<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ natural language processing - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ natural language processing - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Mon, 25 May 2026 20:15:05 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/natural-language-processing/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Use NLP Techniques and Tools in Your Projects [Full Handbook] ]]>
                </title>
                <description>
                    <![CDATA[ Nowadays, computers can comprehend and produce human-like language thanks to Natural Language Processing. And this opens up numerous opportunities for you as a developer. This guide will teach you how to create NLP projects from scratch. It includes ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-use-nlp-techniques-and-tools-in-your-projects-full-handbook/</link>
                <guid isPermaLink="false">692096d4afb994c2aecc26e9</guid>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                    <category>
                        <![CDATA[ natural language processing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Oleh Romanyuk ]]>
                </dc:creator>
                <pubDate>Fri, 21 Nov 2025 16:44:04 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1763743424066/393a4384-ce7a-4ff8-9e98-1edaaa322bc6.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Nowadays, computers can comprehend and produce human-like language thanks to Natural Language Processing. And this opens up numerous opportunities for you as a developer.</p>
<p>This guide will teach you how to create NLP projects from scratch. It includes details on how to organize your workflow, utilize the appropriate tools, and perform typical NLP tasks.</p>
<p>After reading this article, you will understand how to:</p>
<ul>
<li><p>Configure your environment for NLP development.</p>
</li>
<li><p>Select the appropriate frameworks and libraries for your project.</p>
</li>
<li><p>Execute fundamental NLP tasks such as sentiment analysis and text classification.</p>
</li>
<li><p>Create and implement a functional NLP application.</p>
</li>
<li><p>Diagnose and fix common problems in NLP projects.</p>
</li>
</ul>
<p>Before beginning, you should have some basics at hand already. They include a solid understanding of Python programming and knowledge of the general ideas of machine learning. You should also know how to build algorithms and data structures. Finally, your system should have Python 3.8 or higher installed so you can try running the example snippets.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-what-is-natural-language-processing">What is Natural Language Processing?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-nlp-systems-interpret-speech">How NLP Systems Interpret Speech</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-typical-nlp-tasks">Typical NLP tasks</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conventional-machine-learning-methods-for-nlp">Conventional Machine Learning Methods for NLP</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-use-nlp-in-various-industries">How to Use NLP in Various Industries</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-choose-the-most-effective-nlp-tools-and-libraries">How to Choose the Most Effective NLP Tools and Libraries</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-prepare-and-train-nlp-systems">How to Prepare and Train NLP systems</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-establishing-and-labeling-datasets">Establishing and Labeling Datasets</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-what-is-natural-language-processing">What is Natural Language Processing?</h2>
<p>NLP (natural language processing) is a set of methodologies that allow computers to learn to comprehend human language and produce relevant outputs. </p>
<p>NLP manages the intricacy of human communication. In contrast to conventional machine learning, which operates with structured data only, NLP handles unstructured text data.</p>
<p>Specifically, to more accurately comprehend language, NLP systems simultaneously analyze the syntax (which is the arrangement of words and grammar), the semantics (the meanings of specific words and phrases), and interpret context (how adjacent information affects meaning). This allows them to differentiate between various interpretations of identical words, grasp implied messages, and produce responses as relevant as possible.</p>
<p>The ability of machines to process language was demonstrated by early experiments such as the Georgetown-IBM translation in 1954 and the ELIZA chatbot in 1966 (Sources: <a target="_blank" href="https://www.mdpi.com/2078-2489/15/8/443">Szmurlo and Akhtar, MDPI; Hutchins, ResearchGate</a>). With today's tools, any developer can access and use the capabilities of NLP tools.</p>
<p>So why is this important for you? In 2025, the market for NLP, which currently powers chatbots, translation software, and content creation platforms, has reached $42.47 billion. (Source: <a target="_blank" href="https://www.precedenceresearch.com/natural-language-processing-market">Precedence Research</a>)</p>
<p>The growth is only accelerating. By 2030, the global NLP market is expected to grow to $439.85 billion. (Source: <a target="_blank" href="https://www.grandviewresearch.com/industry-analysis/natural-language-processing-market-report#:~:text=The%20global%20natural%20language%20processing,38.7%25%20from%202025%20to%202030.">GrandviewResearch</a>). </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762161849212/b2e0b1b5-8b91-4061-a647-2488a7396548.png" alt="NLP market size" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-important-nlp-concepts-to-know">Important NLP Concepts to Know</h3>
<p>Five interconnected layers generally make up NLP systems. Every layer addresses a distinct language processing problem. (Source: <a target="_blank" href="https://www.researchgate.net/publication/350058919_Natural_Language_Processing_History_Evolution_Application_and_Future_Work">Khatri and others, ResearchGate).</a></p>
<ul>
<li><p><a target="_blank" href="https://www.researchgate.net/publication/350058919_Natural_Language_Processing_History_Evolution_Application_and_Future_Work"><strong>Analysis of morphology</strong></a> is where you break down words into their most meaningful components by this layer. Words will be broken down into prefixes, roots, and suffixes. For instance, "working" becomes "work" plus "ing." This makes it easier for your system to comprehend word relationships even when they change form.</p>
</li>
<li><p><strong>Analysis of syntactic structure</strong> is where you use grammar rules to determine sentence structure. Here, you construct parse trees that map the grammatical relationships between words. Individual words are represented as leaves, phrases as intermediate nodes, and sentences as roots in the tree.</p>
</li>
<li><p><strong>Analysis of semantics</strong> is where, from the parsed structure, you derive the true meaning.</p>
</li>
<li><p>You deal with synonyms, antonyms, and homophones as well as word ambiguity. This transforms grammatical structure into meaning.</p>
</li>
<li><p><strong>Analysis of</strong> <strong>discourse</strong> is where you connect sentences within longer text structures. You'll observe how ideas flow from one paragraph to the next and spot recurring themes. This connects meaning at the sentence level to meaning at the document level.</p>
</li>
<li><p><strong>Analysis of</strong> <strong>pragmatics</strong> is where you decipher intent and context. You will be able to resolve references, comprehend dialogue structure, and decipher implied meanings. You can process sarcasm, cultural background, and other aspects of everyday communication at this layer.</p>
</li>
</ul>
<p>Understanding these layers gives you the ability to build NLP systems that can manage challenging language tasks in a variety of contexts.</p>
<h2 id="heading-how-nlp-systems-interpret-speech">How NLP Systems Interpret Speech</h2>
<p>NLP systems use a pipeline to convert raw text into computational meaning. Each step builds on its predecessor, allowing for better analysis of unstructured language data. In this section, I’ll provide real snippets of code you can insert into an editor for training.</p>
<h3 id="heading-step-1-text-input">Step 1: Text Input</h3>
<p>To start, your system will take in raw text that can come in various forms. Potential sources for raw input include emails, social media posts, articles, documents, or transcripts of speeches. The raw data will contain misspellings, crude language, and grammatical mistakes you'll need to circumvent.</p>
<h3 id="heading-step-2-text-preprocessing">Step 2: Text Preprocessing</h3>
<p>Next, you’ll need to clean and standardize the input text before your system analyzes it. Your pre-process will likely include some or all of these steps:</p>
<ul>
<li><p>Tokenizing text into single words or subwords</p>
</li>
<li><p>Removing punctuation marks from the text</p>
</li>
<li><p>Lower casing all the text</p>
</li>
<li><p>Removing stop words like "the", "and," and "is."</p>
</li>
</ul>
<p>For example, you can accomplish such a simple form of NLP using Python, but note that you need to import specific libraries (we will discuss them later):</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> nltk
<span class="hljs-keyword">from</span> nltk.corpus <span class="hljs-keyword">import</span> stopwords
<span class="hljs-keyword">from</span> nltk.tokenize <span class="hljs-keyword">import</span> word_tokenize

<span class="hljs-comment"># Download required NLTK data</span>
nltk.download(<span class="hljs-string">'punkt'</span>)
nltk.download(<span class="hljs-string">'stopwords'</span>)

<span class="hljs-comment"># Raw text input</span>
text = <span class="hljs-string">"The quick brown fox jumps over the lazy dog!"</span>

<span class="hljs-comment"># Tokenization</span>
tokens = word_tokenize(text.lower())

<span class="hljs-comment"># Remove punctuation and stop words</span>
stop_words = set(stopwords.words(<span class="hljs-string">'english'</span>))
filtered_tokens = [word <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> tokens <span class="hljs-keyword">if</span> word.isalnum() <span class="hljs-keyword">and</span> word <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> stop_words]

print(filtered_tokens)
<span class="hljs-comment"># Output: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']</span>
</code></pre>
<h3 id="heading-step-3-syntactic-parsing-and-analysis">Step 3: Syntactic Parsing and Analysis</h3>
<p>After cleaning, you’ll analyze the text’s grammatical structure by constructing parse trees. While parse trees can vary in complexity, they map the relationships between words, phrases, and clauses. You can leverage part-of-speech tagging information to assign grammatical roles (noun, verb, adjective, and so on) to words, and dependency parsing to learn how related words are linked syntactically.</p>
<p>For example, the code below illustrates how to perform part-of-speech tagging with spaCy, which determines the grammatical function of each word within a sentence.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> spacy

<span class="hljs-comment"># Load English language model</span>
nlp = spacy.load(<span class="hljs-string">"en_core_web_sm"</span>)

<span class="hljs-comment"># Process text</span>
doc = nlp(<span class="hljs-string">"The cat sat on the mat"</span>)

<span class="hljs-comment"># Part-of-speech tagging</span>
<span class="hljs-keyword">for</span> token <span class="hljs-keyword">in</span> doc:
    print(<span class="hljs-string">f"<span class="hljs-subst">{token.text}</span>: <span class="hljs-subst">{token.pos_}</span>"</span>)

<span class="hljs-comment"># Output:</span>
<span class="hljs-comment"># The: DET</span>
<span class="hljs-comment"># cat: NOUN</span>
<span class="hljs-comment"># sat: VERB</span>
<span class="hljs-comment"># on: ADP</span>
<span class="hljs-comment"># the: DET</span>
<span class="hljs-comment"># mat: NOUN</span>
</code></pre>
<h3 id="heading-step-4-feature-engineering-and-text-representation">Step 4: Feature Engineering and Text Representation</h3>
<p>Here, you convert words into numerical vectors that computers can parse using embedding or transformer-based techniques to capture similarities and semantic relationships between terms. For instance, this allows your system to understand that the words "kid" and "child" are similar in meaning.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sentence_transformers <span class="hljs-keyword">import</span> SentenceTransformer

<span class="hljs-comment"># Load pre-trained model</span>
model = SentenceTransformer(<span class="hljs-string">'all-MiniLM-L6-v2'</span>)

<span class="hljs-comment"># Convert sentences to embeddings</span>
sentences = [<span class="hljs-string">"The cat sits on the mat"</span>, <span class="hljs-string">"The feline rests on the rug"</span>]
embeddings = model.encode(sentences)

print(<span class="hljs-string">f"Embedding shape: <span class="hljs-subst">{embeddings.shape}</span>"</span>)
<span class="hljs-comment"># Output: Embedding shape: (2, 384)</span>
</code></pre>
<h3 id="heading-step-5-modeling-and-pattern-recognition">Step 5: Modeling and Pattern Recognition</h3>
<p>In this part of the process, you’ll use machine learning algorithms to identify patterns from vectorized text. You may use either a traditional machine learning representation or one of the deep learning methods, such as transformers. Your models will learn about patterns in the language, classify the content presented, or extract entities in the text.</p>
<p>To understand this method, let’s see a straightforward example of a text classification model that uses transformers to identify sentiments.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> pipeline

<span class="hljs-comment"># Load a pre-trained sentiment analysis model</span>
classifier = pipeline(<span class="hljs-string">"sentiment-analysis"</span>)

<span class="hljs-comment"># Classify text sentiment</span>
texts = [<span class="hljs-string">"I love this product!"</span>, <span class="hljs-string">"This is terrible and disappointing"</span>]
results = classifier(texts)

<span class="hljs-keyword">for</span> text, result <span class="hljs-keyword">in</span> zip(texts, results):
    print(<span class="hljs-string">f"Text: <span class="hljs-subst">{text}</span>"</span>)
    print(<span class="hljs-string">f"Sentiment: <span class="hljs-subst">{result[<span class="hljs-string">'label'</span>]}</span>, Confidence: <span class="hljs-subst">{result[<span class="hljs-string">'score'</span>]:<span class="hljs-number">.2</span>f}</span>\n"</span>)

<span class="hljs-comment"># Output:</span>
<span class="hljs-comment"># Text: I love this product!</span>
<span class="hljs-comment"># Sentiment: POSITIVE, Confidence: 0.99</span>
<span class="hljs-comment">#</span>
<span class="hljs-comment"># Text: This is terrible and disappointing</span>
<span class="hljs-comment"># Sentiment: NEGATIVE, Confidence: 0.99</span>
</code></pre>
<p>This illustrates how the model detects linguistic patterns for sentiment categorization, which is a typical task in NLP. In subsequent sections, we’ll delve into more specialized modeling techniques tailored for various NLP applications.</p>
<h3 id="heading-step-6-evaluation-and-deployment">Step 6: Evaluation and Deployment</h3>
<p>Next, you will evaluate your model from metrics such as precision, recall, and F1 scores. After evaluation, you will deploy your model to production, and the model will continue to learn from data produced from real-world text. Here’s an example of how it’s done:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> classification_report

<span class="hljs-comment"># Example predictions vs actual labels</span>
y_true = [<span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>]
y_pred = [<span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>]

<span class="hljs-comment"># Generate evaluation metrics</span>
print(classification_report(y_true, y_pred))
</code></pre>
<h2 id="heading-typical-nlp-tasks">Typical NLP Tasks</h2>
<h3 id="heading-natural-language-understanding-nlu-tasks">Natural Language Understanding (NLU) tasks</h3>
<p>Natural Language Understanding (NLU) tasks deal with actually understanding what people are communicating about. There are several elements involved in this process.</p>
<h4 id="heading-sentiment-analysis-and-text-classification">Sentiment analysis and text classification</h4>
<p>Here, you recognize and categorize documents according to emotion. Your engine identifies whether the text conveys a positive, negative, or neutral sentiment. Then, it autonomously filters content across digital platforms. Consider this example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> pipeline

<span class="hljs-comment"># Load sentiment analysis pipeline</span>
classifier = pipeline(<span class="hljs-string">"sentiment-analysis"</span>)

<span class="hljs-comment"># Analyze sentiment</span>
result = classifier(<span class="hljs-string">"I love this product! It works great."</span>)
print(result)
<span class="hljs-comment"># Output: [{'label': 'POSITIVE', 'score': 0.9998}]</span>
</code></pre>
<h4 id="heading-named-entity-recognition-ner">Named Entity Recognition (NER)</h4>
<p>NER is a pipeline that involves automatically identifying and classifying distinct pieces of information within a body of text. This includes names of people, locations, organizations, dates, and monetary figures.</p>
<p>Your NER system analyzes unstructured text to accurately label these entities, converting raw data into a structured format that can be easily analyzed. Your algorithm can also uncover relationships among these entities, allowing you to gain valuable insights from extensive amounts of text.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> spacy

nlp = spacy.load(<span class="hljs-string">"en_core_web_sm"</span>)
doc = nlp(<span class="hljs-string">"Apple Inc. was founded by Steve Jobs in Cupertino, California."</span>)

<span class="hljs-keyword">for</span> ent <span class="hljs-keyword">in</span> doc.ents:
     print(<span class="hljs-string">f"<span class="hljs-subst">{ent.text}</span>: <span class="hljs-subst">{ent.label_}</span>"</span>)

<span class="hljs-comment"># Output:</span>
<span class="hljs-comment"># Apple Inc.: ORG</span>
<span class="hljs-comment"># Steve Jobs: PERSON</span>
<span class="hljs-comment"># Cupertino: GPE</span>
<span class="hljs-comment"># California: GPE</span>
</code></pre>
<h4 id="heading-question-answering">Question answering</h4>
<p>You can create systems that consume natural language questions and retrieve appropriate answers. Your system can also use entailment and contradiction detection to analyze the logical relationships between text blocks. </p>
<h4 id="heading-intent-recognition">Intent recognition</h4>
<p>You can recognize user intentions in conversational domains. Your dialog systems are conscious of the user’s goals, allowing buttons or voices to respond in kind. </p>
<p>Now, let’s move on to some general natural language-related tasks.</p>
<h3 id="heading-general-natural-language-tasks">General Natural Language Tasks</h3>
<p>This class of tasks pulls together some aspects of understanding while dealing with generation as well.</p>
<h4 id="heading-machine-translation">Machine translation</h4>
<p>You can translate text across multiple languages while preserving context and meaning. Neural networks use encoder-decoder architectures to create linguistic outputs in the target language.</p>
<p>Let’s see how it’s done with the MarianMTModel and MarianTokenizer models:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> MarianMTModel, MarianTokenizer

<span class="hljs-comment"># Load translation model</span>
model_name = <span class="hljs-string">'Helsinki-NLP/opus-mt-en-es'</span>
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

<span class="hljs-comment"># Translate English to German</span>
text = <span class="hljs-string">"Hello, how are you?"</span>
translated = model.generate(**tokenizer(text, return_tensors=<span class="hljs-string">"pt"</span>, padding=<span class="hljs-literal">True</span>))
print(tokenizer.decode(translated[<span class="hljs-number">0</span>], skip_special_tokens=<span class="hljs-literal">True</span>))
<span class="hljs-comment"># Output: Hallo, wie geht's dir?</span>
</code></pre>
<h4 id="heading-text-summarization">Text summarization</h4>
<p>Often you’ll need to shorten a long document into a more accessible summary – this is text summarization, and it’s a common NLP task. Your system retains key details and coherence while reducing the length of a document.</p>
<h4 id="heading-speech-recognition-and-text-to-speech">Speech recognition and text-to-speech</h4>
<p>Using these techniques, you can turn speech into text (speech recognition) or text into natural audio (text-to-speech). These tasks close the gap between text and audio modalities. </p>
<h4 id="heading-syntactic-parsing">Syntactic parsing</h4>
<p>Here, you examine the grammatical construction to determine the syntactic relationships between words in the sentence. This critical task gives a structural analysis of the text to support more complex understanding tasks.</p>
<p>These tasks, when combined, create powerful applications for different industries and use cases in Natural Language Processing.</p>
<h2 id="heading-conventional-machine-learning-methods-for-nlp">Conventional Machine Learning Methods for NLP</h2>
<p>Instead of relying on manually created linguistic rules (where programmers specify patterns like "if a word ends with '-ing', it is likely a verb" or "sentences containing 'not' followed by positive words suggest negative sentiment"), ML approaches apply statistical methods to discover patterns automatically within the data.</p>
<p>These methods learn through examples and don’t require human experts to define every potential language structure explicitly. As a result, they are more scalable and adaptable across different languages and fields. Let’s look at some of them now.</p>
<h3 id="heading-logistic-regression">Logistic Regression</h3>
<p>For tasks involving binary classification, you can use logistic regression. Based on input features, it predicts event probability by learning linear decision boundaries. Consider the following example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.feature_extraction.text <span class="hljs-keyword">import</span> TfidfVectorizer
<span class="hljs-keyword">from</span> sklearn.linear_model <span class="hljs-keyword">import</span> LogisticRegression
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

<span class="hljs-comment"># Sample data</span>
texts = [<span class="hljs-string">"This is spam"</span>, <span class="hljs-string">"Normal email"</span>, <span class="hljs-string">"Buy now!"</span>, <span class="hljs-string">"Meeting tomorrow"</span>]
labels = [<span class="hljs-number">1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>]  <span class="hljs-comment"># 1 = spam, 0 = not spam</span>

<span class="hljs-comment"># Convert text to features</span>
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

<span class="hljs-comment"># Train model</span>
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=<span class="hljs-number">0.25</span>)
model = LogisticRegression()
model.fit(X_train, y_train)

<span class="hljs-comment"># Predict</span>
new_text = vectorizer.transform([<span class="hljs-string">"Free money now"</span>])
prediction = model.predict(new_text)
print(<span class="hljs-string">f"Prediction: <span class="hljs-subst">{<span class="hljs-string">'Spam'</span> <span class="hljs-keyword">if</span> prediction[<span class="hljs-number">0</span>] == <span class="hljs-number">1</span> <span class="hljs-keyword">else</span> <span class="hljs-string">'Not Spam'</span>}</span>"</span>)
</code></pre>
<p>Typical uses include toxicity classification, sentiment analysis, and spam detection.</p>
<h3 id="heading-naive-bayes">Naive Bayes</h3>
<p>Using the premise that words are independent, <a target="_blank" href="https://www.freecodecamp.org/news/how-naive-bayes-classifiers-work/">Naive Bayes</a> applies <a target="_blank" href="https://www.freecodecamp.org/news/bayes-rule-explained/">Bayes' Theorem</a>.</p>
<p>To classify documents, it computes:</p>
<p>$$P(label|text) = P(label) × P(text|label) / P(text)$$</p><pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.naive_bayes <span class="hljs-keyword">import</span> MultinomialNB
<span class="hljs-keyword">from</span> sklearn.feature_extraction.text <span class="hljs-keyword">import</span> CountVectorizer

<span class="hljs-comment"># Training data</span>
texts = [<span class="hljs-string">"I love this product"</span>, <span class="hljs-string">"Terrible service"</span>, <span class="hljs-string">"Amazing quality"</span>, <span class="hljs-string">"Waste of money"</span>]
labels = [<span class="hljs-number">1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>]  <span class="hljs-comment"># 1 = positive, 0 = negative</span>

<span class="hljs-comment"># Vectorize text</span>
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

<span class="hljs-comment"># Train Naive Bayes</span>
clf = MultinomialNB()
clf.fit(X, labels)

<span class="hljs-comment"># Predict sentiment</span>
new_review = vectorizer.transform([<span class="hljs-string">"Great purchase"</span>])
print(<span class="hljs-string">f"Sentiment: <span class="hljs-subst">{<span class="hljs-string">'Positive'</span> <span class="hljs-keyword">if</span> clf.predict(new_review)[<span class="hljs-number">0</span>] == <span class="hljs-number">1</span> <span class="hljs-keyword">else</span> <span class="hljs-string">'Negative'</span>}</span>"</span>)
</code></pre>
<p>Common uses for this algorithm that you can try are spam detection and bug detection in software.</p>
<h3 id="heading-decision-trees">Decision Trees</h3>
<p>Decision trees partition data sets recursively by choosing the feature that maximizes information gain at each split in a way that builds interpretable, tree-like models. Each internal node is a decision (on a feature), each branch is an outcome of the decision, and each leaf node is a classification.</p>
<p>Decision trees are especially useful for text classification and feature selection because the decision tree allows you to trace exactly how the model made the predicted classification. </p>
<p>Let’s see a code example that shows how the decision tree learns which words, converted to TF-IDF features, predict whether the sentiment of the text in question is positive or negative:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.tree <span class="hljs-keyword">import</span> DecisionTreeClassifier
<span class="hljs-keyword">from</span> sklearn.feature_extraction.text <span class="hljs-keyword">import</span> TfidfVectorizer
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

<span class="hljs-comment"># Sample text data with labels</span>
texts = [
    <span class="hljs-string">"I love this movie, it's fantastic"</span>,
    <span class="hljs-string">"Terrible film, waste of time"</span>,
    <span class="hljs-string">"Amazing performance and great story"</span>,
    <span class="hljs-string">"Boring and disappointing"</span>,
    <span class="hljs-string">"Excellent cinematography and acting"</span>,
    <span class="hljs-string">"Awful, would not recommend"</span>
]
labels = [<span class="hljs-number">1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>]  <span class="hljs-comment"># 1 = positive, 0 = negative</span>

<span class="hljs-comment"># Convert text to TF-IDF features</span>
vectorizer = TfidfVectorizer(max_features=<span class="hljs-number">20</span>)
X = vectorizer.fit_transform(texts)

<span class="hljs-comment"># Split data</span>
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=<span class="hljs-number">0.3</span>, random_state=<span class="hljs-number">42</span>)

<span class="hljs-comment"># Train decision tree</span>
clf = DecisionTreeClassifier(max_depth=<span class="hljs-number">3</span>, random_state=<span class="hljs-number">42</span>)
clf.fit(X_train, y_train)

<span class="hljs-comment"># Make predictions</span>
test_text = [<span class="hljs-string">"This movie is wonderful"</span>]
test_vector = vectorizer.transform(test_text)
prediction = clf.predict(test_vector)

print(<span class="hljs-string">f"Text: <span class="hljs-subst">{test_text[<span class="hljs-number">0</span>]}</span>"</span>)
print(<span class="hljs-string">f"Predicted sentiment: <span class="hljs-subst">{<span class="hljs-string">'Positive'</span> <span class="hljs-keyword">if</span> prediction[<span class="hljs-number">0</span>] == <span class="hljs-number">1</span> <span class="hljs-keyword">else</span> <span class="hljs-string">'Negative'</span>}</span>"</span>)
print(<span class="hljs-string">f"Model accuracy: <span class="hljs-subst">{clf.score(X_test, y_test):<span class="hljs-number">.2</span>f}</span>"</span>)
</code></pre>
<p>At each node, the decision tree asks a question: "Does the text have a high TF-IDF score for 'wonderful'?" Then the tree will branch accordingly based on the answer to the question until reaching a classification.</p>
<p>One key parameter in the above code is <code>max_depth=3</code> – without it, the tree may become too complex and overfit. The parameter limits the complexity of the tree.</p>
<h3 id="heading-latent-dirichlet-allocation-lda">Latent Dirichlet Allocation (LDA)</h3>
<p>Latent Dirichlet Allocation (LDA) automatically determines thematic structures in large collections of texts by treating documents as probabilistic mixtures of topics, and topics as distributions over words. This discovery approach uses unsupervised learning, which means that no labeled training data are needed to discover structured but hidden themes. LDA is suited for exploratory text analysis and organization of data in significant amounts of text.</p>
<p>Let’s see some code that generates a word frequency matrix from documents. In this code, LDA identifies two underlying topics based on patterns of word co-occurrence, a process that is a type of clustering analysis for text documents.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.decomposition <span class="hljs-keyword">import</span> LatentDirichletAllocation
<span class="hljs-keyword">from</span> sklearn.feature_extraction.text <span class="hljs-keyword">import</span> CountVectorizer

<span class="hljs-comment"># Document collection</span>
documents = [
    <span class="hljs-string">"Machine learning algorithms process data"</span>,
    <span class="hljs-string">"Deep learning uses neural networks"</span>,
    <span class="hljs-string">"Python is great for data science"</span>,
    <span class="hljs-string">"Neural networks learn from examples"</span>
]

<span class="hljs-comment"># Create document-term matrix</span>
vectorizer = CountVectorizer(max_features=<span class="hljs-number">50</span>)
doc_term_matrix = vectorizer.fit_transform(documents)

<span class="hljs-comment"># Train LDA model</span>
lda = LatentDirichletAllocation(n_components=<span class="hljs-number">2</span>, random_state=<span class="hljs-number">42</span>)
lda.fit(doc_term_matrix)

<span class="hljs-comment"># Display topics</span>
feature_names = vectorizer.get_feature_names_out()
<span class="hljs-keyword">for</span> topic_idx, topic <span class="hljs-keyword">in</span> enumerate(lda.components_):
    top_words_idx = topic.argsort()[<span class="hljs-number">-5</span>:]
    top_words = [feature_names[i] <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> top_words_idx]
    print(<span class="hljs-string">f"Topic <span class="hljs-subst">{topic_idx}</span>: <span class="hljs-subst">{<span class="hljs-string">', '</span>.join(top_words)}</span>"</span>)
</code></pre>
<p>In this illustration, we could interpret Topic 0 as "data science and algorithms," and Topic 1 as "neural networks and deep learning." The LDA model will assign, in a mixed model fashion, a probability distribution of each document falling under the two topics. For instance, a document titled "neural networks for data processing" could be considered 60% Topic 1 and 40% Topic 0.</p>
<h3 id="heading-deep-learning-models">Deep Learning Models</h3>
<p>Deep learning models automatically extract hierarchical representations from raw text without manual feature engineering. Applying deep learning to language processing is important because language understanding requires modeling not just individual words, but also phrases, sentences, and the context as a whole.</p>
<p>A neural architecture achieves this modeling by learning multiple layers of abstraction and can interpret the sentences in more complex ways, such as sentiment, intent, or topic. </p>
<p>Let’s illustrate how it works with an example showing a simplified deep learning model that can be used for text classification using TensorFlow/Keras. This specific example uses an embedding layer to map words to dense vectors that capture their semantic meaning, as well as a Bidirectional LSTM layer which is able to capture information from the past and future of a sequence and outputs to a Dense layer for binary classification.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> tensorflow.keras.models <span class="hljs-keyword">import</span> Sequential
<span class="hljs-keyword">from</span> tensorflow.keras.layers <span class="hljs-keyword">import</span> Embedding, LSTM, Dense
<span class="hljs-keyword">from</span> tensorflow.keras.preprocessing.text <span class="hljs-keyword">import</span> Tokenizer
<span class="hljs-keyword">from</span> tensorflow.keras.preprocessing.sequence <span class="hljs-keyword">import</span> pad_sequences

<span class="hljs-comment"># Example sentences and labels</span>
texts = [<span class="hljs-string">"I like this movie"</span>, <span class="hljs-string">"I hate this movie"</span>]
labels = [<span class="hljs-number">1</span>, <span class="hljs-number">0</span>]  <span class="hljs-comment"># 1 = positive, 0 = negative</span>

<span class="hljs-comment"># Tokenize text and pad sequences</span>
tokenizer = Tokenizer(num_words=<span class="hljs-number">50</span>)
tokenizer.fit_on_texts(texts)
X = pad_sequences(tokenizer.texts_to_sequences(texts), maxlen=<span class="hljs-number">5</span>)

<span class="hljs-comment"># Simple model: embedding + LSTM + output</span>
model = Sequential([
    Embedding(input_dim=<span class="hljs-number">50</span>, output_dim=<span class="hljs-number">8</span>, input_length=<span class="hljs-number">5</span>),
    LSTM(<span class="hljs-number">4</span>),
    Dense(<span class="hljs-number">1</span>, activation=<span class="hljs-string">'sigmoid'</span>)
])

model.compile(optimizer=<span class="hljs-string">'adam'</span>, loss=<span class="hljs-string">'binary_crossentropy'</span>)
model.fit(X, labels, epochs=<span class="hljs-number">5</span>, verbose=<span class="hljs-number">0</span>)

<span class="hljs-comment"># Predict sentiment for new sentence</span>
test_text = [<span class="hljs-string">"I love this"</span>]
test_seq = pad_sequences(tokenizer.texts_to_sequences(test_text), maxlen=<span class="hljs-number">5</span>)
pred = model.predict(test_seq)[<span class="hljs-number">0</span>][<span class="hljs-number">0</span>]

print(<span class="hljs-string">f"Sentiment score: <span class="hljs-subst">{pred:<span class="hljs-number">.2</span>f}</span> (1=positive, 0=negative)"</span>)
</code></pre>
<p>The model learns these patterns from example sentences that have been labeled as positive or negative and then uses those learned patterns to predict the sentiment of new text input. This is an example of how deep learning models learn to automatically represent the text that is processed and then use that representation to interpret sequences of text for classification purposes, without any feature engineering.</p>
<h3 id="heading-convolutional-neural-networks-cnns">Convolutional Neural Networks (CNNs)</h3>
<p>CNNs apply the same pattern-detecting framework to the text as they do to image recognition. CNNs see documents as sequences, and when a convolutional filter is applied across the text, it detects patterns for various types of features, such as n-grams (sequences of symbols that are adjacent to one another), and meaningful phrases.</p>
<p>CNNs encompass multi-filter layers to detect different features. Each filter layer detects features that are continuously more abstract, going from simple combinations of words to capturing combinations of words that are consistently used in semantic patterns, creating an effective use for the text classification task. (Source: <a target="_blank" href="https://arxiv.org/abs/1408.5882">Yoon Kim</a>)</p>
<p>Here is an example of the convolutional layer scanning through the text using filters. It detects meaningful patterns established through previous learning, such as the word "excellent" or "terrible waste," learning to treat each combination of words as expressing a positive or negative sentiment during a final classification step.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf
<span class="hljs-keyword">from</span> tensorflow.keras.layers <span class="hljs-keyword">import</span> Embedding, Conv1D, GlobalMaxPooling1D, Dense
<span class="hljs-keyword">from</span> tensorflow.keras.models <span class="hljs-keyword">import</span> Sequential
<span class="hljs-keyword">from</span> tensorflow.keras.preprocessing.text <span class="hljs-keyword">import</span> Tokenizer
<span class="hljs-keyword">from</span> tensorflow.keras.preprocessing.sequence <span class="hljs-keyword">import</span> pad_sequences

<span class="hljs-comment"># Sample training data</span>
texts = [
    <span class="hljs-string">"This movie is excellent and entertaining"</span>,
    <span class="hljs-string">"Terrible film, complete waste"</span>,
    <span class="hljs-string">"Amazing story and great acting"</span>,
    <span class="hljs-string">"Boring and poorly made"</span>
]
labels = [<span class="hljs-number">1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>]  <span class="hljs-comment"># 1 = positive, 0 = negative</span>

<span class="hljs-comment"># Tokenize and pad sequences</span>
tokenizer = Tokenizer(num_words=<span class="hljs-number">100</span>)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
X = pad_sequences(sequences, maxlen=<span class="hljs-number">10</span>)

<span class="hljs-comment"># Build CNN model</span>
model = Sequential([
    Embedding(input_dim=<span class="hljs-number">100</span>, output_dim=<span class="hljs-number">32</span>, input_length=<span class="hljs-number">10</span>),  <span class="hljs-comment"># Convert words to dense vectors</span>
    Conv1D(filters=<span class="hljs-number">64</span>, kernel_size=<span class="hljs-number">3</span>, activation=<span class="hljs-string">'relu'</span>),  <span class="hljs-comment"># Detect 3-word patterns</span>
    GlobalMaxPooling1D(),  <span class="hljs-comment"># Extract most important features</span>
    Dense(<span class="hljs-number">1</span>, activation=<span class="hljs-string">'sigmoid'</span>)  <span class="hljs-comment"># Binary classification</span>
])

model.compile(optimizer=<span class="hljs-string">'adam'</span>, loss=<span class="hljs-string">'binary_crossentropy'</span>, metrics=[<span class="hljs-string">'accuracy'</span>])
model.fit(X, labels, epochs=<span class="hljs-number">10</span>, verbose=<span class="hljs-number">0</span>)

<span class="hljs-comment"># Test prediction</span>
test_text = [<span class="hljs-string">"wonderful movie with great plot"</span>]
test_seq = tokenizer.texts_to_sequences(test_text)
test_pad = pad_sequences(test_seq, maxlen=<span class="hljs-number">10</span>)
prediction = model.predict(test_pad)

print(<span class="hljs-string">f"Sentiment probability: <span class="hljs-subst">{prediction[<span class="hljs-number">0</span>][<span class="hljs-number">0</span>]:<span class="hljs-number">.2</span>f}</span>"</span>)
print(<span class="hljs-string">f"Classification: <span class="hljs-subst">{<span class="hljs-string">'Positive'</span> <span class="hljs-keyword">if</span> prediction[<span class="hljs-number">0</span>][<span class="hljs-number">0</span>] &gt; <span class="hljs-number">0.5</span> <span class="hljs-keyword">else</span> <span class="hljs-string">'Negative'</span>}</span>"</span>)
</code></pre>
<p>The pooling layer analyzes this filtered text and brings forth the most substantial signals for measuring positive versus negative sentiments from the convolutional text features of the previous steps.</p>
<h3 id="heading-recurrent-neural-networks-rnns">Recurrent Neural Networks (RNNs)</h3>
<p>RNNs handle sequential data by tracking hidden states that reflect dependencies over time. At each time step, the RNN receives the current word and the previous hidden state as input and changes the hidden state, which reflects the accumulated context. </p>
<p>Here's a concrete example where, as the RNN reads the next word from left to right, it updates its hidden state to maintain the context. </p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf
<span class="hljs-keyword">from</span> tensorflow.keras.layers <span class="hljs-keyword">import</span> Embedding, SimpleRNN, Dense
<span class="hljs-keyword">from</span> tensorflow.keras.models <span class="hljs-keyword">import</span> Sequential
<span class="hljs-keyword">from</span> tensorflow.keras.preprocessing.text <span class="hljs-keyword">import</span> Tokenizer
<span class="hljs-keyword">from</span> tensorflow.keras.preprocessing.sequence <span class="hljs-keyword">import</span> pad_sequences

<span class="hljs-comment"># Training data</span>
texts = [
    <span class="hljs-string">"I really enjoyed this book"</span>,
    <span class="hljs-string">"The plot was confusing and dull"</span>,
    <span class="hljs-string">"Fantastic read, highly recommend"</span>,
    <span class="hljs-string">"Disappointing and poorly written"</span>
]
labels = [<span class="hljs-number">1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>]

<span class="hljs-comment"># Prepare data</span>
tokenizer = Tokenizer(num_words=<span class="hljs-number">100</span>)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
X = pad_sequences(sequences, maxlen=<span class="hljs-number">10</span>)

<span class="hljs-comment"># Build RNN model</span>
model = Sequential([
    Embedding(input_dim=<span class="hljs-number">100</span>, output_dim=<span class="hljs-number">32</span>, input_length=<span class="hljs-number">10</span>),
    SimpleRNN(units=<span class="hljs-number">64</span>, return_sequences=<span class="hljs-literal">False</span>),  <span class="hljs-comment"># Process sequence and maintain hidden state</span>
    Dense(<span class="hljs-number">1</span>, activation=<span class="hljs-string">'sigmoid'</span>)
])

model.compile(optimizer=<span class="hljs-string">'adam'</span>, loss=<span class="hljs-string">'binary_crossentropy'</span>, metrics=[<span class="hljs-string">'accuracy'</span>])
model.fit(X, labels, epochs=<span class="hljs-number">20</span>, verbose=<span class="hljs-number">0</span>)

<span class="hljs-comment"># Test</span>
test_text = [<span class="hljs-string">"amazing story highly engaging"</span>]
test_seq = tokenizer.texts_to_sequences(test_text)
test_pad = pad_sequences(test_seq, maxlen=<span class="hljs-number">10</span>)
prediction = model.predict(test_pad)

print(<span class="hljs-string">f"Sentiment probability: <span class="hljs-subst">{prediction[<span class="hljs-number">0</span>][<span class="hljs-number">0</span>]:<span class="hljs-number">.2</span>f}</span>"</span>)
</code></pre>
<p>Longer sentences are more complex because the information contained in the hidden state is lost over an ever-increasing number of time steps. That's the motivation for the more sophisticated architectures of long short-term memory (LSTM) and gated recurrent unit (GRU).</p>
<h3 id="heading-encoder-decoder-architectures">Encoder-Decoder Architectures</h3>
<p>These architectures have two neural networks which work together. The first encoder neural network takes the input text and reduces it to a dense, fixed-size representation but encodes the essential meaning. Then a second decoder network generates an output text based on the meaning representation.</p>
<p>These architectures learn a compressed representation of the input data, and they are often used for:</p>
<ul>
<li><p>Dimensionality reductions.</p>
</li>
<li><p>Feature learning.</p>
</li>
<li><p>Document clustering.</p>
</li>
<li><p>Sequence-to-sequence tasks (for example, translations or summarizations).</p>
</li>
</ul>
<p>The following example illustrates how to use a Text-to-Text Transfer Transformer (T5) encoder-decoder model to translate English into German. The encoder takes the input English sentence and builds its internal representation of the text, while the decoder generates the German translation based on the representation:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> T5Tokenizer, T5ForConditionalGeneration

<span class="hljs-comment"># Load T5 model for text generation</span>
tokenizer = T5Tokenizer.from_pretrained(<span class="hljs-string">"t5-small"</span>)
model = T5ForConditionalGeneration.from_pretrained(<span class="hljs-string">"t5-small"</span>)

<span class="hljs-comment"># Translate text</span>
input_text = <span class="hljs-string">"translate English to German: Hello, how are you?"</span>
input_ids = tokenizer(input_text, return_tensors=<span class="hljs-string">"pt"</span>).input_ids

<span class="hljs-comment"># Generate translation</span>
outputs = model.generate(input_ids)
translation = tokenizer.decode(outputs[<span class="hljs-number">0</span>], skip_special_tokens=<span class="hljs-literal">True</span>)
print(<span class="hljs-string">f"Translation: <span class="hljs-subst">{translation}</span>"</span>)
</code></pre>
<p>This architecture solves the issue of variable-length input and output in a very elegant way. The encoding neural network reduces the sentence to a fixed-size representation regardless of the input length. Subsequently, the decoder generates an output for whatever length it determines is appropriate based on the input length, whether it’s one sentence or six sentences.</p>
<h3 id="heading-transformer-models">Transformer Models</h3>
<p>Unlike RNNs, in which text is processed sequentially (one word at a time), transformers use a processing mechanism that evaluates the sequence in parallel. This means that the transformer can simultaneously consider all of the words in a sentence and directly compute relationships between any two words, even apart in distance.</p>
<p>In the example below, "The girl didn't go to school because she was ill," the model directly connects "she" with "girl" despite other words between these two. This brings a faster ability to train on information and helps avoid the degradation of information through time steps. (Source: <a target="_blank" href="https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf">Vaswani and others).</a></p>
<p>In the example, BERT, one of the most well-known transformer models, performs sentiment classification on a text. Here’s how the transformer justifies text classification by understanding pre-trained language and only using minimal additional training:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> BertTokenizer, BertForSequenceClassification
<span class="hljs-keyword">import</span> torch

<span class="hljs-comment"># Load pre-trained BERT</span>
tokenizer = BertTokenizer.from_pretrained(<span class="hljs-string">'bert-base-uncased'</span>)
model = BertForSequenceClassification.from_pretrained(<span class="hljs-string">'bert-base-uncased'</span>)

<span class="hljs-comment"># Prepare input</span>
text = <span class="hljs-string">"This movie was fantastic!"</span>
inputs = tokenizer(text, return_tensors=<span class="hljs-string">"pt"</span>, padding=<span class="hljs-literal">True</span>, truncation=<span class="hljs-literal">True</span>)

<span class="hljs-comment"># Get predictions</span>
<span class="hljs-keyword">with</span> torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=<span class="hljs-number">-1</span>)

print(<span class="hljs-string">f"Prediction scores: <span class="hljs-subst">{predictions}</span>"</span>)
</code></pre>
<p>In the above code, the tokenizer converts the sequence of text into numerical tokens (which BERT understands) and special tokens, such as [CLS] (for classification), at the beginning of the list of tokens. BERT then models the entire length of the sentence using multiple layers, where each layer is able to learn abstract representations of meaning in each layer.</p>
<h2 id="heading-how-to-use-nlp-in-various-industries">How to Use NLP in Various Industries</h2>
<p>You can use NLP to solve issues in almost any sector, and there are many sector-specific implementations. You can choose to try the snippets below depending on the area you’re most interested in.</p>
<h3 id="heading-tourism-and-hospitality">Tourism and Hospitality</h3>
<p>You can use NLP techniques to build intelligent booking systems that comprehend natural language requests from clients. Important uses you can apply:</p>
<ul>
<li><p><strong>Sentiment analysis</strong> monitors consumer feedback to spot patterns in satisfaction and problems with customer service.</p>
</li>
<li><p><strong>NER-enabled chatbots</strong> retrieve dates and locations from consumer inquiries such as "I need a flight to Paris next Tuesday."</p>
</li>
</ul>
<p>Here’s an example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> pipeline

<span class="hljs-comment"># Load NER model</span>
ner = pipeline(<span class="hljs-string">"ner"</span>, grouped_entities=<span class="hljs-literal">True</span>)

<span class="hljs-comment"># Extract booking information</span>
query = <span class="hljs-string">"I need a hotel in London from December 15 to December 20"</span>
entities = ner(query)

<span class="hljs-keyword">for</span> entity <span class="hljs-keyword">in</span> entities:
    print(<span class="hljs-string">f"<span class="hljs-subst">{entity[<span class="hljs-string">'entity_group'</span>]}</span>: <span class="hljs-subst">{entity[<span class="hljs-string">'word'</span>]}</span>"</span>)
<span class="hljs-comment"># Output: LOC: London</span>
</code></pre>
<p>Through machine translation, you can provide multilingual support to your customers in various languages. And an intent classification model based on BERT automatically identifies how to route your customers for service or makes bookings automatically for them.</p>
<h3 id="heading-logistics-and-supply-chain">Logistics and Supply Chain</h3>
<p>You can automate document processing via NLP and optimize delivery routing using predictive algorithms. Let’s see the common areas of application:</p>
<ul>
<li><strong>You can use OCR to process documents</strong> to automatically extract shipping information from invoices and customs forms. Here’s an example:</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pytesseract
<span class="hljs-keyword">from</span> PIL <span class="hljs-keyword">import</span> Image

<span class="hljs-comment"># Extract text from shipping document</span>
image = Image.open(<span class="hljs-string">'invoice.png'</span>)
text = pytesseract.image_to_string(image)

<span class="hljs-comment"># Parse extracted information</span>
<span class="hljs-comment"># (Add parsing logic based on document structure)</span>
</code></pre>
<ul>
<li><p><strong>Text classification</strong> can place shipments into categories based on descriptions, allowing for recursive sorting of shipments for transport.</p>
</li>
<li><p><strong>Predictive routing models</strong> can use historical delivery data and weather reports to create delivery schedules.</p>
</li>
<li><p><strong>Natural Language Generation</strong> takes technical data across logistics to create user-friendly tracking updates.</p>
</li>
</ul>
<h3 id="heading-retail-and-ecommerce">Retail and eCommerce</h3>
<p>Within the eCommerse operations, you can personalize your customers’ shopping experience and optimize pricing with NLP techniques.</p>
<p>Some key applications that you can benefit from:</p>
<ul>
<li><strong>Recommendation engines</strong> utilize word embeddings to learn product descriptions and corresponding user reviews to suggest relevant items. Here’s how, for instance:</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sentence_transformers <span class="hljs-keyword">import</span> SentenceTransformer, util

<span class="hljs-comment"># Load embedding model</span>
model = SentenceTransformer(<span class="hljs-string">'all-MiniLM-L6-v2'</span>)

<span class="hljs-comment"># Product descriptions</span>
products = [
    <span class="hljs-string">"Wireless Bluetooth headphones with noise cancellation"</span>,
    <span class="hljs-string">"USB-C charging cable for smartphones"</span>,
    <span class="hljs-string">"Noise-cancelling earbuds with long battery life"</span>
]

<span class="hljs-comment"># User query</span>
query = <span class="hljs-string">"I need headphones that block outside noise"</span>

<span class="hljs-comment"># Calculate similarities</span>
query_embedding = model.encode(query)
product_embeddings = model.encode(products)
similarities = util.cos_sim(query_embedding, product_embeddings)

<span class="hljs-comment"># Find best match</span>
best_match_idx = similarities.argmax()
print(<span class="hljs-string">f"Recommended product: <span class="hljs-subst">{products[best_match_idx]}</span>"</span>)
</code></pre>
<ul>
<li><p><strong>Chatbots that include dialogue management</strong> can respond to inquiries from customers about products, orders, and returns.</p>
</li>
<li><p><strong>Sentiment analysis</strong> on social media tracks brand health and customer sentiment in real-time. </p>
</li>
<li><p><strong>Price optimization algorithms</strong> analyze competitors' pricing and market signals to change prices in real-time.</p>
</li>
<li><p><strong>Demand forecasting</strong> analyzes news and social sentiment to predict inventory needs.</p>
</li>
</ul>
<h3 id="heading-healthcare">Healthcare</h3>
<p>Healthcare, with the great amount of data from patient records, is a natural area for NLP to optimize. You can support clinical decision-making and process medical records using specialized NLP systems. </p>
<p>Here are a few of the possible uses and an example:</p>
<ul>
<li><strong>Clinical NER</strong> identifies conditions, medications, and treatments mentioned in the clinicians' notes within electronic health records. For instance:</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> spacy

<span class="hljs-comment"># Load medical NER model (requires installation of scispacy)</span>
<span class="hljs-comment"># pip install scispacy</span>
<span class="hljs-comment"># pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/en_core_sci_sm-0.5.1.tar.gz</span>

nlp = spacy.load(<span class="hljs-string">"en_core_sci_sm"</span>)

<span class="hljs-comment"># Process clinical note</span>
text = <span class="hljs-string">"Patient presents with hypertension and type 2 diabetes. Prescribed metformin 500mg."</span>
doc = nlp(text)

<span class="hljs-keyword">for</span> ent <span class="hljs-keyword">in</span> doc.ents:
    print(<span class="hljs-string">f"<span class="hljs-subst">{ent.text}</span>: <span class="hljs-subst">{ent.label_}</span>"</span>)
</code></pre>
<ul>
<li><p><strong>Clinical decision support systems</strong> scan descriptions of symptoms and provide suggestions for potential diagnoses to help a physician's decision-making.</p>
</li>
<li><p><strong>Literature mining</strong> scans clinical studies and identifies new treatment patterns or potential drug discovery targets.</p>
</li>
</ul>
<p>Of course, NLP can also be used in patient assistance chatbots, as they can comprehend natural language and its nuances.</p>
<h3 id="heading-financial-services">Financial Services</h3>
<p>In the finance sector, there are unique bottlenecks you might face. Financial data security gaps and the risks of fraud are among the most threatening ones, as well as the regulatory fines that come with these issues.</p>
<p>With NLP, you can improve security mechanisms and create systems for detecting fraud. </p>
<p>You can also detect phishing attacks with high accuracy with ML classifiers and NLP using CNNs and RNNs combined. (Source: <a target="_blank" href="https://www.researchgate.net/publication/385251725_ScienceDirect_Advancements_of_SMS_Spam_Detection_A_Comprehensive_Survey_of_NLP_and_ML_Techniques">Saidat and others, ResearchGate).</a></p>
<p>Some other use cases include:</p>
<ul>
<li><p><strong>Document analysis processes loans</strong> applications/contract to assess credit risk by automated analysis of documents. </p>
</li>
<li><p><strong>Fraud detection systems analyze transaction data</strong> and communication dat to identify suspicious activity. For example:</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> pipeline

<span class="hljs-comment"># Load zero-shot classification model</span>
classifier = pipeline(<span class="hljs-string">"zero-shot-classification"</span>)

<span class="hljs-comment"># Analyze transaction description</span>
description = <span class="hljs-string">"Wire transfer to offshore account for investment opportunity"</span>
candidate_labels = [<span class="hljs-string">"legitimate transaction"</span>, <span class="hljs-string">"potential fraud"</span>, <span class="hljs-string">"suspicious activity"</span>]

result = classifier(description, candidate_labels)
print(<span class="hljs-string">f"Classification: <span class="hljs-subst">{result[<span class="hljs-string">'labels'</span>][<span class="hljs-number">0</span>]}</span> (Score: <span class="hljs-subst">{result[<span class="hljs-string">'scores'</span>][<span class="hljs-number">0</span>]:<span class="hljs-number">.4</span>f}</span>)"</span>)
</code></pre>
<ul>
<li><p><strong>Automated compliance monitoring</strong> scans messages for adherence with regulations.</p>
</li>
<li><p><strong>Robo-advisors leverage natural language interfaces</strong> to engage with clients while providing investment advice.</p>
</li>
</ul>
<p>Apart from these uses, conventional chatbots also provide assistance by using NLP techniques. OCR algorithms are widely used for document analysis – but we’ve mentioned those other use cases, so we won’t discuss them further here.</p>
<h3 id="heading-legal-industry-and-compliance-regulations">Legal Industry and Compliance Regulations</h3>
<p>Even more than financial services, the legal sector depends on strict requirements, laws, and regulations. NLP techniques can help you improve safety, security, and efficiency in processing legal documents.</p>
<p>Key examples of how it can be applied:</p>
<ul>
<li><p><strong>Multimodal authentication</strong> is a secure identity verification process consisting of a combination of facial recognition, voice recognition, and natural language processing.</p>
</li>
<li><p><strong>Speaker recognition</strong> uses automatic speech-to-text encoding and intent recognition to process a verbal response to security questions.</p>
</li>
<li><p><strong>Contract analysis</strong> scans legal documents to identify key terms, deliverables, and dates, extracting that information automatically. As an example, you can try the following snippet with the spaCy library installed:</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> spacy

nlp = spacy.load(<span class="hljs-string">"en_core_web_sm"</span>)

<span class="hljs-comment"># Extract dates and obligations from contract</span>
contract_text = <span class="hljs-string">"The agreement shall commence on January 1, 2026 and continue for a period of 12 months."</span>
doc = nlp(contract_text)

<span class="hljs-keyword">for</span> ent <span class="hljs-keyword">in</span> doc.ents:
    <span class="hljs-keyword">if</span> ent.label_ <span class="hljs-keyword">in</span> [<span class="hljs-string">"DATE"</span>, <span class="hljs-string">"CARDINAL"</span>]:
        print(<span class="hljs-string">f"<span class="hljs-subst">{ent.label_}</span>: <span class="hljs-subst">{ent.text}</span>"</span>)
</code></pre>
<ul>
<li><strong>Compliance monitoring</strong> looks for possible regulatory infractions in legal communications.</li>
</ul>
<p>These real-world examples show how NLP can be used to solve practical business issues and boost operational effectiveness. You can modify the samples to your case or discover other potential uses, but these are the most widespread ones for you to try.</p>
<h2 id="heading-how-to-choose-the-most-effective-nlp-tools-and-libraries">How to Choose the Most Effective NLP Tools and Libraries</h2>
<p>There is a great variety of tools and libraries that can help you learn how to use NLP or that you can use to implement NLP into a project. You should select the appropriate tools considering your project needs and background in the associated technologies.</p>
<p>Below are some popular tools you can choose to learn or check out, along with tips about when they’re most useful.</p>
<h3 id="heading-hugging-face-transformershttpshuggingfacecodocstransformersenindex"><a target="_blank" href="https://huggingface.co/docs/transformers/en/index">Hugging Face Transformers</a></h3>
<p>Hugging Face Transformers has thousands of pre-trained models for text generation, classification, and question answering. It gives you more than 100 languages supported and is compatible with PyTorch and TensorFlow.</p>
<p>It also provides model hosting and datasets, and allows collaboration with community members. It will assist you with deep learning NLP applications that require top-notch software to implement the algorithms. </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762166821354/51df5644-2713-46c8-b24d-e0d7a51bd61d.png" alt="Hugging Face" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-nltkhttpswwwnltkorg-natural-language-toolkit"><a target="_blank" href="https://www.nltk.org/">NLTK</a> (Natural Language Toolkit)</h3>
<p>NLTK is the primary package in Python for education and research concerning NLP. Developed at the University of Pennsylvania, it provides extensive packages for your tokenizers, stemmers, parsers, and semantic reasoning. It’s a great choice if you need to get learning concepts in NLP or conduct research projects. </p>
<h3 id="heading-spacyhttpsspacyio"><a target="_blank" href="https://spacy.io/">spaCy</a></h3>
<p>spaCy is a Python library developed to be production-ready, and has the fastest syntactic parser. It’s constructed using Cython for optimal performance and offers excellent named entity recognition. It will fit well if you need strong dependency parsing and a developer-friendly API. It’s easy for you to use spaCy for quick prototyping.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762166854654/b563410d-b978-498a-a52e-9e6bc58f732c.png" alt="SpaCy" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-google-cloud-nlphttpscloudgooglecomnatural-language"><a target="_blank" href="https://cloud.google.com/natural-language">Google Cloud NLP</a></h3>
<p>Google Cloud NLP also offers enterprise-level API services. It will fit your project if you need sentiment analysis, entity recognition, syntax analysis, automatic language identification, and simple, trouble-free scaling. And if you’re already in the Google Cloud ecosystem working with big volumes of customer feedback, it’s just what you need.</p>
<h3 id="heading-amazon-comprehendhttpsawsamazoncomcomprehend"><a target="_blank" href="https://aws.amazon.com/comprehend/">Amazon Comprehend</a></h3>
<p>Comprehend is a fully-managed service from AWS for text analysis in the cloud. It supports the major functions you might want to cover: sentiment analysis, entity recognition, topic modeling, built-in protection of personally identifiable information (PII), and auto-scaling. And it’s perfect if you need a built-in integration with the AWS suite.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762167136170/6cc6a502-adbc-401f-a9e7-957d7facbe0c.png" alt="Amazon Comprehend" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-ibm-watsonhttpswwwibmcomdocsenwatsonxsaastopicscripts-watson-natural-language-processing"><a target="_blank" href="https://www.ibm.com/docs/en/watsonx/saas?topic=scripts-watson-natural-language-processing">IBM Watson</a></h3>
<p>Watson has NLP models specific to regulated industries (healthcare, finance, and so on). Its library offers pre-trained models in 20 programming languages. Its top features you can use are strong data controls, reliable REST API access, and truly compliance-ready outputs. These makes this tool a great choice if you’re in healthcare, finance, or legal industries.</p>
<h3 id="heading-textblobhttpstextblobreadthedocsio"><a target="_blank" href="https://textblob.readthedocs.io/">TextBlob</a></h3>
<p>TextBlob is a simplified library that’s a great option if you’re a beginner. It’s a user-friendly Python library for common NLP tasks. For your convenience, it offers a simplified API design, but still provides decent sentiment analysis, translation, spelling correction, and noun phrase extraction. Apart from beginner projects, it will fit your quick prototypes creation needs.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762167252639/9961064e-1311-4423-bb6c-57aa15b11fc4.png" alt="TextBlob" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h2 id="heading-how-to-prepare-and-train-nlp-systems">How to Prepare and Train NLP Systems</h2>
<p>As you get ready to train your NLP model, you’ll need to prepare your data accurately to ensure its quality doesn’t hinder the outputs. Remember that poor quality data results in a poor performing model, so you’ll want to make sure you have solid data.</p>
<h3 id="heading-understanding-data-quality-and-preprocessing">Understanding Data Quality and Preprocessing</h3>
<p>Raw text data is messy and unstructured. It contains typos, slang, and irrelevant information that degrades the performance of your model. </p>
<p>Preprocessing is the operation that takes messy data and converts it into clean, structured text that models can accept as input. </p>
<p>Research shows that 85.4% of NLP research studies utilized some sort of restructuring/preprocessing to allow NLP models to process raw text. The key data quality components that were essential included accuracy (68.3%), relevance (34.1%) and comparability (31.7%). (Source: <a target="_blank" href="https://pmc.ncbi.nlm.nih.gov/articles/PMC10476151/">Nesca and others, NCBI).</a></p>
<p>Preprocessing comes down to a specific list of tasks you’ll need to perform. Let’s break them down.</p>
<h3 id="heading-text-cleaning">Text Cleaning</h3>
<p>Text cleaning is the process of standardizing the text format by removing anything that may affect model training. Raw text often contains extra elements (HTML tags, URLs, special characters, inconsistent use of capitalization, and excess whitespace) that add noise to your data.</p>
<p>The following example shows a cleaning pipeline that removes the above-mentioned elements. This function performs multiple steps of text cleaning:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> re

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">clean_text</span>(<span class="hljs-params">text</span>):</span>
    <span class="hljs-comment"># Convert to lowercase</span>
    text = text.lower()

    <span class="hljs-comment"># Remove HTML tags</span>
    text = re.sub(<span class="hljs-string">r'&lt;[^&gt;]+&gt;'</span>, <span class="hljs-string">''</span>, text)    

    <span class="hljs-comment"># Remove URLs</span>
    text = re.sub(<span class="hljs-string">r'http\S+|www.\S+'</span>, <span class="hljs-string">''</span>, text)    

    <span class="hljs-comment"># Remove special characters and numbers</span>
    text = re.sub(<span class="hljs-string">r'[^a-zA-Z\s]'</span>, <span class="hljs-string">''</span>, text)    

    <span class="hljs-comment"># Remove extra whitespace</span>
    text = <span class="hljs-string">' '</span>.join(text.split())    

    <span class="hljs-keyword">return</span> text

<span class="hljs-comment"># Example</span>
raw_text = <span class="hljs-string">"Check out https://example.com! It's &lt;b&gt;AMAZING&lt;/b&gt; :-)"</span>
cleaned = clean_text(raw_text)
print(cleaned)
<span class="hljs-comment"># Output: check out its amazing</span>
</code></pre>
<p>The first step the model made was to convert everything to lower case for uniformity. Then it used regular expressions to parse the text to remove HTML tags, URLs, special characters, and numbers. Finally, it normalized whitespace by splitting and joining the text back together. The output is clean text, in a standardized format you can use for tokenization.</p>
<h3 id="heading-tokenization">Tokenization</h3>
<p>The next step is to divide the text into smaller digestible chunks that are easier for ML models to understand. These chunks are known as tokens.</p>
<p>Tokenization comes in three varieties:</p>
<ul>
<li><p><strong>Word tokenization</strong> separates text according to punctuation and whitespace.</p>
</li>
<li><p><strong>Sentence tokenization</strong> uses punctuation cues to divide text into sentences.</p>
</li>
<li><p><strong>Subword tokenization</strong> breaks words up into more manageable chunks.</p>
</li>
</ul>
<p>The example below addresses examples of word and sentence tokenizing by using NLTK (Natural Language Toolkit). </p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> nltk.tokenize <span class="hljs-keyword">import</span> word_tokenize, sent_tokenize

text = <span class="hljs-string">"Natural language processing is exciting! It helps computers understand text."</span>

<span class="hljs-comment"># Word tokenization</span>
words = word_tokenize(text)
print(<span class="hljs-string">f"Words: <span class="hljs-subst">{words}</span>"</span>)

<span class="hljs-comment"># Sentence tokenization</span>
sentences = sent_tokenize(text)
print(<span class="hljs-string">f"Sentences: <span class="hljs-subst">{sentences}</span>"</span>)
</code></pre>
<p>Notice that after word tokenization was performed, the punctuation marks '!' or '.' were considered individual tokens, as punctuation conveys meaning. Sentence tokenization correctly identified the boundaries of the two sentences, and despite the presence of an exclamation mark, it indicated that it had more complex rules beyond just splitting based on periods.</p>
<h3 id="heading-stop-word-removal">Stop Word Removal</h3>
<p>Here, you reduce the text to the meaning without any extra details. You can do this by removing the commonly used words that have little semantic value – the “stop words”. </p>
<p>Common stop words include articles, prepositions, pronouns, auxiliary verbs and conjunctions. Here’s how to do it:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> nltk.corpus <span class="hljs-keyword">import</span> stopwords
<span class="hljs-keyword">from</span> nltk.tokenize <span class="hljs-keyword">import</span> word_tokenize

nltk.download(<span class="hljs-string">'stopwords'</span>)

text = <span class="hljs-string">"The quick brown fox jumps over the lazy dog"</span>
tokens = word_tokenize(text.lower())

stop_words = set(stopwords.words(<span class="hljs-string">'english'</span>))
filtered_tokens = [word <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> tokens <span class="hljs-keyword">if</span> word <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> stop_words]

print(<span class="hljs-string">f"Original: <span class="hljs-subst">{tokens}</span>"</span>)
print(<span class="hljs-string">f"Filtered: <span class="hljs-subst">{filtered_tokens}</span>"</span>)
<span class="hljs-comment"># Output: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']</span>
</code></pre>
<h3 id="heading-stemming-and-lemmatization">Stemming and Lemmatization</h3>
<p>In this next step, you’ll process the text further by reducing words to their root form. This will treat words with similar variations as a single token.</p>
<ul>
<li><p>To be precise, <strong>stemming is simply using heuristic rules</strong> that remove the endings of words. For example (running|runs|run) → run.</p>
</li>
<li><p><strong>Lemmatization uses morphological analysis</strong> and vocabulary. For example, (children|mice) → child|mouse.</p>
</li>
</ul>
<p>Lemmatization typically gives better, more accurate outcomes (but calls for more computation at the same time).</p>
<p>Here’s how you can apply both:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> nltk.stem <span class="hljs-keyword">import</span> PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = [<span class="hljs-string">"running"</span>, <span class="hljs-string">"runs"</span>, <span class="hljs-string">"ran"</span>, <span class="hljs-string">"children"</span>, <span class="hljs-string">"better"</span>]

print(<span class="hljs-string">"Stemming:"</span>)
<span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> words:
    print(<span class="hljs-string">f"<span class="hljs-subst">{word}</span> -&gt; <span class="hljs-subst">{stemmer.stem(word)}</span>"</span>)

print(<span class="hljs-string">"\nLemmatization:"</span>)
<span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> words:
    print(<span class="hljs-string">f"<span class="hljs-subst">{word}</span> -&gt; <span class="hljs-subst">{lemmatizer.lemmatize(word, pos=<span class="hljs-string">'v'</span>)}</span>"</span>)
</code></pre>
<h3 id="heading-expanding-contractions">Expanding Contractions</h3>
<p>AI systems need standardization. And “don’ts” and “you’res” are unacceptable for them. This is why, typically, you would want to expand the contraction to the real word for standardization. Here’s how you can do that:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> contractions

text = <span class="hljs-string">"I can't believe it's already 2025. We'll see what happens."</span>
expanded = contractions.fix(text)
print(expanded)
<span class="hljs-comment"># Output: I cannot believe it is already 2025. We will see what happens.</span>
</code></pre>
<h3 id="heading-correcting-spelling-errors">Correcting Spelling Errors</h3>
<p>Orthographic, spelling, or grammar errors shouldn’t be in the data that you feed to your ML models. You can make corrections to these errors using statistical language models, which can predict the most likely intended word, edit distance algorithms that can find the closest valid word, or neural approaches that can learn patterns of common errors.</p>
<p>For example, let’s see how TextBlob, a library that uses a mix of a dictionary-based approach and contextual probability, detects and corrects misspellings.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> textblob <span class="hljs-keyword">import</span> TextBlob

text = <span class="hljs-string">"Natral languag procesing is powrful"</span>
corrected = TextBlob(text).correct()
print(<span class="hljs-string">f"Original: <span class="hljs-subst">{text}</span>"</span>)
print(<span class="hljs-string">f"Corrected: <span class="hljs-subst">{corrected}</span>"</span>)
</code></pre>
<p>TextBlob analyzes each word, identifies which ones are not in its dictionary, calculates edit distances, finds the most similar valid words, and selects corrections based on the frequency of word use in context.</p>
<h3 id="heading-parts-of-speech-tagging">Parts-of-Speech Tagging</h3>
<p>Parts-of-speech tagging (POS) refers to assigning grammatical classification to words based on their role in a sentence. This is important because the same word can function as a different part of speech depending on the context. For example, "walk" can be a noun (like "an evening walk") or a verb (like "I walk to the office").</p>
<p>POS taggers rely on statistical models trained to predict the most likely grammatical role for a word given the context. The following code shows POS tagging using NLTK, which applies a pre-trained model that will tag grammatical structure.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> nltk

text = <span class="hljs-string">"The cat sat on the mat"</span>
tokens = nltk.word_tokenize(text)
pos_tags = nltk.pos_tag(tokens)

<span class="hljs-keyword">for</span> word, tag <span class="hljs-keyword">in</span> pos_tags:
    print(<span class="hljs-string">f"<span class="hljs-subst">{word}</span>: <span class="hljs-subst">{tag}</span>"</span>)
<span class="hljs-comment"># Output:</span>
<span class="hljs-comment"># The: DT (Determiner)</span>
<span class="hljs-comment"># cat: NN (Noun)</span>
<span class="hljs-comment"># sat: VBD (Verb, past tense)</span>
</code></pre>
<p>The function pos_tag() assesses each token and assigns it a standardized notation. For example, DT indicates determiners (such as "the"), NN indicates singular nouns, VBD indicates past tense verbs, and IN indicates prepositions. The tagger can also use context to make these decisions: it can determine that "sat" is VBD and not NN because it appears after a noun and before a preposition, all of which are typical patterns of the English sentence.</p>
<h2 id="heading-establishing-and-labeling-datasets">Establishing and Labeling Datasets</h2>
<p>For supervised learning tasks such as sentiment analysis, NER, or classification, unlabeled data is useless. </p>
<p>This is why, to create training datasets, you must annotate raw data with relevant labels.  Models can learn patterns and make predictions thanks to this "ground truth." Let’s define the most common methods for labeling.</p>
<h3 id="heading-automated-labeling-based-on-libraries">Automated Labeling Based on Libraries</h3>
<p>You don’t always have to create labels from scratch. Libraries like TextBlob have built-in sentiment analysis models trained on large datasets that label text. They run a polarity score (a number that represents sentiment) and assign a categorical label.</p>
<p>For example, TextBlob sees a word choice, modifiers in particular (words like "very" or "not"), and grammatical patterns to run a polarity score from -1 (most negative) to +1 (most positive), with zero meaning a neutral sentiment.</p>
<p>In this example, we’re automatically labeling sentiments based on a pre-trained TextBlob sentiment analyzer:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> textblob <span class="hljs-keyword">import</span> TextBlob

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">label_sentiment</span>(<span class="hljs-params">text</span>):</span>
    blob = TextBlob(text)
    polarity = blob.sentiment.polarity

    <span class="hljs-keyword">if</span> polarity &lt; <span class="hljs-number">0</span>:
        <span class="hljs-keyword">return</span> <span class="hljs-string">"negative"</span>
    <span class="hljs-keyword">elif</span> polarity == <span class="hljs-number">0</span>:
        <span class="hljs-keyword">return</span> <span class="hljs-string">"neutral"</span>
    <span class="hljs-keyword">else</span>:
        <span class="hljs-keyword">return</span> <span class="hljs-string">"positive"</span>

<span class="hljs-comment"># Example</span>
texts = [
    <span class="hljs-string">"I love this product!"</span>,
    <span class="hljs-string">"It's okay, nothing special"</span>,
    <span class="hljs-string">"Terrible experience, very disappointed"</span>
]

<span class="hljs-keyword">for</span> text <span class="hljs-keyword">in</span> texts:
    label = label_sentiment(text)
    print(<span class="hljs-string">f"<span class="hljs-subst">{text}</span> -&gt; <span class="hljs-subst">{label}</span>"</span>)
</code></pre>
<h3 id="heading-manual-labeling">Manual Labeling</h3>
<p>In many cases, automated library-based labeling isn’t an option. For domain-specific standards, you should annotate your data by hand to ensure accuracy and relevance.</p>
<p><strong>For projects involving manual labeling:</strong></p>
<ul>
<li><p><strong>Establish precise labeling standards.</strong> Provide an annotation guideline that defines each label with clear criteria and edge cases. As an example, if annotating for customer support tickets, an example of potential criteria should be that "I need help resetting my password" is "Technical support," and another example is "When will my order arrive?" which is an example of "Order inquiry."</p>
</li>
<li><p><strong>For quality control, use several annotators</strong>. Have 2-3 people label the same data samples independently. For example, if annotating for medical symptoms, having multiple annotators better reduces the chance of bias from one person's labeling and may protect against clerical errors.</p>
</li>
<li><p><strong>Determine the inter-annotator agreement.</strong> Calculate <a target="_blank" href="https://numiqo.com/tutorial/cohens-kappa">Cohen's kappa</a> or <a target="_blank" href="https://numiqo.com/tutorial/fleiss-kappa">Fleiss' kappa</a> scores to measure the consistency of agreement among annotators. A score of above 0.80 would signify very good agreement, while a score below 0.60 would indicate that the labeling guidelines were not clear enough to the annotators.</p>
</li>
<li><p><strong>Give instructions and illustrations</strong>. Create a reference document with 20-30 examples you pre-labeled, showing examples of typical use cases and edge cases. For example, in a sentiment analysis, case you can provide examples of when a sentiment would be neutral, with an example such as "This product is fine, I guess," even though it may have seemed slightly negative. A sentiment is also classified as a good positive example even though it contains two negatives: "Not bad at all.”</p>
</li>
</ul>
<p>This is the gold standard for high-quality datasets, but it’s also time-consuming and expensive – labeling 10,000 customer reviews might take a week if done manually.</p>
<h3 id="heading-approaches-with-semi-supervision">Approaches with Semi-Supervision</h3>
<p>There are instances where you should combine the previous approaches. This semi-supervised method uses a small, manually labeled data set (high-quality data) and a large pool of unlabeled data (cheap, large amounts of data). </p>
<p>The method operates through iterative self-training, where you first train the model on your small dataset of signed data, then predict the labels on the unlabeled data using this model, then add the most confident predictions to your training data during training, and retrain. The self-training process is then repeated, improving and expanding your labeled data set gradually.</p>
<p>Here is an example of self-training in practice: this code demonstrates the semi-supervised workflow.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.semi_supervised <span class="hljs-keyword">import</span> SelfTrainingClassifier
<span class="hljs-keyword">from</span> sklearn.svm <span class="hljs-keyword">import</span> SVC

<span class="hljs-comment"># Small labeled dataset + large unlabeled dataset</span>
X_labeled = [[<span class="hljs-number">1</span>, <span class="hljs-number">2</span>], [<span class="hljs-number">3</span>, <span class="hljs-number">4</span>], [<span class="hljs-number">5</span>, <span class="hljs-number">6</span>]]
y_labeled = [<span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>]
X_unlabeled = [[<span class="hljs-number">2</span>, <span class="hljs-number">3</span>], [<span class="hljs-number">4</span>, <span class="hljs-number">5</span>], [<span class="hljs-number">6</span>, <span class="hljs-number">7</span>]]

<span class="hljs-comment"># Combine datasets (-1 represents unlabeled)</span>
X_train = X_labeled + X_unlabeled
y_train = y_labeled + [<span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>]

<span class="hljs-comment"># Self-training classifier</span>
base_classifier = SVC(probability=<span class="hljs-literal">True</span>, gamma=<span class="hljs-string">'auto'</span>)
self_training = SelfTrainingClassifier(base_classifier)
self_training.fit(X_train, y_train)
</code></pre>
<p>The code shows a version of the SelfTrainingClassifier that first trains on the three labeled examples, then uses the model to predict inputs and labels for the unlabeled data. The classifier then selects predictions where it has high confidence (for example, predictions that are &gt;90% probability) while using them as newly signed data. The classifier then re-trains itself, and the process continues.</p>
<p>So how do you decide which approach will fit your needs? In most cases, the optimal one will depend on the following aspects:</p>
<ul>
<li><p>Available budget and time.</p>
</li>
<li><p>Desired accuracy.</p>
</li>
<li><p>Size of the dataset.</p>
</li>
<li><p>Complexity of the domain.</p>
</li>
</ul>
<p>As you see, approaches vary and can be mixed now and then. The key thing is to make sure the inputs for final pre-generation processing are cleaned, standardized, and labeled. Remember the key principle: “garbage in, garbage out”. Send gold instead, and good luck!</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>At this point, you should know the basics of working with NLP projects.</p>
<p><strong>Throughout this article, you've learned:</strong></p>
<ul>
<li><p>How to set up your NLP development environment using a set of tools and libraries.</p>
</li>
<li><p>The five parts of NLP systems and how they are used to process language.</p>
</li>
<li><p>How to conduct common tasks like NER, sentiment analysis, and text classification.</p>
</li>
<li><p>How to choose the library to use to accommodate your project needs.</p>
</li>
<li><p>How to prepare and label datasets for training.</p>
</li>
<li><p>How to find the key practical applications of NLP tailored to your industry and use case.</p>
</li>
</ul>
<h3 id="heading-next-steps">Next steps</h3>
<ul>
<li><p>Try to start with a simple project, like sentiment analysis, that uses pre-trained models.</p>
</li>
<li><p>Practice preprocessing methods with your own text data.</p>
</li>
<li><p>Use and try different libraries to see how to get the best output for your project.</p>
</li>
<li><p>Build a full pipeline from preparing text data to deploying models.</p>
</li>
<li><p>Continue to practice and see advanced applications like transformer models and fine-tuning.</p>
</li>
</ul>
<p>Most importantly, keep in mind that NLP is an iterative process. Start small, test appropriately to get it to work, and then build in complexity when you are more comfortable and sure of your abilities and familiarity with the practices.</p>
<h3 id="heading-about-the-author">About the author</h3>
<p>Hope you enjoyed the article and found it helpful. I’ve been a contributor to freeCodeCamp for more than 8 years, and to make this piece more precise and detailed, I used some expert help.</p>
<p>I’m grateful for the technical ideas of my co-workers at <a target="_blank" href="https://coaxsoft.com/">COAX Software</a> who wished to stay anonymous. The company is a well-regarded <a target="_blank" href="https://coaxsoft.com/services/ai-development-services">AI/ML development company.</a></p>
<p>To find out more about me and read more content on tech and digital, you can <a target="_blank" href="https://www.linkedin.com/in/oleg-romanyuk/">visit my LinkedIn page</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Perform Sentence Similarity Check Using Sentence Transformers ]]>
                </title>
                <description>
                    <![CDATA[ Sentence similarity plays an important role in many natural language processing (NLP) applications.  Whether you build chatbots, recommendation systems, or search engines, understanding how close two sentences are in meaning can improve user experien... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-perform-sentence-similarity-check-using-sentence-transformers/</link>
                <guid isPermaLink="false">68b86d04b7b16a9a0d9ce2d2</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ natural language processing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Wed, 03 Sep 2025 16:29:56 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1756916978057/de0bda62-c9ea-48d1-b1ac-b78eb10e82d2.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Sentence similarity plays an important role in many natural language processing (NLP) applications. </p>
<p>Whether you build chatbots, recommendation systems, or search engines, understanding how close two sentences are in meaning can improve user experience – and this is what sentence similarity allows you to do.</p>
<p><a target="_blank" href="https://sbert.net/">Sentence Transformers</a> make this process simple and efficient. In this guide, you will learn what sentence similarity is, how Sentence Transformers work, and how to write code to measure similarity between two sets of sentences.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-what-is-sentence-similarity">What Is Sentence Similarity?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-why-use-sentence-transformers">Why Use Sentence Transformers</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-loading-a-pre-trained-model">Loading a Pre-trained Model</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-defining-sentences-to-compare">Defining Sentences to Compare</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-converting-sentences-into-embeddings">Converting Sentences into Embeddings</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-calculating-cosine-similarity">Calculating Cosine Similarity</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-printing-the-results">Printing the Results</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-sample-output">Sample Output</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-interpret-the-scores">How to Interpret the Scores</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-real-world-applications-of-sentence-similarity">Real-World Applications of Sentence Similarity</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-semantic-search">Semantic Search</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-duplicate-detection">Duplicate Detection</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-recommendation-systems">Recommendation Systems</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-chatbots-and-virtual-assistants">Chatbots and Virtual Assistants</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-improving-performance-with-larger-models">Improving Performance with Larger Models</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-what-is-sentence-similarity">What Is Sentence Similarity?</h2>
<p>Sentence similarity is the process of comparing two sentences to see how close they are in meaning. It does not look at the exact words but focuses on the meaning behind them.</p>
<p>For example:</p>
<ul>
<li><p>“The cat is sitting outside”</p>
</li>
<li><p>“The dog is playing in the garden”</p>
</li>
</ul>
<p>Both sentences talk about animals outdoors, so they share some similarity even though they use different words.</p>
<p>This kind of understanding is essential for tasks like document clustering, duplicate detection, or semantic search.</p>
<h2 id="heading-why-use-sentence-transformers">Why Use Sentence Transformers</h2>
<p>Traditional methods like <a target="_blank" href="https://www.freecodecamp.org/news/how-bag-of-words-works/">Bag of Words</a> relied on simple word matching or frequency counts. But these fail when words differ but the meaning stays the same.</p>
<p>Sentence Transformers solve this by using transformer-based language models like <a target="_blank" href="https://en.wikipedia.org/wiki/BERT_%28language_model%29">BERT</a> or RoBERTa to create embeddings.</p>
<p>An <a target="_blank" href="https://www.freecodecamp.org/news/understanding-word-embeddings-the-building-blocks-of-nlp-and-gpts/">embedding</a> is a list of numbers that represents the meaning of a sentence. When two embeddings are close together in this high-dimensional space, their sentences are similar in meaning.</p>
<p>The Sentence Transformers library in Python makes this easy by providing pre-trained models that can generate embeddings for sentences.</p>
<h3 id="heading-installing-the-required-libraries">Installing the Required Libraries</h3>
<p>Before you start coding, make sure you install the required packages. Run this command to do so:</p>
<pre><code class="lang-plaintext">pip install -U sentence-transformers
</code></pre>
<p>This will install the Sentence Transformers library along with its dependencies.</p>
<h2 id="heading-loading-a-pre-trained-model">Loading a Pre-trained Model</h2>
<p>Sentence Transformers offers several pre-trained models. For this example, you will use the <strong>all-MiniLM-L6-v2</strong> model. It’s lightweight, fast, and works well for most applications.</p>
<p>Here is how to load it in Python:</p>
<pre><code class="lang-plaintext">from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("all-MiniLM-L6-v2")
</code></pre>
<p>Once loaded, this model can convert any sentence into its corresponding embedding.</p>
<h2 id="heading-defining-sentences-to-compare">Defining Sentences to Compare</h2>
<p>You need two lists of sentences for comparison. Here is an example:</p>
<pre><code class="lang-plaintext">sentences1 = [
    'The cat sits outside',
    'A man is playing guitar',
    'The movies are awesome'
]

sentences2 = [
    'The dog plays in the garden',
    'A woman watches TV',
    'The new movie is so great'
]
</code></pre>
<p>Each sentence in <code>sentences1</code> will be compared with the sentence at the same position in <code>sentences2</code>.</p>
<h2 id="heading-converting-sentences-into-embeddings">Converting Sentences into Embeddings</h2>
<p>Now that you have sentences, you must convert them into embeddings using the model.</p>
<p>Add this code:</p>
<pre><code class="lang-plaintext"># Convert sentences to embeddings
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)
</code></pre>
<p>The <code>convert_to_tensor=True</code> argument tells the model to return <a target="_blank" href="https://docs.pytorch.org/tutorials/beginner/introyt/tensors_deeper_tutorial.html">PyTorch tensors</a>, which work well with similarity calculations.</p>
<h2 id="heading-calculating-cosine-similarity">Calculating Cosine Similarity</h2>
<p>Once you have embeddings, you need a way to measure similarity. The <a target="_blank" href="https://www.youtube.com/watch?v=zcUGLp5vwaQ">cosine similarity</a> metric is commonly used for this.</p>
<p>Cosine similarity looks at the angle between two vectors in a high-dimensional space. If the angle is small, the vectors are similar.</p>
<p>Add this code to compute similarity:</p>
<pre><code class="lang-plaintext">from sentence_transformers import util
# Compute cosine similarity
cosine_scores = util.cos_sim(embeddings1, embeddings2)
</code></pre>
<p>Now <code>cosine_scores</code> contains the similarity score for each sentence pair.</p>
<h2 id="heading-printing-the-results">Printing the Results</h2>
<p>To see the results clearly, format them like this:</p>
<pre><code class="lang-plaintext"># Print formatted results
for i in range(len(sentences1)):
    print(f"{sentences1[i]} \t\t {sentences2[i]} \t\t Score: {cosine_scores[i][i]:.4f}")
</code></pre>
<p>This will print each sentence pair along with its similarity score.</p>
<h2 id="heading-sample-output">Sample Output</h2>
<p>If you run this code, you will see a result similar to the below. </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756385160047/576750a6-3c65-45e7-a634-f1e7375e7e16.png" alt="576750a6-3c65-45e7-a634-f1e7375e7e16" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>The third pair has the highest score because both sentences talk about movies in a positive way.</p>
<h2 id="heading-how-to-interpret-the-scores">How to Interpret the Scores</h2>
<p>The cosine similarity score ranges between <strong>-1</strong> and <strong>1</strong>.</p>
<ul>
<li><p>A score close to <strong>1</strong> means the sentences are very similar.</p>
</li>
<li><p>A score near <strong>0</strong> means they are unrelated.</p>
</li>
<li><p>Negative values mean the sentences are not related or even opposite in meaning.</p>
</li>
</ul>
<p>In most real-world cases, you focus on values between 0 and 1. The higher the value, the closer the meanings.</p>
<h2 id="heading-real-world-applications-of-sentence-similarity">Real-World Applications of Sentence Similarity</h2>
<p>Sentence similarity has become a core part of many modern applications because it helps systems understand meaning rather than relying on exact words. This shift makes search, analysis, and recommendations far more accurate and useful.</p>
<h3 id="heading-semantic-search"><strong>Semantic Search</strong></h3>
<p>Traditional search engines depend on keyword matches. If the exact words are missing, results often become irrelevant. <a target="_blank" href="https://en.wikipedia.org/wiki/Semantic_search">Semantic search</a> solves this problem by looking at the meaning behind a query. </p>
<p>For example, if someone searches for “best ways to learn guitar,” the system can return results for “top tips to play the guitar” even though the keywords differ. This makes search experiences smoother and more intelligent.</p>
<h3 id="heading-duplicate-detection"><strong>Duplicate Detection</strong></h3>
<p>Large datasets often contain repeated or near-duplicate content. Manual checking is impossible when dealing with millions of records. </p>
<p>Sentence similarity automates this by detecting texts that carry the same meaning even if the wording changes slightly. This is especially useful in data cleaning, web scraping pipelines, or managing user-generated content.</p>
<h3 id="heading-recommendation-systems"><strong>Recommendation Systems</strong></h3>
<p>Recommendation engines work best when they understand context. For instance, if a user likes articles about “healthy cooking,” the system can recommend content on “nutritious recipes” or “quick healthy meals” using similarity scores. This approach goes beyond surface-level keywords and finds deeper connections in the text.</p>
<h3 id="heading-chatbots-and-virtual-assistants"><strong>Chatbots and Virtual Assistants</strong></h3>
<p>Chatbots store a large set of possible user questions and answers. When someone types a new question, the system must find the most relevant response. By using sentence similarity, chatbots match user input with the closest existing query in meaning, not just words, leading to more accurate and natural conversations.</p>
<h3 id="heading-improving-performance-with-larger-models">Improving Performance with Larger Models</h3>
<p>The <a target="_blank" href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2">all-MiniLM-L6-v2</a> model is fast and accurate for small to medium tasks.</p>
<p>For more accuracy, you can try larger models like <a target="_blank" href="https://huggingface.co/sentence-transformers/all-mpnet-base-v2">all-mpnet-base-v2</a>, though they may require more memory and time to run.</p>
<p>Replace the model name in your code to use a different pre-trained model:</p>
<pre><code class="lang-plaintext">model = SentenceTransformer("all-mpnet-base-v2")
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Sentence Transformers make it easy to measure sentence similarity using pre-trained models. By converting sentences into embeddings and comparing them with cosine similarity, you can build systems that understand meaning rather than relying on simple word matching.</p>
<p>With just a few lines of code, you can integrate this into chatbots, search engines, or recommendation systems and create more intelligent applications.</p>
<p><em>Hope you enjoyed this article. Signup for my free newsletter</em> <a target="_blank" href="https://www.turingtalks.ai/"><strong><em>TuringTalks.ai</em></strong></a> <em>for more hands-on tutorials on AI. You can also</em> <a target="_blank" href="https://manishshivanandhan.com/"><em>visit my website</em></a><em>.</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Tokenizers Explained – How Tokenizers Help AI Understand Language ]]>
                </title>
                <description>
                    <![CDATA[ Tokenizers are the fundamental tools that enable artificial intelligence to dissect and interpret human language. Let’s look at how tokenizers help AI systems comprehend and process language. In the fast-evolving world of natural language processing ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-tokenizers-shape-ai-understanding/</link>
                <guid isPermaLink="false">66d035f6c1024fe75b758f20</guid>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ LLM&#39;s  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ natural language processing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Wed, 27 Mar 2024 11:35:07 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/03/ab427b80-a502-11ea-8467-694f4e40dfa7.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Tokenizers are the fundamental tools that enable artificial intelligence to dissect and interpret human language. Let’s look at how tokenizers help AI systems comprehend and process language.</p>
<p>In the fast-evolving world of natural language processing (NLP), tokenizers play a pivotal role.</p>
<p>Tokenizers are the unsung heroes behind the scenes, making sense of human language for machines to understand.</p>
<p>Let’s dive into what tokenizers are and explore their use cases. We'll also introduce you to Huggingface, a leading platform in AI and NLP.</p>
<p>We'll also walk through a simple code example using the Huggingface Tokenizer library.</p>
<h2 id="heading-what-are-tokenizers">What are Tokenizers?</h2>
<p>Imagine that you’re trying to teach a robot to understand and speak human languages. The first challenge you’d face is how to break down language into pieces the robot can digest. That’s where tokenizers come in.</p>
<p>Tokenizers dissect complex language into manageable pieces, transforming raw text into a structured form that AI models can easily process. This seemingly simple step is crucial, enabling machines to grasp the nuances of human communication.</p>
<p>Think of tokenizers as the chefs who chop ingredients before a meal is cooked. Without this step, preparing complex dishes (or understanding complex sentences) would be much harder.</p>
<p>Through tokenization, AI systems can recognize patterns, understand context, and generate responses that are increasingly similar to human interaction.</p>
<p>By breaking down the complexities of language into digestible bits, tokenizers not only enhance AI’s linguistic capabilities but also pave the way for more intuitive, efficient, and accurate machine learning models.</p>
<h2 id="heading-what-are-huggingface-tokenizers">What are Huggingface Tokenizers?</h2>
<p><a target="_blank" href="https://huggingface.co/">Huggingface</a> is a company at the forefront of AI and NLP.</p>
<p>They are best known for their Transformers library, which has made it easy to access state-of-the-art NLP models.</p>
<p>At the heart of their innovations is the tokenizers library, a powerful tool designed to convert text into a format that AI models can understand. This library is essential for developers and researchers working on AI projects.</p>
<p>Hugging Face’s tokenizers are not only efficient and fast but also support a wide range of languages, making them versatile tools for global NLP tasks. They are optimized for performance, ensuring that they can handle large volumes of text without compromising speed or accuracy.</p>
<p>What sets Hugging Face’s tokenizers apart is their integration with the Transformers library, another cornerstone of Hugging Face’s AI ecosystem.</p>
<p>This integration allows for seamless processing of text data, readying it for complex tasks like translation, summarization, and sentiment analysis.</p>
<p>The tokenizers library is continually updated, incorporating the latest research findings and community feedback to enhance its capabilities.</p>
<h2 id="heading-simple-code-example-of-huggingface-tokenizer-library">Simple Code Example of Huggingface Tokenizer Library</h2>
<p>Let’s get our hands dirty with some code. We’ll use the Huggingface Tokenizer library to tokenize a simple sentence.</p>
<p>First, let's install the Huggingface Transformers library. (Use ! before the command if you are installing it in a <a target="_blank" href="https://colab.research.google.com/">Google Collab notebook</a>).</p>
<pre><code>pip install transformers
</code></pre><p>First, let's import the <code>AutoTokenizer</code> class from the Transformers library. <code>AutoTokenizer</code> is a factory class that can automatically load the tokenizer corresponding to a pre-trained model we specify (in this case, the <a target="_blank" href="https://huggingface.co/google-bert/bert-base-uncased">bert-base-uncased</a> model).</p>
<pre><code><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoTokenizer
</code></pre><p>Next, we create an instance of the <code>AutoTokenizer</code> class by calling the <code>from_pretrained</code> method. This tokenizer is designed to work with the BERT model and is configured to not differentiate between uppercase and lowercase letters (hence 'uncased').</p>
<pre><code>tokenizer=AutoTokenizer.from_pretrained(<span class="hljs-string">"bert-base-uncased"</span>)
</code></pre><p>Now let’s declare a string for tokenizing.</p>
<pre><code>text = <span class="hljs-string">"Hello, and welcome to the world of Tokenizers"</span>
</code></pre><p>Let’s use the <code>tokenize</code> method of the tokenizer with the sample text as its argument.</p>
<pre><code>tokens = tokenizer.tokenize(text)
</code></pre><p>The <code>tokenize</code>method splits the input text into a list of tokens or words/sub-words that the pre-trained model was trained on. For models like BERT, words might be split into smaller units (sub-words or characters) to handle out-of-vocabulary words more effectively.</p>
<p>We'll also convert the list of tokens into a list of integers (token IDs). Each integer corresponds to a specific token in the tokenizer’s vocabulary.</p>
<p>This conversion is necessary because machine learning models do not understand text directly; they work with numerical data.</p>
<pre><code>token_ids = tokenizer.convert_tokens_to_ids(tokens)
</code></pre><p>We are done. let’s print both tokens and their corresponding IDs.</p>
<pre><code>print(<span class="hljs-string">"Tokens:"</span>, tokens)
print(<span class="hljs-string">"Token IDs:"</span>, token_ids)
</code></pre><p>So this piece of code loads a pre-trained tokenizer for the BERT model, tokenizes a sample sentence and converts those tokens into their corresponding IDs. These IDs are what machine learning models process.</p>
<p>Here is the response:</p>
<pre><code>Tokens: [<span class="hljs-string">'hello'</span>, <span class="hljs-string">','</span>, <span class="hljs-string">'and'</span>, <span class="hljs-string">'welcome'</span>, <span class="hljs-string">'to'</span>, <span class="hljs-string">'the'</span>, <span class="hljs-string">'world'</span>, <span class="hljs-string">'of'</span>, <span class="hljs-string">'token'</span>, <span class="hljs-string">'##izer'</span>, <span class="hljs-string">'##s'</span>]
Token IDs: [<span class="hljs-number">7592</span>, <span class="hljs-number">1010</span>, <span class="hljs-number">1998</span>, <span class="hljs-number">6160</span>, <span class="hljs-number">2000</span>, <span class="hljs-number">1996</span>, <span class="hljs-number">2088</span>, <span class="hljs-number">1997</span>, <span class="hljs-number">19204</span>, <span class="hljs-number">17629</span>, <span class="hljs-number">2015</span>]
</code></pre><p>These tokens and token IDs are crucial for training machine learning models. They convert text into a numerical format that models can process, enabling the understanding of language nuances.</p>
<p>Tokens like <code>##izer</code> and <code>##s</code> are examples of how the tokenizer deals with words or parts of words that might not be in its basic vocabulary.</p>
<p>The <code>##</code> prefix indicates that these are sub-word units or suffixes attached to the preceding token without a space. This allows the model to handle a wide range of vocabulary, including new or uncommon words, by breaking them down into known subcomponents.</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>Tokenizers are foundational to NLP, and the Huggingface Transformers library provides an extensive toolkit for working with them. </p>
<p>By understanding and utilizing tokenizers, we can bridge the gap between human language and machine understanding, unlocking a wide range of applications in AI.</p>
<p>Whether you’re a seasoned developer or new to NLP, diving into tokenization methods is a great way to enhance your machine-learning skills.</p>
<p>Hope you enjoyed this article. If you have any questions, let me know in the comments. <a target="_blank" href="https://www.turingtalks.ai/">Visit turingtalks.ai</a> for weekly byte-sized AI tutorials.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use the Hugging Face Transformer Library ]]>
                </title>
                <description>
                    <![CDATA[ In this article, I'll talk about why I think the Hugging Face’s Transformer Library is a game-changer in NLP for developers and researchers alike. Have you ever wondered how modern AI achieves such remarkable feats, like understanding human language ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/hugging-face-transformer-library-overview/</link>
                <guid isPermaLink="false">66d035f915ea3036a953992e</guid>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ natural language processing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Wed, 31 Jan 2024 00:36:42 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/01/hugging-face.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In this article, I'll talk about why I think the Hugging Face’s Transformer Library is a game-changer in NLP for developers and researchers alike.</p>
<p>Have you ever wondered how modern AI achieves such remarkable feats, like understanding human language or generating text that sounds like it was written by a person?</p>
<p>A significant part of this magic stems from a groundbreaking model called <a target="_blank" href="https://blogs.nvidia.com/blog/what-is-a-transformer-model/">the Transformer</a>. Many frameworks released into the Natural Language Processing (NLP) space are based on the Transformer model, and an important one is the <a target="_blank" href="https://huggingface.co/docs/transformers/index">Hugging Face Transformer Library</a>.</p>
<p>In this article, I’ll walk you through why this library is not just another piece of software, but a powerful tool for engineers and researchers alike. Then you'll see a practical example of how to use it.</p>
<h2 id="heading-what-is-the-hugging-face-transformer-library">What is the Hugging Face Transformer Library?</h2>
<p>The Hugging Face Transformer Library is an open-source library that provides a vast array of pre-trained models primarily focused on NLP. It’s built on PyTorch and TensorFlow, making it incredibly versatile and powerful.</p>
<p>One of the first reasons the Hugging Face library stands out is its remarkable user-friendliness. Even if you’re not a deep learning expert, you can use this library with relative ease.</p>
<p>It offers straightforward interfaces that allow you to implement complex models with just a few lines of code. This simplicity opens the doors of advanced AI to a broader range of developers and researchers.</p>
<h2 id="heading-pre-trained-and-ready-to-go">Pre-Trained and Ready to Go</h2>
<p>The beauty of today’s deep learning models is that you don't have to train a model from scratch. Most models are pre-trained and your job as an AI engineer will be to train a model using custom data.</p>
<p>So imagine having access to a toolbox where each tool is tailored for a specific job. That’s what Hugging Face offers with its wide range of pre-trained models.</p>
<p>Whether you’re working on text classification, question answering, or language generation, there’s a model ready for you to use. This saves an enormous amount of time and resources as you don’t have to start from scratch.</p>
<p>While pre-trained models are fantastic, they might not fit every specific need. This is where Hugging Face truly shines. The library allows you to fine-tune models on your dataset, making it possible to customize the models to your specific requirements.</p>
<h2 id="heading-community-support">Community Support</h2>
<p>What sets Hugging Face apart is not just its technical capabilities but also its vibrant community. By engaging with this community, you gain access to a wealth of knowledge and support.</p>
<p>Users continuously contribute to the library, adding new models and features, making it a living, evolving ecosystem. This collaborative spirit ensures that the library stays at the cutting edge of AI research and application.</p>
<h2 id="heading-performance-and-scalability">Performance and Scalability</h2>
<p>In the world of AI, performance is key, and the Hugging Face library doesn’t disappoint. It’s designed to handle large-scale models efficiently, which means you can work with some of the most advanced AI models without needing a supercomputer at your disposal.</p>
<p>Hugging Face is also not just about English. It supports multiple languages, which is essential for organizations and developers aiming to create AI applications for a diverse user base.</p>
<h2 id="heading-popular-hugging-face-models">Popular Hugging Face Models</h2>
<ol>
<li><a target="_blank" href="https://huggingface.co/docs/transformers/model_doc/bert"><strong>BERT (Bidirectional Encoder Representations from Transformers)</strong></a><strong>:</strong> BERT excels in understanding the context of a word in a sentence, making it effective for tasks like sentiment analysis, question-answering, and language understanding. It’s widely used in chatbots, search engines, and to enhance user interaction with AI systems.</li>
<li><a target="_blank" href="https://huggingface.co/gpt2"><strong>GPT (Generative Pretrained Transformer)</strong></a><strong>:</strong> Known for its ability to generate human-like text, GPT is used for creative writing, generating conversational responses, and even writing code. It’s particularly popular in chatbots, automated content creation tools, and customer service applications.</li>
<li><a target="_blank" href="https://huggingface.co/docs/transformers/model_doc/distilbert"><strong>DistilBERT</strong></a>: A streamlined version of BERT, DistilBERT offers similar capabilities but is faster and requires less computational power. It’s ideal for environments where resources are limited, like mobile applications, and is used in tasks like text classification and information extraction.</li>
<li><a target="_blank" href="https://huggingface.co/docs/transformers/model_doc/roberta"><strong>RoBERTa (Robustly Optimized BERT Approach)</strong></a>: An optimized version of BERT, RoBERTa is trained on a larger dataset and for a longer time, leading to improved performance. It’s used in more complex NLP tasks like sentiment analysis, language inference, and text classification.</li>
<li><a target="_blank" href="https://huggingface.co/docs/transformers/model_doc/t5"><strong>T5 (Text-To-Text Transfer Transformer)</strong></a>: T5 converts all NLP problems into a text-to-text format, providing a versatile approach to tasks like translation, summarization, and question answering. Its adaptability makes it valuable in diverse applications, from automated translation services to information summarization tools.</li>
</ol>
<p>Each of these models has its unique strengths, and you should choose them based on the specific requirements of your tasks. Make sure to balance factors like computational resources, complexity of the task, and the desired level of performance.</p>
<h2 id="heading-how-to-use-the-hugging-face-transformers-library">How to Use the Hugging Face Transformers Library</h2>
<p>Let me show you how easy it is to work with the Hugging Face Transformers library. We will implement a simple summarization script that takes in a large text and returns a short summary.</p>
<p>We will first import <code>pipeline</code> from the transformers library. In Hugging Face, a “pipeline” is like a tool that helps you perform a series of steps to change data into the form you want. </p>
<pre><code><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> pipeline
</code></pre><p>The pipeline makes it simple to use these tools for different jobs, without needing to know all the complex details about how these tools work on the inside. For this example, we will use the "summarization" pipeline. </p>
<pre><code>summarizer = pipeline(<span class="hljs-string">"summarization"</span>)
</code></pre><p>And we are now ready to start using the summarization pipeline. Let's pass in a long chunk of text and see what the response is. </p>
<pre><code>text = <span class="hljs-string">""</span><span class="hljs-string">"
The development of the internet has been one of the most transformative events in human history, altering virtually every aspect of modern life. Initially conceived as a military and academic network in the late 1960s, the internet evolved rapidly through the 1970s and 1980s, expanding its reach and capabilities with each passing year. The introduction of the World Wide Web in the early 1990s was a critical moment, making the internet much more accessible and user-friendly, sparking a global revolution in communication, business, and entertainment. As a tool for information dissemination, the internet has been unparalleled, allowing instant access to vast amounts of data from all over the world. It has democratized information, breaking down barriers that once existed due to geography or social status. The internet has also had a profound impact on commerce, giving rise to e-commerce and transforming traditional business models. The ease of online shopping and the rise of digital marketplaces have reshaped consumer habits and expectations. Socially and culturally, the internet has connected people across the globe, facilitating the exchange of ideas and cultures in a way that was previously unimaginable. However, it has also raised concerns about privacy, data security, and the digital divide. The rapid dissemination of information has sometimes led to the spread of misinformation, posing challenges for societies in discerning truth from falsehood. As the internet continues to evolve, it poses new challenges and opportunities, shaping the future of human interaction, governance, and technology.
"</span><span class="hljs-string">""</span>

summary = summarizer(text)
print(summary[<span class="hljs-number">0</span>][<span class="hljs-string">'summary_text'</span>])
</code></pre><p>Here is a sample response:</p>
<pre><code> The introduction <span class="hljs-keyword">of</span> the internet <span class="hljs-keyword">in</span> the <span class="hljs-number">1970</span>s and <span class="hljs-number">1980</span>s was a major event <span class="hljs-keyword">for</span> the world<span class="hljs-string">'s first time . As a result, the internet has been able to connect people across the globe . The internet has also raised concerns about privacy and security in the digital age of 21.</span>
</code></pre><p>That's how easy it is to work with the Hugging Face Transformers library. </p>
<h2 id="heading-ethical-ai-and-transparency-a-step-towards-responsible-ai">Ethical AI and Transparency: A Step Towards Responsible AI</h2>
<p>Since AI ethics are increasingly under the spotlight, Hugging Face commits to transparency and responsible AI development. The open-source nature of the library promotes a level of transparency that’s essential for ethical AI development. Users can see exactly how models are built and make informed decisions about their use.</p>
<p>AI is a field that never stands still, and neither does the Hugging Face Transformer Library. It’s continuously updated with the latest breakthroughs in AI research. This means that when you use Hugging Face, you’re always at the forefront of AI technology.</p>
<p>Finally, the real test of any tool is its applications in the real world, and here, Hugging Face excels. It’s used by academics for cutting-edge research and by companies for practical applications like sentiment analysis, content generation, and language translation.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In summary, the Hugging Face Transformer Library is more than just a collection of AI models. It’s a gateway to advanced AI for people of all skill levels. Its ease of use and the availability of a comprehensive range of models make it a standout library in the world of AI.</p>
<p>Whether you’re a seasoned AI expert or just starting, the Hugging Face library is a useful resource that can help you achieve your AI goals.</p>
<p>Hope you enjoyed this article. Find more beginner-friendly tutorials on AI at <strong><a target="_blank" href="https://www.turingtalks.ai/">turingtalks.ai</a>.</strong> </p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Simple Sentiment Analyzer Using Hugging Face Transformer ]]>
                </title>
                <description>
                    <![CDATA[ In this article, we will look at writing a sentiment analyzer using Hugging Face Transformer, a powerful tool in the world of NLP.  Imagine you’re running a business and you want to know what your customers think about your product. Or maybe you’re a... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-simple-sentiment-analyzer-using-hugging-face-transformer/</link>
                <guid isPermaLink="false">66d035d812c679876b0602de</guid>
                
                    <category>
                        <![CDATA[ natural language processing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Sentiment analysis ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Fri, 26 Jan 2024 00:32:04 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/01/pngtree-facial-emotions-illustration-in-black-outline-on-white-background-vector-picture-image_10574137.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In this article, we will look at writing a sentiment analyzer using Hugging Face Transformer, a powerful tool in the world of NLP. </p>
<p>Imagine you’re running a business and you want to know what your customers think about your product. Or maybe you’re a movie director wanting to gauge the public reaction to your latest release.</p>
<p>This is where sentiment analysis comes into play.</p>
<blockquote>
<p>Sentiment analysis is a technique used in text analysis that helps in identifying and categorizing opinions expressed in a piece of text.</p>
</blockquote>
<p>Sentiment analysis determines whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral.</p>
<p>In a world where data is king, sentiment analysis is a crown jewel. It’s like having a superpower to understand the emotional tone behind words at scale.</p>
<p>Companies use it to understand customer feedback on products and services. Governments and organizations use it to get a sense of public opinion.</p>
<p>In social media management, sentiment analysis is used for brand monitoring, customer service, and market research.</p>
<p>It’s not just about understanding how many people are talking about your brand or product, but how they feel about it.</p>
<h2 id="heading-what-is-hugging-face">What is Hugging Face?</h2>
<p>Now, let’s talk about Hugging Face. No, it’s not what you think. You don’t go around hugging faces.</p>
<p>In the world of AI, <a target="_blank" href="https://huggingface.co/">Hugging Face</a> is quite the star. It’s an AI community and platform that provides state-of-the-art tools and models for Natural Language Processing (NLP).</p>
<p>Think of it as a toolbox that gives you the power to understand and generate human language. It’s like having a linguistic wizard by your side.</p>
<p>Hugging Face’s most popular offering is the ‘Transformers’ library. The Transformers library comes packed with APIs and tools that let you easily grab and train top-notch pre-trained models.</p>
<p>When you pick these pre-trained models, you’re cutting down on compute costs and carbon footprint. Plus, you save loads of time and resources that you’d otherwise spend training a model from scratch.</p>
<p>These models solve common tasks across various domains, like:</p>
<ul>
<li><strong>Natural Language Processing (NLP)</strong>: Here, you can do a bunch of cool stuff like text classification, spotting names or entities in text, answering questions, language modelling, summarizing, translating, handling multiple-choice questions, and even generating text.</li>
<li><strong>Computer Vision:</strong> This involves image classification, spotting and outlining objects in images, and more.</li>
<li><strong>Audio:</strong> You can work on recognizing speech automatically and classifying different types of sounds.</li>
<li><strong>Multimodal Tasks:</strong> These are tasks that mix it up, like answering questions based on tables, recognizing text in images (like scanned documents), pulling out information from these documents, classifying videos, and answering questions based on images.</li>
</ul>
<p>The neat thing about Transformers is that they’re flexible with different frameworks. Whether you’re into <a target="_blank" href="https://turingtalks.substack.com/p/pytorch-vs-tensorflow-for-deep-learning">PyTorch</a>, TensorFlow, or JAX, Transformers has got you covered.</p>
<p>Its ease of use and comprehensive nature make it a go-to for researchers, developers, and businesses alike.</p>
<h2 id="heading-code-for-sentiment-analysis">Code for Sentiment Analysis</h2>
<p>Now that you know what sentiment analysis and Hugging Face are, let’s write some code. We’ll use Python and the Hugging Face <code>transformers</code> library to build a simple sentiment analyzer.</p>
<p>You can either use your terminal, install Python and run the code, or use a <a target="_blank" href="https://colab.research.google.com/">Google Colab notebook</a>. I would recommend the latter since it comes pre-installed with Python.</p>
<p>Install the <code>transformers</code>library with this command:</p>
<pre><code>pip install transformers
</code></pre><p>If you are using a Colab notebook, use a <strong>!</strong> symbol before the command for the notebook to treat it as a shell command (Colab executes code as Python by default).</p>
<pre><code>!pip install transfomers
</code></pre><p>Once the installation is complete, you can start using the library. First, let's import <code>pipeline</code> from the transformers library.</p>
<pre><code><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> pipeline
</code></pre><p>In Hugging Face, a “pipeline” is like a tool that helps you perform a series of steps to change data into the form you want. The pipeline makes it simple to use these tools for different jobs, without needing to know all the complex details about how these tools work on the inside.</p>
<p>Now let’s load the <code>sentiment-analysis</code> pipeline.</p>
<pre><code>sentiment_pipeline = pipeline(<span class="hljs-string">"sentiment-analysis"</span>)
</code></pre><p>Now would you believe me if I said we are pretty much done? Our sentiment analysis model is ready and we can pass text to the pipeline and get the label as well as a sentiment score.</p>
<pre><code># Run sentiment analysis
result = sentiment_pipeline(<span class="hljs-string">"Every new day brings a chance to create joyful memories and embrace new opportunities."</span>)

# Print the result
print(result)
</code></pre><p>This is the output of the above code:</p>
<pre><code>[{<span class="hljs-string">'label'</span>: <span class="hljs-string">'POSITIVE'</span>, <span class="hljs-string">'score'</span>: <span class="hljs-number">0.9998821020126343</span>}]
</code></pre><p>If you want to pass multiple sentences, pass an array of inputs to the pipeline.</p>
<pre><code>result = sentiment_pipeline([<span class="hljs-string">"Every new day brings a chance to create joyful memories and embrace new opportunities."</span>,<span class="hljs-string">"Despite the effort, the project failed to meet expectations, leading to disappointment and frustration among the team."</span>])
print(result)
</code></pre><p>Following will be the output of the above code:</p>
<pre><code>[{<span class="hljs-string">'label'</span>: <span class="hljs-string">'POSITIVE'</span>, <span class="hljs-string">'score'</span>: <span class="hljs-number">0.9998821020126343</span>}, {<span class="hljs-string">'label'</span>: <span class="hljs-string">'NEGATIVE'</span>, <span class="hljs-string">'score'</span>: <span class="hljs-number">0.9997937083244324</span>}]
</code></pre><p>I hope you understand how powerful the Hugging Face Transformer library is. This is just a sample of the many pre-trained models that Hugging Face provides. Unless you are working on a unique problem, you should find a pre-trained model in Hugging Face available for you to work with.</p>
<h2 id="heading-summary">Summary</h2>
<p>In this article, we’ve learned about sentiment analysis and Hugging Face, a powerful tool in the world of NLP. Most importantly, you’ve taken your first steps in performing sentiment analysis by using the Hugging Face Transformers library.</p>
<p>Remember, what we’ve covered is just the tip of the iceberg. The field of NLP is vast and constantly evolving. The Hugging Face Transformers library is a powerful ally in your journey through AI. It simplifies complex tasks and gives you access to pre-trained models, saving you time and resources.</p>
<p>Hope you enjoyed this article. Find more beginner-friendly articles on AI at <strong><a target="_blank" href="https://www.turingtalks.ai/">turingtalks.ai</a></strong></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Natural Language Processing Techniques for Topic Identification – Explained with Examples ]]>
                </title>
                <description>
                    <![CDATA[ There's a lot of textual information available these days. It ranges from articles to social media posts and research papers. So our ability to distill meaningful insights is key. This helps us make informed decisions in a wide array of contexts. For... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/topic-identification-using-natural-language-processing/</link>
                <guid isPermaLink="false">66d45f44052ad259f07e4af0</guid>
                
                    <category>
                        <![CDATA[ natural language processing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ibrahim Ogunbiyi ]]>
                </dc:creator>
                <pubDate>Thu, 25 Jan 2024 16:16:15 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/01/pexels-wallace-chuck-3109168.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>There's a lot of textual information available these days. It ranges from articles to social media posts and research papers. So our ability to distill meaningful insights is key. This helps us make informed decisions in a wide array of contexts.</p>
<p>For example, you can analyze a large volume of textual content to extract a common theme. Companies and businesses utilize this technique to understand public opinion about their brand. This lets them make informed decisions and improve their services.</p>
<p>The ability to extract themes from a large amount of textual data is referred to as topic identification.</p>
<p>In this article, you will learn how to utilize NLP techniques for topic identification, enhancing your skillset as a data scientist. So sit back, because it's gonna be an interesting journey.</p>
<h2 id="heading-what-is-topic-identification">What is Topic Identification?</h2>
<p>Topic identification, simply put, is a sub-field under natural language processing. It involves the process of automatically discovering and organizing the main themes or topics present in a collection of textual data.</p>
<p>There are several Natural Language Processing (NLP) techniques you can use to identify themes in text, from simple ones to more algorithm based techniques. In this article we will look at the common NLP techniques used for topic identification. We'll discuss these in more detail below.</p>
<p>I recently tweeted about the essence of NLP. It really is purely statistics, because there are different manipulations you can do to ensure that numbers serve as representations for text (since computers don't understand text).</p>
<div class="embed-wrapper">
        <blockquote class="twitter-tweet">
          <a href="https://twitter.com/Ibrahim_Geek/status/1742877290227187989?s=20"></a>
        </blockquote>
        <script defer="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></div>
<p> </p>
<h2 id="heading-requirements-for-this-project">Requirements for this Project</h2>
<p>In order for you to be able to follow along and get hands-on practical experience while learning, you should have Python 3.x installed on your machine.</p>
<p>We'll also use the following libraries: Gensim, Scikit-Learn, and NLTK. You can install them using the Pip package installer with the following command:</p>
<pre><code class="lang-bash">pip install gensim nltk scikit-learn
</code></pre>
<h2 id="heading-techniques-used-in-nlp-for-topic-identification">Techniques Used in NLP for Topic Identification</h2>
<p>There are various techniques you can use for topic identification. In this article, you will learn about some common NLP techniques that work quite well, from simple and effective methods to more advanced ones.</p>
<h3 id="heading-bag-of-words">Bag of Words</h3>
<p>Bag of Words (BoW) is a common representation used in NLP for textual data. You can use it to count the frequency at which each word occurs in a document.</p>
<p>BoW, in the context of topic identification, is based on the assumption that the more frequently a word occurs in a document, the more important it is. Then you can use those more common words to infer what the document is all about.</p>
<p>Bag of words is the simplest technique used to identify topics in NLP. While Bag of Words is simple and efficient, it is highly affected by stop words, which are common words in text data (like "the," "and," "is," and so on).</p>
<p>But once you eliminate the issue of stop words from the text, allowing you to perform effective text processing (using techniques like normalization), BoW can still prove effective in identifying some main topics.</p>
<p>Let's look at how you can use BoW to identify the topic below.</p>
<h4 id="heading-how-to-implement-of-bag-of-words-in-python">How to implement of Bag of Words in Python</h4>
<p>A bit of background about the example article we'll use here: I got it from the BBC, and it's titled "US lifts ban on imports of latest Apple watch." The article discusses the lifted ban on Apple's latest watches, Ultra 2 and Series 9.</p>
<p>Now let's go over how to implement the bag of words in Python. I'll break this code block up into sections and explain each part as I go to make it a bit more easy to digest.</p>
<pre><code class="lang-python"><span class="hljs-comment">#import necessary libraries</span>
<span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> Counter
<span class="hljs-keyword">from</span> nltk.tokenize <span class="hljs-keyword">import</span> word_tokenize
<span class="hljs-keyword">from</span> nltk.corpus <span class="hljs-keyword">import</span> stopwords

article = <span class="hljs-string">"Apple's latest smart watches can resume being sold in the US after the tech company filed an emergency appeal with authorities.\
Sales of the Series 9 and Ultra 2 watches had been halted in the US over a patent row.\
The US's trade body had barred imports and sales of Apple watches with technology for reading blood-oxygen level.\
Device maker Masimo had accused Apple of poaching its staff and technology. \
It comes after the White House declined to overturn a ban on sales and imports of the Series 9 and Ultra 2 watches which came into effect this week.\
Apple had said it strongly disagrees with the ruling.\
The iPhone maker made an emergency request to the US Court of Appeals, which proved successful in getting the ban lifted."</span>
</code></pre>
<p>In the above code, we're importing the necessary libraries that we'll use to implement the BoW.</p>
<p>We'll use the Counter library to count the frequency of each word, and the word_tokenize library to tokenize the document into individual word tokens so they can be counted. Lastly, the stopwords library will remove stop words from the document.</p>
<pre><code class="lang-python">
<span class="hljs-comment"># Initialize english stopwords</span>
english_stopwords = stopwords.words(<span class="hljs-string">"english"</span>)

<span class="hljs-comment">#convert article to tokens</span>
tokens = word_tokenize(article)

<span class="hljs-comment">#extract alpha words and convert to lowercase</span>
alpha_lower_tokens = [word.lower() <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> tokens <span class="hljs-keyword">if</span> word.isalpha()]

<span class="hljs-comment">#remove stopwords</span>
alpha_no_stopwords = [word <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> alpha_lower_tokens <span class="hljs-keyword">if</span> word <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> english_stopwords]

<span class="hljs-comment">#Count word</span>
BoW = Counter(alpha_no_stopwords)

<span class="hljs-comment">#3 Most common words</span>
BoW.most_common(<span class="hljs-number">3</span>)
</code></pre>
<p>In the above code, we use the first line of code to extract all stop words in the English language. Then, the second line tokenizes the article string into individual words. The third line of code normalizes each word into lowercase and only extracts alphabetic words from the article. The last two lines of code are used to count the frequency of each word and select the most common three words.</p>
<p>Below is the output of the BoW model:</p>
<pre><code class="lang-javascript">[(<span class="hljs-string">'watches'</span>, <span class="hljs-number">4</span>), (<span class="hljs-string">'us'</span>, <span class="hljs-number">4</span>), (<span class="hljs-string">'apple'</span>, <span class="hljs-number">3</span>), (<span class="hljs-string">'emergency'</span>, <span class="hljs-number">2</span>)]
</code></pre>
<p>From this, we can infer that the article is all about "Apple's watches in the US". As you can see, with the simplicity in reasoning behind the bag of words, it is still possible to infer a bit of knowledge about the article.</p>
<h3 id="heading-latent-dirichlet-allocation">Latent Dirichlet Allocation</h3>
<p>Latent Dirichlet Allocation, or LDA for short, is a popular probabilistic model used in NLP and machine learning for topic modeling (using algorithms to identify topics). It is based on the assumption that documents are mixtures of topics, and topics are mixtures of words.</p>
<p>Simply put, LDA is an NLP technique used to identify the topic to which a document belongs based on the words contained in the document.</p>
<p>LDA operates on the bag-of-words representation of documents, where each document is represented as a vector of word frequencies. You can implement LDA using the Gensim library in Python (which is an open source library used for topic modelling and document similarity analysis).</p>
<p>Steps for implementing LDA include:</p>
<ul>
<li><p><strong>Import Libraries:</strong> First step is to import the necessary libraries you will be utilizing.</p>
</li>
<li><p><strong>Data Preparation:</strong> Convert raw data to a document format then tokenize, remove stop words, and optionally perform stemming or lemmatization.</p>
</li>
<li><p><strong>Create Dictionary and Corpus</strong>: Build a dictionary with unique word IDs. Then form a bag of words corpus representing document-word frequency.</p>
</li>
<li><p><strong>Train LDA Model</strong>: Use the document-word frequency and dictionary to train the LDA model, setting the desired number of topics.</p>
</li>
<li><p><strong>Print Topics</strong>: Explore and print the discovered topics.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># Import the necessary libraries</span>
<span class="hljs-keyword">from</span> gensim.corpora.dictionary <span class="hljs-keyword">import</span> Dictionary
<span class="hljs-keyword">from</span> gensim.models <span class="hljs-keyword">import</span> LdaModel
<span class="hljs-keyword">from</span> nltk <span class="hljs-keyword">import</span> sent_tokenize, word_tokenize
<span class="hljs-keyword">from</span> nltk.corpus <span class="hljs-keyword">import</span> stopwords

article = <span class="hljs-string">"Apple's latest smart watches can resume being sold in the US after the tech company filed an emergency appeal with authorities. \
Sales of the Series 9 and Ultra 2 watches had been halted in the US over a patent row. \
The US's trade body had barred imports and sales of Apple watches with technology for reading blood-oxygen level. \
Device maker Masimo had accused Apple of poaching its staff and technology. \
It comes after the White House declined to overturn a ban on sales and imports of the Series 9 and Ultra 2 watches which came into effect this week. \
Apple had said it strongly disagrees with the ruling. \
The iPhone maker made an emergency request to the US Court of Appeals, which proved successful in getting the ban lifted."</span>
</code></pre>
<p>The above lines of code include the necessary libraries that we'll use to implement the LDA.</p>
<p>The first line of code contains the Dictionary object. Then, the second line imports the LDA model, and the third line of code contains the <code>sent_tokenize</code>, which we'll use to convert the article into document. After that, <code>word_tokenize</code> will tokenize the document into individual words. Lastly, we have the <code>stop_words</code> library.</p>
<pre><code class="lang-python"><span class="hljs-comment"># convert article to documents</span>
documents = sent_tokenize(article)

<span class="hljs-comment">#toeknize and normalize the document</span>
tokenized_words = [word_tokenize(doc.lower()) <span class="hljs-keyword">for</span> doc <span class="hljs-keyword">in</span> documents]

<span class="hljs-comment"># remove stops words and onl extract alphabets</span>
cleaned_token = [[word <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> sentence <span class="hljs-keyword">if</span> word <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> english_stopwords <span class="hljs-keyword">and</span> word.isalpha()]
                 <span class="hljs-keyword">for</span> sentence <span class="hljs-keyword">in</span> tokenize_words]

<span class="hljs-comment"># create a dictionary</span>
dictionary = Dictionary(cleaned_token)

<span class="hljs-comment"># Create a corpus from the document</span>
corpus = [dictionary.doc2bow(text) <span class="hljs-keyword">for</span> text <span class="hljs-keyword">in</span> cleaned_token]
</code></pre>
<p>The above lines of code include the preprocessing steps that will be performed on the article, including converting the article to a document, normalizing, and tokenizing the document into individual words.</p>
<p>The next part removes stopwords from the text and then extracts words and numbers from the document. After that, we create a dictionary, which is a map between each word and its numerical identifier. The last line of code then creates a corpus of the document.</p>
<pre><code class="lang-javascript"># Build the LDA model
model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=<span class="hljs-number">3</span>)

# Print the topics
print(<span class="hljs-string">"Identified Topics:"</span>)
<span class="hljs-keyword">for</span> idx, topic <span class="hljs-keyword">in</span> lda_model.print_topics():
    print(f<span class="hljs-string">"Topic {idx + 1}: {topic}"</span>)
</code></pre>
<p>The above code is used to train the model on the corpus and then prints the top 3 topics from the article.</p>
<p>Below is the output of the LDA Model:</p>
<pre><code class="lang-javascript">Identified Topics:
Topic <span class="hljs-number">1</span>: <span class="hljs-number">0.045</span>*<span class="hljs-string">"9"</span> + <span class="hljs-number">0.045</span>*<span class="hljs-string">"ultra"</span> + <span class="hljs-number">0.044</span>*<span class="hljs-string">"sales"</span> + <span class="hljs-number">0.044</span>*<span class="hljs-string">"2"</span> + <span class="hljs-number">0.043</span>*<span class="hljs-string">"series"</span> + <span class="hljs-number">0.043</span>*<span class="hljs-string">"watches"</span> + <span class="hljs-number">0.029</span>*<span class="hljs-string">"apple"</span> + <span class="hljs-number">0.028</span>*<span class="hljs-string">"ruling"</span> + <span class="hljs-number">0.028</span>*<span class="hljs-string">"disagrees"</span> + <span class="hljs-number">0.028</span>*<span class="hljs-string">"said"</span>
Topic <span class="hljs-number">2</span>: <span class="hljs-number">0.051</span>*<span class="hljs-string">"maker"</span> + <span class="hljs-number">0.035</span>*<span class="hljs-string">"ban"</span> + <span class="hljs-number">0.035</span>*<span class="hljs-string">"us"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"emergency"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"made"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"successful"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"court"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"lifted"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"request"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"proved"</span>
Topic <span class="hljs-number">3</span>: <span class="hljs-number">0.055</span>*<span class="hljs-string">"apple"</span> + <span class="hljs-number">0.054</span>*<span class="hljs-string">"us"</span> + <span class="hljs-number">0.054</span>*<span class="hljs-string">"watches"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"sales"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"technology"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"imports"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"authorities"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"barred"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"appeal"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"filed"</span>
</code></pre>
<p>The LDA technique shows some improvement as compared to BoW method. We can still obtain a more information that the article is all about a ban related to Apple ultra series watches in the US.</p>
<h3 id="heading-non-negative-matrix-factorization">Non-Negative Matrix Factorization</h3>
<p>Non-Negative Matrix Factorization (NMF), just like LDA, is another topic modeling technique that uncovers latent topics in a collection of documents.</p>
<p>But instead of relying on BoW, it relies on the Term Frequency-Inverse Document Frequency (TF-IDF) representation to capture and retrieve hidden themes or topics from the documents.</p>
<p>By incorporating TF-IDF information, NMF is able to weigh the importance of terms, thereby identifying more hidden patterns. You can perform NMF using the Scikit-learn library.</p>
<h3 id="heading-steps-for-performing-nmf">Steps for performing NMF</h3>
<ul>
<li><p>Import necessary libraries</p>
</li>
<li><p>Data Preparation: Convert text into document, then perform necessary data preparation like removing stop words. The TF-IDF function in Scikit-Learn has as an argument that does that.</p>
</li>
<li><p>Convert the document to a TF-IDF matrix using the TF-IDF vectorizer in Scikit-learn</p>
</li>
<li><p>Apply the NMF function on the TF-IDF matrix and specify the numbers of topic you want and the number of words in each topic</p>
</li>
<li><p>Lastly, interpret your result.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># import the necessary libraries</span>
<span class="hljs-keyword">from</span> sklearn.feature_extraction.text <span class="hljs-keyword">import</span> TfidfVectorizer
<span class="hljs-keyword">from</span> sklearn.decomposition <span class="hljs-keyword">import</span> NMF

article = <span class="hljs-string">"Apple's latest smart watches can resume being sold in the US after the tech company filed an emergency appeal with authorities. \
Sales of the Series 9 and Ultra 2 watches had been halted in the US over a patent row. \
The US's trade body had barred imports and sales of Apple watches with technology for reading blood-oxygen level. \
Device maker Masimo had accused Apple of poaching its staff and technology. \
It comes after the White House declined to overturn a ban on sales and imports of the Series 9 and Ultra 2 watches which came into effect this week. \
Apple had said it strongly disagrees with the ruling. \
The iPhone maker made an emergency request to the US Court of Appeals, which proved successful in getting the ban lifted."</span>
</code></pre>
<p>The above code contains the libaries that we'll use to implement NMF and the article itself.</p>
<pre><code class="lang-python"><span class="hljs-comment"># convert article to documents</span>
documents = sent_tokenize(article)

<span class="hljs-comment"># Create a TF-IDF vectorizer</span>
tfidf_vectorizer = TfidfVectorizer(stop_words=<span class="hljs-string">'english'</span>).fit_transform(document)

<span class="hljs-comment"># Apply NMF</span>
num_topics = <span class="hljs-number">5</span>  <span class="hljs-comment"># Set the number of topics you want to identify</span>
nmf_model = NMF(n_components=num_topics, init=<span class="hljs-string">'random'</span>, random_state=<span class="hljs-number">42</span>)
nmf_matrix = nmf_model.fit_transform(tfidf)
</code></pre>
<p>The above code converts the article into documents. Then it creates a Term-Frequency Inverse Document Frequency matrix of the article document. The last three lines of code then define the number of topics and create the topics from the document matrix using the NMF.</p>
<p>Below is the output of the NMF Model:</p>
<pre><code class="lang-javascript">Topic #<span class="hljs-number">1</span>: ultra, series, sales, watches, row, halted, patent, white, house, effect
Topic #<span class="hljs-number">2</span>: lifted, court, iphone, getting, request, successful, proved, appeals, ban, maker
Topic #<span class="hljs-number">3</span>: disagrees, strongly, ruling, said, apple, body, blood, level, trade, oxygen
Topic #<span class="hljs-number">4</span>: filed, resume, appeal, latest, tech, authorities, sold, smart, company, emergency
Topic #<span class="hljs-number">5</span>: technology, apple, accused, masimo, device, staff, poaching, maker, trade, level
</code></pre>
<p>You can see that NMF reveals more insights concerning the themes of the document. For example, you can tell that another company called Masimo is accusing Apple of a patent infringement in their Ultra series watches.</p>
<h2 id="heading-how-to-choose-which-technique-to-use">How to Choose Which Technique to Use?</h2>
<p>I recommend experimenting with all the approaches in order to gain different perspectives concerning the contents of your document.</p>
<p>Bag of Words and LDA are based on how frequently words occur, making these techniques useful for inferring the biggest/most general themes about the document.</p>
<p>On the other hand, when using NMF, which is based on TF-IDF, less frequent words can be used to infer additional topics and provide a different perspective on the document.</p>
<p>For example, NMF was able to identify key terms like "Masimo" and "accused," whereas LDA was not able to do this. So depending on your needs, go ahead and experiment with all the approaches to see which one is able to yield better results.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this article, you've learned about topic identification and how you can use it to extract themes or topics from a large document.</p>
<p>We covered some different techniques you can use to identify topic including simple ones like BoW and more advanced ones like LDA and NMF.</p>
<p>Happy learning, and see you in the next one.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Get Started with Hugging Face – Open Source AI Models and Datasets ]]>
                </title>
                <description>
                    <![CDATA[ By Ambreen Khan What is Hugging Face 🤗? If you are interested in Artificial Intelligence and Natural Language Processing, you have probably heard of Hugging Face – the company named after a cute emoji.  Hugging Face is not only a company, but also a... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/get-started-with-hugging-face/</link>
                <guid isPermaLink="false">66d45d97706b9fb1c166b918</guid>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ natural language processing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                    <category>
                        <![CDATA[ open source ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Wed, 10 Jan 2024 21:05:36 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/01/HuggingFace_Title-1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Ambreen Khan</p>
<h2 id="heading-what-is-hugging-face"><strong>What is Hugging Face 🤗?</strong></h2>
<p>If you are interested in Artificial Intelligence and Natural Language Processing, you have probably heard of Hugging Face – the company named after a cute emoji. </p>
<p>Hugging Face is not only a company, but also a platform that is transforming the fields of AI and NLP through open source and open science.</p>
<p>Hugging Face offers a platform called the Hugging Face Hub, where you can find and share thousands of AI models, datasets, and demo apps. The Hub is like the GitHub of AI, where you can collaborate with other machine learning enthusiasts and experts, and learn from their work and experience.</p>
<p>Hugging Face’s mission is to democratize good machine learning, one commit at a time. Whether you are a beginner or a professional, you can benefit from the amazing resources and tools that Hugging Face provides.</p>
<p>In this post, I'll guide you through the basics of Hugging Face. You'll learn how to create your Hugging Face account, set up your development environment, and use some of the pre-trained models that are available on the Hub. Let’s get started! 🚀</p>
<h2 id="heading-heres-what-well-cover">Here's what we'll cover:</h2>
<ol>
<li><a class="post-section-overview" href="#heading-what-can-you-do-on-the-hugging-face-platform">What can you do on the Hugging Face Platform?</a><ul>
<li><a class="post-section-overview" href="#download-and-fine-tune-existing-open-source-models">Download and fine-tune existing Open Source models</a></li>
<li><a class="post-section-overview" href="#run-models-directly-from-hugging-face">Run models directly from Hugging Face</a></li>
<li><a class="post-section-overview" href="#addcreate-your-own-model">Add/create your own model</a></li>
<li><a class="post-section-overview" href="#use-existing-datasets">Use existing datasets</a></li>
<li><a class="post-section-overview" href="#createbrowse-demo-apps-also-known-as-spaces">Create/browse demo apps (also known as Spaces)</a></li>
<li><a class="post-section-overview" href="#join-or-create-an-organization">Join or create an organization</a></li>
<li><a class="post-section-overview" href="#create-a-portfolio">Create a portfolio</a></li>
<li><a class="post-section-overview" href="#learn-ai-skills">Learn AI skills</a></li>
</ul>
</li>
<li><a class="post-section-overview" href="#heading-hugging-face-terminology">Hugging Face terminology</a></li>
<li><a class="post-section-overview" href="#heading-how-to-get-started-with-hugging-face">How to get started with Hugging Face</a><ul>
<li><a class="post-section-overview" href="#heading-create-a-hugging-face-account">Create a Hugging Face account</a></li>
<li><a class="post-section-overview" href="#heading-set-up-your-environment">Set up your environment</a></li>
</ul>
</li>
<li><a class="post-section-overview" href="#heading-how-to-use-pre-trained-models-in-hugging-face">How to use pre-trained models in Hugging Face</a></li>
<li><a class="post-section-overview" href="#heading-how-to-find-the-right-pre-trained-model">How to find the right pre-trained model</a></li>
<li><a class="post-section-overview" href="#whats-next">What's next?</a></li>
</ol>
<h2 id="heading-what-can-you-do-on-the-hugging-face-platform">What Can You Do on the Hugging Face Platform?</h2>
<p>Here are some of the awesome things you can do on Hugging Face:</p>
<h3 id="heading-download-and-fine-tune-existing-open-source-models">Download and fine-tune existing Open Source models:</h3>
<p>Why start from scratch when you can leverage the power of over 450k models that are already available on the Hugging Face model library? </p>
<p>You can easily download these models and fine-tune them on your own custom dataset with just a few lines of code. This way, you can save time and resources, and still get a model that suits your specific needs.</p>
<p>You can use these models to perform various tasks, such as:</p>
<ol>
<li>Natural language processing (for example, translation, summarization, and text generation)</li>
<li>Audio-related functions (for example, automatic speech recognition, voice activity detection, and text-to-speech)</li>
<li>Computer vision tasks (for example, depth estimation, image classification, and image-to-image processing),</li>
<li>Multimodal models capable of handling diverse data types (text, images, audio) and producing multiple types of output.</li>
</ol>
<h3 id="heading-run-models-directly-from-hugging-face">Run Models directly from Hugging Face:</h3>
<p>If you don’t want to set up these models on your own machines, you can simply use Hugging Face’s Transformer library to connect to these models, send requests, and receive outputs. </p>
<h3 id="heading-addcreate-your-own-model">Add/create your own model:</h3>
<p>If you have a brilliant idea for a new model, or you want to improve an existing one, you can also add/create your own model on Hugging Face. </p>
<p>The platform will host your model, and allow you to provide additional information, upload essential files, and manage different versions. You can also choose whether your models are public or private, so you can decide when or if you want to share them with the world. </p>
<p>Once your model is ready, you can access it directly from Hugging Face, send requests, and retrieve the outputs for integration into any applications you are developing.</p>
<h3 id="heading-use-existing-datasets">Use existing datasets:</h3>
<p>A good model needs a good dataset. Hugging Face provides a repository of over 90,000 datasets that you can use and feed into your models. </p>
<p>You can take an in-depth look inside the dataset using the dataset viewer. You can also contribute your own datasets to the repository, and help the machine learning community grow.</p>
<p><img src="https://lh7-us.googleusercontent.com/tYogXTtF_pOn4dIRAFUDP20kpbf4yzTvkWdINjnFqjka6N5b4xfDRT_ssvVqQCig09SlSfb3voil16yE37YOPLDmsHj508xkPtYWKHF63rX8ozOW21BQH2dKQL5jEuhq5Yn-m1xyU9pKKHOimOlDqHk" alt="Image" width="600" height="400" loading="lazy">
<em>Screenshot of dataset viewer</em></p>
<h3 id="heading-createbrowse-demo-apps-also-known-as-spaces">Create/browse demo apps (also known as Spaces):</h3>
<p>Hugging Face’s Spaces are Git repositories that allow you to showcase your machine learning applications. You can also browse and try out the Spaces created by other users, and find inspiration for your next AI app. </p>
<p>With thousands of ML apps to choose from, you will never run out of fun and interesting things to do.</p>
<p>Here are a few cool Spaces you can check out:</p>
<ul>
<li><a target="_blank" href="https://huggingface.co/spaces/openai/whisper">OpenAI's Whisper</a>: Transcribe long-form microphone or audio inputs with the click of a button.</li>
<li><a target="_blank" href="https://huggingface.co/spaces/jbilcke-hf/ai-comic-factory">AI Comic Factory</a>: Create your own comic books.</li>
<li><a target="_blank" href="https://huggingface.co/spaces/huggingface-projects/QR-code-AI-art-generator">QR Code AI Art Generator</a>: Generate beautiful QR codes using AI.</li>
<li><a target="_blank" href="https://huggingface.co/spaces/multimodalart/stable-video-diffusion">Stable Video Diffusion</a> (Img2Vid - XT): Generate 4s video from a single image.</li>
<li><a target="_blank" href="https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA">Video-LLaMA</a>: Audio-Visual Language Model for Video Understanding.</li>
</ul>
<h3 id="heading-join-or-create-an-organization">Join or create an organization:</h3>
<p>You can join or create your own organization on Hugging Face. This allows you to showcase your work and collaborate with other members from your university, lab, or company. You can also work on private datasets, models, and spaces with your organization.</p>
<h3 id="heading-create-a-portfolio">Create a portfolio:</h3>
<p>You can create a professional portfolio on Hugging Face to showcase your work and start building your reputation. This can help you land jobs related to AI model training, integration, and development. </p>
<p>Hugging Face provides the basic computing resources for running the demo app, including 16 GB of RAM, 2 CPU cores, and 50 GB of disk space for free. You can also upgrade your hardware for improved and faster performance with paid options.</p>
<h3 id="heading-learn-ai-skills">Learn AI Skills:</h3>
<p>Hugging Face is an excellent platform for learning AI skills. It offers a comprehensive set of tools and resources for training and using models. This includes demos, use cases, documentation, and tutorials that guide you through the entire process of using these tools and training models.</p>
<p>You can also learn from the experts and the community on Hugging Face, and improve your AI knowledge and skills.</p>
<h2 id="heading-hugging-face-terminology">Hugging Face Terminology</h2>
<p>There are some terms you'll need to know to get the most out of working with Hugging Face.</p>
<p><strong>Pretrained model:</strong> A model that has been trained on a large dataset for a specific task before being made available for use. </p>
<p><strong>Inference:</strong> Inference is the process of using a trained model to make predictions or draw conclusions about new, unseen data based on the learned patterns from the training data.</p>
<p><strong>Transformers:</strong> Transformers are models that can handle text-based tasks, such as translation, summarization, and text generation. They use a special architecture that relies on attention mechanisms to capture the relationships between words and sentences.</p>
<p><strong>Tokenizer</strong>: A tokenizer is a process that breaks down text into smaller units called tokens. Tokens are usually words or subwords that can be used for natural language processing (NLP) tasks.</p>
<h2 id="heading-how-to-get-started-with-hugging-face"><strong>How to Get Started with Hugging Face</strong></h2>
<p>To get started with HuggingFace, you will need to set up an account and install the necessary libraries and dependencies. Don’t worry, it’s easy and fun! </p>
<p>Here are the steps you need to follow:</p>
<h3 id="heading-create-a-hugging-face-account">Create a Hugging Face Account</h3>
<p>Signing up as a Community individual contributor is free of charge. You can also opt for a ‘Pro’ plan or a customized plan for Organizations if you need more features and resources.</p>
<p>Go to the Hugging Face website and click on “Sign Up” to create a free account.</p>
<p>Then enter your email address and a password. Click next and complete your profile and security check.</p>
<p><img src="https://lh7-us.googleusercontent.com/OQA0CUGvs2Dg4LKI3X5mPVjNj7LYIbeUDF0q46sC2p39n-Ca56OwiGNYYdPJU4NrcZG4s-G_KKYX1YADa9QL2yyjHcMDoQ43BBllp6SHgq6P_33XG7ta4nVDTsjierUonbH3YYwuj7CploOW2tpAopo" alt="Image" width="600" height="400" loading="lazy">
<em>Setting up a Hugging Face account</em></p>
<p>Congratulations, you are now a Hugging Face member! 🎉 You will be directed to the Hugging Face ‘Welcome’ page, where you can find more information and tips on how to use the platform.</p>
<p>As a bonus, you also get a Git-based hosted repository where you can create your Models, Datasets and Spaces. You can do this directly using the website or using the CLI. If you prefer the latter, you can check the detailed instructions on the ‘Welcome’ page under the ‘Programmatic access’ section.</p>
<p><img src="https://lh7-us.googleusercontent.com/PhM1PcZxLn4jgchRlU2J6ZEemobdrBTBq0ypqFM3Y2mZsTwtvFUg7nhJ4KBL4HfvYJz4Zp2KsZa7SvbfJMe8o9ARKvy1NOdCGSn4WEJ0JUivxT2Lp4nnWrU21cCjjGl5yJMG7BqfaGzvqVGd9z06Mrg" alt="Image" width="600" height="400" loading="lazy">
<em>Hugging Face welcome screen showing options to create a new model, browse the docs, and set up programmatic access</em></p>
<h3 id="heading-set-up-your-environment">Set Up Your Environment</h3>
<p>Before you start using the Hugging Face hub programmatically, you will need to set up your environment.  </p>
<h4 id="heading-step-1-install-python-and-pip">Step 1: Install Python and Pip:</h4>
<p>Make sure you have Python 3.8 or higher installed on your system. You will also need Pip, the package manager for Python, to install the Hugging Face libraries. If you don’t have Python, you can install it by following the instructions <a target="_blank" href="https://www.python.org/downloads/">here</a>.</p>
<h4 id="heading-step-2-install-huggingface-libraries">Step 2: Install HuggingFace libraries:</h4>
<p>Open a terminal or command prompt and run the following command to install the HuggingFace libraries: </p>
<pre><code class="lang-shell">pip install transformers
</code></pre>
<p>This will install the core Hugging Face library along with its dependencies. To have the full capability, you should also install the datasets and the tokenizers library.</p>
<pre><code class="lang-shell">pip install tokenizers, datasets
</code></pre>
<h4 id="heading-step-3-set-up-a-development-environment">Step 3: Set up a development environment:</h4>
<p>Choose a code editor or IDE of your choice, such as Jupyter Notebook, PyCharm, or Visual Studio Code. Create a new project directory and set up a virtual environment to isolate your project dependencies. You can find more information on how to do this <a target="_blank" href="https://docs.python.org/3/library/venv.html">here</a>.</p>
<p>With these steps completed, you have successfully set up Hugging Face on your system and are ready to start exploring its features and capabilities. Let’s go! 🚀</p>
<h2 id="heading-how-to-use-pre-trained-models-in-hugging-face">How to Use Pre-Trained Models in Hugging Face</h2>
<p>One of the best things about Hugging Face is that it gives you access to thousands of pre-trained models that can perform various tasks on different types of data. Whether you are working with text, vision, audio, or a combination of them, you can find a model that suits your needs.</p>
<p>Hugging Face has two main libraries that provide access to pre-trained models: <strong>Transformers</strong> and <strong>Diffusers</strong>. The Transformers library handles text-based tasks, such as translation, summarization, and text generation. Diffusers can handle image-based tasks, such as image synthesis, image editing, and image captioning.</p>
<p>You have already installed the transformers library during the environment setup. Let’s see how you can use it to work with pre-trained models.</p>
<h3 id="heading-step-1-visit-the-pypi-page">Step 1: Visit the PyPI page</h3>
<p>To learn more about the transformers library, you can visit its page on PyPI, the Python Package Index. </p>
<p>Go to <a target="_blank" href="https://pypi.org/">PyPi</a> and search for ‘transformers’. Click on the latest version of the transformers library displayed in the search result. You will see a brief introduction of the library, as well as some useful links and information.</p>
<h3 id="heading-step-2-download-and-use-pre-trained-models">Step 2: Download and use pre-trained models</h3>
<p>The transformers library provides APIs to quickly download and use pre-trained models on a given text, fine-tune them on your own datasets, and then share them with the community on Hugging Face’s <a target="_blank" href="https://huggingface.co/models">model hub</a>.</p>
<h3 id="heading-step-3-use-the-pipeline-method">Step 3: Use the <code>pipeline()</code> method</h3>
<p>To use a pre-trained model on a given input, Hugging Face provides a <code>pipeline()</code> method, an easy-to-use API for performing a wide variety of tasks. </p>
<p>The <a target="_blank" href="https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/pipelines#transformers.pipeline">pipeline()</a> method makes it simple to use any <a target="_blank" href="https://huggingface.co/models">model</a> from the Hub for inference on any language, computer vision, speech, and multimodal tasks.</p>
<p>Let’s try to perform a task using the pipeline() method.</p>
<h4 id="heading-task-sentiment-analysis">Task: Sentiment analysis:</h4>
<p>Let’s use the <code>pipeline()</code> method to classify positive versus negative texts provided by the user:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> pipeline

<span class="hljs-comment"># Load the pre-trained sentiment analysis model</span>
sentiment_analysis = pipeline(
<span class="hljs-string">"sentiment-analysis"</span>, model=<span class="hljs-string">"distilbert-base-uncased-finetuned-sst-2-english"</span>)

input_text = [
<span class="hljs-string">"It’s a great app, my biggest problem is the card readers regularly do not connect. Which is very poor customer service for us because we have to manually enter our customers debit cards, which takes time. This slows down our efficiency."</span>
]

<span class="hljs-comment"># Perform sentiment analysis on the input text</span>
result = sentiment_analysis(input_text)

<span class="hljs-comment"># Print the result</span>
print(result)
</code></pre>
<p>The pipeline statement downloads and caches the pretrained model used by the pipeline, while the statement <code>result = sentiment_analysis(input_text)</code> evaluates it on the given text. </p>
<p><strong>Output:</strong></p>
<pre><code class="lang-shell">[{'label': 'NEGATIVE', 'score': 0.9996176958084106}]
</code></pre>
<p>Here, the answer is "NEGATIVE" with a confidence of 99.96%.</p>
<h4 id="heading-task-automatic-speech-recognition">Task: Automatic speech recognition</h4>
<p>Let’s try another task that involves speech recognition.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> pipeline

transcriber = pipeline(task=<span class="hljs-string">"automatic-speech-recognition"</span>,
                       model=<span class="hljs-string">"openai/whisper-small"</span>)
result = transcriber(
    <span class="hljs-string">"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"</span>)

print(result)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-shell">{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}
</code></pre>
<p>You can see how easy it is to get a pre-trained model up and running using Hugging Face's libraries. </p>
<h3 id="heading-how-to-find-the-right-pre-trained-model">How to find the right pre-trained model</h3>
<p>But how can you find the right pre-trained model if you want to perform a specific task? </p>
<p>This is actually quite easy. You can browse the models on the Hugging Face website, and filter them by task, language, framework, and more. You can also search for models and datasets by keyword and sort them by trending, most likes, most downloads, or by recent updates.</p>
<p><img src="https://lh7-us.googleusercontent.com/e94ThjikQ7rAFXu-LUx6a0ZosgWFKqjfSION915OcA9fQweqZO62wLdyPkAH657OFOlO-Zw4O9WLvtQ1auZl8Oo9inxtul7J1hkuXs1Bqs10n_FRy8P6o-mhGVB_QKVEz4CHL7-mOm9wTGzbqr6gJJY" alt="Image" width="600" height="400" loading="lazy">
<em>Searching for models</em></p>
<p>Each model has a model card that contains important information, such as model details, inference example, training procedure, community interaction features, and link to the files. You can also try the model on the model card page by using the Inference API section.</p>
<p><img src="https://lh7-us.googleusercontent.com/Fs-OKp8zUOF4WIN9-dFBYQIQDL5loPowHzEzIr7T8mWZltyGSDGEj8K-U-CrTZwPK3D1RjkFZwSfhNex_BhWYCYW4AkUFuADkefneuJtyHSYkDoTqAU24zqvUFdTjx978g8jfVkoajhZ9PF_lTi2Ekg" alt="Image" width="600" height="400" loading="lazy">
<em>Inference API</em></p>
<p>You can also check the list of spaces that are using that particular model and further explore the spaces by clicking on the space link.</p>
<p><img src="https://lh7-us.googleusercontent.com/z2abf18c-bvqWM82OJz7ua_sebywG4DHXQQbWE4QD0Vmv1tIOw35Okw56Va5nBrJlVRWJArC_L6RWdgYIl1nadcaRlMfbt_fyZyK6hFpDkhXAgURyDiU24hzRy91W8jQbwMbs4tavsAv2r3Di-Qjpo0" alt="Image" width="600" height="400" loading="lazy">
<em>Spaces</em></p>
<h2 id="heading-whats-next">What's Next?</h2>
<p>In this guide, you have learned the basics of Hugging Face, and how to use its libraries, models, datasets, and spaces. But there is so much more to discover and enjoy!</p>
<p>Here are some tips on how to make the most of Hugging Face:</p>
<ul>
<li>Dive into Hugging Face’s Spaces: Spaces are where the magic happens. You can find and try out thousands of machine learning applications created by the community, and see what’s trending and popular. You can also create your own spaces and showcase your work to the world.</li>
<li>Explore the Hugging Face documentation and tutorials: If you want to learn more about the Hugging Face platform and its features, you can check out the documentation and tutorials. They provide detailed information and guidance on how to use the tools and resources that Hugging Face offers. You can also find information about common ML/AI tasks, such as text classification, image generation, and speech recognition, on the tasks page.</li>
<li>Visit the <a target="_blank" href="https://huggingface.co/learn">learn</a> section:  If you are interested in acquiring new skills and knowledge in AI and NLP, you can visit the ‘learn’ page that displays courses from Hugging Face. Here, you can learn from the experts and the best practices in the field, and apply them to your own projects.</li>
<li>Join the Hugging Face community: Machine learning is more fun when collaborating! You can join the Hugging Face community on platforms like GitHub, Discord, and Twitter to connect with other users and stay updated on the latest developments. You can also share your feedback, questions, and ideas with the community, and help Hugging Face grow and improve.</li>
</ul>
<p>Hugging Face is not just a platform for AI and NLP – it's also a playground for your curiosity and creativity. You can experiment with new models, expand your AI knowledge, and enrich your AI toolkit with various tools and resources. So, keep learning, keep exploring. There is always something new and exciting to discover with Hugging Face. 😊</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ From Text to Meaning: How Computers Understand Language ]]>
                </title>
                <description>
                    <![CDATA[ Language is an intricate dance of words and meanings, a fundamental tool for human expression and understanding. For centuries, this dance was uniquely human. But with the advent of modern computing, a new question emerged: can machines understand ou... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-computers-understand-language/</link>
                <guid isPermaLink="false">66d035ce15ea3036a953992a</guid>
                
                    <category>
                        <![CDATA[ natural language processing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Sat, 30 Sep 2023 14:31:13 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/09/blog_img.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Language is an intricate dance of words and meanings, a fundamental tool for human expression and understanding.</p>
<p>For centuries, this dance was uniquely human. But with the advent of modern computing, a new question emerged: can machines understand our language?</p>
<p>The answer, as many of us know, is a resounding “yes!” — but how do they do it? Let’s look at how Natural Language Processing (NLP) helps computers decode and derive context from our language.</p>
<h2 id="heading-the-building-blocks-tokens">The Building Blocks: Tokens</h2>
<p>Imagine reading a sentence.</p>
<p>To make sense of it, your brain breaks it down, recognizing individual words and their roles. Computers do something similar called tokenization.</p>
<p>Tokenization splits a piece of text into smaller units, or “tokens”, which are typically words or subwords. This is the computer’s first step in processing text data.</p>
<p>For example, the sentence “Computers are smart” would be tokenized into [‘Computers’, ‘are’, ‘smart’].</p>
<h2 id="heading-understanding-word-forms-stemming-and-lemmatization">Understanding Word Forms: Stemming and Lemmatization</h2>
<p>Once a computer tokenizes a text, it needs to understand different word forms.</p>
<p>Consider the words “running”, “runner”, and “ran”. To us, they are related. But a computer sees them as separate words. Enter stemming and lemmatization.</p>
<h3 id="heading-stemming">Stemming</h3>
<p>Stemming simplifies words to their foundational form. For example, in this example, variations like “running”, “runner”, or “runs” are all stripped down to the basic root, which is “run”.</p>
<p>Stemming helps simplify the text data, making it easier for algorithms to analyze and process. While it’s useful for certain tasks, it’s important to note that stemming can sometimes lead to inaccurate results, as it might trim words too much and lose some of their original meaning.</p>
<p>For more nuanced tasks, other techniques like lemmatization might be more appropriate.</p>
<h3 id="heading-lemmatization">Lemmatization</h3>
<p>Lemmatization reduces a word to its base or canonical form, called a lemma.</p>
<p>Unlike stemming, which simply trims words, lemmatization considers the context and meaning of the word. It ensures that the words are transformed into a valid base form. For instance, the word “better” might be lemmatized to “good”, and “running” would be lemmatized to “run”.</p>
<p>By using lemmatization, we can group different forms of a word together so that they’re treated as a single item. This is useful when analyzing text data, as it helps in recognizing that different word forms are essentially conveying the same concept.</p>
<p>Lemmatization often requires more computational resources than stemming since it has to consider word meanings and structures. It’s also typically dependent on dictionaries or morphological analysis tools.</p>
<h2 id="heading-understanding-context-with-syntax-and-semantics">Understanding Context with Syntax and Semantics</h2>
<p>Words interact with each other, influencing their meanings based on their neighbouring words. To grasp this context, computers analyze both syntax and semantics.</p>
<p>Take the word “bat” as an example. In the sentence “I played with the bat,” “bat” refers to a sporting tool. However, in the sentence “The bat flew in the night,” “bat” indicates a flying mammal.</p>
<p>Through syntax, computers determine a word’s function in a sentence, and with semantics, they interpret its exact meaning given that function.</p>
<h2 id="heading-the-power-of-word-embeddings">The Power of Word Embeddings</h2>
<p>Computers are great with numbers, but not so much with words.</p>
<p>To bridge this gap, words are often converted into vectors of numbers in a process called word embedding. These vectors capture the semantic meaning of words.</p>
<p>Words with similar meanings tend to have similar vectors. This numerical representation allows computers to perform mathematical operations on words, leading to tasks like finding word similarities or even analogies.</p>
<p>I recently published an article on word embeddings and you can <a target="_blank" href="https://www.freecodecamp.org/news/understanding-word-embeddings-the-building-blocks-of-nlp-and-gpts/">read the full article here</a>.</p>
<h2 id="heading-the-final-piece-machine-learning">The Final Piece: Machine Learning</h2>
<p>All the above processes feed into machine learning models.</p>
<p>These models, trained on vast datasets, use patterns in the text to make determinations. Datasets can include various examples and scenarios, allowing the models to learn and recognize patterns, trends, and relationships within the text.</p>
<p>Once trained, when these models encounter new textual information, they analyze it by looking for familiar patterns they’ve learned. For example, is a given piece of text positive or negative in sentiment? Or a review stating “The movie was captivating,” versus “It was a dull watch.”</p>
<p>These models can then power products like language translation and transformers. There are more steps involved in breaking down language for NLP, but these are all the ones that you will use almost on a daily basis as an AI engineer.</p>
<h2 id="heading-summary">Summary</h2>
<p>The journey from text to meaning is a complex one, even for humans. From breaking down sentences to understanding context and leveraging the power of machine learning, computers have come a long way in deciphering human language.</p>
<p>As technology continues to advance, we can only anticipate even more deep interactions between humans and machines, facilitated by the power of Natural Language Processing.</p>
<p>If you found this article interesting, <a target="_blank" href="https://manishmshiva.com/">join my newsletter</a> and I ll send you an email with my content every Friday.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Understanding Word Embeddings: The Building Blocks of NLP and GPTs ]]>
                </title>
                <description>
                    <![CDATA[ Word embeddings serve as the foundation for many applications, from simple text classification to complex machine translation systems. But what exactly are word embeddings, and how do they work? Let's find out. What Are Word Embeddings? Word embeddin... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/understanding-word-embeddings-the-building-blocks-of-nlp-and-gpts/</link>
                <guid isPermaLink="false">66d0362664be048ac359a304</guid>
                
                    <category>
                        <![CDATA[ natural language processing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Sun, 24 Sep 2023 12:17:31 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/09/nlp1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Word embeddings serve as the foundation for many applications, from simple text classification to complex machine translation systems. But what exactly are word embeddings, and how do they work? Let's find out.</p>
<h2 id="heading-what-are-word-embeddings">What Are Word Embeddings?</h2>
<p>Word embeddings serve as the digital DNA for words in the world of natural language processing (NLP). In essence, word embeddings convert words into numerical vectors (a fancy term for arrays of numbers). These vectors can be processed by machine learning algorithms.</p>
<p>Think of these vectors as a numeric fingerprint for each word. For example, the word “apple” might be represented by a numerical vector like [0.2, -0.4, 0.7].</p>
<p>The main benefit of word embeddings is their ability to capture the semantic essence of words. In simpler terms, they help machines understand the meaning and nuances behind each word.</p>
<p>For example, if “apple” is close to “fruit” in this numerical space but far from “car,” the machine understands that an apple is more related to fruits than to vehicles.</p>
<p>Beyond individual meaning, word embeddings also encode relationships between words. Words that often appear together in the same context will have similar or ‘closer’ vectors.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/Screenshot-2023-09-24-at-5.43.04-PM.png" alt="Image" width="600" height="400" loading="lazy">
<em>Word embeddings</em></p>
<p>For example, in the numerical space, the vectors representing “king” and “queen” might be closer to each other than those representing “king” and “apple.” This is because the algorithm has learned from numerous texts that “king” and “queen” often appear in similar settings, such as discussions about royalty, while “king” and “apple” do not.</p>
<h2 id="heading-why-do-we-need-word-embeddings">Why Do We Need Word Embeddings?</h2>
<p>Traditional language models treated words as separate, isolated entities.</p>
<p>For instance, the word “dog” might be represented as a unique identifier, say 1, while the word "cat" as 2. This approach fails to capture the relationship between "dog" and "cat," which are both animals and pets.</p>
<p>Word embeddings solve this problem by placing words with similar meanings or contexts close to each other in a multi-dimensional space.</p>
<h2 id="heading-algorithms-for-generating-embeddings">Algorithms for Generating Embeddings</h2>
<h3 id="heading-word2vec">Word2Vec</h3>
<p>Researchers at Google developed Word2Vec, which employs neural networks to generate word embeddings. The model processes a large text corpus and outputs high-quality word vectors.</p>
<p>It determines these embeddings by analyzing the context in which words appear, based on the idea that words found in similar contexts likely share semantic meaning.</p>
<h3 id="heading-glove-global-vectors-for-word-representation">GloVe (Global Vectors for Word Representation)</h3>
<p>Stanford researchers developed GloVe, which constructs a large table to monitor the frequency with which words co-occur in a text dataset. The model then employs mathematical methods to simplify this table, generating numerical vectors for individual words.</p>
<p>These vectors encapsulate both the meaning and the relationships among words, laying the groundwork for various machine-learning tasks related to language.</p>
<h3 id="heading-fasttext">FastText</h3>
<p>Facebook’s AI Research lab created FastText, which improves upon the Word2Vec model by viewing words as assemblies of smaller character strings, or character n-grams.</p>
<p>This method enables the model to more effectively capture the intricacies of languages that have complex word structures and to incorporate words not present in the original training data. Consequently, FastText yields a more adaptable and comprehensive language model useful for a diverse set of machine-learning tasks.</p>
<h2 id="heading-word-embeddings-and-gpts">Word Embeddings and GPTs</h2>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/gpt.gif" alt="Image" width="600" height="400" loading="lazy">
<em>Word embeddings and GPTs</em></p>
<p>Word embeddings are a foundational component in GPT models like GPT-2, GPT-3, and GPT-4. However, the architecture and approach are a bit more advanced compared to simpler models that solely rely on word embeddings.</p>
<p>In traditional models that use word embeddings like Word2Vec or GloVe, each word is converted into a fixed vector in a pre-defined space. These vectors are then used as the input to other machine learning algorithms for tasks like classification, clustering, or even in sequence-to-sequence models for machine translation.</p>
<p>In contrast, GPT models use a variant known as “transformer embeddings,” which not only embeds individual words but also considers the context in which a word appears.</p>
<p>This is essential for understanding the meaning of words that can change based on their surrounding words. For example, the word “bank” could mean a financial institution or the side of a river, depending on the context.</p>
<p>The GPT architecture takes a sequence of words (or more precisely, tokens) as input and processes them through multiple layers of transformer blocks. These blocks output a new sequence of vectors that represent not just the individual words, but also their relationships with all other words in the input sequence.</p>
<p>This sequence is then used for NLP tasks, from text completion to translation and summarization.</p>
<p>So, while GPT models do use embeddings, they are far more dynamic and context-aware than traditional word embeddings. The embeddings in GPT models are part of a larger, more complex system designed to understand and generate human-like text based on the input it receives.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Word embeddings offer an effective and computationally efficient way to represent words as vectors, capturing the intricacies of language in a form that machines can understand. They lie at the heart of many NLP applications, improving the accuracy and sophistication of language models. </p>
<p>As technology continues to evolve, so will the methods for generating and utilizing word embeddings, promising even more robust and nuanced language processing capabilities in the years to come.</p>
<p><em>If you found this article interesting, <a target="_blank" href="https://manishmshiva.com/">join my newsletter</a> and I'll send you an email with my content every Friday.</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Fine-Tune spaCy Models for NLP Use Cases ]]>
                </title>
                <description>
                    <![CDATA[ spaCy is an open-source software library for advanced natural language processing. It's written in the programming languages Python and Cython, and is published under the MIT license.  spaCy excels at large-scale information extraction tasks. It's wr... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-fine-tune-spacy-for-nlp-use-cases/</link>
                <guid isPermaLink="false">66ba10bf228e16bed602a89c</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ natural language processing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Arunachalam B ]]>
                </dc:creator>
                <pubDate>Tue, 11 Jul 2023 14:11:02 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/07/Fine-tune-Spacy---Banner.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>spaCy is an open-source software library for advanced natural language processing. It's written in the programming languages Python and Cython, and is published under the MIT license. </p>
<p>spaCy excels at large-scale information extraction tasks. It's written from the ground up in carefully memory-managed Cython.</p>
<p>spaCy is designed to help us build real products, or gather real insights. It's built with 73+ languages, and supports custom models built with Pytorch and Tensorflow. It's robust and has rigorously evaluated accuracy.</p>
<p>You may not know much about Cython. So let's have a quick look at it. </p>
<h2 id="heading-what-is-cython">What is Cython?</h2>
<p>Cython is a Python compiler that makes writing C extensions for Python as easy as Python itself. Cython is based on Pyrex, but supports more cutting edge functionality and optimizations. </p>
<p>To put in simple words, it's a Python to C compiler. </p>
<p>Quoting from Wikipedia, </p>
<blockquote>
<p>Cython is a programming language, a superset of the Python programming language, designed to give C-like performance with code that is written mostly in Python with optional additional C-inspired syntax. </p>
</blockquote>
<p>So you might be wondering if you need to learn Cython to help you fine-tune your spaCy models.</p>
<p>Well, don't worry. You don't need to learn Cython to to work with spaCy. I just wanted to make sure you're aware of what it is to help you get the most out of this tutorial.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<h3 id="heading-basic-knowledge-of-spacy">Basic knowledge of spaCy</h3>
<p>The <a target="_blank" href="https://spacy.io/usage/spacy-101">official documentation site</a> of spaCy provides a lot of information about the tool. Alternatively, you can read my <a target="_blank" href="https://www.freecodecamp.org/news/getting-started-with-nlp-using-spacy/">other tutorial</a> which gives some basic information about spaCy. </p>
<h3 id="heading-basic-knowledge-of-how-to-gather-data">Basic knowledge of how to gather data</h3>
<p>In order to fine-tune any model, you need to have the data ready. And it should be good quality data. </p>
<p>In this tutorial, let's assume we've built a event management software. We want to add voice assistance to our software. We have a module that converts voice input into text. Our next step is to process this text and extract data from the given sentence using spaCy. </p>
<p>We have to gather some basic sentences that we hear from people trying to schedule a event. Here are a few examples:</p>
<ol>
<li>Schedule event for visit to Trivandrum on July 18</li>
<li>Create event happening tomorrow on AI</li>
<li>Schedule Pongal celebration event in Oaks HOA at June 20, 2023</li>
</ol>
<p>Similarly, we have to collect prompts related to event scheduling. The more data you collect and input, more accurate your model will be. </p>
<p>I created 7 sentences, which is much too small for a event management software company to train its model. But from a demo standpoint, it should be enough. </p>
<h2 id="heading-how-to-pre-process-the-data">How to Pre-Process the Data</h2>
<p>Collecting data covers just one part of the equation. We need to pre-process the data and transform it in a way that spaCy can easily understand. We should also define what kind of data (tags) should be identified from the given sentences. </p>
<p>Let's take the following sentence as an example:</p>
<blockquote>
<p>"Schedule event for visit to Trivandrum on July 18". </p>
</blockquote>
<p>Let's try to split out some tags from above sentence:</p>
<ul>
<li>Schedule – this belongs to the "action" tag</li>
<li>event – this belongs to the "domain" tag</li>
<li>visit to Trivandrum – this belongs to the "name" tag</li>
<li>July 18 – this belongs to the "date" tag</li>
</ul>
<p>Every tag defined above may contain alternatives in other sentences. For an example, we may input the following sentences:</p>
<ol>
<li>Cancel client meeting scheduled tomorrow</li>
<li>Change time of mall visit to 6 PM</li>
</ol>
<p>From the above sentences, the action tags are "Cancel" and "Change". Similarly data for each tag may vary for each sentence. </p>
<p>Our next step is to teach spaCy about the words for each tag. We need to prepare a JSON file that contains examples with the tags and their indices. </p>
<p>For example, in the above sentence ("Schedule event for visit to Trivandrum on July 18"), the index for the "action" tag starts from 0 (indices always start from 0 in Python) and ends at 7. </p>
<p>Similarly for all 7 sentences I've chosen, I've prepared the index for each tag and created the JSON file. </p>


<h2 id="heading-how-to-fine-tune-the-spacy-model">How to Fine-Tune the spaCy Model</h2>
<p>Let's try to fine-tune spaCy with the data that we have. </p>
<p>Create a folder and download the above JSON file and place it into the folder. Create a new file named <code>custom_model.ipynb</code>. </p>
<p>All the following sections below need a code block to be created. Create a code block wherever you see a heading. Here's a sample screenshot. </p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/07/image-61.png" alt="Image" width="600" height="400" loading="lazy">
<em>Creating and running blocks of code</em></p>
<h3 id="heading-import-spacy">Import spaCy</h3>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> spacy
</code></pre>
<h3 id="heading-load-the-pre-trained-model">Load the pre-trained model</h3>
<pre><code class="lang-python">nlp = spacy.load(<span class="hljs-string">"en_core_web_lg"</span>)
</code></pre>
<h3 id="heading-import-the-json-file">Import the JSON file</h3>
<p>Import the above downloaded JSON file. </p>
<pre><code><span class="hljs-keyword">import</span> json

<span class="hljs-keyword">with</span> open(<span class="hljs-string">'./event_schedule_data.json'</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f:
    data = json.load(f)
</code></pre><h3 id="heading-convert-the-data">Convert the data</h3>
<p>Convert the data read from JSON file into tuple of dictionaries containing original text and entities. </p>
<pre><code>training_data = []
<span class="hljs-keyword">for</span> example <span class="hljs-keyword">in</span> data[<span class="hljs-string">'examples'</span>]:
    temp_dict = {}
    temp_dict[<span class="hljs-string">'text'</span>] = example[<span class="hljs-string">'content'</span>]
    temp_dict[<span class="hljs-string">'entities'</span>] = []
    <span class="hljs-keyword">for</span> annotation <span class="hljs-keyword">in</span> example[<span class="hljs-string">'annotations'</span>]:
        start = annotation[<span class="hljs-string">'start'</span>]
        end = annotation[<span class="hljs-string">'end'</span>] + <span class="hljs-number">1</span>
        label = annotation[<span class="hljs-string">'tag_name'</span>].upper()
        temp_dict[<span class="hljs-string">'entities'</span>].append((start, end, label))
    training_data.append(temp_dict)
print(training_data[<span class="hljs-number">0</span>])
</code></pre><p>The above code will convert the data to the required format and print the first dictionary in the tuple, which will look something like below: </p>
<pre><code class="lang-json">{'text': 'Schedule a calendar event in Teak oaks HOA about competitions happening tomorrow', 'entities': [(<span class="hljs-number">0</span>, <span class="hljs-number">8</span>, 'ACTION'), (<span class="hljs-number">11</span>, <span class="hljs-number">25</span>, 'DOMAIN'), (<span class="hljs-number">29</span>, <span class="hljs-number">42</span>, 'HOA'), (<span class="hljs-number">49</span>, <span class="hljs-number">71</span>, 'EVENT'), (<span class="hljs-number">72</span>, <span class="hljs-number">80</span>, 'DATE')]}
</code></pre>
<h3 id="heading-import-training-libraries">Import training libraries</h3>
<pre><code><span class="hljs-keyword">from</span> spacy.tokens <span class="hljs-keyword">import</span> DocBin
<span class="hljs-keyword">from</span> tqdm <span class="hljs-keyword">import</span> tqdm
<span class="hljs-keyword">from</span> spacy.util <span class="hljs-keyword">import</span> filter_spans

nlp = spacy.blank(<span class="hljs-string">'en'</span>)
</code></pre><h3 id="heading-train-the-model">Train the model</h3>
<p>The below code will create a custom model with the data that we give. A binary file  named <code>train.spacy</code> will be generated at the end. </p>
<pre><code class="lang-python">doc_bin = DocBin()
<span class="hljs-keyword">for</span> training_example <span class="hljs-keyword">in</span> tqdm(training_data):
    text = training_example[<span class="hljs-string">'text'</span>]
    labels = training_example[<span class="hljs-string">'entities'</span>]
    doc = nlp.make_doc(text)
    ents = []
    <span class="hljs-keyword">for</span> start, end, label <span class="hljs-keyword">in</span> labels: 
        span = doc.char_span(start, end, label=label, alignment_mode=<span class="hljs-string">"contract"</span>)
        <span class="hljs-keyword">if</span> span <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:
            print(<span class="hljs-string">"Skipping entity"</span>)
        <span class="hljs-keyword">else</span>:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.ents = filtered_ents
    doc_bin.add(doc)

doc_bin.to_disk(<span class="hljs-string">"train.spacy"</span>)
</code></pre>
<h3 id="heading-create-config-files">Create config files</h3>
<p>Create a new file named <code>base_config.cfg</code> and copy the below code into it. </p>


<p>Create another file named <code>config.cfg</code> and copy the below code into it. </p>


<p>Don't worry. These are default configurations that I've taken from their official documentation and I've not made any changes to it. </p>
<h3 id="heading-initialize-spacy-with-the-config-files">Initialize spaCy with the config files</h3>
<p>Run the following command in the notebook code block to initialize spaCy with the config file. This config file will be used to train the spaCy model with our generated custom model. </p>
<pre><code class="lang-python">!python -m spacy init fill-config base_config.cfg config.cfg
</code></pre>
<h3 id="heading-train-spacy-model">Train spaCy model</h3>
<p>Run the following command to train the spaCy model:</p>
<pre><code>!python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./train.spacy
</code></pre><p>This may take some time depending on your system configuration. Ideally not too long (around 5 to 10 minutes). At the end, it'll generate 2 folders named <code>model-best</code> and <code>model-last</code>. </p>
<h3 id="heading-load-the-best-model">Load the best model</h3>
<pre><code>nlp_ner = spacy.load(<span class="hljs-string">"model-best"</span>)
</code></pre><h3 id="heading-test-our-model">Test our model</h3>
<p>Let's test our model with the following input. </p>
<p>"Could you please reserve a team brainstorming session on coming Wednesday at 11 AM?"</p>
<pre><code class="lang-python">doc = nlp_ner(<span class="hljs-string">"Could you please reserve a team brainstorming session on coming Wednesday at 11 AM?"</span>)

spacy.displacy.render(doc, style=<span class="hljs-string">"ent"</span>)
</code></pre>
<p>You should be surprised to see the output. </p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/07/image-45.png" alt="Image" width="600" height="400" loading="lazy">
<em>Entity representation of our input sentence</em></p>
<p>That's great right? </p>
<p>Eventually, you may have a question: as a programmer, how can I get this data in my backend code? </p>
<p>Well, that's something everyone asks. </p>
<p>spaCy has an answer for it. You can expose the above data as JSON. </p>
<h3 id="heading-convert-extracted-data-to-json">Convert extracted data to JSON</h3>
<pre><code class="lang-python">json_obj = doc.to_json()
json_obj
</code></pre>
<p>This will show a similar output like the one below. </p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/07/image-46.png" alt="Image" width="600" height="400" loading="lazy">
<em>JSON of our test sentence input</em></p>
<p>Write a REST API and expose this data as JSON. That's it. But remember, spaCy will give you only the indices, you have parse your sentence and extract words in between those indices. </p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this article, we learnt about how to customize and fine-tune a pre-trained spaCy model with the data that corresponds to our domain knowledge. </p>
<p>Similarly you can also train with your domain specific data. The model that you fine-tune will be private to you unless you expose it to the public. So it's best suited for training with the domain data that is not publicly available. </p>
<p>If you wish to learn more about NLP/Machine Learning, subscribe to my <a target="_blank" href="https://5minslearn.gogosoon.com/?ref=fcc_fine_tune_spacy">email newsletter</a> (<a target="_blank" href="https://5minslearn.gogosoon.com/?ref=fcc_fine_tune_spacy">https://5minslearn.gogosoon.com/</a>) and follow me on social media.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ NLP using spaCy – How to Get Started with Natural Language Processing ]]>
                </title>
                <description>
                    <![CDATA[ In today's data-driven world, vast amounts of unstructured text data are generated every day. And to help handle all that data, Natural Language Processing (NLP) has emerged as a transformative technology.  NLP is a sub-field of artificial intelligen... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/getting-started-with-nlp-using-spacy/</link>
                <guid isPermaLink="false">66ba10af052fa53219e0a374</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ natural language processing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Arunachalam B ]]>
                </dc:creator>
                <pubDate>Mon, 26 Jun 2023 23:10:31 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/06/NLP-using-Spacy---Banner.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In today's data-driven world, vast amounts of unstructured text data are generated every day. And to help handle all that data, Natural Language Processing (NLP) has emerged as a transformative technology. </p>
<p>NLP is a sub-field of artificial intelligence. It focuses on enabling machines to understand, interpret, and generate human language. </p>
<p>In this tutorial, we'll explore the fundamental concepts of NLP and we'll look at a particular implementation with spaCy. This will showcase its immense potential to revolutionize various industries.</p>
<p>Let's have a quick look at Natural Language Processing before we begin with spaCy.</p>
<h2 id="heading-what-is-natural-language-processing">What is Natural Language Processing?</h2>
<p>NLP involves the intersection of linguistics, computer science, and machine learning. Its primary objective is to bridge the gap between human language and machine understanding. </p>
<p>NLP encompasses a wide range of tasks, including Text Classification, Named Entity Recognition (NER), Sentiment Analysis, and more.</p>
<h3 id="heading-text-classification">Text Classification</h3>
<p>This involves categorizing text into predefined classes or categories based on its content. </p>
<p>This has applications in sentiment analysis, spam detection, topic classification, and more.</p>
<h3 id="heading-named-entity-recognition-ner">Named Entity Recognition (NER)</h3>
<p>This involves identifying and extracting named entities such as names, organizations, locations, and dates from text. </p>
<p>NER is crucial for information extraction, question answering systems, and recommendation engines. </p>
<h3 id="heading-sentiment-analysis">Sentiment Analysis</h3>
<p>This involves determining the sentiment or emotion expressed in a piece of text, whether it's positive, negative, or neutral. </p>
<p>Sentiment analysis is extensively used for brand monitoring, customer feedback analysis, and social media monitoring. </p>
<h2 id="heading-challenges-in-natural-language-processing">Challenges in Natural Language Processing</h2>
<p>While NLP has made significant advancements, several challenges persist:</p>
<ol>
<li>Human language is inherently ambiguous, making it challenging sometimes for machines to accurately understand and interpret meaning.</li>
<li>Different languages, dialects, slang, and cultural nuances add complexity to NLP tasks, requiring models to be language-specific and adaptable.</li>
<li>Capturing contextual information and understanding the underlying semantics of text remains a significant challenge for NLP algorithms.</li>
<li>NLP models heavily rely on training data, and biased or low-quality data can result in biased or inaccurate predictions, leading to potential ethical concerns.</li>
</ol>
<h2 id="heading-what-is-spacy">What is spaCy?</h2>
<p>In the world of Natural Language Processing (NLP), spaCy has emerged as a powerful and efficient library, revolutionizing the way developers and researchers work with text data. </p>
<p>spaCy is an open-source Python library designed specifically for NLP tasks such as part-of-speech tagging, named entity recognition, dependency parsing, and more. </p>
<p>It was developed with the goal of providing industrial-strength performance, while still being easy to use and integrate into existing workflows. </p>
<p>spaCy is built on the latest research and implements state-of-the-art techniques, making it an ideal choice for both beginners and experienced NLP practitioners.</p>
<h2 id="heading-key-features-of-spacy">Key Features of spaCy</h2>
<h3 id="heading-linguistic-annotations">Linguistic Annotations</h3>
<p>spaCy provides a wide range of pre-trained models that can quickly analyze text and extract various linguistic features. These features include part-of-speech tags, named entities, syntactic dependencies, sentence boundaries, and more. </p>
<p>The pre-trained models are trained on large corpora and have high accuracy, allowing developers to focus on their specific NLP tasks without worrying about training models from scratch.</p>
<h3 id="heading-tokenization-and-sentence-segmentation">Tokenization and Sentence Segmentation</h3>
<p>Tokenization is a crucial step in NLP that breaks down text into individual words or subwords. spaCy's tokenization algorithms are highly efficient and language-specific, allowing for accurate and customizable tokenization. </p>
<p>spaCy can also automatically segment text into sentences, making it easy to work with text data at a granular level. </p>
<h3 id="heading-entity-recognition">Entity Recognition</h3>
<p>Named Entity Recognition (NER) is the task of identifying and classifying named entities such as persons, organizations, locations, dates, and more. </p>
<p>spaCy's NER capabilities are exceptional, providing out-of-the-box support for multiple languages. It allows developers to train custom NER models using their own labeled data, enabling domain-specific entity recognition. </p>
<h3 id="heading-dependency-parsing">Dependency Parsing</h3>
<p>Dependency parsing involves analyzing the grammatical structure of a sentence by determining the relationships between words. </p>
<p>spaCy's dependency parsing is based on efficient algorithms and achieves high accuracy. It provides a rich set of syntactic annotations, including the head of each word, the dependency label, and the subtree structure. </p>
<p>This information is invaluable for tasks like information extraction, question answering, and sentiment analysis.</p>
<h3 id="heading-customization-and-extensibility">Customization and Extensibility</h3>
<p>One of spaCy's major strengths is its flexibility and extensibility. Developers can easily customize and fine-tune spaCy's models to adapt to specific domains or improve performance on specific tasks. </p>
<p>The library also provides a straightforward API for adding custom components, such as new tokenizers, entity recognizers, or syntactic parsers, making it a versatile tool for research and development. </p>
<h3 id="heading-performance-and-scalability">Performance and Scalability</h3>
<p>spaCy is known for its exceptional performance and scalability. The library is implemented in Cython, a programming language that compiles Python-like code into highly efficient C/C++ modules. This allows spaCy to process text data blazingly fast, making it suitable for large-scale NLP applications and real-time systems. </p>
<h2 id="heading-named-entity-recognition-example-in-spacy">Named Entity Recognition Example in spaCy</h2>
<p>Let's try to implement NER using spaCy. </p>
<p>I'll be using Google Colab. Google Colab is a hosted Jupyter Notebook service that requires no setup to use and provides free access to computing resources, including GPUs and TPUs. </p>
<p>You can use Kaggle instead or run it on your own computer if you'd like. Since spaCy is a pre-trained model, it does not required much computing power to get started. </p>
<p>But I'd advise that you setup Anaconda on your machine if you're working on Machine learning problems. </p>
<p>Navigate to <a target="_blank" href="https://colab.research.google.com">https://colab.research.google.com</a> and click on the "New Notebook" button. </p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/06/image-244.png" alt="Image" width="600" height="400" loading="lazy">
<em>Google Colab Console</em></p>
<p>On the header, enter a name of your file. Ensure your file name ends with <code>.pynb</code> extension. </p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/06/image-245.png" alt="Image" width="600" height="400" loading="lazy">
<em>Change file name and create a code block</em></p>
<p>Click on the "+ Code" button to create a code block. </p>
<p>By default, Google Colab is packed with some machine tools and Python libraries pre-installed. So, we don't have to worry about installations and getting our development environment ready. </p>
<p>But it doesn't come with the <code>spacy</code> library. </p>
<p>Run the following command inside the code block to install the <code>spacy</code> library. </p>
<pre><code class="lang-bash">!pip install -U spacy
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/06/image-246.png" alt="Image" width="600" height="400" loading="lazy">
<em>Install <code>spacy</code> library</em></p>
<p>Choose whichever option you want and proceed. The major difference between each one of them is the amount of data it has been trained with. </p>
<ul>
<li>Small – <a target="_blank" href="https://www.freecodecamp.org/news/p/eb6f9486-7030-463a-9ec3-30a1f7858d94/spacy.io/models/en#en_core_web_sm">en_core_web_sm</a></li>
<li>Medium – <a target="_blank" href="https://www.freecodecamp.org/news/p/eb6f9486-7030-463a-9ec3-30a1f7858d94/spacy.io/models/en#en_core_web_md">en_core_web_md</a></li>
<li>Large – <a target="_blank" href="https://www.freecodecamp.org/news/p/eb6f9486-7030-463a-9ec3-30a1f7858d94/spacy.io/models/en#en_core_web_lg">en_core_web_lg</a></li>
<li>Transformer – <a target="_blank" href="https://www.freecodecamp.org/news/p/eb6f9486-7030-463a-9ec3-30a1f7858d94/spacy.io/models/en#en_core_web_trf">en_core_web_trf</a></li>
</ul>
<p>Our next step is to download one of these models. Add a code block and choose any of the ones from the above list and run the following command. I'll be downloading the large model.</p>
<pre><code class="lang-bash">!python -m spacy download en_core_web_lg
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/06/image-247.png" alt="Image" width="600" height="400" loading="lazy">
<em>Download pre-trained model</em></p>
<p>Add a code block and run the following command to load the model. </p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> spacy
nlp = spacy.load(<span class="hljs-string">"en_core_web_lg"</span>)
</code></pre>
<p>Alright. We're all set. </p>
<p>Let's try to split entities from a sentence. Add a code block and run the following block of code:</p>
<pre><code class="lang-python">doc = nlp(<span class="hljs-string">"Apple is looking at buying U.K. startup for $1 billion"</span>)

<span class="hljs-keyword">for</span> ent <span class="hljs-keyword">in</span> doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
</code></pre>
<p>In the above code, we're asking the spaCy model to find the entities from the sentence "Apple is looking at buying U.K. startup for $1 billion". </p>
<p>We're then iterating through each entity, and displaying the entity, start and end characters index in the sentence, and the entity label.</p>
<p>You should be seeing the following output:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/06/image-248.png" alt="Image" width="600" height="400" loading="lazy">
<em>Named Entity Recognition example 1 from <code>spaCy</code></em></p>
<p>The above output describes that "Apple" is an entity and it is present from index 0 to index 5 in the given sentence and it is an Organisation (ORG). </p>
<p>If you're confused about the index, remember that it starts from 0. The first 5 characters in our given input text is "Apple". So, it is from 0 to 5. </p>
<p>Similarly it figures out "U.K." as an entity and describes it as a Geopolitical entity (GPE). It labels "$1 billion" as a Money (MONEY) entity. </p>
<p>Let's try a different sentence this time. </p>
<p>"Prime Minister of India Narendra Modi met US President Joe Biden at Washington DC". </p>
<p>Let's see what are the entities it finds out. Add a code block and run the following code:</p>
<pre><code class="lang-python">doc = nlp(<span class="hljs-string">"Prime Minister of India Narendra Modi met US President Joe Biden at Washington DC"</span>)

<span class="hljs-keyword">for</span> ent <span class="hljs-keyword">in</span> doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/06/image-249.png" alt="Image" width="600" height="400" loading="lazy">
<em>Named Entity Recognition example 2 from <code>spaCy</code></em></p>
<p>That's awesome, isn't it? </p>
<p>It has identified "India", "US", and "Washington DC" as Geopolitical entities (GPE). It has also identified "Narendra Modi" and "Joe Biden" as Person entities (PERSON). </p>
<p>Try to input different sentences and play around with it. I'm sure you'll be amazed at its capabilities in identifying entities. </p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, we learnt about NLP with a simple implementation using the spaCy library. </p>
<p>Natural Language Processing holds immense potential to transform the way we interact with machines and analyze vast amounts of textual data. spaCy has become a go-to library for many NLP practitioners due to its powerful features, ease of use, and exceptional performance. </p>
<p>If you wish to learn more about NLP/Machine Learning, subscribe to my <a target="_blank" href="https://5minslearn.gogosoon.com/?ref=fcc_getting_started_nlp_spacy">email newsletter</a> (<a target="_blank" href="https://5minslearn.gogosoon.com/?ref=fcc_getting_started_nlp_spacy">https://5minslearn.gogosoon.com/</a>) and follow me on social media. </p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Topic Modeling Tutorial – How to Use SVD and NMF in Python ]]>
                </title>
                <description>
                    <![CDATA[ In the context of Natural Language Processing (NLP), topic modeling is an unsupervised learning problem whose goal is to find abstract topics in a collection of documents.  Topic Modeling answers the question: "Given a text corpus of many documents, ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/advanced-topic-modeling-how-to-use-svd-nmf-in-python/</link>
                <guid isPermaLink="false">66bb8b17c332a9c775d15b66</guid>
                
                    <category>
                        <![CDATA[ natural language processing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ topic modeling ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Bala Priya C ]]>
                </dc:creator>
                <pubDate>Tue, 21 Feb 2023 18:32:38 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/02/brett-jordan-M3cxjDNiLlQ-unsplash-cover-img.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In the context of Natural Language Processing (NLP), <strong>topic modeling</strong> is an unsupervised learning problem whose goal is to find abstract topics in a collection of documents. </p>
<p><strong>Topic Modeling</strong> answers the question: "Given a text corpus of many documents, can we find the abstract topics that the text is talking about?"</p>
<p>In this tutorial, you’ll:</p>
<ul>
<li>Learn about two powerful matrix factorization techniques - <strong>Singular Value Decomposition (SVD)</strong> and <strong>Non-negative Matrix Factorization (NMF)</strong></li>
<li>Use them to find topics in a collection of documents</li>
</ul>
<p>By the end of this tutorial, you'll be able to build your own topic models to find topics in any piece of text.📚📑 </p>
<p>Let's get started.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><a class="post-section-overview" href="#heading-what-is-topic-modeling">What is Topic Modeling?</a></li>
<li><a class="post-section-overview" href="#heading-tf-idf-score-equation">TF-IDF Score Equation</a></li>
<li><a class="post-section-overview" href="#heading-topic-modeling-using-singular-value-decomposition-svd">Topic Modeling Using Singular Value Decomposition (SVD)</a></li>
<li><a class="post-section-overview" href="#heading-what-is-truncated-svd-or-k-svd">What is Truncated SVD or k-SVD?</a></li>
<li><a class="post-section-overview" href="#heading-topic-modeling-using-non-negative-matrix-factorization-nmf">Topic Modeling Using Non-Negative Matrix Factorization (NMF)</a></li>
<li><a class="post-section-overview" href="#heading-7-steps-to-use-svd-for-topic-modeling">7 Steps to Use SVD for Topic Modeling</a></li>
<li><a class="post-section-overview" href="#heading-how-to-visualize-topics-as-word-clouds">How to Visualize Topics as Word Clouds</a></li>
<li><a class="post-section-overview" href="#heading-how-to-use-nmf-for-topic-modeling">How to Use NMF for Topic Modeling</a></li>
<li><a class="post-section-overview" href="#heading-svd-vs-nmf-an-overview-of-the-differences">SVD vs NMF – An Overview of the Differences</a></li>
</ol>
<h2 id="heading-what-is-topic-modeling">What is Topic Modeling?</h2>
<p>Let's start by understanding what topic modeling is.</p>
<p>Suppose you're given a large text corpus containing several documents. You'd like to know the <strong>key topics</strong> that reside in the given collection of documents without reading through each document.</p>
<p>Topic Modeling helps you distill the information in the large text corpus into a certain number of topics. Topics are groups of words that are <em>similar in context</em> and are indicative of the information in the collection of documents.</p>
<p>The general structure of the Document-Term Matrix for a text corpus containing <code>M</code> documents, and <code>N</code> terms in all, is shown below:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/1.png" alt="Image" width="600" height="400" loading="lazy">
<em>Structure of the Document-Term Matrix</em></p>
<p>Let's parse the matrix representation:</p>
<ul>
<li>D1, D2, ..., DM are the M documents.</li>
<li>T1, T2, ..., TN are the N terms</li>
</ul>
<p>To populate the Document-Term Matrix, let’s use the widely-used metric—the TF-IDF Score.</p>
<h2 id="heading-tf-idf-score-equation">TF-IDF Score Equation</h2>
<p>The TF-IDF score is given by the following equation:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/2.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>where,</p>
<ul>
<li><code>TF_ij</code> is the number of times the term  <code>Tj</code> occurs in the document  <code>Di</code>.</li>
<li><code>dfj</code> is the number of documents containing the term <code>Tj</code></li>
</ul>
<p>A term that occurs frequently in a particular document, and rarely across the entire corpus has a higher IDF score. </p>
<p>I hope you’ve now gained a cursory understanding of the DTM and the TF-IDF score. Let’s now go over the matrix factorization techniques.</p>
<h2 id="heading-topic-modeling-using-singular-value-decomposition-svd">Topic Modeling Using Singular Value Decomposition (SVD)</h2>
<p>The use of Singular Value Decomposition (SVD) for topic modeling is explained in the figure below:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/3.jpeg" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Singular Value Decomposition on the the Document-Term Matrix D gives the following three matrices:</p>
<ul>
<li>The left singular vector matrix <strong>U</strong>. This matrix is obtained by the eigen decomposition of the Gram matrix <strong>D.D_T</strong>—also called the document similarity matrix. The i,j-th entry of the document similarity matrix signifies how similar document <code>i</code> is to document <code>j</code>.</li>
<li>The matrix of singular values <strong>S</strong>, which (values) signify the relative importance of topics.</li>
<li>The right singular vector matrix <strong>V_T</strong>, which is also called the term topic matrix. The topics in the text reside along the rows of this matrix.</li>
</ul>
<p>If you'd like to refresh the concept of eigen decomposition, here's an excellent tutorial by <a target="_blank" href="https://www.youtube.com/c/3blue1brown">Grant Sanderson from 3Blue1Brown</a>. It explains eigenvectors and eigenvalues visually.</p>
<p><a target="_blank" href="https://www.youtube.com/embed/PFDu9oVAE-g">Embedded content</a></p>
<p>It's totally fine if you find the working of SVD a bit difficult to understand. 🙂 For now, you may think of SVD as a black box that operates on your Document-Term Matrix (DTM) and yields 3 matrices, <strong>U, S</strong>, and <strong>V_T</strong>. And the topics reside along the rows of the matrix <strong>V_T</strong>. </p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/4.jpeg" alt="Image" width="600" height="400" loading="lazy"></p>
<p><strong>Note</strong>: SVD is also called <strong>Latent Semantic Indexing (LSI).</strong></p>
<h2 id="heading-what-is-truncated-svd-or-k-svd">What is Truncated SVD or k-SVD?</h2>
<p>Suppose you have a text corpus of 150 documents. Would you prefer skimming through 150 different topics that describe the corpus, or would you be happy reading through 10 topics that can convey the content of the corpus?</p>
<p>Well, it's often helpful to fix a small number of topics that best convey the content of the text. And this is what motivates <strong>k-SVD</strong>.</p>
<p>As matrix multiplication requires a lot of computation, it's preferred to choose the <strong>k largest singular values</strong>, and the topics corresponding to them. The working of k-SVD is illustrated below:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/5.jpeg" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-topic-modeling-using-non-negative-matrix-factorization-nmf">Topic Modeling Using Non-Negative Matrix Factorization (NMF)</h2>
<p>Non-negative Matrix Factorization (NMF) works as shown below:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/6.jpeg" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Non-negative Matrix Factorization acts on the Document-Term Matrix and yields the following:</p>
<ul>
<li>The matrix <strong>W</strong> which is called the <strong>document-topic matrix</strong>. This matrix shows the distribution of the topics across the documents in the corpus.</li>
<li>The matrix <strong>H</strong> which is also called the <strong>term-topic matrix</strong>. This matrix captures the significance of terms across the topics.</li>
</ul>
<p>NMF is easier to interpret as all the elements of the matrices <strong>W</strong> and <strong>H</strong> are now non-negative. So a higher score corresponds to greater relevance.</p>
<p><strong>But how do we get matrices W and H?</strong> </p>
<p>NMF is a <em>non-exact</em> matrix factorization technique. This means that you cannot multiply W and H to get back the original document-term matrix V. </p>
<p>The matrices W and H are initialized randomly. And the algorithm is run iteratively until we find a W and H that minimize the cost function. </p>
<p>The cost function is the Frobenius norm of the matrix <strong>V - W.H</strong>, as shown below:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/e2.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The Frobenius norm of a matrix A with m rows and n columns is given by the following equation:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/e3.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-7-steps-to-use-svd-for-topic-modeling">7 Steps to Use SVD for Topic Modeling</h2>
<p>1️⃣ To use SVD to get topics, let's first get a text corpus. The following code cell contains a piece of text on <a target="_blank" href="https://en.wikipedia.org/wiki/Computer_programming">computer programming</a>.</p>
<pre><code class="lang-python">text=[<span class="hljs-string">"Computer programming is the process of designing and building an executable computer program to accomplish a specific computing result or to perform a specific task."</span>,

      <span class="hljs-string">"Programming involves tasks such as: analysis, generating algorithms, profiling algorithms' accuracy and resource consumption, and the implementation of algorithms in a chosen programming language (commonly referred to as coding)."</span>,

      <span class="hljs-string">"The source program is written in one or more languages that are intelligible to programmers, rather than machine code, which is directly executed by the central processing unit."</span>,

      <span class="hljs-string">"The purpose of programming is to find a sequence of instructions that will automate the performance of a task (which can be as complex as an operating system) on a computer, often for solving a given problem."</span>,

      <span class="hljs-string">"Proficient programming thus often requires expertise in several different subjects, including knowledge of the application domain, specialized algorithms, and formal logic."</span>,

      <span class="hljs-string">"Tasks accompanying and related to programming include: testing, debugging, source code maintenance, implementation of build systems, and management of derived artifacts, such as the machine code of computer programs."</span>,

      <span class="hljs-string">"These might be considered part of the programming process, but often the term software development is used for this larger process with the term programming, implementation, or coding reserved for the actual writing of code."</span>,

      <span class="hljs-string">"Software engineering combines engineering techniques with software development practices."</span>,

    <span class="hljs-string">"Reverse engineering is a related process used by designers, analysts and programmers to understand and re-create/re-implement"</span>]
</code></pre>
<p>The text for which you need to find topics is now ready.</p>
<p>2️⃣ The next step is to import the <code>TfidfVectorizer</code> class from scikit-learn's feature extraction module for text data:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.feature_extraction.text <span class="hljs-keyword">import</span> TfidfVectorizer
</code></pre>
<p>You'll use the <code>TfidfVectorizer</code> class to get the DTM populated with the TF-IDF scores for the text corpus.</p>
<p>3️⃣ To use <strong>Truncated SVD (k-SVD)</strong> discussed earlier, you need to import the <code>TruncatedSVD</code> class from scikit-learn's <code>decomposition</code> module:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.decomposition <span class="hljs-keyword">import</span> TruncatedSVD
</code></pre>
<p>▶ Now that you've imported all the necessary modules, it's time to start your quest for topics in the text.</p>
<p>4️⃣ In this step, you'll instantiate a <code>Tfidfvectorizer</code> object. Let's call it vectorizer.</p>
<pre><code class="lang-python">vectorizer = TfidfVectorizer(stop_words=<span class="hljs-string">'english'</span>,smooth_idf=<span class="hljs-literal">True</span>) 
<span class="hljs-comment"># under the hood - lowercasing,removing special chars,removing stop words</span>
input_matrix = vectorizer.fit_transform(text).todense()
</code></pre>
<p>So far, you've:</p>
<p>☑ collected the text,<br>☑ imported the necessary modules, and<br>☑ obtained the input DTM.</p>
<p>Now you'll proceed with using SVD to obtain topics.</p>
<p>5️⃣ You'll now use the <code>TruncatedSVD</code> class that you imported in step 3️⃣.</p>
<pre><code class="lang-python">svd_modeling= TruncatedSVD(n_components=<span class="hljs-number">4</span>, algorithm=<span class="hljs-string">'randomized'</span>, n_iter=<span class="hljs-number">100</span>, random_state=<span class="hljs-number">122</span>)
svd_modeling.fit(input_matrix)
components=svd_modeling.components_
vocab = vectorizer.get_feature_names()
</code></pre>
<p>6️⃣ Let’s write a function that gets the topics for us.</p>
<pre><code class="lang-python">topic_word_list = []
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_topics</span>(<span class="hljs-params">components</span>):</span> 
  <span class="hljs-keyword">for</span> i, comp <span class="hljs-keyword">in</span> enumerate(components):
    terms_comp = zip(vocab,comp)
  sorted_terms = sorted(terms_comp, key= <span class="hljs-keyword">lambda</span> x:x[<span class="hljs-number">1</span>], reverse=<span class="hljs-literal">True</span>)[:<span class="hljs-number">7</span>]
     topic=<span class="hljs-string">" "</span>
     <span class="hljs-keyword">for</span> t <span class="hljs-keyword">in</span> sorted_terms:
      topic= topic + <span class="hljs-string">' '</span> + t[<span class="hljs-number">0</span>]
     topic_word_list.append(topic)
     print(topic_word_list)
  <span class="hljs-keyword">return</span> topic_word_list
get_topics(components)
</code></pre>
<p>7️⃣ And it's time to view the topics, and see if they make sense. When you call the <code>get_topics()</code> function with the components obtained from SVD as the argument, you'll get a list of topics, and the top words in each of those topics.</p>
<pre><code class="lang-python">Topic <span class="hljs-number">1</span>: 
  code programming process software term computer engineering

Topic <span class="hljs-number">2</span>: 
  engineering software development combines practices techniques used

Topic <span class="hljs-number">3</span>: 
  code machine source central directly executed intelligible

Topic <span class="hljs-number">4</span>: 
  computer specific task automate complex given instructions
</code></pre>
<p>And you have your topics in just 7 steps. Do the topics look good?</p>
<h2 id="heading-how-to-visualize-topics-as-word-clouds">How to Visualize Topics as Word Clouds</h2>
<p>In the previous section, you printed out the topics, and made sense of the topics using the top words in each topic.</p>
<p>Another popular visualization method for topics is the <strong>word cloud</strong>. In a word cloud, the terms in a particular topic are displayed in terms of their <strong>relative significance</strong>. The most important word has the largest font size, and so on.</p>
<pre><code class="lang-python">!pip install wordcloud
<span class="hljs-keyword">from</span> wordcloud <span class="hljs-keyword">import</span> WordCloud
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">4</span>):
  wc = WordCloud(width=<span class="hljs-number">1000</span>, height=<span class="hljs-number">600</span>, margin=<span class="hljs-number">3</span>,  prefer_horizontal=<span class="hljs-number">0.7</span>,scale=<span class="hljs-number">1</span>,background_color=<span class="hljs-string">'black'</span>, relative_scaling=<span class="hljs-number">0</span>).generate(topic_word_list[i])
  plt.imshow(wc)
  plt.title(<span class="hljs-string">f"Topic<span class="hljs-subst">{i+<span class="hljs-number">1</span>}</span>"</span>)
  plt.axis(<span class="hljs-string">"off"</span>)
  plt.show()
</code></pre>
<p>The word clouds for topics 1 through 4 are shown in the image grid below:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/wc1.jpeg" alt="Image" width="600" height="400" loading="lazy">
<em>Topic Clouds from SVD</em></p>
<p>As you can see, the font-size of words indicate their relative importance in a topic. These word clouds are also called topic clouds.</p>
<h2 id="heading-how-to-use-nmf-for-topic-modeling">How to Use NMF for Topic Modeling</h2>
<p>In this section, you'll run through the same steps as in SVD. You need to first import the <code>NMF</code> class from scikit-learn's <code>decomposition</code> module.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.decomposition <span class="hljs-keyword">import</span> NMF
NMF_model = NMF(n_components=<span class="hljs-number">4</span>, random_state=<span class="hljs-number">1</span>)
W = NMF_model.fit_transform(input_matrix)
H = NMF_model.components_
</code></pre>
<p>And then you may call the <code>get_topics()</code> function on the matrix <strong>H</strong> to get the topics.</p>
<pre><code class="lang-python">Topic <span class="hljs-number">1</span>: 
  code machine source central directly executed intelligible

Topic <span class="hljs-number">2</span>: 
  engineering software process development used term combines

Topic <span class="hljs-number">3</span>: 
  algorithms programming application different domain expertise formal

Topic <span class="hljs-number">4</span>: 
  computer specific task programming automate complex given
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/wc2.jpeg" alt="Image" width="600" height="400" loading="lazy">
<em>Topic Clouds from NMF</em></p>
<p>For the given piece of text, you can see that both SVD and NMF give similar topic clouds.</p>
<h2 id="heading-svd-vs-nmf-an-overview-of-the-differences">SVD vs NMF – An Overview of the Differences</h2>
<p>Now, let's put together the differences between these two matrix factorization techniques for topic modeling.</p>
<ul>
<li>SVD is an exact matrix factorization technique – you can reconstruct the input DTM from the resultant matrices.</li>
<li>If you choose to use k-SVD, it's the best possible k-rank approximation to the input DTM.</li>
<li>Though NMF is a non-exact approximation to the input DTM, it's known to capture more diverse topics than SVD.</li>
</ul>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>I hope you enjoyed this tutorial. As a next step, you may spin up your own Colab notebook using the code cells from this tutorial. You only have to plug in the piece of text that you'd like to find topics for, and you'd have your topics and word clouds ready!</p>
<p>Thank you for reading, and happy coding!</p>
<h3 id="heading-references-and-further-reading-on-topic-modeling">References and Further Reading on Topic Modeling</h3>
<ul>
<li><a target="_blank" href="https://www.fast.ai/2019/07/08/fastai-nlp/">A Code-First Approach to Natural Language Processing</a> by fast.ai</li>
<li><a target="_blank" href="https://www.fast.ai/2017/07/17/num-lin-alg/">Computational Linear Algebra</a> by fast.ai</li>
</ul>
<p>Cover Image: Photo by <a target="_blank" href="https://unsplash.com/ja/@brett_jordan?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Brett Jordan</a> on <a target="_blank" href="https://unsplash.com/photos/M3cxjDNiLlQ?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Perform Data Augmentation in NLP Projects ]]>
                </title>
                <description>
                    <![CDATA[ By Davis David In machine learning, you need to have a large amount of data in order to achieve strong model performance.  Using a method known as data augmentation, you can create more data for your machine learning project. Data augmentation is a c... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-perform-data-augmentation-in-nlp-projects/</link>
                <guid isPermaLink="false">66d84eb9ef84e4cc27cfbe33</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analytics ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ natural language processing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Fri, 24 Jun 2022 15:33:57 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/06/1_eproIleJllsp0enh6HA2Hw.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Davis David</p>
<p>In machine learning, you need to have a large amount of data in order to achieve strong model performance. </p>
<p>Using a method known as data augmentation, you can create more data for your machine learning project. Data augmentation is a collection of techniques that manage the process of automatically generating high-quality data on top of existing data.</p>
<p>In computer vision applications, augmenting approaches are quite common. If you are working on a computer vision project (like image classification), for instance, you can apply dozens of techniques to each image: shift, modify color intensities, scale, rotate, crop, and more.</p>
<p>If you have a tiny dataset for your ML project or wish to reduce overfitting in your machine learning models, it may be a good idea to apply data augmentation approaches.</p>
<blockquote>
<p>“We don’t have better algorithms. We just have more data.”- Peter <a target="_blank" href="https://research.google/people/author205/?ref=hackernoon.com">Norvig</a></p>
</blockquote>
<p>In the field of Natural Language Processing (NLP), the tremendous complexity of language makes it difficult to augment the text. The process of augmenting text data is more challenging and not as straightforward as you might expect.</p>
<p>In this article, you will learn how to use a library called <a target="_blank" href="https://github.com/QData/TextAttack?ref=hackernoon.com">TextAttack</a> to improve data for natural language processing.</p>
<h2 id="heading-what-is-textattack">What is TextAttack?</h2>
<p>TextAttack is a Python framework that was built by the <a target="_blank" href="https://qdata.github.io/qdata-page/?ref=hackernoon.com">QData team</a> for the purpose of conducting adversarial attacks, adversarial training, and data augmentation in natural language processing. </p>
<p>TextAttack has components that can be utilized independently for a variety of basic natural language processing tasks, including sentence encoding, grammar checking, and word substitution.</p>
<p>TextAttack excels in performing the following three functions:</p>
<ol>
<li>Adversarial attacks (Python: <code>**textattack.Attack**</code>, Bash: <code>**textattack attack**</code>).</li>
<li>Data augmentation (Python: <code>**textattack.augmentation.Augmenter**</code>, Bash: <code>**textattack augment**</code>).</li>
<li>Model training (Python: <code>**textattack.Trainer**</code>, Bash: <code>**textattack train**</code>).</li>
</ol>
<p>For this article, we will focus on how to use the TextAttack library for data augmentation.</p>
<h2 id="heading-how-to-install-texattack">How to Install TexAttack</h2>
<p>To use this library, make sure you have Python 3.6 or above in your environment.</p>
<p>Run the following command to install textAttack:</p>
<pre><code class="lang-python">pip install textattack
</code></pre>
<p><strong>Note:</strong> Once you have installed TexAttack, you can run it via the Python module or via the command line.</p>
<h2 id="heading-data-augmentation-techniques-for-text-data">Data Augmentation Techniques for Text Data</h2>
<p>The TextAttack library has various augmentation techniques that you can use in your NLP project to add more text data. </p>
<p>Here are some of the techniques that you can apply:</p>
<h3 id="heading-charswapaugmenter-technique"><code>CharSwapAugmenter</code> technique</h3>
<p>This technique augments words by swapping characters out for other characters.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> textattack.augmentation <span class="hljs-keyword">import</span> CharSwapAugmenter

text = <span class="hljs-string">"I have enjoyed watching that movie, it was amazing."</span>

charswap_aug = CharSwapAugmenter()

charswap_aug.augment(text)
</code></pre>
<p>[‘I have enjoyed watching that omvie, it was amazing.’]</p>
<p>The Augmenter has swapped the word <strong>“movie”</strong> for <strong>“omvie”</strong>.</p>
<h3 id="heading-deletionaugmenter-technique"><code>DeletionAugmenter</code> technique</h3>
<p>This one augments the text by deleting some parts of the text to make new text.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> textattack.augmentation <span class="hljs-keyword">import</span> DeletionAugmenter

text = <span class="hljs-string">"I have enjoyed watching that movie, it was amazing."</span>

deletion_aug = DeletionAugmenter()

deletion_aug.augment(text)
</code></pre>
<p>[‘I have watching that, it was amazing.’]</p>
<p>This method has removed the word <strong>“enjoyed”</strong> to create a new augmented text.</p>
<h3 id="heading-easydataaugmenter-technique"><code>EasyDataAugmenter</code> technique</h3>
<p>This augments the text with a combination of different methods, such as:</p>
<ul>
<li>Randomly swapping the positions of the words in the sentence.</li>
<li>Randomly removing words from the sentence.</li>
<li>Randomly inserting a random synonym of a random word at a random location.</li>
<li>Randomly replacing words with their synonyms.</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> textattack.augmentation <span class="hljs-keyword">import</span> EasyDataAugmenter

text = <span class="hljs-string">"I was billed twice for the service and this is the second time it has happened"</span>

eda_aug = EasyDataAugmenter()

eda_aug.augment(text)
</code></pre>
<p>[‘I was billed twice for the service and this is the second time it has happen’, ‘I was billed twice for the one service and this is the second time it has happened’, ‘I billed twice for the service and this is the second time it has happened’,<br>‘I was billed twice for the this and service is the second time it has happened’]</p>
<p>As you can see from the augmented texts, it shows different results based on the methods applied. For example in the first augmented text, the last word has been modified from <strong>“happened”</strong> to <strong>“happen”</strong>.</p>
<h3 id="heading-wordnetaugmenter-technique"><code>WordNetAugmenter</code> technique</h3>
<p>This technique can augment the text by replacing it with synonyms from the WordNet thesaurus.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> textattack.augmentation <span class="hljs-keyword">import</span> WordNetAugmenter

text = <span class="hljs-string">"I was billed twice for the service and this is the second time it has happened"</span>

wordnet_aug = WordNetAugmenter()

wordnet_aug.augment(text)
</code></pre>
<p>[‘I was billed twice for the service and this is the second time it has pass’]</p>
<p>This method has changed the word <strong>“happened”</strong> to <strong>“pass”</strong> in order to create a new augmented text.</p>
<h3 id="heading-how-to-create-your-own-augmenter">How to Create Your Own Augmenter</h3>
<p>Importing transformations and constraints from <code>textattack.transformations</code> and <code>textattack.constraints</code> allows you to build your own augmenter from the ground up. </p>
<p>The following is an illustration of the use of the <code>WordSwapRandomCharacterDeletion</code> algorithm to produce augmentations of a string:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> textattack.transformations <span class="hljs-keyword">import</span> WordSwapRandomCharacterDeletion
<span class="hljs-keyword">from</span> textattack.transformations <span class="hljs-keyword">import</span> CompositeTransformation
<span class="hljs-keyword">from</span> textattack.augmentation <span class="hljs-keyword">import</span> Augmenter

my_transformation = CompositeTransformation([WordSwapRandomCharacterDeletion()])
augmenter = Augmenter(transformation=my_transformation, transformations_per_example=<span class="hljs-number">3</span>)

text = <span class="hljs-string">'Siri became confused when we reused to follow her directions.'</span>

augmenter.augment(text)
</code></pre>
<p>[‘Siri became cnfused when we reused to follow her directions.’, ‘Siri became confused when e reused to follow her directions.’, ‘Siri became confused when we reused to follow hr directions.’]</p>
<p>The output shows different augmented texts after implementing the <code>WordSwapRandomCharacterDeletion</code> method. For example, in the first augmented text, the method randomly removes the character “<strong>o”</strong> in the word “<strong>confused”.</strong></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this article, you have learned the significance of data augmentation for your Machine Learning projects. You've also learned how to execute data augmentation for textual data using the TextAttack library.</p>
<p>To the best of my knowledge, these techniques are the most effective approaches available to do the task for your NLP project. Hopefully they’ll be of use to you in your work.</p>
<p>You can also try to use other available augmentation techniques from the TextAttack library such as:</p>
<ul>
<li>EmbeddingAugmenter</li>
<li>CheckListAugmenter</li>
<li>CLAREAugmenter</li>
</ul>
<p>If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!</p>
<p>You can also find me on Twitter <a target="_blank" href="https://twitter.com/Davis_McDavid?ref=hackernoon.com">@Davis_McDavid</a>.</p>
<p>And you can read more articles like this <a target="_blank" href="https://hackernoon.com/u/davisdavid">here</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Train BPE, WordPiece, and Unigram Tokenizers from Scratch using Hugging Face ]]>
                </title>
                <description>
                    <![CDATA[ If you've had some experience with NLP, you probably know that tokenization is at the helm of any NLP pipeline. Tokenization is often regarded as a subfield of NLP but it has its own story of evolution. And now it underpins many state-of-the-art NLP ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/train-algorithms-from-scratch-with-hugging-face/</link>
                <guid isPermaLink="false">66d45f5ab3016bf139028d4c</guid>
                
                    <category>
                        <![CDATA[ algorithms ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ natural language processing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Harshit Tyagi ]]>
                </dc:creator>
                <pubDate>Mon, 18 Oct 2021 22:27:40 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2021/10/tok_hf.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you've had some experience with NLP, you probably know that tokenization is at the helm of any NLP pipeline.</p>
<p>Tokenization is often regarded as a subfield of NLP but it has its own <a target="_blank" href="https://dswharshit.substack.com/p/the-evolution-of-tokenization-byte">story of evolution</a>. And now it underpins many state-of-the-art NLP models.</p>
<p>This post is all about training tokenizers from scratch by leveraging <strong>Hugging Face’s tokenizers package.</strong></p>
<p>Before we get to the fun part of training and comparing the different tokenizers, I want to give you a brief summary of the key differences between the algorithms.</p>
<p>The main difference lies in the <strong>choice of character pairs</strong> to merge and t<strong>he merging policy</strong> that each of these algorithms uses to generate the final set of tokens.</p>
<h2 id="heading-bpe-algorithm-a-frequency-based-model">BPE Algorithm – a Frequency-based Model</h2>
<p>Byte Pair Encoding uses the frequency of subword patterns to shortlist them for merging.</p>
<p>The drawback of using frequency as the driving factor is that you can end up having ambiguous final encodings that might not be useful for the new input text.</p>
<p>But it still has the scope of improvement in terms of generating unambiguous tokens.</p>
<h2 id="heading-unigram-algorithm-a-probability-based-model">Unigram Algorithm – a Probability-based Model</h2>
<p>Next we have the Unigram model that approaches solving the merging problem by calculating the likelihood of each subword combination rather than picking the most frequent pattern.</p>
<p>It calculates the probability of every subword token and then drops it based on a loss function that is explained in <a target="_blank" href="https://arxiv.org/pdf/1804.10959.pdf">this research paper.</a></p>
<p>Based on a certain threshold of the loss value, you can then trigger the model to drop the bottom 20-30% of the subword tokens.</p>
<p>Unigram is a completely probabilistic algorithm that chooses both the pairs of characters and the final decision to merge (or not) in each iteration based on probability.</p>
<h2 id="heading-wordpiece-algorithm">WordPiece Algorithm</h2>
<p>With the release of BERT in 2018, there came a new subword tokenization algorithm called WordPiece which can be considered an intermediary of BPE and Unigram algorithms.</p>
<p>WordPiece is also a greedy algorithm that leverages likelihood instead of count frequency to merge the best pair in each iteration but the choice of characters to pair is based on count frequency.</p>
<p>So, it is similar to BPE in terms of choosing characters to pair and similar to Unigram in terms of choosing the best pair to merge.</p>
<p>With the algorithmic differences covered, I tried to implement each of these algorithms (not from scratch) to compare the output generated by each of them.</p>
<h2 id="heading-how-to-train-the-bpe-unigram-and-wordpiece-algorithms">How to Train the BPE, Unigram, and WordPiece Algorithms</h2>
<p>Now, in order to have an unbiased comparison of outputs, I didn’t want to use pre-trained algorithms as that would bring size, quality, and the content of the dataset into the picture.</p>
<p>One way could be to code these algorithms from scratch using the research papers and then test them out. This is a good approach in order to truly understand the workings of each algorithm but you might end up spending weeks doing that.</p>
<p>I instead used <a target="_blank" href="https://huggingface.co/docs/tokenizers/python/latest/quicktour.html">Hugging Face’s tokenizers</a> package that offers the implementation of all of today’s most used tokenizers. It also allowed me to train these models from scratch on my choice of dataset and then tokenize the input string of my own choice.</p>
<h3 id="heading-how-to-train-the-datasets">How to Train the Datasets</h3>
<p>I chose two different datasets to train these models. One is a free book from Gutenberg which serves as a small dataset and the other one is the <a target="_blank" href="https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/">wikitext-103</a> which is 516M of text.</p>
<p>In the Colab, you can download the datasets first and unzip them (if required):</p>
<pre><code class="lang-javascript">!wget http:<span class="hljs-comment">//www.gutenberg.org/cache/epub/16457/pg16457.txt</span>
</code></pre>
<pre><code class="lang-javascript">!wget https:<span class="hljs-comment">//s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip</span>
</code></pre>
<pre><code class="lang-javascript">!unzip wikitext<span class="hljs-number">-103</span>-raw-v1.zip
</code></pre>
<h3 id="heading-import-the-required-models-and-trainers">Import the Required Models and Trainers</h3>
<p>Going through the documentation, you’ll find that the main API of the package is the class <code>Tokenizer.</code></p>
<p>You can then instantiate any tokenizer with your choice of model (BPE/ Unigram/ WordPiece).</p>
<p>Here, I imported the main class, all the models I wanted to test, and their trainers, as I want to train these models from scratch.</p>
<pre><code class="lang-javascript">## importing the tokenizer and subword BPE trainer
<span class="hljs-keyword">from</span> tokenizers <span class="hljs-keyword">import</span> Tokenizer
<span class="hljs-keyword">from</span> tokenizers.models <span class="hljs-keyword">import</span> BPE, Unigram, WordLevel, WordPiece
<span class="hljs-keyword">from</span> tokenizers.trainers <span class="hljs-keyword">import</span> BpeTrainer, WordLevelTrainer, \
                                WordPieceTrainer, UnigramTrainer

## a pretokenizer to segment the text into words
<span class="hljs-keyword">from</span> tokenizers.pre_tokenizers <span class="hljs-keyword">import</span> Whitespace
</code></pre>
<h3 id="heading-how-to-automate-training-and-tokenization">How to Automate Training and Tokenization</h3>
<p>Since I need to perform somewhat similar processes for three different models, I broke the processes into 3 functions. I’ll only need to call these functions for each model and my job here will be done.</p>
<p>So, what are these functions?</p>
<h4 id="heading-step-1-prepare-the-tokenizer">Step 1 - Prepare the tokenizer</h4>
<p>Preparing the tokenizer requires us to instantiate the Tokenizer class with a model of our choice. But since we have four models (I added a simple Word-level algorithm as well) to test, we’ll write if/else cases to instantiate the tokenizer with the right model.</p>
<p>To train the instantiated tokenizer on the small and large datasets, we will also need to instantiate a trainer, in our case these would be <a target="_blank" href="https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#tokenizers.trainers.BpeTrainer"><code>BpeTrainer</code></a><code>, WordLevelTrainer, WordPieceTrainer, and UnigramTrainer.</code></p>
<p>The instantiation and training will need us to specify some special tokens. These are tokens for unknown words and other special tokens that we’ll need to use later on to add to our vocabulary.</p>
<p>You can also specify other training arguments' vocabulary size or minimum frequency here.</p>
<pre><code class="lang-javascript">unk_token = <span class="hljs-string">"&lt;UNK&gt;"</span>  # token <span class="hljs-keyword">for</span> unknown words
spl_tokens = [<span class="hljs-string">"&lt;UNK&gt;"</span>, <span class="hljs-string">"&lt;SEP&gt;"</span>, <span class="hljs-string">"&lt;MASK&gt;"</span>, <span class="hljs-string">"&lt;CLS&gt;"</span>]  # special tokens

def prepare_tokenizer_trainer(alg):
    <span class="hljs-string">""</span><span class="hljs-string">"
    Prepares the tokenizer and trainer with unknown &amp; special tokens.
    "</span><span class="hljs-string">""</span>
    <span class="hljs-keyword">if</span> alg == <span class="hljs-string">'BPE'</span>:
        tokenizer = Tokenizer(BPE(unk_token = unk_token))
        trainer = BpeTrainer(special_tokens = spl_tokens)
    elif alg == <span class="hljs-string">'UNI'</span>:
        tokenizer = Tokenizer(Unigram())
        trainer = UnigramTrainer(unk_token= unk_token, special_tokens = spl_tokens)
    elif alg == <span class="hljs-string">'WPC'</span>:
        tokenizer = Tokenizer(WordPiece(unk_token = unk_token))
        trainer = WordPieceTrainer(special_tokens = spl_tokens)
    <span class="hljs-attr">else</span>:
        tokenizer = Tokenizer(WordLevel(unk_token = unk_token))
        trainer = WordLevelTrainer(special_tokens = spl_tokens)

    tokenizer.pre_tokenizer = Whitespace()
    <span class="hljs-keyword">return</span> tokenizer, trainer
</code></pre>
<p>We’ll also need to add a pre-tokenizer to split our input into words as without a pre-tokenizer, we might get tokens that overlap several words: for instance we could get a <code>"there is"</code> token since those two words often appear next to each other.</p>
<blockquote>
<p><em>Using a pre-tokenizer will ensure no token is bigger than a word returned by the pre-tokenizer.</em></p>
</blockquote>
<p>This function will return the tokenizer and its trainer object which we can use to train the model on a dataset.</p>
<p>Here, we are using the same pre-tokenizer (<code>Whitespace</code>) for all the models. You can choose to test it with <a target="_blank" href="https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.pre_tokenizers">others</a>.</p>
<h4 id="heading-step-2-train-the-tokenizer">Step 2 - Train the tokenizer</h4>
<p>After preparing the tokenizers and trainers, we can start the training process.</p>
<p>Here’s a function that will take the file(s) on which we intend to train our tokenizer along with the algorithm identifier.</p>
<ul>
<li><p><code>‘WLV’</code> - Word Level Algorithm</p>
</li>
<li><p><code>‘WPC’</code> - WordPiece Algorithm</p>
</li>
<li><p><code>‘BPE’</code> - Byte Pair Encoding</p>
</li>
<li><p><code>‘UNI’</code> - Unigram</p>
</li>
</ul>
<pre><code class="lang-javascript">def train_tokenizer(files, alg=<span class="hljs-string">'WLV'</span>):
    <span class="hljs-string">""</span><span class="hljs-string">"
    Takes the files and trains the tokenizer.
    "</span><span class="hljs-string">""</span>
    tokenizer, trainer = prepare_tokenizer_trainer(alg)
    tokenizer.train(files, trainer) # training the tokenzier
    tokenizer.save(<span class="hljs-string">"./tokenizer-trained.json"</span>)
    tokenizer = Tokenizer.from_file(<span class="hljs-string">"./tokenizer-trained.json"</span>)
    <span class="hljs-keyword">return</span> tokenizer
</code></pre>
<p>This is the main function that we’ll need to call for training the tokenizer. It will first prepare the tokenizer and trainer and then start training the tokenizers with the provided files.</p>
<p>After training, it saves the model in a JSON file, loads it from the file, and returns the trained tokenizer to start encoding the new input.</p>
<h4 id="heading-step-3-tokenize-the-input-string">Step 3 - Tokenize the input string</h4>
<p>The last step is to start encoding the new input strings and compare the tokens generated by each algorithm.</p>
<p>Here, we’ll be writing a nested for loop to train each model on the smaller dataset first followed by training on the larger dataset and tokenizing the input string as well.</p>
<p><strong>Input string -</strong> “This is a deep learning tokenization tutorial. Tokenization is the first step in a deep learning NLP pipeline. We will be comparing the tokens generated by each tokenization model. Excited much?!😍”</p>
<pre><code class="lang-javascript">small_file = [<span class="hljs-string">'pg16457.txt'</span>]
large_files = [f<span class="hljs-string">"./wikitext-103-raw/wiki.{split}.raw"</span> <span class="hljs-keyword">for</span> split <span class="hljs-keyword">in</span> [<span class="hljs-string">"test"</span>, <span class="hljs-string">"train"</span>, <span class="hljs-string">"valid"</span>]]

<span class="hljs-keyword">for</span> files <span class="hljs-keyword">in</span> [small_file, large_files]:
    print(f<span class="hljs-string">"========Using vocabulary from {files}======="</span>)
    <span class="hljs-keyword">for</span> alg <span class="hljs-keyword">in</span> [<span class="hljs-string">'WLV'</span>, <span class="hljs-string">'BPE'</span>, <span class="hljs-string">'UNI'</span>, <span class="hljs-string">'WPC'</span>]:
        trained_tokenizer = train_tokenizer(files, alg)
        input_string = <span class="hljs-string">"This is a deep learning tokenization tutorial. Tokenization is the first step in a deep learning NLP pipeline. We will be comparing the tokens generated by each tokenization model. Excited much?!😍"</span>
        output = tokenize(input_string, trained_tokenizer)
        tokens_dict[alg] = output.tokens
        print(<span class="hljs-string">"----"</span>, alg, <span class="hljs-string">"----"</span>)
        print(output.tokens, <span class="hljs-string">"-&gt;"</span>, len(output.tokens))
</code></pre>
<p><strong>And here's the output:</strong></p>
<p><img src="https://cdn.substack.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F43eb1a88-36a1-4343-be1e-ac65843e3837_1306x430.png" alt="Image" width="1306" height="430" loading="lazy"></p>
<h2 id="heading-analysis-of-the-output">Analysis of the output:</h2>
<p>Looking at the output, you’ll see the difference in how the tokens were generated which in turn led to different number of tokens generated.</p>
<ul>
<li><p>A simple <strong>word level algorithm</strong> created 35 tokens no matter which dataset it was trained on.</p>
</li>
<li><p><strong>BPE</strong> algorithm created 55 tokens when trained on a smaller dataset and 47 when trained on a larger dataset. This shows that it was able to merge more pairs of characters when trained on a larger dataset.</p>
</li>
<li><p>The <strong>Unigram model</strong> created similar (68 and 67) numbers of tokens with both the datasets. But you can see the difference in the generated tokens:</p>
</li>
</ul>
<p><img src="https://cdn.substack.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdf3d128-641c-4680-9b43-0e04a505d67c_428x43.png" alt="Image" width="428" height="43" loading="lazy"></p>
<p>With larger dataset, merging came closer to generating tokens that are better-suited to encode real-world English language words that we often use.</p>
<p><img src="https://cdn.substack.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Feb49063b-8896-496e-acec-0dea60d6ea37_260x40.png" alt="Image" width="260" height="40" loading="lazy"></p>
<p><strong>WordPiece</strong> created 52 tokens when trained on a smaller dataset and 48 when trained on a larger dataset. The generated tokens have double ## to denote the use of a token as a prefix/suffix.</p>
<p><img src="https://cdn.substack.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5225119-6158-45e7-83b1-26bf587791f3_391x45.png" alt="Image" width="391" height="45" loading="lazy"></p>
<p>All three algorithms generated worse and better subword tokens when trained on a larger dataset.</p>
<h2 id="heading-how-to-compare-the-tokens">How to Compare the Tokens</h2>
<p>To compare the tokens, I stored the output of each algorithm in a dictionary and I’ll turn it into a dataframe to view the differences in tokens better.</p>
<p>Since the number of tokens generated by each model is different, I’ve added a token to make the data rectangular and fit a dataframe.</p>
<p>is basically nan in the dataframe.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

max_len = max(len(tokens_dict[<span class="hljs-string">'UNI'</span>]), len(tokens_dict[<span class="hljs-string">'WPC'</span>]), len(tokens_dict[<span class="hljs-string">'BPE'</span>]))
diff_bpe = max_len - len(tokens_dict[<span class="hljs-string">'BPE'</span>])
diff_wpc = max_len - len(tokens_dict[<span class="hljs-string">'WPC'</span>])

tokens_dict[<span class="hljs-string">'BPE'</span>] = tokens_dict[<span class="hljs-string">'BPE'</span>] + [<span class="hljs-string">'&lt;PAD&gt;'</span>]*diff_bpe
tokens_dict[<span class="hljs-string">'WPC'</span>] = tokens_dict[<span class="hljs-string">'WPC'</span>] + [<span class="hljs-string">'&lt;PAD&gt;'</span>]*diff_wpc

del tokens_dict[<span class="hljs-string">'WLV'</span>]

df = pd.DataFrame(tokens_dict)
df.head(<span class="hljs-number">10</span>)
</code></pre>
<p><strong>Here's the output:</strong></p>
<p><img src="https://cdn.substack.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F856ab4bc-7343-4114-9867-27e64a71d21a_306x474.png" alt="Image" width="306" height="474" loading="lazy"></p>
<p>You can also look at the difference in tokens using sets:</p>
<p><img src="https://cdn.substack.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7be68b1a-d979-4688-94e9-b19219f2259d_370x692.png" alt="Image" width="370" height="692" loading="lazy"></p>
<p><img src="https://cdn.substack.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F8942a6e1-d1bc-4a6e-bbec-3473a87ef9ca_370x626.png" alt="Image" width="370" height="626" loading="lazy"></p>
<p>To check out the code, head over to this <a target="_blank" href="https://colab.research.google.com/drive/10gwzRY55JqzgeEQOX6nwFs6bQ84-mB9f?usp=sharing">Colab notebook.</a></p>
<h2 id="heading-closing-thoughts-and-next-steps">Closing Thoughts and Next Steps</h2>
<p>Based on the kind of tokens generated, WPC does seem to generate subword tokens that are more commonly found in the English language – but don’t hold me to this observation.</p>
<p>These algorithms are slightly different from each other and do a somewhat similar job of developing a decent NLP model. But much of the performance depends on the use case of your language model, the vocabulary size, speed and other factors.</p>
<p>This concludes our examination of tokenization algorithms. The next step to deep dive into this is to understand what embeddings are, how tokenization plays a vital role in creating these embeddings, and how they affect a model’s performance.</p>
<p>A further advancement to these algorithms is the <a target="_blank" href="https://arxiv.org/pdf/1808.06226.pdf">SentencePiece algorithm</a> which is a wholesome approach to the whole tokenization problem. But much of this problem is alleviated by HuggingFace, and even better – they have all the algorithms implemented in a single GitHub repo.</p>
<h3 id="heading-references-and-notes">References and Notes</h3>
<p>If you have questions about my analysis or any of my work in this post, I highly encourage you to check out these resources for a precise understanding of the workings of each algorithm:</p>
<ol>
<li><p><a target="_blank" href="https://arxiv.org/pdf/1804.10959.pdf">Subword regularization: Improving Neural Network Translation Models with Multiple Subword Candidates</a> by Taku Kudo</p>
</li>
<li><p><a target="_blank" href="https://arxiv.org/pdf/1508.07909.pdf">Neural Machine Translation of Rare Words with Subword Units</a> - Research paper that discusses different segmentation techniques based BPE compression algorithm.</p>
</li>
<li><p><a target="_blank" href="https://huggingface.co/docs/tokenizers/python/latest/quicktour.html">Hugging Face’s tokenizer package.</a></p>
</li>
</ol>
<h3 id="heading-connect-with-me">Connect with me</h3>
<p>If you’re looking to get started in the field of data science or ML, check out my course on <a target="_blank" href="https://www.wiplane.com/p/foundations-for-data-science-ml"><strong>Foundations of Data Science &amp; ML</strong></a>.</p>
<p>If you would like to see more of such content and you are not a subscriber, consider subscribing to <a target="_blank" href="https://dswharshit.substack.com/">my newsletter</a>.</p>
<p>Have something to add or suggest, you can reach out to me via:</p>
<ul>
<li><p><a target="_blank" href="https://www.youtube.com/channel/UCH-xwLTKQaABNs2QmGxK2bQ">YouTube</a></p>
</li>
<li><p><a target="_blank" href="https://twitter.com/dswharshit">Twitter</a></p>
</li>
<li><p><a target="_blank" href="https://www.linkedin.com/in/tyagiharshit/">LinkedIn</a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ The Evolution of Tokenization – Byte Pair Encoding in NLP ]]>
                </title>
                <description>
                    <![CDATA[ Natural Language Processing may have come a little late to the AI game, but companies like Google and OpenAI are working wonders with NLP techniques these days. These companies have released state-of-the-art language models like BERT and GPT-2 and GP... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/evolution-of-tokenization/</link>
                <guid isPermaLink="false">66d45f3d3a8352b6c5a2aa7f</guid>
                
                    <category>
                        <![CDATA[ algorithms ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ natural language processing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Tokenization ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Harshit Tyagi ]]>
                </dc:creator>
                <pubDate>Tue, 05 Oct 2021 15:26:44 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2021/10/IMG_0079.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Natural Language Processing may have come a little late to the AI game, but companies like Google and OpenAI are working wonders with NLP techniques these days.</p>
<p>These companies have released state-of-the-art language models like BERT and GPT-2 and GPT-3. And GitHub Copilot and OpenAI codex are among some of the popular applications that are in the news lately.</p>
<p>As someone who has had very limited exposure to NLP, I decided to take it up as an area of research so I can learn more about it. My next few articles and videos will focus on sharing what I learn after dissecting some important components of NLP.</p>
<h3 id="heading-main-components-of-nlp">Main Components of NLP</h3>
<p>NLP systems have three main components that help machines understand natural language:</p>
<ol>
<li><p>Tokenization</p>
</li>
<li><p>Embedding</p>
</li>
<li><p>Model architectures</p>
</li>
</ol>
<p>Top Deep Learning models like BERT, GPT-2, and GPT-3 all share the same components but with different architectures that distinguish one model from another.</p>
<p>In this article (and the <a target="_blank" href="https://colab.research.google.com/drive/1QLlQx_EjlZzBPsuj_ClrEDC0l8G-JuTn?usp=sharing">notebook</a> that accompanies it), we are going to focus on the basics of the first component of an NLP pipeline which is <strong>tokenization</strong>. It's an often overlooked concept, but it is a field of research in itself.</p>
<p>We have come so far from the traditional NLTK tokenization process. And though we have state-of-the-art algorithms for tokenization, it's always a good practice to understand its evolution and how we got to where we are now.</p>
<p>So, here's what we'll cover:</p>
<ul>
<li><p>What is tokenization?</p>
</li>
<li><p>Why do we need a tokenizer?</p>
</li>
<li><p>Types of tokenization – Word, Character, and Subword.</p>
</li>
<li><p>Byte Pair Encoding Algorithm - a version of which is used by most NLP models these days.</p>
</li>
</ul>
<p>The next part of this tutorial will dive into more advanced (or enhanced versions of Byte Pair Encoding) algorithms:</p>
<ul>
<li><p><strong>Unigram Algorithm</strong></p>
</li>
<li><p><strong>WordPiece – BERT transformer</strong></p>
</li>
<li><p><strong>SentencePiece – End-to-End tokenizer system</strong></p>
</li>
</ul>
<h2 id="heading-what-is-tokenization">What is Tokenization?</h2>
<p>Tokenization is the process of representing raw text in smaller units called tokens. These tokens can then be mapped with numbers to further feed to an NLP model.</p>
<p>Here's an overly simplified example of what a tokenizer does:</p>
<pre><code class="lang-javascript">## read the text and enumerate the tokens <span class="hljs-keyword">in</span> the text
text = open(<span class="hljs-string">'example.txt'</span>, <span class="hljs-string">'r'</span>).read(). # read a text file

words = text.split(<span class="hljs-string">" "</span>) # split the text on spaces

tokens = {<span class="hljs-attr">v</span>: k <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> enumerate(words)} # generate a word to index mapping
</code></pre>
<p><img src="https://cdn.substack.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fcaa2e479-181a-4703-afb6-9796d0f74d09_229x327.png" alt="Image" width="229" height="327" loading="lazy"></p>
<p>Here, we have simply mapped every word in the text to a numerical index. This is, of course, a very simple example and we have not considered grammar, punctuation, or compound words (like test, test-ify, test-ing, and so on).</p>
<p>So we need a more technical and accurate definition of tokenization for our work here. To take into account all punctuation and every related word, we need to start working at the character level.</p>
<p>There are multiple applications of tokenization. One of the use cases comes from compiler design where you might need to parse computer programs to convert raw characters into keywords of a programming language.</p>
<p><strong>In deep learning,</strong> tokenization is the process of converting a sequence of characters into a sequence of tokens which further needs to be converted into a sequence of numerical vectors that can be processed by a neural network.</p>
<h2 id="heading-why-do-we-need-a-tokenizer">Why do we need a Tokenizer?</h2>
<p>The need for a tokenizer came from the question "How can we make machines read?"</p>
<p>A common way of processing textual data is to define a set of rules in a dictionary and then look up that fixed dictionary of rules. But this method can only go so far, and we want machines to learn these rules from the text that it reads.</p>
<p>Now, machines don't know any language, nor do they understand sound or phonetics. They need to be taught from scratch and in such a way that they can read any language that's out there.</p>
<p>Quite a task, right?</p>
<p>Humans learn a language by connecting sound to the meaning and then we learn to read and write in that language. Machines can't do that, so they need to be given the most basic units of text to start processing the text.</p>
<p>That's where tokenization comes into play. It breaks down the text into smaller units called "tokens".</p>
<p>And there are different ways of tokenizing text which is what we'll learn now.</p>
<h2 id="heading-different-ways-to-tokenize-text">Different ways to tokenize text</h2>
<p>To make the deep learning model learn from the text, we need a two-step process:</p>
<p><img src="https://cdn.substack.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fff7fafb7-a127-4e41-a050-cb02951f3112_1391x821.jpeg" alt="Image" width="1391" height="821" loading="lazy"></p>
<ol>
<li><p>Tokenize – decide the algorithm we'll use to generate the tokens.</p>
</li>
<li><p>Encode the tokens to vectors</p>
</li>
</ol>
<h2 id="heading-word-based-tokenization">Word-based tokenization</h2>
<p>As the first step suggests, we need to decide how to convert text into small tokens. A simple and straightforward method that most of us would propose is to use word-based tokens, splitting the text by spaces.</p>
<h3 id="heading-problems-with-word-tokenizer">Problems with Word tokenizer</h3>
<p><strong>There's a high risk of missing words in the training data.</strong> With word tokens, your model won't recognize the variants of words that were not part of the data on which the model was trained.</p>
<p>So, if your model has seen <code>foot</code> and <code>ball</code> in the training data but the final text has <code>football</code>, the model won't be able to recognize the word and it will be treated with an <code>&lt;UNK&gt;</code> token.</p>
<p>Similarly, punctuation poses another problem. For example, <code>let</code> or <code>let's</code> will need individual tokens which is an inefficient solution. This will <strong>require a huge vocabulary</strong> to make sure you've thought of every variant of the word.</p>
<p>Even if you add a <strong>lemmatizer</strong> to solve this problem, you're adding an extra step in your processing pipeline.</p>
<p><strong>It's also tough to handle slang and abbreviations.</strong> We use lots of slang and abbreviations in text these days, such as "FOMO", "LOL", "tl;dr" and so on. What do we do for these words?</p>
<p><strong>Finally, what if the language doesn't use spaces for segmentation?</strong> For a language like Chinese, which doesn't use spaces for word separation, this tokenizer will fail completely.</p>
<p>After encountering these problems, researchers looked into another approach which involved tokenizing all the characters.</p>
<h2 id="heading-character-based-tokenization">Character-based tokenization</h2>
<p>To resolve the problems associated with word-based tokenization, data scientists tried an alternative approach of character-by-character tokenization.</p>
<p>This did solve the problem of missing words, as now we are dealing with characters that can be encoded using ASCII or Unicode. Now it could generate embedding for any word.</p>
<p>Every character, whether it was a space, apostrophe, colon, or whatever can now be assigned a symbol to generate a sequence of vectors.</p>
<p>But this approach had its own cons.</p>
<h3 id="heading-problems-with-character-based-models">Problems with character-based models</h3>
<p><strong>First, this approach requires more computing resources.</strong> Character-based models will treat each character as a token. And more tokens means more input computations to process each token which in turn requires more compute resources.</p>
<p>For example, for a 5-word long sentence, you may need to process 30 tokens instead of 5 word-based tokens.</p>
<p><strong>Also, it narrows down the number of NLP tasks and applications.</strong> With long sequences of characters, you can only use a certain type of neural network architecture.</p>
<p>This limits the type of NLP tasks we can perform. For applications like entity recognition or text classification, character-based encoding might turn out to be an inefficient approach.</p>
<p><strong>Finally, there's a risk of learning incorrect semantics.</strong> Working with characters could generate incorrect spellings of words. Also, with no inherent meaning, learning with characters is like learning with no meaningful semantics.</p>
<blockquote>
<p>What's fascinating is that for such a seemingly simple task, multiple algorithms have been written to find the optimal tokenization policy.</p>
</blockquote>
<p>After understanding the pros and cons of these tokenization methods, it makes sense to look for an approach that offers a middle route. We'll want one that preserves the semantics with limited vocabulary that can generate all the words in the text on merging.</p>
<h2 id="heading-subword-tokenization">Subword Tokenization</h2>
<p>With character-based models, we risk losing the semantic features of the word. And with word-based tokenization, we need a very large vocabulary to encompass all the possible variations of every word.</p>
<p>So, the goal was to develop an algorithm that could:</p>
<ol>
<li><p>Retain the semantic features of the token, that is information per token.</p>
</li>
<li><p>Tokenize without demanding a very large vocabulary with a finite set of words.</p>
</li>
</ol>
<p>To solve this problem, you can think of breaking down the words based on a set of prefixes and suffixes. For example, we can write a rule-based system to identify subwords like <code>"##s"</code>, <code>"##ing"</code>, <code>"##ify"</code>, <code>"un##"</code> and so on, where the position of the double hash denotes prefix and suffixes.</p>
<p>So, a word like <code>"unhappily"</code> is tokenized using subwords like <code>"un##"</code>, <code>"happ"</code>, and <code>"##ily"</code>.</p>
<p>The model only learns relatively few subwords and then puts them together to create other words. This solves the problems of memory requirement and effort required to create a large vocabulary.</p>
<h3 id="heading-problems-with-the-subword-tokenization-algorithm">Problems with the subword tokenization algorithm:</h3>
<p>First of all, some of the subwords that are created as per the defined rules may never appear in your text to tokenize and may end up occupying extra memory.</p>
<p>Also, for every language, we'll need to define a different set of rules to create subwords.</p>
<p>To alleviate this problem, in practice, most modern tokenizers have a training phase that identifies the recurring text in the input corpus and creates new subword tokens. For rare patterns, we stick to word-based tokens.</p>
<p>Another important factor that plays a vital role in this process is the size of the vocabulary that the user sets. A large vocabulary size allows for more common words to be tokenized, whereas smaller vocabulary requires more subwords to be created to create every word in the text without using the <code>&lt;UNK&gt;</code> token.</p>
<p>Striking the right balance for your application is key here.</p>
<h2 id="heading-byte-pair-encoding-bpe-algorithm">Byte Pair Encoding (BPE) Algorithm</h2>
<p>BPE was originally a data compression algorithm that you use to find the best way to represent data by identifying the common byte pairs. We now use it in NLP to find the best representation of text using the smallest number of tokens.</p>
<p>Here's how it works:</p>
<ol>
<li><p>Add an identifier (<code>&lt;/w&gt;</code>) at the end of each word to identify the end of a word and then calculate the word frequency in the text.</p>
</li>
<li><p>Split the word into characters and then calculate the character frequency.</p>
</li>
<li><p>From the character tokens, for a predefined number of iterations, count the frequency of the consecutive byte pairs and merge the most frequently occurring byte pairing.</p>
</li>
<li><p>Keep iterating until you have reached the iteration limit (set by you) or until you have reached the token limit.</p>
</li>
</ol>
<p>Let's go through each step (in the code) for some sample text. For coding this, I have taken help from <a target="_blank" href="https://leimao.github.io/blog/Byte-Pair-Encoding/">Lei Mao's very minimalistic blog on BPE</a>. I encourage you to check it out!</p>
<h2 id="heading-step-1-add-word-identifiers-and-calculate-word-frequency">Step 1: Add word identifiers and calculate word frequency</h2>
<p>Here's our sample text:</p>
<pre><code class="lang-javascript"><span class="hljs-string">"There is an 80% chance of rainfall today. We are pretty sure it is going to rain."</span>
</code></pre>
<pre><code class="lang-javascript">## define the text first
</code></pre>
<pre><code class="lang-javascript">text = <span class="hljs-string">"There is an 80% chance of rainfall today. We are pretty sure it is going to rain."</span>
</code></pre>
<pre><code class="lang-javascript">## get the word frequency and add the end <span class="hljs-keyword">of</span> word (&lt;/w&gt;) token ## at the end <span class="hljs-keyword">of</span> each word

words = text.strip().split(<span class="hljs-string">" "</span>)

print(f<span class="hljs-string">"Vocabulary size: {len(words)}"</span>)
</code></pre>
<p><img src="https://cdn.substack.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fb90e5882-9f1f-4b05-be48-3e9336cf1854_283x392.png" alt="Image" width="283" height="392" loading="lazy"></p>
<h2 id="heading-step-2-split-the-word-into-characters-and-then-calculate-the-character-frequency">Step 2: Split the word into characters and then calculate the character frequency</h2>
<pre><code class="lang-javascript">char_freq_dict = collections.defaultdict(int)
<span class="hljs-keyword">for</span> word, freq <span class="hljs-keyword">in</span> word_freq_dict.items():
    chars = word.split()
    <span class="hljs-keyword">for</span> char <span class="hljs-keyword">in</span> chars:
        char_freq_dict[char] += freq

char_freq_dict
</code></pre>
<p><img src="https://cdn.substack.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fecbf93c5-7fd6-4a40-a63d-e504be1bf157_396x536.png" alt="Image" width="396" height="536" loading="lazy"></p>
<h2 id="heading-step-3-merge-the-most-frequently-occurring-consecutive-byte-pairings">Step 3: Merge the most frequently occurring consecutive byte pairings</h2>
<pre><code class="lang-javascript"><span class="hljs-keyword">import</span> re

## create all possible consecutive pairs
pairs = collections.defaultdict(int)
<span class="hljs-keyword">for</span> word, freq <span class="hljs-keyword">in</span> word_freq_dict.items():
    chars = word.split()
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(chars)<span class="hljs-number">-1</span>):
        pairs[chars[i], chars[i+<span class="hljs-number">1</span>]] += freq
</code></pre>
<p><img src="https://cdn.substack.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf21979-946b-4dfb-b090-64e591c13907_400x590.png" alt="Image" width="400" height="590" loading="lazy"></p>
<h2 id="heading-step-4-iterate-n-times-to-find-the-best-in-terms-of-frequency-pairs-to-encode-and-then-concatenate-them-to-find-the-subwords">Step 4 - Iterate n times to find the best (in terms of frequency) pairs to encode and then concatenate them to find the subwords</h2>
<p>It is better at this point to structure our code into functions. This means that we need to perform the following steps:</p>
<ol>
<li><p>Find the most frequently occurring byte pairs in each iteration.</p>
</li>
<li><p>Merge these tokens.</p>
</li>
<li><p>Recalculate the character tokens' frequency with the new pair encoding added.</p>
</li>
<li><p>Keep doing this until there are no more pairs or you reach the end of the for a loop.</p>
</li>
</ol>
<p>For detailed code, you should <strong>check out my</strong> <a target="_blank" href="https://colab.research.google.com/drive/1QLlQx_EjlZzBPsuj_ClrEDC0l8G-JuTn?usp=sharing"><strong>Colab notebook</strong></a><strong>.</strong></p>
<p>Here’s a trimmed output of those 4 steps:</p>
<p><img src="https://cdn.substack.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4cb7b992-1986-4dbc-a444-da817255f80f_1295x637.png" alt="Image" width="1295" height="637" loading="lazy"></p>
<p>So as we iterate with each best pair, we merge (concatenate) the pair. You can see that as we recalculate the frequency, the original character token frequency is reduced and the new paired token frequency pops up in the token dictionary.</p>
<p>If you look at the number of tokens created, it first increases because we create new pairings – but the number starts to decrease after a number of iterations.</p>
<p>Here, we started with 25 tokens, went up to 31 tokens in the 14th iteration, and then came down to 16 tokens in the 50th iteration. Interesting, right?</p>
<h2 id="heading-how-to-improve-the-bpe-algorithm">How to improve the BPE algorithm</h2>
<p>BPE algorithm is a greedy algorithm, which means that it tries to find the best pair in each iteration. And there are some limitations to this greedy approach.</p>
<p>So of course there are pros and cons of the BPE algorithm, too.</p>
<p>The final tokens will vary depending upon the number of iterations you have run. This also causes another problem: we now can have different tokens for a single text, and thus different embeddings.</p>
<p>To address this issue, multiple solutions have been proposed. But the one that stood out was a unigram language model that added <a target="_blank" href="https://arxiv.org/pdf/1804.10959.pdf">subword regularization (a new method of subword segmentation)</a> training that calculates the probability for each subword token to choose the best option using a loss function. We'll talk more about this in upcoming articles.</p>
<h2 id="heading-do-we-use-bpe-in-berts-or-gpts">Do we use BPE in BERTs or GPTs?</h2>
<p>Models like BERT or GPT-2 use some version of the BPE or the unigram model to tokenize the input text.</p>
<p>BERT included a new algorithm called WordPiece. It is similar to BPE, but has an added layer of likelihood calculation to decide whether the merged token will make the final cut.</p>
<h2 id="heading-summary">Summary</h2>
<p>In this blog, you've learned how a machine starts to make sense of language by breaking down the text into very small units.</p>
<p>Now, there are many ways to break text down and so it becomes important to compare one approach with another.</p>
<p>We started off by understanding tokenization by splitting the English text by spaces – but not every language is written the same way (that is using spaces to denote segmentation). So then we looked at splitting by character to generate character tokens.</p>
<p>The problem with characters was the loss of semantic features from the tokens at the risk of creating incorrect word representations or embeddings.</p>
<p>To get the best of both worlds, we looked at subword tokenization which was more promising. And finally we looked at the BPE algorithm to implement subword tokenization.</p>
<p>We'll look more into the next steps and advanced tokenizers like WordPiece, SentencePiece, and how to work with the HuggingFace tokenizer next week.</p>
<h2 id="heading-references-and-notes">References and Notes</h2>
<p>My post is actually an accumulation of the following papers and blogs that I encourage you to read:</p>
<ol>
<li><p><a target="_blank" href="https://arxiv.org/pdf/1508.07909.pdf">Neural Machine Translation of Rare Words with Subword Units</a> - Research paper that discusses different segmentation techniques based BPE compression algorithm.</p>
</li>
<li><p><a target="_blank" href="https://github.com/rsennrich/subword-nmt">GitHub repo on Subword NMT(Neural Machine Translation)</a> - supporting code for the above paper.</p>
</li>
<li><p><a target="_blank" href="https://leimao.github.io/blog/Byte-Pair-Encoding/">Lei Mao’s blog on Byte Pair Encoding</a> - I used the code in his blog to implement and understand BPE myself.</p>
</li>
<li><p><a target="_blank" href="https://blog.floydhub.com/tokenization-nlp/">How Machines read</a> - a blog by Cathal Horan.</p>
</li>
</ol>
<p>If you’re looking to start in the field of data science or ML, check out my course on <a target="_blank" href="https://www.wiplane.com/p/foundations-for-data-science-ml"><strong>Foundations of Data Science &amp; ML</strong></a>.</p>
<p>If you would like to get all my tutorials/blogs delivered directly to your inbox, consider subscribing to <a target="_blank" href="https://dswharshit.substack.com/">my newsletter here.</a></p>
<p>Have something to add or suggest, you can reach out to me via:</p>
<ul>
<li><p><a target="_blank" href="https://www.youtube.com/channel/UCH-xwLTKQaABNs2QmGxK2bQ">YouTube</a></p>
</li>
<li><p><a target="_blank" href="https://twitter.com/dswharshit">Twitter</a></p>
</li>
<li><p><a target="_blank" href="https://www.linkedin.com/in/tyagiharshit/">LinkedIn</a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
