<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ llm - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ llm - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Fri, 03 Jul 2026 22:38:07 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/llm/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ Codex vs Claude Code: Which AI Coding Assistant to Choose ]]>
                </title>
                <description>
                    <![CDATA[ AI coding assistants have evolved from simple autocomplete tools into capable development agents that can write code, debug applications, refactor projects, and even execute complex workflows. Among t ]]>
                </description>
                <link>https://www.freecodecamp.org/news/codex-vs-claude-code-which-ai-coding-assistant-to-choose/</link>
                <guid isPermaLink="false">6a4697abd8f1260e868746b9</guid>
                
                    <category>
                        <![CDATA[ ai-agent ]]>
                    </category>
                
                    <category>
                        <![CDATA[ claude ]]>
                    </category>
                
                    <category>
                        <![CDATA[ codex ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Thu, 02 Jul 2026 16:54:03 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/4ecd4fdb-8024-4bb6-92ae-142b35c0a3c3.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>AI coding assistants have evolved from simple autocomplete tools into capable development agents that can write code, debug applications, refactor projects, and even execute complex workflows.</p>
<p>Among the newest generation of tools, <a href="https://chatgpt.com/codex/">OpenAI's Codex</a> and <a href="https://claude.com/product/claude-code">Anthropic's Claude Code</a> have emerged as two of the strongest options for developers.</p>
<p>Both platforms promise to improve productivity, reduce repetitive work, and help teams ship software faster. But they approach software development differently.</p>
<p>Choosing between them depends less on finding a universal winner and more on understanding which tool aligns with your workflow, team structure, and development goals.</p>
<h3 id="heading-what-well-cover-here">What We'll Cover Here:</h3>
<ul>
<li><p><a href="#heading-understanding-codex">Understanding Codex</a></p>
</li>
<li><p><a href="#heading-understanding-claude-code">Understanding Claude Code</a></p>
</li>
<li><p><a href="#heading-codex-vs-claude-code-direct-comparison">Codex vs Claude Code: Direct Comparison</a></p>
<ul>
<li><p><a href="#heading-the-difference-in-philosophy">The Difference in Philosophy</a></p>
</li>
<li><p><a href="#heading-code-quality-and-reasoning">Code Quality and Reasoning</a></p>
</li>
<li><p><a href="#heading-workflow-integration">Workflow Integration</a></p>
</li>
<li><p><a href="#heading-deployment-options">Deployment Options</a></p>
</li>
<li><p><a href="#heading-productivity-considerations">Productivity Considerations</a></p>
</li>
<li><p><a href="#heading-security-and-oversight">Security and Oversight</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-should-you-choose-codex-or-claude-code">Should you choose Codex or Claude Code?</a></p>
</li>
<li><p><a href="#heading-final-thoughts">Final Thoughts</a></p>
</li>
</ul>
<h2 id="heading-understanding-codex"><strong>Understanding Codex</strong></h2>
<img src="https://cdn.hashnode.com/uploads/covers/66c6d8f04fa7fe6a6e337edd/1f4a1f16-a95f-4157-9c1e-9129b97d07c5.png" alt="Codex interface" style="display:block;margin:0 auto" width="2004" height="1380" loading="lazy">

<p>Codex is OpenAI's dedicated coding agent designed to assist developers throughout the software development lifecycle.</p>
<p>Unlike earlier code generation tools that focused mainly on snippets and autocomplete, modern Codex operates more like an autonomous development partner.</p>
<p>It can understand large codebases, generate new features, fix bugs, review existing implementations, and work on multiple tasks simultaneously.</p>
<p>OpenAI has expanded Codex beyond a simple command-line experience, introducing desktop and cloud-based environments that allow developers to delegate work while continuing with other responsibilities.</p>
<p>According to OpenAI, Codex can read, edit, and run code while operating in its own environment to complete assigned tasks. This makes it particularly useful for teams that want an AI assistant capable of handling longer-running assignments independently.</p>
<h2 id="heading-understanding-claude-code"><strong>Understanding Claude Code</strong></h2>
<img src="https://cdn.hashnode.com/uploads/covers/66c6d8f04fa7fe6a6e337edd/806861dd-6cd5-4368-9392-420227068f1c.png" alt="Claude Code interface" style="display:block;margin:0 auto" width="1442" height="666" loading="lazy">

<p>Claude Code takes a different approach. Rather than emphasising autonomous execution, Anthropic has focused heavily on developer collaboration and reasoning quality.</p>
<p>Claude Code functions as a terminal-native assistant that integrates directly into existing workflows. Developers can interact with it conversationally while maintaining close oversight of the coding process.</p>
<p>The tool is particularly strong at explaining architectural decisions, reviewing unfamiliar codebases, and helping developers work through complex implementation challenges. Instead of simply generating solutions, Claude Code often provides context that helps engineers understand why a particular approach may be preferable.</p>
<p>This makes Claude Code attractive for developers who view AI as an intelligent collaborator rather than an independent coding agent.</p>
<h2 id="heading-codex-vs-claude-code-direct-comparison"><strong>Codex vs Claude Code: Direct Comparison</strong></h2>
<h3 id="heading-the-difference-in-philosophy">The Difference in Philosophy</h3>
<p>The biggest distinction between Codex and Claude Code lies in their approaches to autonomy.</p>
<p>Codex is designed to execute delegated work efficiently. Developers describe objectives, and the system attempts to complete them with minimal intervention. It excels in situations where productivity and task completion are the primary objectives.</p>
<p>Claude Code, on the other hand, prioritises interaction. It keeps developers closely involved in the decision-making process and often produces explanations alongside implementation suggestions.</p>
<p>Neither philosophy is inherently better.</p>
<p>Teams building products under tight deadlines may benefit from Codex's autonomous capabilities. Developers working on complex systems that require thoughtful design discussions may prefer Claude Code's collaborative style.</p>
<h3 id="heading-code-quality-and-reasoning">Code Quality and Reasoning</h3>
<p>When evaluating coding assistants, raw output quality matters.</p>
<p>Claude Code has earned a reputation for producing clean, maintainable code with strong architectural awareness. It often breaks larger problems into logical components and provides reasoning that helps developers understand the trade-offs involved.</p>
<p>Codex tends to optimise for execution and efficiency. Its outputs frequently focus on accomplishing the requested task with minimal overhead while maintaining practical production considerations.</p>
<p>Comparative testing has shown that Claude Code often excels in documentation tasks and feature design. Codex demonstrates strong consistency across multiple categories of development work. Research analysing thousands of pull requests found that no single agent dominated every software engineering task, reinforcing the idea that context matters when selecting a tool.</p>
<h3 id="heading-workflow-integration">Workflow Integration</h3>
<p>The way an AI coding assistant fits into your existing development process can significantly impact adoption and long-term value.</p>
<p>Claude Code is built around a terminal-first experience, allowing developers to interact with the model directly within familiar command-line environments. This makes it particularly appealing to engineers who prefer maintaining close control over implementation decisions while receiving real-time guidance and feedback.</p>
<p>Codex takes a different approach by emphasising automation and delegation. Developers can assign coding tasks and review the completed work later, making it well-suited for teams looking to reduce repetitive workloads and improve development velocity. This model can be especially useful in larger organisations where engineers frequently juggle multiple projects and priorities.</p>
<p>Ultimately, the right choice depends on how your team prefers to work. Developers seeking an interactive coding companion may gravitate toward Claude Code, while organisations focused on streamlining execution may find Codex a better fit within their existing workflows.</p>
<h3 id="heading-deployment-options">Deployment Options</h3>
<p>Writing code is only part of the software development process. Once an application is complete, developers still need a reliable way to test, deploy, and maintain it in production.</p>
<p>Whether you use Codex or Claude Code, the deployment workflow remains largely the same. AI coding assistants can generate production-ready applications, but they don't replace the infrastructure needed to host them.</p>
<p>Developers still need platforms like Vercel, Hostinger and Railway that support automated deployments, scalable environments, SSL certificates, backups, monitoring, and straightforward rollback options.</p>
<p>For teams looking to <a href="https://docs.aws.amazon.com/solutions/generative-ai-application-builder-on-aws/">deploy apps built with Claude</a>, platforms like AWS and Vercel make it easier. They integrate continuous delivery pipelines while providing the reliability expected from production systems.</p>
<p>The same applies when you try to <a href="https://www.hostinger.com/web-apps-hosting/codex-hosting">deploy apps built with Codex</a>. Services such as Hostinger simplify deployments with managed Node.js hosting, Git integration, and built-in security features, allowing developers to move from AI-generated code to a live production environment with minimal configuration.</p>
<p>As AI coding assistants become part of everyday development workflows, selecting the right production hosting for AI coding assistants is becoming just as important as choosing the coding tool itself. The best workflow combines an intelligent development assistant with infrastructure that makes shipping software fast, reliable, and repeatable.</p>
<h3 id="heading-productivity-considerations">Productivity Considerations</h3>
<p>One of the primary reasons organisations adopt AI coding assistants is to improve development velocity.</p>
<p>Codex often shines when repetitive or well-defined tasks dominate the workload. Generating boilerplate code, implementing straightforward features, writing tests, or executing multi-step workflows are scenarios where autonomy can deliver meaningful time savings.</p>
<p>Claude Code provides value during exploratory development. Developers can brainstorm implementation approaches, validate assumptions, and receive guidance while preserving human oversight.</p>
<p>The productivity gains from each tool depend heavily on how teams allocate engineering effort.</p>
<p>Organisations emphasising rapid delivery may prioritise Codex.</p>
<p>Teams prioritising knowledge sharing and architectural consistency may lean toward Claude Code.</p>
<h3 id="heading-security-and-oversight">Security and Oversight</h3>
<p>As AI agents gain more capabilities, governance becomes increasingly important.</p>
<p>Claude Code's interactive design naturally encourages human review before significant actions occur. This reduces the likelihood of unintended modifications and reinforces developer accountability.</p>
<p>Codex introduces stronger automation capabilities, which can accelerate workflows but also require clearly defined operational safeguards. Organisations adopting autonomous coding agents should establish review processes, permission controls, and testing requirements before integrating them into production environments.</p>
<p>The goal is not to eliminate human involvement but to position AI appropriately within existing software development practices.</p>
<h2 id="heading-should-you-choose-codex-or-claude-code"><strong>Should you Choose Codex or Claude Code?</strong></h2>
<p>The answer depends on how you work.</p>
<p>Choose Codex if your team values autonomy, wants to delegate substantial development tasks, and needs an assistant that can operate independently across multiple assignments. Organisations focused on maximising throughput may find this approach particularly compelling.</p>
<p>Choose Claude Code if you prefer collaborative problem-solving, appreciate detailed reasoning, and want AI assistance that remains closely integrated with human decision-making throughout the development process.</p>
<p>Neither assistant replaces engineering judgment. Instead, they amplify different aspects of software development.</p>
<h2 id="heading-final-thoughts"><strong>Final Thoughts</strong></h2>
<p>The debate between Codex and Claude Code reflects a broader shift within software engineering. AI assistants are no longer limited to suggesting individual lines of code. They're evolving into sophisticated development partners capable of influencing planning, implementation, testing, and deployment.</p>
<p>Codex emphasises execution. Claude Code emphasises collaboration.</p>
<p>For some teams, Codex will unlock significant productivity gains by handling routine work autonomously. For others, Claude Code will enhance decision-making by serving as an intelligent coding companion.</p>
<p>Ultimately, the best choice is the one that complements your team's existing strengths and addresses its most significant bottlenecks.</p>
<p>As AI continues to reshape development practices, the organisations that succeed will not necessarily be those using the most advanced tools. They will be the ones who integrate those tools thoughtfully into well-defined engineering processes.</p>
<p>Hope you enjoyed this article. You can <a href="https://linkedin.com/in/manishmshiva">connect with me on LinkedIn</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build an AI Agent That Runs its Own LLM Experiments with autoresearch ]]>
                </title>
                <description>
                    <![CDATA[ A few months ago, Andrej Karpathy released autoresearch. It's an open-source Python tool that lets an AI agent run experiments on one GPU while you sit back and wait for the results. Lately I've still ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-an-ai-agent-that-runs-its-own-llm-experiments-with-autoresearch/</link>
                <guid isPermaLink="false">6a42a24e2a8a54195ace1aab</guid>
                
                    <category>
                        <![CDATA[ ai-agent ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ automation ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ ishaan gupta ]]>
                </dc:creator>
                <pubDate>Mon, 29 Jun 2026 16:50:22 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/4f910471-5f78-41c0-a30e-7630737bbb74.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>A few months ago, Andrej Karpathy released <a href="https://github.com/karpathy/autoresearch"><strong>autoresearch</strong></a>. It's an open-source Python tool that lets an AI agent run experiments on one GPU while you sit back and wait for the results.</p>
<p>Lately I've still seen folks on Twitter arguing about whether AI agents can build their <em>“million dollar idea”</em> or something about <em>Openclaw</em>. But here's a repo that lets you hand an agent a real GPT training setup and ask it to do the research itself.</p>
<p>Basically it edits the code, trains, reads the loss, makes a decision about the result, and repeats this process. And all this happens while you sleep, or dig into something else. And surprisingly, it does actually work.</p>
<p>On a depth-12 nanochat baseline (more on what "depth" means later), Karpathy left it running for about two days. Over roughly 700 experiments, the agent found about 20 changes that genuinely improved the model, and those changes stacked on top of each other.</p>
<p>In this article, I'll walk through what autoresearch is, why the way it measures success is the whole trick, what each file in the repo actually does, what the agent tends to discover, and a step-by-step guide to running it yourself. By the end you should be able to point an agent at your own GPU and let it run.</p>
<h2 id="heading-table-of-contents"><strong>Table of Contents</strong></h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-what-is-autoresearch">What is autoresearch?</a></p>
</li>
<li><p><a href="#heading-why-this-matters">Why This Matters</a></p>
</li>
<li><p><a href="#heading-what-exactly-is-valbpb">What Exactly is <code>val_bpb</code>?</a></p>
</li>
<li><p><a href="#heading-what-the-agent-actually-finds">What the Agent Actually&nbsp;Finds</a></p>
</li>
<li><p><a href="#heading-final-thoughts">Final Thoughts</a></p>
</li>
</ul>
<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>This article is a complete walkthrough of this repo. The goal is that by the end, you'll understand what autoresearch is and how you can run it on your own machine.</p>
<p>No prior ML research experience required, but if you have it then the deeper sections I wrote will be more meaningful to you. Just basic knowledge of GPU, VRAM and GPUs like H100/A100/4090 would suffice, but don't worry i have quoted the text below explaining every term i think a beginner needs to understand.</p>
<h2 id="heading-what-is-autoresearch">What is autoresearch?</h2>
<img src="https://cdn.hashnode.com/uploads/covers/6a0065c6e3eebc2e20691ad8/4d4413c5-7264-49b0-bcb0-1cf8b7e763f7.png" alt="flowchart of the autoresearch loop" style="display:block;margin:0 auto" width="1600" height="967" loading="lazy">

<p>Simply put, autoresearch is just one specific idea executed cleanly. You take a small but real LLM training setup, put it in a single Python file, and let an AI agent edit that file.</p>
<p>The agent runs the file and reads the loss. When you train a language model, "loss" is just a single number that scores how badly the model is predicting the next chunk of text. A high number means it's guessing poorly, and a number close to zero means it's predicting almost perfectly.</p>
<p>Training is the process of nudging the model's millions of internal weights to push that number down. So when I say the agent "reads the loss," I mean it looks at that score to judge whether the change it just made helped or hurt.</p>
<p>Based on that score, the agent decides whether the change helped, and then either keeps the change or reverts it. Then it tries something else.</p>
<p>The flow runs top to bottom like this: A human (you) writes the playbook (a Markdown file called <a href="http://program.md">program.md</a>), which spells out the rules. An AI agent reads that playbook and starts an experiment loop.</p>
<p>In each pass of the loop, the agent edits the training code with a new idea, trains for five minutes, reads the resulting score, decides whether to keep or undo the change, and writes the outcome to a results file. Then it loops back and tries the next idea.</p>
<p>It does this on its own, around twelve times an hour. So a full night of sleep buys you roughly a hundred experiments and, with luck, a noticeably better model by morning.</p>
<p>The repo is laid out so the agent has exactly one knob to turn. It can't install new packages or change how the data is loaded or how the loss is measured. All of that is locked down on purpose. The only file the agent edits is <code>train.py</code> which consists of the model architecture, the optimizer, the batch size, the learning rate, and the structure of the training loop itself.</p>
<p>The reason this design works is the same reason a controlled experiment in any field works. If the data, the metric, and the budget are all fixed, then any change in the result must be coming from the change the agent made. The agent is doing science the way a careful researcher would, only it doesn't get tired and doesn't need lunch.</p>
<h2 id="heading-why-this-matters">Why This Matters</h2>
<p>It's tempting to read this as just another agent demo. But it's not, and the reason is the metric. That metric is called val_bpb, short for validation bits per byte. It's a specific way of scoring how well the model predicts text it has never seen during training (the "validation" set).</p>
<p>I'll break down exactly how it's calculated in the next section, but the one-line version is that it measures, on average, how many bits of information the model needs to encode each byte of text. Lower is better: a lower val_bpb means the model is surprised less often by real text, which is the whole goal.</p>
<p>The reason Karpathy uses bits per byte rather than the raw training loss is that bits per byte doesn't change just because you changed the vocabulary, so two very different models can still be compared fairly. The "lower is better" part and the "vocabulary-independent" part are two separate properties. The metric happens to have both.</p>
<p>When I say a baseline model from this repo "lands around 1.00 bpb," I mean that if you run the default untouched training script for its 5 minutes, the model it produces scores roughly 1.00 on this metric when measured on the held-out validation text. That's your starting line.</p>
<p>From there, an improvement of 0.005 bpb (so a score of about 0.995) is a small but real win, the kind the agent finds often. An improvement of 0.05 (a score near 0.95) would be enormous, the kind of jump you'd usually only get from a much bigger model or a much longer training run. So the numbers look tiny, but on this scale, thousandths of a bit genuinely matter.</p>
<p>Here's why optimizing this particular number is a big deal. The agent isn't chasing some artificial leaderboard that researchers spent years gaming. It's pushing down the same kind of validation loss curve that every major language model has been trained against since GPT-2 in 2019.</p>
<p>A "loss curve" is just the plot of that score dropping over the course of training, and "the wave of LLMs since GPT-2" is shorthand for the fact that essentially all of the progress, from GPT-2 to today's frontier models, came from people finding ways to make that curve drop faster or lower for the same amount of compute. The agent is working on the exact same problem, just at a small, fast cheap scale.</p>
<p>And that's what makes the next part surprising. When the agent finds an improvement "here," I mean on the small depth-12 model it's allowed to edit. "Depth" is the number of transformer layers stacked in the model. depth-12 is a small model, and depth-24 is a bigger one with twice as many layers.</p>
<p>Karpathy took the roughly 20 tweaks the agent discovered on the small depth-12 model and applied them to the bigger depth-24 model. Being stacked cleanly means two things at once: the improvements were additive (turning on all 20 together gave you the sum of their individual gains, rather than cancelling each other out), and they transferred (gains found on the small model still showed up on the big one).</p>
<p>That's the signal that the agent found real insights about training, not lucky quirks that only help at one specific size. Stacked together, they cut Karpathy's "Time to GPT-2" benchmark from 2.02 hours to 1.80 hours, which is about an 11% speedup on code he'd already hand-tuned for a long time.</p>
<p>The other thing that's significant is the budget. Each experiment runs for exactly 5 minutes of wall-clock training time, no more, no less. That gives roughly 12 experiments per hour, or about 100 in a typical 8-hour sleep cycle.</p>
<h3 id="heading-exploring-the-repo">Exploring the Repo</h3>
<p>Now if you clone the repo, you get a small handful of files. Most of them are plumbing. Three of them are the heart of the system and the difference between them is who edits what.</p>
<p>Only three files matter, and they differ by who edits them.</p>
<ol>
<li><p><a href="http://train.py">train.py</a> is the file the agent edits. it holds the GPT model, the optimizer, and the training loop, and everything in it is fair game.</p>
</li>
<li><p><a href="http://prepare.py">prepare.py</a> is the fixed foundation that nobody edits during a run: it downloads the data, trains the tokenizer, and defines the metric.</p>
</li>
<li><p><a href="http://program.md">program.md</a> is the file you, the human, edit: it's the playbook of rules the agent follows.</p>
</li>
</ol>
<p>The remaining files (README.md, pyproject.toml, uv.lock, .gitignore, .python-version, the analysis.ipynb notebook, and the progress.png image) are plumbing and documentation that neither you nor the agent needs to touch during a run.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6a0065c6e3eebc2e20691ad8/1a8acbf9-87a3-428e-9cc1-53aaee2adc91.png" alt="three main files that we need to understand" style="display:block;margin:0 auto" width="1600" height="752" loading="lazy">

<p>There are a few other files in the repo which don't need attention from you or the agent during a run.</p>
<h2 id="heading-what-exactly-is-valbpb">What Exactly is <code>val_bpb</code>?</h2>
<p>Before going further, it helps to understand what val_bpb is. If you've read other LLM articles, you have probably seen terms like <strong>“perplexity”</strong> or <strong>“cross-entropy loss”</strong> thrown around.</p>
<p>Bits per byte is like their cousin. When a language model predicts text, it assigns probabilities to what comes next. If the model is confident and right, it gets a low loss. If it's confident and wrong, it gets a high loss, a large penalty. Add up those penalties across all the text and you get the model's total loss. Lower is better, because a lower total means the model assigned high probability to the words that actually appeared.</p>
<p>Cross-entropy loss is the standard scoring function for training language models. For each token, the model assigns a probability to every possible next token and the loss is the negative logarithm of the probability it gave to the token that actually came next. Predict the right token confidently and the loss is near zero. Assign low probability to the correct token and the loss is large. The model's total loss is the average of this across all tokens.</p>
<p>Cross-entropy loss measures this in nats. A nat is the unit you get when that logarithm is taken in base e (the natural log) instead of base 2. It measures the same quantity of "surprise" on a different scale (one nat is about 1.44 bits). Dividing the loss by the natural log of 2 is what rescales nats into bits, which is the conversion bits per byte performs.</p>
<p>Bits per byte takes that loss and divides it by the number of bytes the text actually contains, then converts to log base 2. The result is a number that tells you, on average, how many bits of information the model needs to encode each byte of text.</p>
<p>A perfect model would need close to zero, while a random model would need around 8 bits per byte (since a byte has 8 bits).</p>
<p>The reason Karpathy chose bpb instead of plain cross-entropy is that bpb is <strong>vocabulary-size-independent</strong>. If the agent decides to change the tokenizer or the vocabulary, the cross-entropy loss would be completely different even for the same model quality. Bits per byte normalizes that out, so a depth-8 model with vocab 8192 and a depth-12 model with vocab 16384 are directly comparable.</p>
<p>The function that computes this, evaluate_bpb, lives in prepare.py, which the agent is never allowed to edit. It can only touch train.py. Because the metric's definition sits in a file the agent can't modify, it can't lower its score by quietly changing how the score is calculated. The scoring rule stays identical for every experiment, which is what makes the comparison honest.</p>
<h3 id="heading-the-5-minute-rule">The 5 Minute&nbsp;Rule</h3>
<p>There's one design choice in autoresearch that deserves its own section, because it's the choice that makes the whole thing work in practice. Every experiment runs for exactly 5 minutes of wall-clock training time regardless of what the agent is doing.</p>
<p>Wall-clock time means real elapsed time: what a clock on the wall measures, and not the number of training steps or tokens processed. 5 minutes of wall-clock time is 5 literal minutes regardless, of how much the model does in them.</p>
<p>If you trained for a fixed number of steps instead, the agent could “win” by making the model so small that it ripped through more steps than the baseline. If you trained for a fixed number of tokens, the agent could win by lowering the sequence length.</p>
<p>The agent isn't competing against another agent as we might think of it. Its only objective is to push val_bpb below the previous best score on this exact setup. So "winning" means producing a lower score, and the risk is that it lowers the score through a degenerate shortcut that games whichever budget you chose rather than a real efficiency gain. If you trained until convergence, the agent’s run would take wildly different amounts of time and you would never finish 100 experiments in a night.</p>
<p>A fixed wall clock budget cuts through all of this. The agent is forced to optimize for actual training efficiency on the actual hardware in front of it. If it makes the model slightly bigger but the per-step compute drops because of a smarter attention pattern, that's a real win. If it speeds up the per-step compute but the model now learns less per step, that shows up as a worse val_bpb. The two effects get netted out automatically in the end.</p>
<p>The H100 and A100 are NVIDIA datacenter GPUs and the RTX 4090 is a high-end consumer card. They differ sharply in speed and memory, and that's the whole point: in a fixed 5 minute budget, a faster card processes more data and reaches a lower val_bpb. So a score from one GPU can't be compared head-to-head with a score from another.</p>
<p>There's a tradeoff, though. Because the budget is wall-clock, the val_bpb you get on an H100 isn't directly comparable to the val_bpb you get on a 4090 or an A100. The system is designed to find the best model <strong>for your specific compute platform</strong> in 5 minutes, not to be a global benchmark.</p>
<p>If you want to compare across hardware, you would need to fix a different budget. For the autonomous research use case, this is exactly right.</p>
<p>Let’s get into each of the files in depth now.</p>
<h3 id="heading-1-preparepy">1. <code>prepare.py</code></h3>
<p>Nobody touches this file but everything depends on it. It mainly performs three jobs.</p>
<p>The first job is downloading data. The training corpus is ClimbMix-400B, a high-quality web dataset hosted on HuggingFace and shuffled into 6,543 parquet shards. By default <code>prepare.py</code> downloads only 10 of these (about a few gigabytes), which is plenty for running thousands of 5-minute experiments.</p>
<p>The very last shard is always downloaded and pinned as the validation set. That pinning matters, since every experiment (no matter what changes) evaluates on the exact same held-out data.</p>
<p>The second job is training a tokenizer. The repo uses <strong>rustbpe,</strong> a fast Rust implementation of byte-pair encoding, to learn a vocabulary of 8,192 tokens from a sample of the training data. The result is exported as a tiktoken-compatible encoding so it integrates cleanly with PyTorch downstream. There's also a small precomputed lookup table called <code>token_bytes.pt</code> that maps each token id to its UTF-8 byte length. This is what makes the bpb calculation honest.</p>
<p>The third job is providing utilities that <code>train.py</code> imports at runtime. The dataloader is the interesting one. It does what's called <strong>best-fit packing</strong>: every row in the batch starts with a special BOS (beginning of sequence) token and the loader fills the row by greedily picking documents that fit in the remaining space. Only when no document fits does it crop the shortest available document to fill the gap.</p>
<p>The result is 100% utilization with no padding. This is meaningfully faster than the naïve approach of just truncating long documents and padding short ones. The constants at the top of <code>prepare.py</code> are deliberately simple. Three numbers and a sequence length define the entire experimental contract.</p>
<p>If you run autoresearch on different hardware and want to compare results with a friend, the only thing both of you need to share is these constants. That's the whole point of putting them here and nowhere else.</p>
<h3 id="heading-2-trainpy">2. <code>train.py</code></h3>
<p>This is the file the agent lives in. It breaks naturally into four parts: the model, the optimizer (Muon for the matrix weights, AdamW for the embeddings and scalar parameters), the hyperparameters, and the training loop. We'll walk through each one with the goal of understanding why each piece exists.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6a0065c6e3eebc2e20691ad8/a4847be2-2007-42e0-91bd-9599125b5ffc.png" alt="you can see in the image that the agent only controls the two green boxes in the middle, the model and the loop" style="display:block;margin:0 auto" width="1600" height="644" loading="lazy">

<p>The model is a fairly modern GPT written from scratch with no library dependencies beyond PyTorch and a Flash Attention 3 kernel. If you've read other GPT implementations the high-level structure will look familiar: a token embedding, a stack of transformer blocks, a normalization layer, and a linear head that projects back to vocabulary logits.</p>
<p>The interesting parts are in the details. I don’t think explaining the architecture or code is required for this repo, so I’ll just draw out a small architecture diagram for those of you who want to visualize it. Then I'll explain how the training loop is written.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6a0065c6e3eebc2e20691ad8/42663aea-dace-4d97-8bf4-7294d61f8a0d.png" alt="simple explanation of the  model in train.py- token embedding feeding a stack of transformer blocks, then a normalization layer, then a linear head producing vocabulary logits" style="display:block;margin:0 auto" width="1600" height="1600" loading="lazy">

<p>The loop itself is short and almost pleasant to read. The skeleton is:</p>
<pre><code class="language-python">while True:
    # accumulate gradient over micro-batches to hit TOTAL_BATCH_SIZE
    for micro_step in range(grad_accum_steps):
        with autocast_ctx:
            loss = model(x, y)
        loss = loss / grad_accum_steps
        loss.backward()
        x, y, epoch = next(train_loader)

    # update LR / momentum / weight decay based on time elapsed
    progress = min(total_training_time / TIME_BUDGET, 1.0)
    # ... set group["lr"], group["momentum"], group["weight_decay"] ...

    optimizer.step()
    model.zero_grad(set_to_none=True)

    # log step metrics
    # ...

    if step &gt; 10 and total_training_time &gt;= TIME_BUDGET:
        break
</code></pre>
<p>There are a few things worth noticing here. First, the time budget is checked after the first 10 steps. This is so the budget doesn't include the initial PyTorch compilation (which can take 30 seconds or more). Without this, fast experiments would get penalized for spending half their budget on warmup.</p>
<p>Second, the loop has a fast-fail check. If the loss explodes or hits NaN it prints “FAIL” and exits. The agent then sees a crash and logs it. This is a defense against the agent doing something that diverges spectacularly.</p>
<p>Third, after the loop ends, there's a single final call to <code>evaluate_bpb</code> and then a structured summary printed to stdout.</p>
<p>That summary is the whole API between the training script and the agent:</p>
<pre><code class="language-yaml">---
val_bpb:          0.997900
training_seconds: 300.1
total_seconds:    325.9
peak_vram_mb:     45060.2
mfu_percent:      39.80
total_tokens_M:   499.6
num_steps:        953
num_params_M:     50.3
depth:            8
</code></pre>
<p>This is what the grep extracts and the agent reads. The whole experimental contract is seven lines of this plain text.</p>
<h4 id="heading-the-hyperparameters">The Hyperparameters</h4>
<p>The hyperparameters live in their own clearly-marked section near the bottom of <code>train.py</code>, with a comment that says "edit these directly, no CLI flags needed." They look like this:</p>
<pre><code class="language-yaml"># Model architecture
ASPECT_RATIO = 64       # model_dim = depth * ASPECT_RATIO
HEAD_DIM = 128          # target head dimension for attention
WINDOW_PATTERN = "SSSL" # sliding window pattern: L=full, S=half context

# Optimization
TOTAL_BATCH_SIZE = 2**19 # ~524K tokens per optimizer step
EMBEDDING_LR = 0.6
UNEMBEDDING_LR = 0.004
MATRIX_LR = 0.04
SCALAR_LR = 0.5
WEIGHT_DECAY = 0.2
ADAM_BETAS = (0.8, 0.95)
WARMUP_RATIO = 0.0
WARMDOWN_RATIO = 0.5
FINAL_LR_FRAC = 0.0

# Model size
DEPTH = 8
DEVICE_BATCH_SIZE = 128
</code></pre>
<p>Everything here is a deliberate single point of truth. The model dimension is computed from depth (<code>depth × 64</code>, rounded to the head dimension). The number of heads is computed from model dimension. This means that the agent can change one number <code>DEPTH</code>, and the model rescales itself coherently.</p>
<p>That kind of "one knob to scale the model" parameterization is exactly what makes a search space tractable.</p>
<h3 id="heading-3-programmd">3. <code>program.md</code></h3>
<p><code>program.md</code> is the shortest of the three files and is arguably the most important. It's the file that we edit and it contains everything the agent needs to know about how to behave during a run.</p>
<p>The structure of <code>program.md</code> mirrors the lifecycle of a research session. It opens with <strong>setup,</strong> agrees on a run tag, creates a Git branch named <code>autoresearch/&lt;tag&gt;</code>, reads the in-scope files, verifies that the data exists, and initializes a results file. It then describes the experimentation rules, like what the agent can and can't modify, that VRAM is a soft constraint, and crucially a simplicity criterion that says all else being equal, simpler is better.</p>
<p>A 0.001 bpb improvement that adds 20 lines of hacky code isn't worth keeping. A 0.001 bpb improvement that <strong>removes</strong> 20 lines is definitely worth keeping.</p>
<p>Then comes the actual loop. The agent is told to run training with <code>uv run train.py &gt; run.log 2&gt;&amp;1</code> and never to use <code>tee</code> or stream the output because that would flood the agent's context window. It's also told to extract metrics with <code>grep "^val_bpb:\|^peak_vram_mb:" run.log</code>, which gives just the one or two lines that matter.</p>
<p>If the grep produces nothing, that means the run crashed and the agent is told to read the last 50 lines of the log and try to fix the issue (but it should give up after a few attempts and move on). The result of every experiment is logged to <code>results.tsv</code>.</p>
<p>The decision rule is simple: if val_bpb improved (got lower) then the agent advances the branch by keeping its commit. If it didn't improve, the agent runs <code>git reset</code> to undo the commit. If it crashed, the agent logs that and tries something else.</p>
<p>The last paragraph of <code>program.md</code> is the one that makes autoresearch what it is. It's titled <strong>NEVER STOP</strong>. The agent is explicitly told not to ask the human (you) if it should keep going, not to ask for any permissions, and not to pause for confirmation. If the agent runs out of ideas, it should think harder, look at the failures, combine near-misses, and try more radical changes.</p>
<p>The loop runs until we interrupt it. This single instruction is more interesting than any line of Python in the repo. It's the difference between an agent that does a few experiments and asks if you want to continue and an agent that genuinely does autonomous research overnight.</p>
<p>There is no contradiction with the 5 minute budget. 5 minutes governs a single experiment, one training run. The "Never stop" instruction governs the outer loop. The moment one run finishes and the agent logs the result, it launches the next one. It keeps starting fresh 5 minute experiments back-to-back until you interrupt it.</p>
<p>Nothing ever trains for more than five minutes. The agent simply never stops starting new 5 minute trainings.</p>
<p>Now that you understand how it works, let’s start using it.</p>
<h2 id="heading-setup-guide">Setup Guide</h2>
<p>I'm assuming you have a single NVIDIA GPU with enough VRAM to run these experiments. Anything with 24GB or more should work with the default settings. Smaller GPUs need some tuning, which I'll cover later on.</p>
<h3 id="heading-step-1-install-uv-the-python-project-manager-the-repo-uses">Step 1: Install uv, the Python Project Manager the Repo Uses</h3>
<p>uv is much faster than pip and handles virtual environments transparently. After you install it, then clone the repo and install dependencies:</p>
<pre><code class="language-shell">curl -LsSf https://astral.sh/uv/install.sh | sh

git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
uv sync
</code></pre>
<p>This will create a&nbsp;<code>.venv</code> and install pyTorch, Flash Attention, rustbpe, tiktoken, pyarrow, and a few other packages. It pulls PyTorch from the CUDA 12.8 wheel index, so make sure your driver supports that.</p>
<h3 id="heading-step-2-run-the-data-preparation">Step 2: Run the Data Preparation</h3>
<p>This downloads 10 ClimbMix shards plus the validation shard and then trains our tokenizer.</p>
<pre><code class="language-shell">uv run prepare.py
</code></pre>
<p>It takes about 2 minutes on a decent connection. If you have limited disk space, you can pass <code>--num-shards 4</code> for a smaller download. The data and tokenizer get cached in <code>~/.cache/autoresearch/</code>.</p>
<h3 id="heading-step-3-run-a-manual-training-experiement">Step 3: Run a Manual Training Experiement</h3>
<p>Now, you'll run a single training experiment manually, just to confirm that everything works end-to-end.</p>
<pre><code class="language-shell">uv run train.py
</code></pre>
<p>You should see the model compile (this takes 30 seconds or so the first time), then training output that looks something like this: <code>step 00050 (8.3%) | loss: 5.123456 | lrm: 1.00 | dt: 240ms | tok/sec: 2,184,533 | mfu: 39.8% | epoch: 1 | remaining: 275s</code>.</p>
<p>After about 5 minutes of training, plus an evaluation pass at the end, you'll get the summary block with <code>val_bpb</code> printed. That's your baseline.</p>
<h3 id="heading-step-4-hand-the-repo-to-an-agent">Step 4: Hand the Repo to an Agent</h3>
<p>In practice, this means opening Claude Code or your tool of choice in the repo directory, ideally with permissions disabled or scoped tightly to the repo, and prompting it with something like this:</p>
<pre><code class="language-plaintext">Have a look at program.md and let's kick off a new experiment.
Let's do the setup first.
</code></pre>
<p>The agent will read <code>program.md</code>, walk through the setup steps (creating the autoresearch branch and initializing <code>results.tsv</code>), confirm with you, and then start running. From this point on, you can leave it alone. When you come back, check <code>results.tsv</code> and the Git log on the autoresearch branch.</p>
<h3 id="heading-tuning-autoresearch-for-smaller-gpus">Tuning autoresearch for Smaller&nbsp;GPUs</h3>
<p>The default configuration assumes an H100. If you have a 4090, 3090, or anything with less than 80GB of VRAM, you'll need to dial things down.</p>
<ol>
<li><p>Lower the sequence length first: <code>MAX_SEQ_LEN = 2048</code> in <code>prepare.py</code> is the biggest VRAM lever since attention scales quadratically with it. Try 512 or even 256 on a small GPU and bump <code>DEVICE_BATCH_SIZE</code> in <code>train.py</code> slightly to compensate. The product of these two is the tokens-per-forward-pass.</p>
</li>
<li><p>Lower the depth: <code>DEPTH = 8</code> in <code>train.py</code> is the master knob for model size. Drop it to 4 on a small GPU and the model dimension automatically scales down with it.</p>
</li>
<li><p>Switch the window pattern: <code>WINDOW_PATTERN = "SSSL"</code> uses banded attention which is fast on H100 but can be slow on consumer GPUs, depending on the kernel implementation. Just <code>"L"</code> (always full attention) is simpler and often faster on smaller cards.</p>
</li>
<li><p>Lower the total batch size: <code>TOTAL_BATCH_SIZE = 2**19</code> is roughly 524K tokens per optimizer step. On a small GPU, drop it to 2^14 (~16K) to start.</p>
</li>
<li><p>Consider switching the dataset: climbMix is a hard broad web corpus. On a tiny model, the loss curve is noisy and bpb numbers are hard to interpret. Karpathy specifically recommends his own TinyStories-GPT4-Clean dataset for small-scale experimentation. The text is narrower in scope (children’s stories) so a small model can actually learn to generate something coherent in 5 minutes.</p>
</li>
</ol>
<p>There are already several community forks that have done the consumer-GPU tuning for you which you can check out in the repo's readme.md file.</p>
<h2 id="heading-what-the-agent-actually-finds">What the Agent Actually&nbsp;Finds</h2>
<p>It's one thing to describe how the loop works, and another to see what it produces. Karpathy was open about this on Twitter in his depth-12 run: the agent found about 20 changes that improved validation loss, all of which transferred to depth-24.</p>
<p>Specific examples from his post-run analysis include adding a learnable scalar to the parameterless QK-norm to sharpen attention, applying regularization to the value embeddings, widening the banded attention window, correcting the AdamW betas for certain parameter groups, tuning weight decay schedules, and adjusting initialization.</p>
<p>None of these would headline a research paper, but all of them showed up as 0.001 to 0.005 bpb improvements that stacked.</p>
<p>So it's not that an AI agent invented a new architecture. It's that the slow patient hill-climbing that real researchers spend months doing can be done by an agent in a couple of days. The result is the same boring detail-tuning that has always been where most of the actual progress in ML comes from.</p>
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>autoresearch doesn't introduce a new model or a new optimizer or a new dataset. It just defines a kind of contract between a human researcher and an AI agent and it shows that the contract can be enough. That contract is something like <em>“here is the fixed part of reality, the metric that judges you, a budget, and within those rules, do whatever you want and tell me what worked.”</em></p>
<p>There are two questions I still ponder that are worth thinking about. One is <strong>overfitting to the validation set</strong>. If you run hundreds of experiments against the same fixed validation shard, eventually the agent will start finding tweaks that look like wins on this shard but don't transfer. Karpathy himself called the results “fragile” in some sessions.</p>
<p>There's no obvious fix here yet beyond rotating validation data which would break comparability.</p>
<p>The other question is <strong>what the human’s role becomes</strong>. If the agent does the experiments, the human’s contribution shifts to shaping the search space and the rules. That is what <code>program.md</code> is. It's a pretty good preview of what research looks like when the loop is automated.</p>
<p>Well, that’s it for today. See you folks in my next article!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Personal Web Research AI Agent with Ollama and Qwen ]]>
                </title>
                <description>
                    <![CDATA[ In this tutorial, I’ll show you how to build an AI web research agent using Ollama, Qwen, and Python. The agent searches the web for a topic, fetches relevant pages, and uses a local LLM to generate a ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-personal-ai-web-research-agent-with-ollama-and-qwen/</link>
                <guid isPermaLink="false">6a3ebfce33b56590aa5b54c9</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ai-agent ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ollama ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Darsh Shah ]]>
                </dc:creator>
                <pubDate>Fri, 26 Jun 2026 18:07:10 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/33d0f53f-3eaf-4549-9335-d3a9e356b4f9.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In this tutorial, I’ll show you how to build an AI web research agent using Ollama, Qwen, and Python. The agent searches the web for a topic, fetches relevant pages, and uses a local LLM to generate a concise digest.</p>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ul>
<li><p><a href="#heading-background">Background</a></p>
</li>
<li><p><a href="#heading-motivation-and-architecture">Motivation and Architecture</a></p>
</li>
<li><p><a href="#heading-step-1-install-ollama-and-get-an-api-key">Step 1: Install Ollama and get an API key</a></p>
</li>
<li><p><a href="#heading-step-2-pull-the-qwen-model">Step 2: Pull the Qwen model</a></p>
</li>
<li><p><a href="#heading-step-3-install-python-dependencies">Step 3: Install Python dependencies</a></p>
</li>
<li><p><a href="#heading-step-4-agent-code">Step 4: Agent code</a></p>
</li>
<li><p><a href="#heading-step-5-running-the-agent">Step 5: Running the agent</a></p>
</li>
<li><p><a href="#heading-sample-output">Sample Output</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-background">Background</h2>
<p>Most of us have used ChatGPT or Claude to send queries to a large language model. You've probably also seen hallucinations in the response when the model didn't know something, sometimes because its knowledge was out of date.</p>
<p>With the rise of tool calling, LLMs can now use tools to search the web for the latest information. They can then bring that information into context and use it to generate an output, summarize results, and extract key points from retrieved sources.</p>
<p>In this tutorial, I'll show you how I built a personal research agent that searches the internet for any topic and uses local LLM to summarize what it finds. It runs entirely on my own machine to preserve privacy and has no API costs. So it's completely free.</p>
<p>To follow this tutorial, you'll need <a href="https://ollama.com">Ollama</a> installed on your machine and a free Ollama account. The tutorial works on macOS, Windows, and Linux. I'm using a MacBook Pro with 32 GB of RAM, but you can run this on a lower-memory machine by choosing a smaller Qwen model from Ollama.</p>
<h2 id="heading-motivation-and-architecture">Motivation and Architecture</h2>
<p>The motivation behind this project is to have agents running on my machine that can handle a variety of tasks every day. I can spin off agents to create a daily digest of AI news, surface the latest world events, or look for new job postings.</p>
<p>Running a local LLM also means none of these queries leave my machine. My research history stays private, and there are no per-query API costs to worry about.</p>
<p>For this project, we'll use Ollama web search for retrieval and local Qwen LLM for summarization (rather than rely on hosted chat tools like ChatGPT or Claude). The system diagram below shows how the agent works.</p>
<p>When run in the terminal, the agent asks the user what they want to research. It then calls the Ollama web search API to fetch the top 5 results for the query, downloads each of those pages, and extracts the readable text.</p>
<p>The extracted content from all five pages is sent to the local Qwen model along with the user's prompt and a system prompt: "<em>Use these web results and page contents to answer in Markdown format</em>." The model's response is then saved as a Markdown file on disk.</p>
<img src="https://cdn.hashnode.com/uploads/covers/684c95e159698b4bf6a0e4be/238ef25e-6dff-4a54-ba73-2ccbe666bd60.png" alt="Diagram of the process: user prompt, Ollama web search API, top 5 result URLs, requests + BeautifulSoup, clean page text,  local Qwen model via Ollama, markdown digest saved to disk." width="1584" height="1212" loading="lazy">

<h2 id="heading-step-1-install-ollama-and-get-an-api-key">Step 1: Install Ollama and Get an API Key</h2>
<p>To get started, install the <a href="https://ollama.com/download">Ollama application</a> and create an account to get an <a href="https://docs.ollama.com/capabilities/web-search">API key</a>. The free tier of Ollama will suffice for this tutorial.</p>
<p>Once you have the key, place it in an environment variable:</p>
<pre><code class="language-bash">export OLLAMA_API_KEY="paste-key-here"
</code></pre>
<h2 id="heading-step-2-pull-the-qwen-model">Step 2: Pull the Qwen Model</h2>
<p>We'll use Qwen for this tutorial, an open-weight model that's currently one of the best smaller sized models available.</p>
<p>I'm using the 4-billion-parameter variant because it follows structured prompts well and runs on a laptop without a dedicated GPU. There are other sizes like 2b or 9b available.</p>
<p>To use <a href="https://ollama.com/library/qwen3.5:4b">Qwen3.5:4b</a> locally, install it using Ollama. The 4b model size is around 3.4 GB on my machine. If your machine has lower RAM, you can use qwen3.5:0.8b instead of the 4b model.</p>
<pre><code class="language-plaintext">ollama pull qwen3.5:4b
</code></pre>
<h2 id="heading-step-3-install-python-dependencies">Step 3: Install Python Dependencies</h2>
<pre><code class="language-bash">python3 -m venv venv
source venv/bin/activate
pip install ollama requests beautifulsoup4
</code></pre>
<h2 id="heading-step-4-write-the-agent-code">Step 4: Write the Agent Code</h2>
<p>The below Python code does four things: it takes a research prompt from the terminal, calls Ollama's web search API for the top 5 results, downloads the webpages using Requests and cleans each page's text using BeautifulSoup, then sends everything to a local Qwen model with an instruction to summarize in Markdown. Finally, it saves the result to a timestamped .md file.</p>
<p>Save the code in your research_agent.py file.</p>
<p>The summarization prompt is intentionally basic. Feel free to tweak it to match the kind of output you want.</p>
<pre><code class="language-python">import os
import json
import requests
import ollama
from bs4 import BeautifulSoup
from datetime import datetime
from pathlib import Path

API_KEY = os.getenv("OLLAMA_API_KEY")
SEARCH_URL = "https://ollama.com/api/web_search"
MODEL = "qwen3.5:4b"

# Search web using Ollama web search 
def search_web(query):
    response = requests.post(
        SEARCH_URL,
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={"query": query, "max_results": 5},
        timeout=30,
    )
    response.raise_for_status()
    return response.json().get("results", [])

# Fetch full web page content
def fetch_text(url):
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
    except requests.RequestException as e:
        return ""
    soup = BeautifulSoup(response.text, "html.parser")
    for tag in soup(["script", "style", "nav", "footer"]):
        tag.decompose()
    return soup.get_text(separator="\n", strip=True)


def main():
    user_prompt = input("Enter your prompt: ").strip()
    if not user_prompt:
        print("Prompt cannot be empty.")
        return

    results = search_web(user_prompt)

    # For each url in web search result, fetch full content
    pages = []
    for item in results:
        url = item.get("url")
        if not url:
            continue

        print(f"Fetching: {url}")
        page_text = fetch_text(url)

        pages.append({
            "title": item.get("title", ""),
            "url": url,
            "snippet": item.get("content", ""),
            "page_text": page_text,
        })

    # Prompt to send to Qwen model with web data
    prompt = f"""
    User request:
    {user_prompt}

    Use these web results and page contents to answer in markdown format.

    Data:
    {json.dumps(pages, ensure_ascii=False)}
    """

    # Invoke local Qwen model 
    response = ollama.chat(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
    )

    digest = response.message.content

    # Build a unique filename using today's date and time
    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    filename = f"digest-{timestamp}.md"

    # Save the digest to disk
    with open(filename, "w") as f:
        f.write(digest)
    
    print(f"Saved to digest")

if __name__ == "__main__":
    main()
</code></pre>
<h2 id="heading-step-5-run-the-agent">Step 5: Run the Agent</h2>
<pre><code class="language-plaintext">python research_agent.py
</code></pre>
<p>The script will prompt you to enter the topic you'd like to research.</p>
<h3 id="heading-sample-output">Sample Output</h3>
<p>The summarized digest is saved as a timestamped Markdown file. The agent also prints the source URLs as it fetches them.</p>
<p>Before trusting the summary, skim it and spot-check a claim or two against the original source. Local models are smaller than hosted frontier models and tend to hallucinate more. So spot-checking can help with accuracy.</p>
<p>As a test run, I asked the research agent: "What's new in LLMs" and it fetched 5 web pages as seen below:</p>
<pre><code class="language-plaintext">Enter your prompt: What's new in LLMs
Fetching: https://openai.com/nl-NL/index/chatgpt-memory-dreaming/
Fetching: https://pub.towardsai.net/tai-210-glm-5-2-closes-most-of-the-open-weight-gap-in-ten-weeks-2f970c5f1326
Fetching: https://www.globenewswire.com/news-release/2026/06/23/3315999/0/en/Multiverse-Computing-Launches-Pulsar-16B-in-collaboration-with-NVIDIA-Frontier-Grade-Reasoning-at-Half-the-Parameters.html
Fetching: https://thenextweb.com/news/anthropic-claude-tag-slack-always-on-ai-teammate
Fetching: https://www.aidoers.io/blog/claude-mythos-5-and-fable-5-explained-what-anthropic-actually-shipped

Saved to digest
</code></pre>
<p>The digest came out reasonably well-structured for a 4B local model. It's organized into sections with all the relevant data from the sources. I spot-checked the summary and it was accurate.</p>
<p>Here's what it produced:</p>
<pre><code class="language-plaintext"># What's New in LLMs (June 2026)

The landscape of Large Language Models (LLMs) has evolved rapidly in June 2026, with significant updates in memory synthesis, new frontier models, enterprise integrations, and market dynamics.

## 1. Memory &amp; Personalization: OpenAI’s "Dreaming" Update
OpenAI has deployed a new memory architecture for ChatGPT, referred to as **Dreaming V3**.
*   **Purpose:** Improves memory synthesis to optimize freshness, continuity, and relevance.
*   **Evolution:**
    *   **2024:** "Saved memories" (manual instruction-based).
    *   **2025:** "Dreaming V0" (background process curating memories from chat history).
    *   **2026:** **Dreaming V3** (significantly more capable and compute-efficient architecture).
*   **Impact:** Memory is now reviewable via a summary page, allowing users to update information and set instructions on topics to bring up.
*   **Availability:** Rolled out to ChatGPT Plus and Pro users in the US today, expanding to additional countries and Free/Go users over coming weeks.
*   **Capability:** The model now remembers specific user setups (e.g., photography gear preferences) and constraints (e.g., vegetarian diet, hotel AC preferences) without requiring explicit "remember" cues.

## 2. New Frontier Models &amp; Benchmarks

### Claude Fable 5 &amp; Mythos 5 (Anthropic)
*   **Classification:** Mythos-class tier, sitting above Opus in raw capability.
*   **Differentiation:** **Fable 5** is available to the public. **Mythos 5** is the identical model with cybersecurity safeguards removed, restricted to **Project Glasswing** partners only.
*   **Pricing:** $10 per million input tokens / $50 per million output tokens.
*   **Availability:** Included at no extra cost on Pro, Max, Team, and enterprise plans until June 22.
*   **Capabilities:** Significant jumps in **Knowledge work**, **Agentic coding**, **Vision**, **Legal reasoning**, and **Biology**.

### Z.ai GLM-5.2 (Open Weights)
*   **Release:** Z.ai (Z.AI) released GLM-5.2 under an MIT license on June 16, 2026.
*   **Performance:** Closed the open-weight gap in ten weeks. Scored **51** on the Artificial Analysis Intelligence Index.
    *   **Context:** Expanded from 200K to **1 million tokens**.
    *   **Architecture:** Utilizes "IndexShare" for long-context efficiency and "Compaction-aware reinforcement learning" for agents.
*   **Benchmarks:** Ranked third on the AA-Briefcase (91 held-out tasks), behind Fable and Opus 4.8 but ahead of GPT-5.5.
*   **Cost:** ~$0.52 per task (compared to $0.86 for GPT-5.5 and $1.80 for Opus 4.8).

### Multiverse Pulsar 16B (NVIDIA Collaboration)
*   **Parameters:** 16.15B total parameters (3.1B active).
*   **Performance:** Delivers 30B-class intelligence at half the parameter count.
*   **Validation:** Matches 30B-class architectures (e.g., Nemotron-3-Nano-30B-A3B) on reasoning, coding, and math.
*   **Deployment:** Available on Hugging Face under Apache 2.0 license. Optimized for lower-memory GPUs and single-node environments.

## 3. Enterprise Integration &amp; Tools

*   **Claude Tag (Anthropic):**
    *   An "always-on AI teammate" available to **Claude Enterprise and Team** customers.
    *   **Features:** Lives inside Slack, follows conversations, learns context, and uses an **ambient mode** to proactively flag updates and tasks.
    *   **Scoping:** Identity-based permissions allow admins to restrict which channels/teams the AI can access.
*   **MCP Connectors (Anthropic):**
    *   Launched **Enterprise-Managed Authorization (EMA)**.
    *   Allows IT admins to provision connector access via identity providers (Okta) without individual OAuth flows.
*   **Perplexity Brain (Computer Agent):**
    *   Research preview for Max/Enterprise Max subscribers.
    *   Self-improving memory system that remembers what the agent *did* rather than user preferences.
    *   Results show 25% increase in answer correctness on repeated tasks.

## 4. Industry Trends &amp; Personnel Moves

*   **Market Dynamics:** ChatGPT market share dropped below 50% (46.4% by May 2026). Claude leads in subscription conversion (13%).
*   **Talent Shifts:**
    *   **Noam Shazeer:** Co-inventor of Transformer (Google) joins OpenAI as Lead for Architecture Research.
    *   **John Jumper:** Nobel Laureate (DeepMind) joins Anthropic for AI-for-science infrastructure.
*   **Corporate M&amp;A:**
    *   **SpaceX** acquires **Cursor** (Anysphere) for **$60 Billion** in a Q3 2026 deal to strengthen its AI coding division.
    *   **Alibaba** released the **Qwen-Robot Suite** (Qwen-RobotNav, Manip, World) for embodied intelligence and robotic control.
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you learned how to build a personal AI web research agent that searches the web, summarizes results with a local LLM, and saves a Markdown digest. All this runs on your own machine with no data leaving your laptop. You have full control over the model and prompts without any API costs.</p>
<p>From here, you can try new prompts to research different topics, tweak the system prompt to change the output, swap in other local models like Qwen 3.6 or Mistral, or extend the script to fit your own workflow. Happy tinkering!</p>
<p>If you enjoyed this tutorial, you can find more of my writing on my <a href="https://darshshah.org/blog/">blog</a> (recent posts include system design paper series), my work on my <a href="https://darshshah.org/">personal website</a>, and updates on <a href="https://www.linkedin.com/in/darshs">LinkedIn</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Teach a Small LLM to Suggest K12 Creative Project Ideas ]]>
                </title>
                <description>
                    <![CDATA[ Recently, I wrote a post about an educational app I'd developed using AI tools, and the design decisions I made along the way. When I showed the prototype of my activity-based learning app to a few ed ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-teach-a-small-llm-to-suggest-k12-creative-project-ideas/</link>
                <guid isPermaLink="false">6a3ab6628d22211aa0282f4f</guid>
                
                    <category>
                        <![CDATA[ edtech ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Srishti Sethi ]]>
                </dc:creator>
                <pubDate>Tue, 23 Jun 2026 16:37:54 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/381c2b8d-ed7d-4f88-b0d4-f4ba90878758.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Recently, I <a href="https://www.freecodecamp.org/news/technical-design-decisions-educational-app-llms/">wrote a post about an educational app</a> I'd developed using AI tools, and the design decisions I made along the way.</p>
<p>When I showed the prototype of my activity-based learning app to a few educators, one suggestion came up repeatedly that was drawn from their own experience hunting for creative ideas on platforms like Pinterest and TikTok. They wanted a feature that could pull project ideas from across the internet based on practical search criteria: the materials they have access to, and what they'd like the end product to look like.</p>
<p>The app already has a basic search that returns results from its own activity data, but that data is still limited at this stage. Generating results from outside the app felt like something LLMs are well suited to handle.</p>
<p>I was also curious to learn how you actually teach a K12 LLM – not the kind that needs enormous datasets and compute (which I don't have access to), but the mechanics of it, for learning's sake. And, like in my previous post, I wanted to think through the design choices that go into it:</p>
<ul>
<li><p>What are the technicalities behind teaching a small LLM to handle a K12 use case?</p>
</li>
<li><p>How, and on what data, do you train such a model?</p>
</li>
<li><p>How do you ensure the model is child friendly?</p>
</li>
<li><p>What does it take to integrate the model into your app?</p>
</li>
</ul>
<p>In this post, I'll document everything I learned about training such a model and integrating it as a feature in my educational prototype.</p>
<h2 id="heading-table-of-contents"><strong>Table of Contents</strong></h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-dataset-preparation">Dataset Preparation</a></p>
</li>
<li><p><a href="#heading-filtering-the-corpus">Filtering the Corpus</a></p>
</li>
<li><p><a href="#heading-generating-training-pairs">Generating Training Pairs</a></p>
</li>
<li><p><a href="#heading-fine-tuning">Fine Tuning</a></p>
</li>
<li><p><a href="#heading-evaluating-the-fine-tuned-model">Evaluating the Fine-tuned Model</a></p>
</li>
<li><p><a href="#heading-building-the-index-amp-rag-retrieval">Building the Index &amp; RAG Retrieval</a></p>
</li>
<li><p><a href="#heading-integrate-the-model-with-the-feature">Integrate the Model with the Feature</a></p>
</li>
<li><p><a href="#heading-making-content-safe">Making Content Safe</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-resources">Resources</a></p>
</li>
</ul>
<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>This is a hands-on tutorial, so here's what will help you follow along or train the model yourself.</p>
<p><strong>Skills you'll want</strong></p>
<ul>
<li><p>Using Claude on the command line.</p>
</li>
<li><p>Basic Python: reading code, installing and using packages, calling APIs, and making sense of output like log files.</p>
</li>
<li><p>Reading a bit of TypeScript, since that's what the app's frontend is built in.</p>
</li>
<li><p>Most importantly, being comfortable following Claude's reasoning, weighing the options it lays out, and deciding what to do next. That back-and-forth, not any single command, is really the core skill this kind of project asks for.</p>
</li>
</ul>
<p>You don't need a background in machine learning. The post tries to explain the ML concepts as it goes, in plain language.</p>
<p><strong>Setup you'll need</strong></p>
<ul>
<li><p>An Apple Silicon Mac (M1/M2/M3 or newer). The fine-tuning step uses MLX, Apple's framework, which only runs on Apple Silicon.</p>
</li>
<li><p>Python 3 with a virtual environment <code>python3 -m venv</code>).</p>
</li>
<li><p>Ollama installed, with the Qwen 2.5 7B model pulled <code>ollama pull qwen2.5:7b</code>), for generating the training data locally. You'll want enough RAM to run a 7B model.</p>
</li>
<li><p>Claude on the command line, for working through the build.</p>
</li>
</ul>
<h2 id="heading-dataset-preparation"><strong>Dataset Preparation</strong></h2>
<p>For this experiment, I wanted the activity data to be grounded in local cultures from around the world. This would help the model suggest creative project ideas that inspire the facilitation of cultural activities in educational settings.</p>
<p>I'd come across a lot of Wikipedia articles on local arts and traditions over the years. Wikipedia is my favorite resource for information: it's human-first, its content is updated frequently, and as an open source project its APIs are free to use. So I decided to use Wikipedia data to teach my model.</p>
<p>The genuinely hands-on part of this stage was seeding the right categories. In a Python script, I defined ~40 seed categories and grouped them under 9 STEAM labels with suggestions from Claude on which categories to scrape and how to avoid noise in the fetched data.</p>
<p>For extracting text from the sections of each article, Claude suggested a Python wrapper for the Wikipedia API. This let me fetch each article as a section-structured record. To keep noise down, I limited the crawl to one sub-category level deep and only kept articles above a certain content size.</p>
<pre><code class="language-python"># Seed categories grouped by STEAM domain.
SEED_CATEGORIES = {
    "Crafts &amp; making": [
        "Category:Crafts",
        "Category:Origami",
        "Category:Pottery",
        "Category:Kites",
    ],
    "Arts": [
        "Category:Folk art",
        "Category:Textile arts",
        "Category:Indigenous art",
        "Category:Masks",
    ],
    "Science": [
        "Category:Ethnobotany",
        "Category:Food preservation",
        "Category:Gardening",
    ],                                                            
# ... Media arts, Engineering, Mathematics, Music &amp; sky, Play &amp; learning
}

MAX_DEPTH = 1             # descend only one sub-category level
MIN_CONTENT_CHARS = 800   # skip stubs (summary + sections)
</code></pre>
<h2 id="heading-filtering-the-corpus">Filtering the Corpus</h2>
<p>The previous step wrote ~19,000 articles during scraping. This step makes sure the content stays relevant to STEAM topics. Relevance filtering itself runs in two stages: removing obvious noise, then semantic filtering.</p>
<p>The first stage drops obvious non-activity content like music, films, TV, biographies, plant/animal species using category, title, and section-heading patterns.</p>
<p>The second, semantic stage converts each article's title and summary into a vector using a small sentence-transformer model (all-MiniLM-L6-v2). It then compares it against two sets of example sentences: positive and negative anchors.</p>
<p>The positive anchors describe sentences relevant to STEAM activities and the negative anchors describe less relevant ones. Each article gets a score based on how close it sits to the positive examples versus the negative ones, and we keep every article that leans positive. We do this with the sentence-transformers library.</p>
<p>Writing these anchor sentences is the most human step in the process. With this filtering, I brought the corpus down to ~6,600 articles.</p>
<pre><code class="language-python"># Filtering the raw scrape to articles useful for STEAM activity suggestions.

POSITIVE_ANCHORS = [
    "a hands-on craft that children can make using simple materials and a technique",
    "a traditional cultural art or making technique such as weaving, carving, pottery or paper folding",
]
NEGATIVE_ANCHORS = [
    "a species of plant, animal or fungus",
    "a biography of a person",
    "a city, region, building or geographic place",
]

    # Embed article + anchors, then keep whatever leans positive.
    pos_sim = util.cos_sim(emb, pos).max(dim=1).values # closest positive anchor 
    neg_sim = util.cos_sim(emb, neg).max(dim=1).values # closest negative anchor
    scores = (pos_sim - neg_sim).tolist()
</code></pre>
<h2 id="heading-generating-training-pairs"><strong>Generating Training Pairs</strong></h2>
<p>The next step is to generate input → output training pairs from the filtered corpus. We do this by distilling it through a pretrained, local open-source model (Qwen 2.5 7B, running via Ollama).</p>
<p>For each article, you send the model the title, summary, cultural context, and a few content sections. You also send it a system prompt that explains the task, specifies the output format (valid JSON, in this case), and includes one example training pair to anchor the format.</p>
<p>Constructing this prompt well is where human intervention matters most: the schema, the rules, and that single worked example are what determine the quality of every pair the model generates.</p>
<p>After generation, we cleaned and prepared the pairs for fine-tuning. The local model tended to invent its own category labels ("Ceramics," "Crafts &amp; Making," "Circuits (metaphorical)"…). So this step maps every category onto the app's fixed set of 10 canonical categories (Art, Science, Coding, Circuits, Engineering, Storytelling, Drama, Film, Music, Nature), clamps each activity's age range into the K12 band, converts the pairs into chat format, and finally splits the data into three sets: train, validate, and test.</p>
<pre><code class="language-json"># The schema every generated training pair must match (valid JSON only).
  {
    "input": {
      "materials": ["3-6 realistic classroom materials"],
      "age_range": [min_int, max_int],
      "theme": "optional string or null"
    },
    "output": {
      "ideas": [{
        "title": "catchy, max 60 chars",
        "description": "2-3 sentences",
        "category": "one of: Art, Science, Coding, Circuits, Engineering, ...",
        "cultural_origin": "specific region or culture",
        "materials_used": ["subset of input materials"],
        "materials_missing": ["anything else needed"],
        "estimated_minutes": integer,
        "steps": ["3-6 short steps, one sentence each"],
        "learning_objectives": ["2-4 objectives"],
        "safety_note": "string or null"
      }]
    }
  }
</code></pre>
<h2 id="heading-fine-tuning"><strong>Fine-Tuning</strong></h2>
<p>This is the step where the model learns how to behave and generate a desired response in the appropriate format. It involves fine-tuning a pretrained model (Qwen2.5-1.5B-Instruct-4bit in this case) via MLX on my dataset using the LoRA technique.</p>
<p>Fine-tuning with LoRA is a cheap and lightweight approach: it doesn't retrain the whole model, but instead adds a tiny correction layer that adjusts the final behavior while the original model stays frozen.</p>
<p>Given the constraints of this project, working on a personal laptop with a small dataset of ~400 pairs, full fine-tuning would have needed significantly more memory and compute, which would be overkill here. So LoRA was the right choice.</p>
<h3 id="heading-the-lora-fine-tuning-cycle">The LoRA Fine-tuning Cycle:</h3>
<img src="https://cdn.hashnode.com/uploads/covers/6a172a9fbadcd8afcb11f314/bb1b995b-ff56-4364-8246-c885449c7399.png" alt="Flowchart showing the LoRA fine-tuning cycle" style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<p>Training runs many iterations over the training pairs, and each iteration is the same short cycle. For each input, the model produces a prediction by assigning a probability score to every possible next word, based on the input and the model's current weights. During training it is then graded on how much probability it gave the actual correct next word from the training data.</p>
<p>(Note: in a neural network, <a href="https://www.youtube.com/watch?v=nEt5_8V_wpY">weights and biases</a> are the numbers that determine how the model processes an input, makes a prediction, and generates a response.).</p>
<p>From that comparison it calculates the train loss. It then updates the weights accordingly, specifically the small LoRA adapter weights, while the frozen base model stays untouched, so that next time the guess is a little closer. The lower the loss, the better the model is fitting the data.</p>
<p>Then it moves on to the next iteration, and the cycle repeats. At the end, the trained adapter weights are saved out to a safetensors file.</p>
<p>For example, here is how the validation loss moved over my run: 2.532 → 0.842 → 0.823 → 0.814 → 0.820 → 0.831 → 0.845. It dropped sharply at first (the model was genuinely learning), bottomed out at 0.814 around iteration 300, then ticked back up to 0.845 by the end. This was early sign that the model was starting to overfit, that is memorize the training data rather than continue improving.</p>
<p>So the sweet spot was the middle of the run, not the very end. This is where human review mattered most: I saved checkpoints at iterations 200, 400, and 600, and chose the 400 checkpoint, the one with the lowest validation loss among them, to evaluate and serve.</p>
<pre><code class="language-yaml"># Base model — small, instruction-tuned, 4-bit (runs on a laptop)
  model: "mlx-community/Qwen2.5-1.5B-Instruct-4bit"

  train: true
  data: "data/mlx"            # training data: train.jsonl + valid.jsonl
  adapter_path: "adapters"    # &lt;- the trained LoRA weights get saved here

  fine_tune_type: lora
  num_layers: 8               # apply LoRA to the last 8 transformer layers only
  lora_parameters:
    rank: 8                   # adapter size — bigger = more capacity, more overfit risk

  # Training loop
  batch_size: 4               # 400 train examples / 4 = 100 iterations per epoch
  iters: 600                  # ~6 passes over the training set
  learning_rate: 1e-5

  # Watch validation loss to catch overfitting
  steps_per_eval: 100         # check validation loss every 100 steps
  save_every: 200             # checkpoint adapters at 200 / 400 / 600
</code></pre>
<p>Above is the configuration file. It shows the model used, the adapter path, the fine-tuning and LoRA settings, the training loop, and the validation pass.</p>
<p>Below is the command, run with MLX (Apple's machine learning framework), that kicks off the fine-tuning process:</p>
<pre><code class="language-shell">mlx_lm.lora --config lora_config.yaml
</code></pre>
<p>The output below shows the result: the trained weights land in the adapters/ folder, with a checkpoint saved every 200 iterations at 200, 400, and 600.</p>
<pre><code class="language-shell">  adapters/
  ├── 0000200_adapters.safetensors
  ├── 0000400_adapters.safetensors   &lt;- the one you serve (lowest val loss of the three)
  ├── 0000600_adapters.safetensors
  └── adapters.safetensors           &lt;- copy of the final (600) weights
</code></pre>
<h2 id="heading-evaluating-the-fine-tuned-model"><strong>Evaluating the Fine-tuned Model</strong></h2>
<p>Once fine-tuning was done, the model needed to be evaluated on the held-out test set, the 50 examples set aside during the training-pair generation step and never seen during training.</p>
<p>In this step, the user message is fed to the model, the model generates its own JSON answer, and that answer is compared against the gold (correct/reference) answer already stored in the file.</p>
<p>The evaluation checks and reports whether the JSON is valid, whether it has the expected keys, how much the predicted materials overlap with the gold answer, how often the prediction names a specific cultural origin, and so on.</p>
<p>It runs this for every example in the test set, printing a short per-example line and a summary at the end. It saves the full results, including each predicted idea alongside the actual (gold) idea, so you can read them side by side.</p>
<pre><code class="language-json"># Fine-tuned model on 50 held-out test examples:
  {
    "json_valid_rate":       1.00,   # always valid JSON
    "schema_match_rate":     1.00,   # always the right keys
    "avg_n_steps":           4.74,   # ~5 steps per idea
    "avg_materials_jaccard": 0.653,  # decent overlap with gold materials
    "pred_culture_specific_rate": 0.52,   # names a specific culture about half the time
    "culture_loose_match_rate":   0.108,  # but it's usually the WRONG one  &lt;-- the gap RAG tries to close
  }
</code></pre>
<h2 id="heading-building-the-index-amp-rag-retrieval"><strong>Building the Index &amp; RAG Retrieval</strong></h2>
<p>In the previous step we found that <code>culture_loose_match_rate_when_gold_specific</code> was low: the model is bad at recalling the right cultural origin for a suggested activity.</p>
<p>In this step, we'll try to address that weakness with RAG (retrieval-augmented generation). Instead of hoping that the model has memorized that Raku is Japanese, we'll look up the real Wikipedia article at query time, hand it to the model, and then test whether retrieval actually helps.</p>
<p>This happens in two parts. First, we'll build a retrieval index, turning the Wikipedia corpus we collected earlier into a searchable "meaning database." For each article we compute an embedding by passing its title and summary through a small embedding model, all-MiniLM-L6-v2. An embedding is a numeric fingerprint of meaning, a row of 384 numbers, and articles with similar meaning end up with similar numbers. These are computed once, offline, and saved to disk.</p>
<p>Second comes the retrieval itself. At query time, we turn the query into the same kind of vector, score every article by how similar it is, and return the few with the highest scores (that is, the articles whose meaning is closest to what the user asked for). We then run the same evaluation as the previous phase, but with these retrieved articles pasted into the prompt, to answer the core question: when the model is handed the right Wikipedia article, does it do better?</p>
<p>In a nutshell, this phase is: retrieve the relevant articles, augment the prompt with them, and let the model generate.</p>
<pre><code class="language-python">def retrieve(query, embedder, embeddings, meta, k):
      # 1. turn the query into the same kind of 384-number vector
      q = embedder.encode([query], normalize_embeddings=True,
                          convert_to_numpy=True)[0]
      # 2. score every article by similarity (dot product of unit vectors = cosine)
      sims = embeddings @ q
      # 3. take the k closest, return them with their scores
      top = np.argsort(-sims)[:k]
      return [(meta[i], float(sims[i])) for i in top]
</code></pre>
<p>So with RAG, the materials overlap improved and the model named a specific culture more often – but the exact cultural match barely moved. This is something I would like to improve in future versions of the app.</p>
<pre><code class="language-plaintext">Metric                        Plain     + RAG     Change
materials_jaccard             0.653     0.752     better
pred_culture_specific_rate    0.52      0.64      better
culture_loose_match_rate      0.108     0.135     barely
</code></pre>
<h2 id="heading-integrate-the-model-with-the-feature"><strong>Integrate the Model with the Feature</strong></h2>
<p>Now it's time to integrate the fine-tuned model into the app and see what cultural activities it can generate to inspire educators.</p>
<p>The end-to-end flow starts on a "Suggest" screen, where an educator enters the materials they have on hand and, optionally, a theme for the activity. From there, the suggestion happens in two phases: retrieval, then generation.</p>
<p>First, the app does a vector search over the Wikipedia index and populates a grid of culturally-specific articles that match the educator's input. No model is involved, so the grid appears instantly.</p>
<p>Then, when you tap a card, you land on a detail screen where the fine-tuned model generates a full STEAM activity grounded in that single tradition: a title, description, materials, step-by-step instructions, learning objectives, and a safety note. Everything needed to guide the activity in the classroom.</p>
<pre><code class="language-typescript"> // Step 1 — RETRIEVAL: educator's materials -&gt; grid of cultural articles.
  // Pure vector search on the server, no model, so the grid appears instantly.
  export async function fetchInspiration(materials: string[], theme?: string) {
    const res = await fetch(`${BASE_URL}/suggest`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ materials, theme: theme ?? null }),
    });
    return res.json();   // { results: [...articles] }
  }

  // Step 2 — GENERATION: runs only when the educator taps ONE card.
  // The fine-tuned model generates a full activity grounded in that article.
  export async function fetchActivity(
    articleId: number,
    materials: string[],
    ageRange: [number, number],
  ) {
    const res = await fetch(`${BASE_URL}/activity`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ article_id: articleId, materials, age_range: ageRange }),
    });
    return res.json();   // { activity: {...}, article: {...} }
  }
</code></pre>
<p>Splitting browsing from generation this way is both a cost and a quality choice: retrieval is essentially free, so the model runs just once on the tradition the educator actually commits to, rather than once for every card on the grid.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6a172a9fbadcd8afcb11f314/b56489af-0450-48eb-b8f6-04c7a1a15781.png" alt="Screenshots showing steps to generate cultural STEAM activities using the app" style="display:block;margin:0 auto" width="5930" height="2532" loading="lazy">

<h2 id="heading-making-content-safe">Making Content Safe</h2>
<p>I wanted to talk about this topic explicitly at the end, even though many phases of the pipeline already involve steps to keep the model's content safe.</p>
<p>Even though the direct users of the app are educators, anything this feature produces can end up in front of kids. So we never want to surface or generate steps for intoxicants, drugs, tobacco, weapons, explosives, or poisons – basically any content that isn't age-appropriate.</p>
<p>This is something the model won't automatically handle on its own. The fine-tuned model was trained only on cultural-craft examples, so it has no built-in instinct to refuse an unsafe request, and the general knowledge of things like alcohol and weapons still lives in the base model's weights underneath.</p>
<p>As a builder, you have to put the necessary guards and checkpoints in place, and remind the model how to behave. We do this in two phases:</p>
<ul>
<li><p>Pre-filter the data to reduce risk at the source, the same way we dropped unrelated categories earlier. Screening the corpus (and the generated training pairs) means we never teach the model unsafe content in the first place. This matters especially if you ever plan to publish your model or dataset somewhere like Hugging Face, where it should already be filtered. This step removed ~850 unsafe articles from the ~19,000 scraped.</p>
</li>
<li><p>Keep runtime guardrails in the ZubHub app as the actual guarantee. Because data filtering reduces risk but can't erase what the base model already knows, the live app screens every input before retrieval and every generated output before display. This means that nothing built around unsafe terms is ever retrieved or shown.</p>
</li>
</ul>
<pre><code class="language-python"># safety.py — one shared list of what we never surface to kids...
  UNSAFE_TERMS = { 
      # ...
  }

  # ...matched whole-word, so "twine" != "wine" and "gunny sack" != "gun".
  def screen_text(text):
      """Return the first unsafe category found, or None if the text is clear."""
      for category, pattern in _PATTERNS.items():   # _PATTERNS built from UNSAFE_TERMS
          if pattern.search(text):
              return category
      return None

  # Phase 1, data: drop unsafe articles before they ever reach training.
  for article in corpus:
      if screen_text(article["title"] + article["summary"]):
          continue                      # never taught to the model

  # Phase 2, runtime: screen the educator's input AND the model's output.
  if screen_text(user_input):           # before retrieval
      return BLOCK_MESSAGE
  answer = model.generate(...)
  if screen_text(answer):               # before anything is shown
      return BLOCK_MESSAGE
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In a nutshell, this article walked through how you teach a small LLM to suggest creative, hands-on projects for an educational app.</p>
<p>We started from a pretrained model, Qwen2.5-1.5B-Instruct, and taught it on a dataset we built from Wikipedia's STEAM and cultural articles.</p>
<p>The goal was to get it to take a simple input (the materials an educator has, the children's age range, and an optional theme) and respond with a structured JSON activity: a title, description, step-by-step instructions, learning objectives, and a safety note.</p>
<p>Along the way, we worked through the technicalities of adapting a small LLM for a K12 use case end to end: building the dataset with the Wikipedia API, filtering out irrelevant categories and unsafe content, generating training pairs, fine-tuning the model with LoRA, evaluating its quality, building a retrieval index and adding RAG to make the suggestions more grounded and specific, and finally integrating the model into the app.</p>
<p>Most importantly, building it this way as a hands-on project is what made the core ideas of the ML/LLM space click for me, rather than staying abstract. I hope it does the same for you!</p>
<h2 id="heading-resources"><strong>Resources</strong></h2>
<ul>
<li>Check out the source code in this <a href="https://github.com/unstructuredstudio/zubhub-mobile/commit/296729c6bf981b0aa4ed6418f7c771a667170e77">specific PR</a>.</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ From LLMs to LangChain: Understanding How Modern AI Applications Actually Work ]]>
                </title>
                <description>
                    <![CDATA[ Typically, when we start experimenting with AI, many of us begin similarly. We try a single LLM call as the core of an app, like this: const response = await llm.chat("Explain Kubernetes"); For a lit ]]>
                </description>
                <link>https://www.freecodecamp.org/news/from-llms-to-langchain-understanding-how-modern-ai-applications-actually-work/</link>
                <guid isPermaLink="false">6a3aab13b5ad15098db82372</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ langchain ]]>
                    </category>
                
                    <category>
                        <![CDATA[ JavaScript ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Web Development ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Open Source ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Sudheesh Shetty ]]>
                </dc:creator>
                <pubDate>Tue, 23 Jun 2026 15:49:39 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/38787e16-7e86-44da-9a6a-620cc1a99fce.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Typically, when we start experimenting with AI, many of us begin similarly. We try a single LLM call as the core of an app, like this:</p>
<pre><code class="language-plaintext">const response = await llm.chat("Explain Kubernetes");
</code></pre>
<p>For a little while it feels like the whole flow is: the user asks something, and the model returns an answer. That early success often creates a false impression that building AI is just about sending prompts and getting responses.</p>
<p>That simplicity is seductive, but it doesn't hold up. Over time, users want the assistant to find answers in their documents and knowledge bases, call APIs, fetch live data, or trigger services or schedule meetings.</p>
<p>Users also expect the agent to access internal systems and interact with ERPs, CRMs, or other tools holding critical business data. They'll want agents to combine multiple steps, as workflows often require chaining queries, computations, and side effects into reliable processes.</p>
<p>This is where concepts like MCP (the Model Context Protocol) and tools like LangChain come in. Initially, they may seem like buzzwords, but they address different aspects of LLM production.</p>
<p>After experimenting with AI tools, I found that these concepts help solve different problems related to interfaces, orchestration, and system integration.</p>
<p>This article is a practical guide to understanding how LLMs connect with tools, orchestrate workflows, and power real AI applications.</p>
<h3 id="heading-heres-what-well-cover">Here’s what we’ll cover:</h3>
<ol>
<li><p><a href="#heading-what-is-an-llm">What Is an LLM?</a></p>
</li>
<li><p><a href="#heading-why-llms-need-tools">Why LLMs Need Tools</a></p>
</li>
<li><p><a href="#heading-where-mcp-comes-in">Where MCP Comes In</a></p>
</li>
<li><p><a href="#heading-so-what-does-langchain-actually-do">So What Does LangChain Actually Do?</a></p>
</li>
<li><p><a href="#heading-putting-it-together">Putting It Together</a></p>
</li>
<li><p><a href="#heading-what-i-built-while-learning-this">What I Built While Learning This</a></p>
</li>
</ol>
<p>Throughout the article we'll discuss what LLMs are and how they work, what tool-calling looks like in practice, what MCP is and how it works, how LangChain fits into the whole process, and how to put all these tools together.</p>
<p>To follow along, you'll need a basic understanding of Node.js, API operations, and basic JavaScript concepts.</p>
<h2 id="heading-what-is-an-llm"><strong>What Is an LLM?</strong></h2>
<p>LLM stands for <strong>Large Language Model</strong>. It's a class of deep neural networks trained on massive amounts of text to model and generate human-like language. Popular examples you might have heard of include GPT, Claude, Gemini, and Llama.</p>
<h3 id="heading-how-to-call-an-llm-from-a-nodejs-application">How to Call an LLM From a Node.js Application</h3>
<p>Before writing code, let’s understand what it means to call an LLM from a Node.js application.</p>
<p>Calling an LLM means sending input from your application to an AI provider’s API and receiving generated output in return. It's similar to calling any other external service.</p>
<p>In most real-world applications, the model isn't hosted or trained by your application. Instead, providers such as OpenAI and Groq host and maintain the models, while your application communicates with them over HTTP APIs.</p>
<p>In this example, we’ll build a minimal API using Node.js and Express. We’ll create a simple <code>POST /chat</code> endpoint that accepts a user message, sends it to the OpenAI API, receives the generated response, and returns it to the client.</p>
<p>Here, our Node.js server acts as the bridge between the user and the LLM provider.</p>
<p>For this example, create an API key from the <a href="https://console.groq.com/keys">Groq</a> console. Since it offers a free tier, it’s a simple way to experiment and understand the concepts.</p>
<p>First, install the dependencies:</p>
<pre><code class="language-plaintext">npm install express
</code></pre>
<pre><code class="language-javascript">import express from "express";

const app = express();
app.use(express.json());

app.post("/chat", async (req, res) =&gt; {
  const { message } = req.body;
  const response = await fetch("https://api.groq.com/openai/v1/chat/completions", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      Authorization: GROQ_API_KEY,
    },
    body: JSON.stringify({
      model: "llama-3.3-70b-versatile",
      messages: [{ role: "user", content: message }],
    }),
  });

  const data = await response.json();

  if (!response.ok) {
    return res.status(response.status).json({ error: data });
  }

  const reply = data.choices[0].message.content;

  res.json({ reply });
});

const PORT = process.env.PORT || 8888;
app.listen(PORT, () =&gt; {
  console.log(`Server running on http://localhost:${PORT}`);
});
</code></pre>
<p>Start the server and make a request. Use Postman and do a POST request to <code>/chat</code> using the below body:</p>
<pre><code class="language-plaintext">POST /chat

{
  "message": "Explain Kubernetes"
}
</code></pre>
<p>Example response:</p>
<pre><code class="language-plaintext">{
  "reply": "Kubernetes is a container orchestration platform..."
}
</code></pre>
<p>The backend receives the message, forwards it to the model provider, receives generated text, and returns it to the client.</p>
<p>LLMs are excellent at language-centric tasks: they understand phrasing and intent, generate coherent text, extract structured information from unstructured input, and perform basic reasoning over provided context. These capabilities make them powerful for things like summarization, drafting, and conversational QA.</p>
<p>But there’s an important limitation: LLMs don't automatically know about and can't access your private or live data. They don’t have implicit access to your company database, internal APIs, or the current state of your systems unless you provide that information at runtime.</p>
<p>Because of that limitation, you need secure mechanisms to connect models to live systems and data — which brings us to the idea of tools.</p>
<h2 id="heading-why-llms-need-tools"><strong>Why LLMs Need Tools</strong></h2>
<p>Imagine asking:</p>
<blockquote>
<p>Check my order and raise support if delivery is delayed.</p>
</blockquote>
<p>The model alone can't inspect your order database or create a support ticket in your system. To do that, it must call external functions — for example, a <code>getOrderStatus(orderId)</code> API and a <code>createSupportTicket(orderId, issue)</code> action.</p>
<p>Those callable functions are what we call tools: programmatic interfaces the AI can use to interact with systems and take concrete actions on behalf of users.</p>
<p>A tool is simply a function that an AI model can call to interact with external systems or perform actions.</p>
<p>For example, imagine we have a getOrderStatus(id) function that returns an order’s delivery status.</p>
<p>To expose this to the LLM, we define a tools array. Each tool includes:</p>
<ul>
<li><p>type – currently "function"</p>
</li>
<li><p>function name – the function identifier</p>
</li>
<li><p>function description – helps the LLM decide when to call the tool</p>
</li>
<li><p>function parameters – a JSON Schema describing the arguments the tool expects</p>
</li>
</ul>
<p>Here's an example:</p>
<pre><code class="language-typescript">function getOrderStatus(id) {
  const statuses = ["pending", "success", "cancelled"];
  const status = statuses[Math.floor(Math.random() * statuses.length)];
  return `Your order status is ${status}.`;
}

const tools = [
  {
    type: "function",
    function: {
      name: "getOrderStatus",
      description: "Get the status of an order by its ID",
      parameters: {
        type: "object",
        properties: {
          id: { type: "string", description: "The order ID" },
        },
        required: ["id"],
      },
    },
  },
];
</code></pre>
<p>The above tool format is for Grok. Different LLM providers may use different formats for defining tools, but the overall idea remains the same.</p>
<p>When making the API call, we pass both the user messages and the list of available tools.</p>
<pre><code class="language-typescript">body: JSON.stringify({
    model: "llama-3.3-70b-versatile",
    messages: [{ role: "user", content: message }],
    tools,
}),
</code></pre>
<p>After the API call, the LLM decides whether a tool is needed. If a tool call is requested, our application executes the corresponding function and sends the result back to the model.</p>
<p>For this example, we'll only handle the <code>getOrderStatus</code> tool. We can check whether the model requested a tool call like this:</p>
<pre><code class="language-typescript">const toolCall = data.choices[0].message.tool_calls[0];
const { id } = JSON.parse(toolCall.function.arguments);
const toolResult = getOrderStatus(id)
</code></pre>
<p>and later we can pass the message context with tool result</p>
<pre><code class="language-typescript">body: JSON.stringify({
    model: "llama-3.3-70b-versatile",
    messages: [
        { role: "user", content: message },
        assistantMessage,
        { role: "tool", tool_call_id: toolCall.id, content: toolResult },
    ],
    tools,
}),
</code></pre>
<p>Finally, return the response:</p>
<pre><code class="language-typescript">return res.json({ reply: followUpData.choices[0].message.content });
</code></pre>
<p>Here's a diagram of the flow:</p>
<img src="https://cdn.hashnode.com/uploads/covers/6a1fa5fdc5c3ae375fb38ab2/22d6dc4d-ad5e-4fbb-84f6-71c367565282.png" alt="User -> LLM -> Tool Execution -> Tool Result -> Final Response" style="display:block;margin:0 auto" width="1774" height="887" loading="lazy">

<p>The LLM decides whether a tool is needed and generates the required inputs, while your application executes the function.</p>
<h2 id="heading-where-mcp-comes-in"><strong>Where MCP Comes In</strong></h2>
<p>Tools are simple. You define functions and tell the AI what it can use.</p>
<p>For example, <code>getOrderStatus()</code> works well when all tools are built inside your application. But as applications grow, tools may come from many places, like Slack, GitHub, databases, internal systems, or third-party services. Each one may expose tools differently.</p>
<p>This is where <a href="https://www.freecodecamp.org/news/how-does-an-mcp-work-under-the-hood/">MCP (Model Context Protocol) helps</a>. Think of MCP as a common language that lets AI systems connect to external tools in a consistent way.</p>
<p>Tools define what the AI can do. MCP standardizes how the AI connects to and uses those tools.</p>
<p>Now let’s extend the previous /chat API example so the LLM can use tools exposed through MCP. There are multiple ways to do this:</p>
<ul>
<li><p>build and host your own MCP server and expose your application functions</p>
</li>
<li><p>connect to existing third-party MCP servers such as Slack</p>
</li>
</ul>
<p>For this tutorial, we'll keep things simple and use a remote MCP server approach because it's easier to understand.</p>
<pre><code class="language-plaintext">npm install express @modelcontextprotocol/sdk zod
</code></pre>
<p>Now let’s create our own MCP server and expose the same <code>getOrderStatus</code> function as an MCP tool:</p>
<pre><code class="language-typescript">import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { createMcpExpressApp } from "@modelcontextprotocol/sdk/server/express.js";
import { StreamableHTTPServerTransport } from "@modelcontextprotocol/sdk/server/streamableHttp.js";
import { z } from "zod";

function getOrderStatus(id) {
  const statuses = ["pending", "success", "cancelled"];
  const status = statuses[Math.floor(Math.random() * statuses.length)];
  return `Your order status is ${status}.`;
}

function createOrderServer() {
  const server = new McpServer({ name: "order-server", version: "1.0.0" });

  server.registerTool(
    "getOrderStatus",
    {
      description: "Get the status of an order by its ID",
      inputSchema: { id: z.string() },
    },
    async ({ id }) =&gt; ({
      content: [{ type: "text", text: getOrderStatus(id) }],
    })
  );

  return server;
}

const app = createMcpExpressApp({ host: "0.0.0.0" });

app.post("/mcp", async (req, res) =&gt; {
  const server = createOrderServer();
  const transport = new StreamableHTTPServerTransport({
    sessionIdGenerator: undefined,
  });

  res.on("close", () =&gt; {
    transport.close();
    server.close();
  });

  await server.connect(transport);
  await transport.handleRequest(req, res, req.body);
});

const PORT = process.env.PORT || 3001;
app.listen(PORT, "0.0.0.0", () =&gt; {
  console.log(`Order MCP server running on http://0.0.0.0:${PORT}/mcp`);
});
</code></pre>
<p>This is useful when you want to expose your own application functions through MCP. Typically, the MCP server runs separately and is accessed by MCP clients. Now any MCP client can connect to this server and discover the available tools automatically.</p>
<p>The same idea applies to third-party MCP servers.</p>
<p>For example, if a Slack MCP server is available, we can connect to it instead of writing Slack integration code ourselves.</p>
<p>In that case, our application isn't directly calling Slack APIs. It connects to the Slack MCP server, which exposes Slack-related tools using the MCP standard.</p>
<p>So the difference is:</p>
<ul>
<li><p>For our own features, we can build our own MCP server</p>
</li>
<li><p>For external systems, we can use existing MCP servers when available</p>
</li>
</ul>
<p>Now we can pass MCP servers to the LLM request:</p>
<pre><code class="language-typescript">body: JSON.stringify({
  model: "llama-3.3-70b-versatile",
  messages: [{ role: "user", content: message }],
  tools: [
    {
      type: "mcp",
      server_label: "OrderServer",
      server_url: `http://0.0.0.0:${PORT}/mcp`,
      server_description: "Get the status of an order by its ID",
    },
    {
      type: "mcp",
      server_label: "Slack",
      server_url: "https://mcp.slack.com/mcp",
      server_description: "Send and read Slack messages",
      headers: {
        Authorization: `Bearer ${process.env.SLACK_BOT_TOKEN}`,
      },
    },
  ],
})
</code></pre>
<p>We can also use local MCP servers instead of remote URLs by connecting through transports such as <code>StdioClientTransport</code>. In that case, we connect locally, discover the available tools, and expose them to the LLM.</p>
<p>Now if the user sends:</p>
<pre><code class="language-json">{
  "message": "What is status of order 123"
}
</code></pre>
<p>The LLM decides whether a tool is needed, MCP exposes and executes the tool, and the final response is returned to the user.</p>
<p>The flow becomes:</p>
<img src="https://cdn.hashnode.com/uploads/covers/6a1fa5fdc5c3ae375fb38ab2/2db75d86-db9a-477e-b578-92221a490a2a.png" alt="User -> /chat api -> LLM -> MCP Tool -> Tool Result -> Tool Response" style="display:block;margin:0 auto" width="1774" height="887" loading="lazy">

<p>This standardization makes integrations far more reusable: instead of rewriting glue logic for each new connector, teams can register MCP-compliant tools and let the orchestrator and model handle discovery and invocation.</p>
<h2 id="heading-so-what-does-langchain-actually-do"><strong>So What Does LangChain Actually Do?</strong></h2>
<p>I initially thought LangChain was simply another wrapper around LLM APIs, but it is better understood as an orchestration framework for AI workflows. Tools let an LLM perform actions. MCP standardizes how tools are exposed. LangChain helps coordinate models, tools, and application logic to build multi-step workflows.</p>
<p>For example:</p>
<blockquote>
<p>User: Find flights, compare prices, book hotel, send confirmation.</p>
</blockquote>
<p>Now the system may need to:</p>
<ul>
<li><p>Check order status</p>
</li>
<li><p>Decide whether support is needed</p>
</li>
<li><p>Create a support ticket</p>
</li>
<li><p>Generate the final response</p>
</li>
</ul>
<p>Without orchestration, you would manually control each step. LangChain helps manage this flow.</p>
<p>To use LangChain, Install the required packages:</p>
<pre><code class="language-json">npm install express langchain @langchain/groq
</code></pre>
<p>We'll reuse the same tool functions from earlier:</p>
<pre><code class="language-typescript">import express from "express";
import { createAgent } from "langchain";
import { ChatGroq } from "@langchain/groq";

const app = express();
app.use(express.json());

const agent = createAgent({
  model: new ChatGroq({
    model: "llama-3.3-70b-versatile",
    apiKey: GROQ_API_KEY,
  }),
  tools: [
    {
      name: "getOrderStatus",
      description:
        "Get order status",
      execute: ({ id }) =&gt;
        getOrderStatus(id), // we have this function above
    },
    {
      name: "createSupportTicket",
      description:
        "Create support ticket",
      execute: ({ id }) =&gt;
        createSupportTicket(id), //imagine a function that creates a support ticket
    },
  ],
});

app.post(
  "/chat",
  async (req, res) =&gt; {
    const { message } = req.body;

    const response =
      await agent.invoke({
        messages: [
          {
            role: "user",
            content: message,
          },
        ],
      });

    res.json({
      reply:
        response.messages
          ?.at(-1)
          ?.text,
    });
  }
);

app.listen(3000);
</code></pre>
<p>Now the flow becomes:</p>
<img src="https://cdn.hashnode.com/uploads/covers/6a1fa5fdc5c3ae375fb38ab2/bd2a266c-39eb-4f3e-9909-ad81360bccb7.png" alt="Horizontal architecture diagram showing User → /chat API → LangChain Agent → OpenAI → Tool → Tool Result → Final Response." style="display:block;margin:0 auto" width="1930" height="815" loading="lazy">

<p>LangChain doesn't replace tools or MCP. It sits above them and coordinates how everything works together.</p>
<h2 id="heading-putting-it-together"><strong>Putting It Together</strong></h2>
<p>A modern AI application usually has multiple layers working together. The LLM handles reasoning and language generation. Tools perform real operations such as reading data, calling APIs, or executing actions. MCP helps standardize how those tools are exposed and accessed. LangChain helps orchestrate the interaction between models, tools, and workflows.</p>
<p>By separating these responsibilities, applications become easier to extend, maintain, and scale.</p>
<p>The goal is more than just generating text. You want to be able to build systems that can reason, retrieve information, take actions, and reliably solve real user problems.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6a1fa5fdc5c3ae375fb38ab2/bfc88660-3145-4b89-a626-158c4ec52bcc.png" alt="User ->LLM -> LangChain -> MCP -> Tools -> Systems &amp; Data" style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<h2 id="heading-what-i-built-while-learning-this"><strong>What I Built While Learning This</strong></h2>
<p>After understanding the concepts above, I wanted to reduce some of this setup for my own projects. As I experimented, I noticed most applications recreate the same plumbing over and over: connecting an LLM, wiring up tools, managing execution, and exposing orchestration patterns.</p>
<p>So I built a small open-source toolkit to reduce that setup. The goal was simple: you should be able to focus on business logic instead of wiring AI infrastructure.</p>
<p>Current capabilities:</p>
<ul>
<li><p>LLM integration</p>
</li>
<li><p>Tool registration</p>
</li>
<li><p>Tool execution</p>
</li>
<li><p>Chat orchestration</p>
</li>
<li><p>LangChain support</p>
</li>
<li><p>Extensible architecture</p>
</li>
</ul>
<h3 id="heading-packages">Packages:</h3>
<p>AI Chat Widget: <a href="https://www.npmjs.com/package/ai-chat-toolkit-widget">https://www.npmjs.com/package/ai-chat-toolkit-widget</a></p>
<p>AI Chat Server: <a href="https://www.npmjs.com/package/ai-chat-toolkit-server">https://www.npmjs.com/package/ai-chat-toolkit-server</a></p>
<p>GitHub Repository: <a href="https://github.com/sudheeshshetty/ai-chat-toolkit">https://github.com/sudheeshshetty/ai-chat-toolkit</a></p>
<p>To build a server using the toolkit:</p>
<pre><code class="language-typescript">npm install express ai-chat-toolkit-server
</code></pre>
<p>Create the chat server:</p>
<pre><code class="language-typescript">const aiChat = new AiChatServer({
  path: "/my-chat",
  provider: "groq",
  apiKey: process.env.API_KEY,
  model: process.env.MODEL || "llama-3.3-70b-versatile",
  cors: {
    origin: "http://localhost:5174",
  },
  orchestration: "langchain",
  maxToolRounds: 6,
  systemPrompt:
    "You are a helpful operations assistant for a demo store. Keep answers concise.",
});
</code></pre>
<p>Add your tools:</p>
<pre><code class="language-typescript">aiChat.addTools([
  {
    name: "...",
    description: "...",
    inputSchema: { ... },
    handler: async (input) =&gt; { /* runs in Node */ },
  },
]);
</code></pre>
<p>Attach it to your Express app:</p>
<pre><code class="language-typescript">aiChat.attach(app);
</code></pre>
<p>Now <code>/my-chat</code> is exposed in your Express server and can be used directly.</p>
<p>You can also use <code>ai-chat-toolkit-widget</code> if you want to skip building the chat UI.</p>
<p>Examples are available in the repository, so you can try it out quickly.</p>
<p>A quick glance of one of the examples:</p>
<img src="https://cdn.hashnode.com/uploads/covers/6a1fa5fdc5c3ae375fb38ab2/a9079710-be65-472b-881f-350daeeb0f3b.gif" alt="a9079710-be65-472b-881f-350daeeb0f3b" style="display:block;margin:0 auto" width="3456" height="2234" loading="lazy">

<p>If you find it useful, I’d appreciate a star, feedback, or contributions on GitHub as I continue improving the developer experience and exploring new ideas.<br>Thanks for reading — I hope this helped make LLMs, tools, MCP, and LangChain feel a little less magical and a lot more practical.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Handle Small Context Window Limits in RAG Systems ]]>
                </title>
                <description>
                    <![CDATA[ Retrieval-augmented generation, or RAG, is a pattern where an application retrieves relevant source material and adds it to a model prompt so the model can answer from that context. A larger context w ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-handle-small-context-window-limits-in-rag-systems/</link>
                <guid isPermaLink="false">6a33373b82e3f02be1b8c36f</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ TypeScript ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Sviatoslav Barbutsa ]]>
                </dc:creator>
                <pubDate>Thu, 18 Jun 2026 00:09:31 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/5ee25d45-2056-4e32-b780-c92e828e7964.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Retrieval-augmented generation, or RAG, is a pattern where an application retrieves relevant source material and adds it to a model prompt so the model can answer from that context.</p>
<p>A larger context window in a RAG system shouldn't be treated as a substitute for good context management, although it can make the experience more forgiving for the end user. It's like running unoptimized graphics on a powerful GPU: the extra capacity can hide inefficiency for a while, but it doesn't eliminate the underlying optimization problem.</p>
<p>But even a very large context window still has a hard limit. If you keep adding tokens, you can eventually exceed it. This problem becomes more visible on consumer hardware, where limited memory and compute usually mean smaller usable context windows.</p>
<p>I ran into this problem while experimenting with local models on a consumer laptop with 12 GB of VRAM. RAG worked well for small tests but as soon as the documents got larger, the system would retrieve useful chunks and still fail to answer well.</p>
<p>The issue wasn't always retrieval. Sometimes the right chunk had been found, but the final prompt didn't have room for it.</p>
<p>This article walks through the solution I implemented for this problem:</p>
<p>Document summary → chunk summary → raw chunk → final answer</p>
<p>The pattern is based on three rules:</p>
<ul>
<li><p>Use summaries for retrieval.</p>
</li>
<li><p>Use raw chunks for answering.</p>
</li>
<li><p>Use a context budget to decide what reaches the model.</p>
</li>
</ul>
<p>To keep the demo simple and convenient, the <a href="https://github.com/sviat-barbutsa/small-context-rag-solution">companion repository</a> uses small Python and TypeScript examples with a simplified in-memory retrieval store and a simplified answer extractor. This lets you see the article’s core ideas in practice without installing a full stack of dependencies, downloading models, running a Large Language Model (LLM) server, setting up an embedding service, or configuring a vector database.</p>
<p>That setup process could easily become its own dedicated article, so this tutorial keeps the runnable examples focused on the small-context RAG pattern: summaries for retrieval, raw chunks for answers, and a visible context budget.</p>
<p>The repo demonstrates the data flow and debugging pattern rather than production-grade model quality. In production, you'd want to replace the simplified summarizer, in-memory similarity search, and token estimator with your own model, embedding store, reranker, and tokenizer.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-you-will-implement">What You Will Implement</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-why-basic-rag-can-fail-with-a-small-context-window">Why Basic RAG Can Fail with a Small Context Window</a></p>
</li>
<li><p><a href="#heading-how-summary-routing-works">How Summary Routing Works</a></p>
</li>
<li><p><a href="#heading-how-to-represent-documents-and-chunks">How to Represent Documents and Chunks</a></p>
</li>
<li><p><a href="#heading-how-to-split-documents-into-raw-chunks">How to Split Documents into Raw Chunks</a></p>
</li>
<li><p><a href="#heading-how-to-summarize-chunks-and-documents">How to Summarize Chunks and Documents</a></p>
</li>
<li><p><a href="#heading-how-to-recursively-reduce-summaries">How to Recursively Reduce Summaries</a></p>
</li>
<li><p><a href="#heading-how-to-implement-the-hierarchical-index">How to Implement the Hierarchical Index</a></p>
</li>
<li><p><a href="#heading-how-to-retrieve-through-summaries">How to Retrieve Through Summaries</a></p>
</li>
<li><p><a href="#heading-how-to-implement-a-budgeted-raw-context">How to Implement a Budgeted Raw Context</a></p>
</li>
<li><p><a href="#heading-how-to-run-the-demo">How to Run the Demo</a></p>
</li>
<li><p><a href="#heading-how-to-interpret-the-250-vs-1200-token-test">How to Interpret the 250 vs 1200 Token Test</a></p>
</li>
<li><p><a href="#heading-how-this-relates-to-existing-rag-techniques">How This Relates to Existing RAG Techniques</a></p>
</li>
<li><p><a href="#heading-when-to-use-this-pattern">When to Use This Pattern</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-what-you-will-implement">What You Will Implement</h2>
<p>In this tutorial, you'll implement a small educational RAG pipeline that manages context window limitations by processing documents across three levels:</p>
<ul>
<li><p><strong>Document records</strong> contain a short summary used to choose likely documents.</p>
</li>
<li><p><strong>Chunk records</strong> contain a short summary used to choose likely chunks inside those documents, plus the raw source text.</p>
</li>
<li><p><strong>Raw context</strong> contains selected raw chunks packed into a fixed token budget.</p>
</li>
</ul>
<p>The important distinction is that summaries are only used to decide where to look. They're not used as final evidence.</p>
<p>That matters because summaries are lossy. They compress information, and they may leave out the detail needed to answer the user's question. Raw chunks, by contrast, are larger, but they preserve the original wording.</p>
<p>The demo prints a trace for every question:</p>
<ul>
<li><p>Document summary hits</p>
</li>
<li><p>Chunk summary hits</p>
</li>
<li><p>Raw chunks included</p>
</li>
<li><p>Raw chunks skipped</p>
</li>
<li><p>Answer</p>
</li>
</ul>
<p>That trace is the debugging interface. It shows whether retrieval failed, or whether prompt assembly skipped useful evidence because the context budget was too small.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along, you need one of these:</p>
<ul>
<li><strong>Python 3.10 or newer</strong></li>
</ul>
<p>or:</p>
<ul>
<li><p><strong>Node.js 22 or newer</strong></p>
</li>
<li><p><strong>npm</strong></p>
</li>
</ul>
<p>You'll get the most out of this article if you're already comfortable with:</p>
<ul>
<li><p>basic Python or TypeScript syntax</p>
</li>
<li><p>running commands in a terminal</p>
</li>
<li><p>reading small data classes, functions, and lists or maps</p>
</li>
<li><p>the general idea of an LLM prompt and context window</p>
</li>
<li><p>the basic RAG idea: retrieve relevant source text, add it to a prompt, and answer from that context</p>
</li>
</ul>
<p>You don't need prior experience with vector databases, embedding APIs, LangChain, LlamaIndex, or local LLM setup.</p>
<p>The examples don't require an LLM provider, an embedding API, or a vector database. They use:</p>
<ul>
<li><p>sentence extraction as a stand-in for LLM summarization</p>
</li>
<li><p>bag-of-words cosine similarity as a stand-in for embedding search</p>
</li>
<li><p>fixed character-based token estimates as a stand-in for a tokenizer</p>
</li>
</ul>
<p>I made these implementation choices to save you time and make the examples easier to try, while preserving the original purpose. They also make the retrieval path visible.</p>
<h2 id="heading-why-basic-rag-can-fail-with-a-small-context-window">Why Basic RAG Can Fail with a Small Context Window</h2>
<p>The basic RAG loop usually looks like this:</p>
<p>Load documents → split documents into chunks → embed chunks → retrieve the top chunks → put retrieved chunks into the prompt → ask the model to answer.</p>
<p>This is a good starting point. But it hides two different problems inside one phrase: "retrieve the top chunks."</p>
<p>First, you need to find relevant material. That's retrieval quality.</p>
<p>Second, you need to decide which retrieved material actually fits in the final prompt. That's context budgeting.</p>
<p>On a large hosted model, you may not notice this problem right away. On a local model or a smaller context window, you'll notice it quickly.</p>
<p>The failure mode looks like this:</p>
<ul>
<li><p>The retriever finds useful chunks.</p>
</li>
<li><p>The prompt builder tries to add them.</p>
</li>
<li><p>The context budget fills up.</p>
</li>
<li><p>Some chunks are skipped.</p>
</li>
<li><p>The final model never sees those skipped chunks.</p>
</li>
<li><p>The answer is incomplete or says "I do not know."</p>
</li>
</ul>
<p>This can feel confusing when you inspect retrieval and see that the relevant chunk was returned. But retrieval returning a chunk isn't the same thing as the model seeing that chunk.</p>
<p>If you develop RAG systems on constrained hardware, this distinction becomes important.</p>
<h2 id="heading-how-summary-routing-works">How Summary Routing Works</h2>
<p>Instead of searching all raw chunks directly, you can create a routing layer out of summaries.</p>
<p>At indexing time:</p>
<ol>
<li><p>Load documents.</p>
</li>
<li><p>Split each document into chunks.</p>
</li>
<li><p>Summarize each chunk.</p>
</li>
<li><p>Reduce chunk summaries into one document summary.</p>
</li>
<li><p>Store document summaries in a document-summary store.</p>
</li>
<li><p>Store chunk summaries in per-document chunk-summary stores.</p>
</li>
<li><p>Keep raw chunks in a lookup table.</p>
</li>
</ol>
<p>Here's what the indexing pipeline looks like:</p>
<img src="https://cdn.hashnode.com/uploads/covers/6a2161a1d326f39f24ff91db/4356d7cb-13a7-4cda-855f-9c4056b43157.png" alt="Diagram showing documents split into chunks, chunk summaries, recursive reduction, document summary stores, chunk summary stores, and raw chunk lookup" style="display:block;margin:0 auto" width="1672" height="941" loading="lazy">

<p>At question time:</p>
<ol>
<li><p>Search document summaries to choose likely documents.</p>
</li>
<li><p>Search chunk summaries only inside those documents.</p>
</li>
<li><p>Convert chunk-summary hits back to raw chunk IDs.</p>
</li>
<li><p>Optionally add neighboring chunks.</p>
</li>
<li><p>Pack raw chunks into the final context budget.</p>
</li>
<li><p>Answer from raw chunks only.</p>
</li>
</ol>
<p>The query path uses the summaries for routing, then switches back to raw chunks before answering:</p>
<img src="https://cdn.hashnode.com/uploads/covers/6a2161a1d326f39f24ff91db/905681c7-1807-439a-b3ec-514ea7b9c221.png" alt="Diagram showing a question flowing through document summaries, chunk summaries, raw chunk lookup, and a final answer" style="display:block;margin:0 auto" width="1672" height="941" loading="lazy">

<p>This gives you two useful properties:</p>
<ul>
<li><p>Summaries make retrieval cheaper.</p>
</li>
<li><p>Raw chunks keep answers grounded.</p>
</li>
</ul>
<p>It also gives you a place to debug. If the system gives a weak answer, inspect the trace. Did the right document summary match? Did the right chunk summary match? Did the raw chunk fit in the final context? Did it get skipped because of the budget?</p>
<h2 id="heading-how-to-represent-documents-and-chunks">How to Represent Documents and Chunks</h2>
<p>The data structures are intentionally small because they contain only the essential information needed for this pipeline. In a real system, you would probably add more metadata.</p>
<p>Here's the Python version:</p>
<pre><code class="language-python">from dataclasses import dataclass

@dataclass(frozen=True)
class SearchDocument:
    page_content: str
    metadata: dict[str, str | int]

@dataclass(frozen=True)
class DocumentRecord:
    doc_id: str
    source: str
    text: str
    summary: str

@dataclass(frozen=True)
class ChunkRecord:
    chunk_id: str
    doc_id: str
    source: str
    index: int
    text: str
    summary: str
    previous_chunk_id: str | None
    next_chunk_id: str | None
</code></pre>
<p>The <code>DocumentRecord</code> stores the full document and a summary. The <code>ChunkRecord</code> stores the raw chunk, its summary, and links to the previous and next chunks.</p>
<p>Those neighbor links are useful because chunk boundaries are artificial. If retrieval finds chunk 4, the answer may start in chunk 3 or continue into chunk 5.</p>
<p>The index keeps both searchable stores and lookup maps:</p>
<pre><code class="language-python">@dataclass(frozen=True)
class HierarchicalIndex:
    documents_by_id: dict[str, DocumentRecord]
    chunks_by_id: dict[str, ChunkRecord]
    chunks_by_doc_id: dict[str, list[ChunkRecord]]
    document_summary_store: SimpleVectorStore
    chunk_summary_stores_by_doc_id: dict[str, SimpleVectorStore]
</code></pre>
<p>The most important lookup is this:</p>
<pre><code class="language-python">chunk = index.chunks_by_id[chunk_hit.metadata["chunk_id"]]
</code></pre>
<p>That line converts a retrieved summary hit back into the raw source text used for the final answer.</p>
<h2 id="heading-how-to-split-documents-into-raw-chunks">How to Split Documents into Raw Chunks</h2>
<p>The demo splits Markdown files by paragraph and groups paragraphs until a target character size is reached:</p>
<pre><code class="language-python">CHUNK_SIZE = 420

def split_text(text: str) -&gt; list[str]:
    chunks = []
    current_paragraphs = []
    current_size = 0

    for paragraph in re.split(r"\n\s*\n", text.strip()):
        paragraph = paragraph.strip()

        if not paragraph:
            continue

        if current_paragraphs and current_size + len(paragraph) &gt; CHUNK_SIZE:
            chunks.append("\n\n".join(current_paragraphs))
            current_paragraphs = []
            current_size = 0

        current_paragraphs.append(paragraph)
        current_size += len(paragraph)

    if current_paragraphs:
        chunks.append("\n\n".join(current_paragraphs))

    return chunks
</code></pre>
<p>One important thing: this isn't the perfect splitter for every use case. It's intentionally readable.</p>
<p>In a production system, you might use a tokenizer-aware splitter, Markdown-aware sections, semantic chunking, or parent-child chunking. But regardless of the option you pick, the idea stays the same: keep raw chunks as the final evidence.</p>
<h2 id="heading-how-to-summarize-chunks-and-documents">How to Summarize Chunks and Documents</h2>
<p>To keep the demo easy to run, this article uses sentence extraction as a stand-in for LLM summarization. It scores sentences that include important RAG terms and keeps the top sentences.</p>
<pre><code class="language-python">def summarize_text(text: str, max_sentences: int = 2) -&gt; str:
    sentences = [
        sentence.strip()
        for sentence in re.split(r"(?&lt;=[.!?])\s+", " ".join(text.split()))
        if sentence.strip()
    ]

    if len(sentences) &lt;= max_sentences:
        return " ".join(sentences)

    scored_sentences = []

    for position, sentence in enumerate(sentences):
        sentence_words = words(sentence)
        term_score = sum(3 for word in sentence_words if word in IMPORTANT_TERMS)
        first_sentence_bonus = 1 if position == 0 else 0
        scored_sentences.append((term_score + first_sentence_bonus, position, sentence))

    selected = sorted(scored_sentences, key=lambda item: (-item[0],item[1]))[:max_sentences]
    selected.sort(key=lambda item: item[1])

    return " ".join(sentence for _score, _position, sentence in selected)
</code></pre>
<p>In a real system, this function would call a small local model or a hosted model. The prompt instructions would be something like:</p>
<ul>
<li><p>Summarize this chunk for retrieval.</p>
</li>
<li><p>Preserve names, constraints, decisions, errors, numbers, and domain-specific terms.</p>
</li>
<li><p>Don't answer a user question.</p>
</li>
</ul>
<p>Note that the chunk summary isn't supposed to replace the raw chunk. Its only goal is to make retrieval easier.</p>
<h2 id="heading-how-to-recursively-reduce-summaries">How to Recursively Reduce Summaries</h2>
<p>A common mistake is to create a document summary by putting every chunk summary into one prompt:</p>
<pre><code class="language-python">combined = "\n\n".join(chunk_summaries)
document_summary = summarize(combined)
</code></pre>
<p>That works for a few chunks, but it doesn't work for hundreds of chunks. You have only moved the context-window problem from answer time into indexing time.</p>
<p>A better approach is to reduce summaries in batches:</p>
<p>Chunk summaries → budgeted batches → batch summaries → higher-level summaries → final document summary.</p>
<p>The reduction process looks like this:</p>
<img src="https://cdn.hashnode.com/uploads/covers/6a2161a1d326f39f24ff91db/5cac6432-5cdd-4940-8318-08ffa2ec0622.png" alt="Diagram showing chunk summaries being grouped into budgeted batches, reduced into higher-level summaries, and then reduced into one final document summary" style="display:block;margin:0 auto" width="1672" height="941" loading="lazy">

<p>Here is the budgeted packing function:</p>
<pre><code class="language-python">def pack_summaries_by_token_budget(
    summaries: list[str],
    token_budget: int,
) -&gt; list[list[str]]:
    batches = []
    current_batch = []
    current_tokens = 0

    for summary in summaries:
        summary_tokens = approximate_tokens(summary)

        if current_batch and current_tokens + summary_tokens &gt; token_budget:
            batches.append(current_batch)
            current_batch = []
            current_tokens = 0

        current_batch.append(summary)
        current_tokens += summary_tokens

    if current_batch:
        batches.append(current_batch)

    return batches
</code></pre>
<p>And here is the recursive reduction loop:</p>
<pre><code class="language-python">def recursively_reduce_summaries(summaries: list[str]) -&gt; str:
    if not summaries:
        return "No summary available."

    current_summaries = summaries
    level = 1

    while len(current_summaries) &gt; 1:
        batches = pack_summaries_by_token_budget(
            current_summaries,
            SUMMARY_REDUCTION_INPUT_TOKEN_BUDGET,
        )

        if len(batches) == len(current_summaries):
            batches = force_summary_reduction_progress(current_summaries)

        print(
            f"Reducing {len(current_summaries)} summaries into "
            f"{len(batches)} batch summaries at level {level}"
        )

        current_summaries = [reduce_summary_batch(batch) for batch in batches]
        level += 1

    return summarize_text(current_summaries[0], max_sentences=3)
</code></pre>
<p>The fallback matters:</p>
<pre><code class="language-python">if len(batches) == len(current_summaries):
    batches = force_summary_reduction_progress(current_summaries)
</code></pre>
<p>If each summary is too large to fit with another summary, simple budget packing makes no progress, so pairing summaries forces the reduction to continue.</p>
<h2 id="heading-how-to-implement-the-hierarchical-index">How to Implement the Hierarchical Index</h2>
<p>Once you have document records and chunk records, create two kinds of stores:</p>
<ul>
<li><p>one store for document summaries</p>
</li>
<li><p>one store for chunk summaries, grouped by document</p>
</li>
</ul>
<p>Here's the document-summary store:</p>
<pre><code class="language-python">document_summary_store = SimpleVectorStore(
    [
        SearchDocument(
            page_content=record.summary,
            metadata={"doc_id": record.doc_id, "source": record.source},
        )
        for record in document_records
    ]
)
</code></pre>
<p>Then group chunks by document:</p>
<pre><code class="language-python">chunks_by_doc_id: dict[str, list[ChunkRecord]] = {}

for chunk in chunk_records:
    chunks_by_doc_id.setdefault(chunk.doc_id, []).append(chunk)
</code></pre>
<p>Then create one chunk-summary store per document:</p>
<pre><code class="language-python">chunk_summary_stores_by_doc_id = {}

for doc_id, doc_chunks in chunks_by_doc_id.items():
    chunk_summary_stores_by_doc_id[doc_id] = SimpleVectorStore(
        [
            SearchDocument(
                page_content=chunk.summary,
                metadata={
                    "chunk_id": chunk.chunk_id,
                    "doc_id": chunk.doc_id,
                    "source": chunk.source,
                    "chunk_index": chunk.index,
                },
            )
            for chunk in doc_chunks
        ]
    )
</code></pre>
<p>This is what makes retrieval hierarchical: the first search chooses documents, while the second search only looks inside the chosen documents.</p>
<h2 id="heading-how-to-retrieve-through-summaries">How to Retrieve Through Summaries</h2>
<p>At question time, search document summaries first:</p>
<pre><code class="language-python">document_hits = index.document_summary_store.similarity_search(
    question,
    k=min(DOC_RETRIEVAL_K, len(index.documents_by_id)),
)
</code></pre>
<p>In these searches, <code>k</code> controls how many top-ranked results the store should return.</p>
<p>Then search chunk summaries inside each selected document:</p>
<pre><code class="language-python">chunk_hits = []
seen_chunk_ids = set()

for document_hit in document_hits:
    doc_id = str(document_hit.metadata["doc_id"])
    chunk_store = index.chunk_summary_stores_by_doc_id[doc_id]
    doc_chunk_count = len(index.chunks_by_doc_id[doc_id])
    per_doc_hits = chunk_store.similarity_search(
        question,
        k=min(CHUNK_RETRIEVAL_K_PER_DOC, doc_chunk_count),
    )

    for chunk_hit in per_doc_hits:
        chunk_id = str(chunk_hit.metadata["chunk_id"])

        if chunk_id in seen_chunk_ids:
            continue

        chunk_hits.append(chunk_hit)
        seen_chunk_ids.add(chunk_id)
</code></pre>
<p>Notice what is being retrieved here: summaries.</p>
<p>The summary hit contains the <code>chunk_id</code>, but the final answer still uses the raw chunk text associated with that ID because the raw chunk preserves the original wording and details that the summary might have removed.</p>
<h2 id="heading-how-to-implement-a-budgeted-raw-context">How to Implement a Budgeted Raw Context</h2>
<p>After chunk-summary retrieval, convert the hits back to raw chunks.</p>
<p>The demo also adds neighbor chunks:</p>
<pre><code class="language-python">def candidate_raw_chunks(
    chunk_hits: list[SearchDocument],
    index: HierarchicalIndex,
) -&gt; list[ChunkRecord]:
    candidates = []
    seen_chunk_ids = set()

    for chunk_hit in chunk_hits:
        chunk = index.chunks_by_id[str(chunk_hit.metadata["chunk_id"])]
        related_chunk_ids = [chunk.chunk_id]

        if EXPAND_NEIGHBOR_CHUNKS:
            related_chunk_ids.extend([chunk.next_chunk_id, chunk.previous_chunk_id])

        for chunk_id in related_chunk_ids:
            if chunk_id is None or chunk_id in seen_chunk_ids:
                continue

            candidates.append(index.chunks_by_id[chunk_id])
            seen_chunk_ids.add(chunk_id)

    return candidates
</code></pre>
<p>Then apply the final context budget:</p>
<pre><code class="language-python">def build_raw_context(
    chunk_hits: list[SearchDocument],
    index: HierarchicalIndex,
) -&gt; tuple[str, list[tuple[ChunkRecord, int]], list[tuple[ChunkRecord, int]]]:
    included_chunks = []
    skipped_chunks = []
    used_tokens = 0

    for chunk in candidate_raw_chunks(chunk_hits, index):
        raw_context_part = format_raw_chunk(chunk)
        raw_context_tokens = approximate_tokens(raw_context_part)

        if used_tokens + raw_context_tokens &gt; RAW_CONTEXT_TOKEN_BUDGET:
            skipped_chunks.append((chunk, raw_context_tokens))
            continue

        included_chunks.append((chunk, raw_context_tokens))
        used_tokens += raw_context_tokens

    included_chunks.sort(key=lambda item: (item[0].source, item[0].index))

    context = "\n\n---\n\n".join(
        format_raw_chunk(chunk)
        for chunk, _tokens in included_chunks
    )

    return context, included_chunks, skipped_chunks
</code></pre>
<p>This step is where many RAG bugs become visible.</p>
<p>If the system retrieves a useful chunk but skips it because the prompt is full, the problem isn't document search. It's context budgeting.</p>
<h2 id="heading-how-to-run-the-demo">How to Run the Demo</h2>
<p>The companion repository contains two versions of the same example.</p>
<p>From the companion repository root, run the Python version:</p>
<pre><code class="language-bash">cd python
python3 -m small_context_rag_solution --question "Why can RAG fail when the context budget is too small?"
</code></pre>
<p>Run the TypeScript version:</p>
<pre><code class="language-bash">cd typescript
npm install
npm run demo
</code></pre>
<p>You can also run either example interactively by leaving off the question flag. Type <code>q</code>, <code>quit</code>, or <code>exit</code> to leave interactive mode.</p>
<p>Python:</p>
<pre><code class="language-bash">python3 -m small_context_rag_solution
</code></pre>
<p>TypeScript:</p>
<pre><code class="language-bash">npm run build
npm start
</code></pre>
<p>The default raw context budget is small on purpose: <code>RAW_CONTEXT_TOKEN_BUDGET=250</code>. That makes skipped chunks visible.</p>
<h2 id="heading-how-to-interpret-the-250-vs-1200-token-test">How to Interpret the 250 vs 1200 Token Test</h2>
<p>Run the same question with two budgets.</p>
<p>Python:</p>
<pre><code class="language-bash">RAW_CONTEXT_TOKEN_BUDGET=250 python3 -m small_context_rag_solution --question "Why can RAG fail when the context budget is too small?"
RAW_CONTEXT_TOKEN_BUDGET=1200 python3 -m small_context_rag_solution --question "Why can RAG fail when the context budget is too small?"
</code></pre>
<p>TypeScript:</p>
<pre><code class="language-bash">RAW_CONTEXT_TOKEN_BUDGET=250 npm run demo
RAW_CONTEXT_TOKEN_BUDGET=1200 npm run demo
</code></pre>
<p>With the 250-token budget, the raw context builder includes only two chunks:</p>
<ul>
<li><p><code>doc-003-large_rag_notes-chunk-004</code> (110 approx tokens)</p>
</li>
<li><p><code>doc-003-large_rag_notes-chunk-005</code> (121 approx tokens)</p>
</li>
</ul>
<p>It skips five other selected chunks:</p>
<ul>
<li><p><code>doc-003-large_rag_notes-chunk-003</code> (117 approx tokens)</p>
</li>
<li><p><code>doc-003-large_rag_notes-chunk-001</code> (116 approx tokens)</p>
</li>
<li><p><code>doc-003-large_rag_notes-chunk-002</code> (120 approx tokens)</p>
</li>
<li><p><code>doc-001-context_window_notes-chunk-001</code> (131 approx tokens)</p>
</li>
<li><p><code>doc-001-context_window_notes-chunk-002</code> (73 approx tokens)</p>
</li>
</ul>
<p>With the 1200-token budget, every selected raw chunk fits:</p>
<ul>
<li><p><code>doc-001-context_window_notes-chunk-001</code> (131 approx tokens)</p>
</li>
<li><p><code>doc-001-context_window_notes-chunk-002</code> (73 approx tokens)</p>
</li>
<li><p><code>doc-003-large_rag_notes-chunk-001</code> (116 approx tokens)</p>
</li>
<li><p><code>doc-003-large_rag_notes-chunk-002</code> (120 approx tokens)</p>
</li>
<li><p><code>doc-003-large_rag_notes-chunk-003</code> (117 approx tokens)</p>
</li>
<li><p><code>doc-003-large_rag_notes-chunk-004</code> (110 approx tokens)</p>
</li>
<li><p><code>doc-003-large_rag_notes-chunk-005</code> (121 approx tokens)</p>
</li>
</ul>
<p>No selected raw chunks are skipped.</p>
<p>This diagram shows the difference between the two context budgets:</p>
<img src="https://cdn.hashnode.com/uploads/covers/6a2161a1d326f39f24ff91db/266bcd79-cd50-4e84-ade7-2d6bbaddd662.png" alt="Diagram comparing a 250-token raw context budget that includes two chunks and skips five with a 1200-token budget that includes seven chunks and skips none" style="display:block;margin:0 auto" width="1540" height="1021" loading="lazy">

<p>A 1,200-token limit is still a very small context window for a real system, but it's much larger than 250. In this example, you can clearly see that the same retrieval route behaves differently when the prompt builder has more room.</p>
<p>This is why I like printing both included and skipped chunks. It helps answer a practical debugging question:</p>
<p><em>Did retrieval miss the evidence, or did prompt assembly drop it?</em></p>
<p>The demo uses a simplified answer step, so don't focus too much on the exact wording of the final answer. In a real LLM prompt, you would include instructions like:</p>
<ul>
<li><p>Answer only from the raw chunks below.</p>
</li>
<li><p>If the raw chunks contain multiple relevant reasons, include all of them.</p>
</li>
<li><p>Prefer a concise bullet list for multi-part answers.</p>
</li>
<li><p>If the raw chunks don't contain enough evidence, say so.</p>
</li>
</ul>
<p>More context doesn't automatically make the answer better. The prompt still has to tell the model how to use the extra evidence.</p>
<h2 id="heading-how-this-relates-to-existing-rag-techniques">How This Relates to Existing RAG Techniques</h2>
<p>This pattern isn't brand new research. It's a practical combination of several ideas that already exist in the RAG ecosystem.</p>
<p>LangChain uses a related technique in its <a href="https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain_classic/retrievers/parent_document_retriever.py">ParentDocumentRetriever</a>, which searches smaller child chunks and then returns their larger parent documents.</p>
<p>It is also related to the <a href="https://developers.llamaindex.ai/python/examples/index_structs/doc_summary/docsummary/">LlamaIndex Document Summary Index</a>, which uses document summaries to select relevant documents and then retrieves the nodes for those documents.</p>
<p>And it's conceptually adjacent to <a href="https://arxiv.org/abs/2401.18059">RAPTOR</a>, a retrieval method that builds a tree by recursively clustering and summarizing text.</p>
<p>The version in this article is intentionally simpler:</p>
<ul>
<li><p>No clustering.</p>
</li>
<li><p>No framework requirement.</p>
</li>
<li><p>No vector database required for the demo.</p>
</li>
<li><p>No claim that summaries are enough for final answers.</p>
</li>
</ul>
<p>The goal is to show a transparent pattern that's easy to understand under the hood and adapt to your own needs without relying on heavy frameworks. For my local-model work, the useful part was the separation:</p>
<ul>
<li><p>Summaries for retrieval</p>
</li>
<li><p>Raw chunks for grounding</p>
</li>
<li><p>Budget trace for debugging</p>
</li>
</ul>
<h2 id="heading-when-to-use-this-pattern">When to Use This Pattern</h2>
<p>This pattern is useful when:</p>
<ul>
<li><p>you run local models with limited VRAM</p>
</li>
<li><p>your context window is small or expensive</p>
</li>
<li><p>you have many documents but only a few are relevant to each question</p>
</li>
<li><p>you want inspectable retrieval traces</p>
</li>
<li><p>you want summaries for search but raw text for answers</p>
</li>
<li><p>you need to avoid unbounded prompts during both indexing and answering</p>
</li>
</ul>
<p>It's less useful when:</p>
<ul>
<li><p>your source documents are already small</p>
</li>
<li><p>your whole corpus fits comfortably in the prompt</p>
</li>
<li><p>exact keyword search is enough</p>
</li>
<li><p>you don't need multi-document routing</p>
</li>
<li><p>you can afford to retrieve and rerank many raw chunks directly</p>
</li>
</ul>
<p>There is also a tradeoff. This pattern adds indexing work:</p>
<ul>
<li><p>chunk summaries</p>
</li>
<li><p>recursive summary reduction</p>
</li>
<li><p>document summaries</p>
</li>
<li><p>extra lookup maps</p>
</li>
</ul>
<p>That's usually acceptable for document assistants, research tools, internal knowledge bases, and local-model projects where indexing can happen once and queries happen many times.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Don't treat RAG as only "retrieve chunks and paste them into a prompt."</p>
<p>For small-context systems, retrieval needs routing and budgeting. Even on high-end hardware with very large context windows, good system design becomes fundamental as the project scales.</p>
<p>The pattern comes down to three practical rules:</p>
<ul>
<li><p><strong>Summaries help find relevant source material.</strong></p>
</li>
<li><p><strong>Raw chunks ground the answer.</strong></p>
</li>
<li><p><strong>Context budgeting decides what reaches the model.</strong></p>
</li>
</ul>
<p>This solution helped me develop more reliable local RAG systems on constrained hardware. It also made failures easier to debug, because I could see exactly which summaries matched, which raw chunks were selected, and which raw chunks were skipped.</p>
<p>Whether you're running RAG locally or using a hosted model, if you're working with a small model, a limited context window, or a strict prompt budget, this pattern is worth trying before you spend money on a larger context window.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ AI Paper Review: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models ]]>
                </title>
                <description>
                    <![CDATA[ For the last few years, Large Language Models have been impressing researchers with their ability to generate text, answer questions, translate languages, and perform tasks they had never been explici ]]>
                </description>
                <link>https://www.freecodecamp.org/news/ai-paper-review-chain-of-thought-prompting-elicits-reasoning-in-large-language-models/</link>
                <guid isPermaLink="false">6a30800dc3625a1a686f75f8</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Mohammed Fahd Abrah ]]>
                </dc:creator>
                <pubDate>Mon, 15 Jun 2026 22:43:25 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/0d9c4f6a-1352-431f-af2e-c08b0e128e39.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>For the last few years, Large Language Models have been impressing researchers with their ability to generate text, answer questions, translate languages, and perform tasks they had never been explicitly trained to solve.</p>
<p>Each new generation seemed to confirm a simple belief: bigger models lead to better capabilities. Yet there was one area where progress appeared frustratingly limited. When problems required multiple steps of reasoning, language models often struggled in ways that were difficult to ignore.</p>
<p>A math word problem, a common sense question, or a symbolic puzzle could expose a surprising gap between fluent language generation and genuine problem solving. Models could frequently produce confident answers, but confidence alone wasn't enough. The challenge was whether they could reason through a problem before arriving at an answer.</p>
<p>Against this backdrop, the paper <em>Chain-of-Thought Prompting Elicits Reasoning in Large Language Models</em> introduced an idea that was both simple and unexpected. Rather than asking a model to produce an answer immediately, the authors encouraged it to work through intermediate reasoning steps first.</p>
<p>What followed was one of the most influential discoveries in modern AI research: many reasoning abilities that appeared absent in large language models weren't necessarily missing. In many cases, they simply hadn't been elicited in the right way.</p>
<p>This paper went on to reshape how researchers think about prompting, reasoning, and the capabilities of large language models. More importantly, it laid the intellectual foundation for many of the reasoning-oriented techniques and systems that emerged in the years that followed.</p>
<h2 id="heading-paper-overview">Paper Overview</h2>
<p>In this article, we'll explore the paper <em>Chain-of-Thought Prompting Elicits Reasoning in Large Language Models</em>, published by researchers at Google Research in 2022.</p>
<p>This paper introduced one of the most influential ideas in modern AI: <strong>Chain-of-Thought (CoT) Prompting</strong>. At a time when researchers were focused on scaling language models to ever-larger sizes, this study revealed that performance improvements were not always about building bigger models. Sometimes, the key was changing how we communicate with them.</p>
<p>The paper investigates a simple but powerful question: what happens if a language model is encouraged to show its reasoning process before giving an answer? Instead of responding directly, the model is guided to generate intermediate reasoning steps that lead to the final solution.</p>
<p>What makes this paper historically important is that it changed how researchers think about reasoning in large language models. The authors demonstrated that many reasoning capabilities can be unlocked through prompting alone, without additional training, fine-tuning, or architectural modifications.</p>
<p>The impact of this idea quickly extended beyond arithmetic reasoning. It influenced a new generation of research on reasoning, including Self-Consistency, Process Supervision, Verification-based methods, and the reasoning-oriented models that followed in subsequent years.</p>
<p>In many ways, this paper marked a shift from asking language models <strong>what the answer is</strong> to asking them <strong>how they arrived at the answer</strong>.</p>
<p>Here's the original paper if you'd like to explore it directly:</p>
<p><a href="https://arxiv.org/pdf/2201.11903"><strong>Chain-of-Thought Prompting Elicits Reasoning in Large Language Models</strong></a></p>
<p>And here's a quick infographic of what we'll cover throughout this review.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/bdf2234d-0fb2-4a44-a632-a0b3aa77fff4.png" alt="Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" style="display:block;margin:0 auto" width="1414" height="2000" loading="lazy">

<h2 id="heading-table-of-contents"><strong>Table of Contents:</strong></h2>
<ul>
<li><p><a href="#heading-abstract">Abstract</a></p>
</li>
<li><p><a href="#heading-introduction">Introduction</a></p>
</li>
<li><p><a href="#heading-chain-of-thought-prompting">Chain-of-Thought Prompting</a></p>
</li>
<li><p><a href="#heading-arithmetic-reasoning">Arithmetic Reasoning</a></p>
</li>
<li><p><a href="#heading-results">Results</a></p>
</li>
<li><p><a href="#heading-ablation-study">Ablation Study</a></p>
</li>
<li><p><a href="#heading-robustness-of-chain-of-thought-prompting">Robustness of Chain-of-Thought Prompting</a></p>
</li>
<li><p><a href="#heading-common-sense-reasoning">Common Sense Reasoning</a></p>
</li>
<li><p><a href="#heading-symbolic-reasoning">Symbolic Reasoning</a></p>
</li>
<li><p><a href="#heading-discussion">Discussion</a></p>
</li>
<li><p><a href="#heading-related-work">Related Work</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-resources">Resources</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To get the most out of this breakdown, it helps to already be familiar with a few foundational ideas and the evolution of large language models that led to Chain-of-Thought prompting.</p>
<p>Reading the previous reviews in this series will be especially helpful:</p>
<ul>
<li><p><a href="https://www.freecodecamp.org/news/ai-paper-review-improving-language-understanding-by-generative-pre-training-gpt-1/">AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1)</a></p>
</li>
<li><p><a href="https://www.freecodecamp.org/news/ai-paper-review-language-models-are-unsupervised-multitask-learners-gpt-2/">AI Paper Review: Language Models are Unsupervised Multitask Learners (GPT-2)</a></p>
</li>
<li><p><a href="https://www.freecodecamp.org/news/ai-paper-review-language-models-are-few-shot-learners-gpt-3/">AI Paper Review: Language Models are Few-Shot Learners (GPT-3)</a></p>
</li>
<li><p><a href="https://www.freecodecamp.org/news/ai-paper-review-training-language-models-to-follow-instructions-with-human-feedback-instructgpt/">AI Paper Review: Training Language Models to Follow Instructions with Human Feedback (InstructGPT)</a></p>
</li>
</ul>
<p>The GPT-3 review is particularly important because the Chain-of-Thought paper builds directly on one of GPT-3's most surprising capabilities: in-context learning. Rather than changing the model architecture or retraining the model, the authors discovered that reasoning performance could be dramatically improved simply by changing how examples were presented in the prompt.</p>
<p>It also helps to have:</p>
<ul>
<li><p>A general understanding of natural language processing (NLP) and large language models</p>
</li>
<li><p>A basic understanding of Transformer-based autoregressive models</p>
</li>
<li><p>Familiarity with prompting, few-shot learning, and in-context learning</p>
</li>
<li><p>A high-level understanding of how language models generate text token by token</p>
</li>
<li><p>General machine learning concepts such as training, inference, scaling laws, and model evaluation</p>
</li>
<li><p>Some exposure to reasoning tasks, logic problems, and mathematical word problems</p>
</li>
<li><p>A basic understanding of benchmark datasets and model performance evaluation</p>
</li>
</ul>
<p>You don't need a deep background in mathematics or machine learning research to follow this article.</p>
<p>I'll keep the explanations intuitive and practical, focusing on why Chain-of-Thought prompting became one of the most influential reasoning techniques in modern AI and how a simple prompting strategy changed the way researchers think about language model reasoning.</p>
<h2 id="heading-abstract"><strong>Abstract</strong></h2>
<p>One of the long-standing challenges for large language models has been reasoning. While these models can generate fluent text and answer a wide variety of questions, they often struggle when a task requires multiple logical steps.</p>
<p>This paper introduces a remarkably simple idea to address that limitation: instead of prompting a model with only questions and answers, you should provide examples that also include the intermediate reasoning steps leading to the solution.</p>
<p>The authors call this approach Chain-of-Thought (CoT) Prompting. By showing a model a few demonstrations of step-by-step reasoning, they find that sufficiently large language models can generate their own reasoning chains and solve complex problems more effectively. Importantly, this improvement doesn't require additional training or fine-tuning, only a different style of prompting.</p>
<p>Through experiments on arithmetic, common sense, and symbolic reasoning tasks, the paper demonstrates that chain-of-thought prompting consistently improves performance. The gains become especially pronounced at larger model scales, suggesting that reasoning abilities emerge naturally as models grow and are given the right prompting strategy.</p>
<p>The paper's most striking result comes from the GSM8K math benchmark, where PaLM 540B, using only eight chain-of-thought examples, achieved state-of-the-art performance and even surpassed a fine-tuned GPT-3 system equipped with a verifier. This finding revealed that prompting alone could unlock reasoning capabilities that standard prompting often fails to expose.</p>
<p>The figure below compares standard prompting with Chain-of-Thought (CoT) prompting using a simple arithmetic example.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/126da3c8-fa3f-4207-8d86-723c576d80d5.png" alt="Standard prompting vs chain of thought prompting" style="display:block;margin:0 auto" width="1853" height="835" loading="lazy">

<p>Source: <a href="https://arxiv.org/pdf/2201.11903">Chain-of-Thought Prompting Elicits Reasoning in Large Language Models</a></p>
<p>In standard prompting, the model is shown question–answer pairs and is expected to produce an answer directly, which can lead to mistakes on multi-step problems.</p>
<p>In Chain-of-Thought prompting, the examples include intermediate reasoning steps before the final answer. When faced with a new problem, the model follows a similar step-by-step process, arriving at the correct solution.</p>
<p>This paper shows that providing reasoning demonstrations can substantially improve performance on arithmetic, common sense, and symbolic reasoning tasks, particularly in large language models.</p>
<h2 id="heading-introduction"><strong>Introduction</strong></h2>
<p>By 2022, large language models had already transformed natural language processing. Models such as GPT-3 demonstrated that scaling model size could unlock impressive capabilities, from text generation to few-shot learning.</p>
<p>But there was an important limitation: larger models weren't necessarily better at reasoning. Tasks that required multi-step arithmetic, common sense inference, or symbolic manipulation remained surprisingly difficult, even for some of the largest models available.</p>
<p>The authors begin by observing two promising research directions. The first comes from prior work showing that reasoning tasks can benefit from natural language explanations or intermediate solution steps. Instead of jumping directly to an answer, a model can generate a rationale that mirrors how a human might solve the problem.</p>
<p>The second direction is few-shot prompting, where a model learns a task from a handful of examples provided in the prompt, eliminating the need for task-specific fine-tuning.</p>
<p>Still, both approaches have drawbacks. Training models on large collections of human-written rationales is expensive and time-consuming, while standard few-shot prompting often struggles on tasks that require genuine reasoning.</p>
<p>The key insight of this paper was to combine the strengths of both ideas. Rather than providing only input-output examples, the prompt includes an additional component: the reasoning process itself. Each example follows the structure of <em>input → chain of thought → output</em>.</p>
<p>This simple modification led to Chain-of-Thought Prompting. By exposing intermediate reasoning steps, the model is encouraged to break complex problems into smaller, more manageable stages before arriving at a final answer.</p>
<p>To evaluate the idea, the authors tested chain-of-thought prompting across arithmetic, common sense, and symbolic reasoning benchmarks. The results showed substantial improvements over standard prompting, with some gains being remarkably large.</p>
<h2 id="heading-chain-of-thought-prompting"><strong>Chain-of-Thought Prompting</strong></h2>
<p>At the heart of this paper is a simple observation about how humans solve difficult problems. When faced with a multi-step reasoning task, we rarely jump directly to the answer. Instead, we break the problem into smaller pieces, solve each intermediate step, and gradually work toward a conclusion. The authors argued that large language models could benefit from a similar process.</p>
<p>This idea led to Chain-of-Thought (CoT) Prompting, where examples in the prompt included not only the question and answer, but also the reasoning steps connecting them. By seeing a few demonstrations of this reasoning process, sufficiently large language models learned to generate their own chains of thought before producing a final answer.</p>
<p>The significance of this approach extends beyond improving accuracy. First, it allows complex problems to be decomposed into manageable intermediate steps, making multi-step reasoning easier to perform.</p>
<p>Second, the generated reasoning process offers a degree of interpretability, giving researchers and users a glimpse into how the model arrived at its answer. While these reasoning traces don't fully reveal the model's internal computations, they can help identify where mistakes occur.</p>
<p>Another important aspect of chain-of-thought prompting is its generality. The authors proposed it not as a solution for a single benchmark, but as a broad reasoning framework that can be applied to arithmetic problems, common sense reasoning tasks, symbolic manipulation, and potentially many other challenges that require sequential reasoning.</p>
<p>Perhaps most importantly, this capability can be elicited from existing language models through prompting alone, without additional training or architectural modifications.</p>
<p>This section establishes the paper's central claim: reasoning abilities don't necessarily require new model architectures or specialized fine-tuning. In sufficiently large language models, these capabilities can emerge when the model is guided to generate intermediate reasoning steps rather than being asked to produce an answer immediately.</p>
<h2 id="heading-arithmetic-reasoning"><strong>Arithmetic Reasoning</strong></h2>
<p>The authors begin their empirical evaluation with arithmetic reasoning, a domain that had long exposed a weakness of large language models.</p>
<p>Although solving math word problems is relatively straightforward for humans, it often requires a sequence of intermediate calculations and logical deductions.</p>
<p>Previous research had shown that even large language models struggled with these tasks, making arithmetic reasoning an ideal setting for testing whether chain-of-thought prompting could genuinely improve reasoning ability.</p>
<p>To evaluate their approach, the authors selected five established benchmarks covering a variety of math word problems. These datasets differ in style and difficulty, ranging from straightforward arithmetic questions to more complex problems that require multiple reasoning steps before arriving at a solution. Together, they provide a broad picture of how well language models handle mathematical reasoning.</p>
<p>The experiments compare two prompting strategies. The first is standard few-shot prompting, where the model is shown examples consisting only of questions and their corresponding answers. This was the dominant prompting approach at the time and serves as the baseline throughout the paper.</p>
<p>The second is chain-of-thought prompting, where each example is expanded to include the intermediate reasoning steps that connect the question to the final answer.</p>
<p>To ensure a fair comparison, the authors manually created a small set of eight reasoning demonstrations and reused them across the arithmetic benchmarks. Importantly, these examples weren't heavily optimized or engineered for specific datasets. Instead, they were intended to test whether a modest number of natural reasoning demonstrations could reliably encourage models to reason through new problems on their own.</p>
<p>The study also evaluates a diverse collection of language models, including GPT-3, LaMDA, PaLM, UL2, and Codex, spanning model sizes from hundreds of millions to hundreds of billions of parameters. This broad range allowed the authors to examine not only whether chain-of-thought prompting works, but also how its effectiveness changes as models become larger.</p>
<p>With this experimental framework in place, the paper investigated a central question: can providing a few examples of step-by-step reasoning enable large language models to solve mathematical problems that standard prompting struggles to handle?</p>
<h2 id="heading-results">Results</h2>
<p>The arithmetic reasoning experiments revealed that the success of chain-of-thought prompting depends heavily on model scale.</p>
<p>One of the clearest patterns across the benchmarks was that smaller models gained little benefit from generating reasoning steps. In some cases, their performance even deteriorated because the models produced explanations that sounded plausible but were logically flawed.</p>
<p>The advantages of chain-of-thought prompting only became apparent once the models reached very large scales, suggesting that the ability to effectively use intermediate reasoning steps is itself an emergent capability.</p>
<p>Another important observation was that the benefits of chain-of-thought prompting grew as problems became more challenging. On simpler tasks that required only a single reasoning step, standard prompting was already sufficient and the additional reasoning process provided little value.</p>
<p>But as the complexity of the problems increased, the gap between standard prompting and chain-of-thought prompting widened substantially. The GSM8K benchmark provides the strongest example of this trend, where the largest GPT and PaLM models more than doubled their performance when allowed to reason step by step.</p>
<p>Perhaps the most significant result is that chain-of-thought prompting enabled large language models to compete with, and in some cases surpass, specialized systems trained directly for these tasks.</p>
<p>Using only a handful of reasoning demonstrations, PaLM 540B established new state-of-the-art results on several arithmetic benchmarks, despite relying solely on prompting rather than task-specific fine-tuning. This outcome challenged the prevailing assumption that strong performance on reasoning tasks necessarily required dedicated training datasets and specialized models.</p>
<p>To better understand these improvements, the authors manually inspected the reasoning traces generated by the models. When the model arrived at the correct answer, the reasoning process was usually correct as well, indicating that the model was often following a coherent sequence of logical steps rather than guessing the final answer.</p>
<p>Even among incorrect predictions, many reasoning chains were largely accurate and failed only because of small mistakes such as arithmetic slips, incorrect symbol mappings, or a missing intermediate step. More serious failures tended to arise from misunderstanding the problem itself or producing incoherent reasoning.</p>
<p>The error analysis also offered an explanation for why larger models benefited more from chain-of-thought prompting. Comparing PaLM 62B with PaLM 540B showed that increasing scale reduced many of the semantic misunderstandings and incomplete reasoning patterns that appeared in smaller models.</p>
<p>In other words, larger models were not merely generating longer explanations. They were producing reasoning chains that were more logically complete and more faithful to the underlying problem.</p>
<h2 id="heading-ablation-study"><strong>Ablation Study</strong></h2>
<p>Before diving into this section, it's worth briefly explaining what an ablation study is. In machine learning research, an ablation study systematically removes or modifies parts of a method to determine which components are actually responsible for its performance. Rather than asking whether a method works, an ablation study asks why it works.</p>
<p>In this paper, the authors use ablation experiments to identify which aspects of Chain-of-Thought prompting contribute most to its reasoning improvements.</p>
<p>After demonstrating that chain-of-thought prompting improved reasoning performance, the authors turned to a more fundamental question: why does it work? Simply observing higher accuracy isn't enough. To understand the source of these gains, they designed a series of ablation experiments that isolated different aspects of the prompting strategy.</p>
<p>One possible explanation is that chain-of-thought prompting helps because it encourages the model to generate mathematical equations before producing an answer. If this were true, then the natural language reasoning itself might not be necessary.</p>
<p>To test this idea, the authors replaced the reasoning steps with equations alone. The results showed that this approach provides only limited benefits on complex benchmarks such as GSM8K. While equations can help with simpler problems, they are often insufficient for tasks that require understanding the meaning of the question before translating it into mathematical operations. This suggests that the value of chain-of-thought prompting comes from more than symbolic calculation.</p>
<p>The authors then examined another hypothesis: perhaps chain-of-thought prompting succeeds simply because it allows the model to generate more tokens and therefore spend more computation on difficult problems.</p>
<p>To isolate this factor, they created a prompt that produces additional tokens without any meaningful reasoning content. Performance remained close to the standard prompting baseline, indicating that extra computation alone doesn't explain the observed improvements. What mattered wasn't the number of intermediate tokens, but the reasoning expressed within them.</p>
<p>A third possibility was that chain-of-thought prompts merely activated relevant knowledge already stored in the model. If that were the case, the reasoning steps wouldn't need to appear before the answer.</p>
<p>The authors tested this by moving the reasoning process to after the final answer. Once again, performance largely fell back to the baseline. This result suggested that the sequence of reasoning steps plays an active role in helping the model arrive at the correct solution rather than simply serving as an explanation after the fact.</p>
<p>Taken together, these experiments strengthen the paper's central argument. The success of chain-of-thought prompting can't be explained by equation generation, additional computation, or easier access to stored knowledge alone.</p>
<p>Instead, the evidence points toward the reasoning process itself as the critical ingredient. The intermediate steps aren't merely decorative explanations. They appear to guide the model through a sequence of decisions that makes complex problem solving more effective.</p>
<h2 id="heading-robustness-of-chain-of-thought-prompting"><strong>Robustness of Chain-of-Thought Prompting</strong></h2>
<p>One of the long-standing concerns with prompting methods is their sensitivity to the examples included in the prompt. Small changes in wording, example selection, or even the order of examples can sometimes produce noticeably different results.</p>
<p>Once they established that chain-of-thought prompting improves reasoning performance, the authors investigated whether these gains were robust or whether they depended on a particular set of carefully crafted demonstrations.</p>
<p>To answer this question, the researchers asked multiple authors of the paper to independently write reasoning traces for the same examples. They also experimented with a more concise writing style and tested prompts built from entirely different sets of examples.</p>
<p>The goal was to determine whether chain-of-thought prompting was succeeding because of a specific wording choice or because the underlying reasoning structure was genuinely useful.</p>
<p>The results provided reassuring evidence that the technique isn't tied to a particular author, writing style, or collection of exemplars. While some variation in performance naturally appeared across different prompts, every version of chain-of-thought prompting consistently outperformed standard prompting by a substantial margin. Whether the reasoning steps were detailed or concise, manually written or drawn from an independent dataset, the overall pattern remained remarkably stable.</p>
<p>The authors further broadened their analysis by varying the order and number of exemplars used in the prompt. Once again, the central finding persisted: although prompt design still influenced performance to some degree, the effectiveness of chain-of-thought prompting didn't depend on a single carefully engineered prompt.</p>
<p>This robustness analysis strengthens one of the paper's most important claims that the success of chain-of-thought prompting isn't an artifact of a particular phrasing or annotation style. Instead, the benefits appear to arise from exposing the model to a reasoning process itself, suggesting that the method captures a more general principle rather than a prompt-specific trick.</p>
<h2 id="heading-common-sense-reasoning"><strong>Common Sense Reasoning</strong></h2>
<p>Up to this point, the paper focused primarily on mathematical reasoning. While the results are impressive, they leave an important question unanswered: is chain-of-thought prompting useful only for arithmetic problems, or can it improve reasoning more broadly?</p>
<p>To investigate this, the authors turned to common sense reasoning tasks. Unlike math problems, these tasks often require background knowledge about the world, an understanding of human behavior, or the ability to connect multiple pieces of information before arriving at a conclusion. In many cases, the challenge isn't performing calculations but reasoning through situations that humans find intuitive.</p>
<p>The evaluation spanned a diverse collection of benchmarks, including common sense question answering, multi-hop reasoning, date understanding, sports-related reasoning, and even tasks that involved converting natural language instructions into robot actions.</p>
<p>Despite their differences, these tasks share a common requirement: solving them often involves a sequence of intermediate inferences rather than an immediate answer.</p>
<p>The results showed that the benefits of chain-of-thought prompting extend well beyond mathematics. Across most benchmarks, models consistently performed better when encouraged to generate intermediate reasoning steps before producing a final answer.</p>
<p>The improvements became particularly noticeable for larger models, suggesting that the same pattern observed in arithmetic reasoning also applies to common sense reasoning.</p>
<p>Some of the strongest gains appeared on tasks that required multi-step inference. On StrategyQA, for example, chain-of-thought prompting enabled PaLM 540B to surpass the previous state of the art. Similarly, on the Sports Understanding benchmark, the model achieved performance that exceeded that of an unaided human sports enthusiast.</p>
<p>These results suggest that the reasoning process encouraged by chain-of-thought prompting can help models connect facts, evaluate plausibility, and navigate more complex decision-making scenarios.</p>
<p>At the same time, the improvements weren't uniform across every dataset. The gains on CommonsenseQA were relatively modest, indicating that not all reasoning tasks benefit equally from explicit reasoning traces. This serves as an early reminder that chain-of-thought prompting isn't a universal solution, even though it consistently proves valuable across a wide range of settings.</p>
<p>More broadly, this section strengthens the paper's central argument by showing that chain-of-thought prompting isn't merely a technique for solving math word problems. Its effectiveness across diverse common sense tasks suggests that the method taps into a more general reasoning capability that emerges in sufficiently large language models.</p>
<h2 id="heading-symbolic-reasoning"><strong>Symbolic Reasoning</strong></h2>
<p>The final evaluation moves away from mathematics and real-world knowledge altogether. Instead, the authors focus on symbolic reasoning tasks, where success depends on following abstract rules rather than recalling facts or performing calculations. These tasks are simple for humans, yet they provide a useful way to test whether language models can consistently apply a sequence of reasoning steps.</p>
<p>To explore this question, the authors designed two controlled tasks. The first required the model to extract and concatenate the last letters of words in a name. The second asked the model to track the state of a coin after a sequence of flips and non-flips.</p>
<p>Although these tasks may appear simple, they required the model to perform precise symbolic manipulations without relying on memorized knowledge about the world.</p>
<p>What made these experiments particularly interesting was the introduction of an out-of-distribution setting. During prompting, the model only saw examples involving short reasoning chains. At evaluation time, it was asked to solve versions of the same tasks that required more steps than any example it had previously encountered.</p>
<p>This setup allowed the authors to test not only whether the model could follow a reasoning procedure, but also whether it could extend that procedure to longer and unfamiliar cases.</p>
<p>The results revealed a familiar pattern. Large models benefitted substantially from chain-of-thought prompting, while smaller models struggled even when the required reasoning process was straightforward.</p>
<p>On the in-domain tasks, where the evaluation closely matched the examples provided in the prompt, the largest models achieved near-perfect performance when guided by chain-of-thought reasoning. This indicated that they could successfully learn and apply the underlying procedure demonstrated in the prompt.</p>
<p>The more revealing results come from the out-of-distribution evaluations. Standard prompting largely fails when the reasoning chain becomes longer than those seen in the examples. In contrast, chain-of-thought prompting enabled performance to improve as model size increased, demonstrating an ability to extend learned reasoning patterns beyond the exact situations shown during prompting.</p>
<p>Although accuracy declines compared to the in-domain setting, the models were still able to generalize in ways that standard prompting couldn't.</p>
<p>This section provided some of the strongest evidence that chain-of-thought prompting is doing more than improving benchmark performance. By helping models apply reasoning procedures to longer and previously unseen inputs, it suggests that the generated reasoning steps serve as a scaffold for systematic problem solving rather than merely a mechanism for producing better answers on familiar examples.</p>
<h2 id="heading-discussion"><strong>Discussion</strong></h2>
<p>The most important contribution of this paper wasn't a new model architecture, a new training objective, or a larger dataset. Instead, it demonstrated that a simple change in prompting could unlock capabilities that standard prompting often failed to reveal.</p>
<p>Across arithmetic, common sense, and symbolic reasoning tasks, chain-of-thought prompting consistently allowed large language models to solve problems that were previously difficult or inaccessible.</p>
<p>A recurring theme throughout the paper was the relationship between reasoning and scale. The authors repeatedly observed that chain-of-thought prompting became effective only once models reached a sufficient size. Smaller models generated fluent reasoning traces, but those traces were often logically inconsistent.</p>
<p>Larger models, in contrast, were able to use intermediate reasoning steps in a way that genuinely improved problem-solving performance.</p>
<p>This finding reinforced a broader lesson emerging from language model research at the time: some capabilities don't appear gradually, but emerge once a model crosses a certain scale threshold.</p>
<p>Perhaps the most intriguing implication was that standard prompting may significantly underestimate what large language models are capable of doing.</p>
<p>Before this work, many reasoning tasks appeared to have reached a performance ceiling. Chain-of-thought prompting revealed that the limitation wasn't always the model itself, but sometimes the way the model was being asked to solve the problem. In that sense, the paper shifted attention from building more capable models to discovering better ways of interacting with the capabilities that already exist within them.</p>
<p>At the same time, the authors were careful not to overstate their conclusions. Although chain-of-thought outputs can resemble human reasoning, the paper doesn't prove that language models reason in the same way humans do. The generated reasoning traces may reflect genuine problem-solving processes, post-hoc rationalizations, or something in between. Determining the relationship between generated reasoning and internal model computation remains an open research question.</p>
<p>The authors also acknowledged several practical limitations. Constructing high-quality reasoning demonstrations can require additional effort, particularly if the approach is extended beyond few-shot prompting.</p>
<p>Also, generating a chain of thought doesn't guarantee that the reasoning itself is correct. Models can still produce convincing but flawed reasoning paths, leading to incorrect answers.</p>
<p>Finally, the strongest benefits appear only in very large models, raising questions about computational cost and whether similar reasoning abilities can be induced in smaller systems.</p>
<p>Viewed from a historical perspective, this paper marked a turning point in research on language model reasoning. Rather than treating reasoning as something that must be explicitly trained into a model, it suggested that reasoning abilities could be elicited through the right prompting strategy.</p>
<p>Many influential ideas that followed, including self-consistency, reasoning supervision, process supervision, and the reasoning-focused models that emerged in later years, can trace part of their intellectual foundation back to the simple insight introduced here: sometimes a model performs better when it's encouraged to show its work.</p>
<h2 id="heading-related-work"><strong>Related Work</strong></h2>
<p>The ideas behind Chain-of-Thought prompting didn't emerge in isolation. Instead, the paper sits at the intersection of two research directions that had been evolving independently for several years.</p>
<p>The first direction focused on helping models solve complex problems through intermediate reasoning steps. Earlier work had already shown that tasks such as mathematical reasoning become easier when a model generates natural language rationales rather than producing an answer directly. Researchers explored methods that trained models to generate explanations, reasoning traces, or intermediate computations before arriving at a final solution.</p>
<p>Other approaches relied on formal symbolic representations, translating problems into structured equations or logical forms. Despite their differences, these efforts shared a common intuition: difficult reasoning tasks are often easier to solve when they're decomposed into smaller steps.</p>
<p>Chain-of-thought prompting inherits this intuition but introduces an important shift. Earlier methods typically required dedicated training procedures, specialized datasets, or task-specific fine-tuning.</p>
<p>In contrast, this paper demonstrated that reasoning traces could be elicited through prompting alone. Rather than teaching a model to reason through additional training, the authors showed that providing a handful of reasoning examples may be enough to unlock capabilities that already exist within sufficiently large language models.</p>
<p>The second research direction concerns prompting itself. Following the success of GPT-3 and few-shot learning, a growing body of work explored how prompts could be used to improve model performance without retraining.</p>
<p>Researchers experimented with prompt engineering, prompt tuning, and natural language instructions to better communicate tasks to language models. Most of these techniques focused on improving the input side of the interaction by changing how a task was described to the model.</p>
<p>Chain-of-thought prompting takes a different approach. Instead of modifying the instructions that precede a task, it augments the examples that follow them by exposing the reasoning process that connects inputs and outputs. This distinction may seem subtle, but it represents one of the paper's key insights: the contribution goes beyond a better prompt template. It focuses on the realization that demonstrating how to reason can be just as important as describing what task should be solved.</p>
<p>Viewed in this broader context, the paper acts as a bridge between research on reasoning traces and research on prompting. It combines the strengths of both traditions and, in doing so, lays the foundation for many later advances in language model reasoning, including self-consistency, STaR, process supervision, and the reasoning-oriented systems that followed in subsequent years.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Chain-of-Thought Prompting introduced a simple idea that changed how researchers think about reasoning in large language models. Rather than modifying model architectures or relying on additional training, the authors showed that reasoning abilities could often be unlocked by encouraging models to generate intermediate reasoning steps before producing an answer.</p>
<p>Across arithmetic, common sense, and symbolic reasoning tasks, the results demonstrated that large language models become significantly more capable when allowed to work through a problem step by step. More importantly, the paper revealed that many of these improvements emerge at larger scales, suggesting that reasoning isn't simply a product of prompting but a capability that becomes increasingly accessible as models grow more powerful.</p>
<p>What made this work particularly influential wasn't the complexity of the method, but the insight behind it. A model may possess the knowledge required to solve a problem, yet still fail to use that knowledge effectively when asked for an immediate answer. By exposing the reasoning process, Chain-of-Thought prompting showed that how a model arrives at an answer can be just as important as the answer itself.</p>
<p>This idea helped shift the focus of AI research beyond what language models know toward how they reason, plan, and solve problems. Many of the techniques that followed (including Self-Consistency, process supervision, verification-based methods, and modern reasoning-focused systems) build upon the foundation established by this paper.</p>
<p>Viewed in retrospect, Chain-of-Thought Prompting was more than a prompting technique. It marked a turning point in the study of language model reasoning, demonstrating that some capabilities aren't absent from a model but simply require the right conditions to emerge.</p>
<p>The infographic below highlights some of the most influential papers and milestones that shaped modern AI, from the introduction of GPT-1 and the scaling era of GPT-2 and GPT-3, to instruction tuning, Chain-of-Thought reasoning, Self-Consistency, process supervision, and the latest generation of reasoning-focused models. Together, these works reveal how the field evolved from teaching models to predict language toward helping them reason, verify, and solve increasingly complex problems.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/6d03f50e-e3d7-4370-94b7-6f5a9a5cd201.png" alt="The GPT Journey Key Papers That Shaped Modern AI" style="display:block;margin:0 auto" width="2320" height="1480" loading="lazy">

<h2 id="heading-resources">Resources</h2>
<ul>
<li><p><a href="https://github.com/MOHAMMEDFAHD/Pytorch-Collections/tree/main/GPT">Pytorch Projects for GPT series</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/1706.03762">Attention Is All You Need</a></p>
</li>
<li><p><a href="https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf">Improving Language Understanding by Generative Pre-Training (GPT-1)</a></p>
</li>
<li><p><a href="https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">Language Models are Unsupervised Multitask Learners (GPT-2)</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2005.14165">Language Models are Few-Shot Learners (GPT-3)</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/2001.08361">Scaling Laws for Neural Language Models</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/2203.02155">Training Language Models to Follow Instructions with Human Feedback (InstructGPT)</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/2109.01652">Finetuned Language Models are Zero-Shot Learners (FLAN)</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/2201.08239">LaMDA: Language Models for Dialog Applications</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/2204.02311">PaLM: Scaling Language Modeling with Pathways</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1705.04146">Program Induction by Rationale Generation: Learning to Solve and Explain Algebra Word Problems</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/2110.14168">Training Verifiers to Solve Math Word Problems</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/2112.00114">Show Your Work: Scratchpads for Intermediate Computation with Language Models</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/2203.11171">Self-Consistency Improves Chain of Thought Reasoning in Language Models</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/2203.14465">STaR: Bootstrapping Reasoning with Reasoning</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/2206.07682">Emergent Abilities of Large Language Models</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/2303.12712">Sparks of Artificial General Intelligence: Early Experiments with GPT-4</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/2305.20050">Let's Verify Step by Step</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/2503.19470">Learning to Reason with LLMs</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/2303.08774">GPT-4 Technical Report</a></p>
</li>
</ul>
<p><strong>Contact Me</strong></p>
<ul>
<li><p><a href="https://github.com/MOHAMMEDFAHD"><strong>Github</strong></a></p>
</li>
<li><p><a href="https://x.com/programmingoce"><strong>X</strong></a></p>
</li>
<li><p><a href="https://www.linkedin.com/in/mohammed-abrah-6435a63ba/"><strong>Linkedin</strong></a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ AI Paper Review: Training Language Models to Follow Instructions
with Human Feedback (InstructGPT) ]]>
                </title>
                <description>
                    <![CDATA[ GPT-3 was a major breakthrough in natural language processing. With 175 billion parameters, it demonstrated remarkable few-shot learning abilities and showed that scaling large language models could u ]]>
                </description>
                <link>https://www.freecodecamp.org/news/ai-paper-review-training-language-models-to-follow-instructions-with-human-feedback-instructgpt/</link>
                <guid isPermaLink="false">6a206bf72a223bf98b13dcfc</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ large language models ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ chatgpt ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Mohammed Fahd Abrah ]]>
                </dc:creator>
                <pubDate>Wed, 03 Jun 2026 18:01:27 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/494c3fa7-d7a0-448b-9983-99575f91836d.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>GPT-3 was a major breakthrough in natural language processing. With 175 billion parameters, it demonstrated remarkable few-shot learning abilities and showed that scaling large language models could unlock a wide range of capabilities.</p>
<p>Yet despite its impressive performance, GPT-3 revealed an important limitation: raw capability doesn't automatically create a useful assistant.</p>
<p>A language model can generate fluent text, answer questions, and solve complex tasks while still failing to follow what the user actually wants.</p>
<p>GPT-3 could produce responses that were inconsistent, overly confident, difficult to control, or misaligned with user instructions. It was a powerful prediction engine, but it wasn't designed to reliably act as a helpful assistant.</p>
<p>This challenge motivated one of the most influential papers in modern AI: <em>Training Language Models to Follow Instructions with Human Feedback</em>. Rather than making the model larger, the researchers focused on teaching it how to better follow human intent.</p>
<p>The result was InstructGPT, a system fine-tuned from GPT-3 that demonstrated how human feedback could transform a capable language model into a far more useful and aligned assistant.</p>
<p>This challenge became one of the most important problems in modern AI: alignment.</p>
<p>Researchers realized that building larger models was only part of the solution. While scaling improved capabilities, it didn't guarantee that models would reliably follow instructions or behave in ways that matched user expectations. The next stage of progress required teaching models how to respond in a more helpful, truthful, and safe manner.</p>
<p>This led to the development of instruction-following systems and Reinforcement Learning from Human Feedback (RLHF). Instead of optimizing models solely to predict the next word, researchers began training them to better align with human preferences and intentions.</p>
<p>This shift marked a major turning point in the evolution of large language models.</p>
<p>GPT-3 demonstrated the power of large-scale language modeling and introduced many people to prompting and few-shot learning.</p>
<p>InstructGPT built on that foundation by showing how human feedback could significantly improve instruction following and model behavior. ChatGPT then brought these ideas to a much broader audience by packaging aligned language models into an accessible conversational interface used by millions of people.</p>
<p>In many ways, language models became capable before they became aligned.</p>
<p>That's why the transition from GPT-3 to InstructGPT represents one of the most important milestones in the history of artificial intelligence. The focus was no longer only on making models more capable. It was also about making them more useful, reliable, and responsive to human intent.</p>
<p>The success of InstructGPT pioneered many of the alignment techniques that later became a core part of systems such as ChatGPT and GPT-4.</p>
<h2 id="heading-paper-overview"><strong>Paper Overview:</strong></h2>
<p>In this article, we’ll mainly focus on the paper <a href="https://arxiv.org/pdf/2203.02155"><strong>Training Language Models to Follow Instructions with Human Feedback</strong></a>, published by OpenAI in 2022.</p>
<p>This paper introduced <strong>InstructGPT</strong>, one of the most important transitions in the history of large language models. While earlier GPT systems focused heavily on scaling model size and improving raw capabilities, this work shifted attention toward something equally important: <strong>alignment</strong>.</p>
<p>The paper explores how language models can be trained to better follow human instructions using reinforcement learning from human feedback (RLHF). Instead of optimizing only for next-token prediction, the model is further optimized to produce responses that humans actually prefer – responses that are more helpful, safer, and more aligned with user intent.</p>
<p>What makes this paper historically important is that it became the foundation for the modern ChatGPT alignment pipeline.</p>
<p>Many of the interaction patterns people now associate with ChatGPT (like instruction following, conversational behavior, refusal handling, and safer responses) can be traced directly back to the ideas introduced here.</p>
<p>Here’s the original paper again if you want to explore it directly: <a href="https://arxiv.org/pdf/2203.02155">Training language models to follow instructions with human feedback</a></p>
<p>And here’s a quick infographic of what we’ll cover throughout this review:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/6986f1fe-7ee5-4bc6-b144-44aad5d2bb3e.png" alt="AI Papers Quick Insights- InstructGPT" style="display:block;margin:0 auto" width="1414" height="2000" loading="lazy">

<h2 id="heading-table-of-contents"><strong>Table of Contents:</strong></h2>
<ul>
<li><p><a href="#heading-executive-summary">Executive Summary</a></p>
</li>
<li><p><a href="#heading-the-core-problem">The Core Problem</a></p>
</li>
<li><p><a href="#heading-why-gpt-3-was-not-enough">Why GPT-3 Was Not Enough</a></p>
</li>
<li><p><a href="#heading-instructgpt-the-birth-of-alignment-centered-llms">InstructGPT: The Birth of Alignment-Centered LLMs</a></p>
</li>
<li><p><a href="#heading-rlhf-pipeline-how-instructgpt-learned-to-behave-like-an-assistant">RLHF Pipeline: How InstructGPT Learned to Behave Like an Assistant</a></p>
<ul>
<li><p><a href="#heading-stage-1-supervised-fine-tuning-sft">Stage 1 — Supervised Fine-Tuning (SFT)</a></p>
</li>
<li><p><a href="#heading-stage-2-reward-model-training">Stage 2 — Reward Model Training</a></p>
</li>
<li><p><a href="#heading-stage-3-ppo-reinforcement-learning">Stage 3 — PPO Reinforcement Learning</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-helpful-honest-harmless">Helpful, Honest, Harmless</a></p>
</li>
<li><p><a href="#heading-human-feedback-as-the-new-scaling-factor">Human Feedback as the New Scaling Factor</a></p>
</li>
<li><p><a href="#heading-why-chatgpt-exploded-globally">Why ChatGPT Exploded Globally</a></p>
</li>
<li><p><a href="#heading-chatgpt-as-an-interface-revolution">ChatGPT as an Interface Revolution</a></p>
</li>
<li><p><a href="#heading-benchmarks-and-results">Benchmarks and Results</a></p>
</li>
<li><p><a href="#heading-truthfulness-and-hallucinations">Truthfulness and Hallucinations</a></p>
</li>
<li><p><a href="#heading-safety-and-refusal-behavior">Safety and Refusal Behavior</a></p>
</li>
<li><p><a href="#heading-limitations">Limitations</a></p>
</li>
<li><p><a href="#heading-historical-importance">Historical Importance</a></p>
</li>
<li><p><a href="#heading-discussion-the-real-shift">Discussion: The Real Shift</a></p>
</li>
<li><p><a href="#heading-connection-to-gpt-4">Connection to GPT-4</a></p>
</li>
<li><p><a href="#heading-gpt-3-vs-instructgpt-vs-chatgpt-vs-gpt-4-key-differences">GPT-3 vs InstructGPT vs ChatGPT vs GPT-4: Key Differences</a></p>
</li>
<li><p><a href="#heading-from-gpt-1-to-gpt-4-a-timeline-of-modern-ai-systems-and-alignment-evolution">From GPT-1 to GPT-4: A Timeline of Modern AI Systems and Alignment Evolution</a></p>
</li>
<li><p><a href="#heading-final-insight">Final Insight</a></p>
</li>
<li><p><a href="#heading-resources">Resources</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To get the most out of this breakdown, it helps to already be familiar with a few foundational ideas.</p>
<p>Reading the previous reviews in this series will be especially helpful:</p>
<ul>
<li><p><a href="https://www.freecodecamp.org/news/ai-paper-review-improving-language-understanding-by-generative-pre-training-gpt-1/">AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1)</a></p>
</li>
<li><p><a href="https://www.freecodecamp.org/news/ai-paper-review-language-models-are-unsupervised-multitask-learners-gpt-2/">AI Paper Review: Language Models are Unsupervised Multitask Learners (GPT-2)</a></p>
</li>
<li><p><a href="https://www.freecodecamp.org/news/ai-paper-review-language-models-are-few-shot-learners-gpt-3/">AI Paper Review: Language Models are Few-Shot Learners (GPT-3)</a></p>
</li>
</ul>
<p>Even though GPT-4 was released after InstructGPT, reading the GPT-4 review can still be helpful. It provides a broader view of how alignment techniques evolved and how they were combined with stronger reasoning and multimodal capabilities in later generations of GPT models.</p>
<ul>
<li><a href="https://www.freecodecamp.org/news/ai-paper-review-gpt-4-technical-report/">AI Paper Review: GPT-4 Technical Report (GPT-4)</a></li>
</ul>
<p>It also helps to have:</p>
<ul>
<li><p>A general understanding of natural language processing (NLP) and large language models</p>
</li>
<li><p>A high-level idea of Transformer-based autoregressive models</p>
</li>
<li><p>Familiarity with prompting, few-shot learning, and in-context learning</p>
</li>
<li><p>A basic understanding of reinforcement learning and human feedback systems</p>
</li>
<li><p>General machine learning concepts like training data, fine-tuning, scaling, and inference</p>
</li>
<li><p>Some familiarity with alignment, safety, and AI behavior control concepts</p>
</li>
</ul>
<p>You don't need to be an AI researcher to follow this article, though.</p>
<p>I’ll keep the explanations practical and intuitive, focusing more on understanding how InstructGPT changed modern AI systems rather than getting lost in dense mathematical details or academic terminology.</p>
<h2 id="heading-executive-summary">Executive Summary</h2>
<p>The paper <em>Training Language Models to Follow Instructions with Human Feedback</em> marks one of the biggest turning points in the history of modern AI systems. Instead of asking only how to make language models larger or smarter, OpenAI focused on a different question: how do we make these models actually helpful for real people?</p>
<p>The paper introduces <strong>InstructGPT</strong>, a version of GPT-3 fine-tuned to follow human instructions more accurately using a method called <strong>Reinforcement Learning from Human Feedback (RLHF)</strong>.</p>
<p>The core insight of the paper is simple but extremely important:</p>
<p>Bigger language models don't automatically become better assistants.</p>
<p>Even highly capable models like GPT-3 could still:</p>
<ul>
<li><p>ignore instructions</p>
</li>
<li><p>hallucinate facts</p>
</li>
<li><p>generate toxic or biased outputs</p>
</li>
<li><p>produce responses that were technically fluent but not actually useful to users</p>
</li>
</ul>
<p>To solve this problem, OpenAI built a multi-stage alignment pipeline: humans first demonstrate ideal answers, humans then rank model outputs, and finally the model learns from those preferences using reinforcement learning.</p>
<p>This changed the direction of modern AI development.</p>
<p>The paper shows that alignment and usability can matter more than raw model size itself. One of the most surprising findings was that the 1.3B InstructGPT model was often preferred by human evaluators over the original 175B GPT-3 model, despite being dramatically smaller.</p>
<p>The paper also demonstrates improvements in instruction following, truthfulness, toxicity reduction, conversational behavior, and general user preference.</p>
<p>Historically, this paper became the foundation behind modern conversational AI systems.</p>
<p>GPT-3 proved that language models could learn from prompts.</p>
<p>GPT-4 later proved that scaling and multimodal reasoning could unlock even stronger capabilities.</p>
<p>But InstructGPT showed something equally important: AI systems must be aligned with human intent to become truly usable products.</p>
<p>In many ways, this paper represents the transition from raw language modeling to aligned assistants, capability scaling to behavior shaping, and research demos to real-world conversational AI systems.</p>
<p>And that transition eventually led directly to ChatGPT.</p>
<h2 id="heading-the-core-problem">The Core Problem</h2>
<p>One of the most important ideas in this paper is that raw language modeling is not the same thing as building a useful assistant.</p>
<p>Before InstructGPT, models like GPT-3 were trained mainly with a simple objective: predict the next token in a sequence.</p>
<p>That objective made language models extremely powerful at generating fluent text, but it also created a major limitation. The model learned how to continue internet text, not necessarily how to help humans.</p>
<p>This became one of the defining realizations behind modern AI alignment research.</p>
<p>Despite its impressive capabilities, GPT-3 often struggled to behave like a reliable assistant. The model could produce fluent text, but it was not explicitly trained to follow user intent.</p>
<p>Here are some examples that highlight the differences between GPT-3 and InstructGPT in how they respond to user prompts:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/22cfce35-8c0e-4560-9419-15c6e33123ce.png" alt="Comparison of GPT-3 and InstructGPT responses to the same prompts. GPT-3 often continues generating similar prompts instead of completing the requested task, while InstructGPT follows the instruction directly and produces the requested answer, demonstrating stronger instruction-following behavior." style="display:block;margin:0 auto" width="1764" height="678" loading="lazy">

<p>Source: <a href="https://openai.com/index/instruction-following/"><strong>Aligning language models to follow instructions</strong></a></p>
<img src="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/cd366a10-f872-4468-bff3-64d05d0597d6.png" alt="cd366a10-f872-4468-bff3-64d05d0597d6" style="display:block;margin:0 auto" width="1753" height="794" loading="lazy">

<p>Source: <a href="https://openai.com/index/instruction-following/"><strong>Aligning language models to follow instructions</strong></a></p>
<p>These examples reveal the central weakness of early GPT systems. GPT-3 often continued the pattern of the prompt rather than completing the requested task. InstructGPT, by contrast, responded directly to the user's instruction. The difference wasn't a matter of raw intelligence. It was a difference in training objectives.</p>
<p>GPT models were trained on massive internet-scale datasets where the goal was simply to predict what text comes next. As a result, the model optimized for plausibility, continuation, and pattern completion. Not necessarily for truthfulness, safety, helpfulness, or alignment with human goals.</p>
<p>This created a major gap between: language capability and useful assistant behavior.</p>
<p>For example, if a user asked a harmful, misleading, or nonsensical question, the model might still attempt to continue the pattern naturally instead of recognizing the issue. In many cases, the model behaved more like an internet text simulator than a reliable assistant.</p>
<p>The paper repeatedly emphasizes that scaling alone couldn't solve this problem.</p>
<p>Researchers increasingly recognized that better behavior would require more than scaling alone.</p>
<p>Models also needed stronger instruction following, better alignment with human intent, improved safety behavior, greater truthfulness, and optimization around real user needs.</p>
<h2 id="heading-why-gpt-3-was-not-enough">Why GPT-3 Was Not Enough</h2>
<p>When GPT-3 was released, it felt like a massive leap forward in AI capabilities.</p>
<p>The model could perform few-shot learning, answer questions, summarize text, generate code, translate languages, and even solve certain reasoning tasks: all without traditional fine-tuning. For many researchers, it was the first time a language model started to feel genuinely general-purpose.</p>
<p>Yet using GPT-3 in practice was often less reliable than its benchmark performance suggested.</p>
<p>In practice, using GPT-3 often required careful prompt engineering. Small wording changes could completely change the quality of the response. Sometimes the model followed instructions well, and other times it ignored them entirely.</p>
<p>Users often found themselves rewriting prompts repeatedly to obtain the response they actually wanted.</p>
<p>This became the core motivation behind InstructGPT.</p>
<p>OpenAI responded by exploring ways to make model behavior more consistent, predictable, and useful for users.</p>
<h2 id="heading-instructgpt-the-birth-of-alignment-centered-llms">InstructGPT: The Birth of Alignment-Centered LLMs</h2>
<p>The release of InstructGPT marked one of the biggest shifts in the history of large language models.</p>
<p>Before InstructGPT, most advances in language models came from scaling data, compute, and model size.</p>
<p>The focus shifted toward alignment: building systems that could follow instructions more reliably and behave in ways users actually preferred.</p>
<p>This is where InstructGPT introduced one of the most important ideas in modern AI systems: Reinforcement Learning from Human Feedback (RLHF).</p>
<p>Instead of optimizing models only to predict internet text, OpenAI started optimizing models based on what humans actually preferred. Human labelers ranked model outputs, and those preferences became part of the training process itself.</p>
<p>This fundamentally changed the objective of language models.</p>
<p>Rather than optimizing solely for next-token prediction, the system was increasingly optimized to produce responses that humans judged to be helpful, safe, and aligned with their intentions.</p>
<p>That distinction may sound subtle, but it completely changed the direction of AI development.</p>
<p>InstructGPT combined instruction-following training with human preference optimization, creating a model whose behavior could be shaped directly through feedback rather than solely through pretraining.</p>
<p>The model was no longer trained only to imitate the internet. It was trained to behave more like an assistant.</p>
<h2 id="heading-rlhf-pipeline-how-instructgpt-learned-to-behave-like-an-assistant">RLHF Pipeline: How InstructGPT Learned to Behave Like an Assistant</h2>
<p>At the center of the InstructGPT paper is a training pipeline that completely changed how modern AI assistants are built.</p>
<p>RLHF was designed to build on traditional language-model pretraining rather than replace it.</p>
<p>The InstructGPT paper introduced a different idea: instead of training models only on internet text, why not train them using human preferences directly?</p>
<p>This led to the development of the RLHF pipeline: Reinforcement Learning from Human Feedback. This approach would later become a standard component of modern conversational AI systems.</p>
<p>The paper’s Figure 2 is especially important because it visualizes the entire alignment pipeline introduced by OpenAI. Rather than relying on a single training stage, the system uses multiple stages where human feedback gradually shapes model behavior.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/d1ccebd1-00b4-48ea-8bc7-e3953bc88fc6.png" alt="RLHF Training Pipeline for InstructGPT" style="display:block;margin:0 auto" width="1212" height="808" loading="lazy">

<p><strong>Source:</strong> <em>Training Language Models to Follow Instructions with Human Feedback</em> (OpenAI, 2022).</p>
<p>As you can see in the image above, the process happens in three major stages.</p>
<h3 id="heading-stage-1-supervised-fine-tuning-sft">Stage 1 — Supervised Fine-Tuning (SFT)</h3>
<p>The first stage starts with human-written demonstrations.</p>
<p>Labelers are given prompts and asked to write ideal responses – the kinds of answers a helpful assistant should produce. These examples become the initial training dataset for the model.</p>
<p>At this stage, the model learns the basic patterns of assistant-style responses.</p>
<p>This is still traditional supervised learning, but the goal is different from standard language modeling. Instead of learning only from web text, the model now learns from examples of preferred assistant behavior.</p>
<p>This stage creates what the paper calls the Supervised Fine-Tuned model (SFT model).</p>
<p>And while this already improves behavior significantly, OpenAI realized something important: human preferences are more complex than simple “correct answers.”</p>
<p>There are often many possible responses to a prompt, but humans may strongly prefer some answers over others.</p>
<p>That leads to the next stage.</p>
<h3 id="heading-stage-2-reward-model-training">Stage 2 — Reward Model Training</h3>
<p>In the second stage, humans no longer write responses directly.</p>
<p>Instead, the model generates multiple answers for the same prompt, and human labelers rank them from best to worst.</p>
<p>For a given prompt, one response may be clearer, another more accurate, and another safer or more appropriate. Human labelers rank these alternatives according to their preferences</p>
<p>The rankings are then used to train a separate neural network called the Reward Model (RM).</p>
<p>This model learns something extremely important: which outputs humans prefer.</p>
<p>In other words, the system converts human preferences into a trainable reward signal.</p>
<p>This becomes one of the biggest conceptual breakthroughs in the paper. Instead of manually programming behavior rules, OpenAI trains the model to approximate human judgment itself.</p>
<p>The reward model captures patterns in human preferences and turns them into a training signal.</p>
<p>That reward signal becomes the foundation for the final training stage.</p>
<h3 id="heading-stage-3-ppo-reinforcement-learning">Stage 3 — PPO Reinforcement Learning</h3>
<p>The final stage uses reinforcement learning to optimize the language model against the reward model.</p>
<p>More specifically, the paper uses PPO (Proximal Policy Optimization), a reinforcement learning algorithm commonly used in policy optimization tasks.</p>
<p>At this stage, the model generates responses, receives scores from the reward model, and gradually updates its behavior to maximize those scores.</p>
<p>The model gradually shifts toward responses that receive higher scores from the reward model.</p>
<p>The key innovation is that optimization now occurs against a learned representation of human preferences rather than only a language-modeling objective.</p>
<p>According to the paper, this RLHF pipeline significantly improved instruction following and user preference ratings while also reducing toxic and unsafe behavior.</p>
<p>And in many ways, this pipeline became the blueprint for the modern era of conversational AI systems.</p>
<h2 id="heading-helpful-honest-harmless">Helpful, Honest, Harmless</h2>
<p>The authors argue that evaluating language models requires more than measuring capability alone. They should also be evaluated by how they behave around humans.</p>
<p>At the time, this represented a significant shift in how researchers evaluated language models.</p>
<p>That is why the paper repeatedly emphasizes a new alignment philosophy centered around three goals:</p>
<ul>
<li><p>Helpful</p>
</li>
<li><p>Honest</p>
</li>
<li><p>Harmless</p>
</li>
</ul>
<p>These ideas became the conceptual foundation behind modern alignment research and conversational AI systems.</p>
<h3 id="heading-helpful">Helpful</h3>
<p>The first goal is straightforward: the model should genuinely help the user accomplish what they want.</p>
<p>In practice, helpfulness means following instructions clearly, answering questions directly, providing relevant information, and adapting to the user's intent.</p>
<p>This may seem simple, but it fundamentally changes the training objective.</p>
<p>The model is no longer optimized only for linguistic fluency. It's optimized for usefulness.</p>
<h3 id="heading-honest">Honest</h3>
<p>The second goal is honesty.</p>
<p>One of the biggest problems with large language models is that they often produce convincing answers even when those answers are wrong. The models can hallucinate facts, invent references, or respond confidently despite uncertainty.</p>
<p>The paper recognizes that a useful assistant shouldn't merely sound intelligent. It should also behave truthfully and acknowledge uncertainty when necessary.</p>
<p>This is especially important because language models are optimized to generate plausible text, not verified truth.</p>
<p>As a result, earlier models sometimes prioritized sounding coherent over being accurate.</p>
<p>The alignment process introduced in InstructGPT attempts to reduce this behavior through human feedback and preference optimization. Human evaluators consistently prefer responses that are more accurate, transparent, and reliable, and those preferences gradually shape the model during RLHF training.</p>
<p>The paper doesn't claim that hallucinations disappear completely. Far from it. But it marks one of the first large-scale attempts to explicitly optimize language models for truthfulness and reliability rather than pure text generation quality.</p>
<h3 id="heading-harmless">Harmless</h3>
<p>The third goal is harmlessness.</p>
<p>Large language models trained on internet data inevitably absorb toxic, biased, unsafe, or harmful patterns from that data. Without alignment, models may generate dangerous instructions, offensive content, or manipulative behavior.</p>
<p>The paper directly addresses this concern and treats safety as a central part of model development.</p>
<p>Through RLHF and human preference ranking, the model learns to refuse certain harmful requests, avoid toxic generations, produce safer responses, and behave more responsibly during interaction.</p>
<p>This became one of the defining characteristics of modern conversational AI systems.</p>
<p>Instead of maximizing unrestricted generation, the system begins balancing usefulness, safety, and alignment with human values.</p>
<p>But the paper is also honest about limitations.</p>
<p>The authors acknowledge that harmful outputs, biases, and unsafe behavior can still appear. Alignment is imperfect, and human values themselves are complex and difficult to define universally.</p>
<p>But historically, this paper marks the moment when safety and alignment became core engineering goals rather than secondary concerns.</p>
<p>Taken together, these three principles (helpful, honest, and harmless) became much more than training objectives. They became the philosophical foundation behind ChatGPT-era AI systems.</p>
<p>Earlier GPT papers mainly explored how to scale intelligence. But InstructGPT explored something deeper: how to make intelligence usable for humans.</p>
<h2 id="heading-human-feedback-as-the-new-scaling-factor">Human Feedback as the New Scaling Factor</h2>
<p>One of the most fascinating ideas behind the InstructGPT paper is that it quietly changed what “scaling” meant in modern AI.</p>
<p>For years, progress in language models was largely measured through scaling.</p>
<p>GPT-1 showed that pretraining works. GPT-2 showed that larger models develop stronger zero-shot behavior. GPT-3 pushed this idea even further by scaling to 175 billion parameters and demonstrating impressive few-shot learning abilities.</p>
<p>And to some extent, that was true. Larger models became better at reasoning, code generation, language understanding, translation, and generalization.</p>
<p>That is where human feedback became central.</p>
<p>Instead of relying purely on internet-scale text, OpenAI introduced a training pipeline where human preferences directly shaped model behavior. Human labelers ranked responses, evaluated quality, and guided the system toward outputs people actually preferred.</p>
<p>In many ways, this created a completely new scaling dimension for AI systems:</p>
<ul>
<li><p>scaling human feedback</p>
</li>
<li><p>scaling preference learning</p>
</li>
<li><p>scaling alignment pipelines</p>
</li>
</ul>
<p>Historically, this shifted attention from model scale alone toward the quality of model behavior</p>
<p>InstructGPT focused on scaling usability. And the results were surprisingly powerful.</p>
<p>According to the paper, a much smaller aligned model was often preferred over the original 175B GPT-3 model by human evaluators.</p>
<p>That finding changed how the industry thought about progress.</p>
<p>The result suggested that improving behavior could sometimes matter as much as increasing scale.</p>
<p>This is why RLHF became one of the defining ideas of the ChatGPT era.</p>
<p>After InstructGPT, modern AI systems were no longer evaluated only by benchmark scores, parameter counts, or scaling curves.</p>
<p>They were increasingly evaluated by usefulness, conversational quality, safety, reliability, and how well they interact with humans.</p>
<p>And that shift fundamentally changed the future direction of large language models.</p>
<h2 id="heading-why-chatgpt-exploded-globally">Why ChatGPT Exploded Globally</h2>
<p>When ChatGPT launched publicly, the reaction was immediate and unlike anything the AI industry had seen before.</p>
<p>Millions of people started using it within days. Developers, students, writers, researchers, businesses, and everyday users suddenly felt like they were interacting with AI in a completely different way.</p>
<p>What made this moment so important was that advanced AI capabilities finally became accessible to ordinary users. After all, the underlying language models were already extremely capable before ChatGPT existed. GPT-3 could generate essays, answer questions, write code, summarize text, and perform impressive few-shot learning tasks. GPT-4 later pushed reasoning and multimodal abilities even further.</p>
<p>The challenge was no longer whether language models could perform useful tasks, but whether people could interact with them naturally.</p>
<p>ChatGPT combined powerful language-model capabilities with RLHF-based alignment, conversational interaction, safer behavior, and a user-friendly chat interface.</p>
<p>Earlier systems often required significant prompt experimentation to achieve consistent results. Users had to carefully engineer prompts, retry questions, or work around strange outputs. The models could be brilliant one moment and confusing the next.</p>
<p>ChatGPT changed that experience dramatically.</p>
<p>Thanks to the alignment techniques introduced in the InstructGPT paper, the system became far better at following instructions, maintaining conversational flow, understanding intent, and responding in a way that felt cooperative rather than purely generative.</p>
<p>The conversational interface itself also mattered enormously.</p>
<p>Before ChatGPT, interacting with advanced AI systems often required APIs, coding knowledge, prompt experimentation, or technical understanding.</p>
<p>ChatGPT simplified everything into a familiar chat format: you simply typed naturally, and the system responded naturally.</p>
<p>That design decision may sound small, but historically it was transformative. It turned large language models from research tools into consumer products.</p>
<p>Although imperfect, the system felt substantially more reliable than earlier language-model interfaces.</p>
<p>The system was designed to communicate in ways that felt more natural and cooperative.</p>
<p>The breakthrough was not simply that the AI became smarter. The breakthrough was that the AI became usable.</p>
<p>And that usability is what transformed large language models from impressive research demonstrations into globally adopted AI assistants.</p>
<h2 id="heading-chatgpt-as-an-interface-revolution">ChatGPT as an Interface Revolution</h2>
<p>One of the most important things about ChatGPT is that it changed how humans interact with computers.</p>
<p>Before ChatGPT, powerful AI systems mostly lived behind APIs, research demos, developer tools, and complex prompting workflows.</p>
<p>Using advanced language models often required technical knowledge. Developers experimented with prompt engineering, API parameters, temperature settings, and carefully structured inputs just to get reliable outputs from the model.</p>
<p>Even GPT-3, despite being extremely powerful, still felt like a research system for many users. You had to learn how to “talk to the model.”</p>
<p>And in many cases, the interaction felt fragile. Slight changes in wording could completely change the quality of the response.</p>
<p>ChatGPT changed that dynamic almost overnight.</p>
<p>Instead of making users adapt to the AI, the AI became much better at adapting to humans.</p>
<p>Natural conversation became the interface.</p>
<p>For decades, human-computer interaction depended on commands, menus, search boxes, forms, programming languages, and specialized software interfaces.</p>
<p>ChatGPT introduced something different: you could simply explain what you wanted in plain language. And the system would usually understand.</p>
<p>This made AI feel accessible to people who had never written code, used APIs, or interacted with machine learning systems before.</p>
<p>In many ways, ChatGPT transformed prompting into a universal interface for computing. And that single shift affected nearly every digital field.</p>
<p>In education, students started using conversational AI to explain difficult concepts, summarize lessons, practice languages, and receive tutoring-style help.</p>
<p>In coding, developers began using AI systems for debugging, code generation, documentation, and learning new frameworks.</p>
<p>This eventually led to the rise of AI coding assistants integrated directly into development environments.</p>
<p>In writing and content creation, conversational AI became a brainstorming partner capable of drafting ideas, rewriting text, organizing articles, and helping people communicate more effectively.</p>
<p>Search behavior also started changing. Instead of searching through lists of links, users increasingly expected direct conversational answers. This fundamentally challenged traditional search-engine interaction models.</p>
<p>And across productivity tools, AI systems began acting less like software features and more like collaborative assistants.</p>
<p>This shift was enabled by advances in conversational AI and interaction design that made dialogue feel natural and useful.</p>
<p>The alignment techniques introduced by InstructGPT were an important part of making these conversational experiences practical.</p>
<p>Historically, this may become one of the most important consequences of the GPT era: earlier software required humans to learn interfaces. ChatGPT pushed computing toward interfaces that learn humans instead.</p>
<h2 id="heading-benchmarks-and-results">Benchmarks and Results</h2>
<p>We've already discussed how one of the biggest improvement didn't come from making the model larger. Instead, it came from making the model better aligned with humans.</p>
<p>This is one of the central findings of the entire paper, and it changed how many researchers thought about progress in large language models.</p>
<p>Before this work, the dominant belief was that scaling was the main path forward, with bigger models, more parameters, more compute, and more data. And GPT-3 seemed to confirm that idea. Larger models consistently showed stronger few-shot learning, reasoning, and generalization abilities.</p>
<p>But the InstructGPT paper introduced a different perspective. The researchers found that a relatively small 1.3B parameter InstructGPT model was often preferred by human evaluators over the original 175B GPT-3 model.</p>
<p>That result was extremely important. It suggested that alignment sometimes outperformed scale.</p>
<p>This became one of the defining insights of the ChatGPT era.</p>
<p>According to the paper, human evaluators consistently preferred InstructGPT responses because they were more helpful, more accurate, safer, and better aligned with what users were actually asking for.</p>
<p>The improvements appeared across several important areas.</p>
<p>One major improvement was instruction following. Earlier GPT models often ignored instructions, drifted off-topic, or generated responses that sounded fluent but failed to solve the user’s actual task. InstructGPT behaved much more like a cooperative assistant and followed prompts more reliably.</p>
<p>The paper also reports improvements in truthfulness. Large language models are known for hallucinating information and confidently generating false statements. Through RLHF and preference optimization, InstructGPT reduced some of these behaviors and produced answers humans judged to be more truthful and reliable.</p>
<p>Another important improvement involved toxicity and harmful outputs. The researchers evaluated the system on toxicity benchmarks and found that aligned models generated fewer toxic or unsafe responses compared to earlier GPT systems.</p>
<p>What makes these findings historically important is that they changed the industry’s understanding of what “better AI” actually meant.</p>
<p>Before InstructGPT, improvement was mostly measured through benchmark scores, scaling curves, and parameter counts.</p>
<p>After InstructGPT, researchers increasingly focused on usability, safety, alignment, conversational quality, and human preference satisfaction.</p>
<p>This was a major shift in AI development philosophy.</p>
<h2 id="heading-truthfulness-and-hallucinations">Truthfulness and Hallucinations</h2>
<p>A major challenge for language models is that fluent responses are not always truthful.</p>
<p>This behavior is now commonly called hallucination.</p>
<p>Hallucinations can take many forms, including invented facts, fabricated references, incorrect explanations, or confident answers that lack factual support.</p>
<p>And because the responses are fluent and natural, the mistakes can sometimes look believable to users. The InstructGPT paper treats this as a serious issue rather than a minor flaw.</p>
<p>The authors note that language models are optimized for plausibility rather than verified truth. This is an important distinction: a language model can generate text that <em>looks</em> correct while still being inaccurate.</p>
<p>This is why the paper places particular emphasis on truthfulness and factual reliability.</p>
<p>Through RLHF and human preference optimization, InstructGPT was trained to produce answers humans judged to be more accurate and trustworthy. Human evaluators generally preferred responses that were more transparent about uncertainty and less likely to contain misleading information.</p>
<p>The paper also evaluates the model on truthfulness benchmarks such as <a href="https://arxiv.org/pdf/2109.07958">TruthfulQA</a>, where aligned models demonstrated improvements compared to earlier GPT systems.</p>
<p>But the paper is also careful not to overstate the results. Hallucinations didn't disappear. The aligned models could still make reasoning mistakes, generate false information, misunderstand prompts, or produce overconfident answers.</p>
<p>This nuance is extremely important: the paper doesn't claim that RLHF solved factuality or reasoning completely. Instead, alignment improved behavior, not perfection.</p>
<p>That distinction became increasingly important as ChatGPT and later GPT-4 systems reached millions of users worldwide.</p>
<p>The models became more useful, more truthful, and more aligned, but they still remained probabilistic language models rather than guaranteed fact engines.</p>
<p>In many ways, the InstructGPT paper marks the beginning of large-scale efforts to make AI systems not only intelligent, but also trustworthy enough for real-world human interaction.</p>
<h2 id="heading-safety-and-refusal-behavior">Safety and Refusal Behavior</h2>
<p>As language models became more powerful, researchers realized that safety was becoming a deployment problem.</p>
<p>A model that can generate human-like language at scale can also generate harmful instructions, produce toxic content, spread misinformation, or be manipulated into unsafe behavior.</p>
<p>The InstructGPT paper treats these risks very seriously and frames alignment as a necessary part of deploying large language models responsibly.</p>
<p>One of the biggest changes introduced through RLHF was safer refusal behavior.</p>
<p>Earlier GPT systems often attempted to answer almost anything. As a result, they often responded to unsafe prompts rather than recognizing when a refusal was appropriate.</p>
<p>InstructGPT begins changing that behavior.</p>
<p>Through human feedback and preference optimization, the model learns that some requests shouldn't be answered directly. Human labelers consistently prefer safer responses, refusals for harmful instructions, and outputs that avoid dangerous or toxic behavior.</p>
<p>This leads to systems that are better at refusing unsafe requests, avoiding toxic generations, and behaving more cautiously during interaction.</p>
<p>The paper also evaluates toxicity reduction using safety-related benchmarks and finds that aligned models generally produce fewer harmful outputs than earlier GPT systems.</p>
<p>Another important issue is harmful content filtering. Large language models absorb patterns from massive internet datasets, which inevitably contain biased language, misinformation, unsafe instructions, and toxic behavior.</p>
<p>Without alignment, models may reproduce these patterns surprisingly easily.</p>
<p>RLHF acts as a corrective layer on top of pretraining. Instead of only imitating internet text, the model is further optimized toward responses humans judge to be safer and more appropriate.</p>
<p>Of course, the paper is also realistic about limitations.</p>
<p>The authors acknowledge that alignment is incomplete and that unsafe outputs can still occur. Models may still be vulnerable to adversarial prompting or attempts to bypass safety behavior (what later became widely known as jailbreaks).</p>
<p>This is an important nuance: alignment reduces risk, but it doesn't eliminate it.</p>
<p>And historically, this realization became incredibly important for the future of large-scale AI deployment.</p>
<p>In many ways, the InstructGPT paper marks the beginning of modern AI safety engineering inside flagship language models.</p>
<p>InstructGPT introduced large-scale behavior alignment. Then GPT-4 expanded this even further with red teaming, adversarial testing, deployment monitoring, and much larger safety evaluation pipelines.</p>
<p>So this paper becomes a direct bridge between early generative language models and the much more safety-focused AI systems that followed in the GPT-4 era.</p>
<h2 id="heading-limitations">Limitations</h2>
<p>One of the strongest aspects of the InstructGPT paper is that it doesn't present alignment as a solved problem.</p>
<p>Even though the results are impressive, the authors are careful and surprisingly honest about the system’s remaining weaknesses and risks.</p>
<p>This balance is important because the paper isn't arguing that RLHF creates perfect AI systems. The authors consistently frame alignment as a work in progress rather than a finished solution.</p>
<p>One major limitation is that the models still hallucinate.</p>
<p>The paper acknowledges that hallucinations remain a significant challenge despite alignment improvements.</p>
<p>RLHF improves truthfulness and instruction adherence, but it doesn't fundamentally solve the probabilistic nature of language models. The system still predicts likely text patterns rather than verifying objective truth.</p>
<p>Another important issue is <a href="https://arxiv.org/pdf/2209.13085">reward hacking</a>.</p>
<p>Because the model is optimized against a learned reward signal, it can sometimes discover shortcuts that maximize reward without genuinely improving reasoning or understanding. In other words, the model may learn behaviors that <em>look</em> aligned to evaluators while still hiding deeper problems underneath.</p>
<p>This is a common challenge in reinforcement learning systems more broadly.</p>
<p>The paper also hints at a problem that later became widely discussed in ChatGPT-era systems: <a href="https://arxiv.org/pdf/2406.11717">over-refusal</a> and <a href="https://arxiv.org/pdf/2310.13548">sycophancy</a>.</p>
<p>Sometimes aligned models become too cautious and refuse harmless requests unnecessarily. In other cases, models may become overly agreeable, telling users what they appear to want to hear instead of providing more balanced or truthful responses.</p>
<p>This creates a difficult tension between safety, helpfulness, and honesty.</p>
<p>Another major limitation is bias.</p>
<p>Since these systems are trained on massive internet datasets and further shaped through human labeling, they inevitably inherit biases from both sources. The paper explicitly acknowledges that alignment doesn't remove all harmful or biased behavior.</p>
<p>And perhaps most importantly, the paper emphasizes that RLHF aligns models to labeler preferences not universal human values. This is a very important nuance.</p>
<p>The system learns from the judgments of specific human annotators operating within specific cultural and organizational contexts. That means alignment itself is subjective and imperfect.</p>
<p>There is no single universally agreed definition of helpfulness, fairness, safety, or acceptable behavior.</p>
<p>The paper discusses these concerns carefully and recognizes that human feedback introduces its own limitations and assumptions.</p>
<p>The alignment itself is also fragile. Even aligned systems can sometimes be manipulated through adversarial prompting or jailbreak-style attacks that bypass safety behavior. This later became one of the defining challenges of ChatGPT and GPT-4 deployment.</p>
<p>And finally, there's the practical issue of scale.</p>
<p>RLHF requires large amounts of human labeling, ranking, evaluation, and monitoring. Building these alignment pipelines is expensive, time-consuming, and operationally complex. Unlike raw pretraining data scraped automatically from the internet, human feedback doesn't scale nearly as easily.</p>
<p>In many ways, the paper reveals an important truth about modern AI systems: making models intelligent is difficult. But making them reliably aligned with humans may be even harder.</p>
<h2 id="heading-historical-importance">Historical Importance</h2>
<p>Looking back now, it's difficult to overstate how important the InstructGPT paper became for the entire AI industry.</p>
<p>Earlier GPT papers focused mostly on one central question: How do we make language models more capable?</p>
<p>That era was largely driven by larger datasets, larger parameter counts, scaling laws, and benchmark performance.</p>
<p>The models became increasingly impressive at generating text, solving tasks, and demonstrating emergent abilities. But they still behaved primarily like prediction engines trained to continue internet text.</p>
<p>InstructGPT changed the focus completely. For the first time, large-scale AI development began shifting from model-centric AI to interaction-centric AI.</p>
<p>This was a major philosophical transition: the industry realized that users didn't only care about raw intelligence, benchmark scores, or parameter counts.</p>
<p>They cared about usability, conversational quality, safety, trust, and whether the system could actually help them effectively.</p>
<p>This is why ChatGPT felt so different to the public. The underlying language model capabilities were important, but the real breakthrough came from how those capabilities were shaped into a usable human experience.</p>
<p>The interface became conversational. The system became more cooperative. The AI became more aligned with user intent.</p>
<p>That shift fundamentally changed public perception of artificial intelligence.</p>
<p>Before ChatGPT, most people saw AI as research software, technical demos, or specialized tools for experts.</p>
<p>After ChatGPT, millions of people started interacting with AI systems conversationally on a daily basis.</p>
<p>And that changed everything.</p>
<p>Earlier GPT papers focused mainly on discovering what scaling could achieve. InstructGPT introduced a different challenge: How do we safely deploy these systems in the real world?</p>
<p>That shift helped create entirely new areas of research and engineering, including RLHF pipelines, safety tuning, refusal behavior, red teaming, adversarial testing, policy frameworks, and large-scale human-feedback infrastructure.</p>
<p>In many ways, the ChatGPT era began the moment researchers realized that building powerful models was only part of the problem.</p>
<p>The harder challenge was making those systems reliable enough for human interaction at global scale.</p>
<p>It also helps explain why later systems placed much greater emphasis on safety, alignment, deployment practices, and real-world reliability.</p>
<p>The industry was no longer building language models only for research papers. It was building AI systems intended to operate in the real world. And the InstructGPT paper became one of the clearest turning points in that transformation.</p>
<h2 id="heading-discussion-the-real-shift">Discussion: The Real Shift</h2>
<p>The transition from GPT-3 to ChatGPT represents something much deeper than a simple improvement in model performance.</p>
<p>It changed the central question driving the entire AI industry.</p>
<p>During the GPT-3 era, the big question was, “Can language models learn tasks directly from prompts?”</p>
<p>That was the breakthrough introduced by GPT-3.</p>
<p>Research attention shifted toward scaling and emergent capabilities.</p>
<p>But the ChatGPT era introduced a completely different challenge: the question was no longer simply “Can the model perform the task?” Instead, it became, “Can humans actually trust and use these systems every day?”</p>
<p>That shift changed everything.</p>
<p>Once millions of people began interacting with AI systems directly, raw intelligence alone was no longer sufficient. Users needed systems that were understandable, reliable, safe, conversational, and aligned with human expectations.</p>
<p>This is exactly why the InstructGPT paper became so historically important. It introduced the idea that large language models should not only optimize for capability, but also for human interaction quality.</p>
<p>In many ways, the industry moved from “How smart is the model?” to “How usable is the model?”</p>
<p>And that transition fundamentally changed AI development.</p>
<p>After ChatGPT, success was no longer measured only by benchmark scores, parameter counts, or scaling curves.</p>
<p>It was increasingly measured by alignment, conversational quality, safety, and real-world usability.</p>
<p>This also explains why alignment research suddenly became central to modern AI systems.</p>
<p>GPT-3 showed that models could learn from prompts. ChatGPT showed that humans needed models that could cooperate.</p>
<p>That was the real shift.</p>
<p>And it may ultimately become one of the most important turning points in the history of artificial intelligence.</p>
<h2 id="heading-connection-to-gpt-4">Connection to GPT-4</h2>
<p>One of the most important things to understand about GPT-4 is that it didn't appear out of nowhere.</p>
<p>It was built on top of the alignment ideas introduced by InstructGPT and refined through the large-scale deployment experience of ChatGPT.</p>
<p>GPT-4 is often discussed in terms of its reasoning, multimodal abilities, and benchmark performance.</p>
<p>But beneath all of those improvements is something equally important: the alignment pipeline.</p>
<p>Without the work introduced in the InstructGPT paper, GPT-4 would likely feel far less usable as a real-world assistant.</p>
<p>That distinction matters enormously.</p>
<p>Many of GPT-4's alignment techniques can be traced back to ideas introduced by InstructGPT, including RLHF, instruction tuning, conversational alignment, safer refusal behavior, and human preference optimization.</p>
<p>ChatGPT then became the large-scale real-world testing ground for these ideas.</p>
<p>Millions of user interactions exposed weaknesses ranging from hallucinations and jailbreak attempts to broader safety and usability issues.</p>
<p>Those deployment lessons became incredibly valuable.</p>
<p>By the time GPT-4 arrived, OpenAI was no longer simply training a larger language model. It was building a large-scale aligned conversational system shaped by RLHF pipelines, human feedback, safety engineering, adversarial testing, and real-world user interaction.</p>
<p>This is why GPT-4 feels fundamentally different from earlier GPT models.</p>
<p>In many ways, GPT-4 represents the convergence of two major ideas: scaling capability and scaling alignment.</p>
<ul>
<li><p>GPT-3 proved that language models could learn tasks from prompts.</p>
</li>
<li><p>InstructGPT proved that models could be shaped through human feedback.</p>
</li>
<li><p>ChatGPT proved that aligned conversational AI could work at global scale.</p>
</li>
<li><p>GPT-4 combined all of those ideas into a much more capable multimodal system.</p>
</li>
</ul>
<p>That historical progression is important because it shows that modern AI systems aren't built through scaling alone. They're built through the combination of intelligence, alignment, interaction design, and deployment experience.</p>
<p>And the InstructGPT paper became one of the key foundations that made GPT-4 possible.</p>
<h2 id="heading-gpt-3-vs-instructgpt-vs-chatgpt-vs-gpt-4-key-differences">GPT-3 vs InstructGPT vs ChatGPT vs GPT-4: Key Differences</h2>
<p>By this point, we've discussed GPT-3, InstructGPT, ChatGPT, and GPT-4 individually. But it can be helpful to see them side by side.</p>
<p>Although these systems are closely related, each one introduced a different shift in the evolution of modern AI.</p>
<p>GPT-3 focused on capability through scale, InstructGPT focused on alignment through human feedback, ChatGPT focused on conversational usability, and GPT-4 combined these ideas with stronger reasoning and multimodal capabilities.</p>
<p>The table below summarizes the main differences between them and shows how each system built on the progress of the previous generation.</p>
<table style="min-width:125px"><colgroup><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"></colgroup><tbody><tr><td><p><strong>Aspect</strong></p></td><td><p><strong>GPT-3</strong></p></td><td><p><strong>InstructGPT</strong></p></td><td><p><strong>ChatGPT</strong></p></td><td><p><strong>GPT-4</strong></p></td></tr><tr><td><p><strong>Core Idea</strong></p></td><td><p>Large-scale language model enabling few-shot and in-context learning</p></td><td><p>Align language models with human instructions using RLHF</p></td><td><p>Conversational AI assistant optimized for dialogue and usability</p></td><td><p>Aligned multimodal foundation model with stronger reasoning and deployment maturity</p></td></tr><tr><td><p><strong>Main Goal</strong></p></td><td><p>Scale capability through massive pretraining</p></td><td><p>Improve instruction following and alignment</p></td><td><p>Deliver usable conversational AI for the public</p></td><td><p>Build reliable multimodal AI systems for real-world deployment</p></td></tr><tr><td><p><strong>Training Objective</strong></p></td><td><p>Predict next token from internet-scale text</p></td><td><p>Optimize outputs using human feedback and preference learning</p></td><td><p>Conversational interaction optimized through RLHF and dialogue tuning</p></td><td><p>Large-scale multimodal pretraining combined with RLHF, safety tuning, and deployment optimization</p></td></tr><tr><td><p><strong>Alignment Focus</strong></p></td><td><p>Minimal explicit alignment</p></td><td><p>Central focus of the paper</p></td><td><p>Strong conversational alignment</p></td><td><p>Advanced alignment and safety engineering</p></td></tr><tr><td><p><strong>RLHF Usage</strong></p></td><td><p>Not central</p></td><td><p>Core innovation of the system</p></td><td><p>Major component of interaction quality</p></td><td><p>Expanded and refined at larger scale</p></td></tr><tr><td><p><strong>Human Feedback Role</strong></p></td><td><p>Limited</p></td><td><p>Human rankings shape model behavior directly</p></td><td><p>Human feedback improves conversation flow and usability</p></td><td><p>Human feedback combined with large-scale safety evaluation and red teaming</p></td></tr><tr><td><p><strong>Interaction Style</strong></p></td><td><p>Prompt-based text generation</p></td><td><p>Instruction-following assistant</p></td><td><p>Natural multi-turn conversational assistant</p></td><td><p>Advanced conversational and multimodal assistant</p></td></tr><tr><td><p><strong>Prompting Style</strong></p></td><td><p>Zero-shot, one-shot, and few-shot prompting</p></td><td><p>Instruction prompts become more reliable</p></td><td><p>Conversational prompting becomes primary interface</p></td><td><p>Conversational and multimodal prompting</p></td></tr><tr><td><p><strong>Conversation Memory</strong></p></td><td><p>Limited contextual continuity</p></td><td><p>Better instruction adherence</p></td><td><p>Maintains dialogue flow across interactions</p></td><td><p>Stronger contextual reasoning across longer interactions</p></td></tr><tr><td><p><strong>Instruction Following</strong></p></td><td><p>Often inconsistent</p></td><td><p>Significantly improved</p></td><td><p>Strong conversational instruction following</p></td><td><p>More reliable and nuanced instruction handling</p></td></tr><tr><td><p><strong>Truthfulness</strong></p></td><td><p>Frequent hallucinations and overconfidence</p></td><td><p>Improved factual alignment through RLHF</p></td><td><p>More reliable but still hallucinates</p></td><td><p>Improved reasoning and factual performance, though hallucinations remain</p></td></tr><tr><td><p><strong>Safety Behavior</strong></p></td><td><p>Weak safety control</p></td><td><p>Safer refusal behavior introduced</p></td><td><p>More robust refusal and moderation behavior</p></td><td><p>Advanced safety pipelines and adversarial testing</p></td></tr><tr><td><p><strong>Harmful Output Handling</strong></p></td><td><p>Often continues unsafe prompts</p></td><td><p>Learns safer refusals from human feedback</p></td><td><p>Stronger refusal behavior in public deployment</p></td><td><p>More sophisticated alignment and safety systems</p></td></tr><tr><td><p><strong>Reasoning Ability</strong></p></td><td><p>Strong emergent reasoning for its time</p></td><td><p>Similar base capability but behaviorally improved</p></td><td><p>Improved practical reasoning in conversation</p></td><td><p>Major leap in reasoning and problem-solving</p></td></tr><tr><td><p><strong>Multimodal Capability</strong></p></td><td><p>Text only</p></td><td><p>Text only</p></td><td><p>Primarily text-based at launch</p></td><td><p>Text and image multimodal understanding</p></td></tr><tr><td><p><strong>Coding Ability</strong></p></td><td><p>Strong code generation emergence</p></td><td><p>Improved usability for coding tasks</p></td><td><p>Widely used as coding assistant</p></td><td><p>Much stronger coding and debugging performance</p></td></tr><tr><td><p><strong>Context Handling</strong></p></td><td><p>2048-token context window</p></td><td><p>Similar GPT-3-based context limits</p></td><td><p>Improved conversational memory handling</p></td><td><p>Much larger context capabilities</p></td></tr><tr><td><p><strong>Model Size</strong></p></td><td><p>175B parameters</p></td><td><p>Fine-tuned versions of GPT-3 models</p></td><td><p>Based on aligned GPT-3.5/GPT-4 systems</p></td><td><p>Undisclosed by OpenAI</p></td></tr><tr><td><p><strong>Training Data</strong></p></td><td><p>Massive internet-scale text datasets</p></td><td><p>GPT-3 pretraining plus human demonstrations and rankings</p></td><td><p>Large conversational interaction tuning datasets</p></td><td><p>Large-scale multimodal and internet-scale datasets</p></td></tr><tr><td><p><strong>Learning Paradigm</strong></p></td><td><p>In-context learning through scale</p></td><td><p>Human preference learning through RLHF</p></td><td><p>Conversational alignment at deployment scale</p></td><td><p>Combined capability scaling and alignment scaling</p></td></tr><tr><td><p><strong>Key Innovation</strong></p></td><td><p>Emergent few-shot learning</p></td><td><p>RLHF-based alignment pipeline</p></td><td><p>Conversational AI interface revolution</p></td><td><p>Multimodal aligned foundation systems</p></td></tr><tr><td><p><strong>User Experience</strong></p></td><td><p>Powerful but difficult to control</p></td><td><p>More cooperative and instruction-aware</p></td><td><p>Feels like talking to an assistant</p></td><td><p>More reliable, capable, and multimodal interaction</p></td></tr><tr><td><p><strong>Reliability</strong></p></td><td><p>Often unstable across prompts</p></td><td><p>More stable instruction behavior</p></td><td><p>Significantly improved usability</p></td><td><p>Stronger robustness and interaction quality</p></td></tr><tr><td><p><strong>Deployment Style</strong></p></td><td><p>Research and API usage</p></td><td><p>Alignment research milestone</p></td><td><p>Mass public deployment</p></td><td><p>Large-scale multimodal deployment</p></td></tr><tr><td><p><strong>Benchmark Emphasis</strong></p></td><td><p>Capability scaling and few-shot tasks</p></td><td><p>Human preference evaluations and alignment</p></td><td><p>Real-world conversational usability</p></td><td><p>Broad multimodal benchmark dominance</p></td></tr><tr><td><p><strong>Main Limitation</strong></p></td><td><p>Poor alignment and hallucinations</p></td><td><p>Alignment still incomplete and subjective</p></td><td><p>Hallucinations and jailbreak vulnerabilities</p></td><td><p>Hallucinations, safety tradeoffs, and lack of transparency</p></td></tr><tr><td><p><strong>Historical Importance</strong></p></td><td><p>Proved scaling produces emergent abilities</p></td><td><p>Introduced modern alignment-centered LLM training</p></td><td><p>Brought conversational AI to mainstream global use</p></td><td><p>Defined the era of aligned multimodal AI systems</p></td></tr><tr><td><p><strong>What Changed in AI</strong></p></td><td><p>Prompting became central</p></td><td><p>Alignment became a core research priority</p></td><td><p>AI became a mainstream consumer interface</p></td><td><p>AI became deployable multimodal infrastructure</p></td></tr><tr><td><p><strong>Legacy</strong></p></td><td><p>Foundation of prompt-driven AI</p></td><td><p>Foundation of ChatGPT alignment pipeline</p></td><td><p>Popularized conversational AI globally</p></td><td><p>Established modern multimodal AI ecosystem</p></td></tr></tbody></table>

<h2 id="heading-from-gpt-1-to-gpt-4-a-timeline-of-modern-ai-systems-and-alignment-evolution">From GPT-1 to GPT-4: A Timeline of Modern AI Systems and Alignment Evolution</h2>
<p>Before we wrap up, it's worth stepping back and looking at the bigger picture.</p>
<p>The InstructGPT paper didn't emerge in isolation. It was part of a much larger evolution that transformed GPT models from research-focused language models into the conversational AI systems we use today.</p>
<p>Each generation introduced a new idea that pushed the field forward.</p>
<p>GPT-1 introduced large-scale pretraining, GPT-2 demonstrated zero-shot capabilities, GPT-3 popularized prompting and in-context learning, and InstructGPT introduced alignment through human feedback. ChatGPT then brought these ideas to millions of users through a conversational interface, while GPT-4 combined alignment with stronger reasoning and multimodal capabilities.</p>
<p>The timeline below summarizes the key transitions that shaped the modern AI era.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/6e4cc89c-7772-41e4-b5dc-b61820e1521a.png" alt="From GPT-1 to GPT-4 A Timeline of Modern AI Systems and Alignment Evolution" style="display:block;margin:0 auto" width="1920" height="1080" loading="lazy">

<table style="min-width:150px"><colgroup><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"></colgroup><tbody><tr><td><p><strong>Year</strong></p></td><td><p><strong>System</strong></p></td><td><p><strong>Main Transition</strong></p></td><td><p><strong>What Changed</strong></p></td><td><p><strong>Key Paper / Release</strong></p></td><td><p><strong>Historical Importance</strong></p></td></tr><tr><td><p><strong>2018</strong></p></td><td><p>GPT-1</p></td><td><p>Pretraining + Fine-Tuning Era</p></td><td><p>Introduced generative pretraining using Transformers before supervised fine-tuning</p></td><td><p><em>Improving Language Understanding by Generative Pre-Training</em></p></td><td><p>Started the modern large-scale NLP pretraining paradigm</p></td></tr><tr><td><p><strong>2019</strong></p></td><td><p>GPT-2</p></td><td><p>Zero-Shot Language Modeling Era</p></td><td><p>Showed that larger language models could perform multiple tasks without task-specific fine-tuning</p></td><td><p><em>Language Models are Unsupervised Multitask Learners</em></p></td><td><p>Shifted AI toward general-purpose generative models</p></td></tr><tr><td><p><strong>2020</strong></p></td><td><p>GPT-3</p></td><td><p>In-Context Learning Era</p></td><td><p>Demonstrated few-shot, one-shot, and zero-shot learning at massive scale using prompts alone</p></td><td><p><em>Language Models are Few-Shot Learners</em></p></td><td><p>Made prompting the central interface for AI systems</p></td></tr><tr><td><p><strong>March 2022</strong></p></td><td><p>InstructGPT</p></td><td><p>Alignment and RLHF Era</p></td><td><p>Introduced reinforcement learning from human feedback (RLHF) to align models with user intent</p></td><td><p><em>Training Language Models to Follow Instructions with Human Feedback</em></p></td><td><p>Shifted AI development from raw capability to alignment and usability</p></td></tr><tr><td><p><strong>Nov 2022</strong></p></td><td><p>GPT-3.5 / ChatGPT</p></td><td><p>Conversational AI Era</p></td><td><p>Combined GPT-3.5 with RLHF and chat-based interaction for public deployment</p></td><td><p>ChatGPT public release based on GPT-3.5 family</p></td><td><p>Turned LLMs into mainstream conversational assistants used globally</p></td></tr><tr><td><p><strong>2023</strong></p></td><td><p>GPT-4</p></td><td><p>Multimodal Aligned Foundation Model Era</p></td><td><p>Expanded aligned AI into multimodal reasoning across text and images with stronger reliability and safety systems</p></td><td><p>GPT-4 Technical Report</p></td><td><p>Established the modern era of deployable multimodal AI systems</p></td></tr><tr><td><p><strong>2023–Present</strong></p></td><td><p>GPT-4 + ChatGPT Ecosystem</p></td><td><p>AI Assistant Infrastructure Era</p></td><td><p>AI systems evolved into integrated assistants for coding, education, productivity, reasoning, and multimodal interaction</p></td><td><p>GPT-4 deployment ecosystem</p></td><td><p>Transitioned AI from research products into global infrastructure platforms</p></td></tr></tbody></table>

<h2 id="heading-final-insight">Final Insight</h2>
<p>When people look back at the history of modern AI, they often focus on the moments when models became larger, more powerful, or more capable. But the story of the GPT series is not just a story about scale. It is also a story about learning how to make that intelligence useful.</p>
<p>GPT-1 showed that language models could learn surprisingly rich representations from large amounts of text before being adapted to specific tasks.</p>
<p>GPT-2 expanded that idea and revealed that scale itself could unlock new behaviors.</p>
<p>GPT-3 pushed the field into entirely new territory, demonstrating that a single model could perform a wide variety of tasks simply by responding to prompts and examples.</p>
<p>For a moment, it seemed as though scaling might be the answer to everything.</p>
<p>Then InstructGPT arrived and exposed a different challenge.</p>
<p>The problem was no longer whether a model could generate text, answer questions, or complete tasks. Models were already becoming remarkably capable.</p>
<p>The real question was whether people could actually rely on them. Could they follow instructions consistently? Could they respond in ways users found helpful? Could they become something more than sophisticated prediction engines?</p>
<p>That was the breakthrough at the heart of InstructGPT.</p>
<p>Rather than focusing solely on making models smarter, the paper focused on making them behave better.</p>
<p>Human feedback became part of the training process itself.</p>
<p>Alignment moved from a research concern to a core design principle. For the first time, improving the relationship between humans and AI became just as important as improving the model's raw capabilities.</p>
<p>The impact of that shift extended far beyond a single paper.</p>
<p>It laid the groundwork for ChatGPT, which introduced millions of people to conversational AI. Suddenly, interacting with advanced language models no longer required APIs, research expertise, or carefully engineered prompts. People could simply ask questions, seek advice, explore ideas, or learn something new through natural conversation.</p>
<p>That change transformed AI from a research breakthrough into a widely used product.</p>
<p>GPT-4 would later build on this foundation, combining stronger reasoning and broader capabilities with the alignment techniques that began with InstructGPT. But by then, the industry had already learned an important lesson: capability alone was not enough. Intelligence had to be usable.</p>
<p>In hindsight, the lasting significance of the InstructGPT paper is not that it introduced a new training pipeline. It is that it helped redefine the goal of modern AI.</p>
<p>The challenge was no longer just building systems that could generate language.</p>
<p>It was building systems that people could work with, learn from, and trust.</p>
<p>And that may ultimately be the transition that defined this era of artificial intelligence.</p>
<h2 id="heading-resources"><strong>Resources:</strong></h2>
<ul>
<li><p><a href="https://github.com/MOHAMMEDFAHD/Pytorch-Collections/tree/main/GPT">Pytorch Projects for GPT series</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2203.02155">Training Language Models to Follow Instructions with Human Feedback</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2005.14165">Language Models are Few-Shot Learners</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2009.01325">Learning to Summarize from Human Feedback</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/1909.08593">Fine-Tuning Language Models from Human Preferences</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/1706.03741">Deep Reinforcement Learning from Human Preferences</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2009.01325">Learning to Summarize with Human Feedback</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2008.02275">Aligning AI With Shared Human Values</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2107.05637">Asking for Help on Recursive Decomposition</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2112.09332">WebGPT: Browser-assisted Question-Answering with Human Feedback</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2212.08073">Constitutional AI: Harmlessness from AI Feedback</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2109.07958">TruthfulQA: Measuring How Models Mimic Human Falsehoods</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2009.11462">RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2104.08691">The Power of Scale for Parameter-Efficient Prompt Tuning</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2109.01652">Finetuned Language Models Are Zero-Shot Learners</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2110.08207">Multitask Prompted Training Enables Zero-Shot Task Generalization</a></p>
</li>
</ul>
<p><strong>Contact Me</strong></p>
<ul>
<li><p><a href="https://github.com/MOHAMMEDFAHD"><strong>Github</strong></a></p>
</li>
<li><p><a href="https://x.com/programmingoce"><strong>X</strong></a></p>
</li>
<li><p><a href="https://www.linkedin.com/in/mohammed-abrah-6435a63ba/"><strong>Linkedin</strong></a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Run an LLM Locally on Your Mobile Phone with QVAC and Expo ]]>
                </title>
                <description>
                    <![CDATA[ When I was younger, I remember my mother’s Android phone, a Samsung Galaxy Note 3 that she bought right after losing her BlackBerry. During that time, a phone with 16 GB of storage was considered cutt ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-run-an-llm-locally-on-your-mobile-phone-with-qvac-and-expo/</link>
                <guid isPermaLink="false">6a2061ad78a43e3153aede0d</guid>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Mobile Development ]]>
                    </category>
                
                    <category>
                        <![CDATA[ local development ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Jibril-M🍀 ]]>
                </dc:creator>
                <pubDate>Wed, 03 Jun 2026 17:17:33 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/a5fb9baf-a10d-4e53-9c66-3980919a35b8.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When I was younger, I remember my mother’s Android phone, a Samsung Galaxy Note 3 that she bought right after losing her BlackBerry. During that time, a phone with 16 GB of storage was considered cutting-edge technology. The ability to store five 720p torrented movies on a single phone honestly felt unreal.</p>
<p>Most flagship devices back then shipped with somewhere between 2 and 8 GB of RAM, and GPUs were nowhere near what we carry around today. My mom’s Galaxy Note 3 featured the Qualcomm Adreno 330 GPU with 32 unified shader cores running at up to 578 MHz — a complete powerhouse for its time.</p>
<p>Fast forward to today, and the phones in our pockets are ridiculously more powerful, more efficient, and, honestly, capable of things people would’ve considered science fiction back then.</p>
<p>But enough about my mom’s phone. What I’m really trying to say is this: instead of spending hundreds of dollars every month on AI subscriptions and tokens, we can take advantage of the insanely capable devices we already carry around every day.</p>
<p>Modern smartphones now have dedicated AI acceleration, impressive thermal efficiency, and enough compute power to run lightweight language models locally, completely offline. That means better privacy, full control over your chat history, lower latency, and the ability to use AI without depending entirely on cloud services.</p>
<p>In this article, we’re going to build a React Native application that interacts with an LLM running directly on the device itself. The implementation will revolve around QVAC, a family of inference tools designed specifically for running AI models locally.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-what-is-qvac">What is QVAC?</a></p>
</li>
<li><p><a href="#heading-environment-setup">Environment Setup</a></p>
</li>
<li><p><a href="#heading-model-management">Model Management</a></p>
</li>
<li><p><a href="#heading-custom-models">Custom Models</a></p>
</li>
<li><p><a href="#heading-complete-implementation">Complete Implementation</a></p>
</li>
<li><p><a href="#heading-codebase-breakdown">Codebase Breakdown</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-resources-amp-further-reading">Resources &amp; Further Reading</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To get the most out of this article, you should have a basic understanding of front end development and React in general. You don't have to be a mobile developer, but understanding React will help a lot.</p>
<h2 id="heading-what-is-qvac">What is QVAC?</h2>
<p>QVAC (QuantumVerse Automatic Computer) is a local-first AI inference platform developed by Tether. It's designed to move artificial intelligence away from centralized cloud systems and bring computation back to the user’s own device.</p>
<p>Most modern AI tools rely heavily on remote servers, API keys, and cloud infrastructure controlled by a handful of companies. While this makes AI accessible, it also creates major concerns around privacy, censorship, vendor lock-in, internet dependency, and ownership of user data. Every prompt, conversation, or uploaded file often passes through third-party servers that users have little control over.</p>
<p>QVAC was designed to solve that problem by allowing AI models and agents to run directly on devices like smartphones, laptops, and embedded systems, even while completely offline. Instead of sending personal conversations and sensitive data to the cloud, users can process everything locally on their own hardware.</p>
<p>The platform also embraces decentralization through peer-to-peer communication, reducing reliance on centralized infrastructure and eliminating single points of failure. This approach makes AI systems more private, resilient, autonomous, and accessible, especially in environments with limited internet access or strict data privacy requirements.</p>
<p>In simple terms, QVAC exists to make AI truly owned by its users — local-first, private by default, and independent from centralized control.</p>
<h2 id="heading-environment-setup">Environment Setup</h2>
<p>To speed up the process, I prepared a React Native starter project with all the dependencies installed. But we will install and set up QVAC in this article, since that's our main topic. Here's a link to the <a href="https://github.com/DjibrilM/QVAC-offline-Chatbot-Article-Project-">repository</a>.</p>
<p>Or you can run the below command to clone the starter project.</p>
<pre><code class="language-shell">git clone --branch ft-ui-implementation --single-branch https://github.com/DjibrilM/QVAC-offline-Chatbot-Article-Project-
</code></pre>
<h3 id="heading-qvac-installation">QVAC Installation</h3>
<p>Run the following command to install the SDK: <code>npm i @qvac/sdk</code>. Feel free to use any package manager of your choice. As for me, I will keep things simple with <code>npm.</code></p>
<p>Then add the following peer dependencies to your <code>package.json</code>:</p>
<pre><code class="language-json">{
  "dependencies": {
    "@qvac/sdk": "^0.7.0",
+   "bare-rpc": "^1.0.0", 
    "expo": "~54.0.33",
    "expo-status-bar": "~3.0.9",
    "react": "19.1.0",
    "react-native": "0.81.5",
+   "react-native-bare-kit": "^0.11.5"  
  },
  "devDependencies": {
    "@types/react": "~19.1.0",
    "bare-pack": "^1.5.1", 
    "typescript": "~5.9.2"
  }
}
</code></pre>
<p>Install the following additional dependencies:</p>
<pre><code class="language-shell">npx expo install expo-file-system expo-build-properties expo-device
</code></pre>
<p>Then configure <code>expo-build-properties</code> and add <code>@qvac/sdk/expo-plugin</code> to the <code>plugins</code> array in your <code>app.json</code>:</p>
<pre><code class="language-json">{
  "expo": {
    "plugins": [
      "expo-router",
      "@qvac/sdk/expo-plugin",
      [
        "expo-splash-screen",
        {
          "backgroundColor": "#208AEF",
          "android": {
            "image": "./assets/images/splash-icon.png",
            "imageWidth": 76
          }
        }
      ]
    ]
  }
}
</code></pre>
<p>Run the following command to build the native modules:</p>
<pre><code class="language-shell">npx expo prebuild
</code></pre>
<p><strong>Note:</strong> QVAC uses llama.cpp under the hood. Due to optimization requirements and native hardware dependencies, the QVAC SDK doesn't run on emulators. You'll have to test this with a real physical device with Developer Mode enabled.</p>
<p>To run the app on your physical device, execute:</p>
<pre><code class="language-shell"># For Android:
npx expo run:android --device

# For iOS:
npx expo run:ios --device
</code></pre>
<h2 id="heading-model-management">Model Management</h2>
<p>The QVAC model management system is completely local-first and decentralized. It handles the entire lifecycle, from downloading files to lifecycle optimization, abstracting everything behind clean utility APIs.</p>
<h3 id="heading-resumable-amp-deduplicated-downloading-downloadasset">Resumable &amp; Deduplicated Downloading (<code>downloadAsset</code>)</h3>
<p>It writes temporary chunks to local disk. If a network drop occurs, the partial file is preserved and resumes automatically upon the next call. Also, if multiple components invoke a download for the same asset simultaneously, QVAC handles the streaming under a single network stream.</p>
<h3 id="heading-memory-lifecycle-loadmodel-amp-unloadmodel">Memory Lifecycle (<code>loadModel</code> &amp; <code>unloadModel</code>)</h3>
<p><code>loadModel</code> maps the asset file directly into memory, maps it to your hardware target (such as the device GPU), and exposes an ephemeral <code>modelId</code>. Because local inference is highly memory-intensive on mobile devices, calling <code>unloadModel</code> frees system RAM immediately while preserving the downloaded file on disk.</p>
<h3 id="heading-custom-models">Custom Models</h3>
<p>Because QVAC relies on an optimized branch of llama.cpp, it remains highly compatible with the open-source AI ecosystem. If you plan to load custom models, ensure they adhere to these criteria:</p>
<ul>
<li><p><strong>Format:</strong> Must be in the GGUF (<code>.gguf</code>) format.</p>
</li>
<li><p><strong>Quantization:</strong> For mobile and edge deployments, always prioritize <code>Q4_0</code>, <code>Q4_K_M</code>, or <code>Q8_0</code> configurations to guarantee they fit safely within mobile hardware RAM constraints.</p>
</li>
</ul>
<h2 id="heading-complete-implementation">Complete Implementation</h2>
<p>Now let's replace your main file codebase logic with the full implementation, combining the UI container layout, user interaction state, model lifecycle setup, and real-time inference handling into a cohesive structure.</p>
<p>Replace your entry file with the following code:</p>
<pre><code class="language-typescript">import { ChatInput } from "@/components/chat-input";
import { ChatMessage, Message } from "@/components/chat-message";
import { ModelLoader } from "@/components/model-loader";
import { Button } from "@/components/ui/button";
import { Text } from "@/components/ui/text";

import {
  completion,
  deleteCache,
  downloadAsset,
  LLAMA_3_2_1B_INST_Q4_0,
  loadModel,
  type ModelProgressUpdate,
  VERBOSITY,
} from "@qvac/sdk";
import { SymbolView } from "expo-symbols";
import { useEffect, useRef, useState } from "react";

import {
  Clipboard,
  KeyboardAvoidingView,
  Platform,
  SafeAreaView,
  ScrollView,
  View,
} from "react-native";

const makeId = () =&gt; Math.random().toString(36).substring(2, 9);

export default function Index() {
  const [messages, setMessages] = useState&lt;Message[]&gt;([]);
  const [input, setInput] = useState("");
  const [isGenerating, setIsGenerating] = useState(false);

  // Model loading state
  const [modelId, setModelId] = useState&lt;string | null&gt;(null);
  const [isModelLoaded, setIsModelLoaded] = useState(false);
  const [isDownloading, setIsDownloading] = useState(false);
  const [downloadProgress, setDownloadProgress] = useState(0);

  const scrollViewRef = useRef&lt;ScrollView&gt;(null);
  const messagesRef = useRef&lt;Message[]&gt;([]);

  useEffect(() =&gt; {
    messagesRef.current = messages;
  }, [messages]);

  const startDownload = () =&gt; {
    setIsDownloading(true);
    setupModel();
  };

  // Automatically scroll to bottom when messages list updates
  useEffect(() =&gt; {
    if (scrollViewRef.current) {
      setTimeout(() =&gt; {
        scrollViewRef.current?.scrollToEnd({ animated: true });
      }, 100);
    }
  }, [messages, isGenerating]);

  const copyToClipboard = (text: string) =&gt; {
    if (Platform.OS === "web") {
      navigator.clipboard.writeText(text);
    } else {
      Clipboard.setString(text);
    }
  };

  const setupModel = async () =&gt; {
    try {
      setIsDownloading(true);
      setDownloadProgress(0);
      
      // 1. Local download path execution
      await downloadAsset({
        assetSrc: LLAMA_3_2_1B_INST_Q4_0,
        onProgress: (progress: ModelProgressUpdate) =&gt; {
          setDownloadProgress(progress.percentage / 100);
        },
      });

      setDownloadProgress(1);

      // 2. Load model into runtime memory
      const loadedModel = await loadModel({
        modelSrc: LLAMA_3_2_1B_INST_Q4_0,
        modelType: "llm",
        modelConfig: {
          device: "gpu",
          ctx_size: 2048,
          verbosity: VERBOSITY.ERROR,
        },
      });

      setModelId(loadedModel);
      setIsModelLoaded(true);
      setIsDownloading(false);
    } catch (e: any) {
      console.error("Error setting up model:", e);
      setIsDownloading(false);
    }
  };

  async function handleSend() {
    // Guard against sending before the model is ready or while generating.
    if (!modelId || isGenerating) return;

    const trimmed = input.trim();
    if (!trimmed) return;

    setInput("");
    setIsGenerating(true);

    // Append user message and a placeholder assistant message for streaming.
    const userMsg: Message = {
      id: makeId(),
      role: "user",
      content: trimmed,
    };

    const assistantId = makeId();

    const assistantMsg: Message = {
      id: assistantId,
      role: "assistant",
      content: "",
    };

    setMessages((prev) =&gt; [...prev, userMsg, assistantMsg]);

    try {
      // Build chat history for the completion request.
      const history = [...messagesRef.current, userMsg].map((m) =&gt; ({
        role: m.role,
        content: m.content,
      }));

      // Run a streaming completion and update the last assistant bubble.
      const result = completion({
        modelId,
        history,
        stream: true,
      });

      let acc = "";

      for await (const token of result.tokenStream) {
        acc += token;

        // Update only the last assistant message content
        setMessages((prev) =&gt;
          prev.map((m) =&gt;
            m.id === assistantId ? { ...m, content: acc } : m
          )
        );
      }

      // Optional: Log completion performance stats
      try {
        const stats = await result.stats;
        console.log("📊 Completion stats:", stats);
      } catch {}

    } catch (e: any) {
      // Show any error in the assistant bubble.
      setMessages((prev) =&gt;
        prev.map((m) =&gt;
          m.id === assistantId
            ? { ...m, content: `❌ Error: ${e?.message ?? String(e)}` }
            : m
        )
      );
    } finally {
      setIsGenerating(false);
    }
  }

  if (!isModelLoaded) {
    return (
      &lt;ModelLoader
        onDownload={startDownload}
        isDownloading={isDownloading}
        progress={downloadProgress}
      /&gt;
    );
  }

  return (
    &lt;SafeAreaView className="flex-1 bg-background"&gt;
      &lt;KeyboardAvoidingView
        behavior={Platform.OS === "ios" ? "padding" : "height"}
        className="flex-1"
      &gt;
        &lt;View className="flex-row items-center justify-between p-4 border-b border-border"&gt;
          &lt;View className="flex-row items-center gap-2"&gt;
            &lt;View className="w-2 h-2 rounded-full bg-emerald-500" /&gt;
            &lt;Text className="font-semibold text-lg"&gt;Local Llama 3.2&lt;/Text&gt;
          &lt;/View&gt;
          &lt;Text className="text-xs text-muted-foreground"&gt;Offline Engine&lt;/Text&gt;
        &lt;/View&gt;

        &lt;ScrollView
          ref={scrollViewRef}
          className="flex-1 px-4"
          contentContainerStyle={{ paddingVertical: 16, gap: 16 }}
        &gt;
          {messages.filter(m =&gt; m.content !== "" || m.role === "assistant").map((msg) =&gt; (
            &lt;ChatMessage
              key={msg.id}
              message={msg}
              onCopy={() =&gt; copyToClipboard(msg.content)}
            /&gt;
          ))}
        &lt;/ScrollView&gt;

        &lt;ChatInput
          value={input}
          onChangeText={setInput}
          onSend={handleSend}
          disabled={isGenerating}
          placeholder={isGenerating ? "Thinking..." : "Type a message..."}
        /&gt;
      &lt;/KeyboardAvoidingView&gt;
    &lt;/SafeAreaView&gt;
  );
}
</code></pre>
<h3 id="heading-codebase-breakdown">Codebase Breakdown</h3>
<p>Let’s lift the hood on how this unified component manages local model workflows and real-time UI streaming.</p>
<h4 id="heading-1-tracking-model-state-amp-asynchronous-synchronization">1. Tracking Model State &amp; Asynchronous Synchronization</h4>
<p>At the root of the component, we track both user-facing interface state and underlying QVAC runtime handles:</p>
<pre><code class="language-typescript">const [messages, setMessages] = useState&lt;Message[]&gt;([]);
const [modelId, setModelId] = useState&lt;string | null&gt;(null);
const [isModelLoaded, setIsModelLoaded] = useState(false);
const [isDownloading, setIsDownloading] = useState(false);
const [downloadProgress, setDownloadProgress] = useState(0);
</code></pre>
<p>Because state setters in React are asynchronous, streaming loops can accidentally capture stale representations of current chat logs.</p>
<p>To circumvent this, a mutable <code>messagesRef</code> acts as a real-time single source of truth for the active session state:</p>
<pre><code class="language-typescript">const messagesRef = useRef&lt;Message[]&gt;([]);

useEffect(() =&gt; {
  messagesRef.current = messages;
}, [messages]);
</code></pre>
<h4 id="heading-2-orchestrating-download-amp-memory-instantiation">2. Orchestrating Download &amp; Memory Instantiation</h4>
<p>When the user strikes the download button action trigger, the application launches <code>setupModel()</code>. This function splits tasks clearly across local storage caching and active hardware allocation layers:</p>
<pre><code class="language-typescript">await downloadAsset({
  assetSrc: LLAMA_3_2_1B_INST_Q4_0,
  onProgress: (progress: ModelProgressUpdate) =&gt; {
    setDownloadProgress(progress.percentage / 100);
  },
});
</code></pre>
<ul>
<li><p><strong>Storage Sync:</strong> <code>downloadAsset</code> reaches out to pull the designated standard model signature down into mobile device disk files.</p>
</li>
<li><p><strong>Hardware Binding:</strong> Once safe on disk, <code>loadModel</code> executes to wake up the engine runtime:</p>
</li>
</ul>
<pre><code class="language-typescript">const loadedModel = await loadModel({
  modelSrc: LLAMA_3_2_1B_INST_Q4_0,
  modelType: "llm",
  modelConfig: {
    device: "gpu",
    ctx_size: 2048,
    verbosity: VERBOSITY.ERROR,
  },
});
</code></pre>
<p>Passing <code>device: "gpu"</code> tells QVAC to run hardware-accelerated kernels across the smartphone's graphic processing hardware structure, ensuring rapid performance metrics instead of locking execution to slower CPU loops.</p>
<h4 id="heading-3-pipeline-ingest-amp-streaming-generation-loop">3. Pipeline Ingest &amp; Streaming Generation Loop</h4>
<p>Once user validation confirms the prompt is ready, <code>handleSend()</code> sets up user bubbles and generates an empty assistant placeholder card to catch token output segments.</p>
<p>The application map transforms references straight out of <code>messagesRef.current</code> into a structured history syntax before processing:</p>
<pre><code class="language-typescript">const result = completion({
  modelId,
  history,
  stream: true,
});
</code></pre>
<p>With <code>stream: true</code> enabled, QVAC doesn't hold up your application thread waiting for long string sequences to complete. Instead, it yields an asynchronous iterable stream that spits out fresh updates instantly:</p>
<pre><code class="language-typescript">let acc = "";

for await (const token of result.tokenStream) {
  acc += token;

  setMessages((prev) =&gt;
    prev.map((m) =&gt;
      m.id === assistantId ? { ...m, content: acc } : m
    )
  );
}
</code></pre>
<p>The loop continuously concatenates token text variables into the tracking accumulator (<code>acc</code>), target patching state properties exclusively against our placeholder identifier (<code>assistantId</code>). This creates a lightning-fast typing animation experience while executing fully offline on your user's physical device hardware.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Building a local-first AI application is no longer a concept confined to high-end desktops or specialized research labs. As we’ve seen, the smartphones we carry in our pockets every day possess more than enough computational muscle and dedicated hardware acceleration to run highly capable language models completely offline.</p>
<p>By leveraging React Native and the QVAC SDK, we successfully bypassed the traditional cloud-dependent architecture. We eliminated the need for complex server infrastructure, API key management, and recurring token subscription fees, all while providing an ultra-private, low-latency, streaming chat experience directly on-device.</p>
<p>As open-source models continue to shrink in size and grow in capabilities, edge inference will become an essential architecture for developers prioritizing privacy, offline resilience, and cost efficiency. The power to compute is back where it belongs: in the hands of the user.</p>
<h3 id="heading-resources-amp-further-reading">Resources &amp; Further Reading</h3>
<p>To dive deeper into local inference, inspect the source code, or explore advanced configurations for your mobile applications, check out the following resources:</p>
<ul>
<li><p><a href="https://docs.qvac.tether.io/tutorials/expo/"><strong>QVAC Expo Integration Tutorial</strong></a> – The official step-by-step documentation for configuring QVAC within the Expo and React Native ecosystems.</p>
</li>
<li><p><a href="https://github.com/DjibrilM/QVAC-offline-Chatbot-Article-Project-"><strong>Project GitHub Repository</strong></a> – Access the complete source code, including the UI layout components, starter themes, and full configuration files used in this guide.</p>
</li>
<li><p><a href="https://github.com/ggml-org/llama.cpp"><strong>Llama.cpp Official Repository</strong></a> – Learn more about the underlying inference engine that powers QVAC's hardware-accelerated local execution.</p>
</li>
<li><p><a href="https://huggingface.co/models?search=gguf"><strong>Hugging Face GGUF Models</strong></a> – Explore thousands of open-source, quantized models that you can download and experiment with inside your local application.</p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ The LLM Gateway Pattern: Why Every Kubernetes-Based AI App Needs One ]]>
                </title>
                <description>
                    <![CDATA[ You ship your first LLM-powered feature. It works and the users love it. A second team adds another feature calling a different model, and a third integrates a completely different provider. Six month ]]>
                </description>
                <link>https://www.freecodecamp.org/news/the-llm-gateway-pattern-why-every-kubernetes-based-ai-app-needs-one/</link>
                <guid isPermaLink="false">6a20607178a43e3153ae3cc4</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ development ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Temitope Oyedele ]]>
                </dc:creator>
                <pubDate>Wed, 03 Jun 2026 17:12:17 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/35be7043-56b7-4df6-b56b-a48620be2dd8.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>You ship your first LLM-powered feature. It works and the users love it. A second team adds another feature calling a different model, and a third integrates a completely different provider.</p>
<p>Six months later, you have fourteen microservices, each holding their own API keys, writing their own retry logic, and failing in their own unique ways.</p>
<p>Nobody knows how much you're spending on tokens or which service is hammering the rate limit. And when OpenAI goes down, everything goes down with it.</p>
<p>That scenario plays out across engineering teams every single day, and the root cause is almost always the same: moving fast with LLMs while skipping the infrastructure thinking that holds everything together at scale.</p>
<p>Fortunately, a well-established architectural pattern solves exactly these problems. If you already run Kubernetes, you're more than halfway to implementing it. That pattern is called the LLM Gateway Pattern, and this article walks you through what it is, why it matters, and how to put it into practice.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-is-the-llm-gateway-pattern">What Is the LLM Gateway Pattern?</a></p>
<ul>
<li><a href="#heading-how-it-works">How It Works</a></li>
</ul>
</li>
<li><p><a href="#heading-the-problem-without-a-gateway">The Problem Without a Gateway</a></p>
</li>
<li><p><a href="#heading-deploying-an-llm-gateway-on-kubernetes">Deploying an LLM Gateway on Kubernetes</a></p>
<ul>
<li><p><a href="#heading-storing-api-keys-securely">Storing API Keys Securely</a></p>
</li>
<li><p><a href="#heading-defining-routing-rules-in-a-configmap">Defining Routing Rules in a ConfigMap</a></p>
</li>
<li><p><a href="#heading-scaling-the-gateway">Scaling the Gateway</a></p>
</li>
<li><p><a href="#heading-wiring-up-observability">Wiring Up Observability</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-features-of-an-llm-gateway">Features of an LLM Gateway</a></p>
<ul>
<li><p><a href="#heading-multi-provider-routing">Multi-Provider Routing</a></p>
</li>
<li><p><a href="#heading-semantic-caching">Semantic Caching</a></p>
</li>
<li><p><a href="#heading-rate-limiting-per-consumer">Rate Limiting Per Consumer</a></p>
</li>
<li><p><a href="#heading-fallback-and-failover">Fallback and Failover</a></p>
</li>
<li><p><a href="#heading-token-usage-tracking">Token Usage Tracking</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-wrapping-up">Wrapping Up</a></p>
</li>
</ul>
<h2 id="heading-what-is-the-llm-gateway-pattern">What Is the LLM Gateway Pattern?</h2>
<p>The LLM Gateway Pattern is an architectural approach where all LLM API traffic from your applications flows through a single, centralized proxy service before reaching any external provider. Think of it as the AI equivalent of an API gateway, except it's purpose-built for the unique challenges that come with language models: token budgets, streaming responses, model routing, semantic caching, and multi-provider fallback.</p>
<p>Instead of every service in your cluster talking directly to OpenAI or Anthropic, they all talk to one internal gateway. That gateway handles authentication, routing, rate limiting, logging, and failover. Your application services stay clean and focused on business logic, while the gateway takes on all the messy operational concerns of working with LLMs at scale.</p>
<p>The pattern itself is not new in concept. Engineers have used API gateways for years to manage REST traffic. What makes LLM gateways distinct is that they understand the specific shape of LLM requests, including token counts, model parameters, prompt structure, and streaming semantics.</p>
<h3 id="heading-how-it-works">How It Works</h3>
<p>The core components of an LLM Gateway on Kubernetes are straightforward. Here is the high-level flow:</p>
<img src="https://cdn.hashnode.com/uploads/covers/627d043a4903bec29b5871be/2aaa42ed-d6b4-4a9e-9d4c-2faa42e76783.png" alt="Diagram showing how LLM Gateway works on Kubernetes" style="display:block;margin:0 auto" width="1162" height="718" loading="lazy">

<p><strong>App Pods</strong> send requests to the gateway using a standard OpenAI-compatible API format. Because of this, most existing LLM client libraries work without modification — you just change the base URL to point at your internal gateway service.</p>
<p><strong>The Gateway Service</strong> receives each incoming request, authenticates the caller, applies any configured rate limits, checks the cache, selects the appropriate upstream provider based on routing rules, and forwards the request. On the way back, it logs token usage and latency before returning the response to the caller.</p>
<p><strong>ConfigMap</strong> holds the routing rules. Which model should handle requests tagged as fast? Which provider should the system fall back to if the primary one is unavailable? All of this lives in configuration, not code, so you can update routing behaviour without redeploying anything.</p>
<p><strong>Secrets</strong> hold the actual API keys for each provider. The gateway is the only service in the cluster that needs access to them. Application pods never touch provider credentials directly.</p>
<p><strong>Provider endpoints</strong> are the actual LLM APIs: OpenAI, Anthropic, a self-hosted vLLM instance running in your cluster, or any other provider that exposes an OpenAI-compatible interface.</p>
<h2 id="heading-the-problem-without-a-gateway">The Problem Without a Gateway</h2>
<p>To appreciate why this pattern matters, it helps to look at what happens when you skip it.</p>
<h3 id="heading-1-scattered-secrets-and-no-central-control">1. Scattered Secrets and No Central Control</h3>
<p>Every service that calls an LLM needs an API key. In Kubernetes, this usually means creating a <a href="https://kubernetes.io/docs/concepts/configuration/secret/">Secret</a> per namespace or per deployment.</p>
<p>When that key rotates or gets compromised, you're hunting through dozens of manifests to update it. There's no single place to revoke access or audit who is calling what.</p>
<h3 id="heading-2-no-visibility-into-cost-or-usage">2. No Visibility into Cost or Usage</h3>
<p>LLM APIs charge per token. Without a centralized layer collecting usage data, you have no reliable way to know which service is responsible for that spike in your monthly bill.</p>
<h3 id="heading-3-provider-lock-in-at-the-application-level">3. Provider Lock-in at the Application Level</h3>
<p>When you hardcode <a href="https://api.openai.com">https://api.openai.com</a> into your service, switching to a different provider or routing certain requests to a cheaper model becomes a code change. You need to redeploy your application just to change which model handles a request type.</p>
<h3 id="heading-4-no-caching">4. No Caching</h3>
<p>Many LLM applications send semantically similar or identical prompts repeatedly. Without a shared caching layer, each one incurs full token costs and full latency. The savings from even basic caching can be significant.</p>
<p>All of these problems compound as your team grows and more services start calling LLMs. The gateway pattern cuts through all of them in one architectural decision.</p>
<h2 id="heading-deploying-an-llm-gateway-on-kubernetes">Deploying an LLM Gateway on Kubernetes</h2>
<p>There are several tools that can serve as an LLM gateway in a Kubernetes environment, including <a href="https://docs.litellm.ai/docs/simple_proxy">LiteLLM Proxy</a>, <a href="https://portkey.ai/">Portkey</a>, <a href="https://openrouter.ai/">OpenRouter</a>, and Envoy with custom filters.</p>
<p>For the rest of this walkthrough, we'll use LiteLLM Proxy. It ships with a Helm chart, supports over a hundred models across all major providers, and comes with a management UI that makes initial configuration straightforward.</p>
<h3 id="heading-storing-api-keys-securely">Storing API Keys Securely</h3>
<p>Start by creating a Kubernetes Secret that holds your provider API keys. Your gateway pods will consume these credentials as environment variables, which means no provider key ever needs to live inside your application containers:</p>
<pre><code class="language-yaml">apiVersion: v1
kind: Secret
metadata:
  name: llm-gateway-secrets
  namespace: ai-platform
type: Opaque
stringData:
  OPENAI_API_KEY: "sk-..."
  ANTHROPIC_API_KEY: "sk-ant-..."
</code></pre>
<h3 id="heading-defining-routing-rules-in-a-configmap">Defining Routing Rules in a <code>ConfigMap</code></h3>
<p>The routing configuration tells the gateway which models are available and how to reach each one. Keeping this in a <code>ConfigMap</code> means you can update your routing rules without touching a single line of application code:</p>
<pre><code class="language-yaml">apiVersion: v1
kind: ConfigMap
metadata:
  name: llm-gateway-config
  namespace: ai-platform
data:
  config.yaml: |
    model_list:
      - model_name: gpt-4o
        litellm_params:
          model: openai/gpt-4o
          api_key: os.environ/OPENAI_API_KEY
      - model_name: claude-sonnet
        litellm_params:
          model: anthropic/claude-sonnet-4-20250514
          api_key: os.environ/ANTHROPIC_API_KEY
      - model_name: fast
        litellm_params:
          model: openai/gpt-4o-mini
          api_key: os.environ/OPENAI_API_KEY
</code></pre>
<p>With this configuration in place, any application in your cluster can reach the gateway at <a href="http://llm-gateway.ai-platform.svc.cluster.local">http://llm-gateway.ai-platform.svc.cluster.local</a> using the standard OpenAI client format, regardless of which actual provider sits behind it.</p>
<h3 id="heading-scaling-the-gateway">Scaling the Gateway</h3>
<p>Because the gateway is stateless, horizontal scaling is straightforward. You can attach a <code>HorizontalPodAutoscaler</code> to scale based on CPU utilization or request rate:</p>
<pre><code class="language-yaml">apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-gateway-hpa
  namespace: ai-platform
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-gateway
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
</code></pre>
<h3 id="heading-wiring-up-observability">Wiring Up Observability</h3>
<p>A gateway you can't observe is a gateway you can't trust, so wiring up monitoring before you go to production is worth the extra hour it takes.</p>
<p>LiteLLM exposes a <code>/metrics</code> endpoint in Prometheus format. You can scrape it with a standard <code>ServiceMonitor</code> if you run the Prometheus Operator, or configure Prometheus directly to target the gateway service.</p>
<p>The metrics that matter most in day-to-day operations are token throughput per model, request latency percentiles, error rates per provider, and cache hit ratio.</p>
<p>Once Prometheus is collecting that data, you can build Grafana dashboards that show token spend broken down by caller, model, and time period. This gives engineering managers and finance teams the cost visibility they've been asking for, and it takes surprisingly little effort to set up once the metrics pipeline is in place.</p>
<p>If you run an OpenTelemetry collector in your cluster, you can also configure the gateway to emit trace spans for every LLM request. This lets you see the full latency breakdown from the moment a user action triggers a call in your application all the way through to the provider response. So when something is slow, you can tell immediately whether the bottleneck sits in your service, the gateway, or upstream with the provider.</p>
<h2 id="heading-features-of-an-llm-gateway">Features of an LLM Gateway</h2>
<p>Not all gateway implementations are equal, so as your needs grow, these are the core capabilities worth evaluating.</p>
<h3 id="heading-multi-provider-routing">Multi-Provider Routing</h3>
<p>A well-built gateway routes requests to different providers based on declarative, configurable rules that live entirely outside your application code. This means that changing a model never requires a redeployment.</p>
<h3 id="heading-semantic-caching">Semantic Caching</h3>
<p>Rather than only caching byte-for-byte identical prompts, a semantic cache uses embedding similarity to recognise when two different prompts are asking essentially the same thing. This can cut redundant API calls dramatically.</p>
<h3 id="heading-rate-limiting-per-consumer">Rate Limiting Per Consumer</h3>
<p>The gateway should let you set token budgets and request limits per team, per namespace, or per application, so no single runaway service can starve the rest of your cluster or drive up costs unchecked.</p>
<h3 id="heading-fallback-and-failover">Fallback and Failover</h3>
<p>When a primary provider fails or exceeds acceptable latency thresholds, the gateway should automatically retry against a configured fallback. This centralizes logic that is notoriously hard to get right inside individual services.</p>
<h3 id="heading-token-usage-tracking">Token Usage Tracking</h3>
<p>Every request should produce a detailed usage record capturing input tokens, output tokens, model, caller identity, and latency. This gives engineering managers the clear, actionable picture of AI spending they need.</p>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>The LLM Gateway Pattern solves a set of operational problems that every team building on language models at scale will eventually run into. Scattered secrets, invisible costs, inconsistent failure handling, and provider lock-in are all symptoms of the same underlying issue: infrastructure concerns leaking into services that shouldn't have to deal with them.</p>
<p>A centralized gateway on Kubernetes gives your application teams a stable, provider-agnostic interface while giving your platform team the visibility and controls they need to manage cost and reliability effectively. When a provider goes down in the middle of the night, your configured fallback kicks in automatically instead of someone waking up to a page.</p>
<p>Start with LiteLLM Proxy, wire up the Prometheus metrics, build a simple Grafana dashboard, and watch how quickly the pattern pays for itself. Once you have seen what centralized LLM traffic management looks like in practice, it becomes very hard to go back to doing it any other way.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Design APIs for AI Agents ]]>
                </title>
                <description>
                    <![CDATA[ APIs are designed for human developers. People read documentation, infer the intent behind an endpoint, and know how to handle edge cases when something unexpected happens. AI agents don't have that c ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-design-apis-for-ai-agents/</link>
                <guid isPermaLink="false">6a18bdb078258754833f8205</guid>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ api ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ai-agent ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ David Aniebo ]]>
                </dc:creator>
                <pubDate>Thu, 28 May 2026 22:12:00 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/056b20d6-7409-4b6e-a29c-0b48061a7508.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>APIs are designed for human developers. People read documentation, infer the intent behind an endpoint, and know how to handle edge cases when something unexpected happens.</p>
<p>AI agents don't have that context and understanding.</p>
<p>AI agent understand APIs through schemas, examples, randomized data and live responses. When a behavior or method is ambiguous and inconsistent, the model doesn't pause to “think” – it fills in the blanks (randomizing).</p>
<p>In production, those guesses could become blocks, retry storms, duplicated side effects, or broken workflows.</p>
<p>This is why APIs that are perfectly fine for humans frequently fail under AI agent use. The problem is rarely “the agent isn’t smart enough.” More often, the API was never designed for an agent/machine consumer that must plan, call tools, and recover from failure without a human in the loop.</p>
<p>In this guide, you’ll learn how to design APIs that agents can use reliably. We’ll anchor the discussion in three practical ideas:</p>
<ol>
<li><p><strong>Deterministic behavior:</strong> same inputs and state should yield predictable outcomes and shapes.</p>
</li>
<li><p><strong>Strong schemas:</strong> contracts that are complete, descriptive, and testable.</p>
</li>
<li><p><strong>Guardrails at the API boundary:</strong> authorization, validation, and safe defaults that prevent unsafe autonomy.</p>
</li>
</ol>
<p>The aim of this article is not to build “AI-powered” APIs, but rather to build APIs that are <strong>clear, strict,</strong> and <strong>dependable,</strong> even when the caller is not an agent but a fellow developers leveraging various tools.</p>
<h2 id="heading-table-of-contents">Table Of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-why-good-enough-for-devs-is-not-good-enough-for-agents">Why “Good Enough for Devs” Is Not Good Enough for Agents</a></p>
</li>
<li><p><a href="#heading-principle-1-deterministic-behavior">Principle 1: Deterministic Behavior</a></p>
</li>
<li><p><a href="#heading-principle-2-strong-schemas">Principle 2: Strong Schemas</a></p>
</li>
<li><p><a href="#heading-principle-3-guardrails-at-the-api-boundary">Principle 3: Guardrails at the API Boundary</a></p>
</li>
<li><p><a href="#heading-patterns-that-bridge-apis-and-agent-runtimes">Patterns That Bridge APIs and Agent Runtimes</a></p>
</li>
<li><p><a href="#heading-a-practical-before-and-after-example">A Practical Before and After Example</a></p>
</li>
<li><p><a href="#heading-checklist-is-your-api-agent-ready">Checklist: Is Your API Agent-Ready?</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before reading this guide, it helps to have:</p>
<ul>
<li><p>A basic understanding of HTTP APIs and REST concepts</p>
</li>
<li><p>Familiarity with JSON and API request/response patterns</p>
</li>
<li><p>An understanding of common API concepts like authentication, pagination, and retries</p>
</li>
</ul>
<h2 id="heading-why-good-enough-for-devs-is-not-good-enough-for-agents">Why “Good Enough for Devs” Is Not Good Enough for Agents</h2>
<p>Human developers bring implied and contextual knowledge: they read through Slack threads, read blog posts, and recognize that “this 404 usually means you forgot the workspace ID.”</p>
<p>Agents mostly get whatever is in the spec, the examples, and the last response body.</p>
<p>That gap shows up in predictable ways:</p>
<ul>
<li><p><strong>Ambiguous semantics:</strong> wrong endpoint or wrong parameter combination.</p>
</li>
<li><p><strong>Undocumented branches:</strong> the model invents fields or misreads optional behavior.</p>
</li>
<li><p><strong>Inconsistent error bodies:</strong> retries that shouldn't happen, or no retry when one is safe.</p>
</li>
<li><p><strong>Non-idempotent “do things” endpoints:</strong> duplicate charges, duplicate tickets, duplicate emails.</p>
</li>
</ul>
<p>Industry commentary and practitioner guides converge on the same point: agents are becoming a major class of API consumer, and machine legibility matters as much as developer experience.</p>
<p>See for example discussions of OpenAPI as the source of truth for agents, emerging tool protocols, and traffic patterns that differ from human clients in the resources listed at the end of this article.</p>
<h2 id="heading-principle-1-deterministic-behavior">Principle 1: Deterministic Behavior</h2>
<p>Determinism for agents doesn't mean “always return the same JSON forever.” It means: <strong>given the same request and the same server-side state, your API behaves in a way the agent can model</strong> and when state changes, you make that explicit.</p>
<h3 id="heading-prefer-explicit-state-over-hidden-magic">Prefer Explicit State Over Hidden Magic</h3>
<p>Agents struggle with “sometimes the server does X depending on internal flags.” Where humans infer intent from product copy, agents infer from patterns. If those patterns drift, autonomy breaks.</p>
<p>Practical habits:</p>
<ul>
<li><p>Model lifecycle explicitly (<code>draft</code> → <code>submitted</code> → <code>approved</code>) instead of overloading a single <code>status</code> field with undocumented combinations.</p>
</li>
<li><p>Return what changed after mutations (updated resource, relevant IDs, next allowed actions).</p>
</li>
<li><p>Avoid silent coercion (auto-correcting bad enums, silently dropping unknown fields) unless you document and signal it.</p>
</li>
</ul>
<h3 id="heading-make-writes-safe-idempotency-and-intent-keys">Make Writes Safe: Idempotency and Intent Keys</h3>
<p>For any endpoint that bills, sends messages, provisions infrastructure, or otherwise <strong>does something irreversible</strong>, assume double-submission will happen.</p>
<ul>
<li><p>Support idempotency keys (header or body) for create-like operations.</p>
</li>
<li><p>Use clear HTTP semantics: <code>POST</code> creates, <code>PUT</code> replaces where appropriate, <code>PATCH</code> for partial updates and document what repeats mean.</p>
</li>
<li><p>Where duplicates are possible, offer a lookup-by-client-reference path so agents can reconcile.</p>
</li>
</ul>
<h3 id="heading-pagination-and-sorting-one-pattern-everywhere">Pagination and Sorting: One Pattern, Everywhere</h3>
<p>Agents loop. If every resource paginates differently, the model will mix strategies.</p>
<p>To combat this, pick one pagination style (cursor vs offset) per API surface and stick to it.</p>
<p>Also, always return stable sort order or require <code>sort</code> explicitly. You should also include <code>next</code> links or cursors in a consistent envelope.</p>
<h3 id="heading-timeouts-partial-success-and-async-work">Timeouts, Partial Success, and Async Work</h3>
<p>Agents hate “maybe it worked.” Long-running work should be <strong>explicitly async</strong>:</p>
<ul>
<li><p><code>202 Accepted</code> + job ID + polling or webhooks.</p>
</li>
<li><p>Clear terminal states: <code>succeeded</code>, <code>failed</code>, <code>canceled</code>, with structured error details on failure.</p>
</li>
</ul>
<h2 id="heading-principle-2-strong-schemas">Principle 2: Strong Schemas</h2>
<p>If determinism is about behavior, schemas are about communication. For agents, your OpenAPI (or equivalent) isn't paperwork, it's part of the runtime interface.</p>
<h3 id="heading-treat-openapi-as-a-contract-not-a-souvenir">Treat OpenAPI as a Contract, Not a Souvenir</h3>
<p>A specification that lags production is worse than no spec: it trains the agent to be confidently wrong. Teams increasingly treat OpenAPI as the authoritative contract and validate requests/responses against it in CI and at the edge.</p>
<p>Here's the minimum bar for agent-friendly OpenAPI:</p>
<ul>
<li><p>Every operation has a <code>summary</code> and a <code>description</code> that explain <em>when</em> to use it, not only <em>what</em> it returns.</p>
</li>
<li><p>Every request body property has <code>description</code> and realistic <code>example</code> values.</p>
</li>
<li><p>All responses are documented including 4xx/5xx with stable JSON shapes.</p>
</li>
</ul>
<h3 id="heading-describe-intent-in-natural-language-precisely">Describe Intent in Natural Language, Precisely</h3>
<p>Agents aren't offended by verbosity. They're confused by vague verbs.</p>
<p>Instead of:</p>
<blockquote>
<p>“Gets orders.”</p>
</blockquote>
<p>Prefer:</p>
<blockquote>
<p>“Lists orders for the authenticated merchant. Supports filtering by <code>status</code> and a time window on <code>created_at</code>. Returns at most <code>limit</code> items; use <code>cursor</code> for the next page.”</p>
</blockquote>
<p>This aligns with what multiple guides call <strong>context-aware</strong> or <strong>self-describing</strong> APIs: the schema carries semantic intent, not just types.</p>
<h3 id="heading-examples-are-part-of-the-contract">Examples Are Part of the Contract</h3>
<p>You should provide a happy path example per endpoint, at least one validation error example (400) with your standard error object, and examples for optional fields when they change behavior.</p>
<p>Examples reduce “shape hallucination” where the model guesses field names or nesting.</p>
<h3 id="heading-json-schema-strictness-helps-tool-calling-stacks">JSON Schema Strictness Helps Tool-Calling Stacks</h3>
<p>If your agent uses function calling / structured outputs, tighten schemas:</p>
<ul>
<li><p>Prefer <code>enum</code> for small closed sets.</p>
</li>
<li><p>Mark fields <code>required</code> honestly.</p>
</li>
<li><p>Use <code>format</code> (<code>uuid</code>, <code>date-time</code>) where real.</p>
</li>
<li><p>Avoid <code>additionalProperties: true</code> on security-sensitive payloads if you need strict validation.</p>
</li>
</ul>
<h3 id="heading-name-things-consistently">Name Things Consistently</h3>
<p><code>userId</code> in one endpoint and <code>user_id</code> in another is a human annoyance and an agent trap. Pick a convention and enforce it.</p>
<h2 id="heading-principle-3-guardrails-at-the-api-boundary">Principle 3: Guardrails at the API Boundary</h2>
<p>Autonomy amplifies mistakes. Guardrails turn “oops” into blocked requests instead of incidents.</p>
<h3 id="heading-authorization-should-be-narrow-and-explicit">Authorization Should Be Narrow and Explicit</h3>
<p>Agents should receive credentials scoped to <strong>least privilege</strong>. For example, use short-lived tokens, with refresh documented clearly. Use scopes that map to real actions (<code>orders:read</code> vs <code>orders:write</code>). And avoid flows that assume a human can solve (CAPTCHAs) or click (email links mid-run) or isolate those as human-in-the-loop tools.</p>
<h3 id="heading-validate-hard-fail-loud-and-structured">Validate Hard, Fail Loud and Structured</h3>
<p>Reject bad input at the edge with stable <code>error_code</code> values (machine-actionable), human-readable <code>message</code> (for logs and UI), optional <code>field</code> or JSON Pointer to the problem, and optional <code>doc_url</code> linking to documentation.</p>
<p>This matches guidance from several practitioner articles: opaque 500s and generic errors are where autonomous clients spiral.</p>
<p>RFC 7807 Problem Details (<code>application/problem+json</code>) is a good, widely understood pattern for HTTP APIs, a structured envelope agents can parse consistently.</p>
<h3 id="heading-separate-read-the-world-from-change-the-world">Separate “Read the World” from “Change the World”</h3>
<p>For high-impact actions (refunds, deletes, transfers), consider using a two-step pattern: first create an intent, then confirm execution.</p>
<p>Or you can dry-run query parameters / dedicated endpoints that validate without committing.</p>
<p>Also keep in mind that rate limits and quotas tuned for bursty agent behavior and autonomous loops can dwarf human traffic.</p>
<h3 id="heading-observability-is-a-product-feature">Observability is a Product Feature</h3>
<p>Log correlation IDs, surface them in responses where safe, and monitor for retry amplification. An agent that misreads a 409 as “retry forever” becomes a denial-of-wallet attack on your own systems.</p>
<h2 id="heading-patterns-that-bridge-apis-and-agent-runtimes">Patterns That Bridge APIs and Agent Runtimes</h2>
<h3 id="heading-workflow-documentation-sequences-not-just-endpoints">Workflow Documentation: Sequences, Not Just Endpoints</h3>
<p>Agents excel when they can follow a recipe. Document common sequences (“create customer → add payment method → charge”) and consider standards meant for multi-step API flows (such as Arazzo) when your product’s complexity justifies it.</p>
<h3 id="heading-hypermedia-and-next-steps">Hypermedia and “Next Steps”</h3>
<p>Including links to plausible next actions (for example, pagination <code>next</code>, or related resources) reduces improvisation. This is the same spirit as <a href="https://en.wikipedia.org/wiki/HATEOAS">HATEOAS</a>: the response whispers what you can do next, instead of forcing the model to guess URLs.</p>
<h3 id="heading-tool-oriented-surfaces-for-example-mcp">Tool-Oriented Surfaces (For Example, MCP)</h3>
<p>Protocols like the Model Context Protocol (MCP) are gaining traction as a way to expose curated capabilities (“tools”) with schemas agents can bind to directly.</p>
<p>A common pragmatic pattern is not to dump every micro-endpoint as a tool, but to expose coarse-grained tools aligned to user outcomes while keeping your underlying REST API strict and clean.</p>
<p>MCP isn't a substitute for good API design. It's a delivery and discovery layer. Slapping a thin wrapper on a messy API still leaves you with a messy system – it just fails faster in public.</p>
<h3 id="heading-metadata-for-discovery-llmstxt-and-friends">Metadata for Discovery (<code>llms.txt</code> and Friends)</h3>
<p>Some teams publish <code>/llms.txt</code> or similar lightweight discovery files for documentation sites. Treat these as optional signposts, not replacements for OpenAPI.</p>
<p>Ecosystem adoption is still evolving, but the underlying idea is sound: make the canonical machine-readable description easy to find.</p>
<h2 id="heading-a-practical-beforeafter">A Practical Before/After</h2>
<h3 id="heading-weak-pattern-agent-hostile">Weak Pattern (Agent-hostile)</h3>
<pre><code class="language-http">POST /do-stuff
</code></pre>
<p>Response <code>200 OK</code>:</p>
<pre><code class="language-json">{ "ok": true }
</code></pre>
<p>Problems: no idempotency, no structured error, no entity ID, no way to poll, the agent must guess whether “ok” means “created” or “ignored duplicate.”</p>
<h3 id="heading-stronger-pattern-agent-friendly">Stronger Pattern (Agent-friendly)</h3>
<pre><code class="language-http">POST /v1/invoices
Idempotency-Key: 7b3c-...
</code></pre>
<p>Response <code>201 Created</code>:</p>
<pre><code class="language-json">{
  "invoice": {
    "id": "inv_9Qz",
    "status": "draft",
    "total": { "amount": "120.00", "currency": "USD" }
  },
  "links": {
    "finalize": "/v1/invoices/inv_9Qz/finalize"
  }
}
</code></pre>
<p>Conflict response <code>409 Conflict</code> with Problem Details:</p>
<pre><code class="language-json">{
  "type": "https://api.example.com/problems/duplicate-idempotency-key",
  "title": "Duplicate idempotency key",
  "status": 409,
  "detail": "A different request body was sent with the same Idempotency-Key.",
  "error_code": "IDEMPOTENCY_KEY_REUSE_BODY_MISMATCH"
}
</code></pre>
<p>This tells the agent what happened and whether retrying is appropriate.</p>
<h2 id="heading-checklist-is-your-api-agent-ready">Checklist: Is Your API Agent-Ready?</h2>
<ul>
<li><p><strong>Contract</strong>: Published OpenAPI 3.x, validated against real traffic, with rich descriptions and examples.</p>
</li>
<li><p><strong>Determinism</strong>: Documented state machines, consistent pagination, explicit async for long jobs.</p>
</li>
<li><p><strong>Safe writes</strong>: Idempotency for side effects, reconciliation endpoints where needed.</p>
</li>
<li><p><strong>Errors</strong>: Stable codes, structured bodies, documented remediation paths.</p>
</li>
<li><p><strong>Security</strong>: Least-privilege tokens, no “mystery” side doors agents can accidentally hit.</p>
</li>
<li><p><strong>Operations</strong>: Rate limits, bulk endpoints where appropriate, correlation IDs, dashboards for anomalous agent traffic.</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Designing for AI agents is, in most respects, disciplined API design — pushed to the level where machines can rely on your contract without tribal knowledge.</p>
<p>If you remember only three things:</p>
<ol>
<li><p><strong>Be predictable:</strong> in shapes, states, and side effects.</p>
</li>
<li><p><strong>Be explicit:</strong> in schemas, examples, and errors.</p>
</li>
<li><p><strong>Be protective:</strong> validate early, scope narrowly, and make dangerous actions hard to trigger by accident.</p>
</li>
</ol>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ RAG Explained Simply with a Real Project ]]>
                </title>
                <description>
                    <![CDATA[ If you have used ChatGPT, you know how magical it feels. You ask a question, and it instantly generates a highly articulate answer. But you also probably know its biggest flaw. If you ask it about you ]]>
                </description>
                <link>https://www.freecodecamp.org/news/rag-explained-simply-with-a-real-project/</link>
                <guid isPermaLink="false">6a186a9260295e5547e04628</guid>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ashutosh Krishna ]]>
                </dc:creator>
                <pubDate>Thu, 28 May 2026 16:17:22 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/5dc3370a-a536-43f6-850e-223928f99870.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you have used ChatGPT, you know how magical it feels. You ask a question, and it instantly generates a highly articulate answer.</p>
<p>But you also probably know its biggest flaw. If you ask it about your company's internal code, your private Notion workspace, or an event that happened yesterday, it fails.</p>
<p>Usually, it does one of two things. It either apologizes and says it doesn't have access to that information, or worse, it confidently makes something up entirely.</p>
<p>This happens because Large Language Models (LLMs) are like extremely smart students who are locked in a room without internet access. They only know what they memorized before they were locked inside. If you ask them a question outside of their memorized knowledge, they have to guess.</p>
<p>So, how do we fix this? How do we get an AI to answer questions about our private data without retraining the entire model from scratch?</p>
<p>The answer is <strong>RAG</strong>, which stands for Retrieval-Augmented Generation.</p>
<p>RAG is the architecture behind nearly every modern AI application that interacts with private data. If you have ever used a "chat with PDF" app or a customer support bot that actually knows company policies, you have interacted with RAG.</p>
<p>In this article, we'll break down exactly how RAG works from first principles. Then, we'll build a working RAG application from scratch using Python.</p>
<h3 id="heading-heres-what-well-cover">Here's what we'll cover:</h3>
<ul>
<li><p><a href="#heading-what-is-rag">What is RAG?</a></p>
</li>
<li><p><a href="#heading-why-traditional-llms-fail">Why Traditional LLMs Fail</a></p>
</li>
<li><p><a href="#heading-how-rag-works-internally">How RAG Works Internally</a></p>
</li>
<li><p><a href="#heading-how-to-build-a-real-rag-project">How to Build a Real RAG Project</a></p>
</li>
<li><p><a href="#heading-the-full-data-flow">The Full Data Flow</a></p>
</li>
<li><p><a href="#heading-common-rag-problems">Common RAG Problems</a></p>
</li>
<li><p><a href="#heading-advanced-rag-concepts">Advanced RAG Concepts</a></p>
</li>
<li><p><a href="#heading-final-thoughts">Final Thoughts</a></p>
</li>
</ul>
<h2 id="heading-what-is-rag">What is RAG?</h2>
<p>RAG stands for <strong>Retrieval-Augmented Generation</strong>. Let's break down what those three words actually mean.</p>
<ul>
<li><p><strong>Retrieval:</strong> Finding relevant information from a database.</p>
</li>
<li><p><strong>Augmented:</strong> Adding that information to the user's original question.</p>
</li>
<li><p><strong>Generation:</strong> Asking the LLM to write an answer using only the added information.</p>
</li>
</ul>
<h3 id="heading-the-open-book-test-analogy">The Open-Book Test Analogy</h3>
<p>To build a mental model, think of a traditional LLM as a student taking a closed-book exam. The student has read billions of books in the past, but right now, they have to answer questions purely from memory. Sometimes they forget facts, and sometimes they make up answers to avoid leaving the page blank. Not gonna lie, I pulled the same move in quite a few university exams.</p>
<p>RAG turns this into an open-book exam.</p>
<p>When you ask a question, the system first runs to a massive library (your database), finds the exact pages that contain the answer, and hands those pages to the student. The student then reads those specific pages and writes a perfect answer.</p>
<p>Instead of relying on the AI's memory, we're only relying on its reading comprehension skills.</p>
<h2 id="heading-why-traditional-llms-fail">Why Traditional LLMs Fail</h2>
<p>Before we dive into how to build RAG, we need to understand exactly why prompting an LLM on its own isn't enough.</p>
<ol>
<li><p><strong>Training cutoffs:</strong> Training an LLM takes months and costs millions of dollars. Because of this, models are trained on data up to a specific date. If an LLM was trained in 2025, it has absolutely no idea what happened in 2026.</p>
</li>
<li><p><strong>No access to private data:</strong> Your company's Jira tickets, internal wikis, and Slack messages are private. OpenAI, Google, and Anthropic don't have them in their training datasets.</p>
</li>
<li><p><strong>Hallucinations:</strong> LLMs are essentially advanced autocomplete engines. They predict the next most likely word based on patterns. If they don't know a fact, they'll string together words that sound highly plausible but may be completely incorrect. We call this hallucinating.</p>
</li>
<li><p><strong>Context window limitations:</strong> You might be thinking, "Why not just copy and paste my entire company wiki into the ChatGPT prompt?" Well, every LLM has a "context window", which is the maximum amount of text it can process at once. Even with modern models that have massive context windows, pasting thousands of documents into a prompt is incredibly slow and expensive. Also, models tend to lose track of information when you overwhelm them with too much text.</p>
</li>
<li><p><strong>The high cost of retraining:</strong> You could theoretically fine-tune an LLM on your private data. But fine-tuning is complicated and expensive. More importantly, knowledge changes constantly. If you update a company policy, you would have to fine-tune the model all over again to teach it the new rule.</p>
</li>
</ol>
<p>RAG solves all of these problems. It gives the LLM access to real-time, private data without needing to retrain the model.</p>
<h2 id="heading-how-rag-works-internally">How RAG Works Internally</h2>
<p>To make RAG work, we need a specific pipeline of technologies. Let's explore every major concept in the RAG architecture.</p>
<h3 id="heading-documents">Documents</h3>
<p>Everything starts with your raw data. These are your PDFs, database records, text files, or scraped websites. In the AI world, we refer to all of these source materials generally as "documents".</p>
<h3 id="heading-chunking">Chunking</h3>
<p>You can't feed a 500-page book into an AI all at once for a simple question. It's inefficient. Instead, we break the documents down into smaller, manageable pieces called "chunks". A chunk might be a single paragraph or a few sentences.</p>
<p>This matters because when a user asks a question, we only want to retrieve the specific paragraphs that contain the answer, not the entire book. If we skipped chunking, the system would retrieve massive walls of text, which would crash the LLM's context window.</p>
<h3 id="heading-embeddings">Embeddings</h3>
<p>This is the most intimidating term for beginners, but the concept is brilliant. Computers don't understand words, but they're great at math. <strong>Embeddings</strong> are a way to translate human language into lists of numbers (vectors) that capture the actual meaning of the text.</p>
<p>Imagine a 2D map. We can plot the word "Dog" at coordinates [2, 3] and the word "Puppy" at [2.1, 3.1]. Even though they're different words, the computer knows they mean similar things because their coordinates are physically close together on the map. The word "Car" might be way over at [10, 10].</p>
<p>In a real AI system, an embedding model doesn't use just 2 dimensions. It maps sentences across thousands of dimensions to capture deep semantic meaning.</p>
<h3 id="heading-vector-databases">Vector Databases</h3>
<p>Once we convert all of our text chunks into number coordinates (embeddings), we need a place to store them. Traditional SQL databases are great at finding exact keyword matches. But they're terrible at finding "similar meanings".</p>
<p>A <strong>vector database</strong> is specifically designed to store lists of numbers and quickly calculate the distance between them. Popular vector databases include ChromaDB, Pinecone, Weaviate, FAISS, and Milvus.</p>
<h3 id="heading-semantic-search-and-similarity-matching">Semantic Search and Similarity Matching</h3>
<p>When a user types a question into our chatbot, we run the question through the exact same embedding model. The question becomes a list of numbers.</p>
<p>We then ask the vector database to perform a <strong>similarity search</strong>. The database looks at the coordinates of the user's question and finds the stored chunks that are located closest to it in mathematical space. Because distance equals meaning, the closest chunks will contain the most relevant information to answer the question.</p>
<h3 id="heading-prompt-augmentation">Prompt Augmentation</h3>
<p>Now we have the user's original question and the text chunks we retrieved from the database. We "augment" (add to) the prompt. We create a hidden template behind the scenes that looks like this:</p>
<blockquote>
<p>"You are a helpful assistant. Use ONLY the following context to answer the user's question.</p>
<p>Context:</p>
<p>[Insert retrieved chunks here]</p>
<p>Question:</p>
<p>[Insert user question here]"</p>
</blockquote>
<h3 id="heading-final-llm-response">Final LLM Response</h3>
<p>We send this giant, augmented prompt to the LLM. The LLM reads the context, processes the question, and generates a factual response based entirely on the provided data.</p>
<h3 id="heading-quick-recap">Quick Recap</h3>
<p>A RAG pipeline usually looks like this:</p>
<img src="https://cdn.hashnode.com/uploads/covers/61c1acb4a90dea775da8262b/fa6b3432-bc29-4346-8537-3f5b3861b9d1.png" alt="RAG Pipeline" style="display:block;margin:0 auto" width="795" height="1310" loading="lazy">

<h2 id="heading-how-to-build-a-real-rag-project">How to Build a Real RAG Project</h2>
<p>Let's build a real-world RAG application. We'll build an AI chatbot that reads and understands a PDF document.</p>
<p>To make this completely free to build, we'll use Python, LangChain (a popular AI framework), Google's Gemini API (which has a generous free tier for developers), and ChromaDB (a local vector database).</p>
<p>Note: We'll be using the free Gemini tier here for illustration purposes so you can learn without spending money. Because LangChain is modular, you can easily swap this out for any other production-grade model later just by changing one line (or a few lines) of code.</p>
<h3 id="heading-project-setup">Project Setup</h3>
<p>First, open your terminal or command prompt, create a new directory for your project, and navigate into it:</p>
<pre><code class="language-shell">mkdir my-rag-project
cd my-rag-project
</code></pre>
<p>Next, it's a best practice to create an isolated <strong>virtual environment</strong>. This ensures that the packages we install for this project don't conflict with other Python projects on your computer.</p>
<p>To create and activate a virtual environment, run the commands for your specific operating system:</p>
<p><strong>For macOS and Linux:</strong></p>
<pre><code class="language-shell">python3 -m venv venv
source venv/bin/activate
</code></pre>
<p><strong>For Windows (Command Prompt):</strong></p>
<pre><code class="language-shell">python -m venv venv
venv\Scripts\activate
</code></pre>
<p><strong>For Windows (PowerShell):</strong></p>
<pre><code class="language-shell">python -m venv venv
.\venv\Scripts\Activate.ps1
</code></pre>
<p>Once activated, you'll see <code>(venv)</code> appear at the beginning of your terminal line. Now, go ahead and install the required libraries inside your fresh environment:</p>
<pre><code class="language-shell">python -m pip install --upgrade pip
pip install langchain langchain-google-genai langchain-community chromadb python-dotenv pypdf
</code></pre>
<p>You'll also need a Google Gemini API key. You can get one for free from <a href="https://aistudio.google.com/app/api-keys">Google AI Studio</a>.</p>
<p>Instead of running messy terminal configuration commands for different operating systems, create a new file named <code>.env</code> in the root of your project folder and add your key like this:</p>
<pre><code class="language-plaintext">GOOGLE_API_KEY=your_actual_api_key_here
</code></pre>
<h3 id="heading-preparing-the-pdf">Preparing the PDF</h3>
<p>Since this is a "Chat with PDF" project, you’ll need a sample PDF document to work with. To keep things simple, download <a href="https://drive.google.com/file/d/1UOUVl2mzc39SEHxpi8hujpueIyhEUPC7/view?usp=sharing">this ready-made sample document</a> below and place it inside your project folder.</p>
<p>You can then use this PDF throughout the tutorial for testing uploads, parsing, embeddings, and chat functionality.</p>
<h3 id="heading-writing-the-rag-code-step-by-step">Writing the RAG Code Step-by-Step</h3>
<p>Create a Python file named <code>rag_app.py</code> in your project folder. Instead of copying a massive block of code, we'll build this application block by block so we can understand exactly how data flows through our pipeline.</p>
<h4 id="heading-step-1-imports-and-environment-setup">Step 1: Imports and Environment Setup</h4>
<p>At the very top of your file, add the necessary library imports and initialize your environment configuration:</p>
<pre><code class="language-python">import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough

# Load environment variables from the .env file
load_dotenv()
</code></pre>
<p>We're bringing in LangChain modules to handle loading, splitting, embedding, storing, and prompting. The <code>load_dotenv()</code> function is mandatory because it scans our <code>.env</code> file and loads the <code>GOOGLE_API_KEY</code> into our system's background environment variables, ensuring our AI models can authenticate seamlessly without hardcoding passwords.</p>
<h4 id="heading-step-2-loading-the-pdf-document">Step 2: Loading the PDF Document</h4>
<p>Next, let's point our script to the PDF document we downloaded earlier:</p>
<pre><code class="language-python">print("Loading PDF document...")
loader = PyPDFLoader("TechCorp_Official_Employee_Handbook.pdf")
document = loader.load()

print(document[0].page_content)
</code></pre>
<p>Computers can't read a PDF like a standard text file because PDFs contain complex layout streams. <code>PyPDFLoader</code> handles the heavy lifting of opening the file, stripping away visual layout formatting, and extracting the raw text characters into a clean format that LangChain can work with.</p>
<p>At this point, when you run the script, you should see the text content from the first page of the PDF printed in the terminal. This is a quick way to verify that the PDF was loaded successfully and that <code>PyPDFLoader</code> was able to extract readable text from the document correctly.</p>
<h4 id="heading-step-3-chunking-the-text">Step 3: Chunking the Text</h4>
<p>Now that the raw text is in memory, we need to chop it up into smaller pieces:</p>
<pre><code class="language-python">print("Chunking text...")
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
chunks = text_splitter.split_documents(document)

print(chunks[0].page_content)
</code></pre>
<p>If a user asks a simple question, sending an entire 100-page document to the LLM is incredibly slow and expensive. <code>RecursiveCharacterTextSplitter</code> cuts the text into segments of roughly 500 characters.</p>
<p>The <code>chunk_overlap=50</code> parameter tells the text splitter to repeat the last 50 characters of one chunk at the beginning of the next. This helps preserve context between chunks so that sentences or ideas are not abruptly cut off.</p>
<p>Without overlap, important information near chunk boundaries could be separated, making retrieval less accurate. By maintaining a small shared section between neighboring chunks, the model can better understand continuity in the document, resulting in more reliable search results and higher-quality responses.</p>
<p>When you run the script, you should now see the contents of the first text chunk printed in the terminal.</p>
<h4 id="heading-step-4-creating-embeddings-and-initializing-the-vector-db">Step 4: Creating Embeddings and Initializing the Vector DB</h4>
<p>With our chunks ready, we'll convert them into vector coordinates and save them locally:</p>
<pre><code class="language-python">print("Creating vector database...")
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
vector_db = Chroma.from_documents(
    documents=chunks, 
    embedding=embeddings, 
    persist_directory="./chroma_db"
)
</code></pre>
<p>This is the mathematical core of RAG. <code>GoogleGenerativeAIEmbeddings</code> takes a raw text chunk and turns it into a list of numbers representing its conceptual meaning. We then hand those chunks and numbers to <code>Chroma</code>, which maps them into a local database directory named <code>chroma_db</code> on your hard drive, allowing for lightning-fast mathematical lookups later.</p>
<h4 id="heading-step-5-setting-up-the-retriever-and-prompt-template">Step 5: Setting Up the Retriever and Prompt Template</h4>
<p>Now we need a mechanism to query our database and a structure to house our instructions:</p>
<pre><code class="language-python"># Configure the database to act as a document retriever
retriever = vector_db.as_retriever(search_kwargs={"k": 2})

# Define the hidden prompt structure for the LLM
template = """
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 
Use three sentences maximum and keep the answer concise.

Context: {context}

Question: {question}

Answer:
"""
prompt = PromptTemplate.from_template(template)
</code></pre>
<p><code>vector_db.as_retriever()</code> converts the vector database into a retriever object that can search through stored document embeddings and return the most relevant chunks for a user’s question. Setting <code>k=2</code> on our retriever tells the database to only pull the top two most relevant chunks for any given question, which keeps things clean and efficient.</p>
<p>The prompt template acts as hidden instructions for the model. When a user asks a question, LangChain automatically replaces <code>{context}</code> with the retrieved document chunks and <code>{question}</code> with the user’s actual query. The template also acts as a safety guardrail. By explicitly telling the model to say "I don't know" if the context lacks information, we heavily suppress the model's tendency to hallucinate fake answers.</p>
<h4 id="heading-step-6-initializing-the-llm-and-constructing-the-rag-chain">Step 6: Initializing the LLM and Constructing the RAG Chain</h4>
<p>Next, we hook up our language model and construct our execution pipeline:</p>
<pre><code class="language-python"># Initialize the free Gemini model tier
llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", temperature=0)

# Helper function to stitch retrieved chunks into a single text block
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Connect everything together using LangChain Expression Language (LCEL)
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
)
</code></pre>
<p>We use <code>gemini-3.5-flash</code> with a <code>temperature=0</code> setting to force the model to be completely factual and analytical rather than creative.</p>
<p>The retriever returns multiple document chunks as structured objects. The <code>format_docs</code> function converts those chunks into a single continuous text block by joining their <code>page_content</code>. This step is necessary because the prompt expects a clean, readable context string rather than a list of document objects.</p>
<p>Finally, we connect everything using LangChain Expression Language (LCEL). When a question comes in, it passes it to the retriever, formats the resulting text documents, passes the filled template to the prompt handler, and pushes the final product straight to the LLM.</p>
<h4 id="heading-step-7-invoking-the-chain-with-a-question">Step 7: Invoking the Chain with a Question</h4>
<p>Finally, let's execute the pipeline and print the result out to the console:</p>
<pre><code class="language-python">user_question = "What days can I work from home?"
print(f"\nQuestion: {user_question}")

response = rag_chain.invoke(user_question)
print(f"Answer: {response.content}")
</code></pre>
<p>This is where the magic happens. The <code>invoke</code> command sets off the entire chain reaction we just built. When you run this, the console will output:</p>
<pre><code class="language-shell">Loading PDF document...
Chunking text...
Creating vector database...

Question: What days can I work from home?
Answer: [{'type': 'text', 'text': 'You are permitted to work from home on Tuesdays and Thursdays. Additional remote flexibility may also be approved by your department manager.', 'extras': {'signature': 'Eo0JCooJAQw51seue7vZT7Vby90GMDLhtOBWLKm5UjfEro7f8dRoKC0KAIHxSqQSLXq0s3kf6yfzTsgaUMFiNd0fnwtNSNoApzcZ7huRD8iq+f+xomoXGhmFYClnLApHUKtOLykICluJnM1j6DfYGaVHKLqU0MF4+Fng9CdqXVqPgN9HcfJEvSpeMAc9vTYENj07s8N6MidlMvMt1w0fl4GCjxAZXyEngdU4kGfjUqaKyjjCQ9yLFeoXrV55pqZdkElLxXEK4ZWNnMGh5NDqGmt2b0kMG4KoCdunUltBr1ctV15rZ+724T0qnjDvI+pIgp/ZtKa423gaVXSkSmdvSePEog38blJ2dgjtZg72XF5xlh45Yv06fZVu7e60ZB1sTn4W8iWuYGQ61i/xCN6xCX/e3SuitjwQoHSlEe/iuoaNf5BXhdp87TUyQTawiY+qIZjgWz2AMLUbMcOvns/0iFt6jpUkXr/dO4eYF39UCosrbWC5TZQp2gllNQ6mlrczTAKqe8mPZwmBVuTJ3kx3q+SsVROln584EdD94IxXrgLXhuLkbR9ub0qyvjBfAmIfvUEK5pcaBCGydQvheH9wsIvAOG1kspMb/wqjAv/mpmii8J9vztSvM9PR9v7L3YLu8vcANol80w2PfeHhyWUJWit8R58kKd7HHor5GJhA436x+tCukIlBq2oTcob+ydxVJydA12pRsiuw4kYkEIU8nr5yCiIwjYCDtVm6Ws0RUnhyk5u+dRONPZ6g+mfBShKCnahcIMzzJpXznmPXvmP2C96uD64SGTI6L86EMlLEz06/cTJTabgqAYqe2AhERgnYc/4d0XabQOkzvDmBKMr5/LOAt3ZZg7X4PIuefEwxx0eB60gLROefcbbu8k+KPazqFsDP/YA/aPyAxyss/6V43EID0amJcDA81LKJzazL9KnclefQZrN9viIwteMaV04IIlx+Ynk1vZi/LVgWiFuDVWF3Ql2luY4KwFpfFDxQ728gkrhvUdTBrfUeKRSLV1W4ox6I7ogo0e9i7db2lkOQljctGs3Km3hWu4JOkH+YzLNmcDHMF3imfgQH5Ml99H9PXh1ScBjq47MXKzJPdHijkY5ZRSjceEIlKEGv8afQO60NB8lk1MQAGwd+CxqIwVg11N8q9EFSwdJmVVmoyM1nINGJERSKhKOrkqBsOELfpKDjv14tuNgDUy4wdtuxn8C4tJBKvN8t/hrW/Z65VoBGdMwA08sRSV6Fp5l/gSdYeB9yA/Lx/VGkgVqaP5tU73XrE/XO8ysJ/kgRDXiTvsg+2uayU1Q9PfKFAawopslwybCHtdOwaVgsRdA5R4f1NIkPoP/sX+iBxyR0kKg6v4RRAj851WifM2fQ8Vsw5dtFSeh/4TfYg1GCCCDNT4JwrtI8fqcF+qMQqUb+oUqoyzjzFqqSRxXcyqHXOLV9V9C6yWYmZ3TSY043WL9L4kGGJGxFHD5VWG77Quiy+rHWGO13LOc5EBKIO05sg1xnI88QQTUgkxwJeuntytIy3f3pfMVrFYFkvi8w5LzL4RK68+4HMg=='}}]
</code></pre>
<p>Modern LLMs like Google's Gemini are <strong>multimodal</strong>. This means they're designed to read and generate not just plain text, but images, video, and audio simultaneously. Because of this, the LangChain Google integration doesn't always return a simple text string. Instead, it returns a <strong>list of content blocks</strong>.</p>
<p>In your output, the AI successfully returned your text, but it also included an <code>extras</code> dictionary containing a <code>signature</code>. This signature is a behind-the-scenes data point used by Google for AI safety tracking, grounding metadata, and thought-process verification.</p>
<p>To get a clean, human-readable string, you simply need to extract the <code>text</code> value from that list. You can update your final print statement to check if the response is a list and extract the text automatically:</p>
<pre><code class="language-python"># Clean up the output if Gemini returns a list of content blocks
if isinstance(response.content, list):
    clean_answer = response.content[0]['text']
else:
    clean_answer = response.content

print(f"Answer: {clean_answer}")
</code></pre>
<p>Now, your output will look like this:</p>
<pre><code class="language-shell">Question: What days can I work from home?
Answer: You are permitted to work from home on Tuesdays and Thursdays. Additional remote flexibility may also be approved by your department manager.
</code></pre>
<h4 id="heading-step-8-making-it-conversational">Step 8: Making it Conversational</h4>
<p>Right now, our script hardcodes a single question, prints the answer, and immediately exits. In the real world, you want to chat with your documents naturally. Let's upgrade our script to run continuously in your terminal so you can ask as many questions as you want without restarting the program.</p>
<p>Replace the bottom section of your code with a simple <code>while</code> loop:</p>
<pre><code class="language-python"># Chat with your PDF in a continuous loop
print("\n--- PDF Chatbot Initialized ---")
print("Type 'exit' or 'quit' to stop.")

while True:
    # 1. Wait for the user to type a question
    user_question = input("\nYour Question: ")

    # 2. Allow the user to break the loop and close the program
    if user_question.lower() in ['exit', 'quit']:
        print("Shutting down chatbot. Goodbye!")
        break

    # 3. Send the question through our RAG chain
    response = rag_chain.invoke(user_question)

    # 4. Clean up the output format
    if isinstance(response.content, list):
        clean_answer = response.content[0]['text']
    else:
        clean_answer = response.content

    # 5. Print the final answer to the console
    print(f"Answer: {clean_answer}")
</code></pre>
<p>By using Python's <code>input()</code> function wrapped inside an infinite <code>while True</code> loop, we keep the Python script alive. The PDF chunks and vector database stay loaded in your computer's memory, allowing you to fire off consecutive questions instantly. This transforms your script from a static demonstration into a fully interactive AI tool!</p>
<p>Here's a sample run:</p>
<img src="https://cdn.hashnode.com/uploads/covers/61c1acb4a90dea775da8262b/987055ad-353d-4bd3-9026-6e4172a0904a.png" alt="Image of sample run" style="display:block;margin:0 auto" width="1907" height="994" loading="lazy">

<h4 id="heading-full-code">Full Code</h4>
<pre><code class="language-python">import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough

# Load environment variables from the .env file
load_dotenv()

print("Loading PDF document...")
loader = PyPDFLoader("TechCorp_Official_Employee_Handbook.pdf")
document = loader.load()
# print(document[0].page_content)

print("Chunking text...")
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
chunks = text_splitter.split_documents(document)
# print(chunks[0].page_content)

print("Creating vector database...")
embeddings = GoogleGenerativeAIEmbeddings(model="gemini-embedding-001")
vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Configure the database to act as a document retriever
retriever = vector_db.as_retriever(search_kwargs={"k": 2})

# Define the hidden prompt structure for the LLM
template = """
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 
Use three sentences maximum and keep the answer concise.

Context: {context}

Question: {question}

Answer:
"""
prompt = PromptTemplate.from_template(template)

# Initialize the free Gemini model tier
llm = ChatGoogleGenerativeAI(model="gemini-3.5-flash", temperature=0)

# Helper function to stitch retrieved chunks into a single text block
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


# Connect everything together using LangChain Expression Language (LCEL)
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
)
"""
user_question = "What days can I work from home?"
print(f"\nQuestion: {user_question}")

response = rag_chain.invoke(user_question)
# print(f"Answer: {response.content}")

# Clean up the output if Gemini returns a list of content blocks
if isinstance(response.content, list):
    clean_answer = response.content[0]['text']
else:
    clean_answer = response.content

print(f"Answer: {clean_answer}")
"""

# Chat with your PDF in a continuous loop
print("\n--- PDF Chatbot Initialized ---")
print("Type 'exit' or 'quit' to stop.")

while True:
    # 1. Wait for the user to type a question
    user_question = input("\nYour Question: ")

    # 2. Allow the user to break the loop and close the program
    if user_question.lower() in ['exit', 'quit']:
        print("Shutting down chatbot. Goodbye!")
        break

    # 3. Send the question through our RAG chain
    response = rag_chain.invoke(user_question)

    # 4. Clean up the output format
    if isinstance(response.content, list):
        clean_answer = response.content[0]['text']
    else:
        clean_answer = response.content

    # 5. Print the final answer to the console
    print(f"Answer: {clean_answer}")
</code></pre>
<h4 id="heading-taking-it-out-of-the-terminal">Taking it out of the terminal</h4>
<p>Once you have your terminal chatbot working, you probably want to give it a proper visual interface. The easiest way to do this in Python is using an open-source library called <strong>Gradio</strong>. <a href="https://blog.ashutoshkrris.in/build-ai-apps-with-gradio-turn-your-python-scripts-into-web-apps">Gradio</a> has a built-in <code>ChatInterface</code> feature that can wrap your existing RAG code and automatically generate a beautiful, ChatGPT-style web UI in your browser with just three extra lines of code. It's highly recommended as your next mini-project.</p>
<h2 id="heading-the-full-data-flow">The Full Data Flow</h2>
<p>To truly solidify your understanding, let's map out the exact lifecycle of a single user question in our system:</p>
<img src="https://cdn.hashnode.com/uploads/covers/61c1acb4a90dea775da8262b/af98284c-39c6-4cc7-bb4f-13955b659048.png" alt="af98284c-39c6-4cc7-bb4f-13955b659048" style="display:block;margin:0 auto" width="1577" height="621" loading="lazy">

<h3 id="heading-breaking-down-the-execution-timeline">Breaking Down the Execution Timeline</h3>
<ol>
<li><p><strong>The request begins:</strong> The user interfaces with our console and asks a text-based question: "How much vacation do I get?" At this exact moment, our application code takes control of the program flow.</p>
</li>
<li><p><strong>The text-to-vector translation:</strong> Computers can't compute similarity using raw text characters. Our app makes a fast network call to the Google Embedding Model, handing over the raw question. The model converts the text into a massive array of numbers that mathematically represents the user's intent.</p>
</li>
<li><p><strong>The database distance calculation:</strong> Our application script takes those coordinate numbers and passes them directly to ChromaDB. ChromaDB scans the local hard drive, running a similarity math function against the numbers stored for each of our PDF chunks. It locates the text chunk mentioning "20 days of paid time off" because its coordinates are physically closest to the query coordinates.</p>
</li>
<li><p><strong>The prompt augmentation:</strong> ChromaDB hands the raw text strings of those relevant pieces back to our script. The code automatically unrolls our prompt template, plugging the raw chunks into the {context} slot and the user's original text into the {question} slot.</p>
</li>
<li><p><strong>The final generation:</strong> Our application drops this combined package into the final network call, pushing it directly to the Gemini LLM. Because temperature=0 is configured, the model acts strictly as a reading comprehension engine. It reads the custom context, formats a clean sentence, and sends it back to our terminal to be printed out beautifully for the user.</p>
</li>
</ol>
<h2 id="heading-common-rag-problems">Common RAG Problems</h2>
<p>Building a simple RAG app is easy. Building a RAG app that works perfectly in production is very difficult. Here are the most common problems engineers face and how they fix them.</p>
<h3 id="heading-1-bad-chunking">1. Bad Chunking</h3>
<p>If your chunks are too large, they include irrelevant information that confuses the LLM. If they're too small, they lose vital context. Engineers can solve this by experimenting with different chunk sizes or using semantic chunking (splitting by whole sentences or paragraphs rather than strict character counts).</p>
<h3 id="heading-2-irrelevant-retrieval">2. Irrelevant Retrieval</h3>
<p>Sometimes semantic search fails. If a user searches for "Apple" expecting information about fruit, but the database only has data about the tech company, the system will confidently return tech company documents. Engineers can fix this by adjusting the embedding models or adding keyword search rules.</p>
<h3 id="heading-3-hallucinations">3. Hallucinations</h3>
<p>Even with RAG, an LLM might ignore the retrieved context and rely on its training memory. Engineers mitigate this by heavily engineering the prompt template with strict rules like "ONLY use the provided text."</p>
<h3 id="heading-4-latency">4. Latency</h3>
<p>RAG requires an embedding network call, a database search, and an LLM network call. This takes time. Engineers can optimize this by using faster, locally hosted embedding models or caching common questions.</p>
<h3 id="heading-5-stale-data">5. Stale Data</h3>
<p>If HR updates the company policy PDF, the vector database still holds the old numbers. The AI will give outdated answers. Engineers build update pipelines that automatically delete old vectors and embed new ones whenever a source file changes.</p>
<h2 id="heading-advanced-rag-concepts">Advanced RAG Concepts</h2>
<p>Once you master basic RAG, the AI engineering world opens up to highly advanced techniques.</p>
<h3 id="heading-hybrid-search">Hybrid Search</h3>
<p>Vector databases are great at understanding meaning, but bad at finding exact ID numbers or specific names. Hybrid search combines traditional keyword search (like searching a SQL database) with semantic vector search to get the best of both worlds.</p>
<h3 id="heading-reranking">Reranking</h3>
<p>Sometimes the vector database returns 10 chunks, but the best answer is accidentally placed at the bottom of the list. Reranking uses a second, specialized AI model to read the retrieved chunks and sort them strictly by relevance before sending them to the LLM.</p>
<h3 id="heading-agentic-rag">Agentic RAG</h3>
<p>Instead of forcing the system to retrieve documents every single time, Agentic RAG uses an AI "Agent" to decide if it even needs to search. If you say "Hello", the agent skips the database and just says "Hi". If you ask a hard question, it decides to query the database.</p>
<h3 id="heading-graph-rag">Graph RAG</h3>
<p>Instead of breaking text into isolated chunks, Graph RAG extracts entities (people, places, concepts) and maps how they relate to each other in a Knowledge Graph. This is incredibly powerful for complex datasets with deep relationships.</p>
<h3 id="heading-multi-modal-rag">Multi-Modal RAG</h3>
<p>Traditional RAG only reads text. Multi-modal RAG processes images, charts, and audio files, allowing users to ask questions like, "What does the graph on page 4 indicate?"</p>
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>Retrieval-Augmented Generation is the bridge between incredible reasoning engines (LLMs) and reliable factual knowledge (your data).</p>
<p>Understanding RAG is no longer optional for software engineers. Nearly every enterprise software product being built today involves some form of it. By learning how chunking, embeddings, vector databases, and prompt augmentation work together, you have demystified the magic behind modern AI.</p>
<p>Your next step is to build on the code we wrote today. Try pointing the PDF loader to your résumé, a school textbook, or a financial report. Once you experience your own code answering questions about your personal data, you'll start to truly understand the power of AI engineering.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ AI Paper Review: Language Models are Few-Shot Learners (GPT-3) ]]>
                </title>
                <description>
                    <![CDATA[ After GPT-2, it became clear that language models could do much more than researchers originally expected. Simply training a model to predict the next word had already started producing surprising abi ]]>
                </description>
                <link>https://www.freecodecamp.org/news/ai-paper-review-language-models-are-few-shot-learners-gpt-3/</link>
                <guid isPermaLink="false">6a0b76a04e81b730489aea6f</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Mohammed Fahd Abrah ]]>
                </dc:creator>
                <pubDate>Mon, 18 May 2026 20:29:20 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5fc16e412cae9c5b190b6cdd/9fd8e279-ebf3-4662-b204-737dd38b7648.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>After GPT-2, it became clear that language models could do much more than researchers originally expected. Simply training a model to predict the next word had already started producing surprising abilities like translation, summarization, and question answering without task-specific training.</p>
<p>But there was still a major limitation. Even though GPT-2 could generalize across tasks, it still struggled to adapt reliably. Performance often depended on carefully written prompts, and for many real-world applications, fine-tuning was still necessary. AI systems were becoming more flexible, but they still were not truly learning tasks from context the way humans do.</p>
<p>Then GPT-3 pushed the idea much further. Instead of asking whether language models could perform tasks without fine-tuning, the paper explored something even more ambitious:</p>
<p>What happens if we scale language models to an extreme size? The answer surprised almost everyone in the AI community.</p>
<p>GPT-3 showed that a sufficiently large language model could often learn new tasks directly from examples inside the prompt itself. No retraining. No gradient updates. Just a few demonstrations written in natural language.</p>
<p>For example, if you showed the model a few English-to-French translations, it could continue the pattern correctly for a new sentence. If you gave it examples of questions and answers, it could often infer the task immediately and generate reasonable responses.</p>
<p>This became known as <em>few-shot learning</em> and <em>in-context learning</em>.</p>
<p>More importantly, GPT-3 suggested a completely different way of interacting with AI systems. Instead of training a separate model for every task, the same model could dynamically adapt depending on the instructions and examples it received.</p>
<p>That idea eventually became the foundation for modern AI systems like ChatGPT.</p>
<p>Now, like many influential AI papers, the GPT-3 paper can be difficult to read because of its scale, technical experiments, and long benchmark evaluations. So in this article, I’ll break everything down in a clear and practical way.</p>
<p>We’ll explore what problem the paper was trying to solve, how few-shot learning works, why scaling became so important, how GPT-3 was trained, and why this paper fundamentally changed the direction of modern AI research.</p>
<p>By the end, you should understand the core ideas behind GPT-3 and why this paper became one of the most important milestones in the history of large language models LLM.</p>
<h2 id="heading-paper-overview">Paper Overview</h2>
<p>In this article, we’ll review the paper <a href="https://arxiv.org/pdf/2005.14165"><em>Language Models are Few-Shot Learners</em></a> by Tom Brown et al. from Open AI.</p>
<p>This paper introduced GPT-3 and demonstrated something that changed the direction of modern AI research: large language models could learn tasks directly from prompts and examples without task-specific fine-tuning like the methodology of GPT-1.</p>
<p>Instead of retraining the model for every new task, GPT-3 could often adapt dynamically through natural language instructions, one-shot examples, or few-shot prompting.</p>
<p>The paper also introduced the idea of <em>in-context learning</em>, where the model effectively learns from patterns inside the prompt itself during inference.</p>
<p>Here’s the original paper if you want to explore it directly: <a href="https://arxiv.org/pdf/2005.14165"><em>Language Models are Few-Shot Learners (PDF)</em></a></p>
<p>And here’s a quick infographic of what we’ll cover throughout this review:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/871201a8-de4c-4a1c-8b75-4bab09fdb1fc.png" alt="GPT-3 Quick Insight" style="display:block;margin:0 auto" width="1414" height="2000" loading="lazy">

<h2 id="heading-table-of-content">Table of Content:</h2>
<ul>
<li><p><a href="#heading-executive-summary">Executive Summary</a></p>
</li>
<li><p><a href="#heading-goals-of-the-paper">Goals of the Paper</a></p>
</li>
<li><p><a href="#heading-core-idea">Core Idea</a></p>
</li>
<li><p><a href="#heading-methodology">Methodology</a></p>
</li>
<li><p><a href="#heading-fine-tuning-vs-zero-shot-vs-few-shot">Fine-tuning vs Zero-Shot vs Few-Shot</a></p>
</li>
<li><p><a href="#heading-model-architecture">Model Architecture</a></p>
</li>
<li><p><a href="#heading-experiments">Experiments</a></p>
</li>
<li><p><a href="#heading-key-findings">Key Findings</a></p>
</li>
<li><p><a href="#heading-task-specific-observations">Task-Specific Observations</a></p>
</li>
<li><p><a href="#heading-generalization-vs-memorization">Generalization vs Memorization</a></p>
</li>
<li><p><a href="#heading-discussion">Discussion</a></p>
</li>
<li><p><a href="#heading-limitations">Limitations</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-final-insight">Final Insight</a></p>
</li>
<li><p><a href="#heading-gpt-1-vs-gpt-2-vs-gpt-3-key-differences">GPT-1 vs GPT-2 vs GPT-3: Key Differences</a></p>
</li>
<li><p><a href="#heading-pytorch-implementations-of-the-gpt-architecture-evolution">PyTorch Implementations of the GPT Architecture Evolution</a></p>
</li>
<li><p><a href="#heading-resources">Resources:</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To get the most out of this breakdown, it helps to already be familiar with a few foundational ideas.</p>
<p>Reading the previous reviews in this series will be especially helpful:</p>
<ul>
<li><p><a href="https://www.freecodecamp.org/news/ai-paper-review-improving-language-understanding-by-generative-pre-training-gpt-1/"><em>AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1)</em></a></p>
</li>
<li><p><a href="https://www.freecodecamp.org/news/ai-paper-review-language-models-are-unsupervised-multitask-learners-gpt-2/"><em>AI Paper Review: Language Models are Unsupervised Multitask Learners (GPT-2)</em></a></p>
</li>
</ul>
<p>GPT-3 directly builds on many of the ideas introduced in those earlier papers, especially pre-training, zero-shot learning, and large-scale language modeling.</p>
<p>It also helps to have:</p>
<ul>
<li><p>A general understanding of natural language processing (NLP) and how machines work with text</p>
</li>
<li><p>A high-level idea of what a Transformer model is (you do not need deep mathematical details)</p>
</li>
<li><p>Familiarity with supervised learning, unsupervised learning, and zero-shot learning</p>
</li>
<li><p>A basic understanding of prompts and how language models generate text</p>
</li>
<li><p>General machine learning concepts like training data, parameters, scaling, and inference</p>
</li>
</ul>
<p>You do not need to be an AI researcher to follow this article, though.</p>
<p>I’ll keep the explanations practical and intuitive, focusing more on understanding the core ideas behind GPT-3 rather than getting lost in dense mathematical details or academic terminology.</p>
<h2 id="heading-executive-summary"><strong>Executive Summary</strong></h2>
<p>Before GPT-3, models like GPT-2 had already shown something surprising: a language model trained only to predict the next word could still perform many tasks it was never directly trained for. Translation, summarization, question answering somehow these abilities started appearing naturally as models became larger.</p>
<p>But there was still a limitation.</p>
<p>Even with GPT-2, strong performance often depended on careful prompting or additional fine-tuning. In practice, most NLP systems still followed the same pattern: train a large model first, then retrain or fine-tune it separately for every new task.</p>
<p>GPT-3 challenges that entire workflow.</p>
<p>According to the authors, if a language model becomes large enough, it can begin learning tasks directly from context alone. Instead of updating the model’s parameters, you simply show it a few examples inside the prompt, and the model continues the pattern.</p>
<p>This idea is what the paper calls <em>few-shot learning</em>.</p>
<p>For example, rather than training a separate translation model, you could write something like:</p>
<ul>
<li><p>dog → chien</p>
</li>
<li><p>cat → chat</p>
</li>
<li><p>house → ?</p>
</li>
</ul>
<p>And GPT-3 would often continue with the correct answer: <em>maison</em>.</p>
<p>What makes this important is that the model is not learning through gradient updates during inference. There is no retraining happening in the traditional sense. The learning happens inside the context window itself, through the examples provided in the prompt.</p>
<p>This marks a major shift in how language models are used.</p>
<p>Instead of building a specialized system for every task, GPT-3 suggests that a single sufficiently large model can adapt dynamically just by reading instructions and examples. The paper refers to this behavior as <em>in-context learning</em>, and much of GPT-3’s contribution revolves around showing how powerful this idea becomes at scale.</p>
<h2 id="heading-goals-of-the-paper"><strong>Goals of the Paper</strong></h2>
<p>According to the authors, one of the biggest limitations of existing NLP systems is that they depend too heavily on task-specific training. Even though models had become increasingly powerful by the time GPT-3 was introduced, most systems still required a separate fine-tuning process for every new task.</p>
<p>In practice, this created several problems.</p>
<p>First, every task needed labeled data. If you wanted a model to summarize articles, answer questions, classify sentiment, or translate text, you usually needed thousands, or sometimes millions of carefully prepared examples. Collecting that data was expensive, time-consuming, and often unrealistic for smaller or niche tasks.</p>
<p>Second, every new capability required additional training. Even when the underlying model was already pretrained on massive amounts of text, developers still had to retrain or fine-tune it again and again for specific use cases.</p>
<p>The paper argues that this workflow is fundamentally inefficient. More importantly, the authors point out that it does not resemble how humans learn. Humans can often understand a task after seeing only a few demonstrations or simple instructions. We do not usually need thousands of labeled examples to figure out what is being asked.</p>
<p>This becomes the central question behind GPT-3:</p>
<p>Can a language model learn new tasks directly from context instead of relying on parameter updates and task-specific retraining?</p>
<p>That question drives nearly every experiment in the paper. Rather than testing whether GPT-3 can master one carefully optimized benchmark, the authors are exploring something broader: whether scaling language models can produce systems that adapt dynamically just from prompts, examples, and natural language instructions.</p>
<h2 id="heading-core-idea"><strong>Core Idea</strong></h2>
<p>At its core, GPT-3 is still built around the same fundamental idea used in GPT-2: train a language model to predict the next token in a sequence. The training objective itself is surprisingly simple. Given some text, the model learns to guess what comes next, one token at a time.</p>
<p>On the surface, GPT-3 may look like nothing more than a much larger version of GPT-2. And in some ways, that is true. The model scales dramatically in size, growing to 175 billion parameters, and it is trained on a far larger and more diverse dataset gathered from sources like Common Crawl, WebText, books, and Wikipedia.</p>
<p>But the paper argues that something more interesting begins to happen as language models scale.</p>
<p>Instead of simply memorizing text patterns better, GPT-3 starts showing the ability to learn tasks directly from prompts. When the model sees examples inside the input itself, it can often continue the pattern correctly without any additional training or parameter updates.</p>
<p>For example, if the prompt contains a few question-answer pairs or translation examples, GPT-3 can infer the structure of the task and generate similar outputs for new inputs. In other words, the prompt becomes a temporary learning environment.</p>
<p>This is the key conceptual shift in the paper.</p>
<p>Traditional machine learning usually separates training from inference. First the model learns by updating its weights, then later it is deployed to make predictions. GPT-3 blurs that boundary. The model still learns during pretraining, of course, but during inference it can also adapt behavior dynamically based on the context it receives.</p>
<p>The authors describe this behavior as <em>in-context learning</em>.</p>
<p>What makes this idea important is that the model is not retrained for each task. There are no gradient updates happening while the prompt is processed. Instead, GPT-3 learns from the examples embedded inside the context window itself.</p>
<p>This marks a subtle but important change in how we think about language models. The prompt is no longer just an input. It effectively becomes a lightweight interface for teaching the model what to do.</p>
<h2 id="heading-methodology"><strong>Methodology</strong></h2>
<p>One reason GPT-3 became so influential is that the underlying training process is actually very familiar. Unlike many research papers that introduce entirely new architectures or complicated learning algorithms, GPT-3 mostly builds on ideas that already existed before it. The difference is how aggressively those ideas are scaled.</p>
<p>According to the authors, the core training objective remains standard autoregressive language modeling. In simple terms, the model reads text and repeatedly learns to predict the next token in the sequence. This is the same general approach used in GPT-2.</p>
<p>The process itself is conceptually straightforward:</p>
<ul>
<li><p>Train a very large Transformer model</p>
</li>
<li><p>Feed it enormous amounts of internet text</p>
</li>
<li><p>Optimize it to predict the next word over and over again</p>
</li>
</ul>
<p>What changes dramatically is the scale.</p>
<p>GPT-3 is trained on hundreds of billions of tokens collected from sources such as Common Crawl, WebText, books, and Wikipedia. The paper also explains that OpenAI filtered and cleaned large portions of the Common Crawl dataset to improve quality and reduce duplication.</p>
<p>But the most important part of the methodology is not just how the model is trained. It is how the model is <em>used after training</em>.</p>
<p>Traditionally, NLP systems relied heavily on fine-tuning. After pretraining a language model, developers would train it again on a smaller labeled dataset for each individual task. GPT-3 experiments with a different approach entirely.</p>
<p>Instead of retraining the model, tasks are described directly inside the prompt.</p>
<p>The paper studies three main settings:</p>
<ul>
<li><p><em>Zero-shot learning</em>: the model receives only a natural language instruction</p>
</li>
<li><p><em>One-shot learning</em>: the model receives a single example of the task</p>
</li>
<li><p><em>Few-shot learning</em>: the model receives several examples before solving a new case</p>
</li>
</ul>
<p>For example, a translation prompt might look like this:</p>
<p>dog → chien<br>cat → chat<br>house → ?</p>
<p>GPT-3 then continues the pattern and predicts:</p>
<p>maison</p>
<p>What makes this remarkable is that no retraining happens during this process. The model’s weights remain completely unchanged. It is simply using the information inside the prompt to infer what kind of task is being requested.</p>
<p>In practice, this transforms the prompt into something much more powerful than an ordinary input. It becomes a temporary workspace where the model can recognize patterns, adapt behavior, and apply learned knowledge dynamically.</p>
<p>The paper repeatedly emphasizes that this behavior emerges through scale rather than task-specific engineering. GPT-3 is not trained separately for translation, summarization, reasoning, or question answering. Instead, the same general language modelinqag objective appears to produce all of these abilities when the model becomes sufficiently large.</p>
<h2 id="heading-fine-tuning-vs-zero-shot-vs-few-shot"><strong>Fine-tuning vs Zero-Shot vs Few-Shot</strong></h2>
<table style="min-width:100px"><colgroup><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"></colgroup><tbody><tr><td><p><strong>Aspect</strong></p></td><td><p><strong>Fine-Tuning</strong></p></td><td><p><strong>Zero-Shot Learning</strong></p></td><td><p><strong>Few-Shot Learning</strong></p></td></tr><tr><td><p><strong>Definition</strong></p></td><td><p>The model is additionally trained on labeled data for a specific task</p></td><td><p>The model performs a task using only instructions, without examples</p></td><td><p>The model learns the task from a small number of examples inside the prompt</p></td></tr><tr><td><p><strong>Training Requirement</strong></p></td><td><p>Requires supervised task-specific datasets</p></td><td><p>No task-specific training or examples</p></td><td><p>No retraining, but requires a few demonstrations in the prompt</p></td></tr><tr><td><p><strong>How Tasks Are Given</strong></p></td><td><p>Through a separate training phase</p></td><td><p>Through natural language instructions</p></td><td><p>Through instructions plus a few input-output examples</p></td></tr><tr><td><p><strong>Learning Process</strong></p></td><td><p>Model weights are updated during training</p></td><td><p>No weight updates</p></td><td><p>No weight updates; learning happens inside the context window</p></td></tr><tr><td><p><strong>Flexibility</strong></p></td><td><p>Usually specialized for one task</p></td><td><p>Highly flexible across many tasks</p></td><td><p>Flexible while still benefiting from demonstrations</p></td></tr><tr><td><p><strong>Adaptability</strong></p></td><td><p>Requires retraining for new tasks</p></td><td><p>Adapts instantly through prompting</p></td><td><p>Adapts quickly from contextual examples</p></td></tr><tr><td><p><strong>Data Dependency</strong></p></td><td><p>Depends heavily on labeled datasets</p></td><td><p>Depends mostly on pretraining knowledge</p></td><td><p>Depends on both pretraining and prompt examples</p></td></tr><tr><td><p><strong>Performance</strong></p></td><td><p>Often strongest on narrow benchmark tasks</p></td><td><p>Usually weaker than fine-tuning</p></td><td><p>Often much stronger than zero-shot and sometimes close to fine-tuning</p></td></tr><tr><td><p><strong>Scalability Across Tasks</strong></p></td><td><p>Expensive and difficult to scale</p></td><td><p>Extremely scalable</p></td><td><p>Scalable without retraining</p></td></tr><tr><td><p><strong>Compute Cost</strong></p></td><td><p>High because every task may require new training</p></td><td><p>Low during usage</p></td><td><p>Low during usage</p></td></tr><tr><td><p><strong>Example</strong></p></td><td><p>Fine-tune a model on a sentiment analysis dataset</p></td><td><p>“Classify the sentiment of this sentence”</p></td><td><p>“Positive: I loved the movie. Negative: The film was boring. Sentence: The story was amazing →”</p></td></tr><tr><td><p><strong>Main Strength</strong></p></td><td><p>High accuracy on carefully trained tasks</p></td><td><p>Simplicity and broad generalization</p></td><td><p>Strong balance between flexibility and performance</p></td></tr><tr><td><p><strong>Main Weakness</strong></p></td><td><p>Poor scalability across many tasks</p></td><td><p>Can misunderstand task format or intent</p></td><td><p>Sensitive to prompt quality and example selection</p></td></tr><tr><td><p><strong>Most Associated With</strong></p></td><td><p>Traditional NLP systems, GPT-1 era</p></td><td><p>GPT-2 style prompting</p></td><td><p>GPT-3 and in-context learning</p></td></tr><tr><td><p><strong>Core Idea</strong></p></td><td><p>Train specifically for each task</p></td><td><p>Infer the task from instructions</p></td><td><p>Infer the task from examples in context</p></td></tr></tbody></table>

<h2 id="heading-model-architecture"><strong>Model Architecture</strong></h2>
<p>Architecturally, GPT-3 does not introduce a radically new design. In fact, one of the most interesting aspects of the paper is that the core architecture is almost identical to GPT-2. OpenAI continues using a decoder-only Transformer model trained with an autoregressive objective.</p>
<p>At a high level, the Transformer architecture processes text using a mechanism called <em>attention</em>. Instead of reading words strictly one at a time like older recurrent models, Transformers can look across the entire sequence and determine which words are most relevant to each other.</p>
<p>More specifically, GPT-3 relies on <em>self-attention</em>, which allows the model to weigh different parts of the context while generating text. This helps the model capture long-range relationships between words, sentences, and ideas.</p>
<p>The model is also <em>autoregressive</em>, meaning it generates text sequentially by predicting the next token based on everything that came before it. This next-token prediction objective remains the foundation of GPT-3, just as it was for GPT-2.</p>
<p>So if the architecture is mostly the same, what actually changed?</p>
<p>The answer is scale.</p>
<p>GPT-3 dramatically increases the size of the model, the amount of training data, and the computational resources used during training. The largest version of GPT-3 contains 175 billion parameters, making it far larger than GPT-2’s 1.5 billion parameter model.</p>
<p>The paper also experiments with multiple model sizes ranging from 125 million parameters all the way to 175 billion. This was important because the authors wanted to study how capabilities evolve as models grow larger.</p>
<p>The architecture includes:</p>
<ul>
<li><p>A decoder-only Transformer design</p>
</li>
<li><p>A context window of 2048 tokens</p>
</li>
<li><p>Multiple model scales trained under similar objectives</p>
</li>
<li><p>Attention mechanisms that allow the model to process contextual relationships efficiently</p>
</li>
</ul>
<p>One of the paper’s most important observations is that performance improves smoothly as scale increases. Larger models consistently perform better across a wide range of tasks, including translation, question answering, reasoning, and few-shot learning.</p>
<p>This idea becomes central to the entire GPT-3 paper.</p>
<p>Rather than relying on handcrafted task-specific systems, the authors suggest that many advanced capabilities emerge naturally when language models become sufficiently large and are trained on enough diverse data. In other words, scaling itself starts acting like a research strategy.</p>
<p>What makes this shift important is that GPT-3 does not achieve its results through complicated architectural innovations. The paper’s argument is much simpler, and in some ways more surprising:</p>
<p>A relatively standard Transformer architecture, when scaled aggressively enough, begins to display entirely new behaviors.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/4ab1a945-4379-4f2a-b8a5-3dd15ddbcebb.png" alt="Transformer-Decoder-Architecture" style="display:block;margin:0 auto" width="732" height="1064" loading="lazy">

<p><strong>Note:</strong> The original figure illustrates the complete Transformer architecture (Encoder–Decoder) from <em>Attention Is All You Need</em>. For clarity and relevance to GPT-style models, the image used here was cropped to focus only on the decoder side of the architecture, since GPT models are based on a decoder-only Transformer design.</p>
<p><strong>Reference:</strong> Brownlee, J. <a href="https://machinelearningmastery.com/encoders-and-decoders-in-transformer-models/?utm_source=chatgpt.com">Encoders and Decoders in Transformer Models</a> Machine Learning Mastery.</p>
<h2 id="heading-experiments"><strong>Experiments</strong></h2>
<p>To understand whether GPT-3 could truly learn from context alone, the authors evaluated the model across a very broad range of NLP tasks. Rather than focusing on a single benchmark, the paper tests whether the same pretrained model can adapt to many different kinds of problems using only prompts and examples.</p>
<p>The experiments cover a wide variety of domains, including:</p>
<ul>
<li><p>Language modeling and text completion</p>
</li>
<li><p>Question answering</p>
</li>
<li><p>Translation between languages</p>
</li>
<li><p>Reading comprehension</p>
</li>
<li><p>Commonsense reasoning</p>
</li>
<li><p>Winograd-style reasoning tasks</p>
</li>
<li><p>Cloze and sentence completion tasks</p>
</li>
<li><p>Synthetic reasoning problems such as arithmetic and word manipulation</p>
</li>
</ul>
<p>What makes these experiments especially important is the evaluation setup itself.</p>
<p>Instead of fine-tuning GPT-3 separately for each benchmark, the model is tested entirely through prompting. The authors evaluate GPT-3 in three different settings:</p>
<ul>
<li><p><em>Zero-shot learning</em>, where the model receives only a task description</p>
</li>
<li><p><em>One-shot learning</em>, where it receives a single example</p>
</li>
<li><p><em>Few-shot learning</em>, where several demonstrations are included inside the prompt</p>
</li>
</ul>
<p>For example, in translation tasks, the prompt may contain a few English-to-French examples before asking the model to continue the pattern. In question-answering tasks, the model might see several example questions and answers before attempting a new one.</p>
<p>Importantly, the model’s parameters never change during these evaluations. There are no gradient updates, no retraining steps, and no task-specific optimization. GPT-3 performs every task using the exact same pretrained weights.</p>
<p>This is one of the paper’s biggest departures from traditional NLP systems.</p>
<p>At the time, most state-of-the-art models achieved strong benchmark results through supervised fine-tuning on carefully prepared datasets. GPT-3 instead tests whether a single large language model can generalize across tasks simply by understanding patterns inside prompts.</p>
<p>The paper also evaluates how performance changes as model size increases. OpenAI trained multiple versions of GPT-3, ranging from 125 million parameters up to 175 billion parameters, then compared how scaling affected zero-shot, one-shot, and few-shot behavior.</p>
<p>According to the authors, larger models become noticeably better at using contextual information. Few-shot learning improves especially strongly with scale, suggesting that bigger models are not just memorizing more information. They are becoming better at adapting to new tasks dynamically.</p>
<h2 id="heading-key-findings"><strong>Key Findings</strong></h2>
<p>This is the section where GPT-3 stops feeling like “just a bigger language model” and starts looking like something fundamentally different.</p>
<p>According to the paper, one of the clearest patterns across nearly all experiments is that performance improves consistently as model size increases. As GPT-3 scales from millions of parameters to hundreds of billions, the model becomes dramatically better at understanding prompts, adapting to context, and performing tasks it was never explicitly trained for.</p>
<p>But the most surprising result is not simply higher benchmark scores.</p>
<p>The real breakthrough is that <em>few-shot learning actually works at scale</em>.</p>
<p>Across many tasks, GPT-3’s few-shot performance approaches strong fine-tuned systems, and in some cases even matches or surpasses them. This is remarkable because GPT-3 achieves these results without updating its weights for individual tasks. Everything happens through prompting alone.</p>
<p>One of the strongest examples appears in question answering benchmarks.</p>
<p>On TriviaQA, GPT-3 improves significantly as more examples are provided in the prompt. The paper reports that zero-shot performance is already competitive, but one-shot and few-shot prompting push results even further, eventually reaching or exceeding some state-of-the-art fine-tuned systems in the same closed-book setting.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/1b4bfb72-6cbe-4af9-ba1c-5ddb1afa47eb.png" alt="ZeroShot-OneShot-FewShot learning" style="display:block;margin:0 auto" width="1487" height="827" loading="lazy">

<p>Source: Brown et al. (2020), <em>Language Models are Few-Shot Learners</em>, Figure 1.2.</p>
<p>The same pattern appears repeatedly throughout the paper:</p>
<ul>
<li><p>Few-shot prompting consistently outperforms zero-shot prompting</p>
</li>
<li><p>Larger models make better use of contextual examples</p>
</li>
<li><p>Scaling improves not only accuracy, but adaptability itself</p>
</li>
</ul>
<p>This last point is especially important.</p>
<p>The paper suggests that scaling does more than help the model memorize facts or generate more fluent text. As models become larger, they appear to develop stronger <em>in-context learning</em> abilities. In other words, bigger models become better at inferring patterns and task structures directly from prompts.</p>
<p>The authors even observe that the gap between zero-shot and few-shot performance grows with model size. Smaller models struggle to learn effectively from prompts, while larger models can often infer the task from only a handful of examples.</p>
<p>What makes this finding historically important is that it changes how researchers think about capability growth in AI systems.</p>
<p>Before GPT-3, scaling was often viewed mainly as a way to improve existing performance metrics. GPT-3 introduces a different possibility: that entirely new behaviors can emerge as models become sufficiently large.</p>
<p>This is why the paper became so influential. It was not just reporting better benchmark numbers. It was presenting evidence that scale itself can unlock qualitatively new forms of learning behavior.</p>
<h2 id="heading-task-specific-observations"><strong>Task-Specific Observations</strong></h2>
<p>When you look beyond the headline results, the paper reveals something more nuanced about GPT-3: its abilities are highly uneven. The model performs surprisingly well in some areas, yet still struggles badly in others.</p>
<p>GPT-3 shows particularly strong performance on tasks that align closely with pattern recognition and language continuation.</p>
<p>Translation is one notable example. While GPT-3 was never trained specifically as a translation system, the model can still produce impressive results when given a few examples in the prompt. According to the paper, few-shot translation performance improves substantially as model size increases, especially when translating into English.</p>
<p>The model also performs well on question answering benchmarks, especially in closed-book settings where the answer must come directly from information stored inside the model’s parameters. Tasks like TriviaQA show strong gains as GPT-3 moves from zero-shot to few-shot prompting.</p>
<p>Text completion and cloze-style tasks are another major strength. GPT-3 demonstrates a strong ability to continue patterns, complete paragraphs, and infer missing words from context. On datasets like LAMBADA, the few-shot setup produces especially large improvements.</p>
<p>But the paper is also careful about documenting weaknesses.</p>
<p>GPT-3 struggles noticeably on certain reasoning-heavy benchmarks, particularly tasks involving natural language inference. Datasets like ANLI remain difficult even for the largest model.</p>
<p>Some reading comprehension tasks also expose limitations. In several cases, GPT-3 generates answers that sound plausible but fail to demonstrate deep understanding of the passage. This becomes a recurring theme throughout the paper: fluent language generation does not always mean reliable reasoning.</p>
<p>One of the most interesting observations is how sensitive GPT-3 is to prompt design.</p>
<p>Performance often changes dramatically depending on how examples are written, formatted, or ordered inside the context window. In many tasks, adding just a few demonstrations significantly improves accuracy.</p>
<p>This suggests something important about how GPT-3 operates.</p>
<p>The model is not simply retrieving fixed knowledge from memory. Instead, it relies heavily on contextual cues to infer what kind of behavior is expected. Small prompt changes can reshape the model’s interpretation of the task itself.</p>
<p>In practice, this paper helped introduce an entirely new idea to the AI community: that <em>how you ask the model</em> can matter almost as much as the model itself.</p>
<p>That insight eventually evolves into what we now call <em>prompt engineering</em>.</p>
<h2 id="heading-generalization-vs-memorization"><strong>Generalization vs Memorization</strong></h2>
<p>One of the biggest questions surrounding GPT-3 is whether the model is genuinely learning useful patterns, or simply memorizing enormous portions of the internet.</p>
<p>This concern becomes especially important because GPT-3 is trained on massive web-scale datasets, including Common Crawl. With a model this large, it is reasonable to ask whether strong benchmark performance comes from real generalization or from accidentally seeing parts of the evaluation data during training.</p>
<p>The authors take this issue seriously and dedicate an entire section of the paper to studying what they call <em>data contamination</em>.</p>
<p>According to the paper, OpenAI searched for overlaps between the training data and benchmark datasets used during evaluation. They discovered that some contamination did exist. In other words, portions of certain evaluation datasets appeared somewhere inside the model’s training corpus.</p>
<p>However, the authors argue that this overlap is not large enough to fully explain GPT-3’s results.</p>
<p>For many benchmarks, performance improvements remain consistent even after accounting for contamination effects. The paper also notes that some tasks specifically designed to test adaptation and reasoning still show strong few-shot behavior despite being unlikely to appear directly in the training data.</p>
<p>Another important observation is that GPT-3 still <em>underfits</em> the training data. This means the model has not perfectly memorized everything it has seen, even after extremely large-scale training.</p>
<p>That detail matters because it suggests the model is learning statistical structures and linguistic patterns rather than storing an exact copy of the dataset.</p>
<p>Of course, memorization does still happen to some extent. Large language models can reproduce fragments of training text, especially when rare or repeated data appears frequently during training. The paper does not deny this. Instead, the authors argue that memorization alone cannot explain GPT-3’s broad performance across translation, reasoning, question answering, and in-context learning tasks.</p>
<p>In practice, the evidence points toward something more complex.</p>
<p>GPT-3 appears to absorb patterns, relationships, and task structures from large-scale text data, then reuse those patterns flexibly in new contexts. That is very different from simply copying stored answers.</p>
<p>This distinction becomes one of the central debates in modern AI research. GPT-3 forced researchers to think more carefully about what it actually means for a language model to “understand” something, and where the boundary lies between memorization, pattern recognition, and genuine generalization.</p>
<h2 id="heading-discussion"><strong>Discussion</strong></h2>
<p>This is the point in the paper where the broader implications of GPT-3 start becoming clear.</p>
<p>According to the authors, large language models may be doing something more general than simply predicting text. By training on enormous amounts of language data, the model appears to learn patterns associated with tasks themselves.</p>
<p>That idea changes how we think about language modeling.</p>
<p>Traditionally, NLP systems were designed around explicit supervision. If you wanted a model to translate text, answer questions, summarize documents, or classify sentiment, you trained it specifically for that task using labeled examples.</p>
<p>GPT-3 suggests a different possibility.</p>
<p>The paper argues that many tasks are already implicitly embedded inside natural language data. During pretraining, the model encounters countless examples of explanations, translations, conversations, reasoning patterns, instructions, and question-answer pairs scattered across the internet. As scale increases, the model begins learning these behaviors indirectly.</p>
<p>In practice, this means the model does not always require explicit retraining to perform a new task. Instead, prompts and examples can activate behaviors the model has already absorbed during pretraining.</p>
<p>This is why prompting becomes so powerful in GPT-3.</p>
<p>The prompt is not merely providing information. It is guiding the model toward a behavior pattern that already exists somewhere inside its learned representations.</p>
<p>At the same time, the authors are careful not to overstate the results.</p>
<p>Throughout the paper, they repeatedly acknowledge that GPT-3 is still inconsistent. Some outputs are remarkably convincing, while others are obviously incorrect, nonsensical, or logically flawed.</p>
<p>This becomes one of GPT-3’s defining characteristics.</p>
<p>The model often sounds far more confident than it actually is. It can generate fluent explanations and persuasive answers even when the underlying reasoning is weak or factually wrong. In some tasks, especially deeper reasoning and reading comprehension benchmarks, GPT-3 still struggles significantly.</p>
<p>So the paper does not present GPT-3 as a solved form of intelligence.</p>
<p>Instead, it presents evidence that scaling language models unlocks new capabilities that were previously weak or absent. The results are impressive enough to suggest a major shift in direction, but not strong enough to eliminate the need for further research.</p>
<p>That balance is part of what makes the paper influential. It is ambitious, but also surprisingly honest about the limitations that still remain.</p>
<h2 id="heading-limitations"><strong>Limitations</strong></h2>
<p>One reason the GPT-3 paper remained credible despite the excitement surrounding it is that the authors were unusually open about the model’s weaknesses. The paper does not claim that few-shot learning solves NLP, nor does it pretend that GPT-3 works reliably on every task.</p>
<p>In many cases, traditional fine-tuned systems still perform better.</p>
<p>Although GPT-3 achieves impressive few-shot results across a wide range of benchmarks, the model continues to struggle on several reasoning-heavy tasks, especially natural language inference and certain reading comprehension datasets.</p>
<p>The paper also emphasizes that GPT-3’s success depends heavily on scale. Smaller versions of the model show far weaker few-shot capabilities, while the strongest results appear only at extremely large parameter counts.</p>
<p>This creates a major practical problem.</p>
<p>Training GPT-3 required enormous computational resources, specialized infrastructure, and vast amounts of data. The largest model contains 175 billion parameters and was trained using large GPU clusters over massive datasets.</p>
<p>In practice, very few organizations in the world could realistically reproduce this work at the time.</p>
<p>The paper also discusses broader concerns around bias and fairness. Since GPT-3 learns from large internet datasets, it inevitably absorbs social biases, stereotypes, and problematic language patterns present in the data itself.</p>
<p>This becomes especially concerning because the model can generate highly convincing text. Incorrect or biased outputs may sound authoritative even when they are misleading or harmful.</p>
<p>Another issue the authors examine is <em>data contamination</em>. Because GPT-3 is trained on web-scale corpora, parts of benchmark datasets may accidentally appear in the training data. The paper investigates this directly and acknowledges that some overlap exists, although the authors argue that contamination alone does not explain the overall results.</p>
<p>There is also an environmental and economic cost to scaling models this aggressively.</p>
<p>Training systems at the scale of GPT-3 consumes enormous amounts of compute and energy, raising questions about sustainability and accessibility in AI research. As models become larger, cutting-edge progress increasingly depends on access to industrial-scale infrastructure.</p>
<p>This creates a tension that still exists today.</p>
<p>GPT-3 demonstrated that scaling works extraordinarily well, but it also highlighted how concentrated advanced AI research was becoming. The future of large language models was clearly promising, but also increasingly expensive.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>The paper ends with a surprisingly simple conclusion: scaling language models changes what they are capable of doing.</p>
<p>According to the authors, GPT-3 demonstrates that a sufficiently large language model can learn tasks directly from context without requiring gradient updates or task-specific fine-tuning.</p>
<p>That idea represents a major shift in the direction of NLP.</p>
<p>For years, the standard workflow in machine learning looked something like this:</p>
<ul>
<li><p>Pretrain a model</p>
</li>
<li><p>Fine-tune it for a specific task</p>
</li>
<li><p>Deploy the specialized system</p>
</li>
</ul>
<p>GPT-3 introduces a different paradigm.</p>
<p>Instead of retraining the model repeatedly for new tasks, the same pretrained model can often adapt through prompts alone. Instructions and examples inside the context window become enough to guide the model toward useful behavior.</p>
<p>In other words, the workflow starts looking more like this:</p>
<ul>
<li><p>Train once</p>
</li>
<li><p>Adapt dynamically through prompting</p>
</li>
</ul>
<p>What makes this important is not just convenience. It changes how researchers think about generalization itself.</p>
<p>The paper suggests that many capabilities traditionally associated with supervised learning can emerge naturally from large-scale language modeling. Translation, question answering, reasoning, summarization, and even task adaptation begin appearing inside a single unified system trained only with next-token prediction.</p>
<p>At the same time, the authors remain careful in their conclusions.</p>
<p>GPT-3 is clearly powerful, but it is not reliable enough to be considered a complete solution to intelligence or reasoning. The paper repeatedly acknowledges weaknesses involving logic, factual accuracy, bias, and consistency.</p>
<p>Still, the broader message is difficult to ignore.</p>
<p>GPT-3 showed that scaling language models does not simply improve fluency. It can produce entirely new behaviors that were weak or absent in smaller systems. That realization reshaped the trajectory of modern AI research and laid the foundation for the prompt-driven systems that would soon follow.</p>
<h2 id="heading-final-insight"><strong>Final Insight</strong></h2>
<p>If GPT-1 introduced the idea of large-scale pretraining followed by fine-tuning, and GPT-2 showed that language models could generalize surprisingly well without task-specific training, then GPT-3 pushes the idea even further.</p>
<p>It suggests that language models can begin learning <em>during inference itself</em>.</p>
<p>That is the real conceptual shift behind this paper.</p>
<p>Before GPT-3, most AI systems were still fundamentally task-specific. Even powerful pretrained models usually needed additional supervised training before they became useful for a particular application.</p>
<p>GPT-3 starts breaking that pattern.</p>
<p>Instead of building a separate model for translation, summarization, question answering, or reasoning, the same model can adapt dynamically depending on the prompt it receives. Examples inside the context window effectively become temporary instructions for behavior.</p>
<p>In practice, this moves AI systems away from narrow specialization and toward something more flexible:</p>
<ul>
<li><p>From task-specific systems</p>
</li>
<li><p>To general-purpose models that adapt on the fly</p>
</li>
</ul>
<p>What makes this especially important is that GPT-3 did not achieve this through complicated symbolic reasoning systems or handcrafted pipelines. The model was still trained using a relatively simple next-token prediction objective. Yet at sufficient scale, entirely new behaviors started emerging.</p>
<p>Looking back, this paper feels less like the end of the GPT series and more like the beginning of a new era.</p>
<p>Many ideas that now define modern AI trace directly back to GPT-3:</p>
<ul>
<li><p>Prompt engineering</p>
</li>
<li><p>Instruction-following systems</p>
</li>
<li><p>In-context learning</p>
</li>
<li><p>Conversational AI assistants</p>
</li>
<li><p>General-purpose foundation models</p>
</li>
</ul>
<p>And ultimately, systems like ChatGPT exist because GPT-3 demonstrated that prompting itself could become a powerful interface for interacting with intelligence.</p>
<p>That is why this paper became historically important.</p>
<p>It did not just scale language models. It changed how people imagined using them.</p>
<h2 id="heading-gpt-1-vs-gpt-2-vs-gpt-3-key-differences"><strong>GPT-1 vs GPT-2 vs GPT-3: Key Differences</strong></h2>
<table style="min-width:100px"><colgroup><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"></colgroup><tbody><tr><td><p><strong>Aspect</strong></p></td><td><p><strong>GPT-1</strong></p></td><td><p><strong>GPT-2</strong></p></td><td><p><strong>GPT-3</strong></p></td></tr><tr><td><p><strong>Core Idea</strong></p></td><td><p>Pre-training followed by fine-tuning</p></td><td><p>Pre-training alone enables zero-shot behavior</p></td><td><p>Large-scale pre-training enables few-shot and in-context learning</p></td></tr><tr><td><p><strong>Training Approach</strong></p></td><td><p>Two-stage pipeline: pretrain then fine-tune</p></td><td><p>Single-stage language modeling</p></td><td><p>Same language modeling approach, but massively scaled</p></td></tr><tr><td><p><strong>Supervision</strong></p></td><td><p>Requires labeled data for downstream tasks</p></td><td><p>Can perform tasks without supervised fine-tuning</p></td><td><p>Can adapt from prompts and examples without retraining</p></td></tr><tr><td><p><strong>Task Handling</strong></p></td><td><p>Separate fine-tuning for each task</p></td><td><p>Tasks handled mainly through zero-shot prompts</p></td><td><p>Tasks handled through zero-shot, one-shot, and few-shot prompting</p></td></tr><tr><td><p><strong>Learning Style</strong></p></td><td><p>Learns representations, then specializes</p></td><td><p>Learns general language patterns</p></td><td><p>Learns to infer tasks directly from context</p></td></tr><tr><td><p><strong>Generalization</strong></p></td><td><p>Limited outside fine-tuned tasks</p></td><td><p>Stronger cross-task generalization</p></td><td><p>Much stronger contextual adaptation and in-context learning</p></td></tr><tr><td><p><strong>Prompt Usage</strong></p></td><td><p>Minimal importance</p></td><td><p>Prompts become useful</p></td><td><p>Prompts become central to system behavior</p></td></tr><tr><td><p><strong>Inference Behavior</strong></p></td><td><p>Mostly static after training</p></td><td><p>Can generalize during inference</p></td><td><p>Can adapt dynamically during inference</p></td></tr><tr><td><p><strong>Architecture</strong></p></td><td><p>Transformer (decoder-based)</p></td><td><p>Decoder-only Transformer</p></td><td><p>Decoder-only Transformer with large-scale scaling</p></td></tr><tr><td><p><strong>Model Size</strong></p></td><td><p>~117M parameters</p></td><td><p>Up to 1.5B parameters</p></td><td><p>Up to 175B parameters</p></td></tr><tr><td><p><strong>Context Window</strong></p></td><td><p>Smaller context length</p></td><td><p>Up to 1024 tokens</p></td><td><p>2048-token context window</p></td></tr><tr><td><p><strong>Training Data</strong></p></td><td><p>Books Corpus and curated datasets</p></td><td><p>WebText internet dataset</p></td><td><p>Massive multi-source dataset including Common Crawl, WebText, Books, and Wikipedia</p></td></tr><tr><td><p><strong>Key Capability</strong></p></td><td><p>Transfer learning</p></td><td><p>Zero-shot learning</p></td><td><p>Few-shot and in-context learning</p></td></tr><tr><td><p><strong>Performance Style</strong></p></td><td><p>Strong after fine-tuning</p></td><td><p>Strong without task-specific training</p></td><td><p>Often competitive with fine-tuned systems using prompts alone</p></td></tr><tr><td><p><strong>Scaling Importance</strong></p></td><td><p>Moderate</p></td><td><p>Important</p></td><td><p>Central research strategy of the paper</p></td></tr><tr><td><p><strong>Main Limitation</strong></p></td><td><p>Requires labeled datasets and retraining</p></td><td><p>Weak reasoning and inconsistent zero-shot behavior</p></td><td><p>Extremely expensive compute requirements and persistent reasoning limitations</p></td></tr><tr><td><p><strong>Main Contribution</strong></p></td><td><p>Introduced modern NLP pre-training paradigm</p></td><td><p>Demonstrated multitask zero-shot behavior</p></td><td><p>Demonstrated emergent in-context learning at scale</p></td></tr><tr><td><p><strong>Historical Impact</strong></p></td><td><p>Foundation of modern Transformer NLP</p></td><td><p>Shift toward general-purpose language models</p></td><td><p>Foundation for prompt-driven AI systems and modern LLM applications</p></td></tr><tr><td><p><strong>What Changed in the Field</strong></p></td><td><p>Pre-training became standard</p></td><td><p>Prompting became viable</p></td><td><p>Prompting became the primary interface for AI systems</p></td></tr><tr><td><p><strong>Legacy</strong></p></td><td><p>Inspired modern transfer learning pipelines</p></td><td><p>Inspired large-scale generative models</p></td><td><p>Directly influenced ChatGPT, instruction tuning, and foundation models</p></td></tr></tbody></table>

<h2 id="heading-pytorch-implementations-of-the-gpt-architecture-evolution">PyTorch Implementations of the GPT Architecture Evolution</h2>
<p><strong>GPT-1: Pre-training + Fine-Tuning Architecture</strong></p>
<pre><code class="language-python">class GPT1(nn.Module):
    def __init__(self, vocab_size, d_model, n_layers):
        super().__init__()

        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(512, d_model)

        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(d_model)
            for _ in range(n_layers)
        ])

        self.ln_f = nn.LayerNorm(d_model)

        # Language modeling head
        self.lm_head = nn.Linear(d_model, vocab_size)

    def forward(self, input_ids):
        positions = torch.arange(input_ids.size(1))

        x = (
            self.token_embedding(input_ids)
            + self.position_embedding(positions)
        )

        for block in self.transformer_blocks:
            x = block(x)

        x = self.ln_f(x)

        logits = self.lm_head(x)

        return logits
</code></pre>
<p><code>GPT1</code> inherits from <code>nn.Module</code>, which is the base class used to build neural networks in PyTorch. The constructor <code>(init)</code> defines all trainable layers used by the model.</p>
<p><code>nn.Embedding(vocab_size, d_model)</code> creates a learnable lookup table that converts token IDs into dense vectors. Each token in the vocabulary is mapped to a vector of size <code>d_model</code>.</p>
<p>The positional embedding layer adds information about token order. Since Transformers process tokens in parallel, they need explicit positional information to understand sequence structure.</p>
<p><code>nn.ModuleList([...])</code> stores multiple <code>Transformer blocks</code> while ensuring PyTorch properly tracks their parameters during training. Each TransformerBlock typically contains masked self-attention and a feed-forward network.</p>
<p><code>nn.LayerNorm(d_model)</code> applies layer normalization before the output projection. This helps stabilize training and improves gradient flow in deep Transformer architectures.</p>
<p>The language modeling head <code>(nn.Linear)</code> projects the hidden representations back into vocabulary space. The output size equals <code>vocab_size</code>, producing prediction scores for every possible next token.</p>
<p>Inside the <code>forward()</code> method, <code>input_ids.size(1)</code> retrieves the sequence length, and <code>torch.arange(...)</code> generates positional indices for each token position.</p>
<p>The token embeddings and positional embeddings are added together to produce the initial Transformer input representation.</p>
<p>The model then passes the representation through each Transformer block sequentially:</p>
<pre><code class="language-python">for block in self.transformer_blocks:
    x = block(x)
</code></pre>
<p>This iterative stacking is what allows GPT models to learn increasingly abstract contextual representations.</p>
<p>After normalization, the final hidden states are passed into <code>lm_head</code>, producing <code>logits</code>. These logits are unnormalized prediction scores used to compute probabilities for next-token generation.</p>
<p>The model finally returns the logits tensor, which is typically passed through <code>softmax</code> during inference or used directly with <code>CrossEntropyLoss</code> during training.</p>
<p><strong>GPT-2: Zero-Shot Multitask Architecture</strong></p>
<pre><code class="language-python">class GPT2(nn.Module):
    def __init__(self, vocab_size, d_model, n_layers):
        super().__init__()

        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(1024, d_model)

        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(
                d_model=d_model,
                pre_layer_norm=True
            )
            for _ in range(n_layers)
        ])

        self.final_layer_norm = nn.LayerNorm(d_model)

        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, input_ids):
        positions = torch.arange(input_ids.size(1))

        x = (
            self.token_embedding(input_ids)
            + self.position_embedding(positions)
        )

        for block in self.transformer_blocks:
            x = block(x)

        x = self.final_layer_norm(x)

        logits = self.lm_head(x)

        return logits
</code></pre>
<p>Like GPT-1, the model begins with token embeddings and positional embeddings. <code>nn.Embedding</code> converts token IDs into dense vectors, while positional embeddings provide information about token order in the sequence.</p>
<p>One noticeable difference is the larger positional embedding size (<code>1024</code> instead of <code>512</code>), allowing GPT-2 to process longer contexts.</p>
<p>The Transformer layers are stored using <code>nn.ModuleList</code>, but each <code>TransformerBlock</code> now uses:</p>
<pre><code class="language-python">pre_layer_norm=True
</code></pre>
<p>This means layer normalization is applied before attention and feed-forward operations rather than after them. This “Pre-LN” design significantly improves gradient flow and training stability in deeper Transformer models.</p>
<p>The forward pass follows the same overall pipeline:</p>
<ol>
<li><p>Generate positional indices with <code>torch.arange()</code></p>
</li>
<li><p>Add token and positional embeddings</p>
</li>
<li><p>Pass representations through stacked Transformer blocks</p>
</li>
<li><p>Apply final normalization</p>
</li>
<li><p>Project outputs into vocabulary space</p>
</li>
</ol>
<p>The sequential block processing happens here:</p>
<pre><code class="language-python">for block in self.transformer_blocks:
    x = block(x)
</code></pre>
<p>GPT-2 also introduces a small optimization in the output layer:</p>
<pre><code class="language-python">self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
</code></pre>
<pre><code class="language-python">self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
</code></pre>
<p>The bias term is removed because it provides little benefit in large language modeling setups and slightly reduces parameter count.</p>
<p>Finally, the model returns <code>logits</code>, which contain prediction scores for every token in the vocabulary at each sequence position.</p>
<p><strong>GPT-3: Few-Shot / In-Context Learning Architecture</strong></p>
<pre><code class="language-python">class GPT3(nn.Module):
    def __init__(
        self,
        vocab_size=50257,
        d_model=12288,
        n_layers=96,
        n_heads=96,
        context_length=2048
    ):
        super().__init__()

        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(context_length, d_model)

        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(
                d_model=d_model,
                n_heads=n_heads,
                pre_layer_norm=True,
                sparse_attention=True
            )
            for _ in range(n_layers)
        ])

        self.final_layer_norm = nn.LayerNorm(d_model)

        self.lm_head = nn.Linear(
            d_model,
            vocab_size,
            bias=False
        )

    def forward(self, input_ids):
        positions = torch.arange(input_ids.size(1))

        x = (
            self.token_embedding(input_ids)
            + self.position_embedding(positions)
        )

        for block in self.transformer_blocks:
            x = block(x)

        x = self.final_layer_norm(x)

        logits = self.lm_head(x)

        return logits
</code></pre>
<p>Compared to earlier GPT versions, this model dramatically increases scale. The embedding size (<code>d_model=12288</code>) and the number of Transformer layers (<code>96</code>) allow the network to learn highly complex language patterns and long-range dependencies.</p>
<p>The model also uses <code>96</code> attention heads:</p>
<pre><code class="language-python">n_heads=96
</code></pre>
<p>Multi-head attention allows the model to focus on different relationships between tokens simultaneously, improving contextual understanding.</p>
<p>The positional embedding length is expanded to <code>2048</code>, enabling the model to process much longer sequences than GPT-2.</p>
<p>Each Transformer block is configured with:</p>
<pre><code class="language-python">pre_layer_norm=True,
sparse_attention=True
</code></pre>
<p>Pre-layer normalization improves training stability in very deep networks, while sparse attention reduces the computational cost of attention by limiting how many tokens attend to each other. This becomes important at GPT-3 scale, where full attention over long sequences is extremely expensive.</p>
<p>The forward pass follows the standard GPT pipeline:</p>
<ol>
<li><p>Convert token IDs into embeddings</p>
</li>
<li><p>Add positional information</p>
</li>
<li><p>Pass representations through stacked Transformer blocks</p>
</li>
<li><p>Apply final layer normalization</p>
</li>
<li><p>Generate vocabulary logits</p>
</li>
</ol>
<p>The core iterative processing happens here:</p>
<pre><code class="language-plaintext">for block in self.transformer_blocks:
    x = block(x)
</code></pre>
<p>Finally, the output layer projects the hidden states into vocabulary space, producing <code>logits</code> used for next-token prediction during training and text generation.</p>
<h2 id="heading-resources"><strong>Resources:</strong></h2>
<ul>
<li><p><a href="https://github.com/MOHAMMEDFAHD/Pytorch-Collections/tree/main/GPT">Pytorch Projects for GPT series</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/1706.03762?utm_source=chatgpt.com">Attention Is All You Need</a></p>
</li>
<li><p><a href="https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf?utm_source=chatgpt.com">Improving Language Understanding by Generative Pre-Training (GPT-1)</a></p>
</li>
<li><p><a href="https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf?utm_source=chatgpt.com">Language Models are Unsupervised Multitask Learners (GPT-2)</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/1810.04805?utm_source=chatgpt.com">BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/1906.08237?utm_source=chatgpt.com">XLNet: Generalized Autoregressive Pretraining for Language Understanding</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/1907.11692?utm_source=chatgpt.com">RoBERTa: A Robustly Optimized BERT Pretraining Approach</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/1909.08053?utm_source=chatgpt.com">Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2009.08366?utm_source=chatgpt.com">Turing-NLG: A 17-Billion-Parameter Language Model by Microsoft</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/1904.10509?utm_source=chatgpt.com">Sparse Transformers</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2001.08361?utm_source=chatgpt.com">Scaling Laws for Neural Language Models</a></p>
</li>
</ul>
<p><strong>Contact Me</strong></p>
<ul>
<li><p><a href="https://github.com/MOHAMMEDFAHD"><strong>Github</strong></a></p>
</li>
<li><p><a href="https://x.com/programmingoce"><strong>X</strong></a></p>
</li>
<li><p><a href="https://www.linkedin.com/in/mohammed-abrah-6435a63ba/"><strong>Linkedin</strong></a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ AI Paper Review: Language Models are Unsupervised Multitask Learners (GPT-2) ]]>
                </title>
                <description>
                    <![CDATA[ Before models like ChatGPT became part of everyday life, AI systems were already getting surprisingly good at generating text. But there was still a major limitation: most models could only perform ta ]]>
                </description>
                <link>https://www.freecodecamp.org/news/ai-paper-review-language-models-are-unsupervised-multitask-learners-gpt-2/</link>
                <guid isPermaLink="false">6a01fbeffca21b0d4b40ae1d</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Mohammed Fahd Abrah ]]>
                </dc:creator>
                <pubDate>Mon, 11 May 2026 15:55:27 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/be6d96bd-c687-4fac-a3e2-ea68ba622c51.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Before models like ChatGPT became part of everyday life, AI systems were already getting surprisingly good at generating text. But there was still a major limitation: most models could only perform tasks they were specifically trained for.</p>
<p>If you wanted a model to translate text, summarize an article, or answer questions, you usually had to collect labeled data and train it separately for each task. AI was powerful, but still very narrow.</p>
<p>Then GPT-2 introduced a different idea.</p>
<p>Instead of teaching a model every task individually, researchers explored whether simply training a model to predict the next word on a massive amount of internet text could be enough for useful abilities to emerge on their own.</p>
<p>And surprisingly, it worked.</p>
<p>The model began showing early signs of generalization. It could answer questions, summarize text, translate between languages, and complete prompts – all without task-specific training or fine tuning them toward down stream tasks.</p>
<p>Now, research papers like the one that introduced these new ideas can be difficult and time-consuming to read, especially when they’re filled with technical terminology and experimental details. So in this article, I’ll break the paper down in a simple and practical way.</p>
<p>We’ll look at what problem the paper was trying to solve, the main ideas behind GPT-2, how zero-shot learning works, and why this paper became such an important step toward modern large language models.</p>
<p>By the end, you should understand the key insights of GPT-2 without needing to read the full paper yourself.</p>
<h2 id="heading-paper-overview"><strong>Paper Overview</strong></h2>
<p>In this article, we’ll review the paper <em>Language Models are Unsupervised Multitask Learners</em> by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.</p>
<p>The paper introduced GPT-2 and showed how a language model trained on massive amounts of text could perform multiple tasks without task-specific training.</p>
<p>Here’s the actual paper if you want to read it yourself:</p>
<p><a href="https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf?utm_source=chatgpt.com">Language Models are Unsupervised Multitask Learners (PDF)</a></p>
<p>And here’s a quick infographic of what we’ll cover in this review:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/0a814405-f634-4251-a1be-b3b02d785691.png" alt="AI paper quick insights" style="display:block;margin:0 auto" width="1414" height="2000" loading="lazy">

<h2 id="heading-table-of-contents">Table of Contents:</h2>
<ul>
<li><p><a href="#heading-executive-summary">Executive Summary</a></p>
</li>
<li><p><a href="#heading-goals-of-the-paper">Goals of the Paper</a></p>
</li>
<li><p><a href="#heading-core-idea">Core Idea</a></p>
</li>
<li><p><a href="#heading-methodology">Methodology</a></p>
</li>
<li><p><a href="#heading-zero-shot-setup">Zero-Shot Setup</a></p>
</li>
<li><p><a href="#heading-fine-tuning-vs-zero-shot-learning">Fine-tuning vs Zero-Shot Learning</a></p>
</li>
<li><p><a href="#heading-training-data-web-text">Training Data (Web Text)</a></p>
</li>
<li><p><a href="#heading-input-representation">Input Representation</a></p>
</li>
<li><p><a href="#heading-model-architecture">Model Architecture</a></p>
</li>
<li><p><a href="#heading-experiments">Experiments</a></p>
</li>
<li><p><a href="#heading-key-findings">Key Findings</a></p>
</li>
<li><p><a href="#heading-task-specific">Task-Specific</a></p>
</li>
<li><p><a href="#heading-generalization-vs-memorization">Generalization vs Memorization</a></p>
</li>
<li><p><a href="#heading-discussion">Discussion</a></p>
</li>
<li><p><a href="#heading-limitations">Limitations</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-final-insight">Final Insight</a></p>
</li>
<li><p><a href="#heading-gpt-1-vs-gpt-2-key-differences">GPT-1 vs GPT-2 — Key Differences</a></p>
</li>
<li><p><a href="#heading-resources">Resources</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To get the most out of this breakdown, it helps to be familiar with a few basic ideas:</p>
<ul>
<li><p>Reading the previous review, <a href="https://www.freecodecamp.org/news/ai-paper-review-improving-language-understanding-by-generative-pre-training-gpt-1/">AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1)</a>, will be helpful and will give you some solid background info and context (since GPT-2 directly builds on many of the ideas introduced there).</p>
</li>
<li><p>A general understanding of <a href="https://www.freecodecamp.org/news/natural-language-processing-with-spacy-python-full-course/">natural language processing (NLP)</a> and how machines work with text</p>
</li>
<li><p>A high-level idea of what a <a href="https://www.freecodecamp.org/news/how-transformer-models-work-for-language-processing/">Transformer model</a> is (you don’t need deep technical details, just the basic concept)</p>
</li>
<li><p>The difference between supervised learning, unsupervised learning, and zero-shot learning</p>
</li>
<li><p>Basic <a href="https://www.freecodecamp.org/news/learn-the-foundations-of-machine-learning-and-artificial-intelligence/">machine learning concepts</a> like training data, models, and scaling</p>
</li>
</ul>
<p>If you’re not fully comfortable with all of these, that’s completely okay. I’ll keep the explanations as simple and intuitive as possible, focusing more on understanding the ideas than getting lost in heavy technical details.</p>
<h2 id="heading-executive-summary"><strong>Executive Summary</strong></h2>
<p>Before GPT-2, most NLP systems depended heavily on supervised learning. Each task, whether it was translation, question answering, or summarization, typically required its own labeled dataset and a model trained specifically for it.</p>
<p>This paper challenges that approach.</p>
<p>According to the authors, a single large language model, trained only to predict the next word in a sequence of text, can learn to perform many different tasks without any task-specific training.</p>
<p>Instead of being explicitly taught how to solve each problem, the model picks up these abilities from patterns in the data.</p>
<p>In simple terms, the model is not directly trained to translate, answer questions, or summarize. Rather, it learns to do these things implicitly through exposure to large amounts of text.</p>
<p>This marks an important shift. Rather than relying on supervised learning for every task, the paper shows that models can begin to generalize across tasks in what is now known as a zero-shot setting.</p>
<h2 id="heading-goals-of-the-paper"><strong>Goals of the Paper</strong></h2>
<p>To understand the motivation behind this work, it helps to look at the limitations of traditional NLP systems.</p>
<p>According to the authors, most existing approaches rely heavily on labeled datasets, require separate training for each task, and struggle to generalize beyond the specific problems they were designed for.</p>
<p>In practice, this makes systems powerful but narrow: they perform well on what they are trained for, but don’t easily transfer that knowledge elsewhere.</p>
<p>This paper explores a different direction.</p>
<p>The authors ask whether a model can learn to perform multiple tasks without explicit supervision, simply by training on large amounts of text.</p>
<p>They also investigate whether language modeling alone is enough to capture general capabilities, and whether increasing the size of the model and the amount of data can improve this behavior.</p>
<p>At its core, the goal is to move toward more general systems that learn from language itself, rather than from carefully labeled datasets.</p>
<h2 id="heading-core-idea"><strong>Core Idea</strong></h2>
<p>At the heart of the paper is a simple but powerful idea: instead of training models in the traditional supervised way (mapping inputs directly to outputs), the authors train a model to do just one thing: predict the next word in a sequence of text.</p>
<p>At first, this might sound limited. But the key insight is that natural language already contains many examples of tasks embedded within it.</p>
<p>Text on the internet includes questions followed by answers, translations between languages, summaries of longer content, and detailed explanations.</p>
<p>According to the paper, by learning to predict and generate text, the model is indirectly learning how these tasks work. In other words, it begins to model relationships like <em>p(output | input, task)</em> without ever being explicitly told what the task is.</p>
<p>This is what allows the model to move beyond a single objective and start behaving like a general system.</p>
<h2 id="heading-methodology"><strong>Methodology</strong></h2>
<p>To understand how this idea works in practice, it helps to look at how the model is trained.</p>
<p>According to the authors, everything starts with a standard language modeling objective.</p>
<p>The model is trained to predict the next token in a sequence based on the tokens that come before it.</p>
<p>While this may seem simple, it allows the model to learn the underlying structure of language over time.</p>
<p>Formally, this means the model is learning probabilities over sequences of text. In practice, this ability enables it to generate coherent text, complete sentences, and even mimic patterns that resemble specific tasks.</p>
<p>This is what makes the approach powerful. Even though the model is only trained to predict the next word, it ends up capturing much richer behavior that can be applied to a variety of tasks.</p>
<h2 id="heading-zero-shot-setup"><strong>Zero-Shot Setup</strong></h2>
<p>One of the most important differences from earlier approaches is how the model is used after training.</p>
<p>Unlike GPT-1, there's no fine-tuning or task-specific training. The model isn't adapted or retrained for each new task. Instead, everything is handled through the input itself.</p>
<p>According to the authors, tasks are expressed directly as text prompts. For example, you might write something like “Translate to French:” followed by a sentence, or “Answer the question:” followed by a prompt. The model then continues the text in a way that reflects the task.</p>
<p>In practice, this means the model isn't explicitly told what to do through training – it infers the task from the structure of the input and responds accordingly.</p>
<h2 id="heading-fine-tuning-vs-zero-shot-learning"><strong>Fine-tuning vs Zero-Shot Learning</strong></h2>
<table style="min-width:75px"><colgroup><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"></colgroup><tbody><tr><td><p><strong>Aspect</strong></p></td><td><p><strong>Fine-tuning (Task-Specific Training)</strong></p></td><td><p><strong>Zero-Shot Learning</strong></p></td></tr><tr><td><p><strong>Definition</strong></p></td><td><p>Model is trained further on labeled data for a specific task</p></td><td><p>Model performs tasks without any additional training</p></td></tr><tr><td><p><strong>Training Requirement</strong></p></td><td><p>Requires task-specific labeled datasets</p></td><td><p>No labeled data needed for the task</p></td></tr><tr><td><p><strong>Setup</strong></p></td><td><p>Separate training phase for each task</p></td><td><p>Tasks are given as natural language prompts</p></td></tr><tr><td><p><strong>Flexibility</strong></p></td><td><p>Limited to trained tasks</p></td><td><p>Can generalize to many unseen tasks</p></td></tr><tr><td><p><strong>Performance</strong></p></td><td><p>Usually higher on specific tasks</p></td><td><p>Lower, but improving with scale</p></td></tr><tr><td><p><strong>Cost</strong></p></td><td><p>Expensive (training per task)</p></td><td><p>Efficient (no retraining needed)</p></td></tr><tr><td><p><strong>Adaptability</strong></p></td><td><p>Needs retraining for new tasks</p></td><td><p>Adapts instantly via prompts</p></td></tr><tr><td><p><strong>Example (NLP)</strong></p></td><td><p>Train model for sentiment analysis dataset</p></td><td><p>“Classify sentiment: …” prompt</p></td></tr><tr><td><p><strong>Used in</strong></p></td><td><p>GPT-1, traditional NLP systems</p></td><td><p>GPT-2, GPT-3, modern LLMs</p></td></tr><tr><td><p><strong>Main Advantage</strong></p></td><td><p>High accuracy on defined tasks</p></td><td><p>High flexibility and generalization</p></td></tr><tr><td><p><strong>Main Limitation</strong></p></td><td><p>Not scalable across many tasks</p></td><td><p>Less precise than fine-tuned models</p></td></tr></tbody></table>

<h2 id="heading-training-data-web-text"><strong>Training Data (Web Text)</strong></h2>
<p>Another key part of this work is the dataset used to train the model.</p>
<p>Instead of relying on traditional sources like Wikipedia, books, or news articles alone, the authors created a new dataset called <strong>Web Text</strong>.</p>
<p>It consists of millions of documents – around 40 GB of text – collected from links shared on Reddit that received a certain level of engagement.</p>
<p>According to the paper, this filtering step helps improve the overall quality of the data, since the content is more likely to be interesting or useful to readers.</p>
<p>What makes this dataset important is its diversity. It contains real-world language from many domains, and more importantly, it includes natural examples of tasks, such as explanations, question–answer pairs, and translations, embedded within the text itself.</p>
<h2 id="heading-input-representation"><strong>Input Representation</strong></h2>
<p>To process text, the model uses a technique called <strong>Byte Pair Encoding (BPE)</strong>.</p>
<p>According to the authors, BPE works as a middle ground between word-level and character-level representations.</p>
<p>Instead of treating text strictly as full words or individual characters, it breaks it into smaller units that can adapt depending on how frequently patterns appear in the data.</p>
<p>In practice, this allows the model to handle a wide range of text more effectively, including rare words and different languages. It also improves generalization, since the model isn't limited to a fixed vocabulary of complete words.</p>
<h2 id="heading-model-architecture"><strong>Model Architecture</strong></h2>
<p>The model used in this paper is based on a <strong>Transformer (decoder-only)</strong> architecture, similar to GPT-1 but significantly scaled up.</p>
<p>According to the authors, the model relies on <strong>masked self-attention</strong>, which allows it to look at previous tokens in a sequence while predicting the next one.</p>
<p>This means it processes text step by step, always using past context to generate the next token.</p>
<p>Compared to GPT-1, several important changes were introduced.</p>
<p>The model can handle longer context, with sequences of up to 1024 tokens, and uses a larger vocabulary of around 50,000 tokens. It's also much deeper, with more layers and significantly more parameters.</p>
<p>The authors trained multiple versions of the model, ranging from 117 million to 1.5 billion parameters.</p>
<p>The largest of these is what we now refer to as GPT-2, and it's the one responsible for most of the strong results reported in the paper.</p>
<p><strong>Transformer (decoder-only)</strong></p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/602d56bd-dbf1-4eec-b11d-6d82b3dcd04d.png" alt="Transformer (decoder-only)" style="display:block;margin:0 auto" width="732" height="1064" loading="lazy">

<p><strong>Note:</strong> The original figure illustrates the complete Transformer architecture (Encoder–Decoder) from <em>Attention Is All You Need</em>. For clarity and relevance to GPT-style models, the image used here was cropped to focus only on the decoder side of the architecture, since GPT models are based on a decoder-only Transformer design.</p>
<p><strong>Reference:</strong> Brownlee, J. <a href="https://machinelearningmastery.com/encoders-and-decoders-in-transformer-models/?utm_source=chatgpt.com">Encoders and Decoders in Transformer Models</a> Machine Learning Mastery.</p>
<h2 id="heading-experiments">Experiments</h2>
<p>To evaluate the model, the authors tested it across a wide range of tasks – but with an important constraint: according to the paper, the model wasn't trained or fine-tuned on any of these tasks.</p>
<p>Instead, everything was evaluated in a zero-shot setting, where the model is simply given a prompt and asked to continue the text.</p>
<p>They applied this setup to different types of problems, including language modeling benchmarks, reading comprehension, translation, summarization, question answering, and commonsense reasoning.</p>
<p>The goal here was not just to measure performance, but to see how far a single model (trained only on raw text) could generalize across tasks without any additional training.</p>
<h2 id="heading-key-findings">Key Findings</h2>
<p>After evaluating the model across different tasks, the results were stronger than many would have expected.</p>
<p>According to the authors, GPT-2 achieves state-of-the-art results on 7 out of 8 language modeling benchmarks in a zero-shot setting.</p>
<p>One of the most important observations is that performance consistently improves as the model size increases, following a roughly log-linear trend.</p>
<p>In other words, scaling up the model leads to better results across tasks.</p>
<p>The paper also shows that larger models display more consistent multitask behavior.</p>
<p>For example, GPT-2 performs well on tasks that require long-range understanding, such as LAMBADA, and shows competitive results in reading comprehension on datasets like CoQA.</p>
<p>It even demonstrates early capabilities in translation and can answer factual questions without being explicitly trained for those tasks.</p>
<p>In practice, the key takeaway is clear: increasing model size and data plays a major role in unlocking these capabilities.</p>
<h2 id="heading-task-specific">Task-Specific</h2>
<p>Looking more closely at individual tasks, the paper gives a clearer picture of where the model performs well and where it still struggles.</p>
<p>GPT-2 shows surprisingly strong results in reading comprehension, even without any task-specific training. But its performance on summarization is still limited.</p>
<p>While it can generate summaries that look reasonable, they're often less accurate compared to supervised approaches.</p>
<p>For translation, the model demonstrates some ability, but the results are still far from competitive.</p>
<p>On the other hand, question answering improves noticeably as the model size increases, suggesting that scale plays an important role in this capability.</p>
<p>Overall, the model is far from perfect. But what stands out is that it's clearly beginning to learn general skills across tasks, even without being explicitly trained for them.</p>
<h2 id="heading-generalization-vs-memorization">Generalization vs Memorization</h2>
<p>A natural question that comes up is whether the model is actually learning useful patterns or simply memorizing the training data.</p>
<p>The authors address this directly. They analyze overlap between the training dataset and evaluation benchmarks using n-gram comparisons, looking for signs that the model might be copying rather than generalizing.</p>
<p>According to the paper, while some overlap does exist (as is common in large datasets), it's not enough to explain the model’s performance.</p>
<p>They also observe that the model still underfits the data, meaning it hasn’t fully captured everything in the training set.</p>
<p>This is an important point: if the model was mainly memorizing, we would expect it to fit the data much more closely.</p>
<p>In practice, this suggests that the improvements are coming from genuine learning rather than simple memorization, even though some overlap is unavoidable.</p>
<h2 id="heading-discussion">Discussion</h2>
<p>This section is where the authors step back and reflect on what these results actually mean.</p>
<p>According to the paper, language models trained on large and diverse datasets aren't just learning representations of text. They're beginning to learn how to perform tasks directly, even without supervision.</p>
<p>In other words, pre-training is doing more than providing useful features: it's capturing patterns that resemble real task behavior.</p>
<p>At the same time, the authors are careful not to overstate the results.</p>
<p>While the zero-shot capabilities are impressive, performance is still far from practical on many tasks.</p>
<p>Some outputs look convincing on the surface but lack accuracy when measured more carefully.</p>
<p>In practice, this section highlights both sides of the story. The approach is clearly promising, but it's still an early step toward more general systems.</p>
<h2 id="heading-limitations">Limitations</h2>
<p>Despite the progress shown in the paper, the approach still has several important limitations.</p>
<p>According to the authors, zero-shot performance, while impressive, is generally weaker than fully supervised models on many tasks.</p>
<p>The results also depend heavily on scale, both in terms of model size and the amount of data used. This means that smaller models don't show the same level of capability.</p>
<p>In addition, some tasks, such as summarization, remain relatively weak.</p>
<p>The model can produce outputs that look plausible, but they often lack accuracy or consistency when evaluated more carefully.</p>
<p>Another practical challenge is the cost. Training these models requires significant computational resources and large datasets, which makes this approach difficult to reproduce or scale for many researchers.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>The paper ends with a simple but powerful idea.</p>
<p>According to the authors, when a language model is trained on a sufficiently large and diverse dataset – and with enough capacity – it begins to generalize across tasks and perform them without explicit training.</p>
<p>This suggests that the model isn't just learning language, but also the structure of the tasks embedded within it.</p>
<p>In practice, this points to a different way of thinking about AI systems. Instead of designing and training a model for each specific task, we can focus on training a single model on large-scale language data&nbsp;– and allow useful capabilities to emerge naturally from that process.</p>
<h2 id="heading-final-insight">Final Insight</h2>
<p>If GPT-1 introduced the idea of combining pre-training with fine-tuning, GPT-2 takes that idea a step further.</p>
<p>According to the paper, pre-training alone - when done at a large enough scale – can already produce models that begin to perform a wide range of tasks without any additional training.</p>
<p>This is a subtle but important shift, because it suggests that general capabilities can emerge directly from exposure to large amounts of text.</p>
<p>In my view, this is the point where things start to change direction.</p>
<p>The focus moves away from designing task-specific systems and toward building more general models that can adapt on their own.</p>
<p>This idea directly sets the stage for what comes next: models like GPT-3, ChatGPT, and modern large language systems that build on this same principle.</p>
<h2 id="heading-gpt-1-vs-gpt-2-key-differences"><strong>GPT-1 vs GPT-2 — Key Differences</strong></h2>
<table style="min-width:75px"><colgroup><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"></colgroup><tbody><tr><td><p><strong>Aspect</strong></p></td><td><p><strong>GPT-1</strong></p></td><td><p><strong>GPT-2</strong></p></td></tr><tr><td><p><strong>Core Idea</strong></p></td><td><p>Pre-training + fine-tuning</p></td><td><p>Pre-training alone (zero-shot)</p></td></tr><tr><td><p><strong>Training Approach</strong></p></td><td><p>Two-stages: learn language, then adapt to tasks</p></td><td><p>Single stage: learn language and infer tasks</p></td></tr><tr><td><p><strong>Supervision</strong></p></td><td><p>Requires labeled data for fine-tuning</p></td><td><p>No labeled data needed for tasks</p></td></tr><tr><td><p><strong>Task Handling</strong></p></td><td><p>Tasks require separate fine-tuning</p></td><td><p>Tasks handled via prompts (zero-shot)</p></td></tr><tr><td><p><strong>Generalization</strong></p></td><td><p>Limited, depends on fine-tuning</p></td><td><p>Stronger generalization across tasks</p></td></tr><tr><td><p><strong>Model Role</strong></p></td><td><p>Learns language, then adapts</p></td><td><p>Learns language and tasks together</p></td></tr><tr><td><p><strong>Architecture</strong></p></td><td><p>Transformer (decoder-based)</p></td><td><p>Transformer (decoder-only, scaled up)</p></td></tr><tr><td><p><strong>Model Size</strong></p></td><td><p>Smaller (~117M parameters)</p></td><td><p>Much larger (up to 1.5B parameters)</p></td></tr><tr><td><p><strong>Context Length</strong></p></td><td><p>Shorter context</p></td><td><p>Longer context (up to 1024 tokens)</p></td></tr><tr><td><p><strong>Dataset</strong></p></td><td><p>Books Corpus + other curated datasets</p></td><td><p>Web Text (large, diverse internet data)</p></td></tr><tr><td><p><strong>Key Capability</strong></p></td><td><p>Transfer learning</p></td><td><p>Zero-shot learning</p></td></tr><tr><td><p><strong>Performance Style</strong></p></td><td><p>Strong after fine-tuning</p></td><td><p>Strong without any task training</p></td></tr><tr><td><p><strong>Limitations</strong></p></td><td><p>Depends on labeled data</p></td><td><p>Depends heavily on scale (data + compute)</p></td></tr><tr><td><p><strong>Main Contribution</strong></p></td><td><p>Introduced pre-training paradigm</p></td><td><p>Showed emergence of multitask behavior</p></td></tr><tr><td><p><strong>Impact</strong></p></td><td><p>Foundation of modern NLP pipelines</p></td><td><p>Shift toward general-purpose models</p></td></tr></tbody></table>

<h2 id="heading-resources">Resources:</h2>
<ul>
<li><p><a href="https://github.com/MOHAMMEDFAHD/Pytorch-Collections/tree/main/GPT">Pytorch Projects for GPT series</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1706.03762">Attention Is All You Need</a></p>
</li>
<li><p><a href="https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf">Improving Language Understanding by Generative Pre-Training</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1810.04805">BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</a></p>
</li>
<li><p><a href="https://papers.nips.cc/paper_files/paper/2015/file/7137debd45ae4d0ab9aa953017286b20-Paper.pdf">Semi-supervised Sequence Learning</a></p>
</li>
<li><p><a href="https://aclanthology.org/P18-1031.pdf?">Universal Language Model Fine-tuning for Text Classification</a></p>
</li>
<li><p><a href="https://aclanthology.org/N18-1202.pdf">Deep Contextualized Word Representations</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1508.07909">Neural Machine Translation of Rare Words with Subword Units</a></p>
</li>
<li><p><a href="https://papers.nips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf">Distributed Representations of Words and Phrases and Their Compositionality</a></p>
</li>
<li><p><a href="https://aclanthology.org/D14-1162.pdf">GloVe: Global Vectors for Word Representation</a></p>
</li>
</ul>
<h3 id="heading-contact-me"><strong>Contact Me</strong></h3>
<ul>
<li><p><a href="https://github.com/MOHAMMEDFAHD"><strong>Github</strong></a></p>
</li>
<li><p><a href="https://x.com/programmingoce"><strong>X</strong></a></p>
</li>
<li><p><a href="https://www.linkedin.com/in/mohammed-abrah-6435a63ba/"><strong>Linkedin</strong></a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ The Rise of AI Agents: How Software Is Learning to Act ]]>
                </title>
                <description>
                    <![CDATA[ Software has always been reactive. You click a button, it responds. You call an API, it returns data. Even the most sophisticated systems have historically depended on explicit instructions and tightl ]]>
                </description>
                <link>https://www.freecodecamp.org/news/the-rise-of-ai-agents-how-software-is-learning-to-act/</link>
                <guid isPermaLink="false">69fe184ef239332df4ea34e7</guid>
                
                    <category>
                        <![CDATA[ ai agents ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Fri, 08 May 2026 17:07:26 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/1351f6d0-79c2-491b-a8e7-943cc9ece905.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Software has always been reactive.</p>
<p>You click a button, it responds. You call an API, it returns data.</p>
<p>Even the most sophisticated systems have historically depended on explicit instructions and tightly defined workflows. That model is starting to break.</p>
<p>A new class of software is emerging that doesn't just respond, but act.</p>
<p>This shift isn't cosmetic. It changes how software is designed, how systems are operated, and how work itself is executed.</p>
<p>Instead of encoding every step of a workflow, developers are now defining goals, constraints, and tools, then letting software figure out the execution path. The result is software that behaves less like a function and more like an operator.</p>
<p>In this article, you'll learn what AI agents actually are, how they differ from traditional software systems, and why they're starting to represent a major shift in modern software design.</p>
<p>This article is written for developers, technical founders, engineering managers, and anyone building software systems with AI components.</p>
<p>You don't need prior experience building AI agents, but it helps to be familiar with Basic Python syntax and Large language models (LLMs)</p>
<h3 id="heading-what-well-cover">What We'll Cover:</h3>
<ul>
<li><p><a href="#heading-from-deterministic-systems-to-goal-driven-execution">From Deterministic Systems to Goal-Driven Execution</a></p>
</li>
<li><p><a href="#heading-the-core-components-of-an-ai-agent">The Core Components of an AI Agent</a></p>
</li>
<li><p><a href="#heading-why-ai-agents-are-emerging-now">Why AI Agents Are Emerging Now</a></p>
</li>
<li><p><a href="#heading-the-illusion-and-reality-of-autonomy">The Illusion and Reality of Autonomy</a></p>
</li>
<li><p><a href="#heading-designing-agents-that-work-in-practice">Designing Agents That Work in Practice</a></p>
</li>
<li><p><a href="#heading-multi-agent-systems-and-coordination">Multi-Agent Systems and Coordination</a></p>
</li>
<li><p><a href="#heading-where-ai-agents-are-already-delivering-value">Where AI Agents Are Already Delivering Value</a></p>
</li>
<li><p><a href="#heading-the-shift-in-software-design">The Shift in Software Design</a></p>
</li>
<li><p><a href="#heading-what-comes-next">What Comes Next</a></p>
</li>
</ul>
<h2 id="heading-from-deterministic-systems-to-goal-driven-execution">From Deterministic Systems to Goal-Driven Execution</h2>
<p>Traditional software systems are deterministic. Given the same input, they produce the same output.</p>
<p>This predictability is what makes them reliable, but it's also what limits them. Any variation in workflow requires new code, new conditions, and new branches.</p>
<p>AI agents introduce a different model. They're goal-driven rather than instruction-driven. Instead of specifying every step, you define an objective and provide access to tools. The agent decides how to achieve the objective, often adapting in real time.</p>
<p>Consider a simple task like summarizing a set of documents and emailing the result. In a traditional system, you would write a pipeline that loads documents, processes them, formats the output, and sends an email. Each step is explicitly coded.</p>
<p>With an agent, the system might look more like this:</p>
<pre><code class="language-plaintext">from openai import OpenAI

client = OpenAI()
goal = "Summarize all documents in /reports and email a concise briefing to the leadership team"
tools = [
    "read_files",
    "summarize_text",
    "send_email"
]
response = client.responses.create(
    model="gpt-4.1",
    input=f"Goal: {goal}. Available tools: {tools}"
)
print(response.output_text)
</code></pre>
<p>This example is simplified, but it captures the shift. The developer defines intent and capability. The agent determines execution.</p>
<h2 id="heading-the-core-components-of-an-ai-agent">The Core Components of an AI&nbsp;Agent</h2>
<p>To understand how agents work, it helps to break them into components. At a high level, most agents consist of reasoning, memory, and tools.</p>
<p>Reasoning is handled by a large language model. This is what allows the agent to interpret goals, plan actions, and adapt when something fails. It's not just generating text, it's generating decisions.</p>
<p>Memory allows the agent to maintain context across steps. Without memory, the agent behaves like a stateless function. With memory, it can track progress, recall past actions, and refine its approach.</p>
<p><a href="https://www.freecodecamp.org/news/how-to-build-your-first-mcp-server-using-fastmcp/">Tools are what make the agent useful</a>. A tool can be anything from an API to a database query to a shell command. The agent doesn't need to know how the tool works internally. It only needs to know when and how to use it.</p>
<p>Here is a minimal example of tool usage in an agent loop:</p>
<pre><code class="language-plaintext">def agent_loop(goal, tools):
    context = []
    
    while True:
        prompt = f"Goal: {goal}\nContext: {context}\nWhat should be done next?"
        
        decision = model.generate(prompt)
        
        if decision == "DONE":
            break
        
        if decision.startswith("USE_TOOL"):
            tool_name, tool_input = parse_tool_call(decision)
            result = tools[tool_name](tool_input)
            context.append(result)
        else:
            context.append(decision)
    
    return context
</code></pre>
<p>This loop is where the agent “acts.” It observes, decides, executes, and updates its understanding.</p>
<h2 id="heading-why-ai-agents-are-emerging-now">Why AI Agents Are Emerging&nbsp;Now</h2>
<p>The idea of autonomous software isn't new. What has changed is the capability of the underlying models.</p>
<p>Large language models can now reason across multiple steps, interpret unstructured inputs, and generate structured outputs that can drive real systems.</p>
<p>Equally important is the ecosystem around them. APIs are more standardized, infrastructure is more programmable, and data is more accessible. This makes it easier to expose tools and let them interact with real systems helping build some of the <a href="https://nexos.ai/blog/best-ai-agents/">best AI agents</a> in use today.</p>
<p>There's also an economic driver. Many workflows today are still manual, even in highly digitized organizations. These workflows often involve coordination across systems, interpretation of data, and decision-making under uncertainty. This is exactly the kind of work agents are suited for.</p>
<h2 id="heading-the-illusion-and-reality-of-autonomy">The Illusion and Reality of&nbsp;Autonomy</h2>
<p>It's tempting to describe AI agents as fully autonomous. In practice, most are not. They operate within constraints defined by developers. They rely on tools that expose only certain actions. They're often monitored, rate-limited, and evaluated at each step.</p>
<p>What makes them different isn't complete autonomy, but partial autonomy. They can decide how to execute within a bounded environment.</p>
<p>This distinction matters because it affects how systems are designed. You're not building a system that always behaves predictably. You're building a system that explores a solution space and converges on an outcome.</p>
<p>That introduces new challenges. Agents can take inefficient paths. They can misinterpret goals. They can fail in ways that are hard to debug because the failure isn't a single error, but a chain of decisions.</p>
<h2 id="heading-designing-agents-that-work-in-practice">Designing Agents That Work in&nbsp;Practice</h2>
<p>Building an agent is easy. Building one that works reliably is harder. The difference comes down to control.</p>
<p>One approach is to constrain the agent’s <a href="https://milvus.io/ai-quick-reference/what-is-an-action-space-in-rl">action space</a>. Instead of giving it open-ended access, you define a limited set of tools with clear interfaces. This reduces ambiguity and makes behavior more predictable.</p>
<p>Another approach is to introduce intermediate checkpoints. Instead of letting the agent run freely, you validate its decisions at key steps. You can do this through rules, secondary models, or even human review.</p>
<p>Here's an example of adding a validation layer:</p>
<pre><code class="language-plaintext">def safe_execute(tool, input_data):
    if not validate_input(tool, input_data):
        return "Invalid input"
    
    result = tool(input_data)
    
    if not validate_output(tool, result):
        return "Invalid output"
    
    return result
</code></pre>
<p>This pattern is critical in production systems. It turns an unconstrained agent into a controlled system that can still adapt, but within safe boundaries.</p>
<h2 id="heading-multi-agent-systems-and-coordination">Multi-Agent Systems and Coordination</h2>
<p>As agents become more capable, a single agent is often not enough. Complex tasks can be decomposed into multiple agents, each responsible for a specific function.</p>
<p>For example, one agent might handle data retrieval, another might handle analysis, and a third might handle communication. These agents can coordinate by passing structured messages.</p>
<pre><code class="language-plaintext">class Message:
    def __init__(self, sender, receiver, content):
        self.sender = sender
        self.receiver = receiver
        self.content = content

def send_message(agent, message):
    return agent.process(message)
message = Message("retriever", "analyst", "Data collected from API")
response = send_message(analyst_agent, message)
</code></pre>
<p>This model starts to resemble a distributed system, but with agents instead of services. Coordination becomes a first-class concern. You need to define protocols, handle failures, and ensure consistency across agents.</p>
<h2 id="heading-where-ai-agents-are-already-delivering-value">Where AI Agents Are Already Delivering Value</h2>
<p>Despite the hype, there are concrete areas where agents are already useful. Internal tooling is one of them. Automating repetitive workflows, generating reports, and orchestrating tasks across systems are all well-suited for agents.</p>
<p>Customer support is another area. Agents can handle complex queries that require accessing multiple systems, not just retrieving canned responses.</p>
<p>Security and compliance workflows are also a strong fit. These often involve monitoring signals, correlating data, and taking action based on rules that aren't always deterministic.</p>
<p>What these use cases have in common is that they involve structured environments with clear objectives and measurable outcomes. Agents perform best when the problem space is bounded, even if the execution path is not.</p>
<h2 id="heading-the-shift-in-software-design">The Shift in Software&nbsp;Design</h2>
<p>The rise of AI agents isn't just about adding a new feature. It's about changing the abstraction layer of software.</p>
<p>Instead of writing code that directly implements behavior, you're designing systems that enable behavior. You define goals, expose capabilities, and enforce constraints. The actual execution becomes dynamic.</p>
<p>This requires a different mindset. Debugging is no longer just about tracing code. It's about understanding decision paths. Testing is no longer just about input-output pairs. It's about evaluating behavior across scenarios.</p>
<p>Observability becomes critical. You need to log not just what the system did, but why it did it. This includes prompts, intermediate decisions, and tool interactions.</p>
<h2 id="heading-what-comes-next">What Comes&nbsp;Next</h2>
<p>AI agents are still in the relatively early stages. The current generation is powerful but imperfect. Reliability is a major challenge. So is cost, especially when agents require multiple model calls per task.</p>
<p>But the direction is clear: software is moving from static execution to dynamic action. The boundary between user and system is becoming less rigid. Instead of telling software what to do step by step, users will increasingly define outcomes and let systems figure out the rest.</p>
<p>This doesn't eliminate the need for engineers. It changes what engineers do. The focus shifts from implementing logic to designing systems that can reason, act, and adapt.</p>
<p>The rise of AI agents marks a transition. Software is no longer just a tool. It's becoming an actor.</p>
<p><em>Join my</em> <a href="https://applyaito.substack.com/"><em><strong>Applied AI newsletter</strong></em></a> <em>to learn how to build and ship real AI systems. Practical projects, production-ready code, and direct Q&amp;A. You can also</em> <a href="https://www.linkedin.com/in/manishmshiva/"><em><strong>connect with me on</strong></em> <em><strong>LinkedIn</strong></em></a><em><strong>.</strong></em></p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
