llm - freeCodeCamp.org

AI Paper Review: Language Models are Few-Shot Learners (GPT-3)

Mohammed Fahd Abrah — Mon, 18 May 2026 20:29:20 +0000

After GPT-2, it became clear that language models could do much more than researchers originally expected. Simply training a model to predict the next word had already started producing surprising abilities like translation, summarization, and question answering without task-specific training.

But there was still a major limitation. Even though GPT-2 could generalize across tasks, it still struggled to adapt reliably. Performance often depended on carefully written prompts, and for many real-world applications, fine-tuning was still necessary. AI systems were becoming more flexible, but they still were not truly learning tasks from context the way humans do.

Then GPT-3 pushed the idea much further. Instead of asking whether language models could perform tasks without fine-tuning, the paper explored something even more ambitious:

What happens if we scale language models to an extreme size? The answer surprised almost everyone in the AI community.

GPT-3 showed that a sufficiently large language model could often learn new tasks directly from examples inside the prompt itself. No retraining. No gradient updates. Just a few demonstrations written in natural language.

For example, if you showed the model a few English-to-French translations, it could continue the pattern correctly for a new sentence. If you gave it examples of questions and answers, it could often infer the task immediately and generate reasonable responses.

This became known as few-shot learning and in-context learning.

More importantly, GPT-3 suggested a completely different way of interacting with AI systems. Instead of training a separate model for every task, the same model could dynamically adapt depending on the instructions and examples it received.

That idea eventually became the foundation for modern AI systems like ChatGPT.

Now, like many influential AI papers, the GPT-3 paper can be difficult to read because of its scale, technical experiments, and long benchmark evaluations. So in this article, I’ll break everything down in a clear and practical way.

We’ll explore what problem the paper was trying to solve, how few-shot learning works, why scaling became so important, how GPT-3 was trained, and why this paper fundamentally changed the direction of modern AI research.

By the end, you should understand the core ideas behind GPT-3 and why this paper became one of the most important milestones in the history of large language models LLM.

Paper Overview

In this article, we’ll review the paper Language Models are Few-Shot Learners by Tom Brown et al. from Open AI.

This paper introduced GPT-3 and demonstrated something that changed the direction of modern AI research: large language models could learn tasks directly from prompts and examples without task-specific fine-tuning like the methodology of GPT-1.

Instead of retraining the model for every new task, GPT-3 could often adapt dynamically through natural language instructions, one-shot examples, or few-shot prompting.

The paper also introduced the idea of in-context learning, where the model effectively learns from patterns inside the prompt itself during inference.

Here’s the original paper if you want to explore it directly: Language Models are Few-Shot Learners (PDF)

And here’s a quick infographic of what we’ll cover throughout this review:

Table of Content:

Executive Summary
Goals of the Paper
Core Idea
Methodology
Fine-tuning vs Zero-Shot vs Few-Shot
Model Architecture
Experiments
Key Findings
Task-Specific Observations
Generalization vs Memorization
Discussion
Limitations
Conclusion
Final Insight
GPT-1 vs GPT-2 vs GPT-3: Key Differences
PyTorch Implementations of the GPT Architecture Evolution
Resources:

Prerequisites

To get the most out of this breakdown, it helps to already be familiar with a few foundational ideas.

Reading the previous reviews in this series will be especially helpful:

GPT-3 directly builds on many of the ideas introduced in those earlier papers, especially pre-training, zero-shot learning, and large-scale language modeling.

It also helps to have:

A general understanding of natural language processing (NLP) and how machines work with text
A high-level idea of what a Transformer model is (you do not need deep mathematical details)
Familiarity with supervised learning, unsupervised learning, and zero-shot learning
A basic understanding of prompts and how language models generate text
General machine learning concepts like training data, parameters, scaling, and inference

You do not need to be an AI researcher to follow this article, though.

I’ll keep the explanations practical and intuitive, focusing more on understanding the core ideas behind GPT-3 rather than getting lost in dense mathematical details or academic terminology.

Executive Summary

Before GPT-3, models like GPT-2 had already shown something surprising: a language model trained only to predict the next word could still perform many tasks it was never directly trained for. Translation, summarization, question answering somehow these abilities started appearing naturally as models became larger.

But there was still a limitation.

Even with GPT-2, strong performance often depended on careful prompting or additional fine-tuning. In practice, most NLP systems still followed the same pattern: train a large model first, then retrain or fine-tune it separately for every new task.

GPT-3 challenges that entire workflow.

According to the authors, if a language model becomes large enough, it can begin learning tasks directly from context alone. Instead of updating the model’s parameters, you simply show it a few examples inside the prompt, and the model continues the pattern.

This idea is what the paper calls few-shot learning.

For example, rather than training a separate translation model, you could write something like:

dog → chien
cat → chat
house → ?

And GPT-3 would often continue with the correct answer: maison.

What makes this important is that the model is not learning through gradient updates during inference. There is no retraining happening in the traditional sense. The learning happens inside the context window itself, through the examples provided in the prompt.

This marks a major shift in how language models are used.

Instead of building a specialized system for every task, GPT-3 suggests that a single sufficiently large model can adapt dynamically just by reading instructions and examples. The paper refers to this behavior as in-context learning, and much of GPT-3’s contribution revolves around showing how powerful this idea becomes at scale.

Goals of the Paper

According to the authors, one of the biggest limitations of existing NLP systems is that they depend too heavily on task-specific training. Even though models had become increasingly powerful by the time GPT-3 was introduced, most systems still required a separate fine-tuning process for every new task.

In practice, this created several problems.

First, every task needed labeled data. If you wanted a model to summarize articles, answer questions, classify sentiment, or translate text, you usually needed thousands, or sometimes millions of carefully prepared examples. Collecting that data was expensive, time-consuming, and often unrealistic for smaller or niche tasks.

Second, every new capability required additional training. Even when the underlying model was already pretrained on massive amounts of text, developers still had to retrain or fine-tune it again and again for specific use cases.

The paper argues that this workflow is fundamentally inefficient. More importantly, the authors point out that it does not resemble how humans learn. Humans can often understand a task after seeing only a few demonstrations or simple instructions. We do not usually need thousands of labeled examples to figure out what is being asked.

This becomes the central question behind GPT-3:

Can a language model learn new tasks directly from context instead of relying on parameter updates and task-specific retraining?

That question drives nearly every experiment in the paper. Rather than testing whether GPT-3 can master one carefully optimized benchmark, the authors are exploring something broader: whether scaling language models can produce systems that adapt dynamically just from prompts, examples, and natural language instructions.

Core Idea

At its core, GPT-3 is still built around the same fundamental idea used in GPT-2: train a language model to predict the next token in a sequence. The training objective itself is surprisingly simple. Given some text, the model learns to guess what comes next, one token at a time.

On the surface, GPT-3 may look like nothing more than a much larger version of GPT-2. And in some ways, that is true. The model scales dramatically in size, growing to 175 billion parameters, and it is trained on a far larger and more diverse dataset gathered from sources like Common Crawl, WebText, books, and Wikipedia.

But the paper argues that something more interesting begins to happen as language models scale.

Instead of simply memorizing text patterns better, GPT-3 starts showing the ability to learn tasks directly from prompts. When the model sees examples inside the input itself, it can often continue the pattern correctly without any additional training or parameter updates.

For example, if the prompt contains a few question-answer pairs or translation examples, GPT-3 can infer the structure of the task and generate similar outputs for new inputs. In other words, the prompt becomes a temporary learning environment.

This is the key conceptual shift in the paper.

Traditional machine learning usually separates training from inference. First the model learns by updating its weights, then later it is deployed to make predictions. GPT-3 blurs that boundary. The model still learns during pretraining, of course, but during inference it can also adapt behavior dynamically based on the context it receives.

The authors describe this behavior as in-context learning.

What makes this idea important is that the model is not retrained for each task. There are no gradient updates happening while the prompt is processed. Instead, GPT-3 learns from the examples embedded inside the context window itself.

This marks a subtle but important change in how we think about language models. The prompt is no longer just an input. It effectively becomes a lightweight interface for teaching the model what to do.

Methodology

One reason GPT-3 became so influential is that the underlying training process is actually very familiar. Unlike many research papers that introduce entirely new architectures or complicated learning algorithms, GPT-3 mostly builds on ideas that already existed before it. The difference is how aggressively those ideas are scaled.

According to the authors, the core training objective remains standard autoregressive language modeling. In simple terms, the model reads text and repeatedly learns to predict the next token in the sequence. This is the same general approach used in GPT-2.

The process itself is conceptually straightforward:

Train a very large Transformer model
Feed it enormous amounts of internet text
Optimize it to predict the next word over and over again

What changes dramatically is the scale.

GPT-3 is trained on hundreds of billions of tokens collected from sources such as Common Crawl, WebText, books, and Wikipedia. The paper also explains that OpenAI filtered and cleaned large portions of the Common Crawl dataset to improve quality and reduce duplication.

But the most important part of the methodology is not just how the model is trained. It is how the model is used after training.

Traditionally, NLP systems relied heavily on fine-tuning. After pretraining a language model, developers would train it again on a smaller labeled dataset for each individual task. GPT-3 experiments with a different approach entirely.

Instead of retraining the model, tasks are described directly inside the prompt.

The paper studies three main settings:

Zero-shot learning: the model receives only a natural language instruction
One-shot learning: the model receives a single example of the task
Few-shot learning: the model receives several examples before solving a new case

For example, a translation prompt might look like this:

dog → chien
cat → chat
house → ?

GPT-3 then continues the pattern and predicts:

maison

What makes this remarkable is that no retraining happens during this process. The model’s weights remain completely unchanged. It is simply using the information inside the prompt to infer what kind of task is being requested.

In practice, this transforms the prompt into something much more powerful than an ordinary input. It becomes a temporary workspace where the model can recognize patterns, adapt behavior, and apply learned knowledge dynamically.

The paper repeatedly emphasizes that this behavior emerges through scale rather than task-specific engineering. GPT-3 is not trained separately for translation, summarization, reasoning, or question answering. Instead, the same general language modelinqag objective appears to produce all of these abilities when the model becomes sufficiently large.

Fine-tuning vs Zero-Shot vs Few-Shot

Aspect	Fine-Tuning	Zero-Shot Learning	Few-Shot Learning
Definition	The model is additionally trained on labeled data for a specific task	The model performs a task using only instructions, without examples	The model learns the task from a small number of examples inside the prompt
Training Requirement	Requires supervised task-specific datasets	No task-specific training or examples	No retraining, but requires a few demonstrations in the prompt
How Tasks Are Given	Through a separate training phase	Through natural language instructions	Through instructions plus a few input-output examples
Learning Process	Model weights are updated during training	No weight updates	No weight updates; learning happens inside the context window
Flexibility	Usually specialized for one task	Highly flexible across many tasks	Flexible while still benefiting from demonstrations
Adaptability	Requires retraining for new tasks	Adapts instantly through prompting	Adapts quickly from contextual examples
Data Dependency	Depends heavily on labeled datasets	Depends mostly on pretraining knowledge	Depends on both pretraining and prompt examples
Performance	Often strongest on narrow benchmark tasks	Usually weaker than fine-tuning	Often much stronger than zero-shot and sometimes close to fine-tuning
Scalability Across Tasks	Expensive and difficult to scale	Extremely scalable	Scalable without retraining
Compute Cost	High because every task may require new training	Low during usage	Low during usage
Example	Fine-tune a model on a sentiment analysis dataset	“Classify the sentiment of this sentence”	“Positive: I loved the movie. Negative: The film was boring. Sentence: The story was amazing →”
Main Strength	High accuracy on carefully trained tasks	Simplicity and broad generalization	Strong balance between flexibility and performance
Main Weakness	Poor scalability across many tasks	Can misunderstand task format or intent	Sensitive to prompt quality and example selection
Most Associated With	Traditional NLP systems, GPT-1 era	GPT-2 style prompting	GPT-3 and in-context learning
Core Idea	Train specifically for each task	Infer the task from instructions	Infer the task from examples in context

Model Architecture

Architecturally, GPT-3 does not introduce a radically new design. In fact, one of the most interesting aspects of the paper is that the core architecture is almost identical to GPT-2. OpenAI continues using a decoder-only Transformer model trained with an autoregressive objective.

At a high level, the Transformer architecture processes text using a mechanism called attention. Instead of reading words strictly one at a time like older recurrent models, Transformers can look across the entire sequence and determine which words are most relevant to each other.

More specifically, GPT-3 relies on self-attention, which allows the model to weigh different parts of the context while generating text. This helps the model capture long-range relationships between words, sentences, and ideas.

The model is also autoregressive, meaning it generates text sequentially by predicting the next token based on everything that came before it. This next-token prediction objective remains the foundation of GPT-3, just as it was for GPT-2.

So if the architecture is mostly the same, what actually changed?

The answer is scale.

GPT-3 dramatically increases the size of the model, the amount of training data, and the computational resources used during training. The largest version of GPT-3 contains 175 billion parameters, making it far larger than GPT-2’s 1.5 billion parameter model.

The paper also experiments with multiple model sizes ranging from 125 million parameters all the way to 175 billion. This was important because the authors wanted to study how capabilities evolve as models grow larger.

The architecture includes:

A decoder-only Transformer design
A context window of 2048 tokens
Multiple model scales trained under similar objectives
Attention mechanisms that allow the model to process contextual relationships efficiently

One of the paper’s most important observations is that performance improves smoothly as scale increases. Larger models consistently perform better across a wide range of tasks, including translation, question answering, reasoning, and few-shot learning.

This idea becomes central to the entire GPT-3 paper.

Rather than relying on handcrafted task-specific systems, the authors suggest that many advanced capabilities emerge naturally when language models become sufficiently large and are trained on enough diverse data. In other words, scaling itself starts acting like a research strategy.

What makes this shift important is that GPT-3 does not achieve its results through complicated architectural innovations. The paper’s argument is much simpler, and in some ways more surprising:

A relatively standard Transformer architecture, when scaled aggressively enough, begins to display entirely new behaviors.

Note: The original figure illustrates the complete Transformer architecture (Encoder–Decoder) from Attention Is All You Need. For clarity and relevance to GPT-style models, the image used here was cropped to focus only on the decoder side of the architecture, since GPT models are based on a decoder-only Transformer design.

Reference: Brownlee, J. Encoders and Decoders in Transformer Models Machine Learning Mastery.

Experiments

To understand whether GPT-3 could truly learn from context alone, the authors evaluated the model across a very broad range of NLP tasks. Rather than focusing on a single benchmark, the paper tests whether the same pretrained model can adapt to many different kinds of problems using only prompts and examples.

The experiments cover a wide variety of domains, including:

Language modeling and text completion
Question answering
Translation between languages
Reading comprehension
Commonsense reasoning
Winograd-style reasoning tasks
Cloze and sentence completion tasks
Synthetic reasoning problems such as arithmetic and word manipulation

What makes these experiments especially important is the evaluation setup itself.

Instead of fine-tuning GPT-3 separately for each benchmark, the model is tested entirely through prompting. The authors evaluate GPT-3 in three different settings:

Zero-shot learning, where the model receives only a task description
One-shot learning, where it receives a single example
Few-shot learning, where several demonstrations are included inside the prompt

For example, in translation tasks, the prompt may contain a few English-to-French examples before asking the model to continue the pattern. In question-answering tasks, the model might see several example questions and answers before attempting a new one.

Importantly, the model’s parameters never change during these evaluations. There are no gradient updates, no retraining steps, and no task-specific optimization. GPT-3 performs every task using the exact same pretrained weights.

This is one of the paper’s biggest departures from traditional NLP systems.

At the time, most state-of-the-art models achieved strong benchmark results through supervised fine-tuning on carefully prepared datasets. GPT-3 instead tests whether a single large language model can generalize across tasks simply by understanding patterns inside prompts.

The paper also evaluates how performance changes as model size increases. OpenAI trained multiple versions of GPT-3, ranging from 125 million parameters up to 175 billion parameters, then compared how scaling affected zero-shot, one-shot, and few-shot behavior.

According to the authors, larger models become noticeably better at using contextual information. Few-shot learning improves especially strongly with scale, suggesting that bigger models are not just memorizing more information. They are becoming better at adapting to new tasks dynamically.

Key Findings

This is the section where GPT-3 stops feeling like “just a bigger language model” and starts looking like something fundamentally different.

According to the paper, one of the clearest patterns across nearly all experiments is that performance improves consistently as model size increases. As GPT-3 scales from millions of parameters to hundreds of billions, the model becomes dramatically better at understanding prompts, adapting to context, and performing tasks it was never explicitly trained for.

But the most surprising result is not simply higher benchmark scores.

The real breakthrough is that few-shot learning actually works at scale.

Across many tasks, GPT-3’s few-shot performance approaches strong fine-tuned systems, and in some cases even matches or surpasses them. This is remarkable because GPT-3 achieves these results without updating its weights for individual tasks. Everything happens through prompting alone.

One of the strongest examples appears in question answering benchmarks.

On TriviaQA, GPT-3 improves significantly as more examples are provided in the prompt. The paper reports that zero-shot performance is already competitive, but one-shot and few-shot prompting push results even further, eventually reaching or exceeding some state-of-the-art fine-tuned systems in the same closed-book setting.

Source: Brown et al. (2020), Language Models are Few-Shot Learners, Figure 1.2.

The same pattern appears repeatedly throughout the paper:

Few-shot prompting consistently outperforms zero-shot prompting
Larger models make better use of contextual examples
Scaling improves not only accuracy, but adaptability itself

This last point is especially important.

The paper suggests that scaling does more than help the model memorize facts or generate more fluent text. As models become larger, they appear to develop stronger in-context learning abilities. In other words, bigger models become better at inferring patterns and task structures directly from prompts.

The authors even observe that the gap between zero-shot and few-shot performance grows with model size. Smaller models struggle to learn effectively from prompts, while larger models can often infer the task from only a handful of examples.

What makes this finding historically important is that it changes how researchers think about capability growth in AI systems.

Before GPT-3, scaling was often viewed mainly as a way to improve existing performance metrics. GPT-3 introduces a different possibility: that entirely new behaviors can emerge as models become sufficiently large.

This is why the paper became so influential. It was not just reporting better benchmark numbers. It was presenting evidence that scale itself can unlock qualitatively new forms of learning behavior.

Task-Specific Observations

When you look beyond the headline results, the paper reveals something more nuanced about GPT-3: its abilities are highly uneven. The model performs surprisingly well in some areas, yet still struggles badly in others.

GPT-3 shows particularly strong performance on tasks that align closely with pattern recognition and language continuation.

Translation is one notable example. While GPT-3 was never trained specifically as a translation system, the model can still produce impressive results when given a few examples in the prompt. According to the paper, few-shot translation performance improves substantially as model size increases, especially when translating into English.

The model also performs well on question answering benchmarks, especially in closed-book settings where the answer must come directly from information stored inside the model’s parameters. Tasks like TriviaQA show strong gains as GPT-3 moves from zero-shot to few-shot prompting.

Text completion and cloze-style tasks are another major strength. GPT-3 demonstrates a strong ability to continue patterns, complete paragraphs, and infer missing words from context. On datasets like LAMBADA, the few-shot setup produces especially large improvements.

But the paper is also careful about documenting weaknesses.

GPT-3 struggles noticeably on certain reasoning-heavy benchmarks, particularly tasks involving natural language inference. Datasets like ANLI remain difficult even for the largest model.

Some reading comprehension tasks also expose limitations. In several cases, GPT-3 generates answers that sound plausible but fail to demonstrate deep understanding of the passage. This becomes a recurring theme throughout the paper: fluent language generation does not always mean reliable reasoning.

One of the most interesting observations is how sensitive GPT-3 is to prompt design.

Performance often changes dramatically depending on how examples are written, formatted, or ordered inside the context window. In many tasks, adding just a few demonstrations significantly improves accuracy.

This suggests something important about how GPT-3 operates.

The model is not simply retrieving fixed knowledge from memory. Instead, it relies heavily on contextual cues to infer what kind of behavior is expected. Small prompt changes can reshape the model’s interpretation of the task itself.

In practice, this paper helped introduce an entirely new idea to the AI community: that how you ask the model can matter almost as much as the model itself.

That insight eventually evolves into what we now call prompt engineering.

Generalization vs Memorization

One of the biggest questions surrounding GPT-3 is whether the model is genuinely learning useful patterns, or simply memorizing enormous portions of the internet.

This concern becomes especially important because GPT-3 is trained on massive web-scale datasets, including Common Crawl. With a model this large, it is reasonable to ask whether strong benchmark performance comes from real generalization or from accidentally seeing parts of the evaluation data during training.

The authors take this issue seriously and dedicate an entire section of the paper to studying what they call data contamination.

According to the paper, OpenAI searched for overlaps between the training data and benchmark datasets used during evaluation. They discovered that some contamination did exist. In other words, portions of certain evaluation datasets appeared somewhere inside the model’s training corpus.

However, the authors argue that this overlap is not large enough to fully explain GPT-3’s results.

For many benchmarks, performance improvements remain consistent even after accounting for contamination effects. The paper also notes that some tasks specifically designed to test adaptation and reasoning still show strong few-shot behavior despite being unlikely to appear directly in the training data.

Another important observation is that GPT-3 still underfits the training data. This means the model has not perfectly memorized everything it has seen, even after extremely large-scale training.

That detail matters because it suggests the model is learning statistical structures and linguistic patterns rather than storing an exact copy of the dataset.

Of course, memorization does still happen to some extent. Large language models can reproduce fragments of training text, especially when rare or repeated data appears frequently during training. The paper does not deny this. Instead, the authors argue that memorization alone cannot explain GPT-3’s broad performance across translation, reasoning, question answering, and in-context learning tasks.

In practice, the evidence points toward something more complex.

GPT-3 appears to absorb patterns, relationships, and task structures from large-scale text data, then reuse those patterns flexibly in new contexts. That is very different from simply copying stored answers.

This distinction becomes one of the central debates in modern AI research. GPT-3 forced researchers to think more carefully about what it actually means for a language model to “understand” something, and where the boundary lies between memorization, pattern recognition, and genuine generalization.

Discussion

This is the point in the paper where the broader implications of GPT-3 start becoming clear.

According to the authors, large language models may be doing something more general than simply predicting text. By training on enormous amounts of language data, the model appears to learn patterns associated with tasks themselves.

That idea changes how we think about language modeling.

Traditionally, NLP systems were designed around explicit supervision. If you wanted a model to translate text, answer questions, summarize documents, or classify sentiment, you trained it specifically for that task using labeled examples.

GPT-3 suggests a different possibility.

The paper argues that many tasks are already implicitly embedded inside natural language data. During pretraining, the model encounters countless examples of explanations, translations, conversations, reasoning patterns, instructions, and question-answer pairs scattered across the internet. As scale increases, the model begins learning these behaviors indirectly.

In practice, this means the model does not always require explicit retraining to perform a new task. Instead, prompts and examples can activate behaviors the model has already absorbed during pretraining.

This is why prompting becomes so powerful in GPT-3.

The prompt is not merely providing information. It is guiding the model toward a behavior pattern that already exists somewhere inside its learned representations.

At the same time, the authors are careful not to overstate the results.

Throughout the paper, they repeatedly acknowledge that GPT-3 is still inconsistent. Some outputs are remarkably convincing, while others are obviously incorrect, nonsensical, or logically flawed.

This becomes one of GPT-3’s defining characteristics.

The model often sounds far more confident than it actually is. It can generate fluent explanations and persuasive answers even when the underlying reasoning is weak or factually wrong. In some tasks, especially deeper reasoning and reading comprehension benchmarks, GPT-3 still struggles significantly.

So the paper does not present GPT-3 as a solved form of intelligence.

Instead, it presents evidence that scaling language models unlocks new capabilities that were previously weak or absent. The results are impressive enough to suggest a major shift in direction, but not strong enough to eliminate the need for further research.

That balance is part of what makes the paper influential. It is ambitious, but also surprisingly honest about the limitations that still remain.

Limitations

One reason the GPT-3 paper remained credible despite the excitement surrounding it is that the authors were unusually open about the model’s weaknesses. The paper does not claim that few-shot learning solves NLP, nor does it pretend that GPT-3 works reliably on every task.

In many cases, traditional fine-tuned systems still perform better.

Although GPT-3 achieves impressive few-shot results across a wide range of benchmarks, the model continues to struggle on several reasoning-heavy tasks, especially natural language inference and certain reading comprehension datasets.

The paper also emphasizes that GPT-3’s success depends heavily on scale. Smaller versions of the model show far weaker few-shot capabilities, while the strongest results appear only at extremely large parameter counts.

This creates a major practical problem.

Training GPT-3 required enormous computational resources, specialized infrastructure, and vast amounts of data. The largest model contains 175 billion parameters and was trained using large GPU clusters over massive datasets.

In practice, very few organizations in the world could realistically reproduce this work at the time.

The paper also discusses broader concerns around bias and fairness. Since GPT-3 learns from large internet datasets, it inevitably absorbs social biases, stereotypes, and problematic language patterns present in the data itself.

This becomes especially concerning because the model can generate highly convincing text. Incorrect or biased outputs may sound authoritative even when they are misleading or harmful.

Another issue the authors examine is data contamination. Because GPT-3 is trained on web-scale corpora, parts of benchmark datasets may accidentally appear in the training data. The paper investigates this directly and acknowledges that some overlap exists, although the authors argue that contamination alone does not explain the overall results.

There is also an environmental and economic cost to scaling models this aggressively.

Training systems at the scale of GPT-3 consumes enormous amounts of compute and energy, raising questions about sustainability and accessibility in AI research. As models become larger, cutting-edge progress increasingly depends on access to industrial-scale infrastructure.

This creates a tension that still exists today.

GPT-3 demonstrated that scaling works extraordinarily well, but it also highlighted how concentrated advanced AI research was becoming. The future of large language models was clearly promising, but also increasingly expensive.

Conclusion

The paper ends with a surprisingly simple conclusion: scaling language models changes what they are capable of doing.

According to the authors, GPT-3 demonstrates that a sufficiently large language model can learn tasks directly from context without requiring gradient updates or task-specific fine-tuning.

That idea represents a major shift in the direction of NLP.

For years, the standard workflow in machine learning looked something like this:

Pretrain a model
Fine-tune it for a specific task
Deploy the specialized system

GPT-3 introduces a different paradigm.

Instead of retraining the model repeatedly for new tasks, the same pretrained model can often adapt through prompts alone. Instructions and examples inside the context window become enough to guide the model toward useful behavior.

In other words, the workflow starts looking more like this:

Train once
Adapt dynamically through prompting

What makes this important is not just convenience. It changes how researchers think about generalization itself.

The paper suggests that many capabilities traditionally associated with supervised learning can emerge naturally from large-scale language modeling. Translation, question answering, reasoning, summarization, and even task adaptation begin appearing inside a single unified system trained only with next-token prediction.

At the same time, the authors remain careful in their conclusions.

GPT-3 is clearly powerful, but it is not reliable enough to be considered a complete solution to intelligence or reasoning. The paper repeatedly acknowledges weaknesses involving logic, factual accuracy, bias, and consistency.

Still, the broader message is difficult to ignore.

GPT-3 showed that scaling language models does not simply improve fluency. It can produce entirely new behaviors that were weak or absent in smaller systems. That realization reshaped the trajectory of modern AI research and laid the foundation for the prompt-driven systems that would soon follow.

Final Insight

If GPT-1 introduced the idea of large-scale pretraining followed by fine-tuning, and GPT-2 showed that language models could generalize surprisingly well without task-specific training, then GPT-3 pushes the idea even further.

It suggests that language models can begin learning during inference itself.

That is the real conceptual shift behind this paper.

Before GPT-3, most AI systems were still fundamentally task-specific. Even powerful pretrained models usually needed additional supervised training before they became useful for a particular application.

GPT-3 starts breaking that pattern.

Instead of building a separate model for translation, summarization, question answering, or reasoning, the same model can adapt dynamically depending on the prompt it receives. Examples inside the context window effectively become temporary instructions for behavior.

In practice, this moves AI systems away from narrow specialization and toward something more flexible:

From task-specific systems
To general-purpose models that adapt on the fly

What makes this especially important is that GPT-3 did not achieve this through complicated symbolic reasoning systems or handcrafted pipelines. The model was still trained using a relatively simple next-token prediction objective. Yet at sufficient scale, entirely new behaviors started emerging.

Looking back, this paper feels less like the end of the GPT series and more like the beginning of a new era.

Many ideas that now define modern AI trace directly back to GPT-3:

Prompt engineering
Instruction-following systems
In-context learning
Conversational AI assistants
General-purpose foundation models

And ultimately, systems like ChatGPT exist because GPT-3 demonstrated that prompting itself could become a powerful interface for interacting with intelligence.

That is why this paper became historically important.

It did not just scale language models. It changed how people imagined using them.

GPT-1 vs GPT-2 vs GPT-3: Key Differences

Aspect	GPT-1	GPT-2	GPT-3
Core Idea	Pre-training followed by fine-tuning	Pre-training alone enables zero-shot behavior	Large-scale pre-training enables few-shot and in-context learning
Training Approach	Two-stage pipeline: pretrain then fine-tune	Single-stage language modeling	Same language modeling approach, but massively scaled
Supervision	Requires labeled data for downstream tasks	Can perform tasks without supervised fine-tuning	Can adapt from prompts and examples without retraining
Task Handling	Separate fine-tuning for each task	Tasks handled mainly through zero-shot prompts	Tasks handled through zero-shot, one-shot, and few-shot prompting
Learning Style	Learns representations, then specializes	Learns general language patterns	Learns to infer tasks directly from context
Generalization	Limited outside fine-tuned tasks	Stronger cross-task generalization	Much stronger contextual adaptation and in-context learning
Prompt Usage	Minimal importance	Prompts become useful	Prompts become central to system behavior
Inference Behavior	Mostly static after training	Can generalize during inference	Can adapt dynamically during inference
Architecture	Transformer (decoder-based)	Decoder-only Transformer	Decoder-only Transformer with large-scale scaling
Model Size	~117M parameters	Up to 1.5B parameters	Up to 175B parameters
Context Window	Smaller context length	Up to 1024 tokens	2048-token context window
Training Data	Books Corpus and curated datasets	WebText internet dataset	Massive multi-source dataset including Common Crawl, WebText, Books, and Wikipedia
Key Capability	Transfer learning	Zero-shot learning	Few-shot and in-context learning
Performance Style	Strong after fine-tuning	Strong without task-specific training	Often competitive with fine-tuned systems using prompts alone
Scaling Importance	Moderate	Important	Central research strategy of the paper
Main Limitation	Requires labeled datasets and retraining	Weak reasoning and inconsistent zero-shot behavior	Extremely expensive compute requirements and persistent reasoning limitations
Main Contribution	Introduced modern NLP pre-training paradigm	Demonstrated multitask zero-shot behavior	Demonstrated emergent in-context learning at scale
Historical Impact	Foundation of modern Transformer NLP	Shift toward general-purpose language models	Foundation for prompt-driven AI systems and modern LLM applications
What Changed in the Field	Pre-training became standard	Prompting became viable	Prompting became the primary interface for AI systems
Legacy	Inspired modern transfer learning pipelines	Inspired large-scale generative models	Directly influenced ChatGPT, instruction tuning, and foundation models

PyTorch Implementations of the GPT Architecture Evolution

GPT-1: Pre-training + Fine-Tuning Architecture

class GPT1(nn.Module):
    def __init__(self, vocab_size, d_model, n_layers):
        super().__init__()

        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(512, d_model)

        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(d_model)
            for _ in range(n_layers)
        ])

        self.ln_f = nn.LayerNorm(d_model)

        # Language modeling head
        self.lm_head = nn.Linear(d_model, vocab_size)

    def forward(self, input_ids):
        positions = torch.arange(input_ids.size(1))

        x = (
            self.token_embedding(input_ids)
            + self.position_embedding(positions)
        )

        for block in self.transformer_blocks:
            x = block(x)

        x = self.ln_f(x)

        logits = self.lm_head(x)

        return logits

GPT1 inherits from nn.Module, which is the base class used to build neural networks in PyTorch. The constructor (init) defines all trainable layers used by the model.

nn.Embedding(vocab_size, d_model) creates a learnable lookup table that converts token IDs into dense vectors. Each token in the vocabulary is mapped to a vector of size d_model.

The positional embedding layer adds information about token order. Since Transformers process tokens in parallel, they need explicit positional information to understand sequence structure.

nn.ModuleList([...]) stores multiple Transformer blocks while ensuring PyTorch properly tracks their parameters during training. Each TransformerBlock typically contains masked self-attention and a feed-forward network.

nn.LayerNorm(d_model) applies layer normalization before the output projection. This helps stabilize training and improves gradient flow in deep Transformer architectures.

The language modeling head (nn.Linear) projects the hidden representations back into vocabulary space. The output size equals vocab_size, producing prediction scores for every possible next token.

Inside the forward() method, input_ids.size(1) retrieves the sequence length, and torch.arange(...) generates positional indices for each token position.

The token embeddings and positional embeddings are added together to produce the initial Transformer input representation.

The model then passes the representation through each Transformer block sequentially:

for block in self.transformer_blocks:
    x = block(x)

This iterative stacking is what allows GPT models to learn increasingly abstract contextual representations.

After normalization, the final hidden states are passed into lm_head, producing logits. These logits are unnormalized prediction scores used to compute probabilities for next-token generation.

The model finally returns the logits tensor, which is typically passed through softmax during inference or used directly with CrossEntropyLoss during training.

GPT-2: Zero-Shot Multitask Architecture

class GPT2(nn.Module):
    def __init__(self, vocab_size, d_model, n_layers):
        super().__init__()

        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(1024, d_model)

        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(
                d_model=d_model,
                pre_layer_norm=True
            )
            for _ in range(n_layers)
        ])

        self.final_layer_norm = nn.LayerNorm(d_model)

        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, input_ids):
        positions = torch.arange(input_ids.size(1))

        x = (
            self.token_embedding(input_ids)
            + self.position_embedding(positions)
        )

        for block in self.transformer_blocks:
            x = block(x)

        x = self.final_layer_norm(x)

        logits = self.lm_head(x)

        return logits

Like GPT-1, the model begins with token embeddings and positional embeddings. nn.Embedding converts token IDs into dense vectors, while positional embeddings provide information about token order in the sequence.

One noticeable difference is the larger positional embedding size (1024 instead of 512), allowing GPT-2 to process longer contexts.

The Transformer layers are stored using nn.ModuleList, but each TransformerBlock now uses:

pre_layer_norm=True

This means layer normalization is applied before attention and feed-forward operations rather than after them. This “Pre-LN” design significantly improves gradient flow and training stability in deeper Transformer models.

The forward pass follows the same overall pipeline:

Generate positional indices with torch.arange()
Add token and positional embeddings
Pass representations through stacked Transformer blocks
Apply final normalization
Project outputs into vocabulary space

The sequential block processing happens here:

for block in self.transformer_blocks:
    x = block(x)

GPT-2 also introduces a small optimization in the output layer:

self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

The bias term is removed because it provides little benefit in large language modeling setups and slightly reduces parameter count.

Finally, the model returns logits, which contain prediction scores for every token in the vocabulary at each sequence position.

GPT-3: Few-Shot / In-Context Learning Architecture

class GPT3(nn.Module):
    def __init__(
        self,
        vocab_size=50257,
        d_model=12288,
        n_layers=96,
        n_heads=96,
        context_length=2048
    ):
        super().__init__()

        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(context_length, d_model)

        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(
                d_model=d_model,
                n_heads=n_heads,
                pre_layer_norm=True,
                sparse_attention=True
            )
            for _ in range(n_layers)
        ])

        self.final_layer_norm = nn.LayerNorm(d_model)

        self.lm_head = nn.Linear(
            d_model,
            vocab_size,
            bias=False
        )

    def forward(self, input_ids):
        positions = torch.arange(input_ids.size(1))

        x = (
            self.token_embedding(input_ids)
            + self.position_embedding(positions)
        )

        for block in self.transformer_blocks:
            x = block(x)

        x = self.final_layer_norm(x)

        logits = self.lm_head(x)

        return logits

Compared to earlier GPT versions, this model dramatically increases scale. The embedding size (d_model=12288) and the number of Transformer layers (96) allow the network to learn highly complex language patterns and long-range dependencies.

The model also uses 96 attention heads:

n_heads=96

Multi-head attention allows the model to focus on different relationships between tokens simultaneously, improving contextual understanding.

The positional embedding length is expanded to 2048, enabling the model to process much longer sequences than GPT-2.

Each Transformer block is configured with:

pre_layer_norm=True,
sparse_attention=True

Pre-layer normalization improves training stability in very deep networks, while sparse attention reduces the computational cost of attention by limiting how many tokens attend to each other. This becomes important at GPT-3 scale, where full attention over long sequences is extremely expensive.

The forward pass follows the standard GPT pipeline:

Convert token IDs into embeddings
Add positional information
Pass representations through stacked Transformer blocks
Apply final layer normalization
Generate vocabulary logits

The core iterative processing happens here:

for block in self.transformer_blocks:
    x = block(x)

Finally, the output layer projects the hidden states into vocabulary space, producing logits used for next-token prediction during training and text generation.

Resources:

Contact Me

AI Paper Review: Language Models are Unsupervised Multitask Learners (GPT-2)

Mohammed Fahd Abrah — Mon, 11 May 2026 15:55:27 +0000

Before models like ChatGPT became part of everyday life, AI systems were already getting surprisingly good at generating text. But there was still a major limitation: most models could only perform tasks they were specifically trained for.

If you wanted a model to translate text, summarize an article, or answer questions, you usually had to collect labeled data and train it separately for each task. AI was powerful, but still very narrow.

Then GPT-2 introduced a different idea.

Instead of teaching a model every task individually, researchers explored whether simply training a model to predict the next word on a massive amount of internet text could be enough for useful abilities to emerge on their own.

And surprisingly, it worked.

The model began showing early signs of generalization. It could answer questions, summarize text, translate between languages, and complete prompts – all without task-specific training or fine tuning them toward down stream tasks.

Now, research papers like the one that introduced these new ideas can be difficult and time-consuming to read, especially when they’re filled with technical terminology and experimental details. So in this article, I’ll break the paper down in a simple and practical way.

We’ll look at what problem the paper was trying to solve, the main ideas behind GPT-2, how zero-shot learning works, and why this paper became such an important step toward modern large language models.

By the end, you should understand the key insights of GPT-2 without needing to read the full paper yourself.

Paper Overview

In this article, we’ll review the paper Language Models are Unsupervised Multitask Learners by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.

The paper introduced GPT-2 and showed how a language model trained on massive amounts of text could perform multiple tasks without task-specific training.

Here’s the actual paper if you want to read it yourself:

Language Models are Unsupervised Multitask Learners (PDF)

And here’s a quick infographic of what we’ll cover in this review:

Executive Summary
Goals of the Paper
Core Idea
Methodology
Zero-Shot Setup
Fine-tuning vs Zero-Shot Learning
Training Data (Web Text)
Input Representation
Model Architecture
Experiments
Key Findings
Task-Specific
Generalization vs Memorization
Discussion
Limitations
Conclusion
Final Insight
GPT-1 vs GPT-2 — Key Differences
Resources

Prerequisites

To get the most out of this breakdown, it helps to be familiar with a few basic ideas:

Reading the previous review, AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1), will be helpful and will give you some solid background info and context (since GPT-2 directly builds on many of the ideas introduced there).
A general understanding of natural language processing (NLP) and how machines work with text
A high-level idea of what a Transformer model is (you don’t need deep technical details, just the basic concept)
The difference between supervised learning, unsupervised learning, and zero-shot learning
Basic machine learning concepts like training data, models, and scaling

If you’re not fully comfortable with all of these, that’s completely okay. I’ll keep the explanations as simple and intuitive as possible, focusing more on understanding the ideas than getting lost in heavy technical details.

Executive Summary

Before GPT-2, most NLP systems depended heavily on supervised learning. Each task, whether it was translation, question answering, or summarization, typically required its own labeled dataset and a model trained specifically for it.

This paper challenges that approach.

According to the authors, a single large language model, trained only to predict the next word in a sequence of text, can learn to perform many different tasks without any task-specific training.

Instead of being explicitly taught how to solve each problem, the model picks up these abilities from patterns in the data.

In simple terms, the model is not directly trained to translate, answer questions, or summarize. Rather, it learns to do these things implicitly through exposure to large amounts of text.

This marks an important shift. Rather than relying on supervised learning for every task, the paper shows that models can begin to generalize across tasks in what is now known as a zero-shot setting.

Goals of the Paper

To understand the motivation behind this work, it helps to look at the limitations of traditional NLP systems.

According to the authors, most existing approaches rely heavily on labeled datasets, require separate training for each task, and struggle to generalize beyond the specific problems they were designed for.

In practice, this makes systems powerful but narrow: they perform well on what they are trained for, but don’t easily transfer that knowledge elsewhere.

This paper explores a different direction.

The authors ask whether a model can learn to perform multiple tasks without explicit supervision, simply by training on large amounts of text.

They also investigate whether language modeling alone is enough to capture general capabilities, and whether increasing the size of the model and the amount of data can improve this behavior.

At its core, the goal is to move toward more general systems that learn from language itself, rather than from carefully labeled datasets.

Core Idea

At the heart of the paper is a simple but powerful idea: instead of training models in the traditional supervised way (mapping inputs directly to outputs), the authors train a model to do just one thing: predict the next word in a sequence of text.

At first, this might sound limited. But the key insight is that natural language already contains many examples of tasks embedded within it.

Text on the internet includes questions followed by answers, translations between languages, summaries of longer content, and detailed explanations.

According to the paper, by learning to predict and generate text, the model is indirectly learning how these tasks work. In other words, it begins to model relationships like p(output | input, task) without ever being explicitly told what the task is.

This is what allows the model to move beyond a single objective and start behaving like a general system.

Methodology

To understand how this idea works in practice, it helps to look at how the model is trained.

According to the authors, everything starts with a standard language modeling objective.

The model is trained to predict the next token in a sequence based on the tokens that come before it.

While this may seem simple, it allows the model to learn the underlying structure of language over time.

Formally, this means the model is learning probabilities over sequences of text. In practice, this ability enables it to generate coherent text, complete sentences, and even mimic patterns that resemble specific tasks.

This is what makes the approach powerful. Even though the model is only trained to predict the next word, it ends up capturing much richer behavior that can be applied to a variety of tasks.

Zero-Shot Setup

One of the most important differences from earlier approaches is how the model is used after training.

Unlike GPT-1, there's no fine-tuning or task-specific training. The model isn't adapted or retrained for each new task. Instead, everything is handled through the input itself.

According to the authors, tasks are expressed directly as text prompts. For example, you might write something like “Translate to French:” followed by a sentence, or “Answer the question:” followed by a prompt. The model then continues the text in a way that reflects the task.

In practice, this means the model isn't explicitly told what to do through training – it infers the task from the structure of the input and responds accordingly.

Fine-tuning vs Zero-Shot Learning

Aspect	Fine-tuning (Task-Specific Training)	Zero-Shot Learning
Definition	Model is trained further on labeled data for a specific task	Model performs tasks without any additional training
Training Requirement	Requires task-specific labeled datasets	No labeled data needed for the task
Setup	Separate training phase for each task	Tasks are given as natural language prompts
Flexibility	Limited to trained tasks	Can generalize to many unseen tasks
Performance	Usually higher on specific tasks	Lower, but improving with scale
Cost	Expensive (training per task)	Efficient (no retraining needed)
Adaptability	Needs retraining for new tasks	Adapts instantly via prompts
Example (NLP)	Train model for sentiment analysis dataset	“Classify sentiment: …” prompt
Used in	GPT-1, traditional NLP systems	GPT-2, GPT-3, modern LLMs
Main Advantage	High accuracy on defined tasks	High flexibility and generalization
Main Limitation	Not scalable across many tasks	Less precise than fine-tuned models

Training Data (Web Text)

Another key part of this work is the dataset used to train the model.

Instead of relying on traditional sources like Wikipedia, books, or news articles alone, the authors created a new dataset called Web Text.

It consists of millions of documents – around 40 GB of text – collected from links shared on Reddit that received a certain level of engagement.

According to the paper, this filtering step helps improve the overall quality of the data, since the content is more likely to be interesting or useful to readers.

What makes this dataset important is its diversity. It contains real-world language from many domains, and more importantly, it includes natural examples of tasks, such as explanations, question–answer pairs, and translations, embedded within the text itself.

Input Representation

To process text, the model uses a technique called Byte Pair Encoding (BPE).

According to the authors, BPE works as a middle ground between word-level and character-level representations.

Instead of treating text strictly as full words or individual characters, it breaks it into smaller units that can adapt depending on how frequently patterns appear in the data.

In practice, this allows the model to handle a wide range of text more effectively, including rare words and different languages. It also improves generalization, since the model isn't limited to a fixed vocabulary of complete words.

Model Architecture

The model used in this paper is based on a Transformer (decoder-only) architecture, similar to GPT-1 but significantly scaled up.

According to the authors, the model relies on masked self-attention, which allows it to look at previous tokens in a sequence while predicting the next one.

This means it processes text step by step, always using past context to generate the next token.

Compared to GPT-1, several important changes were introduced.

The model can handle longer context, with sequences of up to 1024 tokens, and uses a larger vocabulary of around 50,000 tokens. It's also much deeper, with more layers and significantly more parameters.

The authors trained multiple versions of the model, ranging from 117 million to 1.5 billion parameters.

The largest of these is what we now refer to as GPT-2, and it's the one responsible for most of the strong results reported in the paper.

Transformer (decoder-only)

Reference: Brownlee, J. Encoders and Decoders in Transformer Models Machine Learning Mastery.

Experiments

To evaluate the model, the authors tested it across a wide range of tasks – but with an important constraint: according to the paper, the model wasn't trained or fine-tuned on any of these tasks.

Instead, everything was evaluated in a zero-shot setting, where the model is simply given a prompt and asked to continue the text.

They applied this setup to different types of problems, including language modeling benchmarks, reading comprehension, translation, summarization, question answering, and commonsense reasoning.

The goal here was not just to measure performance, but to see how far a single model (trained only on raw text) could generalize across tasks without any additional training.

Key Findings

After evaluating the model across different tasks, the results were stronger than many would have expected.

According to the authors, GPT-2 achieves state-of-the-art results on 7 out of 8 language modeling benchmarks in a zero-shot setting.

One of the most important observations is that performance consistently improves as the model size increases, following a roughly log-linear trend.

In other words, scaling up the model leads to better results across tasks.

The paper also shows that larger models display more consistent multitask behavior.

For example, GPT-2 performs well on tasks that require long-range understanding, such as LAMBADA, and shows competitive results in reading comprehension on datasets like CoQA.

It even demonstrates early capabilities in translation and can answer factual questions without being explicitly trained for those tasks.

In practice, the key takeaway is clear: increasing model size and data plays a major role in unlocking these capabilities.

Task-Specific

Looking more closely at individual tasks, the paper gives a clearer picture of where the model performs well and where it still struggles.

GPT-2 shows surprisingly strong results in reading comprehension, even without any task-specific training. But its performance on summarization is still limited.

While it can generate summaries that look reasonable, they're often less accurate compared to supervised approaches.

For translation, the model demonstrates some ability, but the results are still far from competitive.

On the other hand, question answering improves noticeably as the model size increases, suggesting that scale plays an important role in this capability.

Overall, the model is far from perfect. But what stands out is that it's clearly beginning to learn general skills across tasks, even without being explicitly trained for them.

Generalization vs Memorization

A natural question that comes up is whether the model is actually learning useful patterns or simply memorizing the training data.

The authors address this directly. They analyze overlap between the training dataset and evaluation benchmarks using n-gram comparisons, looking for signs that the model might be copying rather than generalizing.

According to the paper, while some overlap does exist (as is common in large datasets), it's not enough to explain the model’s performance.

They also observe that the model still underfits the data, meaning it hasn’t fully captured everything in the training set.

This is an important point: if the model was mainly memorizing, we would expect it to fit the data much more closely.

In practice, this suggests that the improvements are coming from genuine learning rather than simple memorization, even though some overlap is unavoidable.

Discussion

This section is where the authors step back and reflect on what these results actually mean.

According to the paper, language models trained on large and diverse datasets aren't just learning representations of text. They're beginning to learn how to perform tasks directly, even without supervision.

In other words, pre-training is doing more than providing useful features: it's capturing patterns that resemble real task behavior.

At the same time, the authors are careful not to overstate the results.

While the zero-shot capabilities are impressive, performance is still far from practical on many tasks.

Some outputs look convincing on the surface but lack accuracy when measured more carefully.

In practice, this section highlights both sides of the story. The approach is clearly promising, but it's still an early step toward more general systems.

Limitations

Despite the progress shown in the paper, the approach still has several important limitations.

According to the authors, zero-shot performance, while impressive, is generally weaker than fully supervised models on many tasks.

The results also depend heavily on scale, both in terms of model size and the amount of data used. This means that smaller models don't show the same level of capability.

In addition, some tasks, such as summarization, remain relatively weak.

The model can produce outputs that look plausible, but they often lack accuracy or consistency when evaluated more carefully.

Another practical challenge is the cost. Training these models requires significant computational resources and large datasets, which makes this approach difficult to reproduce or scale for many researchers.

Conclusion

The paper ends with a simple but powerful idea.

According to the authors, when a language model is trained on a sufficiently large and diverse dataset – and with enough capacity – it begins to generalize across tasks and perform them without explicit training.

This suggests that the model isn't just learning language, but also the structure of the tasks embedded within it.

In practice, this points to a different way of thinking about AI systems. Instead of designing and training a model for each specific task, we can focus on training a single model on large-scale language data – and allow useful capabilities to emerge naturally from that process.

Final Insight

If GPT-1 introduced the idea of combining pre-training with fine-tuning, GPT-2 takes that idea a step further.

According to the paper, pre-training alone - when done at a large enough scale – can already produce models that begin to perform a wide range of tasks without any additional training.

This is a subtle but important shift, because it suggests that general capabilities can emerge directly from exposure to large amounts of text.

In my view, this is the point where things start to change direction.

The focus moves away from designing task-specific systems and toward building more general models that can adapt on their own.

This idea directly sets the stage for what comes next: models like GPT-3, ChatGPT, and modern large language systems that build on this same principle.

GPT-1 vs GPT-2 — Key Differences

Aspect	GPT-1	GPT-2
Core Idea	Pre-training + fine-tuning	Pre-training alone (zero-shot)
Training Approach	Two-stages: learn language, then adapt to tasks	Single stage: learn language and infer tasks
Supervision	Requires labeled data for fine-tuning	No labeled data needed for tasks
Task Handling	Tasks require separate fine-tuning	Tasks handled via prompts (zero-shot)
Generalization	Limited, depends on fine-tuning	Stronger generalization across tasks
Model Role	Learns language, then adapts	Learns language and tasks together
Architecture	Transformer (decoder-based)	Transformer (decoder-only, scaled up)
Model Size	Smaller (~117M parameters)	Much larger (up to 1.5B parameters)
Context Length	Shorter context	Longer context (up to 1024 tokens)
Dataset	Books Corpus + other curated datasets	Web Text (large, diverse internet data)
Key Capability	Transfer learning	Zero-shot learning
Performance Style	Strong after fine-tuning	Strong without any task training
Limitations	Depends on labeled data	Depends heavily on scale (data + compute)
Main Contribution	Introduced pre-training paradigm	Showed emergence of multitask behavior
Impact	Foundation of modern NLP pipelines	Shift toward general-purpose models

Resources:

Contact Me

The Rise of AI Agents: How Software Is Learning to Act

Manish Shivanandhan — Fri, 08 May 2026 17:07:26 +0000

Software has always been reactive.

You click a button, it responds. You call an API, it returns data.

Even the most sophisticated systems have historically depended on explicit instructions and tightly defined workflows. That model is starting to break.

A new class of software is emerging that doesn't just respond, but act.

This shift isn't cosmetic. It changes how software is designed, how systems are operated, and how work itself is executed.

Instead of encoding every step of a workflow, developers are now defining goals, constraints, and tools, then letting software figure out the execution path. The result is software that behaves less like a function and more like an operator.

In this article, you'll learn what AI agents actually are, how they differ from traditional software systems, and why they're starting to represent a major shift in modern software design.

This article is written for developers, technical founders, engineering managers, and anyone building software systems with AI components.

You don't need prior experience building AI agents, but it helps to be familiar with Basic Python syntax and Large language models (LLMs)

What We'll Cover:

From Deterministic Systems to Goal-Driven Execution
The Core Components of an AI Agent
Why AI Agents Are Emerging Now
The Illusion and Reality of Autonomy
Designing Agents That Work in Practice
Multi-Agent Systems and Coordination
Where AI Agents Are Already Delivering Value
The Shift in Software Design
What Comes Next

From Deterministic Systems to Goal-Driven Execution

Traditional software systems are deterministic. Given the same input, they produce the same output.

This predictability is what makes them reliable, but it's also what limits them. Any variation in workflow requires new code, new conditions, and new branches.

AI agents introduce a different model. They're goal-driven rather than instruction-driven. Instead of specifying every step, you define an objective and provide access to tools. The agent decides how to achieve the objective, often adapting in real time.

Consider a simple task like summarizing a set of documents and emailing the result. In a traditional system, you would write a pipeline that loads documents, processes them, formats the output, and sends an email. Each step is explicitly coded.

With an agent, the system might look more like this:

from openai import OpenAI

client = OpenAI()
goal = "Summarize all documents in /reports and email a concise briefing to the leadership team"
tools = [
    "read_files",
    "summarize_text",
    "send_email"
]
response = client.responses.create(
    model="gpt-4.1",
    input=f"Goal: {goal}. Available tools: {tools}"
)
print(response.output_text)

This example is simplified, but it captures the shift. The developer defines intent and capability. The agent determines execution.

The Core Components of an AI Agent

To understand how agents work, it helps to break them into components. At a high level, most agents consist of reasoning, memory, and tools.

Reasoning is handled by a large language model. This is what allows the agent to interpret goals, plan actions, and adapt when something fails. It's not just generating text, it's generating decisions.

Memory allows the agent to maintain context across steps. Without memory, the agent behaves like a stateless function. With memory, it can track progress, recall past actions, and refine its approach.

Tools are what make the agent useful. A tool can be anything from an API to a database query to a shell command. The agent doesn't need to know how the tool works internally. It only needs to know when and how to use it.

Here is a minimal example of tool usage in an agent loop:

def agent_loop(goal, tools):
    context = []
    
    while True:
        prompt = f"Goal: {goal}\nContext: {context}\nWhat should be done next?"
        
        decision = model.generate(prompt)
        
        if decision == "DONE":
            break
        
        if decision.startswith("USE_TOOL"):
            tool_name, tool_input = parse_tool_call(decision)
            result = tools[tool_name](tool_input)
            context.append(result)
        else:
            context.append(decision)
    
    return context

This loop is where the agent “acts.” It observes, decides, executes, and updates its understanding.

Why AI Agents Are Emerging Now

The idea of autonomous software isn't new. What has changed is the capability of the underlying models.

Large language models can now reason across multiple steps, interpret unstructured inputs, and generate structured outputs that can drive real systems.

Equally important is the ecosystem around them. APIs are more standardized, infrastructure is more programmable, and data is more accessible. This makes it easier to expose tools and let them interact with real systems helping build some of the best AI agents in use today.

There's also an economic driver. Many workflows today are still manual, even in highly digitized organizations. These workflows often involve coordination across systems, interpretation of data, and decision-making under uncertainty. This is exactly the kind of work agents are suited for.

The Illusion and Reality of Autonomy

It's tempting to describe AI agents as fully autonomous. In practice, most are not. They operate within constraints defined by developers. They rely on tools that expose only certain actions. They're often monitored, rate-limited, and evaluated at each step.

What makes them different isn't complete autonomy, but partial autonomy. They can decide how to execute within a bounded environment.

This distinction matters because it affects how systems are designed. You're not building a system that always behaves predictably. You're building a system that explores a solution space and converges on an outcome.

That introduces new challenges. Agents can take inefficient paths. They can misinterpret goals. They can fail in ways that are hard to debug because the failure isn't a single error, but a chain of decisions.

Designing Agents That Work in Practice

Building an agent is easy. Building one that works reliably is harder. The difference comes down to control.

One approach is to constrain the agent’s action space. Instead of giving it open-ended access, you define a limited set of tools with clear interfaces. This reduces ambiguity and makes behavior more predictable.

Another approach is to introduce intermediate checkpoints. Instead of letting the agent run freely, you validate its decisions at key steps. You can do this through rules, secondary models, or even human review.

Here's an example of adding a validation layer:

def safe_execute(tool, input_data):
    if not validate_input(tool, input_data):
        return "Invalid input"
    
    result = tool(input_data)
    
    if not validate_output(tool, result):
        return "Invalid output"
    
    return result

This pattern is critical in production systems. It turns an unconstrained agent into a controlled system that can still adapt, but within safe boundaries.

Multi-Agent Systems and Coordination

As agents become more capable, a single agent is often not enough. Complex tasks can be decomposed into multiple agents, each responsible for a specific function.

For example, one agent might handle data retrieval, another might handle analysis, and a third might handle communication. These agents can coordinate by passing structured messages.

class Message:
    def __init__(self, sender, receiver, content):
        self.sender = sender
        self.receiver = receiver
        self.content = content

def send_message(agent, message):
    return agent.process(message)
message = Message("retriever", "analyst", "Data collected from API")
response = send_message(analyst_agent, message)

This model starts to resemble a distributed system, but with agents instead of services. Coordination becomes a first-class concern. You need to define protocols, handle failures, and ensure consistency across agents.

Where AI Agents Are Already Delivering Value

Despite the hype, there are concrete areas where agents are already useful. Internal tooling is one of them. Automating repetitive workflows, generating reports, and orchestrating tasks across systems are all well-suited for agents.

Customer support is another area. Agents can handle complex queries that require accessing multiple systems, not just retrieving canned responses.

Security and compliance workflows are also a strong fit. These often involve monitoring signals, correlating data, and taking action based on rules that aren't always deterministic.

What these use cases have in common is that they involve structured environments with clear objectives and measurable outcomes. Agents perform best when the problem space is bounded, even if the execution path is not.

The Shift in Software Design

The rise of AI agents isn't just about adding a new feature. It's about changing the abstraction layer of software.

Instead of writing code that directly implements behavior, you're designing systems that enable behavior. You define goals, expose capabilities, and enforce constraints. The actual execution becomes dynamic.

This requires a different mindset. Debugging is no longer just about tracing code. It's about understanding decision paths. Testing is no longer just about input-output pairs. It's about evaluating behavior across scenarios.

Observability becomes critical. You need to log not just what the system did, but why it did it. This includes prompts, intermediate decisions, and tool interactions.

What Comes Next

AI agents are still in the relatively early stages. The current generation is powerful but imperfect. Reliability is a major challenge. So is cost, especially when agents require multiple model calls per task.

But the direction is clear: software is moving from static execution to dynamic action. The boundary between user and system is becoming less rigid. Instead of telling software what to do step by step, users will increasingly define outcomes and let systems figure out the rest.

This doesn't eliminate the need for engineers. It changes what engineers do. The focus shifts from implementing logic to designing systems that can reason, act, and adapt.

The rise of AI agents marks a transition. Software is no longer just a tool. It's becoming an actor.

Join my Applied AI newsletter to learn how to build and ship real AI systems. Practical projects, production-ready code, and direct Q&A. You can also connect with me on LinkedIn.

How to Run Open Source LLMs Locally and in the Cloud

Beau Carnes — Thu, 07 May 2026 13:55:17 +0000

Learn how to work with a wide range of open large language models (LLMs) such as Gemma, Kimmy, and GLM across various local and cloud-based environments.

We just posted a course on the freeCodeCamp.org YouTube channel, taught by Andrew Brown, that explores how to use coding harnesses like Claude Code and Pi Agent to build real-world agentic workflows while benchmarking model performance and hardware requirements.

The course provides a practical look at the current state of open AI by conducting "smoke tests," such as building Flappy Bird clones to evaluate how different models handle real-world coding tasks. You will explore the hardware requirements necessary for local execution, including the VRAM limitations that often make cloud-hosted options more viable for large context windows.

Andrew also evaluates various coding harnesses, like Claude Code and PI Coding Agent. By the end of the course, you will understand which models, such as Kimmy 2.5 and Gemma 4, are most reliable for tool calling and structured code generation.

Watch the full course for free on the freeCodeCamp.org YouTube channel.

How to Build Your Own Language-Specific LLM [Full Handbook]

Wisamul Haque — Fri, 24 Apr 2026 20:59:02 +0000

What if you could build your own LLM, one that speaks your native language, all from scratch? That's exactly what we'll do in this tutorial. The best way to understand how LLMs work is by actually building one.

We'll go through each step of creating your own LLM in a specific language (Urdu in this case). This will help you understand what goes on inside an LLM.

Modern LLMs trace back to the research paper that changed everything: "Attention Is All You Need". But rather than getting lost in the math (I am bad at math, sadly), we'll learn by building one from scratch.

Who is This Handbook For?

Software engineers, product owners, or anyone curious about how LLMs work. If you have a little machine learning knowledge, that would be great, but if not, no worries. I've written this so that you don't have to go anywhere outside this tutorial.

By the end, you will have a working Urdu LLM chatbot deployed and running. You can create one for your own native language as well by following the steps defined below.

A Note on Expectations:

The goal here is to educate ourselves on how LLMs work by practically going through all the steps.

The goal is not that your LLM will act like ChatGPT. That has multiple constraints like massive datasets, months of training, and reinforcement learning from human feedback (RLHF), all of which you'll understand better by going through this tutorial.

A Note on the Code:

The code in this tutorial was largely generated using Claude Opus 4. This is worth highlighting because it shows that LLMs are not just coding assistants that help you ship features faster. They can also be powerful learning tools.

By prompting Claude to generate, explain, and iterate on each component, I was able to understand the internals of LLM training far more deeply than reading documentation alone.

If you're following along, I encourage you to do the same: use an LLM for your learning.

What We'll Cover:

Components of LLM Training
- Tech Stack Required
1. Data Preparation
- Data Cleaning
2. Tokenization
3. Pre-Training
4. Supervised Fine-Tuning (SFT)
5. Deployment
- Gradio Web Interface (app.py)
- Deployment Options
Full Pipeline Summary
Results
Conclusion

Components of LLM Training

In this tutorial, we'll be covering the following components one by one with code examples for better understanding:

Data Preparation
Tokenization
Pre-Training
Supervised Fine-Tuning (SFT)
Deployment

Tech Stack Required

Before starting the steps, here is the tech stack you need:

Python 3.9+
PyTorch
Tokenizers / SentencePiece
Hugging Face Datasets & Hub
regex, BeautifulSoup4, requests (for data cleaning)
tqdm, matplotlib (for training utilities)
Gradio (for chat UI deployment)
Google Colab (free T4 GPU for training)

Note: Make sure to install all the dependencies listed in the requirements.txt file of the repository before getting started.

1. Data Preparation

In data preparation, the first and foremost step is data collection. An LLM needs to be trained on a large amount of text data. There is no single place to get this data. Depending on the type of model you want to build, you can collect text from many sources:

Digital libraries and archives: Internet Archive or Wikipedia dumps
Code repositories: GitHub, GitLab (useful if your model needs to understand code)
Web scraping: Crawling websites, blogs, and forums using automated scripts
Academic datasets: Research papers, open-access journals
Pre-built datasets: Platforms like Hugging Face Datasets and Kaggle host thousands of ready-to-use datasets

In practice, large-scale LLMs like GPT and LLaMA rely heavily on web scraping from many sources using automated pipelines. But there's one important rule to follow: only use publicly available, open-source data. Don't scrape private or personal user information. Stick to data that's explicitly shared for public use or falls under permissive licenses.

Also, keep this principle in mind: garbage in, garbage out. Just getting the data isn't enough. It should be correct, clean, and without noise.

In actual practice, you can collect data from different sources. In my case, I found good enough data from Hugging Face. Hugging Face has CulturaX that has multilingual datasets. The dataset was huge, so I didn't download all of it and only downloaded a small portion.

For this tutorial, I used Hugging Face as my data source. I chose it for a few reasons.

First, since the goal was to learn how LLMs work, I wanted to spend my time on the model, not on writing web scrapers. Hugging Face already has a large collection of datasets in a cleaned and structured format, which saves a lot of upfront work.

Second, Hugging Face offers language-specific datasets. Since I was building an Urdu LLM, I needed Urdu text specifically, and Hugging Face has CulturaX which provides multilingual datasets including Urdu and many other languages. The dataset was huge, so I avoided downloading all of it and only downloaded a small portion.

Important: Before you start downloading the dataset from Hugging Face, you need to create an account. Then log into the CLI, from where you'll be able to download the dataset.

In the script below, we load the dataset from Hugging Face and turn streaming to True. The purpose of doing this is so that we don't have to download all the data but only chunks of samples as defined in NUM_SAMPLES.

# ============================================================
# Option A: Download from CulturaX (recommended, high quality)
# ============================================================
# CulturaX is a cleaned version of mC4 + OSCAR
# We stream it to avoid downloading the entire dataset

NUM_SAMPLES = 100_000  # Start with 100K samples (~50-100MB text)

print("Loading CulturaX Urdu dataset (streaming)...")
dataset = load_dataset(
    "uonlp/CulturaX",
    "ur",                    # Urdu language code
    split="train",
    streaming=True,          # Don't download everything
    trust_remote_code=True
)

# Collect samples
raw_texts = []
for i, sample in enumerate(tqdm(dataset, total=NUM_SAMPLES, desc="Downloading")):
    if i >= NUM_SAMPLES:
        break
    raw_texts.append(sample["text"])

print(f"\nDownloaded {len(raw_texts)} samples")
print(f"Total characters: {sum(len(t) for t in raw_texts):,}")
print(f"\nSample text (first 500 chars):")
print(raw_texts[0][:500])

Data Cleaning

Simply having the data is not enough to start training your model. The next step is probably the most important one: data cleaning. The goal is to make the data as pure as possible.

As I was building a language-specific Urdu LLM, I had to write cleaning logic to remove non-Urdu text, HTML links, special characters, duplicate content, and excess whitespace. All these factors pollute the training data and can cause issues during training.

Based on the type of dataset, some language-specific or use-case cleaning will be required.

One thing that might be new to you is the NFKC Unicode normalization step. This normalizes text that appears the same but exists in different Unicode forms, keeping one canonical form.

You'll also see some regex patterns that are used to keep only the Urdu text. As Urdu script is based on Arabic, we'll use Arabic Unicode ranges. I also removed artifacts like //, --, and extra empty spaces that were present in the raw data.

This cleaning took multiple iterations. I reviewed the results manually each time and identified issues like inconsistent spacing, long dashes, and stray punctuation. All of these can negatively impact the next stages, so it's important to clean thoroughly.

This also gives you an idea of how important the data part still is and how much LLMs depend on data.

Here is the cleaning function I used:

def clean_urdu_text(text: str) -> str:
    """
    Clean a single Urdu text document.
    
    Steps:
    1. Remove URLs
    2. Remove HTML tags and entities
    3. Remove email addresses
    4. Normalize Unicode (NFKC normalization)
    5. Remove non-Urdu characters (keep Urdu + punctuation + digits)
    6. Normalize repeated punctuation (۔۔۔, ..., - -, etc.)
    7. Normalize whitespace
    """
    import unicodedata
    
    # Step 1: Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    
    # Step 2: Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # Remove HTML entities
    text = re.sub(r'&[a-zA-Z]+;', ' ', text)
    text = re.sub(r'&#\d+;', ' ', text)
    
    # Step 3: Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)
    
    # Step 4: Unicode normalization (NFKC)
    # This normalizes different representations of the same character
    text = unicodedata.normalize('NFKC', text)
    
    # Step 5: Keep only Urdu characters, basic punctuation, digits, and whitespace
    # Urdu Unicode ranges + Arabic punctuation + Western digits + basic punctuation
    urdu_pattern = regex.compile(
        r'[^'
        r'\u0600-\u06FF'    # Arabic (includes Urdu)
        r'\u0750-\u077F'    # Arabic Supplement
        r'\u08A0-\u08FF'    # Arabic Extended-A
        r'\uFB50-\uFDFF'    # Arabic Presentation Forms-A
        r'\uFE70-\uFEFF'    # Arabic Presentation Forms-B
        r'0-9۰-۹'           # Western and Eastern Arabic-Indic digits
        r'\s'               # Whitespace
        r'۔،؟!٪'           # Urdu punctuation (full stop, comma, question mark, etc.)
        r'.,:;!?\-\(\)"\']'  # Basic Latin punctuation
    )
    text = urdu_pattern.sub(' ', text)
    
    # Step 6: Normalize repeated punctuation
    text = re.sub(r'۔{2,}', '۔', text)
    text = re.sub(r'\.{2,}', '.', text)
    text = re.sub(r'-\s*-+', '-', text)
    text = re.sub(r'-{2,}', '-', text)
    text = re.sub(r'،{2,}', '،', text)
    text = re.sub(r',{2,}', ',', text)
    text = re.sub(r'\s+[۔\.\-,،]\s+', ' ', text)
    
    # Step 7: Normalize whitespace
    text = re.sub(r'\n{3,}', '\n\n', text)  # Max 2 newlines
    text = re.sub(r'[^\S\n]+', ' ', text)    # Collapse spaces (but keep newlines)
    text = text.strip()
    
    return text


def is_mostly_urdu(text: str, threshold: float = 0.5) -> bool:
    """
    Check if text is mostly Urdu characters.
    This filters out documents that are primarily English/other languages.
    
    threshold: minimum fraction of characters that must be Urdu
    """
    if len(text) == 0:
        return False
    urdu_chars = len(regex.findall(r'[\u0600-\u06FF\u0750-\u077F\u08A0-\u08FF\uFB50-\uFDFF\uFE70-\uFEFF]', text))
    return (urdu_chars / len(text)) > threshold


# Test the cleaning function
sample = raw_texts[0]
print("=== BEFORE CLEANING ===")
print(sample[:300])
print("\n=== AFTER CLEANING ===")
cleaned = clean_urdu_text(sample)
print(cleaned[:300])
print(f"\nIs mostly Urdu: {is_mostly_urdu(cleaned)}")

After cleaning, I stored the data in two formats: a text file (used for tokenizer training) and a JSONL file (used for pre-training). Each format serves a specific purpose in the upcoming steps.

2. Tokenization

The next step after cleaning is tokenization. Tokenization converts text into numbers, and provides a way to convert those numbers back into text.

This is necessary because neural networks can't understand text – they only understand numbers. So tokenization is essentially a translation layer between human language and what the model can process.

For example:

"hello world"  →  ["hel", "lo", " world"]  →  [1245, 532, 995]
"اردو زبان"   ←  ["ار", "دو", "زب", "ان"]  ←  [412, 87, 953, 201]

Tokenization Approaches

There are three main approaches to tokenization:

Approach 1: Character-level

With this approach, you split text into individual characters:

hello -> ['h', 'e', 'l', 'l', 'o']
اردو -> ['ا', 'ر', 'د', 'و']

The problem is that sequences become very long. A 1000-word document might be 5000+ tokens. The model has to learn to combine characters into words, which is very hard.

Approach 2: Word-level

In this approach, you split based on spaces between words:

hello how are you -> ['hello', 'how', 'are', 'you']
اردو بہت اچھی زبان ہے -> ['اردو', 'بہت', 'اچھی', 'زبان', 'ہے']

This problem is that a language's vocabulary is huge (Urdu has 100K+ unique words, English has 170K+). The model can't handle new or rare words (the out-of-vocabulary problem).

Approach 3: Subword using BPE (Byte Pair Encoding)

With this approach, the model learns common character sequences from data.

unhappiness might split as ['un', 'happi', 'ness']
مکمل might split as ['مکم', 'ل'] or stay whole if common enough.

This is a smaller vocabulary (we use 32K tokens), and it can handle any word, even new ones. Common words stay as single tokens.

BPE is the industry standard, used by GPT, LLaMA, and most modern LLMs. Here is how it works step by step:

Start with characters: vocabulary = all individual characters
Count pairs: find the most frequent adjacent pair of tokens
Merge: combine that pair into a new token
Repeat: until vocabulary reaches desired size

Here's an example:

Start:  ا ر د و   ز ب ا ن
Merge 1: 'ا ر' -> 'ار'    (most common pair)
Result: ار د و   ز ب ا ن
Merge 2: 'ز ب' -> 'زب'    (next most common)
Result: ار د و   زب ا ن
...and so on for 32,000 merges

This is the approach we'll use for our Urdu LLM. I trained a BPE tokenizer with a vocabulary size of 32K tokens on the cleaned Urdu corpus.

Special Tokens

Along with BPE, we also need to add some special tokens. These tokens give the model structural information it needs during training and inference.

Token	Purpose	Why It Is Needed
	Padding for equal-length sequences	Batching requires all sequences to be the same length. Shorter sequences are filled with tokens.
	Unknown word fallback	If the model encounters a token not in the vocabulary, it maps to instead of failing.
	Marks the start of a sequence	Tells the model where the input begins, leading to more stable generation.
	Marks the end of a sequence	Tells the model when to stop generating. Without it, output may run forever or stop randomly.
	Separates segments	In chat format, separates the system prompt, user message, and assistant response so the model knows which role is which.
`<	user	>`
`<	assistant	>`
`<	system	>`

BPE Tokenizer Configuration

I set vocab size to 32K. What does that mean? It means the model will have 32K tokens in its vocabulary lookup table.

This is a good balance between language coverage and model size. If we increase vocab size, the embedding layer and output layer both grow, which means more parameters to train. For a learning project, 32K keeps things manageable.

MIN_FREQUENCY is set to 2, meaning a token must appear at least twice in the corpus to be included. This filters out one-off noise tokens that would waste vocabulary slots.

For reference: GPT-2 uses a vocabulary of 50K tokens, and LLaMA uses 32K. Our choice of 32K is in line with production models.

VOCAB_SIZE = 32_000  # Number of tokens in our vocabulary
MIN_FREQUENCY = 2    # Token must appear at least twice (filters noise)

# Special tokens - these have reserved IDs
SPECIAL_TOKENS = [
    "",    # ID 0: padding
    "",    # ID 1: unknown
    "",    # ID 2: beginning of sequence 
    "",    # ID 3: end of sequence
    "",    # ID 4: separator (for chat format)
    "<|user|>",     # ID 5: user turn marker (for chat)
    "<|assistant|>", # ID 6: assistant turn marker (for chat)
    "<|system|>",    # ID 7: system prompt marker (for chat)
]

Building the Tokenizer

Next up is creating the tokenizer using the cleaned text file we created earlier. First, we'll import the required libraries and set up the file paths:

import os
from pathlib import Path
from tokenizers import (
    Tokenizer,
    models,
    trainers,
    pre_tokenizers,
    decoders,
    processors,
    normalizers,
)

PROJECT_ROOT = Path(".").resolve().parent
CLEANED_DIR = PROJECT_ROOT / "data" / "cleaned"
TOKENIZER_DIR = PROJECT_ROOT / "tokenizer" / "urdu_tokenizer"
TOKENIZER_DIR.mkdir(parents=True, exist_ok=True)

CORPUS_FILE = str(CLEANED_DIR / "urdu_corpus.txt")
print(f"Corpus file: {CORPUS_FILE}")
print(f"Tokenizer output: {TOKENIZER_DIR}")

# Verify corpus exists
assert os.path.exists(CORPUS_FILE), f"Corpus not found at {CORPUS_FILE}. Run notebook 01 first!"
file_size_mb = os.path.getsize(CORPUS_FILE) / 1024 / 1024
print(f"Corpus size: {file_size_mb:.1f} MB")

Now we'll configure the tokenizer components:

# ============================================================
# Build the tokenizer
# ============================================================

# Step 1: Create a BPE model (the core algorithm)
tokenizer = Tokenizer(models.BPE(unk_token=""))

# Step 2: Add normalizer (text cleaning before tokenization)
# NFKC normalizes Unicode (e.g., different forms of the same Arabic letter)
tokenizer.normalizer = normalizers.NFKC()

# Step 3: Pre-tokenizer (how to split text before BPE)
# We use Metaspace which replaces spaces with ▁ and splits on them
# This preserves space information so we can reconstruct the original text
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

# Step 4: Decoder (how to convert tokens back to text)
# Metaspace decoder converts ▁ back to spaces
tokenizer.decoder = decoders.Metaspace()

# Step 5: Configure the trainer
trainer = trainers.BpeTrainer(
    vocab_size=VOCAB_SIZE,
    min_frequency=MIN_FREQUENCY,
    special_tokens=SPECIAL_TOKENS,
    show_progress=True,
    initial_alphabet=[]  # Learn alphabet from data
)

print("Tokenizer configured. Ready to train!")

Training the Tokenizer

Once the tokenizer is configured, the next step is to run it. This will take roughly 5 to 10 minutes depending on your device.

print("Training tokenizer... (this may take a few minutes)")
tokenizer.train([CORPUS_FILE], trainer)

print(f"\n Tokenizer trained!")
print(f"  Vocabulary size: {tokenizer.get_vocab_size():,}")

Configuring Post-Processing (Auto-Wrapping with BOS/EOS)

Next, we'll configure post-processing so the tokenizer automatically wraps every sequence with and tokens. This means we don't have to manually add them each time we encode text:

bos_id = tokenizer.token_to_id("")
eos_id = tokenizer.token_to_id("")

tokenizer.post_processor = processors.TemplateProcessing(
    single=f":0 $A:0 :0",
    pair=f":0 \(A:0 :0 \)B:1 :1",
    special_tokens=[
        ("", bos_id),
        ("", eos_id),
        ("", tokenizer.token_to_id("")),
    ],
)

print("Post-processor configured (auto-adds  and )")

Note: You might wonder why we need this step when we already defined and in SPECIAL_TOKENS. The SPECIAL_TOKENS list only reserves vocabulary slots for these tokens (assigns them IDs). Post-processing tells the tokenizer to automatically insert them into every encoded sequence.

Without this step, the tokens would exist in the vocabulary but never appear in your data unless you added them manually each time.

Testing the Tokenizer

The final step in tokenization is to test it. The test encodes Urdu sentences into token IDs, then decodes those IDs back into text. If the decoded text matches the original input, the tokenizer is working correctly. This roundtrip test confirms that no information is lost during encoding and decoding:

test_sentences = [
    "اردو ایک بہت خوبصورت زبان ہے",           # "Urdu is a very beautiful language"
    "پاکستان کا دارالحکومت اسلام آباد ہے",      # "The capital of Pakistan is Islamabad"
    "آج موسم بہت اچھا ہے",                     # "The weather is very nice today"
    "مصنوعی ذہانت مستقبل کی ٹیکنالوجی ہے",     # "AI is the technology of the future"
    "السلام علیکم! آپ کیسے ہیں؟",               # "Peace be upon you! How are you?"
]

print("=" * 70)
print("TOKENIZER TEST RESULTS")
print("=" * 70)

for sentence in test_sentences:
    encoded = tokenizer.encode(sentence)
    decoded = tokenizer.decode(encoded.ids)
    
    print(f"\n Input:    {sentence}")
    print(f" Token IDs: {encoded.ids}")
    print(f"  Tokens:   {encoded.tokens}")
    print(f" Decoded:  {decoded}")
    print(f"   Num tokens: {len(encoded.ids)}")
    print(f"   Roundtrip OK: {sentence in decoded}")
    print("-" * 70)

Here is what the output looks like:

======================================================================
TOKENIZER TEST RESULTS
======================================================================

 Input:    اردو ایک بہت خوبصورت زبان ہے
 Token IDs: [2, 1418, 324, 431, 2965, 1430, 276, 3]
 Tokens:   ['', '▁اردو', '▁ایک', '▁بہت', '▁خوبصورت', '▁زبان', '▁ہے', '']
 Decoded:  اردو ایک بہت خوبصورت زبان ہے
   Num tokens: 8
   Roundtrip OK: True
----------------------------------------------------------------------

 Input:    پاکستان کا دارالحکومت اسلام آباد ہے
 Token IDs: [2, 474, 289, 3699, 616, 1004, 276, 3]
 Tokens:   ['', '▁پاکستان', '▁کا', '▁دارالحکومت', '▁اسلام', '▁آباد', '▁ہے', '']
 Decoded:  پاکستان کا دارالحکومت اسلام آباد ہے
   Num tokens: 8
   Roundtrip OK: True

Notice how and are automatically added (thanks to our post-processing step), common Urdu words like پاکستان stay as single tokens, and the ▁ prefix marks word boundaries from the Metaspace pre-tokenizer. Most importantly, every roundtrip succeeds, meaning decoded text matches the original input exactly.

Fertility Score

Fertility is the average number of tokens per word.

A fertility of 1 means each word maps to one token (ideal but unrealistic in modern subword tokenizers).
In modern LLMs, fertility is usually around 1.3–2.5 depending on the language.
Higher fertility means more token splitting, which increases cost and reduces efficiency, but it's also influenced by language complexity, not just tokenizer quality.

# ============================================================
# Calculate fertility score on training corpus
# ============================================================
import json

jsonl_file = CLEANED_DIR / "urdu_corpus.jsonl"
corpus_words = 0
corpus_tokens = 0
sample_size = 10000  # Sample 10K documents for speed

print(f"Calculating fertility on {sample_size:,} documents from corpus...")

with open(jsonl_file, "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        if i >= sample_size:
            break
        doc = json.loads(line)
        text = doc["text"]
        
        words = text.split()
        tokens = tokenizer.encode(text).tokens
        n_tokens = len(tokens) - 2  # Remove  and 
        
        corpus_words += len(words)
        corpus_tokens += n_tokens

corpus_fertility = corpus_tokens / corpus_words
print(f"\n📊 Fertility Score (corpus): {corpus_fertility:.2f} tokens/word")
print(f"   (Total: {corpus_words:,} words → {corpus_tokens:,} tokens)")
print(f"   Documents sampled: {min(i+1, sample_size):,}")

if corpus_fertility < 2.0:
    print("   ✅ Excellent! Tokenizer is well-optimized for Urdu.")
elif corpus_fertility < 3.0:
    print("   ⚠️ Good, but could be better. Consider larger vocab.")
else:
    print("   ❌ High fertility. The tokenizer needs improvement.")

The fertility score we get here is 1.04, which is quite good. But keep in mind that this number is artificially low because the tokenizer was trained on the same small corpus it's being evaluated on. With a larger or unseen dataset, fertility would likely be higher (closer to the 1.3-2.5 range typical for production tokenizers).

Saving the Tokenizer

The final step is to save the tokenizer in JSON format and verify that it loads correctly:

# ============================================================
# Save the tokenizer
# ============================================================

tokenizer_path = str(TOKENIZER_DIR / "urdu_bpe_tokenizer.json")
tokenizer.save(tokenizer_path)

print(f" Tokenizer saved to: {tokenizer_path}")
print(f"   File size: {os.path.getsize(tokenizer_path) / 1024:.0f} KB")

# Verify we can load it back
loaded_tokenizer = Tokenizer.from_file(tokenizer_path)
test = loaded_tokenizer.encode("اردو ایک خوبصورت زبان ہے")
print(f"\n   Verification: {test.tokens}")
print(f"    Tokenizer loads correctly!")

Once saved, we have a lookup table. Using this, along with our corpus of data, we can perform the next important step: pre-training.

3. Pre-Training

In this part, the model learns the language, grammar, patterns, and vocabulary. Once training is done, the model is able to predict the next word in a sequence, and this is where we start to see raw data turning into an LLM.

LLMs are actually next-word predictors. Given a sequence of words, they predict the most probable next word.

With the help of training, the model learns:

The syntax of the language
Semantics, the contextual meaning
Frequently used expressions
Facts from the training dataset

For training, you have some options. As the model is small, you can train it on your local machine. It will be slower but will get the job done.

The other option is using Google Colab. This is the one I used – the free version was enough for the training I required, using a T4 GPU.

Steps to Do Pre-Training

Upload the dataset JSONL file and tokenizer to Google Drive.
Set the model configuration (vocab size, layers, heads, and so on).
Define the transformer architecture (attention, feed-forward, blocks).
Load and tokenize the corpus into training/validation splits.
Run the training loop with optimizer, LR schedule, and checkpointing.

Model Configuration

from dataclasses import dataclass

@dataclass
class UrduLLMConfig:
    # Vocabulary
    vocab_size: int = 32_000
    pad_token_id: int = 0
    bos_token_id: int = 2
    eos_token_id: int = 3

    # Model Architecture
    d_model: int = 384
    n_layers: int = 6
    n_heads: int = 6
    d_ff: int = 1536  # 4 * d_model
    dropout: float = 0.1
    max_seq_len: int = 256

    # Training
    batch_size: int = 32
    learning_rate: float = 3e-4
    weight_decay: float = 0.1
    max_epochs: int = 10
    warmup_steps: int = 500
    grad_clip: float = 1.0

Configuration parameters explained:

The vocabulary parameters (vocab_size, pad_token_id, bos_token_id, eos_token_id) simply match the tokenizer we built earlier. vocab_size is 32K (our BPE vocabulary), and the special token IDs (0, 2, 3) correspond to the positions we assigned during tokenizer training.

Model architecture parameters:

Variable	What it Means	Example	Impact of Value
`d_model`	Embedding/vector size per token	384	Higher: better understanding but slower & more memory. Lowe: faster but less expressive
`n_layers`	Number of transformer layers	6	More layers: deeper understanding but higher latency. Fewer: faster but less powerful
`n_heads`	Attention heads per layer	6	More heads: better context capture. Too few: limited attention diversity
`d_ff`	Feedforward layer size	1536	Larger: more computation power. Smaller: faster but weaker transformations
`dropout`	% of neurons dropped during training	0.1	Higher: prevents overfitting but may underfit. Lower: better training fit but risk of overfitting
`max_seq_len`	Maximum tokens per input	256	Higher: more context but slower & costly. Lower: faster but limited context

Training hyperparameters:

Variable	What it Means	Example	Impact of Value
`batch_size`	Samples per training step	32	Larger: faster training but needs more memory. Smaller: stable but slower
`learning_rate`	Step size for updates	0.0003	Too high: unstable training. Too low: very slow learning
`weight_decay`	Regularization strength	0.1	Higher: reduces overfitting. Lower: risk of overfitting
`max_epochs`	Full dataset passes	10	More: better learning but risk of overfitting. Fewer: undertrained model
`warmup_steps`	Gradual LR increase steps	500	More: smoother start, safer training. Less: risk of early instability
`grad_clip`	Max gradient value	1.0	Lower: stable but slower learning. Higher: risk of exploding gradients

Transformer Architecture

Next up is the main part of training: writing the transformer architecture. Before jumping into code, it's important to know what a transformer architecture is.

To learn in depth about what transformers are and how they differ from RNNs and CNNs, I would recommend going through this article: AWS: What is Transformers in Artificial Intelligence

But in short:

"Transformers are a type of neural network architecture that transforms or changes an input sequence into an output sequence."

The original Transformer paper introduced both an encoder (reads input) and a decoder (generates output). But GPT-style models like ours use only the decoder part. This is called a decoder-only architecture.

The decoder takes a sequence of tokens, applies self-attention to understand relationships between them, and predicts the next token.

Self-attention is what makes transformers powerful: instead of processing tokens one by one in order (like RNNs), the model looks at all previous tokens simultaneously and determines which ones are most relevant for the current prediction.

Here's the complete transformer code. A detailed breakdown of each component follows:

import math
import torch
import torch.nn as nn
import torch.nn.functional as F


class MultiHeadSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.n_heads = config.n_heads
        self.d_model = config.d_model
        self.head_dim = config.d_model // config.n_heads

        self.qkv_proj = nn.Linear(config.d_model, 3 * config.d_model)
        self.out_proj = nn.Linear(config.d_model, config.d_model)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x, mask=None):
        B, T, C = x.shape

        qkv = self.qkv_proj(x)
        qkv = qkv.reshape(B, T, 3, self.n_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]

        attn = (q @ k.transpose(-2, -1)) * (self.head_dim ** -0.5)

        if mask is not None:
            attn = attn.masked_fill(mask == 0, float('-inf'))

        attn = F.softmax(attn, dim=-1)
        attn = self.dropout(attn)

        out = attn @ v
        out = out.transpose(1, 2).reshape(B, T, C)
        out = self.out_proj(out)
        return out


class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.fc1 = nn.Linear(config.d_model, config.d_ff)
        self.fc2 = nn.Linear(config.d_ff, config.d_model)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = F.gelu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x


class TransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln1 = nn.LayerNorm(config.d_model)
        self.attn = MultiHeadSelfAttention(config)
        self.ln2 = nn.LayerNorm(config.d_model)
        self.ff = FeedForward(config)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x, mask=None):
        x = x + self.dropout(self.attn(self.ln1(x), mask))
        x = x + self.dropout(self.ff(self.ln2(x)))
        return x


class UrduGPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config

        self.token_emb = nn.Embedding(config.vocab_size, config.d_model)
        self.pos_emb = nn.Embedding(config.max_seq_len, config.d_model)
        self.dropout = nn.Dropout(config.dropout)

        self.blocks = nn.ModuleList([
            TransformerBlock(config) for _ in range(config.n_layers)
        ])

        self.ln_f = nn.LayerNorm(config.d_model)
        self.head = nn.Linear(config.d_model, config.vocab_size, bias=False)

        # Weight tying
        self.head.weight = self.token_emb.weight

        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, input_ids, targets=None):
        B, T = input_ids.shape
        device = input_ids.device

        tok_emb = self.token_emb(input_ids)
        pos = torch.arange(0, T, dtype=torch.long, device=device)
        pos_emb = self.pos_emb(pos)

        x = self.dropout(tok_emb + pos_emb)

        # Causal mask
        mask = torch.tril(torch.ones(T, T, device=device)).unsqueeze(0).unsqueeze(0)

        for block in self.blocks:
            x = block(x, mask)

        x = self.ln_f(x)
        logits = self.head(x)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

        return {'logits': logits, 'loss': loss}

    @torch.no_grad()
    def generate(self, input_ids, max_new_tokens=100, temperature=0.8,
                 top_k=50, top_p=0.9, eos_token_id=None):
        """
        Generate text autoregressively.

        Sampling strategies:
        - temperature: Controls randomness (low = deterministic, high = creative)
        - top_k: Only consider the top K most likely tokens
        - top_p (nucleus): Only consider tokens whose cumulative probability <= p
        - eos_token_id: Stop generating when this token is produced
        """
        self.eval()
        eos_token_id = eos_token_id or getattr(self.config, 'eos_token_id', None)

        for _ in range(max_new_tokens):
            idx_cond = input_ids if input_ids.size(1) <= self.config.max_seq_len \
                       else input_ids[:, -self.config.max_seq_len:]

            outputs = self.forward(idx_cond)
            logits = outputs["logits"][:, -1, :] / temperature

            # Top-K filtering
            if top_k > 0:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = float('-inf')

            # Top-P (nucleus) filtering
            if top_p < 1.0:
                sorted_logits, sorted_indices = torch.sort(logits, descending=True)
                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
                sorted_indices_to_remove = cumulative_probs > top_p
                sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
                sorted_indices_to_remove[:, 0] = 0
                indices_to_remove = sorted_indices_to_remove.scatter(
                    1, sorted_indices, sorted_indices_to_remove
                )
                logits[indices_to_remove] = float('-inf')

            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            input_ids = torch.cat([input_ids, next_token], dim=1)

            if eos_token_id is not None and next_token.item() == eos_token_id:
                break

        return input_ids

This code builds a text prediction machine. You give it some Urdu words, and it guesses the next word, over and over, until it forms a sentence. That's literally how ChatGPT works too, just much bigger.

Transformer Code Breakdown

1. MultiHeadSelfAttention: "The Lookback System"

Imagine reading a sentence. When you see the word "اس" (this), your brain looks back to figure out what "this" refers to. That's attention.

Q, K, V: Think of it like a library:

Query (Q): "I'm looking for information about X"
Key (K): Each previous word holds up a sign: "I have info about Y"
Value (V): The actual information that word carries

6 heads = 6 different "readers" looking at the sentence simultaneously. One might focus on grammar, another on meaning, another on nearby words, and so on.

Causal mask = A rule that says: "You can only look at words that came before you, not after." (Because when generating, future words don't exist yet!)

The math: Multiply Q×K to get "how relevant is each word?", then use those scores to grab the most useful info from V.

2. FeedForward: "The Thinking Step"

After attention figured out which words matter, this is where the model actually thinks about what they mean.

It's just two layers:

Expand (384 → 1536): Give the model more "brain space" to think
Shrink (1536 → 384): Compress the thought back down
GELU activation: A filter that decides "keep this thought" or "discard it" (smoothly, not harshly)

3. TransformerBlock: "One Round of Reading"

One pass of reading a sentence and thinking about it.

Step 1: Look at other words (attention)
Step 2: Think about what you saw (feed-forward)
LayerNorm: Like resetting your brain between steps so numbers don't get too big or too small.
Residual connection (x + ...): The model keeps its original thought AND adds the new insight. It's like taking notes: you don't erase old notes, you add new ones.

The model does this 6 times (6 blocks). Each round understands the text a little deeper.

4. UrduGPT: "The Full Machine"

Setup (__init__):

Token embedding: A giant lookup table. Each of 32,000 Urdu words/subwords gets a list of 384 numbers that represent its "meaning."
Position embedding: Another lookup table that tells the model "this word is 1st, this is 2nd, this is 3rd..." (otherwise it wouldn't know word order).
6 Transformer blocks: The 6 rounds of reading described above.
LM head: At the end, converts the model's internal "thoughts" (384 numbers) back into a score for each of the 32,000 possible next words.
Weight tying: The input lookup table and output scoring table share the same data. Saves memory and actually works better!

Processing (forward):

Look up each word's meaning (embedding)
Add position info
Run through 6 rounds of attention + thinking
Score every possible next word
If we know the correct answer, calculate how wrong we were (loss)

Generating text (generate): A simple loop:

Feed in the words so far
Get scores for the next word
Temperature: Controls creativity. Low = safe/predictable, high = wild/creative
Top-K: Only consider the K best options (ignore the 31,950 unlikely words)
Top-P (nucleus): Dynamically select the smallest set of tokens whose cumulative probability reaches the threshold
Randomly pick one word from the remaining options
Add it to the sentence, go back to step 1
Stop when is generated or max_new_tokens is reached

Loading the Dataset and Training

First, we load the JSONL corpus and tokenize every document into one long sequence of token IDs. Then we split it 90/10 into training and validation sets, and wrap them in a PyTorch Dataset that creates fixed-length chunks for next-token prediction:

import json
from tokenizers import Tokenizer
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm

# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using: {device}")

# Load tokenizer
tokenizer = Tokenizer.from_file(TOKENIZER_PATH)
print(f"Tokenizer loaded. Vocab: {tokenizer.get_vocab_size():,}")

# Load and tokenize corpus
print("Loading corpus...")
all_token_ids = []
with open(DATA_PATH, "r", encoding="utf-8") as f:
    for line in tqdm(f, desc="Tokenizing"):
        doc = json.loads(line)
        encoded = tokenizer.encode(doc["text"])
        all_token_ids.extend(encoded.ids)

all_token_ids = torch.tensor(all_token_ids, dtype=torch.long)
print(f"Total tokens: {len(all_token_ids):,}")

class UrduTextDataset(Dataset):
    def __init__(self, token_ids, seq_len):
        self.token_ids = token_ids
        self.seq_len = seq_len
        self.n_chunks = (len(token_ids) - 1) // seq_len

    def __len__(self):
        return self.n_chunks

    def __getitem__(self, idx):
        start = idx * self.seq_len
        chunk = self.token_ids[start:start + self.seq_len + 1]
        return chunk[:-1], chunk[1:]  # input, target (shifted by 1)

config = UrduLLMConfig()

# Split 90/10
split_idx = int(len(all_token_ids) * 0.9)
train_dataset = UrduTextDataset(all_token_ids[:split_idx], config.max_seq_len)
val_dataset = UrduTextDataset(all_token_ids[split_idx:], config.max_seq_len)

train_loader = DataLoader(train_dataset, batch_size=config.batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=config.batch_size)

print(f"Train: {len(train_dataset):,} chunks")
print(f"Val: {len(val_dataset):,} chunks")

Each chunk is 256 tokens long. __getitem__ returns (input, target) where target is the input shifted by one position, which is exactly what next-token prediction needs.

Training for me took around 3 hours and completed 3 epochs. In essence, it should have done 10 epochs, but after 3 I reached the free limit of Google Colab. Since the purpose of training was learning, I used the model that was generated and saved it in Drive.

Here's the complete training code:

# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate, weight_decay=config.weight_decay)

# LR Schedule
total_steps = len(train_loader) * config.max_epochs
def get_lr(step):
    if step < config.warmup_steps:
        return config.learning_rate * step / config.warmup_steps
    progress = (step - config.warmup_steps) / (total_steps - config.warmup_steps)
    return config.learning_rate * 0.5 * (1 + math.cos(math.pi * progress))

# Training
history = {'train_loss': [], 'val_loss': []}
global_step = 0
best_val_loss = float('inf')

for epoch in range(config.max_epochs):
    model.train()
    epoch_loss = 0
    pbar = tqdm(train_loader, desc=f"Epoch {epoch+1}")

    for input_ids, targets in pbar:
        input_ids, targets = input_ids.to(device), targets.to(device)

        lr = get_lr(global_step)
        for g in optimizer.param_groups:
            g['lr'] = lr

        outputs = model(input_ids, targets)
        loss = outputs['loss']

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), config.grad_clip)
        optimizer.step()

        epoch_loss += loss.item()
        global_step += 1
        pbar.set_postfix({'loss': f'{loss.item():.4f}'})

    # Validation
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for input_ids, targets in val_loader:
            input_ids, targets = input_ids.to(device), targets.to(device)
            val_loss += model(input_ids, targets)['loss'].item()
    val_loss /= len(val_loader)

    train_loss = epoch_loss / len(train_loader)
    history['train_loss'].append(train_loss)
    history['val_loss'].append(val_loss)

    print(f"Epoch {epoch+1}: Train={train_loss:.4f}, Val={val_loss:.4f}")

    # Save best
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), f"{DRIVE_PATH}/best_model.pt")
        print(f"Best model saved!")

print(f"\nDone! Best val loss: {best_val_loss:.4f}")

Now let's break down what each part of the training code does.

Training Code Explained: Line by Line

1. Optimizer Setup

optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate, weight_decay=config.weight_decay)

AdamW maintains two running statistics per parameter (23M × 2 = 46M extra values in memory):

First moment (momentum): Exponential moving average of gradients. Smooths out noisy updates so the optimizer doesn't zigzag.
Second moment: Exponential moving average of squared gradients. Gives each parameter its own adaptive learning rate (frequently updated params get smaller steps, rare ones get larger).
Weight decay (0.1): Each step, weights are multiplied by (1 - lr × 0.1), shrinking them slightly. This is L2 regularization. It prevents any single weight from growing too large, which reduces overfitting. The "W" in AdamW means this decay is decoupled from the gradient update (applied directly to weights, not mixed into the gradient like vanilla Adam).

2. Learning Rate Schedule

total_steps = len(train_loader) * config.max_epochs  # e.g., 500 batches × 10 epochs = 5000 steps

def get_lr(step):
    if step < config.warmup_steps:                                      # Phase 1: steps 0–499
        return config.learning_rate * step / config.warmup_steps        # Linear ramp: 0 → 3e-4
    progress = (step - config.warmup_steps) / (total_steps - config.warmup_steps)  # 0.0 → 1.0
    return config.learning_rate * 0.5 * (1 + math.cos(math.pi * progress))        # 3e-4 → ~0

Warmup (first 500 steps): At step 0, weights are random and gradients point in semi-random directions, so a large LR would cause destructive parameter updates. By linearly ramping from 0 to 3e-4, we let the loss landscape "stabilize" before making aggressive updates.
Cosine decay (remaining steps): The formula 0.5 × (1 + cos(π × progress)) traces a smooth S-curve from 1.0 to 0.0 as progress goes from 0 to 1. Multiplied by peak LR, this gives:
- Early: Large LR – big parameter changes which results in rapid loss reduction
- Late: Tiny LR – small tweaks which results in fine-tuning without overshooting local minima

LR:  0 ──ramp──▶ peak ──smooth curve──▶ ~0
     |  warmup  |     cosine decay      |

3. Tracking Variables

history = {'train_loss': [], 'val_loss': []}   # For plotting curves later
global_step = 0                                 # Counts total batches across all epochs (for LR schedule)
best_val_loss = float('inf')                    # Tracks best validation; starts at infinity so any real loss beats it

4. Training Loop

Outer Loop: Epochs

for epoch in range(config.max_epochs):
    model.train()     # Enables dropout (randomly zeros 10% of activations for regularization)

Each epoch = one full pass through all training data. We repeat for max_epochs rounds.

Inner Loop: Batches

1. Move to GPU:

input_ids, targets = input_ids.to(device), targets.to(device)

Transfers tensor data from CPU RAM to GPU VRAM. Matrix multiplications in transformers (attention, FFN) run 50–100× faster on GPU due to massive parallelism.

2. Manual LR Update:

lr = get_lr(global_step)
for g in optimizer.param_groups:
    g['lr'] = lr

PyTorch's AdamW doesn't natively support custom schedules, so we manually override the LR each step. param_groups is a list (here just one group), and each group can have its own LR/weight decay.

3. Forward Pass:

outputs = model(input_ids, targets)
loss = outputs['loss']

Input tokens flow through: embeddings → 6 transformer blocks → LM head → logits. Cross-entropy loss is computed between the logits (shape [batch, seq_len, 32000]) and target token IDs. This loss measures the negative log-probability the model assigns to the correct next token, averaged over all positions and batch elements.

4. Backward Pass + Update:

optimizer.zero_grad()          # Reset all parameter gradients to zero (they accumulate by default)
loss.backward()                # Backpropagation: compute ∂loss/∂θ for all 23M parameters via chain rule
torch.nn.utils.clip_grad_norm_(model.parameters(), config.grad_clip)  # If ||gradient||₂ > 1.0, scale it down
optimizer.step()               # θ_new = θ_old - lr × adam_adjusted_gradient - lr × weight_decay × θ_old

zero_grad(): PyTorch accumulates gradients by default (useful for gradient accumulation across micro-batches). We must manually clear them before each new backward pass.
loss.backward(): Backpropagation traverses the computation graph in reverse, computing ∂loss/∂θ for every parameter using the chain rule. This is the most compute-intensive step alongside the forward pass.
Gradient clipping: Computes the L2 norm across all parameter gradients concatenated into one vector. If the norm exceeds 1.0, every gradient is multiplied by 1.0/norm, preserving direction but capping magnitude. This prevents rare batches (unusual token distributions) from causing catastrophically large updates that destabilize training.
optimizer.step(): AdamW applies the update rule using momentum, adaptive per-parameter LR, and decoupled weight decay.

5. Bookkeeping:

epoch_loss += loss.item()      # .item() extracts the Python float from the CUDA tensor (avoids GPU memory leak)
global_step += 1               # Increment for LR schedule
pbar.set_postfix({'loss': ...})  # Update the tqdm progress bar display

6. Validation

model.eval()                   # Disables dropout so we use full model capacity for honest evaluation
val_loss = 0
with torch.no_grad():          # Disables gradient tracking, saves ~50% memory and runs faster
    for input_ids, targets in val_loader:
        input_ids, targets = input_ids.to(device), targets.to(device)
        val_loss += model(input_ids, targets)['loss'].item()
val_loss /= len(val_loader)    # Average loss per batch

This tests on held-out data the model never trained on. Comparing train vs val loss reveals:

Pattern	Meaning
Both decreasing	Model is learning generalizable patterns
Train ↓, Val stalling/↑	Overfitting: memorizing, not learning
Both high and flat	Underfitting: model needs more capacity or data

model.eval() turns OFF dropout so we evaluate with the full model. torch.no_grad() skips gradient computation since we're just measuring, not learning.

7. Checkpointing

if val_loss < best_val_loss:
    best_val_loss = val_loss
    torch.save(model.state_dict(), f"{DRIVE_PATH}/best_model.pt")

model.state_dict() returns an OrderedDict mapping parameter names onto tensors. torch.save serializes this to disk using Python's pickle + zip. We only save when val loss improves.

This is early stopping in spirit: we keep the checkpoint that generalizes best, regardless of what happens in later epochs.

Summary: One Batch in 6 Steps

Feed 32 Urdu sequences through the model → get predicted probabilities
Cross-entropy vs actual next tokens → scalar loss (how wrong?)
Backpropagate through 23M parameters → gradient per parameter (what to fix?)
Clip gradient norm to ≤ 1.0 → prevent instability
AdamW updates parameters with momentum + decay → the actual learning
Repeat ~5000 times, save the best checkpoint → done

Key Metrics

Cross-entropy loss measures how far the predicted probability distribution is from the true next token. A random model over 32K vocab gets loss ≈ ln(32000) ≈ 10.4

Perplexity = e^loss, interpretable as "the model is choosing between N equally likely tokens"

PPL 32,000 = random guessing
PPL 100 = narrowed to ~100 candidates
PPL 10 = quite confident predictions

Once training is completed and we've saved the model in Drive, the next step is to download the model to your local system to perform the next steps.

Now we have a model that's ready, but a question arises: Is it ready to where we can chat with it like we do with any AI tool like ChatGPT, Claude, or Copilot? The answer is no, it's not quite ready yet. Why?

The training part is done, but it doesn't know how to structure or write in a conversational manner, like it's answering user queries. This is the step we call Supervised Fine-Tuning (SFT).

4. Supervised Fine-Tuning (SFT)

At a very high level, in SFT we teach the model how to respond to queries. It's like giving it examples from which it learns how to answer. The more examples you have, the better the responses will become. So essentially, supervised fine-tuning converts the model to a conversational agent.

To achieve this, we'll create a dataset of examples with the following key pairs and format:

{
  "conversations": [
    {"role": "system", "content": "آپ ایک مددگار اردو اسسٹنٹ ہیں۔"},
    {"role": "user", "content": "سوال..."},
    {"role": "assistant", "content": "جواب..."}
  ]
}

Around 79 examples get fed to the system and saved in JSONL format. In real cases, you would use many more examples. As I already mentioned, more examples lead to better results.

Formatting Conversations for Training

The next step is formatting the conversations saved above for training. This is the conversation formatting step for SFT. It converts raw conversation JSON into token ID sequences with loss masking, so the model only learns to generate assistant responses.

Loss masking means we intentionally hide certain parts of the input from the training loss. In this case, we mask the system prompt and user message so the model isn't trained to memorize or reproduce them. The training signal comes only from the assistant's response, which is the useful part in teaching the model what to generate and when to stop.

Part 1: Disable Auto-Formatting & Get Special Token IDs

tokenizer.no_padding()

BOS_ID = tokenizer.token_to_id("")       # 2
EOS_ID = tokenizer.token_to_id("")       # 3
SEP_ID = tokenizer.token_to_id("")       # 4
PAD_ID = tokenizer.token_to_id("")       # 0
USER_ID = tokenizer.token_to_id("<|user|>")          # 5
ASSISTANT_ID = tokenizer.token_to_id("<|assistant|>") # 6
SYSTEM_ID = tokenizer.token_to_id("<|system|>")       # 7

IGNORE_INDEX = -100

no_padding(): Tells the tokenizer "don't add padding automatically, I'll handle it myself." We need full control over the token sequence.
We fetch the integer IDs for each special token so we can manually insert them at the right positions.
IGNORE_INDEX = -100: PyTorch's cross_entropy has a built-in feature: any label set to -100 is skipped in loss computation. This is how we implement loss masking.

Part 2: `format_conversation()`: The Core Function

This takes a conversation and produces two parallel arrays:

input_ids: [BOS, SYSTEM, آپ, ایک, مددگار, ..., SEP, USER, پاکستان, کا, ..., SEP, ASST, اسلام, آباد, ہے, EOS, PAD, PAD, ...]
labels:    [-100, -100, -100, -100, -100, ..., -100, -100, -100,    -100,..., -100, -100, اسلام, آباد, ہے, EOS, -100, -100, ...]

Step-by-step inside the function:

1. Start with BOS:

input_ids = [BOS_ID]
labels = [IGNORE_INDEX]    # Don't learn to predict BOS

2. For each turn, encode the content and strip auto-added BOS/EOS:

content_ids = tokenizer.encode(content).ids
if content_ids[0] == BOS_ID: content_ids = content_ids[1:]     # Remove if tokenizer auto-added
if content_ids[-1] == EOS_ID: content_ids = content_ids[:-1]

We strip these because we're manually placing special tokens at exact positions, so we don't want duplicates.

3. Build token sequence per role:

Role	Token sequence	Labels
system	`[SYSTEM_ID] + content + [SEP_ID]`	All -100 (masked)
user	`[USER_ID] + content + [SEP_ID]`	All -100 (masked)
assistant	`[ASST_ID] + content + [EOS_ID]`	`[-100] + content + [EOS_ID]`

The assistant's role token (<|assistant|>) itself is masked because we don't want the model to learn to predict that. But the actual response content and the do have labels, so the model learns:

What to say (the response content)
When to stop (predicting )

4. Truncate and pad:

input_ids = input_ids[:max_len]          # Cut to 256 tokens max
pad_len = max_len - len(input_ids)
input_ids = input_ids + [PAD_ID] * pad_len
labels = labels + [IGNORE_INDEX] * pad_len   # Don't learn from padding either

All sequences must be the same length for batched training. Padding labels are -100 so they're ignored in loss.

Here's the complete format_conversation() function:

def format_conversation(conversation: dict, max_len: int = 256) -> dict:
    """
    Convert a conversation dict into token IDs + labels for SFT.

    Format: <|system|>...<|user|>...<|assistant|>...
    Labels: -100 for system/user tokens (masked), actual IDs for assistant tokens.
    """
    input_ids = [BOS_ID]
    labels = [IGNORE_INDEX]

    for turn in conversation["conversations"]:
        role = turn["role"]
        content = turn["content"]

        content_ids = tokenizer.encode(content).ids
        if content_ids and content_ids[0] == BOS_ID:
            content_ids = content_ids[1:]
        if content_ids and content_ids[-1] == EOS_ID:
            content_ids = content_ids[:-1]

        if role == "system":
            role_ids = [SYSTEM_ID] + content_ids + [SEP_ID]
            role_labels = [IGNORE_INDEX] * len(role_ids)
        elif role == "user":
            role_ids = [USER_ID] + content_ids + [SEP_ID]
            role_labels = [IGNORE_INDEX] * len(role_ids)
        elif role == "assistant":
            role_ids = [ASSISTANT_ID] + content_ids + [EOS_ID]
            role_labels = [IGNORE_INDEX] + content_ids + [EOS_ID]

        input_ids.extend(role_ids)
        labels.extend(role_labels)

    # Truncate and pad to max_len
    input_ids = input_ids[:max_len]
    labels = labels[:max_len]
    pad_len = max_len - len(input_ids)
    input_ids = input_ids + [PAD_ID] * pad_len
    labels = labels + [IGNORE_INDEX] * pad_len

    return {"input_ids": input_ids, "labels": labels}

Part 3: Verification

n_loss_tokens = sum(1 for l in test_formatted['labels'] if l != IGNORE_INDEX)
print(f"  Tokens with loss: {n_loss_tokens} / 256")

This confirms that only a small fraction of tokens (the assistant's words + EOS) contribute to the loss. For a typical example, you might see something like Tokens with loss: 18 / 256, meaning only ~7% of the sequence drives gradient updates. The rest (system prompt, user questions, special tokens, padding) is masked with -100.

This makes SFT extremely efficient: 100% of the learning signal comes from predicting the assistant's actual response and knowing when to stop (). That efficiency is especially critical when you only have 79 training examples.

Formatting Summary

Component	Purpose
`no_padding()`	Take manual control of token placement
Special token IDs	Insert chat structure markers at exact positions
`IGNORE_INDEX = -100`	PyTorch's built-in mechanism to skip positions in loss
System/User labels → -100	Don't learn from these (context only)
Assistant labels → real IDs	Learn to generate responses + when to stop
Truncation to 256	Match model's context window
Padding with -100 labels	Batch alignment without polluting the loss

SFT Dataset & DataLoader

class SFTDataset(Dataset):
    def __init__(self, conversations: list, max_len: int = 256):
        self.examples = []
        for conv in conversations:
            formatted = format_conversation(conv, max_len)
            self.examples.append(formatted)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        return (
            torch.tensor(self.examples[idx]['input_ids'], dtype=torch.long),
            torch.tensor(self.examples[idx]['labels'], dtype=torch.long),
        )

This wraps all 79 formatted conversations into a PyTorch Dataset. At init time, it pre-formats every conversation using format_conversation() and stores the results. When the DataLoader requests item idx, it returns (input_ids, labels) as tensors.

DataLoader:

sft_loader = DataLoader(sft_dataset, batch_size=4, shuffle=True)

batch_size=4: Small batch because we only have 79 examples. Larger batches would mean fewer gradient updates per epoch.
shuffle=True: Randomize order each epoch so the model doesn't memorize a fixed sequence of examples.

Loading the Pre-trained Model

model = UrduGPT(config).to(device)
checkpoint = torch.load("best_model.pt", map_location=device)
state_dict = checkpoint['model_state_dict']

# Name mapping (Colab → local)
name_mapping = {
    'token_emb.weight': 'token_embedding.weight',
    'pos_emb.weight': 'position_embedding.weight',
    'ln_f.weight': 'ln_final.weight',
    'ln_f.bias': 'ln_final.bias',
    'head.weight': 'lm_head.weight',
}

This creates a fresh UrduGPT model and loads the pre-trained weights from Phase 3.

You might be wondering: why the name mapping? The model was trained on Google Colab with slightly different variable names (for example, token_emb vs token_embedding). The mapping translates Colab's naming convention to the local code's convention. strict=False in load_state_dict allows loading even if some keys don't match exactly.

Also, why start from pre-trained? Well, SFT builds on top of pre-training. The model already knows Urdu grammar, vocabulary, and facts. SFT just teaches it the conversation format. Starting from random weights would require far more data and training.

SFT Training Loop

Here's the complete SFT training loop:

SFT_LR = 2e-5
SFT_EPOCHS = 50
optimizer = torch.optim.AdamW(model.parameters(), lr=SFT_LR, weight_decay=0.01)

sft_history = {'loss': []}
best_loss = float('inf')

for epoch in range(SFT_EPOCHS):
    model.train()
    epoch_loss = 0
    n_batches = 0

    for input_ids, labels in sft_loader:
        input_ids = input_ids.to(device)
        labels = labels.to(device)

        outputs = model(input_ids)
        logits = outputs['logits']

        shift_logits = logits[:, :-1, :].contiguous()
        shift_labels = labels[:, 1:].contiguous()

        loss = F.cross_entropy(
            shift_logits.view(-1, shift_logits.size(-1)),
            shift_labels.view(-1),
            ignore_index=IGNORE_INDEX,
        )

        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

        epoch_loss += loss.item()
        n_batches += 1

    avg_loss = epoch_loss / n_batches
    sft_history['loss'].append(avg_loss)

    if avg_loss < best_loss:
        best_loss = avg_loss
        torch.save({
            'model_state_dict': model.state_dict(),
            'config': config.__dict__,
            'epoch': epoch + 1,
            'loss': avg_loss,
        }, "sft_model.pt")

    if (epoch + 1) % 10 == 0 or epoch == 0:
        print(f"Epoch {epoch+1}/{SFT_EPOCHS} | Loss: {avg_loss:.4f}")

print(f"SFT complete! Best loss: {best_loss:.4f}")

Why these hyperparameters differ from pre-training:

Parameter	Pre-training	SFT	Why different
Learning rate	3e-4	2e-5	Lower LR prevents catastrophic forgetting. Large updates would erase the Urdu knowledge learned during pre-training
Epochs	3	50	Only 79 examples vs millions of tokens. The model needs many passes to learn the conversation pattern
Weight decay	0.1	0.01	Less regularization needed since we want the model to fit these specific examples closely
LR schedule	Cosine warmup	Constant	Simple and effective for small-data fine-tuning

Here's the training step (per batch):

# Forward pass with no targets; we compute loss manually
outputs = model(input_ids)
logits = outputs['logits']

# Shift for next-token prediction
shift_logits = logits[:, :-1, :].contiguous()    # Predictions at positions 0..254
shift_labels = labels[:, 1:].contiguous()         # Targets at positions 1..255

# Loss with masking
loss = F.cross_entropy(
    shift_logits.view(-1, shift_logits.size(-1)),
    shift_labels.view(-1),
    ignore_index=IGNORE_INDEX,  # Skip -100 positions
)

There's a key difference from pre-training: in pre-training, we passed targets directly to model(input_ids, targets) which computed loss internally on ALL tokens. Here we compute loss manually so we can use ignore_index=-100 to mask non-assistant positions.

The shift: logits[:, :-1] and labels[:, 1:] implement next-token prediction. The model's prediction at position i is compared against the actual token at position i+1.

Backward pass + update:

optimizer.zero_grad(set_to_none=True)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()

This is the same as pre-training: clear gradients → backprop → clip to prevent instability → update parameters. Gradient clipping at 1.0 is especially important here since the model is being fine-tuned and some gradients can be large on small data.

Checkpointing:

if avg_loss < best_loss:
    torch.save({'model_state_dict': model.state_dict(), ...}, "sft_model.pt")

Save whenever training loss improves. Unlike pre-training, we don't have a separate validation set (79 examples is too few to split), so we checkpoint on training loss.

Chat Function: Inference

Here's the complete chat function:

def chat(model, tokenizer, user_message: str, system_prompt: str = None,
         max_tokens: int = 100, temperature: float = 0.7) -> str:
    """Generate a chat response."""
    model.eval()

    if system_prompt is None:
        system_prompt = SYSTEM_PROMPT

    # Build the prompt
    prompt_ids = [BOS_ID, SYSTEM_ID]

    sys_ids = tokenizer.encode(system_prompt).ids
    if sys_ids and sys_ids[0] == BOS_ID: sys_ids = sys_ids[1:]
    if sys_ids and sys_ids[-1] == EOS_ID: sys_ids = sys_ids[:-1]
    prompt_ids.extend(sys_ids)
    prompt_ids.append(SEP_ID)

    prompt_ids.append(USER_ID)
    user_ids = tokenizer.encode(user_message).ids
    if user_ids and user_ids[0] == BOS_ID: user_ids = user_ids[1:]
    if user_ids and user_ids[-1] == EOS_ID: user_ids = user_ids[:-1]
    prompt_ids.extend(user_ids)
    prompt_ids.append(SEP_ID)

    prompt_ids.append(ASSISTANT_ID)

    # Generate
    input_tensor = torch.tensor([prompt_ids], dtype=torch.long).to(device)
    with torch.no_grad():
        output_ids = model.generate(
            input_tensor,
            max_new_tokens=max_tokens,
            temperature=temperature,
            top_k=50,
            top_p=0.9,
            eos_token_id=EOS_ID,
        )

    # Decode only the generated part
    generated_ids = output_ids[0][len(prompt_ids):].tolist()
    if EOS_ID in generated_ids:
        generated_ids = generated_ids[:generated_ids.index(EOS_ID)]

    return tokenizer.decode(generated_ids)

And here's a step-by-step breakdown:

1. Build the prompt:

prompt_ids = [BOS_ID, SYSTEM_ID]
prompt_ids.extend(sys_ids)          # System prompt content
prompt_ids.append(SEP_ID)
prompt_ids.append(USER_ID)
prompt_ids.extend(user_ids)          # User message content
prompt_ids.append(SEP_ID)
prompt_ids.append(ASSISTANT_ID)      # "Now respond..."

This constructs exactly the same format the model saw during SFT training:

<|system|>آپ ایک مددگار...<|user|>پاکستان کا دارالحکومت؟<|assistant|>

The model sees <|assistant|> and knows "I should generate a response now" because during SFT, it learned that tokens after <|assistant|> are what it should produce.

2. Generate autoregressively:

with torch.no_grad():
    output_ids = model.generate(
        input_tensor,
        max_new_tokens=max_tokens,
        temperature=temperature,
        top_k=50,
        top_p=0.9,
        eos_token_id=EOS_ID,
    )

torch.no_grad(): No gradients needed for inference, which saves memory and speed
temperature=0.7: Slightly sharpened distribution for coherent but not robotic output
top_k=50: Only sample from top 50 tokens to avoid low-probability noise
top_p=0.9: Nucleus sampling that dynamically selects the smallest set of tokens whose cumulative probability ≥ 0.9
eos_token_id: Stop generating when is produced

3. Extract and decode:

generated_ids = output_ids[0][len(prompt_ids):].tolist()    # Only the new tokens
if EOS_ID in generated_ids:
    generated_ids = generated_ids[:generated_ids.index(EOS_ID)]  # Trim at EOS
return tokenizer.decode(generated_ids)

We slice off the prompt (we don't want to return the system prompt and user message back), trim at , and decode token IDs back to Urdu text.

5. Deployment

At this point, you have your own LLM. That's a great milestone. But there's still the classic problem: "it works on my machine."

To make the model public so others can use it too, we need to deploy it and provide an interface for users to interact with.

While exploring deployment options, I came across Gradio, which provides a simple, clean interface for deploying machine learning models and applications. Gradio integrates directly with Hugging Face Spaces, giving us free hosting with minimal setup.

Gradio Web Interface (`app.py`)

The app.py file ties everything together: it loads the tokenizer and model, defines the chat() function, and launches a Gradio UI. The model loading and chat() logic are identical to what we covered in the SFT section, so here we only show the Gradio-specific part:

import gradio as gr

def respond(message, history):
    if not message.strip():
        return "براہ کرم کچھ لکھیں۔"
    return chat(message)

demo = gr.ChatInterface(
    fn=respond,
    title="🇵🇰 اردو LLM چیٹ بوٹ",
    description="""
    ### ایک چھوٹا اردو زبان ماڈل جو شروع سے تیار کیا گیا ہے
    **A small Urdu language model built from scratch (~23M parameters)**
    """,
    examples=[
        "السلام علیکم",
        "پاکستان کا دارالحکومت کیا ہے؟",
        "لاہور کے بارے میں بتائیں۔",
        "بریانی کیسے بنتی ہے؟",
        "کرکٹ کیسے کھیلی جاتی ہے؟",
        "چاند کیسے چمکتا ہے؟",
        "رمضان کیا ہے؟",
        "علامہ اقبال کون تھے؟",
        "خوش کیسے رہیں؟",
        "آپ کون ہیں؟",
    ],
    theme=gr.themes.Soft(),
)

if __name__ == "__main__":
    demo.launch()

respond() wraps chat() with an empty-input guard, matching the signature Gradio's ChatInterface expects.
gr.ChatInterface provides a ready-made chat UI with message history, input box, and send button.
examples are pre-filled messages users can click to try.
theme=gr.themes.Soft() gives a clean, modern visual theme.

Note: Hugging Face Spaces runs app.py as a standalone script, so the full app.py in the repository inlines everything into one file: the model config, the complete transformer architecture, model loading with gc.collect() for memory optimization, the chat() function, and the Gradio interface above.

We won't repeat all of that here since it was already covered in the Pre-Training and SFT sections.

Running locally:

python app.py
# Opens at http://127.0.0.1:7860

Deployment Options

Option A: Hugging Face Spaces (Free, Recommended)

Hugging Face Spaces provides free CPU hosting for Gradio apps.

What to upload:

urdu-llm-chat/
├── app.py                          # Gradio web interface
├── requirements.txt                # torch, tokenizers, gradio
├── README.md                       # Space metadata (sdk: gradio)
├── model/
│   ├── __init__.py
│   ├── config.py
│   ├── transformer.py
│   └── checkpoints/sft_model.pt    # ~90MB trained model weights
└── tokenizer/
    └── urdu_tokenizer/
        └── urdu_bpe_tokenizer.json

How it works:

Create a free account on huggingface.co
Create a new Space (SDK: Gradio, Hardware: CPU Basic)
Push files via git: git clone https://huggingface.co/spaces/USERNAME/urdu-llm-chat
Copy project files into the cloned repo and push
Hugging Face automatically installs dependencies and runs app.py
Your model is live at https://huggingface.co/spaces/USERNAME/urdu-llm-chat

Why CPU is fine: Our model is only 23M parameters (~90MB). Inference takes <1 second on CPU. No GPU needed for serving.

Option B: Running Locally

cd your-project-directory
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python app.py

Opens at http://127.0.0.1:7860. Works on any machine with Python 3.9+.

Option C: Terminal Chat (No UI)

A lightweight alternative with no Gradio dependency, just terminal input/output. Loads the model and enters an interactive loop:

"""
Standalone Chat Inference Script for Urdu LLM

Usage:
    python inference/chat.py
"""

import sys
import torch
from pathlib import Path
from tokenizers import Tokenizer

# Add project root to path
PROJECT_ROOT = Path(__file__).resolve().parent.parent
sys.path.insert(0, str(PROJECT_ROOT))

from model.config import UrduLLMConfig
from model.transformer import UrduGPT


def load_model(checkpoint_path: str, device: str = None):
    """Load the fine-tuned model."""
    if device is None:
        if torch.cuda.is_available():
            device = "cuda"
        elif torch.backends.mps.is_available():
            device = "mps"
        else:
            device = "cpu"

    device = torch.device(device)

    config = UrduLLMConfig()
    model = UrduGPT(config).to(device)

    checkpoint = torch.load(checkpoint_path, map_location=device)
    model.load_state_dict(checkpoint['model_state_dict'])
    model.eval()

    return model, config, device


def chat_response(model, tokenizer, config, device, user_message,
                  system_prompt="آپ ایک مددگار اردو اسسٹنٹ ہیں۔",
                  max_tokens=100, temperature=0.7):
    """Generate a chat response."""
    BOS_ID = tokenizer.token_to_id("")
    EOS_ID = tokenizer.token_to_id("")
    SEP_ID = tokenizer.token_to_id("")
    USER_ID = tokenizer.token_to_id("<|user|>")
    ASSISTANT_ID = tokenizer.token_to_id("<|assistant|>")
    SYSTEM_ID = tokenizer.token_to_id("<|system|>")

    # Build prompt
    prompt_ids = [BOS_ID, SYSTEM_ID]

    sys_ids = tokenizer.encode(system_prompt).ids
    if sys_ids and sys_ids[0] == BOS_ID: sys_ids = sys_ids[1:]
    if sys_ids and sys_ids[-1] == EOS_ID: sys_ids = sys_ids[:-1]
    prompt_ids.extend(sys_ids)
    prompt_ids.append(SEP_ID)

    prompt_ids.append(USER_ID)
    user_ids = tokenizer.encode(user_message).ids
    if user_ids and user_ids[0] == BOS_ID: user_ids = user_ids[1:]
    if user_ids and user_ids[-1] == EOS_ID: user_ids = user_ids[:-1]
    prompt_ids.extend(user_ids)
    prompt_ids.append(SEP_ID)

    prompt_ids.append(ASSISTANT_ID)

    # Generate
    input_tensor = torch.tensor([prompt_ids], dtype=torch.long).to(device)
    output_ids = model.generate(
        input_tensor,
        max_new_tokens=max_tokens,
        temperature=temperature,
        top_k=50,
        top_p=0.9,
        eos_token_id=EOS_ID,
    )

    generated_ids = output_ids[0][len(prompt_ids):].tolist()
    if EOS_ID in generated_ids:
        generated_ids = generated_ids[:generated_ids.index(EOS_ID)]

    return tokenizer.decode(generated_ids)


def main():
    print("=" * 60)
    print("🇵🇰  اردو LLM چیٹ بوٹ  🇵🇰")
    print("    Urdu LLM ChatBot")
    print("=" * 60)

    # Load model
    tokenizer_path = PROJECT_ROOT / "tokenizer" / "urdu_tokenizer" / "urdu_bpe_tokenizer.json"

    # Try SFT model first, fall back to pre-trained
    sft_path = PROJECT_ROOT / "model" / "checkpoints" / "sft_model.pt"
    pretrained_path = PROJECT_ROOT / "model" / "checkpoints" / "best_model.pt"

    if sft_path.exists():
        checkpoint_path = sft_path
        print("Loading SFT (conversational) model...")
    elif pretrained_path.exists():
        checkpoint_path = pretrained_path
        print("Loading pre-trained model (not fine-tuned for chat)...")
    else:
        print("❌ No model checkpoint found!")
        print("   Run notebooks 03 and 04 first to train the model.")
        sys.exit(1)

    model, config, device = load_model(str(checkpoint_path))
    tokenizer = Tokenizer.from_file(str(tokenizer_path))

    print(f"Model loaded on {device}")
    print("\nType your message in Urdu. Type 'quit' to exit.\n")
    print("-" * 60)

    while True:
        try:
            user_input = input("\n👤 آپ: ").strip()
        except (EOFError, KeyboardInterrupt):
            print("\nخدا حافظ! 👋")
            break

        if user_input.lower() in ['quit', 'exit', 'q']:
            print("خدا حافظ! 👋")
            break

        if not user_input:
            continue

        response = chat_response(model, tokenizer, config, device, user_input)
        print(f"🤖 بوٹ: {response}")


if __name__ == "__main__":
    main()

Run it with:

python inference/chat.py

👤 آپ: السلام علیکم
🤖 بوٹ: وعلیکم السلام! میں آپ کی کیا مدد کر سکتا ہوں؟

Full Pipeline Summary

Phase	Description
Phase 1	Raw Urdu Text → Clean corpus
Phase 2	Corpus → BPE Tokenizer (32K vocab)
Phase 3	Tokenized chunks → Pre-train GPT (23M params, next-token prediction)
Phase 4	Conversations → SFT with loss masking → Chat model
Deploy	`app.py` + model weights → Hugging Face Spaces / Local

You can check out the deployed model here: https://huggingface.co/spaces/Wisamul/urdu_llm

Results

The results were pretty interesting. As we had a small training sample, the chances of hallucinations were pretty high, and especially if you can understand Urdu, you'll appreciate the fun part of hallucination. But overall it was working fine.

Here are a few sample interactions with the deployed model:

👤 User: السلام علیکم
🤖 Assistant: وعلیکم السلام! میں آپ کی کیا مدد کر سکتا ہوں؟

👤 User: پاکستان کا دارالحکومت کیا ہے؟
🤖 Assistant: پاکستان کا دارالحکومت اسلام آباد ہے۔ یہ شہر 1960 کی دہائی میں تعمیر کیا گیا تھا۔

👤 User: تعلیم کی اہمیت کیا ہے؟
🤖 Assistant: تعلیم انسانی ترقی کی بنیاد ہے۔ یہ نہ صرف علم حاصل کرنے کا ذریعہ ہے بلکہ معاشرے کی تعمیر میں اہم کردار ادا کرتی ہے۔

For questions close to the training data, the model responds accurately and fluently. For out-of-distribution queries, it tends to hallucinate or repeat fragments. This is expected given the small SFT dataset (79 examples) and model size (23M parameters).

Conclusion

The goal of all this was to discuss and explain all steps involved in the process of building an LLM. I hope it gives you clarity as to why everyone isn't making their own LLM: factors like cost, training, data collection, and so on. At the same time, going through this process has hopefully helped you appreciate the work that has been done before and what we've achieved here.

We went from raw Urdu text all the way to a deployed chatbot: data cleaning, BPE tokenization, pre-training a GPT-style transformer, supervised fine-tuning with loss masking, and finally a Gradio web interface.

The model is tiny and the dataset is small, but every concept here (attention, next-token prediction, SFT, chat formatting) is exactly what powers production LLMs like GPT-4 and Llama – just at a much larger scale.

If you want to improve on this, the highest-impact next steps would be:

more SFT data (thousands of examples instead of 79),
a larger model (100M+ parameters), and
RLHF/DPO alignment.

But even at this scale, you now have a concrete understanding of the full LLM pipeline.

The New Definition of Software Engineering in the Age of AI

Tapas Adhikary — Tue, 21 Apr 2026 15:57:48 +0000

If you're a software developer today, it's almost impossible to avoid the noise of AI( Artificial Intelligence) and its impact on the industry. You open X or LinkedIn in the morning, and the majority of the posts you see are the terrifying ones about tech layoffs.

You scroll a little more, and someone is claiming that a new AI tool released last week has already made entry-level developers obsolete. You go to YouTube, and a thumbnail screams that all technologies are dead, all developer jobs are dead, and at the same time, a solo founder claims that they've built a million-dollar full-stack app in five minutes using AI agents.

At some point, you start feeling overwhelmed. You start to question and doubt the nights you've spent learning something, building something. You wonder whether the effort you're putting into mastering a programming language or framework still makes sense. You start asking yourself an extremely uncomfortable question: "Is my career still safe?"

This concern is valid. Instead of dismissing the concern with a lot of motivational talk or toxic positivity, let's do a reality check. The industry is fundamentally changing. Hiring patterns are shifting. Expectations for both junior and senior developers are rising exponentially. And yes, AI is the main catalyst accelerating all these changes.

But there is a massive misunderstanding around what's going on. The narrative that "AI is replacing developers" lacks a lot of details. It has created unnecessary fear because it fails to specify what's actually happening.

Not many devs are coming up to explain these details because a good portion of us are still observing, and some are steering the fear to their individual benefits.

Well, here's my take: AI isn't replacing all software engineers. It's replacing a specific kind of work. The low-level, average, routine execution work is getting replaced with AI much faster than anyone could imagine. As a result, it's forcing us to think of what it means to be a software engineer in today's market.

This article is about that thought process. It's a deep dive into the changing landscape of software development, the shift from effort-based to impact-based engineering, and a practical, actionable roadmap to enable you to remain relevant in the era of AI-assisted coding.

The End of the Tutorial-Driven Era
Let's Decode the "AI is Taking Jobs" Myth
Applying a Clean Architecture
A Practical, AI-Era Engineering Roadmap
The Must-Needed Mindset Shift
If You've Read This Far...

The End of the Tutorial-Driven Era

Let's step back for a moment and look at how most of us learned to develop software over the last decade or so.

Between 2010 and 2023, the industry was filled with tutorial-driven developers. We learned to build software by following step-by-step instructions.

Applications like TODO Apps, Weather dashboards, or clones of YouTube or Spotify were in high demand among developers. These projects gave us confidence. They helped us memorise syntax, learn how to use libraries, and figure out how to write a basic frontend and backend.

For a long time, this was enough. The goal was simple: "Can I build this full-stack application that works?"

If you could write code, connect to a few APIs, and build a working interface, companies were willing to hire you. They viewed junior developers as an investment. The expectation was that you should be trainable: you would come in, write standard boilerplate code, and learn the complexities of the system architecture on the job. The industry had the budget and patience for that learning curve.

But while memorizing the syntax and completing Udemy courses, the tooling was quietly evolving. Today, AI has taken that to a different extreme.

A significant portion of what we used to learn manually can now be generated, assisted, and suggested by AI in seconds.

Need a basic Express server setup with rate limiting and CORS integrated? Can be generated.
Need a responsive navigation bar written in React? Can be assisted.
Need a standard SQL query to fetch company data? Can be suggested.

If a machine can do something exponentially faster, cheaper, and reasonably well, that specific task stops being the differentiator in the job market. So, when people say AI is replacing junior developers, what they mean is that AI has automated the execution of these surface-level tasks.

But does it mean developers are no longer needed? No, it means the value of our work has moved up the stack. Building a TODO app, a Weather dashboard, or website clones is no longer a portfolio item. They're just your warm-up exercises.

Let's Decode the "AI is Taking Jobs" Myth

Traditionally, software engineers were given requirements: they wrote code, and they ensured it worked. The value of a software engineer was tied to their work execution. Even in interviews, the emphasis was on effort and memory:

Can you write a linked list from scratch?
Can you check if this text is a palindrome?
Can you find the duplicates in this array of numbers?

If you were a developer who put in long hours analyzing problem statements, manually debugging critical issues, and hand-crafting thousands of lines of source code, you were seen as a dedicated, high-valued employee.

Today, the effort alone is no longer a metric for success.

If you spend hours writing regular expressions or standard authentication flows that an AI agent can scaffold within two minutes, the industry doesn't reward you for your six hours of hard work. The industry asks: "What value did you add beyond what the machine generated?"

This is an uncomfortable truth, but accepting it could be the turning point in your career. Once you accept that AI can write code, your mindset shifts. You start accepting that you no longer have to worry about your execution speed, and you need to focus on System Composition and Abstract Thinking.

If you're a front-end developer today, your job is no longer limited to translating a Figma design into pixel-perfect React components. An AI coding assistant can do 80% of that in a few constructive prompts. Your job role expectations as a front-end developer are now shifted to:

When that UI connects to the backend, and 10K users log in concurrently, how does the system behave?
Suppose a customer has an SLA (Service Level Agreement) stating that the dashboard must render with all data in 1.2 seconds on a slow 4G network, in 500 ms on a fast 4G network, and in 12 ms on a 5G network. How do you architect your Next.js application to meet that?
Are you leveraging server-side rendering, static generation, or edge caching correctly?
How does the application behave for users depending on screen readers?

Source code is no longer the primary output. It should be the byproduct of your thinking and reasoning. You need to anticipate edge cases, and most importantly, you need to take ownership.

AI can write an API, but AI can't sit in a meeting with a furious client and explain why the production database went down. AI cann't own the consequences of a system failure. That accountability belongs entirely to you.

Applying a Clean Architecture

Suppose you ask an LLM to build a complex application, say, an e-commerce product dashboard with sorting, filtering, and pagination. It will gladly generate the code that you'll be able to run and render on the browser. Bur AI has a very peculiar tendency in that it loves to build monoliths.

The AI will likely output a massive 1000+ line React component. The state management, UI rendering, data fetching, and business logic will be clubbed together in a single file. So it'll technically work in the browser, but it will be a nightmare to test, maintain, and scale.

This is where the human software engineers come in. A modern engineer understands clean code principles and design patterns. Instead of accepting the monolith AI output blindly, the engineer thinks in terms of LEGO-block compositions of React components.

A capable engineer looks into the requirements and thinks, " We shouldn't put everything in a single file. Let's use the Compound Components Pattern here to make the UI flexible. Let's use the Slot Pattern to create holes in our layout so consumers of this component can pass in their own custom elements without breaking the underlying logic."

You apply abstract thinking. You ask architectural questions:

How are we managing side effects vs. the data fetching?
Can we swap out the payment provider later with a very small code change?
What happens if the network drops while the user is filtering?

AI provides us with the bare metal raw materials. We need to provide the engineering discipline on top of it to make it production-ready.

A Practical, AI-Era Engineering Roadmap

Now, it's time to think about how to bridge the gap between a tutorial-driven developer and a modern, impact-driven engineer. Here is a practical stage-by-stage roadmap for you.

Step 1: Strengthen Your Fundamentals

You can't use AI effectively if you don't understand the code it generates. In the past, a surface-level knowledge of a framework would have been enough for you to execute your tasks. You might have gotten away without knowing the "under the hood" aspects of it.

Today, AI abstracts the frameworks. If something breaks underneath, you're multiple layers away from the actual problem. Having a strong fundamental knowledge will help you to battle this situation, and you'll enjoy working with AI even more.

You must go deep into the fundamentals of Computer Science & Web Technologies:

How does the internet work? Understand Networking basics.
Don't just learn to write JavaScript promises. Learn about the event loop. Understand the call stack, the microtask queue, and how memory allocation works.
When a React application has a memory leak, AI will struggle to find it if it spans multiple files. You need to know how to use Chrome DevTools memory profilers.
Instead of focusing on random algorithmic puzzles, focus on applied abstract thinking. If you're building a real-time collaborative document editor, how do you manage the data structure for concurrent edits? This is how DSA is tested in this era of technical interviews.

Step 2: Build Real (Uncomfortable) Systems

Stop building TODO apps. Stop building basic CRUD applications that only work in an ideal, localhost environment. Learn to build systems to handle failures.

Instead of building a generic e-commerce clone, build an Automated E-book Delivery and Waitlist system. For example,

The stack: Tanstack Start for the front end, NestJS for the API, Supabase for the database, Razorpay for payment processing, Firebase for social logins, and Resend for email delivery.
The challenge: Don't be satisfied with just making the happy path work. What happens if the Razorpay webhook fails to reach your server after a user pays? How do you implement a retry mechanism? How do you secure your Supabase database with RLS (Row Level Security) so users can only download the book they paid for? How do you prevent duplicate sign-ups on your waitlist?

When you build systems like this, you naturally run into complex real-world problems. Solving these, you'll build the exact engineering muscles that companies are now desperate to hire.

Step 3: Master the Art of Debugging

When the system breaks in production, panic starts. The developers who can stay calm, isolate assumptions, trace problems, and fix them are invaluable.

AI is great at explaining isolated error messages, but it can't easily debug a distributed system where a frontend state mismatch is caused by a race condition in a backend microservice. That's on you to burn the midnight oil and get it done.

As a software developer at any level:

Learn how to implement structured logging in your code.
Learn how to read a stack trace systematically.
Practice fixing performance bottlenecks without causing regressions in other parts of the application.
Understand Web Vitals (LCP, CLS, INP, and so on.) and how to profile a slow rendering page.

Step 4: Use AI as a Tool, Not as a Crutch

First of all, stop blind copy-pasting AI responses. Treat AI like an incredibly fast, highly confident, but slightly carefree junior developer.

Use it for boilerplate: Need an ExpressJS setup? Zustand store set up? Generate it.
Use it for research: Learning a new thing like Rust, Go, or Cybersecurity? Prompt the AI to generate a 30-day learning roadmap tailored to your existing programming language knowledge.
Use it for content: Want to write a READ ME file? Want to brainstorm a DRAFT idea? AI can be your companion.
Use it for scaffolding: Need to write unit tests for a utility function? Let AI scaffold the test suites.

Note, every time you copy code from an LLM without understanding it, you're creating tech debt unknowingly. Your job is to make the AI's response as optimal as possible for production.

If you prompt an AI to write a complex data aggregation logic, and it outputs 72 lines of reducer function, don't just copy-paste it. Read it line-by-line, and ask yourself: Is this optimal? What's the Big O time complexity of this code? Can I make it more readable?

Step 5: Establishing a Strong Proof of Work

A résumé listing your skills or a certificate from a bootcamp aren't very strong proof of work achievements today.

Strong proof of work looks like:

A GitHub repository featuring a complex real-world application with a beautifully written README explaining the architectural choices.
Meaningful contributions to the open-source projects where your code had to pass serious reviews from senior maintainers.
Writing deep tech articles or LinkedIn posts explaining how you solved a difficult rendering bug or why you chose a specific database schema for a project.
Participating in a hackathon to build something that is either trendy, or has potential to go viral, or can bring revenue, or a combination of all of these.

Don't just code in silos. Build in public. Explain your thought process socially. When you articulate your engineering thoughts and decisions publicly, it separates you from millions of developers who are just relying on the response from ChatGPT or any other AI tools.

The diagram below captures all five steps visually for you to connect them and revisit at any point in time.

You can download this tech frame and many others from here.

The Must-Needed Mindset Shift

"It all begins and ends in your mind. What you give power to, has power over you" - by Leon Brown

If you're currently looking for a job, you need to immediately stop asking people, "Will I get a Job?" It's the wrong question. You can't be sure you'll get a job if you don't have a convincing reason why a company should hire you.

Instead, look at the job descriptions. Look at the companies you admire. Then ask yourself: "Why should they hire me in today's circumstances?"

If you don't have a convincing answer yet, that's perfectly fine! That's your baseline, and you've identified your skill gap. Your mission now is to bridge that gap.

We've entered a phase where the definition of a software engineer is sharper and more demanding than ever before. The bar is higher, but the expectations are clearer. If you refuse to adapt and insist on staying at the level of simple execution, the path forward will likely be incredibly difficult. You'll compete with AI tools that never sleep and developers who are utilizing those tools to do the work of three people.

But if you embrace the shift and move toward abstract thinking, deep fundamentals, system architecture, and true accountability, the opportunities are limitless. You're no longer competing with everyone. Your competition will be with a small set of developers willing to take up the challenge of evolving.

The software engineering of the future (read: "today") is not about typing code syntax into an editor. It's about understanding what to build, why to build it, how it impacts the business, how to design it to last, and how to use AI as a tool to accelerate things exponentially.

If You've Read This Far...

Thank You!

I'm a Full Stack Software Engineer with more than two decades of experience in building products and people. At present, I'm pushing my startup, CreoWis Technologies, and teaching/mentoring developers on my YouTube channel, tapaScript.

I'm thrilled to publish my 50th article on the freeCodeCamp platform, and it makes me exceptionally proud to give back my knowledge to the developer community. If you want to connect with me,

Follow on LinkedIn and X
Subscribe to my YouTube Channel
Catch up with my React Clean Code Rules Book

See you soon with my next article. Until then, please take care of yourself and keep learning.

GPT-5.4 vs GLM-5: Is Open Source Finally Matching Proprietary AI?

Oyedele Tioluwani — Mon, 13 Apr 2026 17:24:10 +0000

On March 27, 2026, Zhipu AI quietly pushed an update to their open-weight model line. GLM-5.1, they claim, now performs at 94.6% of Claude Opus 4.6 on coding benchmarks. That's a 28% improvement over GLM-5, which was released just six weeks prior.

The open-source story is not slowing down. It's accelerating.

And yet, most of the teams celebrating these headlines can't run the models they're celebrating. Self-hosting GLM-5 requires roughly 1,490GB of memory.

The gap between open and proprietary AI has closed on benchmarks, but "open" and "accessible" aren't the same word. Treating them as synonyms is the most expensive mistake a team can make these days.

What follows is a look at the benchmarks that matter, the infrastructure reality the press releases leave out, and a decision framework for teams that need to ship something.

The two models at the center of this comparison are GPT-5.4, OpenAI's most capable, frontier model for professional work, released on March 5, 2026, and GLM-5, the 744-billion-parameter open-weight model from China's Zhipu AI, released on February 11.

GPT-5.4 represents the current ceiling of proprietary AI: a model that unifies coding and reasoning into a single system with a one-million token context window, native computer use, and the full weight of OpenAI's platform behind it.

GLM-5 represents something different: the first open-weight model to crack the Intelligence Index score of 50, trained entirely on domestic Chinese hardware, available for free under an MIT license.

The question now shifts from which model scores higher on a given leaderboard to what the gap between them means for teams making real infrastructure decisions.

What We'll Cover:

What GLM-5 Achieved
Where GPT-5.4 Still Has the Edge
"Open" Does Not Mean "Accessible"
The Right Question Is Not Which Model Wins
What This Moment Means

What GLM-5 Achieved

GLM-5 is a 744-billion-parameter model with 40 billion active parameters per forward pass. It uses a sparse MoE architecture and was trained on 28.5 trillion tokens.

The model was released February 11, 2026, by Zhipu AI, a Tsinghua University spin-off that IPO'd in Hong Kong and raised $558 million in its last funding round. The license is MIT, which means it's commercially usable without restrictions.

The Artificial Analysis Intelligence Index v4.0 is an independent benchmark that aggregates 10 evaluations spanning agentic tasks, coding, scientific reasoning, and general knowledge.

Unlike single-task benchmarks, it's designed to measure a model's overall capability across the kinds of work people actually pay AI to do. Scores are normalized so that even the best frontier models sit around 50 to 57, preserving meaningful separation between them.

GLM-5 scores 50 on this index, the first time any open-weight model has cracked that threshold. GLM-4.7 scored 42. The eight-point jump came from improvements in agentic performance and a 56-percentage-point reduction in the hallucination rate.

On Arena (formerly LMArena), the human-preference benchmark initiated by UC Berkeley, GLM-5 ranked number one among open models in both Text Arena and Code Arena at launch, putting it on par with Claude Opus 4.5 and Gemini 3 Pro overall. That's a human preference, not an automated benchmark.

SWE-bench Verified: 77.8%, the number one open-source score. The only models scoring higher are Claude Opus 4.6 (80.8%) and GPT-5.2 (80.0%). On Humanity's Last Exam with tools enabled, GLM-5 scores 50.4, beating GPT-5.2's 45.5.

So GLM-5 is genuinely competitive. But competitive at what, exactly? The Intelligence Index gap tells part of the story. The rest lives in specific benchmarks where GPT-5.4 still pulls ahead.

Where GPT-5.4 Still Has the Edge

The gap is not imaginary. On the Artificial Analysis Intelligence Index, GPT-5.4 scores 57 to GLM-5's 50, tied with Gemini 3.1 Pro Preview for number one out of 427 models.

Terminal-Bench is where the gap is most evident. It measures how well a model performs real-world terminal tasks in actual shell environments: file editing, Git operations, build systems, CI/CD pipelines, and system debugging.

Unlike benchmarks that test whether a model can write code in isolation, Terminal-Bench evaluates whether it can operate a computer the way a developer does.

According to OpenAI's API documentation, GPT-5.4 scores 75.1%, a 9.7-point lead over the next proprietary model. If your team does DevOps, infrastructure-as-code, or CI/CD debugging, this benchmark maps directly to your actual job.

Context window is another differentiator. GPT-5.4 handles 1.05 million tokens, while GLM-5 caps at 200,000. For agentic workflows that need to plan across large codebases or synthesize multi-document research, this is not a spec difference but a capability difference.

Native computer use is another advantage. This means the model can interact directly with desktop software through screenshots, mouse commands, and keyboard inputs, without requiring a separate plugin or wrapper.

GPT-5.4 is the first general-purpose OpenAI model with this capability built in, while GLM-5 is text-only with no image input. If you're building agents that interact with UIs or need multimodal reasoning, you can't use GLM-5 for that.

OpenAI also claims a 47% token reduction in tool-heavy workflows through something called tool search, a real efficiency gain if you are paying per token.

On pricing, GPT-5.4 at $2.50 per million input and $15.00 per million output is 4.2 times more expensive than GLM-5's API. But long-context pricing doubles above 272,000 tokens to $5.00 per million inputs, a tax you'll feel if you run large-context agents.

There's a deeper issue the benchmark numbers don't capture, and it's most likely to trip up teams who rush to adopt open source.

"Open" Does Not Mean "Accessible"

The MIT license is real, and the weights are downloadable, but running GLM-5 in native BF16 precision requires roughly 1,490GB of memory. The recommended production setup for the FP8 model is eight H200 GPUs, each with 141GB of memory. That's a GPU cluster, not something you spin up on a single workstation.

In dollar terms, a used or leased H100 runs $15,000 to $25,000. Eight H200S is not a startup purchase. The infrastructure cost of self-hosting GLM-5 rivals or exceeds that of just calling the OpenAI API for most real-world usage volumes.

There is a quantization path. Quantization is a technique that reduces a model's memory footprint by representing its weights at lower numerical precision – for example, compressing from 16-bit to 2-bit values. It makes large models runnable on smaller hardware, but at the cost of some accuracy.

Unsloth's 2-bit GGUF reduces memory usage to 241GB, which fits within a Mac's 256GB unified memory. But quantization degrades model quality. That 77.8% SWE-bench score is for the full-precision model, and the number you get from a quantized local deployment will be lower.

The honest alternative is to use a hosted GLM-5 API. DeepInfra charges $0.80 per million input tokens, and Novita charges $1.00 per million input tokens. You can get the model without the hardware, but then you're not self-hosting. You're just using a cheaper API, and the data sovereignty, privacy, and vendor lock-in arguments all evaporate.

"Open weight" in 2026 increasingly means open to enterprises with GPU clusters, open to researchers with cloud credits, and open to teams willing to accept quality trade-offs from quantization. It doesn't mean open to the median developer who wants to avoid their API bill.

The paradox is real: open weights, but not open access. That doesn't mean the choice is impossible. It just means the choice has to be honest.

The Right Question Is Not Which Model Wins

	GLM-5 via API	GPT-5.4	Self-hosted GLM-5
Best for	Cost-sensitive, under 200K context	Terminal, computer use, long context	Regulated environments with existing GPU infra
Pricing	$0.80 per million input (DeepInfra)	$2.50 per million input	Hardware cost only
Context window	200K tokens	1.05M tokens	200K tokens
Image input	No	Yes	No
Data sovereignty	No	No	Yes
Self-hosting required	No	No	Yes

The right model depends entirely on what your team is trying to optimize.

Use GLM-5 via API when cost efficiency is the primary constraint, when data residency isn't a concern for Chinese-origin models, when your workflow doesn't require multimodal or image input, and when context demands stay under 200,000 tokens.

It's also the right choice if you want to experiment with open-weight research or contribute back to it. The GLM-5 API is cheap, and if tokens per dollar is your dominant variable, it's hard to beat.

Use GPT-5.4 when your workflow is terminal-heavy or involves computer use, when long-context coherence above 200,000 tokens matters, when you need multimodal input, or when your team is already embedded in the OpenAI ecosystem.

If response consistency at scale is non-negotiable, the premium you pay is real, but for some workloads, the consistency and capabilities justify it.

Consider self-hosting GLM-5 only when your organization already has GPU cluster infrastructure or the budget to build one, when data sovereignty concerns are documented and specific rather than hypothetical, and when you have the ML infrastructure capabilities to manage deployment, updates, and monitoring. Self-hosting a 744-billion parameter model is not a weekend project.

The break-even math is worth doing. At roughly $0.80 per million tokens via DeepInfra, a team would need to process over one billion tokens per month before self-hosting on $15,000 H100 hardware begins to pay off. Most teams don't hit that volume, and the ones that do probably already have the infrastructure.

With this decision framework in place, the question shifts to a larger one. What does this moment mean for how teams should think about open source and proprietary AI?

What This Moment Means

The benchmark gap has closed. It's real, significant, and historic. The MMLU gap between open and proprietary models was 17.5 points in late 2023 and is now effectively zero. GLM-5, scoring 50 on the Intelligence Index, the first open-weight model to do so, is a genuine milestone.

But the way the gap closed matters as much as the fact that it closed. It closed through architectural ingenuity like DSA sparse attention, MoE efficiency, and asynchronous reinforcement learning, not through democratized compute.

The models that have closed the gap are still large, still expensive to deploy at full fidelity, and still dominated by Chinese labs with significant institutional backing.

The proprietary moat is no longer because they have better models. It's now a better platform, a better ecosystem, a better context window, better enterprise support, and a deployment path that doesn't require a GPU cluster. It's a narrower moat, but it's still a moat.

The question for 2026 is not whether to choose open source or proprietary. It's what you're getting for the premium you pay, and whether that's worth it for your specific workflow. For some teams, the answer will flip. For many, it won't yet.

Most teams reading this won't do the math. They'll see "open source" and assume it means cheaper. They will see "GLM-5 matches GPT-5.4 on benchmarks" and assume they can swap one for the other with no trade-offs.

Those assumptions are how you end up with a $50,000 GPU cluster you don't know how to operate, or a production outage because your quantized model can't handle long context.

The gap between what a benchmark says and what a model does in your actual environment is where engineering judgment lives. If you outsource that judgment to headlines, you're not saving money. You're just deferring the cost until it shows up as an incident.

How to Build Reliable AI Systems.

Jide Abdul-Qudus — Thu, 09 Apr 2026 17:05:06 +0000

We've all been there: You open ChatGPT, drop a prompt. "Extract all emails from this sheet and categorize by sentiment." It gives you something close. You correct it, it apologizes, and gives you a new version. You ask for a different format, and suddenly, it's lost all context from earlier, and you're starting over.

Errors like that could be fine for little tasks, but it's a disaster for production systems. The gap between "this worked in my ChatGPT conversation" and "this runs reliably in production" is massive. It's not closed by better prompts. It's closed by engineering.

This article is about that engineering. You'll learn the architecture patterns, failure modes, and implementation strategies that separate AI experiments from AI products.

What You'll Learn

In this tutorial, you'll learn how to:

Understand why AI systems fail differently from traditional software
Identify and prevent the three critical failure modes in production AI
Implement the validator sandwich pattern for consistent outputs
Build observable pipelines with proper monitoring and alerting
Control costs at scale with rate limiting and circuit breakers
Design a complete production-ready AI architecture

Prerequisites

To get the most from this tutorial, you should have:

Basic understanding of any programming language
Familiarity with REST APIs and asynchronous programming
Experience with at least one LLM API (OpenAI, Anthropic, or similar)
Node.js installed locally (optional, for running code examples)

You don't need to be an expert in any of these. Intermediate knowledge is sufficient.

What Makes AI Systems Fundamentally Different
Failure Mode #1: Inconsistent Outputs
Failure Mode #2: Silent Failures
Failure Mode #3: Uncontrolled Costs
How to Build a Complete Production Architecture
Conclusion

What Makes AI Systems Fundamentally Different

Traditional software is deterministic. You write if (urgency > 8) { return 'high' } and it does exactly that, every single time. Same input, same output. Forever. You can write unit tests that cover every path. You can predict every failure mode.

AI systems, on the other hand, are probabilistic. You ask an large language model (LLM) to classify urgency and sometimes it says "high," sometimes "urgent," sometimes it gives you a 1–10 score, sometimes it writes a paragraph explaining its reasoning. Same input, different outputs, depending on temperature settings, model version, context window, and factors you can't fully control.

Here's what that looks like in practice:

Challenge	Traditional systems	AI systems
Consistency	100% reproducible	Varies per request
Debugging	Stack traces, logs	"The model just changed its behaviour."
Testing	Unit tests cover all paths	Can't test all possible outputs
Deployment	Deploy once, works forever	Degrades over time (data drift)
Failure modes	Predictable, finite	Creative, infinite

The engineering challenge is: how do you build reliability on top of inherent unpredictability?

The answer is not "use a better model." The model is maybe 20% of the solution. The remaining 80% is the system you build around it.

Failure Mode #1: Inconsistent Outputs

The Problem

You ask the AI to extract a customer email from a support ticket. Sometimes you get the email back. Sometimes you get just the name. Sometimes you get a phone number. The format changes every time. Same prompt, different outputs.

Prompt: "Extract the customer email from this support ticket"

Output on Monday:    "john@example.com"
Output on Tuesday:   "Customer email: john@example.com (verified)"
Output on Wednesday:   "John Doe"
Output on Thursday: {
                       "customer_info": {
                         "email": "john@example.com"
                       }
                     }

All three outputs contain correct information, but you can't parse them programmatically. You can't route tickets, trigger workflow systems, or integrate with other code because your response data lacks consistency.

The Solution: The Validator Sandwich Pattern

The validator sandwich pattern (also called the guardrails pattern) ensures the AI system doesn't generate or process the wrong data by sandwiching your AI between two layers of deterministic code.

Essentially, you have three layers:

The top bun: Input guardrails (deterministic)
The meat: The LLM (probabilistic)
The bottom bun: Output guardrails (deterministic)

Let's break down each layer.

The Top Bun: Input Guardrails

Before anything touches the AI, validate it. Reject garbage immediately, fail fast and cheaply. Here's a basic example with deterministic code that checks the data being received:

function validateTicketInput(raw): TicketInput {
  // Type checks
  if (!raw.email || typeof raw.email !== "string") {
    throw new ValidationError("Missing or invalid email");
  }

  // Format checks
  if (!isValidEmail(raw.email)) {
    throw new ValidationError(`Invalid email format: ${raw.email}`);
  }

  // Range checks
  if (!raw.body || raw.body.length < 10) {
    throw new ValidationError("Ticket body too short to classify");
  }

  if (raw.body.length > 10000) {
    throw new ValidationError("Ticket body exceeds max length");
  }

  // Return typed, validated input
  return {
    email: raw.email.toLowerCase().trim(),
    subject: raw.subject?.trim() || "No subject",
    body: raw.body.trim(),
    timestamp: new Date(raw.timestamp),
  };
}

This runs before the LLM is ever called. It's fast, cheap, and deterministic. It catches easy failures immediately.

The Meat: Structured Outputs from the LLM

Stop asking the AI for free text. Force it into a schema. Most modern APIs support this directly.

So what does "free text" mean? When you prompt an LLM without constraints, it returns unstructured natural language. The model decides the format. Sometimes it's a sentence, sometimes a paragraph, sometimes it adds extra context you didn't ask for. This makes programmatic parsing nearly impossible.

Forcing it into a schema, on the other hand, means that you explicitly tell the model: "Respond only with JSON matching this exact structure", for example. Modern LLM APIs have built-in features to enforce this. Instead of hoping the AI formats its response correctly, you make it structurally impossible for it to return anything else.

Here's the difference in practice:

Without schema enforcement (free text):

const response = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{
    role: "user",
    content: "Classify this support ticket as bug, billing, or feature request: " + ticketText
  }]
});

// Response could be:
// "This appears to be a billing issue"
// "billing"
// "Category: Billing (confidence: high)"
// { "type": "billing" }  <- if you're lucky

With schema enforcement:

const response = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{
    role: "user",
    content: "Classify this support ticket: " + ticketText
  }],
  response_format: {
    type: "json_schema",
    json_schema: {
      name: "ticket_classification",
      strict: true,
      schema: {
        type: "object",
        properties: {
          category: {
            type: "string",
            enum: ["bug", "billing", "feature", "other"]
          },
          confidence: {
            type: "number",
            minimum: 0,
            maximum: 1
          },
          priority: {
            type: "integer",
            minimum: 1,
            maximum: 5
          }
        },
        required: ["category", "confidence", "priority"],
        additionalProperties: false
      }
    }
  }
});

// Response is GUARANTEED to be:
// { "category": "billing", "confidence": 0.89, "priority": 2 }

The response_format parameter forces the model to output valid JSON matching your schema. If it can't, the API will retry internally until it does. You get predictable, parseable data every single time.

The key difference: you're making the AI conform to your format instead of hoping it does the right thing.

The Bottom Bun: Output Guardrails

This is the most critical layer. LLMs will hallucinate. This layer catches those hallucinations before they break your database or confuse your users.

Guardrails are validation checks that run after the LLM responds. Think of them as safety barriers on a highway: they don't prevent the car from moving, but they can stop it from going off the road.

In AI systems, guardrails verify that:

The output matches your expected schema
The data types are correct
The values fall within acceptable ranges
The business logic makes sense

Alright, now you have a structured response. Now you'll want to validate it aggressively before you use it:

function validateClassification(raw): Classification {
  const required = ["category", "confidence", "priority", "reasoning"];
  for (const field of required) {
    if (raw[field] === undefined || raw[field] === null) {
      throw new ValidationError(`Missing required field: ${field}`);
    }
  }

  if (!["bug", "billing", "feature", "other"].includes(raw.category)) {
    throw new ValidationError(`Invalid category: ${raw.category}`);
  }

  if (typeof raw.confidence !== "number" || 
      raw.confidence < 0 || raw.confidence > 1) {
    throw new ValidationError(`Invalid confidence: ${raw.confidence}`);
  }

  if (!Number.isInteger(raw.priority) || 
      raw.priority < 1 || raw.priority > 5) {
    throw new ValidationError(`Invalid priority: ${raw.priority}`);
  }

  if (raw.category === "billing" && raw.priority > 3) {
    logger.warn("Suspicious: billing classified as low priority", raw);
  }

  return raw as Classification;
}

Validating aggressively means checking everything, not just schema compliance. You're validating:

Schema compliance: Does the JSON have the right fields?
Type safety: Is "confidence" actually a number, not a string?
Range validity: Is confidence between 0 and 1, not -5 or 999?
Business logic: Does the combination of fields make sense for your domain?
Confidence thresholds: Is the AI actually confident in this answer?

If any validation fails, you don't silently accept bad data. You have three options:

Retry with a clearer prompt: Ask the model to try again with stricter instructions
Escalate to human review: Log the failure and route to a review queue
Use a fallback: Return a safe default value that requires human attention

The Deterministic Rule

Here's a rule to follow religiously:

If it can be solved with an if-statement, don't use AI.

Email format validation? Use regex. Date parsing? Use a date library. Checking if a string contains a keyword? Use a string method. Math? Use actual math.

AI is expensive and probabilistic. Traditional code is free, instant, and deterministic. Use AI for genuinely ambiguous tasks, extracting meaning from unstructured text, generating content, and reasoning about complex inputs. Let deterministic code handle everything else.

Failure Mode #2: Silent Failures

The Problem

Model hallucinations are quite common in AI workflows, ranging from degraded accuracy to outdated training data to misclassification issues. This is the scariest failure mode because you don't know it's happening.

Consider accuracy drift. You trained your model on 2024 data. It's now mid-2026. Your vendors changed their invoice formats. Your classification accuracy has drifted from 95% down to 71%. You won't know until you do a quarterly audit. And by then, thousands of records have been processed incorrectly.

The principle is simple: you cannot fix what you cannot see.

The Solution: Observable Pipelines

Every production AI system needs observability baked in from day one. Here's how this plays out in a production system:

In the diagram above:

Input arrives: A user request comes in (support ticket, document, query). You log: request ID, timestamp, user ID, input hash (for deduplication).
LLM Processing: The request goes to your AI model. You log which model was called, how long it took (latency), how many tokens used, what it cost, and critically, the confidence score.
Confidence Gate: This is where you make a routing decision:
- High confidence (>0.8): Auto-process and execute the action
- Medium confidence (0.6-0.8): Send to human review queue
- Low confidence (<0.6): Immediate escalation + alert
Monitoring Dashboard: All this data flows into your observability tools, where you track trends over time.

With monitoring, you can detect issues in your system and address them as soon as possible. Monitoring doesn't just catch problems. It gives you data to diagnose and fix them in hours instead of months.

What you're measuring and why:

Metric	Why it Matters
Response Time	API Health, model issues
Confidence	Model degradation
Human Override Rate	Output quality problems
Error Rate	System Failures
Cost per Request	Budget control
Token Usage Trend	Prompt efficiency

The goal is not to remove humans from the loop, it's to only involve humans when the system is genuinely uncertain.

Failure Mode #3: Uncontrolled Costs

The Problem

You test your workflow with 10 tickets. It works great and costs 50 cents. You deploy to production. 1,000 requests hit your API. Your bill: $500 for the day.

Or you write a retry loop incorrectly. It creates infinite API calls. Your bill: $5,000 for the day.

Or you're using the most expensive model for everything, including simple tasks that a cheaper model could handle.

The reality: "works for 10 requests" ≠ "works for 10,000 requests." Scale changes everything.

The Solution: Gated Pipelines with Circuit Breakers

To move from a fragile prototype to a robust production system, you must abandon the naive approach of directly connecting user inputs to LLM APIs. Instead, implement a gated pipeline.

Think of this architecture as a series of blast doors. A request must successfully pass through each gate before it earns the right to cost you money. If any gate closes, the request is rejected cheaply and quickly, protecting your budget and your upstream dependencies.

From the diagram above, these gates are:

The rate limiter
The cache check
The request queue
The circuit breaker

Let's examine each one.

Gate 1: Rate limiting

The first line of defence stops abuse before it enters your system. In standard web development, rate limiting is about protecting the server CPU. In AI development, it's about protecting your wallet.

Gate 2: Cache check

The cheapest LLM API call is the one you never have to make. Many AI requests are repeated or highly similar. Cache aggressively.

Gate 3: Request queue

LLM APIs are not like standard REST APIs; requests often take 10–30 seconds to complete. If 500 users hit "submit" simultaneously, your server cannot open 500 simultaneous connections without crashing or hitting provider concurrency limits. A request queue solves this by batching requests and processing them at a controlled rate.

Gate 4: Circuit breaker

Retry logic is necessary for transient network blips, but it is destructive during a real outage. If an LLM provider is experiencing downtime and returning 500 errors, a naive retry loop will frantically hammer their API, wasting your money on failed requests.

How to implement a gated pipeline

Here's an example implementation showing all four gates working together:

Step 1: Rate Limiter (using Redis)

import { RateLimiterRedis } from "rate-limiter-flexible";
import Redis from "ioredis";

const redis = new Redis({
  host: process.env.REDIS_HOST,
  port: 6379
});

// Rate limiting per user
const userLimiter = new RateLimiterRedis({
  storeClient: redis,
  keyPrefix: "rl:user",
  points: 100,        
  duration: 3600,     
  blockDuration: 60   
});

// Rate limiting globally 
const globalLimiter = new RateLimiterRedis({
  storeClient: redis,
  keyPrefix: "rl:global",
  points: 1000,       
  duration: 3600      
});

Step 2: Cache Layer

import { createHash } from "crypto";

class AICache {
  private redis: Redis;
  private ttl: number = 3600; 

  hashInput(input: string): string {
    return createHash("sha256").update(input).digest("hex");
  }

  async get(input: string): Promise {
    const key = `ai:cache:${this.hashInput(input)}`;
    const cached = await this.redis.get(key);
    
    if (cached) {
      // Cache hit - free!
      await metrics.increment("ai.cache.hits");
      return JSON.parse(cached);
    }
    
    await metrics.increment("ai.cache.misses");
    return null;
  }

  async set(input: string, result: T): Promise {
    const key = `ai:cache:${this.hashInput(input)}`;
    await this.redis.setex(key, this.ttl, JSON.stringify(result));
  }
}

Step 3: Request Queue

import Queue from "bull";

const aiQueue = new Queue("ai-requests", {
  redis: {
    host: process.env.REDIS_HOST,
    port: 6379
  }
});

aiQueue.process(5, async (job) => {
  // Only 5 simultaneous LLM calls max
  const { ticket } = job.data;
  return await callLLM(ticket);
});

async function enqueueRequest(ticket: Ticket) {
  const job = await aiQueue.add(
    { ticket },
    {
      attempts: 3,
      backoff: {
        type: "exponential",
        delay: 2000
      }
    }
  );
  
  return job.finished(); 
}

Step 4: Circuit Breaker

enum CircuitState {
  CLOSED,   
  OPEN,     
  HALF_OPEN 
}

class CircuitBreaker {
  private state = CircuitState.CLOSED;
  private failures = 0;
  private lastFailureTime?: Date;
  private successesInHalfOpen = 0;

  private readonly failureThreshold = 3;
  private readonly openDurationMs = 5 * 60 * 1000; 
  private readonly halfOpenSuccesses = 2;

  async execute(
    fn: () => Promise,
    fallback?: () => T
  ): Promise {
    if (this.state === CircuitState.OPEN) {
      const elapsed = Date.now() - (this.lastFailureTime?.getTime() || 0);
      
      if (elapsed < this.openDurationMs) {
        // Still in open state - use fallback or throw
        if (fallback) {
          logger.warn("Circuit OPEN - using fallback");
          return fallback();
        }
        throw new Error("Circuit breaker OPEN - service unavailable");
      }
      
      // Transition to half-open
      this.state = CircuitState.HALF_OPEN;
      logger.info("Circuit transitioning to HALF_OPEN");
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    if (this.state === CircuitState.HALF_OPEN) {
      this.successesInHalfOpen++;
      
      if (this.successesInHalfOpen >= this.halfOpenSuccesses) {
        // Service recovered - close circuit
        this.state = CircuitState.CLOSED;
        this.failures = 0;
        this.successesInHalfOpen = 0;
        logger.info("Circuit CLOSED - service recovered");
      }
    } else {
      this.failures = 0;
    }
  }

  private onFailure() {
    this.failures++;
    this.lastFailureTime = new Date();

    if (this.state === CircuitState.HALF_OPEN) {
      // Failed during test - back to open
      this.state = CircuitState.OPEN;
      this.successesInHalfOpen = 0;
      logger.error("Circuit reopened during HALF_OPEN test");
    } else if (this.failures >= this.failureThreshold) {
      // Too many failures - open circuit
      this.state = CircuitState.OPEN;
      logger.error(`Circuit OPEN after ${this.failures} failures`);
    }
  }
}

Step 5: Putting it all together

const cache = new AICache();
const circuitBreaker = new CircuitBreaker();

async function processWithGatedPipeline(ticket: Ticket) {
  try {
    await userLimiter.consume(ticket.userId);
    await globalLimiter.consume("global");
  } catch (error) {
    throw new Error("Rate limit exceeded. Please try again later.");
  }

  const cacheKey = ticket.body;
  const cached = await cache.get(cacheKey);
  if (cached) {
    logger.info("Cache hit - returning cached result");
    return cached;
  }

  const queuedResult = await enqueueRequest(ticket);

  const result = await circuitBreaker.execute(
    async () => {
      const classification = await callLLM(ticket);
      await cache.set(cacheKey, classification);
      return classification;
    },
    () => ({
      category: "other",
      confidence: 0,
      requiresHumanReview: true,
      reason: "service_unavailable"
    })
  );

  return result;
}

What this achieves:

Rate limiting: Prevents abuse and runaway costs
Caching: 30-40% cost reduction on repeated queries
Queueing: Prevents server overload during traffic spikes
Circuit breaker: Fails fast during outages instead of wasting money on retries

Each gate is cheap to operate. Together, they protect your system from the most common production failures.

How to Build a Complete Production Architecture

When you combine all three failure mode solutions-consistent outputs, observability, and cost control, you get a complete production architecture.

When you solve for all three major failure modes, inconsistent outputs, silent failures, and uncontrolled costs. You graduate from a simple script to a true enterprise-grade system. This architecture doesn't just generate text; it actively protects itself, manages resources, and learns from its mistakes.

The Complete Workflow Implementation

Here's how all the pieces we've covered fit together in a single workflow. This brings together the validation functions from Failure Mode #1, the observability from Failure Mode #2, and the gated pipeline from Failure Mode #3:

class TicketWorkflow {
  async processTicket(rawInput: unknown): Promise {
    const requestId = generateId();
    const startTime = Date.now();

    try {
      // LAYER 1: Input validation + rate limiting + cache
      const ticket = validateTicketInput(rawInput);
      await rateLimiter.consume(ticket.userId);
      
      const cached = await cache.get(ticket.body);
      if (cached) return { ...cached, source: "cache" };

      // LAYER 2: AI processing with circuit breaker protection
      const classification = await circuitBreaker.execute(() => 
        classifyTicket(ticket)
      );

      // LAYER 3: Output validation + confidence routing
      const validated = validateClassification(classification);
      
      let action: string;
      if (validated.confidence >= 0.8) {
        await sendToAgent(ticket, validated);
        action = "auto_assigned";
      } else {
        await sendToReviewQueue(ticket, validated);
        action = "needs_review";
      }

      // LAYER 4: Log everything for observability
      await logger.log({
        requestId,
        userId: ticket.userId,
        confidence: validated.confidence,
        action,
        latencyMs: Date.now() - startTime,
        cost: calculateCost(classification.tokensUsed)
      });

      await cache.set(ticket.body, validated);
      return { classification: validated, action };

    } catch (error) {
      await logger.logError(requestId, error);
      throw error;
    }
  }
}

What each layer does:

Layer 1 (Input) protects your system from bad data and abuse:

Validates the ticket has required fields (email, subject, body)
Checks rate limits (prevents one user from overwhelming the system)
Returns cached results if we've seen this exact ticket before

Layer 2 (Orchestration) is where the AI does its work:

Calls the LLM with structured output requirements
Wrapped in a circuit breaker (fails fast if the API is down)
Uses the cheapest model that works (Haiku for classification)

Layer 3 (Validation) ensures the output is safe to use:

Validates the response matches our schema
Routes based on confidence (high confidence → auto-assign, low → human review)
Never blindly trusts AI output

Layer 4 (Observability) tracks everything:

Logs every request with latency, cost, and confidence scores
Sends metrics to your monitoring dashboard
Alerts on anomalies (confidence dropping, costs spiking)

This architecture takes you from "it worked in my ChatGPT demo" to "it runs reliably at 10,000 tickets per day." The code is more complex than a simple API call, but the complexity is intentional. It's what makes the system production-ready.

Conclusion: Engineering Over Prompting

The teams winning with AI right now aren't winning because they have better models. They're winning because they've built better systems around imperfect models.

Any company can call the OpenAI API. The ones that pull ahead are the ones who wrap that API call in validation, observability, cost controls, and thoughtful architecture — the ones who treat AI as a component in an assembly line, not a creative partner in a conversation.

The three things every production AI system needs:

Structure: Validators, schemas, deterministic layers that enforce consistency and eliminate unpredictability at the edges.
Visibility: Logging, monitoring, and alerting so you catch problems in hours, not months. Observable pipelines that let you see exactly what the system is doing and why.
Control: Rate limits, caching, circuit breakers, and cost gates so scale doesn't turn your experiment into a budget emergency.

Reliable AI workflows aren't about better prompts. They're about better architecture around unreliable components.

If you found this helpful, you can connect with me on LinkedIn or subscribe to my newsletter. You can also visit my website.

How the Mixture of Experts Architecture Works in AI Models

Manish Shivanandhan — Tue, 07 Apr 2026 17:18:05 +0000

Artificial intelligence (AI) has seen remarkable advancements over the years, with AI models growing in size and complexity.

Among the innovative approaches gaining traction today is the Mixture of Experts (MoE) architecture. This method optimizes AI model performance by distributing processing tasks across specialized subnetworks known as “experts.”

In this article, we’ll explore how this architecture works, the role of sparsity, routing strategies, and its real-world application in the Mixtral model. We’ll also discuss the challenges these systems face and the solutions developed to address them.

We'll Cover:

Understanding the Mixture of Experts (MoE) Approach
The Role of Sparsity in AI Models
The Art of Routing in MoE Architectures
Load Balancing Challenges and Solutions
- Real-World Application: The Mixtral Model
- Conclusion

Understanding the Mixture of Experts (MoE) Approach

The Mixture of Experts (MoE) is a machine learning technique that divides an AI model into smaller, specialized networks, each focusing on specific tasks.

This is akin to assembling a team where each member possesses unique skills suited for particular challenges.

The idea isn't new. It dates back to a groundbreaking 1991 paper that highlighted the benefits of having separate networks specialize in different training cases.

Fast forward to today, and MoE is experiencing a resurgence, particularly among large language models, which utilize this approach to enhance efficiency and effectiveness.

At its core, this system comprises several components: an input layer, multiple expert networks, a gating network, and an output layer.

The gating network serves as a coordinator, determining which expert networks should be activated for a given task.

By doing so, MoE significantly reduces the need to engage the entire network for every operation. This improves performance and reduces computational overhead.

The Role of Sparsity in AI Models

An essential concept within MoE architecture is sparsity, which refers to activating only a subset of experts for each processing task.

Instead of engaging all network resources, sparsity ensures that only the relevant experts and their parameters are used. This targeted selection significantly reduces computation needs, especially when dealing with complex, high-dimensional data such as natural language processing tasks.

Sparse models excel because they allow for specialized processing. For example, different parts of a sentence may require distinct types of analysis: one expert might be adept at understanding idioms, while another could specialise in parsing complex grammar structures.

By activating only the necessary experts, MoE models can provide more precise and efficient analysis of the input data.

The Art of Routing in MoE Architectures

Routing is another critical component of the Mixture of Experts model.

The gating network plays a crucial role here, as it determines which experts to activate for each input. A successful routing strategy ensures that the network is capable of selecting the most suitable experts, optimizing performance and maintaining balance across the network.

Typically, the routing process involves predicting which expert will provide the best output for a given input. This prediction is made based on the strength of the connection between the expert and the data.

One popular strategy is the “top-k” routing method, where the k most suitable experts are chosen for a task. In practice, a variant known as “top-2” routing is often used, activating the best two experts, which balances effectiveness and computational cost.

Load Balancing Challenges and Solutions

While MoE models have clear advantages, they also introduce specific challenges, particularly regarding load balancing.

The potential issue is that the gating network might consistently select only a few experts, leading to an uneven distribution of tasks. This imbalance can result in some experts being over-utilised and, consequently, over-trained, while others remain underutilised.

To address this challenge, researchers have developed “noisy top-k” gating, a technique introducing Gaussian noise to the selection process. This introduces an element of controlled randomness, promoting a more balanced activation of experts.

By distributing the workload more evenly across experts, this approach mitigates the risk of inefficiencies and ensures that the entire network remains effective.

What Actually Happens During an MoE Inference

To make the Mixture of Experts architecture more concrete, it helps to walk through what happens during a single request.

Consider a prompt like:

“Explain why startups fail due to poor cash flow management.”

In a traditional dense model, every layer and every parameter contribute to generating the response. In an MoE model, the process is more selective.

As the input is processed, each layer passes the token representations to the gating network. This component evaluates all available experts and assigns them scores based on how relevant they are to the input. Instead of activating the full network, the model selects only the top-k experts (commonly two).

For this example, the gating network might select:

One expert specialized in financial reasoning
Another expert better at structuring causal explanations

Only these selected experts process the input, producing intermediate outputs that are then combined and passed to the next layer. The rest of the experts remain inactive for that token.

This selection and combination process repeats across layers, meaning that at any given point, only a small fraction of the model’s total parameters are being used.

The result is a system that behaves like a large, highly capable model, but executes more like a smaller one in terms of compute. This is the practical advantage of MoE: it doesn’t just improve model capacity, it ensures that capacity is used selectively and efficiently for each request.

Real-World Application: The Mixtral Model

A compelling example of the Mixture of Experts architecture in action is the Mixtral model. This open-source large language model exemplifies how MoE can enhance efficiency in processing tasks.

Each layer of the Mixtral model comprises eight experts, each with seven billion parameters. As the model processes each token of input data, the gating network selects the two most suitable experts. These experts handle the task, and their outputs are combined before moving to the next model layer.

This approach allows Mixtral to deliver high performance despite its seemingly modest size for a large language model. By efficiently utilising resources and ensuring specialised processing, Mixtral stands as a testament to the potential of MoE architectures in advancing AI technology.

Conclusion

The Mixture of Experts architecture represents a significant step forward in developing efficient AI systems. With its focus on specialised processing and resource optimisation, MoE offers numerous benefits, particularly for large-scale language models.

Key concepts like sparsity and effective routing ensure that these models can handle complex tasks with precision, while innovations like noisy top-k gating address the common challenges of load balancing.

Despite its complexity and the need for careful tuning, the MoE approach remains promising in elevating AI model performance. As AI continues to advance, architectures like MoE could play a crucial role in powering the next generation of intelligent systems, offering improved efficiency and specialised processing capabilities.

Hope you enjoyed this article. Signup for my free newsletter to get more articles delivered to your inbox. You can also connect with me on Linkedin.

How to Use the Model Context Protocol to Build a Personal Financial Assistant

Nikhil Adithyan — Wed, 25 Mar 2026 16:41:36 +0000

LLMs are great at writing market commentary. The problem is they can sound confident even when they haven't looked at any data. That’s fine for casual chat, but it’s not fine if you’re building a feature for a product, an internal tool, or anything a user might rely on.

In this guide, we’ll build a small financial assistant that fetches real data by calling tools exposed via the MCP protocol (Model Context Protocol), then computes the numbers in Python. The LLM’s job is only to narrate the computed facts. It doesn't invent metrics, and it doesn't do the math.

By the end, you’ll have two outputs you can actually plug into a product flow: a single-ticker market brief, and a watchlist snapshot that compares multiple tickers on volatility and drawdown, with the tool calls traced so you can see exactly what data was used.

What is MCP, and How Does it Change the Integration Story?
Architecture: The “Narrator” Pattern
Step 1: MCP Client Wrapper (client.py)
Step 2: The Assistant Core (core.py)
Demo 1: Market Brief for One Ticker
Demo 2: Watchlist Snapshot
What Makes this Shippable, and What Can Be Improved?
Conclusion

Prerequisites

This is a code-first guide. I won’t explain every line of Python, so you should be comfortable reading pandas code, basic async/await patterns, and calling APIs from Python.

Before you start, you’ll need:

Python 3.10+
An EODHD API key (to access the EODHD MCP server)
An OpenAI API key (for the narration step)
The MCP Python client installed, plus the usual data stack: numpy and pandas
A local environment where you can run async Python code (Jupyter or a normal script both work)

If you’ve never worked with async code before, you can still follow along. Just treat the async functions as "network calls" and focus on how the data flows from tool calls, to deterministic metrics, to narration.

What is MCP, and How Does it Change the Integration Story?

MCP (Model Context Protocol) is a protocol for how an LLM application can discover and call external tools exposed by an MCP server. Instead of hardcoding a bunch of function schemas or building custom connectors per framework, you plug into an MCP server and the tools become “available” in a consistent format.

For product teams, this matters because it reduces integration churn. Tool discovery is predictable, you’re not rewriting wrappers every time your stack changes, and you get a clean separation between the model and the data layer.

In our case, that data layer is EOD Historical Data (EODHD), a market data provider. We’ll use EODHD’s MCP server, which exposes market data tools the assistant can call whenever it needs prices or fundamentals.

One important clarification for this tutorial: we’re using an MCP server purely as the data access layer. The model doesn’t decide which MCP tools to call or what parameters to pass. We'll do that deterministically in Python, then hand the model a facts object and let it write the narrative. This keeps the output grounded and makes the system much easier to trust and debug.

Architecture: The “Narrator” Pattern

Here’s the architecture we’re using in this guide:

The idea is simple: we'll separate “getting facts” from “writing words”. The model only does the second part.

First, the user asks a question like “Give me a 30-day brief for AAPL” or “Compare TSLA, NVDA, AMZN over the last 60 days”. That raw text goes into a tiny parser. The parser is intentionally boring. It only extracts what the system needs to operate: a list of tickers and a lookback window.

Once we have tickers and dates, we fetch data by calling MCP tools on the EODHD MCP server. In this case, our MCP client connects to the EODHD MCP server. So instead of the assistant guessing prices or fundamentals, it calls tools like “get historical prices” and “get fundamentals”. At this point we have raw data. Nothing has been computed yet, and the model has not written a single sentence.

Then Python takes over. This is where we compute everything deterministically: returns, volatility, max drawdown, trend slope, and a simple volatility regime label. For watchlists, we align returns and compute correlation. These numbers are the backbone of the output. If you rerun the same query with the same window, you should get the same metrics.

Only after that do we involve the LLM. We pass it a compact facts object. It contains the metrics we computed, plus a few clean fundamentals fields. The prompt is strict. Use only these facts – no extra numbers and no guessing. The model’s job is to turn the facts into a clean note that feels like something a product would show.

Finally, the assistant returns a structured response object. Not just text. You get:

answer (the narrative)
metrics (the exact computed numbers)
data_used (tickers, date range, and which tools were called)
tool_trace_id (a trace id you can log, debug, or attach to monitoring)

This pattern is B2B-friendly for a very practical reason. It reduces hallucinations because the model isn’t doing analysis. It makes numbers repeatable because Python computes them. And it’s easy to audit because you can always show what data was fetched, what window was used, and which tool calls happened.

Step 1: MCP Client Wrapper (`client.py`)

Before we touch any “assistant logic”, we need one thing: a tiny MCP client wrapper that opens MCP sessions to the EODHD MCP server and calls tools reliably. That’s it.

This file does three jobs:

opens a streamable HTTP MCP session
calls a tool with a timeout and a small retry loop
returns the tool output plus a small metadata object we can later attach to logs and traces

Here’s the complete client.py:

import time
import asyncio

from mcp import ClientSession
from mcp.client.streamable_http import streamable_http_client

class EODHDMCP:
    def __init__(self, apikey, base_url=None):
        self.apikey = apikey
        self.base_url = base_url or "https://mcp.eodhd.dev/mcp"
        self._tools = None

    def _url(self):
        return f"{self.base_url}?apikey={self.apikey}"

    def _open(self):
        return streamable_http_client(self._url())

    async def list_tools(self):
        if self._tools is not None:
            return self._tools

        async with self._open() as (read, write, _):
            async with ClientSession(read, write) as s:
                await s.initialize()
                resp = await s.list_tools()
                self._tools = [t.name for t in resp.tools]
                return self._tools

    async def call_tool(self, name, args, trace_id, timeout_s=25, retries=1):
        last = None

        for attempt in range(retries + 1):
            t0 = time.time()
            try:
                async with self._open() as (read, write, _):
                    async with ClientSession(read, write) as s:
                        await s.initialize()
                        out = await asyncio.wait_for(s.call_tool(name, args), timeout=timeout_s)
                        dt = time.time() - t0
                        meta = {"trace_id": trace_id, "tool": name, "args": args, "latency_s": round(dt, 3)}
                        return out, meta
            except Exception as e:
                last = e
                if attempt < retries:
                    await asyncio.sleep(0.25)

        raise last

How this works:

streamablehttp_client(self._url()) opens an MCP session over streamable HTTP. The URL includes your API key as a query param, so the server can authenticate.
list_tools() is just a convenience. It asks the server which tools exist and caches the names in memory so you don’t fetch them repeatedly.
call_tool() is the workhorse. It opens a session, initializes it, calls a tool with call_tool(name, args), and wraps the result with a meta object.
That meta object is important later. It lets you trace which tool was called, with which params, how long it took, and which request it belonged to (trace_id).

Next, we’ll build the core runner in core.py. This is where we parse the user’s request, fetch prices and fundamentals via MCP, compute metrics in Python, and then hand the facts to the LLM for narration.

Step 2: The Assistant Core (`core.py`)

This is where the assistant actually becomes “real”. client.py was just a connector. Here we decide what data to fetch, how much to fetch, how to compute the numbers, and what we hand to the model for narration.

1. Budgets and Trace Logging

When you build anything that calls real tools, you want limits. Not because you don’t trust your code, but because without limits, one messy prompt can easily turn into an expensive, slow request.

In our case, we cap:

how far back we’ll fetch data (MAX_LOOKBACK_DAYS)
how many tool calls we allow per request (MAX_TOOL_CALLS)
how many tickers we’ll accept in one query (MAX_TICKERS)

And we log a few events so we can always debug what happened later.

Here’s the top part of core.py for that:

import json
import re
import time
import uuid
from datetime import date, timedelta
from openai import OpenAI
import numpy as np
import pandas as pd
import asyncio
from client import EODHDMCP

EODHD_API_KEY = "YOUR EODHD API KEY"
MCP_BASE_URL = "https://mcp.eodhd.dev/mcp"

MAX_LOOKBACK_DAYS = 365
MAX_TOOL_CALLS = 6
MAX_TICKERS = 5

mcp = EODHDMCP(EODHD_API_KEY, base_url=MCP_BASE_URL)
oa = OpenAI(api_key = "OPENAI API KEY")
NARRATION_MODEL = "gpt-5.3-chat-latest"

def log_event(event, trace_id, **k):
    payload = {"event": event, "trace_id": trace_id, "ts": round(time.time(), 3)}
    payload.update(k)
    print(json.dumps(payload, default=str))

What’s going on here:

MAX_LOOKBACK_DAYS, MAX_TOOL_CALLS, MAX_TICKERS are basically your safety rails. We’ll enforce them later, right after parsing the user query.
trace_id is a small id we generate per request. Every log line includes it, so when something breaks, you can reconstruct the exact flow for that request.
log_event() prints one JSON line. Nothing fancy – but it’s enough for debugging and it also looks very similar to how real systems emit traces.

Note: Make sure to replace YOUR EODHD API KEY with your actual EODHD API key. If you don’t have one, you can obtain it by creating an EODHD developer account.

2. Parsing the Request

This part is intentionally not “smart”. We’re not doing NLP. We’re not letting the model interpret the query. We just want to extract two things in a predictable way:

tickers
lookback window

That’s it.

The benefit of keeping it dumb is that the behavior is stable. If the query is messy, we still do something consistent, and the rest of the pipeline remains controllable.

Here are the two functions:

def parse_request(text):
    t = (text or "").upper()

    raw = re.findall(r"\b[A-Z]{1,5}\b", t)

    bad = {
        "I","A","AN","THE","AND","OR","TO","FOR","OF","IN","ON","BY","WITH","ME","WE","US",
        "GIVE","DAY","DAYS","BRIEF","COMPARE","RANK","OVER","LAST","TREND","VOL","VOLATILITY",
        "DRAWDOWN","FLAG","RISKS","RISK","PLUS","MAX","MIN","LOOKBACK"
    }

    tickers = []
    for x in raw:
        if x in bad:
            continue
        if len(x) < 2:
            continue
        if x not in tickers:
            tickers.append(x)

    days = 30

    if "LAST" in t:
        after = t.split("LAST", 1)[1]
        m = re.search(r"\d{1,4}", after)
        if m:
            days = int(m.group(0))
    
    return tickers, days

def enforce_budgets(tickers, lookback_days):
    if lookback_days < 1:
        lookback_days = 1
    if lookback_days > MAX_LOOKBACK_DAYS:
        lookback_days = MAX_LOOKBACK_DAYS

    tickers = tickers[:MAX_TICKERS]

    return tickers, lookback_days

How to read this:

re.findall(r"\b[A-Z]{1,5}\b", t) pulls out every short uppercase token. That’s our crude “ticker candidate” list.
The bad set is just a blacklist of common words that show up in prompts but are obviously not tickers.
We keep unique tickers in order, because the first ticker becomes the “base” for correlation in the watchlist demo.
Lookback is simple: the default is 30 days. If the query contains “last …”, we grab the first number after “LAST”. That avoids regex edge cases with punctuation.

Then enforce_budgets() clamps everything so one request can’t ask for 500 tickers or a 10-year window.

Next, we’ll wire these parsed values into a request state and start making actual MCP calls for prices and fundamentals.

3. Tool Wrappers: Prices and Fundamentals

Now we’re at the point where the assistant actually touches data.

These two functions do the same job in different ways:

fetch_prices() calls the historical prices tool on the EODHD MCP server, then normalizes the output into a tiny DataFrame with just date and price.
fetch_fundamentals() calls the fundamentals tool on the EODHD MCP server.

We also keep a small state object per request. It tracks tool calls and keeps a trace of what was called. That’s how we later produce the data_used block in the final response.

Here’s the code:

def new_state():
    return {"tool_calls": 0, "tool_trace": [], "rows": {}}

def _bump(state, meta):
    state["tool_calls"] += 1
    state["tool_trace"].append(meta)
    if state["tool_calls"] > MAX_TOOL_CALLS:
        raise RuntimeError("tool call budget exceeded")

def _as_json_text(out):
    if isinstance(out, str):
        return out
    if hasattr(out, "content"):
        try:
            return out.content[0].text
        except Exception:
            pass
    return str(out)

async def fetch_prices(ticker, start_date, end_date, trace_id, state):
    args = {
        "ticker": ticker,
        "start_date": start_date,
        "end_date": end_date,
        "period": "d",
        "order": "a",
        "fmt": "json",
    }

    out, meta = await mcp.call_tool("get_historical_stock_prices", args, trace_id)
    txt = _as_json_text(out)

    _bump(state, meta)

    data = json.loads(txt)
    if isinstance(data, dict) and data.get("error"):
        raise RuntimeError(data["error"])

    df = pd.DataFrame(data)
    if df.empty:
        return df

    cols = [c for c in ["date", "adjusted_close", "close"] if c in                   df.columns]
    df = df[cols].copy()

    if "adjusted_close" in df.columns:
        df = df.rename(columns={"adjusted_close": "price"})
    elif "close" in df.columns:
        df = df.rename(columns={"close": "price"})
    else:
        return pd.DataFrame()

    df["ticker"] = ticker

    state["rows"][f"{meta['tool']}:{ticker}"] = len(df)
    return df

async def fetch_fundamentals(ticker, trace_id, state):
    args = {
        "ticker": ticker,
        "include_financials": False,
        "fmt": "json",
    }

    out, meta = await mcp.call_tool("get_fundamentals_data", args, trace_id)
    txt = _as_json_text(out)

    _bump(state, meta)

    data = json.loads(txt)
    if isinstance(data, dict) and data.get("error"):
        raise RuntimeError(data["error"])

    return data

What’s happening here:

_bump() is the budget guard. Every time we make a tool call, we increment the counter and store the tool metadata. If we cross the budget, we fail fast.
meta comes from client.py. It contains tool, args, and latency. That’s enough to trace “what did we call and how long did it take”.
_as_json_text() is there because the tool results returned by the MCP server are not always plain strings. Sometimes it’s an object with .content. This helper just tries to extract the text cleanly.
In fetch_prices(), we intentionally keep only date and price. That’s not because OHLC is useless. It’s because this tutorial’s metrics only need adjusted closes. Fewer columns means simpler code, smaller payloads, and fewer chances to break.

Next, we’ll compute the actual metrics. This is where the assistant stops being “an API caller” and starts producing something useful.

4. Deterministic Metrics

This is the most important design choice in the whole build. The model never computes numbers. Python does.

So for every ticker, we compute a small set of metrics that are easy to explain and are actually useful in a “market brief” style output:

total return over the window
realized volatility (daily and annualized)
max drawdown (worst peak-to-trough fall)
a simple trend slope (so we can say “mild uptrend” or “downtrend” without vibes)
a lightweight regime label (low, mid, high volatility)

Here’s the code:

def compute_metrics(prices_df):
    if prices_df is None or prices_df.empty:
        return {}

    df = prices_df.copy()
    df["date"] = pd.to_datetime(df["date"], errors="coerce")
    df = df.dropna(subset=["date"]).sort_values("date")

    close = pd.to_numeric(df["price"], errors="coerce").dropna()
    if close.empty:
        return {}

    rets = close.pct_change().dropna()

    out = {}

    # realized vol (daily), annualize with sqrt(252)
    if not rets.empty:
        out["vol_daily"] = float(rets.std())
        out["vol_annualized"] = float(rets.std() * np.sqrt(252))
        out["ret_total"] = float((close.iloc[-1] / close.iloc[0]) - 1.0)

    # max drawdown
    peak = close.cummax()
    dd = (close / peak) - 1.0
    out["max_drawdown"] = float(dd.min())

    # simple trend score
    logp = np.log(close.values)
    x = np.arange(len(logp))
    if len(logp) >= 3:
        slope = np.polyfit(x, logp, 1)[0]
        out["trend_slope"] = float(slope)
    else:
        out["trend_slope"] = 0.0

    # basic helpers
    out["n_points"] = int(len(close))
    out["start_close"] = float(close.iloc[0])
    out["end_close"] = float(close.iloc[-1])

    return out

def compute_regime(prices_df, window=20):
    # cheap regime label, based on rolling vol percentile
    if prices_df is None or prices_df.empty:
        return {"regime": "unknown"}

    df = prices_df.copy()
    df["date"] = pd.to_datetime(df["date"], errors="coerce")
    df = df.dropna(subset=["date"]).sort_values("date")

    close = pd.to_numeric(df["price"], errors="coerce").dropna()
    if close.empty:
        return {"regime": "unknown"}

    rets = close.pct_change()
    rv = rets.rolling(window).std()

    last = rv.dropna()
    if last.empty:
        return {"regime": "unknown"}

    cur = float(last.iloc[-1])
    p80 = float(last.quantile(0.8))
    p50 = float(last.quantile(0.5))

    if cur >= p80:
        reg = "high_vol"
    elif cur >= p50:
        reg = "mid_vol"
    else:
        reg = "low_vol"

    return {"regime": reg, "rolling_vol": cur, "window": int(window)}

How to think about these calculations:

Total return is just end / start - 1. It’s the simplest “did it go up or down” number.
Volatility here is realized volatility of daily returns. That’s just the standard deviation of daily % changes. We annualize it using sqrt(252) because markets have roughly 252 trading days.
Max drawdown tells you how bad the worst dip was during the window. It’s often more meaningful than return when you’re writing a quick risk note.
Trend slope is intentionally simple. We fit a straight line to log prices. If the slope is positive, it’s generally drifting up. If it’s negative, it’s drifting down.
Regime label is not a fancy model. It just says “compared to its own recent rolling volatility, are we currently in a high, medium, or low vol phase”.

The main point is this: these numbers are deterministic. If the assistant says “max drawdown was -13%”, you can trace it back to the exact adjusted close series that produced it.

Next, we’ll handle the watchlist side. That means aligning returns across tickers, computing correlation, and generating a ranked snapshot.

5. Watchlist Utilities

Once you have more than one ticker, you want two extra things:

a quick ranking so you can say “this is the riskiest name in the basket”
a correlation snapshot so you can see what’s moving together

The only “gotcha” with correlation is dates. If TSLA has 41 price points and NVDA has 39 because of missing days, you can’t just correlate blindly. You need the returns lined up on the same dates first. That’s what align_returns() does.

Here’s the code:

def align_returns(price_frames):
    if not price_frames:
        return pd.DataFrame()

    parts = []
    for df in price_frames:
        if df is None or df.empty:
            continue
        x = df.copy()
        x["date"] = pd.to_datetime(x["date"], errors="coerce")
        x = x.dropna(subset=["date"])
        x["price"] = pd.to_numeric(x["price"], errors="coerce")
        x = x.dropna(subset=["price"])
        x = x.sort_values("date")
        x["ret"] = x["price"].pct_change()
        x = x.dropna(subset=["ret"])
        parts.append(x[["date", "ticker", "ret"]])

    if not parts:
        return pd.DataFrame()

    allr = pd.concat(parts, ignore_index=True)
    wide = allr.pivot(index="date", columns="ticker", values="ret").dropna(how="any")
    return wide


def corr_summary(ret_wide, base_ticker, top_n=3):
    if ret_wide is None or ret_wide.empty:
        return []

    if base_ticker not in ret_wide.columns:
        return []

    c = ret_wide.corr()[base_ticker].dropna()
    c = c.drop(labels=[base_ticker], errors="ignore")
    if c.empty:
        return []

    out = []
    for k, v in c.sort_values(ascending=False).head(top_n).items():
        out.append({"ticker": k, "corr": float(v)})

    return out


def rank_watchlist(metrics_by_ticker):
    rows = []
    for t, m in metrics_by_ticker.items():
        if not m:
            continue
        rows.append({
            "ticker": t,
            "vol_annualized": m.get("vol_annualized"),
            "max_drawdown": m.get("max_drawdown"),
            "ret_total": m.get("ret_total"),
            "trend_slope": m.get("trend_slope"),
        })

    if not rows:
        return pd.DataFrame()

    df = pd.DataFrame(rows)
    df = df.sort_values(["vol_annualized", "max_drawdown"], ascending=[False, True])
    return df.reset_index(drop=True)

What’s happening here:

align_returns() takes a list of price DataFrames, computes daily returns for each, then pivots them into a wide table like: date -> TSLA.US, NVDA.US, AMZN.US.
We drop rows where any ticker is missing, because correlation only makes sense when the returns are aligned on the same dates.
corr_summary() is a compact “who moves with whom” helper. We pick one base ticker, compute correlations against everything else, then grab the top few. For a watchlist widget, that’s usually enough.
rank_watchlist() is the ranking logic for the snapshot. We sort primarily by annualized volatility, and use drawdown as a secondary risk indicator. You could choose different ranking logic. The point is to keep it deterministic and explainable.

Next, we’ll build the facts objects and narration layer. That’s where we enforce the “model is just a narrator” contract.

6. Facts Object and Narration

This is where the “narrator pattern” becomes real.

Up to this point, we’ve done everything with MCP and Python. We fetched prices and fundamentals from EODHD, we computed metrics, and we aligned returns. Now we need one clean object that represents “the truth” for this request.

That’s what the facts object is.

The rule is simple.

facts contains only things we actually fetched or computed.
The model never sees raw market data. It sees the cleaned facts.
The model is told to write using only those facts, and not to invent any numbers.

Here are the functions that build those facts objects for the two demos, plus the narration function.

def build_facts_single(ticker, lookback_days, metrics, regime, fundamentals):
    # keep this compact. LLM will narrate from this later
    out = {
        "type": "single_ticker_brief",
        "ticker": ticker,
        "lookback_days": int(lookback_days),
        "metrics": metrics,
        "regime": regime,
    }

    if isinstance(fundamentals, dict):
        gen = fundamentals.get("General", {}) or {}
        hi = fundamentals.get("Highlights", {}) or {}
        val = fundamentals.get("Valuation", {}) or {}
        tech = fundamentals.get("Technicals", {}) or {}

        base = {
            "name": gen.get("Name"),
            "exchange": gen.get("Exchange"),
            "sector": gen.get("Sector"),
            "industry": gen.get("Industry"),
        }

        metrics = {
            "market_cap": hi.get("MarketCapitalization"),
            "pe": hi.get("PERatio") or val.get("TrailingPE") or val.get("PERatio"),
            "beta": tech.get("Beta"),
            "div_yield": hi.get("DividendYield"),
        }

        out["fundamentals"] = {k: v for k, v in {**base, **metrics}.items() if v is not None}

    return out


def build_facts_watchlist(tickers, lookback_days, rank_df, corr_bits, metrics_by_ticker):
    out = {
        "type": "watchlist_snapshot",
        "tickers": tickers,
        "lookback_days": int(lookback_days),
        "ranking": rank_df.to_dict(orient="records") if isinstance(rank_df, pd.DataFrame) else [],
        "correlation": corr_bits,
        "metrics_by_ticker": metrics_by_ticker,
    }
    return out


def narrate(facts):
    prompt = (
        "Write a short, product-ready market note using ONLY the facts below.\n"
        "No guessing. No extra numbers. If something is missing, say it's missing.\n"
        "Keep it tight and readable.\n\n"
        f"FACTS:\n{json.dumps(facts, indent=2, default=str)}"
    )

    r = oa.responses.create(
        model=NARRATION_MODEL,
        input=[{"role": "user", "content": prompt}],
    )

    try:
        return r.output_text
    except Exception:
        return str(r)

What’s happening here:

build_facts_single() takes the ticker, window, computed metrics, the vol regime label, and the fundamentals payload. But it doesn’t dump the entire fundamentals JSON. It picks a handful of fields from the General section and only keeps what exists. That keeps the prompt tight and the output predictable.
build_facts_watchlist() is the same idea but for multiple tickers. It passes the ranking table, correlation notes, and per-ticker metrics.
narrate() is basically “convert this facts object into human-friendly text”. The prompt is strict on purpose. If the model can only see these facts, it cannot hallucinate numbers outside them.

One small implementation detail: narrate() is a normal blocking function, while everything else is async. That’s why later, inside run_assistant(), we call it with await asyncio.to_thread(...) so it doesn’t block the async flow.

7. The Orchestration Function (`run_assistant()`)

This is the piece that ties everything together. It does four things in order:

create a trace id and log the request
parse tickers and lookback, then clamp them to budgets
fetch EODHD data via MCP and compute metrics in Python
call the model to narrate the facts, then return a structured response

Here’s the function:

def _dates_from_lookback(lookback_days):
    end = date.today()
    start = end - timedelta(days=int(lookback_days))
    return start.isoformat(), end.isoformat()

async def run_assistant(user_text, mode="auto"):
    trace_id = uuid.uuid4().hex[:10]
    log_event("request_started", trace_id, text=user_text, mode=mode)

    tickers, lookback = parse_request(user_text)
    tickers, lookback = enforce_budgets(tickers, lookback)

    if not tickers:
        return {
            "answer": "no tickers found in request",
            "metrics": {},
            "data_used": {},
            "tool_trace_id": trace_id,
        }

    log_event("parsed", trace_id, tickers=tickers, lookback_days=lookback)
    
    start_date, end_date = _dates_from_lookback(lookback)
    state = new_state()
        
    if mode == "auto":
        mode = "watchlist" if len(tickers) > 1 else "single"

    try:
        if mode == "single":
            t = tickers[0]
            t_full = t if "." in t else f"{t}.US"

            log_event("tool_phase", trace_id, mode="single", ticker=t_full, start_date=start_date, end_date=end_date)

            prices = await fetch_prices(t_full, start_date, end_date, trace_id, state)
            metrics = compute_metrics(prices)
            regime = compute_regime(prices)

            fundamentals = await fetch_fundamentals(t_full, trace_id, state)

            facts = build_facts_single(t_full, lookback, metrics, regime, fundamentals)
            answer = await asyncio.to_thread(narrate, facts)

            resp = {
                "answer": answer,
                "metrics": metrics,
                "data_used": {
                    "tickers": [t_full],
                    "date_range": [start_date, end_date],
                    "tools_called": [x.get("tool") for x in state["tool_trace"]],
                    "tool_calls": state["tool_calls"],
                },
                "tool_trace_id": trace_id,
            }

            log_event("request_finished", trace_id, tool_calls=state["tool_calls"])
            return resp

        # watchlist
        full = [x if "." in x else f"{x}.US" for x in tickers]

        log_event("tool_phase", trace_id, mode="watchlist", tickers=full, start_date=start_date, end_date=end_date)

        frames = []
        metrics_by = {}

        for t in full:
            prices = await fetch_prices(t, start_date, end_date, trace_id, state)
            frames.append(prices)
            metrics_by[t] = compute_metrics(prices)

        ret_wide = align_returns(frames)

        base = full[0]
        corr_bits = []
        top = corr_summary(ret_wide, base, top_n=3)
        if top:
            corr_bits.append({"base": base, "top": top})

        rank_df = rank_watchlist(metrics_by)
        facts = build_facts_watchlist(full, lookback, rank_df, corr_bits, metrics_by)
        answer = await asyncio.to_thread(narrate, facts)

        resp = {
            "answer": answer,
            "metrics": {"by_ticker": metrics_by},
            "data_used": {
                "tickers": full,
                "date_range": [start_date, end_date],
                "tools_called": [x.get("tool") for x in state["tool_trace"]],
                "tool_calls": state["tool_calls"],
            },
            "tool_trace_id": trace_id,
        }

        log_event("request_finished", trace_id, tool_calls=state["tool_calls"])
        return resp

    except Exception as e:
        detail = repr(e)
        if hasattr(e, "exceptions"):
            detail = detail + " | " + " ; ".join([repr(x) for x in e.exceptions])

        log_event("request_failed", trace_id, err=detail)
        
        return {
            "answer": f"failed: {e}",
            "metrics": {},
            "data_used": {
                "tickers": tickers,
                "date_range": [start_date, end_date],
                "tools_called": [x.get("tool") for x in state["tool_trace"]],
                "tool_calls": state["tool_calls"],
            },
            "tool_trace_id": trace_id,
        }

This function is the glue. It creates a trace_id, logs the request, extracts tickers and a lookback window, then clamps both to your budgets so the assistant can’t over-fetch or spam tool calls.

After that, it turns the lookback into a start_date and end_date, initializes a fresh state, and picks a mode. In single mode, it fetches prices and fundamentals for one ticker via EODHD’s MCP tools, computes the metrics in Python, packs everything into a facts object, and asks the LLM to only narrate those facts. In watchlist mode it does the same across multiple tickers, then aligns returns so correlation is computed on matching dates, and builds a ranked snapshot.

The response is always structured the same way. You get the narrative answer, the raw computed metrics, a data_used block that shows tickers, date range, and tools called, plus a tool_trace_id so you can trace any output back to logs.

That structure is the difference between “a chat response” and “a shippable assistant output”. You can plug the same response into a UI card, a Slack alert, or a dashboard without changing anything.

Demo 1: Market Brief for One Ticker

Let’s start with the simplest flow. One ticker, one lookback window, and a market brief that looks like something you could show inside a product.

Prompt used:

“Give me a 30-day brief for AAPL. trend, volatility, max drawdown, plus 3 fundamental highlights.”

Code (Jupyter Notebook):

import asyncio
import json
from core import run_assistant

q1 = "Give me a 30-day brief for AAPL. trend, volatility, max drawdown, plus 3 fundamental highlights."

r1 = await run_assistant(q1, mode="single")
print(json.dumps(r1, indent=2, ensure_ascii=False))

Output:

{"event": "request_started", "trace_id": "2af550173f", "ts": 1772735388.777, "text": "Give me a 30-day brief for AAPL. trend, volatility, max drawdown, plus 3 fundamental highlights.", "mode": "single"}
{"event": "parsed", "trace_id": "2af550173f", "ts": 1772735388.778, "tickers": ["AAPL"], "lookback_days": 30}
{"event": "tool_phase", "trace_id": "2af550173f", "ts": 1772735388.778, "mode": "single", "ticker": "AAPL.US", "start_date": "2026-02-03", "end_date": "2026-03-05"}
{"event": "request_finished", "trace_id": "2af550173f", "ts": 1772735404.392, "tool_calls": 2}
{
  "answer": "Apple Inc (AAPL.US) | NASDAQ | Technology — Consumer Electronics\n
\nOver the past 30 days, Apple shares declined 2.58%, falling from 269.48 to 
262.52 across 21 trading observations. The trend slope over the period was 
negative (-0.00175), indicating a modest downward drift.\n\nRealized daily 
volatility was 1.93%, equivalent to about 30.65% annualized. The stock is currently 
classified in a high‑volatility regime based on a 20‑day rolling volatility measure.
\n\nMaximum drawdown during the period reached -8.03%.\n\nAdditional fundamentals 
or valuation metrics were not provided.",
  "metrics": {
    "vol_daily": 0.01930981768788001,
    "vol_annualized": 0.3065338527847606,
    "ret_total": -0.02582751966750796,
    "max_drawdown": -0.08032503955127279,
    "trend_slope": -0.0017498633497641184,
    "n_points": 21,
    "start_close": 269.48,
    "end_close": 262.52
  },
  "data_used": {
    "tickers": [
      "AAPL.US"
    ],
    "date_range": [
      "2026-02-03",
      "2026-03-05"
    ],
    "tools_called": [
      "get_historical_stock_prices",
      "get_fundamentals_data"
    ],
    "tool_calls": 2
  },
  "tool_trace_id": "2af550173f"
}

First, you’ll see the log events. They’re not part of the final response. They’re just the trace trail.

request_started shows the raw prompt and that we forced mode="single".
parsed confirms the parser extracted AAPL and a 30-day lookback.
tool_phase shows what we actually fetched: AAPL.US from 2026-02-03 to 2026-03-05.
request_finished confirms we made exactly 2 tool calls.

Now the actual response JSON:

answer is the narrative. In this run it summarizes:

return of -2.58% (269.48 to 262.52)
21 price observations in that window
negative trend slope (-0.00175) meaning mild downward drift
daily vol 1.93% and annualized vol 30.65%
max drawdown -8.03%
and it labels the regime as high volatility using the rolling vol logic.

metrics is where those numbers come from. This is the deterministic part. ret_total, vol_daily, vol_annualized, max_drawdown, and trend_slope were computed directly from the fetched closes. start_close, end_close, and n_points explain the exact series used.

data_used is the audit block for this specific output. It shows:

ticker normalized to AAPL.US
the exact date range pulled
the exact tools called on the MCP server: get_historical_stock_prices and get_fundamentals_data
and again, tool_calls: 2 so you can quickly spot runaway calls.

tool_trace_id (2af550173f) is your handle for debugging. Every log line above carries the same id, so you can trace this brief back to the exact tool calls and parameters.

Demo 2: Watchlist Snapshot

Now let’s switch to the watchlist flow. Same assistant core. The only difference is we pass multiple tickers and a longer window, so the output becomes a comparative risk snapshot.

Prompt used:

“Compare TSLA, NVDA, AMZN over the last 60 days. rank by volatility and drawdown, and flag valuation risks.”

Code:

q2 = "Compare TSLA, NVDA, AMZN over the last 60 days. rank by volatility and drawdown, and flag risk outliers."

r2 = await run_assistant(q2, mode="watchlist")
print(json.dumps(r2, indent=2, ensure_ascii=False))

Output:

{"event": "request_started", "trace_id": "1b67bb47d6", "ts": 1772735404.394, "text": "Compare TSLA, NVDA, AMZN over the last 60 days. rank by volatility and drawdown, and flag valuation risks.", "mode": "watchlist"}
{"event": "parsed", "trace_id": "1b67bb47d6", "ts": 1772735404.394, "tickers": ["TSLA", "NVDA", "AMZN"], "lookback_days": 60}
{"event": "tool_phase", "trace_id": "1b67bb47d6", "ts": 1772735404.394, "mode": "watchlist", "tickers": ["TSLA.US", "NVDA.US", "AMZN.US"], "start_date": "2026-01-05", "end_date": "2026-03-06"}
{"event": "request_finished", "trace_id": "1b67bb47d6", "ts": 1772735423.004, "tool_calls": 3}
{
  "answer": "Market Watchlist Snapshot (last 60 days)\n\nAll three names show 
negative total returns and downward trend slopes over the period.\n\nNVDA.US 
ranks highest in the group despite a small decline. Total return is -0.027. 
Price moved from 188.12 to 183.04 across 41 observations. Annualized volatility is 
0.3808 and maximum drawdown is -0.107.\n\nTSLA.US shows the second‑highest volatility 
profile with annualized volatility of 0.3561. Total return is -0.101, with price 
falling from 451.67 to 405.94. Maximum drawdown reached -0.131. Trend slope is negative.
\n\nAMZN.US has the lowest volatility in the set (annualized 0.3196) but the deepest 
drawdown at -0.196. Total return is -0.0697, with price moving from 233.06 to 
216.82. Trend slope is also negative.\n\nCorrelation: TSLA shows a stronger 
relationship with NVDA (0.533) than with AMZN (0.177).\n\nMissing from the 
data: trading volume, catalysts, sector context, and forward-looking indicators.",
  "metrics": {
    "by_ticker": {
      "TSLA.US": {
        "vol_daily": 0.02243518393199404,
        "vol_annualized": 0.3561475038122908,
        "ret_total": -0.10124648526579139,
        "max_drawdown": -0.13115770363318358,
        "trend_slope": -0.0026452119688441023,
        "n_points": 41,
        "start_close": 451.67,
        "end_close": 405.94
      },
      "NVDA.US": {
        "vol_daily": 0.023987861378298222,
        "vol_annualized": 0.3807954941476091,
        "ret_total": -0.027004039974484417,
        "max_drawdown": -0.10716326424601319,
        "trend_slope": -4.3573704505466623e-05,
        "n_points": 41,
        "start_close": 188.12,
        "end_close": 183.04
      },
      "AMZN.US": {
        "vol_daily": 0.020129905817481322,
        "vol_annualized": 0.31955234824924766,
        "ret_total": -0.06968162704882863,
        "max_drawdown": -0.1964184655186353,
        "trend_slope": -0.00520436173926906,
        "n_points": 41,
        "start_close": 233.06,
        "end_close": 216.82
      }
    }
  },
  "data_used": {
    "tickers": [
      "TSLA.US",
      "NVDA.US",
      "AMZN.US"
    ],
    "date_range": [
      "2026-01-05",
      "2026-03-06"
    ],
    "tools_called": [
      "get_historical_stock_prices",
      "get_historical_stock_prices",
      "get_historical_stock_prices"
    ],
    "tool_calls": 3
  },
  "tool_trace_id": "1b67bb47d6"
}

The logs show the assistant correctly extracted TSLA, NVDA, AMZN and a 60-day lookback, then fetched TSLA.US, NVDA.US, and AMZN.US from 2026-01-05 to 2026-03-06. Since this is a watchlist request, it made exactly 3 tool calls. One get_historical_stock_prices call per ticker.

Inside answer, the model is basically summarizing what Python computed. In this run, all three names had negative returns and negative trend slopes.

NVDA had the highest annualized volatility at 0.3808 with a relatively small decline of -2.7%.
TSLA was next in volatility (0.3561) with a larger decline (-10.1%) and drawdown of about -13.1%.
AMZN had the lowest volatility (0.3196) but the deepest drawdown at around -19.6%. It also includes a correlation note derived from the aligned returns table.
TSLA’s return series correlated more with NVDA (0.533) than with AMZN (0.177) in this window.

metrics.by_ticker is where the snapshot really lives. It contains the full computed metric set per ticker, including observation count (n_points=41) and the start and end closes used for the return calculation. data_used shows exactly what we fetched, including the tickers, the date range, and the three price tool calls. And tool_trace_id is the id that links this output back to the full trace logs.

So how would a product team use this? Well, this output is already shaped like a widget backend. You can render the ranking as a watchlist “risk card”, show the top volatility and drawdown names, and drop the narrative into a compact summary box. Since you also get deterministic metrics, you can build UI elements without parsing text, and still keep the narration as a layer on top.

What Makes this Shippable, and What Can Be Improved?

The core reason this works in a real product setting is that the numbers are deterministic. Prices and fundamentals come from EODHD via MCP, metrics are computed in Python, and the model only writes narrative from a facts object.

On top of that, every run is traceable. You get tool logs, data_used, and a tool_trace_id, plus hard limits on lookback, tickers, and tool calls so the system can’t spiral.

At the same time, this is still an MVP. The parsing is a simple heuristic, the metric set is intentionally small, and fundamentals are only lightly extracted.

If you want to take this further, the next upgrades are straightforward: you can add volume and a couple more data tools like earnings calendar and news, introduce caching for repeated requests, build a tiny evaluation harness with fixed prompts and expected outputs, then wrap run_assistant() behind a small API so it can power an actual UI or internal service.

Conclusion

The main takeaway is simple. If you want a financial assistant to be usable beyond casual chat, you need to separate facts from narrative. The MCP protocol gives you a clean way to connect to tool providers via an MCP server. Python gives you deterministic metrics, and the model becomes the last-mile layer that turns those facts into readable output.

This is still a small build, but it’s already shaped like something you can ship. The response format is structured, traceable, and easy to plug into a UI. If you extend it with a few more tools and add basic caching, it can quickly move from a Jupyter notebook demo to a real feature.

If you want to try the same approach with a full market data tool layer out of the box, EODHD’s MCP server is a solid starting point.

With that being said, you’ve reached the end of the article. Hope you learned something new and useful today. Thank you very much for your time.

Docker Container Doctor: How I Built an AI Agent That Monitors and Fixes My Containers

Balajee Asish Brahmandam — Mon, 23 Mar 2026 17:21:11 +0000

Maybe this sounds familiar: your production container crashes at 3 AM. By the time you wake up, it's been throwing the same error for 2 hours. You SSH in, pull logs, decode the cryptic stack trace, Google the error, and finally restart it. Twenty minutes of your morning gone. And the worst part? It happens again next week.

I got tired of this cycle. I was running 5 containerized services on a single Linode box – a Flask API, a Postgres database, an Nginx reverse proxy, a Redis cache, and a background worker. Every other week, one of them would crash. The logs were messy. The errors weren't obvious. And I'd waste time debugging something that could've been auto-detected and fixed in seconds.

So I built something better: a Python agent that watches your containers in real-time, spots errors, figures out what went wrong using Claude, and fixes them without waking you up. I call it the Container Doctor. It's not magic. It's Docker API + LLM reasoning + some automation glue. Here's exactly how I built it, what went wrong along the way, and what I'd do differently.

Why Not Just Use Prometheus?
The Architecture
Setting Up the Project
The Monitoring Script — Line by Line
The Claude Diagnosis Prompt (and Why Structure Matters)
Auto-Fix Logic — Being Conservative on Purpose
Adding Slack Notifications
Health Check Endpoint
Rate Limiting Claude Calls
Docker Compose — The Full Setup
Real Errors I Caught in Production
Cost Breakdown — What This Actually Costs
Security Considerations
What I'd Do Differently
What's Next?

Why Not Just Use Prometheus?

Fair question. Prometheus, Grafana, DataDog – they're all great. But for my setup, they were overkill. I had 5 containers on a $20/month Linode. Setting up Prometheus means deploying a metrics server, configuring exporters for each service, building Grafana dashboards, and writing alert rules. That's a whole side project just to monitor a side project.

Even then, those tools tell you what happened. They'll show you a spike in memory or a 500 error rate. But they won't tell you why. You still need a human to look at the logs, figure out the root cause, and decide what to do.

That's the gap I wanted to fill. I didn't need another dashboard. I needed something that could read a stack trace, understand the context, and either fix it or tell me exactly what to do when I wake up. Claude turned out to be surprisingly good at this. It can read a Python traceback and tell you the issue faster than most junior devs (and some senior ones, honestly).

The Architecture

Here's how the pieces fit together:

┌─────────────────────────────────────────────┐
│              Docker Host                      │
│                                               │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │   web    │  │   api    │  │    db    │   │
│  │ (nginx)  │  │ (flask)  │  │(postgres)│   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │              │              │         │
│       └──────────────┼──────────────┘         │
│                      │                         │
│              Docker Socket                     │
│                      │                         │
│            ┌─────────┴─────────┐              │
│            │ Container Doctor  │              │
│            │  (Python agent)   │              │
│            └─────────┬─────────┘              │
│                      │                         │
└──────────────────────┼─────────────────────────┘
                       │
              ┌────────┴────────┐
              │   Claude API    │
              │  (diagnosis)    │
              └────────┬────────┘
                       │
              ┌────────┴────────┐
              │  Slack Webhook  │
              │  (alerts)       │
              └─────────────────┘

The flow works like this:

The Container Doctor runs in its own container with the Docker socket mounted
Every 10 seconds, it pulls the last 50 lines of logs from each target container
It scans for error patterns (keywords like "error", "exception", "traceback", "fatal")
When it finds something, it sends the logs to Claude with a structured prompt
Claude returns a JSON diagnosis: root cause, severity, suggested fix, and whether it's safe to auto-restart
If severity is high and auto-restart is safe, the script restarts the container
Either way, it sends a Slack notification with the full diagnosis
A simple health endpoint lets you check the doctor's own status

The key insight: the script doesn't try to be smart about the diagnosis itself. It outsources all the thinking to Claude. The script's job is just plumbing: collecting logs, routing them to Claude, and executing the response.

Setting Up the Project

Create your project directory:

mkdir container-doctor && cd container-doctor

Here's your requirements.txt:

docker==7.0.0
anthropic>=0.28.0
python-dotenv==1.0.0
flask==3.0.0
requests==2.31.0

Install locally for testing: pip install -r requirements.txt

Create a .env file:

ANTHROPIC_API_KEY=sk-ant-...
TARGET_CONTAINERS=web,api,db
CHECK_INTERVAL=10
LOG_LINES=50
AUTO_FIX=true
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
POSTGRES_USER=user
POSTGRES_PASSWORD=changeme
POSTGRES_DB=mydb
MAX_DIAGNOSES_PER_HOUR=20

A quick note on CHECK_INTERVAL: 10 seconds is aggressive. For production, I'd bump this to 30-60 seconds. I kept it low during development so I could see results faster, and honestly forgot to change it. My API bill reminded me.

The Monitoring Script – Line by Line

Here's the full container_doctor.py. I'll walk through the important parts after:

import docker
import json
import time
import logging
import os
import requests
from datetime import datetime, timedelta
from collections import defaultdict
from threading import Thread
from flask import Flask, jsonify
from anthropic import Anthropic

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

client = Anthropic()
docker_client = None

# --- Config ---
TARGET_CONTAINERS = os.getenv("TARGET_CONTAINERS", "").split(",")
CHECK_INTERVAL = int(os.getenv("CHECK_INTERVAL", "10"))
LOG_LINES = int(os.getenv("LOG_LINES", "50"))
AUTO_FIX = os.getenv("AUTO_FIX", "true").lower() == "true"
SLACK_WEBHOOK = os.getenv("SLACK_WEBHOOK_URL", "")
MAX_DIAGNOSES = int(os.getenv("MAX_DIAGNOSES_PER_HOUR", "20"))

# --- State tracking ---
diagnosis_history = []
fix_history = defaultdict(list)
last_error_seen = {}
rate_limit_counter = defaultdict(int)
rate_limit_reset = datetime.now() + timedelta(hours=1)

app = Flask(__name__)


def get_docker_client():
    """Lazily initialize Docker client."""
    global docker_client
    if docker_client is None:
        docker_client = docker.from_env()
    return docker_client


def get_container_logs(container_name):
    """Fetch last N lines from a container."""
    try:
        container = get_docker_client().containers.get(container_name)
        logs = container.logs(
            tail=LOG_LINES,
            timestamps=True
        ).decode("utf-8")
        return logs
    except docker.errors.NotFound:
        logger.warning(f"Container '{container_name}' not found. Skipping.")
        return None
    except docker.errors.APIError as e:
        logger.error(f"Docker API error for {container_name}: {e}")
        return None
    except Exception as e:
        logger.error(f"Unexpected error fetching logs for {container_name}: {e}")
        return None


def detect_errors(logs):
    """Check if logs contain error patterns."""
    error_patterns = [
        "error", "exception", "traceback", "failed", "crash",
        "fatal", "panic", "segmentation fault", "out of memory",
        "killed", "oomkiller", "connection refused", "timeout",
        "permission denied", "no such file", "errno"
    ]
    logs_lower = logs.lower()
    found = []
    for pattern in error_patterns:
        if pattern in logs_lower:
            found.append(pattern)
    return found


def is_new_error(container_name, logs):
    """Check if this is a new error or the same one we already diagnosed."""
    log_hash = hash(logs[-200:])  # Hash last 200 chars
    if last_error_seen.get(container_name) == log_hash:
        return False
    last_error_seen[container_name] = log_hash
    return True


def check_rate_limit():
    """Ensure we don't spam Claude with too many requests."""
    global rate_limit_counter, rate_limit_reset

    now = datetime.now()
    if now > rate_limit_reset:
        rate_limit_counter.clear()
        rate_limit_reset = now + timedelta(hours=1)

    total = sum(rate_limit_counter.values())
    if total >= MAX_DIAGNOSES:
        logger.warning(f"Rate limit reached ({total}/{MAX_DIAGNOSES} per hour). Skipping diagnosis.")
        return False
    return True


def diagnose_with_claude(container_name, logs, error_patterns):
    """Send logs to Claude for diagnosis."""
    if not check_rate_limit():
        return None

    rate_limit_counter[container_name] += 1

    prompt = f"""You are a DevOps expert analyzing container logs.

Container: {container_name}
Timestamp: {datetime.now().isoformat()}
Detected patterns: {', '.join(error_patterns)}

Recent logs:
---
{logs}
---

Analyze these logs and respond with ONLY valid JSON (no markdown, no explanation):
{{
    "root_cause": "One sentence explaining exactly what went wrong",
    "severity": "low|medium|high",
    "suggested_fix": "Step-by-step fix the operator should apply",
    "auto_restart_safe": true or false,
    "config_suggestions": ["ENV_VAR=value", "..."],
    "likely_recurring": true or false,
    "estimated_impact": "What breaks if this isn't fixed"
}}
"""

    try:
        message = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=600,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        return message.content[0].text
    except Exception as e:
        logger.error(f"Claude API error: {e}")
        return None


def parse_diagnosis(diagnosis_text):
    """Extract JSON from Claude's response."""
    if not diagnosis_text:
        return None
    try:
        start = diagnosis_text.find("{")
        end = diagnosis_text.rfind("}") + 1
        if start >= 0 and end > start:
            json_str = diagnosis_text[start:end]
            return json.loads(json_str)
    except json.JSONDecodeError as e:
        logger.error(f"JSON parse error: {e}")
        logger.debug(f"Raw response: {diagnosis_text}")
    except Exception as e:
        logger.error(f"Failed to parse diagnosis: {e}")
    return None


def apply_fix(container_name, diagnosis):
    """Apply auto-fixes if safe."""
    if not AUTO_FIX:
        logger.info(f"Auto-fix disabled globally. Skipping {container_name}.")
        return False

    if not diagnosis.get("auto_restart_safe"):
        logger.info(f"Claude says restart is unsafe for {container_name}. Skipping.")
        return False

    # Don't restart the same container more than 3 times per hour
    recent_fixes = [
        t for t in fix_history[container_name]
        if t > datetime.now() - timedelta(hours=1)
    ]
    if len(recent_fixes) >= 3:
        logger.warning(
            f"Container {container_name} already restarted {len(recent_fixes)} "
            f"times this hour. Something deeper is wrong. Skipping."
        )
        send_slack_alert(
            container_name, diagnosis,
            extra="REPEATED FAILURE: This container has been restarted 3+ times "
                  "in the last hour. Manual intervention needed."
        )
        return False

    try:
        container = get_docker_client().containers.get(container_name)
        logger.info(f"Restarting container {container_name}...")
        container.restart(timeout=30)
        fix_history[container_name].append(datetime.now())
        logger.info(f"Container {container_name} restarted successfully")

        # Verify it's actually running after restart
        time.sleep(5)
        container.reload()
        if container.status != "running":
            logger.error(f"Container {container_name} failed to start after restart")
            return False

        return True
    except Exception as e:
        logger.error(f"Failed to restart {container_name}: {e}")
        return False


def send_slack_alert(container_name, diagnosis, extra=""):
    """Send diagnosis to Slack."""
    if not SLACK_WEBHOOK:
        return

    severity_emoji = {
        "low": "🟡",
        "medium": "🟠",
        "high": "🔴"
    }

    severity = diagnosis.get("severity", "unknown")
    emoji = severity_emoji.get(severity, "⚪")

    blocks = [
        {
            "type": "header",
            "text": {
                "type": "plain_text",
                "text": f"{emoji} Container Doctor Alert: {container_name}"
            }
        },
        {
            "type": "section",
            "fields": [
                {"type": "mrkdwn", "text": f"*Severity:* {severity}"},
                {"type": "mrkdwn", "text": f"*Container:* `{container_name}`"},
                {"type": "mrkdwn", "text": f"*Root Cause:* {diagnosis.get('root_cause', 'Unknown')}"},
                {"type": "mrkdwn", "text": f"*Fix:* {diagnosis.get('suggested_fix', 'N/A')}"},
            ]
        }
    ]

    if diagnosis.get("config_suggestions"):
        suggestions = "\n".join(
            f"• `{s}`" for s in diagnosis["config_suggestions"]
        )
        blocks.append({
            "type": "section",
            "text": {
                "type": "mrkdwn",
                "text": f"*Config Suggestions:*\n{suggestions}"
            }
        })

    if extra:
        blocks.append({
            "type": "section",
            "text": {"type": "mrkdwn", "text": f"*⚠️ {extra}*"}
        })

    try:
        requests.post(SLACK_WEBHOOK, json={"blocks": blocks}, timeout=10)
    except Exception as e:
        logger.error(f"Slack notification failed: {e}")


# --- Health Check Endpoint ---
@app.route("/health")
def health():
    """Health check endpoint for the doctor itself."""
    try:
        get_docker_client().ping()
        docker_ok = True
    except:
        docker_ok = False

    return jsonify({
        "status": "healthy" if docker_ok else "degraded",
        "docker_connected": docker_ok,
        "monitoring": TARGET_CONTAINERS,
        "total_diagnoses": len(diagnosis_history),
        "fixes_applied": {k: len(v) for k, v in fix_history.items()},
        "rate_limit_remaining": MAX_DIAGNOSES - sum(rate_limit_counter.values()),
        "uptime_check": datetime.now().isoformat()
    })


@app.route("/history")
def history():
    """Return recent diagnosis history."""
    return jsonify(diagnosis_history[-50:])


def monitor_containers():
    """Main monitoring loop."""
    logger.info(f"Container Doctor starting up")
    logger.info(f"Monitoring: {TARGET_CONTAINERS}")
    logger.info(f"Check interval: {CHECK_INTERVAL}s")
    logger.info(f"Auto-fix: {AUTO_FIX}")
    logger.info(f"Rate limit: {MAX_DIAGNOSES}/hour")

    while True:
        for container_name in TARGET_CONTAINERS:
            container_name = container_name.strip()
            if not container_name:
                continue

            logs = get_container_logs(container_name)
            if not logs:
                continue

            error_patterns = detect_errors(logs)
            if not error_patterns:
                continue

            # Skip if we already diagnosed this exact error
            if not is_new_error(container_name, logs):
                continue

            logger.warning(
                f"Errors detected in {container_name}: {error_patterns}"
            )

            diagnosis_text = diagnose_with_claude(
                container_name, logs, error_patterns
            )
            if not diagnosis_text:
                continue

            diagnosis = parse_diagnosis(diagnosis_text)
            if not diagnosis:
                logger.error("Failed to parse Claude's response. Skipping.")
                continue

            # Record it
            diagnosis_history.append({
                "container": container_name,
                "timestamp": datetime.now().isoformat(),
                "diagnosis": diagnosis,
                "patterns": error_patterns
            })

            logger.info(
                f"Diagnosis for {container_name}: "
                f"severity={diagnosis.get('severity')}, "
                f"cause={diagnosis.get('root_cause')}"
            )

            # Auto-fix only on high severity
            fixed = False
            if diagnosis.get("severity") == "high":
                fixed = apply_fix(container_name, diagnosis)

            # Always notify Slack
            send_slack_alert(
                container_name, diagnosis,
                extra="Auto-restarted" if fixed else ""
            )

        time.sleep(CHECK_INTERVAL)


if __name__ == "__main__":
    # Run Flask health endpoint in background
    flask_thread = Thread(
        target=lambda: app.run(host="0.0.0.0", port=8080, debug=False),
        daemon=True
    )
    flask_thread.start()
    logger.info("Health endpoint running on :8080")

    try:
        monitor_containers()
    except KeyboardInterrupt:
        logger.info("Container Doctor shutting down")

That's a lot of code, so let me walk through the parts that matter.

Error deduplication (is_new_error): This was a lesson I learned the hard way. Without this, the script would see the same error every 10 seconds and spam Claude with identical requests. I hash the last 200 characters of the log output and skip if it matches the last error we saw. Simple, but it cut my API costs by about 80%.

Rate limiting (check_rate_limit): Belt and suspenders. Even with deduplication, I cap it at 20 diagnoses per hour. If something is so broken that it's generating 20+ unique errors per hour, you need a human anyway.

Restart throttling (inside apply_fix): If the same container has been restarted 3 times in an hour, something deeper is wrong. A restart loop won't fix a misconfigured database or a missing volume. The script stops restarting and sends a louder Slack alert instead.

Post-restart verification: After restarting, the script waits 5 seconds and checks if the container is actually running. I've seen cases where a container restarts and immediately crashes again. Without this check, the script would report success while the container is still down.

The Claude Diagnosis Prompt (and Why Structure Matters)

Getting Claude to return parseable JSON took some iteration. My first attempt used a casual prompt and I got back paragraphs of explanation with JSON buried somewhere in the middle. Sometimes it'd use markdown code fences, sometimes not.

The version I landed on is explicit about format:

prompt = f"""You are a DevOps expert analyzing container logs.

Container: {container_name}
Timestamp: {datetime.now().isoformat()}
Detected patterns: {', '.join(error_patterns)}

Recent logs:
---
{logs}
---

Analyze these logs and respond with ONLY valid JSON (no markdown, no explanation):
{{
    "root_cause": "One sentence explaining exactly what went wrong",
    "severity": "low|medium|high",
    "suggested_fix": "Step-by-step fix the operator should apply",
    "auto_restart_safe": true or false,
    "config_suggestions": ["ENV_VAR=value", "..."],
    "likely_recurring": true or false,
    "estimated_impact": "What breaks if this isn't fixed"
}}
"""

A few things I learned:

Include the detected patterns. Telling Claude "I found 'timeout' and 'connection refused'" helps it focus. Without this, it sometimes fixated on irrelevant warnings in the logs.

Ask for estimated_impact. This field turned out to be the most useful in Slack alerts. When your team sees "Database connections will pile up and crash the API within 15 minutes," they act faster than when they see "connection pool exhausted."

likely_recurring is gold. If Claude says an issue is likely to recur, I know a restart is a band-aid and I need to actually fix the root cause. I flag these in Slack with extra emphasis.

Claude returns something like:

{
    "root_cause": "Connection pool exhausted. Default pool size is 5, but app has 8+ concurrent workers.",
    "severity": "high",
    "suggested_fix": "1. Set POOL_SIZE=20 in environment. 2. Add connection timeout of 30s. 3. Consider a connection pooler like PgBouncer.",
    "auto_restart_safe": true,
    "config_suggestions": ["POOL_SIZE=20", "CONNECTION_TIMEOUT=30"],
    "likely_recurring": true,
    "estimated_impact": "API requests will queue and timeout. Users will see 503 errors within 2-3 minutes."
}

I only auto-restart on high severity. Medium and low issues get logged, sent to Slack, and I deal with them during business hours. This distinction matters: you don't want the script restarting containers over every transient warning.

Auto-Fix Logic – Being Conservative on Purpose

The auto-fix function is intentionally limited. Right now it only restarts containers. It doesn't modify environment variables, change configs, or scale services. Here's why:

Restarting is safe and reversible. If the restart makes things worse, the container just crashes again and I get another alert. But if the script started changing environment variables or modifying docker-compose files, a bad decision could cascade across services.

The three safety checks before any restart:

Global toggle: AUTO_FIX=true in .env. I can kill all auto-fixes instantly by changing one variable.
Claude's assessment: auto_restart_safe must be true. If Claude says "don't restart this, it'll corrupt the database," the script listens.
Restart throttle: No more than 3 restarts per container per hour. After that, it's a human problem.

If I were building this for a team, I'd add approval flows. Send a Slack message with "Restart?" and two buttons. Wait for a human to click yes. That adds latency but removes the risk of automated chaos.

Adding Slack Notifications

Every diagnosis gets sent to Slack, whether the container was restarted or not. The notification includes color-coded severity, root cause, suggested fix, and config suggestions.

The Slack Block Kit formatting makes these alerts scannable. A red dot for high severity, orange for medium, yellow for low. Your team can glance at the channel and know if they need to drop everything or if it can wait.

To set this up, create a Slack app at api.slack.com/apps, add an incoming webhook, and paste the URL in your .env.

Health Check Endpoint

The doctor needs a doctor. I added a simple Flask endpoint so I can monitor the monitoring script:

curl http://localhost:8080/health

Returns:

{
    "status": "healthy",
    "docker_connected": true,
    "monitoring": ["web", "api", "db"],
    "total_diagnoses": 14,
    "fixes_applied": {"api": 2, "web": 1},
    "rate_limit_remaining": 6,
    "uptime_check": "2026-03-15T14:30:00"
}

And /history returns the last 50 diagnoses:

curl http://localhost:8080/history

I point an uptime checker (UptimeRobot, free tier) at the /health endpoint. If the Container Doctor itself goes down, I get an email. It's monitoring all the way down.

Rate Limiting Claude Calls

This is where I burned money during development. Without rate limiting, the script was sending 100+ requests per hour during a container crash loop. At a few cents per request, that's a few dollars per hour. Not catastrophic, but annoying.

The rate limiter is simple: a counter that resets every hour. Default cap is 20 diagnoses per hour. If you hit the limit, the script logs a warning and skips diagnosis until the window resets. Errors still get detected, they just don't get sent to Claude.

Combined with error deduplication (same error won't trigger a second diagnosis), this keeps my Claude bill under $5/month even with 5 containers monitored.

Docker Compose – The Full Setup

Here's the complete docker-compose.yml with the Container Doctor, a sample web server, API, and database:

version: '3.8'

services:
  container_doctor:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: container_doctor
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - TARGET_CONTAINERS=web,api,db
      - CHECK_INTERVAL=10
      - LOG_LINES=50
      - AUTO_FIX=true
      - SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
      - MAX_DIAGNOSES_PER_HOUR=20
    ports:
      - "8080:8080"
    restart: unless-stopped
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  web:
    image: nginx:latest
    container_name: web
    ports:
      - "80:80"
    restart: unless-stopped

  api:
    build: ./api
    container_name: api
    environment:
      - DATABASE_URL=postgres://\({POSTGRES_USER}:\){POSTGRES_PASSWORD}@db:5432/${POSTGRES_DB}
      - POOL_SIZE=20
    depends_on:
      - db
    restart: unless-stopped

  db:
    image: postgres:15
    container_name: db
    environment:
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=${POSTGRES_DB}
    volumes:
      - db_data:/var/lib/postgresql/data
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  db_data:

And the Dockerfile:

FROM python:3.12-slim

WORKDIR /app

RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY container_doctor.py .

EXPOSE 8080

CMD ["python", "-u", "container_doctor.py"]

Start everything: docker compose up -d

Important: The socket mount (/var/run/docker.sock:/var/run/docker.sock) gives the Container Doctor full access to the Docker daemon. Don't copy .env into the Docker image either — it bakes your API key into the image layer. Pass environment variables via the compose file or at runtime.

Real Errors I Caught in Production

I've been running this for about 3 weeks now. Here are the actual incidents it caught:

Incident 1: OOM Kill (Week 1)

Logs showed a single word: Killed. That's Linux's OOMKiller doing its thing.

Claude's diagnosis:

{
    "root_cause": "Process killed by OOMKiller. Container is requesting more memory than the 256MB limit allows under load.",
    "severity": "high",
    "suggested_fix": "Increase memory limit to 512MB in docker-compose. Monitor if the leak continues at higher limits.",
    "auto_restart_safe": true,
    "config_suggestions": ["mem_limit: 512m", "memswap_limit: 1g"],
    "likely_recurring": true,
    "estimated_impact": "API is completely down. All requests return 502 from nginx."
}

The script restarted the container in 3 seconds. I updated the compose file the next morning. Before the Container Doctor, this would've been a 2-hour outage overnight.

Incident 2: Connection Pool Exhausted (Week 2)

ERROR: database connection pool exhausted
ERROR: cannot create new pool entry
ERROR: QueuePool limit of 5 overflow 0 reached

Claude caught that my pool size was too small for the number of workers:

{
    "root_cause": "SQLAlchemy connection pool (size=5) can't keep up with 8 concurrent Gunicorn workers. Each worker holds a connection during request processing.",
    "severity": "high",
    "suggested_fix": "Set POOL_SIZE=20 and add POOL_TIMEOUT=30. Long-term: add PgBouncer as a connection pooler.",
    "auto_restart_safe": true,
    "config_suggestions": ["POOL_SIZE=20", "POOL_TIMEOUT=30", "POOL_RECYCLE=3600"],
    "likely_recurring": true,
    "estimated_impact": "New API requests will hang for 30s then timeout. Existing requests may complete but slowly."
}

Incident 3: Transient Timeout (Week 2)

WARN: timeout connecting to upstream service
WARN: retrying request (attempt 2/3)
INFO: request succeeded on retry

Claude correctly identified this as a non-issue:

{
    "root_cause": "Transient network timeout during a DNS resolution hiccup. Retries succeeded.",
    "severity": "low",
    "suggested_fix": "No action needed. This is expected during brief network blips. Only investigate if frequency increases.",
    "auto_restart_safe": false,
    "config_suggestions": [],
    "likely_recurring": false,
    "estimated_impact": "Minimal. Individual requests delayed by ~2s but all completed."
}

No restart. No alert (I filter low-severity from Slack pings). This is the right call: restarting on every transient timeout causes more downtime than it prevents.

Incident 4: Disk Full (Week 3)

ERROR: could not write to temporary file: No space left on device
FATAL: data directory has no space

{
    "root_cause": "Postgres data volume is full. WAL files and temporary sort files consumed all available space.",
    "severity": "high",
    "suggested_fix": "1. Clean WAL files: SELECT pg_switch_wal(). 2. Increase volume size. 3. Add log rotation. 4. Set max_wal_size=1GB.",
    "auto_restart_safe": false,
    "config_suggestions": ["max_wal_size=1GB", "log_rotation_age=1d"],
    "likely_recurring": true,
    "estimated_impact": "Database is read-only. All writes fail. API returns 500 on any mutation."
}

Notice Claude said auto_restart_safe: false here. Restarting Postgres when the disk is full can corrupt data. The script didn't touch it. It just sent me a detailed Slack alert at 4 AM. I cleaned up the WAL files the next morning. Good call by Claude.

Cost Breakdown – What This Actually Costs

After 3 weeks of running this on 5 containers:

Claude API: ~$3.80/month (with rate limiting and deduplication)
Linode compute: $0 extra (the Container Doctor uses about 50MB RAM)
Slack: Free tier
My time saved: ~2-3 hours/month of 3 AM debugging

Without rate limiting, my first week cost $8 in API calls. The deduplication + rate limiter brought that down dramatically. Most of my containers run fine. The script only calls Claude when something actually breaks.

If you're monitoring more containers or have noisier logs, expect higher costs. The MAX_DIAGNOSES_PER_HOUR setting is your budget knob.

Security Considerations

Let's talk about the elephant in the room: the Docker socket.

Mounting /var/run/docker.sock gives the Container Doctor root-equivalent access to your Docker daemon. It can start, stop, and remove any container. It can pull images. It can exec into running containers. If someone compromises the Container Doctor, they own your entire Docker host.

Here's how I mitigate this:

Network isolation: The Container Doctor's health endpoint is only exposed on localhost. In production, put it behind a reverse proxy with auth.
Read-mostly access: The script only reads logs and restarts containers. It never execs into containers, pulls images, or modifies volumes.
No external inputs: The script doesn't accept commands from Slack or any external source. It's outbound-only (logs out, alerts out).
API key rotation: I rotate the Anthropic API key monthly. If the container is compromised, the key has limited blast radius.

For a more secure setup, consider Docker's --read-only flag on the socket mount and a tool like docker-socket-proxy to restrict which API calls the Container Doctor can make.

What I'd Do Differently

After 3 weeks in production, here's my honest retrospective:

I'd use structured logging from day one. My regex-based error detection catches too many false positives. A JSON log format with severity levels would make detection way more accurate.

I'd add per-container policies. Right now, every container gets the same treatment. But you probably want different rules for a database vs a web server. Never auto-restart a database. Always auto-restart a stateless web server.

I'd build a simple web UI. The /history endpoint returns JSON, but a small React dashboard showing a timeline of incidents, fix success rates, and cost tracking would be much more useful.

I'd try local models first. For simple errors (OOM, connection refused), a small local model running on Ollama could handle the diagnosis without any API cost. Reserve Claude for the weird, complex stack traces where you actually need strong reasoning.

I'd add a "learning mode." Run the Container Doctor in observe-only mode for a week. Let it diagnose everything but fix nothing. Review the diagnoses manually. Once you trust its judgment, flip on auto-fix. This builds confidence before you give it restart power.

What's Next?

If you found this useful, I write about Docker, AI tools, and developer workflows every week. I'm Balajee Asish – Docker Captain, freeCodeCamp contributor, and currently building my way through the AI tools space one project at a time.

Got questions or built something similar? Drop a comment below or find me on GitHub and LinkedIn.

Happy building.

How to Ship a Production-Ready RAG App with FAISS (Guardrails, Evals, and Fallbacks)

Chidozie Managwu — Mon, 16 Mar 2026 17:43:51 +0000

Most LLM applications look great in a high-fidelity demo. Then they hit the hands of real users and start failing in very predictable yet damaging ways.

They answer questions they should not, they break when document retrieval is weak, they time out due to network latency, and nobody can tell exactly what happened because there are no logs and no tests.

In this tutorial, you’ll build a beginner-friendly Retrieval Augmented Generation (RAG) application designed to survive production realities. This isn’t just a script that calls an API. It’s a system featuring a FastAPI backend, a persisted FAISS vector store, and essential safety guardrails (including a retrieval gate and fallbacks).

Why RAG Alone Does Not Equal Production-Ready
The Architecture You Are Building
Project Setup and Structure
How to Build the RAG Layer with FAISS
How to Add the LLM Call with Structured Output
How to Add Guardrails: Retrieval Gate and Fallbacks
FastAPI App: Creating the /answer Endpoint
How to Add Beginner-Friendly Evals
What to Improve Next: Realistic Upgrades

Why RAG Alone Does Not Equal Production-Ready

Retrieval Augmented Generation (RAG) is often hailed as the hallucination killer. By grounding the model in retrieved text, we provide it with the facts it needs to be accurate. But simply connecting a vector database to an LLM isn’t enough for a production environment.

Production issues usually arise from the silent failures in the system surrounding the model:

Weak retrieval: If the app retrieves irrelevant chunks of text, the model tries to bridge the gap by inventing an answer anyway. Without a designated “I do not know” path, the model is essentially forced to hallucinate.
Lack of visibility: Without structured outputs and basic logging, you can’t tell if bad retrieval, a confusing prompt, or a model update caused a wrong answer.
Fragility: A simple API timeout or malformed provider response becomes a user-facing outage if you don’t implement fallbacks.
No regression testing: In traditional software, we have unit tests. In AI, we need evals. Without them, a small tweak to your prompt might fix one issue but break ten others without you realising it.

We’ll solve each of these issues systematically in this guide.

Prerequisites

This tutorial is beginner-friendly, but it assumes you have a few basics in place so you can focus on building a robust RAG system instead of getting stuck on setup issues.

Knowledge

You should be comfortable with:

Python fundamentals (functions, modules, virtual environments)
Basic HTTP + JSON (requests, response payloads)
APIs with FastAPI (what an endpoint is and how to run a server)
High-level LLM concepts (prompting, temperature, structured outputs)

Tools + Accounts

You’ll need:

Python 3.10+
A working OpenAI-compatible API key (OpenAI or any provider that supports the same request/response shape)
A local environment where you can run a FastAPI app (Mac/Linux/Windows)

What This Tutorial Covers (and What It Doesn’t)

We’ll build a production-minded baseline:

A FAISS-backed retriever with a persisted index + metadata
A retrieval gate to prevent “forced hallucination”
Structured JSON outputs so your backend is stable
Fallback behavior for timeouts and provider errors
A small eval harness to prevent regressions

We won’t implement advanced upgrades such as rerankers, semantic chunking, auth, background jobs beyond a roadmap at the end.

The Architecture You Are Building

The flow of our application follows a disciplined path so every answer is grounded in evidence:

User query: The user submits a question via a FastAPI endpoint.
Retrieval: The system embeds the question and retrieves the top-k most similar document chunks.
The retrieval gate: We evaluate the similarity score. If the context is not relevant enough, we stop immediately and refuse the query.
Augmentation and generation: If the gate passes, we send a context-augmented prompt to the LLM.
Structured response: The model returns a JSON object containing the answer, sources used, and a confidence level.

Project Setup and Structure

To keep things organized and maintainable, we’ll use a modular structure. This allows you to swap out your LLM provider or your vector database without rewriting your entire core application.

Project Structure

.
├── app.py              # FastAPI entry point and API logic
├── rag.py              # FAISS index, persistence, and document retrieval
├── llm.py              # LLM API interface and JSON parsing
├── prompts.py          # Centralized prompt templates
├── data/               # Source .txt documents
├── index/              # Persisted FAISS index and metadata
└── evals/              # Evaluation dataset and runner script
    ├── eval_set.json
    └── run_evals.py

Install Dependencies

First, create a virtual environment to isolate your project:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install fastapi uvicorn faiss-cpu numpy pydantic requests python-dotenv

Configure the Environment

Create a .env file in the root directory. We are targeting OpenAI-compatible providers:

OPENAI_API_KEY=your_actual_api_key_here
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODEL=gpt-4o-mini

Important note on compatibility: The code below assumes an OpenAI-style API. If you use a provider that is not compatible, you must change the URL, headers (for example X-API-Key), and the way you extract embeddings and final message content in embed_texts() and call_llm().

How to Build the RAG Layer with FAISS

In rag.py, we handle the “Retriever” part of RAG. This involves turning raw text into mathematical vectors that the computer can compare.

What is FAISS (and What Does It Do)?

FAISS (Facebook AI Similarity Search) is a fast library for vector similarity search. In a RAG system, each chunk of text becomes an embedding vector (a list of floats). FAISS stores those vectors in an index so you can quickly ask:

“Given this question embedding, which document chunks are closest to it?”

In this tutorial, we use IndexFlatIP inner product and normalise vectors with faiss.normalize_L2(...). With normalised vectors, the inner product behaves like cosine similarity, giving us a stable score we can use for a retrieval gate.

Chunking Strategy With Overlap

We’ll use chunking with overlap. If we split a document at exactly 1,000 characters, we might cut a sentence in half, losing its meaning. By using an overlap, for example, 200 characters, we ensure that the end of one chunk and the beginning of the next share context.

Implementation of `rag.py`

import os
import faiss
import numpy as np
import requests
import json
from typing import List, Dict
from dotenv import load_dotenv

load_dotenv()

INDEX_PATH = "index/faiss.index"
META_PATH = "index/meta.json"

def chunk_text(text: str, size: int = 1000, overlap: int = 200) -> List[str]:
    chunks = []
    step = max(1, size - overlap)
    for i in range(0, len(text), step):
        chunk = text[i : i + size].strip()
        if chunk:
            chunks.append(chunk)
    return chunks

def embed_texts(texts: List[str]) -> np.ndarray:
    # Note: If your provider is not OpenAI-compatible, change this URL and headers
    url = f"{os.getenv('OPENAI_BASE_URL')}/embeddings"
    headers = {"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"}
    payload = {"input": texts, "model": "text-embedding-3-small"}

    resp = requests.post(url, headers=headers, json=payload, timeout=30)
    resp.raise_for_status()
    # If your provider uses a different response format, change the line below
    vectors = np.array([item["embedding"] for item in resp.json()["data"]], dtype="float32")
    return vectors

def build_index() -> None:
    all_chunks: List[str] = []
    metadata: List[Dict] = []

    if not os.path.exists("data"):
        os.makedirs("data")
        return

    for file in os.listdir("data"):
        if not file.endswith(".txt"):
            continue

        with open(f"data/{file}", "r", encoding="utf-8") as f:
            text = f.read()

        chunks = chunk_text(text)
        all_chunks.extend(chunks)
        for c in chunks:
            metadata.append({"source": file, "text": c})

    if not all_chunks:
        return

    embeddings = embed_texts(all_chunks)
    faiss.normalize_L2(embeddings)

    dim = embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)
    index.add(embeddings)

    os.makedirs("index", exist_ok=True)
    faiss.write_index(index, INDEX_PATH)

    with open(META_PATH, "w", encoding="utf-8") as f:
        json.dump(metadata, f, ensure_ascii=False)

def load_index():
    if not (os.path.exists(INDEX_PATH) and os.path.exists(META_PATH)):
        raise FileNotFoundError(
            "FAISS index not found. Add .txt files to data/ and run build_index()."
        )

    index = faiss.read_index(INDEX_PATH)
    with open(META_PATH, "r", encoding="utf-8") as f:
        metadata = json.load(f)
    return index, metadata

def retrieve(query: str, k: int = 5) -> List[Dict]:
    index, metadata = load_index()

    q_emb = embed_texts([query])
    faiss.normalize_L2(q_emb)

    scores, ids = index.search(q_emb, k)
    results = []
    for score, idx in zip(scores[0], ids[0]):
        if idx == -1:
            continue
        m = metadata[idx]
        results.append(
            {"score": float(score), "source": m["source"], "text": m["text"], "id": int(idx)}
        )
    return results

How to Add the LLM Call with Structured Output

A major failure point in AI apps is the “chatty” nature of LLMs. If your backend expects a list of sources but the LLM returns conversational filler, your code will crash.

We solve this with structured output: instruct the model to return a strict JSON object, then parse it safely.

Implementation of `llm.py`

import json
import requests
import os
from typing import Dict, Any

def call_llm(system_prompt: str, user_prompt: str) -> Dict[str, Any]:
    # Note: Change URL/Headers if using a non-OpenAI compatible provider
    url = f"{os.getenv('OPENAI_BASE_URL')}/chat/completions"
    headers = {
        "Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}",
        "Content-Type": "application/json",
    }

    payload = {
        "model": os.getenv("OPENAI_MODEL"),
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        "response_format": {"type": "json_object"},
        "temperature": 0,
    }

    try:
        resp = requests.post(url, headers=headers, json=payload, timeout=30)
        resp.raise_for_status()
        content = resp.json()["choices"][0]["message"]["content"]

        parsed = json.loads(content)
        parsed.setdefault("answer", "")
        parsed.setdefault("refusal", False)
        parsed.setdefault("confidence", "medium")
        parsed.setdefault("sources", [])
        return parsed

    except (requests.Timeout, requests.ConnectionError):
        return {
            "answer": "The system is temporarily unavailable (network issue). Please try again.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "error_type": "network_error",
        }
    except Exception:
        return {
            "answer": "A system error occurred while generating the answer.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "error_type": "unknown_error",
        }

How to Add Guardrails: Retrieval Gate and Fallbacks

Guardrails are interceptors. They sit between the user and the model to prevent predictable failures.

The Retrieval Gate: How It Works and How to Add It

In a standard RAG pipeline, the system always calls the LLM. If the user asks an irrelevant question, the retriever will still return the “closest” (but wrong) chunks.

The solution is the retrieval gate:

Retrieve top-k chunks and get the top similarity score
If the score is below a threshold (for example 0.30), refuse immediately
Only call the LLM when retrieval is strong enough to ground the answer

A threshold of 0.30 is a reasonable starting point when using normalised cosine similarity, but you should tune it using evals (next section).

Fallbacks and Why They Matter

Fallbacks ensure that if an API fails or times out, the user gets a helpful message instead of a crash. They also keep your API response shape consistent, which prevents frontend errors and makes logging meaningful.

In this tutorial, fallbacks are implemented inside call_llm() so your FastAPI layer stays simple.

FastAPI App: Creating the /answer Endpoint

The app.py file is the conductor. It ties retrieval, guardrails, prompting, and generation together.

Implementation of `app.py`

from fastapi import FastAPI
from pydantic import BaseModel
from rag import retrieve
from llm import call_llm
import prompts
import time
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("rag_app")

app = FastAPI(title="Production-Ready RAG")

class QueryRequest(BaseModel):
    question: str

@app.post("/answer")
async def get_answer(req: QueryRequest):
    start_time = time.time()
    question = (req.question or "").strip()

    if not question:
        return {
            "answer": "Please provide a non-empty question.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "latency_sec": round(time.time() - start_time, 2),
        }

    # 1) Retrieval
    results = retrieve(question, k=5)
    top_score = results[0]["score"] if results else 0.0

    logger.info("query=%r top_score=%.3f num_results=%d", question, top_score, len(results))

    # 2) Retrieval Gate (Guardrail)
    if top_score < 0.30:
        return {
            "answer": "I do not have documents to answer that question.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "latency_sec": round(time.time() - start_time, 2),
            "retrieval": {"top_score": top_score, "k": 5},
        }

    # 3) Augment
    context_text = "\n\n".join([f"Source {r['source']}: {r['text']}" for r in results])
    user_prompt = f"Context:\n{context_text}\n\nQuestion: {question}"

    # 4) Generation with Fallback
    response = call_llm(prompts.SYSTEM_PROMPT, user_prompt)

    # 5) Attach debug metadata
    response["latency_sec"] = round(time.time() - start_time, 2)
    response["retrieval"] = {"top_score": top_score, "k": 5}
    return response

Centralized Prompt – Template: prompts.py

A small but important habit: keep prompts centralised so they’re versionable and easy to evaluate.

Example `prompts.py`

SYSTEM_PROMPT = """You are a RAG assistant. Use ONLY the provided Context to answer.
If the context does not contain the answer, respond with refusal=true.

Return a valid JSON object with exactly these keys:
- answer: string
- refusal: boolean
- confidence: "low" | "medium" | "high"
- sources: array of strings (source filenames you used)

Do not include any extra keys. Do not include markdown. Do not include commentary."""

How to Add Beginner-Friendly Evals

In AI systems, outputs are probabilistic. This makes testing harder than traditional software. Evals (evaluations) are a set of “golden questions” and “expected behaviours” you run repeatedly to detect regressions.

Instead of “does it output exactly this string,” you test:

Should the app refuse when the retrieval is weak?
When it answers, does it include sources?
Is the behaviour stable across prompt tweaks and model changes?

Step 1: Create `evals/eval_set.json`

This should contain both positive and negative cases.

[
  {
    "id": "in_scope_01",
    "question": "What is a retrieval gate and why is it important?",
    "expect_refusal": false,
    "notes": "Should explain gating and relate it to hallucination prevention."
  },
  {
    "id": "out_of_scope_01",
    "question": "What is the capital of France?",
    "expect_refusal": true,
    "notes": "If the knowledge base only includes our docs, the app should refuse."
  },
  {
    "id": "edge_01",
    "question": "",
    "expect_refusal": true,
    "notes": "Empty input should not call the LLM."
  }
]

Step 2: Create `evals/run_evals.py`

This runner calls your API endpoint (end-to-end) and checks expected behaviours.

import json
import requests

API_URL = "http://127.0.0.1:8000/answer"

def run():
    with open("evals/eval_set.json", "r", encoding="utf-8") as f:
        cases = json.load(f)

    passed = 0
    failed = 0

    for case in cases:
        resp = requests.post(API_URL, json={"question": case["question"]}, timeout=60)
        resp.raise_for_status()
        out = resp.json()

        got_refusal = bool(out.get("refusal", False))
        expect_refusal = bool(case["expect_refusal"])

        ok = (got_refusal == expect_refusal)

        # Beginner-friendly: if it answers, sources should exist and be a list
        if not got_refusal:
            ok = ok and isinstance(out.get("sources"), list)

        if ok:
            passed += 1
            print(f"PASS {case['id']}")
        else:
            failed += 1
            print(f"FAIL {case['id']} expected_refusal={expect_refusal} got_refusal={got_refusal}")
            print("Output:", json.dumps(out, indent=2))

    print(f"\nDone. Passed={passed} Failed={failed}")
    if failed:
        raise SystemExit(1)

if __name__ == "__main__":
    run()

How to Use Evals in Practice

Run your server:

uvicorn app:app --reload

In another terminal, run evals:

python evals/run_evals.py

If an eval fails, you have a concrete signal that something changed in retrieval, gating, prompting, or provider behaviour.

What to Improve Next: Realistic Upgrades

Building a reliable RAG app is iterative. Here are realistic next steps:

Semantic chunking: Break text based on meaning instead of character count.
Reranking: Use a cross-encoder reranker to reorder the top-k chunks for higher precision.
Metadata filtering: Filter results by category, date, or department to reduce false positives.
Better citations: Store chunk IDs and show exactly which chunk(s) the answer came from.
Observability: Add request IDs, structured logs, and traces so “what happened?” is answerable.
Async + background indexing: Move index building to a background job and keep the API responsive.

Final Thoughts: Production-Ready Is a Set of Habits

Building an AI application that survives in the real world is about building a system that is predictable, measurable, and safe.

Retrieval quality is measurable: Use similarity scores to gate your LLM.
Refusal is a feature: It is better to say “I do not know” than to lie.
Fallbacks are mandatory: Design for the moment the API goes down.
Evals prevent regressions: Never deploy a change without running your tests.

About Me

I am Chidozie Managwu, an award-winning AI Product Architect and founder focused on helping global tech talent build real, production-ready skills. I contribute to global AI initiatives as a GAFAI Delegate and lead AI Titans Network, a community for developers learning how to ship AI products.

My work has been recognized with the Global Tech Hero award and featured on platforms like HackerNoon.

How to Build End-to-End LLM Observability in FastAPI with OpenTelemetry

Jessica Patel — Fri, 13 Mar 2026 16:13:16 +0000

This article shows how to build end-to-end, code-first LLM observability in a FastAPI application using the OpenTelemetry Python SDK.

Instead of relying on vendor-specific agents or opaque SDKs, we will manually design traces, spans, and semantic attributes that capture the full lifecycle of an LLM-powered request.

Introduction
Prerequisites and Technical Context
Why LLM Observability Is Fundamentally Different
Reference Architecture: A Traceable RAG Request
Reference Architecture Explained
Why This Design Is Better Than Simpler Alternatives
LLM Models That Work Best for This Architecture
OpenTelemetry Primer (LLM-Relevant Concepts Only)
Designing LLM-Aware Spans
FastAPI Example: End-to-End LLM Spans (Complete and Explained)
Semantic Attributes: Best Practices for LLM Observability
Evaluation Hooks Inside Traces
Exporting and Visualizing Traces (Where This Fits with Vendor Tooling)
Operational Patterns and Anti-Patterns
Extending the System
Conclusion

Introduction

Large Language Models (LLMs) are rapidly becoming a core component of modern software systems. Applications that once relied on deterministic APIs are now incorporating LLM-powered features such as conversational assistants, document summarization, intelligent search, and retrieval-augmented generation (RAG).

While these capabilities unlock new user experiences, they also introduce operational complexity that traditional monitoring approaches were never designed to handle.

Unlike conventional software services, LLM systems are probabilistic by nature. The same request may produce slightly different responses depending on factors such as prompt structure, model configuration, retrieval context, and sampling parameters such as temperature or top-p.

In addition, LLM workloads introduce entirely new operational dimensions such as token consumption, prompt construction latency, inference cost, context window limits, and response quality.

These factors mean that a request can appear technically successful from an infrastructure perspective while still producing an incorrect, hallucinated, or low-quality result.

Traditional observability tools typically focus on infrastructure-level signals such as latency, error rate, and throughput. While these metrics remain important, they are insufficient for understanding how an LLM application behaves in production.

Engineers must also understand what prompt was constructed, which documents were retrieved, how many tokens were consumed, which model configuration was used, and how the final response was evaluated. Without this visibility, debugging LLM behavior becomes extremely difficult and operational costs can quickly spiral out of control.

This is where LLM observability becomes essential. Observability for LLM systems extends beyond infrastructure monitoring. It captures the full lifecycle of an AI-driven request — from user input and context retrieval to prompt construction, model inference, post-processing, and quality evaluation.

When implemented correctly, observability allows teams to answer why the model generated a particular response, which retrieval results influenced the output, how much a request cost in terms of tokens, where latency occurred within the request pipeline, and whether the response passed basic quality or safety checks.

This article demonstrates how to implement end-to-end LLM observability in a FastAPI application using OpenTelemetry. Instead of relying on proprietary monitoring agents or opaque vendor SDKs, we take a code-first approach to instrumentation. By explicitly designing traces, spans, and semantic attributes, we gain precise control over how LLM interactions are observed and analyzed.

Throughout the guide, we will walk through a practical architecture for tracing a retrieval-augmented generation (RAG) workflow, where each stage of the request lifecycle is represented as a trace span. We will explore how to design meaningful span boundaries, capture prompt and model metadata safely, record token usage and cost signals, and attach evaluation results directly to traces.

The article also explains how this instrumentation can be exported to any OpenTelemetry-compatible backend such as Jaeger, Grafana Tempo, or LLM-specific platforms like Phoenix.

By the end of this guide, you will understand how to:

Structure traces so that each user request maps to a single end-to-end LLM interaction
Design span hierarchies that reflect the logical stages of an LLM pipeline
Capture prompt metadata, model configuration, and token usage safely
Attach evaluation and quality signals to traces for deeper analysis
Export observability data to different backends without changing instrumentation

Most importantly, the goal of this article is not simply to demonstrate how to add telemetry to an application. Instead, it aims to show how to think about observability when building LLM-powered systems.

When LLM operations are treated as first-class components within a distributed system, traces become a powerful tool for debugging, optimization, cost management, and continuous improvement of model behavior.

Prerequisites and Technical Context

Before following this guide, you should be familiar with the Python programming language, basic web API concepts, and general microservice architecture. Below are some key tools and concepts used in this article.

FastAPI (Web Framework)

FastAPI is used as the primary web framework for the application. It is a modern Python framework designed for building high-performance APIs using standard Python type hints. FastAPI simplifies request validation, serialization, and API documentation while remaining lightweight and fast.

Large Language Models (LLMs)

Large Language Models (LLMs) are the computational core of the example system. An LLM is a model trained on vast amounts of text data to generate or transform language in ways that resemble human communication. In production environments, LLMs are commonly used for tasks such as conversational interfaces, summarization, and question answering.

Observability (Concept)

Observability is the overarching concept that connects all the technical pieces in this article. At a high level, observability refers to the ability to understand a system's internal behavior by examining the data it produces during execution. Rather than asking whether a system is simply "up" or "down," observability helps answer deeper questions about why a request behaved a certain way, where latency was introduced, or how different components interacted.

OpenTelemetry (Instrumentation Standard)

OpenTelemetry is the mechanism used to implement observability within the application. It is an open, vendor-neutral standard for generating telemetry data such as traces, metrics, and logs. By instrumenting key parts of the LLM workflow, we can observe how requests flow through the system, how long each step takes, and what contextual data influenced the final outcome. OpenTelemetry serves as the foundation for collecting this information in a consistent and portable way, independent of any specific monitoring backend.

Why LLM Observability Is Fundamentally Different

Traditional observability assumes deterministic behavior: the same input produces the same output. LLM systems violate this assumption. The same request can vary due to prompt template changes, retrieval differences, sampling parameters (temperature, top-p), model version upgrades, and context window truncation.

As a result, teams need visibility into what the model saw, how it was configured, what it retrieved, how long it took, and how much it cost, all correlated to a single user request. Logs alone are insufficient, and metrics lack dimensionality. Distributed traces are the backbone of LLM observability.

Reference Architecture: A Traceable RAG Request

A typical FastAPI-based RAG service follows this flow:

Each step is observable, but only if we deliberately instrument it. The goal is one trace per user request, with child spans representing each logical LLM step.

Reference Architecture Explained

Client Sends a Request to /chat

The architecture begins when a client sends a request to the /chat endpoint. This request typically contains the user's query along with any session or conversation context required by the application.

Keeping the client interface minimal and well-defined is intentional: it ensures the backend receives a predictable input shape and prevents application-specific logic from leaking into downstream LLM processing.

From an observability perspective, this request marks the start of a single end-to-end trace, allowing every subsequent operation to be correlated back to the original user action.

FastAPI Validates Input and Authenticates the User

Once the request reaches the service, FastAPI performs schema validation and authentication. Validation guarantees that only well-formed inputs proceed through the pipeline, while authentication ensures that expensive LLM operations are only executed for authorized users.

Placing this step early reduces unnecessary computation and protects the system from abuse. It also improves trace quality by ensuring that all observed requests represent legitimate execution paths rather than malformed or rejected traffic.

Retriever Queries the Vector Database

After validation, the system queries a vector database to retrieve documents relevant to the user's request. This retrieval step is the foundation of retrieval-augmented generation (RAG). By grounding the LLM in external knowledge, the system improves factual accuracy and reduces hallucinations.

Separating retrieval from generation allows teams to tune similarity thresholds, embedding models, and top-k values independently, and it makes it easier to diagnose whether poor responses are caused by bad retrieval or model behavior.

Prompt Is Assembled Using Retrieved Documents

With relevant documents in hand, the system constructs the final prompt that will be sent to the LLM. This step combines the user query, retrieved context, system instructions, and formatting rules into a single structured prompt.

Making prompt assembly an explicit stage enables prompt versioning, experimentation, and observability. It also provides a natural place to detect issues such as context window overflows or excessive prompt size before invoking the model.

LLM API Is Invoked

The LLM API call is the most expensive and non-deterministic operation in the pipeline, which is why it occurs only after all preparatory work is complete. At this stage, the model receives a fully constructed prompt and produces a response based on its configuration parameters.

This step is the primary focus of latency, cost, and reliability controls such as retries, timeouts, and circuit breakers. From an observability standpoint, this span becomes the anchor for token usage, cost attribution, and prompt-level debugging.

Response Is Post-Processed and Returned

After the LLM returns a response, the system performs post-processing before sending the result back to the client. This may include formatting, filtering, validation, or enrichment of the output. Post-processing acts as a final safeguard against malformed or low-quality responses and ensures consistency with application requirements. It also provides a clean boundary for attaching evaluation signals, such as response length, relevance scores, or truncation indicators, before the request completes.

Why This Design Is Better Than Simpler Alternatives

This architecture intentionally avoids coupling responsibilities together. Validation, retrieval, prompt construction, model execution, and response handling are all distinct steps. This separation makes the system easier to test, easier to observe, and easier to evolve. When something fails, engineers can identify where and why rather than treating the LLM as a black box.

Compared to a monolithic "send user input directly to the LLM" approach, this design offers better correctness, lower cost, and higher resilience. It also aligns naturally with distributed tracing, since each block maps cleanly to a trace span with a clear semantic purpose. As the system grows, additional features such as caching, fallback models, or policy enforcement can be added without destabilizing the entire flow.

Most importantly, this architecture treats the LLM as one component in a larger system, not the system itself. That mindset is essential for building reliable production applications.

LLM Models That Work Best for This Architecture

This architecture is model-agnostic, but certain model characteristics work particularly well with retrieval-augmented workflows.

Models with strong instruction-following and reasoning capabilities tend to perform best, especially when prompts include structured context from retrieved documents. General-purpose models such as GPT-4-class systems perform well when accuracy and reasoning depth are critical.

For lower-latency or cost-sensitive use cases, smaller instruction-tuned models can be effective when paired with high-quality retrieval. Open-source models such as LLaMA-derived or Mistral-based systems also fit well into this architecture, particularly when deployed behind a private inference endpoint.

The key requirement is not the model itself, but how it is used. Models that can reliably ground their responses in provided context, respect system instructions, and produce stable outputs under varying prompts integrate most cleanly into this design. Because retrieval and prompt construction are explicit stages, models can be swapped or compared without changing the overall system structure.

OpenTelemetry Primer (LLM-Relevant Concepts Only)

OpenTelemetry defines three core types of telemetry data: traces, metrics, and logs. For LLM systems, traces are the most important. To make them useful, you need to understand a few building blocks:

a trace represents a single end-to-end request
a span is a timed operation within that trace
attributes are key–value metadata attached to spans
events are time-stamped annotations
context propagation ensures child spans attach to the correct parent.

FastAPI’s async nature makes correct context propagation essential, but OpenTelemetry’s Python SDK handles this as long as spans are created correctly.

With those concepts in place, the next step is to wire OpenTelemetry into the app. Start by configuring the OpenTelemetry SDK in FastAPI: define a TracerProvider, attach a Resource (service name and environment), configure an exporter (Jaeger, Tempo, Phoenix, and so on), and enable FastAPI auto-instrumentation.

Designing LLM-Aware Spans

Span Taxonomy

A clean span hierarchy is critical. In this guide, a single http.request span (usually auto-generated) acts as the root, and it contains child spans such as rag.retrieval, rag.prompt.build, llm.call, llm.postprocess, and, optionally, llm.eval. Each of these spans represents a logical unit of work rather than an implementation detail.

Span Boundaries

Getting span boundaries right is just as important as picking the right span names. Avoid extremes like wrapping the entire LLM workflow in one giant span, creating a separate span for every token, or dumping all data into logs.

Instead, aim for a few coarse-grained spans that each represent a meaningful step in the request, enrich them with well-chosen attributes, and use events to mark important milestones within a span rather than splitting everything into smaller spans.

Instrumenting the LLM Call

When instrumenting the LLM call, treat it as the most critical span in the trace. Whether you are calling OpenAI, Anthropic, or another provider, start the span immediately before the API request and end it only after the full response (or stream) is complete.

Within that span, capture retries, timeouts, and errors so it becomes the central place for latency analysis, cost attribution, and prompt debugging.

For streaming responses, you can emit events for each chunk to track progress, but avoid creating separate child spans unless you truly need fine-grained timing.

FastAPI Example: End-to-End LLM Spans (Complete and Explained)

from fastapi import FastAPI, Request
from opentelemetry import trace
from opentelemetry.trace import Tracer
from typing import List
import asyncio
import hashlib

# Obtain a tracer instance from OpenTelemetry.
# All spans created with this tracer will be part of the same distributed
# tracing system and exported to the configured backend.
tracer: Tracer = trace.get_tracer(__name__)

# Initialize the FastAPI application.
app = FastAPI()

# Helper functions used by the observable endpoint
async def retrieve_documents(query: str) -> List[str]:
    """
    Simulate document retrieval (e.g., vector search or knowledge base lookup).
    This function represents the retrieval stage in a RAG pipeline.
    In a real system, this might query a vector database or search index.
    """
    await asyncio.sleep(0.05)  # Simulate I/O latency
    return [
        "FastAPI enables high-performance async APIs.",
        "OpenTelemetry provides vendor-neutral observability.",
        "LLM observability requires tracing prompts and tokens.",
    ]


def build_prompt(query: str, documents: List[str]) -> str:
    """
    Construct the final prompt from retrieved documents and the user query.
    Prompt construction is kept separate so it can be observed or modified
    independently if needed (for example, to measure prompt assembly latency).
    """
    context = "\n".join(documents)
    return f"""
Context:
{context}

Question:
{query}
"""


class LLMResponse:
    """
    Minimal abstraction for an LLM response.
    This keeps the example self-contained while still allowing us to attach
    token usage and other metadata for observability.
    """

    def __init__(self, text: str, prompt_tokens: int, completion_tokens: int):
        self.text = text
        self.prompt_tokens = prompt_tokens
        self.completion_tokens = completion_token
    
    @property
    def total_tokens(self) -> int:
        return self.prompt_tokens + self.completion_tokens

async def call_llm(prompt: str) -> LLMResponse:
    """
    Simulate an LLM API call.
    In a real implementation, this would call OpenAI, Anthropic, or another
    provider. The artificial delay represents model latency.
    """
    await asyncio.sleep(0.2)  # Simulate inference time
    response_text = "FastAPI and OpenTelemetry enable end-to-end LLM observability."
    # Token count is approximated here for demonstration purposes.
    prompt_tokens = len(prompt.split())
    completion_tokens = len(response_text.split())
    return LLMResponse(response_text, prompt_tokens, completion_tokens)


def summarize_response(response: LLMResponse) -> str:
    """
    Example post-processing step.
    Post-processing is separated into its own phase so any additional latency
    or errors are not incorrectly attributed to the LLM itself.
    """
    return response.text


# Observable FastAPI endpoint
@app.post("/query")
async def rag_query(request: Request, query: str):
    """
    Handle a single RAG-style request with explicit OpenTelemetry spans.
    This endpoint demonstrates how to create one trace per request, with child
    spans for retrieval, LLM invocation, and post-processing.
    """

    # Create a top-level span for the HTTP request.
    # Even if FastAPI auto-instrumentation is enabled, defining this explicitly
    # allows us to attach domain-specific metadata.
    with tracer.start_as_current_span("http.request") as http_span:
        http_span.set_attribute("http.method", "POST")
        http_span.set_attribute("http.route", "/query")

        # Retrieval phase
        # This span isolates the retrieval step so that relevance issues can be
        # debugged independently of LLM behavior.
        with tracer.start_as_current_span("rag.retrieval") as retrieval_span:
            retrieval_span.set_attribute("rag.top_k", 5)
            retrieval_span.set_attribute("rag.similarity_threshold", 0.8)
            documents = await retrieve_documents(query)

            # Record how many documents were returned.
            # This is a key signal when diagnosing hallucinations
            # or missing context in the final response.
            retrieval_span.set_attribute(
                "rag.documents_returned",
                len(documents),
            )

        # LLM invocation phase
        # This span wraps the actual LLM call and is the primary anchor for
        # latency, cost, and prompt-related analysis.
        with tracer.start_as_current_span("llm.call") as llm_span:
            llm_span.set_attribute("llm.provider", "example")
            llm_span.set_attribute("llm.model", "example-llm")
            llm_span.set_attribute("llm.temperature", 0.7)
            llm_span.set_attribute("llm.prompt_template_id", "rag_v1")

            # Build the final prompt using retrieved context.
            # The raw prompt is intentionally not stored as a span attribute.
            prompt = build_prompt(query, documents)
            
            # Prompt metadata
            prompt_hash = hashlib.sha256(prompt.encode()).hexdigest()
            llm_span.set_attribute("llm.prompt_hash", prompt_hash)
            llm_span.set_attribute("llm.prompt_length", len(prompt))

            response = await call_llm(prompt)

            # Hash the response instead of storing raw text.
            # This allows correlation across traces without exposing content.
            response_hash = hashlib.sha256(
                response.text.encode()
            ).hexdigest()
            llm_span.set_attribute("llm.response_hash", response_hash)

            # Record token usage to enable cost attribution
            # and capacity planning.
            llm_span.set_attribute("llm.usage.prompt_tokens", response.prompt_tokens)
            llm_span.set_attribute("llm.usage.completion_tokens", response.completion_tokens)
            llm_span.set_attribute("llm.usage.total_tokens", response.total_tokens)
            
            # example price per token
            estimated_cost = response.total_tokens * 0.000002
            llm_span.set_attribute("llm.cost_estimated_usd", estimated_cost)

        # Post-processing phase
        # Any transformation after the LLM response is captured here,
        # ensuring inference latency is not overstated.
        with tracer.start_as_current_span("llm.postprocess") as post_span:
            summary = summarize_response(response)
            post_span.set_attribute(
                "llm.summary_length",
                len(summary),
            )

    # Return the final response to the client.
    # All spans above belong to the same distributed trace.
    return {"summary": summary}

Before examining the full code example, it helps to understand how the instrumentation relates to the observability principles described earlier in this article.

The goal of the example is not simply to show how to create spans, but to demonstrate how a single user request can be represented as a structured trace containing meaningful metadata about each stage of the LLM pipeline.

At a high level, the code follows three key design ideas:

One trace per user request
One span per logical LLM workflow stage
Semantic attributes attached to spans for debugging, cost tracking, and analysis

Each of these concepts directly corresponds to the observability practices discussed earlier.

Top-Level Request Span

The FastAPI endpoint begins by creating a top-level span called http.request. This span represents the entire lifecycle of the incoming request and serves as the root span for the trace.

with tracer.start_as_current_span("http.request") as http_span:

Although FastAPI can generate HTTP spans automatically through OpenTelemetry auto-instrumentation, explicitly creating this span allows the application to attach domain-specific metadata such as route names or user identifiers.

Attributes such as the HTTP method and route are attached here:

http_span.set_attribute("http.method", "POST")
http_span.set_attribute("http.route", "/query")

This ensures that every trace can be easily filtered by endpoint when analyzing production traffic.

Retrieval Span

The next span captures the retrieval phase of the RAG pipeline:

with tracer.start_as_current_span("rag.retrieval") as retrieval_span:

This span isolates the vector search or knowledge retrieval step from the rest of the pipeline. If users report irrelevant answers, engineers can inspect this span to determine whether the issue originates from poor retrieval results rather than model behavior.

Several semantic attributes are attached here:

rag.top_k – number of documents requested
rag.similarity_threshold – similarity cutoff used for filtering results
rag.documents_returned – number of documents actually retrieved

These attributes align with the RAG observability signals discussed in the earlier section of the article.

LLM Invocation Span

The most important span in the trace is the llm.call span, which wraps the actual model invocation.

with tracer.start_as_current_span("llm.call") as llm_span:

This span captures the latency, configuration, and token usage associated with the LLM request. In production systems, it becomes the primary location for analyzing model behavior and cost.

Key attributes recorded in this span include:

llm.provider – the model provider (OpenAI, Anthropic, etc.)
llm.model – the specific model version
llm.temperature – sampling parameter controlling response randomness
llm.prompt_template_id – identifier for the prompt template used

These attributes make it possible to correlate changes in model configuration with downstream quality or cost changes.

Prompt Handling and Privacy

Instead of storing the full prompt or response text directly in the trace, the example demonstrates a safer practice: hashing sensitive data.

response_hash = hashlib.sha256(response.text.encode()).hexdigest()

The resulting hash is stored as a span attribute:

llm_span.set_attribute("llm.response_hash", response_hash)

This approach allows engineers to correlate repeated responses across traces without exposing potentially sensitive content in observability systems.

Token Usage Tracking

The llm.call span also records token usage:

llm_span.set_attribute(
    "llm.usage.total_tokens",
    response.total_tokens
)

Capturing token usage at the span level is critical for monitoring cost and efficiency, since token consumption directly determines billing for most LLM providers.

Post-Processing Span

Finally, the example includes a llm.postprocess span:

with tracer.start_as_current_span("llm.postprocess") as post_span:

This span represents any transformation applied after the model generates its response. Separating post-processing from the LLM call ensures that additional latency — such as formatting, filtering, or validation — is not incorrectly attributed to the model itself.

An attribute such as response length is recorded here:

post_span.set_attribute("llm.summary_length", len(summary))

This can be useful when diagnosing issues such as unexpectedly short or truncated outputs.

How the Spans Form a Complete Trace

When the request finishes, all spans belong to the same distributed trace:

http.request
 ├── rag.retrieval
 ├── llm.call
 └── llm.postprocess

This hierarchy reflects the logical workflow of a retrieval-augmented LLM system. Because each span contains structured metadata, engineers can quickly answer questions such as:

Was the latency caused by retrieval or model inference?
How many documents influenced the prompt?
Which model configuration produced the response?
How many tokens were consumed?
Was the response post-processed or truncated?

This structured trace design is what transforms observability from simple monitoring into a practical debugging and optimization tool for LLM systems.

Semantic Attributes: Best Practices for LLM Observability

The goal is not to capture every possible detail, but to record the minimal set of stable, high-signal attributes that enable effective debugging, cost control, and quality analysis in production. Poor attribute design leads to noisy traces, privacy risks, and dashboards that are impossible to reason about.

Prompt, Response, and Model Metadata

Storing raw prompts is often unsafe and expensive, so it is better to record minimal, structured metadata instead. In practice, this means attaching a stable template identifier with llm.prompt_template_id, a hashed version of the final prompt using llm.prompt_hash (to avoid storing raw text), and a size indicator such as llm.prompt_length, which captures the number of tokens or characters.

You should also always record key inference parameters: llm.provider (for example, "openai" or "anthropic"), llm.model (for example, "gpt-4.1"), llm.temperature and llm.top_p (sampling parameters), llm.max_tokens (the maximum tokens allowed), and llm.stream to indicate whether streaming was enabled, while staying within your organization’s privacy and compliance requirements.


with tracer.start_as_current_span("llm.call") as llm_span:
            llm_span.set_attribute("llm.provider", "example")
            llm_span.set_attribute("llm.model", "example-llm")
            llm_span.set_attribute("llm.temperature", 0.7)
            llm_span.set_attribute("llm.top_p", 0.9)
            llm_span.set_attribute("llm.max_tokens", 512)
            llm_span.set_attribute("llm.stream", False)
            llm_span.set_attribute("llm.prompt_template_id", "rag_v1")

            # Build the final prompt using retrieved context.
            # The raw prompt is intentionally not stored as a span attribute.
            prompt = build_prompt(query, documents)
            
            # Prompt metadata
            prompt_hash = hashlib.sha256(prompt.encode()).hexdigest()
            llm_span.set_attribute("llm.prompt_hash", prompt_hash)
            llm_span.set_attribute("llm.prompt_length", len(prompt))

Token Usage and Cost (Why This Matters in Practice)

Token usage is one of the most common blind spots in LLM systems. Many teams monitor latency and error rates but discover runaway costs only after invoices spike. Because token consumption varies significantly by prompt structure, retrieved context, and model configuration, it must be captured explicitly at the span level.

The most important practice is to record token usage at the end of the LLM span, once the model has completed inference. This ensures that the values reflect the full request rather than partial or streamed output.

At minimum, capture the attributes:llm.usage.prompt_tokens ,llm.usage.completion_tokens and llm.usage.total_tokens.

def __init__(self, text: str, prompt_tokens: int, completion_tokens: int):
        self.text = text
        self.prompt_tokens = prompt_tokens
        self.completion_tokens = completion_token
    
    @property
    def total_tokens(self) -> int:
        return self.prompt_tokens + self.completion_tokens

async def call_llm(prompt: str) -> LLMResponse:
    """
    Simulate an LLM API call.
    In a real implementation, this would call OpenAI, Anthropic, or another
    provider. The artificial delay represents model latency.
    """
    await asyncio.sleep(0.2)  # Simulate inference time
    response_text = "FastAPI and OpenTelemetry enable end-to-end LLM observability."
    # Token count is approximated here for demonstration purposes.
    prompt_tokens = len(prompt.split())
    completion_tokens = len(response_text.split())
    return LLMResponse(response_text, prompt_tokens, completion_tokens)

These values allow you to distinguish between requests that are expensive because of large prompts (often caused by excessive retrieval or poor prompt construction) versus those that are expensive because of long model-generated outputs.

*Where possible, also attach an estimated cost:* llm.cost_estimated_usd

    # example price per token
    estimated_cost = response.total_tokens * 0.000002
    llm_span.set_attribute("llm.cost_estimated_usd", estimated_cost)

This value is typically derived by multiplying token counts by the model's published pricing. Even if the estimate is approximate, it enables powerful analysis. For example, you can identify which endpoints, prompt templates, or user flows are responsible for the highest cumulative cost, rather than relying on coarse, account-level billing dashboards.

Once spans carry the right attributes, the next step is to connect them to output quality, not just system health.

Evaluation Hooks Inside Traces

This section describes an additional pattern you can layer on top of the core instrumentation in this guide. It is optional and not implemented in the sample code, but it shows how to attach quality signals directly to your traces.

Observability is not just about whether the system stayed up, it is also about whether the model produced a useful answer. Evaluation hooks inside traces let you attach lightweight quality signals directly to the same spans you use for latency and cost.

Inline evaluations are the simplest approach. You can run quick checks synchronously and record the results as span attributes, such as llm.eval.passed for a simple boolean check, llm.eval.relevance_score for an optional numerical score, or flags like llm.eval.hallucination_detected and llm.eval.refusal_detected. These attributes travel with the trace, so you can filter and aggregate on them in your observability backend just like any other field.

For higher accuracy, you can introduce model-based evaluation as a separate step. In this pattern, an evaluator LLM runs asynchronously on the original prompt and response, and its work is captured in a child span (for example, llm.eval) that shares the same trace ID as the main llm.call span. You then attach scores such as relevance, faithfulness, or toxicity to that evaluation span.

Because the evaluation span shares the same trace ID, you can correlate quality regressions with changes in prompts or retrieval.

Exporting and Visualizing Traces (Where This Fits with Vendor Tooling)

This code-first observability design is vendor-agnostic. Once traces are emitted using OpenTelemetry, they can be exported to different backends without changing instrumentation.

General-purpose tracing systems like Jaeger and Grafana Tempo help engineers debug latency, errors, and request flow across retrieval, prompting, and model calls, answering how the system behaved. LLM-focused platforms such as Arize Phoenix use the same data but add model-specific insights like prompt clustering, token analysis, and quality correlation.

Because instrumentation stays OpenTelemetry-native, you maintain full control over attributes and trace structure while still using vendor dashboards, and you can switch backends as your needs evolve without touching the application code.

Operational Patterns and Anti-Patterns

Effective LLM observability requires disciplined practices. High-volume systems should sample traces to limit overhead, and prompts or responses should be hashed by default to reduce storage and privacy risk. Traces must be treated as production data, with proper access control and retention policies.

Common pitfalls include relying only on vendor SDK traces, logging prompts without trace correlation, or ignoring evaluation signals. These issues fragment visibility and hide quality regressions, especially when observability focuses only on agents instead of full application context.

Extending the System

Once traces are reliable, they support advanced capabilities. Metrics like p95 latency can be derived from spans, logs can be linked using trace IDs, and historical traces can power offline evaluation or prompt testing.

By following OpenTelemetry conventions, the observability stack also stays aligned with emerging LLM semantic standards, keeping the system flexible and future-proof.

Conclusion

End-to-end LLM observability is not achieved by installing another agent. It is achieved through intentional span design, meaningful semantic attributes, and, where needed, lightweight evaluation hooks.

By treating LLM calls as first-class operations within distributed traces, you gain faster debugging, controlled costs, safer deployments, and measurable quality improvements. The backend — Jaeger, Tempo, Phoenix — is interchangeable. The instrumentation strategy is not.

A well-designed trace is the most valuable artifact in a production LLM system.

Learn how to fine-tune LLMs in 12 hours

Beau Carnes — Wed, 11 Mar 2026 16:01:21 +0000

The goal isn't just to train a model; it's to build a system that understands your specific data as well as you do.

We just posted a massive, 12-hour course on the freeCodeCamp.org YouTube channel designed to turn you from an AI consumer into an LLM architect.

While massive models like Llama 3, Gemini, and GPT-4 are impressive out of the box, their true power is unlocked when they are tailored to specific domains. This course is a deep dive into the modern LLM ecosystem, teaching you how to take these giants and make them work for your specific needs.

The course is structured into four major sections:

The Foundations of PEFT: Learn why full fine-tuning is often overkill. You will learn about Parameter-Efficient Fine-Tuning (PEFT), and master techniques like LoRA and QLoRA to train models on consumer hardware.
Advanced Alignment: You’ll learn about Reinforcement Learning from Human Feedback (RLHF) and the increasingly popular Direct Preference Optimization (DPO) to align models with human intent.
High-Performance Tooling: Get hands-on with the fastest tools in the industry. The course covers Unsloth (for 2x faster training), Axolotl, and the Llama Factory project for streamlined workflows.
Enterprise & Multimodal AI: Beyond text, the course explores Vision Transformers (ViT), multimodal architectures (Image, Video, Audio), and how to leverage enterprise APIs from OpenAI and Google Cloud Vertex AI.

Watch the full course on the freeCodeCamp.org YouTube channel (12-hour watch).

How to Build a Production-Ready Voice Agent Architecture with WebRTC

Nataraj Sundar — Fri, 06 Mar 2026 19:46:46 +0000

In this tutorial, you'll build a production-ready voice agent architecture: a browser client that streams audio over WebRTC (Web Real-Time Communication), a backend that mints short-lived session tokens, an agent runtime that orchestrates speech and tools safely, and generates post-call artifacts for downstream workflows.

This article is intentionally vendor-neutral. You can implement these patterns using any AI voice platform that supports WebRTC (directly or via an SFU, selective forwarding unit) and server-side token minting. The goal is to help you ship a voice agent architecture that is secure, observable, and operable in production.

Disclosure: This article reflects my personal views and experience. It does not represent the views of my employer or any vendor mentioned.

What You'll Build
How to Avoid Common Production Failures in Voice Agents
How to Design a Latency Budget for a Real-Time Voice Agent
Production Voice Agent Architecture (Vendor-Neutral)
Production readiness checklist
Closing

What You'll Build

By the end, you'll have:

A web client that streams microphone audio and plays agent audio.
A backend token endpoint that keeps credentials server-side.
A safe coordination channel between the agent and the application.
Structured messages between the application and the agent.
A production checklist for security, reliability, observability, and cost control.

Prerequisites

You should be comfortable with:

JavaScript or TypeScript
Node.js 18+ (so fetch works server-side) and an HTTP framework (Express in examples)
Browser microphone permissions
Basic WebRTC concepts (high level is fine)

TL;DR

A production-ready voice agent needs:

A server-side token service (no secrets in the browser)
A real-time media plane (WebRTC) for low-latency audio
A data channel for structured messages between your app and the agent
Tool guardrails (allowlists, confirmations, timeouts, audit logs)
Post-call processing (summary, actions, CRM (Customer Relationship Management), tickets)
Observability-first implementation (state transitions + metrics)

How to Avoid Common Production Failures in Voice Agents

If you've operated distributed systems, you've seen most failures happen at boundaries:

timeouts and partial connectivity
retries that amplify load
unclear ownership between components
missing observability
“helpful automation” that becomes unsafe

Voice agents amplify those risks because:

Latency is User Experience: A slow agent feels broken. Conversational UX is less forgiving than web UX.

Audio + UI + Tools is a Distributed System: You coordinate browser audio capture, WebRTC transport, STT (speech-to-text), model reasoning, tool calls, TTS (text-to-speech), and playback buffering. Each stage has different clocks and failure modes.

Security Boundaries are Non-negotiable: A leaked API key is catastrophic. A tool misfire can trigger real-world side effects.

Debuggability determines whether you can ship: If you don't log state transitions and capture post-call artifacts, you can't operate or improve the system safely.

How to Design a Latency Budget for a Real-Time Voice Agent

Conversations have a “feel.” That feel is mostly latency.

A practical guideline:

Under ~200ms feels instant
300–500ms feels responsive
Over ~700ms feels broken

Your end-to-end latency is the sum of mic capture, network RTT (round-trip time), STT, reasoning, tool execution, TTS, and playback buffering. Budget for it explicitly or you’ll ship a technically correct system that users perceive as unintelligent.

How to Design a Production Voice Agent Architecture (Vendor-Neutral)

A scalable voice agent architecture typically has these layers:

Web client: mic capture, audio playback, UI state
Token service: short-lived session tokens (secrets stay server-side)
Real-time plane: WebRTC media + a data channel
Agent runtime: STT → reasoning → TTS, plus tool orchestration
Tool layer: external actions behind safety controls
Post-call processor: summary + structured outputs after the session ends

This separation makes failure domains and trust boundaries explicit.

Step 0: Set Up the Project

Create a new project directory:

mkdir voice-agent-app
cd voice-agent-app
npm init -y
npm pkg set type=module
npm pkg set scripts.start="node server.js"

Install dependencies:

npm install express dotenv

Create this folder structure:

voice-agent-app/
├── server.js
├── .env
└── public/
    ├── index.html
    └── client.js

Add a .env file:

VOICE_PLATFORM_URL=https://your-provider.example
VOICE_PLATFORM_API_KEY=your_api_key_here

Now you’re ready to implement each part of the system.

Step 1: Keep Credentials Server-side

Treat every API key like production credentials:

store it in environment variables or a secrets manager
rotate it if exposed
never embed it in browser or mobile apps
avoid logging secrets (log only a short suffix if necessary)

Even if a vendor supports CORS, the browser is not a safe place for long-lived credentials.

Step 2: Build a Backend Token Endpoint

Your backend should:

authenticate the user
mint a short-lived session token using your platform API
return only what the client needs (URL + token + expiry)

Create server.js (Node.js + Express)

import express from "express";
import dotenv from "dotenv";
import path from "path";
import { fileURLToPath } from "url";

dotenv.config();

const app = express();
app.use(express.json());

// Serve the web client from /public
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);
app.use(express.static(path.join(__dirname, "public")));

const VOICE_PLATFORM_URL = process.env.VOICE_PLATFORM_URL;
const VOICE_PLATFORM_API_KEY = process.env.VOICE_PLATFORM_API_KEY;

app.post("/api/voice-token", async (req, res) => {
  res.setHeader("Cache-Control", "no-store");

  try {
    if (!VOICE_PLATFORM_URL || !VOICE_PLATFORM_API_KEY) {
      return res.status(500).json({
        error: "Missing VOICE_PLATFORM_URL or VOICE_PLATFORM_API_KEY in .env",
      });
    }

    // TODO: Authenticate the caller before minting tokens.

    const r = await fetch(`${VOICE_PLATFORM_URL}/api/v1/token`, {
      method: "POST",
      headers: {
        "X-API-Key": VOICE_PLATFORM_API_KEY,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({ participant_name: "Web User" }),
    });

    if (!r.ok) {
      const detail = await r.text().catch(() => "");
      return res.status(r.status).json({ error: "Token request failed", detail });
    }

    const data = await r.json();

    res.json({
      rtc_url: data.rtc_url || data.livekit_url,
      token: data.token,
      expires_in: data.expires_in,
    });
  } catch (err) {
    res.status(500).json({ error: "Failed to mint token" });
  }
});

app.listen(3000, () => console.log("Open http://localhost:3000"));

Run the server

npm start

Then open: http://localhost:3000

How this code works

You load credentials from environment variables so secrets never enter the browser.
The /api/voice-token endpoint calls the voice platform’s token API.
You return only the rtc_url, token, and expiration time.
The browser never sees the API key.
If the provider returns an error, you forward a structured error response.

Production Notes

rate-limit /api/voice-token (cost + abuse control)
instrument token mint latency and error rate
keep TTL short and handle refresh/reconnect
return minimal fields

Step 3: Connect from the Web Client (WebRTC + SFU)

In this step, you'll build a minimal web UI that:

Requests a short-lived token from your backend
Connects to a real-time WebRTC room (often via an SFU)
Plays the agent's audio track
Captures and publishes microphone audio

Create `public/index.html`



  
    
    
    Voice Agent Demo
  
  
    Voice Agent Demo

    
    

    Idle

Create `public/client.js`

Note: This uses a LiveKit-style client SDK to demonstrate the pattern. If you're using a different provider, swap this import and the connect/publish calls for your provider's WebRTC client.

import { Room, RoomEvent, Track } from "https://unpkg.com/livekit-client@2.10.1/dist/livekit-client.esm.mjs";

const startBtn = document.getElementById("startBtn");
const endBtn = document.getElementById("endBtn");
const statusEl = document.getElementById("status");

let room = null;
let intentionallyDisconnected = false;
let audioEls = [];

function setStatus(text) {
  statusEl.textContent = text;
}

function detachAllAudio() {
  for (const el of audioEls) {
    try { el.pause?.(); } catch {}
    el.remove();
  }
  audioEls = [];
}

async function mintToken() {
  const res = await fetch("/api/voice-token", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ participant_name: "Web User" }),
    cache: "no-store",
  });

  if (!res.ok) {
    const detail = await res.text().catch(() => "");
    throw new Error(`Token request failed: ${detail || res.status}`);
  }

  const { rtc_url, token } = await res.json();
  if (!rtc_url || !token) throw new Error("Token response missing rtc_url or token");
  return { rtc_url, token };
}

function wireRoomEvents(r) {
  // 1) Play the agent audio track when subscribed
  r.on(RoomEvent.TrackSubscribed, (track) => {
    if (track.kind !== Track.Kind.Audio) return;

    const el = track.attach();
    audioEls.push(el);
    document.body.appendChild(el);

    // Autoplay restrictions vary by browser/device.
    el.play?.().catch(() => {
      setStatus("Connected (audio may be blocked — click the page to enable)");
    });
  });

  // 2) Reconnect on disconnect (token expiry often shows up this way)
  r.on(RoomEvent.Disconnected, async () => {
    if (intentionallyDisconnected) return;
    setStatus("Disconnected (reconnecting...)");
    await attemptReconnect();
  });
}

async function connectOnce() {
  const { rtc_url, token } = await mintToken();

  const r = new Room();
  wireRoomEvents(r);

  await r.connect(rtc_url, token);

  // Mic permission + publish mic
  try {
    await r.localParticipant.setMicrophoneEnabled(true);
  } catch {
    try { r.disconnect(); } catch {}
    throw new Error("Microphone access denied. Allow mic permission and try again.");
  }

  return r;
}

async function startCall() {
  if (room) return;

  intentionallyDisconnected = false;
  setStatus("Connecting...");

  room = await connectOnce();

  setStatus("Connected");
  startBtn.disabled = true;
  endBtn.disabled = false;
}

async function stopCall() {
  intentionallyDisconnected = true;

  try {
    await room?.localParticipant?.setMicrophoneEnabled(false);
  } catch {}

  try {
    room?.disconnect();
  } catch {}

  room = null;
  detachAllAudio();

  setStatus("Disconnected");
  startBtn.disabled = false;
  endBtn.disabled = true;
}

async function attemptReconnect() {
  // Simplified exponential backoff reconnect.
  // In production, add jitter, max attempts, and better error classification.
  const delaysMs = [250, 500, 1000, 2000];

  for (const delay of delaysMs) {
    if (intentionallyDisconnected) return;

    try {
      // Tear down current state before reconnecting
      try { room?.disconnect(); } catch {}
      room = null;
      detachAllAudio();

      await new Promise((r) => setTimeout(r, delay));

      room = await connectOnce();
      setStatus("Reconnected");
      startBtn.disabled = true;
      endBtn.disabled = false;
      return;
    } catch {
      // keep retrying
    }
  }

  setStatus("Disconnected (reconnect failed)");
  startBtn.disabled = false;
  endBtn.disabled = true;
}

startBtn.addEventListener("click", async () => {
  try {
    await startCall();
  } catch (err) {
    setStatus(err?.message || "Connection failed");
    startBtn.disabled = false;
    endBtn.disabled = true;
    room = null;
    detachAllAudio();
  }
});

endBtn.addEventListener("click", async () => {
  await stopCall();
});

How this Step works (and why these details matter)

The Start button gives you a user gesture so browsers are more likely to allow audio playback.
Mic permission is handled explicitly: if the user denies access, you show a clear error and avoid a half-connected session.
Disconnect cleanup removes audio elements so you don't leak resources across retries.
The reconnect loop demonstrates the production pattern: if a disconnect happens (often due to token expiry or network churn), the client re-mints a token and reconnects.

In the next step, you'll add a structured data-channel handler to safely process agent-suggested “client actions”.

Handle These Explicitly

Autoplay Restriction Example

Add this to index.html:

In client.js:

const startBtn = document.getElementById("startBtn");
const endBtn = document.getElementById("endBtn");
const statusEl = document.getElementById("status");

let room;

startBtn.addEventListener("click", async () => {
  try {
    room = await connectVoice();
    statusEl.textContent = "Connected";
    startBtn.disabled = true;
    endBtn.disabled = false;
  } catch (err) {
    statusEl.textContent = "Connection failed";
  }
});

Microphone denial

try {
  await navigator.mediaDevices.getUserMedia({ audio: true });
} catch (err) {
  statusEl.textContent = "Microphone access denied";
  throw err;
}

Disconnect cleanup

endBtn.addEventListener("click", () => {
  if (room) {
    room.disconnect();
    statusEl.textContent = "Disconnected";
    startBtn.disabled = false;
    endBtn.disabled = true;
  }
});

Token refresh (simplified)

room.on(RoomEvent.Disconnected, async () => {
  const res = await fetch("/api/voice-token");
  const { rtc_url, token } = await res.json();
  await room.connect(rtc_url, token);
});

Step 4: Add Client Actions (Agent Suggests, App Executes)

A production voice agent often needs to:

open a runbook/dashboard URL
show a checklist in the UI
request confirmation for an irreversible action
receive structured context (account, region, incident ID)

The key safety rule:

The agent suggests actions. The application validates and executes them.

Use structured messages over the data channel:

{
  "type": "client_action",
  "action": "open_url",
  "payload": { "url": "https://internal.example.com/runbook" },
  "id": "action_123"
}

Add guardrails:

allowlist permitted actions
validate payload shape
confirmation gates for irreversible actions
idempotency via id
audit logs for every request and outcome

This boundary limits damage from hallucinations or prompt injection.

// Guardrails: allowlist + validation + idempotency + confirmation

const ALLOWED_ACTIONS = new Set(["open_url", "request_confirm"]);
const EXECUTED_ACTION_IDS = new Set();
const ALLOWED_HOSTS = new Set(["internal.example.com"]);

function parseClientAction(text) {
  let msg;
  try {
    msg = JSON.parse(text);
  } catch {
    return null;
  }

  if (msg?.type !== "client_action") return null;
  if (typeof msg.id !== "string") return null;
  if (!ALLOWED_ACTIONS.has(msg.action)) return null;

  return msg;
}

async function handleClientAction(msg, room) {
  if (EXECUTED_ACTION_IDS.has(msg.id)) return; // idempotency
  EXECUTED_ACTION_IDS.add(msg.id);

  console.log("[client_action]", msg); // audit log (demo)

  if (msg.action === "open_url") {
    const url = msg.payload?.url;
    if (typeof url !== "string") return;

    const u = new URL(url);
    if (!ALLOWED_HOSTS.has(u.host)) {
      console.warn("Blocked navigation to:", u.host);
      return;
    }

    window.open(url, "_blank", "noopener,noreferrer");
    return;
  }

  if (msg.action === "request_confirm") {
    const prompt = msg.payload?.prompt || "Confirm this action?";
    const ok = window.confirm(prompt);

    // Send confirmation back to agent/app
    room.localParticipant.publishData(
  new TextEncoder().encode(
    JSON.stringify({ type: "user_confirmed", id: msg.id, ok })
  ),
  { topic: "client_events", reliable: true }
);
  }
}

room.on(RoomEvent.DataReceived, (payload, participant, kind, topic) => {
  if (topic !== "client_actions") return;

  const text = new TextDecoder().decode(payload);
  const msg = parseClientAction(text);
  if (!msg) return;

  handleClientAction(msg, room);
});

Step 5: Add Tool Integrations Safely

Tools turn a voice agent into automation. Regardless of vendor, enforce these rules:

timeouts on every tool call
circuit breakers for flaky dependencies
audit logs (inputs, outputs, duration, trace IDs)
explicit confirmation for destructive actions
credentials stored server-side (never in prompts or clients)

If tools fail, degrade gracefully (“I can’t access that system right now, here’s the manual fallback.”). Silence reads as failure.

Create a server-side tool runner (example)

Paste this into server.js:

const TOOL_ALLOWLIST = {
  get_status: { destructive: false },
  create_ticket: { destructive: true },
};

let failures = 0;
let circuitOpenUntil = 0;

function circuitOpen() {
  return Date.now() < circuitOpenUntil;
}

async function withTimeout(promise, ms) {
  return Promise.race([
    promise,
    new Promise((_, reject) => setTimeout(() => reject(new Error("timeout")), ms)),
  ]);
}

async function runToolSafely(tool, args) {
  if (circuitOpen()) throw new Error("circuit_open");

  try {
    const result = await withTimeout(Promise.resolve({ ok: true, tool, args }), 2000);
    failures = 0;
    return result;
  } catch (err) {
    failures++;
    if (failures >= 3) circuitOpenUntil = Date.now() + 10_000;
    throw err;
  }
}

app.post("/api/tools/run", async (req, res) => {
  const { tool, args, user_confirmed } = req.body || {};

  if (!TOOL_ALLOWLIST[tool]) return res.status(400).json({ error: "Tool not allowed" });

  if (TOOL_ALLOWLIST[tool].destructive && user_confirmed !== true) {
    return res.status(400).json({ error: "Confirmation required" });
  }

  try {
    const started = Date.now();
    const result = await runToolSafely(tool, args);
    console.log("[tool_call]", { tool, ms: Date.now() - started }); // audit log
    res.json({ ok: true, result });
  } catch (err) {
    console.log("[tool_error]", { tool, err: String(err) });
    res.status(500).json({ ok: false, error: "Tool call failed" });
  }
});

Step 6: Add post-call processing (where durable value appears)

After a call ends, generate structured artifacts:

summary
action items
follow-up email draft
CRM entry or ticket creation

A production pattern:

store transcript + metadata
enqueue a background job (queue/worker)
produce outputs as JSON + a human-readable report
apply integrations with retries + idempotency
store a “call report” for audits and incident reviews

Create a post-call webhook endpoint (example)

Paste into server.js:

app.post("/webhooks/call-ended", async (req, res) => {
  const payload = req.body;

  console.log("[call_ended]", {
    call_id: payload.call_id,
    ended_at: payload.ended_at,
  });

  setImmediate(() => processPostCall(payload));
  res.json({ ok: true });
});

function processPostCall(payload) {
  const transcript = payload.transcript || [];
  const summary = transcript.slice(0, 3).map(t => `- \({t.speaker}: \){t.text}`).join("\n");

  const report = {
    call_id: payload.call_id,
    summary,
    action_items: payload.action_items || [],
    created_at: new Date().toISOString(),
  };

  console.log("[call_report]", report);
}

Test it locally

curl -X POST http://localhost:3000/webhooks/call-ended \
  -H "Content-Type: application/json" \
  -d '{
    "call_id": "call_123",
    "ended_at": "2026-02-26T00:10:00Z",
    "transcript": [
      {"speaker": "user", "text": "I need help resetting my password."},
      {"speaker": "agent", "text": "Sure — I can help with that."}
    ],
    "action_items": ["Send password reset link", "Verify account email"]
  }'

Production readiness checklist

Security

no API keys in the browser
strict allowlist for client actions
confirmation gates for destructive actions
schema validation on all inbound messages
audit logging for actions and tool calls

Reliability

reconnect strategy for expired tokens
timeouts + circuit breakers for tools
graceful degradation when dependencies fail
idempotent side effects

Observability

Log state transitions (for example):
listening → thinking → speaking → ended

Track:

connect failure rate
end-to-end latency (STT + reasoning + TTS)
tool error rate
reconnect frequency

Cost control

rate-limit token minting and sessions
cap max call duration
bound context growth (summarize or truncate)
track per-call usage drivers (STT/TTS minutes, tool calls)

Optional resources

How to Try a Managed Voice Platform Quickly

If you want a managed provider to test quickly, you can sign up for a Vocal Bridge account and implement these steps using their token minting + real-time session APIs.

But the core production voice agent architecture in this article is vendor-agnostic. You can replace any component (SFU, STT/TTS, agent runtime, tool layer) as long as you preserve the boundaries: secure token service, real-time media, safe tool execution, and strong observability.

Watch a full demo and explore a complete reference repo

If you'd like to see these patterns working together in a realistic scenario (incident triage), here are two optional resources:

- Demo video: Voice-First Incident Triage (end-to-end run)
This is a hackathon run-through showing client actions, decision boundaries for irreversible actions, and a structured post-call summary.

- GitHub repo (architecture + design + working code): https://github.com/natarajsundar/voice-first-incident-triage

These links are optional, you can follow the tutorial end-to-end without them.

Closing

Production-ready voice agents work when you treat them like real-time distributed systems.

Start with the baseline:

token service + web client + real-time audio

Then layer in:

controlled client actions
safe tools
post-call automation
observability and cost controls

That’s how you ship a voice agent architecture you can operate. You now have a vendor-neutral reference architecture you can adapt to your stack, with clear trust boundaries, safe tool execution, and operational visibility.

If you’re shipping real-time AI systems, what’s been your biggest production bottleneck so far: latency, reliability, or tool safety? I’d love to hear what you’re seeing in the wild. Connect with me on LinkedIn.

llm - freeCodeCamp.org

AI Paper Review: Language Models are Few-Shot Learners (GPT-3)

Paper Overview

Table of Content:

Prerequisites

Executive Summary

Goals of the Paper

Core Idea

Methodology

Fine-tuning vs Zero-Shot vs Few-Shot

Model Architecture

Experiments

Key Findings

Task-Specific Observations

Generalization vs Memorization

Discussion

Limitations

Conclusion

Final Insight

GPT-1 vs GPT-2 vs GPT-3: Key Differences

PyTorch Implementations of the GPT Architecture Evolution

Resources:

AI Paper Review: Language Models are Unsupervised Multitask Learners (GPT-2)

Paper Overview

Table of Contents:

Prerequisites

Executive Summary

Goals of the Paper

Core Idea

Methodology

Zero-Shot Setup

Fine-tuning vs Zero-Shot Learning

Training Data (Web Text)

Input Representation

Model Architecture

Experiments

Key Findings

Task-Specific

Generalization vs Memorization

Discussion

Limitations

Conclusion

Final Insight

GPT-1 vs GPT-2 — Key Differences

Resources:

Contact Me

The Rise of AI Agents: How Software Is Learning to Act

What We'll Cover:

From Deterministic Systems to Goal-Driven Execution

The Core Components of an AI Agent

Why AI Agents Are Emerging Now

The Illusion and Reality of Autonomy

Designing Agents That Work in Practice

Multi-Agent Systems and Coordination

Where AI Agents Are Already Delivering Value

The Shift in Software Design

What Comes Next

How to Run Open Source LLMs Locally and in the Cloud

How to Build Your Own Language-Specific LLM [Full Handbook]

Who is This Handbook For?

A Note on Expectations:

A Note on the Code:

What We'll Cover:

Components of LLM Training

Tech Stack Required

1. Data Preparation

Data Cleaning

2. Tokenization

Tokenization Approaches

Approach 1: Character-level

Approach 2: Word-level

Approach 3: Subword using BPE (Byte Pair Encoding)

Special Tokens

BPE Tokenizer Configuration

Building the Tokenizer

Training the Tokenizer

Configuring Post-Processing (Auto-Wrapping with BOS/EOS)

Testing the Tokenizer

Fertility Score

Saving the Tokenizer

Part 2: `format_conversation()`: The Core Function

Gradio Web Interface (`app.py`)