What if you could build your own LLM, one that speaks your native language, all from scratch? That's exactly what we'll do in this tutorial. The best way to understand how LLMs work is by actually building one.
We'll go through each step of creating your own LLM in a specific language (Urdu in this case). This will help you understand what goes on inside an LLM.
Modern LLMs trace back to the research paper that changed everything: "Attention Is All You Need". But rather than getting lost in the math (I am bad at math, sadly), we'll learn by building one from scratch.
Who is This Handbook For?
Software engineers, product owners, or anyone curious about how LLMs work. If you have a little machine learning knowledge, that would be great, but if not, no worries. I've written this so that you don't have to go anywhere outside this tutorial.
By the end, you will have a working Urdu LLM chatbot deployed and running. You can create one for your own native language as well by following the steps defined below.
A Note on Expectations:
The goal here is to educate ourselves on how LLMs work by practically going through all the steps.
The goal is not that your LLM will act like ChatGPT. That has multiple constraints like massive datasets, months of training, and reinforcement learning from human feedback (RLHF), all of which you'll understand better by going through this tutorial.
A Note on the Code:
The code in this tutorial was largely generated using Claude Opus 4. This is worth highlighting because it shows that LLMs are not just coding assistants that help you ship features faster. They can also be powerful learning tools.
By prompting Claude to generate, explain, and iterate on each component, I was able to understand the internals of LLM training far more deeply than reading documentation alone.
If you're following along, I encourage you to do the same: use an LLM for your learning.
What We'll Cover:
Components of LLM Training
In this tutorial, we'll be covering the following components one by one with code examples for better understanding:
Data Preparation
Tokenization
Pre-Training
Supervised Fine-Tuning (SFT)
Deployment
Tech Stack Required
Before starting the steps, here is the tech stack you need:
Python 3.9+
PyTorch
Tokenizers / SentencePiece
Hugging Face Datasets & Hub
regex, BeautifulSoup4, requests (for data cleaning)
tqdm, matplotlib (for training utilities)
Gradio (for chat UI deployment)
Google Colab (free T4 GPU for training)
Note: Make sure to install all the dependencies listed in the requirements.txt file of the repository before getting started.
1. Data Preparation
In data preparation, the first and foremost step is data collection. An LLM needs to be trained on a large amount of text data. There is no single place to get this data. Depending on the type of model you want to build, you can collect text from many sources:
Digital libraries and archives: Internet Archive or Wikipedia dumps
Code repositories: GitHub, GitLab (useful if your model needs to understand code)
Web scraping: Crawling websites, blogs, and forums using automated scripts
Academic datasets: Research papers, open-access journals
Pre-built datasets: Platforms like Hugging Face Datasets and Kaggle host thousands of ready-to-use datasets
In practice, large-scale LLMs like GPT and LLaMA rely heavily on web scraping from many sources using automated pipelines. But there's one important rule to follow: only use publicly available, open-source data. Don't scrape private or personal user information. Stick to data that's explicitly shared for public use or falls under permissive licenses.
Also, keep this principle in mind: garbage in, garbage out. Just getting the data isn't enough. It should be correct, clean, and without noise.
In actual practice, you can collect data from different sources. In my case, I found good enough data from Hugging Face. Hugging Face has CulturaX that has multilingual datasets. The dataset was huge, so I didn't download all of it and only downloaded a small portion.
For this tutorial, I used Hugging Face as my data source. I chose it for a few reasons.
First, since the goal was to learn how LLMs work, I wanted to spend my time on the model, not on writing web scrapers. Hugging Face already has a large collection of datasets in a cleaned and structured format, which saves a lot of upfront work.
Second, Hugging Face offers language-specific datasets. Since I was building an Urdu LLM, I needed Urdu text specifically, and Hugging Face has CulturaX which provides multilingual datasets including Urdu and many other languages. The dataset was huge, so I avoided downloading all of it and only downloaded a small portion.
Important: Before you start downloading the dataset from Hugging Face, you need to create an account. Then log into the CLI, from where you'll be able to download the dataset.
In the script below, we load the dataset from Hugging Face and turn streaming to True. The purpose of doing this is so that we don't have to download all the data but only chunks of samples as defined in NUM_SAMPLES.
# ============================================================
# Option A: Download from CulturaX (recommended, high quality)
# ============================================================
# CulturaX is a cleaned version of mC4 + OSCAR
# We stream it to avoid downloading the entire dataset
NUM_SAMPLES = 100_000 # Start with 100K samples (~50-100MB text)
print("Loading CulturaX Urdu dataset (streaming)...")
dataset = load_dataset(
"uonlp/CulturaX",
"ur", # Urdu language code
split="train",
streaming=True, # Don't download everything
trust_remote_code=True
)
# Collect samples
raw_texts = []
for i, sample in enumerate(tqdm(dataset, total=NUM_SAMPLES, desc="Downloading")):
if i >= NUM_SAMPLES:
break
raw_texts.append(sample["text"])
print(f"\nDownloaded {len(raw_texts)} samples")
print(f"Total characters: {sum(len(t) for t in raw_texts):,}")
print(f"\nSample text (first 500 chars):")
print(raw_texts[0][:500])
Data Cleaning
Simply having the data is not enough to start training your model. The next step is probably the most important one: data cleaning. The goal is to make the data as pure as possible.
As I was building a language-specific Urdu LLM, I had to write cleaning logic to remove non-Urdu text, HTML links, special characters, duplicate content, and excess whitespace. All these factors pollute the training data and can cause issues during training.
Based on the type of dataset, some language-specific or use-case cleaning will be required.
One thing that might be new to you is the NFKC Unicode normalization step. This normalizes text that appears the same but exists in different Unicode forms, keeping one canonical form.
You'll also see some regex patterns that are used to keep only the Urdu text. As Urdu script is based on Arabic, we'll use Arabic Unicode ranges. I also removed artifacts like //, --, and extra empty spaces that were present in the raw data.
This cleaning took multiple iterations. I reviewed the results manually each time and identified issues like inconsistent spacing, long dashes, and stray punctuation. All of these can negatively impact the next stages, so it's important to clean thoroughly.
This also gives you an idea of how important the data part still is and how much LLMs depend on data.
Here is the cleaning function I used:
def clean_urdu_text(text: str) -> str:
"""
Clean a single Urdu text document.
Steps:
1. Remove URLs
2. Remove HTML tags and entities
3. Remove email addresses
4. Normalize Unicode (NFKC normalization)
5. Remove non-Urdu characters (keep Urdu + punctuation + digits)
6. Normalize repeated punctuation (۔۔۔, ..., - -, etc.)
7. Normalize whitespace
"""
import unicodedata
# Step 1: Remove URLs
text = re.sub(r'https?://\S+|www\.\S+', '', text)
# Step 2: Remove HTML tags
text = re.sub(r'<[^>]+>', '', text)
# Remove HTML entities
text = re.sub(r'&[a-zA-Z]+;', ' ', text)
text = re.sub(r'&#\d+;', ' ', text)
# Step 3: Remove email addresses
text = re.sub(r'\S+@\S+', '', text)
# Step 4: Unicode normalization (NFKC)
# This normalizes different representations of the same character
text = unicodedata.normalize('NFKC', text)
# Step 5: Keep only Urdu characters, basic punctuation, digits, and whitespace
# Urdu Unicode ranges + Arabic punctuation + Western digits + basic punctuation
urdu_pattern = regex.compile(
r'[^'
r'\u0600-\u06FF' # Arabic (includes Urdu)
r'\u0750-\u077F' # Arabic Supplement
r'\u08A0-\u08FF' # Arabic Extended-A
r'\uFB50-\uFDFF' # Arabic Presentation Forms-A
r'\uFE70-\uFEFF' # Arabic Presentation Forms-B
r'0-9۰-۹' # Western and Eastern Arabic-Indic digits
r'\s' # Whitespace
r'۔،؟!٪' # Urdu punctuation (full stop, comma, question mark, etc.)
r'.,:;!?\-\(\)"\']' # Basic Latin punctuation
)
text = urdu_pattern.sub(' ', text)
# Step 6: Normalize repeated punctuation
text = re.sub(r'۔{2,}', '۔', text)
text = re.sub(r'\.{2,}', '.', text)
text = re.sub(r'-\s*-+', '-', text)
text = re.sub(r'-{2,}', '-', text)
text = re.sub(r'،{2,}', '،', text)
text = re.sub(r',{2,}', ',', text)
text = re.sub(r'\s+[۔\.\-,،]\s+', ' ', text)
# Step 7: Normalize whitespace
text = re.sub(r'\n{3,}', '\n\n', text) # Max 2 newlines
text = re.sub(r'[^\S\n]+', ' ', text) # Collapse spaces (but keep newlines)
text = text.strip()
return text
def is_mostly_urdu(text: str, threshold: float = 0.5) -> bool:
"""
Check if text is mostly Urdu characters.
This filters out documents that are primarily English/other languages.
threshold: minimum fraction of characters that must be Urdu
"""
if len(text) == 0:
return False
urdu_chars = len(regex.findall(r'[\u0600-\u06FF\u0750-\u077F\u08A0-\u08FF\uFB50-\uFDFF\uFE70-\uFEFF]', text))
return (urdu_chars / len(text)) > threshold
# Test the cleaning function
sample = raw_texts[0]
print("=== BEFORE CLEANING ===")
print(sample[:300])
print("\n=== AFTER CLEANING ===")
cleaned = clean_urdu_text(sample)
print(cleaned[:300])
print(f"\nIs mostly Urdu: {is_mostly_urdu(cleaned)}")
After cleaning, I stored the data in two formats: a text file (used for tokenizer training) and a JSONL file (used for pre-training). Each format serves a specific purpose in the upcoming steps.
2. Tokenization
The next step after cleaning is tokenization. Tokenization converts text into numbers, and provides a way to convert those numbers back into text.
This is necessary because neural networks can't understand text – they only understand numbers. So tokenization is essentially a translation layer between human language and what the model can process.
For example:
"hello world" → ["hel", "lo", " world"] → [1245, 532, 995]
"اردو زبان" ← ["ار", "دو", "زب", "ان"] ← [412, 87, 953, 201]
Tokenization Approaches
There are three main approaches to tokenization:
Approach 1: Character-level
With this approach, you split text into individual characters:
hello->['h', 'e', 'l', 'l', 'o']اردو->['ا', 'ر', 'د', 'و']
The problem is that sequences become very long. A 1000-word document might be 5000+ tokens. The model has to learn to combine characters into words, which is very hard.
Approach 2: Word-level
In this approach, you split based on spaces between words:
hello how are you->['hello', 'how', 'are', 'you']اردو بہت اچھی زبان ہے->['اردو', 'بہت', 'اچھی', 'زبان', 'ہے']
This problem is that a language's vocabulary is huge (Urdu has 100K+ unique words, English has 170K+). The model can't handle new or rare words (the out-of-vocabulary problem).
Approach 3: Subword using BPE (Byte Pair Encoding)
With this approach, the model learns common character sequences from data.
unhappinessmight split as['un', 'happi', 'ness']مکملmight split as['مکم', 'ل']or stay whole if common enough.
This is a smaller vocabulary (we use 32K tokens), and it can handle any word, even new ones. Common words stay as single tokens.
BPE is the industry standard, used by GPT, LLaMA, and most modern LLMs. Here is how it works step by step:
Start with characters: vocabulary = all individual characters
Count pairs: find the most frequent adjacent pair of tokens
Merge: combine that pair into a new token
Repeat: until vocabulary reaches desired size
Here's an example:
Start: ا ر د و ز ب ا ن
Merge 1: 'ا ر' -> 'ار' (most common pair)
Result: ار د و ز ب ا ن
Merge 2: 'ز ب' -> 'زب' (next most common)
Result: ار د و زب ا ن
...and so on for 32,000 merges
This is the approach we'll use for our Urdu LLM. I trained a BPE tokenizer with a vocabulary size of 32K tokens on the cleaned Urdu corpus.
Special Tokens
Along with BPE, we also need to add some special tokens. These tokens give the model structural information it needs during training and inference.
| Token | Purpose | Why It Is Needed |
|---|---|---|
<pad> |
Padding for equal-length sequences | Batching requires all sequences to be the same length. Shorter sequences are filled with <pad> tokens. |
<unk> |
Unknown word fallback | If the model encounters a token not in the vocabulary, it maps to <unk> instead of failing. |
<bos> |
Marks the start of a sequence | Tells the model where the input begins, leading to more stable generation. |
<eos> |
Marks the end of a sequence | Tells the model when to stop generating. Without it, output may run forever or stop randomly. |
<sep> |
Separates segments | In chat format, separates the system prompt, user message, and assistant response so the model knows which role is which. |
| `< | user | >` |
| `< | assistant | >` |
| `< | system | >` |
BPE Tokenizer Configuration
I set vocab size to 32K. What does that mean? It means the model will have 32K tokens in its vocabulary lookup table.
This is a good balance between language coverage and model size. If we increase vocab size, the embedding layer and output layer both grow, which means more parameters to train. For a learning project, 32K keeps things manageable.
MIN_FREQUENCY is set to 2, meaning a token must appear at least twice in the corpus to be included. This filters out one-off noise tokens that would waste vocabulary slots.
For reference: GPT-2 uses a vocabulary of 50K tokens, and LLaMA uses 32K. Our choice of 32K is in line with production models.
VOCAB_SIZE = 32_000 # Number of tokens in our vocabulary
MIN_FREQUENCY = 2 # Token must appear at least twice (filters noise)
# Special tokens - these have reserved IDs
SPECIAL_TOKENS = [
"<pad>", # ID 0: padding
"<unk>", # ID 1: unknown
"<bos>", # ID 2: beginning of sequence
"<eos>", # ID 3: end of sequence
"<sep>", # ID 4: separator (for chat format)
"<|user|>", # ID 5: user turn marker (for chat)
"<|assistant|>", # ID 6: assistant turn marker (for chat)
"<|system|>", # ID 7: system prompt marker (for chat)
]
Building the Tokenizer
Next up is creating the tokenizer using the cleaned text file we created earlier. First, we'll import the required libraries and set up the file paths:
import os
from pathlib import Path
from tokenizers import (
Tokenizer,
models,
trainers,
pre_tokenizers,
decoders,
processors,
normalizers,
)
PROJECT_ROOT = Path(".").resolve().parent
CLEANED_DIR = PROJECT_ROOT / "data" / "cleaned"
TOKENIZER_DIR = PROJECT_ROOT / "tokenizer" / "urdu_tokenizer"
TOKENIZER_DIR.mkdir(parents=True, exist_ok=True)
CORPUS_FILE = str(CLEANED_DIR / "urdu_corpus.txt")
print(f"Corpus file: {CORPUS_FILE}")
print(f"Tokenizer output: {TOKENIZER_DIR}")
# Verify corpus exists
assert os.path.exists(CORPUS_FILE), f"Corpus not found at {CORPUS_FILE}. Run notebook 01 first!"
file_size_mb = os.path.getsize(CORPUS_FILE) / 1024 / 1024
print(f"Corpus size: {file_size_mb:.1f} MB")
Now we'll configure the tokenizer components:
# ============================================================
# Build the tokenizer
# ============================================================
# Step 1: Create a BPE model (the core algorithm)
tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))
# Step 2: Add normalizer (text cleaning before tokenization)
# NFKC normalizes Unicode (e.g., different forms of the same Arabic letter)
tokenizer.normalizer = normalizers.NFKC()
# Step 3: Pre-tokenizer (how to split text before BPE)
# We use Metaspace which replaces spaces with ▁ and splits on them
# This preserves space information so we can reconstruct the original text
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()
# Step 4: Decoder (how to convert tokens back to text)
# Metaspace decoder converts ▁ back to spaces
tokenizer.decoder = decoders.Metaspace()
# Step 5: Configure the trainer
trainer = trainers.BpeTrainer(
vocab_size=VOCAB_SIZE,
min_frequency=MIN_FREQUENCY,
special_tokens=SPECIAL_TOKENS,
show_progress=True,
initial_alphabet=[] # Learn alphabet from data
)
print("Tokenizer configured. Ready to train!")
Training the Tokenizer
Once the tokenizer is configured, the next step is to run it. This will take roughly 5 to 10 minutes depending on your device.
print("Training tokenizer... (this may take a few minutes)")
tokenizer.train([CORPUS_FILE], trainer)
print(f"\n Tokenizer trained!")
print(f" Vocabulary size: {tokenizer.get_vocab_size():,}")
Configuring Post-Processing (Auto-Wrapping with BOS/EOS)
Next, we'll configure post-processing so the tokenizer automatically wraps every sequence with <bos> and <eos> tokens. This means we don't have to manually add them each time we encode text:
bos_id = tokenizer.token_to_id("<bos>")
eos_id = tokenizer.token_to_id("<eos>")
tokenizer.post_processor = processors.TemplateProcessing(
single=f"<bos>:0 $A:0 <eos>:0",
pair=f"<bos>:0 \(A:0 <sep>:0 \)B:1 <eos>:1",
special_tokens=[
("<bos>", bos_id),
("<eos>", eos_id),
("<sep>", tokenizer.token_to_id("<sep>")),
],
)
print("Post-processor configured (auto-adds <bos> and <eos>)")
Note: You might wonder why we need this step when we already defined <bos> and <eos> in SPECIAL_TOKENS. The SPECIAL_TOKENS list only reserves vocabulary slots for these tokens (assigns them IDs). Post-processing tells the tokenizer to automatically insert them into every encoded sequence.
Without this step, the tokens would exist in the vocabulary but never appear in your data unless you added them manually each time.
Testing the Tokenizer
The final step in tokenization is to test it. The test encodes Urdu sentences into token IDs, then decodes those IDs back into text. If the decoded text matches the original input, the tokenizer is working correctly. This roundtrip test confirms that no information is lost during encoding and decoding:
test_sentences = [
"اردو ایک بہت خوبصورت زبان ہے", # "Urdu is a very beautiful language"
"پاکستان کا دارالحکومت اسلام آباد ہے", # "The capital of Pakistan is Islamabad"
"آج موسم بہت اچھا ہے", # "The weather is very nice today"
"مصنوعی ذہانت مستقبل کی ٹیکنالوجی ہے", # "AI is the technology of the future"
"السلام علیکم! آپ کیسے ہیں؟", # "Peace be upon you! How are you?"
]
print("=" * 70)
print("TOKENIZER TEST RESULTS")
print("=" * 70)
for sentence in test_sentences:
encoded = tokenizer.encode(sentence)
decoded = tokenizer.decode(encoded.ids)
print(f"\n Input: {sentence}")
print(f" Token IDs: {encoded.ids}")
print(f" Tokens: {encoded.tokens}")
print(f" Decoded: {decoded}")
print(f" Num tokens: {len(encoded.ids)}")
print(f" Roundtrip OK: {sentence in decoded}")
print("-" * 70)
Here is what the output looks like:
======================================================================
TOKENIZER TEST RESULTS
======================================================================
Input: اردو ایک بہت خوبصورت زبان ہے
Token IDs: [2, 1418, 324, 431, 2965, 1430, 276, 3]
Tokens: ['<bos>', '▁اردو', '▁ایک', '▁بہت', '▁خوبصورت', '▁زبان', '▁ہے', '<eos>']
Decoded: اردو ایک بہت خوبصورت زبان ہے
Num tokens: 8
Roundtrip OK: True
----------------------------------------------------------------------
Input: پاکستان کا دارالحکومت اسلام آباد ہے
Token IDs: [2, 474, 289, 3699, 616, 1004, 276, 3]
Tokens: ['<bos>', '▁پاکستان', '▁کا', '▁دارالحکومت', '▁اسلام', '▁آباد', '▁ہے', '<eos>']
Decoded: پاکستان کا دارالحکومت اسلام آباد ہے
Num tokens: 8
Roundtrip OK: True
Notice how <bos> and <eos> are automatically added (thanks to our post-processing step), common Urdu words like پاکستان stay as single tokens, and the ▁ prefix marks word boundaries from the Metaspace pre-tokenizer. Most importantly, every roundtrip succeeds, meaning decoded text matches the original input exactly.
Fertility Score
Fertility is the average number of tokens per word.
A fertility of 1 means each word maps to one token (ideal but unrealistic in modern subword tokenizers).
In modern LLMs, fertility is usually around 1.3–2.5 depending on the language.
Higher fertility means more token splitting, which increases cost and reduces efficiency, but it's also influenced by language complexity, not just tokenizer quality.
# ============================================================
# Calculate fertility score on training corpus
# ============================================================
import json
jsonl_file = CLEANED_DIR / "urdu_corpus.jsonl"
corpus_words = 0
corpus_tokens = 0
sample_size = 10000 # Sample 10K documents for speed
print(f"Calculating fertility on {sample_size:,} documents from corpus...")
with open(jsonl_file, "r", encoding="utf-8") as f:
for i, line in enumerate(f):
if i >= sample_size:
break
doc = json.loads(line)
text = doc["text"]
words = text.split()
tokens = tokenizer.encode(text).tokens
n_tokens = len(tokens) - 2 # Remove <bos> and <eos>
corpus_words += len(words)
corpus_tokens += n_tokens
corpus_fertility = corpus_tokens / corpus_words
print(f"\n📊 Fertility Score (corpus): {corpus_fertility:.2f} tokens/word")
print(f" (Total: {corpus_words:,} words → {corpus_tokens:,} tokens)")
print(f" Documents sampled: {min(i+1, sample_size):,}")
if corpus_fertility < 2.0:
print(" ✅ Excellent! Tokenizer is well-optimized for Urdu.")
elif corpus_fertility < 3.0:
print(" ⚠️ Good, but could be better. Consider larger vocab.")
else:
print(" ❌ High fertility. The tokenizer needs improvement.")
The fertility score we get here is 1.04, which is quite good. But keep in mind that this number is artificially low because the tokenizer was trained on the same small corpus it's being evaluated on. With a larger or unseen dataset, fertility would likely be higher (closer to the 1.3-2.5 range typical for production tokenizers).
Saving the Tokenizer
The final step is to save the tokenizer in JSON format and verify that it loads correctly:
# ============================================================
# Save the tokenizer
# ============================================================
tokenizer_path = str(TOKENIZER_DIR / "urdu_bpe_tokenizer.json")
tokenizer.save(tokenizer_path)
print(f" Tokenizer saved to: {tokenizer_path}")
print(f" File size: {os.path.getsize(tokenizer_path) / 1024:.0f} KB")
# Verify we can load it back
loaded_tokenizer = Tokenizer.from_file(tokenizer_path)
test = loaded_tokenizer.encode("اردو ایک خوبصورت زبان ہے")
print(f"\n Verification: {test.tokens}")
print(f" Tokenizer loads correctly!")
Once saved, we have a lookup table. Using this, along with our corpus of data, we can perform the next important step: pre-training.
3. Pre-Training
In this part, the model learns the language, grammar, patterns, and vocabulary. Once training is done, the model is able to predict the next word in a sequence, and this is where we start to see raw data turning into an LLM.
LLMs are actually next-word predictors. Given a sequence of words, they predict the most probable next word.
With the help of training, the model learns:
The syntax of the language
Semantics, the contextual meaning
Frequently used expressions
Facts from the training dataset
For training, you have some options. As the model is small, you can train it on your local machine. It will be slower but will get the job done.
The other option is using Google Colab. This is the one I used – the free version was enough for the training I required, using a T4 GPU.
Steps to Do Pre-Training
Upload the dataset JSONL file and tokenizer to Google Drive.
Set the model configuration (vocab size, layers, heads, and so on).
Define the transformer architecture (attention, feed-forward, blocks).
Load and tokenize the corpus into training/validation splits.
Run the training loop with optimizer, LR schedule, and checkpointing.
Model Configuration
from dataclasses import dataclass
@dataclass
class UrduLLMConfig:
# Vocabulary
vocab_size: int = 32_000
pad_token_id: int = 0
bos_token_id: int = 2
eos_token_id: int = 3
# Model Architecture
d_model: int = 384
n_layers: int = 6
n_heads: int = 6
d_ff: int = 1536 # 4 * d_model
dropout: float = 0.1
max_seq_len: int = 256
# Training
batch_size: int = 32
learning_rate: float = 3e-4
weight_decay: float = 0.1
max_epochs: int = 10
warmup_steps: int = 500
grad_clip: float = 1.0
Configuration parameters explained:
The vocabulary parameters (vocab_size, pad_token_id, bos_token_id, eos_token_id) simply match the tokenizer we built earlier. vocab_size is 32K (our BPE vocabulary), and the special token IDs (0, 2, 3) correspond to the positions we assigned during tokenizer training.
Model architecture parameters:
| Variable | What it Means | Example | Impact of Value |
|---|---|---|---|
d_model |
Embedding/vector size per token | 384 | Higher: better understanding but slower & more memory. Lowe: faster but less expressive |
n_layers |
Number of transformer layers | 6 | More layers: deeper understanding but higher latency. Fewer: faster but less powerful |
n_heads |
Attention heads per layer | 6 | More heads: better context capture. Too few: limited attention diversity |
d_ff |
Feedforward layer size | 1536 | Larger: more computation power. Smaller: faster but weaker transformations |
dropout |
% of neurons dropped during training | 0.1 | Higher: prevents overfitting but may underfit. Lower: better training fit but risk of overfitting |
max_seq_len |
Maximum tokens per input | 256 | Higher: more context but slower & costly. Lower: faster but limited context |
Training hyperparameters:
| Variable | What it Means | Example | Impact of Value |
|---|---|---|---|
batch_size |
Samples per training step | 32 | Larger: faster training but needs more memory. Smaller: stable but slower |
learning_rate |
Step size for updates | 0.0003 | Too high: unstable training. Too low: very slow learning |
weight_decay |
Regularization strength | 0.1 | Higher: reduces overfitting. Lower: risk of overfitting |
max_epochs |
Full dataset passes | 10 | More: better learning but risk of overfitting. Fewer: undertrained model |
warmup_steps |
Gradual LR increase steps | 500 | More: smoother start, safer training. Less: risk of early instability |
grad_clip |
Max gradient value | 1.0 | Lower: stable but slower learning. Higher: risk of exploding gradients |
Transformer Architecture
Next up is the main part of training: writing the transformer architecture. Before jumping into code, it's important to know what a transformer architecture is.
To learn in depth about what transformers are and how they differ from RNNs and CNNs, I would recommend going through this article: AWS: What is Transformers in Artificial Intelligence
But in short:
"Transformers are a type of neural network architecture that transforms or changes an input sequence into an output sequence."
The original Transformer paper introduced both an encoder (reads input) and a decoder (generates output). But GPT-style models like ours use only the decoder part. This is called a decoder-only architecture.
The decoder takes a sequence of tokens, applies self-attention to understand relationships between them, and predicts the next token.
Self-attention is what makes transformers powerful: instead of processing tokens one by one in order (like RNNs), the model looks at all previous tokens simultaneously and determines which ones are most relevant for the current prediction.
Here's the complete transformer code. A detailed breakdown of each component follows:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
class MultiHeadSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.n_heads = config.n_heads
self.d_model = config.d_model
self.head_dim = config.d_model // config.n_heads
self.qkv_proj = nn.Linear(config.d_model, 3 * config.d_model)
self.out_proj = nn.Linear(config.d_model, config.d_model)
self.dropout = nn.Dropout(config.dropout)
def forward(self, x, mask=None):
B, T, C = x.shape
qkv = self.qkv_proj(x)
qkv = qkv.reshape(B, T, 3, self.n_heads, self.head_dim)
qkv = qkv.permute(2, 0, 3, 1, 4)
q, k, v = qkv[0], qkv[1], qkv[2]
attn = (q @ k.transpose(-2, -1)) * (self.head_dim ** -0.5)
if mask is not None:
attn = attn.masked_fill(mask == 0, float('-inf'))
attn = F.softmax(attn, dim=-1)
attn = self.dropout(attn)
out = attn @ v
out = out.transpose(1, 2).reshape(B, T, C)
out = self.out_proj(out)
return out
class FeedForward(nn.Module):
def __init__(self, config):
super().__init__()
self.fc1 = nn.Linear(config.d_model, config.d_ff)
self.fc2 = nn.Linear(config.d_ff, config.d_model)
self.dropout = nn.Dropout(config.dropout)
def forward(self, x):
x = F.gelu(self.fc1(x))
x = self.dropout(x)
x = self.fc2(x)
return x
class TransformerBlock(nn.Module):
def __init__(self, config):
super().__init__()
self.ln1 = nn.LayerNorm(config.d_model)
self.attn = MultiHeadSelfAttention(config)
self.ln2 = nn.LayerNorm(config.d_model)
self.ff = FeedForward(config)
self.dropout = nn.Dropout(config.dropout)
def forward(self, x, mask=None):
x = x + self.dropout(self.attn(self.ln1(x), mask))
x = x + self.dropout(self.ff(self.ln2(x)))
return x
class UrduGPT(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.token_emb = nn.Embedding(config.vocab_size, config.d_model)
self.pos_emb = nn.Embedding(config.max_seq_len, config.d_model)
self.dropout = nn.Dropout(config.dropout)
self.blocks = nn.ModuleList([
TransformerBlock(config) for _ in range(config.n_layers)
])
self.ln_f = nn.LayerNorm(config.d_model)
self.head = nn.Linear(config.d_model, config.vocab_size, bias=False)
# Weight tying
self.head.weight = self.token_emb.weight
self.apply(self._init_weights)
def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(self, input_ids, targets=None):
B, T = input_ids.shape
device = input_ids.device
tok_emb = self.token_emb(input_ids)
pos = torch.arange(0, T, dtype=torch.long, device=device)
pos_emb = self.pos_emb(pos)
x = self.dropout(tok_emb + pos_emb)
# Causal mask
mask = torch.tril(torch.ones(T, T, device=device)).unsqueeze(0).unsqueeze(0)
for block in self.blocks:
x = block(x, mask)
x = self.ln_f(x)
logits = self.head(x)
loss = None
if targets is not None:
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
return {'logits': logits, 'loss': loss}
@torch.no_grad()
def generate(self, input_ids, max_new_tokens=100, temperature=0.8,
top_k=50, top_p=0.9, eos_token_id=None):
"""
Generate text autoregressively.
Sampling strategies:
- temperature: Controls randomness (low = deterministic, high = creative)
- top_k: Only consider the top K most likely tokens
- top_p (nucleus): Only consider tokens whose cumulative probability <= p
- eos_token_id: Stop generating when this token is produced
"""
self.eval()
eos_token_id = eos_token_id or getattr(self.config, 'eos_token_id', None)
for _ in range(max_new_tokens):
idx_cond = input_ids if input_ids.size(1) <= self.config.max_seq_len \
else input_ids[:, -self.config.max_seq_len:]
outputs = self.forward(idx_cond)
logits = outputs["logits"][:, -1, :] / temperature
# Top-K filtering
if top_k > 0:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float('-inf')
# Top-P (nucleus) filtering
if top_p < 1.0:
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
sorted_indices_to_remove = cumulative_probs > top_p
sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
sorted_indices_to_remove[:, 0] = 0
indices_to_remove = sorted_indices_to_remove.scatter(
1, sorted_indices, sorted_indices_to_remove
)
logits[indices_to_remove] = float('-inf')
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
input_ids = torch.cat([input_ids, next_token], dim=1)
if eos_token_id is not None and next_token.item() == eos_token_id:
break
return input_ids
This code builds a text prediction machine. You give it some Urdu words, and it guesses the next word, over and over, until it forms a sentence. That's literally how ChatGPT works too, just much bigger.
Transformer Code Breakdown
1. MultiHeadSelfAttention: "The Lookback System"
Imagine reading a sentence. When you see the word "اس" (this), your brain looks back to figure out what "this" refers to. That's attention.
Q, K, V: Think of it like a library:
Query (Q): "I'm looking for information about X"
Key (K): Each previous word holds up a sign: "I have info about Y"
Value (V): The actual information that word carries
6 heads = 6 different "readers" looking at the sentence simultaneously. One might focus on grammar, another on meaning, another on nearby words, and so on.
Causal mask = A rule that says: "You can only look at words that came before you, not after." (Because when generating, future words don't exist yet!)
The math: Multiply Q×K to get "how relevant is each word?", then use those scores to grab the most useful info from V.
2. FeedForward: "The Thinking Step"
After attention figured out which words matter, this is where the model actually thinks about what they mean.
It's just two layers:
Expand (384 → 1536): Give the model more "brain space" to think
Shrink (1536 → 384): Compress the thought back down
GELU activation: A filter that decides "keep this thought" or "discard it" (smoothly, not harshly)
3. TransformerBlock: "One Round of Reading"
One pass of reading a sentence and thinking about it.
Step 1: Look at other words (attention)
Step 2: Think about what you saw (feed-forward)
LayerNorm: Like resetting your brain between steps so numbers don't get too big or too small.
Residual connection (
x + ...): The model keeps its original thought AND adds the new insight. It's like taking notes: you don't erase old notes, you add new ones.
The model does this 6 times (6 blocks). Each round understands the text a little deeper.
4. UrduGPT: "The Full Machine"
Setup (__init__):
Token embedding: A giant lookup table. Each of 32,000 Urdu words/subwords gets a list of 384 numbers that represent its "meaning."
Position embedding: Another lookup table that tells the model "this word is 1st, this is 2nd, this is 3rd..." (otherwise it wouldn't know word order).
6 Transformer blocks: The 6 rounds of reading described above.
LM head: At the end, converts the model's internal "thoughts" (384 numbers) back into a score for each of the 32,000 possible next words.
Weight tying: The input lookup table and output scoring table share the same data. Saves memory and actually works better!
Processing (forward):
Look up each word's meaning (embedding)
Add position info
Run through 6 rounds of attention + thinking
Score every possible next word
If we know the correct answer, calculate how wrong we were (loss)
Generating text (generate): A simple loop:
Feed in the words so far
Get scores for the next word
Temperature: Controls creativity. Low = safe/predictable, high = wild/creative
Top-K: Only consider the K best options (ignore the 31,950 unlikely words)
Top-P (nucleus): Dynamically select the smallest set of tokens whose cumulative probability reaches the threshold
Randomly pick one word from the remaining options
Add it to the sentence, go back to step 1
Stop when
<eos>is generated ormax_new_tokensis reached
Loading the Dataset and Training
First, we load the JSONL corpus and tokenize every document into one long sequence of token IDs. Then we split it 90/10 into training and validation sets, and wrap them in a PyTorch Dataset that creates fixed-length chunks for next-token prediction:
import json
from tokenizers import Tokenizer
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using: {device}")
# Load tokenizer
tokenizer = Tokenizer.from_file(TOKENIZER_PATH)
print(f"Tokenizer loaded. Vocab: {tokenizer.get_vocab_size():,}")
# Load and tokenize corpus
print("Loading corpus...")
all_token_ids = []
with open(DATA_PATH, "r", encoding="utf-8") as f:
for line in tqdm(f, desc="Tokenizing"):
doc = json.loads(line)
encoded = tokenizer.encode(doc["text"])
all_token_ids.extend(encoded.ids)
all_token_ids = torch.tensor(all_token_ids, dtype=torch.long)
print(f"Total tokens: {len(all_token_ids):,}")
class UrduTextDataset(Dataset):
def __init__(self, token_ids, seq_len):
self.token_ids = token_ids
self.seq_len = seq_len
self.n_chunks = (len(token_ids) - 1) // seq_len
def __len__(self):
return self.n_chunks
def __getitem__(self, idx):
start = idx * self.seq_len
chunk = self.token_ids[start:start + self.seq_len + 1]
return chunk[:-1], chunk[1:] # input, target (shifted by 1)
config = UrduLLMConfig()
# Split 90/10
split_idx = int(len(all_token_ids) * 0.9)
train_dataset = UrduTextDataset(all_token_ids[:split_idx], config.max_seq_len)
val_dataset = UrduTextDataset(all_token_ids[split_idx:], config.max_seq_len)
train_loader = DataLoader(train_dataset, batch_size=config.batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=config.batch_size)
print(f"Train: {len(train_dataset):,} chunks")
print(f"Val: {len(val_dataset):,} chunks")
Each chunk is 256 tokens long. __getitem__ returns (input, target) where target is the input shifted by one position, which is exactly what next-token prediction needs.
Training for me took around 3 hours and completed 3 epochs. In essence, it should have done 10 epochs, but after 3 I reached the free limit of Google Colab. Since the purpose of training was learning, I used the model that was generated and saved it in Drive.
Here's the complete training code:
# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate, weight_decay=config.weight_decay)
# LR Schedule
total_steps = len(train_loader) * config.max_epochs
def get_lr(step):
if step < config.warmup_steps:
return config.learning_rate * step / config.warmup_steps
progress = (step - config.warmup_steps) / (total_steps - config.warmup_steps)
return config.learning_rate * 0.5 * (1 + math.cos(math.pi * progress))
# Training
history = {'train_loss': [], 'val_loss': []}
global_step = 0
best_val_loss = float('inf')
for epoch in range(config.max_epochs):
model.train()
epoch_loss = 0
pbar = tqdm(train_loader, desc=f"Epoch {epoch+1}")
for input_ids, targets in pbar:
input_ids, targets = input_ids.to(device), targets.to(device)
lr = get_lr(global_step)
for g in optimizer.param_groups:
g['lr'] = lr
outputs = model(input_ids, targets)
loss = outputs['loss']
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), config.grad_clip)
optimizer.step()
epoch_loss += loss.item()
global_step += 1
pbar.set_postfix({'loss': f'{loss.item():.4f}'})
# Validation
model.eval()
val_loss = 0
with torch.no_grad():
for input_ids, targets in val_loader:
input_ids, targets = input_ids.to(device), targets.to(device)
val_loss += model(input_ids, targets)['loss'].item()
val_loss /= len(val_loader)
train_loss = epoch_loss / len(train_loader)
history['train_loss'].append(train_loss)
history['val_loss'].append(val_loss)
print(f"Epoch {epoch+1}: Train={train_loss:.4f}, Val={val_loss:.4f}")
# Save best
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save(model.state_dict(), f"{DRIVE_PATH}/best_model.pt")
print(f"Best model saved!")
print(f"\nDone! Best val loss: {best_val_loss:.4f}")
Now let's break down what each part of the training code does.
Training Code Explained: Line by Line
1. Optimizer Setup
optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate, weight_decay=config.weight_decay)
AdamW maintains two running statistics per parameter (23M × 2 = 46M extra values in memory):
First moment (momentum): Exponential moving average of gradients. Smooths out noisy updates so the optimizer doesn't zigzag.
Second moment: Exponential moving average of squared gradients. Gives each parameter its own adaptive learning rate (frequently updated params get smaller steps, rare ones get larger).
Weight decay (0.1): Each step, weights are multiplied by
(1 - lr × 0.1), shrinking them slightly. This is L2 regularization. It prevents any single weight from growing too large, which reduces overfitting. The "W" in AdamW means this decay is decoupled from the gradient update (applied directly to weights, not mixed into the gradient like vanilla Adam).
2. Learning Rate Schedule
total_steps = len(train_loader) * config.max_epochs # e.g., 500 batches × 10 epochs = 5000 steps
def get_lr(step):
if step < config.warmup_steps: # Phase 1: steps 0–499
return config.learning_rate * step / config.warmup_steps # Linear ramp: 0 → 3e-4
progress = (step - config.warmup_steps) / (total_steps - config.warmup_steps) # 0.0 → 1.0
return config.learning_rate * 0.5 * (1 + math.cos(math.pi * progress)) # 3e-4 → ~0
Warmup (first 500 steps): At step 0, weights are random and gradients point in semi-random directions, so a large LR would cause destructive parameter updates. By linearly ramping from 0 to 3e-4, we let the loss landscape "stabilize" before making aggressive updates.
Cosine decay (remaining steps): The formula
0.5 × (1 + cos(π × progress))traces a smooth S-curve from 1.0 to 0.0 as progress goes from 0 to 1. Multiplied by peak LR, this gives:Early: Large LR – big parameter changes which results in rapid loss reduction
Late: Tiny LR – small tweaks which results in fine-tuning without overshooting local minima
LR: 0 ──ramp──▶ peak ──smooth curve──▶ ~0
| warmup | cosine decay |
3. Tracking Variables
history = {'train_loss': [], 'val_loss': []} # For plotting curves later
global_step = 0 # Counts total batches across all epochs (for LR schedule)
best_val_loss = float('inf') # Tracks best validation; starts at infinity so any real loss beats it
4. Training Loop
Outer Loop: Epochs
for epoch in range(config.max_epochs):
model.train() # Enables dropout (randomly zeros 10% of activations for regularization)
Each epoch = one full pass through all training data. We repeat for max_epochs rounds.
Inner Loop: Batches
1. Move to GPU:
input_ids, targets = input_ids.to(device), targets.to(device)
Transfers tensor data from CPU RAM to GPU VRAM. Matrix multiplications in transformers (attention, FFN) run 50–100× faster on GPU due to massive parallelism.
2. Manual LR Update:
lr = get_lr(global_step)
for g in optimizer.param_groups:
g['lr'] = lr
PyTorch's AdamW doesn't natively support custom schedules, so we manually override the LR each step. param_groups is a list (here just one group), and each group can have its own LR/weight decay.
3. Forward Pass:
outputs = model(input_ids, targets)
loss = outputs['loss']
Input tokens flow through: embeddings → 6 transformer blocks → LM head → logits. Cross-entropy loss is computed between the logits (shape [batch, seq_len, 32000]) and target token IDs. This loss measures the negative log-probability the model assigns to the correct next token, averaged over all positions and batch elements.
4. Backward Pass + Update:
optimizer.zero_grad() # Reset all parameter gradients to zero (they accumulate by default)
loss.backward() # Backpropagation: compute ∂loss/∂θ for all 23M parameters via chain rule
torch.nn.utils.clip_grad_norm_(model.parameters(), config.grad_clip) # If ||gradient||₂ > 1.0, scale it down
optimizer.step() # θ_new = θ_old - lr × adam_adjusted_gradient - lr × weight_decay × θ_old
zero_grad(): PyTorch accumulates gradients by default (useful for gradient accumulation across micro-batches). We must manually clear them before each new backward pass.loss.backward(): Backpropagation traverses the computation graph in reverse, computing ∂loss/∂θ for every parameter using the chain rule. This is the most compute-intensive step alongside the forward pass.Gradient clipping: Computes the L2 norm across all parameter gradients concatenated into one vector. If the norm exceeds 1.0, every gradient is multiplied by
1.0/norm, preserving direction but capping magnitude. This prevents rare batches (unusual token distributions) from causing catastrophically large updates that destabilize training.optimizer.step(): AdamW applies the update rule using momentum, adaptive per-parameter LR, and decoupled weight decay.
5. Bookkeeping:
epoch_loss += loss.item() # .item() extracts the Python float from the CUDA tensor (avoids GPU memory leak)
global_step += 1 # Increment for LR schedule
pbar.set_postfix({'loss': ...}) # Update the tqdm progress bar display
6. Validation
model.eval() # Disables dropout so we use full model capacity for honest evaluation
val_loss = 0
with torch.no_grad(): # Disables gradient tracking, saves ~50% memory and runs faster
for input_ids, targets in val_loader:
input_ids, targets = input_ids.to(device), targets.to(device)
val_loss += model(input_ids, targets)['loss'].item()
val_loss /= len(val_loader) # Average loss per batch
This tests on held-out data the model never trained on. Comparing train vs val loss reveals:
| Pattern | Meaning |
|---|---|
| Both decreasing | Model is learning generalizable patterns |
| Train ↓, Val stalling/↑ | Overfitting: memorizing, not learning |
| Both high and flat | Underfitting: model needs more capacity or data |
model.eval() turns OFF dropout so we evaluate with the full model. torch.no_grad() skips gradient computation since we're just measuring, not learning.
7. Checkpointing
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save(model.state_dict(), f"{DRIVE_PATH}/best_model.pt")
model.state_dict() returns an OrderedDict mapping parameter names onto tensors. torch.save serializes this to disk using Python's pickle + zip. We only save when val loss improves.
This is early stopping in spirit: we keep the checkpoint that generalizes best, regardless of what happens in later epochs.
Summary: One Batch in 6 Steps
Feed 32 Urdu sequences through the model → get predicted probabilities
Cross-entropy vs actual next tokens → scalar loss (how wrong?)
Backpropagate through 23M parameters → gradient per parameter (what to fix?)
Clip gradient norm to ≤ 1.0 → prevent instability
AdamW updates parameters with momentum + decay → the actual learning
Repeat ~5000 times, save the best checkpoint → done
Key Metrics
Cross-entropy loss measures how far the predicted probability distribution is from the true next token. A random model over 32K vocab gets loss ≈ ln(32000) ≈ 10.4
Perplexity = e^loss, interpretable as "the model is choosing between N equally likely tokens"
PPL 32,000 = random guessing
PPL 100 = narrowed to ~100 candidates
PPL 10 = quite confident predictions
Once training is completed and we've saved the model in Drive, the next step is to download the model to your local system to perform the next steps.
Now we have a model that's ready, but a question arises: Is it ready to where we can chat with it like we do with any AI tool like ChatGPT, Claude, or Copilot? The answer is no, it's not quite ready yet. Why?
The training part is done, but it doesn't know how to structure or write in a conversational manner, like it's answering user queries. This is the step we call Supervised Fine-Tuning (SFT).
4. Supervised Fine-Tuning (SFT)
At a very high level, in SFT we teach the model how to respond to queries. It's like giving it examples from which it learns how to answer. The more examples you have, the better the responses will become. So essentially, supervised fine-tuning converts the model to a conversational agent.
To achieve this, we'll create a dataset of examples with the following key pairs and format:
{
"conversations": [
{"role": "system", "content": "آپ ایک مددگار اردو اسسٹنٹ ہیں۔"},
{"role": "user", "content": "سوال..."},
{"role": "assistant", "content": "جواب..."}
]
}
Around 79 examples get fed to the system and saved in JSONL format. In real cases, you would use many more examples. As I already mentioned, more examples lead to better results.
Formatting Conversations for Training
The next step is formatting the conversations saved above for training. This is the conversation formatting step for SFT. It converts raw conversation JSON into token ID sequences with loss masking, so the model only learns to generate assistant responses.
Loss masking means we intentionally hide certain parts of the input from the training loss. In this case, we mask the system prompt and user message so the model isn't trained to memorize or reproduce them. The training signal comes only from the assistant's response, which is the useful part in teaching the model what to generate and when to stop.
Part 1: Disable Auto-Formatting & Get Special Token IDs
tokenizer.no_padding()
BOS_ID = tokenizer.token_to_id("<bos>") # 2
EOS_ID = tokenizer.token_to_id("<eos>") # 3
SEP_ID = tokenizer.token_to_id("<sep>") # 4
PAD_ID = tokenizer.token_to_id("<pad>") # 0
USER_ID = tokenizer.token_to_id("<|user|>") # 5
ASSISTANT_ID = tokenizer.token_to_id("<|assistant|>") # 6
SYSTEM_ID = tokenizer.token_to_id("<|system|>") # 7
IGNORE_INDEX = -100
no_padding(): Tells the tokenizer "don't add padding automatically, I'll handle it myself." We need full control over the token sequence.We fetch the integer IDs for each special token so we can manually insert them at the right positions.
IGNORE_INDEX = -100: PyTorch'scross_entropyhas a built-in feature: any label set to -100 is skipped in loss computation. This is how we implement loss masking.
Part 2: format_conversation(): The Core Function
This takes a conversation and produces two parallel arrays:
input_ids: [BOS, SYSTEM, آپ, ایک, مددگار, ..., SEP, USER, پاکستان, کا, ..., SEP, ASST, اسلام, آباد, ہے, EOS, PAD, PAD, ...]
labels: [-100, -100, -100, -100, -100, ..., -100, -100, -100, -100,..., -100, -100, اسلام, آباد, ہے, EOS, -100, -100, ...]
Step-by-step inside the function:
1. Start with BOS:
input_ids = [BOS_ID]
labels = [IGNORE_INDEX] # Don't learn to predict BOS
2. For each turn, encode the content and strip auto-added BOS/EOS:
content_ids = tokenizer.encode(content).ids
if content_ids[0] == BOS_ID: content_ids = content_ids[1:] # Remove if tokenizer auto-added
if content_ids[-1] == EOS_ID: content_ids = content_ids[:-1]
We strip these because we're manually placing special tokens at exact positions, so we don't want duplicates.
3. Build token sequence per role:
| Role | Token sequence | Labels |
|---|---|---|
| system | [SYSTEM_ID] + content + [SEP_ID] |
All -100 (masked) |
| user | [USER_ID] + content + [SEP_ID] |
All -100 (masked) |
| assistant | [ASST_ID] + content + [EOS_ID] |
[-100] + content + [EOS_ID] |
The assistant's role token (<|assistant|>) itself is masked because we don't want the model to learn to predict that. But the actual response content and the <eos> do have labels, so the model learns:
What to say (the response content)
When to stop (predicting
<eos>)
4. Truncate and pad:
input_ids = input_ids[:max_len] # Cut to 256 tokens max
pad_len = max_len - len(input_ids)
input_ids = input_ids + [PAD_ID] * pad_len
labels = labels + [IGNORE_INDEX] * pad_len # Don't learn from padding either
All sequences must be the same length for batched training. Padding labels are -100 so they're ignored in loss.
Here's the complete format_conversation() function:
def format_conversation(conversation: dict, max_len: int = 256) -> dict:
"""
Convert a conversation dict into token IDs + labels for SFT.
Format: <bos><|system|>...<sep><|user|>...<sep><|assistant|>...<eos>
Labels: -100 for system/user tokens (masked), actual IDs for assistant tokens.
"""
input_ids = [BOS_ID]
labels = [IGNORE_INDEX]
for turn in conversation["conversations"]:
role = turn["role"]
content = turn["content"]
content_ids = tokenizer.encode(content).ids
if content_ids and content_ids[0] == BOS_ID:
content_ids = content_ids[1:]
if content_ids and content_ids[-1] == EOS_ID:
content_ids = content_ids[:-1]
if role == "system":
role_ids = [SYSTEM_ID] + content_ids + [SEP_ID]
role_labels = [IGNORE_INDEX] * len(role_ids)
elif role == "user":
role_ids = [USER_ID] + content_ids + [SEP_ID]
role_labels = [IGNORE_INDEX] * len(role_ids)
elif role == "assistant":
role_ids = [ASSISTANT_ID] + content_ids + [EOS_ID]
role_labels = [IGNORE_INDEX] + content_ids + [EOS_ID]
input_ids.extend(role_ids)
labels.extend(role_labels)
# Truncate and pad to max_len
input_ids = input_ids[:max_len]
labels = labels[:max_len]
pad_len = max_len - len(input_ids)
input_ids = input_ids + [PAD_ID] * pad_len
labels = labels + [IGNORE_INDEX] * pad_len
return {"input_ids": input_ids, "labels": labels}
Part 3: Verification
n_loss_tokens = sum(1 for l in test_formatted['labels'] if l != IGNORE_INDEX)
print(f" Tokens with loss: {n_loss_tokens} / 256")
This confirms that only a small fraction of tokens (the assistant's words + EOS) contribute to the loss. For a typical example, you might see something like Tokens with loss: 18 / 256, meaning only ~7% of the sequence drives gradient updates. The rest (system prompt, user questions, special tokens, padding) is masked with -100.
This makes SFT extremely efficient: 100% of the learning signal comes from predicting the assistant's actual response and knowing when to stop (<eos>). That efficiency is especially critical when you only have 79 training examples.
Formatting Summary
| Component | Purpose |
|---|---|
no_padding() |
Take manual control of token placement |
| Special token IDs | Insert chat structure markers at exact positions |
IGNORE_INDEX = -100 |
PyTorch's built-in mechanism to skip positions in loss |
| System/User labels → -100 | Don't learn from these (context only) |
| Assistant labels → real IDs | Learn to generate responses + when to stop |
| Truncation to 256 | Match model's context window |
| Padding with -100 labels | Batch alignment without polluting the loss |
SFT Dataset & DataLoader
class SFTDataset(Dataset):
def __init__(self, conversations: list, max_len: int = 256):
self.examples = []
for conv in conversations:
formatted = format_conversation(conv, max_len)
self.examples.append(formatted)
def __len__(self):
return len(self.examples)
def __getitem__(self, idx):
return (
torch.tensor(self.examples[idx]['input_ids'], dtype=torch.long),
torch.tensor(self.examples[idx]['labels'], dtype=torch.long),
)
This wraps all 79 formatted conversations into a PyTorch Dataset. At init time, it pre-formats every conversation using format_conversation() and stores the results. When the DataLoader requests item idx, it returns (input_ids, labels) as tensors.
DataLoader:
sft_loader = DataLoader(sft_dataset, batch_size=4, shuffle=True)
batch_size=4: Small batch because we only have 79 examples. Larger batches would mean fewer gradient updates per epoch.shuffle=True: Randomize order each epoch so the model doesn't memorize a fixed sequence of examples.
Loading the Pre-trained Model
model = UrduGPT(config).to(device)
checkpoint = torch.load("best_model.pt", map_location=device)
state_dict = checkpoint['model_state_dict']
# Name mapping (Colab → local)
name_mapping = {
'token_emb.weight': 'token_embedding.weight',
'pos_emb.weight': 'position_embedding.weight',
'ln_f.weight': 'ln_final.weight',
'ln_f.bias': 'ln_final.bias',
'head.weight': 'lm_head.weight',
}
This creates a fresh UrduGPT model and loads the pre-trained weights from Phase 3.
You might be wondering: why the name mapping? The model was trained on Google Colab with slightly different variable names (for example, token_emb vs token_embedding). The mapping translates Colab's naming convention to the local code's convention. strict=False in load_state_dict allows loading even if some keys don't match exactly.
Also, why start from pre-trained? Well, SFT builds on top of pre-training. The model already knows Urdu grammar, vocabulary, and facts. SFT just teaches it the conversation format. Starting from random weights would require far more data and training.
SFT Training Loop
Here's the complete SFT training loop:
SFT_LR = 2e-5
SFT_EPOCHS = 50
optimizer = torch.optim.AdamW(model.parameters(), lr=SFT_LR, weight_decay=0.01)
sft_history = {'loss': []}
best_loss = float('inf')
for epoch in range(SFT_EPOCHS):
model.train()
epoch_loss = 0
n_batches = 0
for input_ids, labels in sft_loader:
input_ids = input_ids.to(device)
labels = labels.to(device)
outputs = model(input_ids)
logits = outputs['logits']
shift_logits = logits[:, :-1, :].contiguous()
shift_labels = labels[:, 1:].contiguous()
loss = F.cross_entropy(
shift_logits.view(-1, shift_logits.size(-1)),
shift_labels.view(-1),
ignore_index=IGNORE_INDEX,
)
optimizer.zero_grad(set_to_none=True)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
epoch_loss += loss.item()
n_batches += 1
avg_loss = epoch_loss / n_batches
sft_history['loss'].append(avg_loss)
if avg_loss < best_loss:
best_loss = avg_loss
torch.save({
'model_state_dict': model.state_dict(),
'config': config.__dict__,
'epoch': epoch + 1,
'loss': avg_loss,
}, "sft_model.pt")
if (epoch + 1) % 10 == 0 or epoch == 0:
print(f"Epoch {epoch+1}/{SFT_EPOCHS} | Loss: {avg_loss:.4f}")
print(f"SFT complete! Best loss: {best_loss:.4f}")
Why these hyperparameters differ from pre-training:
| Parameter | Pre-training | SFT | Why different |
|---|---|---|---|
| Learning rate | 3e-4 | 2e-5 | Lower LR prevents catastrophic forgetting. Large updates would erase the Urdu knowledge learned during pre-training |
| Epochs | 3 | 50 | Only 79 examples vs millions of tokens. The model needs many passes to learn the conversation pattern |
| Weight decay | 0.1 | 0.01 | Less regularization needed since we want the model to fit these specific examples closely |
| LR schedule | Cosine warmup | Constant | Simple and effective for small-data fine-tuning |
Here's the training step (per batch):
# Forward pass with no targets; we compute loss manually
outputs = model(input_ids)
logits = outputs['logits']
# Shift for next-token prediction
shift_logits = logits[:, :-1, :].contiguous() # Predictions at positions 0..254
shift_labels = labels[:, 1:].contiguous() # Targets at positions 1..255
# Loss with masking
loss = F.cross_entropy(
shift_logits.view(-1, shift_logits.size(-1)),
shift_labels.view(-1),
ignore_index=IGNORE_INDEX, # Skip -100 positions
)
There's a key difference from pre-training: in pre-training, we passed targets directly to model(input_ids, targets) which computed loss internally on ALL tokens. Here we compute loss manually so we can use ignore_index=-100 to mask non-assistant positions.
The shift: logits[:, :-1] and labels[:, 1:] implement next-token prediction. The model's prediction at position i is compared against the actual token at position i+1.
Backward pass + update:
optimizer.zero_grad(set_to_none=True)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
This is the same as pre-training: clear gradients → backprop → clip to prevent instability → update parameters. Gradient clipping at 1.0 is especially important here since the model is being fine-tuned and some gradients can be large on small data.
Checkpointing:
if avg_loss < best_loss:
torch.save({'model_state_dict': model.state_dict(), ...}, "sft_model.pt")
Save whenever training loss improves. Unlike pre-training, we don't have a separate validation set (79 examples is too few to split), so we checkpoint on training loss.
Chat Function: Inference
Here's the complete chat function:
def chat(model, tokenizer, user_message: str, system_prompt: str = None,
max_tokens: int = 100, temperature: float = 0.7) -> str:
"""Generate a chat response."""
model.eval()
if system_prompt is None:
system_prompt = SYSTEM_PROMPT
# Build the prompt
prompt_ids = [BOS_ID, SYSTEM_ID]
sys_ids = tokenizer.encode(system_prompt).ids
if sys_ids and sys_ids[0] == BOS_ID: sys_ids = sys_ids[1:]
if sys_ids and sys_ids[-1] == EOS_ID: sys_ids = sys_ids[:-1]
prompt_ids.extend(sys_ids)
prompt_ids.append(SEP_ID)
prompt_ids.append(USER_ID)
user_ids = tokenizer.encode(user_message).ids
if user_ids and user_ids[0] == BOS_ID: user_ids = user_ids[1:]
if user_ids and user_ids[-1] == EOS_ID: user_ids = user_ids[:-1]
prompt_ids.extend(user_ids)
prompt_ids.append(SEP_ID)
prompt_ids.append(ASSISTANT_ID)
# Generate
input_tensor = torch.tensor([prompt_ids], dtype=torch.long).to(device)
with torch.no_grad():
output_ids = model.generate(
input_tensor,
max_new_tokens=max_tokens,
temperature=temperature,
top_k=50,
top_p=0.9,
eos_token_id=EOS_ID,
)
# Decode only the generated part
generated_ids = output_ids[0][len(prompt_ids):].tolist()
if EOS_ID in generated_ids:
generated_ids = generated_ids[:generated_ids.index(EOS_ID)]
return tokenizer.decode(generated_ids)
And here's a step-by-step breakdown:
1. Build the prompt:
prompt_ids = [BOS_ID, SYSTEM_ID]
prompt_ids.extend(sys_ids) # System prompt content
prompt_ids.append(SEP_ID)
prompt_ids.append(USER_ID)
prompt_ids.extend(user_ids) # User message content
prompt_ids.append(SEP_ID)
prompt_ids.append(ASSISTANT_ID) # "Now respond..."
This constructs exactly the same format the model saw during SFT training:
<bos><|system|>آپ ایک مددگار...<sep><|user|>پاکستان کا دارالحکومت؟<sep><|assistant|>
The model sees <|assistant|> and knows "I should generate a response now" because during SFT, it learned that tokens after <|assistant|> are what it should produce.
2. Generate autoregressively:
with torch.no_grad():
output_ids = model.generate(
input_tensor,
max_new_tokens=max_tokens,
temperature=temperature,
top_k=50,
top_p=0.9,
eos_token_id=EOS_ID,
)
torch.no_grad(): No gradients needed for inference, which saves memory and speedtemperature=0.7: Slightly sharpened distribution for coherent but not robotic outputtop_k=50: Only sample from top 50 tokens to avoid low-probability noisetop_p=0.9: Nucleus sampling that dynamically selects the smallest set of tokens whose cumulative probability ≥ 0.9eos_token_id: Stop generating when<eos>is produced
3. Extract and decode:
generated_ids = output_ids[0][len(prompt_ids):].tolist() # Only the new tokens
if EOS_ID in generated_ids:
generated_ids = generated_ids[:generated_ids.index(EOS_ID)] # Trim at EOS
return tokenizer.decode(generated_ids)
We slice off the prompt (we don't want to return the system prompt and user message back), trim at <eos>, and decode token IDs back to Urdu text.
5. Deployment
At this point, you have your own LLM. That's a great milestone. But there's still the classic problem: "it works on my machine."
To make the model public so others can use it too, we need to deploy it and provide an interface for users to interact with.
While exploring deployment options, I came across Gradio, which provides a simple, clean interface for deploying machine learning models and applications. Gradio integrates directly with Hugging Face Spaces, giving us free hosting with minimal setup.
Gradio Web Interface (app.py)
The app.py file ties everything together: it loads the tokenizer and model, defines the chat() function, and launches a Gradio UI. The model loading and chat() logic are identical to what we covered in the SFT section, so here we only show the Gradio-specific part:
import gradio as gr
def respond(message, history):
if not message.strip():
return "براہ کرم کچھ لکھیں۔"
return chat(message)
demo = gr.ChatInterface(
fn=respond,
title="🇵🇰 اردو LLM چیٹ بوٹ",
description="""
### ایک چھوٹا اردو زبان ماڈل جو شروع سے تیار کیا گیا ہے
**A small Urdu language model built from scratch (~23M parameters)**
""",
examples=[
"السلام علیکم",
"پاکستان کا دارالحکومت کیا ہے؟",
"لاہور کے بارے میں بتائیں۔",
"بریانی کیسے بنتی ہے؟",
"کرکٹ کیسے کھیلی جاتی ہے؟",
"چاند کیسے چمکتا ہے؟",
"رمضان کیا ہے؟",
"علامہ اقبال کون تھے؟",
"خوش کیسے رہیں؟",
"آپ کون ہیں؟",
],
theme=gr.themes.Soft(),
)
if __name__ == "__main__":
demo.launch()
respond()wrapschat()with an empty-input guard, matching the signature Gradio'sChatInterfaceexpects.gr.ChatInterfaceprovides a ready-made chat UI with message history, input box, and send button.examplesare pre-filled messages users can click to try.theme=gr.themes.Soft()gives a clean, modern visual theme.
Note: Hugging Face Spaces runs app.py as a standalone script, so the full app.py in the repository inlines everything into one file: the model config, the complete transformer architecture, model loading with gc.collect() for memory optimization, the chat() function, and the Gradio interface above.
We won't repeat all of that here since it was already covered in the Pre-Training and SFT sections.
Running locally:
python app.py
# Opens at http://127.0.0.1:7860
Deployment Options
Option A: Hugging Face Spaces (Free, Recommended)
Hugging Face Spaces provides free CPU hosting for Gradio apps.
What to upload:
urdu-llm-chat/
├── app.py # Gradio web interface
├── requirements.txt # torch, tokenizers, gradio
├── README.md # Space metadata (sdk: gradio)
├── model/
│ ├── __init__.py
│ ├── config.py
│ ├── transformer.py
│ └── checkpoints/sft_model.pt # ~90MB trained model weights
└── tokenizer/
└── urdu_tokenizer/
└── urdu_bpe_tokenizer.json
How it works:
Create a free account on huggingface.co
Create a new Space (SDK: Gradio, Hardware: CPU Basic)
Push files via git:
git clone https://huggingface.co/spaces/USERNAME/urdu-llm-chatCopy project files into the cloned repo and push
Hugging Face automatically installs dependencies and runs
app.pyYour model is live at
https://huggingface.co/spaces/USERNAME/urdu-llm-chat
Why CPU is fine: Our model is only 23M parameters (~90MB). Inference takes <1 second on CPU. No GPU needed for serving.
Option B: Running Locally
cd your-project-directory
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python app.py
Opens at http://127.0.0.1:7860. Works on any machine with Python 3.9+.
Option C: Terminal Chat (No UI)
A lightweight alternative with no Gradio dependency, just terminal input/output. Loads the model and enters an interactive loop:
"""
Standalone Chat Inference Script for Urdu LLM
Usage:
python inference/chat.py
"""
import sys
import torch
from pathlib import Path
from tokenizers import Tokenizer
# Add project root to path
PROJECT_ROOT = Path(__file__).resolve().parent.parent
sys.path.insert(0, str(PROJECT_ROOT))
from model.config import UrduLLMConfig
from model.transformer import UrduGPT
def load_model(checkpoint_path: str, device: str = None):
"""Load the fine-tuned model."""
if device is None:
if torch.cuda.is_available():
device = "cuda"
elif torch.backends.mps.is_available():
device = "mps"
else:
device = "cpu"
device = torch.device(device)
config = UrduLLMConfig()
model = UrduGPT(config).to(device)
checkpoint = torch.load(checkpoint_path, map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
return model, config, device
def chat_response(model, tokenizer, config, device, user_message,
system_prompt="آپ ایک مددگار اردو اسسٹنٹ ہیں۔",
max_tokens=100, temperature=0.7):
"""Generate a chat response."""
BOS_ID = tokenizer.token_to_id("<bos>")
EOS_ID = tokenizer.token_to_id("<eos>")
SEP_ID = tokenizer.token_to_id("<sep>")
USER_ID = tokenizer.token_to_id("<|user|>")
ASSISTANT_ID = tokenizer.token_to_id("<|assistant|>")
SYSTEM_ID = tokenizer.token_to_id("<|system|>")
# Build prompt
prompt_ids = [BOS_ID, SYSTEM_ID]
sys_ids = tokenizer.encode(system_prompt).ids
if sys_ids and sys_ids[0] == BOS_ID: sys_ids = sys_ids[1:]
if sys_ids and sys_ids[-1] == EOS_ID: sys_ids = sys_ids[:-1]
prompt_ids.extend(sys_ids)
prompt_ids.append(SEP_ID)
prompt_ids.append(USER_ID)
user_ids = tokenizer.encode(user_message).ids
if user_ids and user_ids[0] == BOS_ID: user_ids = user_ids[1:]
if user_ids and user_ids[-1] == EOS_ID: user_ids = user_ids[:-1]
prompt_ids.extend(user_ids)
prompt_ids.append(SEP_ID)
prompt_ids.append(ASSISTANT_ID)
# Generate
input_tensor = torch.tensor([prompt_ids], dtype=torch.long).to(device)
output_ids = model.generate(
input_tensor,
max_new_tokens=max_tokens,
temperature=temperature,
top_k=50,
top_p=0.9,
eos_token_id=EOS_ID,
)
generated_ids = output_ids[0][len(prompt_ids):].tolist()
if EOS_ID in generated_ids:
generated_ids = generated_ids[:generated_ids.index(EOS_ID)]
return tokenizer.decode(generated_ids)
def main():
print("=" * 60)
print("🇵🇰 اردو LLM چیٹ بوٹ 🇵🇰")
print(" Urdu LLM ChatBot")
print("=" * 60)
# Load model
tokenizer_path = PROJECT_ROOT / "tokenizer" / "urdu_tokenizer" / "urdu_bpe_tokenizer.json"
# Try SFT model first, fall back to pre-trained
sft_path = PROJECT_ROOT / "model" / "checkpoints" / "sft_model.pt"
pretrained_path = PROJECT_ROOT / "model" / "checkpoints" / "best_model.pt"
if sft_path.exists():
checkpoint_path = sft_path
print("Loading SFT (conversational) model...")
elif pretrained_path.exists():
checkpoint_path = pretrained_path
print("Loading pre-trained model (not fine-tuned for chat)...")
else:
print("❌ No model checkpoint found!")
print(" Run notebooks 03 and 04 first to train the model.")
sys.exit(1)
model, config, device = load_model(str(checkpoint_path))
tokenizer = Tokenizer.from_file(str(tokenizer_path))
print(f"Model loaded on {device}")
print("\nType your message in Urdu. Type 'quit' to exit.\n")
print("-" * 60)
while True:
try:
user_input = input("\n👤 آپ: ").strip()
except (EOFError, KeyboardInterrupt):
print("\nخدا حافظ! 👋")
break
if user_input.lower() in ['quit', 'exit', 'q']:
print("خدا حافظ! 👋")
break
if not user_input:
continue
response = chat_response(model, tokenizer, config, device, user_input)
print(f"🤖 بوٹ: {response}")
if __name__ == "__main__":
main()
Run it with:
python inference/chat.py
👤 آپ: السلام علیکم
🤖 بوٹ: وعلیکم السلام! میں آپ کی کیا مدد کر سکتا ہوں؟
Full Pipeline Summary
| Phase | Description |
|---|---|
| Phase 1 | Raw Urdu Text → Clean corpus |
| Phase 2 | Corpus → BPE Tokenizer (32K vocab) |
| Phase 3 | Tokenized chunks → Pre-train GPT (23M params, next-token prediction) |
| Phase 4 | Conversations → SFT with loss masking → Chat model |
| Deploy | app.py + model weights → Hugging Face Spaces / Local |
You can check out the deployed model here: https://huggingface.co/spaces/Wisamul/urdu_llm
Results
The results were pretty interesting. As we had a small training sample, the chances of hallucinations were pretty high, and especially if you can understand Urdu, you'll appreciate the fun part of hallucination. But overall it was working fine.
Here are a few sample interactions with the deployed model:
👤 User: السلام علیکم
🤖 Assistant: وعلیکم السلام! میں آپ کی کیا مدد کر سکتا ہوں؟
👤 User: پاکستان کا دارالحکومت کیا ہے؟
🤖 Assistant: پاکستان کا دارالحکومت اسلام آباد ہے۔ یہ شہر 1960 کی دہائی میں تعمیر کیا گیا تھا۔
👤 User: تعلیم کی اہمیت کیا ہے؟
🤖 Assistant: تعلیم انسانی ترقی کی بنیاد ہے۔ یہ نہ صرف علم حاصل کرنے کا ذریعہ ہے بلکہ معاشرے کی تعمیر میں اہم کردار ادا کرتی ہے۔
For questions close to the training data, the model responds accurately and fluently. For out-of-distribution queries, it tends to hallucinate or repeat fragments. This is expected given the small SFT dataset (79 examples) and model size (23M parameters).
Conclusion
The goal of all this was to discuss and explain all steps involved in the process of building an LLM. I hope it gives you clarity as to why everyone isn't making their own LLM: factors like cost, training, data collection, and so on. At the same time, going through this process has hopefully helped you appreciate the work that has been done before and what we've achieved here.
We went from raw Urdu text all the way to a deployed chatbot: data cleaning, BPE tokenization, pre-training a GPT-style transformer, supervised fine-tuning with loss masking, and finally a Gradio web interface.
The model is tiny and the dataset is small, but every concept here (attention, next-token prediction, SFT, chat formatting) is exactly what powers production LLMs like GPT-4 and Llama – just at a much larger scale.
If you want to improve on this, the highest-impact next steps would be:
more SFT data (thousands of examples instead of 79),
a larger model (100M+ parameters), and
RLHF/DPO alignment.
But even at this scale, you now have a concrete understanding of the full LLM pipeline.