General Programming - freeCodeCamp.org

How to Build an AI-Powered Research Automation System with n8n, Groq, and Academic APIs

Chidozie Managwu — Mon, 16 Mar 2026 18:17:27 +0000

As a researcher and developer, I found myself spending hours manually searching academic databases, reading abstracts, and trying to synthesize findings across multiple sources.

For my work on circular economy and battery recycling, I needed a way to query multiple databases at once without the manual fatigue.

In this tutorial, you'll build an automated research pipeline using n8n that reduces roughly six hours of manual literature review into a five-minute automated process.

This isn’t a “cool demo workflow.” It’s a production-minded pipeline with parallel collection, normalisation, deduplication, structured AI extraction, scoring, and practical error handling.

Prerequisites
The Problem: Research Takes Too Long
The Tech Stack
The Project Structure: How to Think About an n8n Workflow Like Software
Stage 1: Centralised Configuration
Stage 2: Parallel API Collection (With Failure Isolation)
Stage 3: Normalisation and Deduplication (DOI-first, Title fallback)
Stage 4: AI-Powered Content Extraction (Strict JSON)
Stage 5: Scoring and Synthesis
[Beginner-Friendly Evals (Retrieval and Extraction QA)(#heading-beginnerfriendly-evals-retrieval-and-extraction-qa)
Key Learnings and Error Handling
Conclusion

Prerequisites

You don’t need to be a DevOps engineer to follow this, but you should have:

Basic comfort with APIs and JSON (request/response payloads)
Familiarity with spreadsheets (Google Sheets basics)
Willingness to use a small amount of JavaScript inside n8n Function/Code nodes

Access to:

An n8n instance (self-hosted or cloud)
A Groq API key (or a compatible LLM provider)
Optional API keys, depending on the databases you use

What you’ll build assumes:

You’re extracting from metadata + abstracts (not downloading full PDFs).
You can accept that some sources will occasionally rate-limit or return partial results (and your workflow will be designed to survive this).

The Problem: Research Takes Too Long

Manual research is often a bottleneck for innovation. Before building this automation, my workflow involved searching multiple academic databases, scanning abstracts, and manually extracting key findings. This process was not only slow but also prone to human error and inconsistent note-taking.

The goal of this automation is to provide a “full-stack research assistant” that handles the heavy lifting of collecting candidate papers, removing duplicates, extracting consistent fields, scoring relevance and quality, and delivering a curated daily or weekly report, so you can spend your time on high-level synthesis rather than repetitive collection.

The Tech Stack

This workflow leverages a combination of automation tooling, high-speed LLM inference, and academic metadata providers.

Tool	Purpose
n8n	The workflow engine that orchestrates all steps
Groq	Runs a fast LLM (for example, Llama 3.3 70B) for structured extraction/synthesis
Semantic Scholar / OpenAlex	Broad academic coverage for metadata, abstracts, citations
arXiv / PubMed	Strong specialised coverage (preprints, life sciences)
Google Sheets	A lightweight “research database” for storage + history

Notes: coverage varies by provider. Some APIs return abstracts reliably, while others may omit them. Your pipeline should treat missing abstracts as a normal case, not a failure.

The Project Structure: How to Think About an n8n Workflow Like Software

While n8n is a visual tool, it helps to design your workflow as modular stages to avoid the “spaghetti workflow” problem.

.
├── configuration/         # Keywords, thresholds, limits, date filters
├── collectors/            # Parallel HTTP request nodes (multiple sources)
├── processing/            # Normalization + deduplication code nodes
├── extraction/            # LLM extraction nodes (strict JSON)
├── scoring/               # Relevance + quality scoring + filtering
└── delivery/              # Google Sheets + email/HTML report

Design principle: each stage should produce a clean, predictable output shape that the next stage can rely on.

Stage 1: Centralised Configuration

Instead of hardcoding search parameters (keywords, min year, citation thresholds) across multiple nodes, use one configuration node to define workflow variables.

This matters for maintainability (change a value once, not in ten nodes), reusability (repurpose the entire pipeline by swapping one config object), and debuggability (log the config at the start of each run so you can reproduce results).

Use a Set node, or a Code node returning JSON like this:

{
  "keywords": "circular economy battery recycling remanufacturing",
  "min_year": 2020,
  "max_results_per_source": 10,
  "min_citations": 2,
  "relevance_threshold": 15,
  "batch_size": 10
}

Tip: keep numeric fields as numbers (not strings) to avoid scoring bugs later.

Stage 2: Parallel API Collection (With Failure Isolation)

Your workflow should query multiple sources simultaneously. In n8n, you can branch from your configuration node into multiple HTTP Request nodes, and then merge results later.

The production mindset here is simple: APIs fail. Rate limits happen. Providers return partial data. The key is to prevent one failing collector from crashing the whole run.

To implement this, on each HTTP Request node, enable Continue On Fail (or the equivalent “don’t stop workflow” behaviour). Then, in the normalisation stage, treat missing or failed outputs as empty arrays so downstream stages still run.

In practice, it also helps to set explicit timeouts and add a small retry policy (one to two retries) for transient failures. “Good” looks like this: if two out of five sources fail, you still produce a useful report from the remaining three, and you log which sources failed so you can investigate later.

Stage 3: Normalisation and Deduplication (DOI-first, Title fallback)

Each academic API returns different field names and shapes. One might use title, another display_name, another paper_title. Your next stage should normalise all inputs into one schema.

Target normalised schema

Here’s a simple baseline schema (expand later as needed):

{
  "title": "string",
  "abstract": "string|null",
  "doi": "string|null",
  "year": 2024,
  "citations": 12,
  "url": "string|null",
  "source": "Semantic Scholar|OpenAlex|arXiv|PubMed"
}

What deduping by DOI means (and what a DOI is)

A DOI (Digital Object Identifier) is a unique, persistent identifier assigned to many scholarly publications. If a paper has a DOI, that DOI functions like a stable ID: the same paper may appear in multiple databases with slightly different metadata, but the DOI should remain consistent.

So, deduping by DOI means: if two records share the same DOI, treat them as the same paper and keep only one.

When a DOI is missing (which is common for some preprints and some API responses), the fallback is to dedupe using a normalised title key, lowercased, trimmed, punctuation stripped, and whitespace collapsed. It’s not as perfect as DOI-based matching, but it’s a strong pragmatic backup.

What “normalise into a unified object” means (what’s happening in the code)

“Normalise into a unified object” simply means converting every provider’s raw response into the same predictable shape (the schema above). Once everything looks the same, downstream steps, such as deduplication, scoring, AI extraction, and storage, become straightforward because they don’t need provider-specific logic.

In the code below, that’s what the normalized object is: it maps Semantic Scholar’s fields (paper.title, paper.externalIds.DOI, paper.citationCount) into your standard fields (title, doi, citations, etc.). After that, the workflow generates a dedupe key (doi:... if DOI exists, otherwise title:...) and uses a Set to keep only the first occurrence.

Example n8n Code Node (Normalisation + Dedupe Pattern)

const itemsIn = $input.all();

const seen = new Set();
const results = [];

function titleKey(t) {
  return (t || "")
    .toLowerCase()
    .replace(/[\W_]+/g, " ")
    .replace(/\s+/g, " ")
    .trim();
}

for (const item of itemsIn) {
  // Example: Semantic Scholar response shape
  const papers = item.json?.data || [];

  for (const paper of papers) {
    // "Normalize into a unified object":
    // take the provider-specific fields and map them into our standard schema.
    const normalized = {
      title: paper.title || null,
      abstract: paper.abstract || null,
      doi: paper.externalIds?.DOI || null,
      year: paper.year || null,
      citations: paper.citationCount || 0,
      url: paper.url || null,
      source: "Semantic Scholar",
    };

    if (!normalized.title) continue;

    // Dedupe key: DOI is strongest; title is fallback
    const key = normalized.doi
      ? `doi:${normalized.doi.toLowerCase()}`
      : `title:${titleKey(normalized.title)}`;

    if (seen.has(key)) continue;
    seen.add(key);

    results.push(normalized);
  }
}

return results.map(r => ({ json: r }));

Production-minded note: keep a field like source so you can debug where bad metadata is coming from later.

Stage 4: AI-Powered Content Extraction (Strict JSON)

Once you have a deduplicated list of papers, you can send each paper (or a small batch) to Groq for structured extraction.

Why structured output matters

If your LLM returns narrative text instead of JSON, misses fields, or emits malformed JSON, your workflow breaks downstream. In a production workflow, that’s not a rare edge case; it’s something you should expect and design around.

That’s why you’ll use strict schema prompting and validate responses downstream.

System prompt vs user prompt (and how to compose them)

A helpful way to think about prompts in production is:

The system prompt defines the non-negotiable contract: output format, allowed keys, no commentary, and what to do in uncertain cases. This is where you say “return ONLY valid JSON” and “no extra keys.”
The user prompt provides the variable data for this specific request: title, year, citations, abstract, and the exact schema you want filled.

Composing them this way keeps your workflow stable. The system prompt stays mostly constant (your formatting contract), while the user prompt changes per paper (your payload). It also makes debugging easier: if outputs start failing, you can adjust the system constraints without rewriting every payload template.

Suggested extraction schema

Extract only what you can support from abstract-level data:

research_question
methodology
key_findings
limitations
notes (for missing abstract / ambiguity)

Example prompt (system + user)

System:

You are a research extraction engine. You must return ONLY valid JSON.
No markdown. No extra keys. No commentary.
If the abstract is missing or too vague, set fields to null and include a reason in "notes".

User:

Extract structured fields from this paper.

TITLE: {{title}}
YEAR: {{year}}
CITATIONS: {{citations}}
ABSTRACT: {{abstract}}

Return JSON with keys:
research_question (string|null)
methodology (string|null)
key_findings (array of strings)
limitations (array of strings)
notes (string)

Model settings: keep temperature low (around 0.2–0.3) and keep responses short and structured.

Batch processing to avoid timeouts

Instead of sending 50 papers at once, process them in batches (for example, 10). This reduces latency spikes, failure blast radius, and cost surprises. Smaller batches also make it easier to retry only the failing chunk rather than re-running everything.

Stage 5: Scoring and Synthesis

Not every retrieved paper is worth your time. Without scoring, your pipeline becomes a firehose: you’ve automated collection, but you still have to manually decide what to read. Scoring is what turns “a big list of results” into a shortlist you can trust.

I recommend computing two signals:

Relevance: Is this actually about your research question?
Quality/priority: If it’s relevant, is it worth reading first?

For relevance, keep it simple and explainable. Count keyword hits in the title and abstract (and optionally in extracted key_findings). Title matches should be weighted higher because titles are deliberately compact summaries. Abstract hits are useful too, but cap them so long abstracts don’t dominate the score.

For quality/priority, use lightweight metadata you already have. Recency is a strong signal in fast-moving areas, and citations can help, but they should be treated as a weak signal (and capped) so newer high-value papers aren’t unfairly penalised.

A solid first scoring model is: add a title bonus, add a capped abstract bonus, add a capped citations bonus, and add a small recency bonus for papers from the last two years. Then filter using the relevance_threshold results from Stage 1. The advantage of this approach is that it’s easy to debug and tune: you can always explain why a paper passed or failed.

Once you’ve filtered down to your “gold” set, synthesis becomes safer and more useful. Write one row per accepted paper to Google Sheets, then generate a daily/weekly HTML summary (for example, top 5 papers with 1–2 key findings each) and include links so you can verify quickly.

Beginner-Friendly Evals: Retrieval and Extraction QA

AI workflows regress silently. A prompt tweak, a model update, or an API schema change can break extraction without throwing an obvious error. Adding lightweight evals is the difference between “it worked last week” and “it’s reliable.”

The goal here isn’t to build a full evaluation framework. It’s to add small, cheap checks that catch the most common failure modes:

Are collectors still returning results?
Are we actually removing duplicates?
Is the LLM returning valid JSON with the keys we require?

What it looks like in n8n (a concrete example)

A simple implementation is to add an “Assertions” Code node immediately after your extraction step, plus (optionally) another one after normalisation/deduplication.

At a high level, the workflow section looks like:

Collectors (parallel HTTP Request nodes)
Merge results
Normalise + dedupe (Code node)
Split in Batches (optional)
LLM extraction (Groq/OpenAI-compatible node)
Assertions (Code node)
If node (pass/fail)
Delivery (Sheets + email)

Example: Assertions code node after extraction

This code node assumes each item is a paper with:

title, abstract in the normalised fields, and
an extraction field (or whatever you name it) containing the LLM response as an object or JSON string.

Adapt the field name to match your actual node output, but the pattern is the same: parse, validate required keys, compute percentages, then decide whether to fail or warn.

const items = $input.all();

let total = items.length;
let withTitle = 0;
let withAbstract = 0;

let parseOk = 0;
let schemaOk = 0;

const requiredKeys = [
  "research_question",
  "methodology",
  "key_findings",
  "limitations",
  "notes",
];

const failures = [];

for (let i = 0; i < items.length; i++) {
  const p = items[i].json;

  if (p.title && String(p.title).trim().length > 0) withTitle++;
  if (p.abstract && String(p.abstract).trim().length > 0) withAbstract++;

  // Adjust this depending on where you store the model output:
  const raw = p.extraction ?? p.llm ?? p.model_output;

  let obj = null;
  try {
    obj = typeof raw === "string" ? JSON.parse(raw) : raw;
    parseOk++;
  } catch (e) {
    failures.push({ index: i, title: p.title || null, reason: "JSON parse failed" });
    continue;
  }

  const hasAllKeys = requiredKeys.every(k => Object.prototype.hasOwnProperty.call(obj, k));
  if (!hasAllKeys) {
    failures.push({ index: i, title: p.title || null, reason: "Missing required keys" });
    continue;
  }

  // Optional: ensure arrays are arrays
  const arraysOk = Array.isArray(obj.key_findings) && Array.isArray(obj.limitations);
  if (!arraysOk) {
    failures.push({ index: i, title: p.title || null, reason: "key_findings/limitations not arrays" });
    continue;
  }

  schemaOk++;
}

const pct = (n) => (total === 0 ? 0 : Math.round((n / total) * 100));

const report = {
  total_papers: total,
  pct_with_title: pct(withTitle),
  pct_with_abstract: pct(withAbstract),
  pct_extraction_json_parse_ok: pct(parseOk),
  pct_extraction_schema_ok: pct(schemaOk),
  failures_sample: failures.slice(0, 5),
};

// Decide pass/fail thresholds
const HARD_FAIL_PARSE_BELOW = 90;
const HARD_FAIL_SCHEMA_BELOW = 85;

const shouldFail =
  report.pct_extraction_json_parse_ok < HARD_FAIL_PARSE_BELOW ||
  report.pct_extraction_schema_ok < HARD_FAIL_SCHEMA_BELOW;

return [
  {
    json: {
      eval_report: report,
      shouldFail,
    },
  },
];

Then add an If node:

If shouldFail is true, then route to an “Alert/Stop” branch (Slack/email/log) and optionally stop the workflow.
If false, then continue to the delivery stage.

This is the automation equivalent of unit tests: small, cheap, and extremely effective. It also gives you a concrete paper trail when something changes upstream.

Key Learnings and Error Handling

Building this automation taught me that the best workflows are designed for failure.

First, error resilience is not optional. Never let one failing API crash the workflow. Use “Continue On Fail” on your HTTP nodes, merge partial results, and log which sources failed in your final report so you can debug without losing an entire run.

Second, batching is your friend. Process papers in batches (often 5–15) to reduce timeouts and cost spikes. Keep LLM payloads small and focused on what you actually need (metadata + abstract), and retry transient failures once rather than repeatedly hammering the model or API.

Third, structured prompting is what makes AI reliable in automation. A strict JSON schema is the difference between a workflow that runs unattended and one that breaks randomly. Keep temperature low, enforce the schema in the system prompt, and validate everything downstream with simple parse-and-assert checks.

Conclusion

A good research pipeline doesn’t just retrieve papers – it turns scattered results into a consistent, deduplicated, scored, and review-ready shortlist you can trust.

By treating your n8n workflow like software modular stages, strict contracts between steps, and lightweight eval checks, you can reduce hours of manual literature review into a fast, repeatable process that survives real-world API failures and model quirks.

If you build this with good defaults (failure isolation, batching, normalisation, strict JSON extraction, and simple scoring), you end up with something you can run daily or weekly and actually rely on without the manual fatigue.

About Me

I am Chidozie Managwu, an award-winning AI Product Architect and founder focused on helping global tech talent build real, production-ready skills. I contribute to global AI initiatives as a GAFAI Delegate and lead the AI Titans Network, a community for developers learning how to ship AI products.

My work has been recognised with the Global Tech Hero award and featured on platforms like HackerNoon.

Storyteller: A Medium For Guiding Others Through Code

Mark Mahoney — Sat, 28 Feb 2026 01:07:32 +0000

As a computer science instructor, I have long wished that there was a better way to guide others through my code. When I was first learning to program, I was a big fan of traditional programming books. I have shelves and shelves of 800+ page books covering different programming languages and technologies.

I have known for a while now that most learners today don't share my love of big thick books, and to be honest, I rarely read those books in their entirety. Those big books often had a lot more exposition about the code than was probably needed. As a book buyer I wanted to make sure that I was getting my money's worth so the thicker they were, the better. It is much more common these days for learners to consume blog based tutorials and videos.

If you're learning to code right now, you've probably experienced the frustration of these formats too. I want to share something I've been working on that might help.

Blogs and Videos

Blog-style tutorials mix code and the explanation of it in a top-to-bottom fashion. Scrolling through these web-based explanations feels familiar and one can copy and paste with ease. However, linking the explanation of the code and the code itself has always been less than ideal. Often I find myself jumping around the blog post wishing I could see the entire code example while working through the explanation. Instead, I am only able to see small parts of the code and it is challenging to see how those parts relate to other parts.

Video tutorials are very popular these days. They solve some of the problems associated with blog-style tutorials. Videos are great because you get two streams of information: the author's audible narrative and the code being written. A viewer can focus on the two streams simultaneously. However, videos have some problems too.

Viewing Videos

From the perspective of the viewer, videos are hard to search through and are not useful as a copy and paste source or a code reference. More importantly, though, they discourage the viewer from taking their time and reflecting on the material. Often, when I am viewing a video tutorial I don't pause and let concepts sink in before the video moves on. Yes, I could be more disciplined and pause and rewind more often but usually I don't.

Making videos

From the perspective of the video creator, it is clear that not all code being developed is interesting to watch. Some of it is not really worth showing the viewer. Not all video creators can keep the narrative interesting the whole time.

I know I struggle with the 'performance' aspect of making videos (you won't find me coding on Twitch anytime soon). Many times after I am done making a video, as I review it, I wish I had mentioned something that I forgot. It is hard to go back and edit the video without scrapping it and starting over.

Storyteller

I have created a new medium to guide viewers through code examples. It combines the best of books, blog posts, and videos. This new medium allows a developer to write code using a top-notch editor (Visual Studio Code) and then replay the development of that code in the browser.

The author can add comments at important points in the evolution of the code. The comments can include text, hand drawn pictures, screenshots, and audio and video recordings. This allows the author to add visualizations that we have in our heads but don't make it into the code itself. The tool is called Storyteller.

Here are a few examples of a 'playback':

These work best on a big screen. If you are viewing a playback on a small screen you can view it in 'blog' mode (there is button in the top right to switch from 'code' mode to 'blog' mode).

I have created groups of these guided code walk-throughs to help me teach different topics to my students. These are all free and hosted on a website I created called Playback Press. Here are some of the 'books' I have created so far:

I usually assign these as readings in my classes instead of using expensive textbooks. It is a lot easier for me to write several programs than it is to find a perfect textbook.

I also use them for in-class demos instead of writing code live. This makes code demos flow much faster and smoother. If I make an interesting mistake while preparing the code I can still highlight it with a comment. If I make an uninteresting or embarrassing mistake I can just ignore it and the students won't focus on it.

The Advantages of Code Playbacks:

The primary focus is on the code. It is always visible and easy to search and navigate.
Since the code is so accessible, the explanation of it tends to be short and concise.
The narrative can include whiteboard style drawings, screenshots, or videos of running code in addition to a text explanation.
As an author, I can review the code several times and add/edit comments each time I go through it. I don't have to give a perfect performance like I do with a video.
Comment points highlight when the author wants the viewer to take a moment to really think about the code and reflect on it. The playback only moves forward when the viewer is ready.
The code mentioned in a comment can be highlighted so the viewer knows exactly where they should be looking.
The code can be downloaded at any point in the playback. Then a viewer can run it, change it, and add to it.
The tool is a language independent editor plug-in and can be used to describe programs in any language.
Viewers only need a web browser to go through a playback.

Recently, I've been exploring how to make playbacks even more useful for learners.

AI as an Infinitely Patient Tutor

I have extended code playbacks to include an AI tutor. One thing I've learned in my years of teaching is that students often hesitate to ask questions. They worry about looking foolish, or they don't want to slow down the class, or they simply can't articulate what's confusing them.

What if every student had access to a patient tutor who never got frustrated with repeated questions and could explain concepts in multiple ways until something clicked?

I've integrated AI directly into the playback experience. As students work through a playback, they can ask questions about anything they're seeing. This might be a specific line of code, a concept I mentioned in a comment, or how something connects to material from earlier in the playback. The AI has full context. It can see the code, it understands where the student is in the playback, and it can provide explanations tailored to that exact moment. The AI is right there with the student, looking at the same code, understanding the same context.

The AI can also generate self-grading multiple choice questions based on the code and comments in a playback. These low-stakes quizzes make the learning experience more engaging and help learners check their understanding as they go.

Let me be clear: the AI doesn't replace me as an instructor. I still create the playbacks. I still decide what concepts to cover, what order to present them, and what examples best illustrate the ideas. The AI is an extension of my teaching, not a replacement for it.

Note: The AI features are available to registered users on Playback Press. Registration is free but logging in is required to access the AI tutor. If you want to see what this feels like, try one of the playbacks linked above and ask the AI a question about what you're seeing.

Conclusion

My goal has always been to help people learn to code. Books gave us depth but demanded commitment. Blogs gave us accessibility but fragmented the code. Videos gave us narrative but took away control. Playbacks keep the code front and center while letting learners move at their own pace and reflect when they need to. Adding AI doesn't change that philosophy, it just means there's always someone available to answer questions. Together, they get closer to the experience of having an expert sit beside you and walk you through a program. That's what I've been trying to build, and I think we're getting there.

How to Optimize PySpark Jobs: Real-World Scenarios for Understanding Logical Plans

Sameer Shukla — Thu, 05 Feb 2026 22:45:15 +0000

In the world of big data, performance isn't just about bigger clusters – it's about smarter code. Spark is deceptively simple to write but notoriously difficult to optimize, because what you write isn't what Spark executes. Between your transformations and actual computation lies an invisible translation layer – the logical plan – that determines whether your job runs in minutes or hours.

Most engineers never look at this layer, which is why they spend days tuning configurations that don't address the real problem: inefficient transformations that generate bloated plans.

This handbook teaches you to read, interpret, and control those plans, transforming you from someone who writes PySpark code into someone who architects efficient data pipelines with precision and confidence.

Background Information
Chapter 1: The Spark Mindset: Why Plans Matter
Chapter 2: Understanding the Spark Execution Flow
Chapter 3: Reading and Debugging Plans Like a Pro
Chapter 4: Writing Efficient Transformations
Conclusion

Background Information

What This Handbook is Really About

This is not a tutorial about Spark internals, cluster tuning, or PySpark syntax or APIs.

This is a handbook about writing PySpark code that generates efficient logical plans.

Because when your code produces clean, optimized plans, Spark pushes filters correctly, shuffles reduce instead of multiply, projections stay shallow, and the DAG (Directed Acyclic Graph) becomes predictable, lean, and fast.

When your code produces messy plans, Spark shuffles more than necessary, and projects pile up into deep, expensive stacks. Filters arrive late instead of early, joins explode into wide, slow operations, and the DAG becomes tangled and expensive.

The difference between a fast job and a slow job is not “faster hardware.” It’s the structure of the plan Spark generates from your code. This handbook teaches you to shape that plan deliberately through scenarios.

Who This Handbook Is For

This handbook is written for:

Data engineers building production ETL pipelines who want to move beyond trial-and-error tuning and understand why jobs perform the way they do
Analytics engineers working with large datasets in Databricks, EMR, or Glue who need to optimize Spark jobs but don't have time for thousand-page reference manuals
Data scientists transitioning from pandas to PySpark who find themselves writing code that technically runs but takes forever
Anyone who has stared at the Spark UI, seen mysterious "Exchange" nodes in the DAG, and wondered, "Why is this shuffling so much data?"

You should already be comfortable writing basic PySpark code , creating DataFrames, applying transformations, running aggregations. This handbookbook won't teach you Spark syntax. Instead, it teaches you how to write transformations that work with the optimizer, not against it.

How This Handbook Is Structured

We’ll start with foundations, then move on to real-world scenarios.

Chapters 1-3 build your mental model. You'll learn what logical plans are, how they connect to physical execution, and how to read the plan output that Spark shows you. These chapters are short and focused – just enough theory to make the practical scenarios meaningful.

Chapter 4 is the heart of the handbook. It contains 15 real-world scenarios, organized by category. Each scenario shows you a common performance problem, explains what's happening in the logical plan, and demonstrates the better approach. You'll see before-and-after code, plan comparisons, and clear explanations of why one approach outperforms another.

What You'll Learn

By the end of this handbook, you'll be able to:

Read and interpret Spark's logical, optimized, and physical plans
Identify expensive operations before running your code
Restructure transformations to minimize shuffles
Choose the right join strategies for your data
Avoid common pitfalls that cause memory issues and slow performance
Debug production issues by examining execution plans

More importantly, you'll develop a Spark mindset, an intuition for how your code translates to cluster operations. You'll stop writing code that "should work" and start writing code that you know will work efficiently.

Technical Prerequisites

I assume that you’re familiar with the following concepts before proceeding:

Python fundamentals
PySpark basics
- Creating DataFrames and reading data from files
- Basic DataFrame operations: select, filter, withColumn, groupBy, join
- Writing DataFrames back to storage
Basic Spark concepts
- Basic understanding of Spark applications, jobs, stages, and tasks
- Basic understanding of the difference between transformations and actions
- Understanding. of partitions and shuffles
AWS Glue (Good to have)

Chapter 1: The Spark Mindset: Why Plans Matter

This chapter isn’t about Spark theory or internals. It’s about understanding Spark Plans, and seeing Spark the way the engine sees your code. Once you understand how Spark builds and optimizes a logical plan, optimization stops being trial and error and becomes intentional engineering.

Behind every simple transformation, Spark quietly redraws its internal blueprint. Every transformation you write from "withColumn" to join changes that plan. When the plan is efficient, Spark flies, but when it’s messy, Spark crawls.

The Invisible Layer Behind Every Transformation

When you write PySpark code, it feels like you’re chaining operations step by step. In reality, Spark isn’t executing those lines. It’s quietly building a blueprint, a logical plan describing what to do, not how.

Once this plan is built, the Catalyst Optimizer analyzes it, rearranges operations, eliminates redundancies, and produces an optimized plan. Catalyst is Spark’s query optimization engine.

Every DataFrame or SQL operation we write, such as select, filter, join, groupBy, is first converted into a logical plan. Catalyst then analyzes and transforms this plan using a set of rule-based optimizations, such as predicate pushdown, column pruning, constant folding, and join reordering. The result is an optimized logical plan, which Spark later converts into a physical execution plan. Finally, Spark translates that into a physical plan of what your cluster actually runs. This invisible planning layer decides the job’s performance more than any configuration setting.

From Logical to Optimized to Physical Plans

When you run df.explain(True), Spark actually shows you four stages of reasoning:

1. Logical Plan

The logical plan is the first stage where the initial translation of the code results in a tree structure that shows what operations need to happen, without worrying about how to execute them efficiently. It’s a blueprint of the query’s logic before any optimization or physical planning occurs.

This:

df.filter(col('age') > 25) \
  .select('firstname', 'country') \
  .groupby('country') \
  .count() \
  .explain(True)

results in the following logical plan:

== Parsed Logical Plan ==
'Aggregate ['country], ['country, 'count(1) AS count#108]
+- Project [firstname#95, country#97]
   +- Filter (age#96L > cast(25 as bigint))
      +- LogicalRDD [firstname#95, age#96L, country#97], false

2. Analyzed Logical Plan

The analyzed logical plan is the second stage in Spark’s query optimization. In this stage, Spark validates the query by checking if tables and columns actually exist in the Catalog and resolving all references. It converts all the unresolved logical plans into a resolved one with correct data types and column bindings before optimization.

3. Optimized Logical Plan

The optimized logical plan is where Spark's Catalyst optimizer improves the logical plan by applying smart rules like filtering data early, removing unnecessary columns, and combining operations to reduce computation. It's the smarter, more efficient version of your original plan that will execute faster and use fewer resources.

Let’s understand using a simple code example:

df.select('firstname', 'country') \
  .groupby('country') \
  .count() \
  .filter(col('country') == 'USA') \
  .explain(True)

Here’s the parsed logical plan:

== Parsed Logical Plan ==
'Filter '`=`('country, USA)
+- Aggregate [country#97], [country#97, count(1) AS count#122L]
   +- Project [firstname#95, country#97]
      +- LogicalRDD [firstname#95, age#96L, country#97], false

What this means:

Spark first projects firstname and country
Then aggregates by country
Then applies the filter country = 'USA' after aggregation

(because that’s how you wrote it).

Here’s the optimized logical plan:

== Optimized Logical Plan ==
Aggregate [country#97], [country#97, count(1) AS count#122L]
+- Project [country#97]
   +- Filter (isnotnull(country#97) AND (country#97 = USA))
      +- LogicalRDD [firstname#95, age#96L, country#97], false

Key improvements Catalyst applied:

Filter pushdown: The filter country = 'USA' is pushed below the aggregation, so Spark only groups U.S. rows.
Column pruning: “firstname” is automatically removed because it’s never used in the final output.
Cleaner projection: Intermediate columns are dropped early, reducing I/O and in-memory footprint.

4. Physical Plan

The physical plan is Spark's final execution blueprint that shows exactly how the query will run: which specific algorithms to use, how to distribute work across machines, and the order of low-level operations. It's the concrete, executable version of the optimized logical plan, translated into actual Spark operations like “ShuffleExchange”, “HashAggregate”, and “FileScan” that will run on your cluster.

Catalyst may, for example:

Fold constants (col("x") * 1 → col("x"))
Push filters closer to the data source
Replace a regular join with a broadcast join when data fits in memory

Once the physical plan is finalized, Spark’s scheduler converts it into a DAG of stages and tasks that run across the cluster. Understanding that lineage, from your code → plan → DAG, is what separates fast jobs from slow ones.

How to Read a Logical Plan

A logical plan prints as a tree: the bottom is your data source, and each higher node represents a transformation.

Node	Meaning
Relation / LogicalRDD	Data source, the initial DataFrame
Project	Column selection and transformation (select, withColumn)
Filter	Row filtering based on conditions (where, filter)
Join	Combining two DataFrames (join, union)
Aggregate	GroupBy and aggregation operations (groupBy, agg)
Exchange	Shuffle operation (data redistribution across partitions)
Sort	Ordering data (orderBy, sort)

Each node represents a transformation. Execution flows from the bottom up. Let's understand with a basic example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder.appName("Practice").getOrCreate()

employees_data = [
    (1, "John", "Doe", "Engineering", 80000, 28, "2020-01-15", "USA"),
    (2, "Jane", "Smith", "Engineering", 85000, 32, "2019-03-20", "USA"),
    (3, "Alice", "Johnson", "Sales", 60000, 25, "2021-06-10", "UK"),
    (4, "Bob", "Brown", "Engineering", 90000, 35, "2018-07-01", "USA"),
    (5, "Charlie", "Wilson", "Sales", 65000, 29, "2020-11-05", "UK"),
    (6, "David", "Lee", "HR", 55000, 27, "2021-01-20", "USA"),
    (7, "Eve", "Davis", "Engineering", 95000, 40, "2017-04-12", "Canada"),
    (8, "Frank", "Miller", "Sales", 70000, 33, "2019-09-25", "UK"),
    (9, "Grace", "Taylor", "HR", 58000, 26, "2021-08-15", "Canada"),
    (10, "Henry", "Anderson", "Engineering", 88000, 31, "2020-02-28", "USA")
]

df = spark.createDataFrame(employees_data,  
    ["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"])

Version A: withColumn → filter

In this version, we’re using a derived column "withColumn" and then applying a filter to the dataset. This ordering is logically correct and produces the expected result: it shows how introducing derived columns early affects the logical plan. This example shows what happens when Spark is asked to compute a new column before any rows are eliminated.

df_filtered = df \
.withColumn('bonus', col('salary') * 82) \
.filter(col('age') > 35) \
.explain(True)

Parsed Logical Plan (Simplified)

Filter (age > 35)
└─ Project [*, (salary * 82) AS bonus]
   └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

So what’s going on here? Execution flows from the bottom up.

Spark first reads the LogicalRDD.
Then applies the Project node, keeping all columns and adding bonus.
Finally, the Filter removes rows where age ≤ 35.

This means Spark computes the bonus for every employee, even those who are later filtered out. It's harmless here, but costly on millions of rows, more computation, more I/O, more shuffle volume.

Version B: Filter → Project

In this version, we apply the filter before introducing the derived column. The idea is to show how pushing row-reducing operations earlier allows Catalyst to produce a leaner logical plan. Compared to Version A, this example demonstrates that the same logic, written in a different order, can significantly reduce the amount of work Spark needs to perform.

df_filtered = df \
.filter(col('age') > 35) \
.withColumn('bonus', col('salary') * 82) \
.explain(True)

Parsed Logical Plan (Simplified)

Project [*, (salary * 82) AS bonus]

└─ Filter (age > 35)

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

So what’s going on here?

Spark starts from the LogicalRDD.
It immediately applies the Filter, reducing the dataset to only employees with age > 35.
Then the Project node adds the derived column bonus for this smaller subset.

Now the Filter sits below the Project in the plan, cutting data movement and minimizing computation. Spark prunes data first, then derives new columns. This order reduces both the volume of data processed and the amount transferred, leading to a lighter and faster plan.

Why You Should Look at the Plan Every Time by running `df.explain(True)`

This is the quickest way to spot performance issues before they hit production. It shows:

Whether filters sit in the right place.
How many Project nodes exist (each adds overhead).
Where Exchange nodes appear (these are shuffle boundaries).
If Catalyst pushed filters or rewrote joins as expected.

A quick explain() takes seconds, while debugging a bad shuffle in production takes hours. Run explain() whenever you add or reorder transformations. The plan never lies.

What Spark Does Under the Hood

Catalyst can sometimes reorder simple filters automatically, but once you use UDFs, nested logic, or joins, it often can’t. That’s why the best habit is to write transformations in a way that already makes sense to the optimizer. Filter early, avoid redundant projections, and keep plans as shallow as possible.

Optimizing Spark isn’t about tuning cluster configs – it’s about writing code that yields efficient plans. If your plan shows late filters, too many projections, or multiple Exchange nodes, it’s already explaining why your job will run slow.

Chapter 2: Understanding the Spark Execution Flow

In Chapter 1, you learned how Spark interprets your transformations into logical plans – blueprints of what the job intends to do.

But Spark doesn't stop there. It must translate those plans into distributed actions across a cluster of executors, coordinate data movement, and handle any failures that may occur.

This chapter reveals what happens when that plan leaves the driver: how Spark breaks your job into stages, tasks, and a directed acyclic graph (DAG) that actually runs.

By the end, you’ll understand why some operations shuffle terabytes while others fly, and how to predict it before execution begins.

From Plans to Stages to Tasks

A Spark job evolves through three conceptual layers:

Layer	What It Represents	Example View
Plan	The optimized logical + physical representation of your query	Read → Filter → Join → Aggregate
Stage	A contiguous set of operations that can run without shuffling data	“Map Stage” or “Reduce Stage”
Task	The smallest unit of work, one per partition per stage	“Process Partition 7 of Stage 3”

The Execution Trigger: Actions vs Transformations

Here's the critical distinction that determines when execution actually begins:

df1 = spark.paraquet("data.paraquet")
df2 = spark.filter(col("age") > 25)
df3 = spark.groupby("city").count()

Nothing executes yet! Spark just builds up the logical plan, adding each transformation as a node in the plan tree. No data is read, no filters run, no shuffles happen.

Actions Trigger Execution

Spark transformations are lazy. When a sequence of DataFrame operations is defined, a logical plan is created, but no computation takes place. It’s only when Spark encounters an action, an operation that needs a result to be returned to the driver or written out, that execution takes place.

For example:

result = df3.collect()

At this stage, Spark materializes the logical plan, applies optimizations, creates a physical plan, and executes the job. Until Spark is asked to act, such as collect(), count(), or write(), it’s just describing what it needs to do – but it’s not actually doing it.

The Complete Execution Flow

Spark execution is initiated after the execution of an operation such as collect(). The driver then sends the optimized physical plan to the SparkContext, which is then forwarded to the DAG Scheduler. The physical plan is analyzed to determine shuffle boundaries created by wide operations such as groupBy or orderBy.

The plan is then divided into stages that contain narrow operations. These stages are sent to the Task Scheduler as a TaskSet. Each stage has a single task per partition.

The tasks are then assigned to the cores of the executor based on data locality. The execution of the tasks is then initiated. The execution of the stages is initiated after the completion of the previous stage. The final stage is initiated after the completion of the previous stage. The results of the final stage are then returned to the driver or stored.

What Triggers a Shuffle

A shuffle occurs when Spark needs to redistribute data across partitions, typically because the operation requires grouping, joining, or repartitioning data in a way that can’t be done locally within existing partitions.

Common shuffle triggers:

Operation	Why it Shuffles
groupBy(), reduceByKey()	Data with the same key must co-locate for aggregation
join()	Matching keys may reside in different partitions
orderBy() / sort()	Requires global ordering across all partitions
distinct()	Needs comparison of all values across partitions
repartition(n)	Explicit redistribution to a new number of partitions

df.groupBy("user_id”) \
  .agg(sum("amount"))

In Stage 1 (Map), each task performs a partial aggregation on its partition and writes a shuffle file to disk. During the shuffle, each executor retrieves these files across the network such that all records with the same hash(user_id) % numPartitions are colocated.

In Stage 2 (Reduce), each task performs a final aggregation on its partitioned data and writes back to disk. Because Spark has tracked this process as a DAG, a failed task can re-read only the affected shuffle files instead of re-computing the entire DAG.

In practice, a healthy job has 2-6 stages. Seeing 20+ stages for such simple logic usually means unnecessary shuffles or bad partitioning.

Why Shuffles Create Stage Boundaries

Shuffles force data to move across the network between executors. Spark cannot continue processing until:

All tasks in the current stage write their shuffle output to disk
The shuffle data is available for the next stage to read over the network

This dependency creates a natural boundary – so a new stage begins after every shuffle. The DAG Scheduler uses these boundaries to determine where stages must wait for previous stages to complete.

Common Performance Bottlenecks

Bottleneck Type	Symptom	Solution
Data skew	Few tasks run much longer	Use salting, split hot keys, or AQE skew join
Small files	Too many tasks, high overhead	Coalesce or repartition after read
Large shuffle	High network I/O, spill to disk	Filter early, broadcast small tables, reduce cardinality
Unnecessary stages	Extra Exchange nodes in plan	Combine operations, remove redundant repartitions
Inefficient file formats	Slow reads, no predicate pushdown	Use Parquet or ORC with partitioning
Complex data types	Serialization overhead, large objects	Use simple types, cache in serialized form

Let’s ground this with a small but realistic pattern using the same employees DataFrame. Goal: average salary per department and country, only for employees older than 30.

Naïve approach:

from pyspark.sql.functions import col, when, avg

df_dept_country = df.select("department", "country").distinct()

df_result = (
    df.withColumn(
        "age_group",
        when(col("age") < 30, "junior")
        .when(col("age") < 40, "mid")
        .otherwise("senior")
    )
    .join(df_dept_country, ["department"], "inner")
    .groupBy("department", "country")
    .agg(avg("salary").alias("avg_salary"))

This looks harmless, but:

The join on "department" introduces a wide dependency → shuffle #1.
The groupBy("department", "country") introduces another wide dependency → shuffle #2.

So we have two shuffles for what should be a simple aggregation. If you run explain on the df_result, you’ll see two exchange nodes, each marking a shuffle and stage boundary.

Optimized Approach

We can do better by filtering early, broadcasting the small dimension (df_dept_country), and keeping only one global shuffle for aggregation.

from pyspark.sql.functions import broadcast

df_dept_country = df.select("department", "country").distinct()

df_result_optimized = (
    df.filter(col("age") > 30)
        .join(broadcast(df_dept_country), ["department"], "inner")
        .groupBy("department", "country")
        .agg(avg("salary").alias("avg_salary"))
)

What changed:

filter(col("age") > 30) is narrow and runs before any shuffle.
broadcast(df_dept_country) avoids a shuffle for the join.
Only the groupBy("department", "country") causes a single shuffle.

Now explain shows just one Exchange.

Version	Shuffles	Stages	Notes
Naïve	2	~4 (2 map + 2 reduce)	Join shuffle + groupBy shuffle = double overhead
Optimized	1	~2 (1 map + 1 reduce)	Broadcast join avoids shuffle. Only groupBy shuffles

Chapter 3: Reading and Debugging Plans Like a Pro

As explained in Chapter 1, Spark executes transformations based on three levels: the logical plan, the optimized logical plan (Catalyst), and the physical plan. This chapter will expand on this explanation and concentrate on the impact of the logical plan on shuffle and execution performance.

By now, you understand how Spark builds and executes plans. But reading those plans and instantly spotting inefficiencies is the real superpower of a performance-focused data engineer.

Spark’s explain() output isn’t random jargon. It’s a precise log of Spark’s thought process. Once you learn to read it, every optimization becomes obvious.

Three Layers in Spark

As we talked about above, every Spark plan has three key views, printed when you call df.explain(True). Let’s review them now:

Parsed Logical Plan: The raw intent Spark inferred from your code. It may include unresolved column names or expressions.
Analyzed / Optimized Logical Plan: After Spark applies Catalyst optimizations: constant folding, predicate pushdown, column pruning, and plan rearrangements.
Physical Plan: What your executors actually run: joins, shuffles, exchanges, scans, and code-generated operators.

Each stage narrows the gap between what you asked Spark to do and what Spark decides to do.

df_avg = df.filter(col("age") > 30)
        .groupBy("department")
        .agg(avg("salary").alias("avg_salary"))

df_avg.explain(True)

1. Parsed Logical Plan

== Parsed Logical Plan ==
'Aggregate ['department], ['department, 'avg('salary) AS avg_salary#8]
+- Filter (age#5L > cast(30 as bigint))
   +- LogicalRDD [id#0L, firstname#1, lastname#2, department#3, salary#4L, age#5L, hire_date#6, country#7], false

How to read this

Bottom → data source (LogicalRDD).
Middle → Filter: Spark hasn’t yet optimized column references.
Top → Aggregate: high-level grouping intent.

At this stage, the plan may include unresolved symbols (like 'department or 'avg('salary)), meaning Spark hasn’t yet validated column existence or data types.

2. Optimized Logical Plan


== Optimized Logical Plan ==
Aggregate [department#3], [department#3, avg(salary#4L) AS avg_salary#8]
+- Project [department#3, salary#4L]
   +- Filter (isnotnull(age#5L) AND (age#5L > 30))
      +- LogicalRDD [id#0L, firstname#1, lastname#2, department#3, salary#4L, age#5L, hire_date#6, country#7], false

Here, Catalyst has done its job:

Column IDs (#11, #12L) are resolved.
Unused columns are pruned – no need to carry them forward.
The plan now accurately reflects Spark’s optimized logical intent.

If you ever wonder whether Spark pruned columns or pushed filters, this is the section to check.

3. Physical Plan

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[department#3], functions=[avg(salary#4L)], output=[department#3, avg_salary#8])
   +- Exchange hashpartitioning(department#3, 200), ENSURE_REQUIREMENTS, [plan_id=19]
      +- HashAggregate(keys=[department#3], functions=[partial_avg(salary#4L)], output=[department#3, sum#20, count#21L])
         +- Project [department#3, salary#4L]
            +- Filter (isnotnull(age#5L) AND (age#5L > 30))
               +- Scan ExistingRDD[id#0L,firstname#1,lastname#2,department#3,salary#4L,age#5L,hire_date#6,country#7]

Breakdown

Scan ExistingRDD → Spark reading from the in-memory DataFrame.
Filter → narrow transformation, no shuffle.
HashAggregate → partial aggregation per partition.
Exchange → wide dependency: data is shuffled by department.
Top HashAggregate → final aggregation after shuffle.

This structure – partial agg → shuffle → final agg – is Spark’s default two-phase aggregation pattern.

Recognizing Common Nodes

Node / Operator	Meaning	Optimization Hint
Project	Column selection or computed fields	Combine multiple withColumn() into one select()
Filter	Predicate on rows	Push filters as low as possible in the plan
Join	Combine two DataFrames	Broadcast smaller side if < 10 MB
Aggregate	GroupBy, sum, avg, count	Filter before aggregating to reduce cardinality
Exchange	Shuffle / data redistribution	Minimize by filtering early, using broadcast join
Sort	OrderBy, sort	Avoid global sorts; use within partitions if possible
Window	Windowed analytics (row_number, rank)	Partition on selective keys to reduce shuffle

Repeated invocations of withColumn stack multiple Project nodes, which increases the plan depth. Instead, combine these invocations using select.

Multiple Exchange nodes imply repeated data shuffles. You can eliminate these by broadcasting the data or filtering.

Multiple scans of the same table within a single operation imply that some caching of strategic intermediates is lacking.

And frequent SortMergeJoin operations imply that Spark is unnecessarily sorting and shuffling the data. You can eliminate these by broadcasting the smaller dataframe or bucketing.

Debugging Strategy: Read Plans from Top to Bottom

Remember: Spark executes plans from bottom up (from data source to final result). But when you're debugging, you read from the top down (from the output schema back to the root cause). This reversal is intentional: you start with what's wrong at the output level, then trace backward through the plan to find where the inefficiency was introduced.

When debugging a slow job:

Start at the top: Identify output schema and major operators (HashAggregate, Join, and so on).
Scroll for Exchanges: Count them. Each = stage boundary. Ask “Why do I need this shuffle?”
Trace backward: See if filters or projections appear below or above joins.
Look for duplication: Same scan twice? Missing cache? Re-derived columns?
Check join strategy: If it’s SortMergeJoin but one table is small, force a broadcast().
Re-run explain after optimization: You should literally see the extra nodes disappear.

Catalyst Optimizer in Action

Catalyst applies dozens of rules automatically. Knowing a few helps you interpret what changed:

Optimization Rule	Example Transformation
Predicate Pushdown	Moves filters below joins/scans
Constant Folding	Replaces salary * 1 with salary
Column Pruning	Drops unused columns early
Combine Filters	Merges consecutive filters into one
Simplify Casts	Removes redundant type casts
Reorder Joins / Join Reordering	Changes join order for cheaper plan

Putting it all together: every plan tells a story:

As you progress through the practical scenarios in Chapter 4, read every plan before and after. Your goal isn't memorization – it's intuition.

Chapter 4: Writing Efficient Transformations

Every Spark job tells a story, not in code, but in plans. By now, you've seen how Spark interprets transformations (Chapter 1), how it executes them through stages and tasks (Chapter 2), and how to read plans like a detective (Chapter 3). Now comes the part where you apply that knowledge: writing transformations that yield efficient logical plans.

This chapter is the heart of the handbook. It's where we move from understanding Spark's mind to writing code that speaks its language fluently.

Why Transformations Matter

In PySpark, most performance issues don’t start in clusters or configurations. They start in transformations: the way we chain, filter, rename, or join data. Every transformation reshapes the logical plan, influencing how Spark optimizes, when it shuffles, and whether the final DAG is streamlined or tangled.

A good transformation sequence:

Keeps plans shallow, not nested.
Applies filters early, not after computation.
Reduces data movement, not just data size.
Let’s Catalyst and AQE optimize freely, without user-induced constraints.

A bad one can double runtime, and you won't see it in your code, only in your plan.

The Goal of this Chapter

We’ll explore a series of real-world optimization scenarios, drawn from production ETL and analytical pipelines, each showing how a small change in code can completely reshape the logical plan and execution behavior.

Each scenario is practical and short, following a consistent structure. By the end of this chapter, you’ll be able to see optimization opportunities the moment you write code, because you’ll know exactly how they alter the logical plan beneath.

Before You Dive In:

Open a Spark shell or notebook. Load your familiar employees DataFrame. Run every example, and compare the explain("formatted") output before and after the fix. Because in this chapter, performance isn’t about more theory, it’s about seeing the difference in the plan and feeling the difference in execution time.

Scenario 1: Rename in One Pass: withColumnRenamed() vs toDF()

If you’ve worked with PySpark DataFrames, you’ve probably had to rename columns, either by calling withColumnRenamed() repeatedly or by using toDF() in one shot.

At first glance, both approaches produce identical results: the columns have the new names you wanted. But beneath the surface, Spark treats them very differently – and that difference shows up directly in your logical plan.

df_renamed = (df.withColumnRenamed("id", "emp_id")
    .withColumnRenamed("firstname", "first_name")
    .withColumnRenamed("lastname", "last_name")
    .withColumnRenamed("department", "dept")
    .withColumnRenamed("salary", "base_salary")
    .withColumnRenamed("age", "age_years")
    .withColumnRenamed("hire_date", "hired_on")
    .withColumnRenamed("country", "country_code")
)

This is simple and readable. But Spark builds the plan step by step, adding one Project node for every rename. Each Project node copies all existing columns, plus the newly renamed one. In large schemas (hundreds of columns), this silently bloats the plan.

Logical Plan Impact:

Project [emp_id, first_name, last_name, dept, base_salary, age_years, hired_on, country_code]

└─ Project [id, first_name, last_name, dept, base_salary, age_years, hired_on, country_code]

└─ Project [id, firstname, last_name, dept, base_salary, age_years, hired_on, country_code]

└─ Project [id, firstname, lastname, dept, base_salary, age_years, hire_date, country_code]

└─ Project [id, firstname, lastname, department, base_salary, age_years, hire_date, country]

└─ Project [id, firstname, lastname, department, salary, age_years, hire_date, country]

└─ Project [id, firstname, lastname, department, salary, age, hire_date, country]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

Each rename adds a new Project layer, deepening the DAG. Spark now has to materialize intermediate projections before applying the next one. You can see this by running: df.explain(True).

The Better Approach: Rename Once with toDF()

Instead of chaining multiple renames, rename all columns in a single pass:

new_cols = ["id", "first_name", "last_name", "department",
            "salary", "age", "hired_on", "country"]

df_renamed = df.toDF(*new_cols)

Logical Plan Impact:

Project [id, first_name, last_name, department, salary, age, hired_on, country]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

Now there’s just one Project node, which means one projection over the source data. This gives us a flatter, more efficient plan.

Under the Hood: What Spark Actually Does

Every time you call withColumnRenamed(), Spark rewrites the entire projection list. Catalyst treats the rename as a full column re-selection from the previous node, not as a light-weight alias update. When you chain several renames, Catalyst duplicates internal column metadata for each intermediate step.

By contrast, toDF() rebases the schema in a single action. Catalyst interprets it as a single schema rebinding, so no redundant metadata trees are created.

Real-World Timing: Glue Job Benchmark

To see if chained withColumnRenamed calls add real overhead, here's a simple timing test performed on a Glue job using a DataFrame with 1M rows. First using withColumnRenamed:

import time
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MillionRowsRenameTest").getOrCreate()

employees_data = [
    (1, "John", "Doe", "Engineering", 80000, 28, "2020-01-15", "USA"),
    (2, "Jane", "Smith", "Engineering", 85000, 32, "2019-03-20", "USA"),
    (3, "Alice", "Johnson", "Sales", 60000, 25, "2021-06-10", "UK"),
    (4, "Bob", "Brown", "Engineering", 90000, 35, "2018-07-01", "USA"),
    (5, "Charlie", "Wilson", "Sales", 65000, 29, "2020-11-05", "UK"),
    (6, "David", "Lee", "HR", 55000, 27, "2021-01-20", "USA"),
    (7, "Eve", "Davis", "Engineering", 95000, 40, "2017-04-12", "Canada"),
    (8, "Frank", "Miller", "Sales", 70000, 33, "2019-09-25", "UK"),
    (9, "Grace", "Taylor", "HR", 58000, 26, "2021-08-15", "Canada"),
    (10, "Henry", "Anderson", "Engineering", 88000, 31, "2020-02-28", "USA")
]

multiplied_data = [(i, f"firstname_{i}", f"lastname_{i}",
                    employees_data[i % 10][3],  # department
                    employees_data[i % 10][4],  # salary
                    employees_data[i % 10][5],  # age
                    employees_data[i % 10][6],  # hire_date
                    employees_data[i % 10][7])  # country
                   for i in range(1, 1_000_001)]

df = spark.createDataFrame(multiplied_data,
                           ["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"])

start = time.time()
df1 = (df
       .withColumnRenamed("firstname", "first_name")
       .withColumnRenamed("lastname", "last_name")
       .withColumnRenamed("department", "dept_name")
       .withColumnRenamed("salary", "annual_salary")
       .withColumnRenamed("age", "emp_age")
       .withColumnRenamed("hire_date", "hired_on")
       .withColumnRenamed("country", "work_country"))

print("withColumnRenamed Count:", df1.count())
print("withColumnRenamed time:", round(time.time() - start, 2), "seconds")

Using toDF:

start = time.time()
df2 = df.toDF("id", "first_name", "last_name", "dept_name", "annual_salary", "emp_age", "hired_on", "work_country")
print("toDF Count:", df2.count())
print("toDF time:", round(time.time() - start, 2), "seconds")

spark.stop()

Approach	Number of Project Nodes	Glue Execution Time (1M rows)	Plan Complexity
Chained withColumnRenamed()	8 nodes	~12 seconds	Deep, nested
Single toDF()	1 node	~8 seconds	Flat, simple

The difference becomes important at larger sizes or in complex pipelines, especially on managed runtimes such as AWS Glue (where planning overhead becomes important), or when tens of millions of rows are involved, where each additional Project increases column resolution, metadata work, and DAG height. And since Spark can’t collapse chained projections when column names are changed, renaming all columns in one go with toDF() results in a flatter logical and physical plan: one rename, one projection, and faster execution.

Scenario 2: Reusing Expressions

Sometimes Spark jobs run slower, not because of shuffles or joins, but because the same computation is performed repeatedly within the logical plan. Every time you repeat an expression, say, col("salary") * 0.1 in multiple places, Spark treats it as a new derived column, expanding the logical plan and forcing redundant work.

The Problem: Repeated Expressions

Let’s say we’re calculating bonus and total compensation for employees:

df_expr = (
    df.withColumn("bonus", col("salary") * 0.10)
      .withColumn("total_comp", col("salary") + (col("salary") * 0.10))
)

At first glance, it’s simple enough. But Spark’s optimizer doesn’t automatically know that the (col("salary") * 0.10) in the second column is identical to the one computed in the first. Both get evaluated separately in the logical plan.

Simplified Logical Plan:

Project [id, firstname, lastname, department,

salary, age, hire_date, country,

(salary * 0.10) AS bonus,

(salary + (salary * 0.10)) AS total_comp]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

While this looks compact, Spark must compute (salary * 0.10) twice, once for bonus, again inside total_comp. For a large dataset (say 100 M rows), that’s two full column evaluations. The waste compounds when your expression is complex, imagine parsing JSON, applying UDFs, or running date arithmetic multiple times.

The Better Approach: Compute Once, Reuse Everywhere

Compute the expression once, store it as a column, and reference it later:

df_expr = (
    df.withColumn("bonus", col("salary") * 0.10)
      .withColumn("total_comp", col("salary") + col("bonus"))
)

Simplified Logical Plan:

Project [id, firstname, lastname, department,

salary, age, hire_date, country,

(salary * 0.10) AS bonus,

(salary + bonus) AS total_comp]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

Now Spark calculates (salary * 0.10) once, stores it in the bonus column, and reuses that column when computing total_comp. This single change cuts CPU cost and memory usage.

Under the Hood: Why Repetition Hurts

Spark’s Catalyst optimizer doesn’t automatically factor out repeated expressions across different columns. Each withColumn() creates a new Project node with its own expression tree. If multiple nodes reuse the same arithmetic or function, Catalyst re-evaluates them independently.

On small DataFrames, this cost is invisible. On wide, computation-heavy jobs (think feature engineering pipelines), it can add hundreds of milliseconds per task.

Each redundant expression increases:

Catalyst’s internal expression resolution time
The size of generated Java code in WholeStageCodegen
CPU cycles per row, since Spark cannot share intermediate results between columns in the same node

Real-World Benchmark: AWS Glue

We tested this pattern on AWS Glue (Spark 3.3) with 10 million rows and a simulated expensive computation on the similar dataset we used in Scenario 1.

df = spark.createDataFrame(multiplied_data,
                           ["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"])

expr = sqrt(exp(log(col("salary") + 1)))

start = time.time()

df_repeated = (
    df.withColumn("metric_a", expr)
      .withColumn("metric_b", expr * 2)
      .withColumn("metric_c", expr / 10)
)

df_repeated.count()
time_repeated = round(time.time() - start, 2)

start = time.time()

df_reused = (
    df.withColumn("metric", expr)
      .withColumn("metric_a", col("metric"))
      .withColumn("metric_b", col("metric") * 2)
      .withColumn("metric_c", col("metric") / 10)
)

df_reused.count()

print("Repeated expr time:", time_repeated, "seconds")
print("Reused expr time:", round(time.time() - start, 2), "seconds")

spark.stop()

Approach	Project Nodes	Execution Time (10M rows)	Expression Evaluations
Repeated expression	Multiple (nested)	~18 seconds	3x per row
Compute once, reuse	Single	~11 seconds	1x per row

The performance gap widens further with genuinely expensive expressions (like regex extraction, JSON parsing, or UDFs).

Physical Plan Implication

In the physical plan, repeated expressions expand into multiple Java blocks within the same WholeStageCodegen node:

*(1) Project [sqrt(exp(log(salary + 1))) AS metric_a,

(sqrt(exp(log(salary + 1))) * 2) AS metric_b,

(sqrt(exp(log(salary + 1))) / 10) AS metric_c, ...]

Spark literally embeds three copies of the same logic.

Each is JIT-compiled separately, leading to:

Larger generated Java classes
Higher CPU utilization
Longer code-generation time before tasks even start

When reusing a column, Spark generates one expression and references it by name, dramatically shrinking the codegen footprint. If you have complex transformations (nested when, UDFs, regex extractions, and so on), compute them once and reuse them with col("alias"). For even heavier expressions that appear across multiple pipelines, consider persisting the intermediate.

DataFrame:

df_features = df.withColumn("complex_feature", complex_logic)

df_features.cache()

That cache can save multiple recomputations across downstream steps.

Scenario 3: Batch Column Ops

Most PySpark pipelines don’t die because of one big, obvious mistake. They slow down from a thousand tiny cuts: one extra withColumn() here, another there, until the logical plan turns into a tall stack of projections.

On its own, withColumn() is fine. The problem is how we use it:

10–30 chained calls in a row
Re-deriving similar expressions
Spreading logic across many tiny steps

This scenario shows how batching column operations into a single select() produces a flatter, cleaner logical plan that scales better and is easier to reason about.

The Problem: Chaining withColumn() Forever

from pyspark.sql.functions import col, concat_ws, when, lit

df_transformed = (
    df.withColumn("full_name", concat_ws(" ", col("firstname"), col("lastname")))
      .withColumn("is_senior", when(col("age") >= 35, lit(1)).otherwise(lit(0)))
      .withColumn("salary_k", col("salary") / 1000.0)
      .withColumn("experience_band",
                  when(col("age") < 30, "junior")
                  .when((col("age") >= 30) & (col("age") < 40), "mid")
                  .otherwise("senior"))
      .withColumn("country_upper", col("country").upper())
)

It reads nicely, it runs, and everyone moves on. But under the hood, Spark builds this as multiple Project nodes, one per withColumn() call.

Simplified Logical Plan (Chained): Conceptually

Project [..., country_upper]

└─ Project [..., experience_band]

   └─ Project [..., salary_k]

      └─ Project [..., is_senior]

         └─ Project [..., full_name]

            └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

Each layer re-selects all existing columns, adds one more derived column, and deepens the plan.

The Better Approach: Batch with select()

Instead of incrementally patching the schema, build it once.

df_transformed = df.select(
    col("id"),
    col("firstname"),
    col("lastname"),
    col("department"),
    col("salary"),
    col("age"),
    col("hire_date"),
    col("country"),
    concat_ws(" ", col("firstname"), col("lastname")).alias("full_name"),
    when(col("age") >= 35, lit(1)).otherwise(lit(0)).alias("is_senior"),
    (col("salary") / 1000.0).alias("salary_k"),
    when(col("age") < 30, "junior")
        .when((col("age") >= 30) & (col("age") < 40), "mid")
        .otherwise("senior").alias("experience_band"),
    col("country").upper().alias("country_upper")
)

Simplified Logical Plan (Batched):

Project [id, firstname, lastname, department, salary, age, hire_date, country,

         full_name, is_senior, salary_k, experience_band, country_upper]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

One Project. All derived columns are defined together. Flatter DAG. Cleaner plan.

Under the Hood: Why This Matters

Each withColumn() is syntactic sugar for: “Take the previous plan, and create a new Project on top of it.” So 10 withColumn() calls = 10 projections wrapped on top of each other.

Catalyst can sometimes collapse adjacent Project nodes, but:

Not always (especially when aliases shadow each other).
Not when expressions become complex or interdependent.
Not when UDFs or analysis barriers appear.

Batching with select():

Gives Catalyst a single, complete view of all expressions.
Enables more aggressive optimizations (constant folding, expression reuse, pruning).
Keeps expression trees shallower and codegen output smaller.

Think of it as the difference between editing a sentence 10 times in a row and writing the final sentence once, cleanly.

Real-World Example: Using the Employees DF at Scale:

Chained version (many withColumn()):

from pyspark.sql.functions import col, concat_ws, when, lit, upper
import time

start = time.time()
df_chain = (
    df.withColumn("full_name", concat_ws(" ", col("firstname"), col("lastname")))
      .withColumn("is_senior", when(col("age") >= 35, 1).otherwise(0))
      .withColumn("salary_k", col("salary") / 1000.0)
      .withColumn("high_earner", when(col("salary") >= 90000, 1).otherwise(0))
      .withColumn("experience_band",
                  when(col("age") < 30, "junior")
                  .when((col("age") >= 30) & (col("age") < 40), "mid")
                  .otherwise("senior"))
      .withColumn("country_upper", upper(col("country")))
)

df_chain.count()
time_chain = round(time.time() - start, 2)

Batched version (single select()):

start = time.time()
df_batch = df.select(
    "id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country",
    concat_ws(" ", col("firstname"), col("lastname")).alias("full_name"),
    when(col("age") >= 35, 1).otherwise(0).alias("is_senior"),
    (col("salary") / 1000.0).alias("salary_k"),
    when(col("salary") >= 90000, 1).otherwise(0).alias("high_earner"),
    when(col("age") < 30, "junior")
        .when((col("age") >= 30) & (col("age") < 40), "mid")
        .otherwise("senior").alias("experience_band"),
    upper(col("country")).alias("country_upper")
)

df_batch.count()
time_batch = round(time.time() - start, 2)

Approach	Logical Shape	Glue Execution Time (1M rows)	Notes
Chained withColumn()	6 nested Projects	~14 seconds	Deep plan, more Catalyst work
Single select()	1 Project	~9 seconds	Flat planning, cleaner DAG

The distinction is most evident when there are more derived columns, more complex expressions (UDFs, window functions), or when executing on managed runtimes such as AWS Glue.

In the chained cases, there are more Project nodes, code generation is fragmented, and expression evaluation is less amenable to global optimization.

In the batched cases, Spark generates a single Project node, more work is consolidated into a single WholeStageCodegen pipeline, code generation is reduced, the JVM is less stressed, and the plan is flatter and more amenable to optimization. This is not only cleaner, but it’s also faster, more reliable, and friendlier to Spark’s optimizer.

Scenario 4: Early Filter vs Late Filter

Many pipelines apply transformations first, adding columns, joining datasets, or calculating derived metrics, before filtering records. That order looks harmless in code but can double or triple the workload at execution.

Problem: Late Filtering

df_late = (
    df.withColumn("bonus", col("salary") * 0.1)
      .withColumn("salary_k", col("salary") / 1000)
      .filter(col("age") > 35)
)

This means Spark first computes all columns for every employee, then discards most rows.

Simplified Logical Plan:

Filter (age > 35)

└─ Project [id, firstname, lastname, department, salary, age, hire_date, country,

            (salary * 0.1) AS bonus,

            (salary / 1000) AS salary_k]

   └─ LogicalRDD [...]

Catalyst can sometimes reorder this automatically, but when it can't (due to UDFs or complex logic), you're doing unnecessary work on data that's thrown away.

Better Approach: Early Filtering

df_early = (
    df.filter(col("age") > 35)
      .withColumn("bonus", col("salary") * 0.1)
      .withColumn("salary_k", col("salary") / 1000)
)

Simplified Logical Plan:

Project [id, firstname, lastname, department, salary, age, hire_date, country,

         (salary * 0.1) AS bonus,

         (salary / 1000) AS salary_k]

└─ Filter (age > 35)

   └─ LogicalRDD [...]

Now Spark prunes the dataset first, then applies transformations. The result: smaller intermediate data, less codegen, shorter logical plan, shorter DAG, and smaller shuffle footprint.

Real-World Benchmark: AWS Glue

Late Filtering:

df = spark.createDataFrame(
    multiplied_data,
    ["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"]
)

start_late = time.time()

df_late = (
    df.withColumn("bonus", col("salary") * 0.1)
      .withColumn("salary_k", col("salary") / 1000)
      .filter(col("age") > 35)   
)

df_late.count()
time_late = round(time.time() - start_late, 2)

Early Filtering:

start_early = time.time()

df_early = (
    df.filter(col("age") > 35)    
      .withColumn("bonus", col("salary") * 0.1)
      .withColumn("salary_k", col("salary") / 1000)
)

df_early.count()
time_early = round(time.time() - start_early, 2)

print("Late Filter Time:", time_late, "seconds")
print("Early Filter Time:", time_early, "seconds")

spark.stop()

Approach	Rows Processed Before Filter	Execution Time (approx)	Notes
Late filter	1,000,000 (all rows)	~14 seconds	Computes bonus and salary_k for all rows, then filters
Early filter	300,000 (filtered subset)	~9 seconds	Filters first, computes only for age > 35

The early filter approach processes significantly less data before the projection, leading to faster execution and less memory pressure.

Always filter as early as possible, before joins, aggregations, expensive transformations (such as UDFs or window functions), and even during file reads via Parquet/ORC pushdown, since filtering at the source touches fewer partitions and leads to faster jobs.

Scenario 5: Column Pruning

When working with Spark DataFrames, convenience often wins over correctness and nothing feels more convenient than select("*"). It’s quick, flexible, and perfect for exploration.

But in production pipelines, that little star silently costs CPU, memory, network bandwidth, and runtime efficiency. Every time you write select("*"), Spark expands it into every column from your schema, even if you’re using just one or two later.

Those extra attributes flow through every stage of the plan, from filters and joins to aggregations and shuffles. The result: inflated logical plans, bigger shuffle files, and slower queries.

The Problem: “The Lazy Star”

df_star = (
    df.select("*")
      .filter(col("department") == "Engineering")
      .groupBy("country")
      .agg(avg("salary").alias("avg_salary"))
)

At first glance, this seems harmless. But the problem is: only two columns (country and salary) are needed for the aggregation, but Spark carries all eight (id, firstname, lastname, department, salary, age, hire_date, country) through every transformation.

Simplified Logical Plan:

Aggregate [country], [avg(salary) AS avg_salary]

└─ Filter (department = Engineering)

   └─ Project [id, firstname, lastname, department, salary, age, hire_date, country]

      └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

Every node in this tree carries all columns. Catalyst can’t prune them because you explicitly asked for "*". The excess attributes are serialized, shuffled, and deserialized across the cluster, even though they serve no purpose in the final result.

The Fix: Select Only What You Need

Be deliberate with your projections. Select the minimal schema required for the task.

df_pruned = (
    df.select("department", "salary", "country")
      .filter(col("department") == "Engineering")
      .groupBy("country")
      .agg(avg("salary").alias("avg_salary"))
)

Simplified Logical Plan:

Aggregate [country], [avg(salary) AS avg_salary]

└─ Filter (department = Engineering)

   └─ Project [department, salary, country]

      └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

Now Spark reads and processes only the three required columns: department, salary, and country. The plan is narrower, the DAG simpler, and execution faster.

Real-World Benchmark: AWS Glue

Wide Projection:

df = spark.createDataFrame(multiplied_data,
                           ["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"])

start = time.time()
df_star = (
    df.select("*")
      .filter(col("department") == "Engineering")
      .groupBy("country")
      .agg(avg("salary").alias("avg_salary"))
)

df_star.count()
time_star = round(time.time() - start, 2)

Pruned Projection:

start = time.time()

df_pruned = (
    df.select("department", "salary", "country")
      .filter(col("department") == "Engineering")
      .groupBy("country")
      .agg(avg("salary").alias("avg_salary"))
)

df_pruned.count()
time_pruned = round(time.time() - start, 2)

print(f"select('*') time: {time_star}s")
print(f"pruned columns time: {time_pruned}s")

spark.stop()

Approach	Columns Processed	Execution Time (1M rows)	Observation
select("*")	8	~26.54 s	Spark carries all columns through the plan.
Pruned projection	3	~2.21 s	Only needed columns processed → faster and lighter.

Under the Hood: How Catalyst Handles Columns

When you call select("*"), Catalyst resolves every attribute into the logical plan. Each subsequent transformation inherits that full attribute list, increasing plan depth and overhead.

Catalyst includes a rule called ColumnPruning, which removes unused attributes but it only works when Spark can see which columns are necessary. If you use "*" or dynamically reference df.columns, Catalyst loses visibility.

Works:

df \
    .select("salary", "country") \
    .groupBy("country") \
    .agg(avg("salary"))

Doesn’t Work:

cols = df.columns

df.select(cols) \
  .groupBy("country") \
  .agg(avg("salary"))

In the second case, Catalyst can’t prune anything because cols might include everything.

Physical Plan Differences

Wide Projection (select("*")):

*(1) HashAggregate(keys=[country], functions=[avg(salary)])

+- *(1) Project [id, firstname, lastname, department, salary, age, hire_date, country]

   +- *(1) Filter (department = Engineering)

      +- *(1) Scan parquet ...

Pruned Projection:

*(1) HashAggregate(keys=[country], functions=[avg(salary)])

+- *(1) Project [department, salary, country]

   +- *(1) Filter (department = Engineering)

      +- *(1) Scan parquet [department, salary, country]

Notice the last line: Spark physically scans only the three referenced columns from Parquet. That’s genuine I/O reduction, not just logical simplification. Using select(*) increases shuffle file sizes, memory usage during serialization, Catalyst planning time, and I/O and network traffic, and the solution requires no more than specifying the necessary columns.

But in managed environments like AWS Glue or Databricks, this simple practice can greatly reduce ETL time, particularly for Parquet or Delta files, due to effective column pruning during explicit projection. It’s one of the easiest and highest-impact Spark optimization techniques, starting with typing fewer asterisks.

Scenario 6: Filter Pushdown vs Full Scan

When a Spark job feels slow right from the start, even before joins or aggregations, the culprit is often hidden at the data-read layer. Spark spends seconds (or minutes) scanning every record, even though most rows are useless for the query.

That’s where filter pushdown comes in. It tells Spark to push your filter logic down to the file reader so that Parquet / ORC / Delta formats return only the relevant rows from disk. Done right, this optimization can reduce scan size significantly. Done wrong, Spark performs a full scan, reading everything before filtering in memory.

The Problem: Late Filters and Full Scans

employees_df = spark.read.parquet("s3://data/employee_data/")

df_full = (
    employees_df
        .select("*")  # reads all columns
        .filter(col("country") == "Canada")
)

Looks fine, right? But Spark can’t push this filter to the Parquet reader because it’s applied after the select("*") projection step. Catalyst sees the filter as operating on a projected DataFrame, not the raw scan, so the pushdown boundary is lost.

Simplified Logical Plan:

Filter (country = Canada)

└─ Project [id, firstname, lastname, department, salary, age, hire_date, country]

   └─ Scan parquet employee_data [id, firstname, lastname, department, salary, age, hire_date, country]

Every record from every Parquet file is read into memory before the filter executes. In large tables, this means scanning terabytes when you only need megabytes.

The Fix: Filter Early and Project Light

Move filters as close as possible to the data source and limit columns before Spark reads them:

df_pushdown = (
    spark.read.parquet("s3://data/employee_data/")
        .select("id", "firstname", "department", "salary", "country")
        .filter(col("country") == "Canada")
)

Simplified Logical Plan:

Project [id, firstname, department, salary, country]

└─ Scan parquet employee_data [id, firstname, department, salary, country]

PushedFilters: [country = Canada]

Notice the difference: PushedFilters appears in the plan. That means the Parquet reader handles the predicate, returning only matching blocks and rows.

Under the Hood: What Actually Happens

When Spark performs filter pushdown, it leverages the Parquet metadata (min/max statistics and row-group indexes) stored in file footers.

Spark inspects file-level metadata for the predicate column (country).
It skips any row group whose values don’t match (country ≠ Canada).
It reads only the necessary row groups and columns from disk.
Those records enter the DAG directly – no in-memory filtering required.

This optimization happens entirely before Spark begins executing stages, reducing both I/O and network transfer.

Real-World Benchmark: AWS Glue

import time
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("FilterPushdownBenchmark").getOrCreate()

start = time.time()
df_full = (
    spark.read.parquet("s3://data/employee_data/")
        .select("*")                         # all columns
        .filter(col("country") == "Canada")  
)
df_full.count()
time_full = round(time.time() - start, 2)

start = time.time()
df_pushdown = (
    spark.read.parquet("s3://data/employee_data/")
        .select("id", "firstname", "department", "salary", "country")
        .filter(col("country") == "Canada")  
)
df_pushdown.count()
time_push = round(time.time() - start, 2)

print("Full Scan Time:", time_full, "sec")
print("Filter Pushdown Time:", time_push, "sec")

spark.stop()

Approach	Execution Time (1 M rows)	Observation
Full Scan	14.2 s	All files scanned and filtered in memory.
Filter Pushdown	3.8 s	Only relevant row groups and columns read.

Physical Plan Comparison

Full Scan:

*(1) Filter (country = Canada)

+- *(1) ColumnarToRow

   +- *(1) FileScan parquet [id, firstname, lastname, department, salary, age, hire_date, country]

      Batched: true, DataFilters: [], PushedFilters: []

Pushdown:

*(1) ColumnarToRow

+- *(1) FileScan parquet [id, firstname, department, salary, country]

   Batched: true, DataFilters: [isnotnull(country)], PushedFilters: [country = Canada]

The difference is clear: PushedFilters confirms that Spark applied predicate pushdown, skipping unnecessary row groups at the scan stage.

Reflection: Why Pushdown Matters

Pushdown isn’t a micro-optimization. It’s actually often the single biggest performance lever in Spark ETL. In data lakes with hundreds of files, full scans waste hours and inflate AWS S3 I/O costs. By filtering and projecting early, Spark prunes both rows and columns before execution even begins.

Apply filters as early as possible in the read pipeline, combine filter pushdown with column pruning, verify PushedFilters in explain("formatted"), avoid UDFs and select("*") at read time, and let pushdown turn “read everything and discard most” into “read only what you need.”

Scenario 7: De-duplicate Right

The Problem: “All-Row Deduplication” and Why It Hurts

When we use this:

df.dropDuplicates()

Spark removes identical rows across all columns. It sounds simple, but this operation forces Spark to treat every column as part of the deduplication key.

Internally, it means:

Every attribute is serialized and hashed.
Every unique combination of all columns is shuffled across the cluster to ensure global uniqueness.
Even small changes in a non-essential field (like hire_date) cause new keys and destroy aggregation locality.

In wide tables, this is one of the heaviest shuffle operations Spark can perform: df.dropDuplicates()

Simplified Logical Plan:

Aggregate [id, firstname, lastname, department, salary, age, hire_date, country], [first(id) AS id, ...]

└─ Exchange hashpartitioning(id, firstname, lastname, department, salary, age, hire_date, country, 200)

   └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

Notice the Exchange: that’s a full shuffle across all columns. Spark must send every record to the partition responsible for its unique combination of all fields. This is slow, memory-intensive, and scales poorly as columns grow.

The Better Approach: Key-Based Deduplication

In most real datasets, duplicates are determined by a primary or business key, not all attributes. For example, if id uniquely identifies an employee, we only need to keep one record per id.

df.dropDuplicates(["id"])

Now Spark deduplicates based only on the id column.

Aggregate [id], [first(id) AS id, first(firstname) AS firstname, ...]

└─ Exchange hashpartitioning(id, 200)

   └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

The shuffle is dramatically narrower. Instead of hashing across all columns, Spark redistributes data only by id. Fewer bytes, smaller shuffle files, faster reduce stage

Real-World Benchmark: AWS Glue

import time
from pyspark.sql.functions import exp, log, sqrt, col, concat_ws, when, upper, avg
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MillionRowsRenameTest").getOrCreate()

employees_data = [
    (1, "John", "Doe", "Engineering", 80000, 28, "2020-01-15", "USA"),
    (2, "Jane", "Smith", "Engineering", 85000, 32, "2019-03-20", "USA"),
    (3, "Alice", "Johnson", "Sales", 60000, 25, "2021-06-10", "UK"),
    (4, "Bob", "Brown", "Engineering", 90000, 35, "2018-07-01", "USA"),
    (5, "Charlie", "Wilson", "Sales", 65000, 29, "2020-11-05", "UK"),
    (6, "David", "Lee", "HR", 55000, 27, "2021-01-20", "USA"),
    (7, "Eve", "Davis", "Engineering", 95000, 40, "2017-04-12", "Canada"),
    (8, "Frank", "Miller", "Sales", 70000, 33, "2019-09-25", "UK"),
    (9, "Grace", "Taylor", "HR", 58000, 26, "2021-08-15", "Canada"),
    (10, "Henry", "Anderson", "Engineering", 88000, 31, "2020-02-28", "USA")
]

multiplied_data = [(i, f"firstname_{i}", f"lastname_{i}",
                    employees_data[i % 10][3],   # department
                    employees_data[i % 10][4],   # salary
                    employees_data[i % 10][5],   # age
                    employees_data[i % 10][6],   # hire_date
                    employees_data[i % 10][7]    # country
                    )
                   for i in range(1, 1_000_001)]

df = spark.createDataFrame(
    multiplied_data,
    ["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"]
)

start = time.time()
dedup_full = df.dropDuplicates()
dedup_full.count()
time_full = round(time.time() - start, 2)

start = time.time()
dedup_key = df.dropDuplicates(["id"])
dedup_key.count()
time_key = round(time.time() - start, 2)

print(f"Full-row dedup time: {time_full}s")
print(f"Key-based dedup time: {time_key}s")

spark.stop()

Approach	Execution Time (1M rows)	Observation
Full-Row Dedup	27.6 s	Shuffle across all attributes, large hash table
Key-Based Dedup (["id"])	2.06 s	10× faster, minimal shuffle width

Under the Hood: What Catalyst Does

When you specify a key list, Catalyst rewrites dropDuplicates(keys) into a partial + final aggregate plan, just like a groupBy:

HashAggregate(keys=[id], functions=[first(...)])

This allows Spark to:

Perform map-side partial aggregation on each partition (before shuffle).
Exchange only the grouping key (id).
Perform a final aggregation on the reduced data.

The all-column version can’t do that optimization because every column participates in uniqueness Spark must ensure complete data redistribution.

Best Practices for Deduplication

Practice	Why It Matters
Always deduplicate by key columns	Reduces shuffle width and data movement
Use deterministic keys (id, email, ssn)	Ensures predictable grouping
Avoid dropDuplicates() without arguments	Forces global shuffle across all attributes
Combine with column pruning	Keep only necessary fields before deduplication
For “latest record” logic, use window functions	Allows targeted deduplication (row_number() with order)
Cache intermediate datasets if reused	Avoids recomputation of expensive dedup stages

Combining Deduplication & Aggregation

You can merge deduplication with aggregation for even better results:

df_dedup_agg = (
    df.dropDuplicates(["id"])
        .groupBy("department")
        .agg(avg("salary").alias("avg_salary"))
)

Spark now reuses the same shuffle partitioning for both operations, one shuffle instead of two. The plan will show:

HashAggregate(keys=[department], functions=[avg(salary)])

└─ HashAggregate(keys=[id], functions=[first(...), first(department)])

   └─ Exchange hashpartitioning(id, 200)

Prefer dropDuplicates(["key_col"]) over dropDuplicates() to deduplicate by business or surrogate keys rather than the entire schema. Combine deduplication with projection to reduce I/O, and remember that one narrow shuffle is always better than a wide shuffle. Deduplication isn’t just cleanup – it’s an optimization strategy. Choose your keys wisely, and Spark will reward you with faster jobs and lighter DAGs.

Scenario 8: Count Smarter

In production, one of the most common performance pitfalls is the simplest line of code:

if df.count() > 0:

At first glance, this seems harmless. You just want to know whether the DataFrame has any data before writing, joining, or aggregating. But in Spark, count() is not metadata lookup, it’s a full cluster-wide job.

What Really Happens with count()
When you call df.count(), Spark executes a complete action:

It scans every partition.
Deserializes every row.
Counts records locally on each executor.
Reduces the counts to the driver.

That means your “empty check” runs a full distributed computation, even when the dataset has billions of rows or lives in S3.

df.count()

Simplified Logical Plan:

*(1) HashAggregate(keys=[], functions=[count(1)])

+- *(1) ColumnarToRow

   +- *(1) FileScan parquet [id, firstname, lastname, department, salary, age, hire_date, country]

Every record is read, aggregated, and returned just to produce a single integer.

Now imagine this runs in the middle of your Glue job, before a write, before a filter, or inside a loop. You’ve just added a full-table scan to your DAG for no reason.

The Smarter Way: limit(1) or head(1)

If all you need to know is whether data exists, you don’t need to count every record. You just need to know if there’s at least one.

Two efficient alternatives

df.head(1)
#or
df.limit(1).collect()

Both execute a lazy scan that stops as soon as one record is found.

Simplified Logical Plan:

TakeOrderedAndProject(limit=1)

└─ *(1) FileScan parquet [id, firstname, lastname, department, salary, age, hire_date, country]

No global aggregation.
No shuffle.
No full scan.

Real-World Benchmark: AWS Glue

import time
from pyspark.sql.functions import exp, log, sqrt, col, concat_ws, when, upper, avg
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("MillionRowsRenameTest").getOrCreate()

# Base dataset (10 sample employees)
employees_data = [
    (1, "John", "Doe", "Engineering", 80000, 28, "2020-01-15", "USA"),
    (2, "Jane", "Smith", "Engineering", 85000, 32, "2019-03-20", "USA"),
    (3, "Alice", "Johnson", "Sales", 60000, 25, "2021-06-10", "UK"),
    (4, "Bob", "Brown", "Engineering", 90000, 35, "2018-07-01", "USA"),
    (5, "Charlie", "Wilson", "Sales", 65000, 29, "2020-11-05", "UK"),
    (6, "David", "Lee", "HR", 55000, 27, "2021-01-20", "USA"),
    (7, "Eve", "Davis", "Engineering", 95000, 40, "2017-04-12", "Canada"),
    (8, "Frank", "Miller", "Sales", 70000, 33, "2019-09-25", "UK"),
    (9, "Grace", "Taylor", "HR", 58000, 26, "2021-08-15", "Canada"),
    (10, "Henry", "Anderson", "Engineering", 88000, 31, "2020-02-28", "USA")
]

# Create 1 million rows
multiplied_data = [
    (i, f"firstname_{i}", f"lastname_{i}",
     employees_data[i % 10][3],
     employees_data[i % 10][4],
     employees_data[i % 10][5],
     employees_data[i % 10][6],
     employees_data[i % 10][7])
    for i in range(1, 1_000_001)
]

df = spark.createDataFrame(
    multiplied_data,
    ["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"]
)
# Create DataFrame
df = spark.createDataFrame(
    multiplied_data,
    ["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"]
)

start = time.time()
df.count()
count_time = round(time.time() - start, 2)

start = time.time()
df.limit(1).collect()
limit_time = round(time.time() - start, 2)

start = time.time()
df.head(1)
head_time = round(time.time() - start, 2)

spark.stop()

Method	Plan Type	Execution Time (1M rows)	Notes
count()	HashAggregate + Exchange	26.33 s	Full scan + aggregation
limit(1)	TakeOrderedAndProject	0.62 s	Stops after first record
head(1)	TakeOrderedAndProject	0.42 s	Fastest, single partition

The difference is significant for the same logical check.

So why does this difference exist? Spark’s execution model treats every action as a trigger for computation. count() is an aggregation action, requiring global communication, and limit(1) and head(1) are sampling actions, short-circuiting the job after fetching the first record. Catalyst generates a TakeOrderedAndProject node instead of HashAggregate, and the scheduler terminates once one task finishes.

Plan comparison:

Action	Simplified Plan	Type	Behavior
count()	HashAggregate → Exchange → FileScan	Global	Full scan, wide dependency
limit(1)	TakeOrderedAndProject → FileScan	Local	Early stop, narrow dependency
head(1)	TakeOrderedAndProject → FileScan	Local	Early stop, single task

Avoid using count() to check emptiness since it triggers a full scan. Use limit(1) or head(1) for lightweight existence checks. And reserve count() only when the total is required, because Spark will always process all data unless explicitly told to stop. Other alternatives

`df.take(1)`	Similar to head() returns array
`df.first()`	Returns first Row or None
`df.isEmpty()`	Returns true if DataFrame has no rows
`df.rdd.isEmpty()`	RDD-level check

Scenario 9: Window Wisely

Window functions (rank(), dense_rank(), lag(), avg() with over(), and so on) are essential in analytics. They let you calculate running totals, rankings, or time-based metrics.

But in Spark, they’re not cheap, because they rely on shuffles and ordering.

Each window operation:

Requires all rows for the same partition key to be co-located on the same node.
Requires sorting those rows by the orderBy() clause within each partition.

If you omit partitionBy() (or use it with too broad a key), Spark treats the entire dataset as one partition, triggering a massive shuffle and global sort.

Global Window: The Wrong Way

Let’s compute employee rankings by salary without partitioning:

from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col

window_spec = Window.orderBy(col("salary").desc())

df_ranked = df.withColumn("salary_rank", rank().over(window_spec))

Simplified Logical Plan:

Window [rank() windowspecdefinition(orderBy=[salary DESC]) AS salary_rank]

└─ Sort [salary DESC], true

   └─ Exchange rangepartitioning(salary DESC, 200)

      └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

Spark must shuffle and sort the entire dataset globally, a full sort across all rows. Every executor gets a slice of this single global range, and all data must move through the network.

Partition by a Selective Key: The Better Way

Most analytics don’t need a global ranking. You likely want rankings within a department or group, not across the entire company.

window_spec = Window.partitionBy("department").orderBy(col("salary").desc())

df_ranked = df.withColumn("salary_rank", rank().over(window_spec))

Now Spark builds separate windows per department. Each partition’s data stays local, dramatically reducing shuffle size.

Simplified Logical Plan:

Window [rank() windowspecdefinition(partitionBy=[department], orderBy=[salary DESC]) AS salary_rank]

└─ Sort [department ASC, salary DESC], false

   └─ Exchange hashpartitioning(department, 200)

      └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

The Exchange now partitions data only by department. The shuffle boundary is narrower, fewer bytes transferred, fewer sort comparisons, and smaller spill risk.

Real-World Benchmark: AWS Glue

We can execute the windows function on the same 1 million row dataset:

df = spark.createDataFrame(multiplied_data,
["id", "firstname", "lastname", "department", "salary", "age",
 "hire_date", "country"])

start = time.time()
window_global = Window.orderBy(col("salary").desc())
df_global = df.withColumn("salary_rank", rank().over(window_global))
df_global.count()
global_time = round(time.time() - start, 2)
print(f'global_time:{global_time}')

start = time.time()
window_local = Window.partitionBy("department").orderBy(col("salary").desc())
df_local = df.withColumn("salary_rank", rank().over(window_local))
df_local.count()
local_time = round(time.time() - start, 2)
print(f'local_time:{local_time}')

spark.stop()

Approach	Stage Count	Execution Time (1M rows)	Observation
Global Window (no partition)	5	30.21 s	Full dataset shuffle + global sort
Partitioned Window (by department)	3	1.74 s	Localized sort, fewer shuffle files

Partitioning the window reduces shuffle data volume significantly and runtime as well. The difference grows exponentially as data scales.

Under the Hood: What Spark Actually Does

Each Window transformation adds a physical plan node like:

WindowExec [rank() windowspecdefinition(...)], frame=RangeFrame

This node is non-pipelined – it materializes input partitions before computing window metrics. Catalyst optimizer can’t push filters or projections inside WindowExec, which means:

If you rank before filtering, Spark computes ranks for all rows.
If you order globally, Spark must sort everything before starting.

That’s why window placement in your code matters almost as much as partition keys.

Common Anti-Patterns:

Anti-Pattern	Why It Hurts	Fix
Missing partitionBy()	Global sort across dataset	Partition by key columns
Overly broad partition key	Creates too many small partitions	Use selective, not unique keys
Wide, unbounded window frame	Retains all rows in memory per key	Use bounded ranges (for example, rowsBetween(-3, 0))
Filtering after window	Computes unnecessary metrics	Filter first, then window
Multiple chained windows	Each triggers new sort	Combine window metrics in one spec

Partition on selective keys to reduce shuffle volume, and avoid global windows that force full sorts and shuffles. Prefer bounded frames to keep state in memory and limit disk spill, and filter early while combining metrics to minimize unnecessary data flowing through WindowExec. Windows are powerful, but unbounded ones can silently crush performance. In Spark, partitioning isn’t optional. It’s the line between analytics and overhead.

Scenario 10: Incremental Aggregations with Cache and Persist

When multiple actions depend on the same expensive base computation, don’t recompute it every time. Materialize it once with cache() or persist(), then reuse it. Most Spark teams get this wrong in two ways:

They never cache, so Spark recomputes long lineages (filters, joins, window ops) for every action.
They cache everything, blowing executor memory and making things worse.

This scenario shows how to do it intelligently.

The Problem: Recomputing the Same Work for Every Metric

from pyspark.sql.functions import col, avg, max as max_, count

base = (
    df.filter(col("department") == "Engineering")
      .filter(col("country") == "USA")
      .filter(col("salary") > 70000)
)

avg_salary = base.groupBy("department").agg(avg("salary").alias("avg_salary"))
max_salary = base.groupBy("department").agg(max_("salary").alias("max_salary"))
cnt_salary = base.groupBy("department").agg(count("*").alias("cnt"))

Looks totally fine at a glance. But remember: Spark is lazy.
Every time you trigger an action:

avg_salary.show()
max_salary.show()
cnt_salary.show()

Spark walks back to the same base definition and re-runs all filters and shuffles for each metric – unless you persist.

So instead of 1 filtered + shuffled dataset reused 3 times, you effectively get:

3 jobs
3 scans / filter chains
3 groupBy shuffles

for the same input slice.

Simplified Logical Plan Shape (Without Cache):

HashAggregate [department], [avg/max/count]

└─ Exchange hashpartitioning(department)

   └─ Filter (department = 'Engineering' AND country = 'USA' AND salary > 70000)

      └─ Scan ...

And Spark builds this three times. Even though the filter logic is identical, each action triggers a new job with:

new stages,
new shuffles, and
new scans.

On large datasets (hundreds of GBs), this is brutal.

The Better Approach: Cache the Shared Base

from pyspark.sql import StorageLevel

base = (
    df.filter(col("department") == "Engineering")
      .filter(col("country") == "USA")
      .filter(col("salary") > 70000)
)

base = base.persist(StorageLevel.MEMORY_AND_DISK)

base.count()

avg_salary = base.groupBy("department").agg(avg("salary").alias("avg_salary"))
max_salary = base.groupBy("department").agg(max_("salary").alias("max_salary"))
cnt_salary = base.groupBy("department").agg(count("*").alias("cnt"))

avg_salary.show()
max_salary.show()
cnt_salary.show()

base.unpersist()

Now, the filters and initial scan run once, the results are cached, and all subsequent aggregates read from cached data instead of recomputing upstream logic.

Logical Plan Shape (With Cache):

Before materialization (base.count()), the plan still shows the lineage. Afterward, subsequent actions operate off the cached node.

InMemoryRelation [department, salary, country, ...]

   └─ * Cached from:

      Filter (department = 'Engineering' AND country = 'USA' AND salary > 70000)

      └─ Scan parquet employees_large ...

Then:

HashAggregate [department], [avg/max/count]

└─ InMemoryRelation [...]

One heavy pipeline, many cheap reads. The DAG becomes flatter:

Expensive scan & filter & shuffle: once.
Cheap aggregations: N times from memory/disk.

Real-World Benchmark: AWS Glue

df = spark.createDataFrame(multiplied_data,
["id", "firstname", "lastname", "department", "salary", "age",
"hire_date", "country"])

base = (
    df.filter(col("department") == "Engineering")
      .filter(col("country") == "USA")
      .filter(col("salary") > 85000)
)


start = time.time()

avg_salary = base.groupBy("department").agg(avg("salary").alias("avg_salary"))
max_salary = base.groupBy("department").agg(max_("salary").alias("max_salary"))
cnt = base.groupBy("department").agg(count("*").alias("emp_count"))

print("---- Without Cache ----")
avg_salary.show()
max_salary.show()
cnt.show()

no_cache_time = round(time.time() - start, 2)
print(f"Total time without cache: {no_cache_time} seconds")


from pyspark.sql import DataFrame

base_cached = base.persist(StorageLevel.MEMORY_AND_DISK)
base_cached.count()  # materialize cache

start = time.time()

avg_salary_c = base_cached.groupBy("department").agg(avg("salary").alias("avg_salary"))
max_salary_c = base_cached.groupBy("department").agg(max_("salary").alias("max_salary"))
cnt_c = base_cached.groupBy("department").agg(count("*").alias("emp_count"))

print("---- With Cache ----")
avg_salary_c.show()
max_salary_c.show()
cnt_c.show()

cache_time = round(time.time() - start, 2)
print(f"Total time with cache: {cache_time} seconds")

# Cleanup
base_cached.unpersist()

print("\n==== Summary ====")
print(f"Without cache: {no_cache_time}s | With cache: {cache_time}s")
print("=================")

spark.stop()

Approach	Execution Time (1M rows)
Without Cache	30.75 s
With Cache	3.34 s

Under the Hood: Why This Works

Using cache() or persist() in Spark inserts an InMemoryRelation / InMemoryTableScanExec node so that expensive intermediate results are stored in executor memory (or memory+disk). This allows future jobs to reuse cached blocks instead of re-scanning sources or re-computing shuffles. This shortens downstream logical plans, reduces repeated shuffles, and lowers load on systems like S3, HDFS, or JDBC.

Without caching, every action replays the full lineage and Spark recomputes the data unless another operator or AQE optimization has already materialized part of it. But caching should not become “cache everything”. Rather, you should avoid caching very large DataFrames used only once, wide raw inputs instead of filtered/aggregated subsets, or long-lived caches that are never unpersisted.

A good rule of thumb is to cache only when the DataFrame is expensive to recompute (joins, filters, windows, UDFs), is used at least twice, and is reasonably sized after filtering so it can fit in memory or work with MEMORY_AND_DISK. Otherwise, allow Spark to recompute.

Conceptually, caching converts a tall, repetitive DAG such as repeated “HashAggregate → Exchange → Filter → Scan” sequences into a hub-and-spoke design where one heavy cached hub feeds multiple lightweight downstream aggregates.

When multiple actions depend on the same expensive computation, cache or persist the shared base to flatten the DAG, eliminate repeated scans and shuffles, and improve end-to-end performance. All this while being intentional by caching only when reuse is real, the data size is safe, and always calling unpersist() when done.

Don’t make Spark re-solve the same puzzle three times. Let it solve it once, remember the answer, and move on.

Scenario 11: Reduce Shuffles

Shuffles are Spark’s invisible tax collectors. Every time your data crosses executors, you pay in CPU, disk I/O, and network bandwidth.

Two of the most common yet misunderstood transformations that trigger or avoid shuffles are coalesce() and repartition(). Both change partition counts, but they do it in fundamentally different ways.

The Problem

Writing df_result = df.repartition(10) and thinking “I’m just changing partitions so Spark won’t move data unnecessarily.” But that assumption is wrong. repartition() always performs a full shuffle, even when:

You are reducing partitions (from 200 → 10), or
You are increasing partitions (from 10 → 200).

In both cases, Spark redistributes every row across the cluster according to a new hash partitioning scheme. So even if your data is already partitioned optimally, repartition() will still reshuffle it, adding a stage boundary.

Logical Plan:

Exchange hashpartitioning(...)

└─ LogicalRDD [...]

That Exchange node signals a wide dependency: Spark spills intermediate data to disk, transfers it over the network, and reloads it before the next stage. In short: repartition() = "new shuffle, no matter what."

The Better Approach: coalesce()

If your goal is to reduce the number of partitions, for example, before writing results to S3 or Snowflake – use coalesce() instead.

df_result = df.coalesce(10)

coalesce() merges existing partitions locally within each executor, avoiding the costly reshuffle step. It uses a narrow dependency, meaning each output partition depends on one or more existing partitions from the same node.

Coalesce

└─ LogicalRDD [...]

No Exchange.
No network shuffle.
Just local merges – fast and cheap.

Real-World Benchmark: AWS Glue

df = spark.createDataFrame(multiplied_data,
["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"])

start = time.time()
df_repart = df.repartition(10)
df_repart.count()
print("Repartition time:", round(time.time() - start, 2), "sec")

start = time.time()
df_coalesced = df.coalesce(10)
df_coalesced.count()
print("Coalesce time:", round(time.time() - start, 2), "sec")

spark.stop()

Operation	Plan Node	Shuffle Triggered	Glue Runtime	Observation
repartition(10)	Exchange	Yes	18.2 s	Full cluster reshuffle
coalesce(10)	Coalesce	No	1.99 s	Local partition merge only

Even though both ended with 10 partitions, repartition() took significantly longer all because of the unnecessary shuffle.

Why This Matters

Each Exchange node in your logical plan creates a new stage in your DAG, meaning:

Extra disk I/O
Extra serialization
Extra network transfer

That’s why avoiding just one shuffle in a Glue ETL pipeline can save seconds to minutes per run, especially on wide datasets.

When to use which:

Goal	Transformation	Reasoning
Increase parallelism for heavy groupBy or join	repartition()	Distributes data evenly across executors
Reduce file count before writing	coalesce()	Avoids shuffle, merges partitions locally
Rebalance skewed data before a join	repartition(by="key")	Enables better key distribution
Optimize output after aggregation	coalesce()	Prevents too many small output files

AQE and Auto Coalescing

You can enable Adaptive Query Execution (AQE) in AWS Glue 3.0+ to let Spark merge small shuffle partitions automatically:

spark.conf.set("spark.sql.adaptive.enabled", "true")

spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

With AQE, Spark dynamically combines small partitions after shuffle to balance performance and I/O.

repartition() always triggers a shuffle, while coalesce() avoids shuffles and is ideal for local merges before writes. You should always inspect Exchange nodes to identify shuffle points. Note that in AWS Glue, avoiding even one shuffle can yield ~7× runtime improvement at the 1M-row scale. Finally, use AQE to enable dynamic partition coalescing in larger workflows.

Scenario 12: Know Your Shuffle Triggers

Much of Spark's performance comes from invisible data movement. Every shuffle boundary adds a new stage, a new write–read cycle, and sometimes minutes of extra execution time.

In Spark, any operation that requires rearranging data between partitions introduces a wide dependency, represented in the logical plan as an Exchange node.

Common shuffle triggers:

Operation	Why It Shuffles	Plan Node
join()	Records with the same key must be co-located for matching	Exchange (on join keys)
groupBy() / agg()	Keys must gather to a single partition for aggregation	Exchange
distinct()	Spark must compare all values across partitions	Exchange
orderBy()	Requires global ordering of data	Exchange
repartition()	Explicit reshuffle for partition balancing	Exchange

Each Exchange means a shuffle stage: Spark writes partition data to disk, transfers it over the network, and reads it back into memory on the next stage. That’s your hidden performance cliff.

df_result = (
    df.groupBy("department")
      .agg(sum("salary").alias("total_salary"))
      .join(df.select("department", "country")
            .distinct(), "department")
      .orderBy("total_salary", ascending=False)
)

df_result.explain("formatted")

Logical Plan Simplified:

Sort [total_salary DESC]

└─ Exchange (global sort)

   └─ SortMergeJoin [department]

      ├─ Exchange (groupBy shuffle)

      │   └─ HashAggregate (sum salary)

      └─ Exchange (distinct shuffle)

          └─ Aggregate (department, country)

We can see three Exchange nodes, one for the aggregation, one for the distinct join, and one for the global sort. That’s three separate shuffles, three full dataset transfers.

Better Approach

Whenever possible, combine wide transformations into a single stage before an action. For instance, you can compute aggregates and join results in one consistent shuffle domain:

agg_df = df.groupBy("department") \
    .agg(sum("salary") \
    .alias("total_salary"))

country_df = df.select("department", "country").distinct()

df_result = (
    agg_df.join(country_df, "department")
          .sortWithinPartitions("total_salary", ascending=False)
)

Logical Plan Simplified:

SortWithinPartitions [total_salary DESC]

└─ SortMergeJoin [department]

   ├─ Exchange (shared shuffle for join)

   └─ Exchange (shared shuffle for distinct)

Now Spark reuses shuffle partitions across compatible operations – only one shuffle boundary remains. The rest execute as narrow transformations.

Real-World Benchmark: AWS Glue (1M)

df = spark.createDataFrame(multiplied_data,
["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"]).repartition(20)

from pyspark.sql.functions import sum as sum_

start = time.time()

dept_salary = (
    df.groupBy("department")
      .agg(sum_("salary").alias("total_salary"))
)

dept_country = (
    df.select("department", "country")
      .distinct()
)

naive_result = (
    dept_salary.join(dept_country, "department", "inner")
               .orderBy(col("total_salary").desc())
)

naive_count = naive_result.count()
naive_time = round(time.time() - start, 2)


start = time.time()

dept_country_once = (
    df.select("department", "country")
      .distinct()
)

optimized = (
    df.groupBy("department")
      .agg(sum_("salary").alias("total_salary"))
      .join(dept_country_once, "department", "inner")
      .sortWithinPartitions(col("total_salary").desc())
      # local ordering, avoids extra global shuffle
)

opt_count = optimized.count()
opt_time = round(time.time() - start, 2)

print("Optimized result count:", opt_count)
print("Optimized pipeline time:", opt_time, "sec")

print("\nOptimized plan:")
optimized.explain("formatted")

spark.stop()

Pipeline	# of Shuffles	Glue Runtime (sec)	Observation
Naive: groupBy + distinct + orderBy	3	28.99 s	Multiple wide stages
Optimized: combined agg + join + sortWithinPartitions	1	3.52 s	Single wide stage

By merging compatible stages and using sortWithinPartitions() instead of global orderBy(), the job ran significantly faster on the same dataset, with fewer Exchange nodes and shorter lineage. Run df.explain and search for Exchange. Each one signals a full shuffle. You can also check Spark UI → SQL tab → Exchange Read/Write Size to see exactly how much data moved.

Every Exchange represents a shuffle, adding serialization, network I/O, and stage overhead, so avoid chaining wide operations back-to-back by combining them under a consistent partition key. Prefer sortWithinPartitions() over global orderBy() when ordering is local, monitor plan depth to catch consecutive wide dependencies, and note that in AWS Glue eliminating even one shuffle in a 1M-row job can significantly reduce runtime.

Scenario 13: Tune Parallelism: Shuffle Partitions & AQE

Most Spark jobs are either over-parallelized (thousands of tiny tasks doing almost nothing, flooding the driver and filesystem) or under-parallelized (a handful of huge tasks doing all the work, causing slow stages and skew-like behavior). Both waste resources. We can control this behavior using spark.sql.shuffle.partitions and Adaptive Query Execution (AQE).

By default (in many environments), the default value spark.conf.get("spark.sql.shuffle.partitions") is 200, meaning that every shuffle produces approximately 200 shuffle partitions, regardless of data size. That means every shuffle (groupBy, join, distinct, and so on) creates ~200 shuffle partitions. Whether this default is reasonable depends entirely on the workload:

If you’re processing 2 GB, 200 partitions might be great.
If you’re processing 5 MB, 200 partitions is comedy – 200 tiny tasks, overhead > work.
If you’re processing 2 TB, 200 partitions might be too few – tasks become huge and slow.

Example A: The Default Plan (Too Many Tiny Tasks)

from pyspark.sql import SparkSession
from pyspark.sql.functions import sum as sum_

spark = SparkSession.builder.appName("ParallelismExample").getOrCreate()

spark.conf.get("spark.sql.shuffle.partitions")  # '200'

data = [
    (1, "John", "Engineering", 90000),
    (2, "Alice", "Engineering", 85000),
    (3, "Bob", "Sales", 75000),
    (4, "Eve", "Sales", 72000),
    (5, "Grace", "HR", 65000),
]

df = spark.createDataFrame(data, ["id", "name", "department", "salary"])

agg_df = df.groupBy("department").agg(sum_("salary").alias("total_salary"))
agg_df.explain("formatted")

Even though there are only 3 departments, Spark will still create 200 shuffle partitions – meaning 200 tasks for 3 groups of data.

Effect: Each task has almost nothing to do. Spark spends more time planning and scheduling than actually computing.

Example B: Tuned Plan (Balanced Parallelism)

spark.conf.set("spark.sql.shuffle.partitions", "8")
agg_df = df.groupBy("department").agg(sum_("salary").alias("total_salary"))
agg_df.explain("formatted")

Now Spark launches only 8 partitions still parallelized, but not wasteful. Even in this small example, you can visually feel the difference: one logical change, but a completely leaner physical plan.

The Real Problem: Static Tuning Doesn’t Scale

In production, job sizes vary:

Today: 10 GB
Tomorrow: 500 GB
Next week: 200 MB (sampling run)

Manually changing shuffle partitions for each run is neither practical nor reliable. That’s where Adaptive Query Execution (AQE) steps in.

Adaptive Query Execution (AQE): Smarter, Dynamic Parallelism

AQE doesn’t guess. It measures actual shuffle statistics at runtime and rewrites the plan while the job is running.

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.minPartitionSize", "64m")
spark.conf.set("spark.sql.adaptive.coalescePartitions.maxPartitionSize", "256m")

Configuration	Shuffle Partitions	Task Distribution	Observation
Default	200	200 tasks / 3 groups	Too granular, mostly idle
Tuned	8	8 tasks / 3 groups	Balanced execution

AQE merges tiny shuffle partitions, or splits huge ones, based on real-time data metrics, not pre-set assumptions.

df = spark.createDataFrame(multiplied_data,
    ["id", "firstname", "lastname", "department", "salary", "age",
     "hire_date", "country"])

start = time.time()
agg_df = df.groupBy("department").agg(sum_("salary").alias("total_salary"))
agg_df.count()

print(f'Num Partitions df: {df.rdd.getNumPartitions()}')
print(f'Num Partitions aggdf: {agg_df.rdd.getNumPartitions()}')
print("Execution time:", round(time.time() - start, 2), "sec")

spark.stop()

Stage	Without AQE	With AQE
Stage 3 (Aggregation)	200 shuffle partitions, each reading KBs	8–12 coalesced partitions
Stage 4 (Join Output)	200 shuffle files	Merged into balanced partitions
Result	Many small tasks, high overhead	Fewer, balanced tasks, faster runtime

Understanding the Plan

Before AQE (static):

Exchange hashpartitioning(department, 200)

With AQE: AdaptiveSparkPlan (coalesced)

HashAggregate(keys=[department], functions=[sum(salary)])

Exchange hashpartitioning(department, 200) # runtime coalesced to 12

The logical plan remains the same, but the physical execution plan is rewritten during runtime. Spark intelligently reduces or merges shuffle partitions based on data volume.

Spark’s default 200 shuffle partitions often misfit real workloads. Static tuning may work for predictable pipelines, but fails with variable data. On the other hand, AQE uses shuffle statistics to dynamically coalesce partitions at runtime, use it with sensible ceilings (for example, 400 partitions) and always verify in the Spark UI to catch over-partitioning (many tasks reading KBs) or under-partitioning (few tasks reading GBs).

Scenario 14: Handle Skew Smartly

In an ideal Spark world, all partitions contain roughly equal amounts of data. But real datasets are rarely that kind. If one key (say "USA", "2024", or "customer_123") holds millions of rows while others have only a few, Spark ends up with one or two massive partitions. Those partitions take disproportionately longer to process, leaving other executors idle. That’s data skew: the silent killer of parallelism.

You’ll often spot it in Spark UI:

198 tasks finish quickly.
2 tasks take 10× longer.
Stage stays stuck at 98% for minutes.

Example A: The Skew Problem

from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.appName("DataSkewDemo").getOrCreate()

# Create skewed dataset
df = spark.range(0, 10000).toDF("id") \
    .withColumn("department",
        F.when(F.col("id") < 8000, "Engineering")  # 80% of data
         .when(F.col("id") < 9000, "Sales")
         .otherwise("HR")) \
    .withColumn("salary", (F.rand() * 100000).cast("int"))

df.groupBy("department").count().show()

Spark will hash “Engineering” into just one reducer partition, making it heavier than others. That single task becomes a bottleneck, the shuffle has technically completed, but the stage waits for that one lagging task.

Example B: The Solution: Salting Hot Keys

To handle skew, we the hot key (Engineering) into multiple pseudo-keys using a random salt. This redistributes that large partition across multiple reducers.

from pyspark.sql.functions import rand, concat, lit, floor

salt_buckets = 10

df_salted = (
    df.withColumn(
        "department_salted",
        F.when(F.col("department") == "Engineering",
            F.concat(F.col("department"), lit("_"),
                     (F.floor(rand() * salt_buckets))))
         .otherwise(F.col("department"))
    )
)

df_salted.groupBy("department_salted").agg(F.avg("salary"))

Now “Engineering” isn’t one hot key – it’s 10 smaller keys like Engineering_0, Engineering_1, ..., Engineering_9. Each one goes to a separate reducer partition, enabling parallel processing.

Example C: Post-Aggregation Desalting

After aggregating, recombine salted keys to get the original department names:

df_final = (
    df_salted.groupBy("department_salted")
        .agg(F.avg("salary").alias("avg_salary"))
        .withColumn("department", F.split(F.col("department_salted"), "_")
            .getItem(0))
        .groupBy("department")
        .agg(F.avg("avg_salary").alias("final_avg_salary"))
)

When to Use Salting

Use salting when:

You observe stage skew (one or few long tasks).
Shuffle read sizes vary drastically between tasks.
The skew originates from a few dominant key values.

Avoid it when:

The dataset is small (< 1 GB).
You already use partitioning or bucketing keys with uniform distribution.

Alternative approaches:

Technique	Use Case	Pros	Cons
Salting (manual)	Skewed joins/aggregations	Full control	Requires extra logic to merge
Skew join hints (/+ SKEWJOIN /)	Supported joins in Spark 3+	No extra columns needed	Works only on joins
Broadcast smaller side	One table ≪ other	Avoids shuffle on big side	Limited by broadcast size
AQE skew optimization	Spark 3.0+	Automatic handling	Needs AQE enabled

Glue-Specific Tip

AWS Glue 3.0+ includes Spark 3.x, meaning you can also enable AQE’s built-in skew optimization:

spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "128m")

Spark will automatically detect large shuffle partitions and split them, effectively auto-salting hot keys at runtime. Data skew causes uneven shuffle sizes across tasks and can be detected in the Spark UI or via shuffle read/write metrics. Mitigate heavy-key skew with manual salting (recombined later) or rely on AQE skew join optimization for mild cases, and always validate improvements in the Spark UI SQL tab by checking “Shuffle Read Size.”

Scenario 15: Sort Efficiently (orderBy vs sortWithinPartitions)

Most Spark jobs need sorted data at some point – for window functions, for writing ordered files, or for downstream processing. The instinct is to reach for orderBy(). But those instincts cost you a full shuffle every single time.

The Problem: Global Sort When You Don't Need It

Let's say you want to write employee data partitioned by department, sorted by salary within each department:

from pyspark.sql.functions import col

# Naive approach: global sort
df_sorted = df.orderBy(col("department"), col("salary").desc())

df_sorted.write.partitionBy("department").parquet("s3://output/employees/")

This looks reasonable. You're sorting by department and salary, then writing partitioned files. Clean and simple. But here's what Spark actually does:

Simplified Logical Plan:

Sort [department ASC, salary DESC], true

└─ Exchange rangepartitioning(department ASC, salary DESC, 200)

   └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

That Exchange rangepartitioning is a full shuffle. So Spark:

Samples the data to determine range boundaries
Redistributes every row across 200 partitions based on sort keys
Sorts each partition locally
Produces globally ordered output

You just shuffled 1 million rows across the cluster to achieve global ordering – even though you're immediately partitioning by department on write, which destroys that global order anyway.

Why This Hurts

Range partitioning for global sort is one of the most expensive shuffles Spark performs:

Sampling overhead: Spark must scan data twice (once to sample, once to process)
Network transfer: Every row moves to a new executor based on range boundaries
Disk I/O: Shuffle files written and read from disk
Wasted work: Global ordering across departments is meaningless when you partition by department

For 1M rows, this adds 8-12 seconds of pure shuffle overhead.

The Better Approach: Sort Locally Within Partitions

If you only need ordering within each department (or within each output partition), use sortWithinPartitions():

# Optimized approach: local sort only
df_sorted = df.sortWithinPartitions(col("department"), col("salary").desc())
df_sorted.write.partitionBy("department").parquet("s3://output/employees/")

Simplified Logical Plan:

Sort [department ASC, salary DESC], false

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

No Exchange.
No shuffle.
Just local sorting within existing partitions.

Spark sorts each partition in-place, without moving data across the network. The false flag in the Sort node indicates this is a local sort, not a global one.

Real-World Benchmark: AWS Glue

Let's measure the difference on 1 million employee records: First, will start with Global Sort with orderBy:

print("\n--- Testing orderBy() (global sort) ---")

start = time.time()

df_global = df.orderBy(col("department"), col("salary").desc())
df_global.write.mode("overwrite").parquet("/tmp/global_sort_output")

global_time = round(time.time() - start, 2)
print(f"orderBy() time: {global_time}s")

Local Sort:

print("\n--- Testing sortWithinPartitions() (local sort) ---")

start = time.time()

df_local = df.sortWithinPartitions(col("department"), col("salary").desc())
df_local.write.mode("overwrite").parquet("/tmp/local_sort_output")

local_time = round(time.time() - start, 2)
print(f"sortWithinPartitions() time: {local_time}s")

Approach	Plan Type	Execution Time (1M rows)	Observation
orderBy()	Exchange rangepartitioning	10.34 s	Full shuffle for global sort
sortWithinPartitions()	Local Sort (no Exchange)	2.18 s	In-place sorting, no network transfer

Physical Plan Differences:

orderBy() Physical Plan:

*(2) Sort [department ASC NULLS FIRST, salary DESC NULLS LAST], true, 0

+- Exchange rangepartitioning(department ASC NULLS FIRST, salary DESC NULLS LAST, 200)

   +- *(1) Project [id, firstname, lastname, department, salary, age, hire_date, country]

      +- *(1) Scan ExistingRDD[id, firstname, lastname, department, salary, age, hire_date, country]

The Exchange rangepartitioning node marks the shuffle boundary. Spark must:

Sample data to determine range splits
Redistribute all rows across executors
Sort within each range partition

sortWithinPartitions() Physical Plan:

*(1) Sort [department ASC NULLS FIRST, salary DESC NULLS LAST], false, 0

+- *(1) Project [id, firstname, lastname, department, salary, age, hire_date, country]

   +- *(1) Scan ExistingRDD[id, firstname, lastname, department, salary, age, hire_date, country]

No Exchange. The false flag in Sort indicates local sorting only. Each partition is sorted independently, in parallel, without any data movement.

When to Use Which:

Use Case	Method	Why
Writing partitioned files (Parquet, Delta)	sortWithinPartitions()	Partition-level order is sufficient; global order wasted
Window functions with ROWS BETWEEN	sortWithinPartitions()	Only need order within each window partition
Top-N per group (rank, dense_rank)	sortWithinPartitions()	Ranking is local to each partition key
Final output must be globally ordered	orderBy()	Need total order across all partitions
Downstream system requires strict ordering	orderBy()	For example, time-series data for sequential processing
Sorting before coalesce() for fewer output files	sortWithinPartitions()	Maintains order within merged partitions

Common Anti-Pattern

df.orderBy("department", "salary") \
  .write.partitionBy("department") \
  .parquet("output/")

Problem: You're globally sorting by department, then immediately partitioning by department. The global order is destroyed during partitioning.

Here’s the fix:

df.sortWithinPartitions("department", "salary") \
  .write.partitionBy("department") \
  .parquet("output/")

Or even better, if you're partitioning by department anyway:

# Best: let partitioning handle distribution
df.write.partitionBy("department") \
    .sortBy("salary") \
    .parquet("output/")

orderBy() triggers an expensive full shuffle using range partitioning, while sortWithinPartitions() sorts data locally without a shuffle and is often 4–5× faster. Use it when writing partitioned files, computing window functions with partitionBy(), or when order is needed only within groups, and reserve orderBy() strictly for true global ordering, because in most production ETL, the best sort is the one that doesn’t shuffle.

Conclusion

You began this handbook likely wondering why your Spark application was slow, and now you see that the answer was both clear and not so clear: your problem was never your Spark application, your configuration, or your version of Spark. It was your plan all along.

You now understand that Spark runs plans, not code, that transformation order affects logical plans, that shuffles generate stages and are key to runtime performance, and that examining your physical plans allows you to directly link your application performance issues back to your problematic line of code.

And you’ve seen this pattern repeat across many scenarios: problem, plan, solution, improved plan, and so forth, until optimization feels less like a dark art and more like a certainty.

This is the Spark optimization mindset: read plans before you write code, and challenge every single Exchange. Engineers who write high-performance Spark jobs minimize shuffles, filter early, project narrowly, deal with skew carefully, and validate everything via explain() and the Spark UI. Once you learn to read the plan, Spark performance becomes mechanical.

How to Use Vibe Coding Effectively as a Dev

Ankur Tyagi — Tue, 25 Nov 2025 16:52:00 +0000

It may seem like everyone is a vibe coder these days, and prompting seemed like it would become the new coding. But is this AI-generated code really deployable?

Bragging on social media about a clever script is one thing, but pushing a vibe coded app to prod comes with many security risks.

With so many AI dev tools out there now, code reviews become more critical than ever.

This article will explore what vibe coding means and how code reviews should adapt in the era of AI.

What is Vibe Coding?

In early 2025, AI researcher Andrej Karpathy popularized the term vibe coding to describe a new way of development in which you “fully give in to the vibes” and let AI write code while you focus on high level intent.

A developer expresses their desired functionality in plain language, and an AI system (like an LLM) generates the source code to implement it.

This code-by-prompt approach allows even beginners to produce working code without deep knowledge of programming languages. Karpathy joked that with advanced IDE agents (like Cursor’s Composer mode), “I barely even touch the keyboard... I ‘Accept All’ always, I don’t read the diffs anymore... and it mostly works”.

So, vibe coding is coding by vibe and trusting AI to handle the heavy lifting.

How to Implement Vibe Coding in Practice

In practice, vibe coding usually involves using AI assistants and adapting your workflow to a more interactive, prompt-driven style.

Here’s an overview of how you can “vibe code” a project:

Step 1: Choose an AI assistant

Select a development env that supports AI code generation. Popular choices include Cursor and GitHub Copilot.

Step 2: Define your requirements

Instead of writing boilerplate code, describe what you want to build. Provide AI with a specific prompt detailing functionality. The more context and detail you give, the better AI can fulfill your intent.

For example, when I ran an SEO inspection for my website, DevTools Academy, I used this prompt in Cursor:

“Now, act as a senior product engineer and UX strategist. Evaluate and improve https://www.devtoolsacademy.com with a practical, no-fluff lens.

Scope:

UX

SEO and technical SEO

Positioning and messaging

Copywriting and information architecture

What to add to stand out in the developer tools space.”

This prompt works well because it gives the AI a clear role, a defined scope, and a specific intent. AI knows it’s not just fixing SEO but also reviewing how the site communicates value to devs. That combination of clarity and context produces actionable insights instead of surface-level suggestions.

Below is a screenshot of that audit in progress and showing how I reviewed code, metadata, and UX recommendations side by side.

You can checkout the full code on my open source blog here and check out closed PRs. This will help you learn how I use all these coding agents on a production ready app.

Step 3: Review the code

AI will produce initial code based on your prompt. Think of this as a prototype – it’s not perfect. Run the code and see how it behaves.

Let’s look at an example: here, CodeRabbit is reviewing one of my pull requests on GitHub. I had pushed a small fix to sort blog posts correctly and make sure the RSS feed reflects the latest publish date. Within seconds, CodeRabbit analyzed the diff, understood the intent behind my change, and explained exactly what the new code does.

It pointed out that the fix now sorts posts before mapping them, uses the sorted data for both items and the lastBuildDate, and ensures proper chronological order throughout the feed.

It’s like having a senior reviewer who not only checks syntax but also validates logic and confirms that your reasoning holds up.

This is just a reminder to expect imperfections. Vibe coding embraces a “code first, refine later” mindset. This means you get a working version quickly, then iteratively improve it. You might go through a few cycles of prompt -> code -> test -> tweak.

Step 4: Validate, debug, polish

Once AI generated code meets your expectations, do a final review.

Throughout the process, the core idea is that you collaborate with the AI. The AI agent serves as a coding assistant, making real-time suggestions, automating tedious boilerplate, and even generating entire modules on your behalf.

Why Isn’t Vibe Coded Output Production Ready?

Vibe coding moves fast: you describe intent, the AI produces something that runs, and you’re off to the next prompt. What’s missing is the slow, unglamorous work that usually turns a draft into shippable software, like shared context, architectural alignment, verification, and documentation.

AI generates plausible code based on patterns it has seen. But it doesn’t understand your team’s history, your system’s constraints, or the implicit rules that keep everything coherent over time.

That mismatch shows up the moment a “works on my machine” demo meets a real codebase.

Let’s explore the common pitfalls of vibe-coded code, so you’ll know what to watch for. Then, in the checklist section below, I’ll outline practical strategies to address or prevent each issue.

Context gaps are the first crack.

AI only sees what you show it, so it’s easy for it to make the right local choice and the wrong global one: duplicating logic that already exists, choosing defaults that conflict with prior decisions, or introducing functions that don’t respect domain boundaries.

The result is code that looks reasonable in isolation but collides with existing assumptions and conventions once integrated.

Drafts often ignore the lived details of your environment – shared utilities, cross-cutting concerns, configuration, deployment hooks, and operational policies. Interfaces may line up at a glance and still fail at runtime because the draft doesn’t fit how your system composes modules, handles errors, or manages state across services.

The most serious risk is security by omission.

AI rarely includes robust input validation, clear authentication and authorization paths, or rate limiting unless you spell it out. Secrets handling and logging tend to be superficial or missing. That leaves common exposure points like request handlers, job processors, and webhook endpoints without the checks that prevent injection, SSRF, mass assignment, or data exfiltration.

Even when the surface looks tidy, the absence of explicit security controls means you’re trusting defaults you didn’t choose.

Testing and correctness evidence are thin.

Quality suffers in quieter ways, too. Beyond “it runs,” there’s little to demonstrate behavior across edge cases or to guard against regressions.

Performance and scalability remain unknowns: extra network calls, N+1 patterns, and quadratic loops sneak in because nobody measured them. Dependencies and environments drift as versions aren’t pinned, infrastructure isn’t declared, and configuration lives only in the author’s head, making behavior differ across machines and CI.

Operability lags behind.

A lack of metrics, missing health/readiness probes, and no runbook make failures harder to detect and slower to recover from. Add in data quality and compliance concerns (PII handling, encoding assumptions, transitive license obligations), and you have code that demos well but isn’t ready for production’s reliability, security, and audit demands.

In short, vibe-coded output accelerates drafting but skips the shared understanding and evidence that make software safe to ship.

Until those gaps are closed, it’s a prototype, not a release.

Guidelines for AI Code Reviews

Your team should keep pre-AI engineering standards as the bar, including security, tests, readability, maintainability, performance, and docs. AI should change how fast you gather the evidence for those standards, not how much evidence you require. In other words, use AI to accelerate the path to your existing bar, never to lower it.

Using AI, you can generate code at speed. But if reviews take the same amount of time (or more time), you lose some of the benefit. The goal isn’t to relax standards, it’s to shorten the time to prove you met them. That means layering in automation (tests, static analysis, secret scans, SCA) and AI-assisted review to catch obvious issues quickly so human reviewers can focus on intent, architecture, and risk.

Well-used assistants can help here. For example, tools like CodeRabbit, GitHub Copilot PR Reviewer, Claude Code, Cursor’s Bugbot, Graphite’s AI Review, and Greptile can highlight potential bugs, security gaps, style deviations, and mismatched intent, and summarize diffs for faster context. Treat these as accelerators for your existing process, not as replacements for judgment.

Code Review Process in Vibe Coding

The fundamentals of good code reviews haven’t changed – and in fact, they’re more critical now.

Below are some key principles to maintain speed without sacrificing quality.

1. Trust, but verify.

A reviewer usually assumes the author understands the system. With vibe-coded output, the “author” may be an AI with limited context. If something looks odd or unnecessary, question it. Run the code, add/execute tests, or ask the developer/AI for clarification on intent and constraints.

2. Don’t let reviews become a bottleneck.

Vibe coding generates code quickly. If human review takes as long as hand-writing the change, you’ve erased the gain.

Combat this by front-loading automation: run unit/integration tests, static analysis (lint/SAST), secret scans, SCA, and basic perf checks in CI to clear the noise. Then reviewers spend their time on design trade-offs, boundary cases, and risk. The balance is: high standards, faster evidence.

3. Use AI code reviews wisely

AI can help review code just as it helps generate it. Modern “pair reviewer” tools scan a PR and surface likely bugs, security issues, missing tests, or style violations in minutes plus give natural-language summaries of the change.

Tools you can consider include CodeRabbit, GitHub Copilot PR Reviewer, Claude Code, Cursor Bugbot, Graphite, and Greptile. Many integrate with the CLI/IDE and GitHub/GitLab to leave actionable comments.

Think of them as fast first-pass reviewers that increase coverage and consistency across PRs.

4. Human judgment is still irreplaceable.

Even the best AI reviewer is an assistant. Keep humans accountable for correctness, security posture, architectural fit, and user impact. A healthy pattern is AI first-pass > human second-pass that inspects invariants, failure modes, and long-term maintainability.

5. Maintain a high bar for quality.

It’s tempting to accept “it runs” when an AI wrote it. Don’t. Stakeholders still expect software to be robust, secure, and maintainable. Keep DRY, readability, and testability standards. Insist on input validation, authZ checks where relevant, and sensible logging/metrics. If you can’t provide evidence that you met the bar, you haven’t met it.

6. Educate and document

When reviewers find bugs or security flaws in AI-generated code, capture the lesson.

Update internal guides with patterns like “When generating handlers, validate and bound inputs, add rate limits, log request IDs, avoid N+1 queries, and sanitize user-visible output.” Over time, bake these into prompts, templates, repo scaffolds, and CI checks so the next AI draft starts closer to done.

Checklist for Reviewing AI Generated Code

Before approving any vibe-coded change, make the standards explicit and verifiable. Use this checklist to confirm behavior, security, performance, integration, and documentation so the draft you got from AI becomes code you can safely ship.

Here’s a checklist a human reviewer should go through before approving vibe-coded output:

1. Define the code’s purpose (scope & non-goals).

Be explicit about what this change does and does not do. Tie it to a user story/ticket and call out non-goals so “helpful” AI changes don’t creep in.

2. Verify X and Y (behavior and edge cases).

Be clear about what you’re verifying. For example, verify input parsing and pagination boundaries, verify that error paths return the correct status and body, and verify that database writes are idempotent. Run existing tests, add missing unit/integration tests, and reproduce edge inputs (empty, null, huge, unicode).

3. Perform code-quality checks (readability, DRY, refactor needs).

AI often produces verbose or duplicated logic. Ensure names are meaningful, side effects are clearly stated, and duplication is removed or minimized. Run linters/formatters, collapse repetition, and extract helpers where they aid clarity.

4. Analyze organization and structure (make sure it fits the architecture).

AI writes code in isolation. Confirm the change uses existing utilities, layers, and boundaries (domain/services/controllers/jobs). Check imports and module placement, avoid reinventing existing helpers, and align with repository conventions.

5. Validate inputs and assumptions (make the implicit explicit).

List the assumptions the AI made (default locale/timezone, allowed ranges, required fields). Add schema validation (DTO/class validators/JSON Schema). Empty, null, over-max, non-ASCII, unexpected enum, malicious strings. And finally, enforce limits/timeouts.

6. Perform security audits (minimum pass).

AuthN/AuthZ: Confirm endpoint checks identity and authorization paths; deny-by-default.
Inputs: Sanitize/validate inputs, prevent injection (SQL/NoSQL/command), and escape user-visible output.
Secrets: No secrets in code/diff/logs, use env/secret manager, and rotate any test keys.
Abuse controls: Add rate limits, size limits, and timeouts on network and disk operations. Run SAST/secret scan/SCA, and fix or justify findings.

7. Do a performance evaluation (right now, at a small scale).

Look for N+1s, needless network calls, unbounded loops, quadratic sorts. Add a micro-benchmark or run a quick load test for hot paths. Set sensible cache/timeout/retry with jitter where applicable.

8. Manage dependencies (pin, justify, minimize).

Review any new libraries. Are they necessary? Maintained? License compatible? Pin versions, add lockfiles, or remove unused transitive adds.

9. Review documentation (what to add and where).

Ensure the docs are in line with the code. AI often changes some parts or adds code blocks at different places while resolving various issues. These changes might not make it into the docs.

10. Observability (see problems early).

Use structured logs with request/trace IDs, key counters/timers (success/error/latency), health/readiness probes, and a basic dashboard or alert stub.

11. Compliance and data handling (when applicable).

Identify any personally identifiable information (PII), document collection/retention, ensure masking/redaction in logs, verify dependency licenses and data-residency constraints.

How to Work Effectively with AI Tools

At this point, you can probably see why it’s very important to understand the actual skills involved in AI-assisted development.

There’s a pretty big difference between an experienced developer who uses AI tools to help them get more done, and a newbie who thinks AI can build the next Facebook or Google just with a simple prompt.

An inexperienced dev will ask AI something like "Hey, Build me Twitter and make no mistakes"

But an experienced developer who has a solid fundamentals might say say something like:

"AI, we're building a Twitter replica. Use $SQL_Database, Use $Language, Avoid $Common_Pitfalls, Follow $Standard_Practices."
"The generated code is prone to X problem, implement this fix."
"Implementation of $X is flawed because of $Y, do $Z instead."

So as you can see, you still need to know the how's and the why's and what depends on what. Often you’ll just need to make the changes manually, because it will be faster. And you don’t want to outsource the critical thinking part, which is the part that AI can't actually do.

LLMs are good at information retrieval. If you know nothing about what you’re looking for, then asking an AI isn’t going to be that helpful (or that reliable). But if you have an idea, some background knowledge/context, and the skills to verify AI’s responses, then it can be really helpful.

Last month, I shared in my newsletter how my current coding loop looks in practice.

I draft with Claude Code (or Copilot/Cursor), open a PR, and let an AI reviewer like CodeRabbit (or Copilot PR Reviewer / Cursor Bugbot or Greptile) do the first pass. CI runs tests and scans.

I repeat until everything’s green and the PR is ready to merge. It’s fast, but it’s still disciplined.

If you want to understand why this kind of workflow is becoming essential, read this article: Era of AI Slop Cleanup Has Begun. I talk about what’s happening in AI-assisted engineering, where generating code is easy, but keeping it clean and production ready takes experience – and you must have good programming skills.

Conclusion

AI-generated code can boost productivity – but production value still comes from software that is robust, secure, and maintainable.

Mindless code generation creates technical debt. But when you integrate AI thoughtfully, with guardrails, verification, tests, security checks, and documentation, you can go faster without lowering your standards.

That's it for this article. I hope you learned something new today.

If you have any questions about code reviews, engineering, startups, or business in general, please find me on Twitter: @TheAnkurTyagi. I’d be more than happy to discuss them.

Want to read more interesting articles like this?

You can read more about the latest dev tools like this one on my website.

How to Improve Your Programming Skills by Building Games

Manish Shivanandhan — Thu, 30 Oct 2025 13:14:40 +0000

When most people think about learning to code, they imagine building websites or automating small tasks. Few think of building games as a serious way to improve programming skills.

But creating even a simple game can teach lessons that no tutorial ever could. Games force you to think about performance, user input, structure, and creative problem-solving all at once.

When I started building small 2D games as weekend projects, I didn’t realize how much they would sharpen my overall coding skills. From learning how to organize complex systems to handling real-time input, every part of game development stretched my thinking.

Whether you’re a web developer, mobile engineer, or hobby coder, building games will make you a stronger problem solver.

Here are ten programming skills you’ll learn along the way.

1. Thinking in Systems
2. Writing Event-Driven Code
3. Optimizing for Performance
4. Debugging Complex States
5. Handling User Input Responsively
6. Building Reusable Game Loops and Engines
7. Managing Complexity Through Components
8. Learning the Math That Actually Matters
9. Sharpening Your Design and UX Instincts
10. Embracing Creative Problem Solving
Conclusion

1. Thinking in Systems

Every game is a set of systems working together. You might have a physics system that controls movement, a rendering system that draws the visuals, and an AI system that decides how enemies react.

Each one depends on the others, but they must remain separate enough to be managed and improved without breaking the rest of the game.

This is exactly what developers deal with in larger software projects. Building a game helps you understand modular design and why separating logic into smaller, independent parts makes everything easier to scale and debug.

You stop writing long scripts that try to do everything and instead start thinking in terms of systems that talk to each other through clear rules.

2. Writing Event-Driven Code

Games live and breathe on events. A button press, a collision, or a timer hitting zero are all events that trigger actions.

When you code a game, you quickly learn to think in event loops. This helps you understand how asynchronous code works in real life.

If you’ve struggled with JavaScript event listeners or backend message queues, building a small game is the perfect way to get comfortable with them.

Every time a player jumps, attacks, or collects an item, you’re writing code that listens for an event and reacts in real time. That experience makes you a better developer, even outside of gaming.

3. Optimizing for Performance

Unlike websites, games can’t afford to lag. A delay of even a few milliseconds can break the experience.

When you write games, you learn to measure performance constantly. You start thinking about memory usage, CPU load, and rendering time.

You might experiment with how often to update physics calculations or how to reuse textures instead of loading them every frame.

Those small optimizations become second nature, and later, when you’re building a web app or a backend service, you’ll know exactly where to look when something feels slow.

4. Debugging Complex States

Games are full of moving parts that interact in unpredictable ways. Maybe a character disappears after jumping twice, or a power-up triggers twice because of overlapping timers. These problems force you to learn structured debugging.

You’ll get used to adding logs, reproducing edge cases, and isolating bugs by breaking large systems into smaller ones. The patience and process you develop while debugging a tricky game bug translate perfectly to real-world software.

You become the kind of developer who doesn’t panic when something goes wrong because you’ve already handled far more chaotic code in your side projects.

5. Handling User Input Responsively

When you build a game, user input becomes one of your main concerns. You want the player’s actions to feel instant.

That means learning how to manage input devices like keyboards, mice, or best PC controllers. You’ll discover how to debounce actions, prevent lag, and detect simultaneous keypresses. You might even test your code with the best PC controller to make sure it feels smooth and accurate.

This focus on responsiveness changes how you approach every future project. You begin to see every button click or touch gesture as part of a feedback loop that should feel immediate and natural.

6. Building Reusable Game Loops and Engines

After writing a few games, you’ll realize that many parts of your code repeat. The main loop that updates the world, the input handlers, and the collision checks all follow patterns. This realization leads to a powerful skill: abstraction.

You start building small frameworks or reusable components that handle these repetitive tasks. In doing so, you learn the same lessons that professional developers learn when they design APIs or internal tools.

The discipline of turning messy scripts into organized, reusable code teaches you about structure and design in a way that theory never can.

7. Managing Complexity Through Components

Game developers often use something called an Entity-Component-System (ECS) architecture. It’s a way of organizing objects in a game so they can share behavior without heavy inheritance trees. For example, a player and an enemy might both have movement and health components, but different AI logic.

This pattern is very similar to how modern front-end frameworks work. If you use React, you already think in components. Building games strengthens that habit.

You start to see every system, UI, physics, AI, as a component that can be composed and reused. It’s one of the most powerful ways to manage complexity in any large codebase.

8. Learning the Math That Actually Matters

Many developers shy away from math, but games make it practical. When you need to move a character along a curve, calculate projectile motion, or detect collisions, you’re forced to use geometry, trigonometry, and vectors.

The best part is that you learn it through doing, not memorizing formulas. You begin to understand how angles, distances, and forces interact in a way that feels visual and intuitive. Later, when you face algorithmic problems or data visualizations, that math background helps you approach them with confidence.

9. Sharpening Your Design and UX Instincts

Good games feel right. The jump height, the delay between actions, the feedback when you collect a coin, every small detail affects how enjoyable the game feels.

When you design these experiences, you’re learning about user experience design without even realizing it.

You begin to think about things like timing, feedback, and accessibility. You learn how to make interactions satisfying and clear.

The same mindset applies when you build apps or websites. You start designing not just for functionality but for how it feels to use.

10. Embracing Creative Problem Solving

Games are rarely built in a straight line. You’ll face problems that don’t have clear answers.

Maybe you need a way to fake physics without heavy computation or make AI feel smarter than it is. These challenges train you to think creatively.

You’ll often come up with unconventional but clever solutions. That kind of flexible problem-solving becomes one of your most valuable programming skills.

When something breaks in production or a feature seems impossible under current constraints, you’ll know how to find a creative way around it because you’ve done it before in your own projects.

Conclusion

Building games is more than a hobby. It’s an accelerated crash course in becoming a better developer. You’ll write cleaner code, understand systems thinking, and develop a sharp sense for performance and design. You’ll also have fun in the process, which keeps your motivation alive longer than any tutorial series can.

Each project you build will teach you something new about programming. The lessons won’t come from books but from the moments you struggle, test, and finally see your creation come to life. Build something that teaches you back, and you’ll grow as both a coder and a creator.

Hope you enjoyed this article. Connect with me on Linkedin or visit my website.

The Architecture of Mathematics – And How Developers Can Use it in Code

Tiago Capelo Monteiro — Fri, 23 May 2025 15:06:16 +0000

"To understand is to perceive patterns." - Isaiah Berlin

Math is not just numbers. It is the science of finding complex patterns that shape our world. This means that to truly understand it, we need to see beyond numbers, formulas, and theorems and understand its structures.

The main goal of this article is to show how math is just like a growing tree of ideas. I want to show that math is a living system of logic, not just formulas to memorize. With analogies, history, and code examples, I want to help you understand math more deeply and how you can apply it to programming.

I’ve also included some code examples here to help you connect theory and practice. I show them to demonstrate how math ideas are applied to real problems. Whether you are new to more advanced math or are more experienced, these code examples will help you understand how to apply math in programming.

This link across theory and application reflects my own studies. I am a finalist in an undergraduate degree in Electrical and Computer Engineering at NOVA FCT, one of the best engineering faculties in Portugal.

My engineering degree is one with more math and physics. This is because it’s key to get a solid grasp of math to understand electronics, telecommunications, control theory, and other areas of engineering.

Here’s a brief overview of some of the math and physics subjects I’ve learned:

Partial Differential Equations (PDEs): These equations model real-world phenomena, from heat diffusion to the economy of a country.
Harmonic Analysis (Fourier & Laplace): Integral transforms like the Fourier and Laplace transform allow us to understand problems in new domains.
Complex Analysis: Extending calculus into the complex plane gives rise to powerful tools used in physics and engineering.
Numerical Analysis: When analytical solutions are impossible or inefficient, numerical methods provide computer-based approximations. This is crucial for real-world applications.
Control and Signal Theory: These areas show us how to design stable systems like rockets, trains, and robots.
Physics: Courses in Classical Mechanics and Electromagnetism helped bridge theoretical math to physical laws

During my years of study, besides technical skills, I’ve developed a deeper understanding of how the world works and the structure of the field of mathematics. And I’ve started to find patterns in how math is a framework of interconnected logic.

In this article, we’ll explore:

Simple Analogy: The Tree of Mathematics
The Structure and History of Mathematics
An Tree example: Foundations of Relativity by Albert Einstein
The Biggest Paradox of Math, Discovered by Kurt Gödel
What About Applied Math and Engineering?
Code Examples – Analytical and Numerical Approaches
The Impact of a Grand Unified Theory of Mathematics
A Final Lesson From History

Simple Analogy: The Tree of Mathematics

Imagine math as a vast tree growing forever.

The roots of the tree are the foundations of mathematics: logic and set theory. From this foundation emerge the main basic fields of math: arithmetic, algebra, geometry, and analysis.

As the tree divides further and further into more branches, new, more complex subfields start to appear, like topology, abstract algebra, and complex analysis. Sometimes the branches are connected to each other.

And remember: this tree is always growing in many directions. From branches creating new branches to branches connecting to other branches. Little by little, it grows.

Throughout history, there have been times that, due to some big scientific discoveries, parts of the math tree started to grow very fast. Other times, decades and even centuries passed without many new branches. This is the case for imaginary numbers, for example.

And you might wonder: How many more branches and connections between them will keep appearing?

The Structure and History of Mathematics

The first mathematical ideas appeared independently across ancient civilizations. For example:

India’s invention of zero
Islamic algebraic advances
Greek geometric rigor

Over time, many different great mathematicians created and shared them by writing and giving lectures.

Eventually, these new ideas were shared widely with new generations and these new generations created new math based on old math.

This is is how new branches are continuously born from previous branches of the tree of mathematics.

And this is why Isaac Newton wrote, in a letter to Robert Hooke in 1675:

If I have seen further, it is by standing on the shoulders of giants

He meant that by working from previous knowledge, he was able to create and (re)discover new ideas.

Yet, the real power of math lies in practicing it over and over and understanding it more and more deeply. As one of my professors once explained:

More important than knowing the theorems is knowing the ideas behind them and the history of how they were created.

Very often, to solve problems, it is necessary to think in terms of first principles and build from there. Math teaches exactly that. In this way, math is not just an academic subject. It is a language spoken by scientists and engineers around the globe.

By having it well preserved and shared, it is still possible to create new math from previous ideas. And it’s possible for the big tree to continue growing based on previous branches or nodes.

An Tree example: Foundations of Relativity by Albert Einstein

Albert Einstein created the general and special theories of relativity. These have big consequences nowadays:

GPS and Global Communication
Advancements in Satellite Telecommunications
Space Exploration and Satellite Launches

But this was only possible through the unification of geometry with calculus, called differential geometry. The evolution of differential geometry happened over the centuries, thanks to many great mathematicians. Below are some of them, but this is not a complete list:

Euclid (circa 300 BCE): Contributed to geometry, laying the groundwork for later mathematical systems
Archimedes (circa 287–212 BCE): Pioneered the understanding of volume, surface area, and the principles of mechanics
René Descartes (1596–1650): Developed Cartesian coordinates and analytical geometry
Isaac Newton (1642–1727) & Gottfried Wilhelm Leibniz (1646–1716): Newton’s laws of motion and gravitation, alongside Leibniz’s development of calculus, formed the basis of classical mechanics that Einstein sought to extend and modify in his theory of relativity.
Leonhard Euler (1707–1783): Contributed to the development of differential equations, which are essential in the mathematical foundations of physics.
Gaspard Monge (1746–1818): The father of differential geometry and pioneer in descriptive geometry
Carl Friedrich Gauss (1777–1855): Made groundbreaking advances in geometry, including the concept of curved surfaces.
Bernhard Riemann (1826–1866): Introduced Riemannian geometry, a branch of differential geometry.

Once again, as Isaac Newton wrote, in a letter to Robert Hooke in 1675:

If I have seen further, it is by standing on the shoulders of giants.

Albert Einstein saw what no one else in his time saw, thanks to these great math giants and countless others.

The Biggest Paradox of Math, Discovered by Kurt Gödel

The biggest paradox in math, in my opinion, is what Kurt Gödel discovered. His early 20th century research revealed a limitation within this cycle.

This paradox – that is, his incompleteness theorems – shows that in any consistent formal system capable of expressing simple arithmetic, there will always be true mathematical statements that cannot be proven within the system itself.

This means that in ALL systems, there are limits to what you can actually prove as to what is true and false. For for mathematicians, this means that the tree will never be completed. There are truths that are beyond formal truths, and yet we still assume that they are true (albeit unproven).

This way, it proves that no matter how many mathematicians work in the field or how much AI is used to find new mathematics, there will always exist limitations. Some things are impossible to prove that are true, and we just know that they are due to approximation estimations and other non logical exact methods.

What About Applied Math and Engineering?

Applied math and engineering involves interpreting the same pure math ideas in real-world scenarios. Actually, in many cases, it is the combination of many math ideas. Let’s consider some examples:

Principal component analysis (PCA) is a widely used tool in data science. Yet, it is a mixture of linear algebra (in PCA, eigenvalues) with optimization (order eigenvalues that represent more data with less data) in order to make datasets shorter.

In machine learning, logistic regression is a mixture of calculus with statistics and probability.

In harmonic analysis, Laplace, Fourier, and Z-transforms are a way to see the same thing in a new domain to get new insights. In this case, integrals are used to make this mapping.

In deep learning, neural networks are just many matrices multiplying and updating themselves that adapt to model a dataset representing a system. This optimization of matrix values happens with activation functions, a gradient descent-based optimization method (tells how much values need to change), and backpropagation (applies those alterations to all matrix values).

I have actually written an article where I teach why activation functions are important if you want to check it out.

But the best example of this fusion of math with engineering is in control theory.

Control theory is the study of the architecture of systems. From trains to cars to airplanes, everything is based on control theory. It is everywhere in nearly all modern electronic devices. In electric circuits, control theory is also used heavily to guarantee circuit stability in the face of electric disturbances.

So as you can probably start to see, many of the tools we now have are just a mixture of many pure math ideas. Just many combinations and recipes of pure math ideas. In essence, applied math is the application of pure math as “ingredients“ in "recipes" to solve problems.

So, we’ve explored the structure and evolution of mathematics. Yet, it is important to see how these ideas can be applied in real life. Pure math makes the framework, and applied math applies that framework to solve problems. To understand this, we’ll examine two code examples that show how you can use math ideas as programming tools.

Code Examples – Analytical and Numerical Approaches

These code examples demonstrate a couple ways you can use Python to solve math equations.

In the first code example, we’ll solve the problem in the same way that kids in school solve math exercises: essentially, by hand with a pencil. Moving variables from left to right to find their values. In the second example, we’ll solve the problem using numerical analysis.

Example 1: Solve a Problem Analytically

When we solve math problems analytically, like we did in school, we are manipulating symbols to get exact values. Often there symbols are x, y and z. In Python, we can do this using the SymPy library:

from sympy import symbols, Eq, solve

x, y = symbols('x y')
eq1 = Eq(2*x + 3*y, 6)
eq2 = Eq(-x + y, 1)

solution = solve((eq1, eq2), (x, y))
print(solution)

Essentially, we are finding x and y based on this equation:

$$\begin{align*} 2x + 3y &= 6 \\ -x + y &= 1 \end{align*}$$

Which gives us the following result:

{x: 3/5, y: 8/5}

Or:

x= 0.6
y = 1.6

When we say that we’re solving this analytically, it means that we’re finding an exact mathematical solution using formulas or equations.

But many times, problems are harder and can be solved by adding symbols to the right or left of the equation.

Sometimes, there can be so many symbols and transformed versions of them, with things like derivatives and integrals, that it can become very hard to manage and takes a lot of time.

For this reason, there is an area of mathematics devoted to finding approximations of already created mathematical formulas called numerical analysis. It makes it faster to solve these problems. And this is the method we will explore next.

Example 2: Solve Numerically (Approximation)

We’ll now use SciPy to solve the same system with numerical methods:

import numpy as np
from scipy.linalg import solve

A = np.array([[3, 2, -1, 4, 5],
              [1, 1, 3, 2, -2],
              [4, -1, 2, 1, 0],
              [5, 3, -2, 1, 1],
              [2, -3, 1, 3, 4]])

b = np.array([12, 5, 7, 9, 10])

solution = solve(A, b)

print(solution)

In this code example, this line of code:

solution = solve(A, b)

Uses the solve method from the SciPy Python library:

from scipy.linalg import solve

It’s a method that helps you find the values of x in an equation A⋅x=b, where a is a square grid of numbers and b is a list of numbers. Which gives us the following:

[ 1.35022026 -0.79955947 -1.17180617  3.14317181 -0.83920705]

Now imagine, in this simple case, that a matrix like A could represent the traffic flow between cities or intersections, and b could represent the traffic entering or leaving each city.

By solving the system, it could help us determine the distribution of traffic between cities to meet desired traffic conditions.

Of course, these types of problems are far more complex in real life. But to understand and solve the big problems, you need to first understand the smaller problems.

And by the way, a system of equations is the same thing as a matrix. We just represent systems of equations as matrices to make the findings of properties and clarity easier to understand.

The thing is that by using matrices, it is easier to make calculations and to perform linear algebra math to check for characteristics of the matrix and understand it better.

In essence, a matrix represents a system of equations. Also, systems of equations can represent real life phenomena like the economy of a country or the weather.

If you want to know more, I wrote an entire article on numerical analysis that you can check out.

The Impact of a Grand Unified Theory of Mathematics

Despite the biggest paradox in mathematics, what would happen with a Grand Unified Theory of Mathematics?

Remember that such a theory tells us that there are things that are true that are impossible to formally prove, and we need to just accept it. But even with this assumption, it is still possible to unify all math.

This is what the Langland's program is trying to solve. A kind of attempt to interconnect the largest parts of the big tree of math to uncover new patterns in math.

With a Grand Unified Theory of Mathematics, we would be able to understand how every branch of the tree connects with the others and all the relationships between them.

What is the value of this big unification for society?

By studying history, we can find patterns. The unification of various fields has created many massive impacts on society, such as:

In the 19th century, James Clerk Maxwell united the fields of electricity and magnetism with his famous Maxwell equations. This allowed the creation of radios and electric grids around the globe. In turn, it served as a foundation for all technological progress in the 20th and 21th century.
In the 20th century, the unification of algebra with logic led to the rise of digital systems. In turn, digital systems gave the rise of processors and the evolution of computers to the modern laptop.
Also in the 20th century, the unification of probability and communication led to information theory. This became the foundation for the internet. This unification was carried out by a great mathematician called Clause Shannon.

In the end, a Grand Unified Theory of Mathematics could be one of the biggest achievements in modern society.

It could lead to new discoveries in physics, such as in string theory or quantum gravity, where deep mathematical structures are needed to create new physics. In AI, it could help unify all machine learning models in a common architecture. This would help accelerate the development of new AI models. It could also open the door to new cryptographic methods and material science advances, revealing, with math, the deep patterns still not found in these fields.

Just as uniting electricity and magnetism led to modern technology, a unified math framework would lead to a wave of innovation.

A Final Lesson From History

From Greek geometry to AI, math has grown like a tree over centuries. By understanding its structure, it is possible to see its role in finding the patterns of our universe. I hope I was able to make you see math in this way.

In addition, we can conclude that the unification of scientific fields makes the foundations for the creation of new innovations to help society go forward. Many profound societal transformations only came to be thanks to abstract math ideas. When these are shared and refined, they become the hidden architecture of progress in society. Innovation begins when disconnected ideas are united, well-linked, and widely shared.

Find the full code here.

How to Refactor Complex Codebases – A Practical Guide for Devs

Ankur Tyagi — Wed, 21 May 2025 15:47:44 +0000

Developers often see refactoring as a secondary concern that they can delay indefinitely because it doesn’t immediately contribute to revenue or feature development.

And managers frequently view refactoring as "not a business need" until it boils over and becomes the most significant business need possible.

"Oh, our software somehow works. We can't implement any new changes. And oh, everyone is quitting because work is miserable."

In this article, I’ll walk you through the steps I use to refactor a complex codebase. We’ll talk about setting goals, writing tests, breaking up monoliths into smaller modules, verifying changes, making sure existing features still work, and keeping tabs on performance. I’ll also show you how to speed up reviews using AI tools.

By following these steps, you can turn complex, fragile code into a clean, reliable codebase your team can own.

The Issue of Technical Debt

As projects grow and evolve, technical debt increases. Code that was once functional and manageable turns into an unmaintainable mess, where even small changes become risky and time-consuming.

Despite the obvious need for cleanup, refactoring rarely gets prioritized because there's always something more urgent, new features, bug fixes, and client demands.

I’ve had conversations with engineers, many of whom are working on enterprise software and are fully aware of their codebase's code smells and inconsistencies. They dislike the situation but feel powerless to change it.

So how do we shift from a culture of writing for pure functionality to a culture that values maintainability, especially for complex codebases?

It’s usually a mistake to completely halt new feature development for a long refactoring period (except perhaps in emergencies). Business needs still exist, and putting everything on hold can create tension and lost opportunities. It’s better to find a balance so you’re still delivering value to users even as you clean under the hood.

While there is no one-size-fits-all solution, a structured approach can help teams introduce sustainable refactoring practices, even in environments where management is resistant. Let’s explore how this works.

What is Refactoring?

Many people all too often use the word "refactor" when they mean a targeted rewrite.

As Martin Fowler famously said,

“Refactoring is a controlled technique for improving the design of an existing code base. Its essence is applying a series of small behavior-preserving transformations... However, the cumulative effect... is quite significant.”

In practice, this means continuously polishing code to reduce complexity and technical debt.

While traditional software development follows a linear approach of designing first and coding second, real-world projects often evolve in ways that lead to structural decay. Refactoring counteracts this by continuously refining the codebase, transforming disorganized or inefficient implementations into well-structured, maintainable solutions.

A targeted rewrite is a focused overhaul of a specific aspect of an application, often affecting multiple parts of the codebase. It carries more risk than refactoring but is still controlled and contained.

Preparing for Refactoring

Even the most skilled refactoring effort can stall without proper preparation. Before you start moving code around, laying a foundation that will keep your work organized and your team on the same page is crucial.

Here are some steps you can take to ensure your refactoring efforts are successful.

Secure Management Buy-in

As I’ve already discussed, getting time for refactoring can be difficult in feature-driven organizations. Often, management will accept refactoring investment if you can tie it to business outcomes, faster time to market, fewer outages (which translates to happier customers), and the ability to take on new initiatives.

Make those connections explicit. For example, you could say:

“If we refactor our reporting engine now, it will make it feasible to add the analytics module next quarter, which unlocks a new revenue stream.”

Or use data:

“We spent 30% of our last sprint fixing bugs in module Y. After refactoring Y, we expect that to drop significantly, freeing time for new features.”

Business-minded arguments help justify the balance.

Ensure a Safety Net with Automated Testing

As you refactor, tests are your safety net. Before modifying a component, write characterization tests around it if they don’t exist.

# example: characterization test for a legacy function
def legacy_calculate_discount(price, rate):
    # ... complex logic you don't fully understand yet ...
    return price * (1 - rate/100) if rate < 100 else 0

def test_legacy_calculate_discount():
    # capture existing behavior
    assert legacy_calculate_discount(100, 10) == 90
    assert legacy_calculate_discount(50, 200) == 0

These tests capture the current behavior, so you’ll know if you accidentally change it. Unit tests, integration tests, and e2e tests all validate that refactoring hasn’t broken anything.

It’s often worth investing time in setting up a continuous integration pipeline so that every change triggers automated tests. This gives rapid feedback and confidence that you’re not introducing regressions. Robust testing and CI/CD enable you to move faster and refactor with peace of mind.

# .github/workflows/ci.yml
name: CI
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with: python-version: '3.10'
      - run: pip install -r requirements.txt
      - run: pytest --maxfail=1 --disable-warnings -q

Identify High-Risk Areas

The first step is to figure out what to refactor. High-risk areas are parts of the code likely to cause bugs or slow development. Common signs include long methods, large classes, duplicate code, and complex conditional logic.

Such code “smells” often hint at deeper design problems. Tools like static analysis can automatically flag these issues.

For example, SonarQube will mark code smells (like high complexity or long methods) that increase technical debt. Using SonarQube or similar tools, you can generate reports on code complexity (for example, cyclomatic complexity metrics) and find hotspots in the codebase that need more attention.

Set Clear Refactoring Goals

Before refactoring code, define the goal.

Goals must be specific and measurable. For example, you might aim to reduce a class’s size or a function’s cyclomatic complexity by a certain amount or to increase unit test coverage from 60% to 90%.

Each goal is tied to a measurable outcome: shorter methods, fewer if statements or classes with a single responsibility, faster execution for processing orders, higher test coverage, and no unused code. These targets will guide our refactoring plan and let us verify when we’ve succeeded.

Tip: Write down your refactoring goals and share them with your team. This sets expectations that you’re not adding new features in this effort, just making the code cleaner and more robust. It also helps justify the time spent by showing the benefits (like more straightforward future additions and fewer bugs).

Techniques for Refactoring Complex Codebases

1. Identifying and Isolating Problem Areas

It can be overwhelming to decide where to start refactoring a large codebase. Not every part of the code needs refactoring – some areas are delicate or rarely touched.

The most impactful refactoring efforts typically target the “problem areas”: parts of the codebase that are overly complex, error-prone, or act as bottlenecks for development and performance. Identifying these areas is a crucial first step.

Techniques for Finding Hotspots

Team knowledge & developer frustration

Don’t underestimate the value of anecdotal information from the team. Which parts of the code do developers dread working in? Often, the team’s instincts point to areas that are hard to understand or modify (for example, “the accounting module is a black box, we hate touching it”). These could be areas to improve.

In my experience, simply asking, “If you had a magic wand, which part of the code would you rewrite?” yields very insightful answers.

Code complexity metrics

Use static analysis tools to measure cyclomatic complexity, code duplication, large functions/classes, and so on. Files or modules with extremely high complexity numbers or thousands of lines are good candidates for scrutiny. But static complexity alone doesn’t tell the whole story – a file might be ugly but rarely touched.

Change frequency (Churn)

Look at version control history to see which files are often changed, especially those associated with bug fixes or incidents.

Hotspot analysis

A robust approach combines complexity and change frequency to find “hotspots.” For example, a tool or technique plotting modules by their complexity and how often they change can highlight the problematic areas. CodeScene (a code analysis tool) popularized this: hotspots are parts of the code that are highly complex and frequently modified, indicating areas where “paying down debt has a real impact”.

If a module is a mess and developers are in it every week, improving that module will likely yield outsized benefits (fewer bugs, faster adds).

Performance bottlenecks and crashes

Some parts of the codebase become targets for refactoring because they cause frequent performance problems or outages. For instance, if a specific service or job crashes often or can’t keep up with the load, you might need to refactor it for stability.

How to Isolate Problem Areas

Once you’ve identified a hotspot or problem area, the next challenge is isolating it so you can refactor safely. In a complex system, nothing lives in complete isolation. That problematic module likely interacts with many others.

Here are strategies to isolate and tackle it:

Break dependencies (Create seams)

Michael Feathers (in Working Effectively with Legacy Code) introduced the concept of “seams” – places where you can cut into a codebase to isolate a part for testing or refactoring. This might mean introducing an interface or abstraction between components so you can work on one side independently.

For example, suppose PaymentService is tightly coupled to StripeGateway, with direct calls scattered throughout the code.

# payment_service.py

def charge_customer(order_id, amount):
    # Hardcoded dependency to Stripe
    stripe = StripeGateway()
    stripe.charge(order_id, amount)

To isolate and refactor the payment logic safely, you can introduce a PaymentProcessor interface and have PaymentService depend on that interface instead. Then, create an adapter like StripeAdapter that implements PaymentProcessor and delegates to the existing Stripe logic.

This way, you can safely refactor or even replace the Stripe integration behind the StripeAdapter without impacting PaymentService or any other module that uses it. As long as the PaymentProcessor interface is honored, the rest of the system remains unaffected.

# interfaces.py

class PaymentProcessor:
    def charge(self, order_id, amount):
        raise NotImplementedError


# stripe_adapter.py

class StripeAdapter(PaymentProcessor):
    def charge(self, order_id, amount):
        # Internally still uses Stripe
        stripe = StripeGateway()
        stripe.charge(order_id, amount)


# payment_service.py (Refactored)

class PaymentService:
    def __init__(self, processor: PaymentProcessor):
        self.processor = processor

    def charge_customer(self, order_id, amount):
        self.processor.charge(order_id, amount)

“Branch-by-abstraction”

This technique is related to the above and is often used in continuous delivery. The idea is to add a layer of abstraction (like an interface or proxy) in front of the old code, have both old and new code implementations behind it, and then gradually shift usage from the old to the new implementation. For a while, you might have a temporary state where both versions exist (perhaps toggled by a config or feature flag).

This is similar to how the strangler fig pattern works at an architectural level. It’s a bit of extra work (since you maintain two paths for a while), but it allows you to migrate functionality and fall back if needed incrementally.

Aim to identify the 20% of the code causing 80% of the problems. Focus your refactoring energy there for maximum impact. When you do, create a plan to isolate that area via abstractions, interfaces, modules, or other means so that you can work on it with minimal risk of side effects. The more you can contain the blast radius of a refactoring, the more confidently you can move forward.

2. Incremental vs. Big Bang Refactoring

One of the first strategic decisions is approaching the refactor incrementally or going for a “big bang” overhaul. In most cases, an incremental approach is preferable, but there are scenarios where more significant coordinated refactoring steps are considered.

Let’s break down what these mean:

# before: one large function with multiple responsibilities
def process_order(order):
    validate(order)
    apply_discount(order)
    save_to_db(order)
    send_confirmation(order)
    log_metrics(order)
    update_loyalty_points(order)
    # potentially more steps 

# after: refactored incrementally into clearer, smaller units
def process_order(order):
    validate(order)
    apply_discount(order)
    persist_and_notify(order)

def persist_and_notify(order):
    save_to_db(order)
    send_confirmation(order)
    log_metrics(order)
    update_loyalty_points(order)

Incremental refactoring

This means making small, manageable changes over time rather than attempting a massive overhaul in one shot. The system should remain functional at each step (even internally in transition). The advantage is risk mitigation: each small change is less likely to go wrong, and it’s easier to pinpoint and fix if it does.

Incremental delivery lets you confirm changes in production and makes diagnosing issues easier since you’re only changing one small thing at a time. It also means the system keeps running during the refactor, so there’s less pressure to rush to “get the system back to working condition”. If priorities shift, you can pause after some increments and still have a working product.

Big bang refactoring (Rewrite)

This is the “tear it down and rebuild” approach. You stop adding new features, possibly freeze the code for a period, and devote a considerable effort to redesigning or rewriting a significant portion (or the entirety) of the system. The idea is to emerge on the other side with a brand new, clean system.

So when (if ever) is a big bang justified? Perhaps when the existing system is truly untenable – for example, an outdated technology that must be replaced (such as a platform that can’t meet new performance or security requirements or code written in a language no longer supported). Even then, wise teams often simulate a big bang by breaking it into stages or developing the new system in parallel.

Whenever possible, favor an incremental refactoring strategy. Teams successfully pull off massive transformations by treating the big refactor as a series of mini-refactors under a shared vision.

3. Breaking Down Monolithic Code

Many complex codebases start life as a single monolithic application, one deployable, a single code project, or a tightly coupled set of modules all maintained and released together.

Over time, monoliths can become unwieldy, builds take forever, a change in one area can unintentionally affect another, and teams can be complex to scale because everyone is stepping on each other’s toes in the same code. A common refactoring challenge for senior engineers is modularising or splitting a monolith into more manageable pieces.

# define the interface
class PaymentProcessor:
    def charge(self, amount): ...

# old implementation
class LegacyProcessor(PaymentProcessor):
    def charge(self, amount):
        # original code

# new implementation behind a feature flag
class NewProcessor(PaymentProcessor):
    def charge(self, amount):
        # cleaner code

def get_processor():
    if config.feature_new_payment:
        return NewProcessor()
    return LegacyProcessor()

# usage remains the same
processor = get_processor()
processor.charge(100)

Strategies for modularization.

Layer separation: Start by enforcing logical layer boundaries. For example, separate the user interface code from business logic and separate business logic from data access. In a messy monolith, these concerns often get mixed together. By organizing the code into layers (even within the same repository), you can limit the ripple effect of changes.
Domain-based modularization: If your system spans multiple business domains or functional areas, consider splitting along those lines. For example, an e-commerce monolith might be separated into modules like Accounts, Orders, Products, Shipping, and so on.
Each could become a subsystem or a package. The goal is to minimize the information these modules need to know about each other’s internals (high cohesion within modules and clear APIs between them).
Microservices or services extraction: In recent years, the trend has been to break monoliths into microservices, independent services that communicate over APIs. This form of architectural refactoring can significantly improve independent deployability and scalability. But it’s a significant undertaking with complexities (distributed systems, network calls, and so on). If you decide to go this route, do it gradually.
A proven method is the strangler fig pattern mentioned earlier: you pick one piece of functionality and rewrite or extract it as a separate service, redirect traffic or calls to the new service. At the same time, the rest of the monolith remains intact and iteratively does this for other pieces.
Modular monolith: Not every system needs to go full microservices. There’s an approach called a modular monolith, essentially structuring your single application into well-defined modules that communicate via explicit interfaces (almost like internal microservices but without the overhead of separate deployments).

This can give you many microservices' advantages (clear boundaries, separate development responsibility) while avoiding operational complexity.

Identify shared utilities vs. truly independent components: In breaking down a monolith, some code is widely shared (like utility functions or cross-cutting concerns such as authentication). It might make sense to factor those into libraries or services first, as they will be needed by whatever other pieces you split out.

While breaking down a monolith, maintaining functionality during the transition is essential. Techniques like backward compatibility (discussed next) and thorough testing will be your safety net.

Finally, be prepared for the team workflow to change. If you move to microservices, teams might take ownership of different services, requiring more DevOps and communication across teams. If you keep a modular monolith, enforce code ownership or review rules to keep the modules from tangling up again (for example, you might restrict direct database access from one module to another’s tables, and so on).

4. Ensuring Backward Compatibility

A critical concern during large refactoring is: Will our changes break existing contracts?

In other words, can other systems, modules, or clients that rely on our code work as expected after we refactor? Backward compatibility is especially important if your codebase provides public APIs (to external customers or other teams), data persisted in a certain format, configuration files that users have written, etc.

Here are some strategies and considerations to maintain backward compatibility:

Suppose you have a widely-used function like send_email(to, subject, body). You want to refactor the internal logic to support additional features like HTML formatting, but you don’t want to break existing callers.

Instead of changing the function signature, you keep the public API unchanged and delegate to a new internal function:

# original API
def send_email(to, subject, body):
    # send mail...

# refactored internals, keep signature
def send_email(to, subject, body):
    sendv2(to=to, subject=subject, body=body)

def sendv2(to, subject, body, html=True):
    # new implementation with HTML support

The internal send_email_v2() function adds new capabilities like HTML formatting, but older code using send_email() still works without any modifications.

If you're introducing a new, improved version like send_email_v2(to, subject, body, html=True), it's good practice to:

Mark the old version (send_email) as deprecated in documentation.
Ensure the old version internally calls the new one.
Give other teams time to migrate at their own pace.

Use versioning for external APIs

If your system provides an HTTP API or similar to external clients, the safest route for major changes is to version the API. Introduce a v2 API endpoint for the refactored logic, keep v1 running (maybe internally calling v2 or using a translation layer). Clients can move to v2 at their own pace.

It’s extra work to maintain two APIs temporarily, but it prevents a breaking change from angering users or causing outages. Always communicate changes clearly and provide migration guides if applicable.

Have a clear deprecation policy

Make sure there’s a policy (and communication) around how long deprecated features will be supported. For internal APIs, maybe it’s one release cycle. For external ones, maybe multiple cycles or never removal without a major version bump. A good practice is to announce deprecation early.

If you’re exposing an HTTP API, consider introducing a new versioned endpoint (for example, /api/v2/send_email) and maintain the older /api/v1/send_email temporarily. Internally, v1 might call v2 with default parameters, ensuring behavior stays consistent for existing clients.

In summary, maintain backward compatibility whenever possible, and implement a clear deprecation policy for anything you do change.

Write adapter or compatibility layers

In some cases, you can write an adapter to bridge old and new systems. For instance, suppose you refactor the underlying data model of your application, but you still have old configuration files in the old format. Rather than forcing all those files to be rewritten immediately, you could write a small adapter that translates the old format to the new one at runtime (or during startup). This way, old data continues to work.

Test for compatibility

Include tests that specifically ensure backward compatibility. For instance, if you have a public API, keep a suite of tests using the old API contracts and run them against the refactored code, they should still pass.

In summary, ensure that as you refactor, the external behavior and contracts remain consistent. This careful approach protects your users and downstream systems, allowing you to reap the internal benefits of refactoring without causing external chaos.

5. Handling dependencies and tight coupling

One of the hairiest aspects of refactoring a large codebase is dealing with deeply interdependent code. Complex systems often suffer from tight coupling. Module A assumes details about Module B and vice versa, global variables or singletons are used all over, or a change in one place ripples through half the codebase.

Reducing coupling is a significant aim of refactoring because it makes the code more modular, meaning each piece can be understood, tested, and changed independently. So, how do we gradually loosen the coupling in a legacy system?

Let’s go over some strategies to reduce coupling.

Introduce interfaces or abstraction layers

A very effective way to decouple is to put an interface between components. For example, if you have a class that directly queries a database, introduce an interface and have the class use that instead. The underlying database code implements the interface.

# before: direct instantiation
class OrderService:
    def __init__(self):
        self.repo = OrderRepository()

# after: inject dependency
class OrderService:
    def __init__(self, repo):
        self.repo = repo

# wiring up in application startup
repo = OrderRepository(db_conn)
service = OrderService(repo)

Now, that class no longer depends on how the data is fetched. Applying the dependency inversion principle depends on abstractions, not concretions.

Use dependency injection

Once you have interfaces, use dependency injection to supply concrete implementations. Many frameworks support DI containers, or you can do it manually (passing in dependencies via constructors). Dependency injection means code A doesn’t instantiate code B itself – instead, B is passed into A.

This approach also makes unit testing easier (you can inject mock dependencies).

Facades or wrapper services

If a particular subsystem is heavily entangled with others, consider creating a Facade, an object that provides a simplified interface to a larger body of code. Other parts of the system are then called the Facade, not the many internal methods of the subsystem. Internally, the subsystem can be refactored (even split into smaller pieces) as long as the Facade’s outward interface remains consistent.

This is similar to how microservices work (other services don’t care how one service is implemented internally – they just call its API), but you can do it in-process, too.

Gradual replacement (Parallel Run)

If a specific component is to be replaced with a new implementation, it can help to run them in parallel for a while. For instance, if you have a spaghetti module that you want to redo correctly, you could leave the spaghetti code in place for legacy calls but start routing new calls to the new module.

The result is a codebase where changes in one area (hopefully) won’t unpredictably break another, a key property of a maintainable system.

6. Testing Strategies (Safely Refactoring with Confidence)

A robust testing strategy will give you the confidence to make sweeping changes because you’ll know quickly if something important breaks. Here’s how to approach testing in the context of a large refactoring:

Establish a baseline with regression tests

Before you even begin refactoring a particular component, make sure you have tests that cover its current behavior. You're lucky if the codebase already has a good test suite, but many legacy systems have inadequate tests.

One of the first tasks in those cases is often writing characterization tests. A characterization test is a test that documents what the system currently does, not what we think it should do.

As Feathers says, “a characterization test is a test that characterizes the actual behavior of a piece of code.” This allows you to take a snapshot of what it does and ensure that it doesn’t change.

This gives you a safety net so you can refactor with confidence that you’re not introducing regressions. Use automated test suites to help things run smoothly (unit, integration, end-to-end).

Continuous integration (CI)

It is highly recommended that testing be integrated into a CI pipeline that runs on every commit or merge. This way, you catch a bug during refactoring as soon as you introduce it, tightening the feedback loop.

Canary releases and feature flags

Beyond pre-release testing, consider strategies for safely deploying refactored code. A canary release involves rolling out the change to a small subset of users or servers first, observing it, and then gradually expanding.

This is great for catching issues that tests might miss (for example, performance issues or edge cases in production data). If the canary looks good (no errors, metrics are healthy), you proceed to full rollout. If not, you rollback quickly—with only a small impact scope.

Performance and load testing

If performance is a concern, incorporate performance tests into your strategy. This can be done in a staging environment. You might reconsider your approach or optimize the new code if you see a significant regression.

Testing legacy code lacking tests

If you’re dealing with a part of the system with zero tests (not uncommon in older code), prioritize getting at least some coverage there. There are also techniques like approval testing (where you generate output and have a human approve it as correct, then use that as a baseline for future tests). The key is not to refactor entirely in the dark; give yourself at least a flashlight in the form of tests!

In sum, a strong testing strategy is non-negotiable for refactoring complex systems. It’s your safety net, early warning system, and guide to know that your “cleanup” hasn’t broken anything vital.

7. Refactoring Without Breaking Performance

A common concern when refactoring is whether these cleaner code changes will make my system slower or more resource-hungry. Ideally, refactoring is about the internal structure and shouldn’t change external behavior, and performance is part of the behavior.

In theory, performance should remain the same if you don’t change algorithms or data structures in a way that affects complexity.

In practice, though, performance can be inadvertently affected by refactoring. The new code may be more readable but uses more memory, or perhaps a critical caching mechanism was removed in the spirit of simplicity.

Senior engineers need to be mindful of performance-sensitive parts of the system when refactoring and take steps to avoid regressions (or even improve performance where possible).

Here’s how to refactor with performance in mind:

Identify performance-critical code paths

Not all codes are equal regarding performance impact. If you refactor them, treat it almost like a functional change: you must re-measure performance afterwards. You have more leeway for parts of the code that run rarely or are not bottlenecks.

Use profiling before and after

A profiler is a tool that measures where time is spent in your code or how memory is allocated. It’s beneficial to run a profiler on the code before refactoring a module to see how it behaves, and then run it after to compare. If you see, for example, that after refactoring, a function now shows up as taking 30% of execution time (when it was negligible before), that’s a red flag. Maybe the new code calls it more times than before.

import cProfile, pstats
from mymodule import slow_function

def profile(fn):
    profiler = cProfile.Profile()
    profiler.enable()
    fn()
    profiler.disable()
    stats = pstats.Stats(profiler).strip_dirs().sort_stats('cumtime')
    stats.print_stats(10)

# run before refactor
profile(lambda: slow_function())

# after you refactor slow_function(), re-run and compare stats

When possible, improve performance through refactoring

On the flip side, refactoring can help performance.

For example, by refactoring duplicated code into one place, you can use better caching in that one place. So, watch for performance improvement opportunities that arise naturally as you refactor.

Performance should be treated as part of the “external behavior” that needs to be preserved in a good mindset. Refactoring should ideally not make things slower for users. To ensure that, incorporate performance checks into your plan, especially for critical sections. Measure, don’t guess. The end goal is a codebase that is both clean and fast enough.

8. Automate Code Reviews with AI tools

Refactoring code is an ongoing process, not a one-time event – AI code review tools help enforce clean-code standards, catch smells early, and reduce the repetitive tasks that can bog down human reviewers. This frees your engineers to focus on deeper architectural or domain-specific issues.

One powerful option is CodeRabbit, an AI-driven review platform designed to cut review time and bugs in half.

Here’s how it works and why it can boost your refactoring workflow:

AI-powered contextual feedback

CodeRabbit analyzes pull requests line by line, applying both advanced language models and static analysis under the hood. It flags potential bugs, best-practice deviations, and style issues before a human opens the PR.

Some other features include:

Auto-generated summaries and 1-click fixes – Summarize large PRs and apply straightforward fixes instantly.
Real-time collaboration and AI chat – Chat with the AI for clarifications, alternate code snippets, and instant feedback.
Integrates with popular dev platforms – Supports GitHub, GitLab, and Azure DevOps for seamless PR scanning.

CodeRabbit even has a free AI code reviews in VS Code and with this VS Code extension, you can get the most advanced AI code reviews directly in your code editor, saving review time, catching more bugs, and helping you in refactoring.

Summary

Refactoring a complex enterprise codebase is like renovating a large building while people still live in it without collapsing the structure.

Refactoring should be an ongoing process. You prevent the codebase from decaying by incorporating these practices into your regular development (perhaps allocating some time each sprint for refactoring or doing it opportunistically when touching your code). Each minor refactoring should not be too complex, and the cumulative effect is significant.

As Martin Fowler puts it, a series of small changes can lead to a significant improvement in design.

That's it for this blog. I hope you learned something new today.

If you want to read more interesting articles about developer tools, React, Next.js, AI and more, then I'll encourage you to checkout my blog.

Some of the new and interesting articles I've written in the last 24 months.

You can get in touch if you have any questions or corrections. I’m expecting them.

And if you found this blog useful, please share it with your friends and colleagues who might benefit from it as well. Your support enables me to continue producing useful content for the tech community.

Now it’s time to take the next step by subscribing to my newsletter and following me on Twitter.

How to Perform Code Reviews in Tech – The Painless Way

Ankur Tyagi — Tue, 03 Dec 2024 20:29:10 +0000

Okay, I know you may be skeptical: other guides have promised painless code reviews only to reveal that their solution requires some hyper-specific tech stack or a paid developer tool. I won’t do that to you.

This guide provides a straightforward and flexible template for code reviews that you can apply to your engineering team. The only requirement is that your app code is open source.

You can test a TypeScript workflow, Java workflow, Python workflow, PHP, Ruby or even some wacky web stack you invented. And it doesn’t matter if you’re developing on Windows, Linux, or Mac. Best of all, you don’t have to perform convoluted configuration or install software beyond a yaml.

I’ve been in engineering for the last 15 years, and code reviews have a bad reputation. We’ve all witnessed or lived through horror stories where sometimes it feels like every previous line gets torn to shreds.

So, what can you do differently? How can you make reviewing your code painless so that even the biggest nitpick on your team has nothing but praise?

After participating in code reviews for a decade, taking code reviews less personally is the single biggest thing you can do to improve your code. Why? Because all software is iterative. Even “perfect” code will eventually become outdated. Instead of thinking of it like a graded assignment, think of it as a part of the process.

Prerequisites

This tutorial uses free, open-source tools. You’ll need to have a GitHub account to help you make your code reviews more pleasant and valuable.

What is a Code Review?

The term “code review” can refer to various activities, from simply reading code over your teammate’s shoulder to a 10-person meeting where you dissect code line by line. I use the term to refer to a formal and written process, but not so heavyweight as a series of in-person code inspection meetings.

In a project where you work on a repository with other developers, after you complete your work, you commit, push, and create a pull request on the VCS, most likely using Git commands. Then, everyone reviews the pull request to determine whether it’s okay to use. If so, they approve it, and that code gets used in the project.

What is the Purpose of a Code Review?

Code Reviews are a tool for knowledge transfer. They help make devs more efficient when doing maintenance on a part of the system they didn't write.

When you review a pull request, it’s an opportunity to iron out issues before they become technical debt.

Code reviews can also be a good setting for mentoring junior developers.

Now, let’s discuss what is not the purpose of a code review:

Finding bugs. That's what tests (unit, integration, e2e, api, and so on…)are for.

Nitpicking on style issues – settle for one style and use formatters or AI tools to enforce it. Just keep in mind that there are many things that an AI tool cannot check. Code reviews are an excellent place to ensure the code is sufficiently documented or self-documenting.

Do you want to know how you can check this? Return to the code you wrote 6-12 months ago and try to understand what it was written to do.

If you understand it quickly, that means it's readable, and the code review was done properly and in a helpful manner.

Why is Doing Code Reviews Hard?

Despite their importance, many devs don’t like doing code reviews – in part because they can be challenging, especially if you’re not following best practices.

Here are some pain points I’ve observed during my years of participating in code reviews:

When people talk about code reviews, they focus on the reviewer. But the developer who writes the code is just as crucial to the review as the person who reads it.
Doing a code review is not an automatic routine for a developer.
The reviewer may sometimes just do a partial review and add new comments at every pass, even on code in the previous review(s) that remained untouched.
Sometimes, the code reviewer may not clearly express their expectations.
Multiple code reviewers can often have diverging opinions, leading to (too) long discussions.
The developer does not understand the comments from reviewers and requires back-and-forth discussions.
The developer addresses code review comments differently than agreed upon during the review process.

These pain points often bottleneck our development velocity. But recent advances in AI-assisted code review tools have started addressing these common friction points in our PR workflows.

Let's explore how AI-powered tools, along with some best practices, can address these review challenges and optimize your development workflow.

Can AI Replace Code Reviews?

While AI hasn’t replaced human code reviews, it is a powerful force multiplier in the review process.

Here's how: AI code reviews excel as a preliminary screening tool, catching common issues before human reviewers see the code. This becomes especially valuable in open-source projects where maintainer bandwidth is limited.

I recently started using AI code reviews on a case-by-case basis for my projects.

AI tools improve my existing workflows, reduce failure rates by detecting logic errors early on, and boost productivity. So I’ve added it to my CI/CD pipelines. It doesn't have to be perfect at detecting logic errors, as long as its false positive rate is very low (ideally as close to 0 as possible).

Most importantly, AI reviews respect the golden rule of 'value your reviewer's time' by handling routine checks. This allows human reviewers to focus on architecture, business logic, and complex edge cases.

This approach positions AI as a complementary tool that augments rather than replaces human expertise in the code review process.

What to Focus on During a Code Review

When reviewing code, try to prioritise what matters most using the Code Review Pyramid. This is a framework that helps you focus your attention where it creates the most value.

Think of it like building a house — start with the foundation before worrying about paint colours.

The pyramid has five layers, from most critical (bottom) to least critical (top):

API Semantics: Core design decisions that affect users
Implementation Semantics: The code's functionality, security, and performance
Documentation: Clear explanation of how to use the code
Tests: Verification that everything works as intended
Code Style: Formatting and naming conventions

Source: The Code Review Pyramid by Gunnar Morling

Remember: if you want to catch issues/bugs, there are more appropriate processes for that. That is why we have automated testing, canary releases, testing environments, and so on.

In my personal opinion, using code reviews as a bug catching tool is somewhat of an anti-pattern where you're compensating for a development process that may be lacking some key steps/processes.

To me, a code review is much more about managing technical debt and ensuring that quality is produced, while shipping more features.

In doing a code review, you should make sure that:

The code is readable
It has appropriate unit tests
The developer used clear names for everything
The code is well-designed and isn’t more complex than it needs to be
Test cases make sense and have comprehensive coverage
It’s something the team can maintain in the long run
There are no architectural issues that will block the team
The code fits the team's idea of quality
You’re thinking about what you can learn from the PR
You’re sharing any knowledge the developer might use in their PR
You’re thinking about how you can empower the dev through your positive feedback
The PR has a clear changelist description

Code Review Best Practices And Process

There is no general rule in engineering for code reviews, as what you’ll need to focus on depends on many factors. You can and should set up the process according to your company standards and way of working as a team.

Here are some factors you’ll need to think about before setting up a code review process:

The size and type of company you’re in (for example a startup vs a large corporation)
The number of developers on your team
Your budget
The timeframe you’re working with
Your and your team’s workloads
The complexity of the code
The abilities and skills of the reviewer(s)
The availability of the reviewer(s)

As an example, at my work we have a very simple rule: all code changes must be reviewed by at least one developer before a merge or a commit to the trunk.

Code reviews need a systematic approach, but maintaining consistency across every PR is challenging. It’s useful to let computers handle repetitive checks (style, formatting) while humans focus on what matters most: architecture and logic. This balanced approach makes reviews both thorough and sustainable.

Take a look at this example. It shows how we can optimize our code review process by intelligently delegating tasks between humans and automated tools. The diagram below illustrates a typical code style review workflow, comparing manual human review steps against automated tooling.

The diagram shows a real problem we all face in code reviews. See the left side? That's we humans doing manual formatting checks: finding weird spaces, fixing indents, writing comments about it... pretty tedious stuff. But check out the right side: that's where tools like Prettier just fix these formatting issues automatically.

No meetings, no back-and-forth – just done. That's why I started using CodeRabbit, which is a dev tool that caught my attention recently.

What is CodeRabbit?

The CodeRabbit docs describe the tool pretty effectively, so I’ll just leave this here:

CodeRabbit is an AI-powered code reviewer that delivers context-aware feedback on pull requests within minutes, reducing the time and effort needed for manual code reviews. It provides a fresh perspective and catches issues that are often missed, enhancing the overall review quality. – from the CodeRabbit docs

How Does CodeRabbit Help?

Let me walk you through a real example. When you submit a PR, CodeRabbit:

Performs a PR summary on the fly:

First, it gives you a quick overview of what changed.
It also explains the impact in plain English (great for non-tech folks in your team).
Then it includes a detailed walkthrough of file changes.

Does a “Smart Code Review”:

It drops comments right on the specific lines that need attention.
It also suggests fixes in diff format that you can apply them with one click.
And it shows what commits and files it checked (which is helpful for tracking review coverage).

Give you interactive feedback:

You can chat with it right in the PR comments.
You can ask it questions about specific code changes to get more details.
And it remembers your team's patterns and preferences which is super helpful for consistency’s sake (which I discussed above).

Extra Helpful Features:

CodeRabbit validates changes against linked GitHub/GitLab issues.
It creates sequence diagrams to visualize changes.
And it can perform one-click fixes on applications for simple issues.

I first discovered CodeRabbit last month while I was searching for something else on GitHub. I accidentally came across it and I was surprised by how many people are already using it.

I instantly signed up because I was looking for exactly such a solution which could help me and my team out with our reviews.

I read through the CodeRabbit docs and was very impressed.

Getting started using it is pretty much a plug and play process.

In the next section, we’ll go through the quick steps you can follow to enable CodeRabbit using an example repo.

Sign up at coderabbit.ai using your GitHub account.
Go to Add Your Repository.
And that's it. CodeRabbit starts reviewing your PRs automatically.

A GitHub Repo to Test

As an example GitHub repo to test, we’ll use devtoolsacademy: my blog on everything about awesome developer tools.

First, visit the CodeRabbit login page and login via GitHub.

Next, add CodeRabbit to some of your public GitHub repositories.

Now, CodeRabbit is fully integrated and ready to do code reviews on your selected repo.

Yes: it’s that simple and fast. And in my opinion, it’s one of the main reasons the tool is so useful.

Here are some sample PRs for you to check out:

Additional Examples

💡

check all the open source examples of code reviews done by CodeRabbit.

Conclusion

Everyone’s code needs reviewing. Just because someone is the most senior person on the team does not mean that their code doesn’t need to be reviewed.

In this article, I talked about code reviews along with some common pain points. I then showed you how you can leverage CodeRabbit to iterate quickly through your code reviews and focus more on business.

Before I End

I hope you found it helpful learning how to use AI tools for code reviews.

If you like my writing, these are some of my other most recent articles.

Follow me on Twitter to stay updated on my open source projects.

How I Built a Custom Video Conferencing App with Stream and Next.js

Ankur Tyagi — Wed, 02 Oct 2024 17:46:39 +0000

Building full-stack apps can be tough. You have to think about frontend, APIs, databases, auth – plus you have to know how all of these things work together.

And building a project like a video conferencing app from scratch can feel even more overwhelming, especially with the complexities of managing video streams, user auth, and real-time interactions.

But what if I told you there’s an easier way to do this – one that lets you build your video conferencing app in a fraction of the time?

In this article, I’ll show you how I built a video conferencing app using Stream and Clerk in Next.js.

Here is the source code (remember to give it a star ⭐).

Before we start, let me tell you why I wrote this tutorial.

I’m a Software Engineer who cares about writing and I love to code, design, develop, and then teach people.

I've been using open-source projects, products, and services for a while now, and contributing to many of them to improve them how I can. Last month I built an open-source blog for “awesome developer tools“ called - devtoolsacademy

This article is about sharing the experience I’ve had using yet another awesome developer tool.

What is Stream?

Stream is an open-source cloud-based platform that provides APIs and SDKs for building scalable and feature-rich real-time applications. It offers pre-built UI components for creating enterprise-grade software apps with features like chat, video, audio, and activity feeds.

Here's how I'll use Stream while building the app:

Set up real-time video and audio calls
Use Stream's UI components to quickly build the interface
Implement key features like video and audio calls
Call Types – I'll implement instant meetings and pre-scheduled calls using Stream
Leverage Stream's call and participant objects to manage call state

Prerequisites

To fully understand the tutorial, you need to have a basic understanding of React and Next.js. You’ll also need the following:

Stream React SDK - provides pre-built UI components for adding video call features quickly.
Stream Node.js SDK - for managing server-side interactions and keeping Stream's state in sync.
Clerk - a comprehensive user management platform to handle authentication effortlessly.
Headless UI - provides accessible UI components for building user-friendly applications.
React Copy-to-Clipboard - allows users to easily copy meeting links within the app.
React Icons - offers a library of easily integrated icons.

How to Build the App Interface with Next.js

In this section, I'll guide you through creating the user interface for the video-conferencing app. The interface will allow users to easily create, join, and schedule meetings, as well as view their upcoming meetings.

First, let’s create a Next.js TypeScript project by running the code snippet below:

npx create-next-app facetime-app

Then install the following packages:

React icons - a popular React icons package
Headless UI - provides a set of accessible UI components
React-copy-to-clipboard - a lightweight package that enables us to copy meeting links.

npm install react-icons @headlessui/react react-copy-to-clipboard

Copy the code snippet below into the app/page.tsx file:

"use client";
import { useState } from "react";
import { FaLink, FaVideo } from "react-icons/fa";
import InstantMeeting from "@/app/modals/InstantMeeting";
import UpcomingMeeting from "@/app/modals/UpcomingMeeting";
import CreateLink from "@/app/modals/CreateLink";
import JoinMeeting from "@/app/modals/JoinMeeting";

export default function Dashboard() {
    const [startInstantMeeting, setStartInstantMeeting] =
        useState<boolean>(false);
    const [joinMeeting, setJoinMeeting] = useState<boolean>(false);
    const [showUpcomingMeetings, setShowUpcomingMeetings] =
        useState<boolean>(false);
    const [showCreateLink, setShowCreateLink] = useState<boolean>(false);

    return (
        <>
            

            'w-full h-screen flex flex-col items-center justify-center'>
                'font-bold text-2xl text-center'>FaceTime
                'flex flex-col'>
                    
                

                'flex items-center justify-center space-x-4 mt-6'>
                    
                    
                
            

            {startInstantMeeting && (
                
            )}
            {showUpcomingMeetings && (
                
            )}
            {showCreateLink && (
                
            )}
            {joinMeeting && (
                
            )}
        
    );
}

The code snippet above renders multiple buttons that allow users to perform actions like joining, creating, and scheduling a call. Each button opens a modal that prompts the user to provide additional details specific to the action they are performing.

Next, let’s create a modals folder within the Next.js app directory and add the following components to the modals folder:

cd app
mkdir modals && cd modals
touch CreateLink.tsx InstantMeeting.tsx JoinMeeting.tsx UpcomingMeeting.tsx

The CreateLink modal allows users to provide a description and schedule a time for the call. The InstantMeeting modal lets users start an instant meeting by providing a call description. The JoinMeeting modal enables users to enter a call link and join a meeting. And the UpcomingMeeting modal displays all scheduled upcoming calls.

Copy the code snippet below into the CreateLink modal:

"use client";
import {
    Dialog,
    DialogTitle,
    DialogPanel,
    Transition,
    Description,
    TransitionChild,
} from "@headlessui/react";
import { Fragment, SetStateAction, useState, Dispatch } from "react";
import CopyToClipboard from "react-copy-to-clipboard";
import { FaCopy } from "react-icons/fa";

export default function CreateLink({ enable, setEnable }: Props) {
    const [showMeetingLink, setShowMeetingLink] = useState(false);
    const [facetimeLink, setFacetimeLink] = useState<string>("");
    const closeModal = () => setEnable(false);

    return (
        <>
            as={Fragment}>
                as='div' className='relative z-10' onClose={closeModal}>
                    as={Fragment}
                        enter='ease-out duration-300'
                        enterFrom='opacity-0'
                        enterTo='opacity-100'
                        leave='ease-in duration-200'
                        leaveFrom='opacity-100'
                        leaveTo='opacity-0'
                    >
                        'fixed inset-0 bg-black/75' />
                    

                    'fixed inset-0 overflow-y-auto'>
                        'flex min-h-full items-center justify-center p-4 text-center'>
                            as={Fragment}
                                enter='ease-out duration-300'
                                enterFrom='opacity-0 scale-95'
                                enterTo='opacity-100 scale-100'
                                leave='ease-in duration-200'
                                leaveFrom='opacity-100 scale-100'
                                leaveTo='opacity-0 scale-95'
                            >
                                'w-full max-w-2xl transform overflow-hidden rounded-2xl bg-white p-6 align-middle shadow-xl transition-all text-center'>
                                    {showMeetingLink ? (
                                        
                                    ) : (
                                        
                                    )}
                                
                            
                        
                    
                
            
        
    );
}

The code snippet above renders a form that allows users to input a description and select a time to schedule a call. Once the call is created, the generated link is displayed and can be copied.

Finally, add the MeetingForm and MeetingLink components below the CreateLink component:

const MeetingForm = ({
    setShowMeetingLink,
    setFacetimeLink,
}: {
    setShowMeetingLink: React.Dispatchboolean>>;
    setFacetimeLink: Dispatchstring>>;
}) => {
    const [description, setDescription] = useState<string>("");
    const [dateTime, setDateTime] = useState<string>("");

    const handleStartMeeting = async (e: React.FormEvent) => {
        e.preventDefault();
        console.log({ description, dateTime });
    };

    return (
        <>
            as='h3'
                className='text-lg font-bold leading-6 text-green-600'
            >
                Schedule a FaceTime
            

            'text-xs opacity-40 mb-4'>
                Schedule a FaceTime meeting with your cliq
            

            'w-full' onSubmit={handleStartMeeting}>
                'block text-left text-sm font-medium text-gray-700'
                    htmlFor='description'
                >
                    Meeting Description
                
                type='text'
                    name='description'
                    id='description'
                    value={description}
                    onChange={(e) => setDescription(e.target.value)}
                    className='mt-1 block w-full text-sm py-3 px-4 border-gray-200 border-[1px] rounded mb-3'
                    required
                    placeholder='Enter a description for the meeting'
                />

                'block text-left text-sm font-medium text-gray-700'
                    htmlFor='date'
                >
                    Date and Time
                

                type='datetime-local'
                    id='date'
                    name='date'
                    required
                    className='mt-1 block w-full text-sm py-3 px-4 border-gray-200 border-[1px] rounded mb-3'
                    value={dateTime}
                    onChange={(e) => setDateTime(e.target.value)}
                />

                
            
        
    );
};

The MeetingForm component accepts the call description and scheduled time, while the MeetingLink component displays the generated call link and allows users to copy it.

const MeetingLink = ({ facetimeLink }: { facetimeLink: string }) => {
    const [copied, setCopied] = useState<boolean>(false);
    const handleCopy = () => setCopied(true);

    return (
        <>
            as='h3'
                className='text-lg font-bold leading-6 text-green-600'
            >
                Copy FaceTime Link
            

            'text-xs opacity-40 mb-4'>
                You can share the facetime link with your participants
            

            'bg-gray-100 p-4 rounded flex items-center justify-between'>
                'text-xs text-gray-500'>
                    {`${process.env.NEXT_PUBLIC_FACETIME_HOST}/${facetimeLink}`}
                

                `${process.env.NEXT_PUBLIC_FACETIME_HOST}/${facetimeLink}`}
                >
                    'text-green-600 text-lg cursor-pointer' />
                
            

            {copied && (
                'text-red-600 text-xs mt-2'>Link copied to clipboard
            )}
        
    );
};

Copy the following code snippet into the InstantMeeting modal:

"use client";
import {
    Dialog,
    DialogTitle,
    DialogPanel,
    Transition,
    Description,
    TransitionChild,
} from "@headlessui/react";
import { FaCopy } from "react-icons/fa";
import CopyToClipboard from "react-copy-to-clipboard";
import { Fragment, useState, Dispatch, SetStateAction } from "react";
import { useStreamVideoClient } from "@stream-io/video-react-sdk";
import { useUser } from "@clerk/nextjs";
import Link from "next/link";

export default function InstantMeeting({ enable, setEnable }: Props) {
    const [showMeetingLink, setShowMeetingLink] = useState(false);
    const [facetimeLink, setFacetimeLink] = useState<string>("");

    const closeModal = () => setEnable(false);

    return (
        <>
            as={Fragment}>
                as='div' className='relative z-10' onClose={closeModal}>
                    as={Fragment}
                        enter='ease-out duration-300'
                        enterFrom='opacity-0'
                        enterTo='opacity-100'
                        leave='ease-in duration-200'
                        leaveFrom='opacity-100'
                        leaveTo='opacity-0'
                    >
                        'fixed inset-0 bg-black/75' />
                    

                    'fixed inset-0 overflow-y-auto'>
                        'flex min-h-full items-center justify-center p-4 text-center'>
                            as={Fragment}
                                enter='ease-out duration-300'
                                enterFrom='opacity-0 scale-95'
                                enterTo='opacity-100 scale-100'
                                leave='ease-in duration-200'
                                leaveFrom='opacity-100 scale-100'
                                leaveTo='opacity-0 scale-95'
                            >
                                'w-full max-w-2xl transform overflow-hidden rounded-2xl bg-white p-6 align-middle shadow-xl transition-all text-center'>
                                    {showMeetingLink ? (
                                        
                                    ) : (
                                        
                                    )}
                                
                            
                        
                    
                
            
        
    );
}

The code snippet above renders a form that allows users to provide a call description. Once the call is created, the link is generated and available to be copied before starting the call.

Finally, add the MeetingForm and MeetingLink components below the CreateLink component:

const MeetingForm = ({
    setShowMeetingLink,
    setFacetimeLink,
}: {
    setShowMeetingLink: Dispatchboolean>>;
    setFacetimeLink: Dispatchstring>>;
}) => {
    const [description, setDescription] = useState<string>("");

    const handleStartMeeting = async (e: React.FormEvent) => {
        e.preventDefault();
        console.log({ description });
    };

    return (
        <>
            as='h3'
                className='text-lg font-bold leading-6 text-green-600'
            >
                Create Instant FaceTime
            

            'text-xs opacity-40 mb-4'>
                You can start a new FaceTime instantly.
            

            'w-full' onSubmit={handleStartMeeting}>
                'block text-left text-sm font-medium text-gray-700'
                    htmlFor='description'
                >
                    Meeting Description
                
                type='text'
                    name='description'
                    id='description'
                    value={description}
                    required
                    onChange={(e) => setDescription(e.target.value)}
                    className='mt-1 block w-full text-sm py-3 px-4 border-gray-200 border-[1px] rounded mb-3'
                    placeholder='Enter a description for the meeting'
                />

                
            
        
    );
};

The MeetingForm component accepts the call description, while the MeetingLink component displays the generated call link and allows users to copy it before starting the call.

Copy the code snippet below into the JoinMeeting.tsx file. It renders a form that accepts the call link and redirects users to the call page.

"use client";
import {
    Dialog,
    DialogTitle,
    DialogPanel,
    Transition,
    TransitionChild,
} from "@headlessui/react";
import { useRouter } from "next/navigation";
import { Fragment, useState } from "react";

export default function JoinMeeting({ enable, setEnable }: Props) {
    const closeModal = () => setEnable(false);

    return (
        <>
            as={Fragment}>
                as='div' className='relative z-10' onClose={closeModal}>
                    as={Fragment}
                        enter='ease-out duration-300'
                        enterFrom='opacity-0'
                        enterTo='opacity-100'
                        leave='ease-in duration-200'
                        leaveFrom='opacity-100'
                        leaveTo='opacity-0'
                    >
                        'fixed inset-0 bg-black/75' />
                    

                    'fixed inset-0 overflow-y-auto'>
                        'flex min-h-full items-center justify-center p-4 text-center'>
                            as={Fragment}
                                enter='ease-out duration-300'
                                enterFrom='opacity-0 scale-95'
                                enterTo='opacity-100 scale-100'
                                leave='ease-in duration-200'
                                leaveFrom='opacity-100 scale-100'
                                leaveTo='opacity-0 scale-95'
                            >
                                'w-full max-w-2xl transform overflow-hidden rounded-2xl bg-white p-6 align-middle shadow-xl transition-all text-center'>
                                    
                                
                            
                        
                    
                
            
        
    );
}

Add the CallLinkForm below the JoinMeeting component:

const CallLinkForm = () => {
    const [link, setLink] = useState<string>("");
    const router = useRouter();

    const handleJoinMeeting = (e: React.FormEvent) => {
        e.preventDefault();
        router.push(`${link}`);
    };

    return (
        <>
            as='h3'
                className='text-lg font-bold leading-6 text-green-600'
            >
                Join FaceTime
            

            'w-full' onSubmit={handleJoinMeeting}>
                'block text-left text-sm font-medium text-gray-700'
                    htmlFor='link'
                >
                    Enter the FaceTime link
                
                type='url'
                    name='link'
                    id='link'
                    value={link}
                    onChange={(e) => setLink(e.target.value)}
                    className='mt-1 block w-full text-sm py-3 px-4 border-gray-200 border-[1px] rounded mb-3'
                    placeholder='Enter the FaceTime link'
                />

                
            
        
    );
};

Congratulations! You’ve completed the app’s interface.

How to Authenticate Users with Clerk

Clerk is a user management platform that enables you to add auth to web apps.

You can install the Clerk Next.js SDK by running the following code snippet in your terminal:

npm install @clerk/nextjs

Create a middleware.ts file within the Next.js src folder and copy the code snippet below into the file:

import { clerkMiddleware, createRouteMatcher } from "@clerk/nextjs/server";

const protectedRoutes = createRouteMatcher([
    "/facetime(.*)",
    "/dashboard",
    "/",
]);

//👇🏻 protects the route
export default clerkMiddleware((auth, req) => {
    if (protectedRoutes(req)) {
        auth().protect();
    }
});

export const config = {
    matcher: ["/((?!.*\\\\..*|_next).*)", "/", "/(api|trpc)(.*)"],
};

The createRouteMatcher function accepts an array containing routes to be protected from unauthenticated users and the clerkMiddleware() function ensures the routes are protected.

Next, import the following Clerk components into the app/layout.tsx file and update the RootLayout function as shown below:

import {
    ClerkProvider,
    SignInButton,
    SignedIn,
    SignedOut,
    UserButton,
} from "@clerk/nextjs";
import "./globals.css";

export default function RootLayout({
    children,
}: {
    children: React.ReactNode;
}) {
    return (
        
            'en'>
                
                    'w-full py-4 md:px-8 px-4 text-center flex items-center justify-between sticky top-0 bg-white '>
                        'flex items-center justify-end gap-5'>
                            {/*-- if user is signed out --*/}
                            
                                'modal' />
                            
                            {/*-- if user is signed in --*/}
                            
                                
                            
                        
                    

                    {children}
                
            
        
    );
}

After completing this, users will be prompted to create an account or sign in before they can access the application pages.

Finally, create a Clerk account and set up a new Clerk application. Add your Clerk publishable and secret keys to the .env.local file in your project.

NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=
CLERK_SECRET_KEY=

How to Set Up Stream in a Next.js app

First, create a Stream account and set up an organization to house your app. Then, copy the following credentials into your .env.local file:

STREAM_APP_ID=
NEXT_PUBLIC_STREAM_API_KEY=
STREAM_SECRET_KEY=
NEXT_PUBLIC_FACETIME_HOST=http://localhost:3000/facetime

Next, install Stream React Video SDK and the Stream Node.js SDK.

npm install @stream-io/video-react-sdk @stream-io/node-sdk

Create a providers folder containing a StreamVideoProvider.tsx file and copy the following code snippet into the file:

"use client";
import { tokenProvider } from "@/actions/stream.actions";
import { StreamVideo, StreamVideoClient } from "@stream-io/video-react-sdk";
import { useState, ReactNode, useEffect } from "react";
import { useUser } from "@clerk/nextjs";

const apiKey = process.env.NEXT_PUBLIC_STREAM_API_KEY!;

export const StreamVideoProvider = ({ children }: { children: ReactNode }) => {
    const [videoClient, setVideoClient] = useState();

    const { user, isLoaded } = useUser();

    useEffect(() => {
        if (!isLoaded || !user || !apiKey) return;
        if (!tokenProvider) return;
        const client = new StreamVideoClient({
            apiKey,
            user: {
                id: user?.id,
                name: user?.primaryEmailAddress?.emailAddress,
                image: user?.imageUrl,
            },
            tokenProvider, //👉🏻 pending creation
        });

        setVideoClient(client);
    }, [user, isLoaded]);

    if (!videoClient) return null;

    return {children};
};

Let’s wrap the entire app with the StreamVideoProvider component, which initializes a Stream client to identify each user.

The StreamVideoClient function takes an object containing the API key, the user object with details from Clerk, and a tokenProvider.

Next, let’s create a Next.js server action (tokenProvider) that generates the token.

Create an actions folder, add a stream.actions.ts file, and copy the following code snippet into the file:

//👇🏻 tokenPrvoider function
"use server";

import { currentUser } from "@clerk/nextjs/server";
import { StreamClient } from "@stream-io/node-sdk";

const STREAM_API_KEY = process.env.NEXT_PUBLIC_STREAM_API_KEY!;
const STREAM_API_SECRET = process.env.STREAM_SECRET_KEY!;

export const tokenProvider = async () => {
    const user = await currentUser();

    if (!user) throw new Error("User is not authenticated");
    if (!STREAM_API_KEY) throw new Error("Stream API key secret is missing");
    if (!STREAM_API_SECRET) throw new Error("Stream API secret is missing");

    const streamClient = new StreamClient(STREAM_API_KEY, STREAM_API_SECRET);

    const expirationTime = Math.floor(Date.now() / 1000) + 3600;
    const issuedAt = Math.floor(Date.now() / 1000) - 60;

    //👇🏻 generates a Stream user token
    const token = streamClient.generateUserToken({
        user_id: user.id,
        exp: expirationTime,
        validity_in_seconds: issuedAt,
    });
    //👇🏻 returns the user token
    return token;
};

Finally, update the RootLayout function in the app/layout.tsx file by wrapping the entire application with the StreamVideoProvider component:

import "@stream-io/video-react-sdk/dist/css/styles.css";
import { StreamVideoProvider } from "./providers/StreamVideoProvider";

export default function RootLayout({
    children,
}: {
    children: React.ReactNode;
}) {
    return (
        
            'en'>
                
                    
                        'w-full py-4 md:px-8 px-4 text-center flex items-center justify-between sticky top-0 bg-white '>
                            'flex items-center justify-end gap-5'>
                                {/*-- if user is signed out --*/}
                                
                                    'modal' />
                                
                                {/*-- if user is signed in --*/}
                                
                                    
                                
                            
                        

                        {children}
                    
                
            
        
    );
}

Congratulations! You've successfully integrated Stream into the Next.js app.

How to Create and Join Calls with Stream

In this section, you'll learn how to create, schedule, and join calls using the Stream SDK. You'll also learn how to set up the meeting room with the necessary components and fetch upcoming calls from Stream.

Creating and Scheduling calls

To create an instant meeting, execute the handleStartMeeting function. It generates a random ID for the call and creates the meeting using the current date and the provided description.

import { useStreamVideoClient } from "@stream-io/video-react-sdk";
import { useUser } from "@clerk/nextjs";
const client = useStreamVideoClient();
const { user } = useUser();

const handleStartMeeting = async (e: React.FormEvent) => {
    e.preventDefault();
    if (!client || !user) return;
    try {
        const id = crypto.randomUUID();
        const call = client.call("default", id);
        if (!call) throw new Error("Failed to create meeting");

        await call.getOrCreate({
            data: {
                starts_at: new Date(Date.now()).toISOString(),
                custom: {
                    description,
                },
            },
        });

        setFacetimeLink(`${call.id}`);
        setShowMeetingLink(true);
    } catch (error) {
        console.error(error);
        alert("Failed to create Meeting");
    }
};

The call.getOrCreate() function accepts an optional call description along with the current date and time to initiate the call.

It also allows you to schedule calls for a specific time in the future. In this case, you can specify the desired date and time, and Stream will automatically schedule the call for that period.

import { useStreamVideoClient } from "@stream-io/video-react-sdk";
import { useUser } from "@clerk/nextjs";
const client = useStreamVideoClient();
const { user } = useUser();

const handleScheduleMeeting = async (e: React.FormEvent) => {
    e.preventDefault();
    if (!client || !user) return;
    try {
        const id = crypto.randomUUID();
        const call = client.call("default", id);
        if (!call) throw new Error("Failed to create meeting");

        await call.getOrCreate({
            data: {
                //👇🏻 only necessary changes
                starts_at: new Date(dateTime).toISOString(),
                custom: {
                    description,
                },
            },
        });
        setFacetimeLink(`${call.id}`);
        setShowMeetingLink(true);
    } catch (error) {
        console.error(error);
        console.error("Failed to create Meeting");
    }
};

Joining calls and the Meeting Page

Recall that the meeting link in the app is declared as:

`${process.env.NEXT_PUBLIC_FACETIME_HOST}/${facetimeLink}`
// 👉🏻 format:

Therefore, we need to create the /facetime/ route to enable users to join a call. To do this, create a facetime folder with an [id] directory inside, and within that directory, add a page.tsx file. Then, copy the following code snippet into the file:

"use client";
import { useGetCallById } from "@/app/hooks/useGetCallById";
import { useUser } from "@clerk/nextjs";
import {
    StreamCall,
    StreamTheme,
    PaginatedGridLayout,
    CallControls,
} from "@stream-io/video-react-sdk";
import { useParams, useRouter } from "next/navigation";
import { useEffect, useState } from "react";

export default function FaceTimePage() {
    const { id } = useParams<{ id: string }>();
    const [confirmJoin, setConfirmJoin] = useState<boolean>(false);
    const [camMicEnabled, setCamMicEnabled] = useState<boolean>(false);
    const router = useRouter();
    //👇🏻 gets call details by ID
    const { call, isCallLoading } = useGetCallById(id);

    useEffect(() => {
        if (camMicEnabled) {
            call?.camera.enable();
            call?.microphone.enable();
        } else {
            call?.camera.disable();
            call?.microphone.disable();
        }
    }, [call, camMicEnabled]);

    //👇🏻 enable users to join calls
    const handleJoin = () => {
        call?.join();
        setConfirmJoin(true);
    };

    if (isCallLoading) return Loading...;

    if (!call) return Call not found;

    return (
        'min-h-screen w-full items-center justify-center'>
            
                
                    {confirmJoin ? (
                        
                    ) : (
                        'flex flex-col items-center justify-center gap-5'>
                            'text-3xl font-bold'>Join Call
                            'text-lg'>
                                Are you sure you want to join this call?
                            
                            'flex gap-5'>
                                
                                
                            
                        
                    )}
                
            
        
    );
}

When users visit the meeting page, they are presented with a confirmation message, allowing them to confirm that they want to join the call.

In the code snippet above:

The useGetCallById hook is a custom function that retrieves call details based on the call ID.
The handleJoin function allows users to join the call and then displays the component.

Add the MeetingRoom component below the FaceTimePage component:

const MeetingRoom = () => {
    const router = useRouter();

    const handleLeave = () => {
        confirm("Are you sure you want to leave the call?") && router.push("/");
    };

    return (
        'relative min-h-screen w-full overflow-hidden pt-4'>
            'relative flex size-full items-center justify-center'>
                'flex size-full max-w-[1000px] items-center'>
                    
                
                'fixed bottom-0 flex w-full items-center justify-center gap-5'>
                    
                
            
        
    );
};

The PaginatedGridLayout arranges participants in a grid layout with pagination, allowing you to manage larger video calls by displaying a set number of participants per page.

The CallControls component provides built-in actions, such as muting, video toggling, and screen sharing, that can be performed during a call. Both components are part of the Stream SDK, making integration seamless.

Additionally, you can switch to the SpeakerLayout, which highlights the dominant speaker or shared screen while displaying other participants in a smaller view.

Finally, create a hooks folder containing the useGetCallById.ts file and copy the code snippet below into the file:

import { useEffect, useState } from "react";
import { Call, useStreamVideoClient } from "@stream-io/video-react-sdk";

export const useGetCallById = (id: string | string[]) => {
    const [call, setCall] = useState();
    const [isCallLoading, setIsCallLoading] = useState(true);

    const client = useStreamVideoClient();

    useEffect(() => {
        if (!client) return;

        const loadCall = async () => {
            try {
                const { calls } = await client.queryCalls({
                    filter_conditions: { id },
                });

                if (calls.length > 0) setCall(calls[0]);

                setIsCallLoading(false);
            } catch (error) {
                console.error(error);
                setIsCallLoading(false);
            }
        };

        loadCall();
    }, [client, id]);

    return { call, isCallLoading };
};

The code snippet above filters the call list and returns the call with a matching ID, allowing users to join the specified call.

Retrieving Upcoming Calls

To retrieve upcoming calls from Stream, you can create a custom hook that fetches all the calls created by the user, as well as the calls they are a member of.

import { useEffect, useState } from "react";
import { useUser } from "@clerk/nextjs";
import { Call, useStreamVideoClient } from "@stream-io/video-react-sdk";

export const useGetCalls = () => {
    const { user } = useUser();
    const client = useStreamVideoClient();
    const [calls, setCalls] = useState();
    const [isLoading, setIsLoading] = useState(false);

    useEffect(() => {
        const loadCalls = async () => {
            if (!client || !user?.id) return;
            setIsLoading(true);
            try {
                //👇🏻 gets all the calls the user is featured in
                const { calls } = await client.queryCalls({
                    sort: [{ field: "starts_at", direction: -1 }],
                    filter_conditions: {
                        starts_at: { $exists: true },
                        $or: [
                            { created_by_user_id: user.id },
                            { members: { $in: [user.id] } },
                        ],
                    },
                });

                setCalls(calls);
            } catch (error) {
                console.error(error);
            } finally {
                setIsLoading(false);
            }
        };

        loadCalls();
    }, [client, user?.id]);

    const now = new Date();

    //👇🏻 gets only calls that are yet to start
    const upcomingCalls = calls?.filter(({ state: { startsAt } }: Call) => {
        return startsAt && new Date(startsAt) > now;
    });

    return { upcomingCalls, isLoading };
};

The useGetCalls hook retrieves the list of upcoming calls, which can then be displayed in the UpcomingMeeting modal.

Congratulations! You’ve completed the project for this tutorial.

Check out the live app here.

Next Steps

So far, you’ve learned how to build a video conferencing app. If you'd like to learn more about how you can leverage Stream to build scalable apps, then check out these resources:

Before We End...

I hope you found it insightful and that it has given you enough motivation on how to build apps using awesome developer tools.

These are some of my other most recent blog posts.

Check out my blog for more tutorials like this on awesome developer tools.

Follow me on Twitter to stay updated on my side projects and ongoing learning.

Happy coding.

How to Use React Compiler – A Complete Guide

Tapas Adhikary — Tue, 27 Aug 2024 22:35:47 +0000

In this tutorial, you'll learn how the React compiler can help you write more optimized React applications.

React is a user interface library that has been doing its job quite well for over a decade. The component architecture, uni-directional data flow, and declarative nature stand out in helping devs building production-ready, scalable software applications.

Over the releases (even up until the latest stable release of v18.x), React has provided various techniques and methodologies to improve application performance.

For example, the entire memoization paradigm has been supported using the React.memo() higher-order component, or with hooks like useMemo() and useCallback().

In programming, memoization is an optimization technique that makes your programs execute faster by caching the result of expensive computations.

Although React's memoization techniques are great for applying optimizations, as Uncle Ben (remember, Spiderman's uncle?) once said, "With great power comes great responsibility". So we as developers need to be a little more responsible in applying them. Optimization is great, but over-optimization can be a killer for the application's performance.

With React 19, the developer community has received a list of enhancements and features to boast about:

An experimental open-source compiler. We will be focusing primarily on it in this article.
React Server Components.
Server Actions.
Easier and more organic way of handling the document metadata.
Enhanced hooks and APIs.
ref can be passed as props.
Improvements in asset loading for styles, images, and fonts.
A much smoother integration with Web Components.

If these are exciting to you, I recommend watching this video that explains how each feature will impact you as a React developer. I hope you like it 😊.

The introduction of a compiler with React 19 is set to be a game-changer. From now on, we can let the compiler handle the optimization headache rather than keeping it on us.

Does this mean we do not have to use memo, useMemo(), useCallback, and so on anymore? No – we mostly don't. The compiler can take care of these things automatically if you understand and follow the Rules of React for components and hooks.

How will it do this? Well, we'll get to it. But before that, let's understand what a compiler is and whether it's justified to call this new optimizer for React code the React Compiler.

If you like to learn from video tutorials as well, this article is also available as a video tutorial here:

What is a Compiler, traditionally?
React Compiler Architecture
React Compiler in action
Understanding the problem: Without the React Compiler
Fixing the problem: Without the React Compiler
Fixing the problem: Using the React Compiler
Optimized React App with React Compiler
React Compiler in React DevTools
Diving deep - How does the React Compiler work?
How do you opt in and out of the React compiler?
Can we use the React Compiler with React 18.x?
Repositories to look into
What's Next?

What is a Compiler, Traditionally?

Simply put, a compiler is a software program/tool that translates high-level programming language code (source code) into machine code. There are several steps to follow to compile source code and generate machine code:

The lexical analyzer tokenizes the source code and generates tokens.
The Syntax Analyzer creates an abstract syntax tree (AST) to structure the source code tokens logically.
The Semantic Analyzer validates the semantic (or syntactic) correctness of the code.
After all three types of analysis by the respective analyzers, some intermediate code gets generated. It is also known as the IR code.
Then optimization is performed on the IR code.
Finally, the machine code is generated by the compiler from the optimized IR code.

Now that you understand the basics of how a compiler works, let's learn about the React Compiler and understand how it works.

React Compiler Architecture

React compiler is a build-time tool that you need to configure with your React 19 project explicitly using the configuration options provided by the React tools ecosystem.

For example, if you are using Vite to create your React application, the compiler configuration will take place in the vite.config.js file.

React compiler has three primary components:

Babel Plugin: helps transform the code during the compilation process.
ESLint Plugin: helps catch and report any violations of the Rules of React.
Compiler Core: the core compiler logic that performs the code analysis and optimizations. Both Babel and ESLint plugins use the core compiler logic.

The compilation flow goes like this:

The Babel Plugin identifies which functions (components or hooks) to compile. We will see some configurations later to learn how to opt in and out of the compilation process. The plugin calls the core compiler logic for each of the functions and finally creates the Abstract Syntax Tree.
Then the compiler core converts the Babel AST into IR code, analyzes it, and runs various validations to ensure none of the rules are broken.
Next, it tries to reduce the amount of code to be optimized by performing various passes to eliminate dead code. The code gets further optimized using memoization.
Finally, in the code generation stage, the transformed AST is converted back to the optimized JavaScript code.

React Compiler in Action

Now that you know how React Compiler works, let's now dive into configuring it with a React 19 project so you can start learning about the various optimizations.

Understanding the problem: Without the React Compiler

Let's create a simple product page with React. The product page shows a heading with the number of products on the page, a list of products, and the featured products.

The component hierarchy and the data passing between the components look like this:

As you can see in the image above,

The ProductPage component has three child components, Heading, ProductList, and FeaturedProducts.
The ProductPage component receives two props, products and the heading.
The ProductPage component computes the total number of products and passes the value along with the heading text value to the Heading component.
The ProductPage component passes down the products prop to the ProductList child component.
Similarly, it computes the featured products and passes the featuredProducts prop to the FeaturedProducts child component.

Here is how the source code of the ProductPage component may look:

import React from 'react'

import Heading from './Heading';
import FeaturedProducts from './FeaturedProducts';
import ProductList from './ProductList';

const ProductPage = ({products, heading}) => {
  const featuredProducts = products.filter(product => product.featured);
  const totalProducts = products.length;

  return (
    <div className="m-2">
      <Heading
        heading={heading}
        totalProducts={totalProducts} />

      <ProductList
        products={products} />

      <FeaturedProducts
        featuredProducts={featuredProducts} />  

    div>
  )
}

export default ProductPage

Also, assume we use the ProductPage component in the App.js file like this:


import ProductPage from "./components/compiler/ProductPage";

function App() {

  // A list of food products    
  const foodProducts = [
    {
      "id": "001",
      "name": "Hamburger",
      "image": "🍔",
      "featured": true
    },
    {
      "id": "002",
      "name": "French Fries",
      "image": "🍟",
      "featured": false
    },
    {
      "id": "003",
      "name": "Taco",
      "image": "🌮",
      "featured": false
    },
    {
      "id": "004",
      "name": "Hot Dog",
      "image": "🌭",
      "featured": true
    }
  ];

  return (
      <ProductPage 
            products={foodProducts} 
            heading="The Food Product" />
  );
}

export default App;

That's all good – so where is the problem? The problem is that React proactively re-renders the child component when the parent component re-renders. An unnecessary rendering requires optimizations. Let's understand the problem fully first.

We'll add the current timestamp in each of the child components. Now the rendered user interface will look like this:

The big number you see beside the headings is the timestamp (using the simple Date.now() function from the JavaScript Date API) we have added to the component code. Now what happens if we change the value of the heading prop of the ProductPage component?

Before:

<ProductPage 
   products={foodProducts} 
   heading="The Food Product" />

And after (notice that we have made it plural for products by adding an s at the end of the heading value):

<ProductPage 
   products={foodProducts} 
   heading="The Food Products" />

Now you will notice an immediate change in the user interface. All three timestamps got updated. This is because all three components were re-rendered when the parent component was re-rendered due to the props change.

If you notice, the heading prop was passed only to the Heading component, and even then the other two child components re-rendered. This is where we need the optimizations.

Fixing the Problem: Without the React Compiler

As discussed before, React provides us with various hooks and APIs for memoization. We can use React.memo() or useMemo() to safeguard the components that are re-rendering unnecessarily.

For example, we can use React.memo() to memoize the ProductList component to ensure that unless the products prop is changed, the ProductList component will not be re-rendered.

We can use the useMemo() hook to memoize the computation for the featured products. Both implementations are indicated in the image below.

But again, recollecting the wise words of great Uncle Ben, over the last few years we have started over-using these optimization techniques. These over-optimizations can negatively impact the performance of your applications. So, the availability of the compiler is a boon for React developers as it lets them delegate many such optimizations to the compiler.

Let's now fix the problem using the React compiler.

Fixing the problem: Using the React Compiler

Again, React compiler is an opt-in build-time tool. It doesn't come bundled with React 19 RC. You need to install the required dependencies and configure the compiler with your React 19 project.

Before configuring the compiler, you can check if your codebase is compatible by executing this command on your project directory:

npx react-compiler-healthcheck@experimental

It will check and report:

How many components can be optimized by the compiler
If the Rules of React are followed.
If there are any incompatible libraries.

If you find that things are compatible, it's time to install the ESLint plugin powered by the React compiler. This plugin will help you catch any violation of the rules of React in your code. Violating code will be skipped by the React compiler and no optimizations will be performed on it.

npm install eslint-plugin-react-compiler@experimental

Then open the ESLint configuration file (for example, .eslintrc.cjs for Vite) and add these configurations:

module.exports = {
  plugins: [
    'eslint-plugin-react-compiler',
  ],
  rules: {
    'react-compiler/react-compiler': "error",
  },
}

Next, you'll use the Babel plugin for the React compiler to enable the compiler for your entire project. If you're starting a new project with React 19, I recommend that you enable the React compiler for the entire project. Let's install the Babel plugin for the React compiler:

npm install babel-plugin-react-compiler@experimental

Once installed, you need to complete the configuration by adding the options in the Babel config file. As we're using Vite, open the vite.config.js file and replace the content with the following code snippet:

import { defineConfig } from 'vite'
import react from '@vitejs/plugin-react'

const ReactCompilerConfig = {/* ... */ };

// https://vitejs.dev/config/
export default defineConfig({
  plugins: [react({
    babel: {
      plugins: [
        [
          "babel-plugin-react-compiler",
           ReactCompilerConfig
          ]
        ],
    },
  })],
})

Here, you've added the babel-plugin-react-compiler to the configuration. The ReactCompilerConfig is required to provide any advanced configuration like if you want to provide any custom runtime module or any other configurations. In this case, it's an empty object without any advanced configurations.

That's it. You are done configuring the React compiler with your code base to utilize its power. From now on, the React compiler will look into every component and hook in your project to try and apply optimizations to it.

If you want to configure the React compiler with Next.js, Remix, Webpack, and so on, you can follow this guide.

Optimized React App with React Compiler

Now you should have an optimized React app with the inclusion of the React compiler. So, let's run the same tests you did before. Again, change the value of the heading prop of the ProductPage component.

This time, you will not see the child components re-rendering. So the timestamp will not be updated either. But you will see the portion of the component where the data changed, as it will reflect the changes alone. Also, you won't have to use memo, useMemo(), or useCallback() in your code anymore.

You can see it working visually from here.

React Compiler in React DevTools

React DevTools version 5.0+ has built-in support for the React compiler. You will see a badge with the text Memo ✨ beside the components optimized by the React compiler. This is fantastic!

Diving Deep – How Does the React Compiler Work?

Now that you've seen how the React compiler works on React 19 code, let's deep dive into understanding what's happening in the background. We will use the React Compiler Playground to explore the translated code and the optimization steps.

We'll use the Heading component as an example. Copy and paste the following code inside the leftmost section of the playground:

const Heading = ({ heading, totalProducts }) => {
  return (
    <nav>
      <h1 className="text-2xl">
          {heading}({totalProducts}) - {Date.now()}
      h1>
    nav>
  )
}

You will see that some JavaScript code is generated immediately inside the _JS tab of the playground. The React compiler generates this JavaScript code as part of the compilation process. Let's go over it step-by-step:

function anonymous_0(t0) {
  const $ = _c(4);
  const { heading, totalProducts } = t0;
  let t1;
  if ($[0] === Symbol.for("react.memo_cache_sentinel")) {
    t1 = Date.now();
    $[0] = t1;
  } else {
    t1 = $[0];
  }
  let t2;
  if ($[1] !== heading || $[2] !== totalProducts) {
    t2 = (
      <nav>
        <h1 className="text-2xl">
          {heading}({totalProducts}) - {t1}
        h1>
      nav>
    );
    $[1] = heading;
    $[2] = totalProducts;
    $[3] = t2;
  } else {
    t2 = $[3];
  }
  return t2;
}

The compiler uses a hook called _c() to create an array of items to cache. In the code above, an array of four elements has been created to cache four items.

const $ = _c(4);

But, what are the things to cache?

The component takes two props, heading and totalProducts. The compiler needs to cache them. So, it needs two elements in the array of cacheable items.
The Date.now() part in the header should be cached.
The JSX itself should be cached. There is no point in computing JSX unless either of the above changes.

So there are a total of four items to cache.

The compiler creates memoization blocks using the if-block. The final return value from the compiler is the JSX which depends on three dependencies:

The Date.now() value.
Two props, a heading and totalProducts

The output JSX needs re-computation when any of the above changes. This means that the compiler needs to create two memoization blocks for each of the above.

The first memoization block looks like this:

if ($[0] === Symbol.for("react.memo_cache_sentinel")) {
    t1 = Date.now();
    $[0] = t1;
} else {
    t1 = $[0];
}

The if-block stores the value of the Date.now() into the first index of the cacheable array. It re-uses the same every time unless it is changed.

Similarly, in the second memoization block:

if ($[1] !== heading || $[2] !== totalProducts) {
    t2 = (
      <nav>
        <h1 className="text-2xl">
          {heading}({totalProducts}) - {t1}
        h1>
      nav>
    );
    $[1] = heading;
    $[2] = totalProducts;
    $[3] = t2;
  } else {
    t2 = $[3];
  }

Here, the check is for the value changes for either heading or totalProducts props. If either of these changes, the JSX needs to be recomputed. All the values are then stored in the cacheable array. If there are no changes in the value, the previously computed JSX is returned from the cache.

You can now paste any other component source code into the left side and look into the generated JavaScript code to help you understand what's going on as we did above. This will help you to get a better grip on how the compiler performs the memoization techniques in the compilation process.

How Do You Opt in and Out of the React Compiler?

Once you've configured the React compiler the way we have done with our Vite project here, it's enabled for all the compilers and hooks of the project.

But in some cases, you may want to selectively opt-in for the React compiler. In that case, you can run the compiler in “opt-in” mode using the compilationMode: "annotation" option.

// Specify the option in the ReactCompilerConfig
const ReactCompilerConfig = {
  compilationMode: "annotation",
};

Then annotate the components and hooks you want to opt-in for compilation with the "use memo" directive.

// src/ProductPage.jsx
export default function ProductPage() {
  "use memo";
  // ...
}

Note that there is a "use no memo" directive as well. There might be some rare cases where your component may not be working as expected after compilation, and you want to opt out of the compilation temporarily until the issue is identified and fixed. In that case, you can use this directive:

function AComponent() {
  "use no memo";
  // ...
}

Can We Use the React Compiler with React 18.x?

It is recommended to use the React compiler with React 19 as there are required compatibilities. If you can't upgrade your application to React 19, you'll need to have a custom implementation of the cache function. You can go over this thread describing the workaround.

Repositories to Look Into

All the source code used in this article is in this repository.
If you want to start coding with React 19 and its features, here is a template repository configured with React 19 RC, Vite, and TailwindCSS. You may want to try it out.

What's Next?

To learn further,

Check out the official documentation of React Compiler from here.
Check out the discussions in the Working Group.

Up next, if you are willing to learn React and its ecosystem-like Next.js with both fundamental concepts and projects, I have great news for you: you can check out this playlist on my YouTube channel with 22+ video tutorials and 12+ hours of engaging content so far, for free. I hope you like them as well.

That's all for now. Did you enjoy reading this article and have you learned something new? If so, I would love to know if the content was helpful.

Subscribe to my YouTube Channel.
Follow me on X (Twitter) or LinkedIn if you don't want to miss the daily dose of up-skilling tips.
Check out and follow my Open Source work on GitHub.
I regularly publish meaningful posts on my GreenRoots Blog, you may find them helpful, too.

See you soon with my next article. Until then, please take care of yourself, and keep learning.

Practice Your Coding Skills by Building a Program in Different Ways

Niladri S. Jyoti — Mon, 04 Mar 2024 15:39:55 +0000

While we have 365 days in other years, this year (2024) is special because it has one ‘extra’ day.

So in the spirit of Leap Day, let's practice some coding to understand various aspects of programming. We'll focus on the same program but from different perspectives.

Our example program will explore different ways you can code a program that determines whether a given year is a leap year. On other days, we code. But today, let’s decode what we do and get some extra knowledge out of that process.

Program Requirements & Prerequisites
Logical Approaches to Solving the Problem

My Naïve Approach
Reassignments and a Single Return Statement
Switching to Switch-Case from If-Else
Logical Deduction & Subsets for Better Structure
Logical Operators Combining All True Conditions
Applying Nitro with the Ternary Operator
Making it a Single Line Arrow Function

Paradigm Shift: Declarative Programming

Functions with Side Effects
More About Functional Programming
Side-Tracking: Short-Circuiting!
Encapsulation and Declarative Programming

Going Above & Beyond with Code Quality

Validations: Beyond the Basic Specifications
Testing it Out From the Outside

End Note

Program Requirements & Prerequisites

First, let’s discuss the requirements and set the specifications. The program should be able to get a year (expects a number, an integer to be specific) as an argument and returns either true or false (a boolean) depending on if it is a leap year or not. Through the examples, we will focus on the program logic (semantics) rather than the language (syntax). Over the years, I have used JavaScript most frequently so we'll use this language for the project. If you use a different language, no worries because many concepts are common between programming languages. For example, in this article, we would use arrow function which is similar to lambda function used in some other programming languages, such as Python. So, as prerequisites, you should have a basic knowledge of programming and should be comfortable with the concepts of functions (different ways to define and call functions, return values, and so on) and conditional logic (if-else, switch-case, and so on). That would be enough to follow along, for the most part, if you want to read and try the code for yourself. Just in the last bit, we also do unit testing of our code. If you aren't familiar with unit testing, here is a good refresher on how to write unit tests in JavaScript with Jest.

Logical Approaches to Solving the Problem

My Naïve Approach

This is based on the pedagogical style of determining a leap year that I learned as a kid who knew how to divide numbers. If a year ( the number representing it) is divisible by 4, it is generally a leap year. But not always. When that year ends with two zeroes (meaning when the number is divisible by 100), it must also be divisible by 400 to be a leap year.

How to determine if a year is a leap year - as described above

As a beginner programmer, my thoughts flowed like you can see in the above flowchart. As a result, I converted that logic into my program like so:

function isLeapYear(year) {
  if (year % 4 == 0) {
      if (year % 100 == 0) {
          if (year % 400 == 0) {
              return true;
          } else {
              return false;
          }
      } else {
          return true;
      }
  } else {
   return false
  }
}

// Example usage:
console.log(isLeapYear(2024)); // Output: true
console.log(isLeapYear(2023)); // Output: false
console.log(isLeapYear(1900)); // Output: false
console.log(isLeapYear(2000)); // Output: true

This makes the program easily understandable. But with time, as I have moved farther in my programming journey, this type of code looks ugly because of so many nested conditional checks. It's not bad, but because of the nested levels, my brain has to work extra hard to get the logic from the code snapshot quickly.

Reassignments and a Single Return Statement

To avoid nested loops, many programmers follow the strategy of consecutive if conditions, avoiding the else conditions (like how Kyle Cook of Web Dev Simplified shows in this video with examples). It definitely improves readability.

Also, it lets us use only one return statement at the end while reassigning the returnable value. Let's not discuss it too much more when you can better see the code itself:

function isLeapYear(year) {
  let isLeap = false;
  if (year % 4 == 0) {
      isLeap = true;
  }
  if (year % 100 == 0) {
      isLeap = false;
  }
  if (year % 400 == 0) {
      isLeap = true;
  }
  return isLeap;
}

// Example usage:
console.log(isLeapYear(2024)); // Output: true
console.log(isLeapYear(2023)); // Output: false
console.log(isLeapYear(1900)); // Output: false
console.log(isLeapYear(2000)); // Output: true

The above code looks shorter and quicker to interpret. But it does affect the efficiency of the code, as now you have to go through all of the if conditions in all cases.

In contrast, in our previous naïve approach, due to the if-else construct, if a year is not divisible by 4 (like the year 2023), it would just be checked against one if condition. It’s true, of course, that for small programs such as this one, you don’t have to be overly concerned with efficiency.

The pitfall in this approach, though, is that you need to be cautious to apply all the if conditions one after another — using ‘else if’ would create trouble, as that would skip some if condition checks if the previous if condition test passed.

Another important fact is that the order matters. Since you started with the more generic cases of years not being a leap year (that is, let isLeap = false;), you have to go from relatively generic to relatively more specific cases.

So if, out of your three condition checks, the check of divisibility by 4 comes at the end, it would make ‘isLeap’ true even for years that are divisible by 100 but not divisible by 400 (like years 1700, 1800, 1900, and so on).

The same logical error would occur if you interchange the order of divisibility checks involving 100 and 400.

One last point I must mention is that some beginner programmers may think that you can not use multiple return statements and you must return only once in a program (and that you can do reassignments until that point). But experienced programmers can only call that notion a beginners’ myth!

Switching to Switch-Case from If-Else

While the if-else structure is used to choose between two options, you can also use switch-case to choose one from multiple options. You can compare it to nested if-else blocks (as in the first approach) or a series of if blocks (as in the second approach).

The benefit of the switch-case structure is that it is more efficient because it can find the matching success criteria in one go.

Note that there is one quirky thing with switch-case. When using switch-case, once a case is matched, all subsequent cases will also execute unless you are using break statements. So, the following program will not be correct even if it looks very similar to our previous version of the code.

Incorrect code: to show problems with missing break statements

```js example-bad function isLeapYear(year) { let isLeap = false; switch (true) { case year % 4 == 0: isLeap = true; case year % 100 == 0: isLeap = false; case year % 400 == 0: isLeap = true; } return isLeap; }


If we must use a switch-case structure, we need to use break statements. We also need to go from specific cases first to generic cases next. While not all if-else logic can be converted into a switch-case logic, we can successfully convert the previous function like so:

```js
function isLeapYear(year) {
  let isLeap = false;
  switch (true) {
    case year % 400 == 0:
      isLeap = true;
      break;
    case year % 100 == 0:
      isLeap = false;
      break;
    case year % 4 == 0:
      isLeap = true;
      break;
  }
  return isLeap;
}

// Example usage:
console.log(isLeapYear(2024)); // Output: true
console.log(isLeapYear(2023)); // Output: false
console.log(isLeapYear(1900)); // Output: false
console.log(isLeapYear(2000)); // Output: true

Notice that in the above, we don't have a 'default' case. And this is because we have initialized the isLeap variable with false. Had we just declared the variable without initialization with a value, we could've written a default case which would assign the value false to isLeap.

Also, the above version of switch-case code is slightly longer because we wanted to use one return statement in the end and used assignments until then. But if we refactor it, a shorter and more organized code would be this:

function isLeapYear(year) {
  switch (true) {
    case (year % 400 === 0):
      return true;
    case (year % 100 === 0):
      return false;
    case (year % 4 === 0):
      return true;
    default:
      return false;
  }
}

// Example usage:
console.log(isLeapYear(2024)); // Output: true
console.log(isLeapYear(2023)); // Output: false
console.log(isLeapYear(1900)); // Output: false
console.log(isLeapYear(2000)); // Output: true

Notice that since execution of a return statement in a function automatically ends the function call, the program does not read lines that follow that statement. So, in this example, we don't have to use the break statements necessarily.

Logical Deduction & Subsets for Better Structure

Switching back from switch-case to if-else logic, let's do some logical deduction. In our previous if-else logic, we went from generic cases to specific cases. What if we go in reverse order? We consider that a given year will be a leap year unless negated.

So, we start with the narrower cases of centenary years — for them, the rule is simple: to be negated, they need to be divisible by 100 but not by 400 (like years such as 1700, 1800, 1900).

In this process, since we've already accepted years like 2000 (or years divisible by 400) to be a leap year, we won’t test them for divisibility by 4 (because a number divisible by 400 would anyway be divisible by 4 as well).

In the next step, as we consider only the non-centenary years, we would simply negate the cases where the year is not divisible by 4 (years like 2023, 1996, and so on).

function isLeapYear(year) {
  let isLeap = true;
  if (year % 100 == 0 && year % 400 != 0) {
      isLeap = false;
  } else if (year % 4 != 0) {
      isLeap = false;
  }
  return isLeap;
}

// Example usage:
console.log(isLeapYear(2024)); // Output: true
console.log(isLeapYear(2023)); // Output: false
console.log(isLeapYear(1900)); // Output: false
console.log(isLeapYear(2000)); // Output: true

Here you see, we first consider the centenary years and then non-centenary years — so they are mutually exclusive — and that’s why we use ‘else-if’ instead of if in the second conditional check. And in that process, we gain some efficiency over consecutive if blocks.

As this approach is about breaking the possible routes of being a leap year (or for that matter, not being a leap year) into subsets of years, depending upon how we break the possible years into subsets, we can construct the program alternatively as shown below:

function isLeapYear(year) {
  let isLeap = false;
  if (year % 400 == 0) {
      isLeap = true;
  } else if (year % 100 != 0 && year % 4 == 0) {
      isLeap = true;
  }
  return isLeap;
}

// Example usage:
console.log(isLeapYear(2024)); // Output: true
console.log(isLeapYear(2023)); // Output: false
console.log(isLeapYear(1900)); // Output: false
console.log(isLeapYear(2000)); // Output: true

So, in brief, our deduction from the leap year rule is that years divisible by 400 (like 1600, 2000) are leap years, and out of all the other years they must be divisible by 4 but not divisible by 100 to be a leap year.

In taking this approach, we have combined conditions and that’s why we involved logical operators (&&, the logical AND operator). This has helped us reduce the length of the function. Instead of three conditional blocks, we are currently using two blocks — an if block and then an else (where we further check the condition, and thus we call it else-if rather than just else).

But now that we are just using almost a single ‘if-else’ construct and we are also delving into logical operators, let's unleash more power from the logical operators in the following approach.

Logical Operators Combining All True Conditions

This time let's just reorganize the logic from the previous approach (two subsets) to group all positive conditions together and then accept a year as a leap year. If that’s not met, then call it a non-leap year.

function isLeapYear(year) {
    if ((year % 4 == 0 && year % 100 != 0) || year % 400 == 0) {
        return true;
    } else {
        return false;
    }
}

// Example usage:
console.log(isLeapYear(2024)); // Output: true
console.log(isLeapYear(2023)); // Output: false
console.log(isLeapYear(1900)); // Output: false
console.log(isLeapYear(2000)); // Output: true

This one looks good because it increases readability by organizing the positive conditions together. The only cost we incur here is that the condition in the if block is longer.

But with logical operators, it looks visually shorter and not complex (at least to programmers habituated to combining logical operators like this).

Dissecting further, since in the previous approach we said we could break the subsets in two different ways, we can have two corresponding two versions for this approach as well. The second one is the following:

function isLeapYear(year) {
  if ((year % 100 == 0 && year % 400 != 0) || year % 4 != 0) {
      return false;
  } else {
      return true;
  }
}

// Example usage:
console.log(isLeapYear(2024)); // Output: true
console.log(isLeapYear(2023)); // Output: false
console.log(isLeapYear(1900)); // Output: false
console.log(isLeapYear(2000)); // Output: true

Applying Nitro with the Ternary Operator

As you progress in your programming-learning journey, at some point or other, you must have been elated to discover the possibility of writing ultra-short programs.

While logical operators help us do that, to activate the ‘Nitro’ mode, we must use a Ternary Operator — which basically makes our if-else blocks a single line.

function isLeapYear(year) {
  return ((year % 4 == 0 && year % 100 != 0) || year % 400 == 0) ? true : false;
}

// Example usage:
console.log(isLeapYear(2024)); // Output: true
console.log(isLeapYear(2023)); // Output: false
console.log(isLeapYear(1900)); // Output: false
console.log(isLeapYear(2000)); // Output: true

By now, as a pro programmer, you must be pitying your rookie self. You think of those times when you used to declare and initialize a variable with a default value first and then reassign it with the value you wanted to return, and finally return the value held by that variable.

It has been a long time since you shunned that practice, and you now return what you need to return, and don’t consume unnecessary memory space for useless variables.

Making it a Single Line Arrow Function

Now that you have been boosted with Nitro, your programming technique is advancing like an arrow, on a mission to tear away the remnants of ES5 and boldly fly into the post-ES6 world. So you welcome arrow functions with open arms.

const isLeapYear = year => (year % 4 === 0 && (year % 100 !== 0 || year % 400 === 0));

// Example usage:
console.log(isLeapYear(2024)); // Output: true
console.log(isLeapYear(2023)); // Output: false
console.log(isLeapYear(1900)); // Output: false
console.log(isLeapYear(2000)); // Output: true

Previously, you skipped variables, and you skipped ‘if-else’ blocks. And now, you can even skip the return statement thanks to the arrow function having a single statement in its body. You also skip the parentheses around your argument as it is a single argument.

While singing the saga of shorter code, a point must be made that the shorter code is not necessarily the better code. It all depends on your users of the code (people who might read it and possibly collaborate/improve upon it).

If you are working with experienced programmers, this level of concision is fine. Just make sure you don’t exceed the line width beyond a certain number of spaces (80 characters recommended) so you don't trouble your coworkers with the need to handle horizontal scrollbars.

But if you are working with team members with varying levels of experience, or you are a teacher working with learners, then you must be conscious of the readability of your code for everyone.

Paradigm Shift: Declarative Programming

Anyway, we have discussed the logic of determining the leap year in the above examples. But let’s now dissect further to find more nuances of programming. And in that process let's move from imperative programming (as we have used so far) towards declarative programming (which is the end goal in this section).

Functions with Side Effects

Functions are said to have side effects when they modify non-local variables. In addition, a function that prints (logs) in the console is also considered a function with some side effects. That is because if a function does not have a side effect, a call to it can be replaced by its return value.

Functional Programming is a paradigm which dictates that our program should be like a pure function without side effects. A pure function means a function which always returns the same output given the same input. So, in its body, it depends on only the input parameter given from outside and no other global variable. Additionally, it should just return the output value without side effects or trying to modify anything outside its scope.

But consider the following variation of the program which does not specifically return any value representing the result. Instead, it logs the result as a statement (string) in the console. This is an example of a side effect.

function isLeapYear(year) {
  if ((year % 4 == 0 && year % 100 != 0) || year % 400 == 0) {
      console.log("leap year.");
  } else {
      console.log("not leap year.");
  }
}

// Example usage:
let someValue = isLeapYear(2024); // Output: leap year.
console.log(someValue); // Output: undefined

Evidently, it does not follow the specification, as it needs to return a value of boolean type. A function can, of course, do both — printing and returning, like an alternative form of the above function.

function isLeapYear(year) {
  if ((year % 4 == 0 && year % 100 != 0) || year % 400 == 0) {
      console.log("leap year.");
      return true;
  } else {
      console.log("not leap year.");
      return false;
  }
}

// Example usage:
let someValue = isLeapYear(2024); // Output: leap year.
console.log(someValue); // Output: true

But the mere fact that it is doing two things — returning a value and printing in the console — is the problem. A function should be made to do one thing for proper reusability. The ‘isLeapYear’ function should just determine if a year is a leap year. If we need to print anything about it, let that onus of doing the side effects lie with some other logger function(s).

// pure function

function isLeapYear(year) {
    if ((year % 4 == 0 && year % 100 != 0) || year % 400 == 0) {
        return true;
    } else {
        return false;
    }
}

// functions with side effect

function simpleLeapYearLogger(isLeap) {
    if (isLeap) {
        console.log("Yes, a leap year!");
    } else {
        console.log("Sorry, not a leap year.");
    }
}

function advancedLeapYearLogger(year, isLeap) {
    if (isLeap) {
        console.log(`The year ${year} is a leap year!`);
    } else {
        console.log(`The year ${year} is not a leap year!`);
    }
}

// Example usage:
let currYear = 2024;
let check2024 = isLeapYear(currYear); // No Output/Side Effect, just retuned value.
simpleLeapYearLogger(check2024); // Output: Yes, a leap year!
advancedLeapYearLogger(currYear, check2024); // Output: The year 2024 is a leap year!

As you can see above, the function ‘isLeapYear’ is more reusable — with two different use cases in two separate logger functions. Also, had there been any mistake in the logic for the ‘isLeapYear’ function, it would have been easier to fix without touching the logger functions’ code.

Similarly, if you need to display the string logged in the console differently, you could modify the respective logger function without touching the leap year’s logic function. Thus, a function doing just one thing that it was supposed to do increases the reusability and maintainability of that function.

More About Functional Programming

In the above section, you have already ventured into the space of functional programming. And now is the time to delve deeper.

If I search the term ‘Functional Programming’ in Wikipedia, the first line states

“functional programming is a programming paradigm where programs are constructed by applying and composing functions.”

The phrase ‘composing function’ means building complex functions from simple ones. In our example, the leap year function is quite simple already. But to showcase the mechanism of function composition, let's create it out of component functions.

// component function
function divisible(dividend, divisor) {
    return dividend % divisor == 0
}

// composed function
function isLeapYear(year) {
    let isLeap = false;
    divisible(year, 4) && (isLeap = true);
    divisible(year, 100) && (isLeap = false);
    divisible(year, 400) && (isLeap = true);
    return isLeap;
}

// Example usage:
console.log(isLeapYear(2024)); // Output: true
console.log(isLeapYear(2023)); // Output: false
console.log(isLeapYear(1900)); // Output: false
console.log(isLeapYear(2000)); // Output: true

Side-Tracking: Short-Circuiting!

Above, you are using a function to build another function — a component-based approach that you also follow in the React JavaScript-based front-end library.

But wait, before we go further into React, what is that ‘&&’ doing in those three lines in the 'isLeapYear' function when we are not using any if-else statements there?

Welcome to the short-circuit evaluation of logical operators. In that process, an expression stops being evaluated as soon as its outcome is determined. So if two sides contain a logical AND (&&) in between, if the first side is false, this makes the whole expression false – so it does not read (not execute) the second side.

But if the first side is evaluated to be true, it further reads (executes) the second side for evaluation. And in that process, it does that assignment on the right-hand side of && in our example.

Similarly, the process when logical OR (||) is involved is such that if the left-hand side is evaluated as true, the whole expression is true (it needs one condition evaluated as true for || for the whole expression to be true). Then, the second side is ignored. The second side is read or executed only when the first side is evaluated as false.

You can use this kind of evaluation logic as a replacement for the ‘if’ condition checks. For more examples of how it works in different scenarios, read the section ‘Short-Circuiting of Logical Operators (&& and ||)’ in my blog post where I have discussed some nuances of JavaScript Operators.

Encapsulation and Declarative Programming

Returning to REACT and components, the idea of building composing functions or components is rooted in the need for encapsulation. With encapsulation, you can hide the complex details, like in a capsule, and use it repeatedly without bothering much about its underlying complexity.

Essentially, you just proclaim (declare) what you need rather than straining yourself with the workload and headache of how you can make it happen step-by-step with ‘do-this’ and ‘do-that’ type statements (imperatives).

That, briefly, is declarative programming for you.

Going Above & Beyond with Code Quality

So far, we have covered the logical structures and the programming paradigms, but now, let’s look at the third aspect: code quality.

Validations: Beyond the Basic Specifications

The requirements that we laid out at first just considered valid inputs. What if the function is called with arguments that are not the ideal ones — like a non-number, or even if a number but a non-integer?

To address that, we can build validation logic. To build validation logic, you need to think about all the different ways in which the input value (the argument passed to your function) may not be workable for you.

If one of those non-workable ways does come along, you need to return something that makes more sense — you can not give a verdict like true or false in that case. You may return something more neutral (like undefined or null) to indicate that the function encountered an invalid entry.

function isLeapYear(year) {
  if (typeof year!="number" || year % 1 != 0 || year <= 0) return undefined;
  return ((year % 4 == 0 && year % 100 != 0) || year % 400 == 0) ? true : false;
}

// Example usage:
console.log(isLeapYear(2024)); // Output: true
console.log(isLeapYear("TwentyTwentyFour")); // Output: undefined
console.log(isLeapYear(2023.99)); // Output: undefined
console.log(isLeapYear(0)); // Output: undefined
console.log(isLeapYear(-1)); // Output: undefined
console.log(isLeapYear("2024")); // Output: undefined

But if you noticed carefully, in our leap year logic check, we have evaluated just ordinary equality (==) instead of strict equality (===). We can't reap the benefit of that for a string format entry for a year like "2024".

If our intention is to strictly accept a number, the kind of validation we wrote is fine, and it would then be even more proper to use ===.

But if, on the other hand, we want to accept values like "2024", we must enhance our validation logic like so:

function isLeapYear(year) {
  if (isNaN(Number(year)) || year % 1 != 0 || year <= 0) return undefined;
  return ((year % 4 == 0 && year % 100 != 0) || year % 400 == 0) ? true : false;
}

// Example usage:
console.log(isLeapYear(2024)); // Output: true
console.log(isLeapYear("TwentyTwentyFour")); // Output: undefined
console.log(isLeapYear(2023.99)); // Output: undefined
console.log(isLeapYear(0)); // Output: undefined
console.log(isLeapYear(-1)); // Output: undefined
console.log(isLeapYear("2024")); // Output: true

Testing it Out From the Outside

In the above two code blocks, we write our code and test it in the same place. But the code that goes into production will not have the opportunity to include such console logs that we have used extensively for demonstrating 'example usage' in the above code blocks.

This is where unit testing comes in. In unit testing, we first export the function for use in other places (files), then import that function in a test file. In that test file is where we run the test, build our cases, and finally run that test file to execute those tests.

I have used the Jest package to do this unit testing, and here is the code from my index file and test script file:

index.js

function isLeapYear(year) {
  if (isNaN(Number(year)) || year % 1 != 0 || year <= 0) return undefined;
  return ((year % 4 == 0 && year % 100 != 0) || year % 400 == 0) ? true : false;
}

module.exports = isLeapYear;

index.test.js

const isLeapYear = require('./index.js');

describe('Test isLeapYear', () => {
  it('should return true for leap year', () => {
    expect(isLeapYear(2020)).toBe(true);
  });
  it('should return false for non-leap year', () => {
    expect(isLeapYear(2023)).toBe(false);
  });
  it('should return undefined for invalid input', () => {
    expect(isLeapYear('TwentyTwentyFour')).toBe(undefined);
    expect(isLeapYear('2023.99')).toBe(undefined);
    expect(isLeapYear('0')).toBe(undefined);
    expect(isLeapYear('-1')).toBe(undefined);
  });
  it('should return true for a leap year in string format', () => {
    expect(isLeapYear("2024")).toBe(true);
  });
});

I installed Jest using the command npm i jest. Then, I added jest as a value for test in the scripts object inside my package.json file. Then, as I ran npm test, it passed all my test cases, like so:

testing output

If you want to tweak and try this unit testing code, you can use and fork this replit project.

End Note

We've reviewed many programming concepts in the above exercise. And one key takeaway is that a program can be written in multiple ways.

There are typically many correct solutions to a programming problem. So beginner programmers should, therefore, think of the logic part of it (the algorithm) more than the exact execution steps when starting to solve a problem.

And by the way, if you're wondering why we have leap years, then this is for you: the time Earth takes to complete one revolution around the sun is not exactly 365 days (or 365 x 24 hours) but approximately one-quarter of a day extra.

This process may remind you of the modulus operator, represented by the symbol %, which returns the remainder of a division operation. Here, the approximate time (in hours) taken for one revolution of earth is being divided by 24 hours (that is, a day). It gives a remainder of about 6 hours.

const approxTimeHrsRev = 8766;
const hrsPerDay = 24;
let completedDaysEachYear;

let remainderHrsPerYear = 8766 % hrsPerDay;
completedDaysEachYear = (approxTimeHrsRev - remainderHrsPerYear) / hrsPerDay;

console.log(`After ${completedDaysEachYear} complete days, there is still about ${remainderHrsPerYear} hours left out each year.`);
// Output: After 365 complete days, there is still about 6 hours left out each year.

To account for those missed hours, we must adjust our calendars once every four years when those left-out portions add up to make — again approximately — a day.

Finally, because it is not exactly 6 hours, and a tiny bit more than that, we have to adjust every 100 and 400 years further.

How to Create a Music Bot Using Discord.js – Step-by-Step Tutorial

freeCodeCamp — Wed, 28 Feb 2024 13:00:00 +0000

By Gabriel Tanner

The Discord API provides you with an easy tool to create and use your own bots and tools.

In this tutorial, you'll learn how you can create a basic music bot and add it to your server. The bot will be able to play, skip, and stop the music, and will also support queuing functionality.

Prerequisites
How to Set Up a Discord Bot
– How to add the bot to your server
– How to create your project
– Discord.js basics
Discord Bot Version 0.13
– How to create the Discord player
– How to add slash commands
– How to implement interactions
– How to play songs
– How to skip songs
– How to stop songs
– Complete source code for index.js
Discord Bot Version 0.12
– How to read messages
– How to add songs
– How to play songs
– How to skip songs
– How to stop songs
– Complete source code for index.js
Conclusion

Prerequisites

Before we get started creating the bot, make sure that you have installed all the tools you'll need:

After you've installed these, you can continue by setting up your discord bot.

How to Set Up a Discord Bot

First, you need to create a new application on the discord development portal.

You can do so by visiting the portal and clicking on New Application.

Create a new Discord application

After that, you need to give your application a name and click the Create button.

Give your bot whatever name you like - I've chosen "music-bot"

After that, select the bot tab and click on Add Bot.

Add your bot under the "Bot" tab

Now your bot is created and you can continue with inviting it to your server.

How to add the bot to your server

After creating your bot, you can invite it using the OAuth2 URL Generator.

For that, you need to navigate to the OAuth2 page and select bot in the scope tap.

Selecting "bot" on the 0Auth2 Generator page

After that, you need to select the needed permissions to play music and read messages.

Select the permissions you'll need - "read messages/view channels", "send messages", "manage messages", "add reactions", "use slash commands", "connect", and "speak.

Then you can copy your generated URL and paste it into your browser.

Copy the URL

After pasting it, add it to your server by selecting the server and clicking the authorize button.

How to create your project

Now you can start creating your project using the terminal.

First, create a directory and move into it. You can do so by using these two commands:

mkdir musicbot && cd musicbot

After that, create your project modules using the npm init command. After entering the command, you will be asked some questions – just answer them and continue.

Then you just need to create the two files you will work in.

touch index.js && touch config.json

Now, open your project in your text editor. I personally use VS Code and can open it with the following command:

code .

Discord.js basics

Now you need to install some dependencies before we can get started.

npm install discord.js@^12.5.3 ffmpeg fluent-ffmpeg @discordjs/opus ytdl-core --save

After the installation finishes, you can continue with writing your config.json file. Here, save the token of your bot and the prefix it should listen for.

{
"prefix": "!",
"token": "your-token"
}

To get your token, you need to visit the discord developer portal again and copy it from the bot section.

Get your bot token by clicking "Copy" and save it somewhere safe

Those are the only things you need to do in your config.json file. So now it's time to start writing your JavaScript code.

The article includes two versions: one for the new discord.js v13, which uses slash commands combined with the discord-player library to implement the music functionality, and one for discord.js v12.5.3, which implements the functionality without a library.

The older version is better for learning purposes, and the newer version works with the current discord.js and is a lot easier to implement – so choose which you prefer.

Discord Bot Version 0.13

Now you just need to install some more dependencies before we can get started.

npm install discord.js discord-player @discordjs/opus

After installing the dependencies, import them in your dependencies.

const { Client, GuildMember, Intents } = require("discord.js");
const { Player, QueryType } = require("discord-player");
const config = require("./config.json");

After that, create your client and log in using your token.

const client = new Client({
    intents: [Intents.FLAGS.GUILD_VOICE_STATES, Intents.FLAGS.GUILD_MESSAGES, Intents.FLAGS.GUILDS]
});
client.login(config.token);

Now add some basic listeners that console.log when they get executed.

client.once('ready', () => {
 console.log('Ready!');
});

client.on("error", console.error);
client.on("warn", console.warn);

After that, you can start your bot using the node command and the bot should be online on Discord and print “Ready!” in the console.

node index.js

How to create the Discord player

Now that you've created the client for the discord bot, you can continue by initializing your player. This will allow you to play and manage music in your Discord channel.

const player = new Player(client);

You can also add some error handlers that will be called if an error occurs.

player.on("error", (queue, error) => {
    console.log(`[${queue.guild.name}] Error emitted from the queue: ${error.message}`);
});
player.on("connectionError", (queue, error) => {
    console.log(`[${queue.guild.name}] Error emitted from the connection: ${error.message}`);
});

The last thing you need to do is add listeners for the different player events like a song starting or being added.

player.on("trackStart", (queue, track) => {
    queue.metadata.send(`🎶 | Started playing: **${track.title}** in **${queue.connection.channel.name}**!`);
});

player.on("trackAdd", (queue, track) => {
    queue.metadata.send(`🎶 | Track **${track.title}** queued!`);
});

player.on("botDisconnect", (queue) => {
    queue.metadata.send("❌ | I was manually disconnected from the voice channel, clearing queue!");
});

player.on("channelEmpty", (queue) => {
    queue.metadata.send("❌ | Nobody is in the voice channel, leaving...");
});

player.on("queueEnd", (queue) => {
    queue.metadata.send("✅ | Queue finished!");
});

In most cases, you just send a message into the Discord text channel using the send() function.

How to add slash commands

After you've set up the player successfully, you can continue by adding your Slash commands to your client. This step lets Discord know which commands the bot can execute.

client.on("messageCreate", async (message) => {
        if (message.author.bot || !message.guild) return;
    if (!client.application?.owner) await client.application?.fetch();
});

You can do this by implementing a simple !deploy command that saves your commands in the guild.commands variable of a message.

A slash command has a name, a description, and an optional options field that contains the command’s parameters. For example, the play command takes a song query as an argument.

client.on("messageCreate", async (message) => {
        ...

        if (message.content === "!deploy" && message.author.id === client.application?.owner?.id) {
        await message.guild.commands.set([
            {
                name: "play",
                description: "Plays a song from youtube",
                options: [
                    {
                        name: "query",
                        type: "STRING",
                        description: "The song you want to play",
                        required: true
                    }
                ]
            },
            {
                name: "skip",
                description: "Skip to the current song"
            },
            {
                name: "queue",
                description: "See the queue"
            },
            {
                name: "stop",
                description: "Stop the player"
            },
        ]);

        await message.reply("Deployed!");
    }
});

After entering !deploy in your Discord text chat, the slash commands will be added to your application. When typing / into the chat you should see something similar to this:

Example of using the slash commands

How to implement interactions

Once the interactions (slash commands) are defined, now you'll need to implement them.

All slash commands trigger the interactionCreate event and can be implemented inside the async function below. Before executing any functionality, run a few conditionals to check if the user is allowed to perform the given functionality.

client.on("interactionCreate", async (interaction) => {
    if (!interaction.isCommand() || !interaction.guildId) return;

    if (!(interaction.member instanceof GuildMember) || !interaction.member.voice.channel) {
        return void interaction.reply({ content: "You are not in a voice channel!", ephemeral: true });
    }

    if (interaction.guild.me.voice.channelId && interaction.member.voice.channelId !== interaction.guild.me.voice.channelId) {
        return void interaction.reply({ content: "You are not in my voice channel!", ephemeral: true });
    }
});

After that, check which command is being executed by matching the commandName with the name of the commands you defined above.

client.on("interactionCreate", async (interaction) => {
    ...

        if (interaction.commandName === "play") {
            // TODO: Implement play command
        }
});

You can then add the implementation inside of the if statement.

How to play songs

The play command requires you to search for the provided song and add the result to the current queue of songs.

Let’s start by retrieving the user-provided query using the options.get() function. After that you can use the player.search() function to search for the desired song.

if (interaction.commandName === "play") {
    await interaction.deferReply();

    const query = interaction.options.get("query").value;
    const searchResult = await player
        .search(query, {
            requestedBy: interaction.user,
            searchEngine: QueryType.AUTO
        })
        .catch(() => {});
    if (!searchResult || !searchResult.tracks.length) return void interaction.followUp({ content: "No results were found!" });
}

Now that you have the song, you can create a queue for the songs (if there is already a queue, the createQueue function will return the existing one).

Once the queue is created, you can try joining the user’s voice channel. If that is successful, add the song to the current queue using the addTracks function.

if (interaction.commandName === "play") {
    ...

        const queue = await player.createQueue(interaction.guild, {
        metadata: interaction.channel
    });

    try {
        if (!queue.connection) await queue.connect(interaction.member.voice.channel);
    } catch {
        void player.deleteQueue(interaction.guildId);
        return void interaction.followUp({ content: "Could not join your voice channel!" });
    }

    await interaction.followUp({ content: `⏱ | Loading your ${searchResult.playlist ? "playlist" : "track"}...` });
    searchResult.playlist ? queue.addTracks(searchResult.tracks) : queue.addTrack(searchResult.tracks[0]);
    if (!queue.playing) await queue.play();
}

Lastly, if the queue isn’t already playing, let’s start it using the play() function.

How to skip songs

Skipping is quite easy – you can do it by calling the skip() function on the queue.

if (interaction.commandName === "skip") {
    await interaction.deferReply();
    const queue = player.getQueue(interaction.guildId);
    if (!queue || !queue.playing) return void interaction.followUp({ content: "❌ | No music is being played!" });
    const currentTrack = queue.current;
    const success = queue.skip();
    return void interaction.followUp({
        content: success ? `✅ | Skipped **${currentTrack}**!` : "❌ | Something went wrong!"
    });
}

If the action is successful, you can write a message to the Discord text channel using interaction.followUp().

How to stop songs

The stop functionality will remove all the songs from the queue and the bot will leave the voice channel. You can do this by destroying the current queue which automatically makes the bot leave the voice channel (unless you configure it otherwise in the player configuration).

else if (interaction.commandName === "stop") {
        await interaction.deferReply();
        const queue = player.getQueue(interaction.guildId);
        if (!queue || !queue.playing) return void interaction.followUp({ content: "❌ | No music is being played!" });
        queue.destroy();
        return void interaction.followUp({ content: "🛑 | Stopped the player!" });
    }

Complete source code for the index.js:

Here you can get the complete source code for the music bot:

const { Client, GuildMember, Intents } = require("discord.js");
const { Player, QueryType } = require("discord-player");
const config = require("./config.json");

const client = new Client({
    intents: [Intents.FLAGS.GUILD_VOICE_STATES, Intents.FLAGS.GUILD_MESSAGES, Intents.FLAGS.GUILDS]
});

client.on("ready", () => {
    console.log("Bot is online!");
    client.user.setActivity({
        name: "🎶 | Music Time",
        type: "LISTENING"
    });
});
client.on("error", console.error);
client.on("warn", console.warn);

const player = new Player(client);

player.on("error", (queue, error) => {
    console.log(`[${queue.guild.name}] Error emitted from the queue: ${error.message}`);
});
player.on("connectionError", (queue, error) => {
    console.log(`[${queue.guild.name}] Error emitted from the connection: ${error.message}`);
});

player.on("trackStart", (queue, track) => {
    queue.metadata.send(`🎶 | Started playing: **${track.title}** in **${queue.connection.channel.name}**!`);
});

player.on("trackAdd", (queue, track) => {
    queue.metadata.send(`🎶 | Track **${track.title}** queued!`);
});

player.on("botDisconnect", (queue) => {
    queue.metadata.send("❌ | I was manually disconnected from the voice channel, clearing queue!");
});

player.on("channelEmpty", (queue) => {
    queue.metadata.send("❌ | Nobody is in the voice channel, leaving...");
});

player.on("queueEnd", (queue) => {
    queue.metadata.send("✅ | Queue finished!");
});

client.on("messageCreate", async (message) => {
    if (message.author.bot || !message.guild) return;
    if (!client.application?.owner) await client.application?.fetch();

    if (message.content === "!deploy" && message.author.id === client.application?.owner?.id) {
        await message.guild.commands.set([
            {
                name: "play",
                description: "Plays a song from youtube",
                options: [
                    {
                        name: "query",
                        type: "STRING",
                        description: "The song you want to play",
                        required: true
                    }
                ]
            },
            {
                name: "skip",
                description: "Skip to the current song"
            },
            {
                name: "stop",
                description: "Stop the player"
            },
        ]);

        await message.reply("Deployed!");
    }
});

client.on("interactionCreate", async (interaction) => {
    if (!interaction.isCommand() || !interaction.guildId) return;

    if (!(interaction.member instanceof GuildMember) || !interaction.member.voice.channel) {
        return void interaction.reply({ content: "You are not in a voice channel!", ephemeral: true });
    }

    if (interaction.guild.me.voice.channelId && interaction.member.voice.channelId !== interaction.guild.me.voice.channelId) {
        return void interaction.reply({ content: "You are not in my voice channel!", ephemeral: true });
    }

    if (interaction.commandName === "play") {
        await interaction.deferReply();

        const query = interaction.options.get("query").value;
        const searchResult = await player
            .search(query, {
                requestedBy: interaction.user,
                searchEngine: QueryType.AUTO
            })
            .catch(() => {});
        if (!searchResult || !searchResult.tracks.length) return void interaction.followUp({ content: "No results were found!" });

        const queue = await player.createQueue(interaction.guild, {
            metadata: interaction.channel
        });

        try {
            if (!queue.connection) await queue.connect(interaction.member.voice.channel);
        } catch {
            void player.deleteQueue(interaction.guildId);
            return void interaction.followUp({ content: "Could not join your voice channel!" });
        }

        await interaction.followUp({ content: `⏱ | Loading your ${searchResult.playlist ? "playlist" : "track"}...` });
        searchResult.playlist ? queue.addTracks(searchResult.tracks) : queue.addTrack(searchResult.tracks[0]);
        if (!queue.playing) await queue.play();
    } else if (interaction.commandName === "skip") {
        await interaction.deferReply();
        const queue = player.getQueue(interaction.guildId);
        if (!queue || !queue.playing) return void interaction.followUp({ content: "❌ | No music is being played!" });
        const currentTrack = queue.current;
        const success = queue.skip();
        return void interaction.followUp({
            content: success ? `✅ | Skipped **${currentTrack}**!` : "❌ | Something went wrong!"
        });
    } else if (interaction.commandName === "stop") {
        await interaction.deferReply();
        const queue = player.getQueue(interaction.guildId);
        if (!queue || !queue.playing) return void interaction.followUp({ content: "❌ | No music is being played!" });
        queue.destroy();
        return void interaction.followUp({ content: "🛑 | Stopped the player!" });
    } else {
        interaction.reply({
            content: "Unknown command!",
            ephemeral: true
        });
    }
});

client.login(config.token);

Discord Bot Version 0.12

Now you'll just need to install some dependencies before we can get started.

npm install discord.js ffmpeg fluent-ffmpeg @discordjs/opus ytdl-core --save

After installing the dependencies, import them in your dependencies.

const Discord = require('discord.js');
const {
    prefix,
    token,
} = require('./config.json');
const ytdl = require('ytdl-core');

After that, create your client and login using your token.

const client = new Discord.Client();
client.login(token);

Now let’s add some basic listeners that console.log when they get executed.

client.once('ready', () => {
 console.log('Ready!');
});
client.once('reconnecting', () => {
 console.log('Reconnecting!');
});
client.once('disconnect', () => {
 console.log('Disconnect!');
});

After that, you can start your bot using the node command and it should be online on Discord and print “Ready!” in the console.

node index.js

How to read messages

Now that your bot is on your server and able to go online, you can start reading chat messages and responding to them.

To read messages, you only need to write one simple function:

client.on('message', async message => {

}

Here, you're creating a listener for the message event, getting the message, and saving it into a message object if it's triggered.

Now you need to check if the message is from your own bot and ignore it if it is.

if (message.author.bot) return;

In this line, you're checking if the author of the message is your bot and returning if it is.

After that, check if the message starts with the prefix you defined earlier and return if it doesn’t.

if (!message.content.startsWith(prefix)) return;

After that, you can check which command you need to execute. You can do so using some simple if statements:

const serverQueue = queue.get(message.guild.id);

if (message.content.startsWith(`${prefix}play`)) {
    execute(message, serverQueue);
    return;
} else if (message.content.startsWith(`${prefix}skip`)) {
    skip(message, serverQueue);
    return;
} else if (message.content.startsWith(`${prefix}stop`)) {
    stop(message, serverQueue);
    return;
} else {
    message.channel.send("You need to enter a valid command!");
}

In this code block, you're checking which command to execute and calling the command. If the input command isn’t valid, you're writing an error message into the chat using the send() function.

Now that you know which command you need to execute, you can start implementing these commands.

How to add songs

Let’s start by adding the play command. For that, you'll need a song and a guild (a guild represents an isolated collection of users and channels and is often referred to as a server). You'll also need the ytdl library you installed earlier.

First, create a map with the name of the queue where you save all the songs you type in the chat.

const queue = new Map();

After that, create an async function called execute and check if the user is in a voice chat and if the bot has the right permissions. If not, write an error message and return.

async function execute(message, serverQueue) {
  const args = message.content.split(" ");

  const voiceChannel = message.member.voice.channel;
  if (!voiceChannel)
    return message.channel.send(
      "You need to be in a voice channel to play music!"
    );
  const permissions = voiceChannel.permissionsFor(message.client.user);
  if (!permissions.has("CONNECT") || !permissions.has("SPEAK")) {
    return message.channel.send(
      "I need the permissions to join and speak in your voice channel!"
    );
  }
}

Now you can continue with getting the song info and saving it into a song object. For that, use your ytdl library which gets the song information from the YouTube link.

const songInfo = await ytdl.getInfo(args[1]);
const song = {
 title: songInfo.title,
 url: songInfo.video_url,
};

This will get the information of the song using the ytdl library you installed earlier. Then, save the information you need into a song object.

After saving the song info, you just need to create a contract you can add to your queue.

To do so, first check if your serverQueue is already defined which means that music is already playing. If so, add the song to your existing serverQueue and send a success message. If not, create it and try to join the voice channel and start playing music.

if (!serverQueue) {

}else {
 serverQueue.songs.push(song);
 console.log(serverQueue.songs);
 return message.channel.send(`${song.title} has been added to the queue!`);
}

Here, check if the serverQueueis empty and add the song to it if it’s not. Now you just need to create your contract if the serverQueue is null.

// Creating the contract for our queue
const queueContruct = {
 textChannel: message.channel,
 voiceChannel: voiceChannel,
 connection: null,
 songs: [],
 volume: 5,
 playing: true,
};
// Setting the queue using our contract
queue.set(message.guild.id, queueContruct);
// Pushing the song to our songs array
queueContruct.songs.push(song);

try {
 // Here we try to join the voicechat and save our connection into our object.
 var connection = await voiceChannel.join();
 queueContruct.connection = connection;
 // Calling the play function to start a song
 play(message.guild, queueContruct.songs[0]);
} catch (err) {
 // Printing the error message if the bot fails to join the voicechat
 console.log(err);
 queue.delete(message.guild.id);
 return message.channel.send(err);
}

In this code block, you created a contract and added your song to the songs array. After that, you tried to join the voice chat of the user and called your play() function you'll implement after that.

How to play songs

Now that you can add our songs to your queue and create a contract if there isn’t one yet, you can implement the play functionality.

First, create a function called play which takes two parameters (the guild and the song you want to play) and checks if the song is empty. If so, just leave the voice channel and delete the queue.

function play(guild, song) {
  const serverQueue = queue.get(guild.id);
  if (!song) {
    serverQueue.voiceChannel.leave();
    queue.delete(guild.id);
    return;
  }
}

After that, start playing your song using the play() function of the connection and passing the URL of your song.

const dispatcher = serverQueue.connection
    .play(ytdl(song.url))
    .on("finish", () => {
        serverQueue.songs.shift();
        play(guild, serverQueue.songs[0]);
    })
    .on("error", error => console.error(error));
dispatcher.setVolumeLogarithmic(serverQueue.volume / 5);
serverQueue.textChannel.send(`Start playing: **${song.title}**`);

Here, you created a stream and passed it the URL of our song. You also added two listeners that handle the end and error events.

Note: This is a recursive function which means that it calls itself over and over again. We're using recursion so it plays the next song when the song is finished.

Now you're ready to play a song by just typing the !play URL in the chat.

How to skip songs

Now you can implement the skipping functionality. For that, you just need to end the dispatcher you created in your play() function so it starts the next song.

function skip(message, serverQueue) {
  if (!message.member.voice.channel)
    return message.channel.send(
      "You have to be in a voice channel to stop the music!"
    );
  if (!serverQueue)
    return message.channel.send("There is no song that I could skip!");
  serverQueue.connection.dispatcher.end();
}

Here, you're checking if the user that typed the command is in a voice channel and if there is a song to skip.

How to stop songs

The stop() function is almost the same as skip(), except that you clear the songs array which will make your bot delete the queue and leave the voice chat.

function stop(message, serverQueue) {
  if (!message.member.voice.channel)
    return message.channel.send(
      "You have to be in a voice channel to stop the music!"
    );
  serverQueue.songs = [];
  serverQueue.connection.dispatcher.end();
}

Complete source code for the index.js:

Here you can get the complete source code for the music bot:

const Discord = require("discord.js");
const { prefix, token } = require("./config.json");
const ytdl = require("ytdl-core");

const client = new Discord.Client();

const queue = new Map();

client.once("ready", () => {
  console.log("Ready!");
});

client.once("reconnecting", () => {
  console.log("Reconnecting!");
});

client.once("disconnect", () => {
  console.log("Disconnect!");
});

client.on("message", async message => {
  if (message.author.bot) return;
  if (!message.content.startsWith(prefix)) return;

  const serverQueue = queue.get(message.guild.id);

  if (message.content.startsWith(`${prefix}play`)) {
    execute(message, serverQueue);
    return;
  } else if (message.content.startsWith(`${prefix}skip`)) {
    skip(message, serverQueue);
    return;
  } else if (message.content.startsWith(`${prefix}stop`)) {
    stop(message, serverQueue);
    return;
  } else {
    message.channel.send("You need to enter a valid command!");
  }
});

async function execute(message, serverQueue) {
  const args = message.content.split(" ");

  const voiceChannel = message.member.voice.channel;
  if (!voiceChannel)
    return message.channel.send(
      "You need to be in a voice channel to play music!"
    );
  const permissions = voiceChannel.permissionsFor(message.client.user);
  if (!permissions.has("CONNECT") || !permissions.has("SPEAK")) {
    return message.channel.send(
      "I need the permissions to join and speak in your voice channel!"
    );
  }

  const songInfo = await ytdl.getInfo(args[1]);
  const song = {
    title: songInfo.title,
    url: songInfo.video_url
  };

  if (!serverQueue) {
    const queueContruct = {
      textChannel: message.channel,
      voiceChannel: voiceChannel,
      connection: null,
      songs: [],
      volume: 5,
      playing: true
    };

    queue.set(message.guild.id, queueContruct);

    queueContruct.songs.push(song);

    try {
      var connection = await voiceChannel.join();
      queueContruct.connection = connection;
      play(message.guild, queueContruct.songs[0]);
    } catch (err) {
      console.log(err);
      queue.delete(message.guild.id);
      return message.channel.send(err);
    }
  } else {
    serverQueue.songs.push(song);
    return message.channel.send(`${song.title} has been added to the queue!`);
  }
}

function skip(message, serverQueue) {
  if (!message.member.voice.channel)
    return message.channel.send(
      "You have to be in a voice channel to stop the music!"
    );
  if (!serverQueue)
    return message.channel.send("There is no song that I could skip!");
  serverQueue.connection.dispatcher.end();
}

function stop(message, serverQueue) {
  if (!message.member.voice.channel)
    return message.channel.send(
      "You have to be in a voice channel to stop the music!"
    );
  serverQueue.songs = [];
  serverQueue.connection.dispatcher.end();
}

function play(guild, song) {
  const serverQueue = queue.get(guild.id);
  if (!song) {
    serverQueue.voiceChannel.leave();
    queue.delete(guild.id);
    return;
  }

  const dispatcher = serverQueue.connection
    .play(ytdl(song.url))
    .on("finish", () => {
      serverQueue.songs.shift();
      play(guild, serverQueue.songs[0]);
    })
    .on("error", error => console.error(error));
  dispatcher.setVolumeLogarithmic(serverQueue.volume / 5);
  serverQueue.textChannel.send(`Start playing: **${song.title}**`);
}

client.login(token);

Conclusion

You made it all the way until the end! Hope that this article helped you understand the Discord API and how you can use it to create a simple bot.

If you want to see an example of a more advanced discord bot, you can visit my GitHub repository.

If you have found this useful, please consider recommending and sharing it with other fellow developers.

If you have any questions or feedback, let me know and I'd be happy to help.

Idempotence in HTTP Methods – Explained with CRUD Examples

Yemi Ojedapo — Fri, 22 Dec 2023 21:19:43 +0000

Idempotence refers to a program's ability to maintain a particular result even after repeated actions.

For example, let's say you have a button that only opens a door when pressed. This button does not have the ability to close the door, so it stays open even when it's pressed repeatedly. It simply remains in the state it was changed to by the first press.

This same logic applies to HTTP methods that are idempotent. Operating on idempotent HTTP methods repeatedly won't have any additional effect beyond the initial execution.

Understanding idempotence is important for maintaining the consistency of HTTP methods and API design. Idempotence has a significant impact on API design, as it influences how API endpoints should behave when processing requests from clients.

In this tutorial, I'll explain the concept of idempotence and the role it plays in building robust and functional APIs. You'll also learn about what safe methods are, how they relate to idempotence, and how to implement idempotency in non-idempotent methods.

Prerequisites

Before understanding and implementing idempotence in API design, it's essential to have a solid foundation in the following areas:

RESTful Principles
Fundamentals of HTTP methods
API Development
HTTP Status codes
Basics of Web development.

Idempotence Example

Let's start off with an example of idempotence in action. We'll create a function that uses the DELETE method to delete data from a web page:


from flask import Flask, jsonify, abort

app = Flask(__name__)

web_page_data = [
   {"id": 1, "content": "Row 1 data"},
   {"id": 2, "content": "Row 2 data"},
   # Add more rows as needed
]

@app.route('/delete_row/', methods=['DELETE'])
def delete_row(row_id):
   # Find the row to delete
   row_to_delete = next((row for row in web_page_data if row["id"] == row_id), None)

   if row_to_delete:
       # Simulate deletion
       web_page_data.remove(row_to_delete)
       return jsonify({"message": f"Row {row_id} deleted successfully."}), 200
   else:
       abort(404, description=f"Row {row_id} not found.")

if __name__ == '__main__':
   app.run(debug=True)

This function is expected to delete the rows chosen by the user. Now because of the idempotent nature of the DELETE method, the data will be deleted once, even when called repeatedly. But subsequent calls will return a 404 error since the data has already been deleted by the first call.

Let’s look at another example with the GET method. The GET method is used to retrieve data from a resource. Let’s create a function that uses the GET method to retrieve a username:

import requests

def get_username():
    url = 'https://api.example.com/get_username'
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.json()['username']
        else:
            return None
    except requests.RequestException as e:
        print(f"Error occurred: {e}")
        return None

# Usage
username = get_username()
if username:
    print(f"The username is: {username}")
else:
    print("Failed to retrieve the username.")

In this example, we define the get_username() function, which sends a GET request to the API endpoint to retrieve the username. If the request is successful, we extract the username from the JSON response and return it. But if any error occurs during the request, we handle it and return None.

Now the idempotent nature of the GET method ensures that even if you call get_username() multiple times, the same username will be fetched from the API each time. The result will always be the same which is to fetch the username from the resource.

Idempotent vs. Non-Idempotent HTTP Methods:

HTTP methods play crucial roles in determining how data is fetched, modified, or created when interacting with APIs. And Idempotency is one of the important concepts that influences data consistency and reliability in the methods used .

Here's a breakdown of the different methods based on their idempotency.

Idempotent methods:

GET
HEAD
PUT
DELETE
OPTIONS
TRACE

Non-idempotent methods:

POST
PATCH
CONNECT

Safe Methods

In our previous example, we used the GET method to retrieve a username and this had no side effect on the server. This is because it is a safe method.

A safe method is a type of method that doesn’t modify the server’s state or the resource being accessed. In other words, they perform read-only operations used to retrieve data or for resource representation.

When you make a request using a safe method, the server does not perform any operations that modify the resource's state. Like in our previous example, we retrieved the username from the webpage which is the resource without changing anything in the server.

All safe methods are automatically idempotent, but not all idempotent methods are safe. This is because while idempotent methods produce consistent results when called repeatedly, some of them may still modify the server's state or the resource being accessed.

Like in our first example, the DELETE method is idempotent, because deleting a resource multiple times will have the same effect. But it's not safe, as it changes the server's state by removing the resource.

Here’s a classification of HTTP methods based on their safe status:

Safe methods:

GET
OPTIONS
HEAD

Unsafe methods:

DELETE
POST
PUT
PATCH

Why is POST not idempotent?

POST is an HTTP method that sends information to a server. When you make a POST request, you typically submit data to create a new resource or trigger a server-side action. Therefore, making the same request multiple times can result in different outcomes and side effects on the server. This can lead to duplicated data, starting server resources, and reducing performance because of the repeated action.

Unlike idempotent methods like GET, PUT, and DELETE, which have consistent results regardless of repetition, POST requests can cause changes to the server's state with each invocation.

POST requests often create new resources on the server. Repeating the same POST request will generate multiple identical resources, potentially leading to duplication.

This is similar to DELETE which is an idempotent method but not a safe method. Deleting the last entry in a collection using a single DELETE request would be considered idempotent. But if a developer creates a function that deletes the last entry, that would trigger DELETE multiple times. Subsequent DELETE calls would have different effects, as each one removes a unique entry. This would be considered non-idempotent.

How to Achieve Idempotency with Non-Idempotent Methods

Idempotency isn't only a property inherent to certain methods – it can also be implemented as a feature of a non-idempotent method.

Here are some techniques to achieve idempotency even with non-idempotent methods.

Unique Identifiers

Adding unique identifiers to every request is one of the most common techniques used to implement idempotency. It works by tracking whether the operation has already been performed or not. If it's a duplicate (a repeat request), the server knows it's already dealt with that request and simply ignores it, ensuring that no side effects occur.

Here's an example of how it works:

from uuid import uuid4

def process_order(unique_id, order_data):
    if Order.objects.filter(unique_id=unique_id).exists():
        return HttpResponse(status=409)  # Conflict
    order = Order.objects.create(unique_id=unique_id, **order_data)
    return HttpResponse(status=201, content_type="application/json")

# Example usage
post_data = {"products": [...]}
headers = {"X-Unique-ID": str(uuid4())}
requests.post("https://api.example.com/orders", data=post_data, headers=headers)

In this code snippet, we define a function called process_order that creates orders in an API, using unique identifiers to implement idempotency.

Here's a breakdown of the code:

Importing the Unique Identifier Generator:

from uuid import uuid4: The code snippet starts by importing the uuid4 function from the uuid module. This function generates unique identifiers, which are used to achieve idempotency in this code.

Defining the `process_order` Function:

def process_order(unique_id, order_data): This line defines a function named process_order that takes two arguments:

unique_id: This is a string representing a unique identifier for the request. This ensures no duplicate orders are created with the same identifier.
order_data: This is a dictionary containing the actual order data, like product information and customer details.

Checking for Existing Orders:

if Order.objects.filter(unique_id=unique_id).exists(): This line checks if an order with the same unique_id already exists in the database.

Order.objects.filter(unique_id=unique_id).exists() queries the Order model for orders with the matching unique_id and checks if any orders were found in the query result. If an order is found, it means the same request was already processed.

Handling existing orders:

return HttpResponse(status=409): If an order with the same unique_id already exists, the function immediately returns an HTTP response with status code 409 indicating a conflict. This prevents duplicate orders from being created.

Creating a new order (if unique):

order = Order.objects.create(unique_id=unique_id, **order_data ): This line only runs if no existing order is found.

Order.objects.create: creates a new object in the Order model.

unique_id=unique_id sets the unique_id attribute of the new order to the provided unique_id.

order_data: spreads the dictionary order_data as keyword arguments to the order model's constructor, setting other relevant attributes like products and customer information.

Sending a success response:

return HttpResponse(status=201, content_type="application/json"): If the order creation is successful, the function will return an HTTP response with status code 201 which shows a successful creation. It also specifies the response content type as JSON, assuming the order data might be returned in JSON format.

post_data = {"products": [...]}: an example request, defines a dictionary containing the actual order data, like a list of products.

headers = {"X-Unique-ID": str(uuid4())}: This line creates a dictionary containing a custom header named X-Unique-ID. It generates a unique identifier string using uuid4() and adds it to the header.

requests.post("https://api.example.com/orders", data=post_data, headers=headers): This line sends a POST request to the API endpoint https://api.example.com/orders with the provided post_data and headers.

How does this implement idempotence?

It does so by using a unique identifier (unique_id) for each order.

It checks if an order with the same identifier already exists in the database. If it returns true, it returns a 409 Conflict status. Otherwise, it creates a new order and responds with a 201 Created status. The unique identifier prevents duplicate orders, making the system idempotent.

Token-based Authorization

Token-based authorization is a form of authorization that assigns temporary tokens for each non-idempotent action. Once the action is completed, the token is invalidated. If the same request comes again with the same token, the server recognizes it as invalid and refuses the request, thereby preventing duplicate actions.

// Generate a unique token for this action
const token = generateToken();

fetch("https://api.example.com/create-user", {
    method: "POST",
    body: JSON.stringify({ username, password }),
    headers: {
        Authorization: `Bearer ${token}`,
        "Content-Type": "application/json",
    },
})
    .then(response => {
        // Handle successful response
        if (response.ok) {
            // Do something with the successful response
        } else {
            // Handle non-successful response
        }
    })
    .catch(error => {
        // Handle error
        console.error("Error occurred:", error);
    })
    .finally(() => {
        // Invalidate token after successful action or in case of an error
        invalidateToken(token);
    });

// Simple implementation for generating a token
function generateToken() {
    return Math.random().toString(36).substr(2);
}

// Simple implementation for invalidating a token
function invalidateToken(token) {
    // Add your logic to invalidate the token, e.g., remove it from storage
}

Here's a breakdown of the code:

Generating a unique token:

const token = generateToken(): This line calls a function named generateToken() (which is assumed to be defined elsewhere) that generates a unique token string. This token will be used for authorization and idempotency.

Sending the `POST` request:

fetch("https://api.example.com/create-user", { ... }): This line uses the fetch API to send a POST request to the API endpoint https://api.example.com/create-user.

method: "POST": This specifies the HTTP method as POST, indicating the intention to create a new user.

body: JSON.stringify({ username, password }): This defines the request body with user details like username and password. The data is converted to JSON format before sending.

headers: { Authorization:Bearer ${token}}: This sets the Authorization header in the request. The header value includes the generated token prefixed with "Bearer ".

Handling the Response:

.then(response => { ... }): This block defines the code to execute if the request is successful. You would handle things like storing user information or redirecting the user upon successful user creation.

.catch(error => { ... }): This block defines the code to execute if the request encounters an error. You would handle any error messages or handle specific error scenarios here.

Invalidating the Token:

invalidateToken(token): This line calls a function named invalidateToken(token) ( which is assumed to be defined elsewhere) which would likely mark the used token as invalid. This ensures the same token cannot be used for subsequent requests, adding to the idempotency guarantee.

How does this implement Idempotence?

This code snippet uses token-based authorization to implement idempotency in a POST request to create a user on an API. If a user creation request is accidentally sent multiple times, a new unique token is generated each time and used in the Authorization header.

The API server can recognize and verify the unique token, and since the user creation action has already been performed (assuming it's successful the first time), it won't create duplicate users due to subsequent identical requests.

ETag Header:

An ETag header (Entity Tag) is an HTTP header used for web cache validation and conditional requests. It is mainly used for PUT requests, that only update resources if they haven't changed since the last check.

When you want to update a resource, the server sends you its ETag which is then included in your PUT request along with the updated data. If the ETag hasn't changed (meaning the resource remains the same), the server accepts the update. But if the ETag has changed, the server rejects the update, preventing it from overwriting someone else's changes.

def update_article(article_id, content):
    # Get existing article and its ETag
    article = Article.objects.get(pk=article_id)
    etag = article.etag

    # Check if ETag matches with request header
    if request.headers.get("If-Match") != etag:
        return HttpResponse(status=409)  # Conflict

    # Update article content and generate new ETag
    article.content = content
    article.save()
    new_etag = article.etag

    # Return success response with updated ETag
    return HttpResponse(status=200, content_type="text/plain", content=new_etag)

In this code snippet, we define a function called update_article that allows you to update the content of an existing article based on its ID and new content. It implements idempotency using the ETag header technique.

Here's a step-by-step explanation of how it works;

Getting the Existing Article and its ETag:

article = Article.objects.get(pk=article_id): This line fetches the article with the provided article_id from the database using the Article model.

etag = article.etag: This line extracts the ETag value from the retrieved article object. The ETag serves as a unique identifier for the article's current state.

Checking for a Match:

if request.headers.get("If-Match") != etag: This line checks if the ETag header provided in the request matches the ETag of the retrieved article.

return HttpResponse(status=409): If the ETag doesn't match, it indicates that the article might have been updated by another request since the client retrieved its information. The function returns a 409 Conflict response, which prevents accidental data corruption.

Updating the Article Content and generating a new ETag:

article.content = content: This line updates the article's content with the new content received in the request.

article.save(): This line saves the updated article back to the database.

new_etag = article.etag: This line retrieves the new ETag generated for the updated article after saving it.

Returning the Success Response with the new ETag:

return HttpResponse(status=200, content_type="text/plain", content=new_etag): returns a successful 200 OK response, including the content type ("text/plain") and the updated ETag of the article in the response body.

How does this implement idempotence?

This code ensures that if the same update request is sent multiple times with the same ETag, the update will only be performed once, preventing duplicate updates and maintaining data consistency. The new ETag is then provided in the response to help the client keep track of the article's state for future interactions.

Conclusion

In this tutorial, we highlighted the difference between safe methods like GET, which retrieves data without side effects, and non-idempotent methods like POST, which can have different outcomes with each repetition.

We also explored techniques you can apply to achieve idempotence in non-idempotent methods, emphasizing the importance of designing APIs that prioritize consistency and reliability.

How to Build a Movie Recommendation System Based on Collaborative Filtering

freeCodeCamp — Wed, 29 Nov 2023 15:45:18 +0000

By Jess Wilk

In today’s world of technology, we get more recommendations from Artificial Intelligence models than from our friends.

Surprised? Think of the content you see and the apps you use daily. We get product recommendations on Amazon, clothing recommendations on Myntra, and movie suggestions on Netflix based on our past preferences, purchases, and so on.

Have you ever wondered what’s under the hood? The answer is machine learning-powered Recommender systems. Recommender systems are machine learning algorithms developed using historical data and social media information to find products personalized to our preferences.

In this article, I’ll walk you through the different types of ML methods for building a recommendation system and focus on the collaborative filtering method. We will obtain a sample dataset and create a collaborative filtering recommender system step by step.

Make sure to grab a cup of cappuccino (or whatever is your beverage of choice) and get ready!

Prerequisites

Before we embark on this journey, you should have a basic understanding of machine learning concepts and familiarity with Python programming. Knowledge of data processing and experience with libraries like Pandas, NumPy, and Scikit-learn will also be beneficial.

If you're new to these topics, you can check out the Introduction to Data Science course on Hyperskill, where I contribute as an expert.

Different Types of Recommendation Systems

You'll probably agree that there is more than one way to decide what to suggest or recommend when a friend asks our opinion. This applies to AI, too!

In machine learning, two primary methods of building recommendation engines are Content-based and Collaborative filtering methods.

When using the content-based filter method, the suggested products or items are based on what you liked or purchased. This method feeds the machine learning model with historical data such as customer search history, purchase records, and items in their wishlists. The model finds other products that share features similar to your past preferences.

Let’s understand this better with an example of a movie recommendation. Let’s say you saw Inception and gave it a five-star rating. Finding movies of similar themes and genres, like Interstellar and Matrix, and recommending them is called content-based filtering.

Imagine if all the recommendation systems just suggested things based on what you have seen. How would you discover new genres and movies? That’s where the Collaborative filtering method comes in. So what is it?

Rather than finding similar content, the Collaborative filtering method finds other users and customers similar to you and recommends their choices. The algorithm doesn’t consider the product features as in the case of content-based filtering.

To understand how it works, let’s go back to our example of movie recommendations. The system looks at the movies you've enjoyed and finds other users who liked the same movies. Then, it sees what else these similar users enjoyed and suggests those movies to you.

For example, if you and a friend both love The Shawshank Redemption, and your friend also loves Forrest Gump, the system will recommend Forrest Gump to you, thinking you might share your friend's taste.

In the upcoming sections, I’ll show you how to build a movie recommendation engine using Python based on collaborative filtering.

Learning how to build a movie recommendation engine using Python based on collaborative filtering

How to Prepare and Process the Movies Dataset

The first step of any machine learning project is collecting and preparing the data. As our goal is to build a movie recommendation engine, I have chosen a movie rating dataset. The dataset is publicly available for free on Kaggle.

The dataset has two main files in the format of CSV:

Ratings.csv: Contains the rating given by each user to each movie they watched
_Moviesmetadata.csv: Contains information on genre, budget, release date, revenue, and so on for all the movies in the dataset.

Let’s first import the Python packages needed to read the CSV files.

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Next, read the Ratings file into Pandas dataframes and look at the columns.

user_ratings_df = pd.read_csv("../input/the-movies-dataset/ratings.csv")
user_ratings_df.head()

Columns in Pandas dataframe

The UserId column has the unique ID for every customer, and movieId has the unique identification number for every movie. The rating column contains the rating given by the particular user to the movie out of 5. The timestamp column can be dropped, as we won’t need it for our analysis.

Next, let’s read the movie metadata information into a dataframe. Let’s keep only the relevant columns of Movie Title and genre for each MovieID.

movie_metadata = pd.read_csv("../input/the-movies-dataset/movies_metadata.csv")
movie_metadata = movie_names[['title', 'genres']]
movie_metadata.head()

The columns of Movie Title and genre for each MovieID

Next, combine these dataframes on the common column movieID.

movie_data = user_ratings_df.merge(movie_metadata, on='movieId')
movie_data.head()

This dataset can be used for Exploratory Data Analysis. You can find the movie with the top number of ratings, the best rating, and so on. Try it out to better grasp the data you are dealing with.

How to Build the User-Item Matrix

Now that our dataset is ready, let's focus on how collaborative-based filtering works. The machine learning algorithm aims to discover user preference patterns used to make recommendations.

One common approach is to use a user-item matrix. It involves a large spreadsheet where users are listed on one side and movies on the other. Each cell in the spreadsheet shows if a user likes a particular movie. The system then uses various algorithms to analyze this matrix, find patterns, and generate recommendations.

This matrix leads us to one of the advantages of collaborative filtering: it's excellent at discovering new and unexpected recommendations. Since it's based on user behavior, it can suggest a movie you might never have considered but will probably like.

Let’s create a user-movie rating matrix for our dataset. You can do this using the built-in pivot function of a Pandas dataframe, as shown below. We also use the fillna() method to impute missing or null values with 0.

user_item_matrix = ratings_data.pivot(index=['userId'], columns=['movieId'], values='rating').fillna(0)
user_item_matrix

Here’s our output matrix:

A user-movie rating matrix for our dataset

Sometimes, the matrix can be sparse. Sparsity refers to null values. It could significantly increase the amount of computation resources needed. Compressing the sparse matrixes using the scipy Python package is recommended when working with a large dataset.

How to Define and Train the Model

You can use multiple machine learning algorithms for collaborative filtering, like K-nearest neighbors (KNN) and SVD. I’ll be using a KNN model here.

KNN is super straightforward. Picture a giant, colorful board with dots representing different items (like movies). Each dot is close to others that are similar. When you ask KNN for recommendations, it finds the spot of your favorite item on this board and then looks around to see the nearest dots—these are your recommendations.

Now, the metric parameter in KNN is crucial. It's like the ruler the system uses to measure the distance between these dots. The metric used here is Cosine similarity.

What is cosine similarity?

It is a metric that measures how similar two entities are (like documents or vectors in a multi-dimensional space), irrespective of size. Cosine similarity is widely used in NLP to find similar context words.

Follow the snippet below to define a KNN model, the metric, and other parameters. The model is fit on the user-item matrix created in the previous section.

# Define a KNN model on cosine similarity
cf_knn_model= NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=10, n_jobs=-1)


# Fitting the model on our matrix
cf_knn_model.fit(user_item_matrix)

Next, let's define a function to provide the desired number of movie recommendations, given a movie title as input. The code below finds the closest neighbor data, and points to the input movie name using the KNN algorithm. The input parameters for the function are:

**n_recs**: Controls the number of final recommendations that we would get as output
**Movie_name**: Input movie name, based on which we find new recommendations
**Matrix**: The User-Movie Rating matrix

def movie_recommender_engine(movie_name, matrix, cf_model, n_recs):
    # Fit model on matrix
    cf_knn_model.fit(matrix)

    # Extract input movie ID
    movie_id = process.extractOne(movie_name, movie_names['title'])[2]

    # Calculate neighbour distances
    distances, indices = cf_model.kneighbors(matrix[movie_id], n_neighbors=n_recs)
    movie_rec_ids = sorted(list(zip(indices.squeeze().tolist(),distances.squeeze().tolist())),key=lambda x: x[1])[:0:-1]

    # List to store recommendations
    cf_recs = []
    for i in movie_rec_ids:
        cf_recs.append({'Title':movie_names['title'][i[0]],'Distance':i[1]})

    # Select top number of recommendations needed
    df = pd.DataFrame(cf_recs, index = range(1,n_recs))

    return df

How to Get Recommendations from the Model

Let's call our defined function to get movie recommendations. For instance, we can obtain a list of the top 10 recommended movies for someone who is a fan of Batman.

n_recs = 10
movie_recommender_engine('Batman', user_rating_matrix, cf_knn_model, n_recs)

A list of the top 10 recommended movies for someone who is a fan of Batman

Hurray! We have got the result we needed.

Advantages and Limitations of Collaborative Filtering

The advantages of this method include:

Personalized Recommendations: Offers tailored suggestions based on user behavior, leading to highly customized experiences.
Diverse Content Discovery: Capable of recommending a wide range of items, helping users discover content they might not find on their own. It gives diverse content discovery the edge over content-based filtering.
Community Wisdom: Leverages the collective preferences of users, often leading to more accurate recommendations than individual or content-based analysis alone.
Dynamic Adaptation: The model continuously gets updated with user interactions, keeping the recommendations relevant and up-to-date.

It’s not all sunshine, though. One big challenge is the cold start problem. For example, this happens when new movies or users are added to the system. The system struggles to make accurate recommendations since there's not enough data on these new entries.

Another issue is popularity bias. Popular movies get recommended a lot, overshadowing lesser-known gems. There are also scalability issues that come with managing such a large dataset.

While developing collaborative filtering-based engines, computational expenses and data sparsity must be kept in mind for an efficient process. It’s also recommended to take action to ensure data privacy and security.

Conclusion

Using Collaborative Filtering to build a movie recommendation system significantly advances digital content personalization. This system reflects our preferences and exposes us to a broader range of choices based on similar users' tastes.

Despite its challenges, such as the cold start problem and popularity bias, the benefits of personalized recommendations make it a powerful tool in the machine learning industry. As technology advances, these systems will become even more sophisticated, offering refined and enjoyable user experiences in the digital world.

Thank you for reading! I'm Jess, and I'm an expert at Hyperskill. You can check out an Introduction to Data Science course on the platform.

Web Storage Explained – How to Use localStorage and sessionStorage in JavaScript Projects

Oluwatobi Sofela — Mon, 09 Oct 2023 16:45:31 +0000

Web Storage is what the JavaScript API browsers provide for storing data locally and securely within a user’s browser.

Session and local storage are the two main types of web storage. They are similar to regular properties objects, but they persist (do not disappear) when the webpage reloads.

This article aims to show you exactly how the two storage objects work in JavaScript. We will also use a To-Do list exercise to practice using web storage in a web app project.

What is the Session Storage Object?
What is the Local Storage Object?
How to Access the Session and Local Storage Objects
What are Web Storage’s Built-In Interfaces?
Time to Practice with Web Storage 🤸‍♂️🏋️‍♀️
How Did You Go About Solving the Web Storage Exercise?
How to Continue Practicing with Web Storage 🧗‍♀️🚀
Web Storage vs. Cookies: What is the Difference?
Wrapping up

Without further ado, let’s discuss session storage.

What is the Session Storage Object?

The session storage object (window.sessionStorage) stores data that persists for only one session of an opened tab.

In other words, whatever gets stored in the window.sessionStorage object will not disappear on a reload of the web page. Instead, the computer will delete the stored data only when users close the browser tab or window.

Note the following:

The data stored inside the session storage is per-origin and per-instance. In other words, http://freecodecamp.com’s sessionStorage object is different from https://freecodecamp.com’s sessionStorage object because the two origins use different schemes (http and https).
Per-instance means per-window or per-tab. In other words, the sessionStorage object’s lifespan expires once users close the instance (window or tab).
Browsers create a unique page session for each new tab or window. Therefore, users can run multiple instances of an app without interfering with each instance’s session storage. (Note: Cookies do not have good support for running multiple instances of the same app. Such an attempt can cause errors such as double entry of bookings.)
Session storage is a property of the global Window object. So sessionStorage.setItem() is equivalent to window.sessionStorage.setItem().

What is the Local Storage Object?

The local storage object (window.localStorage) stores data that persists even when users close their browser tab (or window).

In other words, whatever gets stored in the window.localStorage object will not disappear during a reload or reopening of the web page or when users close their browsers. Those data have no expiration time. Browsers never clear them automatically.

The computer will delete the window.localStorage object’s content in the following instances only:

When the content gets cleared through JavaScript
When the browser’s cache gets cleared

Note the following:

The window.localStorage object’s storage limit is larger than the window.sessionStorage.
The data stored inside the local storage is per-origin. In other words, http://freecodecamp.com’s localStorage object is different from https://freecodecamp.com’s localStorage object because the two origins use different schemes (http and https).
There are inconsistencies with how browsers handle the local storage of documents not served from a web server (for instance, pages with a file: URL scheme). Therefore, the localStorage object may behave differently among browsers when used with non-HTTP URLs, such as file:///document/on/users/local/system.html.
Local storage is a property of the global Window object. Therefore, localStorage.setItem() is equivalent to window.localStorage.setItem().

How to Access the Session and Local Storage Objects

You can access the two web storages by:

Using the same technique as you'd use for accessing regular JavaScript objects
Using web storage’s built-in interfaces

For instance, consider the snippet below:

sessionStorage.bestColor = "Green";
sessionStorage["bestColor"] = "Green";
sessionStorage.setItem("bestColor", "Green");

The three statements above do the same thing—they set bestColor’s value. But the third line is recommended because it uses web storage’s setItem() method.

Tip: you should prefer using the web storage’s built-in interfaces to avoid the pitfalls of using objects as key/value stores.

Let’s discuss more on the web storage’s built-in interfaces below.

What are Web Storage’s Built-In Interfaces?

The web storage built-in interfaces are the recommended tools for reading and manipulating a browser’s sessionStorage and localStorage objects.

The six (6) built-in interfaces are:

setItem()
key()
getItem()
length
removeItem()
clear()

Let’s discuss each one now.

What is web storage’s `setItem()` method?

The setItem() method stores its key and value arguments inside the specified web storage object.

Syntax of the `setItem()` method

setItem() accepts two required arguments. Here is the syntax:

webStorageObject.setItem(key, value);

webStorageObject represents the storage object (localStorage or sessionStorage) you wish to manipulate.
key is the first argument accepted by setItem(). It is a required string argument representing the name of the web storage property you want to create or update.
value is the second argument accepted by setItem(). It is a required string argument specifying the value of the key you are creating or updating.

Note:

The key and value arguments are always strings.
Suppose you provide an integer as a key or value. In that case, browsers will convert them to strings automatically.
setItem() may display an error message if the storage object is full.

Example 1: How to store data in the session storage object

Invoke sessionStorage’s setItem() method.
Provide the name and value of the data you wish to store.

// Store color: "Pink" inside the browser's session storage object:
sessionStorage.setItem("color", "Pink");

// Log the session storage object to the console:
console.log(sessionStorage);

// The invocation above will return:
{color: "Pink"}

Try Editing It

Note: Your browser’s session storage may contain additional data if it already uses the storage object to store information.

Example 2: How to store data in the local storage object

Invoke localStorage’s setItem() method.
Provide the name and value of the data you wish to store.

// Store color: "Pink" inside the browser's local storage object:
localStorage.setItem("color", "Pink");

// Log the local storage object to the console:
console.log(localStorage);

// The invocation above will return:
{color: "Pink"}

Try Editing It

Note:

Your browser’s local storage may contain additional data if it already uses the storage object to store information.
It is best to serialize objects before storing them in local or session storage. Otherwise, the computer will store the object as "[object Object]".

Example 3: Browsers use `"[object Object]"` for non-serialized objects in the web storage

// Store myBio object inside the browser's session storage object:
sessionStorage.setItem("myBio", { name: "Oluwatobi" });

// Log the session storage object to the console:
console.log(sessionStorage);

// The invocation above will return:
{myBio: "[object Object]", length: 1}

Try Editing It

You can see that the computer stored the object as "[object Object]" because we did not serialize it.

Example 4: How to store serialized objects in the web storage

// Store myBio object inside the browser's session storage object:
sessionStorage.setItem("myBio", JSON.stringify({ name: "Oluwatobi" }));

// Log the session storage object to the console:
console.log(sessionStorage);

// The invocation above will return:
{myBio: '{"name":"Oluwatobi"}', length: 1}

Try Editing It

We used JSON.stringify() to convert the object to JSON before storing it in the web storage.

Tip: Learn how to convert JSON to JavaScript objects.

What is web storage’s `key()` method?

The key() method retrieves a specified web storage item’s name (key).

Syntax of the `key()` method

key() accepts one required argument. Here is the syntax:

webStorageObject.key(index);

webStorageObject represents the storage object (localStorage or sessionStorage) whose key you wish to get.
index is a required argument. It is an integer specifying the index of the item whose key you want to get.

Example 1: How to get the name of an item in the session storage object

Invoke sessionStorage’s key() method.
Provide the index of the item whose name you wish to get.

// Store carColor: "Pink" inside the browser's session storage object:
sessionStorage.setItem("carColor", "Pink");

// Store pcColor: "Yellow" inside the session storage object:
sessionStorage.setItem("pcColor", "Yellow");

// Store laptopColor: "White" inside the session storage object:
sessionStorage.setItem("laptopColor", "White");

// Get the name of the item at index 1:
sessionStorage.key(1);

Try Editing It

Important: The user-agent defines the order of items in the session storage. In other words, key()’s output may vary based on how the user-agent orders the web storage’s items. So you shouldn't rely on key() to return a constant value.

Example 2: How to get the name of an item in the local storage object

Invoke localStorage’s key() method.
Provide the index of the item whose name you wish to get.

// Store carColor: "Pink" inside the browser's local storage object:
localStorage.setItem("carColor", "Pink");

// Store pcColor: "Yellow" inside the local storage object:
localStorage.setItem("pcColor", "Yellow");

// Store laptopColor: "White" inside the local storage object:
localStorage.setItem("laptopColor", "White");

// Get the name of the item at index 1:
localStorage.key(1);

Try Editing It

Important: The user-agent defines the order of items in the local storage. In other words, key()’s output may vary based on how the user-agent orders the web storage’s items. So you shouldn't rely on key() to return a constant value.

What is web storage’s `getItem()` method?

The getItem() method retrieves the value of a specified web storage item.

Syntax of the `getItem()` method

getItem() accepts one required argument. Here is the syntax:

webStorageObject.getItem(key);

webStorageObject represents the storage object (localStorage or sessionStorage) whose item you wish to get.
key is a required argument. It is a string specifying the name of the web storage property whose value you want to get.

Example 1: How to get data from the session storage object

Invoke sessionStorage’s getItem() method.
Provide the name of the data you wish to get.

// Store color: "Pink" inside the browser's session storage object:
sessionStorage.setItem("color", "Pink");

// Get color's value from the session storage:
sessionStorage.getItem("color");

// The invocation above will return:
"Pink"

Try Editing It

Example 2: How to get data from the local storage object

Invoke localStorage’s getItem() method.
Provide the name of the data you wish to get.

// Store color: "Pink" inside the browser's local storage object:
localStorage.setItem("color", "Pink");

// Get color's value from the local storage:
localStorage.getItem("color");

// The invocation above will return:
"Pink"

Try Editing It

Note: The getItem() method will return null if its argument does not exist in the specified web storage.

What is web storage’s `length` property?

The length property returns the number of properties in the specified web storage.

Syntax of the `length` property

Here is length’s syntax:

webStorageObject.length;

webStorageObject represents the storage object (localStorage or sessionStorage) whose length you wish to verify.

Example 1: How to verify the number of items in the session storage object

Invoke sessionStorage’s length property.

// Store carColor: "Pink" inside the browser's session storage object:
sessionStorage.setItem("carColor", "Pink");

// Store pcColor: "Yellow" inside the session storage object:
sessionStorage.setItem("pcColor", "Yellow");

// Store laptopColor: "White" inside the session storage object:
sessionStorage.setItem("laptopColor", "White");

// Verify the number of items in the session storage:
sessionStorage.length;

// The invocation above may return:
3

Try Editing It

Note: Your sessionStorage.length invocation may return a value greater than 3 if your browser’s session storage already contains some stored information.

Example 2: How to verify the number of items in the local storage object

Invoke localStorage’s length property.

// Store carColor: "Pink" inside the browser's local storage object:
localStorage.setItem("carColor", "Pink");

// Store pcColor: "Yellow" inside the local storage object:
localStorage.setItem("pcColor", "Yellow");

// Store laptopColor: "White" inside the local storage object:
localStorage.setItem("laptopColor", "White");

// Verify the number of items in the local storage:
localStorage.length;

// The invocation above may return:
3

Try Editing It

Note: Your localStorage.length invocation may return a value greater than 3 if your browser's local storage already contains some stored information.

What is web storage’s `removeItem()` method?

The removeItem() method removes a property from the specified web storage.

Syntax of the `removeItem()` Method

removeItem() accepts one required argument. Here is the syntax:

webStorageObject.removeItem(key);

webStorageObject represents the storage object (localStorage or sessionStorage) whose item you wish to remove.
key is a required argument. It is a string specifying the name of the web storage property you want to remove.

Example 1: How to remove data from the session storage object

Invoke sessionStorage’s removeItem() method.
Provide the name of the data you wish to remove.

// Store carColor: "Pink" inside the browser's session storage object:
sessionStorage.setItem("carColor", "Pink");

// Store pcColor: "Yellow" inside the session storage object:
sessionStorage.setItem("pcColor", "Yellow");

// Store laptopColor: "White" inside the session storage object:
sessionStorage.setItem("laptopColor", "White");

// Remove the pcColor item from the session storage:
sessionStorage.removeItem("pcColor");

// Confirm whether the pcColor item still exists in the session storage:
sessionStorage.getItem("pcColor");

// The invocation above will return:
null

Try Editing It

Note: The removeItem() method will do nothing if its argument does not exist in the session storage.

Example 2: How to remove data from the local storage object

Invoke localStorage’s removeItem() method.
Provide the name of the data you wish to remove.

// Store carColor: "Pink" inside the browser's local storage object:
localStorage.setItem("carColor", "Pink");

// Store pcColor: "Yellow" inside the local storage object:
localStorage.setItem("pcColor", "Yellow");

// Store laptopColor: "White" inside the local storage object:
localStorage.setItem("laptopColor", "White");

// Remove the pcColor item from the local storage:
localStorage.removeItem("pcColor");

// Confirm whether the pcColor item still exists in the local storage:
localStorage.getItem("pcColor");

// The invocation above will return:
null

Try Editing It

Note: The removeItem() method will do nothing if its argument does not exist in the local storage.

What is web storage’s `clear()` method?

The clear() method clears (deletes) all the items in the specified web storage.

Syntax of the `clear()` Method

clear() accepts no argument. Here is the syntax:

webStorageObject.clear();

webStorageObject represents the storage object (localStorage or sessionStorage) whose items you wish to clear.

Example 1: How to clear all items from the session storage object

Invoke sessionStorage’s clear() method.

// Store carColor: "Pink" inside the browser's session storage object:
sessionStorage.setItem("carColor", "Pink");

// Store pcColor: "Yellow" inside the session storage object:
sessionStorage.setItem("pcColor", "Yellow");

// Store laptopColor: "White" inside the session storage object:
sessionStorage.setItem("laptopColor", "White");

// Clear all items from the session storage:
sessionStorage.clear();

// Confirm whether the session storage still contains any item:
console.log(sessionStorage);

// The invocation above will return:
{length: 0}

Try Editing It

Example 2: How to clear all items from the local storage object

Invoke localStorage’s clear() method.

// Store carColor: "Pink" inside the browser's local storage object:
localStorage.setItem("carColor", "Pink");

// Store pcColor: "Yellow" inside the local storage object:
localStorage.setItem("pcColor", "Yellow");

// Store laptopColor: "White" inside the local storage object:
localStorage.setItem("laptopColor", "White");

// Clear all items from the local storage:
localStorage.clear();

// Confirm whether the local storage still contains any item:
console.log(localStorage);

// The invocation above will return:
{length: 0}

Try Editing It

Now that we know what web storage is and how to access it, we can practice using it in a JavaScript project.

Time to Practice with Web Storage 🤸‍♂️🏋️‍♀️

Consider the following To-Do List app:

The Problem

The issue with the To-Do List app is this:

Tasks disappear whenever users refresh the webpage.

Your Exercise

Use the appropriate Web Storage APIs to accomplish the following tasks:

Prevent the Session pane’s To-Do items from disappearing whenever users reload the browser.
Prevent the Local section’s To-Do items from disappearing whenever users reload or close their browser tab (or window).
Auto-display the Session section's previously added tasks on page reload.
Auto-display the Local section's previously added tasks on page reload (or browser reopen).

Bonus Exercise

Use your browser’s console to:

Check the number of items in your browser’s session storage object.
Display the name of your local storage’s zeroth index item.
Delete all the items in your browser’s session storage.

Try the Web Storage Exercise

Note: You will benefit much more from this tutorial if you attempt the exercise yourself.

If you get stuck, don’t be discouraged. Instead, review the lesson and give it another try.

Once you’ve given it your best shot (you’ll only cheat yourself if you don’t!), we can discuss how I approached the exercise below.

How Did You Go About Solving the Web Storage Exercise?

Below are feasible ways to get the exercise done.

How to prevent the Session Storage pane’s To-Do items from disappearing on page reload

Whenever users click the “Add task” button,

Get existing session storage’s content, if any. Otherwise, return an empty array.
Merge the existing to-do items with the user’s new input.
Add the new to-do list to the browser’s session storage object.

Here’s the code:

sessionAddTaskBtn.addEventListener('click', () => {
  // Get existing session storage's content, if any. Otherwise, return an empty array:
  const currentTodoArray =
    JSON.parse(sessionStorage.getItem('codesweetlyStore')) || [];

  // Merge currentTodoArray with the user's new input:
  const newTodoArray = [
    ...currentTodoArray,
    { checked: false, text: sessionInputEle.value },
  ];

  // Add newTodoArray to the session storage object:
  sessionStorage.setItem('codesweetlyStore', JSON.stringify(newTodoArray));

  const todoLiElements = createTodoLiElements(newTodoArray);
  sessionTodosContainer.replaceChildren(...todoLiElements);
  sessionInputEle.value = '';
});

Try Editing It

Note: The three dots (...) preceding the currentTodoArray variable represent the spread operator. We used it in the newTodoArray object to copy currentTodoArray’s items into newTodoArray.

How to prevent the Local Storage pane’s To-Do items from disappearing on page reload or reopen

Get existing local storage’s content, if any. Otherwise, return an empty array.
Merge the existing to-do items with the user’s new input.
Add the new to-do list to the browser’s local storage object.

Here’s the code:

localAddTaskBtn.addEventListener('click', () => {
  // Get existing local storage's content, if any. Otherwise, return an empty array:
  const currentTodoArray =
    JSON.parse(localStorage.getItem('codesweetlyStore')) || [];

  // Merge currentTodoArray with the user's new input:
  const newTodoArray = [
    ...currentTodoArray,
    { checked: false, text: localInputEle.value },
  ];

  // Add newTodoArray to the local storage object:
  sessionStorage.setItem('codesweetlyStore', JSON.stringify(newTodoArray));

  const todoLiElements = createTodoLiElements(newTodoArray);
  localTodosContainer.replaceChildren(...todoLiElements);
  localInputEle.value = '';
});

Try Editing It

Note: The localTodosContainer.replaceChildren(...todoLiElements) statement tells the browser to replace localTodosContainer’s current children elements with the list of

s in the todoLiElements array.

How to auto-display the Session section’s previously added tasks on page reload

Whenever users reload the page,

Get existing session storage’s content, if any. Otherwise, return an empty array.
Use the retrieved content to create
elements.
Populate the tasks display space with the
elements.

Here’s the code:

window.addEventListener('load', () => {
  // Get existing session storage's content, if any. Otherwise, return an empty array:
  const sessionTodoArray =
    JSON.parse(sessionStorage.getItem('codesweetlyStore')) || [];

  // Use the retrieved sessionTodoArray to create  elements:
  const todoLiElements = createTodoLiElements(sessionTodoArray);

  // Populate the tasks display space with the todoLiElements:
  sessionTodosContainer.replaceChildren(...todoLiElements);
});

Try Editing It

How to auto-display the Local section’s previously added tasks on page reload or reopen

Whenever users reload or reopen the page,

Get existing local storage’s content, if any. Otherwise, return an empty array.
Use the retrieved content to create
elements.
Populate the tasks display space with the
elements.

Here’s the code:

window.addEventListener('load', () => {
  // Get existing local storage's content, if any. Otherwise, return an empty array:
  const localTodoArray =
    JSON.parse(localStorage.getItem('codesweetlyStore')) || [];

  // Use the retrieved localTodoArray to create  elements:
  const todoLiElements = createTodoLiElements(localTodoArray);

  // Populate the tasks display space with the todoLiElements:
  localTodosContainer.replaceChildren(...todoLiElements);
});

Try Editing It

How to check the total items in the browser’s session storage

Use session storage’s length property like so:

console.log(sessionStorage.length);

Try Editing It

How to display the local storage’s zeroth index item’s name

Use the local storage’s key() method as follows:

console.log(localStorage.key(0));

Try Editing It

How to empty the browser’s session storage

Use the session storage’s clear() method as follows:

sessionStorage.clear();

How to Continue Practicing with Web Storage 🧗‍♀️🚀

The to-do app still has a lot of potential. For instance, you can:

Convert it to a React TypeScript application.
Make it keyboard accessible.
Allow users to delete or edit individual tasks.
Allow users to star (mark as important) specific tasks.
Let users specify due dates.

So, feel free to continue developing what we’ve built in this tutorial so you can better understand the web storage objects.

For instance, here’s my attempt at making the two panes functional:

Before we wrap up our discussion, you should know some differences between web storage and cookies. So, let’s talk about that below.

Web Storage vs. Cookies: What is the Difference?

Web storage and cookies are two main ways to store data locally within a user’s browser. But they work differently. Below are the main distinctions between them.

Storage limit

Cookies: Have 4 kilobytes maximum storage limit.

Web storage: Can store a lot more than 4 kilobytes of data. For instance, Safari 8 can store up to 5 MB, while Firefox 34 permits 10 MB.

Data transfer to the server

Cookies: Transfer data to the server whenever browsers send HTTP requests to the web server.

Web storage: Never transfers data to the server.

Note: It is a waste of users’ bandwidth to send data to the server if such information is needed only by the client (browser), not the server.

Weak integrity and confidentiality

Cookies: Suffer from weak integrity and weak confidentiality issues.

Web storage: Do not suffer from weak integrity and confidentiality issues because it stores data per origin.

Property

Cookies: Cookies are a property of the Document object.

Web storage: Web storage is a property of the Window object.

Expiration

Cookie: You can specify a cookie’s expiration date.

Web storage: Browsers determine web storage’s expiration date.

Retrieving individual data

Cookies: There’s no way to retrieve individual data. You always have to recall all the data to read any single one.

Web storage: You can choose the specific data you wish to retrieve.

The syntax for storing data

Cookies:

document.cookie = "key=value";

Web storage:

webStorageObject.setItem(key, value);

The syntax for reading data

Cookies:

document.cookie;

Web storage:

webStorageObject.getItem(key);

The syntax for removing data

Cookies:

document.cookie = "key=; expires=Thu, 01 May 1930 00:00:00 UTC";

The snippet above deletes the cookie by assigning an empty value to the key property and setting a past expiration date.

Web storage:

webStorageObject.removeItem(key);

Wrapping up

In this article, we discussed how to use web storage and its built-in interfaces. We also used a to-do list project to practice using the local and session storage objects to store data locally and securely within users’ browsers.

Thanks for reading!

And here’s a useful React TypeScript resource:

I wrote a book about Creating NPM Packages!

It is a beginner-friendly book that takes you from zero to creating, testing, and publishing NPM packages like a pro.

General Programming - freeCodeCamp.org

How to Build an AI-Powered Research Automation System with n8n, Groq, and Academic APIs

Table of Contents

Prerequisites

The Problem: Research Takes Too Long

The Tech Stack

The Project Structure: How to Think About an n8n Workflow Like Software

Stage 1: Centralised Configuration

Stage 2: Parallel API Collection (With Failure Isolation)

Stage 3: Normalisation and Deduplication (DOI-first, Title fallback)

Target normalised schema

What deduping by DOI means (and what a DOI is)

What “normalise into a unified object” means (what’s happening in the code)

Example n8n Code Node (Normalisation + Dedupe Pattern)

Stage 4: AI-Powered Content Extraction (Strict JSON)

Why structured output matters

System prompt vs user prompt (and how to compose them)

Suggested extraction schema

Example prompt (system + user)

Batch processing to avoid timeouts

Stage 5: Scoring and Synthesis

Beginner-Friendly Evals: Retrieval and Extraction QA

What it looks like in n8n (a concrete example)

Example: Assertions code node after extraction

Key Learnings and Error Handling

Conclusion

About Me

Storyteller: A Medium For Guiding Others Through Code

Blogs and Videos

Viewing Videos

Making videos

Storyteller

The Advantages of Code Playbacks:

AI as an Infinitely Patient Tutor

Conclusion

How to Optimize PySpark Jobs: Real-World Scenarios for Understanding Logical Plans

Table of Contents

Background Information

What This Handbook is Really About

Who This Handbook Is For

How This Handbook Is Structured

What You'll Learn

Technical Prerequisites

Chapter 1: The Spark Mindset: Why Plans Matter

The Invisible Layer Behind Every Transformation

From Logical to Optimized to Physical Plans

1. Logical Plan

2. Analyzed Logical Plan

3. Optimized Logical Plan

4. Physical Plan

How to Read a Logical Plan

Version A: withColumn → filter

Parsed Logical Plan (Simplified)

Version B: Filter → Project

Parsed Logical Plan (Simplified)

Why You Should Look at the Plan Every Time by running df.explain(True)

What Spark Does Under the Hood

Chapter 2: Understanding the Spark Execution Flow

From Plans to Stages to Tasks

The Execution Trigger: Actions vs Transformations

Actions Trigger Execution

The Complete Execution Flow

What Triggers a Shuffle

Why Shuffles Create Stage Boundaries

Common Performance Bottlenecks

Optimized Approach

Chapter 3: Reading and Debugging Plans Like a Pro

Three Layers in Spark

Recognizing Common Nodes

Debugging Strategy: Read Plans from Top to Bottom

Catalyst Optimizer in Action

Chapter 4: Writing Efficient Transformations

Why Transformations Matter

The Goal of this Chapter

Before You Dive In:

Scenario 1: Rename in One Pass: withColumnRenamed() vs toDF()

Logical Plan Impact:

The Better Approach: Rename Once with toDF()

Logical Plan Impact:

Under the Hood: What Spark Actually Does

Why You Should Look at the Plan Every Time by running `df.explain(True)`