Testing - freeCodeCamp.org

How I Tested Malaysia's Open Data Portals with Plain English

Tech With RJ — Thu, 23 Apr 2026 23:37:22 +0000

Most end-to-end test suites drive a real browser and click through an app like a user. They check whether a page renders and whether elements appear.

But they don't check whether the numbers on those elements are correct. A data-pipeline bug that shows Malaysia's population as 3.4 million instead of the real 34 million slips past every selector test in the suite.

The element still exists. A number still renders. The page still looks right. But the bug ships and sits there until a human notices.

I work as a full-stack engineer. Writing end-to-end (E2E) tests with Playwright and unit tests with Jest is part of my day job. I also use Playwright MCP, the bridge between AI assistants like Claude and a running browser, when I need to generate first-draft test code or debug a flow.

None of that tooling closes the maintenance tax on selector-based suites. Every E2E suite I keep alive at work accumulates data-testid selectors, waitForSelector calls, and tests that break because someone renamed a button.

Bug0's Breaking Apps Hackathon gave me a pretext to try something different. Over a weekend, I built an automated regression suite for Malaysia's three public open data portals, data.gov.my, OpenDOSM, and KKMNow, using Passmark, Bug0's open-source AI-driven Playwright library.

The tests are written in plain English. Two AI models verify each assertion. A third arbitrates disagreements.

What You'll Find Below:

How to write an E2E test that checks whether a dashboard's numbers are correct, not only whether the page renders
A specific assertion pattern (range-bounded KPIs) that catches an entire class of data-pipeline bug that selector tests miss, with working examples ready to copy
A cross-field math assertion that takes one sentence in Passmark and around a hundred lines of code without it
How Passmark's own failure explanations became my debugging loop (the single biggest shift in how I'll write E2E tests going forward)
The real limits: a 14% cache-hit rate, a dependency on OpenRouter, and what two-model voting fails to catch

Why Malaysia's Open Data Portals?
What Is Passmark?
The Hero Spec: Range-Bounded Assertions
- What Two-Model Voting Doesn't Catch
Going Further: Cross-Field Math
What I Found Across Three Runs
- The Debugging Loop
- The Two Specs That Still Fail Are the Most Interesting
What It Cost, and Why Cache Rate Is Cost Rate
The Pattern Worth Stealing
Honest Verdict
Resources

Why Malaysia's Open Data Portals?

The hackathon suggested targets like Vercel Commerce, Cal.com, and Hashnode. These all would've been solid picks.

But I wanted to test something local and closer to my day-to-day work instead. I also wanted a data-heavy site where the numbers on screen have to be accurate, as I work with numbers too on a daily basis.

Malaysia has three public open-data portals:

data.gov.my, run by MAMPU, the government's digital transformation agency
OpenDOSM, run by the Department of Statistics
KKMNow, run by the Ministry of Health

They're public, no authentication required, with documented APIs. Seemed like a good fit for an automated test suite. The data on them is what Malaysians use every day, so accuracy isn't optional.

What Is Passmark?

Passmark is a Playwright library where the tests read like specs. Here's an example:

await runSteps({
  page,
  userFlow: "population dashboard smoke",
  steps: [
    { description: "Navigate to https://data.gov.my/dashboard/kawasanku" },
    {
      description: "Wait for the country-level Malaysia view to render",
      waitUntil: "A headline population number is visible",
    },
  ],
  assertions: [
    {
      assertion:
        "The page shows Malaysia's total population as a number greater than 20 million and less than 40 million",
    },
  ],
  test,
  expect,
});

There are no selectors, no data-testid, and no page.locator(). The assertion expresses what I care about, in the words I would use with a colleague.

On the first run, an AI agent drives the page and caches the resolved Playwright action to Redis. Every run after that replays at native Playwright speed with zero model calls.

When the UI changes and a cached action fails, the AI re-engages only for that step. Two assertion models (Claude and Gemini) vote. A third model arbitrates disagreements.

The Hero Spec: Range-Bounded Assertions

Range-bounded assertions were the first shape of test I wrote, and the one I came back to most across the suite.

The idea is straightforward: check that a number on the page falls inside a sensible range, not that a specific element exists.

The image below is the Playwright report from the population spec, with all four range-bounded assertions passing.

The range-bounded population test is the one that shows Passmark's real value.

Traditional Playwright asserts DOM structure. It confirms that an element with class kpi-total contains the text 34.2 million. That tells you the page rendered, not whether the number makes sense.

A bug that shows Malaysia's population as 3.42 million sails past any selector test. The DOM is correct. The number renders. Nothing breaks in the conventional sense.

Passmark reads the page, evaluates the claim, and fails because 3.42 million falls outside the sane range. Two models vote. A hallucination by one model alone produces no false pass.

What Two-Model Voting Doesn't Catch

Voting defends against one model misreading the page. It doesn't defend against both models misreading the page the same way. If Claude and Gemini both parse "32.4 million" as "3.24 million" because of the same unusual spacing in the DOM, they agree, they vote pass, and the bug ships.

The mitigation is assertion design. Write assertions that are hard to misread. A range check ("between 20 million and 40 million") is harder for a model to get wrong than a prose check ("roughly 34 million"). Numerical bounds leave less room for interpretation than adjectives. The more your assertion looks like a unit test written in English, the less room the models have to disagree.

Going Further: Cross-Field Math

Range-bounded assertions are a good first step. They catch "is this number in the right ballpark?" But they don't catch "do these numbers agree with each other?"

For that, you need cross-field math. If a dashboard shows a total population and a breakdown by gender, those two things are supposed to agree. Male plus female should equal total. Ethnicity breakdown percentages should sum to 100.

test("Cross-field math: sex breakdown sums to total population", async ({ page }) => {
  test.setTimeout(180_000);
  await runSteps({
    page,
    userFlow: "population sex breakdown consistency",
    steps: [
      { description: "Navigate to https://data.gov.my/dashboard/kawasanku" },
      {
        description: "Wait for the Malaysia country-level view with breakdown data",
        waitUntil:
          "A headline total population figure is visible and a breakdown by sex is shown on the page",
      },
    ],
    assertions: [
      {
        assertion:
          "The male and female population values shown on the page add up to approximately the headline total population, within a 5% margin",
      },
      {
        assertion:
          "Any percentage-based breakdowns visible on the page (by sex, age, or ethnicity) sum to approximately 100% within a 2 percentage-point margin",
      },
      {
        assertion: "No breakdown value is negative or greater than the headline total",
      },
    ],
    test,
    expect,
  });
});

Try writing that in vanilla Playwright. You need selectors for the headline number, selectors for the breakdown components, number parsing with a comma-aware regex, and a margin calculation. Seventy to a hundred lines of code to verify three invariants a primary school student would call obvious.

The Passmark version is one spec. I ran it against Kawasanku's live country view. All three assertions passed in 1.4 minutes. Passmark's annotation, verbatim:

"The headline total population figure 'Malaysia has a population of 32,447,385 people.' is visible, and 'Gender And Age Distribution' is shown, which implies a breakdown by sex (male, female) will be available."

Two models read the page, extract the numbers, do the arithmetic, and agree. When the dashboard changes layout in three months, the same assertion still works, because it never named a selector.

This is the class of test I want running against every dashboard product that I touch. Financial totals matching their line items. Percentages that sum to 100. Inventory counts equal to the sum of warehouse locations. This rarely gets checked today, because writing the check by hand outweighs the perceived value of running it.

What I Found Across Three Runs

Run	Passed	Key change
1	4 of 13 (31%)	Baseline. Wrote specs without looking at the target pages
2	8 of 13 (62%)	Rewrote five over-specified assertions using Passmark's own feedback
3	12 of 13 (92%)	Dropped one more wrong assertion, bumped timeouts, added retry, installed WebKit

Every passing spec after run 1 came from Passmark telling me, in plain English, why my assertion didn't match the page.

Here are three examples from run 1:

For dataset-detail.spec.ts, I asserted "an API usage snippet (curl or JS) is shown." knowingly that the page is using Python and I wanted to see what the result was. Passmark replied:

"The page contains API usage snippets, but they are specifically for Python using the requests library. There are no snippets provided in curl or JavaScript formats."

The page had snippets. I asked for the wrong languages. Fix: accept any language.

For dashboard-population.spec.ts, I asserted "a chart visualizing population by age or ethnicity is rendered." Passmark replied:

"The current page displays charts for vital statistics such as Live Births, Deaths, and Natural Increase over time, but there is no chart visualizing population specifically by age groups or ethnicity."

The charts are there. Not the slice I guessed. Fix: accept any chart about population.

For kkmnow/hospital-utilisation.spec.ts, I asserted a "headline bed-utilisation percentage." Passmark replied:

"While there are multiple bed-utilisation percentages listed in tables and rankings further down the page, there is no prominent, top-level headline KPI figure displaying the overall bed-utilisation percentage."

The numbers are there. I had asked for a layout the designers didn't build.

This is the killer feature: Passmark's failure messages aren't stack traces. They're explanations. The AI read the page, compared it against my words, and pointed me at the fix. Nothing like a selector-based test throwing TimeoutError: waiting for locator.

The Debugging Loop

Once I saw the pattern, the loop became my main technique. Here's the procedure:

Read the failure message word for word. Don't skim.
Trust it as a description of what is on the page. The AI has read the page. Your assertion has not.
Rewrite the assertion so it matches what's on the page. Broaden, narrow, or restate.
Run it again.

The discipline is to not argue with the tool. The page is what the page is. Your assertion is what is wrong. Every time I tried to "fix" the page (convinced my assertion was right and the site was broken), I lost some time. Every time I took the failure message at face value and rewrote, the test passed on the next run.

This is the one of the changes in how I'll write E2E tests going forward. The feedback loop is the tool. Every failed assertion is a draft of the correct one.

The Two Specs That Still Fail Are the Most Interesting

1. The two models disagreed and the arbiter call failed.

On catalogue-search.spec.ts, Claude voted fail (72% confidence) and Gemini voted pass (100% confidence) on the same assertion. I had written the assertion in a way that read two ways.

Passmark escalated to an arbiter model through OpenRouter. The call came back with a 504 from Cloudflare. The arbiter never ran. The suite failed the spec.

This is an honest limit, not a fluke. Any CI that runs Passmark depends on OpenRouter's availability. External gateway errors happen. My fix for the final run was a global retry wrapper around the OpenRouter call, and the 504 stopped being a problem in practice.

If you bring this to production CI, plan for retries and treat OpenRouter outages as a first-class failure mode in your runbook.

This failure taught me something about assertion design: my wording was ambiguous. Claude's reading was reasonable. Gemini's reading was reasonable. When you write tests in English, being precise about what you mean is part of writing a good test.

2. The wait condition fired too early.

On the KKMNow spec, I had waitUntil: "A utilisation metric is visible". The page showed the section label "Hospital Bed Utilisation (%)" before the numbers finished loading. The wait step saw the label, decided the condition was met, and moved on. By the time the numbers rendered, the test had run out of time. Once the page was fully loaded, the range assertions would have passed on content.

"The page displays multiple bed-utilisation percentages within the specified range (0% to 120%). For example, the ranked list shows Perlis at 93.1% and Melaka at 88.2%."

The lesson: your waitUntil wording needs the same care as your assertion wording. Both are read by AI. A vague wait is as bad as a vague assertion.

What It Cost, and Why Cache Rate Is Cost Rate

Each of the three runs took about 20 minutes on 13 specs with a single worker. The hackathon's pooled OpenRouter key covered the AI costs, so I have no personal dollar figure to report.

The more useful cost finding is what gets cached.

$ docker exec passmark-redis redis-cli DBSIZE
5

Five steps out of roughly 35 were cached across three runs. A 14% cache-hit rate. The Passmark README explains why:

Only steps that produced a single tool call get cached. Multi-step sequences are considered non-deterministic.

Most of my steps described multi-tool sequences. "Open the area selector and choose Selangor, then wait for navigation" becomes click, wait, verify. Those don't cache by design.

This matters for your budget. An 86% miss rate means 86% of your steps call a model on every run. The cost model is per-tool-call via OpenRouter.

To estimate your own bill: count non-atomic steps in your suite, multiply by your chosen model's per-call price at current OpenRouter rates, and the product is your recurring cost per run. Cache rate is cost rate.

The fix is authoring discipline. Split compound descriptions into atomic steps. Treat cache fill rate as a metric you track, not an implementation detail to ignore. A suite with 80% atomic steps costs a fifth of a suite with 14%.

The Pattern Worth Stealing

The idea here is bigger than Passmark.

Check that the numbers on your dashboards make sense. Most teams don't. They should.

A one-line assertion like "the headline number is between 20 million and 40 million" catches several classes of bug regular tests miss.

Here are four common ones:

The data pipeline divided by the wrong thing, so the number on screen is ten times too small.
A timezone bug made yesterday's total show up under tomorrow's date.
The data never refreshed, so users are looking at last week's numbers.
A locale flip swapped commas and decimals, so 1,234,567 is now reading as 1.234567.

Civic portals were my target. The pattern applies anywhere a dashboard shows numbers. Fintech reports, SaaS analytics, healthcare metrics, e-commerce admin panels. Any screen where a number is supposed to mean something.

Most of these numbers never get tested. Writing the check by hand is tedious. You need a selector to find the number, code to parse it, code to handle units, and a margin calculation. Fifty lines for one check. Nobody bothers.

You don't need Passmark to steal the idea. The same check works in plain Playwright with page.evaluate and number parsing. The Passmark version is just more efficient to write and readable by anyone on the team, not only engineers.

Honest Verdict

Passmark works. Across three runs I went from 4 of 13 passing to 12 of 13 without touching a selector, guided by the tool's own feedback.

Still, the caveats are real:

On a cold cache, every step waits for a model. Budget more wall-clock time than a selector suite.
In my suite only 14% of steps cached. The other 86% pays model cost on every run. Authoring discipline (atomic steps) is the difference between cents and dollars per run.
Two-model voting doesn't protect against both models misreading the same way. Write assertions that are hard to misread.
Every assertion depends on OpenRouter's availability. External gateway errors need a retry strategy before this runs in CI.

What stuck with me: Passmark didn't make me better at Playwright. It made me write tests I would have skipped otherwise.

What I imagine myself doing at work:

Run a small nightly Passmark suite against the critical dashboards, focused on range and freshness checks.
Keep traditional Playwright and Jest for everything that has to be fast and deterministic.
Treat every Passmark failure message as a specification of the page, not an error to argue with.

Try this, even if you never touch Passmark. Pick a number on a dashboard you work with. Write a test that fails if the number is outside a sane range. See what breaks. That is the whole pattern and purpose of this article.

Resources

Repo: github.com/LeeRenJie/passmark-hackathon
Passmark: github.com/bug0inc/passmark
Breaking Apps Hackathon: hashnode.com/hackathons/breaking-things
Test targets: data.gov.my, OpenDOSM, KKMNow

The Data Quality Handbook: Data Errors, the Developer's Role, and Validation Layers Explained.

Great John — Tue, 14 Apr 2026 20:29:40 +0000

In August 2012, Knight Capital, a major trading firm in the United States, deployed faulty trading software to its production system. The system used this incorrect configuration data and it triggered millions of unintended stock trades.

The company lost about $440 million in just 45 minutes. Knight Capital nearly collapsed and had to be rescued by investors. It was later acquired by another firm.

When Target expanded into Canada, the company relied on a new supply chain system that contained incorrect product and inventory data. Product information in the database was incomplete and inaccurate. Prices, sizes, and product descriptions were entered incorrectly.

Inventory systems reported items in stock that were actually unavailable. Customers found empty shelves in stores despite the system showing stock. The company lost over $2 billion in the Canadian market. Target eventually shut down all Canadian stores in 2015.

One employee made the statement “Even though we had a great supply chain system on paper, we didn’t have accurate data. Bad data leads to bad decisions’’

Another famous example of data-related engineering failures involves the Mars Climate Orbiter spacecraft. One engineering team used metric units (newtons). Another team used imperial units (pounds-force). The system failed to convert the data correctly. The spacecraft entered Mars' atmosphere at the wrong altitude. The mission failed and the spacecraft was destroyed. The loss was about $125 million.

In this article, we'll delve deep into what data quality truly means, the types of data errors that silently break systems, the developer’s responsibility in preventing them, and the validation layers that work together to keep bad data out of production.

What We'll Cover:

Prerequisites
The Importance of Data Quality
- How Does Bad Data Happen in the First Place?
- The Cost of Bad Data
Types of Data Errors
What Makes Good Data?
Data Validation Layers
Testing Strategies to Protect Data Quality
Conclusion

Prerequisites

A basic understanding of what data is
A basic understanding of data structures
An understanding of what an API is
An understanding of what a database is and what it does

The Importance of Data Quality

As you can see from just these few examples, the quality of the data you're working with really matters.

Gartner reports that organisations attribute around $15 million in annual losses to poor‑quality data. The same research also shows that nearly 60% of companies have no clear idea what bad data actually costs them, largely because they don’t track or measure data‑quality issues at all.

A 2016 study by IBM is even more eye-popping. IBM found that poor data quality strips $3.1 trillion from the U.S. economy annually due to lower productivity, system outages, and higher maintenance costs.

Bad data is, and will continue to be, the kryptonite of any organisation. This is even more concerning as more organisations now depend on data for strategy execution than ever before.

When data is wrong, incomplete, duplicated, or inconsistent, the consequences ripple outward: Incorrect dashboards mislead teams, which leads to making incorrect decisions. Implementing these decisions can lead to faulty strategy and policy implementation.

Eventually, the organisation pays the price, financially, operationally, and reputationally. And while money can be recovered, reputation rarely bounces back so easily.

How Does Bad Data Happen in the First Place?

Form fields are usually the first place where data enters an application, so they’re often where bad data begins. This is why the developer’s role is so critical.

Many of the most damaging data errors don’t originate from malicious users or complex edge cases – they come from simple oversights that the system should never have allowed in the first place.

But it's equally important to recognise that data quality issues often originate before the data ever reaches an application. Upstream processes — how data is collected, measured, recorded, or pre‑validated — can introduce inaccuracies long before the system receives it.

For example, a nurse might weigh a patient using an uncalibrated mechanical scale, record the incorrect value on a paper form, and later have that value transcribed into the hospital system. By the time the data enters the application, the error is already embedded.

This means that maintaining data quality requires attention both to upstream data collection practices and to the system-level validation that developers control.

When the UI, backend, or API layer permits invalid, incomplete, inconsistent, or logically impossible data to enter the pipeline, the organisation inherits a long‑term liability. Even small choices — such as allowing empty fields, ignoring duplicates, or failing to enforce validation rules — can introduce errors that may only surface months later in reports or dashboards, leading to confusion and inaccurate insights.

The Cost of Bad Data

Data quality can also be impacted at any stage of the data pipeline: before ingestion, in production, or even during analysis.

If bad data is caught in the UI, it's almost free, if we're thinking in terms of cost. If it's caught at the API layer, that's still pretty cheap. If it's caught in the database, the cost is moderate. And if it's caught in a report or ML model months later, that's expensive, and sometimes irreversible.

A key principle in modern data management is: the cheapest and safest place to catch bad data is at the source, and that is before ingestion. The well-known 1-10-100 Rule, introduced by George Labovitz and Yu Sang Chang in 1992, clearly illustrates this idea.

According to the rule, it costs about $1 to validate data at the point of entry, $10 to correct it after it has entered the system, and $100 per record if the error goes unnoticed and causes problems further down the line.

As the saying goes, an ounce of prevention is worth a pound of cure – and this is especially true when it comes to maintaining high-quality data.

To help buttress my point, I’ve categorised the different types of errors and oversights that developers should never allow that can and should be prevented before they ever reach the database, analytics layer, or reporting systems.

Types of Data Errors

Required Field Errors

If you build a form that allows a user to submit a registration form with important fields left empty (like first name, last name, email address, phone number, date of birth, or address), you're directly letting incomplete data enter the system.

I remember a scenario from my time as a data analyst where I was analysing a dataset containing different types of alarms triggered across several buildings. These alarms fell into categories such as aquarium alarms, intruder alarms, fire alarms, and maintenance alarms.

The purpose of the analysis was simple: identify which buildings had the highest frequency of alarms so that maintenance, resources, or investigations could be allocated appropriately.

Whenever an alarm went off, the security team recorded it using a software system. By the end of each month, we could view the cumulative alarms and generate insights.

But I encountered a major data quality issue. The security officers often selected the alarm category but failed to submit the building where the alarm occurred — and the system allowed this incomplete record to be saved into the database.

Every alarm had to occur in a specific building. Yet during analysis, I would see entries like “20 fire alarms” with no building information attached. Since I couldn’t determine where these alarms happened, the data became unusable. I had no choice but to delete those records because they provided no actionable value.

This is a classic example of poor data validation. If the developer had implemented proper constraints, the system would never allow an alarm to be submitted without a building name.

Required fields should be enforced at the UI and backend levels to prevent missing data from entering the system in the first place. These gaps lead to missing or unusable data in the database, often forcing teams to delete or manually repair records later.

To prevent these errors, you can use required‑field validation, disable the submit button until all mandatory fields are completed, and visually highlight missing fields with inline error messages.

Here's a practical code example of some bad code (no required checks):

From the above code snippet, the core problem is that the form doesn't enforce required input. Neither HTML‑level validation (using the required attribute) nor JavaScript‑based checks are implemented. This omission allows users to submit the form without providing necessary information, making the form unreliable for collecting valid and complete user data.

From a usability and data quality perspective, this is problematic. Forms are typically designed to collect meaningful and complete information, and fields such as “Full name” and “Email” are usually essential. Without marking these inputs as required or validating them programmatically, we risk receiving blank or invalid submissions, which can compromise the quality of stored data and any processes that depend on it.

Here's an example of a better version (UI prevents empty submission):

In this revised version of the code, the addition of the required attribute to both the name and email input elements ensures that the browser won't allow the form to be submitted unless these fields are filled. This is an important step toward maintaining data completeness and improving the overall reliability of the form.

Also, by checking e.target.checkValidity(), we now ensure that the form is evaluated before submission proceeds.

Another positive aspect is the conditional use of e.preventDefault(). When the form is invalid, the default submission behavior is stopped, preventing incomplete or incorrect data from being sent.

Format Validation Errors

If you have a form that allows a user to enter an email without an @ symbol, an email without a domain, a phone number containing letters, or a postcode/ZIP code in the wrong format, that allows invalid data to enter the system.

The same applies when you allow a user to submit an impossible date (32/15/2025) or a credit card number with the wrong length.

These issues will cause the data analyst to spend more time cleaning the data, if it's even cleanable. And such incorrect inputs create unreliable data that breaks downstream processes and increases cleanup costs.

To prevent these types of errors, you can use regex validation, input masks, and field‑type restrictions (for example, numeric‑only fields for phone numbers) to enforce correct formats before submission.

Here's a bad example of allowing format validation errors:

This code doesn't perform any checks on the format or structure of the phone number. The function simply retrieves whatever value exists – whether valid, invalid, or blank – and logs it to the console without any condition.

Here's the fixed version:

This version fixes the earlier mistake by introducing a clear validation rule. Before the system accepts the phone number, it checks whether the input contains only digits. The regular expression ^\d+$ ensures that the value is made up entirely of numbers, with no letters or symbols allowed. If the user enters anything invalid, the function stops and displays an error message instead of saving bad data.

This approach prevents the format error that occurred in the previous example. Instead of blindly trusting whatever the user types, the code now enforces a rule that matches the expected format of a phone number. This is what a responsible developer should do: verify the input before using it.

Range and Limit Errors

Allowing users to enter values outside acceptable limits – such as negative ages, quantities below zero, discounts above 100%, or measurements far beyond realistic ranges – that enables the ingestion of data that violates business rules. These errors distort analytics, break calculations, and create operational inconsistencies.

To mitigate these errors, you can apply min/max constraints, sliders, steppers, and numeric boundaries to ensure values fall within valid ranges.

Here's a bad example of allowing range and limit errors:

As seen above, we've created an input field for age but doesn't specify any limits or constraints. The browser allows the user to type any number — including values that make no sense, such as negative ages, extremely large ages, or decimals. The JavaScript function simply reads the value and logs it without checking whether the age is realistic.

Here's a better version:

Now in this version, the inclusion of the min="0" and max="120" attributes sets clear boundaries for acceptable input values. This ensures that only realistic age values within a defined range are allowed, preventing invalid entries such as negative numbers or excessively large ages.

The JavaScript function further enhances this validation by using the checkValidity() method. This method checks whether the input satisfies all defined constraints, including the required condition and the specified numeric range. If the input doesn't meet these conditions, the function prevents further execution and displays an alert message, informing the user that the entered age must fall within the allowed range.

Logical Consistency Errors

If you allow a user to select an end date before the start date, choose a checkout date earlier than check‑in at a hotel, or enter a delivery date before the order date, this will result in logically impossible data. The same applies when you allow a user to enter a graduation year earlier than their admission to a program, or submit working hours that exceed 24 hours in a day.

You can mitigate this by implementing cross‑field validation, business‑rule checks, and conditional logic that ensures related fields remain consistent.

Here's a bad example of a logical consistency error:

In the code above, the core issue is the complete absence of validation. Although the inputs use type="date", which provides a structured way for users to select dates, the code doesn't enforce that either field is required. This means the user can leave one or both date fields empty, and the save() function will still run and log the values. As a result, the system may end up processing incomplete or meaningless data.

Beyond missing required checks, the code also fails to validate the logical relationship between the two dates. In any scenario involving a start date and an end date, it's expected that the start date shouldn't occur after the end date. But this code performs no such comparison.

This means that the user can select a start date that's later than the end date, and the system will accept it without warning. This leads to inconsistent or impossible data being recorded.

Also, the function simply logs the values without providing any feedback to the user. There's no mechanism to alert the user when a field is empty or when the dates are logically incorrect. This reduces usability and makes it difficult for users to understand or correct their mistakes.

Here's the fixed version:

In this improved version, first, both date fields now include the required attribute, ensuring that the user can't leave either field empty without triggering validation.

Second, we've added a logical validation check to ensure that the relationship between the two dates is correct. After retrieving the values, the function converts them into Date objects and compares them to verify that the end date doesn't occur before the start date. If this condition is violated, the function stops execution and displays an alert informing the user of the error.

This prevents inconsistent or impossible date ranges from being accepted.

Duplicate and Data Integrity Errors

When you let a user submit an email that's already registered, choose a username that's already taken, or enter a duplicate employee ID or student number, this results in identity conflicts and duplicate records. Problems also arise when you allow users to upload unsupported file types, oversized files, or corrupted images.

Security risks can emerge when users are able to enter HTML/script tags (XSS), SQL‑injection patterns, or disallowed special characters. These issues compromise data quality, system integrity, and security.

You can prevent these types of issues by using uniqueness checks, file‑type and size validation, and input sanitization to block duplicates, invalid uploads, and malicious inputs.

Here's an example of a duplicate error:

This code blindly pushes every email into the savedEmails array without checking whether the email already exists. Because there is no duplicate detection, the user can enter the same email multiple times.

Here is the fixed version:

In this improved version of the code, we've implemented proper validation steps to prevent duplicate email entries. Before saving the email, the function checks whether the value already exists in the savedEmails array using the includes() method. If the email is found, the function stops execution and displays an alert informing the user that the email has already been saved. This ensures that each email is stored only once, maintaining the uniqueness and integrity of the data.

Relational Errors (Reference Integrity)

If you let a user select a city that doesn’t belong to the chosen country, a product ID that no longer exists, a retired SKU, or a shipping method unavailable in the selected region, this can result in broken references.

The same applies when users can select a manager from a different department or choose a fully booked time slot, not setting the right roles and permissions. These errors break relationships between tables and corrupt downstream joins and reports.

Here, you can use dependent dropdowns, real‑time lookups, and foreign‑key validation to help ensure that users can only select valid, existing, and compatible options.

Here's a bad example of a relational error:

From the above, the mistake in this code is that we've treated country and city as completely independent fields, even though one is supposed to depend on the other. By presenting all cities regardless of the selected country, the interface allows users to create combinations that make no sense — such as choosing “United Kingdom” with “New York” or “United States” with “Manchester.”

Also, because the save() function performs no validation and simply logs whatever the user selects, the system ends up accepting and storing relationships that should never exist. This breaks the logical link between the two fields and leads to invalid, inconsistent data that can corrupt downstream.

Here's the fixed, production-ready version:

This improved code turns the country–city form into a controlled, relationship‑aware flow instead of two loose dropdowns.

When the user selects a country, the loadCities() function runs. It first clears the city dropdown and, if no country is selected, keeps the city field disabled so the user can't choose a city on its own.

Once a valid country is chosen, the city dropdown is enabled and populated only with the cities that belong to that specific country, using the citiesByCountry mapping. Also, the city values are normalised (lowercased and stripped of spaces) so they’re consistent and safe to compare.

When the user clicks “Save,” the save() function checks that both a country and a city have been selected. If either is missing, it shows an alert and stops. It then rebuilds the list of valid city values for the chosen country and verifies that the selected city is actually in that list.

Structural Errors (Dropdowns, Radio Buttons, Enums)

If users can type a country as “U.S.A”, “USA”, “United States”, or “us”, enter gender as “male”, “Male”, “M”, or “man”, or type a department as “Engineering”, “Eng”, or “engineer”, this can result in inconsistent categorical data.

The same applies to currencies typed as “usd”, “USD”, “US Dollars”, product categories spelled differently, status values like “active”, “Active”, “ACT”, “enabled”, or boolean values like “yes”, “Yes”, “Y”, “1”.

These inconsistencies make analytics, grouping, and reporting unreliable, and the analyst will spend time cleaning and standardizing these files.

You should replace free‑text fields with dropdowns, radio buttons, and enums to enforce standardized categorical values.

Bad example of a structural error:


  Country

The problem with this code is that it pretends to save a country value without doing any real validation or enforcing any rules, which makes the form unreliable and prone to bad data.

The form uses a plain text input for “country,” meaning the user can type anything they want — misspellings, random characters, invalid countries, or even leave it blank. Because the input isn’t marked as required and the JavaScript doesn’t check whether the field contains a meaningful value, the form will happily “save” an empty string or nonsense text.

The submit handler prevents the default form submission but does nothing beyond logging whatever the user typed, so the system accepts invalid, incomplete, or malformed data without question. In short, the code collects input but doesn't validate it, doesn't enforce correctness, and doesn't protect the system from bad or unusable values.

Here's the fixed version:


  Country

The biggest improvement is that we're no longer relying on a free‑text field for the country. By switching to a dropdown, the form now limits the user to a controlled set of valid options. This prevents misspellings, random text, or invalid country names from ever entering the system.

These are the main types of data errors you might come across in your work. Now that we've discussed what causes them and some key fixes/preventative measures you can take, let's move on to data quality itself.

What Makes Good Data?

So what, in fact, is data quality? IBM defines it as the degree of accuracy, consistency, completeness, reliability, and relevance of the data collected, stored, and used within an organization or a specific context.

Let's look at each of these features of quality data a bit more closely to understand what they entail.

Completeness:

Completeness measures how much of the required data is actually present. When large portions of fields are missing, the dataset stops representing reality and any analysis built on it becomes unreliable.

An example would be a sign‑up form that stores users, but half of them are missing an email address. If you run an analysis on “email engagement,” your results will be skewed because a big chunk of users can’t even receive emails. This means that this data is incomplete.

Uniqueness:

Uniqueness checks whether each real‑world entity appears only once in the dataset. Duplicate records inflate counts, break joins, and distort metrics.

An example would be a customer table containing two rows for the same person with the same customer ID. When calculating “active customers,” the system counts them twice, inflating revenue projections.

Validity:

Validity evaluates whether data follows the expected format, type, or business rules. This includes correct data types, allowed ranges, and patterns defined by the system.

An example would be a field meant to store dates contains values like “32/99/2025” or “tomorrow.” These invalid entries break downstream ETL jobs that expect a proper date format.

Timeliness:

Timeliness reflects whether data is available when it’s needed. Even accurate data becomes useless if it arrives too late for the process that depends on it. For example, after a customer places an order, the system should generate an order ID instantly.

Accuracy:

Accuracy measures how closely data matches the real‑world truth. When multiple systems report the same metric, one must be designated as the authoritative source to avoid conflicting values.

Consistency:

Consistency checks whether data aligns across different datasets or within related fields. If two systems describe the same concept, their values shouldn't contradict each other.

For example, a company’s HR system reports 50 employees in Engineering, but the payroll system lists only 42. Since both describe the same group, the mismatch signals a data quality issue.

Fitness for Purpose:

Fitness for purpose assesses whether the data is suitable for the specific business task at hand. Even complete, accurate, and timely data may be unhelpful if it doesn’t answer the intended question.

A dataset of website clicks might be perfect for analysing user engagement, for example, but it’s useless for forecasting revenue because it contains no purchase or pricing information.

Data Validation Layers

Now that we've highlighted the characteristics that ensure quality data, it's important to discuss the layers of data validation.

There are five layers you'll need to check to enforce data quality.

Frontend Layer — “Protect the User, Not the System”

Frontend validation plays an important role in enhancing the user experience – but it doesn't provide real protection for a system.

Since frontend logic operates within the user’s environment, we can't trust it as a mechanism for enforcing data quality. Any code executed in the browser is ultimately under the user’s control, meaning it can be disabled, modified, intercepted, or bypassed entirely.

For instance, a user can simply open browser developer tools, remove validation rules, and submit invalid or malicious data without restriction.

Frontend validation is incapable of enforcing complex business rules. Constraints such as ensuring that a discounted price is lower than the original price, validating that a start date precedes an end date, preventing stock levels from becoming negative, or confirming that a product belongs to a valid category within the database require deeper system-level checks.

At the frontend level, what is being validated is: required fields, email format, password strength, address fields, and payment input format.

So frontend validation doesn't guarantee data quality or security, as it can be bypassed through API tools (like Postman), disabled JavaScript, malicious bots, and third-party integrations.

Because of this, it's best to treat the front-end as a usability layer, not a trust layer.

Backend Validation — “The Real Gatekeeper”

You can only guarantee true data quality and system integrity at the backend and database layers.

The backend is responsible for enforcing request validation, implementing business logic, and managing authentication and authorization.

If validation fails here, invalid data is rejected before it can propagate. Without this layer, data corruption begins at ingestion.

For example:

$request->validate([
   'name' => 'required|string|max:255',
   'price' => 'required|numeric|min:0',
   'stock' => 'required|integer|min:0',
   'category_id' => 'required|exists:categories,id',
]);

The code snippet above demonstrates how you can use request validation in Laravel to ensure that incoming data meets specific requirements before it's processed or stored in the database. This is an essential practice in web development, as it helps maintain data integrity, prevents errors, and enhances application security.

In this example, we're using the $request->validate() method to define a set of validation rules for four input fields: name, price, stock, and category_id. Each field is assigned a series of constraints that the incoming data must satisfy.

The name field is marked as required, meaning it must be included in the request and can't be empty. It must also be a string, ensuring that only textual data is accepted, and it's limited to a maximum length of 255 characters using max:255. This prevents excessively long inputs that could potentially cause issues in the database or user interface.

Similarly, the price field is required and must be numeric, allowing only numbers such as integers or decimal values. The rule min:0 ensures that the price can't be negative, which is logically consistent for most product pricing scenarios.

The stock field is also required and must be an integer, meaning it can only accept whole numbers. This is appropriate for counting physical items. Like the price field, it includes a min:0 rule to prevent negative stock values, which would not make sense in an inventory system.

Finally, the category_id field is validated to ensure it is both present and valid. The required rule ensures that a category is selected, while the exists:categories,id rule checks that the provided value corresponds to an existing id in the categories database table. This prevents invalid or non-existent category references, thereby preserving relational integrity within the database.

This layer validates null values, data types and formats, allowed ranges, and referential integrity (exists).

Database Layer — “Protect the Data at Rest”

Validation at the application level is insufficient on its own. You'll also need to enforce database-level constraints like NOT NULL constraints, UNIQUE constraints (email, SKU, order number), foreign keys (orders.user_id → users.id), and check constraints (for example, price >= 0).

This layer is critical because application bugs may bypass validation, background jobs and imports may skip controllers, and malicious actors may attempt direct access.

The database layer acts as the final line of defense, ensuring structural integrity regardless of application failures. Database constraints are the last hard stop: they enforce correctness even when code is bypassed.

Service Layer / Business Logic — “Validate Real-World Rules”

This layer enforces domain-specific logic that can't be captured by simple validation rules. The service layer is where the application stops asking “Is this data shaped correctly?” and starts asking “Is this allowed to happen in the real world?”.

This layer enforces domain‑specific rules that can't be captured by simple request validation or database constraints. These rules reflect business truth, not structural correctness.

Example:

if (\(product->stock < \)quantity) {
   throw new OutOfStockException();
}

This prevents overselling and ensures the system reflects physical reality.

if (\(cartTotal !== \)calculatedTotal) {
   throw new PriceMismatchException();
}

This protects revenue and prevents tampering.

In this layer, you enforce real‑world business rules by ensuring inventory correctness, recalculating totals, applying discount logic, and checking user‑specific limits.

Jobs / Queues / Data Ingestion — “Validate External Data”

When importing or processing external data (for example, supplier feeds), validation must occur before processing. You'll need to ensure schema conformity, that the required columns are present, that you have the correct data types, that the JSON structure is valid, and that you're detecting duplicate batches.

This is because external data sources are a major source of data quality issues. Without validation here, corrupted data can silently enter the system at scale.

Now that we've discussed the layers of a modern application stack, it should be clear that data quality isn't something you “check once” at the UI.

It must be enforced repeatedly, at multiple depths of the system. Each layer catches a different class of defects, and together they form a defensive wall that prevents bad data from ever reaching storage, analytics, or downstream consumers.

Testing Strategies to Protect Data Quality

To wrap up, here are the three foundational testing strategy every developer should apply to protect data quality.

Unit Testing

Unit tests are the first line of defense in data quality. In this context, a “unit” refers to a single column, a single transformation, or a single validation rule.

The purpose is straightforward: verify that the smallest building blocks of your data logic behave exactly as intended. This matters because if these low‑level rules are not tested and validated, incorrect or inconsistent data will flow into the database and contaminate everything built on top of it.

By isolating each rule or transformation, you can guarantee that schema constraints, field‑level assumptions, and low‑level logic remain correct before data ever flows into larger pipelines or business processes.

Typical questions answered at this layer include:

Does this column allow nulls?
Does this regex correctly strip whitespace from email strings?
Does this transformation produce the expected output for a single row?

This is where you can verify that the data contract is sound. If a column must be non‑null, unique, or follow a specific pattern, the unit test enforces it. When these rules fail here, they fail cheaply – before they can corrupt a table or mislead a dashboard.

To make this concrete, here’s what a unit test looks like in a real codebase. Even though this example comes from Laravel, the testing principle is identical to data‑quality unit tests: one rule, one expectation, isolated from everything else.

Example: Testing a Discount Calculation Rule

Imagine your e‑commerce shop has this rule:

If a product costs more than £100, apply a 10% discount.
Otherwise, apply no discount.

Let's say this is your discount logic:

 100) {
            return $price * 0.10; // 10% discount
        }

        return 0;
    }
}

The unit test for this logic will be:

calculate(200);

        \(this->assertEquals(20, \)discount);
    }

    /** @test */
    public function it_applies_no_discount_when_price_is_100_or_below()
    {
        $service = new DiscountService();

        \(discount = \)service->calculate(100);

        \(this->assertEquals(0, \)discount);
    }
}

The DiscountService contains a simple rule: if a price is greater than 100, a 10% discount is applied. Otherwise, no discount is applied. The unit test verifies this rule in isolation, without involving controllers, databases, or HTTP requests. By testing the service directly, the developer ensures that the core calculation behaves exactly as intended.

The first test checks the positive case — a price of 200 should produce a discount of 20. The second test checks the boundary condition — a price of 100 should produce no discount. Together, these tests confirm both sides of the rule and protect against regressions if the logic changes in the future.

Now, since this is Laravel example, Laravel tests help you verify both your logic (unit tests) and your full application behaviour (feature tests). You can run them using php artisan test, which executes tests in a separate testing environment, ensuring your real database and main codebase remain safe and unaffected.

Integration Testing: The Flow & Lineage Check

While unit tests validate the correctness of individual rules, integration tests validate the movement of data across components. Integration testing verifies that multiple layers work together as a single data flow.

In this example, the controller receives an order, calls the discount service, applies the transformation, and persists the result to the database. That interaction across layers is what elevates this from a unit test to an integration test. This is where you test the real‑world flow:

Controller → Service → Repository → MySQL
Check if MySQL migrations run correctly
Check foreign keys enforce relationships
Check to ensure services interact with the database as expected
Check to ensure models and repositories behave consistently

Integration tests reveal issues that only appear when components interact: incorrect joins, broken migrations, mismatched field names, or subtle type mismatches that unit tests cannot detect.

This is the layer where you catch the bugs that would otherwise silently corrupt data lineage.

Here's an example:

create(['subtotal' => 150]);

        \(response = \)this->postJson("/orders/{$order->id}/apply-discount");

        $response->assertStatus(200);

        $this->assertDatabaseHas('orders', [
            'id' => $order->id,
            'grand_total' => 135, // 150 - 10% discount
            'discount_total' => 15
        ]);
    }
}

This represents a full flow rather than a single rule:

Controller → Service
Service → Calculation
Controller → Database write
Database → Final state

This test begins by creating an order using an Eloquent factory. It immediately steps beyond the boundaries of a unit test, since it interacts with the database and relies on Laravel’s model layer to persist real data.

From there, the test sends an actual HTTP POST request to the /orders/{id}/apply-discount endpoint, which means it's not calling a method directly, but instead it's traveling through Laravel’s routing layer, invoking the controller responsible for handling the request, and triggering whatever business logic is responsible for calculating and applying the discount.

This movement through multiple layers (routing, controller, service logic, and model persistence) is precisely what defines integration testing: the goal is to verify that these components work together correctly as a system.

Once the request is processed, the test asserts that the response returns a successful status code, which confirms that the HTTP layer behaved as expected.

But the most important part comes afterward, when the test checks the database to ensure that the correct grand_total and discount_total were saved. This final assertion proves that the discount logic was executed, the model was updated, and the changes were successfully written to the database.

In other words, the test isn't merely checking whether a calculation is correct. It's also checking whether the entire pipeline – from receiving the request to updating the database – functions as a coherent whole.

Functional Testing: The Business Rule Check

Functional tests validate the entire user experience, from the moment a request enters the system to the moment a response is returned. This includes:

HTTP requests
Controller logic
Validation rules
Service operations
Database writes
Redirects or rendered views

This is where you test the business rules that govern real‑world behaviour:

“A student can't register for two exams at the same time.”

“A cart can't have negative quantities.”

“A user can't update their profile without a valid email.”

Functional tests ensure that the system behaves correctly from the perspective of the user and the business, not just the code.

Here's an example: Functional Test

create(['price' => 40]);

        // Simulate existing cart
        $this->withSession([
            'cart' => [
                $product->id => ['quantity' => 2]
            ]
        ]);

        // Act: user tries to update quantity to a negative number
        \(response = \)this->post('/cart/update', [
            'product_id' => $product->id,
            'quantity' => -5
        ]);

        // Assert: system rejects invalid business behaviour
        $response->assertStatus(302); // redirect back with errors
        $response->assertSessionHasErrors(['quantity']);

        // Assert: cart remains unchanged (business rule preserved)
        \(this->assertEquals(2, session('cart')[\)product->id]['quantity']);
    }
}

The test begins by creating a realistic environment in which a user interacts with a shopping cart. This is essential for understanding the behaviour the system is meant to enforce.

First, it generates a real product in the database using a factory, giving the product a price so that it resembles an item a customer might genuinely add to their cart.

Once the product exists, the test manually seeds the session with a cart containing that product and a quantity of two. This simulates a user who has already added the item to their cart in a previous interaction, and it establishes the baseline state the system must preserve if the user attempts an invalid update.

With the environment prepared, the test then imitates a user action by sending a POST request to the /cart/update endpoint. Instead of calling a method directly, it uses Laravel’s HTTP layer to reproduce the exact behaviour of a browser submitting a form. The request includes the product ID and a deliberately invalid quantity of negative five.

This is the heart of the scenario: the user is attempting something that violates the business rules of the application, and the test is designed to confirm that the system responds appropriately.

Now, when the request is processed, the test expects the application to reject the input, redirect the user back, and attach validation errors to the session. The assertion that the response has a 302 status code and contains validation errors confirms that the validation layer is functioning correctly and that the controller is enforcing the rule that quantities can't be negative.

The final part of the test is where the business rule is truly verified. After the failed update attempt, the test inspects the session to ensure that the cart remains unchanged. This is crucial because rejecting invalid input is only half of the requirement: the system must also protect the integrity of the existing cart data.

Functional tests answer questions like:

Does the system prevent invalid real‑world behaviour?
Does the user get the correct feedback?
Does the data remain consistent after the request?
Does the final output match the business expectation?

Conclusion

Data quality is never the result of a single check or a single team. It emerges from a disciplined, layered approach where each testing level catches a different category of defects.

Unit tests safeguard the smallest rules, integration tests validate the flow of data across components, and functional tests enforce the business logic that governs real‑world behaviour.

When these layers operate together, bad data has nowhere to hide. When they don’t, even a minor oversight can slip through the cracks and escalate into a costly downstream failure.

So as you can see, your role in data quality is fundamentally proactive, not reactive. By designing systems with validation, integrity, and monitoring in mind, you ensure that data flowing through the pipeline is accurate, timely, complete, unique, and fit for purpose – supporting reliable analytics, reporting, and intelligent systems.

Software Testing with Playwright

Beau Carnes — Thu, 19 Mar 2026 18:12:22 +0000

Testing is the unsung hero of software development because shipping features is only half the battle.

We just published a comprehensive course on the freeCodeCamp.org YouTube channel that will teach you all about why and how to test software.

You will learn about the foundational Testing Pyramid and how to balance fast unit tests with complex end-to-end journeys. And you will learn how to use Playwright to test an e-commerce application. The course also explores the future of the industry by showcasing KaneAI, an AI-powered agent that allows you to author stable, auto-healing tests using plain English instructions.

This course will give you the practical skills to automate your workflow and ensure your code remains production-ready.

Here are the sections in this course:

Course Introduction and Overview
Why Software Testing Matters
Case Studies: Knight Capital & Therac-25
The Boeing 737 Max & The Cost of Everyday Bugs
Testing as "Insurance" for Your Code
The Testing Pyramid: Unit, Integration, & E2E
Test-Driven Development (TDD) Explained
Hands-on: Setting Up the TechMart Sample App
Playwright Framework Installation & Setup
Understanding Playwright Test Structure & Assertions
Writing a Search Functionality Test from Scratch
Strategic Locators: Finding Elements Effectively
Testing Complex Shopping Cart Logic
Login Forms, Validations, & Error Handling
Full End-to-End Checkout Flow Walkthrough
Direct API Testing with Playwright
Debugging Tests in Headed and UI Interactive Modes
Testing Edge Cases and Security (XSS) Vulnerabilities
Mocking API Responses and Simulating Slow Networks
Accessibility Testing for Screen Readers & Keyboards
Challenges: Learning Curves and Maintenance Burdens
Introduction to AI-Powered Software Testing
Hands-on with KaneAI: Authoring Tests in Plain English
Natural Language Code Generation & Auto-Healing Tests
Executing API Tests Using AI Agents
Professional Best Practices: CI/CD & Page Objects
Final Takeaways: When to Use Manual vs. AI Tools

Watch the full course on the freeCodeCamp.org YouTube channel (1-hour watch).

How to Test a Complex Full-Stack App: Manual Approach vs AI-Assisted Testing

Ajay Yadav — Mon, 16 Mar 2026 17:53:09 +0000

A few days ago, I ran an experiment with an AI-powered testing agent that lets you write test cases in plain English instead of code. I opened its natural language interface and typed four simple sentences to test google.com:

1. Go to google.com
2. There should be a long input field on the page
3. Type something and verify suggestions appear in a dropdown
4. The input field should not have any placeholder text

A real browser opened Google, found the search bar, typed a query, checked for the autocomplete dropdown, and verified there was no placeholder, all from those four lines.

No Playwright selectors. No page.getByRole(). No CSS class names. Just plain English describing what a user would do.

That made me curious: what happens if I try this on something actually complex? So I tested my own full-stack app's auth endpoint the same way:

Send a GET request to /api/auth/status without any session cookie. Verify it returns 401.

Within 15 seconds, done.

The same test took me an hour to set up manually, building a session helper, separating my Express app from the server startup, seeding a test database, just so I could write five lines of Supertest code.

I ended up testing my entire application both ways: the traditional manual approach and the AI-assisted approach. Same endpoints, same assertions, completely different experience. This article is about what I learned.

But before I get into how I tested it, let's talk about what actually matters: the testing concepts themselves. Because no approach, manual or automated, will save you time or energy if you don't understand what you're testing and why.

What we'll cover:

Prerequisites
How Testing Actually Works in Full-Stack Apps
What Made This Hard
The Manual Approach
The AI-Assisted Approach
When to Use Which Approach
Conclusion

Prerequisites

To get the most out of this article, you should have a basic understanding of JavaScript and Node.js, along with some familiarity with React and Express.

Experience writing simple tests with any JavaScript testing framework like Jest or Vitest will be helpful, though I'll explain the core testing concepts as we go.

You should also have Node.js installed on your machine. If you want to follow along with the manual testing examples, you'll need Vitest (or Jest) for unit and API tests, Supertest for HTTP endpoint testing, and Playwright for end-to-end browser tests. For the AI-assisted approach, I used KaneAI by LambdaTest, which you can explore through their platform.

How Testing Actually Works in Full-Stack Apps

If you've only tested isolated React components or written a few unit tests for utility functions, full-stack testing feels like a different sport. The concepts are the same, but the complexity jumps dramatically. Here's what you actually need to know.

Three Layers, Three Different Jobs

Every full-stack application has three natural testing layers, and trying to cover everything with just one of them leads to either fragile tests or blind spots.

Unit Tests

Unit tests check that individual functions return the right output for a given input. They don't touch the database, the network, or the browser.

They run in milliseconds. If your function takes a string and returns a formatted slug, a unit test calls that function and checks the result. That's it.

it("converts a title to a slug", () => {
  expect(slugify("My First Post")).toBe("my-first-post");
});

API Tests

API tests check that your backend endpoints return the right responses. They send real HTTP requests to your Express (or Next.js) app and verify the status codes, response shapes, and error handling.

If your /api/auth/status endpoint should return 401 without a session cookie, an API test confirms that contract.

it("returns 401 without session cookie", async () => {
  const res = await request(app).get("/api/auth/status");
  expect(res.status).toBe(401);
});

End-to-end (E2E) Tests

End-to-end (E2E) tests open a real browser and interact with your app the way a user would. They click buttons, fill forms, navigate pages, and check that the right things appear on screen.

If your login flow should redirect to a dashboard after authentication, an E2E test walks through that entire journey.

test("login redirects to dashboard", async ({ page }) => {
  await page.goto("/");
  await page.getByTestId("username-input").fill("ajay");
  await page.getByTestId("password-input").fill("password123");
  await page.getByTestId("login-button").click();
  await expect(page.getByTestId("dashboard")).toBeVisible();
});

The Pain Points Nobody Warns You About

Tutorials make all three layers look straightforward. In practice, each one has a trap.

First, we have the session cookie problem. Most real apps have authentication. To test any authenticated endpoint, you need a valid session.

That means you need a helper function that logs in a test user, extracts the session cookie from the Set-Cookie header, and returns it for future requests.

This sounds simple. It took me an hour to build one that actually works with express-session. Every project reinvents this wheel.

Then we have the app vs. server separation issue. Supertest (the most popular API testing library) needs to import your Express app without starting a real server.

If your app.ts file has app.listen(3000) at the bottom, Supertest will try to bind to port 3000, and your tests will crash when running in parallel.

You have to separate your app definition from the server startup. app.ts exports the Express instance, server.ts calls .listen(). It's a three-minute refactor, but nobody tells you about it until your tests fail.

You also have the SSE and real-time nightmare. If your app uses Server-Sent Events (SSE) or WebSockets, you're testing time-dependent behavior.

You open a connection, trigger an action, and wait for an event to arrive. If the event takes too long, your test times out. If you don't set a timeout, the test hangs forever. You end up writing 30 lines of Promise wrappers, timeout handlers, and cleanup logic for a single assertion.

Finally, there's the selector fragility trap. E2E tests that use CSS selectors (.btn-primary, .card-title) break every time you rename a class.

The fix is using data-testid attributes, stable identifiers that exist solely for testing and don't change during refactors. But retrofitting them into an existing app means touching dozens of components.

Schema Validation: The Hidden Time Sink

Here's something nobody tells you about API testing. Writing the assertion for "does this endpoint return 200" takes one line.

Writing assertions that verify the shape of the response, every field exists, every field has the right type, every enum value is valid, takes 15 to 20 lines per endpoint. Multiply that across a dozen endpoints and you're spending hours writing boilerplate like:

expect(res.body[0]).toHaveProperty("title");
expect(typeof res.body[0].title).toBe("string");
expect(res.body[0]).toHaveProperty("status");
expect(["open", "closed", "merged"]).toContain(res.body[0].status);

It's important work, though: schema validation catches real bugs when your backend changes a response shape. But the repetitiveness is what makes it a good candidate for automation, which I'll get to later.

These aren't edge cases. These are the everyday realities of testing a full-stack app. Knowing them upfront saves you from the "why is this so much harder than the tutorial??" frustration.

What Made This Hard

A few months ago, I wrote a freeCodeCamp article about testing JavaScript apps from unit tests to AI-augmented QA. That article covered testing fundamentals with clean, simple examples.

After publishing it, I kept thinking: what happens when you apply all of this to something messy?

I had the perfect candidate. Creoper (code name) is an AI-powered project management tool I built that connects GitHub with Discord.

Teams can monitor repositories, track pull requests, and query project status using natural language, all without leaving their chat platform.

I built it across two internal hackathons at CreoWis, and it won both times. What started as a simple GitHub-Discord automation bot evolved into a full product with five interconnected components:

It has a React dashboard with GitHub OAuth. An Express backend with REST APIs and SSE. A Discord bot that processes natural language through an LLM intent detection layer. PostgreSQL with Prisma. GitHub webhook handlers.

But here's the thing: despite winning two hackathons, Creoper had zero test cases. The app wasn't even deployed yet. I'd been stuck on Railway monorepo deployment issues for weeks.

So I was staring at a system that had every real-world testing challenge I'd just written about, auth flows, real-time events, multiple integration points, complex business logic, and no safety net at all.

I decided to test it two different ways and document what actually happened. If you want to explore the full project, I've written two separate blogs about how I built it.

The Manual Approach

I mapped pure logic components like the intent parser and embed builder to unit tests, since they deal with straightforward input-output behavior. I assigned Express endpoints to API tests using Supertest, which let me send real HTTP requests and verify response codes and shapes.

I planned to cover the React dashboard with end-to-end tests using Playwright, simulating actual user interactions in a real browser. As for Discord bot interactions and webhook delivery, those couldn't be automated reliably yet, so I documented them and tested them manually.

Here's what each layer looked like in practice.

Unit Tests: The Easy Win

Creoper has a function that classifies Discord messages into structured intents. If someone types "list prs," it should return LIST_PRS with a high confidence score.

If the message is gibberish, it should return UNKNOWN with zero confidence. The confidence score matters because anything below a threshold triggers a safe fallback instead of executing an action.

it("detects LIST_PRS intent", () => {
  const result = parseIntent("list prs");
  expect(result.action).toBe("LIST_PRS");
  expect(result.confidence).toBeGreaterThan(0.8);
});

it("returns low confidence when repo name is missing", () => {
  const result = parseIntent("set active repo");
  expect(result.confidence).toBeLessThan(0.8);
});

Notice these aren't just "does it work" checks. They're testing a safety mechanism, the threshold between executing an action and falling back.

These are exactly the kinds of tests that need to be written by hand because you have to understand the business logic behind the numbers.

I also tested the Discord embed builder the same way. Give it push event data, check that the formatted message contains the right repo name, author, branch, and commit messages.

Pure input, pure output, no external dependencies. Unit tests ran in milliseconds and caught edge cases like empty commit arrays immediately.

API Tests: Where the Friction Starts

Testing the Express endpoints required the infrastructure work I described earlier. I separated app.ts from server.ts, built the createTestSession() helper, and set up an in-memory test database so tests wouldn't touch real data.

it("returns 401 without session cookie", async () => {
  const res = await request(app).get("/api/auth/status");
  expect(res.status).toBe(401);
  expect(res.body).toHaveProperty("error");
});

it("returns user data with valid session", async () => {
  const cookie = await createTestSession();
  const res = await request(app)
    .get("/api/auth/status")
    .set("Cookie", cookie);
  expect(res.status).toBe(200);
  expect(res.body).toHaveProperty("username");
  expect(res.body).not.toHaveProperty("accessToken");
});

Five lines of test code, one hour of infrastructure to make those five lines work.

Then I had to repeat this pattern across every endpoint: repos, pull requests, issues, active repo configuration, each with happy path, error cases, and the tedious schema validation I mentioned earlier.

The SSE test was the worst. I needed a Promise wrapper, an EventSource connection, a timeout handler, an onopen callback to trigger the change, an event listener to catch the response, and cleanup for both the connection and the server. About 30 lines for a single assertion, and it took three attempts to get the timing right.

E2E Tests: The Full Journey

Playwright's E2E tests were actually pleasant to write once I added data-testid attributes to the React components. The login flow, note creation, editing, and deletion all followed a predictable pattern.

test("login and create a note", async ({ page }) => {
  await page.goto("/");
  await page.getByTestId("username-input").fill("ajay");
  await page.getByTestId("password-input").fill("password123");
  await page.getByTestId("login-button").click();
  await expect(page.getByTestId("username-display")).toContainText("ajay");
});

The real cost wasn't writing the tests — it was maintaining them. Midway through development, I renamed a CSS class from .repo-list-item to .repository-card. Two Playwright tests broke immediately. I found the references, updated them, re-ran. Ten minutes for a CSS rename. I can see this becoming death-by-a-thousand-cuts as the UI evolves.

The AI-Assisted Approach

Now here's the same project, tested with a fundamentally different workflow.

Instead of writing test code, you describe what you want to test in natural language. An AI agent interprets your intent, interacts with the actual application, generates assertions, and produces exportable test code.

The tool I used is KaneAI, a GenAI-native testing agent that covers web UIs, APIs, and mobile apps through natural language test authoring with real browser execution. That's the only background you need. Let me show you the workflow.

API Testing: Describing Instead of Coding

Instead of writing Supertest code, I opened the slash command menu, selected API, and pasted a curl command:

curl -X GET http://localhost:3000/api/auth/status

It fired the request through the tunnel, showed the 401 response, and I added it to my test steps. For the authenticated version, I pasted the same command with a session cookie from DevTools. No createTestSession() helper. No test database. No app separation.

For the repository endpoints, I described the flow in plain English:

1. Set active repository to "atechajay/no-javascript" via POST to /api/repos/active
2. Verify the response confirms the repository is active
3. Fetch open pull requests via GET to /api/repos/pulls
4. Verify each item has title, author, url, and status fields
5. Try an invalid repository name, verify 400 error

It generated assertions for the happy path and added schema validation I didn't ask for checking that title is a string, labels is an array, status is one of the expected values. That's the tedious work that ate up hours in the manual approach, generated in seconds.

E2E Testing: Plain English, Real Browser

For the React dashboard, instead of Playwright selectors, I described:

1. Navigate to localhost:3001
2. Click "Go to Dashboard"
3. Verify redirect to GitHub OAuth
4. After auth, verify the dashboard loads
5. Verify the username appears in the sidebar

It executed each step in a real cloud browser connected to my localhost. No page.getByRole(), no page.waitForURL(), no selector debugging.

After each test, I exported the generated code. It came with wait conditions and assertion logic baked in.

It wasn't perfect copy-paste: I updated environment variables, adjusted base URLs, and fixed a few field name mismatches where it expected pullRequestUrl instead of my actual url field. But it gave me roughly 70–80% of the foundation.

The Feature That Surprised Me

Midway through testing, I renamed that CSS class from .repo-list-item to .repository-card. My manual Playwright tests broke immediately.

But the AI tool's auto-healing detected the selector change, found the closest matching element based on the test's original intent, and continued the test with a review flag. No code changes needed.

For a rapidly changing MVP where class names are still in flux, that alone saved significant maintenance time.

When to Use Which Approach

After testing the same project both ways, here's my honest take.

Write tests by hand when you're testing business logic that requires domain understanding. For Creoper's intent parser, I needed to think about what "low confidence" means in the context of the application's safety mechanism.

An AI tool can generate assertions, but it can't understand why a confidence score of 0.5 should trigger a fallback instead of an action. Pure logic with meaningful edge cases is where hand-written tests earn their keep.

You should also write tests by hand when they need to run in CI without external dependencies. Vitest tests with mocked dependencies are self-contained. They run in milliseconds and don't need a tunnel, a cloud browser, or a third-party account.

Hand-written tests are also best when the team needs to maintain them. Hand-written tests are transparent. Generated code, even when exported, can feel opaque to someone who wasn't there when it was authored.

Reach for AI-assisted testing, on the other hand, when your UI changes frequently. For an MVP where CSS classes and component structure are still in flux, auto-healing prevents the "my tests broke because I renamed a div" problem. You spend less time fixing selectors and more time shipping features.

AI-assisted testing is also helpful when you need coverage fast and plan to refine later. The 70–80% foundation is a real boost when you're the only developer and you need coverage now. You can always hand-tune the exported code later.

Never rely solely on either approach to understand your system. No tool knows that an SSE connection drops after 30 seconds if the heartbeat isn't configured. No tool understands that a Discord bot should never execute a write action when confidence is below 0.8. No tool realizes the OAuth callback silently fails if the redirect_uri doesn't match precisely.

The strategy relies on you knowing which endpoints are crucial, identifying dangerous edge cases, and understanding what should occur during failures. The tool simply accelerates how quickly you can articulate and implement that strategy.

Conclusion

My Full-stack app won two hackathons. But without tests, it was a house of cards. One renamed CSS class, one changed API response, and the whole system could silently break.

Testing it both ways taught me that the manual vs AI question is the wrong question. The real skill is matching the approach to the problem.

Write unit tests by hand for business logic. Use AI-assisted testing when you're drowning in repetitive schema validation across a dozen endpoints.

Use auto-healing for E2E tests on a fast-changing UI. And for the things you can't automate yet, like Discord bot interactions or webhook delivery, document them and test them manually until you can.

If you're building something complex and thinking "I'll add tests after I deploy", flip that. Test what you can now. Document what you can't. When deployment day comes, you'll ship with confidence instead of anxiety.

Before We End

I hope you found this article insightful. I’m Ajay Yadav, a software developer and content creator.

You can connect with me on:

Twitter/X and LinkedIn, where I share insights to help you improve 0.01% each day.
Check out my GitHub for more projects.
Check out my Medium page for more blogs.
I also run a YouTube Channel where I share content about careers, software engineering, and technical writing.

See you in the next article — until then, keep learning!

How Does Kubernetes Self-Healing Work? Understand Self-Healing By Breaking a Real Cluster

Osomudeya Zudonu — Fri, 06 Mar 2026 14:43:26 +0000

I have noticed that many engineers who run Kubernetes in production have never actually watched it heal itself. They know it does. They have read the docs. But they have never seen a ReplicaSet controller fire, an OOMKill from kubectl describe, or watched pod endpoints go empty during a cascading failure. That's where 3 am incidents find you. This tutorial puts you on the other side of it.

You will clone one repo, spin up a real 3-node cluster, break it seven different ways, and watch it fix itself each time. No simulated output or fake clusters. Real Kubernetes, real failures, and real recovery. By the end, you will recognize these failure patterns when they show up in your production environment.

What KubeLab Is?
Prerequisites
How to Get the Lab Running
Simulation 1 — Kill Random Pod
Simulation 2 — Drain a Worker Node
Simulation 3 — CPU Stress and Throttling
Simulation 4 — Memory Stress and OOMKill
Simulation 5 — Database Failure
Simulation 6 — Cascading Pod Failure
Simulation 7 — Readiness Probe Failure
How to Read the Signals in Grafana
How to Use This for Production Debugging

What is KubeLab?

KubeLab is an open-source Kubernetes failure simulation lab. It runs a real Node.js backend, a PostgreSQL database, Prometheus and Grafana, all inside a real cluster. When you click "Kill Pod", the backend calls the Kubernetes API and deletes an actual running pod. Nothing is fake.

Simulation	What it teaches
Kill Random Pod	ReplicaSet self-healing, pod immutability
Drain Worker Node	Zero-downtime maintenance, PodDisruptionBudgets
CPU Stress	Throttling vs crashing, invisible latency
Memory Stress	OOMKill, exit code 137, silent restart loops
Database Failure	StatefulSets, PVC persistence
Cascading Pod Failure	Why replicas: 2 isn't enough
Readiness Probe Failure	Liveness vs readiness, traffic control

Plan about 90 minutes for the full path. Or jump directly to any simulation if you have a specific production problem you want to reproduce.

Prerequisites

You need basic familiarity with Docker and comfort with the command line, but no prior Kubernetes experience is required.

Hardware: 8GB RAM minimum, 16GB recommended. The lab can run on Mac, Linux, or Windows with WSL2. You'll need to install three tools. Multipass spins up Ubuntu VMs for the cluster. kubectl is the Kubernetes CLI you will use for every simulation. Git clones the repo. If you cannot run three VMs, the repo includes a Docker Compose preview at setup/docker-compose-preview.md full UI with mock data, no real cluster needed.

How to Get the Lab Running

Full cluster setup lives at setup/k8s-cluster-setup.md in the repo. It walks through creating three VMs with Multipass, installing MicroK8s, joining the worker nodes, and deploying KubeLab. Follow it until all eleven pods show Running:

kubectl get pods -n kubelab
# All 11 pods should show STATUS: Running

Then open two port-forwards in separate terminal tabs and keep them running for the entire tutorial:

# Tab 1 — KubeLab UI at http://localhost:8080
kubectl port-forward -n kubelab svc/frontend 8080:80

# Tab 2 — Grafana at http://localhost:3000
kubectl port-forward -n kubelab svc/grafana 3000:3000

Grafana login: admin / kubelab-grafana-2026.

Position the KubeLab UI and Grafana side by side. Left half of the screen is the app. Right half is Grafana. You will watch both simultaneously from Simulation 3 onward.

Simulation 1: Kill Random Pod

This simulation deletes a running backend pod via the Kubernetes API. Without Kubernetes, you would SSH to the server, find the crashed process, and restart it manually, usually discovered by a user alert at 3am.

Before you click: Run kubectl get pods -n kubelab -w. Watch for a pod to go Terminating then a new one to appear.

kubectl get pods -n kubelab -w
# backend-abc123  1/1   Terminating   0   2m
# backend-xyz789  1/1   Running       0   0s   ← ReplicaSet created a replacement

What happened: The ReplicaSet controller noticed actual(1) did not match desired(2) and created a replacement in parallel with the shutdown. The Endpoints controller removed the dying pod from the Service before SIGTERM fired, so zero traffic hit a dying pod.

The production trap: A missing readiness probe means the new pod receives traffic before it has opened a DB connection. You get 500s on every deployment for 2–3 seconds.

The fix: Set replicas: 2, add a readiness probe, and set terminationGracePeriodSeconds to match your longest request timeout.

Simulation 2: Drain a Worker Node

This simulation cordons a worker node, then evicts all its pods to the remaining node.

To "cordon" a worker node means to mark it as unschedulable. When you run kubectl cordon , the Kubernetes control plane adds the node.kubernetes.io/unschedulable:NoSchedule taint to the node. (A taint is a marker that tells the scheduler to avoid placing pods on that node unless they have a matching "toleration.") This tells the scheduler to stop placing any new pods onto that node. It does not affect the pods that are already running there.

Cordoning is the first, safe step in preparing a node for maintenance. It ensures that while you are draining the node, the scheduler isn't simultaneously trying to schedule new workloads onto it, which would defeat the purpose of the drain.

Without Kubernetes you would drain the server manually, guess when in-flight requests finish, patch it, and bring it back, the window of downtime is unpredictable.

Before you click: Run kubectl get pods -n kubelab -o wide -w. Watch which node each pod runs on.

kubectl get pods -n kubelab -o wide -w

NAME                     NODE               STATUS
backend-abc123-xk2qp    kubelab-worker-1   Terminating   ← evicted
backend-abc123-n7mw3    kubelab-worker-2   Running       ← rescheduled

In kubectl get nodes the node shows Ready,SchedulingDisabled until you run kubectl uncordon.

What happened: The node spec got spec.unschedulable=true. The Eviction API ran per pod. That path goes through PodDisruptionBudget policy checks before proceeding, unlike a raw delete. A raw kubectl delete pod bypasses this check entirely — which is why draining with kubectl drain is always safer than deleting pods manually during maintenance.

The production trap: Two replicas with no pod anti-affinity often land on the same node. Drain that node and both pods evict at once. Complete downtime despite replicas: 2.

The fix: Use pod anti-affinity with topology key: kubernetes.io/hostname and a PodDisruptionBudget with minAvailable: 1.

Simulation 3: CPU Stress and Throttling

This simulation burns CPU inside a backend pod for 60 seconds, hitting the 200m limit. Without Kubernetes, one runaway process can consume all CPU on the host and starve every other service.

Before you click: Run watch -n 2 kubectl top pods -n kubelab and open the Grafana CPU Usage panel.

kubectl top pods -n kubelab
# backend-abc123   200m   ← pegged at limit for 60s; the other pod stays ~15m

What happened: The Linux CFS scheduler enforces the cgroup limit by granting 20ms of CPU per 100ms period then freezing all processes in the cgroup for 80ms. The pod is not slow because it is broken. It is slow because it is frozen 80% of the time.

The production trap: kubectl top shows the pod using 95-150m, which looks normal. The metric shows usage at the ceiling, not the throttle rate. Teams spend hours checking application code for a latency bug that is actually a CPU limit set too low.

The fix: For latency-sensitive workloads, set CPU requests but remove CPU limits. Requests tell the scheduler where to place the pod without throttling at runtime. Confirm throttling with rate(container_cpu_cfs_throttled_seconds_total{namespace="kubelab"}[5m]).

Simulation 4: Memory Stress and OOMKill

This simulation allocates memory in 50MB chunks inside a backend pod until the kernel kills it. Without Kubernetes the process dies, the server goes down, and someone gets paged.

Before you click: Run kubectl get pods -n kubelab -l app=backend -w and open the Grafana Memory Usage panel.

kubectl get pods -n kubelab -l app=backend -w
# backend-abc123   0/1   OOMKilled   3   5m   ← no Terminating phase; SIGKILL bypasses graceful shutdown

What happened: The cgroup memory limit crossed 256Mi. The Linux kernel OOM killer scored processes in the container's cgroup and sent SIGKILL (exit code 137) to the top consumer. Not Kubernetes, the kernel. SIGKILL cannot be caught or handled, so no preStop hook runs and in-memory data or open transactions can be lost. Kubernetes only observed the exit, labeled it OOMKilled, and started a fresh container.

The production trap: The pod runs fine for 8 hours, OOMKills, and restarts. Memory resets to zero and everything looks healthy again. This repeats every 8 hours. The restart count climbs to 7, then 15, then 30, but no alert fires because the metrics look normal between crashes. You find out when a user emails saying the app has been "a bit glitchy lately."

The fix: Alert on rate(kube_pod_container_status_restarts_total{namespace="kubelab"}[1h]) > 3 before users notice.
The Prometheus expression means: look at how many times containers in the kubelab namespace have restarted over the last hour, calculate how fast that number is increasing per second, and fire an alert if that rate exceeds the equivalent of 3 restarts per hour. A healthy pod rarely restarts. Several restarts in an hour usually means the container is hitting its memory limit, dying, and coming back in a loop, so this alert catches the silent OOMKill pattern before users do.

Confirm it happened:

kubectl describe pod -n kubelab  | grep -A 5 "Last State:"
# Reason: OOMKilled
# Exit Code: 137

To see the last output before the kernel killed the process, run kubectl logs -n kubelab --previous. The log stream stops abruptly with no shutdown message, SIGKILL leaves no time for cleanup or final logs.

Simulation 5: Database Failure

This simulation scales the PostgreSQL StatefulSet to 0 replicas. The pod terminates completely. Without Kubernetes, the database server crashes and data recovery depends on whether backups exist and when they ran.

Before you click: Run kubectl get pods,pvc -n kubelab. Note that the PVC exists before you start.

kubectl get pods,pvc -n kubelab
# postgres-0   (gone)
# postgres-data-postgres-0   Bound   ← PVC stays; data lives on the volume

A PVC, or PersistentVolumeClaim, is a request for storage by a user. Think of it as a pod's way of saying, "I need a certain amount of durable, persistent storage." In the context of a stateful application like PostgreSQL, the PVC is critical. When the database pod is deleted, the PVC (and the underlying PersistentVolume it is bound to) remains. This is where the actual database files are stored. When a new postgres-0 pod is created, the StatefulSet knows to re-attach the same PVC, ensuring the new pod has access to all the old data, preventing data loss.

What happened: The StatefulSet controller deleted the pod but left the PersistentVolumeClaim untouched. StatefulSets guarantee stable names and stable PVC binding. postgres-0 always mounts postgres-data-postgres-0. When you restore, the same pod name comes back and reattaches the same volume. PostgreSQL replays WAL to reach a consistent state.

The production trap: Apps without connection retry logic return 500s and stay broken even after PostgreSQL restores. Connection pools that do not validate on acquire hold dead connections forever.

The fix: Add connection retry with exponential backoff in your app. Use network-attached storage (EBS, GCE PD) in production so the pod can reschedule to any node.

Simulation 6: Cascading Pod Failure

This simulation deletes both backend replicas at the same time. If everything is down, without Kubernetes, you'd have to restart every service manually, and hope they come up in the right order.

Before you click: Run kubectl get endpoints -n kubelab backend-service -w. Watch the IP list.

kubectl get endpoints -n kubelab backend-service -w
# ENDPOINTS      ← every request in this window gets Connection refused

What happened: Both pods were deleted. The Service had zero endpoints. The ReplicaSet created two replacements in parallel, but traffic stayed broken until both passed their readiness probes. The endpoint list went empty and came back. You can see the exact downtime window in Grafana's HTTP Request Rate panel.

The production trap: replicas: 2 protects you from one pod dying at a time, nothing more.
If both replicas land on the same node and that node goes down, you have zero replicas and full downtime.
Check right now with kubectl get pods -n kubelab -o wide | grep backend, and if both pods show the same NODE, you are one node failure away from an outage.

The fix: Use pod anti-affinity to force replicas onto different nodes and a PodDisruptionBudget with minAvailable: 1 to block any voluntary action that would leave zero replicas.

Simulation 7: Readiness Probe Failure

This simulation makes one backend pod fail its readiness probe for 120 seconds without restarting it. Without Kubernetes, you'd have no way to take a pod out of traffic rotation without killing it. This is what happens in production when your app connects to a database on startup but the DB is slow. The pod is alive, but it's not ready. Kubernetes holds it out of rotation until it is.

Before you click: Run kubectl get pods -n kubelab -w in one tab and kubectl get endpoints -n kubelab backend-service -w in another.

# Pods tab: STATUS Running, RESTARTS 0 — almost nothing changes
# Endpoints tab: one IP disappears — the pod is alive but not receiving traffic

What happened: /ready returned 503. The kubelet marked the pod Ready=False. The Endpoints controller removed its IP from the Service. The liveness probe /health) still returned 200, so no restart. After 120 seconds /ready recovered and the pod rejoined. Run kubectl logs -n kubelab -f to see the app log 503s for the readiness endpoint while the pod stays Running and receives no traffic.

The production trap: Readiness probes that check external dependencies (database, cache, downstream API) will remove all pods from rotation when that dependency goes down. Instead of degrading gracefully, your entire app goes offline.

The fix: Readiness probes should test only what the pod itself controls. Use a separate deep health endpoint for dependency checks and never tie readiness to external service availability.

4. How to Read the Signals in Grafana

kubectl shows current state. Grafana shows what happened over time. That history is essential when you are debugging something that started 4 hours ago.

The Four Panels that Matter

Pod Restarts: A flat line is good. A step up every few hours is a silent OOMKill loop — the most common invisible production failure.

CPU Usage: A healthy pod's CPU fluctuates. A throttled pod's CPU is unnaturally flat at its limit. That flat ceiling is the signal, not the number.

Memory Usage: Watch for a line that climbs steadily then disappears. That disappearance is an OOMKill. The line reappearing from zero is the restart.

HTTP Request Rate: During Cascading Failure you see a spike of 5xx for 5–15 seconds, the exact downtime window, timestamped.

5. How to Read the Terminal Signals

What you see in the terminal during and after each simulation tells you things Grafana cannot. Five commands matter.

The -w flag on kubectl get pods -n kubelab -w streams changes in real time. The columns that matter are READY, STATUS, and RESTARTS. READY shows containers ready vs total — 1/2 means one container is alive but not passing its readiness probe. STATUS shows the pod lifecycle phase: Running, Pending, Terminating, OOMKilled. RESTARTS is the most important column in production. A number climbing silently over days is a memory leak or a crash loop nobody has noticed yet.

kubectl get events -n kubelab --sort-by=.lastTimestamp is the control plane's diary. Every action the cluster took is here: Killing, SuccessfulCreate, Scheduled, Pulled, Started, OOMKilling, BackOff. When something breaks and you do not know why, read the events. The timestamp gap between a Killing event and the next Started event is your actual downtime window — not an estimate, the exact number.

kubectl describe pod -n kubelab is the deepest single-pod view. Three sections matter: Conditions (Ready: True/False tells you if the pod is in the Service endpoints), Last State (shows the previous container's exit reason — OOMKilled, exit code 137, or a crash), and Events at the bottom (the scheduler's reasoning for every placement decision). This is the first command to run when a pod is misbehaving.

kubectl get endpoints -n kubelab backend-service shows which pod IPs are actually receiving traffic right now. A pod can show Running in kubectl get pods and be completely absent from this list. That is a readiness probe failure. If this list is empty, no request to that Service will succeed regardless of how many pods show Running. Check this whenever users report errors but pods look healthy.

kubectl logs -n kubelab shows the container's stdout and stderr. Use -f to follow the stream. After a pod restarts, use --previous to see the logs from the container that just exited, essential when you need to know what the app was doing right before an OOMKill or crash. Logs are per container and are gone once the pod is replaced, so grab them before the ReplicaSet creates a new pod with a new name.

A full event sequence during Kill Pod recovery looks like this:

kubectl get events -n kubelab --sort-by=.lastTimestamp | tail -10

REASON            MESSAGE
Killing           Stopping container backend          ← SIGTERM sent
SuccessfulCreate  Created pod backend-xyz789          ← ReplicaSet fired
Scheduled         Successfully assigned to worker-2   ← Scheduler placed it
Pulled            Container image already present     ← no pull delay
Started           Started container backend           ← running

The line between Killing and Started is your actual recovery time. In a healthy cluster with a cached image it is 3–8 seconds. If it takes longer, check the Scheduled line, the scheduler may have struggled to find a node.

Two Prometheus Queries Worth Memorizing

First query: silent restart loop. rate(kube_pod_container_status_restarts_total{namespace="kubelab"}[1h]) counts how many times containers in that namespace have restarted over the last hour and expresses it as a rate (restarts per second). A healthy workload rarely restarts. If this rate is high (for example more than 3 restarts per hour), something is killing the container repeatedly, often an OOMKill or a crash. Alert when it exceeds a threshold so you see the pattern before users report errors.

Second query: invisible CPU throttling. rate(container_cpu_cfs_throttled_seconds_total{namespace="kubelab"}[5m]) measures how much time, per second, the Linux scheduler spent throttling containers in that namespace over the last 5 minutes. A result of 0.25 means the container was frozen 25% of the time. High latency with no restarts and "normal" CPU usage in kubectl top often means the CPU limit is too low and the kernel is throttling the process. Alert when this rate exceeds about 0.25 (25% throttled).

# Silent restart loop — alert when this exceeds 3 per hour
rate(kube_pod_container_status_restarts_total{namespace="kubelab"}[1h])

# Invisible throttling — alert when this exceeds 25%
rate(container_cpu_cfs_throttled_seconds_total{namespace="kubelab"}[5m])

Run these against your own cluster. Not just KubeLab. These are production queries.

6. How to Use This for Production Debugging

The repo includes docs/diagnose.md, a symptom-to-simulation map. Find the simulation that reproduces your issue, run it in KubeLab, and understand the mechanics before you touch production.

Exit code 137, pods restarting. Run the Memory Stress simulation. Confirm with kubectl describe pod | grep -A 5 "Last State:" and look for Reason: OOMKilled. Raise limits or find the leak. The simulation shows both.

High latency, pods look healthy, zero restarts. Run the CPU Stress simulation. Check container_cpu_cfs_throttled_seconds_total in Prometheus. If it climbs, your CPU limit is too low and the pod is frozen by CFS.

503 on some requests, pods show Running. Run the Readiness Probe Failure simulation. Check kubectl get endpoints — one pod IP is missing despite Running. The pod gets zero traffic.

Pods stuck Pending after a node went down. Run the Drain Node simulation. Run kubectl describe pod and read Events. The scheduler will state why it cannot place the pod, often insufficient capacity or a PVC on the failed node.

Conclusion

You just broke a real Kubernetes cluster seven ways and watched it fix itself each time. You have seen the ReplicaSet controller fire, read an OOMKill from kubectl describe, watched endpoints go empty during a cascading failure, and understood why a pod can be Running and receiving zero traffic at the same time.

What you practiced here applies to other clusters, staging or production you can read but not safely break. That muscle memory (events, endpoints, restart counter) is what you reach for at 3 am when something is wrong. KubeLab is the safe place to build that reflex.

The repo holds more than this article covered. Explore mode lets you run simulations without the guided flow. The full interview prep doc at docs/interview-prep.md has answers to the 13 most common Kubernetes interview questions. The observability guide at docs/observability.md covers Prometheus and Grafana setup in detail.

If this helped you, star the repo at https://github.com/Osomudeya/kube-lab and share it with someone who is learning Kubernetes the hard way.

What is Disaster Recovery Testing? Explained with Practical Examples

Alex Tray — Mon, 02 Mar 2026 10:07:11 +0000

Most teams are confident they can recover from a major outage until they actually have to. Backups exist, architectures are redundant and a recovery plan is documented somewhere, yet real incidents often reveal critical gaps.

Disaster recovery testing is what separates assumed resilience from proven recovery, but it’s still skipped, rushed or treated as a checkbox exercise. For developers and technical teams, that gap can turn a manageable failure into a prolonged outage.

What is Disaster Recovery Testing?
How Disaster Recovery Testing Works in Practice
Disaster Recovery Testing Methods Developers Should Know
What Technology Disaster Recovery Testing Evaluates
How to Test a Disaster Recovery Plan
Disaster Recovery Test Scenarios: Practical Examples
Disaster Recovery Test Report: Turning Tests Into Improvements
Disaster Recovery Audits and Continuous Validation
Conclusion

What is Disaster Recovery Testing?

Disaster recovery (DR) testing is the process of validating that systems, data and applications can be restored after a disruptive event within defined recovery objectives. It generally evaluates:

Recovery Time Objective (RTO): How quickly systems must be restored.
Recovery Point Objective (RPO): How much data loss is acceptable.
Operational readiness: Whether teams know what to do during an incident.

A disaster recovery test plan documents how these elements are tested, who is responsible and what success looks like. Without testing, DR plans are assumptions, not guarantees.

How Disaster Recovery Testing Works in Practice

In real environments, disaster recovery testing is used to check all elements of the disaster recovery plan and is rarely a single event. It’s a structured exercise that simulates failure, observes system behavior and measures outcomes against expectations.

A typical DR test involves:

Defining scope – Which applications, services, or data sets are included.
Selecting a scenario – Outage, corruption, ransomware, region failure, and so on.
Executing recovery actions – Restore data, fail over systems, reconfigure dependencies.
Measuring results – Time to recovery, data consistency, service availability.
Documenting findings – What worked, what failed, what needs improvement.

For developers, the key shift is recognizing that DR testing isn’t just an ops exercise. Application architecture, data handling and deployment patterns all influence recovery outcomes.

Importantly, regulatory pressure is also reshaping how organizations approach recovery validation. Frameworks such as the NIS2 Directive require essential and important entities in the EU to implement robust cybersecurity risk management measures, including incident response and business continuity capabilities.

Disaster Recovery Testing Methods Developers Should Know

Different testing methods provide different levels of confidence. Mature teams use more than one. Each method has a place, but relying only on low-impact testing creates blind spots that surface during real incidents.

Checklist Testing

The simplest method: Teams review documented recovery steps without executing them. This helps validate documentation completeness but does not confirm real-world recoverability.

Tabletop Exercises

Stakeholders walk through a simulated disaster scenario and discuss responses. Tabletop tests are useful for identifying communication gaps and unclear responsibilities, especially for cross-team coordination.

Partial or Component Testing

Specific systems, such as databases or backup restores, are tested in isolation. Developers often encounter this when validating recovery procedures for individual services or environments.

Full-scale Testing

This is the most comprehensive method. It involves actual failover or full recovery in production-like environments. While disruptive, full-scale tests provide the highest confidence.

What Technology Disaster Recovery Testing Evaluates

Modern environments are complex, and disaster recovery testing must validate more than just data restores.

DR testing evaluates:

Backup integrity – Are backups usable, consistent and complete?
Application dependencies – Do services come back in the correct order?
Infrastructure recovery – Can compute, storage and networking be re-provisioned?
Identity and access – Do credentials, secrets and permissions still function?
Automation and scripts – Do recovery workflows still match current architectures?

For developers, this often reveals hidden coupling between services, outdated scripts or environment-specific assumptions that were never documented.

How to Test a Disaster Recovery Plan

Testing a disaster recovery plan doesn’t require shutting down production on day one. A practical, incremental approach works best.

Start with a single application: Pick a service with well-defined data and dependencies. Avoid starting with your most complex system.
Validate backup restores: Restore data into a non-production environment and confirm application functionality, not just file presence.
Measure RTO and RPO: Time the recovery process and compare results to stated objectives. At this stage, many teams can discover that their objectives were unrealistic.
Test failure assumptions: Simulate real-world issues like missing credentials, expired certificates or partial data loss.
Document gaps immediately: Update the disaster recovery test plan while findings are fresh. Untested fixes are just new assumptions.

This approach makes disaster recovery testing part of standard processes rather than a once-a-year compliance task.

Automating Restore Validation

One of the most common gaps in disaster recovery testing is stopping at “restore completed” instead of validating that the application actually works. A restored database that can’t serve queries or contains incomplete data doesn’t meet recovery objectives.

Teams can reduce this risk by automating post-restore validation. For example, after restoring a PostgreSQL database into a staging or isolated DR environment, a simple validation script can confirm connectivity and basic data integrity:

import psycopg2

import sys


def validate_restore():

    try:

        conn = psycopg2.connect(

            host="restored-db.internal",

            database="appdb",

            user="dr_test_user",

            password="securepassword"

        )

        cur = conn.cursor()

        cur.execute("SELECT COUNT(*) FROM users;")

        result = cur.fetchone()



        if result and result[0] > 0:

            print("Restore validation successful.")

        else:

            print("Restore validation failed: No data found.")

            sys.exit(1)


        conn.close()

    except Exception as e:

        print(f"Restore validation error: {e}")

        sys.exit(1)


validate_restore()

This script does three important things:

Confirms the database is reachable
Executes a real query, not just a connection check
Fails explicitly if the expected data is missing

In practice, teams can integrate scripts like this into CI/CD pipelines or scheduled recovery drills. The goal isn’t to test every edge case, but to move from “backup exists” to “restore is functionally verified.” Over time, these automated checks become part of the disaster recovery test plan, helping teams measure RTO accurately and detect configuration drift before a real incident exposes it.

Disaster Recovery Test Scenarios: Practical Examples

Effective disaster recovery testing focuses on realistic failures, not idealized outages.

Accidental Deletion or Misconfiguration

A dropped database table, deleted storage bucket or bad configuration change tests how quickly teams can restore specific data without rolling back entire systems. These everyday incidents often reveal slow or overly manual recovery processes.

Data Corruption and Application Failure

Buggy releases can silently corrupt data while systems remain online. This scenario validates point-in-time recovery and whether teams can identify when corruption started, not just restore the latest backup.

Ransomware Simulation

Ransomware testing checks whether clean, uncompromised backups can be restored in isolation. It often exposes gaps in backup immutability, credential handling and realistic recovery times.

Infrastructure or Platform Outage

Simulating the loss of a cluster, availability zone or region tests automation and infrastructure-as-code maturity. In virtualized environments, most commonly VMware disaster recovery, testing involves restoring virtual machines at a secondary site and validating networking and application dependencies.

Credential and Access Failure

Recovery can stall if credentials, certificates or secret keys are unavailable. Testing this scenario validates identity systems and whether recovery procedures rely on fragile access assumptions.

Disaster Recovery Test Report: Turning Tests Into Improvements

Testing without documentation is wasted effort. A disaster recovery test report turns results into actionable improvements.

A valuable DR test report includes:

Test scope and scenario
Expected vs. actual RTO/RPO
Recovery steps executed
Failures, delays and root causes
Recommended changes

For developers, this often results in concrete action items: refactoring startup dependencies, adding health checks, improving automation or adjusting data protection policies. The report should feed directly into backlog planning.

Disaster Recovery Audits and Continuous Validation

Audits often expose what teams already suspect: Disaster recovery plans exist, but haven’t been tested recently (or at all).

Rather than treating audits as one-time events, teams should adopt continuous validation:

Regular restore tests integrated into CI/CD pipelines.
Scheduled DR tests tied to major architecture changes.
Automated alerts when recovery objectives drift.

This shifts disaster recovery testing from an annual obligation to an ongoing practice that evolves alongside the environment.

Conclusion

Disaster recovery testing is not about pessimism, it’s about realism. Systems and people change, and failure modes evolve faster than documentation. Without testing, even the best-designed recovery plan can become outdated.

For developers and technical teams, practicing disaster recovery testing builds confidence rooted in evidence, not assumptions. It exposes hidden dependencies, validates data protection strategies and ensures that when something goes wrong, recovery is predictable instead of chaotic.

The AI Coding Loop: How to Guide AI With Rules and Tests

Sumit Saha — Wed, 25 Feb 2026 00:26:58 +0000

Building great software isn't about perfect prompts, it's about a disciplined process. In this guide, I'll share my workflow for shipping secure code: defining clear goals, mapping edge cases, and building incrementally with runnable tests.

Using a Node.js shopping cart example, I'll show why server-side validation and test-driven development beat "one-shot" AI outputs every time. Let's dive into how to make AI your most reliable collaborator.

Some Background

Last week I did something that felt amazing for about… five seconds. I opened an AI tool, typed one sentence, and it generated a whole shopping cart module for an e-commerce app. Lots of files, lots of code, even folders and patterns. It looked professional.

And then I realized something: the problem was not "how fast AI wrote code." The problem was "how do I know this code is correct?"

Here's the truth: a big pile of code that you didn't write is not a shortcut. For most developers, it's actually extra work. You have to read it, understand it, and still catch the hidden mistakes.

So today I'm not going to give you another "AI is coming" talk. Instead, I'll show you a simple loop that any developer can follow – beginner, mid-level, or senior – to get better results from AI, step by step, without getting trapped. And I'll show it with a real example you can run in one file.

Here’s What We’ll Cover:

The 5-second high (and the real problem)
The golden rule: never trust user prices
The mindset shift: stop asking for the whole app
The AI coding loop (the 7-step workflow)
Apply the loop: a server-side cart total calculator
- The prompt (small piece, strong constraints)
One-file runnable example (with a wrong version on purpose)
- What you should notice here
How to use failing tests as a flashlight
Copy-paste prompt template
A calm hype check: why fundamentals matter more now
- A simple exercise (do this once and you'll feel the skill)
Recap

The 5-Second High (and the Real Problem)

A lot of people misunderstand AI coding. They think the main job is typing code. But the main job is thinking clearly. Typing is cheap now. Thinking is expensive.

When AI produces a "perfect-looking" module in one shot, the real work doesn't disappear. It moves downstream:

You still need to understand what it generated
You still need to verify it matches your rules
You still need to catch the mistakes that hide inside "nice looking code"

If you can't verify it, you don't own it. And if you don't own it, you can't safely ship it.

Tip: Treat AI output like code from a stranger on the internet: useful, but untrusted until proven.

The Golden Rule: Never Trust User Prices

I started exactly like a beginner would start. I opened AI and wrote a vague prompt:

Design and develop an e-commerce shopping cart module for me.

AI replied with a big output. It looked clean. If you're new, you might think:

Wow, it solved it.

But then I asked myself:

What is the easiest way this can go wrong in real life?

And the answer is also simple: “money can be stolen”. Because a shopping cart has one golden rule: never trust prices coming from the user.

If the browser sends you: “T-shirt price is $1" and you accept it, someone can pay $1 for a $20 product. And when AI generates a big module quickly, that kind of mistake can easily hide inside "nice looking code."

Warning: Any system that accepts client-sent prices is basically inviting price tampering.

The Mindset Shift: Stop Asking for the Whole App

So instead of accepting the big AI output, I changed my approach. I said:

I'm not going to ask AI to build the whole app. I will break the big thing into small parts, and I will guide AI like a real engineer.

That is the first mindset shift. In the AI era, your value is not how fast you type. Your value is how well you can do three things:

define the problem clearly
break it into small pieces
prove the result is correct

Big systems are built from small correct pieces. That's not "prompt engineering." That's engineering.

The AI Coding Loop (the 7-Step Workflow)

Here's the loop I use. It's simple English. You can copy it and use it for any project:

Write the goal in one sentence
Write the rules (what must be true)
Write two examples (input → output)
Write two bad situations (weird cases)
Ask AI for a small piece, not the whole thing
Ask for tests, then run them
If something fails, improve the prompt and repeat

That's it. That's the loop. Here it is in visual form:

Tip: The loop is the skill. Tools will change. The loop will still work.

Apply the Loop: a Server-side Cart Total Calculator

Now let's apply it to the shopping cart example. Instead of "build me a cart module," I wrote a tiny requirement note:

We need a cart total calculator on the server. User sends productId and quantity. We must ignore any price from the user. We must use our own product list. We must handle unknown products and invalid quantity. We must calculate subtotal, discount, tax, and final total. We must round money correctly. We must have tests.

This is not a large or complex requirements specification - just a clear and concise note.

And then I asked AI for only one small piece:

Not the UI
Not the database
Not the entire architecture
Just one function, with tests

Because the fastest way to build something real is to prove one brick at a time. We have written down everything we discussed in the requirement note. It would be great to also create a visual representation of those ideas. Along with the requirement note, we can prepare a simple sketch or diagram for our own reference. This way, it can serve as a clean and well-documented requirement specification, which we can keep recorded in our project's GitHub README.md file.

In the diagram below, we can have a browser on the left and the server on the right. The browser/user is an untrusted input source. The user may send productId, qty, and even a fake price, but the server must treat only productId and qty as input and must ignore any client-sent price. The server then looks up the real price from its own trusted product catalog, validates the quantity, and calculates totals from server-side data. This is the trust boundary: prices come from the server, not from the client.

The prompt (small piece, strong constraints)

This is the shape of the prompt I used:

Create a single JavaScript file I can run with Node.

Goal:

Calculate shopping cart totals.

Rules:

Input items have productId and qty.
Do NOT trust price from user input.
Use my product catalog.
qty must be at least 1.
discountPercent and taxPercent must not be negative.
discount first, then tax.
round money to 2 decimals.

Examples:

2 T-shirts (20 each) + 1 mug (12.50) => subtotal 52.50
discount 10%, tax 8% => discount first, then tax

Deliver:

one function
simple tests using Node's built-in assert
print one example output

One small change makes a massive difference: “rules + examples + tests”. AI still tries to help fast, but now it has guardrails. And if it still makes a mistake, you can catch it, because you asked for proof.

Here is a visual representation of the "Cart Totals Pipeline" that covers all the use cases involved in the cart totals calculation process.

In the diagram, the cart total calculation follows a fixed pipeline. First, validate inputs (known productId, valid qty, non-negative discount/tax). Next, compute subtotal from the trusted product catalog. Then apply the discount to get the discounted amount. After that, calculate tax on the discounted amount (not on the original subtotal). Finally, round values correctly and return the result (subtotal, discount, tax, and total). The key rule is the order: discount first, then tax.

One-File Runnable Example (with a Wrong Version on Purpose)

Now here's the one-file example you can run right now. No setup. Just Node. Create a file named cart.js, paste in the below code, and run node cart.js.

It includes two versions:

a wrong version that trusts user price (this is the mistake we want to learn from)
a correct version that uses a trusted catalog

// cart.js

// Run: node cart.js

const assert = require("node:assert/strict");

// Trusted product catalog (server-side truth)

const PRODUCTS = {
    tshirt: { name: "T-shirt", priceCents: 2000 }, // $20.00

    mug: { name: "Mug", priceCents: 1250 }, // $12.50

    book: { name: "Book", priceCents: 1599 }, // $15.99
};

function money(cents) {
    return (cents / 100).toFixed(2);
}

// WRONG: trusts user price

function cartTotal_WRONG(cartItems, discountPercent = 0, taxPercent = 0) {
    let subtotalCents = 0;

    for (const item of cartItems) {
        const priceCents = Math.round((item.price ?? 0) * 100); // user can cheat

        subtotalCents += priceCents * item.qty;
    }

    const discountCents = Math.round(subtotalCents * (discountPercent / 100));

    const afterDiscount = subtotalCents - discountCents;

    const taxCents = Math.round(afterDiscount * (taxPercent / 100));

    const totalCents = afterDiscount + taxCents;

    return totalCents;
}

// Correct: uses trusted catalog + checks

function cartTotal(cartItems, discountPercent = 0, taxPercent = 0) {
    if (!Array.isArray(cartItems))
        throw new Error("cartItems must be an array");

    if (typeof discountPercent !== "number" || discountPercent < 0)
        throw new Error("discountPercent must be non-negative");

    if (typeof taxPercent !== "number" || taxPercent < 0)
        throw new Error("taxPercent must be non-negative");

    let subtotalCents = 0;

    for (const item of cartItems) {
        const { productId, qty } = item || {};

        if (typeof productId !== "string" || !PRODUCTS[productId]) {
            throw new Error("Unknown productId: " + productId);
        }

        if (typeof qty !== "number" || qty < 1) {
            throw new Error("qty must be at least 1");
        }

        subtotalCents += PRODUCTS[productId].priceCents * qty;
    }

    const discountCents = Math.round(subtotalCents * (discountPercent / 100));

    let afterDiscountCents = subtotalCents - discountCents;

    if (afterDiscountCents < 0) afterDiscountCents = 0;

    const taxCents = Math.round(afterDiscountCents * (taxPercent / 100));

    const totalCents = afterDiscountCents + taxCents;

    return { subtotalCents, discountCents, taxCents, totalCents };
}

function runTests() {
    // Normal example

    const cart = [
        { productId: "tshirt", qty: 2 },

        { productId: "mug", qty: 1 },
    ];

    const r = cartTotal(cart, 10, 8);

    assert.equal(r.subtotalCents, 5250); // 52.50

    assert.equal(r.discountCents, 525); // 10% of 52.50

    assert.equal(r.taxCents, 378); // 8% of 47.25

    assert.equal(r.totalCents, 5103); // 51.03

    // Attack example: user tries to cheat with price = 1

    const attackerCart = [
        { productId: "tshirt", qty: 2, price: 1 },

        { productId: "mug", qty: 1, price: 1 },
    ];

    const wrong = cartTotal_WRONG(attackerCart, 0, 0);

    assert.equal(money(wrong), "3.00"); // totally wrong in real life

    const safe = cartTotal(attackerCart, 0, 0);

    assert.equal(money(safe.totalCents), "52.50"); // correct, ignores user price

    // Edge cases

    assert.throws(() => cartTotal([{ productId: "unknown", qty: 1 }], 0, 0));

    assert.throws(() => cartTotal([{ productId: "tshirt", qty: 0 }], 0, 0));

    assert.throws(() => cartTotal(cart, -1, 0));

    assert.throws(() => cartTotal(cart, 0, -1));
}

runTests();

console.log("All tests passed.");

const example = cartTotal(
    [
        { productId: "tshirt", qty: 1 },

        { productId: "book", qty: 2 },
    ],

    15,

    5,
);

console.log("Example subtotal:", money(example.subtotalCents));

console.log("Example discount:", money(example.discountCents));

console.log("Example tax:", money(example.taxCents));

console.log("Example total:", money(example.totalCents));

In this code, we didn't do a magic trick. We did some engineering:

We took a big problem and broke it into a small piece
We wrote rules so the AI doesn't guess
We wrote examples so the AI understands
We asked for tests so we can prove it
We ran the tests so we can trust it

That is the loop you can reuse for any project.

How to Use Failing Tests as a Flashlight

This is the part many developers skip. They ask for code, but they don't ask for proof. When you run the tests, one of two things happens:

Tests pass: great, you earned confidence
Tests fail: even better, you earned clarity

A failing test is a flashlight. It shows you the exact place where your thinking (or your prompt) needs improvement. Instead of "AI is wrong," you get a real question:

Which rule was unclear, missing, or contradictory?

Then you adjust:

add a stricter rule
add an example that removes ambiguity
add an edge case that forces the correct behavior
regenerate only the small piece, not the whole codebase

Copy-Paste Prompt Template

Here is a copy-paste prompt template you can reuse from today (see below the image):


Build ONE small piece, not the full app.

Goal:

(One sentence)

Rules:

(3 to 7 bullets)

Examples:

(2 examples: input -> output)

Edge cases:

(2 cases that can break it)

Deliver:

- one runnable file

- include tests using Node assert

- print one example output

Then ask:

Before giving code, list the possible mistakes and confirm the rules.

That last line is powerful. It forces the AI to think about failure before writing code.

A Calm Hype Check: Why Fundamentals Matter More Now

A lot of content online makes it sound like: "AI codes now, so you don't need to learn coding." That idea is a trap. Because yes, AI can type code. But AI cannot replace your responsibilities as a developer and engineer.

If you ship a broken cart, you can lose money. If you ship insecure code, you can get hacked. If you ship unreliable software, users leave. And in real life, nobody will accept the excuse: "The AI wrote it."

In the AI era, learning coding isn't less important. It's more important, just in a different way. The goal isn't to become a fast typist. The goal is to become a strong thinker.

Fundamentals matter more than before:

how data flows through a system
how to break big problems into small parts
how to write clear rules and requirements
how to test and verify
how to notice edge cases
how to think about security
how to understand the tools you use, not just copy answers

Average software will be everywhere. It will be cheap. It will be copied. It will be easy to make. So the only software that matters will be software that is truly valuable: safe, reliable, high quality, and built with real understanding.

That's good news for serious learners. Because the best engineers will become even more valuable, not less.

A Simple Exercise (do this once and you'll feel the skill)

Add one more rule to the cart, like:

qty cannot be more than 10
Write the test first. Then ask AI to update the function. Run the tests.
That's how you train the real AI skill: not prompting, but guiding and verifying.
Let AI type the code.
You do the thinking.
You do the breaking down.
You do the proof.

Recap

Don't ask AI to build the whole app
Break the problem into one small piece
Write rules, examples, and edge cases so AI doesn't guess
Always ask for tests and run them
Treat failing tests as a flashlight
Repeat the loop until you can trust what you ship

That's the game now. And if you play it well, you're not behind, you're ahead.

Final Words

If you found the information here valuable, feel free to share it with others who might benefit from it.

I’d really appreciate your thoughts – mention me on X @sumit_analyzen or on Facebook @sumit.analyzen, watch my coding tutorials, or simply connect with me on LinkedIn.

You can also checkout my official website sumitsaha.me for details about me.

How to Test React Applications with Vitest

Aiyedogbon Abraham — Tue, 10 Feb 2026 22:43:37 +0000

Testing is one of those things that every developer knows they should do, but many put off until problems start appearing in production. If you’re building React applications with Vite, there's a testing framework that fits so naturally into your workflow that you might actually enjoy writing tests. That framework is Vitest.

In this tutorial, you’ll learn how to set up Vitest in a React project, write effective tests for your components and hooks, and understand the testing patterns that will help you build more reliable applications.

What is Vitest and Why Should You Use It?
Prerequisites
How to Set Up Vitest in Your React Project
How to Write Your First Test
How to Test React Components
How to Test User Interactions
How to Test Custom Hooks
How to Mock API Calls
Best Practices for Testing React Components

What is Vitest and Why Should You Use It?

Vitest is a testing framework built on top of Vite. It uses Vite’s development server and plugin pipeline to transform and load files during testing. This means your tests use the same configuration and plugins as your app (for example, the React plugin, TypeScript support,and so on), so you don’t need a separate build or compile step.

Vitest runs tests in parallel across worker threads for maximum speed, and it automatically enables an instant “watch” mode (similar to Vite’s HMR) that reruns only the tests related to changed files. Vitest also has first-class support for modern JavaScript out of the box: it handles ESM, TypeScript, and JSX natively via Vite’s transformer (powered by Oxc).

Because Vitest provides a Jest-compatible API, you can continue to use familiar testing libraries (for example, React Testing Library, jest-dom matchers, user-event, and so on) without extra setup.

In short, Vitest tightly integrates with your Vite-powered stack (or can even run standalone) and lets you plug in existing testing tools seamlessly.

Here is why Vitest has become popular in the React ecosystem:

Speed: Vitest can run tests more than four times faster than Jest in many scenarios. This speed comes from Vite's fast Hot Module Replacement and efficient caching capabilities.
Zero configuration: Unlike Jest, which required Babel integration, TSJest setup, and multiple dependencies, Vitest works out of the box. It reuses your existing Vite configuration, eliminating the need to configure a separate test pipeline.
Native TypeScript support: Vitest handles TypeScript and JSX natively through ESBuild, with no additional configuration needed.
Modern JavaScript: Vitest offers native support for ES modules out of the box, making it ideal for modern JavaScript stacks.
Familiar API: If you know Jest, you already know most of Vitest. The API is intentionally compatible, making migration straightforward.

Prerequisites

To follow along with this tutorial, you should have:

Basic knowledge of React and JavaScript
Understanding of React Hooks
Node.js installed (version 14 or higher)
A React project created with Vite (or you can create one as we go)

How to Set Up Vitest in Your React Project

Let's start by creating a new React project with Vite and setting up Vitest.

Step 1: Create a React Project with Vite

If you don't have an existing project, create one with the following command:

npm create vite@latest my-react-app -- --template react
cd my-react-app
npm install

This creates a React project with Vite as the build tool.

Step 2: Install Vitest and Testing Dependencies

Install Vitest along with the React Testing Library and other necessary dependencies:

npm install --save-dev vitest @testing-library/react @testing-library/jest-dom @testing-library/user-event jsdom

Here's what each package does:

vitest: The testing framework itself
@testing-library/react: Provides utilities for testing React components
@testing-library/jest-dom: Adds custom matchers for DOM assertions
@testing-library/user-event: Simulates user interactions
jsdom: Provides a DOM environment for testing

Step 3: Configure Vitest

Create a vitest.config.js file in your project root:

import { defineConfig } from 'vitest/config';
import react from '@vitejs/plugin-react';

export default defineConfig({
  plugins: [react()],
  test: {
    globals: true,
    environment: 'jsdom',
    setupFiles: './src/test/setup.js',
  },
});

Setting globals: true exposes the describe and it functions on the global object, so you don't need to import them in every test file. The environment: 'jsdom' setting tells Vitest to use jsdom for simulating a browser environment.

Step 4: Create the Test Setup File

Create a file at src/test/setup.js:

import { expect, afterEach } from 'vitest';
import { cleanup } from '@testing-library/react';
import '@testing-library/jest-dom';

afterEach(() => {
  cleanup();
});

The cleanup() function runs after each test to clean up the DOM, ensuring tests don't interfere with each other.

Step 5: Add Test Scripts

Add the following script to your package.json:

{
  "scripts": {
    "dev": "vite",
    "build": "vite build",
    "test": "vitest",
    "test:ui": "vitest --ui",
    "coverage": "vitest --coverage"
  }
}

Now you can run tests with npm test.

How to Write Your First Test

Let's write a simple test to make sure everything is working. Create a file called sum.test.js in your src directory:

import { expect, test } from 'vitest';

function sum(a, b) {
  return a + b;
}

test('adds 1 + 2 to equal 3', () => {
  expect(sum(1, 2)).toBe(3);
});

Run npm test and you should see your test pass. A test in Vitest passes if it doesn't throw an error.

How to Test React Components

Now let's test an actual React component. We'll start with a simple component and gradually build up to more complex scenarios.

Testing a Simple Component

Create a component called Greeting.jsx:

export function Greeting({ name }) {
  return (
    <div>
      <h1>Hello, {name}!h1>
      <p>Welcome to our applicationp>
    div>
  );
}

Now create a test file Greeting.test.jsx:

import { render, screen } from '@testing-library/react';
import { Greeting } from './Greeting';

describe('Greeting Component', () => {
  it('should render the greeting with the provided name', () => {
    render(<Greeting name="Alice" />);

    const heading = screen.getByRole('heading', { level: 1 });
    expect(heading).toHaveTextContent('Hello, Alice!');
  });

  it('should render the welcome message', () => {
    render(<Greeting name="Bob" />);

    const paragraph = screen.getByText('Welcome to our application');
    expect(paragraph).toBeInTheDocument();
  });
});

The describe function groups related tests into a single describe block. Each it function contains one test case.

The render function from React Testing Library renders your component in a test environment. The screen object provides query methods to find elements in the rendered output.

Understanding Query Functions

React Testing Library provides three types of query functions: get, query, and find.

getBy queries: Throw an error if the element isn't found. Use these when you expect the element to be present.

const button = screen.getByRole('button', { name: /click me/i });

queryBy queries: Return null if the element isn't found. Use these when you want to assert that an element doesn't exist.

const errorMessage = screen.queryByText('Error');
expect(errorMessage).not.toBeInTheDocument();

findBy queries: Return a promise and wait for the element to appear. Use these for asynchronous operations.

const loadedData = await screen.findByText('Data loaded');

Testing a Counter Component

Let's test a more interactive component. Create Counter.jsx:

import { useState } from 'react';

export function Counter({ initialCount = 0 }) {
  const [count, setCount] = useState(initialCount);

  return (
    <div>
      <p>Count: {count}p>
      <button onClick={() => setCount(count + 1)}>Incrementbutton>
      <button onClick={() => setCount(count - 1)}>Decrementbutton>
      <button onClick={() => setCount(0)}>Resetbutton>
    div>
  );
}

Create the test file Counter.test.jsx:

import { render, screen } from '@testing-library/react';
import userEvent from '@testing-library/user-event';
import { Counter } from './Counter';

describe('Counter Component', () => {
  it('should render with initial count of 0', () => {
    render(<Counter />);

    expect(screen.getByText('Count: 0')).toBeInTheDocument();
  });

  it('should render with custom initial count', () => {
    render(<Counter initialCount={5} />);

    expect(screen.getByText('Count: 5')).toBeInTheDocument();
  });

  it('should increment count when increment button is clicked', async () => {
    const user = userEvent.setup();
    render(<Counter />);

    const incrementButton = screen.getByRole('button', { name: /increment/i });
    await user.click(incrementButton);

    expect(screen.getByText('Count: 1')).toBeInTheDocument();
  });

  it('should decrement count when decrement button is clicked', async () => {
    const user = userEvent.setup();
    render(<Counter initialCount={5} />);

    const decrementButton = screen.getByRole('button', { name: /decrement/i });
    await user.click(decrementButton);

    expect(screen.getByText('Count: 4')).toBeInTheDocument();
  });

  it('should reset count to 0 when reset button is clicked', async () => {
    const user = userEvent.setup();
    render(<Counter initialCount={10} />);

    const resetButton = screen.getByRole('button', { name: /reset/i });
    await user.click(resetButton);

    expect(screen.getByText('Count: 0')).toBeInTheDocument();
  });
});

In these Counter tests, we first use render() to mount the component in a virtual DOM. We then query the output using Testing Library’s screen object. For example, screen.getByText('Count: 0') finds the element displaying the initial count of 0, and expect(...).toBeInTheDocument() asserts that it is present. The getByText query will throw an error if the text isn’t found, immediately failing the test.

For interactive tests, we create a user with const user = userEvent.setup() and then call await user.click(...) on the increment/decrement/reset buttons. The userEvent.click method simulates a real user click (dispatching the sequence of events a browser would fire). We locate buttons by their accessible role and name (for example, getByRole('button', { name: /increment/i })), following best practices for accessible queries.

After each click, we assert that the DOM updates accordingly (for example, the count text changes to “Count: 1”). Using async/await with user.click ensures the test waits for any state changes. In this way, each test checks the user-visible behavior: that clicking the Increment button increases the count, the Decrement button decreases it, and the Reset button sets it back to zero, without depending on the component’s internal implementation.

How to Test User Interactions

User interactions are a critical part of testing React applications. The @testing-library/user-event library provides a more realistic simulation of user behaviour than simple event dispatching.

Testing Form Inputs

Create a LoginForm.jsx component:

import { useState } from 'react';

export function LoginForm({ onSubmit }) {
  const [email, setEmail] = useState('');
  const [password, setPassword] = useState('');
  const [error, setError] = useState('');

  const handleSubmit = (e) => {
    e.preventDefault();

    if (!email || !password) {
      setError('Both fields are required');
      return;
    }

    setError('');
    onSubmit({ email, password });
  };

  return (
    <form onSubmit={handleSubmit}>
      <div>
        <label htmlFor="email">Emaillabel>
        <input
          id="email"
          type="email"
          value={email}
          onChange={(e) => setEmail(e.target.value)}
        />
      div>
      <div>
        <label htmlFor="password">Passwordlabel>
        <input
          id="password"
          type="password"
          value={password}
          onChange={(e) => setPassword(e.target.value)}
        />
      div>
      {error && <p role="alert">{error}p>}
      <button type="submit">Log Inbutton>
    form>
  );
}

Create the test file LoginForm.test.jsx:

import { render, screen } from '@testing-library/react';
import userEvent from '@testing-library/user-event';
import { LoginForm } from './LoginForm';

describe('LoginForm Component', () => {
  it('should render email and password inputs', () => {
    render(<LoginForm onSubmit={() => {}} />);

    expect(screen.getByLabelText(/email/i)).toBeInTheDocument();
    expect(screen.getByLabelText(/password/i)).toBeInTheDocument();
  });

  it('should update input values when user types', async () => {
    const user = userEvent.setup();
    render(<LoginForm onSubmit={() => {}} />);

    const emailInput = screen.getByLabelText(/email/i);
    const passwordInput = screen.getByLabelText(/password/i);

    await user.type(emailInput, 'test@example.com');
    await user.type(passwordInput, 'password123');

    expect(emailInput).toHaveValue('test@example.com');
    expect(passwordInput).toHaveValue('password123');
  });

  it('should show error when form is submitted empty', async () => {
    const user = userEvent.setup();
    const mockSubmit = vi.fn();
    render(<LoginForm onSubmit={mockSubmit} />);

    const submitButton = screen.getByRole('button', { name: /log in/i });
    await user.click(submitButton);

    expect(screen.getByRole('alert')).toHaveTextContent('Both fields are required');
    expect(mockSubmit).not.toHaveBeenCalled();
  });

  it('should call onSubmit with form data when valid', async () => {
    const user = userEvent.setup();
    const mockSubmit = vi.fn();
    render(<LoginForm onSubmit={mockSubmit} />);

    await user.type(screen.getByLabelText(/email/i), 'test@example.com');
    await user.type(screen.getByLabelText(/password/i), 'password123');
    await user.click(screen.getByRole('button', { name: /log in/i }));

    expect(mockSubmit).toHaveBeenCalledWith({
      email: 'test@example.com',
      password: 'password123',
    });
  });
});

The LoginForm tests similarly use render and screen to interact with the component. We use screen.getByLabelText(/email/i) and screen.getByLabelText(/password/i) to find the input fields by their associated labels, mimicking how users identify form fields.

To simulate typing, we use await user.type(input, text), which sends real keyboard events to the input (via user-event). After typing, we assert the input’s value with expect(input).toHaveValue(...) (a custom matcher from jest-dom).

When submitting the form empty, clicking the Log In button triggers the form’s validation and displays an error message. We find this error by querying getByRole('alert') and check its text content. We also assert that the mock onSubmit handler was not called.

In the valid submission test, we fill both fields and click Log In; then expect(mockSubmit).toHaveBeenCalledWith({...}) verifies the submit handler received the correct { email, password } object.

These tests focus on user actions and outcomes: typing and clicking drive the form logic, and our assertions confirm the expected outputs (visible error text or the callback arguments).

How to Test Custom Hooks

Custom hooks encapsulate reusable logic, and they need testing just like components. React Testing Library provides a renderHook function specifically for this purpose.

Creating and Testing a Custom Hook

Create a custom hook useFetch.js:

import { useState, useEffect } from 'react';

export function useFetch(url) {
  const [data, setData] = useState(null);
  const [loading, setLoading] = useState(true);
  const [error, setError] = useState(null);

  useEffect(() => {
    const fetchData = async () => {
      try {
        setLoading(true);
        const response = await fetch(url);

        if (!response.ok) {
          throw new Error('Network response was not ok');
        }

        const json = await response.json();
        setData(json);
        setError(null);
      } catch (err) {
        setError(err.message);
        setData(null);
      } finally {
        setLoading(false);
      }
    };

    fetchData();
  }, [url]);

  return { data, loading, error };
}

Create the test file useFetch.test.js:

import { renderHook, waitFor } from '@testing-library/react';
import { useFetch } from './useFetch';

describe('useFetch Hook', () => {
  beforeEach(() => {
    global.fetch = vi.fn();
  });

  afterEach(() => {
    vi.restoreAllMocks();
  });

  it('should return loading state initially', () => {
    global.fetch.mockImplementation(() => 
      Promise.resolve({
        ok: true,
        json: async () => ({ data: 'test' }),
      })
    );

    const { result } = renderHook(() => useFetch('https://api.example.com/data'));

    expect(result.current.loading).toBe(true);
    expect(result.current.data).toBe(null);
    expect(result.current.error).toBe(null);
  });

  it('should return data when fetch succeeds', async () => {
    const mockData = { id: 1, title: 'Test Post' };

    global.fetch.mockImplementation(() =>
      Promise.resolve({
        ok: true,
        json: async () => mockData,
      })
    );

    const { result } = renderHook(() => useFetch('https://api.example.com/posts/1'));

    await waitFor(() => expect(result.current.loading).toBe(false));

    expect(result.current.data).toEqual(mockData);
    expect(result.current.error).toBe(null);
  });

  it('should return error when fetch fails', async () => {
    global.fetch.mockImplementation(() =>
      Promise.resolve({
        ok: false,
      })
    );

    const { result } = renderHook(() => useFetch('https://api.example.com/posts/1'));

    await waitFor(() => expect(result.current.loading).toBe(false));

    expect(result.current.data).toBe(null);
    expect(result.current.error).toBe('Network response was not ok');
  });
});

The renderHook function from React Testing Library renders custom hooks, and waitFor is used to wait for asynchronous state updates in the hook.

How to Mock API Calls

When testing components that make API calls, you don't want to hit real endpoints. Mocking ensures your tests are fast, reliable, and don't depend on network conditions.

Mocking with Vitest

Vitest doesn’t auto-mock modules like Jest does, so you need to manually mock them. Let's see how to mock an Axios call.

Create a PostsList.jsx component:

import { useState, useEffect } from 'react';
import axios from 'axios';

export function PostsList() {
  const [posts, setPosts] = useState([]);
  const [loading, setLoading] = useState(true);
  const [error, setError] = useState(null);

  useEffect(() => {
    const fetchPosts = async () => {
      try {
        const response = await axios.get('https://api.example.com/posts');
        setPosts(response.data);
      } catch (err) {
        setError(err.message);
      } finally {
        setLoading(false);
      }
    };

    fetchPosts();
  }, []);

  if (loading) return <p>Loading...p>;
  if (error) return <p>Error: {error}p>;

  return (
    <ul>
      {posts.map((post) => (
        <li key={post.id}>{post.title}li>
      ))}
    ul>
  );
}

Create the test file PostsList.test.jsx:

import { render, screen, waitFor } from '@testing-library/react';
import axios from 'axios';
import { PostsList } from './PostsList';

vi.mock('axios');

describe('PostsList Component', () => {
  beforeEach(() => {
    vi.clearAllMocks();
  });

  it('should display loading state initially', () => {
    axios.get.mockImplementation(() => new Promise(() => {}));
    render(<PostsList />);

    expect(screen.getByText('Loading...')).toBeInTheDocument();
  });

  it('should display posts when API call succeeds', async () => {
    const mockPosts = [
      { id: 1, title: 'First Post' },
      { id: 2, title: 'Second Post' },
    ];

    axios.get.mockResolvedValue({ data: mockPosts });
    render(<PostsList />);

    await waitFor(() => {
      expect(screen.queryByText('Loading...')).not.toBeInTheDocument();
    });

    expect(screen.getByText('First Post')).toBeInTheDocument();
    expect(screen.getByText('Second Post')).toBeInTheDocument();
  });

  it('should display error when API call fails', async () => {
    axios.get.mockRejectedValue(new Error('Network error'));
    render(<PostsList />);

    await waitFor(() => {
      expect(screen.queryByText('Loading...')).not.toBeInTheDocument();
    });

    expect(screen.getByText(/error/i)).toBeInTheDocument();
  });
});

In these tests, we verify specific UI states: the “loading” test checks that a loading indicator shows while data is being fetched, the “success” test confirms that post items render when the API returns data, and the “error” test makes sure an error message appears if the call fails.

We mock Axios by calling vi.mock('axios') and then using methods like mockResolvedValue(...) on axios.get to simulate a successful response (and mockRejectedValue(...) to simulate a failure). This kind of mocking isolates our tests from real network calls (making them fast and reliable) and lets us control exactly what data or error the hook receives.

We use await waitFor(...) to pause the test until those asynchronous updates complete before making assertions. Finally, we use screen.getByText(...) to find elements that should be present (it will throw an error if they’re missing) and screen.queryByText(...) to check that elements aren’t present (it returns null if the element is not in the DOM).

Mocking Specific Module Functions

Sometimes you only want to mock specific functions while keeping the rest of a module's behaviour intact. Here's how to do that:

vi.mock('date-fns', async () => {
  const original = await vi.importActual('date-fns');
  return {
    ...original,
    format: vi.fn(() => '2025-01-01'),
  };
});

In Vitest, you use vi.importActual to retain all original methods while mocking only the format method.

Best Practices for Testing React Components

Now that you know how to write tests, let's talk about how to write good tests.

Test User Behaviour, Not Implementation

Focus on testing what users see and do, not internal component details. If you refactor your component's implementation without changing its behaviour, your tests shouldn't break.

Bad test (testing implementation):

it('should set isOpen state to true', () => {
  const { result } = renderHook(() => useState(false));
  // Testing internal state directly
});

Good test (testing behaviour):

it('should show menu when button is clicked', async () => {
  const user = userEvent.setup();
  render(<Menu />);

  await user.click(screen.getByRole('button', { name: /menu/i }));
  expect(screen.getByRole('navigation')).toBeVisible();
});

Use Accessible Queries

React Testing Library encourages you to query elements the way users do. Prefer queries that mirror user interaction:

getByRole (best for interactive elements)
getByLabelText (for form fields)
getByPlaceholderText
getByText
getByTestId (last resort)

Keep Tests Simple and Focused

Each test should verify one thing. If your test needs a lot of setup or has many assertions, consider splitting it into multiple tests.

Clean Up Between Tests

Use afterEach to clean up the DOM after each test run, ensuring tests don't interfere with each other. This is already handled if you followed the setup steps earlier.

Use Descriptive Test Names

Test names should clearly describe what they're testing and what the expected outcome is.

Good test names:

it('should display error message when form is submitted empty');
it('should call onSubmit with email and password when form is valid');
it('should disable submit button while request is pending');

Mock External Dependencies

Always mock API calls, timers, and other external dependencies. Your tests should be isolated and not depend on network conditions or external services.

Conclusion

Now, you have learned how to set up Vitest in a React project and write effective tests for components, user interactions, custom hooks, and API calls. Vitest provides a powerful and efficient way to test React applications, especially when combined with modern tools like Vite.

Testing is about building confidence in your code, documenting expected behaviour, and enabling safe refactoring. Vitest's speed makes testing feel less like a chore and more like a natural part of development.

Start small. Add tests for critical user flows. Test the components that change frequently. As you build the habit, you will find that tests actually make development faster, not slower. The code will still be there tomorrow. But the bugs you catch today won't be.

Unit Testing in Go - A Beginner's Guide

Gabor Koos — Mon, 12 Jan 2026 17:55:57 +0000

If you're learning Go and you’re already familiar with the idea of unit testing, the main challenge is usually not why to test, but how to test in Go.

Go takes a deliberately minimal approach to testing. There are no built-in assertions, no annotations, and no special syntax. Instead, tests are written as regular Go code using a small standard library package, and run with a single command. This can feel unusual at first if you're coming from ecosystems with richer testing frameworks, but it quickly becomes predictable and easy to reason about.

In this article, we'll look at how unit testing works in Go in practice. We'll write a few small tests, run them from the command line, and cover the most common patterns you'll see in real Go codebases, such as table-driven tests and testing functions that return errors. We'll focus on the essentials and won't cover more advanced topics like mocks or external frameworks.

The goal is to show how familiar testing concepts translate into idiomatic Go. By the end, you should feel comfortable reading and writing basic unit tests and integrating them into your regular Go workflow.

What We'll Cover:

Prerequisites
Writing Your First Test
Table-Driven Tests
Testing Functions That Return Errors
Best Practices and Tips
Conclusion
Solutions to Exercises
- Subtract Function and Tests
- SafeSubtract Function and Tests

Prerequisites

Before you start, you should be comfortable with:

Writing and running basic Go programs
Defining and calling functions in Go
Understanding basic Go types (int, string, bool, and so on)
Using the Go command-line tool (go run, go build)
Basic understanding of unit tests: what a test is and why it's useful
Familiarity with Test-Driven Development concepts like testing before or alongside writing code
Awareness of common testing ideas such as assertions, test coverage, and checking error conditions

You don't need prior experience with Go's testing package or Go-specific test patterns, as this guide will cover all of that.

Writing Your First Test

Let's start with a simple function to test. Imagine you have a small calc package with an Add function:

// calc.go
package calc

// Add returns the sum of two integers
func Add(a, b int) int {
    return a + b
}

To test this function, create a new file named calc_test.go in the same package. In Go, test files must end with _test.go to be recognized by the testing tool.

Inside calc_test.go, you write a test function:

// calc_test.go
package calc

import "testing"

func TestAdd(t *testing.T) {
    got := Add(2, 3)
    want := 5
    if got != want {
        t.Errorf("Add(2, 3) = %d; want %d", got, want)
    }
}

Here's what's happening:

The function name starts with Test and takes a single *testing.T parameter. Go automatically discovers and runs any function that follows this convention.
The t.Errorf call reports a test failure. Unlike some frameworks, Go doesn't provide special assertions – you simply check a condition and call t.Errorf or t.Fatalf if it fails.
Each test is a standalone function. You can write as many as you like, and Go will run them all.

Running Your Test

Once the file is saved, you can run your test with:

go test

This runs tests for the current package (files ending with _test.go). If you want to run tests recursively in all subdirectories of your project, use:

go test ./...

The ./... pattern is shorthand for "run tests in this directory and all subdirectories". This is especially useful in larger projects where your code is spread across multiple packages.

If everything is working, you should see output indicating that the test passed:

$ go test
PASS
ok      _/C_/projects/Articles/Go_Testing       0.334s

You can add the -v flag for verbose output:

go test -v

This will show you the names of the tests as they run:

$ go test -v
=== RUN   TestAdd
--- PASS: TestAdd (0.00s)
PASS
ok      _/C_/projects/Articles/Go_Testing       0.356s

Not much difference for a single test, but it becomes useful as you add more tests.

Now let's see what happens if the test fails. Change the expected value in calc_test.go to an incorrect one:

  ...
    want := 6 // Incorrect expected value
  ...

Run the tests again:

$ go test
--- FAIL: TestAdd (0.00s)
    calc_test.go:9: Add(2, 3) = 5; want 6
FAIL
exit status 1
FAIL    _/C_/projects/Articles/Go_Testing       0.340s

or with verbose output:

$ go test -v
=== RUN   TestAdd
    calc_test.go:9: Add(2, 3) = 5; want 6
--- FAIL: TestAdd (0.00s)
FAIL
exit status 1
FAIL    _/C_/projects/Articles/Go_Testing       0.337s

Of course, your tests should always check for the correct expected values! A failing (but correct) test is a sign that your code needs to be fixed.

We only created one test file and one test function with one assertion here, but Go's testing tool can handle many files and functions at once. Behind the scenes, Go will automatically:

Find all _test.go files in the specified packages (for example, current directory for go test, or recursively in all subdirectories with go test ./...).
Identify functions that start with Test and have the correct signature.
Compile them together with your package into a temporary test binary.
Execute each test function and report the results.

To prove this, let's quickly add a Divide function to our package:

// calc.go
...
// Divide returns the result of dividing a by b
func Divide(a, b int) int {
    return a / b
}

(Note that this is an integer division, so fractional parts are discarded. Divide(5, 2) would return 2.)

And another test file with a corresponding test:

// calc_2_test.go
package calc

import "testing"

func TestDivide(t *testing.T) {
    got := Divide(10, 2)
    want := 5    
    if got != want {
        t.Errorf("Divide(10, 2) = %d; want %d", got, want)
    }    
}

Now when you run go test, both TestAdd and TestDivide will be executed:

$ go test
PASS
ok      _/C_/projects/Articles/Go_Testing       0.325s

Or:

$ go test -v
=== RUN   TestAdd
--- PASS: TestAdd (0.00s)
=== RUN   TestDivide
--- PASS: TestDivide (0.00s)
PASS
ok      _/C_/projects/Articles/Go_Testing       0.323s

Divide by Zero

What happens if we try to Divide by zero? Let's add another test case for that:

// calc_test.go
...
func TestDivideByZero(t *testing.T) {
    defer func() {
        if r := recover(); r == nil { // Check if a panic occurred
            t.Errorf("Divide did not panic on division by zero")
        }
    }()
    Divide(10, 0) // This should cause a panic
}

This test checks that the Divide function panics when dividing by zero. When you run the tests again, you'll see that this new test also passes:

$ go test -v
=== RUN   TestAdd
--- PASS: TestAdd (0.00s)
=== RUN   TestDivide
--- PASS: TestDivide (0.00s)
=== RUN   TestDivideByZero
--- PASS: TestDivideByZero (0.00s)
PASS
ok      _/C_/projects/Articles/Go_Testing       0.312s

(Note that in real-world Go code, it's better to return (int, error) for unsafe operations instead of panicking.)

Feel free to experiment by adding more test cases, changing expected values, and exploring how Go's testing framework handles different scenarios.

`t.Errorf` vs `t.Fatalf`

In the examples above, we used t.Errorf to report test failures. This function logs the error but allows the test to continue running. This is useful when you want to check multiple conditions in a single test function.

In contrast, t.Fatalf logs the error and immediately stops the execution of the current test. Use t.Fatalf when continuing the test after a failure doesn't make sense or could cause misleading results.

For example, in the TestDivideByZero test, if the Divide function does not panic, we use t.Errorf to report the failure but continue to the end of the test. But if we had additional checks after the division, we might want to use t.Fatalf to stop execution immediately upon failure.

While t.Errorf and t.Fatalf use fmt-style formatting, for simple messages without formatting, you can also use t.Error and t.Fatal, respectively.

In the next section, we'll look at table-driven tests, a common Go pattern for testing multiple cases efficiently.

Table-Driven Tests

In Go, it's common to want to run the same test logic for multiple inputs and expected outputs. Rather than writing a separate test function for each case, Go developers often use table-driven tests. This pattern keeps your tests concise, readable, and easy to extend.

Table-Driven `Add` Test

Let's rewrite our Add test using a table-driven approach (and delete calc_2_test.go for clarity):

// calc_test.go
package calc

import "testing"

func TestAddTableDriven(t *testing.T) {
    tests := []struct {// Define a struct for each test case and create a slice of them
        name string
        a, b int
        want int
    }{
        {"both positive", 2, 3, 5},
        {"positive + zero", 5, 0, 5},
        {"negative + positive", -1, 4, 3},
        {"both negative", -2, -3, -5},
    }

    for _, tt := range tests {// Loop over each test case
        t.Run(tt.name, func(t *testing.T) {// Run each case as a subtest
            got := Add(tt.a, tt.b)
            if got != tt.want {// Check the result
                t.Errorf("Add(%d, %d) = %d; want %d", tt.a, tt.b, got, tt.want) // Report failure if it doesn't match
            }
        })
    }
}

Here's how it works:

We define a slice of structs, each representing a test case.
Each struct contains the test name, input values, and the expected result.
We loop over the slice and call t.Run(tt.name, func(t *testing.T) { ... }) to run each test as a subtest.
If a subtest fails, you can see which one by its name in the output.

$ go test
PASS
ok      _/C_/projects/Articles/Go_Testing       0.452s

Or to see detailed output:

$ go test -v
=== RUN   TestAddTableDriven
=== RUN   TestAddTableDriven/both_positive
=== RUN   TestAddTableDriven/positive_+_zero
=== RUN   TestAddTableDriven/negative_+_positive
=== RUN   TestAddTableDriven/both_negative
--- PASS: TestAddTableDriven (0.00s)
    --- PASS: TestAddTableDriven/both_positive (0.00s)
    --- PASS: TestAddTableDriven/positive_+_zero (0.00s)
    --- PASS: TestAddTableDriven/negative_+_positive (0.00s)
    --- PASS: TestAddTableDriven/both_negative (0.00s)
PASS
ok      _/C_/projects/Articles/Go_Testing       0.385s

Table-Driven Divide Test

We can apply the same pattern to Divide, including checking for divide-by-zero:

// calc_test.go
...
func TestDivideTableDriven(t *testing.T) {
    tests := []struct { // Define test cases
        name     string
        a, b     int
        want     int
        wantPanic bool
    }{
        {"normal division", 10, 2, 5, false},
        {"division by zero", 10, 0, 0, true},
    }

    for _, tt := range tests { // Loop over
        t.Run(tt.name, func(t *testing.T) { // Run subtest
            if tt.wantPanic { // Check for expected panic
                defer func() { // Recover from panic
                    if r := recover(); r == nil {
                        t.Errorf("Divide(%d, %d) did not panic", tt.a, tt.b)
                    }
                }()
            }
            got := Divide(tt.a, tt.b) // Tests that do not panic
            if !tt.wantPanic && got != tt.want {
                t.Errorf("Divide(%d, %d) = %d; want %d", tt.a, tt.b, got, tt.want)
            }
        })
    }
}

This example shows how to handle both normal and panic cases in a single table-driven test:

The wantPanic field tells the test whether we expect a panic.
We use defer and recover to check for a panic when needed.
Normal test cases still check the result as usual.

Run all tests as before:

$ go test -v
=== RUN   TestAddTableDriven
=== RUN   TestAddTableDriven/both_positive
=== RUN   TestAddTableDriven/positive_+_zero
=== RUN   TestAddTableDriven/negative_+_positive
=== RUN   TestAddTableDriven/both_negative
--- PASS: TestAddTableDriven (0.00s)
    --- PASS: TestAddTableDriven/both_positive (0.00s)
    --- PASS: TestAddTableDriven/positive_+_zero (0.00s)
    --- PASS: TestAddTableDriven/negative_+_positive (0.00s)
    --- PASS: TestAddTableDriven/both_negative (0.00s)
=== RUN   TestDivideTableDriven
=== RUN   TestDivideTableDriven/normal_division
=== RUN   TestDivideTableDriven/division_by_zero
--- PASS: TestDivideTableDriven (0.00s)
    --- PASS: TestDivideTableDriven/normal_division (0.00s)
    --- PASS: TestDivideTableDriven/division_by_zero (0.00s)
PASS
ok      _/C_/projects/Articles/Go_Testing       0.321s

Subtest names make it easy to see which case passed or failed.

Exercise

Try creating your own table-driven test for a new function, Subtract(a, b int) int. Include at least four test cases:

Both positive numbers
Positive minus zero
Negative minus positive
Both negative

Then run your tests and verify the output.

Testing Functions That Return Errors

Many Go functions return an error as the last return value. Writing tests for these functions is slightly different from testing pure functions like our Add or Divide, because you need to check both the result and whether an error occurred.

Safe Divide Function

Let's add a SafeDivide function to return an error instead of panicking:

// calc.go
...
import "fmt"
...
// SafeDivide returns the result of dividing a by b.
// It returns an error if b is zero.
func SafeDivide(a, b int) (int, error) {
    if b == 0 {
        return 0, fmt.Errorf("cannot divide by zero")
    }
    return a / b, nil
}

Writing Tests for `SafeDivide()`

We can use a table-driven test again:

// calc_test.go
func TestSafeDivide(t *testing.T) {
    tests := []struct {
        name      string
        a, b      int
        want      int
        wantError bool
    }{
        {"normal division", 10, 2, 5, false},
        {"division by zero", 10, 0, 0, true},
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            got, err := SafeDivide(tt.a, tt.b)
            if tt.wantError {
                if err == nil {
                    t.Errorf("SafeDivide(%d, %d) expected error, got nil", tt.a, tt.b)
                }
                return // stop here, no need to check `got`
            }
            if err != nil {
                t.Errorf("SafeDivide(%d, %d) unexpected error: %v", tt.a, tt.b, err)
            }
            if got != tt.want {
                t.Errorf("SafeDivide(%d, %d) = %d; want %d", tt.a, tt.b, got, tt.want)
            }
        })
    }
}

What's happening here:

We added a wantError field to indicate whether the test expects an error.
If an error is expected, we check that err != nil. If not (that is, err == nil), we fail the test.
If no error is expected, we check both the returned value (got) and that err == nil.
Using t.Run subtests keeps everything organized and readable.

Running the tests again:

$ go test -v
...
=== RUN   TestSafeDivide
=== RUN   TestSafeDivide/normal_division
=== RUN   TestSafeDivide/division_by_zero
--- PASS: TestSafeDivide (0.00s)
    --- PASS: TestSafeDivide/normal_division (0.00s)
    --- PASS: TestSafeDivide/division_by_zero (0.00s)
PASS
ok      _/C_/projects/Articles/Go_Testing       0.323s

Showing that both normal and error cases are handled correctly.

Exercise

Update your Subtract(a, b int) int function to a SafeSubtract(a, b int) (int, error) variant that returns an error if the result would be negative. Then write a table-driven test that covers:

A positive result
Zero result
A negative result (should return an error)

Best Practices and Tips

Writing tests in Go is straightforward, but there are a few conventions and tips that make your tests more readable, maintainable, and idiomatic:

Name Tests Clearly

First, make sure you use descriptive names for test functions and subtests. A good name explains what you're testing and under what conditions.

Here’s an example:

t.Run("Divide positive numbers", func(t *testing.T) { ... })
t.Run("Divide by zero returns error", func(t *testing.T) { ... })

Keep Tests Small and Focused

Each subtest should verify one thing, and each test function should cover a single function or method.

Try to avoid combining multiple unrelated checks in the same test function, and use table-driven tests help keep multiple similar checks concise without losing clarity.

Use Table-Driven Tests for Repetitive Cases

If you find yourself writing multiple similar test functions, switch to a table-driven pattern. It makes it easier to add new cases, reduces duplicated code, and keeps output organized with t.Run.

Check Errors Explicitly

In Go, functions often return error. So make sure you always check for errors in tests, even if you expect nil.

You can use the wantError pattern in table-driven tests for clarity.

if tt.wantError {
    if err == nil {
        t.Errorf("expected error, got nil")
    }
}

Avoid Panics When Possible

Panics are fine for some internal checks, but in production code, prefer returning an error.

Your tests can check for panics using defer and recover, but this should be the exception rather than the norm.

Run Tests Frequently

Try to make running tests a habit: go test -v ./.... Frequent testing helps catch mistakes early and reinforces TDD practices.

Keep Tests in the Same Package

By convention, tests live in the same package as the code they test. You can create _test.go files for testing, and Go automatically recognizes them.

Only use a separate package calc_test if you want to test your code from the outside, like a consumer. External test packages (just like every other external package) cannot access unexported identifiers.

Use t.Fatalf vs t.Errorf Appropriately

t.Errorf reports a failure but continues running the test.
t.Fatalf stops the test immediately, which is useful if subsequent code depends on successful setup.

These tips will help you write clean, maintainable, and idiomatic Go tests that are easy to read and extend. Following these practices early in your Go journey will make testing less intimidating and more effective.

Conclusion

Unit testing in Go may feel different at first, especially if you're coming from ecosystems with heavy frameworks and assertions. But the simplicity of Go's testing tools is one of its strengths: once you understand the conventions, writing, running, and organizing tests becomes predictable and intuitive.

In this guide, you've seen how to:

Write basic test functions with the testing package
Run tests from the command line and interpret the results
Use table-driven tests to cover multiple cases efficiently
Handle functions that return errors and check for expected failures

Beyond these fundamentals, testing is not just about verifying correctness, it's also about confidence. Well-tested code allows you to refactor, experiment, and add new features with less fear of breaking existing functionality.

As you continue writing Go code, try to integrate testing early, follow the idiomatic patterns you've learned, and explore more advanced topics such as:

Using mocks or interfaces to isolate dependencies
Benchmark tests with testing.B
Coverage analysis with go test -cover

The key takeaway is that testing in Go is accessible, flexible, and powerful, even without fancy frameworks. By building these habits now, you'll write code that's more reliable, maintainable, and enjoyable to work with.

Solutions to Exercises

Subtract Function and Tests

// calc.go
package calc

func Subtract(a, b int) int {
    return a - b
}

// calc_test.go
package calc

import "testing"

func TestSubtractTableDriven(t *testing.T) {
    tests := []struct {
        name string
        a, b int
        want int
    }{
        {"both positive", 5, 3, 2},
        {"positive minus zero", 5, 0, 5},
        {"negative minus positive", -1, 4, -5},
        {"both negative", -3, -2, -1},
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            got := Subtract(tt.a, tt.b)
            if got != tt.want {
                t.Errorf("Subtract(%d, %d) = %d; want %d", tt.a, tt.b, got, tt.want)
            }
        })
    }
}

SafeSubtract Function and Tests

// calc.go
package calc

import "fmt"

func SafeSubtract(a, b int) (int, error) {
    result := a - b
    if result < 0 {
        return 0, fmt.Errorf("result would be negative")
    }
    return result, nil
}

// calc_test.go
package calc

import "testing"

func TestSafeSubtract(t *testing.T) {
    tests := []struct {
        name      string
        a, b      int
        want      int
        wantError bool
    }{
        {"positive result", 5, 3, 2, false},
        {"zero result", 3, 3, 0, false},
        {"negative result", 2, 5, 0, true},
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            got, err := SafeSubtract(tt.a, tt.b)
            if tt.wantError {
                if err == nil {
                    t.Errorf("SafeSubtract(%d, %d) expected error, got nil", tt.a, tt.b)
                }
                return
            }
            if err != nil {
                t.Errorf("SafeSubtract(%d, %d) unexpected error: %v", tt.a, tt.b, err)
            }
            if got != tt.want {
                t.Errorf("SafeSubtract(%d, %d) = %d; want %d", tt.a, tt.b, got, tt.want)
            }
        })
    }
}

How to Test and Improve AI Applications with an Evaluation Flywheel

Yemi Ojedapo — Mon, 22 Dec 2025 10:18:04 +0000

In traditional programming, developers rely on unit tests to catch mistakes in applications. But when building AI products, that safety net doesn't exist. Responses can shift with model updates, data changes, and subtle fluctuations in prompts or retrieval results. The usual testing methods like unit tests with Pytest or Jest, integration tests, CI pipelines, fail to catch accuracy drops, hallucinations, or regressions, and these silent failures can become real production risks.

In this article, you’ll learn why traditional testing methods fall short for AI systems and how an evaluation flywheel can be used as a practical approach to testing and improving AI applications. The sections below break the evaluation flywheel down step by step, from identifying the problem to implementing a repeatable evaluation loop.

Why Does Traditional Testing Fail for AI applications?
What is the Evaluation Flywheel?
Drawing Parallels to Familiar Practices
Why Silent Failures Matter: A Real-World Example
How to Create an Evaluation Flywheel
Tools and Frameworks you can use for evaluation
What a Complete Evaluation Loop Looks Like in Practice
Key Takeaways
Conclusion

Why Does Traditional Testing Fail for AI applications?

In standard programming, tests assume deterministic behavior. This means the same input is expected to always produce the same output. For example:

def authenticate_user_age(age: int) -> str:
    limit = 18

    if age >= limit:
        return "Access granted"
    else:
        return "User doesn't meet the age limit"

# Test 
assert authenticate_user_age(20) == "Access granted"
assert authenticate_user_age(16) == "User doesn't meet the age limit"

The response from this function is always predictable. You can write tests once and trust they'll catch errors forever.

However, AI models don’t behave the same way every time, they generate output based on probabilities. A query like “best programming practices” may produce strong guidance one day, and outdated or incomplete advice the next. This shift can happen because of changes in the underlying model, updates to retrieval components, or gradual data drift. Without a structured evaluation process in place, these inconsistencies slip into production unnoticed and can quietly weaken the system’s performance.

What is the Evaluation Flywheel?

The evaluation flywheel is a continuous improvement system where test cases representing real user behavior are passed through multiple evaluation steps to assess the output of AI models. The results don't just tell you whether the system passed or failed, they feed directly into the next cycle of improvement.

┌─────────────┐
│   Collect   │
│ Test Cases  │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│     Run     │
│ Evaluations │
└──────┬──────┘
       │
       ▼
┌─────────────┐      ┌─────────────┐
│  Identify   │─────▶│   Improve   │
│  Failures   │      │   System    │
└─────────────┘      └──────┬──────┘
                            │
                            ▼
                       ┌─────────────┐
                       │   Repeat    │
                       └─────────────┘

Here's how it works in practice:

Collect test cases — Gather examples from real user interactions or create synthetic scenarios. These should reflect the kind of tasks and input your system needs to handle.
Run evaluations — Pass each test case through a series of checks. The check can either be programmatic (automated metrics like relevance scores or hallucination detectors) or require manual review (like verifying legal advice accuracy or brand voice consistency).
Identify failures — Detect where the model goes wrong, this can include hallucinations, irrelevant responses, or mistakes on corner-cases.
Improve the system — Based on those failures, refine prompts, improve training or retrieval data, or adjust architectural components.
Repeat the cycle — Re-run the updated system on the existing and newly collected cases. Over time, this grows and strengthens your evaluation suite and boosts system reliability.

Drawing Parallels to Familiar Practices

If you've written software before, the evaluation flywheel will feel familiar. It mirrors patterns that are already used in engineering. For instance,

Unit tests → Evaluation datasets
Unit tests confirm a function returns the right output. Evaluation datasets play the same role for AI: they're ground-truth queries and answers that guard against regressions.

Test-driven development (TDD) → Evaluation-driven development (EDD)
In TDD, you write tests before code. In EDD, you write evaluation cases before shipping prompts or updating models. This replaces assumptions with verifiable results.

CI/CD pipelines → Continuous evaluation pipelines
CI/CD runs checks automatically on every code change. Continuous evaluation does the same for models: it runs automated quality checks every time you tweak a prompt, retrain, or swap out a component.

The key difference is subtle but important. Traditional software tests check whether a function returns the right value or type. AI evaluation tests check whether the system produces the right meaning. That's harder to measure, but the principle is the same: build a safety net that grows stronger with every cycle.

Why Silent Failures Matter: A Real-World Example

AI systems often behave differently in production than they do in development. A model that seems solid in testing can drift, hallucinate, or silently fail when facing real-world input.

Case in point: A fraud detection model passed all monitoring metrics yet missed a spike in fraud. An ML engineer shared how their production monitoring dashboards tracked latency, throughput, and error rates, everything showed green. But fraudulent transactions were slipping through at twice the normal rate. Nobody noticed because existing observability tools focused on pipeline health, not prediction quality.

This silent failure cost the company significant losses. The system seemed fine by traditional metrics. It measured system performance—latency, throughput, uptime—but ignored what mattered most: prediction accuracy. As fraudsters adapted their tactics, the model drifted, and without proper evaluation loops, the degradation went undetected for weeks.

Source: InsightFinder.

Why This Example Matters

Silent failures aren't always bugs — They often stem from models failing to adapt to shifting patterns in the real world.
Static evaluation isn't enough — You need continuous, real-world feedback loops to detect when assumptions no longer hold.
Data drift has business impact — Model degradation isn't just technical, it translates directly into revenue loss, security breaches, or damaged user trust.

How to Create an Evaluation Flywheel

To show how to build a flywheel and how it works, let's create one for a customer support chatbot that answers questions about a SaaS product.

Step 1: Build Your AI System

Create your initial product: prompts, retrieval logic, and integrations. For our chatbot:

def answer_support_question(question: str) -> str:
    # Retrieve relevant docs from knowledge base
    context = retrieve_docs(question, top_k=5)

    # Generate answer using LLM
    prompt = f"""You are a helpful customer support agent.

Context: {context}

Question: {question}

Provide a clear, accurate answer based on the context."""

    response = llm.generate(prompt)
    return response

How this works: This function defines the core chat logic, it takes a customer’s question and returns an AI-generated answer. First, it searches your knowledge base to find the five most relevant documents using retrieve_docs(). These documents provide context about your product or policies. Next, it constructs a prompt that includes this context and the user's question, then sends it to a language model. The LLM reads the context and generates a relevant answer, which the function returns.

Step 2: Identify Test Cases

Build an evaluation set that reflects real user behavior. The more representative your test cases are, including common cases, edge cases, and ambiguous inputs, the better your model can catch failures before they reach production.

Sources for test cases:

Previous customer support tickets
Common FAQ topics
Edge cases discovered in beta testing
Synthetic scenarios (hypothetical but realistic queries)

Example test cases:

test_cases = [
    {
        "question": "How do I reset my password?",
        "expected_elements": ["settings page", "reset link", "email"],
        "category": "account_management"
    },
    {
        "question": "What's your refund policy?",
        "expected_elements": ["30 days", "full refund", "contact support"],
        "category": "billing"
    },
    {
        "question": "Can I export my data to CSV?",
        "expected_elements": ["yes", "export button", "dashboard"],
        "category": "features"
    },
    {
        "question": "Does your API support webhooks?",
        "expected_elements": ["yes", "webhook endpoints", "documentation"],
        "category": "technical"
    }
]

How this works: Here, we define a set of representative test cases to evaluate the AI system. Each test case includes the user’s question, a list of key elements expected in the answer, and a category for organization. These cases help ensure the chatbot is tested against real-world scenarios, edge cases, and important information that should appear in responses.

Step 3: Evaluate Outputs

Define evaluation criteria based on what matters for your use case: accuracy, faithfulness, safety, relevance, tone. Then measure the output against these criteria.

Evaluation happens in two main ways:

Automated Evaluation

Use programmatic metrics and LLM-as-judge patterns:

def evaluate_response(question: str, response: str, expected_elements: list) -> dict:
    scores = {}

    # 1. Faithfulness: Does response contain expected elements?
    scores['contains_key_info'] = all(
        elem.lower() in response.lower() 
        for elem in expected_elements
    )

    # 2. Relevance: Semantic similarity to question
    scores['relevance'] = calculate_semantic_similarity(question, response)

    # 3. Safety: Check for problematic content
    scores['is_safe'] = not contains_harmful_content(response)

    # 4. Tone: Use LLM-as-judge
    judge_prompt = f"""Rate the helpfulness of this support response on a scale of 1-5.

Question: {question}
Response: {response}

Score (1-5):"""

    scores['helpfulness'] = int(llm.generate(judge_prompt))

    return scores

# Run evaluation
for test_case in test_cases:
    response = answer_support_question(test_case['question'])
    scores = evaluate_response(
        test_case['question'],
        response,
        test_case['expected_elements']
    )
    test_case['scores'] = scores
    test_case['response'] = response

How this works: The evaluate_response() function applies four different checks to each AI response:

First, it verifies faithfulness by checking if all expected elements appear in the response using simple string matching.
Second, it calculates semantic similarity, a measure of how closely the responses meaning match the intent of the questions, using embeddings.
Third, it runs a safety check to flag any problematic content.
Fourth, it uses an LLM as a judge by asking a more powerful model (like GPT-4) to rate the helpfulness of the response on a 1-5 scale.

The loop then runs the evaluation for every test case. It generates a response for each question, evaluates it using the evaluate_response function, and then stores both the scores and the response back in the test case. This creates a complete dataset of test results for analysis and further improvements.

Common Automated Metrics:

Semantic similarity (0.0–1.0): This is measured by converting the question and response into vector embeddings and calculating cosine similarity. The score shows how closely the response matches the intent of the question, even if the wording differs.
ROUGE / BLEU scores: The model’s output is compared to reference answers by checking n-gram overlap. These metrics help spot regressions, though scores can be modest for open-ended answers.
LLM-as-judge: A stronger model (like GPT-4 or Claude) can rate the response on a fixed scale, such as 1–5. These ratings give a sense of quality and are useful for tracking improvements or drops over time.
Retrieval metrics (Precision@k, Recall@k): For retrieval-based systems, these metrics calculate how many relevant documents appear in the top-k results. Precision shows accuracy of the retrieved set, and recall indicates completeness.
Custom validators: Simple rule-based checks, like regex patterns, keywords, or length limits, ensure responses meet hard requirements. These help catch issues automated metrics might miss.

Manual Evaluation

Automated metrics can't capture everything. Subjective qualities like tone, empathy, and brand voice require human judgment, as do small factual errors that slip past keyword checks and similarity scores.

# Flag cases for human review
needs_review = [
    case for case in test_cases 
    if case['scores']['helpfulness'] < 3 
    or not case['scores']['contains_key_info']
]

# SMEs review and annotate
for case in needs_review:
    annotation = get_sme_feedback(case)
    case['human_rating'] = annotation['rating']
    case['improvement_notes'] = annotation['notes']

This code filters test cases to find responses that need human attention, those scoring below 3 for helpfulness or missing important information. Subject matter experts review these flagged cases and provide ratings with helpful feedback. Their input helps you spot patterns that automated metrics miss and shows you where to improve your prompts, retrieval setup, or system settings.

When to use manual evaluation:

Assessing tone, empathy, or brand voice
Detecting subtle hallucinations automated checks miss
Validating edge cases with domain-specific nuance
Creating ground truth labels for training evaluation models

Step 4: Learn and Improve

Once you've identified failures, adjust the controllable parts of your AI system (the "configs"):

Common configuration levers:

Prompts — Add instructions, examples, constraints
Retrieval — Change chunk size, top-k, reranking strategy
Model — Switch models, adjust temperature, max tokens
Context — Modify system instructions, add memory
Post-processing — Add validation, formatting, safety filters

Example improvement cycle:

# Problem discovered: Chatbot missing key details
failing_case = {
    "question": "What's your refund policy?",
    "response": "We offer refunds in certain cases.",
    "issue": "Too vague, missing 30-day window and process"
}

# Root cause: Retrieval returning wrong docs
retrieved_docs = retrieve_docs(failing_case['question'], top_k=5)
# Docs about "payment processing" ranked higher than "refund policy"

# Solution 1: Improve retrieval with reranking
def retrieve_docs_v2(question: str, top_k: int) -> str:
    # Initial retrieval
    candidates = vector_search(question, top_k=20)

    # Rerank by relevance
    reranked = rerank_by_relevance(question, candidates)

    return reranked[:top_k]

# Solution 2: Update prompt to require specificity
prompt_v2 = f"""You are a helpful customer support agent.

Context: {context}

Question: {question}

Provide a clear, accurate answer based on the context. Include specific details like:
- Time windows (e.g., "within 30 days")
- Step-by-step processes
- Relevant links or contact methods

Answer:"""

# Re-evaluate
new_response = answer_support_question_v2(failing_case['question'])
new_scores = evaluate_response(
    failing_case['question'],
    new_response,
    ["30 days", "full refund", "contact support"]
)

# Verify improvement
assert new_scores['contains_key_info'] == True
assert new_scores['helpfulness'] >= 4

How this works: In this example, the chatbot's refund answer was too vague. After checking what went wrong, the problem was that the system retrieved docs about payment processing instead of the refund policy.

To resolve this, two changes can be made. First, retrieval is improved by grabbing twenty documents, then picking the best five. Second, the prompt is updated to ask for specific details like dates and steps.

After making these changes, the test runs again to confirm it works: the response now has all the key info and scores at least 4 out of 5. This process turns problems into fixes you can measure.

Step 5: Automate and Repeat

Integrate evaluation into your development workflow using CI/CD:

# .github/workflows/eval.yml
name: Continuous Evaluation

on:
  pull_request:
  push:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Run evaluation suite
        run: python run_evals.py

      - name: Check pass rate
        run: |
          PASS_RATE=$(python calculate_pass_rate.py)
          if (( $(echo "$PASS_RATE < 0.85" | bc -l) )); then
            echo "Pass rate $PASS_RATE below threshold"
            exit 1
          fi

      - name: Upload results
        uses: actions/upload-artifact@v2
        with:
          name: eval-results
          path: results/

Explanation: This GitHub Actions workflow automates your evaluation process so it runs automatically on every code change. The workflow triggers whenever someone opens a pull request or pushes code to the main branch. It checks out your code, runs your full evaluation suite using run_evals.py, then calculates what percentage of test cases passed. If the pass rate drops below 85%, the workflow fails and blocks the code from being merged, preventing quality regressions from reaching production.

Key practices for automation:

Version your test cases — Track them in Git alongside code
Set quality gates — Block deployments if pass rate drops below threshold
Monitor trends — Track metrics over time to catch gradual drift
Alert on regressions — Notify team when specific test cases start failing
Sample production traffic — Continuously add real queries to eval dataset

Tools and Frameworks you can use for evaluation

Several platforms can help implement continuous evaluation. The one you choose depends on your stack and needs:

If you're building with LLMs: Try LangSmith or Braintrust first. Both handle prompt versioning, evaluation datasets, and tracing out of the box.

If you're doing traditional ML: Weights & Biases is the industry standard. If you're in the Microsoft ecosystem, PromptFlow integrates well with Azure.

If you want full control: Build custom with pytest for test execution and MLflow for tracking results. More setup, but you own the entire pipeline

What a Complete Evaluation Loop Looks Like in Practice

This walkthrough shows how a support chatbot improves after running a single cycle of evaluations. Each stage shows how evaluation signals guide improvements and lock in quality for the next release.

Stage	Before	After
Test Case	"Can I use your API on the free plan?"	Same question
Model Response	"Yes, you can access our API."	"Yes, you can access our API on the free plan with a rate limit of 100 requests per day. For higher limits, upgrade to Pro or Enterprise."
Evaluation Scores	contains_key_info=False, helpfulness=2/5	contains_key_info=True, helpfulness=5/5
Issue Identified	Missing crucial detail: free plan rate limits	N/A (issue resolved)
Analysis / Root Cause	Retrieval returned general API docs; prompt didn’t emphasize limitations	N/A (analysis led to fix)
Fixes Applied	1. Improved retrieval to fetch plan comparison docs2. Updated prompt: "Always mention plan-specific restrictions"3. Added validation: Response must mention rate limits if asked	N/A (fix implemented)
Outcome	Test failed, regression not prevented	Test passes, regression prevented
Next Cycle Actions	N/A	1. Add this test case to permanent suite 2. Look for similar issues (other plan-related questions) 3. Monitor production queries for this pattern

Next cycle:

Add this test case to permanent suite
Look for similar issues (other plan-related questions)
Monitor if this pattern appears in production queries

Key Takeaways

AI systems need continuous evaluation, not one-time testing — Models drift, data changes, and silent failures accumulate without ongoing checks.
Build evaluation into your workflow from day one — Don't wait until production failures force you to retrofit evaluation.
Start simple, then scale — Begin with 10-20 test cases and basic metrics. Grow your suite as you encounter edge cases.
Automate what you can, involve humans for what you can't — Use programmatic checks for speed, SME review for nuance.
Treat evaluation datasets as first-class artifacts — Version control them, review changes, and grow them over time.
Make evaluation a team sport — Product, engineering, and domain experts should all contribute test cases and evaluation criteria.

Conclusion

Every developer has felt the relief of seeing "all tests passing." In AI systems, that reassurance is often misleading. A model can deploy successfully, meet performance benchmarks, and still produce incorrect, incomplete, or misleading outputs in ways traditional tests miss.

The evaluation flywheel addresses this gap by making model behavior testable in practice. Instead of assuming correctness, it forces the system to answer real questions, measures the quality of those answers, and highlights where performance degrades over time. This shifts evaluation from a one-off validation step into an ongoing part of development.

Evaluation won't eliminate uncertainty completely, but it makes failures visible before they reach users. With failures clearly exposed, teams stop guessing and start fixing based on results. This might mean adjusting prompts, improving retrieval logic, or refining evaluation criteria. Over time, this leads to AI systems that evolve in controlled ways rather than breaking silently.

Resources for further reading

Anthropic's eval guide: https://docs.anthropic.com/en/docs/build-with-claude/develop-tests
OpenAI's evals framework: https://github.com/openai/evals
LangChain evaluation: https://python.langchain.com/docs/guides/evaluation
Arize AI blog: Comprehensive resources on ML observability

How to Build Your First Dynamic Performance Test in Apache JMeter

Mah Noor — Tue, 28 Oct 2025 16:48:10 +0000

As a QA engineer, I have always found performance testing to be one of the most exciting and underrated parts of software testing. Yes, functional testing is important, but it’s of little use if users have to wait for 5 seconds for each page to load.

For me personally, there is a deep satisfaction that comes with seeing your product come alive under load to find out how it’ll actually work in production when thousands of users will be using it.

Performance testing is about discovering how your system performs under real-world pressure in terms of load, concurrency, and throughput. One of the key aspects of performance testing is ensuring that the APIs can endure the expected load. You can do this using tools like Apache JMeter and K6.

In this tutorial, we’ll explore how you can build your first end-to-end performance test in Apache JMeter. You will be learning to create a test suite that is dynamic (the test can be run with any test data) and that’s one-click executable (the test execution can be done through the GUI as well as the CLI).

Prerequisites
Introduction to Apache JMeter
Conclusion

Prerequisites

Before you start, make sure you have:

Apache JMeter (5.5 or above) installed.
Java 8 or later configured on your system.

You can check if JMeter is installed by running the command below:

jmeter -v

Note: This tutorial will use the JSONPlaceholder public API. You’ll learn how you can get a post_id and use it in a chain request to get user details.

Let’s get started.

Introduction to Apache JMeter

Apache JMeter is an open-source API load and stress testing tool. It’s a powerful testing tool that supports a wide range of protocols, including HTTP, HTTPS, FTP, JDBC, SOAP, and REST.

JMeter helps you answer critical questions about your APIs, like:

How does my API perform under heavy load?
What’s the maximum number of users it can handle before it starts failing?
Which requests or endpoints are slowing things down?

Let’s go through the step-by-step process of building a dynamic load testing suite with JMeter.

Step 1: Create a New Test Plan

Once JMeter opens, you’ll see an empty Test Plan. Think of this as your main workspace, which holds everything: Test configuration, users, requests, assertions, and results.

Right-click on Test Plan → Add → Threads (Users) → Thread Group to add a thread group. A thread group is essentially a test suite containing our test cases.

Step 2: Configure the Thread Group

To configure the thread group, fill out the following input fields:

Setting	Value	Description
Number of Threads (Users)	5	This represents the number of concurrent users. In this case, it will be ‘5’
Ramp-up Period (seconds)	10	This means the time it takes the threads to reach the maximum value.
Loop Count	2	This specifies the number of times you want your thread group executed.

You’ve now created a small, controlled load test of 10 total requests (5 users × 2 loops).

Step 3: Add HTTP Request Defaults

When you’re creating a suite of 100s of APIs, you don’t need to add your request details to all the API samplers in JMeter. JMeter lets you set it once globally by using a config element called HTTP Request Defaults. To add this element, follow the steps below:

Right-click on Thread Group → Add → Config Element → HTTP Request Defaults.
Enter the following:
- Protocol: https
- Server Name or IP: jsonplaceholder.typicode.com

This means all requests in this test will automatically use this base URL.

Step 4: Add a CSV Data Set Config (Dynamic Input)

In real projects, APIs rarely use static inputs. Take as an example a login API that you want to run for 100 concurrent users. In a real-world scenario, every login request will have a different username and password.

To replicate this on JMeter, you need to run your test for 100 different login credentials. This means that your test should be test data-driven. We can build a data-driven test in JMeter using a CSV file:

Create a file named data.csv with the following content:
```
 post_id
 1
 2
 3
 4
 5
```
Save it in your JMeter project folder.
In JMeter, right-click on Thread Group → Add → Config Element → CSV Data Set Config.
Fill in the following fields:
- Filename: data.csv
- Variable Names: post_id
- Recycle on EOF: True
- Stop thread on EOF: False

Now each user will pick a new post_id for every iteration from the CSV file.

Step 5: Add the HTTP Request Sampler

Now let’s add the actual API call we'll test under load. To do this, follow the steps below:

Right-click on Thread Group → Add → Sampler → HTTP Request.
Rename it to Get Post Data.
Set the following fields:
- Method: GET
- Path: /posts/${post_id}

Here ${post_id} dynamically takes its value from your CSV file. The Protocol and Server IP fields will automatically get data from the ‘HTTP Request default’ config element that we added in Step #3.

Step 6: Add a JSON Extractor

When the API returns a response, we can extract a value (like userId) from it and use it later. This is used to implement an end-to-end flow where data is gotten (with GET) from an API and sent to the next POST/DELETE API.

For our API, below is the example response:

{
  "userId": 1,
  "id": 3,
  "title": "fugiat veniam minus",
  "body": "This is an example post body"
}

To extract userId:

Right-click on Get Post Data → Add → Post Processors → JSON Extractor.
Set the variables below in the JSON Extractor:
- Name: Extract User ID
- Variable Name: user_id
- JSON Path Expression: $.userId

Now you can use ${user_id} in the next request, making your test fully dynamic.

Step 7: Add an Assertion

Assertions help you verify that your API responds correctly even under load. You can assert on the API response code, response time, or even the response payload. To add an assertion, follow the steps below:

Right-click Get Post Data → Add → Assertions → Response Assertion.
Configure as:
- Response Field to Test: Response Code – This will add an assertion for the response code.
- Pattern Matching Rules: Contains
- Pattern to Test: 200

This ensures JMeter only counts the request as successful if the word fugiat appears in the response.

Step 8: Add Listeners

We’ll add listeners to display our test results in different forms, such as visually or in a summary. Let’s add two essential ones:

View Results Tree: to view and debug individual requests.
Summary Report: to view performance metrics like response time, error rate, and throughput.

Add them via Thread Group → Add → Listener → [Choose Listener]

Step 9: Run Your Test

Hit the green Start button at the top. JMeter will start sending requests to your API using the dynamic post IDs from your CSV file.

As the test runs:

Green checkmarks in View Results Tree mean successful responses.
Assertion failures will appear in red.
Summary Report will aggregate key metrics.

Step 10: Chain Another Request (Optional)

Let’s take it one step further: we’ll use the extracted user_id from the first response to get user details from the GET users call. To do this, follow the steps below:

Right-click Thread Group → Add → Sampler → HTTP Request.
Rename to Get User Details.
Set:
- Method: GET
- Path: /users/${user_id}

Step 11: Analyze the Results

Once the test completes, open the Summary Report. You’ll see:

Metric	Description
Sample Count	Number of total requests sent
Average	Mean response time per request
Min/Max	Fastest and slowest response times
Error %	Percentage of failed requests
Throughput	Requests handled per second

If your error percentage is 0% and throughput is stable, your system handled the load well.

Pro Tips

Parameterize everything. Use multiple CSVs for realistic test flows (users, IDs, tokens).
Add timers (like Constant Timer) to simulate think time between user actions.
Use Assertions wisely. Don’t add extra assertions; focus on key validations such as response time and API status code.

Generate HTML reports using the command below:

  jmeter -n -t test-plan.jmx -l results.jtl -e -o report

Example Folder Structure:

Follow the folder structure below for an organized test suite.

performance-test/
├── data.csv
├── test-plan.jmx
└── results/
    ├── summary.csv
    └── report.html

Conclusion

Performance testing is an essential element of a production readiness checklist for any product. It helps you ensure that your product can handle the expected user load and scale gracefully.

This guide is your first step towards writing end-to-end performance test cases and bridging the gap between being a functional test engineer and a full-stack QA Engineer who understands both quality and scalability.

I hope you found this tutorial helpful. If you want to stay connected or learn more about performance testing, follow me on LinkedIn.

How to Test JavaScript Apps: From Unit Tests to AI-Augmented QA

Ajay Yadav — Wed, 08 Oct 2025 16:07:18 +0000

As a software engineer, you should always be open to the challenges this field brings. Two months ago, my project manager assigned me a task: write test cases for an API. I was super excited because it meant I got to learn something new beyond just coding features.

Now, if you’re thinking “writing test cases isn’t my job as a frontend or backend developer”, then you’re missing the point. That mindset holds you back.

At the very least, every engineer should understand Unit Testing and Integration Testing. Writing test cases isn’t rocket science, it’s as simple as English and feels very similar to writing JavaScript code.

That said, if you’ve ever tried setting up testing in a JavaScript application, you probably know how complicated and frustrating it can get.

The JavaScript ecosystem is massive, with endless libraries and frameworks. Things shift constantly, new tools replace old ones, and community standards evolve almost overnight. That’s exactly why I decided to write this article.

In it, we’ll explore a modern approach to JavaScript testing, covering practical patterns, workflows, and even how AI-assisted tools are changing the game.

Let’s dive in.

The Evolution of Testing
The Core Layers of Testing
Future of JavaScript Testing
Conclusion
Before We End

The Evolution of Testing

Software testing has been around for as long as software itself. According to IBM (2016), testing started right alongside the very first programs. After World War II, three computer scientists wrote what’s considered to be the first piece of software.

It ran on June 21, 1948, at the University of Manchester in England, performing mathematical calculations with basic machine code instructions.

Since then, testing methods and principles have continuously evolved. As software became more complex and development cycles got faster, the need for reliable and systematic testing grew stronger.

In the early days, the concept of the Testing Pyramid became popular. At the base, you had unit tests, in the middle integration tests, and at the very top a thin layer of end-to-end (E2E) tests. This approach worked well for simpler applications.

But as apps grew more dynamic and interconnected, the pyramid approach began to show its limits. That’s where the Testing Trophy model came in. Instead of overloading with unit tests, it puts greater emphasis on integration testing while still keeping E2E tests and unit tests in balance.

Now, with the rise of AI in QA, testing has entered a new phase. AI-driven tools don’t just run tests, they help generate, maintain, and even self-heal them. This shift is creating a future-ready testing framework designed to handle the complexity of modern software in 2025 and beyond.

The Core Layers of Testing

Testing is not just about finding bugs, but also ensuring reliability, scalability, and user satisfaction. Every testing strategy should cover four main layers:

Unit Testing

Unit testing is a method where you test individual components or units of software in isolation to make sure they work as expected. A unit can be a simple function, a React component, or even a utility module.

When building JavaScript apps, we usually create separate modules or components that later get combined. If any one of those small pieces is broken, the entire application can fail. That’s why unit tests are essential, they catch problems early and ensure reliability before integration.

In the JavaScript ecosystem, there are several tools you can use for writing unit tests:

Vitest – a modern, fast, and developer-friendly testing framework built to work seamlessly with Vite projects.
Jest – one of the most widely used testing frameworks, great for React apps among others.

For this section, we’ll focus on Vitest, because it’s lightweight, super-fast, and feels very natural for modern frontend development. Let’s write a test case for a small module.

Imagine we have a simple utility function that adds two numbers:

// sum.ts
export const sum = function (a: number, b: number) {
  return a + b;
};

Every test typically has 3 parts:

A description (string).
The code execution.
The assertion.

Now, let’s write a unit test for the above function using Vitest.

// sum.test.ts
import { describe, expect, it } from "vitest";
import { sum } from "./sum";

describe("sum function", () => {
  it("should return the sum of two numbers", () => { // 1. description
    const result = sum(2, 3); // 2. code execution
    expect(result).toBe(5);   // 3. assertion
  });

  // ... other test cases
});

// ... other describe blocks

Breaking it down:

describe groups related test cases together. Here, we group everything about the sum function.
it (or test) defines a single test case. In this example: “should return the sum of two numbers.”
expect makes the actual assertion. It checks if the result from sum(2,3) equals 5.

When you run this test, Vitest will quickly execute it and show you whether the function passed or failed.

If the function works, you’ll see 1 passed in green. If it fails, the output will be red with details about what went wrong.

Integration Testing

Now that we’ve covered unit testing, let’s move one step up to integration testing. While unit tests focus on testing individual pieces in isolation, integration tests ensure those pieces work together as expected.

Think of it like assembling Lego blocks: each piece might work fine on its own, but when you connect them, something might not fit right. Integration testing helps you catch those issues early.

In simple terms, Integration testing checks how components and modules interact with each other.

Let’s say we have a React component that fetches user data from an API and displays it on the screen.
We’re no longer just testing one function – we’re testing how the component behaves when it calls an API, manages loading states, and renders data dynamically.

Here’s a simple example:

import { useEffect, useState } from "react";

const User = () => {
  const [users, setUsers] = useState<{ name: string; email: string }[]>([]);
  const [loading, setLoading] = useState(false);

  const fetchUsers = async () => {
    setLoading(true);
    try {
      const res = await fetch("https://api.escuelajs.co/api/v1/users");
      const data = await res.json();
      setUsers(data);
    } catch (e) {
      console.log(e);
    } finally {
      setLoading(false);
    }
  };

  useEffect(() => {
    fetchUsers();
  }, []);

  return (
    <>
      {loading ? (
        Loading...
      ) : (
        
          {users.map((user, index) => (
            
              {user.name}: {user.email}
            
          ))}
        
      )}
    
  );
};

export default User;

This component does a few things:

Calls an external API when the component mounts.
Sets a loading state while fetching data.
Renders the fetched users on the screen once the data is ready.

Now, our job is to test the complete flow, from the API call to the rendered UI, using Vitest and React Testing Library.

Here’s what the test file looks like:

import { render, screen, waitFor } from "@testing-library/react";
import User from "../components/User";
import { describe, test, expect } from "vitest";

describe("User Component", () => {
  test("fetches and displays users successfully", async () => {
    render();

    // 1. Initially shows loading
    expect(screen.getByText("Loading...")).toBeInTheDocument();

    // 2. Wait for API response and UI update
    await waitFor(() => {
      expect(
        screen.getByText("Ajay Yadav: ajay.yadav@example.com")
      ).toBeInTheDocument();
      expect(
        screen.getByText("Jane Smith: jane.smith@example.com")
      ).toBeInTheDocument();
    });

    // 3. Loading should disappear
    expect(screen.queryByText("Loading...")).not.toBeInTheDocument();
  });
});

This test looks simple, but it covers the entire flow of our component. Let’s understand it step-by-step:

Render the component: Render the component inside the test environment.
Check the loading state: As soon as the component mounts, the “Loading…” text should appear, indicating that data is being fetched.
Wait for the data to load: Since the API call is asynchronous, use waitFor() to wait until the users are fetched and displayed.
Verify the data: Once the API resolves, check if the user names and emails are correctly rendered on the screen.
Confirm loading disappears: Finally, ensure that the “Loading…” text is removed once the data is displayed, confirming a proper state update.

You can also test how your component behaves when the API fails. For example, you can mock the fetch() call to reject and then verify if an error message appears on the screen.

Vitest and React Testing Library make it easy to mock responses and simulate both success and failure cases, helping you ensure that your app handles real-world scenarios gracefully.

End-to-End Testing

Now that we’ve seen how integration testing ensures that different components work together, let’s move to the third layer, End-to-End (E2E) testing.

While unit and integration tests run in isolated or simulated environments, E2E tests mimic how real users interact with your app.

They open a browser and perform actions like clicking buttons, typing in fields, and verifying what appears on the screen, exactly like a real person would.

Think of E2E testing as putting your entire app on stage and watching if it performs flawlessly in front of the audience. In simple words, E2E testing verifies the full user journey from start to finish.

Let’s take a common example, a login flow. As a developer, you’ve probably built dozens of login forms, but how do you know if they truly work under real conditions? That’s where E2E testing comes in.

Using tools like Playwright or Cypress, you can perform effective E2E testing. Both Playwright and Cypress are powerful tools and are popular among developers.

We can simulate a real browser, fill out the login form, submit it, and confirm that the user is redirected to the dashboard. Here’s what a simple E2E test looks like using Playwright:

// tests/login.e2e.ts
import { test, expect } from "@playwright/test";

test("should login successfully", async ({ page }) => {
  // 1. Visit the login page
  await page.goto("http://localhost:3000/login");

  // 2. Fill in the form
  await page.fill('input[name="email"]', "user@example.com");
  await page.fill('input[name="password"]', "password123");

  // 3. Click login button
  await page.click('button[type="submit"]');

  // 4. Wait for navigation and verify success message or dashboard
  await expect(page).toHaveURL("http://localhost:3000/dashboard");
  await expect(page.getByText("Welcome back!")).toBeVisible();
});

Let’s understand what’s happening here step-by-step:

Visit the page: The test opens your web app in a real browser. It navigates to http://localhost:3000/login.
Simulate user input: Playwright fills in the email and password fields, just like a real user typing into the form.
Perform actions: It clicks the login button, triggering all the same logic your frontend and backend would normally handle.
Verify the outcome: Once the user logs in, check if the URL changes to /dashboard and whether a welcome message appears on the screen.

That’s it, you just automated your first user journey from login to dashboard. Both frameworks achieve the same goal, ensuring your app behaves correctly in a real browser, not just in isolated tests.

AI-Augmented Testing

As testing evolves, a new layer has emerged that is AI-Augmented QA. This isn’t just another tool in the developer’s toolkit. It’s a complete transformation in how software quality is managed.

Traditionally, testing has been a manual process. Engineers wrote, maintained, and updated test cases whenever the product changed. But with AI entering the scene, that manual burden is decreasing.

AI models can now analyze your codebase, understand logic, and generate relevant test cases almost instantly, covering edge cases you might never think of. Tools like GitHub Copilot and CodiumAI already assist in generating smart test suites, while continuously learning from your coding style and past patterns.

Beyond code suggestions, complete AI QA platforms are changing automation itself. For example, an AI QA agent like Bug0 can adjust to UI changes automatically. If a button label or DOM structure changes, its self-healing tests find elements visually instead of depending on fixed selectors.

It also produces real-time test reports with detailed logs and video recordings, helping developers pinpoint UI or data changes causing failures.

With CI/CD integrations like GitHub or GitLab, it can automatically start and validate test runs for every pull request, updating PR checks just like a human QA engineer would.

While AI-assisted testing is powerful, it’s not a full replacement for human judgment. Developers still play a vital role in the following ways:

AI can generate test cases, but humans must decide what truly matters for business logic and user experience.
Reviewing AI-generated tests to ensure they are relevant and to avoid false positives.
Interpreting failures contextually means understanding whether a test failure indicates a real bug or an expected change.
Maintaining ethical and data-safe workflows involves avoiding the exposure of sensitive data when using cloud-based AI tools.

When used responsibly, AI becomes a testing partner, automating the tedious tasks while leaving creative problem-solving, decision-making, and domain understanding to developers.

This shift marks the beginning of intelligent, autonomous QA. AI isn’t just automating repetitive testing, it’s transforming the process into a continuous, adaptive feedback loop, capable of predicting and resolving failures on its own.

In the coming years, expect testing to evolve into a collaborative process between human engineers and AI copilots, ensuring every release is not just faster, but smarter and more reliable than ever before.

Future of JavaScript Testing

JavaScript testing is changing faster than ever. A few years ago, developers had to deal with tons of testing libraries and confusing setups. Now, things are becoming much more unified, smarter, and easier to work with.

In the future, testing will move from being reactive to proactive. That means instead of catching bugs after they happen, tools will be smart enough to predict and prevent them before they appear.

With AI-powered test generation and real-time monitoring, every commit you make could be automatically checked for reliability and performance without you even running a command.

Frameworks like Vitest, Playwright, and React Testing Library will still be the core tools, but the real progress will come from how they integrate and learn.

We’ll also see tighter CI/CD integrations, where pipelines can automatically adjust based on your test coverage and code risk. Testing won’t feel like an extra step anymore, it’ll become a natural part of development, powered by both human logic and machine intelligence.

In short, the future of JavaScript testing is about speed, intelligence, and automation. A world where developers spend more time building and less time debugging.

Conclusion

Testing isn’t just about preventing bugs, it’s about building confidence. Confidence that your code works, your features scale, and your users have a seamless experience.

Whether it’s unit tests ensuring logic, integration tests validating flow, E2E tests simulating real behavior, or AI-enhanced automation managing it all. Testing is the silent force that makes great software possible.

As a developer, understanding how testing fits into your workflow is no longer optional. Rather, it’s a skill that sets you apart. The more you test, the better you code and the faster you ship with peace of mind.

So, the next time someone says writing tests isn’t your job, you’ll know the truth: Testing isn’t extra work. Instead, it’s part of writing better, more reliable software.

Before We End

I hope you found this article insightful. I’m Ajay Yadav, a software developer and content creator.

You can connect with me on:

Twitter/X and LinkedIn, where I share insights to help you improve 0.01% each day.
Check out my GitHub for more projects.
I also run a YouTube Channel where I share content about careers, software engineering, and technical writing.

See you in the next article — until then, keep learning!

How to Use pytest: A Simple Guide to Testing in Python

Olowo Jude — Tue, 08 Jul 2025 20:42:29 +0000

With the recent advancements in AI, tools like ChatGPT have made the development process faster and more accessible. Developers can now write code and build web apps with some well-articulated prompts and careful code reviews.

While this brings an increase in productivity, there's a growing downside. AI-generated code is prone to errors, unexpected bugs, or poor integration with the rest of your code.

Because of these risks, it’s more important than ever to establish robust testing practices to make sure your code is high quality and properly functioning. Various testing tools are available to help solve these challenges, and pytest stands out in the Python ecosystem for its simplicity, flexibility, and powerful features.

In this article, we'll explore the following topics:

Why Use pytest?
How to Write Your First Tests with pytest
How to Run pytest Tests
How to Interpret pytest Results
How to Handle Exceptions in pytest
Advanced pytest Features
Conclusion

By the end of this article, you will have a comprehensive knowledge of pytest and be able to use it in your Python development process.

Pre-requisites

Must have Python installed
An understanding of the Python programming language

Why Use pytest?

pytest is a popular testing framework for Python that makes it easy to write and run tests. Unlike unittest and other Python testing frameworks, pytest’s simple syntax allows developers to write tests directly as functions or within classes. This lets you write clean, readable code without complexities.

pytest also supports popular Python frameworks like Flask, Django, and more. Combined with other rich features, pytest equips you with the tools you need to ship reliable software in today’s AI-driven era.

Key features of pytest that make it a preferred testing tool include:

Flexibility: it provides flexibility in test structure by supporting tests for functions, classes, and modules.
Detailed test output: it provides a detailed and readable test output, making it easy to understand test failures and errors.
Automatic test discovery: it automatically discovers tests by looking for files that start with "test_" or end with "_test.py". This eliminates the need for manually specifying test files**.**
Parameterization: it supports parameterized tests, which allow you to run a single test function with multiple sets of inputs.
Fixtures: it fixtures provide setup and tearDown methods that help prevent code repetition. This enables you to set up baseline conditions for your tests and also delete them after each test.
Plugins and extensions: it has a rich ecosystem of plugins and extensions that add extra functionalities, such as detailed tests reporting, and integration with other tools and Python frameworks like Django and Flask.
Compatibility: it is compatible with other testing frameworks like unittest , allowing you to migrate tests from different testing frameworks and run them seamlessly on it.

How to Write Your First Tests with pytest

This section will guide you through writing your first set of tests using the pytest framework.

pytest is a Python package, and you’ll need to install it before using it. You can do that with the following command:

pip install pytest

NOTE: Following Python's best practices, it’s recommended you install pytest within a virtual environment. Here's a guide to help you set it up.

Next, create a Python file where you will write your tests and import pytest into it using:

import pytest

pytest has 2 basic methods of writing tests, which include:

The function-based method: This method is straightforward for writing tests because you write the tests in individual functions.

Note: Each function name must be prefixed with the word test_ for pytest to discover and run these tests automatically.

Here’s an example of a function-based test:
```
  def test_addition():
      assert 1 + 1 == 2
```
Note: In the code above, the assert statement used here in pytest is Python’s built-in “assert”. It’s more convenient and doesn’t require the specific methods like assertEqual and assertTrue which are common with unittest. Another advantage of using the assert statement is that it provides more detailed error messages when an assertion fails.
Class-based method: This method is similar to the way of writing tests in unittest, except that your test class does not inherit any methods. An example is shown below:
```
  class TestMathOperations:
      def test_addition(self):
          assert 1 + 1 == 2
```
This method of writing tests in pytest is useful when you want to group related tests together.

How to Run pytest Tests

Running pytest differs slightly from the normal convention of running regular Python scripts.

The general method of running pytest tests is by running the pytest command in your terminal. pytest will automatically look for and run all files of the form test_*.py or *_test.py in the current directory and subdirectories. But while this may be a great way to run tests, pytest offers more flexibility beyond this general method of running tests.

Depending on preferences, you may want to run your test files based on the following:

To run a specific test file: To run tests in a specific file, use the pytest command followed by the file name. For example: pytest test_example.py.
To run tests in a directory: Let’s say you have a directory named Tests that contains some test files. To run all the tests in that directory, use the pytest command followed by the directory and a forward slash. For example: pytest Tests/.
To run tests using specific keywords: To run tests based on a certain keyword, use the command pytest -k "keyword". Pytest will automatically look for and run function names, class names, or file names matching that keyword in the current directory and subdirectories. But to run tests matching a certain keyword in a specific file, you’d have to specify the file name after the pytest command. For example: pytest test_example.py -k "keyword".
Run a specific test within a test file: To run only a specific test inside a test file, use the command pytest test_example.py::test_addition. This will run only the test_addition test function within the test_example.py module.
To run all test methods in a specific class: To run all the tests within a specific class, use pytest test_example.py::TestClass. This command would run all the test methods inside the TestClass class in the test_example.py module.
To run a specific test method inside a specific class: To run a specific test inside a specific class, use pytest test_example.py::TestClass::test_addition. This command would run the specific test_addition method within the TestClass class in the test_example.py module.

How to Interpret pytest Results

One major advantage pytest has over other Python testing frameworks is the rich output it provides, which gives very detailed information about the status of your tests.

Let’s use a basic test to understand how to interpret pytest’s output:

import pytest

def test_addition():
    assert 1 + 1 == 3

Run this test, and we get an output similar to the one below:

============================== test session starts ====================================
platform win32 -- Python 3.10.5, pytest-8.4.1, pluggy-1.6.0
rootdir: C:\\Users\\hp\\Desktop\\Pytest
collected 1 items

                                                                                  [ 50%]
test_example.py F                                                                 [100%]

===================================== FAILURES =========================================
____________________________________test_addition ______________________________________

    def test_addition():
>       assert 1 + 1 == 3
E       assert (1 + 1) == 3

test_example.py:4: AssertionError
============================== short test summary info =================================
FAILED test_example.py::test_addition - assert (1 + 1) == 3
========================= 1 failed, 1 passed in 0.13s ==================================

The above output is divided into several sections. Here’s a breakdown of what each section means:

Test session information:
```
 =============================== test session starts ===============================
 platform win32 -- Python 3.10.5, pytest-8.4.1, pluggy-1.6.0
 rootdir: C:\\Users\\hp\\Desktop\\TDD pytest
 collected 1 item
```
- This section displays a summary of the test environment. It begins with a line marker that indicates the beginning of the test session.
- Below the marker, pytest displays information about the operating system, along with the installed versions of Python, pytest and pluggy. (Pluggy is a pytest dependency used to manage plugins.)
- The next line indicates the root directory where the test is being run.
- The last line in this section displays the number of tests found in this directory.
Test status:
```
 test_example.py F                                                              [100%]

 ================================== FAILURES =========================================
 ________________________________ test_addition ______________________________________

     def test_addition():
 >       assert 1 + 1 == 3
 E       assert (1 + 1) == 3

 test_example.py:4: AssertionError
```
- This section displays information about the status of our tests
- The first line in this section specifies the test file which is being run, followed by the status (F in this case, which indicates a test failure).
- The next set of lines gives specific information about the failed tests. This includes the function where the failure occurred (test_addition), and the exact line of code responsible for the error.
- The last line gives a concise summary of this section. It indicates that the error occurred in test_example.py on line 4 and it was an AssertionError.

Test summary:

 ============================= short test summary info =============================
 FAILED test_example.py::test_addition - assert (1 + 1) == 3
 ================================ 1 failed in 0.13s ================================

This section provides an overall summary of the test.
It indicates that the failed test occurred in test_example.py file in the test_addition function because of an incorrect assertion (1 + 1) == 3 which isn’t true.

Edit the code with the correct assertion assert(1 + 1) == 2 and rerun the code. This time, the code passes with a different output.

=============================== test session starts ==================================
platform win32 -- Python 3.10.5, pytest-8.3.2, pluggy-1.5.0
rootdir: C:\\Users\\hp\\Desktop\\TDD pytest
collected 1 items

test_example.py .                                                               [100%]

=============================== 1 passed in 0.01s =================================

How to Handle Exceptions in pytest

Exceptions are unexpected errors that occur while running our tests, and they prevent our code from performing as expected. As a result, pytest offers several built-in mechanisms for handling these exceptions (but we’ll just cover one of them in this article).

pytest.raises Context Manager is a tool that checks if your code raises specific exceptions. If the specified exception is raised, that test passes, confirming that the expected error occurred. But if the specified exception is not raised, that test fails.

Usage Examples of pytest.raises

Checking for ValueError: In Python, a ValueError is raised when a function receives an argument with an incorrect value. In the example below, we can verify that a ValueError is raised when attempting to calculate the square root of a negative number.

 import pytest
 import math

 def calculate_square_root(value):
     if value < 0:
         raise ValueError("Cannot calculate the square root of a negative number")
     return math.sqrt(value)

 def test_calculate_square_root():
     with pytest.raises(ValueError):
         calculate_square_root(-1)

Checking for ZeroDivisionError: Dividing a number by zero raises a ZeroDivisionError. In this example, we check that this error is raised when dividing a number by zero.

 import pytest

 def divide_numbers(numerator, denominator):
     return numerator / denominator

 def test_divide_numbers():
     with pytest.raises(ZeroDivisionError):
         divide_numbers(10, 0)

Checking for TypeError: A TypeError is raised when an operation is applied to an object of an inappropriate type. Here, we check that this error is raised when adding incompatible data types, such as a string and an integer given in the example.
```
 import pytest

 def add_numbers(a, b):
     return a + b

 def test_add_numbers():
     with pytest.raises(TypeError):
         add_numbers("10", 5)
```

Checking for KeyError: A KeyError is raised when we try to access a dictionary key that doesn’t exist. We can verify and handle this error using the following code:

 import pytest

 def get_value(dictionary, key):
     return dictionary[key]

 def test_get_value():
     with pytest.raises(KeyError):
         get_value({"name": "Alice"}, "age")

Advanced pytest Features

As a robust testing framework, pytest offers some advanced features that help you manage complex test scenarios. In this section, we will explore some of these advanced features at a beginner-friendly level and demonstrate how you can start applying them in your tests.

1. pytest Markers

When working with a large codebase, sometimes running every single test can be time-consuming. This is where pytest markers come in handy.

A marker is just like a label that you can attach to a test function to categorise it. Once a test is labelled, you can instruct pytest to run only tests with certain markers. For example, you may label some tests as "slow" if they take longer to execute and run them separately from the faster ones.

One advantage to using Markers is that it allows you to run specific tests based on categories or specific parameters, and also skip tests if certain conditions aren’t met.

pytest comes along with some built-in markers that can be quite useful:

@pytest.mark.skip: This marker allows you to skip a test unconditionally, and can be useful when you know a test will fail due to an external issue or incomplete code.

Example:
```
 @pytest.mark.skip(reason="Feature not yet implemented")
 def test_feature():
     pass
```

@pytest.mark.skipif: This marker allows you to skip a test conditionally if certain conditions are met.

Example:

 import sys

 @pytest.mark.skipif(sys.platform == "win32", reason="does not run on windows")
 class TestClass:
     def test_function(self):
         "This test will not run under 'win32' platform"

@pytest.mark.xfail: This marker is attached to tests that are expected to fail, probably due to a bug or incomplete feature. So when pytest runs such tests, it won’t count it as a failure.

Example:
```
 @pytest.mark.xfail(reason="division by zero not handled yet")
 def test_divide_by_zero():
     assert divide(10, 0) == 0
```
Note: Detailed information about skipped/failed tests is not shown by default to avoid cluttering the output.

While pytest comes along with some built-in markers, you can also create your own custom marker (but we won’t cover that in this tutorial). Kindly refer to the documentation for more information on working with custom markers

2. pytest Fixtures

In pytest, fixtures allow you to create reusable default data that can be shared across multiple tests. By using fixtures, you can reduce code repetition, making your tests cleaner and more maintainable.

In pytest, fixtures are defined with the @pytest.fixture decorator as shown in the example below:

Let’s say we have several tests that rely on a list of user data. Instead of repeating the same data in each test, we can create a fixture to hold this data, and the fixture is passed across the tests that need it.

import pytest

@pytest.fixture
def user_data():
    return [
        {"name": "Alice", "age": 30},
        {"name": "Bob", "age": 25},
        {"name": "Charlie", "age": 35}
    ]

# Test function to check for a specific user by name and age
def test_user_exists(user_data):
    user = {"name": "Alice", "age": 30}

    # Check if the target user is in the list
    assert user in user_data

# Test average age of users
def test_average_age(user_data):
    ages = [user["age"] for user in user_data]
    avg_age = sum(ages) / len(ages)
    assert avg_age == 30

Note: The @pytest.fixture decorator in the code above marks the user_data function as a fixture in pytest. This fixture provides reusable data that can be shared across multiple test functions, allowing them to share the same setup without repeating code.

3. Parametrization

Parametrization is a pytest feature that allows you to run a test function with different sets of data at once.

For example: Let’s say you have a function that calculates the square of a number. To provide enough coverage while testing, you would want to test the function with zero, positive, and negative numbers.

Instead of writing separate test functions for each scenario, you can use parametrization to run a test function with different sets of data at once. This approach is more concise, and reduces code duplication.

To use parametrization in pytest, we use the @pytest.mark.parametrize decorator as shown in the example below:

import pytest

# Function to calculate the square of a number
def square_numbers(num):
    return num * num

#Parametrize decorator to test the square function with different inputs
@pytest.mark.parametrize("input_value, expected_output", [
    (2, 4),     
    (-3, 9),    
    (0, 0)    
])

def test_square(input_value, expected_output):
    assert square_numbers(input_value) == expected_output

In the example above, the different input values and expected values are listed in the @pytest.mark.parametrize decorator. We’re testing the square_numbers() function with three different input values: 2, -3, and 0.

For each value, pytest calls the test_square() function and compares the result of square_numbers(input_value) to expected_output.

This approach is more efficient and ensures the function behaves as expected across a variety of cases.

4. pytest Plugins

Plugins are an extension mechanism that allows you to add new functionality to pytest or modify its existing behaviour. These plugins work by providing additional features that extend pytest’s capabilities, which can be useful, especially in complex test scenarios.

pytest has a vast ecosystem of plugins, each designed to suit your different testing needs. You can find the full list of available plugins on PyPI in the pytest Plugin List.

To use a plugin, simply install it with pip.

For example:

pip install pytest-NAME
pip uninstall pytest-NAME

Note: NAME in the code above should be replaced with the name of the plugin you want to install.

After installing a plugin, pytest automatically finds and integrates it. There’s no need for any additional configuration.

In this section, we explored some of pytest's advanced features. By leveraging these features, you can now significantly improve the quality of your tests by ensuring they’re more efficient, scalable, and easier to maintain over time.

Conclusion

In this article, you’ve learned the basics of testing with pytest, from writing and interpreting tests to handling exceptions and using advanced features like fixtures and parametrization.

Whether your code is written manually or generated by AI, learning how to write tests empowers you to detect bugs early, and build more reliable software. Testing acts as a safety net that boosts you confidence during development and ensures your code works as expected.

If you're ready to go a step further, I’ve written an in-depth article on Test Driven Development in Python. It is a powerful approach where writing tests guides your entire coding process.

If you found this helpful, let me know, share it with your network, or give it a like to help others discover it too.

The Logic, Philosophy, and Science of Software Testing – A Handbook for Developers

Han Qi — Tue, 17 Jun 2025 18:43:38 +0000

In an age of information overload, AI assistance, and rapid technological change, the ability to think clearly and reason soundly has never been more valuable.

This handbook takes you on a journey from fundamental logical principles to their practical applications in software development, scientific reasoning, and critical thinking.

Whether you're a high school student learning to think more clearly, a professional debugging complex systems, or simply someone curious about how sound reasoning works, this handbook provides tools for sharper, more reliable thinking.

What We’ll Cover:

Part I: Foundational Theory

We start with the bedrock of formal logic – understanding implications, truth tables, and the core rules of reasoning.

You'll learn the scaffolding for everything that follows:

How "if-then" statements actually work (spoiler: it's not always intuitive!)
The power of truth tables to map all possible scenarios
Why some arguments are valid while others are logical fallacies
The elegant relationship between Modus Ponens, Modus Tollens, and Contrapositives

Part II: Practical Applications

Here's where logic comes alive in tangible ways:

In Software Development:

How debugging mirrors logical reasoning, and why your tests might be lying to you
The logic behind Test-Driven Development and Mutation Testing

In Scientific Thinking:

Karl Popper's falsification principle and why it matters beyond academia
How Hypothesis Testing is just statistics meets Modus Tollens

In Everyday Reasoning:

Spotting logical fallacies in arguments, media, and your thinking
The art of considering multiple causal paths instead of jumping to conclusions

Part III: Philosophical Depths

The final section confronts the beautiful complexity of applying pure logic to an impure world:

Why perfect "if-and-only-if" relationships are the goal but rarely achievable
How modern software systems hide their complexity
The butterfly effect of bugs and why root cause analysis is often harder than it seems
Formal verification tools: from Prolog to Coq to TLA+

What You'll Gain

For Students:

Critical thinking superpowers: Learn to spot flawed reasoning in arguments, social media, and news
Academic advantage: These concepts appear in debates, philosophy, computer science, mathematics, and statistics

For Software Engineers:

Debugging mastery: Modus Tollens for debugging: "If the output is wrong, what could cause it?"
Testing philosophy: Move beyond "make the tests pass" to "prove the code is correct"
Problem analysis: Avoid jumping to solutions before understanding the real problem
System design: Think more rigorously about failure modes and edge cases, evaluate cause-and-effect relationships in complex systems
Communication and career growth: Present arguments more clearly and persuasively, gain logical thinking skills that separate senior engineers from juniors

For Scientists:

Experimental design: Strengthen your understanding of hypothesis testing and falsifiability
Peer review: Better evaluate the logical soundness of research claims
Grant writing: Structure arguments more persuasively using solid logical foundations

Pre-requisites

I’ll introduce code samples starting in the second half of the article, so knowing a programming language would be helpful. The concepts in this article are programming language-agnostic, but I’ve used Python throughout for readability.

No prior formal logic or philosophy background is strictly necessary, but the following will let you reap the most benefits from this article:

Experience in testing and debugging during software development.
Know what REPL (Read-Evaluate-Print-Loop) is if you want to try the Proof Assistants.
Knowledge of logical operators (NOT, AND, OR), and the fact that they take 1 or 2 boolean values as input and return a single boolean value as output.
Basic Algebraic Thinking: representing statements as variables (P, Q), the concept of NOT (¬) as an inversion of statements, and the concept that different input combinations can reach the same output.
Exposure to deductive reasoning, where inferences are made based on some facts, and fallacies, which are some ways arguments can be flawed.
Willingness to engage in conceptual back-and-forth between concrete English examples and abstract logical symbols.
Holding possibly conflicting ideas between the ideal logic world and the impure real world.
Openness to challenging intuition and following logical rules before applying your real-world experience.

An Introduction to Logic
Truth Tables: Mapping All Possibilities
Contrapositives, Modus Ponens, Modus Tollens
The Origin of P⟹Q: Science and Reality
Revisiting Argument Forms: Valid Inferences and Common Fallacies
Denying the Antecedent: A Database Example
Assigning Real-World Meanings to Logic
Applying Logic to Software Testing
A Closer Look at Testing
Revisiting the Four Statements for Coding
The Missing Ingredient - If and Only If
Mutation Testing: Testing the Tests
Toward If-and-Only-If Confidence
Real-World Challenges
Glimmers of Hope: Tools and Practices for Clarity
The Power of Falsification in Testing
Proof Assistants
Food for Thought
Q.E.D.: The Enduring Power of Logic in an Uncertain World
Resources
Glossary

An Introduction to Logic

Imagine that the following statement is True:

If you are a coding instructor, then you have a job.

Now, do these make sense?

You have no job, so you are not a coding instructor
You have a job, so you are a coding instructor
You are not a coding instructor, so you have no job

Interpretations

Based on logic:

Statement 1 is correct.
Statement 2 is wrong because you may have other jobs without being a coding instructor.
Statement 3 is wrong because you may or may not have a job, and as before, you may have other jobs without being a coding instructor.

Growing complexity

These statements grow increasingly complex due to:

Changing from 2 valid statements to 2 invalid conclusions
Moving from a clear job status (1, 2) to uncertainty about job existence or type (3).

Let’s get familiar with some notation before seeing how Truth tables help manage this complexity.

Notations

Notation	Meaning	Example (if P="It's raining", Q="The ground is wet")
P, Q	Propositions	P, Q
⟹	Implies / If...then...	P⟹Q ("If it's raining, then the ground is wet")
¬	Not	¬P ("It's not raining")
∧	And (conjunction)	P∧Q ("It's raining and the ground is wet")
∨	Or (disjunction)	P∨Q ("It's raining or the ground is wet")
⟺	If and only if (biconditional)	P⟺Q ("It's raining if and only if the ground is wet")
∴	Therefore	P ⟹ Q: If it's raining, then the ground is wet; P: It's raining; ∴ Q: Therefore, the ground is wet

Truth Tables: Mapping All Possibilities

What is a Truth Table?

A truth table is a powerful tool in logic that helps us determine the overall truth or falsity of a compound logical statement. It does this by systematically listing all possible combinations of truth values (True or False) for its individual component propositions.

For every way the "inputs" (our propositions like P and Q) can be true or false, the truth table shows you the precise "output" (the truth value of the entire logical statement, such as P⟹Q).

Why are Truth Tables Helpful?

Truth tables offer critical benefits for clear thinking:

Clarity and precision: They eliminate ambiguity by explicitly showing the outcome for every single scenario.
Systematic analysis: They ensure no possible combination is missed, which is vital for sound reasoning.
Foundation for understanding: They define how logical rules work, forming the bedrock for analyzing more complex arguments in any domain.

How to Read Our First Truth Table:

Let's examine the truth table for the implication P⟹Q ("If P then Q").

Each row represents a unique scenario, combining the truth values of P and Q to show the resulting truth value of P⟹Q.

P	Q	P⟹Q (If P then Q)	Used In
True	True	True	Modus Ponens ✅
True	False	False	Falsifiability 🚨
False	True	True	No Inference
False	False	True	Modus Tollens ✅

Let's break down each row:

P and Q Columns: These show the input truth values (True or False) for our two propositions. Since each can be one of two values, we have 2×2 = 4 unique combinations, filling all four rows.
P ⟹ Q Column: This is the output truth value of the "If P then Q" statement for each combination of inputs P and Q.
- Row 1: P is True, Q is True.
  - If P is true (you are a coding instructor) and Q is also true (you have a job), then the implication P⟹Q is True. (The "If...then..." statement holds).
  - This row is key for Modus Ponens.
- Row 2: P is True, Q is False
  - If P is true (you are a coding instructor) but Q is false (you have a job), then the implication P⟹Q is False. This is the only scenario that disproves an "if-then" statement.
  - This row is key for Falsifiability.
- Row 3: P is False, Q is True.
  - If P is False (you are not a coding instructor) but Q is True (you have a job), then the implication P⟹Q is still considered True. This can seem counter-intuitive.
  - The reason is that the implication statement only makes a claim about what happens when P is true. If P is false, the implication's claim isn't tested, so it is considered vacuously true.
- Row 4: P is False, Q is False.
  - If P is False (you are not a coding instructor) and Q is False (you have no job), then the implication P⟹Q is also considered True.
  - Similar to Row 3, since the initial condition (P) was false, the implication's truth value remains True, as it hasn't been disproven.
  - This row is key for Modus Tollens.

The "Used In" column serves as a preview of the specific logical arguments or concepts that rely on each row's behavior, which we will explore in detail later.

Understanding the Implication (P⟹Q) Deeper

Most programmers are familiar with truth tables from logical operators like AND (∧), OR (∨), and NOT (¬), where they define the output based on combinations of inputs.

The implication (P⟹Q) works similarly, its output is defined by the rules of propositional logic, not by any real-world causal relationship or your “common sense”. For any given pair of inputs for P and Q, the result of P⟹Q is fixed.

If this feels counter-intuitive, consider that mathematical logic, like any formal system, is built upon agreed-upon axioms. These basic accepted truths allow us to construct complex systems of ideas. If later found ineffective or contradictory, these axioms can be redefined, or a new system can be developed.

In formal logic, this implication is also defined as being logically equivalent to "NOT P OR Q" (¬P∨Q).

This is the fundamental logical rule that dictates why, if P is False, P⟹Q is always True, regardless of Q's truth value. You can also understand this using the NOT P OR Q form.

If P is False, that means NOT P is True.
Using the rules of Logical operation:
- True (Not P) OR True (Q) is True (NOT P OR Q)
- True (Not P) OR False (Q) is True (NOT P OR Q)
- NOT P OR Q is True regardless of what Q is.

The above explains rows 3 and 4 of the truth table from the NOT P OR Q form. As an exercise, you can apply the inputs (P, Q) from the first two rows of the truth table to NOT P OR Q to arrive at the same results defined in the P⟹Q column.

This formal definition allows us to use implication to reason in powerful ways, not just in the "forward" direction (P⟹Q, leading to Modus Ponens), but also in a crucial "backward" direction.

This backward form (Contrapositive) involves swapping and negating the propositions (¬Q⟹¬P).

For example, if "If you are a coding instructor, then you have a job" is true, then it must also be true that "If you have no job (¬Q), then you are not a coding instructor (¬P). ".

This "backward" way of reasoning, which underpins Modus Tollens, is a powerful tool for inferring conclusions from observed outcomes.

We'll explore the Contrapositive and two argument forms (Modus Ponens, Modus Tollens) in detail next.

Contrapositives, Modus Ponens, Modus Tollens

We've explored the fundamental implication (P⟹Q) and how truth tables reveal its behavior.

Now, we explore reasoning tools that build upon this foundation: Modus Ponens, Modus Tollens, and the concept of Contrapositives. These are bedrock principles of valid argument and efficient logical thought.

What is Logical Equivalence?

Before we dive into these specific concepts, let's clarify what logical equivalence means. Two statements are logically equivalent if they always have the same truth value under all possible circumstances. In simpler terms, if one statement is true, the other is always true. If one is false, the other is always false. They are, in essence, different ways of saying the same logical thing.

Understanding logical equivalence is incredibly useful. It:

Simplifies logic: It allows us to substitute one statement for another without changing the truth of an argument, which simplifies complex proofs and reasoning.
Reduces complexity: In fields like circuit design, it can lead to fewer physical gates.
Maintains software correctness: In programming, it helps maintain code's correctness during refactoring and debugging, especially when simplifying conditional statements, by ensuring the transformed code still behaves identically to the original under all conditions.

The Contrapositive: An Equivalent Implication

One of the most important logical equivalences involves the Contrapositive of an implication. The contrapositive of an "If P then Q" (P⟹Q) statement is "If not Q, then not P" (¬Q⟹¬P).

You might intuitively question how "If P then Q" could be logically the same as "If not Q then not P." Let's demonstrate this using a truth table.

We'll start with our familiar P and Q columns and the P⟹Q implication. Then, we'll add columns for ¬P (Not P) and ¬Q (Not Q), and finally, the implication for the contrapositive, ¬Q⟹¬P.

Let's look at how the truth table explicitly shows this equivalence:

Q, not P, not Q, not Q -> not P" class="image--center mx-auto" width="1042" height="325" loading="lazy">

Explanation of the table

P, Q, P ⟹ Q (Columns 1-3): These are our standard propositions and the implication we've already defined.
¬P (Column 4): This column simply shows the negation (opposite truth value) of the P column. If P is True, ¬P is False, and vice-versa.
¬Q (Column 5): Similarly, this column shows the negation of the Q column.
¬Q ⟹ ¬P (Column 6): This is the contrapositive. We apply the same rules for implication that we learned earlier, but now using ¬Q as our "if" part and ¬P as our "then" part. For example, in Row 2, ¬Q is True and ¬P is False. According to the implication rule (True ⟹ False yields False), the result for ¬Q⟹¬P is False.
The Proof of Equivalence: Now, compare Column 3 (P⟹Q) with Column 6 (¬Q⟹¬P). You'll notice that for every single row, their truth values are identical! When P⟹Q is True, ¬Q⟹¬P is also True. When P⟹Q is False, ¬Q⟹¬P is also False. This perfectly illustrates why they are logically equivalent.

So, "If you are a coding instructor, then you have a job" (P⟹Q) is logically the same as saying "If you have no job, then you are not a coding instructor" (¬Q⟹¬P). They convey the same information about the relationship between being a coding instructor and having a job.

How Modus Ponens and Modus Tollens Relate to Implication

Having defined logical equivalence and the contrapositive, we can now precisely understand two of the most fundamental and valid forms of deductive argument: Modus Ponens and Modus Tollens. Both of these argument forms rely on a core premise that an implication (P⟹Q) is true, and then use additional information to draw a valid conclusion.

Modus Ponens (Affirming the Antecedent): This is often considered the most intuitive and direct form of logical inference. It works in the "forward" direction of the implication.
- Premise 1: We are given that the implication is true: If P, then Q (P⟹Q).
- Premise 2: We are also given that the "if" part, the antecedent, is true: P is true.
- Conclusion: Therefore, we can validly infer that the "then" part, the consequent, must also be true: Q is true.

Example:

Premise 1: If it is raining (P), then the ground is wet (Q).
Premise 2: It is raining (P).
Conclusion: Therefore, the ground is wet (Q).

This directly corresponds to Row 1 (True, True) of our truth table for P⟹Q.

Modus Tollens (Denying the Consequent): This argument form works in the "backward" direction and relies directly on the logical equivalence of an implication and its contrapositive.
- Premise 1: We are given that the implication is true: If P, then Q (P⟹Q).
- Premise 2: We are also given that the "then" part, the consequent, is false: Not Q (¬Q).
- Conclusion: Therefore, we can validly infer that the "if" part, the antecedent, must also be false: Not P (¬P).

Example:

Premise 1: If it is raining (P), then the ground is wet (Q).
Premise 2: The ground is not wet (¬Q).
Conclusion: Therefore, it is not raining (¬P).

Modus Tollens is valid because if P⟹Q is true, its contrapositive (¬Q⟹¬P) must also be true. Applying Modus Ponens to this contrapositive (with ¬Q as our second premise) directly leads to the conclusion ¬P. This corresponds to Row 4 (False, False) of our original truth table for P⟹Q, where P and Q are both false but the implication is still true.

These two argument forms are central to rigorous deductive reasoning, allowing us to draw certain conclusions based on the truth of implications and related facts.

The Origin of P⟹Q: Science and Reality

In science, hypotheses often take the form "If P, then Q" where P is a cause and Q is its predicted effect –for example, "If a drug is given (P), then symptoms improve (Q)."

Ideally, P is controllable, as in experimental studies, but even in observational studies, P must be clearly defined and measurable.

Each experiment yields one observation, reflecting one of four possible truth-value combinations of P and Q.

The Falsifying Case in Science and Logic

Each experiment produces a single observation – one of the four possible combinations of P and Q.

If P=True, Q=False is observed (row 2 of the truth table), the hypothesis is falsified
In all other cases, the hypothesis is not falsified (yet)

Thus:

If all observations fall in the 3 truth-preserving rows, the hypothesis remains viable.
If at least one experiment yields P=True, Q=False, we either:
- Conclude falsification, or
- Re-examine the experiment and attempt replication before accepting falsification.

The Power of the Falsifying Case

In the Logical World

The falsifying case is not useful for inference with Modus Ponens or Modus Tollens because these two argument forms require starting with P⟹Q = True. I’ll explain both arguments in detail later.

But the falsifying case is useful for showing counterexamples to disprove the implication, or proof by contradiction.

In the Real Scientific world

The falsifying case embodies Falsifiability – a crucial concept in Science.

In so far as a scientific statement speaks about reality, it must be falsifiable: and in so far as it is not falsifiable, it does not speak about reality.

— Karl R. Popper, The Logic of Scientifc Discovery

Scientific theories come about through hypotheses that are continually tested and survive attempts at falsification.

Popperian Falsification and Hypothesis Testing

These two approaches, one philosophical and one statistical, are distinct but complementary in the scientific method.

Popperian Falsification starts with a scientific hypothesis (for example, "P has an effect on Q"). Its core aim is to actively seek evidence that would disprove this hypothesis. If such disproving evidence is found, the hypothesis is falsified.
Statistical Hypothesis Testing begins with a null hypothesis (H0) (for example, "P has no effect on Q"). Its goal is to determine if the collected data provides sufficiently extreme evidence to reject this null hypothesis.

If the null hypothesis is rejected, it provides statistical support for the alternative hypothesis (that P does have an effect on Q). This statistically supported hypothesis then becomes a stronger candidate, continually subjected to further Popperian attempts at falsification through new experiments and observations.

The Nuance: Implication is Not Causality

P⟹Q does not inherently imply that P causes Q.

Consider these examples:

"If the fire alarm is sounding, then there is smoke." The alarm doesn't cause the smoke.
"If a colleague screams during code review, then the code is bad." Does the screaming cause the bad code, or merely reveal it? (Perhaps sometimes both! 😰)

Causality is a real-world concept crucial for making informed decisions, predicting outcomes, and inferring the underlying reasons for events.

It's often central to predictive modeling and supervised learning in data science, where the target variable is the effect and the predictors are proposed causes. A common pitfall here is data leakage, where predictors are inadvertently influenced by (or are themselves effects of) the target, violating the causal assumption.

Logic, however, doesn't model time, mechanisms, or interventions. It only cares about truth values and formal structure. Logic defines what is true based on premises, not what makes something true in a causal sense.

Revisiting Argument Forms: Valid Inferences and Common Fallacies

We've now established the rules of implication, understood logical equivalence, and learned about two powerful, valid argument forms: Modus Ponens and Modus Tollens. But when we try to reason using "if-then" statements, it's easy to fall into common logical traps.

In this section, we'll systematically revisit the four common ways we might try to draw conclusions from an implication P⟹Q (If you are a coding instructor, then you have a job) introduced at the start of the handbook.

Two are valid arguments (Modus Ponens and Modus Tollens), and two are common logical fallacies. Understanding the differences is crucial for sound reasoning.

First, let's quickly define the parts of an "if-then" condition:

Antecedent: The "if" part of the condition (P).
Consequent: The "then" part of the condition (Q).

Now, let's examine these four argument forms, using our knowledge of truth tables and the coding instructor example.

Affirming the Antecedent (Modus Ponens)

This is the first valid argument form we discussed. It's called "affirming the antecedent" because it asserts the truth of the "if" part (the antecedent, P) to conclude the "then" part (the consequent, Q).

Argument Form:
1. If P, then Q (P⟹Q)
2. P is true.
3. Therefore, Q is true.
Examples:
- You are a coding instructor (P), so you have a job (Q).
- You provided invalid input data (P), so the code will show an error (Q).
Interpretation: This argument directly aligns with Row 1 (P=True, Q=True) of our truth table, where the implication holds true. It's often the most intuitive form of logical deduction. In programming, it's natural to expect bad input to lead to error messages if the code is designed correctly.

Denying the Consequent (Modus Tollens)

This is the second valid argument form. It's called "denying the consequent" because it asserts the falsity of the "then" part (the consequent, ¬Q) to conclude the falsity of the "if" part (the antecedent, ¬P). As we learned, Modus Tollens derives its validity from the logical equivalence of P⟹Q and its contrapositive (¬Q⟹¬P).

Argument Form:
1. If P, then Q (P⟹Q)
2. Not Q is true (¬Q).
3. Therefore, Not P is true (¬P).
Examples:
- You have no job (¬Q), so you are not a coding instructor (¬P).
- There are no error messages (¬Q), so the input data is valid (¬P)
Interpretation: This argument corresponds to Row 4 (P=False, Q=False) of our truth table, where P⟹Q is true, and both P and Q are false. This form of reasoning is critical for skillful debugging, allowing you to infer reasonably true conclusions about the cause (P) from observations of the outcome (Q), assuming your program logic (P⟹Q) holds true.

Affirming the Consequent (Fallacy)

Now we move to the common pitfalls. This is an invalid argument form where we attempt to conclude that the antecedent (P) is true simply because the consequent (Q) is true. It's a fallacy because the truth of Q does not guarantee the truth of P, as Q could have been caused by something other than P.

Argument Form (Invalid):
1. If P, then Q (P⟹Q)
2. Q is true.
3. Therefore, P is true. (**Incorrect inference!**🚨)
Examples:
- You have a job (Q), so you are a coding instructor (P).
  - Incorrect: You could have many other jobs.
- The code showed an error (Q), so you provided invalid data (P).
  - Incorrect: Other things besides invalid data can cause errors.
Interpretation: This fallacy highlights the difference between a one-to-one and a one-to-many relationship. Looking at our truth table, when P⟹Q is True and Q is True, P could be True (Row 1) or False (Row 3). The argument mistakenly concludes that P must always be True. The uncertainty arises because observing Q as True doesn't uniquely point to P as the cause – there could be many other reasons or paths that lead to Q.
- Think of walking down a forest path, unaware that another trail has merged into yours from behind you. When retracing your steps in reverse, you encounter a split (Q) at that merge and feel disoriented, unsure which path leads back to your start point (P). Just as multiple paths can converge on the same point, multiple causes can produce the same outcome.

Denying the Antecedent (Fallacy)

This is another invalid argument form. Here, we attempt to conclude that the consequent (Q) is false simply because the antecedent (P) is false. It's a fallacy because P being false does not guarantee that Q will also be false. Q could still be true for other reasons, or the implication might not cover all scenarios where Q occurs.

Argument Form (Invalid):
1. If P, then Q (P⟹Q)
2. Not P is true (¬P).
3. Therefore, Not Q is true (¬Q). (**Incorrect inference!**🚨)
Examples:
- You are not a coding instructor (¬P), so you have no job (¬Q).
  - Incorrect: You could have a different job.
- You provided valid data (¬P), so you have no error (¬Q).
  - Incorrect: Valid data doesn't guarantee no error. Other factors like network issues, memory leaks, or non-idempotent operations can still cause errors.
Interpretation: Similar to Affirming the Consequent, this fallacy stems from incorrectly assuming a unique relationship. From our truth table, when P⟹Q is True and P is False, Q could be True (Row 3) or False (Row 4). The argument mistakenly concludes Q must always be False.

Both of these fallacies (Affirming the Consequent and Denying the Antecedent) creep into our thinking when we prematurely assume a single cause for an effect. In complex real-world systems, many factors can lead to an outcome, and narrowing your thinking too soon can lead to missed bugs or incorrect conclusions.

Fallacies and Implication: A Prerequisite

Both the fallacy of affirming the consequent and denying the antecedent assume the underlying implication (P⟹Q) is true.

If this implication is false from the start, there's no logical argument to be made, and thus, no fallacy to speak of.

Exercise: Identifying an Argument Form

Which of the 4 forms of argument is this?

Penguins can’t fly. I can’t fly. Therefore, I’m a penguin.

Hint: Rephrase the first statement into an if-then form.

Denying the Antecedent: A Database Example

We just saw that Denying the Antecedent is a logical fallacy, meaning that even if the initial implication (P⟹Q) is true, concluding ¬Q from ¬P is not a valid inference. To make this abstract concept concrete, and to illustrate why this fallacy can be particularly dangerous in real-world systems like software, let's explore a practical example involving a database.

The implication: If the database is down (P), we’ll see a connection timeout error (Q).

Now, applying the fallacy of Denying the Antecedent, we might incorrectly conclude: If the database is not down (¬P), we will not see a connection timeout error (¬Q). ❌

But even if the database itself is perfectly operational and "not down," you might still encounter a connection timeout error. This could happen due to a variety of other, independent reasons, such as:

Network problems
Firewall rules
The database is up but extremely slow
The query engine is stuck

This specific example of multiple potential causes for a "timeout" highlights a broader, critical skill in software development: thorough case analysis.

This is precisely why technical assessments, especially in areas like algorithms and system design, frequently demand that you consider exhaustive possibilities. For instance, you are often asked to handle base and recursive cases in dynamic programming, or to ensure mutually exclusive and collectively exhaustive coverage when grouping multiple scenarios in problems like interval merging.

Such strong case analysis is vital for minimizing bugs and cultivating an open-minded approach to considering multiple causal paths, driven by experience, curiosity, and a dedication to craftsmanship.

But even perfect case analysis doesn't guarantee a correct implementation. Weak language mastery or mistaken assumptions can still lead to errors, making tests a crucial last line of defense.

Before jumping into applying logic to software testing, let’s practice our agility in conceptually switching between real-world concepts in English and symbols in logic.

Assigning Real-World Meanings to Logic

We must define what P, Q, and P⟹Q refer to when applying logical theory to real-world concepts.

How we define these variables affects our truth tables.

For example:

If P means "valid input," then ¬P means "invalid input."
If P means "invalid input," then ¬P means "valid input."

Imagine we define P = "Good input" and Q = "No Error."

When testing the happy path, we are verifying that the implication P⟹Q (If input is good, then no error) holds true.
When testing the unhappy path (mutation testing, more details later), we are verifying that ¬P⟹¬Q (If input is not good, then an error occurs) holds true.

In any test, a failure indicates that the tested implication is false. This warrants investigation into whether the issue lies with the specification's interpretation, the implementation, or even the test itself.

Applying Logic to Software Testing

Software development relies on constructing systems that behave predictably. Software testing is our primary tool for validating these behaviors. At its core, testing is a process deeply rooted in logical implications, where we propose a hypothesis about our code and then run an experiment (the test) to check its truth.

A test case is carefully designed to evaluate a specific piece of code. This involves:

Setting up Preconditions and Inputs: Before executing the code under test, we meticulously establish a specific environment and provide particular inputs. This includes:
- Function/Method Arguments: The precise values passed into the code being tested.
- System State: Setting up relevant data in a database, preparing the content of a file system, configuring an object's instance variables, or dictating the responses of external services (often through "mocks" or "stubs").
- Environmental Factors: Controlling elements like the current time, specific network conditions, or user permissions relevant to the code's execution. This precise setup ensures that the code runs under defined conditions, allowing us to evaluate its behavior consistently.

Once the setup is complete, the code under test is executed, and its output or behavior is observed. This observation is then compared against an expected result.

To precisely analyze test outcomes, let's establish our specific logical mapping:

P: The code under test is correct for the specific scenario defined by the test. This refers to the actual, objective state of the code's internal logic and implementation when presented with the test's preconditions and inputs. If P is True, the code is without defect for this case. If P is False, there is a bug or deviation.
Q: The test passes. This means the actual output or behavior observed from the code precisely matches the expected outcome defined in our test case. If they do not match, the test fails.
P⟹Q: If the code under test is correct for this specific scenario, then the test will pass. In pure propositional logic, the truth value of P⟹Q is indeed defined by the truth values of P and Q. But in the context of software testing, P⟹Q represents our hypothesis or desired specification for how the code should behave. We don't directly "know" P's truth value beforehand. Instead, the test's execution provides empirical data (the actual Q) that allows us to evaluate whether this hypothesis holds true in practice, and thereby infer the actual state of P.

Understanding this mapping is vital for interpreting test results. Let's examine the different outcomes of a test run, referencing the truth table for P⟹Q:

Row 1: P is True (Code is correct), Q is True (Test passes)
- Interpretation in Testing: Ideal State/Validation
  - This is the desired outcome and strengthens our confidence that the code adheres to its specification.
  - This scenario directly confirms the truth of our hypothesis (P⟹Q).
Row 2: P is True (Code is correct), Q is False (Test fails)
- Interpretation in Testing: Logical Contradiction / Falsification of Hypothesis
  - This row means our overall hypothesis P⟹Q is false for this specific instance.
  - This demands investigation: either our initial assumption that P was True (meaning the code was correct) is wrong (i.e., there's an actual bug, so P is actually False), or the test itself is flawed (its inputs/expectations are incorrect), or the specification is wrong.
  - This is where rethinking of the P⟹Q hypothesis itself happens.
Row 3: P is False (Code is incorrect), Q is True (Test passes)
- Interpretation in Testing: False Positive / Inadequate Test
  - This is a problematic scenario. It implies the test is not robust enough to detect the defect in the code, or the test's expectation is flawed.
  - While P⟹Q remains true vacuously, this outcome is misleading and means the test is not effectively verifying code correctness.
Row 4: P is False (Code is incorrect), Q is False (Test fails)
- Interpretation in Testing: Bug Found / Confirmation of Incorrectness
  - This is a beneficial outcome, as the test has successfully identified a defect.
  - When P is truly False, P⟹Q is vacuously true.
  - This row can represent either a known, intended 'P is False' state (e.g., TDD Red phase) or the actual state discovered via deduction (explained below in Scenario 1).

Note on this Contextualized Truth Table and Probabilistic Nature

This truth table differs from a purely abstract logical truth table by being explicitly contextualized for software testing.

Specific Definitions: Unlike a generic P and Q, here they have precise meanings within the domain of code correctness and test outcomes.
"Interpretation in Testing" Column: This is the key distinguishing feature. It translates the raw logical outcomes of (P, Q, and P⟹Q) into actionable insights and common debugging/development scenarios for software engineers. It explains what it means when a particular row is observed in the context of testing.
Probabilistic Confidence: While formal logic operates in binary (True/False), real-world software testing often involves probabilistic confidence. A test doesn't provide absolute logical proof of correctness (for example, a passing test doesn't guarantee P is 100% True due to the possibility of undiscovered bugs or false positives). Instead, test results increase our confidence that the code is correct, or provide strong evidence that it is incorrect. Testing is fundamentally about reducing uncertainty and increasing the probability that our code functions as intended.

Let's now explore how these logical outcomes are interpreted in two common testing scenarios:

Scenario 1: Debugging an Unexpected Defect (Applying Modus Tollens)

This scenario occurs when a test that was previously passing, or a newly written test that we strongly trust as a precise and correct specification, unexpectedly fails. In this context, we assume the validity of the implication P⟹Q for this specific test case, treating it as an unbreakable rule for how correct code should behave.

Our Core Premise (Trusted Specification): We operate under the assumption that the implication "P⟹Q" ("If the code is correct for this scenario, then the test passes") is True for this specific test. Our confidence stems from the test's meticulous design, its history of passing, or its role in a well-established regression suite.
Test Execution and Observation: We run the test, which has its preconditions and inputs set.
- If the Test Fails (Q is False): This is the key observation. Since we trust our premise that P⟹Q is True, and we observe ¬Q (the test fails), we are logically compelled to deduce that our initial belief about P (the code being correct for this scenario) must be false.
  - Application of Modus Tollens:
    - Premise 1: If the code is correct for this scenario (P), then the test passes (Q). (P⟹Q, assumed true as a trusted specification).
    - Premise 2: The test did not pass (¬Q).
    - Conclusion: Therefore, the code is not correct for this scenario (¬P).
  - Outcome: This inference directly points us to a defect in the code. The test's failure, given its trusted nature, reveals that the actual state of the code for this scenario is P is False. This effectively places the scenario in Row 4 (P False, Q False) of our truth table, confirming the presence of a bug that needs fixing. This is typical in regression testing, where a previously correct feature suddenly breaks.

Scenario 2: Validating/Refining the Specification (Falsifying P⟹Q or Confirming Known Incorrectness)

This scenario arises when a test fails, and our primary focus is not immediately on debugging the code as if it's a regression. Instead, it's on understanding why the P⟹Q relationship (our hypothesis for this specific behavior) isn't holding, or simply confirming an expected failure. This can involve questioning the test itself, the underlying requirements, or confirming a deliberately incorrect state of the code.

Our Hypothesis (Being Challenged or Confirmed): We are either actively evaluating the validity of the implication "P⟹Q" for a specific behavior, or we are running a test against code we know is incomplete or incorrect.
Test Execution and Observation: We run the test with its defined preconditions and inputs.
If the Test Fails (Q is False): The interpretation here depends on our prior knowledge or intent about the code's state (P):
- Sub-scenario 2A: Falsifying P⟹Q and Rethinking Specification (Corresponds to Row 2: P True, Q False):
  - We observe Q is False (the test fails).
  - If we then examine the code and the requirements, and we conclude that the code should have been correct for this scenario (meaning, our expectation/belief was P is True), then the test result means the specific instance of our hypothesis "P⟹Q" is FALSE.
  - This direct falsification reveals a contradiction. We must then investigate:
    - Is our initial belief that P was True mistaken (that is, is there a genuine bug in the code that makes P actually False, moving this to a Row 4 scenario)?
    - Or, is the test itself incorrect (its inputs or expected output are wrong), meaning our P⟹Q premise needs to be re-evaluated and corrected?
    - Or, have the underlying requirements changed or been misunderstood?
  - Outcome: This critical outcome prompts us to "rethink" – either the code needs fixing, or the test needs adjusting, or the specification needs clarification. This is common in exploratory testing or when working with new/evolving features where the exact behavior is still being defined.
- Sub-scenario 2B: Confirming Known Incorrectness (Corresponds to Row 4: P False, Q False):
  - We observe Q is False (the test fails).
  - We already know or intentionally designed the code to be incorrect for this scenario (that is, we are actively developing a feature and haven't written the full code yet, or we're running a test against a known, un-fixed bug, so our expectation is P is False).
  - The test result simply confirms our prior knowledge that P is False. The test correctly highlights the missing or incorrect behavior. In this case, the P⟹Q implication is vacuously true, and the test effectively served its purpose of showing the existing defect.
  - Outcome: This is typical in Test-Driven Development (TDD) in the Red phase, where a failing test for a not-yet-implemented feature confirms the "P is False" state, guiding development to make P True. It also applies when verifying that a bug fix indeed works: the test initially fails (confirming the bug), and then passes after the fix (confirming P is now True).

A Closer Look at Testing

The Illusion of Correctness: Affirming the Consequent

Consider a common scenario where a test passes, seemingly validating our code:

def get_user_role(user_id):
    if user_id == 42:
        return "admin"
    return "guest"

# test
assert get_user_role(42) == "admin"

Here, our implicit claim (the specification) is: If the code is correct (P), then the output will match the expectation (Q).

In this example, the test passes – the output is "admin" (Q), but can we definitively conclude that the function is correct (P)? Not necessarily.

This scenario often exemplifies the logical fallacy of affirming the consequent. We see the desired outcome (Q) and mistakenly assume that our specific intended cause (P, the correctness of our specific implementation path) was the reason.

The Problem: What if the real condition for an "admin" role should be checking a database, but we have temporarily hardcoded the value for testing? The test would pass, but the correctness is illusory. If we see P as false because the code did not implement the behaviour from the full specification, this corresponds to Row 3 (P False, Q True: False Positive) in our truth table.

As I mentioned before, deliberately implementing ¬P works well if ¬Q is observed, but is not useful, or even erroneous, if Q is observed.

Even without hardcoding, the output might match by coincidence, or because of factors outside the direct logic we intended to test. This can happen due to:

Default behavior: A broader system default might produce the expected output.
Caching: A previous successful operation might have cached the result, bypassing the actual logic.
Fallback logic: An unintended fallback mechanism produces the correct output despite an error in the primary path.
Test harness bugs: Flaws in the testing setup itself might obscure real issues.

The Role and Risks of Test Doubles

The challenges highlighted above are particularly relevant when using test doubles, such as Stubs and Mocks. These are artificial components that replace real dependencies (for example, databases, external APIs, time-sensitive operations) during testing.

Stubs focus on state: they provide pre-programmed fake data or return values to get the rest of the code under test working predictably, like the get_user_role example
Mocks focus on behavior: they allow you to verify interactions, such as the number of calls made to a certain API, or how control flow flows through specific parts of the system.

Both remove external dependencies, allowing you to isolate and focus on the internal logic of the code without noise or side effects. But using them without understanding their limitations can lead to false confidence.

If a test double simulates a "correct" response, but the real dependency it replaces has a bug, or the way the main code interacts with that dependency is flawed, the test will pass (Q is True) – yet P (the code's overall correctness in a real environment) might be False, leading to a dangerous false positive.

Whether you encounter such logical fallacies in your testing depends on precisely what behavior or state you are attempting to verify, and whether you are over-interpreting the test results.

Test Scope and Interpretation

The choice of testing scope – from narrowly focused unit tests to broader integration tests, system tests, user acceptance tests (UAT), and even testing in production – represents a continuum. On this spectrum, various trade-offs are involved, especially concerning the effort-reward ratio. This effort is influenced by factors like individual developer skill, company engineering practices (for example, responsibility split between feature developer and dedicated tester roles), and industry regulations.

Generally:

Smaller-scoped tests (for example, unit tests) have fewer assumptions baked in and a shorter chain of logical implications. This translates to less risk of committing fallacies in both test implementation and test result interpretation. They are excellent for quickly verifying isolated units of code.
Larger-scoped tests (for example, end-to-end integration tests) incorporate more real-world complexities and dependencies. While providing higher confidence in the system's overall behavior, they inherently increase the potential for confounding factors that can lead to false positives or make debugging more challenging.

Being acutely aware of the assumptions implicit in each test, at every scope level, is paramount. Passing tests for the wrong reasons will inevitably cause problems down the road.

Debugging, Observability, and Mental Models

Failing tests are not failures of the testing process but are, in fact, incredibly valuable learning moments. They represent opportunities to:

Run focused debugging experiments to pinpoint the exact cause of the failure.
Refine your mental model of the code-to-outcome (P⟹Q) link. A failing test (where Q is False) tells you that your current understanding of P, or of the P⟹Q relationship, is flawed. Use this feedback to update your understanding of the code's actual behavior.
Improve both the code and the tests themselves.

Enhance system observability to better detect and confirm outcomes (Q). The more clearly, from multiple angles, and through diverse methods we can observe Q (for example, logs, metrics, tracing, output inspection), the more confident we can be in its causes and, by extension, the actual state of P.

Crucially, avoid blindly fixing tests just to make them pass. Always ensure you thoroughly understand why a test failed and update your P⟹Q model accordingly. The ultimate goal is not just to fix current bugs, but to prevent them in the future by continually strengthening both the correctness of the code and the verifiability of its behavior.

Falsifiable Tests Reveal Regressions

Beyond avoiding false positives (where the code is incorrect but the test passes), a good test must also be falsifiable. This means the test must be genuinely capable of failing under certain (incorrect) conditions. An unfalsifiable test is a broken test – it cannot serve its purpose of revealing regressions or confirming the presence of bugs.

While we strive for the implication P⟹Q to hold true for all the scenarios we care about, it may not be true for all cases due to unforeseen or mistaken assumptions, or simply because the code is incorrect. The test's ability to demonstrate this incorrectness by failing under specific, well-defined conditions makes it profoundly valuable.

Some common culprits for unfalsifiable or "bad" tests include:

Vague or Untestable Specifications: Statements like "The system should behave well under most conditions," "It shouldn't crash randomly," or "The algorithm is robust" lack clear, measurable criteria. It's impossible to design a test that definitively passes or fails against such statements, thus rendering them effectively unfalsifiable.
Broken Implementations of the Test Suite: The test code itself might be flawed, perhaps due to logical errors or control flow issues that prevent assertions from ever being reached or correctly evaluated, inadvertently taking the same passing path regardless of the code under test.
Insufficient Test Data or Edge Cases: If tests only cover "happy path" scenarios and fail to include challenging inputs or boundary conditions, they might pass for incorrect code that only breaks under specific, untested circumstances.

A robust specification clearly defines what constitutes success and failure. Correspondingly, a good test suite correctly implements that specification, making its tests both accurate and truly falsifiable.

Take a step back

Critical thinkers might observe that the application of the four fundamental logical argument forms to coding scenarios, as initially presented, could be misleading in the complexities of real-world software.

The next section shows some nuances that arise when we transition from the clear-cut rules of formal logic to the often messy reality of software development.

Specifically:

The first two points below show why the seemingly valid arguments of Modus Ponens and Modus Tollens may not always lead to reliable conclusions when applied to coding scenarios.
The last two points below show why the two common logical fallacies, Affirming the Consequent and Denying the Antecedent, may actually provide correct insights under specific real-world coding conditions.

Revisiting the Four Statements for Coding

Here are the four arguments and their associated coding examples:

Modus Ponens: If you provide invalid input data (P), the code will show an error (Q).
Modus Tollens: There are no error messages (¬Q), so the input data is valid (¬P).
Affirming the Consequent (Fallacy): The code showed an error (Q), so you provided invalid data (P).
Denying the Antecedent (Fallacy): You provided valid data (¬P), so you have no error (¬Q).

Now, let's dive into the nuances of each:

Modus Ponens

Our coding example: If you provide invalid input data (P), then the code will show an error (Q).
Why it may not always hold: This application of Modus Ponens assumes that either your code or any third-party code it relies upon will always properly detect and explicitly raise exceptions or show errors on bad data. In reality, systems might automatically fix or sanitize bad input, silence errors, or simply proceed with unexpected behavior without explicitly signaling an error, leading to a passing (or non-failing) state (¬Q) even when P (invalid input) was true.

Modus Tollens

Our coding example: There are no error messages (¬Q), so the input data is valid (¬P).
Why it may not always hold: This application of Modus Tollens assumes there are no automatic mechanisms within the system to fix or silence bad input before errors are typically displayed. If such "silent correction" or "error suppression" occurs, you might observe no error messages (¬Q), but the input data could still be invalid (P), rendering the conclusion (¬P) false despite the premise (¬Q) being true. This highlights the dangers of incomplete observability.

Affirming the Consequent (Fallacy)

Our coding example: The code showed an error (Q), so you provided invalid data (P).
Why it may actually be correct: While logically a fallacy, in specific, highly constrained real-world conditions, this inference can gain practical validity. If the error message is so uniquely and specifically defined that it can only be caused by invalid input data (P) and no other known factor, then this statement can become reliable. This is rare and typically requires meticulous error handling design where each error message maps unambiguously to a single root cause.

Denying the Antecedent (Fallacy)

Our coding example: You provided valid data (¬P), so you have no error (¬Q).
Why it may actually be correct: Although a fallacy in general logic, this inference can hold a high degree of practical confidence under certain programming paradigms (Functional Programming). If the code is sufficiently simple, purely functional (meaning outputs depend only on inputs and have no side effects), and has no external dependencies (like network or database interactions), then the absence of invalid data (¬P) can indeed make us reasonably confident that there will be no errors (¬Q). The lack of external variables and internal state makes the code's behavior highly predictable and directly tied to its inputs.

You may now be thinking: what’s the point of studying logic if it has so many loopholes and edge cases when applied to coding?

The Missing Ingredient – If and Only If

In our exploration of logical implications, we've focused primarily on the unidirectional relationship P⟹Q ("If P, then Q"). This statement tells us what happens if P is true, but it remains silent on whether Q only happens when P is true. It's like saying, "If it rains, the ground gets wet." This is true, but the ground can also get wet if a sprinkler is on, even if it's not raining.

But in many critical contexts, especially in rigorous scientific theories and robust software systems, we often seek a much stronger relationship: one where the truth of Q absolutely depends on the truth of P, and vice versa. This powerful bidirectional relationship is captured by the phrase "If and Only If" (P⟺Q).

What "If and Only If" Means: A Stronger Statement

When we assert "P⟺Q", we're making two distinct claims simultaneously:

If P, then Q (P⟹Q): P is a sufficient condition for Q. Whenever P is true, Q must also be true.
If Q, then P (Q⟹P): P is also a necessary condition for Q. Whenever Q is true, P must also be true. In other words, Q cannot be true without P being true.

Notice the significant increase in the strength of the statement. "If P, then Q" merely states a consequence. "P⟺Q" declares a definitive equivalence, where P and Q are inextricably linked. They rise and fall together – one cannot be true without the other being true, and one cannot be false without the other being false.

Bidirectional Truth Table: Unambiguous Relationships

Let's construct the truth table for P⟺Q to clearly see this strong relationship.

P⟺Q is logically equivalent to (P⟹Q)∧(Q⟹P).

Q, Q->P, P<->Q" class="image--center mx-auto" width="1226" height="323" loading="lazy">

Creating the Table (columns 4 and 5 are new):

Q⟹P (Column 4): We apply the standard implication rules, but with Q as our "if" and P as our "then." For instance, in Row 3, Q is True and P is False, so Q⟹P is False.
P⟺Q (Column 5): This is the logical AND of the P⟹Q and Q⟹P columns. For P⟺Q to be True, both component implications must be True, which explains why you see less Trues in the bidirectional implication compared to any of the unidirectional implications.

Implications for the Two Common Fallacies

The clarity provided by "If and Only If" is particularly powerful in preventing the very logical fallacies we discussed earlier: Affirming the Consequent and Denying the Antecedent. These fallacies arise from the incorrect assumption that an "if-then" statement implies an "if and only if" relationship.

Let's revisit them with the lens of P⟺Q If and Only If you provided invalid data (P), then the code will show an error (Q):

Affirming the Consequent: No More Ambiguity

The Fallacy (assuming unidirectional P⟹Q):
- If the code showed an error (Q), then you provided invalid data (P).
- Previously, when P⟹Q was True and Q was True, P could be True (Row 1) or False (Row 3). This ambiguity led to the fallacy.
With P⟺Q:
- Now, look at the P⟺Q column in the table. When P⟺Q is True and Q is True (Row 1), P is unambiguously True. The confusion from Row 3 is gone because if Q were True while P was False, P⟺Q would be False (as Q⟹P would be False), thus making that row irrelevant for valid modus ponens inference under the P⟺Q premise.
- In a system designed with P⟺Q in mind, knowing that Q is True (observing an error) would force the conclusion that P is True (invalid data is the cause), assuming the "if and only if" relationship holds true for that specific system design.

Denying the Antecedent: Unmistakable Consequences

The Fallacy (assuming unidirectional P⟹Q):
- You provided valid data (¬P), so you have no error (¬Q).
- Previously, when P⟹Q was True and P was False, Q could be True (Row 3) or False (Row 4). This ambiguity led to the fallacy.
With P⟺Q:
- Now, when P⟺Q is True and P is False (Row 4), Q is unambiguously False. The problematic scenario from Row 3 (where P was False but Q was True) is irrelevant here because P⟺Q would be False in that case (specifically, Q⟹P would be False).
- If your system genuinely adheres to "P⟺Q", then knowing that P is False (valid data provided) guarantees that Q is False (no error messages).

Practical Mitigation in Coding

The insights from "If and Only If" are more than just theoretical. Practically, both fallacies (Affirming the Consequent and Denying the Antecedent) can be mitigated by striving for conditions that approximate an "if and only if" relationship in your code and tests.

Focused Unit Tests

Design unit tests that are so granular and isolated that they effectively aim to establish an "if and only if" scenario for a tiny piece of logic. By thoroughly mocking or controlling all external dependencies and environmental factors, you reduce the impact of "other causes."

If your test for a specific input passes, you want to be as confident as possible that it passed only because the code handled that specific input correctly, and not due to some irrelevant side effect. Similarly, if it fails, you want to be sure that the failure points directly to the intended logical path.

Exception Handling and Specificity

Instead of catching broad Exception types, catch and handle specific exceptions. This helps differentiate between various "causes" (P1,P2,…) that might lead to a generic "error" (Q). The more precise your error handling, the closer you get to a scenario where "If X error, then Y specific cause," moving towards a bidirectional understanding of error conditions.

Test-Driven Development (TDD) and Mutation Testing

These methodologies inherently push towards P⟺Q thinking. TDD encourages writing a failing test first (¬Q), which then necessitates a specific code change (P) to make it pass.

Mutation testing, which we'll explore further, takes this a step further by ensuring that your tests are robust enough to fail when code is subtly altered (that is, proving that ¬P leads to ¬Q, and thus, that the original P was indeed necessary for Q).

By consciously aiming for "if and only if" relationships in your code's design and your testing strategies, you can build systems that are not only predictable but also much easier to debug and reason about, moving beyond mere correlation to a deeper understanding of cause and effect.

Callback to Mutation Testing

In the earlier section on Assigning Real-World Meanings to Logic, we discussed:

When testing the happy path, we are verifying that the implication P⟹Q (If input is good, then no error) holds true.

When testing the unhappy path (mutation testing), we are verifying that ¬P⟹¬Q (If input is not good, then an error occurs) holds true.

This dual view is key to understanding how mutation testing contributes to software correctness.

Mutation Testing: Testing the Tests

Mutation testing deliberately introduces small faults (mutations) in the code and checks whether the test suite detects them by failing. This process assesses not the code, but the tests themselves.

In a robust test suite, we strive for two ideal conditions:

All correct implementations should pass the tests.
All incorrect implementations should fail the tests.

If a mutated (wrong) version of the code is introduced and causes no test failures, that defeats the fundamental purpose of testing. It means your tests aren't sensitive enough to catch a deviation from correctness. Mutations reveal hidden assumptions or gaps in your test coverage, acting as a sensitivity probe for your test suite.

Example code mutations:

Changing an arithmetic operator (+ to -, > to >=).
Flipping a boolean condition (true to false).
Deleting or duplicating a statement.
Modifying a constant value.

Common Python mutation testing tools:

mutmut uses Python’s built-in ast module.
cosmic-ray uses parso, which provides a more complete AST.

These tools rely on abstract syntax trees to surgically mutate code.

You can even swap out underlying AST libraries for different precision or completeness: https://github.com/boxed/mutmut/issues/281

Logic Behind Mutation Testing

Let's formalize the logical mapping of mutation testing, recalling our definitions:

Let P: Code is correct.
Let Q: Tests pass.

Standard happy path testing primarily checks that P⟹Q – "if the code is correct, then tests pass."

Mutation testing focuses on the other side of the coin: we intentionally make ¬P true (by introducing a fault), and then we expect ¬Q (the tests should fail). This process rigorously checks whether the implication ¬P⟹¬Q ("if the code is not correct, then the tests fail") holds true for your test suite.

But there's a deeper, more powerful logical implication here:

As we learned earlier, the statement ¬P⟹¬Q is logically equivalent to its contrapositive, Q⟹P.

So, by successfully verifying that introducing a fault (¬P) leads to a test failure (¬Q), we are simultaneously validating the contrapositive: if tests pass (Q), then the code must be correct (P).

This is incredibly significant! It moves us much closer to establishing a bidirectional guarantee between our code and our tests: P⟺Q (code correctness is tightly coupled with test success). Mutation testing helps us confidently eliminate false positives in the test suite – situations where Q is true (the test passes) but P is false (the code is actually incorrect).

In a world where LLMs help us write and refactor code quickly, having this "if and only if" confidence in our test suite is invaluable for ensuring the generated or refactored code truly meets expectations.

Clarifying the Kinds of Failures

In software, we typically categorize errors into three main types:

Syntax errors: Violations of the language's grammatical rules (for example, missing colon, invalid keyword). These prevent the code from running at all.
Runtime errors: Errors that occur during program execution, often due to unexpected conditions (for example, TypeError, AttributeError, ZeroDivisionError).
Logic errors: The program runs without crashing, but it produces an incorrect result or behaves in a way that doesn't match the intended specification (for example, wrong algorithm, wrong return value).

Mutation testing focuses on logic errors – failures where the program runs, but produces incorrect results. These are usually caught via AssertionError in the "Assert" phase of the Arrange–Act–Assert (AAA) testing pattern.

You could argue pedantically that AssertionError is a runtime error, but in testing, we treat it as a signal for logical failure:

"The function ran, but the output didn’t match the expected behavior."

Mutation testing assumes that syntax and runtime errors are already handled. Its purpose is to validate whether the test suite reliably catches logical misbehavior.

A Deeper Falsification Perspective

Now, let's connect mutation testing back to Karl Popper's principle of falsification, which we introduced earlier in the context of scientific reasoning. Recall that Popper argued scientific theories gain strength not by being "proven," but by surviving rigorous attempts to disprove them. The core idea of falsification logic is that to disprove an implication like P⟹Q, you only need to find one instance where P is True and Q is False.

Mutation testing applies this same powerful principle, but to our test suite's effectiveness:

Instead of trying to prove directly that our tests are perfect, mutation testing takes a falsification approach to the implication ¬P⟹¬Q ("If the code is incorrect, then the tests fail"). It actively tries to falsify this crucial relationship.

If we introduce a mutation (making ¬P true, that is, the code is now incorrect) but the existing test suite still passes (meaning Q is true), then we have found an instance where:

¬P is True (the code is incorrect due to the mutation).
Q is True (the test still passes).

In this scenario, the implication ¬P⟹¬Q is falsified because we have a True antecedent (¬P) leading to a False consequent (¬Q is false, because Q is true).

And, critically, if ¬P⟹¬Q is falsified, then its logically equivalent contrapositive, Q⟹P ("If the tests pass, then the code is correct"), is also falsified. This means we can no longer trust that a passing test suite reliably indicates correct code. Our desired P⟺Q relationship is broken – the test suite is no longer fully effective at guaranteeing correctness.

By pushing for zero surviving mutants, mutation testing forces us to minimize the surface area of these "hidden assumptions" in our test suite. It demands highly sensitive and specific tests that can pinpoint even subtle logical flaws, thereby moving us closer to building truly resilient systems.

Comparing TDD (Red Phase) and Mutation Testing

Both methodologies, albeit through different means and at different stages of the development cycle, aim to establish confidence in the ¬P ⟹ ¬Q relationship.

Key Differences Summarized:

Feature	TDD (Red Phase)	Mutation Testing
Primary Goal	Drive new code development. Confirm a bug/feature.	Evaluate the quality/completeness of existing tests.
Code State	Production code is incomplete or buggy.	Production code is (assumed to be) correct.
Test State	The new test is expected to fail.	Existing tests are expected to fail (due to mutants).
Initiator	Developer wanting to add functionality/fix bug.	Tool that inserts artificial bugs into code.
"Bugs"	Actual, intended bugs or missing features.	Artificial, subtle changes to the code.

Toward If-and-Only-If Confidence

Ultimately, the goal in software development is to establish if-and-only-if relationships whenever possible, both in the code implementation and especially in the sensitivity of the test suite to the code under test.

This means if a certain condition (P) is true, then a specific outcome (Q) must occur, and if Q occurs, then P must have been the cause. Achieving this level of clarity comes from:

A deep understanding of the problem.
Aligned expectations during requirements gathering.
Logical analysis and interpretation of well-designed experiments.
Adherence to Single Responsibility Principle in SOLID
Rigorous tests with meaningful coverage.

This allows us to understand how control flow and data flow work with greater depth and confidence, leading to better inferences throughout the entire software development lifecycle.

Real-World Challenges

While striving for perfect "if-and-only-if" relationships provides a powerful logical ideal, the messy reality of modern software development presents significant hurdles. The very characteristics that make large systems powerful and scalable – their intricate interconnections and inherent dynamism – simultaneously obscure clear cause-and-effect relationships, making precise logical reasoning and debugging an ongoing battle.

A Web of Complexity

Fan-In, Fan-Out: The Nature of Modern Systems

Any reasonably large software system rarely operates through purely linear control and data flows. Fan-out and fan-in patterns – where many components are called and then their results merged – are inevitable.

For example:

In ETL pipelines, data may be ingested from multiple sources (external APIs, CSVs) and logged to multiple destinations (files, databases).
In concurrent programming, Python’s ProcessPoolExecutor splits data into chunks processed in parallel, then recombines the results.

SRP Meets Real-World Boundaries

Just as functional programming must eventually perform I/O, the Single Responsibility Principle (SRP) runs into real-world boundaries, whether conceptual or infrastructural. At some point, something must glue these isolated units together.

Orchestration logic might live in a single function, span multiple files, or even distribute across microservices and machines communicating over networks. While this decomposition enhances modularity, it also increases surface area for bugs involving:

Side effects: Unintended changes to system state outside a component's explicit outputs.
Circular dependencies: Components relying on each other in a loop, leading to difficult-to-trace behavior.
Interface drift: Changes in one component's input/output expectations not being correctly reflected elsewhere.
Race conditions: Timing-dependent bugs in concurrent operations.
Serialization issues: Problems translating data between different formats or systems.
Network unreliability: Unpredictable latency, packet loss, or disconnections in distributed systems.

The Double-Edged Sword of Abstraction

This web of dependencies is the price of progress, made manageable only through better tooling and abstractions.

If boundaries are well-designed, observable, and testable, they enable asynchronous collaboration, improve long-term maintainability, and increase developer confidence. (See GitHub Playbook in References)
If systems lack architectural coherence or fall behind evolving needs, they calcify into technical debt that demoralizes even the most motivated teams.

Clean Code Is Contextual

While abstractions and orchestration help manage complexity, overusing design patterns or creating unnecessary class layers can introduce needless indirection. This is a common counterargument to architectural purism.

Ultimately, what counts as "clean code" is context-dependent. It varies with programmer skill, the tooling at hand (linters, tests, Copilot), and whether the project is a throwaway script or a multi-year infrastructure investment. Architectural practices like SRP should evolve alongside those constraints.

The Butterfly Effect of Bugs

From SRP to Reasoning Chains

Previously, we focused on simple, direct cause-effect logic (P ⟹ Q), but real-world systems are messier.

The more we adhere to SRP through small, focused functions, the more we create longer chains of logic. This improves separation of concerns but also extends the reasoning required to debug behavior.

Debugging in a Causal Fog

A seemingly minor trigger (O) can cascade through a chain like O⟹P⟹Q⟹R, which we may not fully understand due to knowledge silos, evolving requirements, or runtime dynamism.

Even when we understand the components, precisely identifying “P” is hard, much like how redefining a research question shifts the statistical population being studied. In complex systems with feedback loops (recommender engines), there might not be a single "root cause" at all.

Short-Term Triage vs. Long-Term Insight

Finding the true origin of a bug often demands experimentation, telemetry, and broad system insight. These investigations produce robust, future-proof fixes but take time.

In on-call scenarios, however, urgency reshapes priorities. Fast mitigations and clear communication often take precedence over deep diagnosis.

Masked by Design and Debt

As systems scale, failure stops looking like a crash. Instead, it shows up as a retry spike, a slow metric drift, or silent fallback behavior.

Modern fault-tolerant systems, built with retries, failovers, circuit breakers, and autoscaling, are designed to recover quickly. This resilience often masks deeper problems, delaying detection for weeks and making root cause analysis harder.

Operating in non-deterministic environments with flaky networks, race conditions, or dynamic routing adds further ambiguity. Small symptoms become harder to link back to specific causes.

Compounding this, technical debt driven by weak technical leadership, shifting priorities or time pressure weakens the system’s observability and test coverage. Teams inherit brittle, poorly understood code, making it hard to draw clean lines between cause and effect.

Even the best engineers struggle in such conditions. When a system resists clarity, it doesn’t just block debugging. It erodes trust, slows learning, and fuels long-term burnout.

Glimmers of Hope: Tools and Practices for Clarity

Despite these challenges, several strategies and practices offer a path toward more robust and understandable software.

Leveraging Design Patterns

Design patterns offer a shared vocabulary and time-tested strategies for structuring systems. When applied well, they tame complexity, reduce technical debt, and make behavior more predictable.

They also tend to concentrate similar failure modes. The same bug might appear across companies or industries, creating a wealth of prior art and solution playbooks. Familiarity with patterns can accelerate debugging and deepen shared understanding across teams.

Nurturing Expert Mentorship

Promoting mentors based on real technical impact instead of tenure builds stronger teams and avoids the Peter Principle (people in a hierarchy tend to rise to a level of respective incompetence).

Great mentors teach more than skills – they model falsifiability, independent thinking, and an ability to reason under uncertainty.

They help others challenge assumptions, navigate tradeoffs, and grow both technically and interpersonally. In systems where root causes are murky, this kind of leadership is essential.

One of the most powerful techniques that scales from mentorship to code is falsification: the disciplined search for counterexamples. Whether applied in design reviews, debugging sessions, or automated tests, this mindset anchors reasoning in reality.

The Power of Falsification in Testing

The deliberate search for counterexamples is core to building reliable systems.

In algorithm design, testing edge cases is just falsification in disguise: finding where your logic breaks.
In code, fuzz testing (Atheris) throws diverse inputs at functions to expose falsifying examples.
Property-based testing (Hypothesis) goes further by generating inputs that satisfy certain rules, then shrinks failures to their minimal form. This greatly improves reproducibility and helps stress-test concurrency issues.

The more rigorously we attempt to falsify our assumptions, the more confidently we can reason about behavior using tools like Modus Ponens and Modus Tollens.

Assumptions are always present in software to simplify complexity. The question is whether they're explicitly codified in tests or left hidden and fragile.

Of course, no test is ever bulletproof: our assumptions could be mistaken, or the world could change. That’s why critical thinking, discerning "what should be" versus "what is", remains essential as newer generations increasingly rely on AI tools like Large Language Models.

This deliberate, falsification-driven approach is paramount for building reliable software. It underpins sophisticated testing techniques designed to expose hidden assumptions and break our logical chains.

While testing helps us uncover where our reasoning might falter, some domains demand an even higher degree of certainty. For those critical systems, we turn to the ultimate tools for logical rigor: Proof Assistants.

Proof Assistants

While traditional testing and fuzzing are powerful for finding bugs, they fundamentally cannot guarantee correctness for all possible inputs or scenarios. They can only prove the presence of bugs, not their absence.

To achieve formal, mathematically verified proofs of program behavior – providing the strongest possible guarantees – we turn to proof assistants. These tools allow us to build step-by-step logical proofs, ensuring that a program or system design adheres to its specification with absolute rigor.

Prolog

Prolog offers a relatively straightforward entry point into the world of logic programming and theorem proving. SWI-Prolog is a common interpreter (a REPL, or Read-Eval-Print Loop) for Prolog.

You interact with Prolog by providing it with a knowledge base composed of facts and rules (which are a type of logical clause called Horn clauses). You then pose queries.

Installing SWI-Prolog

You can download SWI-Prolog from its official website: https://www.swi-prolog.org/download/stable
Follow the instructions for your operating system (Windows, macOS, or Linux).

On Ubuntu/Debian, you can usually install it via:

sudo apt update
sudo apt install swi-prolog

Using Prolog: REPL vs. File

REPL (swipl) is best for: Quick, interactive tests of single facts or rules, and posing queries to an already loaded knowledge base.
A File (.pl extension) is best for: Defining your entire knowledge base (multiple facts and rules) and storing your program for reusability. This is the standard way to work with Prolog for anything beyond a few lines.

Example: A Simple Knowledge Base

Let's define a knowledge base to represent who has a job and who is a coding instructor.

1. Create a file named knowledge.pl with the following content:

% knowledge.pl
% This file defines a small knowledge base in Prolog.
% In Prolog, all statements (facts and rules) about the same predicate
% (identified by its name AND number of arguments, e.g., 'has_job' with 1 argument is 'has_job/1')
% must be written consecutively without other predicate definitions in between.

% --- Definitions for the 'has_job' predicate (takes 1 argument) ---

% Fact: Alice has a job.
has_job(alice).

% Fact: Bob has a job.
has_job(bob).

% Rule: Anyone (represented by variable X) has a job IF they are a coding instructor.
% ':-' means 'if'. 'X' is a variable (starts with uppercase).
has_job(X) :- is_coding_instructor(X).

% --- Definitions for the 'is_coding_instructor' predicate (takes 1 argument) ---

% Fact: Alice is a coding instructor.
is_coding_instructor(alice).

What each line does:

Lines starting with %: These are comments for human readability, ignored by Prolog. They explain the file's purpose and key rules like predicate grouping.
has_job(alice). / has_job(bob).: These are facts. They assert simple truths, like "Alice has a job." The . at the end is mandatory for every statement.
has_job(X) :- is_coding_instructor(X).: This is a rule. It states a conditional truth: "For any X, X has a job if X is a coding instructor." X is a variable (always starts with an uppercase letter), and :- means "if." This rule allows Prolog to deduce new information.
is_coding_instructor(alice).: Another fact, asserting "Alice is a coding instructor." It's placed after all has_job/1 clauses to satisfy Prolog's grouping rule.

2. Load and Query in the REPL:

Open your terminal and type swipl. Once at the ?- prompt, load the file and then pose your queries:

$ swipl
?- [knowledge].   % Load the 'knowledge.pl' file (omit .pl, use square brackets and a period)
% Press Enter. Prolog will confirm it loaded the file, e.g., '% knowledge.pl compiled...'
True.

?- has_job(alice). % Query: Does Alice have a job?
% Press Enter. Prolog gives you a solution, then waits.
True.              % Output: Yes, because it's a fact.
% After 'True.', you'll see the '?- ' prompt again, indicating Prolog is ready for your next query.
% If there were multiple ways to prove 'True.', Prolog would present the first 'True.' then wait for you to press ';' for alternatives, then Enter to confirm the final 'True.' or 'False.'.

?- has_job(carol). % Query: Does Carol have a job?
% Press Enter.
False.             % Output: No, Prolog cannot prove it from its knowledge.

?- has_job(X).     % Query: Who has a job? (Find values for X)
% Press Enter
X = alice ;        % Prolog finds Alice as the first solution. Type ';' and press Enter to ask for the next solution.
X = bob ;          % It finds Bob. Type ';' and press Enter for the next solution.
X = alice          % It finds Alice again (this time deduced via the rule and is_coding_instructor(alice)).
% Press Enter. This accepts the current set of solutions and stops searching for more.
False.             % Output: Indicates no more solutions found after the last 'Enter' (or if you explicitly chose not to search further).

?- halt.           % Type 'halt.' to exit the Prolog REPL cleanly.
% Alternatively, you can often use Ctrl+D (press and hold Ctrl, then D) to exit most REPLs.

The Prolog example clearly demonstrates:

"Is P(X) true for a specific X?": Shown by ?- has_job(alice). (returns True.) and ?- has_job(carol). (returns False.).
"Is there an X for which P(X) is true?": Shown by ?- has_job(X). (provides solutions like X = alice, X = bob).

Prolog Limitations

Prolog's limitations become evident when attempting to reason about falsity or non-existence. You cannot directly ask "Is there any X for which P(X) is false?"

Instead, Prolog operates on the principle of negation as failure. This means that if Prolog cannot prove a statement, it considers that statement false.

For example, if you ask ?- \+ has_job(carol). (meaning "Is it not true that Carol has a job?"), Prolog will say True, because it simply cannot find any proof that Carol has a job in its knowledge base.

This is a significant distinction: it doesn't mean Carol definitely doesn't have a job, nor does Prolog provide a formal counterexample. It merely reflects a lack of provable information.

This fundamental constraint means Prolog, while powerful for logic programming, falls short of being a full-fledged proof assistant for comprehensive formal verification.

Coq

After experimenting with Prolog and seeing its limitations, you can move on to a more powerful proof assistant like Coq. Coq is employed in safety-critical domains where absolute mathematical certainty is paramount. coqtop is the standard REPL for Coq.

A fundamental difference from Prolog is Coq's lack of a Closed World Assumption. In Coq, anything not explicitly proven is simply unknown, not automatically false.

Unlike Prolog, Coq's primary purpose isn't solving computational problems by searching a knowledge base. Its true power lies in its ability to construct and verify formal mathematical proofs and programs with absolute rigor. Its interaction involves managing a proof state (your remaining goals) and applying tactics (logical inference steps) until the proof is complete.

Installing Coq

Coq can be installed in several ways, often via package managers or a tool called opam (the OCaml package manager, as Coq is written in OCaml).

Official Downloads: Visit the Coq website for detailed instructions for your OS: https://coq.inria.fr/download
Using a system package manager (for example, Ubuntu/Debian): Bash
```
  sudo apt update
  sudo apt install coq
```

Using Coq: REPL vs. File

REPL (coqtop) is best for: Trying out single tactics, inspecting the current proof state, or learning basic syntax for very short commands.
A File (.v extension) is best for: Almost all Coq development and proof construction. This is how complex proofs and verified programs are structured and managed.

Coq's Comprehensive Question Answering

Unlike Prolog, Coq can directly address all three types of logical questions we've discussed, providing robust answers backed by formal proof:

"Is P(X) true for a specific X?": Coq allows you to define a precise statement (a theorem) like "Alice has a job." You then build a step-by-step logical proof that formally confirms whether this statement is true based on your definitions. If the proof succeeds, Coq formally verifies it: if it fails, Coq clearly shows where your logic breaks down.
"Is there an X for which P(X) is true?": Coq handles questions of existence. If you ask, "Does someone have a job?", you can construct a proof by explicitly providing an example (like "Alice") and then proving that your chosen example indeed satisfies the condition ("Alice has a job").
"Is there any X for which P(X) is false?": This is a key capability where Coq excels over Prolog. Coq allows you to formally prove that a statement is false, or that a counterexample exists. For instance, you could prove "Carol does not have a job" by showing it contradicts the definition, or prove "there exists someone who doesn't have a job" by explicitly identifying such a person and proving that they indeed lack a job. This direct ability to reason about negation and provide formal counterexamples (or prove their non-existence) is what makes Coq a full-fledged proof assistant.

While Coq's core doesn't automatically generate counterexamples when a proof fails, plugins like QuickChick can be integrated for property-based testing to find falsifying examples.

It's a Coq library that allows you to specify properties about your Coq definitions and then randomly generate inputs to try and find a counterexample that falsifies your property.

This is a powerful way to find bugs early in your formalization before you invest a lot of time trying to prove a false theorem.

TLA+, Isabelle, and Lean: A Spectrum of Formal Verification

Beyond Prolog and Coq, other powerful proof assistants and formal specification languages cater to different needs and paradigms:

TLA+: This is a formal specification language developed by Leslie Lamport. It focuses on modeling and verifying system designs (especially concurrent and distributed ones) using temporal logic, rather than proving low-level code. It helps ensure critical properties like safety (nothing bad ever happens) and liveness (something good eventually happens). Its practicality and accessibility make it popular in industry, notably at Amazon and Microsoft for robust system design.
Isabelle and Lean: These are modern, highly advanced proof assistants.
- Isabelle, grounded in higher-order logic, is widely used by researchers and institutions (for example, in projects like the seL4 verified microkernel) for formal theorem proving and software verification in academic and safety-critical domains demanding extreme rigor.
- Lean, based on dependent type theory, is favored by mathematicians for formalizing proofs in pure mathematics (for example, number theory, algebra). It's known for its powerful automation and active community.

These tools represent the pinnacle of applying formal logic to ensure the correctness and reliability of both mathematical theories and complex software systems.

Now that you have a good lay of the land in both theory and practice, here are some thought experiments to enrich your education.

Food for Thought

The journey into formal logic and its intersection with practical domains like software and science offers many avenues for deeper exploration.

Hypothesis Testing in Science and the Implication Truth Table

Statistical hypothesis testing uses a probabilistic form of Modus Tollens. We start with a null hypothesis (H0): "If H0 is true, then observing this data (or more extreme data) is likely." We then observe data that is highly unlikely/unexpected if H0 were true (that is, a small p-value). This serves as our probabilistic "not Q." Therefore, we conclude that H0 is likely not true (we reject H0). This is our probabilistic "∴¬P."

Here, the "truthiness" of P⟹Q is being tested, rather than simply assumed to be true for developing arguments, as in Modus Ponens or Modus Tollens. There's no absolute truth or anything to "prove" definitively.

Inferences are drawn from prior experiments (which inform the test data distribution) and context-specific experiment setups (which determine the significance level α), together defining the threshold (critical value) for what is considered an unlikely observation of Q.

The experiment's result is a rejection (or lack thereof) of H0, not a definitive proof that H0 is true.

Inductive Reasoning's Relationship to Deductive Arguments

Induction generates general rules (for example, "P is always followed by Q") from specific observations or cases.
Deduction then tests or applies those general rules in new situations.

If deduction leads to wrong predictions (that is, a rule is falsified), induction may need to revise the original rule, which forms a continuous feedback loop that refines our understanding.

Necessity and Sufficiency in Implication

The implication P⟹Q ("If you crossed the border, you must have had a passport") unpacks into two fundamental logical concepts:

P is sufficient for Q: Crossing the border guarantees you had a passport. (P alone is enough for Q.)
Q is necessary for P: If you didn't have a passport (¬Q), you couldn't have crossed (¬P). (Q is required for P to happen.)

Q.E.D.: The Enduring Power of Logic in an Uncertain World

Throughout this handbook, we’ve journeyed from the foundational concepts of propositional logic and truth tables to the powerful argument forms of Modus Ponens and Modus Tollens. We explored how these tools enable valid deductions and identified common logical fallacies like Affirming the Consequent and Denying the Antecedent, understanding why they lead to incorrect inferences when an "if-then" relationship isn't a strict "if and only if." We learned the profound importance of falsifiability – the ability for a statement or hypothesis to be disproven – a cornerstone of both scientific inquiry and robust software testing.

We then delved into the practical application of these logical principles in software development, mapping code correctness to test outcomes. We discovered how a failing test, when trusted, becomes a powerful application of Modus Tollens, pinpointing defects. We also confronted the "illusion of correctness" that arises from the affirming the consequent fallacy when tests pass for the wrong reasons, especially when using test doubles.

Crucially, we introduced the "If and Only If" (P⟺Q) relationship, highlighting its unparalleled power in establishing unambiguous connections between cause and effect. This bidirectional guarantee is the ideal we strive for in test suite quality, moving beyond mere correlation to a deeper understanding of causality. We saw how mutation testing rigorously pushes us towards this "if and only if" confidence by actively trying to falsify the assumption that "incorrect code leads to failing tests," thereby strengthening the inverse: "passing tests guarantee correct code."

We also acknowledged the "messy reality" of modern software. Large systems are webs of complexity, with fan-in/fan-out patterns, side effects, and unforeseen interactions that can obscure clear logical chains. Technical debt and the double-edged sword of abstraction often mask the true origins of bugs, turning debugging into a "causal fog."

Logic as Your Compass

Despite these formidable challenges, the logical principles we've explored remain your most vital tools. They provide the mental framework to navigate uncertainty.

When confronted with a bug, your ability to reason logically allows you to formulate hypotheses, design focused experiments (your tests), and interpret their outcomes with precision. Whether you're debugging a complex microservice or reasoning about a simple function, applying Modus Tollens to a failing test or designing tests that aim for P⟺Q clarity helps you cut through the noise.

We also touched upon advanced tools like Proof Assistants (Prolog, Coq, TLA+, Isabelle, Lean), which represent the pinnacle of applying formal logic to guarantee system correctness – a testament to the enduring power of logical rigor in critical domains.

In the intricate dance between theory and practice, the principles of logic stand as an unshakeable foundation. They are the "rocks" upon which you can meticulously build your understanding and your systems. The more consistently you apply this critical thinking, driven by curiosity and a commitment to rigorous validation, the clearer your path becomes.

This clarity is not just about fixing today’s bugs, it’s about continually refining your mental models, fostering trust in your codebase, and equipping yourself to build increasingly robust and predictable systems in an ever-evolving technological landscape.

If you love problem solving, critical thinking, or have experiences on how you fixed an issue that looked different from how it initially seemed, feel free to connect with me at https://linkedin.com/in/hanqi91.

Resources

Article that motivated this handbook: Classical Reasoning and Debugging
3 Formal proofs of modus tollens: https://en.wikipedia.org/wiki/Modus_tollens
Table of 24 syllogisms: https://en.wikipedia.org/wiki/Syllogism
Challenging Assumptions: Falsehoods software teams believe about user feedback
How assumptions and software evolve beyond your control: https://www.tdda.info/why-code-rusts
Relationship to Hypothesis Testing: https://sites.google.com/view/reasonedwriting/home/FRAMEWORK_FOR_SCIENTIFIC_PAPERS/HYPOTHESES/HOW_TO_TEST_HYPOTHESES/MODUS_TOLLENS
The Troubleshooting Mindset: https://www.autodidacts.io/troubleshooting/
Causal Diagrams from The Effect Book: https://theeffectbook.net/ch-CausalDiagrams.html
A systematic guide to the mindsets and practices of debugging: https://www.amazon.sg/Debug-Find-Repair-Prevent-Bugs/dp/193435628X
Constructing P in a way to ensure software correctness: https://www.hillelwayne.com/post/constructive/
Fail Fast by explicitly representing assumptions as assertions: https://www.martinfowler.com/ieeeSoftware/failFast.pdf
Deterministic Simulation Testing to tackle complex systems: https://pierrezemb.fr/posts/learn-about-dst/
GitHub’s Engineering System Success Playbook (ESSP) - Quality, Velocity, Developer Happiness on Business Outcomes: https://assets.ctfassets.net/wfutmusr1t3h/us6AUuwawrtNGTlwlT9Ac/f0fce86712054fc87f10db28b20f303b/GitHub-ESSP.pdf
Closed-world assumption: https://en.wikipedia.org/wiki/Closed-world_assumption

Glossary

Axiom: A fundamental truth or rule accepted as a starting point for a logical or mathematical system, without requiring proof.
Contrapositive: A logically equivalent form of an "if-then" statement (P⟹Q), which is ¬Q⟹¬P ("If not Q, then not P").
Deductive Reasoning: A type of logical reasoning where a conclusion is necessarily true if its premises are true.
Falsification: The principle, especially in science (from Karl Popper), that a hypothesis or theory must be capable of being proven false by empirical observation or experiment.
Formal Logic: The study of abstract systems of reasoning and arguments based on their structure, independent of content.
Hypothesis Testing: A statistical method for making inferences about a population based on sample data, typically by testing a null hypothesis (e.g., "P has no effect on Q") against an alternative hypothesis.
Logical Fallacy: A flaw in the structure or content of an argument that makes it unsound or invalid, even if its conclusion might seem plausible.
- Affirming the Consequent (Fallacy): An invalid argument form that mistakenly assumes if P⟹Q is true, and Q is true, then P must be true.
- Denying the Antecedent (Fallacy): An invalid argument form that mistakenly assumes if P⟹Q is true, and P is false, then Q must be false.
Modus Ponens: A valid argument form: If P⟹Q is true and P is true, then Q must be true.
Modus Tollens: A valid argument form: If P⟹Q is true and ¬Q is true, then ¬P must be true.
Mutation Testing: A software testing technique that involves deliberately introducing small, single-point faults (mutations) into code to assess the effectiveness and coverage of a test suite.
Propositional Logic: A branch of logic that deals with propositions and their relationships using logical operators.
Test-Driven Development (TDD): A software development methodology where tests are written before the code, guiding the development process and ensuring correctness.
Truth Table: A table that systematically lists all possible truth values for a set of propositions and shows the resulting truth value of a complex logical statement.
Vacuously True: Describes an implication (P⟹Q) that is considered true simply because its antecedent (P) is false.

How to Build a Testing Framework for E-Commerce Checkout and Payments

Venkata Sai Sandeep — Fri, 23 May 2025 15:07:30 +0000

When I first started working on E-commerce applications, I assumed testing checkout flows and payments would be straightforward. My expectation was simple: users select items, provide an address, pay, and receive confirmation. But I quickly learned that each step in the checkout process is filled with hidden complexities, edge cases, and unexpected behaviors.

The reason I’m sharing my experience is simple: I struggled initially to find detailed resources that described real-world checkout testing challenges. I want this article to be what I wish I had when I began – a clear, structured guide to building a robust checkout and payment testing framework that anticipates and handles real-world scenarios effectively.

Why This is Important and Challenging
Getting Started
Testing the Checkout Flow
Personal Challenges & Lessons Learned
Final Thoughts

Why This is Important and Challenging

Testing checkout and payment flows is crucial because they’re directly tied to customer trust and business revenue. Each mistake or oversight can lead to lost sales, security vulnerabilities, or damaged reputation.

The complexity arises because checkout processes involve multiple integrated components carts, addresses, payments, and confirmations, each potentially failing or behaving unpredictably. So robust testing ensures the system reliably handles real-world customer behaviors and system anomalies, safeguarding both user experience and business success.

Getting Started

To follow along with this guide, you'll need basic experience in Java (8 or later), object-oriented programming concepts like interfaces and classes, and familiarity with a text editor or IDE such as IntelliJ, Eclipse, or VS Code.

This article is beginner-friendly but touches on real-world use cases that are beneficial to experienced engineers. You'll work with simulated inputs rather than real APIs, making it safe to explore and experiment.

Defining Some Terms:

In this context, a "testing framework" refers to a modular, logic-driven structure for validating key business rules in the checkout pipeline.

Instead of relying on external libraries like JUnit or Selenium, this approach embeds rule-based validations directly into the control flow. Each component (for example, cart, address, payment) is treated as a testable unit with clear preconditions and response logic, reflecting how a lightweight internal QA harness might enforce system integrity.

For example, verifying that a cart has items with quantity > 0, or that an address includes required fields like postal code, simulates the validation engine that would exist in production-grade systems.

We'll also use the term "Assertion Steps" throughout this article to describe the key validation points your framework should enforce at each stage of the checkout flow. These aren't formal assertions from a test library, but are rather logical checks built into the control flow that verify specific conditions like ensuring a cart isn’t empty or a payment method is supported.

When I began building frameworks, I often focused on getting things to work, but missed defining what "working" meant. Adding clear, meaningful assertions to each step transformed my process. They became not only guardrails for correctness, but also checkpoints that made my test code more maintainable, predictable, and easier to extend.

Testing the Checkout Flow

Now that we understand why checkout testing is important and what we’ll be doing here, let’s walk through the key parts of the flow. Each stage represents a critical checkpoint where real-world issues can emerge and where your test framework should be ready to catch them.

Step 1: Cart State and Validation

Before testing payments, I learned the hard way that ensuring the cart’s state is critical. Users frequently modify carts during checkout, or their session might expire.

The cart is where every checkout begins. It might look simple, but it’s surprisingly fragile. Users can remove items mid-flow, reload stale pages, or even send malformed data. Your framework should validate both the cart’s structure and the legitimacy of its contents before allowing checkout to proceed.

Map cartItems = getCartItems();

boolean isCartValid = cartItems.entrySet().stream()
    .allMatch(entry -> entry.getValue() > 0);

if (isCartValid) {
    proceedToCheckout();
} else {
    showError("Cart validation failed: one or more items have invalid quantities.");
}

Assertion Steps:

We’re validating that this logic enforces key conditions, ensuring that only valid cart states proceed and failures are clearly reported. This helps isolate issues early and improves confidence in the checkout pipeline:

Verify error messages appear when the cart validation fails (showError(…) line).
Confirm the checkout process advances only if the cart is valid (proceedToCheckout() line).

Step 2: Address and Shipping Details

I encountered many edge cases such as incomplete addresses, international formats, and unexpected API failures from shipping providers.

To handle these issues, you can use shipping address validation. This ensures that the order actually has a destination and that it's reachable. Also, incomplete fields, invalid formats, or API glitches can lead to fulfillment failures. Your test logic should enforce address completeness and formatting before progressing.

Map addressFields = address.getAddressFields();

boolean isAddressComplete = Stream.of("street", "city", "postalCode")
    .allMatch(field -> addressFields.getOrDefault(field, "").trim().length() > 0);

if (isAddressComplete) {
    confirmShippingDetails(address);
} else {
    showError("Invalid or incomplete address provided.");
}

Assertion Steps:

This validation ensures the system doesn’t proceed with incomplete address data. The stream logic checks for required fields, and depending on the result, either confirms the shipping or triggers an error message.

Confirm the system rejects incomplete or invalid addresses (the conditional check in the isAddressComplete stream logic).
Ensure clear error messages are displayed if address validation fails (showError(…) line).

Step 3: Payment Method Selection and Validation

Payment methods like credit cards, debit cards, digital wallets, and gift cards required different validation rules and logic flows.

This step ensures that only valid and supported payment methods can be used. From credit cards to mobile wallets, each method requires its own validation logic. Testing here prevents users from attempting transactions with incomplete or unverified payment inputs.

LinkedList supportedMethods = new LinkedList<>(Arrays.asList("CreditCard", "DebitCard", "PayPal", "Wallet"));

if (supportedMethods.contains(paymentMethod.getType()) && paymentMethod.detailsAreValid()) {
    processPayment(paymentMethod);
} else {
    showError("Selected payment method is invalid or unsupported.");
}

Assertion Steps:

This logic ensures that only supported and valid payment types can proceed to processing. The contains(…) check confirms the method is allowed, while detailsAreValid() guards against incomplete or incorrect data. Combined, these help isolate bad inputs early in the flow:

Confirm unsupported payment types trigger the appropriate error (showError(…) line).
Ensure the payment processing proceeds only with valid and supported methods (processPayment(paymentMethod) line).

Common Payment Method Validations:

Different payment methods have unique validation requirements. Here are examples of some key tests:

Credit Card: Validate card number format (for example, starts with 4 for Visa, correct length), CVV (3-digit), and expiry date validity.

  if (paymentMethod.getType().equals("CreditCard") && paymentMethod.getCardNumber().matches("^4[0-9]{12}(?:[0-9]{3})?$")) {
      processPayment(paymentMethod);
  } else {
      showError("Invalid credit card details.");
  }

PayPal: Confirm linked account is verified.

  if (paymentMethod.getType().equals("PayPal") && paymentMethod.isAccountVerified()) {
      processPayment(paymentMethod);
  } else {
      showError("Unverified PayPal account.");
  }

Digital Wallet: Validate secure token is correctly formed and active.

  if (paymentMethod.getType().equals("Wallet") && paymentMethod.isTokenValid()) {
      processPayment(paymentMethod);
  } else {
      showError("Invalid or expired wallet token.");
  }

Step 4: Payment Processing and Error Handling

Even when payment details are valid, payment gateways can fail unpredictably due to network issues, bank declines, or incorrect transaction formats.

This step tests how the system handles payment failures gracefully and clearly and ensures orders are only processed after true confirmation.

PaymentResponse response = paymentGateway.process(transactionDetails);
if (response.isSuccessful()) {
    confirmOrder(response);
} else {
    handlePaymentError(response.getError());
}

Assertion Steps:

This logic focuses on how the system handles responses from the payment gateway. The isSuccessful() check ensures only confirmed transactions trigger order creation, while any failure path is routed to handlePaymentError(), allowing you to test error flows like declines or timeouts clearly.

Confirm errors from payment processing (handlePaymentError(response.getError()) line) are handled gracefully.
Common errors your framework should simulate and verify include:
- Timeouts: when the gateway service is delayed or unreachable.
- Insufficient Funds: valid card but not enough balance.
- Card Declined: blocked or expired cards.
- Malformed Requests: missing fields or invalid transaction payloads.
Ensure successful transactions are always followed by order confirmations (confirmOrder(response) line).

Step 5: Order Confirmation

Order confirmation accuracy and timing are crucial. Issues can occur if confirmation happens prematurely or email notifications are delayed.

This final step validates that orders are only confirmed after successful payment. Rushing this process can result in orders without revenue or duplicate transactions. The framework should check for payment settlement before confirming and notifying the user.

if (payment.isSettled()) {
    order.createRecord();
    notifyCustomer(order);
} else {
    showError("Order cannot be confirmed until payment settles.");
}

Assertion Steps:

This logic ensures confirmation and notification only happen after payment settlement. The payment.isSettled() check guards against premature actions, allowing order creation and customer notifications only when the transaction is fully complete:

Validate emails are sent only after payment settlement (notifyCustomer(order) line following successful payment check).
Confirm that orders are created accurately after payments (order.createRecord() line).

Personal Challenges & Lessons Learned

Users behave unpredictably: design your tests to mimic real-world behavior as closely as possible.
Simulate external service failures proactively: don’t wait for production to expose them.
Maintain detailed logs: they help pinpoint issues faster during debugging.
Communicate clearly and promptly: users value transparency when issues arise.

These challenges reinforced that technical correctness alone is not sufficient. An effective testing framework must account for unpredictable user behavior, proactively simulate third-party service failures, and offer traceability through detailed logs.

By building for resilience and maintaining clear communication, you can ensure your e-commerce system operates reliably and builds lasting user trust even under stress.

Key Takeaways:

Always validate backend logic separately from UI.
Include negative and edge-case scenarios in your tests.
Expect API failures and handle them gracefully.

Lessons from the Journey

Testing e-commerce checkouts taught me that robust frameworks understand human behaviors, expect the unexpected, and rigorously validate each step. By sharing my journey, I aim to simplify the learning curve for others facing similar challenges.

Remember – effective testing isn’t about getting to zero defects immediately. It's about continuous refinement and learning from every scenario. Keep building, keep testing, and let your code reflect real-world reliability.

Testing - freeCodeCamp.org

How I Tested Malaysia's Open Data Portals with Plain English

What You'll Find Below:

Table of Contents

Why Malaysia's Open Data Portals?

What Is Passmark?

The Hero Spec: Range-Bounded Assertions

What Two-Model Voting Doesn't Catch

Going Further: Cross-Field Math

What I Found Across Three Runs

The Debugging Loop

The Two Specs That Still Fail Are the Most Interesting

1. The two models disagreed and the arbiter call failed.

2. The wait condition fired too early.

What It Cost, and Why Cache Rate Is Cost Rate

The Pattern Worth Stealing

Honest Verdict

Resources

The Data Quality Handbook: Data Errors, the Developer's Role, and Validation Layers Explained.

What We'll Cover:

Prerequisites

The Importance of Data Quality

How Does Bad Data Happen in the First Place?

The Cost of Bad Data

Types of Data Errors

Required Field Errors

Format Validation Errors

Range and Limit Errors

Logical Consistency Errors

Duplicate and Data Integrity Errors

Relational Errors (Reference Integrity)

Structural Errors (Dropdowns, Radio Buttons, Enums)

What Makes Good Data?

Completeness:

Uniqueness:

Validity:

Timeliness:

Accuracy:

Consistency:

Fitness for Purpose:

Data Validation Layers

Frontend Layer — “Protect the User, Not the System”

Backend Validation — “The Real Gatekeeper”

Database Layer — “Protect the Data at Rest”

Service Layer / Business Logic — “Validate Real-World Rules”

Jobs / Queues / Data Ingestion — “Validate External Data”

Testing Strategies to Protect Data Quality

Unit Testing

Example: Testing a Discount Calculation Rule

Integration Testing: The Flow & Lineage Check

Functional Testing: The Business Rule Check

Here's an example: Functional Test

Conclusion

Software Testing with Playwright

How to Test a Complex Full-Stack App: Manual Approach vs AI-Assisted Testing

What we'll cover:

Prerequisites

How Testing Actually Works in Full-Stack Apps

Three Layers, Three Different Jobs

API Tests

End-to-end (E2E) Tests

The Pain Points Nobody Warns You About

Schema Validation: The Hidden Time Sink

What Made This Hard

The Manual Approach

Unit Tests: The Easy Win

API Tests: Where the Friction Starts

E2E Tests: The Full Journey

The AI-Assisted Approach

API Testing: Describing Instead of Coding

E2E Testing: Plain English, Real Browser

The Feature That Surprised Me

When to Use Which Approach

Conclusion

Before We End

How Does Kubernetes Self-Healing Work? Understand Self-Healing By Breaking a Real Cluster

Table of Contents

What is KubeLab?

Prerequisites

How to Get the Lab Running