Traditional Scraping vs AI Scraping: A Practical Guide for Developers and Data Teams

Joel Olawanle — Thu, 16 Apr 2026 21:37:47 +0000

Enormous amounts of data are constantly generated on the open web. Product prices change, job listings go live and get taken down, news articles are published, and company information gets updated.

For developers and teams that rely on this kind of data, the question has never been whether to scrape the web, but how to do so reliably over time.

For a long time, the approach has been straightforward. You inspect a page, write selectors, and extract the data using tools like BeautifulSoup or browser automation libraries like Playwright and Selenium. This works well, but it comes with a familiar problem: the moment the structure of a page changes, your scraper breaks and needs fixing.

Recently, a different approach has started gaining attention. Instead of writing selectors, you describe what you want and let the system figure out how to extract it. This is what people refer to as AI scraping.

Both approaches are widely used today, but they solve the problem in very different ways. This guide breaks down how each one works, where each one fits, and how to decide which approach makes sense for your use case.

What is Traditional Web Scraping?
Traditional Scraping in Practice
What is AI Web Scraping?
AI Scraping in Practice
Traditional Scraping vs AI Scraping: When to Use Each

What is Traditional Web Scraping?

Traditional web scraping scraping is built on a simple idea that if a browser can load a page and display data to a user, then a program should be able to do the same and extract that data automatically.

This is done with CSS selectors and XPath. For CSS selectors, a selector like .product-card .price means “find the price element inside a product card.” It's easy to understand and works well for most use cases.

XPath, on the other hand, is more powerful but more complex. It allows you to navigate the structure of a page in more detail, including moving up and down the DOM, filtering by text, or handling deeply nested elements.

In practice, most developers start with CSS selectors and only use XPath when the structure becomes too complex.

This idea has been around since the early days of the web. Instead of manually copying information from a page, developers started writing scripts that send requests, receive HTML responses, and extract the pieces they care about.

At its core, nothing about that model has really changed.

You still fetch a page, inspect its structure, and extract data from it. The difference today is not the concept, but how sophisticated the tooling and scale have become.

The Tools Behind Traditional Scraping

Over time, a solid ecosystem of tools has developed around this approach.

Requests is the de facto Python library for making HTTP calls. Most traditional scrapers use requests to fetch pages and then pass the response to BeautifulSoup for parsing. It's simple and reliable for static sites.
BeautifulSoup is a Python library for parsing HTML and XML. It takes raw HTML and builds a navigable tree of objects from it. It's fast to learn, very readable, and excellent for static pages. Its main limitation is that it has no browser engine, so it can't execute JavaScript. If a site renders content dynamically after page load, BeautifulSoup will see an empty container.
Selenium and Playwright are browser automation tools that control a real browser. They can click buttons, scroll, and wait for JavaScript to finish loading before extracting data. The trade-off is that they are slower and more resource-intensive than simple HTTP requests, but they are necessary for dynamic sites.

Traditional Scraping in Practice

Let's build a real, working scraper using Books to Scrape, a sandbox site built specifically for practicing web scraping. The goal is to extract the title, price, and star rating for every book listed on the first page.

Step 1: Install Dependencies

pip install requests beautifulsoup4

Step 2: Inspect the Page

Before writing a single line of code, open the target page in your browser and inspect its HTML. Right-click any book title and choose "Inspect" to see the structure.

You'll notice each book lives inside an

element, and within it:

The title is in the
tag, inside an element (as a title attribute)

The price is in a

element

The star rating is encoded in the CSS class of a

element — for example,

means three stars

This is the core detective work of traditional scraping: you study the HTML, find the patterns, and write selectors to match them.

Step 3: Write the Scraper

import requests
from bs4 import BeautifulSoup

# 1. Fetch the page
url = "https://books.toscrape.com/"
response = requests.get(url)

# Always check the request succeeded before going further
if response.status_code != 200:
    print(f"Failed to fetch page: {response.status_code}")
    exit()

# 2. Parse the HTML
soup = BeautifulSoup(response.content, "html.parser")

# 3. Find all book containers on the page
books = soup.select("article.product_pod")

# 4. Extract data from each book
results = []

for book in books:
    # Title is stored as an attribute, not visible text
    title = book.select_one("h3 a")["title"]

    # Price is the text inside the price element
    price = book.select_one("p.price_color").get_text(strip=True)

    # Rating is encoded as a word in the CSS class: "star-rating Three"
    # We grab the second class name and map it to a number
    rating_word = book.select_one("p.star-rating")["class"][1]
    rating_map = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
    rating = rating_map.get(rating_word, 0)

    results.append({
        "title": title,
        "price": price,
        "rating": rating
    })

# 5. Display results
for book in results:
    print(f"{book['title']} | {book['price']} | {book['rating']} stars")

Step 4: Run It

python scraper.py

Your output will look something like this:

A Light in the Attic | £51.77 | 3 stars
Tipping the Velvet | £53.74 | 1 stars
Soumission | £50.10 | 1 stars
Sharp Objects | £47.82 | 4 stars
Sapiens: A Brief History of Humankind | £54.23 | 5 stars
...

Twenty books, all structured and clean.

Step 5: Extend It to Multiple Pages

The site has 50 pages. Extending the scraper to crawl all of them requires following the "next" button:

import requests
from bs4 import BeautifulSoup

BASE_URL = "https://books.toscrape.com/catalogue/"
start_url = "https://books.toscrape.com/catalogue/page-1.html"

all_books = []
url = start_url

while url:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    for book in soup.select("article.product_pod"):
        title = book.select_one("h3 a")["title"]
        price = book.select_one("p.price_color").get_text(strip=True)
        rating_word = book.select_one("p.star-rating")["class"][1]
        rating_map = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
        rating = rating_map.get(rating_word, 0)
        all_books.append({"title": title, "price": price, "rating": rating})

    # Check for a "next" button and follow it
    next_btn = soup.select_one("li.next a")
    url = BASE_URL + next_btn["href"] if next_btn else None

print(f"Scraped {len(all_books)} books total.")

Running this crawls all 1,000 books across all 50 pages.

What Makes This Approach Fragile

This scraper works well today because books.toscrape.com is a static, stable sandbox. In production, the same approach has a well-known weakness: it's completely dependent on the HTML structure staying the same.

If the site's developer renames product_pod to book-card, or moves the price into a

instead of a

, every selector breaks. You get no data, or worse, incorrect data with no error, and you only discover the breakage when someone notices the output looks wrong.

This is one of the problems AI scraping is designed to address.

What is AI Web Scraping?

Traditional scraping works by following the structure of a page. It looks for specific elements, class names, or patterns in the HTML and extracts data based on those rules.

AI-powered scraping approaches the same problem differently. Instead of relying only on structure, it focuses on understanding the content itself. It looks at a page and identifies what something represents, not just where it's located.

In a traditional scraper, you might write something like:

response.css(".product-card .price::text").get()

You're telling the system exactly where to look. But, with AI scraping, you describe the outcome:

Extract the product name, price, and availability for each item on this page.

The system reads the page, identifies what appears to be a product listing, extracts the relevant fields, and returns structured data.

What's Actually Happening Under the Hood

AI scraping can feel like magic at first, but it's built on a combination of familiar components.

At the core are large language models (LLMs) trained on vast amounts of text, including web content and HTML. Over time, they learn patterns such as what a product listing looks like, how prices are usually presented, or how job listings are structured.

When given a page, the model can recognize these patterns and map them to the fields you asked for.

But the model is only one part of the system. You still need something to load and interact with the page. That is where browser automation comes in. Most AI scraping tools rely on headless browsers like Chromium or frameworks like Playwright to render pages, execute JavaScript, and handle real-world behavior such as scrolling or clicking.

On top of that, there's a layer that interprets your input. When you write a prompt describing the data you want, the system translates that into an extraction task. It decides what parts of the page are relevant and how to structure the output.

Finally, the system formats the results into clean data, typically as JSON or CSV, so you can use them directly with minimal post-processing.

Note: Tools like ChatGPT can interpret content, but they're not scraping systems. They don't crawl pages, handle workflows, or run repeatable data extraction. AI scraping tools combine this intelligence with the infrastructure required to collect data reliably.

Popular Tools Behind AI Scraping

As AI scraping has grown more popular, a number of tools have emerged that make this approach accessible without requiring you to build everything from scratch.

For example:

Spidra takes a pretty direct approach to extraction. You describe the data you want, and it handles loading the page, interpreting the content, and returning structured results. It also manages things like navigation and interactions behind the scenes, which makes it useful when you want to extract data without worrying about selectors or maintaining scraping logic.
Firecrawl focuses on turning web pages into clean, structured content. Instead of extracting specific fields like price or title, it converts entire pages into formats like markdown or simplified JSON. This makes it especially useful when you want to feed web content into AI systems or work with it in a readable format without dealing with messy HTML.
Jina Reader is designed to simplify web pages into clean text. It strips away layout noise such as navigation, ads, and styling, and focuses on the actual content. This is helpful when your goal is to understand or process the information on a page rather than extract structured fields.
Bright Data AI scrapers combine AI-based extraction with a strong scraping infrastructure. They allow you to request structured data without writing selectors, while also handling challenges like blocking and scaling. This makes them more suitable for larger or more demanding scraping tasks.
Apify sits somewhere in between traditional and AI-driven scraping. It provides a full platform for building and running scrapers, and allows you to introduce AI where it makes sense, whether for extraction or post-processing. This makes it useful when you need more control over the entire pipeline.

In practice, these tools aren't trying to solve the exact same problem. Some focus on extracting structured data, others on cleaning content, and others on building full scraping workflows. The right choice depends on what you're trying to achieve, not just the tool itself.

AI Scraping in Practice

Let's run the same data collection task of extracting books from books.toscrape.com using an AI scraping tool. We'll use Spidra's API so you can see exactly what changes.

Step 1: Get an API Key

Step 2: Understand the API Structure

Spidra's scrape endpoint accepts a JSON payload. The two most important fields are url (where to scrape) and prompt (what to extract, written in plain English). You can optionally specify the output format — JSON works best for structured data.

POST https://api.spidra.io/scrape
Authorization: Bearer YOUR_API_KEY
Content-Type: application/json

You see, we don't need selectors or HTML inspection. Just a URL and a description.

Step 3: Write a Single-Page Extraction

Here's the equivalent of our traditional scraper, written as an API call:

import requests
import json

API_KEY = "your_api_key_here"

payload = {
    "urls": [{"url": "https://books.toscrape.com/"}],
    "prompt": "Extract all books on this page. For each book, return the title, price, and star rating as a number from 1 to 5.",
    "output": "json"
}

response = requests.post(
    "https://api.spidra.io/scrape",
    headers={
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    },
    json=payload
)

data = response.json()
print(json.dumps(data, indent=2))

That's the entire scraper. No BeautifulSoup, no selector logic, and no HTML parsing.

Step 4: Understand the Output

The API returns a structured JSON response. Each book is represented as an object with the fields you described:

{
  "results": [
    {
      "title": "A Light in the Attic",
      "price": "£51.77",
      "rating": 3
    },
    {
      "title": "Tipping the Velvet",
      "price": "£53.74",
      "rating": 1
    },
    {
      "title": "Soumission",
      "price": "£50.10",
      "rating": 1
    }
    ...
  ]
}

The model identified the star rating encoding (star-rating Three → 3) without being told how ratings are represented. It understood the intent of "star rating as a number from 1 to 5" and handled the mapping itself.

Step 5: Use Actions for Multi-Step Workflows

Where AI scraping starts to show its real advantages is with workflows that would require significant engineering in a traditional scraper.

Suppose you want to visit each book's detail page and extract the full description and availability status (not just what's visible on the listing page).

In a traditional scraper, this means building a follow-link loop, managing state, handling errors on each detail page, and maintaining separate selectors for the detail page's different structure. In an AI scraper like Spidra, you can mimic a real human interaction with browser actions:

{
  "urls": [{
    "url": "https://books.toscrape.com/catalogue/category/books/mystery_3/index.html",
    "actions": [{
      "type":            "forEach",
      "observe":         "Find all book cards in the product grid",
      "mode":            "inline",
      "captureSelector": "article.product_pod",
      "maxItems":        10,
      "itemPrompt":      "Extract the book title, price, and star rating (One/Two/Three/Four/Five). Return as JSON: {title, price, star_rating}"
    }]
  }]
}

The system navigates to each book's page, reads the new content, extracts the additional fields, and returns them as part of the same result set.

You can also configure how you want your data to be:

{
  "urls": [{ "url": "https://jobs.example.com/senior-engineer" }],
  "prompt": "Extract the job details",
  "schema": {
    "type": "object",
    "required": ["title", "company", "remote", "employment_type"],
    "properties": {
      "title":           { "type": "string" },
      "company":         { "type": "string" },
      "location":        { "type": ["string", "null"] },
      "remote":          { "type": ["boolean", "null"] },
      "salary_min":      { "type": ["number", "null"] },
      "salary_max":      { "type": ["number", "null"] },
      "employment_type": {
        "type": ["string", "null"],
        "enum": ["full_time", "part_time", "contract", null]
      },
      "skills": {
        "type": "array",
        "items": { "type": "string" }
      }
    }
  }
}

There is more to these AI scrapers, like batch scraping, AI crawling, and lots more.

Where AI Scraping Earns Its Keep

Now suppose the site updates its frontend. The class product_pod gets renamed to book-card. The price moves into a different element.

In the traditional scraper, you get zero results and no error until you notice the data is missing. You then re-inspect the page, update the selectors, test, and redeploy.

In the AI scraper, you run the same prompt. The model isn't looking for product_pod or price_color. It's looking for content that resembles a product listing with pricing information. The layout change is invisible to the extraction logic.

This is the core operational advantage of the AI approach: structural changes to a page don't automatically break your extraction.

Traditional Scraping vs AI Scraping: When to Use Each

At this point, the difference between the two approaches is clear. The more important question is when each one actually makes sense in practice.

A simple way to think about it is this:

Scenario	Traditional Scraping	AI Scraping
Stable websites	✅ Best choice	✅ Works but may sometimes become an overkill
Frequently changing layouts	❌ Breaks often	✅ More resilient
Large-scale crawling	✅ More cost-efficient	✅ Efficient but can get expensive
Fast prototyping	❌ Slower setup	✅ Very fast
Non-technical users	❌ Requires coding	✅ More accessible
Full control & transparency	✅ High control	❌ Less transparent
Messy or inconsistent data	❌ Hard to maintain	✅ Easier to handle
Complex workflows (login, steps)	⚠️ Possible but manual	✅ Often built-in

In practice, it's not a cut-and-dry choice between the two. Traditional scraping works best when everything is predictable and stable. AI scraping becomes useful when things are messy, dynamic, or time-sensitive. Most real-world systems combine both approaches rather than relying on one alone.

Wrapping up

Web scraping is not going away. What's changing is how we approach it.

Traditional scraping gives you control and precision, but it can be fragile and time-consuming to maintain. AI scraping makes things faster and more flexible, especially when dealing with messy or constantly changing pages, but it comes with less transparency.

In practice, most real-world workflows are starting to combine both.

We're also beginning to see AI scraping tools integrate into larger systems, especially with AI agents and MCP-style setups, where scraping becomes something that can be triggered on demand rather than built from scratch each time.

The key takeaway is simple. Traditional scraping tells the system where the data is. AI scraping tells the system what the data means.

Knowing when to use each is what actually matters.

The Data Quality Handbook: Data Errors, the Developer's Role, and Validation Layers Explained.

Great John — Tue, 14 Apr 2026 20:29:40 +0000

In August 2012, Knight Capital, a major trading firm in the United States, deployed faulty trading software to its production system. The system used this incorrect configuration data and it triggered millions of unintended stock trades.

The company lost about $440 million in just 45 minutes. Knight Capital nearly collapsed and had to be rescued by investors. It was later acquired by another firm.

When Target expanded into Canada, the company relied on a new supply chain system that contained incorrect product and inventory data. Product information in the database was incomplete and inaccurate. Prices, sizes, and product descriptions were entered incorrectly.

Inventory systems reported items in stock that were actually unavailable. Customers found empty shelves in stores despite the system showing stock. The company lost over $2 billion in the Canadian market. Target eventually shut down all Canadian stores in 2015.

One employee made the statement “Even though we had a great supply chain system on paper, we didn’t have accurate data. Bad data leads to bad decisions’’

Another famous example of data-related engineering failures involves the Mars Climate Orbiter spacecraft. One engineering team used metric units (newtons). Another team used imperial units (pounds-force). The system failed to convert the data correctly. The spacecraft entered Mars' atmosphere at the wrong altitude. The mission failed and the spacecraft was destroyed. The loss was about $125 million.

In this article, we'll delve deep into what data quality truly means, the types of data errors that silently break systems, the developer’s responsibility in preventing them, and the validation layers that work together to keep bad data out of production.

What We'll Cover:

Prerequisites
The Importance of Data Quality
- How Does Bad Data Happen in the First Place?
- The Cost of Bad Data
Types of Data Errors
What Makes Good Data?
Data Validation Layers
Testing Strategies to Protect Data Quality
Conclusion

Prerequisites

A basic understanding of what data is
A basic understanding of data structures
An understanding of what an API is
An understanding of what a database is and what it does

The Importance of Data Quality

As you can see from just these few examples, the quality of the data you're working with really matters.

Gartner reports that organisations attribute around $15 million in annual losses to poor‑quality data. The same research also shows that nearly 60% of companies have no clear idea what bad data actually costs them, largely because they don’t track or measure data‑quality issues at all.

A 2016 study by IBM is even more eye-popping. IBM found that poor data quality strips $3.1 trillion from the U.S. economy annually due to lower productivity, system outages, and higher maintenance costs.

Bad data is, and will continue to be, the kryptonite of any organisation. This is even more concerning as more organisations now depend on data for strategy execution than ever before.

When data is wrong, incomplete, duplicated, or inconsistent, the consequences ripple outward: Incorrect dashboards mislead teams, which leads to making incorrect decisions. Implementing these decisions can lead to faulty strategy and policy implementation.

Eventually, the organisation pays the price, financially, operationally, and reputationally. And while money can be recovered, reputation rarely bounces back so easily.

How Does Bad Data Happen in the First Place?

Form fields are usually the first place where data enters an application, so they’re often where bad data begins. This is why the developer’s role is so critical.

Many of the most damaging data errors don’t originate from malicious users or complex edge cases – they come from simple oversights that the system should never have allowed in the first place.

But it's equally important to recognise that data quality issues often originate before the data ever reaches an application. Upstream processes — how data is collected, measured, recorded, or pre‑validated — can introduce inaccuracies long before the system receives it.

For example, a nurse might weigh a patient using an uncalibrated mechanical scale, record the incorrect value on a paper form, and later have that value transcribed into the hospital system. By the time the data enters the application, the error is already embedded.

This means that maintaining data quality requires attention both to upstream data collection practices and to the system-level validation that developers control.

When the UI, backend, or API layer permits invalid, incomplete, inconsistent, or logically impossible data to enter the pipeline, the organisation inherits a long‑term liability. Even small choices — such as allowing empty fields, ignoring duplicates, or failing to enforce validation rules — can introduce errors that may only surface months later in reports or dashboards, leading to confusion and inaccurate insights.

The Cost of Bad Data

Data quality can also be impacted at any stage of the data pipeline: before ingestion, in production, or even during analysis.

If bad data is caught in the UI, it's almost free, if we're thinking in terms of cost. If it's caught at the API layer, that's still pretty cheap. If it's caught in the database, the cost is moderate. And if it's caught in a report or ML model months later, that's expensive, and sometimes irreversible.

A key principle in modern data management is: the cheapest and safest place to catch bad data is at the source, and that is before ingestion. The well-known 1-10-100 Rule, introduced by George Labovitz and Yu Sang Chang in 1992, clearly illustrates this idea.

According to the rule, it costs about $1 to validate data at the point of entry, $10 to correct it after it has entered the system, and $100 per record if the error goes unnoticed and causes problems further down the line.

As the saying goes, an ounce of prevention is worth a pound of cure – and this is especially true when it comes to maintaining high-quality data.

To help buttress my point, I’ve categorised the different types of errors and oversights that developers should never allow that can and should be prevented before they ever reach the database, analytics layer, or reporting systems.

Types of Data Errors

Required Field Errors

If you build a form that allows a user to submit a registration form with important fields left empty (like first name, last name, email address, phone number, date of birth, or address), you're directly letting incomplete data enter the system.

I remember a scenario from my time as a data analyst where I was analysing a dataset containing different types of alarms triggered across several buildings. These alarms fell into categories such as aquarium alarms, intruder alarms, fire alarms, and maintenance alarms.

The purpose of the analysis was simple: identify which buildings had the highest frequency of alarms so that maintenance, resources, or investigations could be allocated appropriately.

Whenever an alarm went off, the security team recorded it using a software system. By the end of each month, we could view the cumulative alarms and generate insights.

But I encountered a major data quality issue. The security officers often selected the alarm category but failed to submit the building where the alarm occurred — and the system allowed this incomplete record to be saved into the database.

Every alarm had to occur in a specific building. Yet during analysis, I would see entries like “20 fire alarms” with no building information attached. Since I couldn’t determine where these alarms happened, the data became unusable. I had no choice but to delete those records because they provided no actionable value.

This is a classic example of poor data validation. If the developer had implemented proper constraints, the system would never allow an alarm to be submitted without a building name.

Required fields should be enforced at the UI and backend levels to prevent missing data from entering the system in the first place. These gaps lead to missing or unusable data in the database, often forcing teams to delete or manually repair records later.

To prevent these errors, you can use required‑field validation, disable the submit button until all mandatory fields are completed, and visually highlight missing fields with inline error messages.

Here's a practical code example of some bad code (no required checks):

From the above code snippet, the core problem is that the form doesn't enforce required input. Neither HTML‑level validation (using the required attribute) nor JavaScript‑based checks are implemented. This omission allows users to submit the form without providing necessary information, making the form unreliable for collecting valid and complete user data.

From a usability and data quality perspective, this is problematic. Forms are typically designed to collect meaningful and complete information, and fields such as “Full name” and “Email” are usually essential. Without marking these inputs as required or validating them programmatically, we risk receiving blank or invalid submissions, which can compromise the quality of stored data and any processes that depend on it.

Here's an example of a better version (UI prevents empty submission):

In this revised version of the code, the addition of the required attribute to both the name and email input elements ensures that the browser won't allow the form to be submitted unless these fields are filled. This is an important step toward maintaining data completeness and improving the overall reliability of the form.

Also, by checking e.target.checkValidity(), we now ensure that the form is evaluated before submission proceeds.

Another positive aspect is the conditional use of e.preventDefault(). When the form is invalid, the default submission behavior is stopped, preventing incomplete or incorrect data from being sent.

Format Validation Errors

If you have a form that allows a user to enter an email without an @ symbol, an email without a domain, a phone number containing letters, or a postcode/ZIP code in the wrong format, that allows invalid data to enter the system.

The same applies when you allow a user to submit an impossible date (32/15/2025) or a credit card number with the wrong length.

These issues will cause the data analyst to spend more time cleaning the data, if it's even cleanable. And such incorrect inputs create unreliable data that breaks downstream processes and increases cleanup costs.

To prevent these types of errors, you can use regex validation, input masks, and field‑type restrictions (for example, numeric‑only fields for phone numbers) to enforce correct formats before submission.

Here's a bad example of allowing format validation errors:

This code doesn't perform any checks on the format or structure of the phone number. The function simply retrieves whatever value exists – whether valid, invalid, or blank – and logs it to the console without any condition.

Here's the fixed version:

This version fixes the earlier mistake by introducing a clear validation rule. Before the system accepts the phone number, it checks whether the input contains only digits. The regular expression ^\d+$ ensures that the value is made up entirely of numbers, with no letters or symbols allowed. If the user enters anything invalid, the function stops and displays an error message instead of saving bad data.

This approach prevents the format error that occurred in the previous example. Instead of blindly trusting whatever the user types, the code now enforces a rule that matches the expected format of a phone number. This is what a responsible developer should do: verify the input before using it.

Range and Limit Errors

Allowing users to enter values outside acceptable limits – such as negative ages, quantities below zero, discounts above 100%, or measurements far beyond realistic ranges – that enables the ingestion of data that violates business rules. These errors distort analytics, break calculations, and create operational inconsistencies.

To mitigate these errors, you can apply min/max constraints, sliders, steppers, and numeric boundaries to ensure values fall within valid ranges.

Here's a bad example of allowing range and limit errors:

As seen above, we've created an input field for age but doesn't specify any limits or constraints. The browser allows the user to type any number — including values that make no sense, such as negative ages, extremely large ages, or decimals. The JavaScript function simply reads the value and logs it without checking whether the age is realistic.

Here's a better version:

Now in this version, the inclusion of the min="0" and max="120" attributes sets clear boundaries for acceptable input values. This ensures that only realistic age values within a defined range are allowed, preventing invalid entries such as negative numbers or excessively large ages.

The JavaScript function further enhances this validation by using the checkValidity() method. This method checks whether the input satisfies all defined constraints, including the required condition and the specified numeric range. If the input doesn't meet these conditions, the function prevents further execution and displays an alert message, informing the user that the entered age must fall within the allowed range.

Logical Consistency Errors

If you allow a user to select an end date before the start date, choose a checkout date earlier than check‑in at a hotel, or enter a delivery date before the order date, this will result in logically impossible data. The same applies when you allow a user to enter a graduation year earlier than their admission to a program, or submit working hours that exceed 24 hours in a day.

You can mitigate this by implementing cross‑field validation, business‑rule checks, and conditional logic that ensures related fields remain consistent.

Here's a bad example of a logical consistency error:

In the code above, the core issue is the complete absence of validation. Although the inputs use type="date", which provides a structured way for users to select dates, the code doesn't enforce that either field is required. This means the user can leave one or both date fields empty, and the save() function will still run and log the values. As a result, the system may end up processing incomplete or meaningless data.

Beyond missing required checks, the code also fails to validate the logical relationship between the two dates. In any scenario involving a start date and an end date, it's expected that the start date shouldn't occur after the end date. But this code performs no such comparison.

This means that the user can select a start date that's later than the end date, and the system will accept it without warning. This leads to inconsistent or impossible data being recorded.

Also, the function simply logs the values without providing any feedback to the user. There's no mechanism to alert the user when a field is empty or when the dates are logically incorrect. This reduces usability and makes it difficult for users to understand or correct their mistakes.

Here's the fixed version:

In this improved version, first, both date fields now include the required attribute, ensuring that the user can't leave either field empty without triggering validation.

Second, we've added a logical validation check to ensure that the relationship between the two dates is correct. After retrieving the values, the function converts them into Date objects and compares them to verify that the end date doesn't occur before the start date. If this condition is violated, the function stops execution and displays an alert informing the user of the error.

This prevents inconsistent or impossible date ranges from being accepted.

Duplicate and Data Integrity Errors

When you let a user submit an email that's already registered, choose a username that's already taken, or enter a duplicate employee ID or student number, this results in identity conflicts and duplicate records. Problems also arise when you allow users to upload unsupported file types, oversized files, or corrupted images.

Security risks can emerge when users are able to enter HTML/script tags (XSS), SQL‑injection patterns, or disallowed special characters. These issues compromise data quality, system integrity, and security.

You can prevent these types of issues by using uniqueness checks, file‑type and size validation, and input sanitization to block duplicates, invalid uploads, and malicious inputs.

Here's an example of a duplicate error:

This code blindly pushes every email into the savedEmails array without checking whether the email already exists. Because there is no duplicate detection, the user can enter the same email multiple times.

Here is the fixed version:

In this improved version of the code, we've implemented proper validation steps to prevent duplicate email entries. Before saving the email, the function checks whether the value already exists in the savedEmails array using the includes() method. If the email is found, the function stops execution and displays an alert informing the user that the email has already been saved. This ensures that each email is stored only once, maintaining the uniqueness and integrity of the data.

Relational Errors (Reference Integrity)

If you let a user select a city that doesn’t belong to the chosen country, a product ID that no longer exists, a retired SKU, or a shipping method unavailable in the selected region, this can result in broken references.

The same applies when users can select a manager from a different department or choose a fully booked time slot, not setting the right roles and permissions. These errors break relationships between tables and corrupt downstream joins and reports.

Here, you can use dependent dropdowns, real‑time lookups, and foreign‑key validation to help ensure that users can only select valid, existing, and compatible options.

Here's a bad example of a relational error:

From the above, the mistake in this code is that we've treated country and city as completely independent fields, even though one is supposed to depend on the other. By presenting all cities regardless of the selected country, the interface allows users to create combinations that make no sense — such as choosing “United Kingdom” with “New York” or “United States” with “Manchester.”

Also, because the save() function performs no validation and simply logs whatever the user selects, the system ends up accepting and storing relationships that should never exist. This breaks the logical link between the two fields and leads to invalid, inconsistent data that can corrupt downstream.

Here's the fixed, production-ready version:

This improved code turns the country–city form into a controlled, relationship‑aware flow instead of two loose dropdowns.

When the user selects a country, the loadCities() function runs. It first clears the city dropdown and, if no country is selected, keeps the city field disabled so the user can't choose a city on its own.

Once a valid country is chosen, the city dropdown is enabled and populated only with the cities that belong to that specific country, using the citiesByCountry mapping. Also, the city values are normalised (lowercased and stripped of spaces) so they’re consistent and safe to compare.

When the user clicks “Save,” the save() function checks that both a country and a city have been selected. If either is missing, it shows an alert and stops. It then rebuilds the list of valid city values for the chosen country and verifies that the selected city is actually in that list.

Structural Errors (Dropdowns, Radio Buttons, Enums)

If users can type a country as “U.S.A”, “USA”, “United States”, or “us”, enter gender as “male”, “Male”, “M”, or “man”, or type a department as “Engineering”, “Eng”, or “engineer”, this can result in inconsistent categorical data.

The same applies to currencies typed as “usd”, “USD”, “US Dollars”, product categories spelled differently, status values like “active”, “Active”, “ACT”, “enabled”, or boolean values like “yes”, “Yes”, “Y”, “1”.

These inconsistencies make analytics, grouping, and reporting unreliable, and the analyst will spend time cleaning and standardizing these files.

You should replace free‑text fields with dropdowns, radio buttons, and enums to enforce standardized categorical values.

Bad example of a structural error:


  Country

The problem with this code is that it pretends to save a country value without doing any real validation or enforcing any rules, which makes the form unreliable and prone to bad data.

The form uses a plain text input for “country,” meaning the user can type anything they want — misspellings, random characters, invalid countries, or even leave it blank. Because the input isn’t marked as required and the JavaScript doesn’t check whether the field contains a meaningful value, the form will happily “save” an empty string or nonsense text.

The submit handler prevents the default form submission but does nothing beyond logging whatever the user typed, so the system accepts invalid, incomplete, or malformed data without question. In short, the code collects input but doesn't validate it, doesn't enforce correctness, and doesn't protect the system from bad or unusable values.

Here's the fixed version:


  Country

The biggest improvement is that we're no longer relying on a free‑text field for the country. By switching to a dropdown, the form now limits the user to a controlled set of valid options. This prevents misspellings, random text, or invalid country names from ever entering the system.

These are the main types of data errors you might come across in your work. Now that we've discussed what causes them and some key fixes/preventative measures you can take, let's move on to data quality itself.

What Makes Good Data?

So what, in fact, is data quality? IBM defines it as the degree of accuracy, consistency, completeness, reliability, and relevance of the data collected, stored, and used within an organization or a specific context.

Let's look at each of these features of quality data a bit more closely to understand what they entail.

Completeness:

Completeness measures how much of the required data is actually present. When large portions of fields are missing, the dataset stops representing reality and any analysis built on it becomes unreliable.

An example would be a sign‑up form that stores users, but half of them are missing an email address. If you run an analysis on “email engagement,” your results will be skewed because a big chunk of users can’t even receive emails. This means that this data is incomplete.

Uniqueness:

Uniqueness checks whether each real‑world entity appears only once in the dataset. Duplicate records inflate counts, break joins, and distort metrics.

An example would be a customer table containing two rows for the same person with the same customer ID. When calculating “active customers,” the system counts them twice, inflating revenue projections.

Validity:

Validity evaluates whether data follows the expected format, type, or business rules. This includes correct data types, allowed ranges, and patterns defined by the system.

An example would be a field meant to store dates contains values like “32/99/2025” or “tomorrow.” These invalid entries break downstream ETL jobs that expect a proper date format.

Timeliness:

Timeliness reflects whether data is available when it’s needed. Even accurate data becomes useless if it arrives too late for the process that depends on it. For example, after a customer places an order, the system should generate an order ID instantly.

Accuracy:

Accuracy measures how closely data matches the real‑world truth. When multiple systems report the same metric, one must be designated as the authoritative source to avoid conflicting values.

Consistency:

Consistency checks whether data aligns across different datasets or within related fields. If two systems describe the same concept, their values shouldn't contradict each other.

For example, a company’s HR system reports 50 employees in Engineering, but the payroll system lists only 42. Since both describe the same group, the mismatch signals a data quality issue.

Fitness for Purpose:

Fitness for purpose assesses whether the data is suitable for the specific business task at hand. Even complete, accurate, and timely data may be unhelpful if it doesn’t answer the intended question.

A dataset of website clicks might be perfect for analysing user engagement, for example, but it’s useless for forecasting revenue because it contains no purchase or pricing information.

Data Validation Layers

Now that we've highlighted the characteristics that ensure quality data, it's important to discuss the layers of data validation.

There are five layers you'll need to check to enforce data quality.

Frontend Layer — “Protect the User, Not the System”

Frontend validation plays an important role in enhancing the user experience – but it doesn't provide real protection for a system.

Since frontend logic operates within the user’s environment, we can't trust it as a mechanism for enforcing data quality. Any code executed in the browser is ultimately under the user’s control, meaning it can be disabled, modified, intercepted, or bypassed entirely.

For instance, a user can simply open browser developer tools, remove validation rules, and submit invalid or malicious data without restriction.

Frontend validation is incapable of enforcing complex business rules. Constraints such as ensuring that a discounted price is lower than the original price, validating that a start date precedes an end date, preventing stock levels from becoming negative, or confirming that a product belongs to a valid category within the database require deeper system-level checks.

At the frontend level, what is being validated is: required fields, email format, password strength, address fields, and payment input format.

So frontend validation doesn't guarantee data quality or security, as it can be bypassed through API tools (like Postman), disabled JavaScript, malicious bots, and third-party integrations.

Because of this, it's best to treat the front-end as a usability layer, not a trust layer.

Backend Validation — “The Real Gatekeeper”

You can only guarantee true data quality and system integrity at the backend and database layers.

The backend is responsible for enforcing request validation, implementing business logic, and managing authentication and authorization.

If validation fails here, invalid data is rejected before it can propagate. Without this layer, data corruption begins at ingestion.

For example:

$request->validate([
   'name' => 'required|string|max:255',
   'price' => 'required|numeric|min:0',
   'stock' => 'required|integer|min:0',
   'category_id' => 'required|exists:categories,id',
]);

The code snippet above demonstrates how you can use request validation in Laravel to ensure that incoming data meets specific requirements before it's processed or stored in the database. This is an essential practice in web development, as it helps maintain data integrity, prevents errors, and enhances application security.

In this example, we're using the $request->validate() method to define a set of validation rules for four input fields: name, price, stock, and category_id. Each field is assigned a series of constraints that the incoming data must satisfy.

The name field is marked as required, meaning it must be included in the request and can't be empty. It must also be a string, ensuring that only textual data is accepted, and it's limited to a maximum length of 255 characters using max:255. This prevents excessively long inputs that could potentially cause issues in the database or user interface.

Similarly, the price field is required and must be numeric, allowing only numbers such as integers or decimal values. The rule min:0 ensures that the price can't be negative, which is logically consistent for most product pricing scenarios.

The stock field is also required and must be an integer, meaning it can only accept whole numbers. This is appropriate for counting physical items. Like the price field, it includes a min:0 rule to prevent negative stock values, which would not make sense in an inventory system.

Finally, the category_id field is validated to ensure it is both present and valid. The required rule ensures that a category is selected, while the exists:categories,id rule checks that the provided value corresponds to an existing id in the categories database table. This prevents invalid or non-existent category references, thereby preserving relational integrity within the database.

This layer validates null values, data types and formats, allowed ranges, and referential integrity (exists).

Database Layer — “Protect the Data at Rest”

Validation at the application level is insufficient on its own. You'll also need to enforce database-level constraints like NOT NULL constraints, UNIQUE constraints (email, SKU, order number), foreign keys (orders.user_id → users.id), and check constraints (for example, price >= 0).

This layer is critical because application bugs may bypass validation, background jobs and imports may skip controllers, and malicious actors may attempt direct access.

The database layer acts as the final line of defense, ensuring structural integrity regardless of application failures. Database constraints are the last hard stop: they enforce correctness even when code is bypassed.

Service Layer / Business Logic — “Validate Real-World Rules”

This layer enforces domain-specific logic that can't be captured by simple validation rules. The service layer is where the application stops asking “Is this data shaped correctly?” and starts asking “Is this allowed to happen in the real world?”.

This layer enforces domain‑specific rules that can't be captured by simple request validation or database constraints. These rules reflect business truth, not structural correctness.

Example:

if (\(product->stock < \)quantity) {
   throw new OutOfStockException();
}

This prevents overselling and ensures the system reflects physical reality.

if (\(cartTotal !== \)calculatedTotal) {
   throw new PriceMismatchException();
}

This protects revenue and prevents tampering.

In this layer, you enforce real‑world business rules by ensuring inventory correctness, recalculating totals, applying discount logic, and checking user‑specific limits.

Jobs / Queues / Data Ingestion — “Validate External Data”

When importing or processing external data (for example, supplier feeds), validation must occur before processing. You'll need to ensure schema conformity, that the required columns are present, that you have the correct data types, that the JSON structure is valid, and that you're detecting duplicate batches.

This is because external data sources are a major source of data quality issues. Without validation here, corrupted data can silently enter the system at scale.

Now that we've discussed the layers of a modern application stack, it should be clear that data quality isn't something you “check once” at the UI.

It must be enforced repeatedly, at multiple depths of the system. Each layer catches a different class of defects, and together they form a defensive wall that prevents bad data from ever reaching storage, analytics, or downstream consumers.

Testing Strategies to Protect Data Quality

To wrap up, here are the three foundational testing strategy every developer should apply to protect data quality.

Unit Testing

Unit tests are the first line of defense in data quality. In this context, a “unit” refers to a single column, a single transformation, or a single validation rule.

The purpose is straightforward: verify that the smallest building blocks of your data logic behave exactly as intended. This matters because if these low‑level rules are not tested and validated, incorrect or inconsistent data will flow into the database and contaminate everything built on top of it.

By isolating each rule or transformation, you can guarantee that schema constraints, field‑level assumptions, and low‑level logic remain correct before data ever flows into larger pipelines or business processes.

Typical questions answered at this layer include:

Does this column allow nulls?
Does this regex correctly strip whitespace from email strings?
Does this transformation produce the expected output for a single row?

This is where you can verify that the data contract is sound. If a column must be non‑null, unique, or follow a specific pattern, the unit test enforces it. When these rules fail here, they fail cheaply – before they can corrupt a table or mislead a dashboard.

To make this concrete, here’s what a unit test looks like in a real codebase. Even though this example comes from Laravel, the testing principle is identical to data‑quality unit tests: one rule, one expectation, isolated from everything else.

Example: Testing a Discount Calculation Rule

Imagine your e‑commerce shop has this rule:

If a product costs more than £100, apply a 10% discount.
Otherwise, apply no discount.

Let's say this is your discount logic:

 100) {
            return $price * 0.10; // 10% discount
        }

        return 0;
    }
}

The unit test for this logic will be:

calculate(200);

        \(this->assertEquals(20, \)discount);
    }

    /** @test */
    public function it_applies_no_discount_when_price_is_100_or_below()
    {
        $service = new DiscountService();

        \(discount = \)service->calculate(100);

        \(this->assertEquals(0, \)discount);
    }
}

The DiscountService contains a simple rule: if a price is greater than 100, a 10% discount is applied. Otherwise, no discount is applied. The unit test verifies this rule in isolation, without involving controllers, databases, or HTTP requests. By testing the service directly, the developer ensures that the core calculation behaves exactly as intended.

The first test checks the positive case — a price of 200 should produce a discount of 20. The second test checks the boundary condition — a price of 100 should produce no discount. Together, these tests confirm both sides of the rule and protect against regressions if the logic changes in the future.

Now, since this is Laravel example, Laravel tests help you verify both your logic (unit tests) and your full application behaviour (feature tests). You can run them using php artisan test, which executes tests in a separate testing environment, ensuring your real database and main codebase remain safe and unaffected.

Integration Testing: The Flow & Lineage Check

While unit tests validate the correctness of individual rules, integration tests validate the movement of data across components. Integration testing verifies that multiple layers work together as a single data flow.

In this example, the controller receives an order, calls the discount service, applies the transformation, and persists the result to the database. That interaction across layers is what elevates this from a unit test to an integration test. This is where you test the real‑world flow:

Controller → Service → Repository → MySQL
Check if MySQL migrations run correctly
Check foreign keys enforce relationships
Check to ensure services interact with the database as expected
Check to ensure models and repositories behave consistently

Integration tests reveal issues that only appear when components interact: incorrect joins, broken migrations, mismatched field names, or subtle type mismatches that unit tests cannot detect.

This is the layer where you catch the bugs that would otherwise silently corrupt data lineage.

Here's an example:

create(['subtotal' => 150]);

        \(response = \)this->postJson("/orders/{$order->id}/apply-discount");

        $response->assertStatus(200);

        $this->assertDatabaseHas('orders', [
            'id' => $order->id,
            'grand_total' => 135, // 150 - 10% discount
            'discount_total' => 15
        ]);
    }
}

This represents a full flow rather than a single rule:

Controller → Service
Service → Calculation
Controller → Database write
Database → Final state

This test begins by creating an order using an Eloquent factory. It immediately steps beyond the boundaries of a unit test, since it interacts with the database and relies on Laravel’s model layer to persist real data.

From there, the test sends an actual HTTP POST request to the /orders/{id}/apply-discount endpoint, which means it's not calling a method directly, but instead it's traveling through Laravel’s routing layer, invoking the controller responsible for handling the request, and triggering whatever business logic is responsible for calculating and applying the discount.

This movement through multiple layers (routing, controller, service logic, and model persistence) is precisely what defines integration testing: the goal is to verify that these components work together correctly as a system.

Once the request is processed, the test asserts that the response returns a successful status code, which confirms that the HTTP layer behaved as expected.

But the most important part comes afterward, when the test checks the database to ensure that the correct grand_total and discount_total were saved. This final assertion proves that the discount logic was executed, the model was updated, and the changes were successfully written to the database.

In other words, the test isn't merely checking whether a calculation is correct. It's also checking whether the entire pipeline – from receiving the request to updating the database – functions as a coherent whole.

Functional Testing: The Business Rule Check

Functional tests validate the entire user experience, from the moment a request enters the system to the moment a response is returned. This includes:

HTTP requests
Controller logic
Validation rules
Service operations
Database writes
Redirects or rendered views

This is where you test the business rules that govern real‑world behaviour:

“A student can't register for two exams at the same time.”

“A cart can't have negative quantities.”

“A user can't update their profile without a valid email.”

Functional tests ensure that the system behaves correctly from the perspective of the user and the business, not just the code.

Here's an example: Functional Test

create(['price' => 40]);

        // Simulate existing cart
        $this->withSession([
            'cart' => [
                $product->id => ['quantity' => 2]
            ]
        ]);

        // Act: user tries to update quantity to a negative number
        \(response = \)this->post('/cart/update', [
            'product_id' => $product->id,
            'quantity' => -5
        ]);

        // Assert: system rejects invalid business behaviour
        $response->assertStatus(302); // redirect back with errors
        $response->assertSessionHasErrors(['quantity']);

        // Assert: cart remains unchanged (business rule preserved)
        \(this->assertEquals(2, session('cart')[\)product->id]['quantity']);
    }
}

The test begins by creating a realistic environment in which a user interacts with a shopping cart. This is essential for understanding the behaviour the system is meant to enforce.

First, it generates a real product in the database using a factory, giving the product a price so that it resembles an item a customer might genuinely add to their cart.

Once the product exists, the test manually seeds the session with a cart containing that product and a quantity of two. This simulates a user who has already added the item to their cart in a previous interaction, and it establishes the baseline state the system must preserve if the user attempts an invalid update.

With the environment prepared, the test then imitates a user action by sending a POST request to the /cart/update endpoint. Instead of calling a method directly, it uses Laravel’s HTTP layer to reproduce the exact behaviour of a browser submitting a form. The request includes the product ID and a deliberately invalid quantity of negative five.

This is the heart of the scenario: the user is attempting something that violates the business rules of the application, and the test is designed to confirm that the system responds appropriately.

Now, when the request is processed, the test expects the application to reject the input, redirect the user back, and attach validation errors to the session. The assertion that the response has a 302 status code and contains validation errors confirms that the validation layer is functioning correctly and that the controller is enforcing the rule that quantities can't be negative.

The final part of the test is where the business rule is truly verified. After the failed update attempt, the test inspects the session to ensure that the cart remains unchanged. This is crucial because rejecting invalid input is only half of the requirement: the system must also protect the integrity of the existing cart data.

Functional tests answer questions like:

Does the system prevent invalid real‑world behaviour?
Does the user get the correct feedback?
Does the data remain consistent after the request?
Does the final output match the business expectation?

Conclusion

Data quality is never the result of a single check or a single team. It emerges from a disciplined, layered approach where each testing level catches a different category of defects.

Unit tests safeguard the smallest rules, integration tests validate the flow of data across components, and functional tests enforce the business logic that governs real‑world behaviour.

When these layers operate together, bad data has nowhere to hide. When they don’t, even a minor oversight can slip through the cracks and escalate into a costly downstream failure.

So as you can see, your role in data quality is fundamentally proactive, not reactive. By designing systems with validation, integrity, and monitoring in mind, you ensure that data flowing through the pipeline is accurate, timely, complete, unique, and fit for purpose – supporting reliable analytics, reporting, and intelligent systems.

The Modern React Data Fetching Handbook: Suspense, use(), and ErrorBoundary Explained

Tapas Adhikary — Thu, 12 Feb 2026 15:19:30 +0000

Most React developers don’t break the data fetching process all at once. It usually degrades gradually, slowly.

Traditionally, you may have used a useEffect here, a loading flag there, and an error state along with it to tackle data fetching. Moving forward, another fetch depended on the first one, then a second useEffect, and another loading and error state.

This likely continued until you started feeling like you were writing code that you yourself could’t even maintain in the future.

Requests that should run in parallel started running sequentially. Components re-rendered unnecessarily just to satisfy another data fetch request-response. Loading spinners appeared when nothing meaningfully changed. Error states got scattered inside the component.

Well, none of these things are React’s problem. These are core design problems you should be aware of while coding your React Apps.

In this handbook, we’ll walk through one React Pattern that fixes data fetching at the architecture level without ignoring real data dependencies, and without introducing any new magic. If data fetching in React has ever felt harder than it should be, this pattern will make even more sense to you.

You’ll learn how to use React’s Suspense with the recently introduced use() API to handle data fetching smoothly. In case of errors, you’ll learn how an Error Boundary can help handle them gracefully.

Give this one a read, and code along to get a better grip on this pattern’s mental model.

This handbook is also available as a video tutorial as part of the 15 Days of React Design Patterns initiative. You can check it out if you’d like:

We’ll use a lot of source code to demonstrate the problems with the traditional data-fetching approach and how the Suspense Pattern can improve things. I would suggest that you try the code as you read. But if you want to take a look at the source code ahead of time, you can find it on the tapaScript GitHub.

The Traditional Way of Data Fetching in React
- The Problem with the Traditional Way
Let’s Build a Dashboard with the Traditional Data Fetching Approach
What is Suspense?
What is the use() API in React?
How to Use Suspense and the use() API for Data Fetching
Let’s Build the Dashboard with Suspense and the use() API
How to Handle Error Scenarios with Error Boundaries?
- Error Boundary
- Suspense and Error Boundary
Learn from the 15 Days of React Design Patterns
Before We End…

The Traditional Way of Data Fetching in React

To understand why data fetching can become painful in React, we first need to understand how React works under the hood.

React works in phases. It doesn’t do everything at once. At a high level, every update in React goes through three distinct phases:

Render phase – React figures out what the UI should look like
Commit phase – React applies those changes to the DOM
Effect phase – React synchronises with the outside world

This separation is intentional. It’s what allows React to be predictable, interruptible, and efficient.

Now, let’s see where data fetching with useEffect fits into this picture:

Where does useEffect actually run? It doesn’t run during rendering. It runs after React has already committed the UI to the DOM.

That means the flow looks like this:

React renders the component (without data)
React commits the UI
useEffect runs
Data fetching starts
State updates when the data arrives
React renders again

useEffect(() => {
  fetchData().then(setData);
}, []);

Hence, the fetch only starts after the UI has already rendered.

The Problem with the Traditional Way

Consider a very common scenario: you fetch a user, and then fetch related data using the user ID.

The traditional React data fetching solution would look like this:

useEffect(() => {
  fetchUser().then(setUser);
}, []);

useEffect(() => {
  if (!user) return;
  fetchOrders(user.id).then(setOrders);
}, [user]);

What actually happens is:

The component renders
React commits the UI
The first effect runs and fetches the user
React re-renders
The second effect runs and fetches orders

Even if the network is fast, the requests are forced to start one after another, because each fetch is triggered by a render that only happens after the previous fetch completes.

This paradigm of data fetching is called Fetch-On-Render. The fetching logic is no longer controlled by data dependencies – it’s controlled by render timing. But that’s not all. There are other problems with this approach: you create and maintain unnecessary states.

Now, let’s see both these problems in action by building something practical.

Let’s Build a Dashboard with the Traditional Data Fetching Approach

Let’s build a simple dashboard with the traditional data fetching approach using the useEffect hook at the center. The dashboard will have four primary sections:

A Static heading.
A Profile section welcoming the user with their name.
An Order section listing the items ordered by the user.
An Analytics section showing a few metrics for the same user.

You can visualise it like this:

The profile, order, and analytics sections should show the dynamic data of a user and their order and analytics. Hence, we’ll simulate three API calls to get the user details, order details, and the analytics data.

// API to fetch User
export function fetchUser() {
  return new Promise((resolve) => {
    setTimeout(() => {
      resolve({ id: 1, name: "Tapas" });
    }, 1500);
  });
}

// API to fetch the Orders of a User
export function fetchOrders(userId) {
  return new Promise((resolve, reject) => {
    setTimeout(() => {
      resolve([
          `Order A for user ${userId}`,
          `Order B for user ${userId}`
      ]);
    }, 1500);
  });
}

// API to fetch the Analytics of a User
export function fetchAnalytics(userId) {
  return new Promise((resolve) => {
    setTimeout(() => {
      resolve({
        revenue: "$12,000",
        growth: "18%",
        userId
      });
    }, 1500);
  });
}

As you can see in this code:

Each of the API functions returns a promise.
There is an intentional delay of 1.5 seconds using setTimeout to simulate the feel of a network call. The promise resolves after the delay passes.
Once the promise gets resolved, we get the data.

Now, let’s create the Dashboard component:

import { useEffect, useState } from "react";
import { fetchAnalytics, fetchOrders, fetchUser } from "../api";

export default function Dashboard() {
    const [user, setUser] = useState(null);
    const [orders, setOrders] = useState(null);
    const [analytics, setAnalytics] = useState(null);

    // Step 1: Fetch user
    useEffect(() => {
        fetchUser().then(setUser);
    }, []);

    // Step 2: Fetch orders (depends on user)
    useEffect(() => {
        if (!user) return;
        fetchOrders(user.id).then(setOrders);
    }, [user]);

    // Step 3: Fetch analytics (depends on user)
    useEffect(() => {
        if (!user) return;
        fetchAnalytics(user.id).then(setAnalytics);
    }, [user]);

    // Logic to ensure that user, orders, and analytics data 
    // loaded before we render them on JSX
    if (!user || !orders || !analytics) {
        return <p className="text-xl m-3">Loading dashboard...p>;
    }

    return (
        <div className="m-2">
            <header>
                <h1 className="text-5xl mb-12">📊 Dashboardh1>
            header>

            <h2 className="text-3xl">Welcome, {user.name}h2>

            <h2 className="text-3xl mt-3">Ordersh2>
            <ul>
                {orders.map((o) => (
                    <li className="text-xl" key={o}>
                        {o}
                    li>
                ))}
            ul>

            <h2 className="text-3xl mt-3">Analyticsh2>
            <p className="text-xl">Revenue: {analytics.revenue}p>
            <p className="text-xl">Growth: {analytics.growth}p>
        div>
    );
}

Let’s break it down:

The first thing you’ll notice is that we have three states for holding the data of users, orders, and analytics.
Then we have three useEffects to manage the fetching of data and updating the states.
Then we show the data values in the JSX.
We’re using a Fetch-On-Render methodology.

However, in between, there’s an explicit logic to check if the user data, order data, or analytics data has been loaded. If the data are not loaded, then we don’t even process the JSX – rather, we show a loading message.

// Logic to ensure that user, orders, and analytics data 
// loaded before we render them on JSX
if (!user || !orders || !analytics) {
  return <p className="text-xl m-3">Loading dashboard...p>;
}

This is good as a measure so that the UI doesn’t crash at runtime. But this is not a declarative approach. Since React is declarative, it would make more sense if we could handle this scenario in a declarative way as well.

In Declarative programming, you as a programmer don’t specify how to to solve certain problems. You declare what you want to achieve, and the programming language/framework takes care of the “how” part for you. When you specify the “how” part, it becomes imperative, not declarative.

React is declarative because you don’t specify how to update the browser DOM to render the UI changes. You declare them using JSX, and React takes care of it under the hood.

As an alternative to the explicit imperative logic like the above, you could also handle it using loading states. You could have loading states for profile, orders, and analytics. The loading states could decide when to show the data conditionally. But this approach needs additional state management and conditional rendering of JSX.

Along with these issues, think of handling errors! Again, you would need states for error handling and the conditional logic to show and hide the error messages. That’s too much to manage.

So, with the useEffect strategy, data fetching in React is not that effective. We need a better pattern to handle data along with loading states and errors.

But before we move on, I want to clarify that useEffect isn’t bad. It has a purpose, but sometimes we don’t use it as intended. If you’re someone who wants to learn the effective usages of this hook and how to debug it properly, you can check out this session.

What is Suspense?

At its core, React Suspense is not a loading feature. It’s a rendering coordination mechanism. Suspense allows a component to tell React, “I’m not ready to be rendered, yet”. When that happens, React pauses rendering for that part of the tree and shows a fallback UI until the required data becomes available.

This is fundamentally different from how data fetching works with useEffect.

With the traditional Fetch-On-Render approach, React must first render a component before it’s allowed to start fetching data. Effects run after the commit phase, which means data fetching is always a reaction to rendering, never a prerequisite for it. As applications grow, this creates render-fetch-re-render loops, hidden waterfalls, and loading logic spread across components.

Suspense flips that model.

Instead of rendering first and fetching later, Suspense enables Render-as-you-Fetch. Data fetching can begin before React attempts to commit the UI, and rendering simply waits until the data is ready. The UI doesn’t guess when to show loading states. React coordinates it declaratively through Suspense boundaries.

With Suspense, you need to wrap the component that handles the asynchronous call.

Suspense can pause the rendering while the wrapped component is dealing with the promise. Suspense can show a fallback UI (it could be a loader, UI skeleton, and so on) until the promise is resolved (or rejected). Once the promise is resolved, Suspense replaces the fallback UI with the actual wrapped component baked with the data. No hard-coded logic, no extra state management is needed.

What is the `use()` API in React?

use() is an API introduced in React 19 that accepts a promise and returns its resolved value. If the promise hasn’t resolved yet, React doesn’t continue rendering. It suspends. If the promise fails, React throws an error. Both cases are handled declaratively by Suspense and Error Boundaries.

import { use } from "react";

function fetchUser() {
  return fetch("/api/user").then(res => res.json());
}

const userPromise = fetchUser();

export default function Profile() {
  const user = use(userPromise);
  return <h2>Welcome, {user.name}h2>;
}

What’s important here:

use() is called during render
If the promise is unresolved, rendering pauses
No useEffect, no loading state

use() is very powerful. It can read promises that depend on other promises.

const userPromise = fetch("/api/user").then(r => r.json());

const ordersPromise = userPromise.then(user =>
  fetch(`/api/orders?userId=${user.id}`).then(r => r.json())
);

function Orders() {
  const orders = use(ordersPromise);
  return (
    <ul>
      {orders.map(o => <li key={o.id}>{o.title}li>)}
    ul>
  );
}

Here:

Dependencies are expressed in data, not effects
Rendering is coordinated automatically (declaratively)

The primary mental model is: this render is not allowed to complete without this data. It suspends.

const data = use(promise);

We need to use Suspense to handle this gap (when the promise hasn’t resolved yet) using a fallback UI, and then to continue rendering once the promise is resolved.

How to Use Suspense and the `use()` API for Data Fetching

The use() API is what finally makes Suspense practical for data fetching. Before use(), Suspense could pause rendering, but React didn’t have a clean way to consume asynchronous data during render without hacks. Most examples relied on custom abstractions or libraries to bridge that gap. use() changed that by allowing React components to read async values directly during rendering.

When a component reads data using use(promise), React treats that promise as a render dependency. If the promise hasn’t resolved yet, React pauses rendering at the nearest Suspense boundary. When it resolves, React retries rendering automatically, without manual state updates, effects, or conditional logic.

import { Suspense, use } from "react";

const userPromise = fetch("/api/user").then(res => res.json());

function Profile() {
  const user = use(userPromise);
  return <h2>Welcome, {user.name}h2>;
}

export default function App() {
  return (
    <Suspense fallback={<p>Loading profile...p>}>
      <Profile />
    Suspense>
  );
}

What happens here:

Profile tries to read userPromise
If the promise is unresolved, React pauses rendering
React renders the nearest Suspense fallback
When the promise resolves, React retries rendering automatically

Here, there are no effects, no loading flags, and no manual re-rendering.

Now, let’s see all these in action together by rebuilding the same Dashboard app.

Let’s Build the Dashboard with Suspense and the `use()` API

Now that we have a better understanding of Suspense and use(), let’s rewrite the same Dashboard application with it.

Project Setup

First, you’ll need to create a React project scaffolding using Vite. You can use the following command to create a Vite-based React project with modern toolings:

npx degit atapas/code-in-react-19#main suspense-patterns

This will create a React 19 project with TailwindCSS configured.

Now use the npm install command to install the dependencies. This will create the node_modules folder for you. At this point, the directory structure should look like this:

API Services

Now under src/, create a new folder called api/. Then create an index.js file under src/app/ with the following code snippet:

export function fetchUser() {
  return new Promise((resolve) => {
    setTimeout(() => {
      resolve({ id: 1, name: "Tapas" });
    }, 1500);
  });
}

export function fetchOrders(userId) {
  return new Promise((resolve, reject) => {
    setTimeout(() => {
      resolve([
          `Order A for user ${userId}`,
          `Order B for user ${userId}`
      ]);
    }, 1500);
  });
}

export function fetchAnalytics(userId) {
  return new Promise((resolve) => {
    setTimeout(() => {
      resolve({
        revenue: "$12,000",
        growth: "18%",
        userId
      });
    }, 1500);
  });
}

These are the same APIs we used before when constructing the dashboard with useEffect.

fetchUser: For fetching the user’s profile
fetchOrders: For fetching the orders made by a user
fetchAnalytics: For fetching the analytics data of a user

Create a Centralised User Resource

Now, let’s create a centralised JavaScript utility file where we can create each of the promises by calling their respective fetch methods. It’s a good practice to handle all the fetch APIs and their promises from a single place, rather than keeping them scattered. The same utility can export the promises so that we can consume them in the components.

Create a resources/ folder under src/. Create a file userResource.js file under src/resources/ with the following code:

import { fetchAnalytics, fetchOrders, fetchUser } from "../api";

let userPromise;
let ordersPromise;
let analyticsPromise;

export function createUserResources() {
  userPromise = fetchUser();

  ordersPromise = userPromise.then(user =>
    fetchOrders(user.id)
  );

  analyticsPromise = userPromise.then(user =>
    fetchAnalytics(user.id)
  );
}

export function getUserResources() {
  return {
    userPromise,
    ordersPromise,
    analyticsPromise
  };
}

Here, we export two functions:

The createUserResources() creates all the promises and keeps them ready.
The getUserResources() returns all the promises we can consume later.

Now, the question is, when will we create these promises? That is, where we will call the createUserResources() function? We should create these promises when the application starts up, and the main.jsx file would be the perfect place for that.

Open the main.jsx file, import {createUserResources}, and invoke it immediately.

import React from "react";
import ReactDOM from "react-dom/client";
import App from "./App.jsx";
import "./index.css";

import { createUserResources } from "./resources/userResource.js";

createUserResources();

ReactDOM.createRoot(document.getElementById("root")).render(
    <React.StrictMode>
        <App />
    React.StrictMode>,
);

Great! Our data fetching APIs and the promises are ready. Let’s create the components where we’ll be using these promises.

Create Individual Components

We’ll create three components to compose the dashboard: Profile, Orders, and Analytics.

Let’s start with the Profile component. Create a folder components/ under the src/. Now, create a Profile.jsx with the following code:

import { use } from "react";
import { getUserResources } from "../resources/userResource";

export default function Profile() {
    const { userPromise } = getUserResources();
    const user = use(userPromise);
    return <h2 className="text-3xl">Welcome, {user.name}h2>;
}

Let’s break it down:

We imported the use() from React, as we’ll be dealing with the use promise here to handle it and get the user name to render.
Next, we need the user promise. We have the getUserResources() function to get that, so we imported it.
Then, inside the Profile component, we destructured userPromise from the getUserResources() function.
After that, we passed the promise to the use(). We have learned that the use() API accepts a promise and returns the result when it is resolved. Until then, the passed-in promise itself will be returned.
Finally, we used the resolved user to extract the name property and render it.

Simple, right? Let’s quickly create the Orders and Analytics components.

The Orders component:

import { use } from "react";
import { getUserResources } from "../resources/userResource";

export default function Orders() {
   const { ordersPromise } = getUserResources();
  const orders = use(ordersPromise);

  return (
    <>
      <h2 className="text-3xl mt-2">Ordersh2>
      <ul>
        {orders.map((o) => (
          <li className="text-xl" key={o}>{o}li>
        ))}
      ul>
    
  );
}

It has the same flow as the Profile component.

Now, let’s do the Analytics component:

import { use } from "react";
import { getUserResources } from "../resources/userResource";

export default function Analytics() {
    const { analyticsPromise } = getUserResources();
    const analytics = use(analyticsPromise);

    return (
        <>
            <h2 className="text-3xl mt-2">Analyticsh2>
            <p className="text-xl">Revenue: {analytics.revenue}p>
            <p className="text-xl">Growth: {analytics.growth}p>
        
    );
}

All three components are ready. Before we move further, let’s reflect once more on what we learned about Suspense.

Suspense wraps a component that deals with promises (async operations). Until the promise gets resolved, the Suspense holds the rendering and can show a fallback UI in the meantime. Once the promise gets resolved and we have the value, the fallback UI gets swapped with the actual component Suspense wrapped.

So, we have the ideal case now: wrapping the , , and with the … boundary to handle the promises and resolved data for each of the components.

Let’s do that, but aren’t we missing something? Yeah we are: the fallback UI. Let’s create it.

Create the Fallback UI

Now we’ll create three different fallback UI components. Create a file Skeletons.jsx under the src/components/ with the following code:

export const ProfileSkeleton = () => <p className="text-3xl m-2">Loading user...p>;
export const OrdersSkeleton = () => <p className="text-3xl m-2">Loading orders...p>;
export const AnalyticsSkeleton = () => <p className="text-3xl m-2">Loading analytics...p>;

We now have a fallback skeleton UI for each of our components. These are very simple components that just render loading messages.

Create the Dashboard Component with Suspense

Now we have everything to make our Dashboard work. Create a suspense/ folder under src/. Then create a Dashboard.jsx file under src/suspense/ with the following code:

import { Suspense } from "react";

import Analytics from "../components/Analytics";
import Orders from "../components/Orders";
import Profile from "../components/Profile";

import {
    AnalyticsSkeleton,
    OrdersSkeleton,
    ProfileSkeleton,
} from "../components/Skeletons";

export default function Dashboard() {
    return (
        <div className="m-2">
            <header>
                <h1 className="text-5xl mb-12">📊 Dashboardh1>
            header>

            <Suspense fallback={<ProfileSkeleton />}>
               <Profile />
            Suspense>
            <Suspense fallback={<OrdersSkeleton />}>
               <Orders />
            Suspense>
            <Suspense fallback={<AnalyticsSkeleton />}>
               <Analytics />
            Suspense>
        div>
    );
}

First, let me explain the code:

We imported Suspense from React, all the components, and all the fallback UI components.
Then we rendered a static header and three suspense boundaries for each of the components. We wrapped Profile, Orders, and Analytics with Suspense, respectively. To handle the pending promise state, we have passed the individual skeleton component as the fallback to the suspense.

How clean is this? If you scroll up and recheck our old implementation of the dashboard using useEffect and compare it with the one we created with suspense, the positive differences are clear.

It’s declarative.
There’s less code, with the chance of fewer bugs
There’s no effect management and synchronisations
There’s no conditional JSX

It’s a huge win 🏆.

Run the Dashboard App

To run the dashboard app, import the dashboard component in the App.jsx file and use it like this:

import Dashboard from "./suspense/Dashboard";
function App() {
    return (
        <div className="flex items-center justify-center gap-12">
            <Dashboard />
        div>
    );
}

export default App;

Next, open the terminal. Run the app using the npm run dev command. You get the same dashboard back, but it’s much improved:

It loads the data of each of the sections independently.
Each of the sections shows the data loading indicator when the promise is pending.
It doesn’t block the entire UI.

Suspense and use() together are very powerful. Now you have learned that powerful pattern end-to-end.

How to Handle Error Scenarios with Error Boundaries

This data fetching handbook wouldn’t be complete without talking about error scenarios and how to handle them. So far, we’ve spoken about only the happy path. But what if any of the promises are rejected? How do we handle that?

To understand this in depth, let’s reject one of the promises – say the Order promise. Open the index.js file under the src/api/ folder and replace the fetchOrder() function with this updated code:

export function fetchOrders(userId) {
  return new Promise((resolve, reject) => {
    setTimeout(() => {
      // Simulate failure
      if (Math.random() < 0.5) {
        reject(new Error("Failed to fetch orders"));
      } else {
        resolve([
          `Order A for user ${userId}`,
          `Order B for user ${userId}`
        ]);
      }
    }, 1500);
  });
}

Here, the changes are:

We have simulated a failure by rejecting a promise.
The promise gets rejected randomly and throws an error with an error message.

At this point, if you refresh the UI a few times, you’ll randomly get a blank broken UI with the error message logged into the browser console. This isn’t ideal. It kills the UX of the app.

A better way of handling would be to show the error message on the UI and provide a way to retry and check if the user can recover from the error.

This is where Error Boundary comes in.

Error Boundary

Error Boundaries in React exist for a simple reason: Errors are inevitable, and we must handle them gracefully. There could be:

Network requests fail
Data is malformed
The assumptions break

Without boundaries, a single tiny rendering error can crash the entire React tree. Error Boundaries provide React with a structured way to handle failures.

Technically, an Error Boundary is a component that catches errors thrown during rendering. When an error occurs, React stops rendering the subtree and renders a fallback UI instead.

Let’s now create an Error Boundary. Create a file called ErrorBoundary.jsx under src/components with the following code:

import { Component } from "react";
import { createUserResources } from "../resources/userResource";

export default class ErrorBoundary extends Component {
  state = { error: null };

  static getDerivedStateFromError(error) {
    return { error };
  }

  handleRetry = () => {
    this.setState({ error: null });
    createUserResources();
  };

  render() {
    if (this.state.error) {
      return (
        <div className="border border-red-700 rounded p-1">
          <p className="text-xl">{this.state.error.message}p>
          <button 
                className="bg-orange-400 rounded-xl p-1 text-black cursor-pointer" 
                onClick={this.handleRetry}>
            Retry
          button>
        div>
      );
    }

    return this.props.children;
  }
}

Now, let’s understand what’s going on in the code above:

This is a class component that inherits from React.Component. That’s because Error Boundaries must use class lifecycle methods, which are not available in function components.
The component keeps track of whether an error has occurred. state = { error: null } means everything is rendering normally. When an error happens, this state will store the error object.
The static getDerivedStateFromError() is a special lifecycle method. React automatically calls it when a child component throws an error during render.
The handleRetry() method resets the error state back to null. It calls the createUserResources() to reinitialise the async resources.
In the render() method, if an error exists, render a fallback UI, show the error message, and provide an ability to retry the error using a retry button. If no error exists, render the children normally. The Error Boundary becomes invisible when everything works without an error. The fallback UI also can be an external component that we can pass as a prop to the Error Boundary.

If you’re interested in diving deep into the Error Boundary pattern and want to learn various use cases of it, here is a dedicated video you can check out.

Suspense and Error Boundary

Next, we’ll now use the Error Boundary to wrap each of the Suspense boundaries so that if an error originated from any of those, it can be managed. Open the Dashboard.jsx file and wrap each of the Suspense boundaries with the ErrorBoundary component as shown below:

import { Suspense } from "react";
import Analytics from "../components/Analytics";
import ErrorBoundary from "../components/ErrorBoundary";
import Orders from "../components/Orders";
import Profile from "../components/Profile";
import {
    AnalyticsSkeleton,
    OrdersSkeleton,
    ProfileSkeleton,
} from "../components/Skeletons";

export default function Dashboard() {
    return (
        <div className="m-2">
            <header>
                <h1 className="text-5xl mb-12">📊 Dashboardh1>
            header>

            <ErrorBoundary>
                <Suspense fallback={<ProfileSkeleton />}>
                    <Profile />
                Suspense>
            ErrorBoundary>

            <ErrorBoundary>
                <Suspense fallback={<OrdersSkeleton />}>
                    <Orders />
                Suspense>
            ErrorBoundary>

            <ErrorBoundary>
                <Suspense fallback={<AnalyticsSkeleton />}>
                    <Analytics />
                Suspense>
            ErrorBoundary>
        div>
    );
}

That’s it. Now, access the dashboard on the browser. Whenever the order promise rejects, we’ll get the fallback error UI from the error boundary. Note, the remaining UI isn’t broken and rendered successfully. The partial failure of the UI is also recoverable, as we have provided a retry button to attempt to revive that portion. It provides a great UX.

This is how the Suspense boundary, the use() API, and the Error Boundary work together to help you write scalable React code that can be maintained very easily in the future. I hope you found it helpful. All the source code used in this handbook is in the tapaScript GitHub Repository.

Learn from the 15 Days of React Design Patterns

I have some great news for you: after my 40 days of JavaScript initiative, I have now completed a brand new initiative called 15 Days of React Design Patterns (with Bonus Episodes).

If you enjoyed learning from this handbook, I’m sure you’ll love this series, featuring the 15+ most important React design patterns. Check it out, subscribe, and get it for free:

Before We End…

That’s all! I hope you found this insightful.

Let’s connect:

Subscribe to my YouTube Channel.
Check out my courses, 40 Days of JavaScript, 15 Days of React Design Patterns, and Thinking in Debugging.
Follow on LinkedIn if you don't want to miss the daily dose of up-skilling tips.
Join my Discord Server, and let’s learn together.
Follow my work on GitHub.

See you soon with my next article. Until then, please take care of yourself and keep learning.

The Data Communication and Networking Handbook

valentine Gatwiri — Wed, 18 Jun 2025 18:29:46 +0000

When I was beginning to learn about networks, I didn't know how many things in my daily life depended on them – from texting on WhatsApp to watching YouTube.

I still vividly remember when I learned that computers communicate with one another. It was magic – telepathy, nearly. But there is a systematic, logical process behind the magic: computer networking. And I’m excited to help you discover how computers communicate and why it’s possible.

Essentially, data communication is all about exchanging information between two or more machines. But it's not just a question of sending – it's a matter of sending the right data, to the right machine, in the right format. And that's the brilliance of networking basics.

This handbook will teach you the fundamentals of the language of computers. You'll discover how data is passed from machine to machine, how operations are carried out on information, and how networks – from tiny home arrangements to massive worldwide networks – are constructed and managed.

We’ll start with the absolute basics: what a network is, what the hardware is, and how devices know each other and talk to each other. Next, we’ll examine crucial networking models like OSI and TCP/IP stacks that segment communication into layers in order to make it easier to understand and troubleshoot. You'll learn about IP addresses, DNS, routing, switching, and firewalls and security's involvement in keeping networks safe.

Whether you are a complete beginner starting from the ground up or a seasoned dev looking to solidify your foundation, this handbook will walk you through linking the dots. When you're finished, you won't only understand how your favorite sites and apps really function behind the scenes – you'll be able to speak networks in your sleep.

Chapter 1: Data and Communication Fundamentals
Chapter 2: Signals — The Language of Communication
Chapter 3: Bandwidth — Understanding How Much We Can Transmit
Chapter 4: Transmission Media — The Highways of Communication
Chapter 5: Network Topologies — How We Structure Our Connections
Chapter 6: The OSI Model — Understanding Layers of Communication
Chapter 7: Protocols and Ports — How Rules and Doors Guide Communication
Chapter 8: IP Addressing and Subnetting — Naming and Organizing the Network
Chapter 9: Routing and Switching — Directing Data on the Network
Chapter 10: Network Infrastructure — Devices, Security, and the Modern Internet

Chapter 1: Data and Communication Fundamentals

This introductory section lays the groundwork for the rest of the handbook. You’ll learn what data communication is, how it's different from "sending a message," and what's required for two computers (or phones, or servers) to exchange information efficiently.

You'll start to feel at home with fundamental ideas, technical terminology, and the machinery behind the scenes that works quietly in the background to make daily technology appear effortless.

By the end, you will be able to:

Explain what data communication is and how it works in real life
Identify the components involved in data communication systems
Differentiate between types of data and how they're represented
Understand different types of data flow (simplex, half duplex, full duplex)
Describe what a computer network is and its main categories (LAN, MAN, WAN)
Understand the importance of protocols and how they enable communication
Recognize the role of standards and standard organizations in making networking universal

Data vs Information

We throw around the word "data" a lot these days – "big data," "data science," "data plans" – but what does it mean?

Data is raw. It's unprocessed, meaningless on its own. Think of numbers on a spreadsheet with no labels.
Information is processed data – it's meaningful and helps us make decisions.

A personal example: I once received a CSV file from my school with hundreds of rows of marks. It looked like chaos – just student IDs and scores. But the moment I matched those IDs to names and applied the grading criteria, it became useful information about who passed, who failed, and who topped the class.

So, data is the ingredient. Information is the cooked dish.

So, What Exactly is Data Communication?

Imagine you're texting your friend. Your phone sends data to their phone using signals through cables, Wi-Fi, or even satellites. This entire process is called data communication, moving data from one place (you!) to another (your friend).

But it’s not as random as it sounds. It follows a set of agreed rules called protocols. Think of them as social etiquette for devices – how to talk, when to talk, and what to say.

This process involves:

Devices (sender and receiver)
A transmission medium (like cables or wireless)
A set of rules (protocols)

Let’s break it down further.

Characteristics of Data Communication

To be considered effective, data communication must exhibit the following characteristics:

Delivery: Data must reach the correct destination. If I send a message to John, it shouldn't land in Sarah's inbox.
Accuracy: No one wants a corrupted file. Data must be accurate, free from errors.
Timeliness: Some data, like live video, must arrive on time. Lag ruins the experience.
Jitter: Inconsistent arrival times of data packets (especially in audio/video) create disruption. A good system keeps jitter low.

I once experienced a video call where the sound lagged by 5 seconds. It turned into a game of "Guess what I said." That's jitter in action.

Meet the Cast: The Components of Data Communication

In every data conversation, five key players show up:

Sender – The device that starts the chat (like your phone).
Receiver – The one getting the message (your friend’s phone).
Message – The actual info, whether it’s "hi" or a TikTok.
Transmission Medium – The path your message travels (Wi-Fi, cables, and so on).
Protocol – The language they agree to speak (like TCP/IP).

Pretty cool, right?

Data Representation

Computers are not humans. They don’t understand language, pictures, or music – unless these are converted into a format they can process: bits (0s and 1s).

Let’s walk through the different types of data representation:

1. Text

Text is stored as a sequence of characters using encoding schemes like ASCII and Unicode. For example, the letter "A" in ASCII is 65, which in binary is 01000001.

2. Numbers

Similarly, numeric data is stored as bit patterns. Computers can perform calculations using binary logic.

3. Images

An image is a matrix of pixels. Each pixel is represented by bits. A black-and-white image might only need 1 bit per pixel, while a full-color photo could use 24 bits per pixel or more.

Example: A 10x10 black and white image = 100 pixels = 100 bits.

4. Audio

Audio is analog, but we digitize it for storage and transmission. For instance, voice notes are sampled at certain intervals and stored as bits.

5. Video

Video is a sequence of images (frames) along with synchronized audio. It’s high in data volume and needs compression techniques like MP4 to be practical.

How Does the Data Flow?

You might think data just zips across in one go – but it has modes, just like moods:

Simplex: One-way only (like a radio broadcast).
Half Duplex: You take turns – like walkie-talkies.
Full Duplex: Both sides talk at once – think phone calls.

Each has its own vibe depending on the situation.

What is a Computer Network?

A computer network is a system that allows devices to share data. These connected devices (nodes) use communication links to interact.

The main goals of a network are:

Reliability: Data should get there.
Security: Unwanted access should be blocked.
Performance: High speed, low delay.

When you connect your laptop at a café, for example, you’re part of a network. But networks come in all shapes:

PAN (A personal area network): connects electronic devices within a user's immediate area.

LAN (Local Area Network): Small – like your home Wi-Fi.

MAN (Metropolitan Area Network): Covers a city – like college campuses.

WAN (Wide Area Network): Huge – think the entire internet!

The internet isn’t one big net – it’s a net of many, many nets.

What is a Protocol?

A protocol is a set of rules that devices follow to communicate. Without a protocol, it’s chaos.

Analogy: Think of a group project. If everyone agrees to use Google Docs and write in English (or any one language), it works. But if one person uses Word in French, and another emails a PDF in Mandarin, you have a mess.

Protocols define:

What data to send
How to send it
When to send it

Elements of a Protocol

Syntax: Format and structure (like grammar).
Semantics: Meaning of each section.
Timing: When to send and at what speed.

Standards in Networking

Standards are agreements to ensure that different systems can work together. Without standards, each manufacturer would create isolated networks that couldn’t talk to others.

There are two types of standards:

De facto: By convention (used commonly but not formally approved)
De jure: By law (formally approved)

Standards Organizations

There are a few key organizations that help define these standards:

ISO – International Organization for Standardization
ITU-T – International Telecommunication Union
IEEE – Institute of Electrical and Electronics Engineers
ANSI – American National Standards Institute
EIA – Electronic Industries Association

Chapter 2: Signals — The Language of Communication

In this chapter, I’ll teach you about the invisible messengers – signals – that make it all possible. You will:

Understand what signals are and how they carry data
Distinguish between analog and digital signals, and when each is used
Learn about key signal characteristics like amplitude, frequency, phase, and wavelength
Visualize and compare time domain vs frequency domain representations
Appreciate how real-world signals are composed of multiple waves (composite signals)
Understand digital signal features like bit rate, baud rate, and bit interval
Learn about baseband vs broadband transmission methods
Identify challenges like attenuation, distortion, and noise
Grasp how bandwidth affects data quality and speed

When I was a teenager, I often wondered how my voice traveled through a phone and reached someone else in another town. I imagined tiny versions of myself running through wires with a message in hand. Turns out, while not exactly accurate, the idea of something carrying your message is spot on. That something is called a signal.

A signal is the form data takes to move through physical space. Whether it’s your mom calling you, your professor sending an email, or your friend uploading a reel – all of that happens through signals.

Data and Signals

What is a Signal?

I learned that data is like the message I wanted to send, and a signal is the delivery truck. Without the truck, the message goes nowhere.

Here’s where things get a bit science-y, but stay with me. When data travels, it becomes signals, kind of like waves. These waves can be classified in to two common ways, by the nature of the signal, and by their patterns over time. We’ll talk about the nature of the signal first.

The Nature of the Signal: Analog vs Digital

Analog – A signal that varies smoothly over time and can take any value in a range. Like ocean waves, always changing smoothly. Continuous (like voices).
Digital – A signal that has discrete values, usually 0s and 1s. Like a staircase – clear, sharp steps, either up or down, in bits (1s and 0s, like computers).

Analog Signals

The first time I visualized an analog signal, it looked like the ripples I saw after tossing a stone in water. Gentle curves moving outwards.

Key features of analog signals:

Amplitude: This reminded me of volume. Louder signals have taller waves.
Frequency: It’s the beat or rhythm. High frequency = rapid waves = higher pitch.
Period: Time for one full wave cycle. Shorter periods mean higher frequency.
Phase: Two waves can start at different points – just like dancers starting a move a second apart.
Wavelength: How far one wave travels in space. It depends on how fast it moves and its frequency.

Time vs. Frequency Domain

Time Domain: Shows how signals change over time. Like watching a song’s audio waveform.
Frequency Domain: Shows the ingredients – how much bass, how much treble. It’s like the EQ settings on a music player.

Composite Signals and Fourier

Real-world signals are messy, made of multiple waves mixed. Fourier’s big idea was: Any messy signal can be broken down into simple sine waves. That insight changed how engineers understand and clean up signals.

Digital Signals

Digital signals felt familiar to me. My laptop, my phone, even my microwave speaks digital.

Key features of digital signals:

Bit Interval: One bit’s duration. Like how long I hold down a piano key.
Bit Rate: How many notes (bits) I can play per second.
Baud Rate: How often the signal actually changes. Not always the same as bit rate.
Levels: 2-level = 1s and 0s. More levels = more complex encoding.

Square Waves

If analog signals are elegant curves, digital signals are sharp edges. A square wave is a bold, binary shout: ON-OFF-ON-OFF.

Digital Advantages and Struggles

Why I love them:

They’re clean and easy to work with.
Errors are easier to spot and fix.

But they’re not perfect:

They need more bandwidth.
They don’t travel well over long distances without help.

Pattern Over Time: Periodic vs Non-periodic Signals

Periodic Signals: Repeat at regular intervals over time (for example, sine waves, clock pulses).
Non-periodic Signals: Do not repeat – more random or unique (for example, a burst of data or speech waveform).

Periodic Signals

These feel like the rhythm of my favorite song. They’re predictable. Repeating. Reliable.

Key Features

Repetition: The same pattern, again and again. Like waves hitting the shore at steady intervals.
Cycle: One complete shape of the signal. Think of it as one heartbeat in a steady pulse.
Frequency: How many cycles per second? Measured in Hertz (Hz).

Why I like them

Easy to analyze – like having a beat to follow.
Great for systems that need synchronization, like clock signals in my devices.

But still...

They can’t carry surprise or variety. No space for one-time messages.

Non-periodic Signals

These are the jazz solos of the signal world. Wild. Unique. Unpredictable.

Key Features

No repetition: Each part is different – like my playlist on shuffle.
Spikes and silence: Sudden changes, long pauses. Perfect for one-off data transmissions.
Used in real-life data: Emails, videos, and downloads all love this format.

Why they’re cool

Great for representing actual information – each burst means something new.
More flexible for transmitting complex messages.

What’s tricky

Harder to analyze and predict.
Tougher to filter or compress efficiently.

Understanding signals helps us know how fast and cleanly information travels.

Channels: The Roads Signals Travel On

In the context of signals and communication, channels refer to the medium or path through which a signal travels from a sender (transmitter) to a receiver. Channels are like roads. You can’t just send a truck (signal) without knowing if the road (channel) allows it.

We can describe channels in different ways:

Physically: What the signal travels through (like a wire or air).
Functionally: How the signal is allowed to move through (based on frequency).
Logically: How we organize multiple data streams within the same physical path.

Physical Channels = The Road Itself

These are the real, tangible paths for signals:

Example	Medium
Ethernet cable	Copper wire
Fiber-optic link	Glass strand
Wi-Fi or Radio	Air (wireless)
Satellite transmission	Space (electromagnetic waves)

Frequency Behavior of Physical Channels

Just like roads are built for certain speeds, physical channels are better at carrying certain frequencies.

Here’s where low-pass, high-pass, band-pass, and band-stop come in – they describe how a physical channel behaves.

Channel Type	Behavior	Analogy	Common Use
Low-pass	Lets low frequencies pass	Quiet country road (slow cars only)	Telephone lines (voice)
Band-pass	Allows a specific frequency band	Toll road with speed range	FM radio, Wi-Fi
High-pass	Blocks low, passes high frequencies	Speedway (fast cars only)	Audio filtering
Band-stop	Blocks a range but passes others	Road under construction	Noise removal (for example, hum filter)

So when we say "low-pass channel," we're talking about how a physical channel filters signals.

Logical Channels = Lanes on the Road

A logical channel is a virtual path created within a physical one. It organizes or splits the signal flow so multiple people or devices can use the same channel without crashing into each other.

Feature	Description	Analogy
Frequency Division	Each user gets their own frequency	FM radio stations
Time Division	Each user gets a time slot	Taking turns at a speaking table
Virtual Circuits	Custom paths inside networks	Reserved bus seats

So yes – you can have many logical channels on one physical cable.

How They Work Together

Let’s combine it all:

Imagine a fiber optic cable (physical channel) that’s designed to carry a specific frequency range (band-pass).
Within that frequency range, you can create many logical channels using time or frequency division.

Example: FM Radio

Physical Channel: Air (radio waves)
Type: Band-pass (88–108 MHz)
Logical Channels: Each station (for example, 98.4 FM) is a logical channel inside that band

Example: Internet over DSL

Physical Channel: Telephone line (copper wire)
Type: Low-pass for voice, high-pass for internet
Logical Channels: Browsing, streaming, and downloads running together via time/frequency division

Baseband vs Broadband Transmission: How We Use the Channel

There are two main types of ways we use the channel: baseband and broadband transmission.

Baseband Transmission is like talking directly to someone across a quiet room. Simple and unaltered. Common in local systems like Ethernet.

Broadband Transmission is a bit different. Here, we dress up the digital message in analog clothing using modulation. That’s how we send data over radio or fiber. It’s more complex, but necessary when you’re dealing with wider, noisier roads.

Signal Villains: What Goes Wrong on the Way

As your signal travels down the channel, it may face three big problems.

Attenuation: It’s like my voice getting quieter the farther I am from someone. Amplifiers help boost it.
Distortion: Imagine you and I agree to send square waves, but by the time it reaches you, it looks like mush. That’s distortion, especially bad over long cables.
Noise: Noise is anything extra that wasn’t supposed to be in the signal. From lightning strikes to microwaves, interference is real.

Types I learned about:

Thermal (heat-related)
Induced (nearby equipment)
Crosstalk (adjacent wires “talking”)
Impulse (sudden bursts)

We can reduce noise using better cables, filters, and digital corrections.

Bandwidth

The word ‘bandwidth’ gets thrown around so much. For me, it used to just mean internet speed. But it’s deeper:

Analog Bandwidth: Range of frequencies a signal uses.
Digital Bandwidth: How much data we can push through per second.

More bandwidth = more room = faster, clearer communication.

We’ll talk more about bandwidth in the next chapter.

Learning about signals was like being handed the key to a secret code. Every beep, flash, and wave in our world is part of a language. Once you see it, you can’t unsee it. Signals are not just theory – they are the reason I can write this on a laptop, send it to the cloud, and have you read it anywhere in the world.

Chapter 3: Bandwidth — Understanding How Much We Can Transmit

When I first heard the term "bandwidth," I assumed it just meant how fast my internet was. And while that’s not entirely wrong, I came to learn there’s much more to it.

In this chapter, we’ll delve into the concept of bandwidth as the capacity of a communication path, examine its impact on signal quality and speed, and investigate how it's measured in both analog and digital systems.

By the end of this chapter, you will be able to explain:

What bandwidth means in different contexts
How analog and digital bandwidths are measured
The concept of throughput and how it differs from bandwidth
Factors that affect data transmission performance

What Bandwidth is All About

Bandwidth is the maximum amount of data that can be transmitted over a communication channel in a given amount of time.

Have you ever streamed a movie and it kept buffering? That frustrating lag led me to one of the most important concepts in networking: bandwidth. Bandwidth is like a highway. The wider the road, the more cars (or data) can pass at once.

I also like to think of it this way: If I’m trying to pour water (data) through a pipe (the communication channel), a narrow pipe limits how much water can flow through at a time. That’s low bandwidth. A wide pipe? Now we’re talking high bandwidth – fast and smooth.

Bandwidth Utilization

Efficiency

This is how well we use the available bandwidth. High efficiency means most of the bandwidth is being used for actual data (not overhead).

Overhead

Overhead includes headers, acknowledgments, and error-checking codes. It’s necessary, but it eats into our available bandwidth.

Idle Time

Sometimes the channel sits unused, due to waiting for acknowledgment, processing time, and so on. Minimizing idle time improves efficiency.

Bandwidth in Analog and Digital Terms

Analog Bandwidth

Analog bandwidth refers to the range of frequencies over which an analog signal can be accurately acquired, processed, or transmitted by a system. Beyond this range, the signal begins to degrade – either being attenuated or distorted, making it unreliable for precise use.

Key Concepts

Frequency Range: Analog bandwidth defines the spectrum of frequencies that a system can handle without significant degradation. It’s the system’s “comfort zone” for signal fidelity.
3 dB Bandwidth: One common method of defining analog bandwidth is the -3 dB point. At this point, the signal’s amplitude drops to about 70.7% of its original value, meaning almost half its power is lost. Frequencies beyond this threshold experience much more signal loss or distortion.
Importance in Signal Fidelity: Analog bandwidth directly affects how well a system can reproduce or process real-world signals – especially in audio, video, instrumentation, and telecommunications. A narrow bandwidth results in muffled or distorted outputs, while a wider bandwidth ensures better detail and accuracy.

Bandwidth and Rise Time

In instruments like oscilloscopes, analog bandwidth is closely related to rise time – the time it takes for a signal to transition from low to high. A wider bandwidth enables faster transitions to be captured accurately, which is essential for analyzing high-speed or fast-changing signals.

Real-Life Example

Consider old telephone systems: they typically had an analog bandwidth ranging from 300 Hz to 3300 Hz, resulting in a 3000 Hz bandwidth. This range was enough for clear voice transmission, but not wide enough for high-fidelity music or modern audio standards.

Applications of Analog Bandwidth

Application Area	Role of Analog Bandwidth
Oscilloscopes	Determines how accurately signals (especially fast ones) are captured.
Amplifiers	Specifies which frequency ranges can be amplified without distortion.
Communication Systems	Defines signal capacity and transmission quality.
Data Acquisition	Affects how well fast-changing signals are measured and analyzed.

Digital Bandwidth

Digital bandwidth refers to the maximum capacity of a digital channel to transmit data over a specific period, usually measured in bits per second (bps). It’s a measure of how much data can “flow” through a communication path, much like how the width of a pipe controls how much water can pass through.

The wider the digital bandwidth, the more data can be transmitted simultaneously, resulting in faster downloads, smoother video streams, and better overall network performance.

Bandwidth vs. Data Rate

Although they’re often used interchangeably, they aren’t quite the same:

Bandwidth is the capacity of the channel – the maximum potential.
Data rate is the actual speed at which data is transmitted, which can vary based on factors like:
- Network congestion
- Hardware limitations
- Signal interference

Think of bandwidth as the size of a highway, and data rate as how fast cars are moving on it.

How Digital Bandwidth is Measured

Digital bandwidth is expressed in units such as:

bps – bits per second
Kbps – thousands of bits per second
Mbps – millions of bits per second
Gbps – billions of bits per second

Example: A 100 Mbps internet connection can, in theory, transfer 100 million bits of data every second.

Why It Matters

Bandwidth plays a central role in modern digital life. Without enough bandwidth:

Streaming videos buffer
Video calls drop in quality or disconnect
Online games lag or stutter
Large files download painfully slowly

This becomes even more critical when multiple devices share the same network. Each device draws from the available bandwidth, which can quickly get overwhelmed if the demand is too high.

Digital vs. Analog Bandwidth

Aspect	Digital Bandwidth	Analog Bandwidth
Measured in	Bits per second (bps, Mbps, Gbps)	Hertz (Hz)
Focus	Data transmission rate	Frequency range
Example	Internet connection	FM radio signal (for example, 88–108 MHz)

Bandwidth in Shared Networks

In shared environments – like home Wi-Fi or public hotspots – everyone taps into the same bandwidth. If bandwidth is limited and several devices are streaming, gaming, or downloading, the network slows down for everyone.

Throughput – What Gets Delivered

While bandwidth is the potential capacity of a channel (the width of the road), throughput is the actual rate at which data travels end‑to‑end under real‑world conditions. It’s the number of cars that make it through the city per minute, after red lights, speed limits, and detours.

Key factors that influence throughput:

Interference & Noise (analog) or packet collisions (digital)
Hardware Constraints (CPU, NICs, switches)
Network Congestion (too many users/devices)
Error Retransmissions (when packets get lost or corrupted)

Example: A “100 Mbps” link (bandwidth) might only sustain 80 Mbps of throughput because of TCP overhead, competing traffic, and occasional packet losses.

Latency and Delay – The Time Dimension

Latency is the time it takes for a single bit (or packet) to travel from sender to receiver. Think of it as a travel time, whereas bandwidth and throughput are about volume.

Propagation Delay: Time for the signal to move through the medium (for example, light in fiber: ~200,000 km/s).
Transmission Delay: Time to push all the bits of a packet onto the wire:
Packet Size (bits)÷Link Bandwidth (bps)\text{Packet Size (bits)} ÷ \text{Link Bandwidth (bps)}Packet Size (bits)÷Link Bandwidth (bps)
Processing Delay: Time routers or switches spend examining headers, making forwarding decisions.
Queuing Delay: Time packets wait in buffers when traffic spikes.

Real‑world story: During a long‑distance video call, even 100 ms of round‑trip latency can feel like talking through molasses – voices overlap, and the conversation feels stilted.

Jitter – Variability in Arrival

Jitter is the inconsistency in packet arrival times. Even if the average latency is low, high jitter disrupts:

Audio/Video Streams: Choppy playback when packets clump or arrive too late.
VoIP Calls: Glitches, echoes, or dropped words.

You can mitigate this through Buffers and Quality of Service (QoS) agreements, which real‑time traffic to smooth out the delivery.

How to Improve Performance

If I could go back in time and give myself one tip: Performance isn’t just about speed – it’s about reliability and consistency, too.

Here’s what affects performance:

Bandwidth: Think of this as the largest diameter of your internet pipe – how much data can actually move through it per second, usually in Mbps or Gbps.

Why it matters: More bandwidth means your connection can handle more data – like downloading big files fast or streaming in 4K. BUT: Just because your connection can go fast doesn't necessarily mean that it always does. That's where throughput comes in.
Throughput: Your actual speed – how much data is really passing through the pipe right now.

Why it matters: Your actual internet experience (web page loading, Netflix streaming, gaming) is throughput-dependent, not bandwidth-dependent. If your throughput is bad, your videos buffer, downloads crawl, and games lag – even when you're signed up for a "fast" plan.
Latency & Jitter: Latency is the lag – how long it takes information to travel from your machine back to the server and vice versa (in milliseconds). Jitter is the variation in that lag – how inconsistent the timing gets.

Why they're significant: High latency = frustrating delay in video calls, sluggish online gaming, or keyboard lag in remote desktops. High jitter = choppy audio, frozen faces, or desync'd video in live meetings or streams.
Packet Loss: Sometimes, data just doesn't get to where it’s supposed to go. Packets are tiny chunks of data, and if a few get lost along the way, your device has to ask for them again.

Why it matters: Small levels of packet loss can cause buffering, call drops, or rubberbanding during gaming. Greater loss = subpar performance, stuttery audio, or crashed streams.
Utilization & Overhead: Utilization refers to what ratio of your total bandwidth is being used at any one time. Overhead is the extra information that needs to be dealt with to manage your connection – like labels on a package.

Why they're important: High utilization is when your connection gets crowded – for example, rush hour. Everything slows down. High overhead absorbs your free bandwidth – less room for what you actually love (video, games, files).

Engineers use techniques like compression, efficient routing, better cabling, and load balancing to improve performance.

I now see bandwidth everywhere – not just in networks, but in life. Our mental bandwidth, emotional bandwidth – it's all about capacity. Knowing how bandwidth works helped me troubleshoot slow Wi-Fi, plan file transfers, and appreciate what’s going on behind a simple Google search.

Just as in life with mental or emotional bandwidth, we need both capacity and consistency to function at our best. Understanding these metrics empowers you to diagnose slow Wi‑Fi, optimize file transfers, and build networks that meet real user demands.

Chapter 4: Transmission Media — The Highways of Communication

How does data move across distances? What path does it take?

This chapter dives into the physical and wireless pathways data takes from one device to another – the transmission media. By the end of this chapter, you will understand:

What transmission media is and why it matters
The difference between guided (wired) and unguided (wireless) media
Various types of cables (twisted pair, coaxial, fiber optics)
Wireless media like radio waves, microwaves, and infrared
The strengths and limitations of each medium

What are Transmission Media?

Imagine needing to deliver a letter. Do you send it through a postal truck? Drop it by drone? Deliver it by hand? The method you choose is your transmission medium.

In the digital world, transmission media refers to the path data takes from the sender to the receiver. These paths can be physical (guided), like cables, or wireless (unguided), like airwaves.

When I finally understood that even invisible data needs a “road,” I realized how crucial this topic was to building fast, reliable networks.

Different Types of Transmission Media

Transmission media are classified into two broad categories:

Guided Media (Wired): The data follows a specific path (like a road or railway). Common types include a Twisted Pair cable, a Coaxial cable, and a Fiber Optic cable.
Unguided Media (Wireless): Data floats freely through the atmosphere, like radio signals or Wi-Fi. Types include Radio Waves, Microwaves, and Infrared Waves.

Let’s dive into each of these types of transmission media in a bit more detail.

Guided Transmission Media

1. Twisted Pair Cable

This was the first cable I ever handled – it looked like two wires twisted together. Signals are transmitted as tiny voltage differences between the two copper conductors. By twisting the pair, electromagnetic interference picked up on one wire tends to be canceled out on the other, since each twist reverses their positions relative to the noise source.

Features & Use‑Cases:

Structure: Two insulated copper wires twisted to reduce interference.
Types:
- Unshielded Twisted Pair (UTP): Common in LANs, cheaper but more prone to noise.
- Shielded Twisted Pair (STP): Has shielding for better noise protection.
Usage: Telephones, Ethernet.
Bandwidth: Low to medium.
Distance: Up to 100 meters (for UTP).

2. Coaxial Cable

I remember unscrewing one from the back of our old TV. A single copper core carries the signal; an insulating layer and an outer metal shield form a concentric geometry. The signal propagates as an electromagnetic wave confined between the inner conductor and shield, which also blocks external noise.

Features & Use‑Cases:

Structure: A central copper core, surrounded by insulation, a metal shield, and an outer plastic cover.
Advantages: Better shielding, higher bandwidth than UTP.
Usage: Cable TV, broadband internet.
Distance: Up to several kilometers with amplifiers.

3. Fiber Optic Cable

This one blew my mind – light carrying data! Data is encoded into light pulses (laser or LED) sent down a glass or plastic core. Total internal reflection at the core–cladding interface traps light, allowing it to travel long distances with almost no loss.

Features & Use‑Cases:

Structure: Glass or plastic core surrounded by cladding and a protective sheath.
Types:
- Single-Mode Fiber: For long distances, uses a laser.
- Multi-Mode Fiber: For shorter distances, uses LED.
Advantages:
- Immune to electromagnetic interference
- Higher bandwidth and longer distances
- More secure and reliable
Usage: Backbone of the internet, submarine cables, hospitals.

Unguided Transmission Media

When you connect to Wi-Fi or use Bluetooth, you are relying on unguided media. These don’t need a cable – just air.

There are several different kinds of unguided transmission media. Let’s talk about some of the most common.

1. Radio Waves

How It Works:
Antennas convert electrical signals into electromagnetic waves (and vice versa). Radio frequencies (3 kHz–1 GHz) propagate omnidirectionally (or in broad beams) through the air and can diffract around obstacles.

Pros: Penetrates walls; easy broadcast to many receivers.
Cons: Susceptible to interference and eavesdropping.
Applications: FM/AM radio, Wi‑Fi (2.4 GHz band), Bluetooth, cordless phones.

2. Microwaves

How It Works:
Highly directional beams (1 GHz–300 GHz) generated by parabolic dishes or waveguide antennas. Because they travel in straight lines (line‑of‑sight), they must be carefully aligned between towers or rooftop dishes.

Pros: High data rates, cellular backhaul, satellite links.
Cons: Rain fade, clear path required, more expensive antennas.
Applications: Mobile networks, satellite TV, point‑to‑point enterprise links.

3. Infrared

How It Works:
LED or laser diodes emit infrared light pulses, which are detected by photodiodes on the receiver. Because IR light cannot pass through walls, it works only in a confined, line‑of‑sight – or within a reflective “cone.”

Pros: Highly secure (confined to room), no RF interference.
Cons: Very short range; blocked by obstacles; strict alignment.
Applications: TV remotes, short‑range device pairing, some industrial sensors.

Comparison Table

Medium	Speed	Distance	Interference	Cost	Usage
Twisted Pair	Low-Medium	~100m	High	Low	LAN, telephony
Coaxial	Medium	~2km (amplified)	Medium	Medium	Cable TV, broadband
Fiber Optic	Very High	>60km (with repeaters)	Very Low	High	Backbone, high-speed
Radio	Low-Medium	Long (via towers)	High	Low	Wi-Fi, radio, Bluetooth
Microwave	High	Long (LOS)	Medium	High	Mobile, satellites
Infrared	Low	Short	Very Low	Low	Remotes, IR sensors

How to Choose the Right Transmission Medium

When I set up my first home network, I had to think about speed, distance, and cost. That’s what engineers do when designing large networks, too.

Questions to ask yourself or your team:

How far does the data need to travel?
How fast do I need the connection?
Can I afford high-end cables or equipment?
Is the environment prone to interference?

Scenario	Best Medium	Why & How to Decide
Home LAN & Office Ethernet	Cat6 UTP	Affordable, easy to install, handles Gigabit speeds up to 100 m.
No‑Cable Wireless Access	Wi‑Fi (2.4/5 GHz)	Easy coverage of rooms; choose 5 GHz for less interference, higher speed.
Long‑Distance Fiber Backbone	Single‑Mode Fiber	Minimal signal loss over tens of kilometers; vital for ISP backbones.
Campus/Building Interconnect	Multi‑Mode Fiber	Supports 10–100 Gbps across campus; lower cost than single‑mode for short runs.
Point‑to‑Point Enterprise Link	Microwave Link	Rapid deployment between buildings; ensure clear LOS and proper dish alignment.
Industrial/Noisy Environments	Shielded Twisted‑Pair or Fiber	STP resists EMI ; fiber is immune but costlier.
Room‑Confined, Secure Control Signals	Infrared	Perfect for IR‑controlled lighting or remote‑only devices in one room.
Broad Wireless Broadcast	Radio Waves	For wide‑area IoT sensors or broadcast audio; simple omnidirectional antennas.

Define Distance & Speed:
- Short run (<100 m) + moderate speed → UTP.
- Long haul → fiber or microwave.
Assess Environment:
- High EMI (factories) → fiber or STP.
- Indoor home/office → UTP or Wi‑Fi.
Consider Mobility:
- Devices moving around → wireless (Wi‑Fi, cellular).
Weigh Cost vs. Performance:
- Budget LAN → UTP
- Critical backbone → fiber
Security Needs:
- Room‑confined control → infrared
- Open campus → directional microwave or encrypted Wi‑Fi

By matching distance, throughput requirements, environmental constraints, and budget, you can select the transmission medium that delivers optimal real‑world performance, just as engineers do when designing networks that power everything from our smartphones to submarine data cables.

Learning about transmission media made me realize how much effort goes into a simple text message. Whether it’s a copper wire under the road or a beam of light under the ocean, there’s always a path connecting us.

I now see cables and antennas not just as hardware, but as lifelines of human connection. They are the highways of our digital lives.

Chapter 5: Network Topologies — How We Structure Our Connections

The word “topology”, in the context of networking, refers to how devices are arranged and connected. This chapter helps you see that the structure of a network is just as important as the technology it uses.

By the end of this chapter, you will:

Understand what a network topology is and why it matters
Explore different types of physical and logical topologies
Learn the pros and cons of each layout (bus, ring, star, mesh, hybrid)
Recognize how topology affects performance, scalability, and fault tolerance

What is Topology?

If you’ve ever arranged chairs in a room for a meeting, you’ve thought about topology. Should everyone face forward? Sit in a circle? Group up in clusters?

Networking topology is the same idea – it’s about the layout of devices and how they connect. Whether you're designing a small home LAN or a vast corporate network, choosing the right topology affects everything: speed, cost, troubleshooting, and scalability.

Physical vs Logical Topology

Physical Topology

This is what you can see – the actual layout of wires and devices.

Example: You see computers in a classroom connected by cables to a central switch. That’s the physical topology.

Logical Topology

This is how data flows, regardless of how devices are physically connected.

Example: Even if computers are wired to a switch (star), the data may travel like a bus – this makes it a logical bus topology (more on this below).

It’s like a subway map vs. the actual underground tunnels – one shows the concept, the other shows the reality.

Types of Network Topologies

Let’s go through the main types of network topologies. Each has strengths, weaknesses, and ideal use cases.

Bus Topology

Imagine one long cable – all devices “tap into” it.

In a bus topology, a single backbone cable connects all devices.

Pros:
- Simple and cheap
- Uses less cable
Cons:
- If the backbone fails, the whole network goes down
- Difficult to troubleshoot
- Performance degrades with more devices
Use case: Small temporary networks

Ring Topology

Here, each device connects to exactly two others, forming a circle.

In this case, data travels in one direction, passing through each node.

Pros:
- Easy to install
- Better than bus for managing traffic
Cons:
- Failure in one node can break the ring
- Adding/removing nodes is disruptive
Use case: Token Ring networks (rare today)

Star Topology

This is what I used when setting up a LAN in my home. All devices connect to a central hub or switch.

Pros:
- Easy to install and manage
- Failure of one device doesn’t affect the rest
Cons:
- If the central device fails, everything goes down
- Requires more cable
Use case: Modern Ethernet networks

Mesh Topology

This one fascinated me because of its complexity.

In a mesh topology, every device is connected to every other device.

Pros:
- Redundant paths ensure reliability
- Excellent fault tolerance
Cons:
- Expensive and complex to install
- Requires lots of cabling
Use case: Military, critical systems, backbone networks

Hybrid Topology

Like a recipe with ingredients from different cuisines.

A hybrid topology works by combining two or more topologies.

Pros:
- Flexible and scalable
- Can be tailored to specific needs
Cons:
- Complex design and management
Use case: Large organizations with diverse requirements

Comparison Table

Topology	Cost	Reliability	Scalability	Complexity	Use Case
Bus	Low	Low	Low	Low	Small LANs
Ring	Medium	Medium	Low	Medium	Outdated systems
Star	Medium	Medium-High	High	Low	Homes, offices
Mesh	High	Very High	Medium	Very High	Data centers, military
Hybrid	High	High	Very High	High	Enterprises

How to Choose the Right Topology

When I built my first network for a class project, I went with a star topology. Why? Because it was easy to set up and troubleshoot, and it matched our desk layout, with all PCs around a central switch. That hands-on experience taught me that the right topology isn’t just about wiring – it’s about reliability, cost, and how people use the network.

Think of it like planning a city:

Where are the busiest hubs?
Do you need alternate routes in case one fails?
Can you maintain all the connections?

Common Network Topologies and When to Use Them

Topology	How It Works	When to Use It	Pros	Cons
Bus	All devices share a single backbone cable	Very small networks, temporary setups, or budget constraints	Cheap, minimal cabling	Hard to troubleshoot, poor scalability, one break = network down
Star	Devices connect to a central hub or switch	Home networks, classrooms, offices	Easy to manage, isolate issues, scalable	Hub is single point of failure
Ring	Each device connects to two others forming a closed loop	Legacy systems or specialized industrial networks	Predictable data flow, fair traffic management	Break in loop can halt the network unless dual ring used
Mesh	Every device connects to multiple others	Critical systems (e.g. military, finance), where uptime is vital	Highly fault-tolerant, redundant paths	Expensive, complex, heavy cabling
Hybrid	Mix of two or more topologies	Large enterprises or campuses	Flexible, optimized for different departments	Can be complex and costly to manage

How to Actually Choose a Topology (Real-Life Scenarios)

Let’s move beyond theory. Here’s how you'd pick a topology depending on your network goals and constraints:

1. Need a simple setup with a tight budget?

Choose: Bus or Star
Why: Bus requires minimal cabling (but be warned—it’s fragile); Star uses affordable switches and is easy to expand.
Example: Setting up a temporary lab or a network for a rural clinic.

2. Setting up a home or small office?

Choose: Star
Why: It mirrors how devices are physically placed. One faulty PC won’t crash the whole network.
Example: Wi-Fi router (the central node) with laptops, smart TVs, and printers.

3. Running a business with multiple departments?

Choose: Hybrid (Star + Mesh or Star + Ring)
Why: Combine flexibility with reliability. Use star for offices, mesh for server interconnects.
Example: A university with classrooms (star) and data centers (mesh).

4. Downtime is a dealbreaker?

Choose: Mesh
Why: Redundant paths keep communication alive even if several links fail.
Example: Military control center or emergency dispatch system.

5. Working with legacy systems?

Choose: Ring
Why: Some older systems (like token ring networks or SONET) require ring layouts.
Example: Legacy manufacturing networks that still run on ring-based designs.

6. Expecting rapid growth?

Choose: Star or Hybrid
Why: You can easily add more nodes to the central hub or integrate new segments.
Example: A startup anticipating more staff and devices within 6–12 months.

Tips from Experience

Think long-term: Design for tomorrow’s load, not just today’s.
Plan for failures: Even if you don’t need full mesh, maybe add backup links for your star’s hub.
Sketch the layout: Visualizing devices and data flow helps you pick the best design.
Consider wireless topologies too: For mobile or flexible environments, wireless mesh or infrastructure-based topologies might be better than wired ones.

Just like roads and power lines shape how a city grows, your network topology shapes how your digital systems evolve. The best layout isn’t the one with the fanciest name – it’s the one that fits your users, your budget, and your goals.

Choose thoughtfully, and your network becomes more than wires – it becomes infrastructure for productivity, connection, and growth.

Network topology is the blueprint for that digital city. When done right, everything flows. When it’s messy, things get congested, slow, or fail. And that’s why I now look at every network not just as wires and switches, but as architecture, with a purpose and design.

Chapter 6: The OSI Model — Understanding Layers of Communication

The OSI model is like a translator – it helps all types of systems speak the same language. And it’s everywhere.

In this chapter, you will:

Understand what the OSI model is and why it was created
Learn what each of the 7 layers does
Discover how the layers work together during communication
Apply real-life analogies to remember each layer’s role

What is the OSI Model?

Picture this: you want to send a letter. You write it 📝 → put it in an envelope ✉️ → mail it 📮 → it goes to your friend’s house 🏠 → they open it 👐 → and read it 👀.

That’s basically how the OSI Model works. The OSI (Open Systems Interconnection) model is a conceptual framework that describes how data moves from one device to another in a network. Instead of all systems operating differently, the OSI model helps break down communication into 7 distinct layers.

Each layer has a specific task, and together they make communication structured, understandable, and interoperable.

Developed by the International Organization for Standardization (ISO), the OSI model was created to provide a universal standard for different systems to communicate.

Think of it like this: You’re building a house. You wouldn’t put the roof before the walls. Similarly, data follows an order, moving through each of these layers – from sender to receiver.

The 7 layers of the OSI model are:

Application (your browser or app)
Presentation (formatting, encrypting)
Session (starting/ending chats)
Transport (reliable delivery)
Network (finding the route)
Data Link (organizing the data)
Physical (the actual wires or Wi-Fi)

It’s teamwork that makes the stream work!

An easy mnemonic I used to memorize them (from top to bottom): “All People Seem To Need Data Processing.”

Let’s explore each layer from the bottom (Layer 1) to the top (Layer 7):

Layer 1 – Physical Layer

This is the hardware level.

Handles: cables, switches, voltages, pins
Responsible for: physically transmitting raw bits (0s and 1s)
Example: Ethernet cables, fiber optics

Analogy: The roads on which data travels.

Layer 2 – Data Link Layer

Ensures reliable transfer across the physical link.

Handles: MAC addresses, framing, error detection
Divided into:
- Logical Link Control (LLC)
- Media Access Control (MAC)
Example: Switches, MAC addressing

Analogy: Street signs and traffic signals managing who goes when.

Layer 3 – Network Layer

This is about routing – finding the best path to the destination.

Handles: IP addresses, packet forwarding
Devices: Routers
Protocols: IP, ICMP

Analogy: Google Maps calculating the best route.

Layer 4 – Transport Layer

Responsible for end-to-end communication and reliability.

Handles: segmentation, flow control, error correction
Protocols: TCP (reliable), UDP (fast but no guarantee)

Analogy: Your personal driver, making sure you arrive safely.

Layer 5 – Session Layer

This layer manages dialogues (sessions) between systems.

Handles: session setup, management, and termination

Analogy: A host managing who gets to speak in a Zoom meeting.

Layer 6 – Presentation Layer

Responsible for data formatting and translation.

Handles: encryption, compression, data conversion
Example: JPEG, MP3, SSL, ASCII, EBCDIC

Analogy: A translator ensuring the data is understood.

Layer 7 – Application Layer

The layer closest to the user.

Handles: user interfaces, network services
Protocols: HTTP, FTP, SMTP, DNS

Analogy: The app you open – browser, email client, and so on.

Communication Flow

When I send a message:

It starts at Layer 7 and goes down to Layer 1 at my device
Then travels across the medium
And climbs back up from Layer 1 to Layer 7 on the receiving device

Each layer talks to its “peer” on the other device using a protocol.

Why the OSI Model Matters

The OSI model is more than theory. It’s a map of the journey your data takes that helped give structure to the chaos. It’s also helped me think systematically about problems, identify where things break down, and appreciate the complexity behind “just sending a message.” When debugging a network problem, I ask:

Is the cable plugged in? (Layer 1)
Is the MAC address correct? (Layer 2)
Can I ping the destination? (Layer 3)
Is the application service running? (Layer 7)

It gave me a checklist to go through, along with some clarity.

Whether you’re a student or a network pro, these 7 layers are your best friends.

TCP/IP: The Real MVP of the Internet

While the OSI model is an ideal learning tool, the TCP/IP model is what the internet actually uses. It has only four layers, combining some of the OSI layers for simplicity and practicality:

TCP/IP Layer	Corresponds to OSI Layers	Examples
Application	Layers 5–7 (Application to Session)	HTTP, FTP, DNS, SMTP
Transport	Layer 4 (Transport)	TCP, UDP
Internet	Layer 3 (Network)	IP, ICMP
Network Access / Link	Layers 1–2 (Physical + Data Link)	Ethernet, Wi-Fi, MAC addresses

Why TCP/IP Matters:

Scalable: It powers everything from home routers to global telecom infrastructure.
Interoperable: Works across all hardware, operating systems, and devices.
Fault-tolerant: TCP handles dropped packets, reordering, and error checking.
Backbone of the Internet: Every website, email, or Zoom call runs over TCP/IP.

How TCP/IP Works (Simplified Walkthrough)

Let’s say you open your browser and type in www.example.com.

Application Layer (HTTP): Your browser sends a request for a web page.
Transport Layer (TCP): The request is broken into segments, with each piece numbered and prepared for reliable delivery.
Internet Layer (IP): Each segment gets an IP address and is routed across networks.
Network Access Layer: The data is turned into frames and signals, then physically transmitted over the internet (via cables or wireless).

At the other end, the process reverses, and you see the web page appear on your screen.

OSI vs. TCP/IP: Why Learn Both?

OSI	TCP/IP
Conceptual, educational model	Practical, real-world protocol suite
7 distinct layers	4 simplified layers
Rarely used directly in implementation	Foundation of the internet

Think of the OSI model as a textbook diagram – helpful for troubleshooting and interviews. TCP/IP is the actual engine – streamlined and optimized for real-world communication.

Chapter 7: Protocols and Ports — How Rules and Doors Guide Communication

Protocols and ports are the rules and gates that make it all happen smoothly. This chapter helps you appreciate how structured communication actually is.

By the end of this chapter, you will:

Understand what protocols are and why they’re essential
Learn about standard protocols used in networking
Explore the concept of ports and their numbers
Discover how protocols and ports work together to manage communication

The Importance of Protocols and Ports

When I tried setting up a local web server for the first time, nothing loaded. It took me a while to realize I hadn’t opened the right port or used the correct protocol.

Protocols are the rules that devices follow when talking to each other. Ports are like doors that allow specific types of data to come in and go out.

Without protocols and ports, communication would be total chaos.

What is a Protocol?

A protocol is an agreed-upon set of rules for sending and receiving data.

Think of it like:

A language: both sides must understand it
A traffic system: everyone follows the same rules to avoid crashes

Characteristics of Good Protocols

For a protocol to be effective in communication, it must clearly define how data is structured, understood, and managed in time. Let’s break that down:

1. Syntax – The Format and Structure of the Data

Think of syntax like grammar in language. It defines:

Data format (for example, header, payload, footer)
Order of fields in a message
Encoding rules (for example, binary, ASCII, JSON, XML)

Example: In an email protocol like SMTP, the syntax might require that the sender and recipient addresses come in a specific format like MAIL FROM: and RCPT TO:.

A good protocol syntax is:

Consistent and unambiguous
Easy to parse by machines
Designed to minimize errors in interpretation

2. Semantics – The Meaning of Each Field

Semantics defines what each piece of data means – what should be done with it.

What does a "200 OK" response mean in HTTP? (It means the request was successful.)
What does a SYN flag mean in TCP? (It initiates a new connection.)

Good protocol semantics:

Ensure that both sender and receiver interpret the data in the same way
Clearly define error codes, commands, and responses
Support meaningful actions tied to each instruction

3. Timing – When and How Fast to Communicate

Timing refers to:

When messages are sent (synchronization)
How fast messages should arrive (data rate)
How long to wait before assuming failure (timeouts)

A good protocol timing design:

Prevents collisions (two devices sending at the same time)
Supports flow control to avoid overwhelming slower devices
Includes retransmission logic in case of delay or loss

Common Networking Protocols

Before diving into details, here’s some context: A networking protocol is like a shared language for computers. It ensures that devices can communicate, share data, and coordinate actions reliably and securely.

TCP – Transmission Control Protocol

TCP is the backbone of reliable internet communication.

It is:

Connection-oriented: A session is established before data is sent.
Reliable: It ensures all data arrives correctly and in order using acknowledgments and retransmission.
Error-checked: Includes checksums to detect and correct corruption.

You use TCP in Web browsing (HTTP/HTTPS), email (SMTP), and file transfers (FTP). It’s like mailing a package with tracking and a required signature on delivery.

UDP – User Datagram Protocol

UDP is lightweight, fast, and doesn’t worry about delivery guarantees.

It is:

Connectionless: No handshake or setup, just send and forget.
Low overhead: No acknowledgments or retransmission.
Faster than TCP, but riskier for data loss.

You use it in online gaming, voice calls (VoIP), and live video streaming. It’s like shouting a message across a noisy room – quick, but no guarantee it’ll be heard.

HTTP / HTTPS – HyperText Transfer Protocol

HTTP is the protocol of the web – it enables your browser to request and display web pages.

It is:

Stateless: Each request is independent.
Based on the request-response model: Client sends a request; server responds.

HTTPS adds encryption via SSL/TLS, making it secure for sensitive data (for example, online banking, logins).

It’s used for activities like browsing websites and in REST APIs.

FTP – File Transfer Protocol

FTP is a classic protocol for transferring files between devices on a network.

It:

Works in client-server mode
Requires authentication (username/password)
Is not secure on its own – can be enhanced with FTPS or replaced by SFTP (uses SSH)

You can use it for website hosting and file backup systems.

SMTP, POP3, IMAP – Email Protocols

These are the three common email protocols, and each has its own features:

SMTP (Simple Mail Transfer Protocol): Used to send email from clients to servers or between servers.
POP3 (Post Office Protocol v3): Downloads emails to the device and usually deletes them from the server.
IMAP (Internet Message Access Protocol): Keeps email on the server and synchronizes across devices.

These are used in email clients like Outlook, Thunderbird, and Apple Mail.

DNS – Domain Name System

DNS is the internet’s phonebook – it converts human-readable names (like google.com) into IP addresses.

Hierarchical and distributed system
Uses caching to speed up lookups
Works behind the scenes of every website visit

It’s used in every internet-connected application that uses domain names.

What is a Port?

A port is a virtual door on a device that allows certain kinds of data through.

Each application or service uses a specific port number, which ranges from 0 to 65535.

Port Ranges

Well-known ports: 0–1023 (assigned to common services)
Registered ports: 1024–49151 (used by user processes)
Dynamic/Private ports: 49152–65535 (temporary or private use)

Common Port Numbers

Service	Protocol	Port
HTTP	TCP	80
HTTPS	TCP	443
FTP	TCP	21
SSH	TCP	22
DNS	UDP/TCP	53
SMTP	TCP	25
POP3	TCP	110
IMAP	TCP	143

How Protocols and Ports Work Together

Imagine you’re throwing a party:

Protocol: The invitation format – RSVP, dress code, rules.
Port: The door your friends enter through.

A web browser knows to use HTTP (protocol) on port 80. A secure connection will use HTTPS on port 443.

Your computer and servers use these pairings to know what type of data to expect.

Once I understood protocols and ports, troubleshooting network issues got easier. Suddenly, firewall rules, web server configs, and error messages started to make sense.

Protocols ensure everyone speaks the same language. Ports ensure everyone enters through the correct door.

They are the silent heroes of every network conversation.

Chapter 8: IP Addressing and Subnetting — Naming and Organizing the Network

When I first saw an IP address like 192.168.0.1, I didn’t think much of it. But now I see it for what it is, the digital address that tells data where to go. In this chapter, you will learn:

What an IP address is and why it's necessary
The difference between IPv4 and IPv6
How subnetting works and why it's useful
How to calculate and interpret IP ranges, subnet masks, and CIDR notation

Imagine trying to mail a letter without an address – it would be lost forever. The same applies to data on a network. Every device needs a unique identifier called an IP address to send and receive information correctly.

IP addressing ensures that when I request a webpage, my data comes back to me, not someone else on the network.

What is an IP Address?

An IP address (Internet Protocol address) is a unique number assigned to every device on a network.

Every device on a network needs an IP address to identify it – like a phone number for computers. There are two main versions of IP addresses: IPv4 and IPv6.

IPv4 vs. IPv6

IPv4 (Internet Protocol version 4) is the older, more widely used system. It uses a 32-bit address format, written as four numbers (each 0–255) separated by dots—for example: 192.168.1.1. This format allows for about 4.3 billion unique addresses.

But with the explosion of internet-connected devices, we quickly ran out of IPv4 addresses. That’s why IPv6 (Internet Protocol version 6) was introduced.IPv6 uses a 128-bit address format, written in hexadecimal and separated by colons: 2001:0db8:85a3:0000:0000:8a2e:0370:7334. This allows for a virtually unlimited number of addresses – over 340 undecillion (that’s 340 followed by 36 zeros)!

Let’s see a quick breakdown of the key details of each protocol:

IPv4 Address Format

Composed of four numbers separated by dots
Each number ranges from 0 to 255 (i.e., 8 bits per number)
Total: 32 bits (4 x 8)
Example: 192.168.1.1

IPv6 Address Format

Created to solve the address shortage in IPv4
Composed of eight blocks of hexadecimal values
Total: 128 bits
Example: 2001:0db8:85a3:0000:0000:8a2e:0370:7334

The Old IPv4 Class System

Originally, IPv4 addresses were grouped into classes to simplify allocation:

Class	Range	Default Subnet Mask	Use
A	1.0.0.0 – 126.0.0.0	255.0.0.0	Large networks
B	128.0.0.0 – 191.255.0.0	255.255.0.0	Medium networks
C	192.0.0.0 – 223.255.255.0	255.255.255.0	Small networks
D	224.0.0.0 – 239.255.255.255	N/A	Multicasting
E	240.0.0.0 – 255.255.255.255	N/A	Reserved for future use

But this system was too rigid. It wasted address space by assigning fixed block sizes, even when a network didn’t need that much.

Enter CIDR: Classless Inter-Domain Routing

CIDR (pronounced "cider") replaced the old class system in the 1990s. CIDR allows for more flexible and efficient allocation of IP addresses. Instead of using predefined classes, CIDR uses a prefix length to specify how many bits represent the network portion.

Example: 192.168.1.0/24: This means the first 24 bits are the network, and the last 8 bits are available for hosts.

CIDR made it easier to split (subnet) networks and slow the exhaustion of IPv4 addresses. We’ll discuss this more below.

Does IPv6 Use Classes?

No, IPv6 does not use classes. It was designed from the start to avoid the inefficiencies of the class system. Instead, it uses a hierarchical structure and prefix notation similar to CIDR. IPv6 addresses are divided into:

Global unicast (like public IPv4 addresses)
Link-local (used within a local network)
Multicast (send to many devices at once)

IPv6’s design naturally supports efficient routing and address assignment without needing "classes" as a workaround.

After learning about IP addresses – especially the difference between IPv4 and IPv6 – it’s important to understand how networks manage and organize these addresses. That’s where subnetting comes in.

What Is Subnetting?

Think of a large network like a school compound. Subnetting is like dividing the school into classrooms or departments. It’s the process of dividing a larger network into smaller, more manageable subnetworks (subnets).

Subnetting helps with:

Efficient use of IP addresses: You don’t need to assign a huge range of addresses when only a few devices are needed.
Network organization: Departments or teams can be separated into their own subnets.
Better performance and security: Traffic stays local within each subnet, and issues in one subnet don’t affect the whole network.

How Subnet Masks Work

To understand subnetting, we need to talk about subnet masks.

Every IPv4 address is divided into two parts:

The network portion tells you which network it belongs to.
The host portion tells you which specific device (computer, phone, printer, and so on) on that network.

A subnet mask tells us how to separate those two parts.

Example:

IP Address: 192.168.1.10
Subnet Mask: 255.255.255.0

This means:

The first three numbers of the IP address (192.168.1) represent the network.
The last number (10) identifies the specific host on that network.

The subnet mask acts like a filter that shows which part of the IP is fixed (network) and which part can vary (host).

CIDR Notation: A Modern Alternative

You might also see IP addresses written like this: 192.168.1.0/24. This is called CIDR notation (Classless Inter-Domain Routing), which we discussed briefly above.

CIDR is a more flexible and compact way to express IP addresses and subnet masks. The /24 tells us that the first 24 bits of the address are used for the network. The rest are for hosts.

CIDR Notation	Subnet Mask	Number of Hosts
/24	255.255.255.0	256 IPs (254 usable)
/26	255.255.255.192	64 IPs (62 usable)
/30	255.255.255.252	4 IPs (2 usable)

CIDR allows networks to be split or combined more precisely than the old Class A/B/C system, which had fixed sizes.

How to Calculate a Subnet

Let’s walk through a basic example.

You’re given the network: 192.168.1.0/26

The /26 means 26 bits are used for the network and 6 bits remain for hosts (since IPv4 has 32 bits total).
Using the formula 2^number_of_host_bits, you get 2^6 = 64 total addresses.
But 2 addresses are reserved: one for the network itself, and one for the broadcast address.
So, you’re left with 62 usable addresses in that subnet.

This is helpful when dividing a network among departments, buildings, or device types.

Public vs Private IP Addresses

Not all IP addresses are meant for use on the open internet. Some are private, used within internal networks.

Private IP Addresses:

Not routed over the internet.
Used in homes, schools, and offices.
Can be reused in different networks without conflict.

Range	Purpose
10.0.0.0 – 10.255.255.255	Private use
172.16.0.0 – 172.31.255.255	Private use
192.168.0.0 – 192.168.255.255	Private use

Devices with private IPs connect to the internet through a router that uses NAT (Network Address Translation).

Public IP Addresses:

Assigned by your ISP (Internet Service Provider).
Must be globally unique.
Used by websites, servers, and other devices reachable over the internet.

Static vs Dynamic IP Addresses

IP addresses can also be either static or dynamic.

Static IP Address:
- Manually assigned to a device.
- Doesn’t change over time.
- Commonly used for servers, printers, or devices that need consistent access.
Dynamic IP Address:
- Assigned automatically using DHCP (Dynamic Host Configuration Protocol).
- Changes occasionally.
- Most home networks use dynamic IPs for convenience and flexibility.

Why This All Matters

Understanding subnetting, masks, and IP types helps you:

Design networks that scale and perform well.
Assign addresses efficiently.
Improve security through network isolation.
Troubleshoot and configure routers and firewalls effectively.

Subnetting felt confusing at first, but once I saw how it's like breaking down a neighborhood into streets and houses, it clicked. It's a powerful skill for anyone working in networking or IT. And with the rise of IPv6 and cloud-based systems, it's more relevant than ever.

Chapter 9: Routing and Switching — Directing Data on the Network

In this chapter, you will:

Understand the roles of routers and switches
Learn how data is directed within and across networks
Explore routing tables, packet forwarding, and switching techniques
Compare static vs. dynamic routing
Understand how LAN and WAN switching works

Every time we send an email or watch a video, data is being routed and switched through a maze of devices. It’s like navigating a city using both small alleyways (switching) and highways (routing).

These processes ensure that data goes from point A to point B efficiently, securely, and correctly, even if they’re continents apart.

What is Switching?

Switching happens within local networks (LANs). It’s all about moving data between devices on the same network.

What is a Switch?

A switch is a device used in LANs to connect computers, printers, and other networked devices. It operates at Layer 2 (Data Link Layer) of the OSI model and plays a crucial role in directing traffic inside a local network.

But how does a switch know where to send the data?

It uses something called a MAC address.

What Are MAC Addresses?

A MAC (Media Access Control) address is a unique identifier assigned to a device’s network interface card (NIC). It’s like a digital fingerprint for your laptop, printer, or phone.

Each MAC address is a 48-bit address usually displayed in hexadecimal format like this:
00:1A:2B:3C:4D:5E

When data is sent over a LAN, it’s broken into frames, which include both a source MAC address and a destination MAC address.

The switch reads the destination MAC address and forwards the frame only to the port where that specific device is connected. This makes switching faster and more secure than old-style hubs that sent data to all devices.

LAN Switching Techniques

Switches use different techniques to decide when and how to forward frames. These include:

Store-and-Forward Switching: The switch receives the entire frame, checks it for errors using a CRC (Cyclic Redundancy Check), and then forwards it. It’s reliable but slightly slower.
Cut-Through Switching: The switch reads just the destination MAC address – often within the first 6 bytes – and immediately begins forwarding the frame. It’s faster but doesn’t check for errors.
Fragment-Free Switching: A hybrid approach. It reads the first 64 bytes before forwarding, enough to avoid most collision-related errors.

What is Routing?

While switching moves data within a single network, routing is what moves data between networks. This is how information travels from your home network to the wider internet.

What is a Router?

A router is a device that connects different networks and determines the best path for data to travel. It operates at Layer 3 (Network Layer) of the OSI model and forwards data based on IP addresses rather than MAC addresses.

You can think of a router like a GPS navigator for internet traffic. It chooses the best available route based on traffic, cost, and destination.

What is a Routing Table?

Each router has a routing table, which is like a map that tells the router:

Which destination networks does it know about
The next hop (which router to send the packet to next)
Which interface (port) to send it out on
The metric, which is a number representing the cost or preference of that path

When a router receives a data packet, it checks the routing table to decide where to send it next.

Static vs. Dynamic Routing

Routers can learn routes in two main ways: static or dynamic.

Static Routing

With static routing, a network administrator manually enters routes into the router's configuration. This method is:

Simple and efficient for small, stable networks
Very secure since routes never change unless manually updated
Limited because it doesn’t adapt if a network link goes down

Example: If you tell a router, “To reach network X, always go through Router A,” that route will stay in place until someone changes it.

Dynamic Routing

Dynamic routing uses protocols that allow routers to automatically share and update routing information with each other. This approach is:

Ideal for large or complex networks
Adaptive routes are recalculated if something changes or fails
Slightly more resource-intensive due to constant updates

Common dynamic routing protocols include:

RIP (Routing Information Protocol) – Simple, but outdated
OSPF (Open Shortest Path First) – Fast and widely used in large networks
EIGRP (Enhanced Interior Gateway Routing Protocol) – Cisco’s proprietary protocol, combining the best of both distance vector and link-state methods
BGP (Border Gateway Protocol) – The protocol that powers routing across the entire internet

Routing in Action

Let’s say I’m watching a YouTube video:

My device sends a request
The switch sends it to the router
The router consults its table and forwards it to another router
This process continues until the request reaches YouTube’s server
The server sends data back, following the same or a different route

Routers and switches never sleep. They’re working behind the scenes, 24/7, making sure our digital lives function smoothly.

Routing and switching may sound technical, but they are the backbone of modern networking. Knowing how they work has helped me troubleshoot issues and understand why certain delays or outages happen.

Switching keeps local communication efficient. Routing connects us to the world.Together, they are the traffic controllers of the internet.

Chapter 10: Network Infrastructure — Devices, Security, and the Modern Internet

As I continued my journey through networking and data communication, I could see that it's not theory alone – it's hardware, security, and innovation that are essential to the backbone of our everyday life on the internet.

This final chapter brings together the essential knowledge of networks: devices, security protocols, and the technologies behind new connectivity.

In this chapter, you will:

Understand common networking devices and their functions
Explore firewalls, intrusion detection, and best practices for security
Learn how the internet works (DNS, cloud computing, IoT)
Appreciate the role of protocols, encryption, and data integrity in today's connected world

Network Devices — The Building Blocks of Connectivity

Every time we send an email, stream a video, or browse the web, a collection of physical devices quietly work behind the scenes to make it all possible. These network devices form the infrastructure of both small local networks and the vast global internet. Let’s take a closer look at some of the key players.

Hub

The hub is one of the earliest and simplest network devices. It operates at the Physical Layer (Layer 1) of the OSI model and has a very basic job: when it receives data from one of its ports, it broadcasts that data to all other connected devices.

This method is inefficient, as it creates unnecessary traffic and poses security risks. Because of this, hubs are rarely used in modern networks, having been largely replaced by more intelligent devices like switches.

Switch

A switch is a more advanced and efficient version of a hub. It operates at Layer 2 (Data Link Layer) and uses MAC addresses to forward data only to the intended recipient. Instead of flooding the entire network with every transmission, a switch makes sure the data goes only where it's needed. This makes it the go-to device in most Local Area Networks (LANs) today.

Router

While switches handle local traffic, routers are responsible for sending data between different networks. Operating at Layer 3 (Network Layer), a router uses IP addresses to determine the best path for forwarding packets across the internet. In home and business environments, routers are essential for enabling access to the wider world beyond the local network.

Access Point (AP)

An Access Point bridges the gap between wired and wireless networking. It connects to a wired network and provides Wi-Fi so that wireless devices like laptops and smartphones can connect. Access points are especially important in large areas such as offices, schools, or public places where seamless wireless connectivity is needed.

Modem

A modem (short for modulator-demodulator) is the device that connects your local network to your Internet Service Provider (ISP). It converts digital data from your computer into signals that can travel over telephone lines or cable systems, and vice versa. In many homes, the modem is combined with a router in a single device.

Network Interface Card (NIC)

A NIC is the hardware component inside a device—like a laptop or desktop—that allows it to connect to a network. It can be built-in or external and can support either wired Ethernet or wireless Wi-Fi connections. Without a NIC, a device simply can’t participate in network communication.

Network Security — Protecting Our Digital Lives

I never thought much about network security – until I once received a very convincing spam email that nearly tricked me into sharing personal info. It was a wake-up call that our digital spaces aren’t always as safe as they seem.

In today’s connected world, network security is not just an IT concern – it’s a crucial part of everyday life. As we connect more devices and store more personal data online, the risks of cyberattacks and data breaches grow. Here’s a look at the major threats and how we protect against them.

Common Threats

There are many ways attackers can exploit vulnerabilities in a network. Some of the most common threats include:

Malware: This includes viruses, worms, and ransomware – malicious software that can damage files, steal information, or lock systems until a ransom is paid.
Phishing: Attackers send fake emails or create deceptive websites to trick users into revealing sensitive information like passwords or credit card numbers.
DDoS Attacks: A Distributed Denial of Service attack overwhelms a system with traffic from multiple sources, causing it to slow down or crash entirely.

Security Devices and Techniques

To defend against these threats, networks are equipped with various tools and strategies:

Firewalls: These act as gatekeepers between networks, blocking unauthorized access while allowing legitimate communication.
Intrusion Detection Systems (IDS): These monitor network traffic for suspicious behavior or known attack patterns.
Antivirus and Endpoint Security: These tools protect individual devices by scanning for and removing malicious software.
VPNs (Virtual Private Networks): VPNs encrypt data transmitted over the internet, shielding users from eavesdropping—especially on public Wi-Fi networks.

Best Practices

Technology alone isn’t enough – human behavior plays a big role in security. Some key habits include:

Using strong, unique passwords and changing them regularly
Keeping software and operating systems up to date, since patches often fix security holes
Enabling multi-factor authentication (MFA) to add an extra layer of protection
Educating users to recognize suspicious emails and links

Together, these tools and habits form a multi-layered defense that helps safeguard personal and organizational data.

The Modern Internet — DNS, Cloud, and IoT

Today’s internet is about far more than just connecting computers. It’s a complex, evolving ecosystem of services and smart devices, all working together to deliver seamless digital experiences. Let’s explore three key pillars of the modern internet: DNS, Cloud Computing, and the Internet of Things (IoT).

Domain Name System (DNS)

Imagine trying to access websites using IP addresses like 142.250.190.206 instead of just typing google.com. It would be nearly impossible to remember. That’s where the Domain Name System (DNS) comes in.

DNS works like the internet’s phonebook: it translates easy-to-remember domain names (like google.com) into the numerical IP addresses that computers use to communicate. Without DNS, web browsing as we know it wouldn’t exist.

Cloud Computing

The cloud has transformed how we store, process, and access information. Rather than relying on local hardware, cloud computing delivers services—like file storage, applications, or processing power—via the internet. Platforms like Google Drive, Amazon Web Services (AWS), and Microsoft Azure make it easy to scale up resources as needed, work from anywhere, and reduce infrastructure costs.

The benefits are clear: scalability, flexibility, and cost efficiency. But it also brings new challenges in terms of data privacy, security, and compliance.

Internet of Things (IoT)

The Internet of Things refers to everyday objects – like light bulbs, refrigerators, security cameras – that are connected to the internet and can communicate with each other. These devices offer convenience and automation, like turning off lights remotely or monitoring your home while away.

But the explosion of connected devices introduces challenges:

Security: Many IoT devices are poorly secured, making them easy targets for hackers.
Interoperability: With so many manufacturers and standards, getting devices to work together can be difficult.
Privacy: IoT devices often collect sensitive personal data, raising concerns about how that information is used.

Encryption and Secure Protocols

As data travels through this vast digital landscape, it must be protected from prying eyes. That’s where encryption and secure protocols come into play. These tools ensure that even if data is intercepted, it remains unreadable without the correct key.

Some of the most widely used secure protocols include:

HTTPS (Hypertext Transfer Protocol Secure): Ensures encrypted communication between your browser and websites.
SSL/TLS (Secure Sockets Layer / Transport Layer Security): Used behind HTTPS to secure web data.
IPSec: Encrypts IP packets and is commonly used in VPNs to secure network-level communication.
SSH (Secure Shell): Provides secure remote access to systems and devices.

These technologies form the backbone of secure internet communication, protecting users from data leaks, identity theft, and other forms of digital attack.

Wrapping Up

Looking back, it's amazing how far we've come – from learning what a bit is, to understanding how huge global networks function securely and efficiently.

Networking is more than routers and wires – it's a finely crafted system of trust, logic, and global cooperation. It's the very reason that we're able to learn, work, connect, and create anywhere.

And having established this foundation, I feel ready to go further.

Thank you for joining me on this journey.

How to Create Database Documentation Using dbdocs with DBML

Truong-Phat Nguyen — Mon, 14 Oct 2024 15:56:30 +0000

Database documentation plays a crucial role in maintaining and scaling systems. Clear and well-organized documentation can significantly improve communication between team members and enhance project longevity.

One of the most efficient ways to document a database is through dbdocs and DBML - an open sourced Database Markup Language.

In this guide, I’ll show you how to create database documentation using these tools, step by step.

What is dbdocs?

dbdocs is a platform that generates database documentation from your schema, easily shareable via a link. Using DBML (Database Markup Language), you can create clear, shareable, and updatable documentation of your database structure.

Prerequisites

Before we begin, ensure you have the following:

Basic knowledge of databases and SQL.
A database schema to document (we’ll use a PostgreSQL example in this guide).

Step 1: Install DBML CLI and dbdocs

Start by installing the DBML CLI, which helps convert your database schema into a DBML format. You also need the dbdocs CLI to generate and publish your documentation.

npm install -g dbdocs

Step 2: Export Your Database Schema to DBML

If you’re working with an existing database, you can export the schema into DBML using the DBML CLI tool.

For PostgreSQL, run the following command:

$ dbdocs db2dbml postgres  -o database.dbml

✔ Connecting to database ... done.
✔ Generating DBML... done.
✔ Wrote to database.dbml

This command will export your database schema and save it into a file called database.dbml.

Here’s an example of how a generated DBML file might look:

Table users {
  id int [pk, increment]
  username varchar(50) [not null]
  email varchar(100) [not null, unique]
  created_at timestamp [not null]
}

Table orders {
  id int [pk, increment]
  user_id int [not null, ref: > users.id]
  total decimal [not null]
  created_at timestamp [not null]
}

In this example:

• The users and orders tables are defined.

• Fields are annotated with types and constraints.

• The relationship between orders.user_id and users.id is established using ref.

Step 3: Edit and Add Notes to the DBML File

You may want to clean it up or add extra documentation like table descriptions and field descriptions to communicate with other members in the team.

Step 4: Generate Documentation with dbdocs

Once your DBML file is ready, the next step is to generate the documentation using dbdocs. First, you need to login to dbdocs:

dbdocs login

After logging in, publish the DBML file:

dbdocs build database.dbml

This command will generate a shareable documentation link that you can access via the dbdocs platform. You can also set access permissions and collaborate with your team.

This seamless workflow ensures that your documentation always reflects the latest state of your database.

Benefits of Using dbdocs with DBML

Simplicity: The DBML syntax is simple and easy to learn, making it a perfect fit for teams.
Automation: You can automate your database documentation updates as part of your CI/CD pipeline.
Collaboration: Easily share documentation links with your team or stakeholders for easy access and discussion.
Version Control: Use schema changelog to track database schema changes over time.
Visualization: dbdocs provides a clean interface to visualize your database schema, relationships, and annotations. Try this demo to learn more.

Conclusion

In this tutorial, we explored how to export a database schema, customize it, and generate shareable documentation using dbdocs.

By incorporating this workflow into your development process, you’ll improve your team’s collaboration, enhance your project’s scalability, and ensure that everyone stays on the same page. Happy documenting!

How Do Numerical Conversions Work in Computer Systems? Explained With Examples

Zaira Hira — Wed, 29 May 2024 19:56:06 +0000

Computers perform complex calculations when carrying out their assigned tasks. At the very core, the calculations boil down to operations like comparisons, assignments, and addition.

Have you ever wondered how they are performed under the hood and why they are important? At a fundamental level, a computer works by performing various numerical conversions.

In this article, you'll learn the following concepts:

The importance of numerical systems in computers.
Types of numerical systems.
Numerical conversion techniques.
Application of different numerical systems.
Mini exercises to keep you engaged along the way.

Types of Numerical Systems

Numerical conversion is the process of converting numbers from one numeral system to another. In computer systems, the common numeral systems include decimal (base-10), binary (base-2), hexadecimal (base-16), and octal (base-8).

But What Is a Base?

In mathematics and computer science, the term "base" refers to the number of unique digits or symbols used in a positional numeral system. Each digit's value is multiplied by the base raised to the power of its position in the number, starting from the rightmost digit, which represents the units place.

Here's an explanation of the commonly encountered numeral systems:

Base-2 (Binary):
- Base-2, or binary, uses only two symbols: 0 and 1.
- Each digit's value is a power of 2, with positions increasing from right to left.
Base-10 (Decimal):
- Base-10, or decimal uses ten symbols from 0 to 9.
- Each digit's value is a power of 10, with positions increasing from right to left.
Base-8 (Octal):
- Base-8, or octal, uses eight symbols: 0 to 7.
- Each digit's value is a power of 8, with positions increasing from right to left.
Base-16 (Hexadecimal):
- Base-16, or hexadecimal, uses sixteen symbols: 0 to 9 and A to F (representing 10 to 15).
- Each digit's value is a power of 16, with positions increasing from right to left.
- Below is a table showing the mapping of hexadecimal numbers from 10 with alphabets.

Character	Hexadecimal
A	10
B	11
C	12
D	13
E	14
F	15

This notation is commonly used to simplify the representation of binary-coded values.

Importance of Understanding Numerical Systems in Computers

Learning numeral conversions in computer science is essential for several reasons:

Understanding Data Representation: Computers store and manipulate data using binary (base-2) representation. Knowing how to convert between numeral systems helps in understanding how data is stored and processed at the fundamental level.
Addressing Memory: Memory addresses in computers are frequently represented in hexadecimal format. Knowing how to convert between decimal and hexadecimal is crucial for understanding memory management and for debugging.
Networking and Communication: In networking, IP addresses and MAC addresses are often represented in hexadecimal format. Understanding hexadecimal conversion is thus comes in handy for networking professionals.
Cryptography: In cryptography, hexadecimal numbers are frequently used to represent keys, cipher texts, and other cryptographic data. Understanding numeral conversions helps in understanding cryptographic operations.

Conversion Techniques

In this section, you'll learn techniques to convert one number system to another.

Decimal to Binary

Step-by-Step Conversion Process:

Divide the number by 2: The first step is to divide the number by 2 and record the remainder.
Divide the quotient by 2 repetitively: Divide the quotient from step 1 and record the remainder. Continue to divide and record the remainder till 1 remains as the quotient.
Find the solution in reverse order: Starting from the last quotient that would be 1, go upwards to get the final answer.

Example Conversion:

Let's say you want the binary equivalent of 17, then the process would be like this:

Operation	Result	Remainder
17/2	8	1 ⬆️
8/2	4	0 ⬆️
4/2	2	0 ⬆️
2/2	1 ➡️	0 ⬆️

To find the final answer, follow the arrows. Start from the bottom where the result is 1 and go upwards. You'll get 10001.

So,

$$17_{10} = 10001_{2}$$

Let's try a bigger number 55

Operation	Result	Remainder
55/2	27	1 ⬆️
27/2	13	1 ⬆️
13/2	6	1 ⬆️
6/2	3	0 ⬆️
3/2	1 ➡️	1 ⬆️

So,

$$55_{10} = 110111_{2}$$

Now, your turn:

What is 67 in binary?

Show Answer

Asynchronous vs Batch Data Processing in Distributed Systems – Explained with Examples

Anant Chowdhary — Wed, 20 Mar 2024 15:13:11 +0000

Distributed Systems often process and store huge amounts of data. Processing this data efficiently is typically an ongoing endeavor, and how it is designed almost always affects the end-user experience of a product.

Two popular modes of processing data are Batch Processing and Asynchronous Processing. We'll learn more about both in this article, along with when to use each approach.

Batch Processing of Data
What is Batch Processing?
When Do We Use Batch Processing?
Real World Example of Batch Processing of Data
What Does Batch Processing Look Like in Code?
Asynchronous Processing of Data
What is Asynchronous Processing?
When Do We Use Asynchronous Processing
Real World Example of Asynchronous Processing of Data
What Does Async Processing Look Like in Code?
Summary

Batch Processing of Data

What is Batch Processing?

Batch Processing, as you may have guessed, waits for a certain amount of data to be accumulated, and then processes this batch of data in one go. In other words, this means that in most scenarios we would wait for some number of events to complete and then process the data.

This is different from asynchronous processing of data, where we process an event and its associated data as soon as it occurs. More on that soon.

Now that you know a bit more about batch processing, it'll be useful to see a couple of real world examples.

When Do We Use Batch Processing?

Batch processing is used in lots of scenarios, such as:

Large volume of data: When we have a very large amount of data, it is often more resource-efficient to let the data collect over a period of time and then process it.
Data that isn't time sensitive: Since batch processing waits for data to collect, it is generally not suitable for processing data that's very time sensitive. On the other hand, it is possible to process batches of data within short intervals of time.
Scheduled Processing of data: In lots of instances, we need a large amount of data to be processed at regular intervals. Automated system backup and updates, for example, are generally scheduled for particular intervals. Batch processing can be very useful in such scenarios.

Real World Example of Batch Processing of Data

A popular real world use case for batch processing is credit card transactions.

Many financial institutions choose to settle credit card transactions in batches instead of settling them in real time. Since the settlement of transactions is generally not very time sensitive, this gives systems the time to run various other analyses / jobs on the transactions such as fraud detection, currency conversions etc.

Credit Card Transactions and Batch Processing

The diagram above shows a very high level example of a lifecycle of a credit card transaction. The steps are as follows:

The credit card transaction takes place at the Point of Sale (POS).
A gateway forwards the request to a serverless component that writes the transaction to a staging database where the transaction is stored temporarily.
At the end of the business day, the transactions in the staging database are reconciled and go through fraud detection. This is the component where batch processing takes place (note that we waited for some data to collect, and processed a large amount of data).

What Does Batch Processing Look Like in Code?

We saw an example of a distributed system in the above example. How would batch processing look like in code?

Below you'll see some code that lets you process a batch of SQS messages:

import boto3 

def process_batch_messages(sqs, queue_url):
    partial_response = sqs.receive_message(
        QueueUrl=queue_url
        MaxNumberOfMessages=10 # This sets the maximum batch size to 10
        WaitTimeSeconds=10 # We wait for a maximum of 10 seconds
    )
    if 'Messages' in partial_response:
        messages = partial_response['Messages']
        for each message in messages:
            # do something with each message

            # remove the message from the queue after processing
            sqs.delete_message(QueueUrl=queue_url, ReceiptHandle=message['ReceiptHandle'])


if __name__ == '__main__':
    # Initialize sqs client
        sqs = boto3.client(
        'sqs',
        aws_access_key_id='Your access key id',
        aws_secret_access_key='your secret access key,
        region_name='Your AWS region'
    )
    your_queue_url = 'your-queue-url'
    process_batch_messages(sqs, your_queue_url)

The above code waits for the earlier of two events: either 10 seconds having passed, or a batch size of 10 being reached within the queue.

Asynchronous Processing of Data

What is Asynchronous Processing?

The word asynchronous is generally defined as "events that are not coordinated in time". As the definition suggests, asynchronous processing of data does not rely on coordination of data events, and these events are processed as and when they occur.

This means that as soon as an event occurs, the event is processed and the data corresponding to the event may be stored in a sub system, passed on to another component in the system, or may simply lead to another event being fired off.

When Do We Use Asynchronous Processing?

You'll use asynchronous processing of data (sometimes also referred to as async) in various scenarios.

Microservices: Microservices often involve a request that needs an immediate response. Since this processing is done "per event", this would require async processing of data, so in most cases results are returned to clients within a very short period of time (low latency).
User Interfaces: Often, components in user facing UI components need to use async processing of data. For instance, multiple data fetches can be performed in the background using async calls when a user is using an application. This ensures that the application works smoothly and responsively without having the need for the UI components to "freeze".
Systems that require real time responses: Many interactive systems require real time processing of data. In the past few years, video calls and meetings have become increasingly popular. Since systems like these require immediate requests and responses (and in some cases streams of data being processed), async processing of data is used here.

Real World Example of Asynchronous Processing of Data

Chat apps are a great example of asynchronous processing of data. Here, if a user 1 types a message and sends it to user 2, the message must be written to the required databases / systems, delivered to user 2, and possibly read by user 2 without any delay.

Since this is real time processing of the event that occurred here (the event being that a message was sent), this is an example of asynchronous processing of data.

Exchange of messages in a chat app

In the above diagram we see that User 1 sends a message through their phone. The message gets routed to a message server which ultimately creates an entry in a messages database (Messages DB).

Now that MessagesDB has an entry, an event is fired off that is consumed by the Notification Pusher. This then communicates with User 2's notification queue to put a notification related to the message in their notification queue.

Whenever User 2's device comes online or has access to the internet, they receive a message notification.

Note that we did not wait for any data to collect, nor did we process this data after any specific time delay. We processed the event as soon as it happened. So this is an example of asynchronous processing of data.

What Does Async Processing Look Like in Code?

Can we modify the code that we saw in the section for batch processing to work for async processing? Remember that we said "this code waits for the earlier of two events: 10 seconds having passed, or a batch size of 10 being reached within the queue".

If we change the batch size to 1, we would effectively process a message as soon as it is received.

import boto3 

def process_async_messages(sqs, queue_url, batch_size):
    partial_response = sqs.receive_message(
        QueueUrl=queue_url
        MaxNumberOfMessages=batch_size # This sets the maximum batch size
        WaitTimeSeconds=10 # We wait for a maximum of 10 seconds
    )
    if 'Messages' in partial_response:
        messages = partial_response['Messages']
        for each message in messages:
            # do something with each message

            # remove the message from the queue after processing
            sqs.delete_message(QueueUrl=queue_url, ReceiptHandle=message['ReceiptHandle'])


if __name__ == '__main__':
    # Initialize sqs client
        sqs = boto3.client(
        'sqs',
        aws_access_key_id='Your access key id',
        aws_secret_access_key='your secret access key,
        region_name='Your AWS region'
    )
    your_queue_url = 'your-queue-url'
    process_batch_messages(sqs, your_queue_url, 1)

Note that in the above code we modified the process_batch_messages to accept a batch_size parameter and renamed the method to process_async_messages. This method processes a message as soon as the queue receives a method (assuming the queue has received a message within the wait time of 10 seconds)

Summary

Let's summarize batch and asynchronous data processing.

Batch Processing is a paradigm where you wait for an amount of data to collect or some time to pass before the data is processed.

Batch processing is often used in scenarios where you have large volumes of data, data that isn't time sensitive, and data that can be processed on a set schedule. The example we discussed above was that of a credit card transaction.

Asynchronous processing of data, on the other hand, is used to process data related to events as soon as they occur.

This approach is often used when dealing with data processed in microservices, user interfaces, and in general with systems needing real time request-response processing. We looked at an example of a chat app in the above discussion and learnt how asynchronous processing of data is applicable to the scenario.

How to Protect Data in Transit using HMAC and Diffie-Hellman in Node.js [Full Handbook]

Hamdaan Ali — Mon, 18 Mar 2024 23:00:22 +0000

Data integrity refers to the assurance that data will remain accurate, unaltered, and consistent throughout its lifecycle. In communication, data integrity is important in safeguarding against unintended alterations and malicious interventions during data transmission.

The integrity of Digital Data is accomplished using Hashing Algorithms. The crypto module in Node provides various built-in vetted library functions to provide means to not only verify the integrity of data but also the authenticity of its origin.

This handbook aims to highlight the internal workings of the functions in the crypto library and give you some insights into the internal workings of HMAC and Diffie-Hellman Key Exchange. This will help you make informed decisions about hash algorithms and key lengths depending on your business requirements.

The primary focus of this handbook is to emphasize the crucial aspect of data integrity rather than discussing the various encryption algorithms available.

Encryption is used to protect information by converting it into a secure format, which ensures its confidentiality. But data integrity is concerned with ensuring that the data remains accurate and unaltered.

You can also watch the associated video here:

Prerequisites
The Alice-Bob Paradigm
Message Detection Code (MDC)
Message Authentication Code (MAC)
Hash-based MACs (HMAC)
The Diffie-Hellman-Merkle Protocol
Connecting the Dots
Invoking the APIs
Wrapping Up
References

Prerequisites

Node and Express: We'll create a TypeScript sample application using the Express framework. A basic understanding of the framework would be helpful. You will need the Node Runtime Environment to execute the scripts.
Postman Client: To make an API request and to test out the sample application, you will need a tool to make HTTP Requests. You may use your web browser's "Edit and Send" feature under the Networks tab, but since not all browsers allow this, it's best to use a tool like Postman which provides a better UI to observe responses.

The Alice-Bob Paradigm

Throughout this handbook you will come across numerous sequence diagrams and mathematical proofs that use the Alice-Bob Paradigm.

The Alice-Bob paradigm is a common convention in cryptography where two generic entities, often named Alice and Bob, are used to illustrate various scenarios, protocols, or cryptographic principles.

The Alice-Bob Paradigm

These characters represent two parties engaged in communication, with Alice typically representing the sender or initiator, and Bob representing the receiver or responder.

We often introduce Eve as a third party, symbolizing an eavesdropper or potential attacker, adding an element of security risk and illustrating scenarios where external entities might attempt to intercept or manipulate the communication.

The sample application shown in the later sections models after this Alice-Bob Paradigm to use Boost Inc. and Account Aggregator (AA) as the parties engaged in communication.

Message Detection Code (MDC)

When Alice needs to send critical data to Bob over the internet, the data changes hands, jumping between routers and servers, each step carrying the potential risk of unintended alterations.

If Eve manages to get their hands on Alice's data, they might modify it. So the integrity of the data becomes questionable, emphasizing that its original state may have been compromised during transmission.

Note that we are talking about the integrity and not the confidentiality of the data. Say even after Alice encrypts the data, it doesn't inherently guarantee that the data hasn't been tampered with during transit.

Consider this scenario: even though Eve may be unable to decrypt the encrypted message, they might attempt to modify the ciphertext in transit. This could involve altering bits, rearranging packets, or injecting malicious code, potentially leading to unintended consequences upon decryption.

This is where a Message Detection Code (MDC) or a hash comes in picture. A modification detection code (MDC) is a message digest or a checksum that can prove the integrity of a message: that the message has not been changed [1].
The figure below explains how MDC is used to verify the integrity of a message:

Modification Detection Code [1]

A Hash Function is used to generate the digest for any given message. This hash function processes the entire content of the message, producing a fixed-size string of characters that uniquely represents the message's contents. This is called the message digest or MDC.

Note that any hash function, such as SHA-256, SHA-3, or MD5, can be used depending on your specific security requirements and preferences.

Once the digest is generated, it serves as a unique fingerprint for the original message. When Alice sends both the message and its corresponding digest to Bob, they can independently apply the same hash function to the received message. If the calculated digest matches the one received from Alice, it serves as irrefutable evidence that the message has not undergone any modifications during transmission.

Message Authentication Code (MAC)

While MDC or the checksum is typically transferred over a safe channel, it may so happen that the safety of the channel or the trusted party itself is compromised. In such a case Eve can easily modify both the message and the digest and Bob will never know if the message actually came from Alice as intended.

What MDC lacks is a definitive guarantee of the message origin, leaving a potential vulnerability in confirming the true sender.

This is where Message Authentication Code (MAC) comes in. MACs not only ensure the integrity of the message, detecting any unauthorized alterations, but they also provide a mechanism for authenticating the origin of the data. In other words, MACs offer assurance that the message is indeed originating from Alice and not by someone else.

The figure below explains how MAC can help authenticate the origin of a message besides providing integrity check:

Message Authentication Code [1]

Notice that the difference between a MDC and a MAC is that MAC also includes a secret key (K) between Alice and Bob. The hash function also takes in a key (K) along with the message (M) to generate a MAC.

$$ h (K | M) = MAC $$

Now both the message and MAC can be sent over the same insecure channel. When Bob receives this ( M + MAC ), he can separate out the message M and compute the MAC for it using the same hash function and the secret key (K).

Bob will then compare the newly computed MAC with the one he received. If the two MACs match, the message is authentic and has not been modified by an adversary.

$$ Alice: S(K,M) = MAC \\ Bob: V(M, K, MAC) = Accept/ Reject $$

Since Eve does not have this secret key (K), they cannot modify the message and generate a valid MAC. Consequently, the resulting MAC becomes a unique fingerprint, signifying not only the integrity of the message but also authenticating its origin.

Hash-based MACs (HMAC)

While MAC do provide a guarantee of authentication of the origin of a message, it is still falls short in ensuring unforgeability. It is easy for Eve to perform a Man in The Middle (MiTM) attack, intercept the MAC + Message pair and then perform the Length Extension Attack.

Given ( S = h( K || M) ) and the message (M), Eve can extend (M) to (M' = M || Pad || w) and create (MAC(M')); where (MAC(M')) is evaluated as
( S = h( K || M || Pad || w) ).

Eve does not require knowledge of the secret key (K) to extend the message (M) to (M'). When Alice receives this modified (M') and (MAC(M')), they are unable to determine the modification.

HMAC or a Hash-based MAC is a specific method for constructing a MAC algorithm out of a collision resistant hash function. HMAC uses two passes of hash computation and provides a better immunity against length extension attacks. The figure below explains the construction of HMACs.

Hash-based Message Authentication Code [1]

There are several steps involved in the implementation of HMACs [1]:

Divide the message into N blocks, each of b bits
Select a secret key and left-padded with 0’s to create a b-bit key and exclusive-ored with a constant called (ipad) (input pad).
Use the same secret key and XOR it with an another constant called (opad).
The value of (ipad) and (opad) are fixed constants as defined in the HMAC Standards [3]. The value of (ipad) is taken as b/8 repetition of the sequence
00110110 (hex: 36) and the value of (opad) is taken as b/8 repetition of the sequence 01011100 (hex: 5C).
These values are defined in such a way to have the most "non-regular" Hamming distance from each other.
The Hamming distance between (ipad) and (opad) 4, meaning exactly half of the bits are flipped.
Prepend the result produced in Step 2 to the message block. Use the hash function on this (N+1) block to create a n bit message digest called the intermediate HMAC.
The intermediate HMAC is prepended with (0)s to make a b bit block and then the result of Step 3 is prepended to this block.
Use the hash function again on the result of step 5 to get a final n bit HMAC.

Mathematically, this can be represented as:

$$ S(k, m) = H(k \oplus \text{opad} || H(k \oplus \text{ipad} || m)) $$

Now if Eve tries to extend (M) to (M' = M || Pad || w), the resulting HMAC construction this would be:

$$ HMAC(K, M')=H(K||opad, H(K||ipad, M || Pad || w)) $$

Due to the unique application of (opad) in the outer hash, the attacker cannot construct (H(K||opad, <...> )) without knowing the key (K). The outer padding disrupts the internal state for any additional input, thwarting the attacker's attempt.

The Diffie-Hellman-Merkle Protocol

One of the main challenges in Symmetric-key Ciphers is the distribution of keys. A fundamental question naturally arises: How will Bob know what keys Alice has used?

A very intuitive answer to this problem could be to use a Key Exchange or a Key Distribution Center (KDC). However, the utilization of a KDC or a Key Exchange introduces a notable caveat: the requirement of a secure channel for transmitting keys.

The security of a system employing a Key Distribution Center (KDC), such as in the case of the Kerberos authentication protocol, is heavily dependent on the security of the KDC itself. If the KDC is compromised, the cryptographic keys it manages and distributes can be exposed, leading to potential security vulnerabilities throughout the system as seen in a Golden Ticket Attack.

In the year 1979, Ralph Merkle, Whitfield Diffie and Martin Hellman came up with a way to Securely exchange Cryptographic Keys over Public Insecure Channels.

The Diffie-Hellman-Merkle Protocol provides a way for two parties to agree upon a shared secret key over an insecure channel without directly exchanging that key. The crypto module in Node.js contains the DiffieHellman class, which is a utility for creating Diffie-Hellman key exchanges.

Before we go through all of the functions defined in this class, it is important to understand the mathematics that goes around in The Diffie-Hellman-Merkle Protocol. The UML Sequence Diagram below explains the steps involved in The Diffie-Hellman-Merkle Protocol:

The Diffie-Hellman-Merkle Protocol

The process begins with either of the party who wants to establish a secure communication with the other. In this case, Alice wants to start the communication.

Alice will first pick a randomly chosen Generator g and a large prime number p. Increasing the length of the prime number results in heightened security, as it amplifies the difficulty for adversaries to execute certain cryptographic attacks.

However, enlarging the prime number also comes with computational costs. Longer prime numbers require more computational resources to perform the key generation.

Now, Alice needs to select a Private a and compute a modular exponentiation:

$$ A = g^a (\text{mod} , p) $$

Alice will send over the Generator g, the large prime p and Alice's Public Key A to Bob. At this point, Bob has all the values he needs to evaluate his own modular exponentiation of:

$$ A = g^b (\text{mod} , p) $$

He will send back this Public Key B to Alice.

Note that up until this point, all communication are occurring over insecure channel. The values g, p, A and B "might" as well be sent as plaintext. The Actual Secret Key is evaluated when Alice and Bob use these data to compute what is known as a "Shared Secret".

Shared Secret computed by A:

$$ S = A^b (\text{mod} , p) \\ S = g^{\left(ab\right)} (\text{mod} , p) $$

Shared Secret computed by B:

$$ S = B^a (\text{mod} , p) \\ S = g^{\left(ab\right)} (\text{mod} , p) $$

Notice how the Shared Secret computed by both parties at their end are the same.

This symmetrical outcome is the essence of the Diffie-Hellman key exchange, where each party independently computes the shared secret using their private key and the public key received from the other party. This ensures that both Alice and Bob arrive at an identical Shared Secret, establishing a secure foundation for further encrypted communication.

Why is the Shared Secret Secure?

Diffie-Hellman key exchange relies on the mathematical principles of discrete logarithm, primitive roots and Modular exponentiation.

Modular exponentiation is the problem of computing (a^b mod n), where (a), (b), and (n) are known integers. Discrete logarithm is the problem of finding (x) such that (a^x mod n = b), where (a), (b), and (n) are known integers and (a) is a primitive root modulo (n).

The security of Diffie-Hellman is rooted in the computational complexity of calculating discrete logarithms.

For example, given g, p and a, it's easy to compute A as Modular exponentiation is in P, meaning that there is a polynomial-time algorithm to solve it.

But, the other way can't be said true. Given g, p, and A, computing a requires solving the discrete logarithm problem, which is widely believed to be a computationally infeasible task [2].

Remember that both parties will compute the Shared Secret at their end and there is no need to send over this secret to the other party. This eliminates the risk of the Shared Secret getting intercepted by Eve and the only option they are left with is to solve the discrete logarithm problem.

Connecting the Dots

The key (K) that we provide in an HMAC has to be the same for both Alice and Bob. Now that we know how a Diffie-Hellman-Merkle key exchange works, it becomes intuitive that we can plug in the shared secret as the key for an HMAC.

Alice can use the shared key (S) in the HMAC function as a parameter and Bob can use the same shared secret (S), computed at their end, in the verification algorithm.

The crypto module in Node.js provides various built-in functions to implement cryptographic constructs such as HMACs and Diffie-Hellman Key Exchange. It is always recommended to use vetted cryptographic libraries and avoid implementing cryptographic algorithms yourselves over the concerns of Side Channel Attacks or a Heartbleed.

Let's create a TypeScript/ Node.js application to understand the implementation and prototypes of these functions. The two entities involved in communication in this application would be Boost Inc. and Account Aggregator. Boost needs to send a critical data over to the Account Aggregator.

We will first utilize the DiffieHellman class to create Secret Keys for both entities. Boost will then use the Secret Key to create a HMAC using the Hmac Class in Node. Account Aggregator will recieve this HMAC along with the message. They will verify this HMAC against the newly generated HMAC from the message they received.

Note that the code at Account Aggregator's end will be simulated and we will create API endpoints for each operation to show separation of concerns in this sample application.

The following sequence diagram explains what the application does:

UML Sequence Diagram for the sample application

Project Setup

In the root of your workspace, install Express, Axios, type definitions of Node, and type definitions of Express using the following command:

npm init -y | npm install axios express
npm install -D nodemon ts-node @types/express @types/node typescript

Configure tsconfig as per your liking and create a file called cryto.utils.ts under src/utils. Let's create an interface and import all necessary modules from the crypto library:

import { createHmac, createDiffieHellman, DiffieHellman, KeyObject, BinaryLike } from 'crypto';

export interface KeyPair {
  publicKey: Buffer;
  privateKey: Buffer;
  generator: Buffer;
  prime: Buffer;
  diffieHellman: DiffieHellman;
}

This interface will function as a blueprint for managing cryptographic key pairs throughout this application. It encapsulates the public and private keys, generator, prime, and a Diffie-Hellman object.

By using this interface we will ensure a structured and standardized approach to handle cryptographic key pair information, thus promoting clarity and consistency in cryptographic operations within a Node.js environment.

The createDiffieHellman Function

Next, we will define the function generateKeyPair which will allow us to generate the private and public keys, (A) and (B) along with the large prime (p) and the generator (g) using the createDiffieHellman and generateKeys functions.

export function generateKeyPair(prime?: any, generator?: any): KeyPair {
  const diffieHellman = prime && generator ? createDiffieHellman(prime, 'hex', generator, 'hex') : createDiffieHellman(2048);
  diffieHellman.generateKeys();

  return {
    publicKey: diffieHellman.getPublicKey(),
    privateKey: diffieHellman.getPrivateKey(),
    generator: diffieHellman.getGenerator(),
    prime: diffieHellman.getPrime(),
    diffieHellman,
  };
}

Notice that the parameters to this function – prime and generator – are optional. This is because the underlying createDiffieHellman has five defined overloads:

function createDiffieHellman(primeLength: number, generator?: number): DiffieHellman;

function createDiffieHellman(
    prime: ArrayBuffer | NodeJS.ArrayBufferView,
    generator?: number | ArrayBuffer | NodeJS.ArrayBufferView,
): DiffieHellman;

function createDiffieHellman(
    prime: ArrayBuffer | NodeJS.ArrayBufferView,
    generator: string,
    generatorEncoding: BinaryToTextEncoding,
): DiffieHellman;

function createDiffieHellman(
    prime: string,
    primeEncoding: BinaryToTextEncoding,
    generator?: number | ArrayBuffer | NodeJS.ArrayBufferView,
): DiffieHellman;

function createDiffieHellman(
    prime: string,
    primeEncoding: BinaryToTextEncoding,
    generator: string,
    generatorEncoding: BinaryToTextEncoding,
): DiffieHellman;

The first function creates a Diffie-Hellman object with a randomly generated prime of the specified length. The createDiffieHellman(2048); creates a Diffie-Hellman object where the length of the randomly generated prime is 2048 bits.

When no generator value is provided to this function, it takes the default value of 2. The length of the prime necessarily has to be large and if you select a small value Node will throw an error signifying that this length will not make a secure key.

Instead of passing in the length of the prime, we can pass the prime as a buffer. This is what Account Aggregator will to at their end when Boost sends over the necessary details.

Similarly you can use the other function declarations as per your use case to pass the prime and generator as ArrayBuffer or ArrayBufferView types or as string with a specified encoding.

The computeSecret Function

Now, let's define a method generateSharedSecret that takes in a Key pair and a public key as parameter and computes the shared secret (S):

export function generateSharedSecret(keyPair: KeyPair, publicKey: Buffer): Buffer {
  return keyPair.diffieHellman.computeSecret(publicKey);
}

The computeSecret function also has four overrides, which allows you to either provide the Public key parameter as string or ArrayBufferView as well as options to specify the inputEncoding and outputEncoding.

The createHmac Function

Now that we've computed our shared secret, let's create a function generateHMAC that consumes this shared secret and generates a digest against it.

export function generateHMAC(data: any, secretKey: KeyObject | BinaryLike): any {
  data = JSON.stringify(data);
  const hmac = createHmac('sha256', secretKey);
  hmac.update(data);
  return hmac.digest('hex');
}

The first parameter of the createHmac function takes an algorithm. This is where you need to specify what underlying hash function do you want to use.

Remember that the security of HMAC relies on various factors, including the cryptographic strength of the underlying hash function, the size of its hash output, and the quality and size of the key.

The options given to you under this algorithms parameter depends on the available algorithms supported by the OpenSSL version on the platform. To check what algorithms are available to you, execute the following command in the terminal:

openssl list -digest-algorithms

This will give you a list from which you can select your desired algorithm for the underlying hash function:

RSA-MD4 => MD4
RSA-MD5 => MD5
RSA-MDC2 => MDC2
RSA-RIPEMD160 => RIPEMD160
RSA-SHA1 => SHA1
RSA-SHA1-2 => RSA-SHA1
RSA-SHA224 => SHA224
RSA-SHA256 => SHA256
...

The secret key that the createHmac function takes could either be of type KeyObject or of type BinaryLike. Note that the type BinaryLike is a union type in TypeScript. It is a type that can be either a string or a NodeJS.ArrayBufferView.

The createHmac function's data parameter is designed to accepts strings, Buffer, TypedArray and DataView. To simplify the developer experience and minimize complexity, we intentionally set the data parameter type in the generateHMAC function as any. Internally, we handle the conversion to a string using JSON.stringify.

Initializing communication

Now on Boost's end create a file verification.controller.ts under src/controllers:

import { generateKeyPair, generateSharedSecret, generateHMAC, KeyPair } from '@boost/v1/utils/crypto.utils';
import { KeyObject, BinaryLike } from 'crypto';

const boostKeyPair: KeyPair = generateKeyPair();

export function shareKeys() {
    const boostPublicKey: Buffer = boostKeyPair.publicKey;
    const boostPrivateKey: Buffer = boostKeyPair.privateKey;
    const boostGenerator: Buffer = boostKeyPair.generator;
    const boostPrime: Buffer = boostKeyPair.prime;
    const boostDiffieHellman = boostKeyPair.diffieHellman;

    return {
        boostPublicKey,
        boostPrivateKey,
        boostGenerator,
        boostPrime,
        boostDiffieHellman,
    };
}

export function hmacDigest(data: any, secretKey: KeyObject | BinaryLike): any {
    return generateHMAC(JSON.stringify(data), secretKey);
}

This file imports the interface and all necessary modules from cryto.utils.ts and defines two wrapper functions – shareKeys and hmacDigest. shareKeys will only serve as a wrapper around generateKeyPair which will allow developers at Boost to send only the required keys over to the Account Aggregator.

Setting up the Account Aggregator

At the Account Aggregator's end, we need to set up a function that computes AA's public key and sends it over to Boost Inc. We will also need a function to verify the received HMAC of a data by comparing it against one that AA generates:

import { generateKeyPair, generateSharedSecret, generateHMAC, KeyPair } from '../utils/crypto.utils';  
import axios from 'axios';

let sharedSecret: Buffer;

export async function sendAAPublicKey(): Promise<Buffer> {
  try {
    const response = await axios.get('http://localhost:3000/init');

    const boostPublicKey: Buffer = Buffer.from(response.data.boostPublicKey, 'hex');
    const boostGenerator: Buffer = Buffer.from(response.data.boostGenerator, 'hex');
    const boostPrime: Buffer = Buffer.from(response.data.boostPrime, 'hex');

    const AA: KeyPair = generateKeyPair(boostPrime, boostGenerator);
    sharedSecret = generateSharedSecret(AA, boostPublicKey);

    return AA.publicKey;
  } catch (error) {
    console.error('Error sending AA public key:', (error as Error).message);
    throw error;
  }
}

export async function verifyData(data: any, hmac: string): Promise<string> {
  try {
    const calculatedHMAC = generateHMAC(JSON.stringify(data), sharedSecret);
    return calculatedHMAC === hmac ? "Integrity and authenticity verified" : "Integrity or authenticity compromised";
  } catch (error) {
    console.error('Error verifying data:', (error as Error).message);
    throw error;
  }
}

We make an Axios request to the /init endpoint defined at Boost and fetch (p), (g) and (A). Once we've computed the public key, we'll send that back to Boost. We will also compute our shared secret here which we'll use while verifying the HMAC in the verifyData method.

Setting up the Express APIs

Now that all the controllers and utility functions are in place, we'll create a few endpoints to facilitate communication between Boost Inc. and the Account Aggregator.

Boost:

import express, { Request, Response } from 'express';
import { hmacDigest, shareKeys } from '@boost/v1/controllers/verification.controller';
import { KeyPair, generateSharedSecret } from '@boost/v1/utils/crypto.utils';
import { DiffieHellman } from 'crypto';
import axios from 'axios';

const appBoost = express();
const PORT_BOOST = 3000;

let boostPublicKey: Buffer, boostPrivateKey: Buffer;
let boostGenerator: Buffer, boostPrime: Buffer;
let sharedSecret: Buffer;
let boostKeyPair: KeyPair, boostDiffieHellman: DiffieHellman;

appBoost.get('/init', async (req: Request, res: Response) => {
    ({ boostPublicKey, boostPrivateKey, boostGenerator, boostPrime, boostDiffieHellman } = shareKeys());
    res.send({ boostPublicKey, boostGenerator, boostPrime });
});

// Simulated Data
const data = {
    name: 'Boost User 1',
    phone: '1234567890',
};

appBoost.get('/fetchData', async (req: Request, res: Response) => {
    const hmac = hmacDigest(data, sharedSecret);
    res.send({ data, hmac });
});

appBoost.listen(PORT_BOOST, () => {
  console.log(`Boost server is running on http://localhost:${PORT_BOOST}`);
});

The /init endpoint, hosted by Boost, is invoked by AA within its sendAAPublicKey function. When the shared secret is calculated, AA will invoke the endpoint /fetchData to retrieve the critical information.

Account Aggregator (AA):

import express, { Request, Response } from 'express';
import { sendAAPublicKey, verifyData } from '@AA/v1/controllers/aa.controller';
import { KeyPair, generateSharedSecret } from '@boost/v1/utils/crypto.utils';
import { DiffieHellman } from 'crypto';
import axios from 'axios';

const appAA = express();
const PORT_AA = 3001;

let boostPublicKey: Buffer, boostPrivateKey: Buffer;
let boostGenerator: Buffer, boostPrime: Buffer;
let AAPublicKey: Buffer;
let sharedSecret: Buffer;
let boostKeyPair: KeyPair, boostDiffieHellman: DiffieHellman;

appAA.get('/fetchAAPublicKey', async (req: Request, res: Response) => {
    AAPublicKey = await sendAAPublicKey();
    res.send({ AAPublicKey: AAPublicKey.toString('hex') });

    boostKeyPair = {
        publicKey: boostPublicKey,
        privateKey: boostPrivateKey,
        generator: boostGenerator,
        prime: boostPrime,
        diffieHellman: boostDiffieHellman,
    }

    sharedSecret = generateSharedSecret(boostKeyPair, AAPublicKey);
});

appAA.get('/verifyData', async (req: Request, res: Response) => {
    const response = await axios.get('http://localhost:3000/fetchData');
    const { data, hmac } = response.data;
    const verified = await verifyData(data, hmac);
    res.send({ verified });
});

appAA.listen(PORT_AA, () => {
  console.log(`AA server is running on http://localhost:${PORT_AA}`);
});

The fetchAAPublicKey endpoint, hosted as AA's end, will be invoked by Boost when it wants to evaluate the Shared Secret. The verifyData method is encapsulated within a GET request, enabling either party to confirm the integrity of the transmitted data.

Invoking the APIs

Head over to your Postman Client to test out these APIs. Since the sendAAPublicKey method takes care of the initiation, we need to start our communication using the /fetchAAPublicKey endpoint:

Postman Client: fetchAAPublicKey Endpoint

You will observe the AA's public key as a response. Now, Boost Inc. will use this Public Key and evaluate the Shared Secret.

Once that is done, it will use the Shared Secret to compute the message digest in the /fetchData endpoint. Since /verifyData invokes the former endpoint, we'll check this in action on our Postman Client:

Postman Client: verifyData Endpoint

You will notice that the /verifyData response declares the successful verification of both integrity and authenticity. This acknowledgment ensures that the transmitted data remains untampered and originates from the authenticated source, providing a layer of security for communication between the two entities.

Wrapping Up

And there you have it: by utilizing HMACs and the Diffie-Hellman-Merkle Key Exchange, you can verify the integrity and authenticity of your transmitted data, enhancing the security of your applications and ensuring a reliable API communication framework for developers.

By understanding the intricacies and mathematical underpinnings of these practices, you can now make informed decisions, fortifying your system against tampering threats.

Find the complete code snippets here — GitHub Gist | HamdaanAliQuatil.
You may find me on X (formerly Twitter) – Hamdaan Ali Quatil.

References

[1] Behrouz A. Forouzan – Introduction to Cryptography and Network Security

[2] New Directions in Cryptography, Whitfield Diffie and Martin E. Hellman diffie.hellman.pdf (jhu.edu)

[3] Keying Hash Functions for Message Authentication, Mihir Bellare, Ran Canetti, Hugo Krawczyk https://cseweb.ucsd.edu/~mihir/papers/kmd5.pdf

How to Use OpenTelementry to Trace Node.js Applications

Abraham Dahunsi — Sat, 03 Feb 2024 00:21:14 +0000

Observability refers to our ability to "see" and understand what's happening inside a system by looking at its external signals (like logs, metrics, and traces).

Observability involves collecting and analyzing data from sources within a system to monitor its performance and address problems effectively.

Why is Observability Useful?

Detecting and Troubleshooting Problems: Observability plays a role in identifying and diagnosing issues within a system. When something goes wrong, having access to data helps pinpoint the cause and resolve problems more quickly.
Optimizing Performance: Through monitoring metrics and performance indicators, observability helps in optimizing the performance of your system. This includes identifying bottlenecks, improving resource utilization, and ensuring operation.
Planning for Future Capacity: Understanding how your system behaves over time is vital for planning capacity requirements. Observability data can reveal trends, peak usage periods, and resource needs, helping your decisions regarding scaling.
Enhancing User Experience: By observing user interactions with your system through logs and metrics, you can improve the user experience. It assists in recognizing patterns, preferences, and potential areas that can be enhanced for user satisfaction.

Why Should I Use OpenTelementary?

Observability is essential for ensuring the reliability and availability of your Node.js applications. But manually instrumenting your code to collect and export telemetry data, such as traces, metrics, and logs, can become very stressful.

Manual instrumentation is very tedious, error-prone, and inconsistent. It can also introduce additional overhead and complexity to your application logic.

In this guide, you will learn how to use OpenTelemetry’s auto-instrumentation to help you achieve effortless Node.js monitoring.

Prerequisites

Before you go through this guide, make sure you have the following:

A Node.js application
A Datadog account and an API key. If you don't have one, you can sign up here to get one.
A Backend service. You can use a backend service like Zepkin or Jaeger to store and analyze trace data. For this guide, we'll be using Jaeger.
Some basic knowledge of Linux commands. You should be familiar with using the command line and editing configuration files.

Prepare Your Application

In this guide, you will be using a Node.js application that has two services that transfer data between themselves. You will use OpenTelemetry’s Node.js client library to send trace data to an OpenTelementay collector.

Firstly, clone the Repo Locally:

$ git clone https://github.com//nodejs-example.git

Then run the application:

npm install

Go to the directory of the first service using this command:

$ cd

And start the first service.

$ node index.js

Then go to the directory of the second service

$ cd

And start the second service.

$ node index.js

Open Service A, in this case port 5555, and input some information. Then repeat the same for Service B.

How to Set Up OpenTelementary

After starting the services, it's time to install the OpenTelementary modules you'll need for auto-instrumentation.

Here are what we need to install:

$ npm install --save @opentelemetry/api

$ npm install --save @opentelemetry/instrumentation

$ npm install --save @opentelemetry/tracing

$ npm install --save @opentelemetry/exporter-trace-otlp-http

$ npm install --save @opentelemetry/resources

$ npm install --save @opentelemetry/semantic-conventions

$ npm install --save @opentelemetry/auto-instrumentations-node

$ npm install --save @opentelemetry/sdk-node

$ npm install --save @opentelemetry/exporter-jaeger

Here's break down of what each module does:

@opentelemetry/api: This module provides the OpenTelemetry API for Node.js.
@opentelemetry/instrumentation: The instrumentation libraries provide automatic instrumentation for your Node.js application. They automatically capture telemetry data without requiring manual code modifications.
@opentelemetry/tracing: This module contains the core tracing functionality for OpenTelemetry in your Node.js application. It includes the Tracer and Span interfaces, which are important for capturing and representing distributed traces within your applications.
@opentelemetry/exporter-trace-otlp-http: This exporter module enables sending trace data to an OpenTelemetry Protocol (OTLP) compatible backend over HTTP.
@opentelemetry/resources: This module provides a way to define and manage resources associated with traces.
@opentelemetry/semantic-conventions: This module defines a set of semantic conventions for tracing. It establishes a common set of attribute keys and value formats to ensure consistency in how telemetry data is represented and interpreted.
@opentelemetry/auto-instrumentations-node: This module simplifies the process of instrumenting your application by automatically applying instrumentation to supported libraries.
@opentelemetry/sdk-node: The Software Development Kit (SDK) for Node.js provides the implementation of the OpenTelemetry API.
@opentelemetry/exporter-jaeger: This exporter module allows exporting trace data to Jaeger. Jaeger provides a user-friendly interface for monitoring and analyzing trace data.

Configure the Node.js Application

Next, add a Node.js SDk tracer to handle the instantiation and shutdown of the tracing.

To add the tracer, create a file tracer.js:

$ nano tracer.js

Then add the following code to the file:


"use strict";

const {
    BasicTracerProvider,
    SimpleSpanProcessor,
} = require("@opentelemetry/tracing");
// Import the JaegerExporter
const { JaegerExporter } = require("@opentelemetry/exporter-jaeger");
const { Resource } = require("@opentelemetry/resources");
const {
    SemanticResourceAttributes,
} = require("@opentelemetry/semantic-conventions");

const opentelemetry = require("@opentelemetry/sdk-node");
const {
    getNodeAutoInstrumentations,
} = require("@opentelemetry/auto-instrumentations-node");

// Create a new instance of JaegerExporter with the options
const exporter = new JaegerExporter({
    serviceName: "YOUR-SERVICE-NAME",
    host: "localhost", // optional, can be set by OTEL_EXPORTER_JAEGER_AGENT_HOST
    port: 16686 // optional
});

const provider = new BasicTracerProvider({
    resource: new Resource({
        [SemanticResourceAttributes.SERVICE_NAME]:
            "YOUR-SERVICE-NAME",
    }),
});
// Add the JaegerExporter to the span processor
provider.addSpanProcessor(new SimpleSpanProcessor(exporter));

provider.register();
const sdk = new opentelemetry.NodeSDK({
    traceExporter: exporter,
    instrumentations: [getNodeAutoInstrumentations()],
});

sdk
    .start()
    .then(() => {
        console.log("Tracing initialized");
    })
    .catch((error) => console.log("Error initializing tracing", error));

process.on("SIGTERM", () => {
    sdk
        .shutdown()
        .then(() => console.log("Tracing terminated"))
        .catch((error) => console.log("Error terminating tracing", error))
        .finally(() => process.exit(0));

Here is a simple breakdown of the code:

The code starts by importing the modules BasicTracerProvider and SimpleSpanProcessor for setting up tracing from the OpenTelemetry library
It then imports the JaegerExporter module for exporting trace data to Jaeger.
The code creates a new instance of the JaegerExporter, specifying the service name, host, and port.
It then creates a BasicTracerProvider and adds the JaegerExporter to the span processor using SimpleSpanProcessor.
The provider is registered, setting it as the default provider for the application.
An OpenTelemetry SDK instance is created, configuring it with the JaegerExporter and enabling auto-instrumentations for Node.js.
The OpenTelemetry SDK is started, initializing tracing.
A handler for the SIGTERM signal is set up to shut down tracing when the application is terminated.
The code then configures the trace provider with a trace exporter. To verify the instrumentation, ConsoleSpanExporter is used to print some of the tracer output to the console.

How to Set Up OpenTelemetry to Export the Traces

Next, you'll need to write the configurations to collect and export data in the OpenTelemetry Collector.

Create a file config.yaml:


receivers:
  otlp:
    protocols:
      grpc:
      http:

exporters:
  datadog:
    api: # Replace with your Datadog API key
      key: ""
    # Optional:
    #   - endpoint: https://app.datadoghq.eu  # For EU region

processors:
  batch:

extensions:
  pprof:
    endpoint: :1777
  zpages:
    endpoint: :55679
  health_check:

service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [datadog]

The configuration sets up OpenTelemetry with the OTLP (OpenTelemetry Protocol) receiver and the Datadog exporter. Here’s a break down of the code:

receivers: Specifies the components that receive the telemetry data. In this case, it includes the OTLP receiver, which supports both gRPC and HTTP protocols.
exporters: Defines the components responsible for exporting telemetry data. Here, it configures the Datadog exporter, providing the Datadog API key. Additionally, an optional endpoint is provided for using Datadog's EU region.
processors: Specifies the data processing components. In this case, the batch processor is used to batch and send data in larger chunks for efficiency.
extensions: Defines additional components that extend the functionality. Here, it includes extensions for pprof (profiling data), zpages (debugging pages), and a health check extension.
service: Configures the overall service behavior, including the extensions and pipelines. The extensions section lists the extensions to be used, and the pipelines section configures the telemetry data pipeline. Here, the traces pipeline includes the OTLP receiver, the batch processor, and the Datadog exporter.

This code is configured by the collector with the Datadog exporter to send the traces to Datadog Distributed Tracing services. However, there are other distributed tracing services that you can use like New Relic, Logzio, and Zipkin.

How to Start the Application

After correctly setting up auto-instrumentation, start the application again to test and verify the tracing configuration.

Begin by starting the OpenTelemetry Collector:

./otelcontribcol_darwin_amd64 --config ./config.yaml

The collector will start on port 4317.

Next, go to the directory of the first service:

$ cd

Then start the first service with the “--require './tracer.js'” parameter to enable the application instrumentation.

$ node --require './tracer.js' index.js

Repeat this to start the second service.

Using a browser like Chrome, go to the endpoints of your two applications' services, add some data, and send some requests to test the tracing configuration.

Once the requests are made, these traces are picked up by the collector, which then dispatches them to the distributed tracing backend specified by the exporter configuration in the collector's configuration file.

It's worth noting that our tracer not only facilitates the transmission of traces to the designated backend, but also exports them to the console at the same time.

This dual functionality allows for real-time visibility into the traces being generated and sent, helping us in the monitoring and debugging processes.

Now, let’s use Jaeger UI to monitor the traces:

Start Jaeger with the following command:

docker run -d --name jaeger \
  -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14250:14250 \
  -p 14268:14268 \
  -p 14269:14269 \
  -p 9411:9411 \
  jaegertracing/all-in-one:1.32

Using a browser, start Jaeger UI at the http://localhost:16686/ endpoint.

There you have it! The initiation of the trace starting from the inception point of one service, navigating through a sequence of operations.

This path is created as the service starts its operations, resulting in the set up of the other service to fulfill the original request you initiated earlier.

The trace provides a visual narrative of what happens between these services, offering insights into each step of the process.

How Can You Use Observability Data?

Monitoring Metrics: Keep an eye on key metrics such as response times, error rates, and resource usage. Sudden spikes or anomalies can indicate issues that require attention.
Logging: Log data provides detailed information about events and actions within a system. Analyzing logs helps in understanding the sequence of activities and tracing the steps leading to an issue.
Tracing: Tracing involves tracking the flow of requests or transactions across different components of a system. This helps in understanding the journey of a request and identifying any bottlenecks or delays.
Alerting: Set up alerts based on specific conditions or thresholds. When certain metrics exceed predefined limits, alerts can notify you in real-time, allowing for immediate action.
Visualization: Use graphical representations and dashboards to visualize complex data. This makes it easier to identify patterns, trends, and correlations in the observability data.

Observability, when implemented effectively, empowers teams to proactively manage and improve the performance, reliability, and user experience of their systems. It's a crucial aspect of modern software development and operations.

Conclusion

In this guide you learned how to auto-instrument Node.js applications with little code by:

Installing and configuring the OpenTelemetry Node.js SDK and the auto-instrumentation package
Enabling automatic tracing and metrics collection for your Node.js applications and their dependencies
Exporting to visualize your telemetry data on a backend, Jaeger.

Using OpenTelemetry’s auto-instrumentation can help you gain valuable insights into the performance and behavior of your Node.js applications without having to manually instrument each library or framework.

How to Use Pandas for Data Cleaning and Preprocessing

Oluwadamisi Samuel — Tue, 30 Jan 2024 14:55:00 +0000

Steve Lohr of The New York Times said: "Data scientists, according to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."

This statement is 100% accurate, as this encompasses a series of steps that ensure data used for data science, machine learning and analysis projects are complete, accurate, unbiased and reliable.

The quality of your dataset plays a pivotal role in the success of your analysis or model. As the saying goes, “garbage in, garbage out”, the quality and reliability of your model and analysis heavily depends on the quality of your data.

Raw data, collected from various sources, are often messy, contain errors, inconsistencies, missing values and outliers. Data cleaning and preprocessing aims to identify and rectify these issues to ensure accurate, reliable and meaningful results during model building and data analysis as wrong conclusions could be costly.

This is where Pandas comes into play, it is a wonderful tool used in the data world to do both data cleaning and preprocessing. In this article, we'll delve into the essential concepts of data cleaning and preprocessing using the powerful Python library, Pandas.

Prerequisites
Introduction
What is Data Cleaning?
What is Data Processing?
How to Import the Necessary Libraries
How to Load the Dataset
Exploratory Data Analysis (EDA)
How to Handle Missing Values
How to Remove Duplicate Records
Data Types and Conversion
How to Encode Categorical Variables
How to Handle Outliers
Conclusion

Prerequisites

A basic understanding of Python.
Basic understanding of data cleaning.

Introduction

Pandas is a popular open-source data manipulation and analysis library for Python. It provides easy-to-use functions needed to work with structured data seamlessly.

Pandas also integrates seamlessly with other popular Python libraries, such as NumPy for numerical computing and Matplotlib for data visualization. This makes it a powerful asset for data driven tasks.

Pandas excels in handling missing data, reshaping datasets, merging and joining multiple datasets, and performing complex operations on data, making it exceptionally useful for data cleaning and manipulation.

At its core, Pandas introduces two key data structures: Series and DataFrame. A Series is a one-dimensional array-like object that can hold any data type, while a DataFrame is a two-dimensional table with labeled axes (rows and columns). These structures allow users to manipulate, clean, and analyze datasets efficiently.

What is Data Cleaning?

Before we embark on our data adventure with Pandas, let's take a moment to explain the term "data cleaning." Think of it as the digital detox for your dataset, where we tidy up, and and prioritize accuracy above all else.

Data cleaning involves identifying and rectifying errors, inconsistencies, and missing values within a dataset. It's like preparing your ingredients before cooking; you want everything in order to get the perfect analysis or visualization.

Why bother with data cleaning? Well, imagine trying to analyze sales trends when some entries are missing, or working with a dataset that has duplicate records throwing off your calculations. Not ideal, right?

In this digital detox, we use tools like Pandas to get rid of inconsistencies, straighten out errors, and let the true clarity of your data shine through.

What is Data Processing?

You may be wondering, "Does data cleaning and data preprocessing mean the same thing?" The answer is no – they do not.

Picture this: you stumble upon an ancient treasure chest buried in the digital sands of your dataset. Data cleaning is like carefully unearthing that chest, dusting off the cobwebs, and ensuring that what's inside is authentic and reliable.

As for data preprocessing, you can think of it as taking that discovered treasure and preparing its contents for public display. It goes beyond cleaning; it's about transforming and optimizing the data for specific analyses or tasks.

Data cleaning is the initial phase of refining your dataset, making it readable and usable with techniques like removing duplicates, handling missing values and data type conversion while data preprocessing is similar to taking this refined data and scaling with more advanced techniques such as feature engineering, encoding categorical variables and and handling outliers to achieve better and more advanced results.

The goal is to turn your dataset into a refined masterpiece, ready for analysis or modeling.

How to Import the Necessary Libraries

Before we embark on data cleaning and preprocessing, let's import the Pandas library.

To save time and typing, we often import Pandas as pd. This lets us use the shorter pd.read_csv() instead of pandas.read_csv() for reading CSV files, making our code more efficient and readable.

import pandas as pd

How to Load the Dataset

Start by loading your dataset into a Pandas DataFrame.

In this example, we'll use a hypothetical dataset named your_dataset.csv. We will load the dataset into a variable called df.

#Replace 'your_dataset.csv' with the actual dataset name or file path
df = pd.read_csv('your_dataset.csv')

Exploratory Data Analysis (EDA)

EDA helps you understand the structure and characteristics of your dataset. Some Pandas functions help us gain insights into our dataset. We call these functions by calling the dataset variable plus the function.

For example:

df.head() will call the first 5 rows of the dataset. You can specify the number of rows to be displayed in the parentheses.
df.describe() gives some statistical data like percentile, mean and standard deviation of the numerical values of the Series or DataFrame.
df.info() gives the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

Here's a code example below:

#Display the first few rows of the dataset
print(df.head())

#Summary statistics
print(df.describe())

#Information about the dataset
print(df.info())

How to Handle Missing Values

As a newbie in this field, missing values pose a significant stress as they come in different formats and can adversely impact your analysis or model.

Machine learning models cannot be trained with data that has missing or "NAN" values as they can alter your end result during analysis. But do not fret, Pandas provides methods to handle this problem.

One way to do this is by removing the missing values altogether. Code snippet below:

#Check for missing values
print(df.isnull().sum())

#Drop rows with missing valiues and place it in a new variable "df_cleaned"
df_cleaned = df.dropna()

#Fill missing values with mean for numerical data and place it ina new variable called df_filled
df_filled = df.fillna(df.mean())

But if the number of rows that have missing values is large, then this method will be inadequate.

For numerical data, you can simply compute the mean and input it into the rows that have missing values. Code snippet below:

#Replace missing values with the mean of each column
df.fillna(df.mean(), inplace=True)

#If you want to replace missing values in a specific column, you can do it this way:
#Replace 'column_name' with the actual column name
df['column_name'].fillna(df['column_name'].mean(), inplace=True)

#Now, df contains no missing values, and NaNs have been replaced with column mean

How to Remove Duplicate Records

Duplicate records can distort your analysis by influencing the results in ways that do not accurately show trends and underlying patterns (by producing outliers).

Pandas helps to identify and remove the duplicate values in an easy way by placing them in new variables.

Code snippet below:

#Identify duplicates
print(df.duplicated().sum())

#Remove duplicates
df_no_duplicates = df.drop_duplicates()

Data Types and Conversion

Data type conversion in Pandas is a crucial aspect of data preprocessing, allowing you to ensure that your data is in the appropriate format for analysis or modeling.

Data from various sources are usually messy and the data types of some values may be in the wrong format, for example some numerical values may come in 'float' or 'string' format instead of 'integer' format and a mix up of these formats leads to errors and wrong results.

You can convert a Column of type int to float with the following code:

#Convert 'Column1' to float
df['Column1'] = df['Column1'].astype(float)

#Display updated data types
print(df.dtypes)

You can use df.dtypes to print column data types.

How to Encode Categorical Variables

For machine learning algorithms, having categorical values in your dataset (non-numerical values) is crucial in ensuring the best model as they are equally as important.

These could be car brand names in a cars dataset for predicting car prices. But machine learning algorithms cannot processes this datatype, therefore it must be converted to numerical data before it can be used.

Pandas provides the get_dummies function which converts categorical values into numerical format(Binary format) such that it is recognized by the algorithm as a placeholder for values and not hierarchical data that can undergo numerical analysis. this just means that the numbers the brand name is converted to is not interpreted as 1 is greater than 0, but it tells the algorithm that both 1 and 0 are placeholders for categorical data. Code snippet is shown below:

#To convert categorical data from the column "Car_Brand" to numerical data
df_encode = pd.get_dummies(df, columns=[Car_Brand])

#The categorical data is converted to binary format of Numerical data

How to Handle Outliers

Outliers are data points significantly different from the majority of the data, they can distort statistical measures and adversely affect the performance of machine learning models.

They may be caused by human error, missing NaN values, or could be accurate data that does not correlate with the rest of the data.

There are several methods to identify and remove outliers, they are:

Remove NaN values.
Visualize the data before and after removal.
Z-score method (for normally distributed data).
IQR (Interquartile range) method for more robust data.

The IQR is useful for identifying outliers in a dataset. According to the IQR method, values that fall below Q1−1.5× IQR or above Q3+1.5×IQR are considered outliers.

This rule is based on the assumption that most of the data in a normal distribution should fall within this range.

Here's a code snippet for the IQR method:

#Using median calculations and IQR, outliers are identified and these data points should be removed
Q1 = df["column_name"].quantile(0.25)
Q3 = df["column_name"].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[df["column_name"].between(lower_bound, upper_bound)]

Conclusion

Data cleaning and preprocessing are integral components of any data analysis, science or machine learning project. Pandas, with its versatile functions, facilitates these processes efficiently.

By following the concepts outlined in this article, you can ensure that your data is well-prepared for analysis and modeling, ultimately leading to more accurate and reliable results.

Web Storage API – How to Store Data on the Browser

Nathan Sebhastian — Fri, 12 Jan 2024 17:43:34 +0000

The Web Storage API is a set of APIs exposed by the browser so that you can store data in the browser.

The data stored in the Web Storage use the key/value pair format, and both data will be stored as strings.

There are two types of storage introduced in the Web Storage API: Local Storage and Session Storage.

In this article, I’m going to show you how to use the Web Storage API and why it’s useful for web developers.

How the Web Storage API Works

The Web Storage API exposes a set of objects and methods that you can use to store data in the browser. The data you store in Web Storage is private, which means no other website can access it.

In Google Chrome, you can view Web Storage by opening the developer tools window and going to the Application tab as shown below:

The web storage location in Google Chrome

In the picture above, you can see that the Storage menu also has other storage types like Indexed DB, Web SQL, and cookies. The Web SQL standard has been deprecated, and IndexedDB is rarely used because it’s too complex. Any data you store in IndexedDB might better be stored on the server.

As for cookies, it’s a more traditional mechanism of storing data that only allows you to store a maximum of 4 KB of data. By contrast, the Local Storage capacity is 10 MB and the session storage has 5 MB capacity.

This is why we’re going to focus only on Local Storage and Session Storage in this article.

Local Storage and Session Storage Explained

Local Storage and Session Storage are the two standard mechanisms supported by the Web Storage API.

Web storage is domain specific, meaning data stored under one domain (netflix.com) can’t be accessed by another domain (www.netflix.com or members.netflix.com)

Web storage is also protocol specific. This means the data you store in a http:// site won’t be available under the https:// site.

The main difference between Local and Session Storage is that Local Storage will store your data forever. If you want to remove the data, you need to use the available method or clear it manually from the Applications tab.

By contrast, the data stored in session storage is only available during the page session. When you close the browser or the tab, the session storage for that specific tab is removed.

Both Local and Session Storage can be accessed through the window object under the variables localStorage and sesionStorage, respectively. Let’s see the methods and properties of these storage types next.

Methods and Properties of Local and Session Storage

Both Local and Session Storage have the same methods and properties. To set a new key/value pair in the Local Storage, you can use the setItem() method of the localStorage object:

localStorage.setItem('firstName', 'Nathan');

If you look into the Local Storage menu in the browser, you should see the data above saved into the storage as follows:

Storing a key/value pair in Local Storage

The key you used in localStorage must be unique. If you set another data with a key that already exists, then the setItem() method will replace the previous value with the new one.

To get the value out of local storage, you need to call the getItem() method and pass the key you used when saving the data. If the key doesn’t exist, then getItem() will return null back:

const firstName = localStorage.getItem('firstName');
console.log(firstName); // Nathan

const lastName = localStorage.getItem('lastName');
console.log(lastName); // null

To remove the data you have in local storage, call the removeItem() method and pass the key pointing to the data you want to remove:

localStorage.removeItem('firstName');

The removeItem() method will always return undefined. When the data you want to remove doesn’t exist, the method simply does nothing.

If you want to clear the storage, you can use the clear() method:

localStorage.clear();

The clear() method removes all key/value pairs from the storage object you are accessing.

Properties of Local and Session Storage

Both storage types have only one property, which is the length property that shows the amount of data stored in them.

sessionStorage.setItem('firstName', 'Nathan');
sessionStorage.setItem('lastName', 'Sebhastian');

console.log(sessionStorage.length); // 2

sessionStorage.clear();
console.log(sessionStorage.length); // 0

And that’s all the methods and properties you can access in localStorage and sessionStorage.

How to Store JSON Strings in Web Storage Storage

Since Web Storage always stores data as strings, you can store complex data as a JSON string, and then convert that string back into an object when you access it.

For example, suppose I want to store the following information about a user:

const user = {
  firstName: 'Nathan',
  lastName: 'Sebhastian',
  url: 'https://codewithnathan.com',
};

At first, I might store the data as a series of key/value pairs like this:

localStorage.setItem('firstName', user.firstName);
localStorage.setItem('lastName', user.lastName);
localStorage.setItem('url', user.url);

But a better way is to convert the JavaScript object into a JSON string, and then store the data under one key as follows:

const user = {
  firstName: 'Nathan',
  lastName: 'Sebhastian',
  url: 'https://codewithnathan.com',
};

const userData = JSON.stringify(user);

localStorage.setItem('user', userData);

Now the local storage will have only one key/value pair with the JSON string as its value. You can open the Applications tab to see this:

Storing a JSON string in Local Storage

When you need the data, call the getItem() and JSON.parse() methods as follows:

const getUser = JSON.parse(localStorage.getItem('user'));

console.log(getUser);
// {firstName: 'Nathan', lastName: 'Sebhastian', url: 'https://codewithnathan.com'}

Here, you can see that the data is returned as a regular JavaScript object.

Local Storage vs Session Storage – Which One to Use?

Based on my experience, localStorage is the preferred Web Storage mechanism because the data will persist as long as you need it to. When you don’t need the data, you can remove it using the removeItem() method.

sessionStorage is only used when you need to store temporary data, like tracking whether a popup box has been shown to the user or not.

But this is also open to discussion because you might not want to show a popup every time the user logs into your web application, but only once. In that case, you should use localStorage instead.

My rule of thumb is to use localStorage first, and sessionStorage when the situation needs it.

Benefits of Using the Web Storage API

Now that you know how the Web Storage API works, you can see that there are some benefits of using it:

Storing data on the browser reduces the need to make a server request for a piece of information. This can improve the performance of your web applications.
The simple key/value pair format allows you to store user preferences and local settings that should persist between sessions.
The Web Storage API is simple to use, providing only a few methods and one property. It’s simple to set and retrieve data using JavaScript
It has offline support. By storing necessary data locally, the Web Storage enables your web application to work offline.
The Web Storage is also a standardized API, meaning the code you write will work in many different browsers.

But of course, not all data should be stored in the Web Storage API. You still need a server database to keep records that are important for your application.

Conclusion

Web Storage is a useful API that allows you to quickly store and retrieve data from the browser. Using Web Storage, you can store the user’s preferences when accessing your application.

localStorage allows you to store data forever until it’s removed manually, while sessionStorage will persist as long as the browser or tab is open.

Some benefits of using the Web Storage API include reducing server requests, offline support, and a simple API that’s easy to use. It’s also standardized, so it will work on different browsers.

If you enjoyed this article and want to take your JavaScript skills to the next level, I recommend you check out my new book Beginning Modern JavaScript here.

The book is designed to be easy for beginners and accessible to anyone looking to learn JavaScript. It provides a step-by-step gentle guide that will help you understand how to use JavaScript to create a dynamic web application.

Here's my promise: You will actually feel like you understand what you're doing with JavaScript.

See you later!

Signal Processing and Systems in Programming – Guide for Beginners

Tiago Capelo Monteiro — Wed, 06 Sep 2023 14:49:58 +0000

Signal processing is an important field in engineering and programming.

Basically, it allows engineers and programmers to improve data so that people can use it more effectively.

For example, it is thanks to signal processing that much of the background noise in a phone call is removed. This way, only your voice arrives on the other end of the call.

Other examples are:

Audio and music software
Image and video processing software
Medical imaging software
Speech and language processing software
Wireless communication software

Understanding signal processing and systems is key for any programmer who needs to process, manipulate, and analyze these types of data.

This tutorial will explore the field of signal processing and the main characteristics of a system, including some important system characteristics such as:

Causality
Memory
Time-invariance
Linearity

Here's what we'll cover:

What is Signal Processing?
Python Code Example – How to Filter a Signal
Background on the Fourier Transform
What is a System in Signal Processing?
Conclusion

What is Signal Processing?

Signal processing, simply explained, is the field where tools are created for engineers and programmers to manipulate certain signals to solve problems.

It involves analyzing sounds or images to extract only the needed data.

For example, the data from biosensors that shows how much oxygen there is in your blood is displayed in a pulse oximeter. This data is filtered with the help of tools from signal processing.

Photo by cottonbro studio

This data processed in a program inside the oximeter with the help of signal processing software tools.

Also, when you're making a phone call to a friend, signal processing algorithms are running so that only your voice gets sent to your friend to reduce as much background noise as possible.

Photo by Karolina Grabowska

Often, signal processing works with the help of tools like the Fast Fourier Transform. And don't worry – I'll explain what this is.

Using the Fast Fourier Transform algorithm, we are able to decompose a signal to find the individual waves that make it up.

This way, we are able to remove the individual waves that we don't want (for example, the background noise of a phone call is a set of waves we can remove to improve quality).

The Fast Fourier Transform is also used as a building block or inspiration for some file compression algorithms.

In the end, this is what signal processing is all about: decomposing a signal to extract what we want from it.

Where is signal processing used in real life?

Audio processing – like removing the background noise from a movie
Image processing – like making the image black and white
Wireless communications systems – like modulating a signal so that it can travel further (frequency modulation)

Python Code Example – How to Filter a Signal

You don't need to understand the full code I am about to show you right now – this is just the code I used to generate the graphs I will show you to help you understand how the Fast Fourier Transform works.

I've shared the full code in the conclusion in a GitHub repository so you can check it out.

Here is the code that filters a signal:

import numpy as np
import matplotlib.pyplot as plt

t = np.linspace(0, 1, 300, endpoint=False)
x = np.sin(2np.pi10t)
y = 0.5np.sin(2np.pi20t)
w = 0.2np.sin(2np.pi50t)
z = x + y + w

zf = np.fft.fft(z)

N = len(z)
freq = np.fft.fftfreq(N, d=t[1]-t[0])
spectrum = 2/N * np.abs(zf[:N//2])

mask = np.ones(len(freq), dtype=bool)
mask[(freq > 15) & (freq < 60)] = False
mask[(freq < -15) & (freq > -60)] = False

zf_filtered = zf.copy()
zf_filtered[~mask] = 0

z_filtered = np.fft.ifft(zf_filtered)

Below i will show visually what each part of the code does with graphs:

Step 1: Creating the signals

t = np.linspace(0, 1, 300, endpoint=False)
x = np.sin(2np.pi10t)
y = 0.5np.sin(2np.pi20t)
w = 0.2np.sin(2np.pi50*t)
z = x + y + w

Three different signals and a green signal representing their sum

We can see here that the green signal is the sum of:

Red wave – X signal
Blue wave – Y signal
Orange wave – W signal

Note that any signal can be composed of a certain number of simple waves. In mathematics, these waves are the sine and cosine functions.

This incredibly important idea is called a Fourier series.

Below is a video I recommend that explains simply what a Fourier series is:

Step 2: Creating a Fast Fourier Transform on the signal Z

We can apply the Fast Fourier Transform like this:

zf = np.fft.fft(z)

To make a graph out of it, we still need to do the following:

N = len(z)
freq = np.fft.fftfreq(N, d=t[1]-t[0])
spectrum = 2/N * np.abs(zf[:N//2])

Seeing the green signal in terms of frequency instead of time - We are "seeing" the green signal from another point of view

Thanks to the Fast Fourier Transform, we are able to see the composition of the green signal.

As we can see, the green signal is composed of 3 waves with 3 different frequencies:

10 hertz – Red wave – X signal
20 hertz – Blue wave – Y signal
50 hertz – Orange wave – W signal

Step 3: Creating and applying the filter

mask = np.ones(len(freq), dtype=bool)
mask[(freq > 15) & (freq < 60)] = False
mask[(freq < -15) & (freq > -60)] = False

zf_filtered = zf.copy()
zf_filtered[~mask] = 0

z_filtered = np.fft.ifft(zf_filtered)

Filtered X signal from the green signal - Only the 10 hertz signal passes

This filter is called a pass-band filter, because it filters all the frequencies between 30 hertz and 60 hertz.

So, this filtered red signal is essentially the original red signal.

Background on the Fourier Transform

The idea that any signal can be represented by the sum of simple waves was created by the mathematician Joseph Fourier.

These waves are called sine and cosine.

Note: you don't need to understand these equations completely – I'm just showing them so you can understand the history of the Fourier Transform.

This is what is called a Fourier series:

Equation for Fourier series

Here is a better image of the equation

The coefficients are given by the following:

Coefficients of the Fourier series

From the Fourier series the Fourier transform can be deduced:

Fourier transform equation

Here is a better image of the formula

However, the Fourier transform was developed by various mathematicians and physicists over the years.

So, it was based on the work of many scholars over time that we were able to redefine the Fourier transform.

But this is not the pure mathematical complicated expression that is running in a computer.

In a computer, it is an algorithm that approximates very well the Fourier transform called the Fast Fourier Transform.

That is where in the code the FFT comes from:

zf = np.fft.fft(z)

fft stands for Fast Fourier Transform.

Here are the docs with the function:

https://numpy.org/doc/stable/reference/generated/numpy.fft.fft.html

But you might be asking...

Why use the Fast Fourier Transform?

Because the Fast Fourier Transform runs much faster than the pure mathematical equation.

There is even a whole field of mathematics dedicated to finding algorithms that approximate pure math so that computers can run it really fast.

This field is called Numerical Analysis.

It is also used to find approximate solutions for problems that are impossible to find by hand.

For example, in the field of partial differential equations, many solutions to partial differential equations are only solved with numerical analysis methods running on computers.

This way, thanks to this field of mathematics, companies are able to save millions in energy costs

If you are interested in learning more about numerical analysis, I've included some more resources in the conclusion.

Changing the topic slightly, now we will talk about systems

What is a System in Signal Processing?

A system is a combination of many “things” that work together as if they were a whole.

An example of a system could be a computer or a car.

In signal processing, a system is often a combination of software and hardware in a technology that takes an input signal and produces an output signal.

For example, when pressing the acceleration pedal in a car (input), the car goes faster (output).

Knowing the characteristics of a system is important for understanding how it will process the signal.

Four important characteristics of a system are:

Causality
Memory
Time-invariance
Linearity

But, why is it important to understand the main characteristics of a system in programming?

By understanding these characteristics, you will understand better how to design software (in this case, the system can be seen as software) as well as how to optimize and integrate it.

Knowing these characteristics is very important in systems engineering where they are applied to software development. They help you manage the complexity of programs, define their requirements, and ensure quality, adaptability and scalability.

If you want to learn more about systems engineering, you can read my article on it.

So, let's learn more about what each of these characteristics are.

Causality

Causality is the property of a system where the output depends on past and present input only.

For example, when predicting the weather, it is only possible to use past weather data to make a forecast.

It is not possible to use future weather data to predict the weather.

Memory

Memory is the property of a system where the output depends on past inputs.

Recommendation systems used by websites like Netflix and Amazon suggest movies or products based on a user's previous interactions.

The algorithms take into account products viewed or purchased, and use this information to recommend similar items.

With more data gathered over time, recommendations become more accurate and personalized.

Memory is important in programming because it allows us to create systems that can learn and adapt over time – for example, machine learning systems.

Time-invariance

Time-invariance is the property of a system when the output does not depend on when the input was applied.

Real-time control systems used in robotics, manufacturing, and aerospace applications rely on time-invariant systems.

For instance, a flight control system must respond quickly and accurately to changes in an aircraft's position, irrespective of when they occur.

Linearity

Linearity in programming is like a recipe where doubling the ingredients results in a proportionally doubled output, allowing for predictable and accurate results.

Linearity refers to the property of a system where the output is directly proportional to the input.

This means that if the input is doubled, the output will also be doubled.

For example, in digital image processing, linearity is used in techniques such as contrast adjustment and color correction to ensure that the output is proportional to the input.

This results in predictable and accurate image processing.

Conclusion

Signal processing and systems are essential for programming because they allow us to process, manipulate, and analyze data in a reliable and predictable way.

Whether you are working on audio processing, image processing, or any other application that involves signal processing, understanding the fundamentals of signal processing and systems is crucial for success.

Systems are closely related to signal processing, because they allow the transformation of signal for programmers and engineers to reach their desired goal.

If you are interested in learning more about Fourier Transform, here is a YouTube video explaining it in more depth:

Here is a YouTube video explaining the algorithm that actually runs on your computer:

And here is also a video detailing the history of the development of the Fast Fourier Transform:

Final Note

There are more transforms used for signal processing and other purposes, such as the Laplace transform (used in continuous signals) and the Z transform (used in discreet signals).

But, since there are so many mathematical transforms formulas, what really is a transform?

A transform is simply a mathematical tool that helps us see something from a different point of view.

By seeing things a different way, we can learn about details we did not see originally.

For example, with the Fourier Transform, we can see the signal from the point of view of a frequency instead of the point of view of time.

This lets us see the same thing in a different way.

Mathematically, we can say we are changing the domain of the function. In other words, we are changing the x axis.

And I will leave this with you: the Laplace transform is a generalized Fourier transform.

Full code:

https://github.com/tiagomonteiro0715/Signal-Processing-and-Systems-in-Programming-Guide-for-Beginners

What is Steganography? How to Hide Data Inside Data

Daniel Iwugo — Thu, 13 Jul 2023 17:15:06 +0000

Ladies and Gentlemen, welcome to the world of Spies 🕵️.

In the movie Uncharted (great movie by the way), Tom Holland and his brother have a secret form of communication. They would write a message on a plain postcard with special ink that became invisible and then send it to the other person.

On the outside, it seemed like another plain old postcard. But if a lighter was lit just behind the paper, the ink would reappear, and a new message would be found 🔥.

This is one of the coolest hidden information tricks seen in movies. But what if we could do this on computers?

Well, turns out we sorta can. Using Steganography.

Disclaimer: This concept can be used for both good and bad. The content of this article is for educational purposes only and is not to be used to play pranks, or harm people and infrastructure.

And with that out of the way, here’s what we’re going to explore in this article:

What is Steganography?
Types of Steganography – Text, Image, Video, Audio, Network
Image steganography using Steghide

What is Steganography?

Steganography is the art of hiding secret data in plain sight. It sounds kind of counter-intuitive, but you’d be surprised how effective it is.

Hiding things such as source code, passwords, IP addresses, and other confidential information in pictures, music, or other random files tends to be the last place anyone would think of finding them.

You should note that steganography and cryptography are not mutually exclusive from each other. One may contain elements of the other or both. For example, you could perform steganography with an encryption algorithm or password, as you’ll find out soon.

Types of Steganography

There are various types of steganography, and we’ll look at five of them in this tutorial.

Text Steganography

This form involves hiding a message within a text. A common way to do this is substitution. It involves replacing certain characters with others and then substituting them back to retrieve the original data.

For example, take the following text.

Thi follow eng tixt contaens a sicrit missagi

Doesn’t really make sense right? But what if we replace the i’s with e’s and the e’s with i’s?

The follow ing text contains a secret message

I think that’s a little easier on the eyes. This is a pretty easy example, but there are much more complicated ones and even some you could come up with on your own.

Image Steganography

Frankly, this is my favourite. It involves hiding data behind digital images. There are various techniques for image steganography which include the Least Significant Bit technique, Masking and Filtering, and Coding and Cosine Transformation.

Take a look at the two images below and spot the difference:

Groot on Linux ¦ Credit: Mercury

Basically, no human on earth can tell the visual difference. But if you take a closer look at the file details…

Comparing the images ¦ Credit: Mercury

The only difference is the size of the images. That’s because the one on the right is hiding 260 words of text in it. How cool is that?

Video Steganography

In Video steganography, you can literally hide entire videos inside another video. Videos are basically a sequence of images with audio playing as the sequence progresses. This type of steganography allows each video frame to encode an image of the one you want to hide.

This technique can also be used to hide text as demonstrated in the software Steganosaurus by James Ridgeway. He shows how it works in this video.

Audio Steganography

This type of steganography enables hidden messages to be encoded inside an audio file. A common technique used in this is called Backmasking. Backmasking is hiding a message in the audio file and it can only be heard when played backwards.

The famous rapper, Eminem, did some backmasking in the song ‘Stimulate’ back in 2002.

Network Steganography

This is relatively rare, but nevertheless, it is a technique in which messages are passed by hiding them in network traffic. The messages could be found in the payload or headers of data packets when captured and analysed by the receiver.

Now let’s take a look at how to do some image steganography.

Steganography using Steghide

Steghide is an open source image steganography tool that uses the least significant bit (LSB) method to hide data in images.

Images are made up of pixels, which are made up of bits. The bit depth determines how many colours are present in an image. The higher the bit depth, the more colourful the image tends to look.

What LSB does is change the last bit of each byte (or pixel) in the image to one that represents the data you want to hide. This changes the image data, but if done properly is not perceivable. The higher the bit depth and resolution, the more data can be stored in the image.

Now that you understand how it works, let’s play a little hide and seek (no pun intended 👀).

First we’ll be needing a few things:

A Linux OS
An Internet Connection
An Image
A Text file

Install Steghide

First we need to install Steghide. Open your terminal and run the following command to do that:

sudo apt install steghide

You can always run steghide --help to get the command list to see all your options.

Get your image ready

Next, have an image and a text file in a directory. My files are ‘information.txt’ and ‘image.png’. I’ve also put some text in the file to hide in the image later.

Setting up files ¦ Credit: Mercury

Open up your terminal again and go to the directory you stored the files. Mine is in ~/Documents/steganography_tutorial.

Looking for the files ¦ Credit: Mercury

Create a new image

Next, run the following command to create a new image that contains the text file you want hide.

steghide embed -ef <data> -cf <image> -sf <stego_image> -v

Let’s take a look at the command:

steghide – We specify the tool to use
embed – Tells the tool we want to embed data
-ef – Embed file, specifies the file to hide
-cf – Cover file, specifies the cover image
-sf – Stego file, creates a duplicate of the original image with the embedded file in it
-v – Verbose, gives us more information about the process

When the command is run, you’ll be asked to enter a password. If you want an extra layer of security, you might want to do this. If you don’t, just hit enter twice. Here’s the result of what I ran:

Embedding the information ¦ Credit: Mercury

Inspect the new file

Now let’s take a look at the new file.

Comparing the images side by side ¦ Credit: Mercury

There’s seems to be no difference. We can take a closer look with a site called diffchecker.com.

Comparing the images details ¦ Credit: Mercury

Extract the data

The stego file is slightly bigger than the original because it contains information. We can extract the data from the stego file using the command below.

steghide extract -sf <stego_image> -xf <extracted_data>

Let’s review the command above:

-sf – stego file, the image containing hidden data
-xf – extract file, the file with extracted data

Below is the screenshot from running the command. The extracted text is also shown below.

Extracting the information ¦ Credit: Mercury

If you extracted the text, Congratulations 🎉🎊. You have successfully hidden and extracted the text from the image. You can do this with a number of things, even whole books.

Using a different tool called Stegcore, I hid a text file containing Quincy Larson’s new book, “How to Learn to Code & Get a Developer Job”, behind an image of the book🔍.

Here’s an excerpt from the book.

An excerpt from the book ¦ Credit: Quincy Larson

And just like before, the text was embedded into a new image. Here is the original and the stego image side by side.

The original image compared to the stego image ¦ Credit: Mercury

And as expected, the stego image is slightly larger in size than the original.

The image details side by side ¦ Credit: Mercury

Talk about hiding a book behind a book (bad joke, I know 🤧). If you want to try it out, you can check out the Github repository or the app.

Conclusion

You’ve learned what steganography is and how to implement it using tools. Keep in mind that steganography is a tool and can be used for both good and bad. Companies can hide sensitive information using these means. On the other hand, a hacker could use it to hide malicious code.

Once again, this tutorial is for educational purposes only and is to be used to help and defend information from black hat hackers. Stay safe in the online jungle and happy hacking 🙃.

Acknowledgements

Thanks to Anuoluwapo Victor, Chinaza Nwukwa, Holumidey Mercy, Favour Ojo, Georgina Awani, and my family for the inspiration, support and knowledge used to put this together. I appreciate all of you.

If you want articles similar to this one, hit me up on Upwork or read more of my articles here.

Cover image credit: Abstract Data Cube ¦ Credit: Shubham Dhage.

How to Create Data Validation Rules in Excel

freeCodeCamp — Fri, 26 May 2023 00:07:47 +0000

By Faith Oyama

Data validation is a feature in Excel used in restricting data entry in specific cells. It can also prompt the user to enter valid data into the cells based on the rules and restrictions provided by the creator of the Excel worksheet.

When setting up a workbook, you might want to make sure users input a specific type of data. For example, you might want to allow only dates, numbers, or letters in a specific range to be imputed in a cell. This is crucial if you want to eliminate mistakes as much as possible in your data.

Types of Data Validation Rules in Excel

Here are a few data validation rules you can set up in Excel:

Only allow text or numeric values in a cell.
Only allow numbers within a specific range.
Display a warning message when a user inputs the wrong data.
Only allow dates and times outside a given range.
Validation rule based on criteria from another cell.

Steps to Create Data Validation Rules in Excel.

To create a data validation rule in Excel, do the following:

First, select the row, column, or specific cell you want to apply a data validation rule to.

Then open the data pane and click on the data validation. Alternatively, you can go directly to the data validation dialogue box by pressing the following keys on your keyboard separately. ALT > D > L. Do not hold the keys together, press the keys separately and you will be taken to the dialogue box as well.

Create the data validation based on what data you want to be supplied in the cell or row.

You can supply the following validation criteria:

Allow: Make a rule based on the type of data you want to allow. You can choose one from the drop-down menu. You can uncheck the “Ignore blank” button if you do not want blank spaces.

Data: From the drop-down menu you can choose the criteria and also input the minimum and maximum values you want the user to input.

With the validation criteria set, click OK to close the window or click on the Input Message or Error Alert tab to give the user more information on the data validation rule.

Input message: While this is optional, you can input a message to be displayed when a user clicks on a cell that has a data validation rule defined on it.

Next, give your input message a title, and under the input message, make sure the message you provide is clear to the user. Click on OK to close the dialogue box or navigate to the Error Alert tab.

Then display an error message. This is optional, but it is good practice to display an error message to users when they enter data that is outside the validation rule you set.

There are three types of error alerts:

Stop: This is the default and is very strict, as it stops users from entering invalid data. You can only click on “Retry” or “Cancel”

Warning: This will only warn the user but is not as strict as the stop warning. A user can ignore the message by clicking “YES” the invalid data will be inputted.

Here’s an example of the warning message a user will get:

Information: This is a permissive type of error alert as it only informs the user about invalid data inputted.

If the user clicks OK, the invalid gets inserted into the worksheet. If the user clicks on Cancel, the data gets deleted.

Give a title to the error alert and also provide a message for your users to see. When you’re done, click on OK, and your data validation rule has been set.

Conclusion

Data validation in Excel is one powerful feature you should utilize when creating an Excel spreadsheet.

You can use the data validation feature in Excel to make rules that will ensure the data inputted meets certain criteria or follows predefined rules. Setting a data validation rule helps to maintain data accuracy, consistency, and integrity within your Excel worksheet.

YAML Commenting – How to Add a Multiline Comment in YAML

Ihechikara Abba — Mon, 01 May 2023 18:28:28 +0000

You can use a YAML file to store data in a format that can be easily read and understood by humans. It is a data serialization language that is often used for configuration files and data transfer between applications.

YAML is similar to XML and JSON as they can all be used to store data in different formats. The main difference is their syntax.

Here's what XML format looks like:

<user>
  <name>John Doename>
  <phone>00223344phone>
  <age>80age>
user>

Here's what JSON format looks like:

{
  "user": {
    "name": "John Doe",
    "phone": "00223344",
    "age": 80
  }
}

Here's what YAML format looks like:

user:
  name: John Doe
  phone: 00223344
  age: 80

Each of the formats above is used to store data about a user's name, phone number, and age.

You can read more about the features, basic rules, and syntax of YAML, and its differences from JSON and XML in this article.

In this article, you'll learn about multiline comments in YAML.

How to Add a Multiline Comment in YAML

You can use comments for various reasons like documenting your code, collaborating with others, stopping a block of code from running, and so on.

You can use the # symbol to create comments in a YAML file. That is:

# The object below represents a user

user:
  name: John Doe
  email: john.doe@example.com
  age: 30

Unlike some other languages, YAML doesn't have a different format for creating block or multiline comments.

You'll have to use the # symbol on every line the comments spans into. Here's an example:

# The object below is an example that represents a 
# user's name, phone number and age

user:
  name: John Doe
  email: john.doe@example.com
  age: 30

If you remove the # symbol on the second line, the text may still appear as a comment but the YAML parser may interpret it as plain text which may lead to an error.

To be on the safe side, use the # symbol at the start of each comment line.

Summary

In this article, we talked about YAML. It is mostly used to store and transfer data.

We saw how to create inline and multiline comments. In YAML, the # symbol is used for both inline and multiline comments.

Happy coding! Check out my blog for more programming content.

data - freeCodeCamp.org

Traditional Scraping vs AI Scraping: A Practical Guide for Developers and Data Teams

Table of Contents

What is Traditional Web Scraping?

The Tools Behind Traditional Scraping

Traditional Scraping in Practice

Step 1: Install Dependencies

Step 2: Inspect the Page

tag, inside an element (as a title attribute)

Step 3: Write the Scraper

Step 4: Run It

Step 5: Extend It to Multiple Pages

What Makes This Approach Fragile

What is AI Web Scraping?

What's Actually Happening Under the Hood

Popular Tools Behind AI Scraping

AI Scraping in Practice

Step 1: Get an API Key

Step 2: Understand the API Structure

Step 3: Write a Single-Page Extraction

Step 4: Understand the Output

Step 5: Use Actions for Multi-Step Workflows

Where AI Scraping Earns Its Keep

Traditional Scraping vs AI Scraping: When to Use Each

Wrapping up

The Data Quality Handbook: Data Errors, the Developer's Role, and Validation Layers Explained.

What We'll Cover:

Prerequisites

The Importance of Data Quality

How Does Bad Data Happen in the First Place?

The Cost of Bad Data

Types of Data Errors

Required Field Errors

Format Validation Errors

Range and Limit Errors

Logical Consistency Errors

Duplicate and Data Integrity Errors

Relational Errors (Reference Integrity)

Structural Errors (Dropdowns, Radio Buttons, Enums)

What Makes Good Data?

Completeness:

Uniqueness:

Validity:

Timeliness:

Accuracy:

Consistency:

Fitness for Purpose:

Data Validation Layers

Frontend Layer — “Protect the User, Not the System”

Backend Validation — “The Real Gatekeeper”

Database Layer — “Protect the Data at Rest”

Service Layer / Business Logic — “Validate Real-World Rules”

Jobs / Queues / Data Ingestion — “Validate External Data”

Testing Strategies to Protect Data Quality

Unit Testing

Example: Testing a Discount Calculation Rule

Integration Testing: The Flow & Lineage Check

Functional Testing: The Business Rule Check

Here's an example: Functional Test

Conclusion

The Modern React Data Fetching Handbook: Suspense, use(), and ErrorBoundary Explained

Table of Contents

The Traditional Way of Data Fetching in React

The Problem with the Traditional Way

Let’s Build a Dashboard with the Traditional Data Fetching Approach

What is Suspense?

What is the use() API in React?

How to Use Suspense and the use() API for Data Fetching

Let’s Build the Dashboard with Suspense and the use() API

Project Setup

API Services

Create a Centralised User Resource

Create Individual Components

Create the Fallback UI

Create the Dashboard Component with Suspense

Run the Dashboard App

How to Handle Error Scenarios with Error Boundaries

Error Boundary

Suspense and Error Boundary

Learn from the 15 Days of React Design Patterns

tag, inside an element (as a `title` attribute)

What is the `use()` API in React?

How to Use Suspense and the `use()` API for Data Fetching

Let’s Build the Dashboard with Suspense and the `use()` API