MathJax - freeCodeCamp.org

How to Apply Academic Theories to Human-Centered Web Design [Full Handbook]

Great John — Fri, 08 May 2026 18:22:33 +0000

Have you ever abandoned an app right at the sign‑up page? Or felt uneasy navigating a website because the buttons were scattered randomly, the colors clashed, and the layout felt confusing and unnecessarily complex?

Maybe you were asked to complete twenty fields in one go. You carefully filled everything out, hit Submit — and only then were you told that your password didn't meet some hidden, unspoken requirement. A requirement that was never communicated upfront.

Instead of helpful guidance, you were met with a vague message: “Invalid input." Invalid how, you wonder?

Required fields weren’t marked. There was no real‑time validation. No helpful red outline showing which field was wrong. Just a generic prompt telling you to “go back and correct missing information,” as if you’re supposed to magically know what the system wants.

So you scroll.

You search.

You guess.

And you're now getting frustrated.

The reason you're frustrated is simple: no one enjoys repeating a task they thought they had already completed — especially when the mistakes could've been prevented with clear guidance along the way.

You manage to fill in the form and you tap the Submit button.

Nothing happens.

No loading spinner.

No subtle animation.

No confirmation message.

No success screen.

Just silence. For a brief moment, you’re left wondering: Did it go through? So you tap again. And maybe… one more time.

At this point, you become fed up and you either postpone the signup process to when you have the time, or you may not ever return.

Even if you haven’t experienced this exact scenario, you’ve almost certainly felt the same kind of friction: that moment when a digital interface makes you pause, hesitate, or wonder what you’re supposed to do next.

These frustrations often arise because frontend developers either overlook or are unaware of the essential design principles and theories that underpin a smooth, intuitive user experience.

As a frontend developer, your interface should minimise cognitive load, provide immediate clarity, and guide users effortlessly through every task.

In this handbook, I'll introduce the academic theories that should inform and elevate your frontend decisions.

1.0 Fitts’s Law:
2.0 Hick's Law:
- Design Takeaway from Hick's Law
3.0 Gestalt Principles:
4.0 Von Restorff Effect (The Isolation Effect):
- Design takeway from Von Restorff
5.0 Jakob’s Law
- Design Takeaway from Jakob's Law
6.0 Miller’s Law
- Design Takeaway from Miller's Law
7.0 The Goal-Gradient Hypothesis
- Design Takeaway from Goal-Gradient Hypothesis
8.0 Zeigarnik Effect
- Design Takeaway from Zeigarnik Effect
9.0 Tesla’s Law:
- Design Takeaway from Tesla's Law
10.0 Peak End Rule:
- Design takeaway from Peak End Rule
11.0 Postel’s Law:
- Design Takeaway from Postel's Law
12.0 Doherty Threshold:
- Design Takeaways from Doherty Threshold
13.0 Serial Position Effect (Primacy and Recency):
- Design Takeaways Serial Position Effect
14.0 Occam’s Razor:
- Design Takeaway from Occam's Razor
15.0 Parkinson's Law
- Design Takeaway for Parkinson's law
Conclusion
References

You might wonder what academic theories have to do with frontend development.

The answer is simple. Academic theories aren't abstract ideas. There are the result of rigorous scientific investigation — controlled experiments, validated models, and decades of research into how humans think, learn, perceive, and interact with information.

Because these theories are grounded in evidence rather than opinion, they offer reliable guidance for building interfaces that align with how the human brain actually processes information.

Applying them to frontend development means you're not designing by guesswork or personal preference. Instead, you're applying tested, scientific insights to create clearer, faster, more humane user experiences.

In other words, when you build with academic theory in mind, your frontend becomes more than just visually appealing — it becomes cognitively efficient, behaviourally aligned, and measurably easier for users to navigate.

You can use the following laws and principles to guide your development work. Let’s start by looking at Fitt’s law.

1.0 Fitts’s Law:

Fitts’s law is the brainchild of Paul Fitts. He was among the early psychologists who recognised that many human errors result from flawed design rather than simple human weakness.

During World War II, he studied airplane cockpit layouts and concluded that numerous incidents attributed to pilot error were actually caused by poor design decisions (Hall, 2023; Budiu, 2022).

Here's the formula:

$$T = a + b \cdot \log_2\left(1 + \frac{D}{W}\right)$$

T = Movement Time

D = Distance to the target

W = Width (size) of the target

a, b = Empirically determined constants

Based on his findings, Fitts postulated that the time required to acquire/reach a target is determined by the distance to the target and the size of the target.

Fig 1.0: Illustration of Fitts Law.

From the above, between Target B and Target C, it will be faster to interact with Target C than Target B simply because of the distance (Target B is farther away). Interestingly, though Target A and Target C are at the same distance, Target C will still be faster to interact with and less error-prone because of its larger size.

In simple terms, Fitt’s Law tells us that the time required to move to a target depends on two main factors: the distance to the target and the size of the target. The farther away an element is, the longer it takes to reach. The smaller it is, the more precision it demands, which increases the interaction time and the likelihood of errors.

Conversely, closer and larger targets reduce cognitive load, motor effort, and frustration.

In a nutshell, Fitts’s main message to developers is to reduce the distance users must travel on the screen and to make important buttons large and visually dominant.

Fig 1.1: Showing Call-to-Action buttons are the largest and most visually prominent elements on each screen.

From the image above, you can see that the Call-to-Action buttons on each of the screens are the most visually dominant button and largest in size. They're also placed within the natural region. This makes them faster/easier to interact with.

You should also place your Call-to-Action button within the natural zone. This is a zone on a mobile phone where it's easy to reach with the thumb (as most people use their thumbs to select things on a phone screen). Here's a diagram showing the "natural zone" on a typical smartphone. It's much faster for a user to interact within the "natural zone" than the "hard zone" (see figure).

Fig 1.2: Showing three different zones for buttons placement (natural, stretching and hard region)

1.1 Use Padding Wisely

Fitts' law can be applied to your development by increasing padding wisely. You can also use padding to increase the interactive area. By doing this, you're increasing the size of the targets.

This is important, because imagine a menu that disappears the moment your cursor drifts a few inches away. You’re weren't trying to close it — you simply moved slightly, and suddenly the entire menu collapses. That tiny slip forces you to start the interaction all over again. It’s a small mistake, but it creates a disproportionately frustrating experience.

This happens because the interactive area is too narrow.

That’s why effective padding — or more broadly, generous interactive zones — is essential. By increasing the clickable or hoverable area around a menu, you are increasing the size of the targets, which makes the interaction more stable, more forgiving, and far less cognitively demanding.

This ensures users can move naturally without fear of accidentally “falling off” the target.

1.2 Use Infinite Targets

Another fundamental principle that emerges from Fitt’s Law is the idea of infinite targets. When an interface element is placed at the very edge or corner of a screen, it becomes effectively “infinite” because the cursor can't move beyond the screen boundary. The edge acts as a physical barrier, allowing the user to fling the mouse in that direction without precision or careful aiming.

As a result, corners and edges become the fastest, easiest, and most reliable places for users to access important controls.

This is why operating systems such as Apple’s macOS and Microsoft Windows position their most essential menus and buttons at these locations. The macOS Apple Menu sits in the top‑left corner, Windows historically placed the Start button in the bottom‑left corner, and both systems anchor taskbars, docks, and notification areas along screen edges.

These placements reduce cognitive load, minimise motor effort, and increase interaction speed because users do not need to slow down or correct their cursor movement. The screen itself “catches” the pointer.

In essence, infinite targets transform small interface elements into large, easy‑to‑hit zones simply by leveraging the geometry of the screen.

What this means for you: place your most important and frequently used actions where users can reach them with the least effort. Screen edges and corners act as natural stopping points, meaning users can't overshoot them.

Design Takeaways from Fitts Law:

Place Primary Actions Where the Task Ends:
Placing a submit button at the top‑right forces users to travel all the way back after completing a long form. This increases interaction cost and breaks flow. The best place for a submit button is at the bottom of the form — exactly where the user finishes the task. This aligns with natural reading and interaction patterns.

Keep Related Actions Physically Close:
Separating “Add to Cart” and “Check Out” across opposite sides of the screen forces unnecessary thumb movement. Group related actions to reduce effort and speed up decisions.

Make Primary Targets Large and Visually Dominant:
Your main CTAs (“Subscribe Now,” “Pay Now,” “Create Account,” “Sign Up”) should be the most recognisable elements on the screen. Large, high‑contrast targets reduce errors and improve speed.

Place High‑Value Actions at Screen Edges and Corners:
Edges and corners act as “infinite targets” because the cursor can’t overshoot them. This makes them the fastest, easiest, and most reliable places for critical controls.

A tiny icon in the middle of the screen is hard to hit. The same icon placed at an edge becomes effectively huge because the boundary “catches” the pointer. Also, actions like navigation, primary CTAs, or global controls should live where users can reach them with minimal effort. Avoid burying important actions in the centre of the screen.

Increase Target Size With Generous Padding:
Small interactive zones force users to aim with pixel‑level precision. Adding padding expands the clickable or hoverable area, making interactions easier, faster, and more forgiving.

Prevent Accidental “Fall‑Off” With Larger Hit Areas:
Menus that collapse the moment the cursor drifts slightly create frustration. A wider interactive zone keeps the menu open during natural mouse movement, reducing accidental resets.

Users don’t move perfectly. Interfaces should accommodate slight slips without punishing them. Larger targets reduce cognitive load and eliminate unnecessary frustration. so by increasing the effective size of buttons, menus, and controls, you create interactions that feel stable and predictable, and users can move confidently without fear of losing their place.

To Sum Up: The farther away an element is, the longer it takes to reach. The smaller it is, the more precision it demands, which increases the interaction time and the likelihood of errors. Conversely, closer and larger targets reduce cognitive load, motor effort, and frustration.

2.0 Hick's Law:

Hick’s Law is a psychological principle that describes the relationship between the number of choices presented to a user and the time it takes them to make a decision. It was formulated by William Edmund Hick in 1952 (Yablonski, 2022; Proctor & Scheider, 2018).

The law states that as the number of options increases, the decision time increases logarithmically. In simple terms, more choices slow users down, while fewer choices speed up decision-making.

$$T = a + b \cdot \log_2(n + 1)$$

Where:

T = time to make a decision,

n = number of choices,

b= a constant that depends on the task and the individual

Figure 2.0 illustrates the relationship between user experience, reaction time, and the number of actions.

This is how users feel, for example, when they encounter a form that asks for too much information upfront. The longer the form gets, the more frustrated they become.

Examples of this are overloading menus with too many items, presenting long, unorganised forms, giving too many calls-to-action on one screen, and building nested menus with excessive depth.

All of these create friction and can lead to cognitive overload.

Design Takeaway from Hick's Law

Avoid Overloading Users With Too Many Actions:
Too many buttons, menu items, or choices at once increases cognitive load and slows decision‑making. Users freeze when everything competes for attention.

Keep Navigation Clean and Focused:
Cluttered menus hurt both usability and SEO. Search engines struggle to track overly complex navigation structures, and users struggle to find what matters.

Use Progressive Disclosure to Reduce Complexity:
Hide advanced or rarely used options under “More” or expandable sections. Reveal complexity only when the user needs it.

Break Complex Tasks Into Smaller, Manageable Steps:
Progressive disclosure works beautifully for multi‑step forms and decision flows. Smaller steps reduce overwhelm and improve completion rates.

Group Related Options Into Logical Categories:
Organising actions into meaningful clusters helps users process information faster. For example, placing “Edit” and “Delete” together leverages natural mental grouping.

Video 2.0: Video description of Progressive Disclosure.

From the video above, instead of showing all the menu details at once, it is better to hide them initially. As you can see, the additional information only appears when the arrow down button is pressed. This approach prevents overwhelming the user and keeps the interface clean and focused.

You should also reduce decision anxiety, as too many choices create doubt and friction (as they say, the more you ask from a user, the less you get).

Beyond this, try to use recommended labels, show brief descriptions, provide visual previews, and use comparison tables wisely to show comparison between products especially when they have many characteristics. An example of a comparison table is shown below:

Figure 2.1: A comparison table being used to simplify complex information.

Also, rather than showing advanced configuration options by default, display only the most commonly used settings. Advanced options can be hidden under an expandable section like “Advanced” or “More Settings. This makes your interface less cluttered and more visually organized.

And speaking of visual organization, this is the perfect moment to introduce Gestalt principles — the psychological rules that explain how users naturally group and interpret what they see.

To Sum Up: As the number of options increases, the decision time increases logarithmically.

3.0 Gestalt Principles:

In the 1920s, a group of German psychologists – Max Wertheimer, Kurt Koffka, and Wolfgang Köhlern – introduced what are now known as the Gestalt Principles. Their work sought to understand how humans perceive and interpret visual information (Bustamante, 2023).

The word “Gestalt” is German for “unified whole,” reflecting the core idea behind the theory: people naturally perceive objects as organised patterns and complete forms rather than as separate, disconnected parts.

These principles explain how the human mind structures visual elements to make sense of the world. Over time, they have become highly influential in fields such as design, user experience (UX), psychology, and data visualization, where understanding perception is critical.

Key Gestalt Principles:

3.1 Proximity

Elements that are placed close to each other are perceived as a group, while those spaced far apart are seen as separate. This is why labels are placed directly next to their corresponding input fields.

For example: In a blog feed, the "Title," "Author," and "Date" should have small margins between them (8px), while the space between one blog post card and the next should be much larger (40px). This tells the user's brain: "These three text strings belong to this specific post."

Fig 3:0 Illustration of proximity (Gestalt Principle)

From the fig above, the spacing within the blog feed plays a powerful role in how effortlessly users interpret what they see. When elements sit close together, the brain instinctively treats them as belonging to the same unit. This is why placing the author credit just 8px beneath the title creates an immediate mental link. The viewer doesn’t need to pause or decode who wrote which article; proximity does the cognitive work automatically, forming a tight, intuitive grouping.

Equally important is the generous 40px gap between individual cards. This larger spacing introduces “visual breathing room.” Without it, a feed can quickly collapse into a dense wall of text, overwhelming the user and discouraging exploration. The wider margin establishes a clear boundary—a natural stop-and-start rhythm—that makes each card feel distinct and the entire layout more scannable.

Finally, subtle spacing differences can guide behaviour, not just perception. The slightly larger 12px margin above the read‑more link separates it from the passive information above it. This spacing cues the user that the link represents an action rather than another piece of descriptive text. It’s a small adjustment, but it shifts the element’s role from informational to interactive, helping users understand what they can do next.

Together, these spacing decisions transform a simple list of posts into a structured, intuitive, and behaviourally clear interface—one where the user never has to think about the layout, because the layout is already thinking for them.

Proximity controls meaning: move elements closer to show connection, separate them to show difference.

3.2 Similarity

We naturally group elements that share similar visual characteristics, such as color, shape, size, or orientation.

For example, even if buttons are spread across a page, if they're all the same shade of blue, the user understands they perform similar functions.

If your primary "Submit" button is blue with rounded corners, every other primary action on your site should look exactly the same. If you suddenly use a square red button for a primary action, the user will be confused because the "similarity" is broken.

Fig 3:1 : illustration of similarity (Gestalt principle)

As you can see from above, the layout clearly demonstrates how the Gestalt Principle of Similarity works by showing two different visual situations: one where everything matches, and one where a single element breaks the pattern.

All three product cards share the same visual characteristics:

Same card shape
Same border and shadow
Same image size and placement
Same blue “Add to Cart” button
Same font style and spacing

Because these elements look alike, your brain automatically groups them as one category — “products that belong together.”
You don’t have to think about it; the similarity creates instant visual unity.

This is the Gestalt Principle of Similarity in action.

In the second row, everything is still similar except one button:

The middle product’s button is orange, not blue
It has square corners, not rounded
The text is italic, not regular
The label changes to “Quick Buy”

Because this button breaks the shared pattern, your brain immediately notices it and treats it as different or special.

Developers can use broken similarity to intentionally highlight featured items, promotions, or urgent actions.

When similarity is broken, the different element stands out and draws attention.

3.3 Continuity

The human eye prefers to follow a continuous path or curve rather than jagged or broken lines. We perceive items aligned on a line or curve as being related. This is often used in navigation menus or horizontal carousels to guide the user's gaze.

For example, you might have a horizontal carousel where the last visible card is slightly "cut off" at the edge of the screen. This visual break creates a path that encourages the user to keep scrolling as their eyes follow the line of cards.

Fig 3:2: illustration of continuity (Gestalt principle)

As you can see, all four form fields — First Name, Last Name, Email Address, and Phone Number — are perfectly aligned along one continuous horizontal path. Because the human eye naturally prefers to follow an unbroken line, your gaze moves smoothly from left to right across the fields without effort.

The final field is slightly cut off at the edge, which creates a subtle visual cue that the line continues beyond the visible area. This encourages the user to keep scrolling or swiping, because their eyes are already following the direction of the form.
when elements are arranged along a straight path, curve, or flow, the brain automatically treats them as connected and expects the pattern to continue.

Another example is Instagram Stories, which are arranged in a smooth horizontal line at the top of the app. Instagram reinforces this by slightly revealing the next story circle at the edge of the screen. That tiny “peek” acts as a continuation cue — your eyes expect the line to keep going, so your finger follows.

Fig 3:3: illustration of continuity (Gestalt principle)

As you can see from above, all the circular story icons are arranged in a straight horizontal line, and your visual system instinctively follows that line from left to right without effort.

The slight visibility of the next story at the edge of the screen strengthens this effect, signaling that the sequence continues beyond what's currently shown. Also, because the icons share the same size, spacing, and shape, there are no visual interruptions, allowing your eyes to glide across them in one continuous motion.

This seamless flow is exactly what continuity describes: the tendency of the human eye to follow the direction of a line or pattern, assuming it continues even when part of it is out of view.

Continuity is the tendency of the human eye to follow the direction of a line or pattern, assuming it continues even when part of it is out of view.

3.4 Closure

Closure refers to the mind’s ability to perceive a complete, unified form even when parts of that form are missing. Rather than requiring every boundary, line, or shape to be explicitly drawn, the brain instinctively fills in the gaps. When used intentionally, closure allows interfaces to feel cleaner, more elegant, and more cognitively efficient.

When we look at a complex arrangement of visual elements, we tend to look for a single, recognisable pattern. If an image is missing parts, our brains fill in the gaps to "close" the shape.

One of the most celebrated examples of closure in visual identity design is the panda symbol used by the World Wide Fund for Nature (WWF). This logo demonstrates how strategic omission can produce a memorable, emotionally resonant, and universally recognisable mark.

At first glance, the panda illustration appears simple, composed of a few bold black shapes arranged against a white background.

Yet a closer look reveals that the panda is not fully drawn. There are no outlines defining the body, no complete contours around the head, and no explicit boundaries separating limbs from background. Instead, the designer uses a series of carefully placed shapes (ears, eye patches, nose, and partial limbs) to imply the rest of the animal. The viewer’s mind fills in the missing information, completing the silhouette effortlessly.

This is closure at its most effective: the brain constructs a whole from fragments, creating a sense of completeness without visual overload.

Fig 3:4: illustration of closure (Gestalt principle)

For example, a "hamburger menu" (three lines) isn't a literal drawer, but our brains "close" the shape to understand it represents a menu.

Fig 3:5: illustration of closure (Gestalt principle)

An example of closure in practice can be seen in step indicators commonly used in checkout flows. These components often rely on partial shapes, implied boundaries, and incomplete outlines to guide the user through a sequence of actions.

For instance, upcoming steps may be represented by dashed circles. Although the circles aren't fully drawn, the viewer immediately recognises them as complete shapes. The brain resolves the missing segments, allowing the interface to communicate progression without heavy borders or fully rendered icons. This subtle use of closure reduces visual clutter while preserving clarity.

Closure refers to the mind’s ability to perceive a complete, unified form even when parts of that form are missing.

3.5 Figure/Ground

This principle describes the mind's tendency to separate an object (the figure) from its surrounding area (the ground or background). In web design, using a "modal" or "pop-up" relies on this: by blurring the background, you force the user to see the pop-up as the focal figure.

When a user clicks "Login" on a modal/lightbox, the background site often dims (the "Ground") while the login box stays bright and centered (the "Figure"). This immediate depth change tells the user exactly where their attention belongs.

Video 3.5.0 Video description of Figure/Ground (Gestalt Principle)

From the video above, you can see that when the Quick View button is clicked, the selected figure stands out while the background darkens. This contrast guides the user’s attention and helps them focus on the figure. Developers can use this technique to direct users’ attention to what matters most or to what they want users to notice.

This principle describes the mind's tendency to separate an object (the figure) from its surrounding area (the ground or background).

3.6 Common Fate

Elements that move in the same direction are perceived as more related than elements that are stationary or move in different directions. Think of a dropdown menu: when all sub-items slide down together, they are clearly part of the same "unit."

For example, when you click a FAQ header and five sub-items slide down at the exact same speed and direction, the "Common Fate" tells the user that all those items belong to that specific category. If they flew in from different directions, the relationship would be lost.

Video 3.6.1 Video description of common fate (Gestalt Principle)

Video 3.6.2 Video description of common fate (Gestalt Principle)

From the video shown above, the e‑commerce animation example demonstrates these principles clearly by using two distinct motion patterns: a group of regular products that move upward together, and a pair of special‑category items that enter dramatically from the left. Through these contrasting movements, the interface communicates category differences without relying on text labels or explicit instructions.

Therefore, developers can use this motion‑based differentiation as a design strategy to guide users’ perception—allowing the interface to signal hierarchy, category structure, and product importance purely through animated behaviour rather than through static visual labels.

Elements that move in the same direction are perceived as more related than elements that are stationary or move in different directions.

3.7 Focal Point

Whatever stands out visually will capture and hold the viewer’s attention first. This is essentially the principle of emphasis. A bright "Sign Up" button in a sea of gray text acts as the focal point, directing the user's primary action.

For example, an alert banner or a pricing table should stand out from its surroundings. Beyond this, in a three-tier pricing table (Basic, Pro, Enterprise), the "Pro" column is often slightly larger or a different color. This creates a focal point that draws the eye to the "recommended" option immediately.

Fig 3:7: illustration of closure (Gestalt principle)

In visual interface design, the Gestalt principle of Focal Point plays a crucial role in directing user attention toward the most important element on a screen.

A focal point is created when one element breaks the established pattern of surrounding elements, making it stand out immediately.

In e‑commerce interfaces, this principle is often applied to highlight primary actions such as purchasing, subscribing, or upgrading. The “Buy Now” button provides a clear and practical example of how focal points function within a layout.

From the example above, the first two buttons share the same visual characteristics: neutral colours, and regular weight text. This repetition establishes a visual pattern that the user quickly becomes familiar with.

But the “Buy Now” button intentionally disrupts this pattern. It uses a bright colour, which contrasts sharply with the muted tones of the other buttons. This colour difference alone is enough to draw the eye, as humans are naturally sensitive to changes in hue and saturation within a uniform environment.

The Focal Point may sound like it's similar to the principle of Similarity, but the two operate in completely opposite ways within perceptual psychology.

Similarity explains how the mind naturally groups elements that share visual characteristics – such as colour, shape, or size – into coherent units. Once this grouping is established, the interface gains structure and predictability.

Focal Point, on the other hand, works by intentionally breaking that structure. Instead of reinforcing uniformity, it introduces a deliberate contrast – through colour, scale, brightness, or motion – to draw the viewer’s attention to one specific element.

In other words, Similarity creates the background pattern, while Focal Point identifies the one element that must stand out against that pattern.

Whatever stands out visually will capture and hold the viewer’s attention first.

Design Takeaways from the Gestalt Principles

Use Spacing as Your Primary Grouping Tool:
Elements that belong together should sit closer to each other than to anything else. Spacing communicates structure faster than borders or boxes. Use tight internal spacing (6–12px) for related items and wide external spacing (24–48px) to separate groups.

Build a Strict, Consistent Visual System — and Stick to It:
Define clear rules for button types, text styles, icon sizes, and alignment patterns. Consistent left‑aligned text blocks, predictable carousel lines, and stable flow patterns reduce cognitive load and make interfaces feel trustworthy.

Guide the Brain With Spacing, Alignment, Consistency, Contrast, and Motion:
The human brain is always trying to group, follow, and prioritise what it sees. Your job is to guide that instinct through intentional layout decisions, not fight against it.

4.0 Von Restorff Effect (The Isolation Effect):

This is the brainchild of Hedwig von Restorff, posited in 1933. In principle it states: An item that stands out is more noticable and more likely to be remembered than other items (Hunt, 1995).

So unique or visually distinct elements grab attention and are more memorable – in other words, distinctiveness dictates memory. When a user interacts with an interface, their brain naturally seeks patterns to minimize cognitive effort.

While consistency is generally a virtue in design, a perfectly uniform layout can lead to "banner blindness" or habituation, where the user stops noticing details.

By strategically breaking a pattern through changes in color, size, shape, or spacing, the developer can "isolate" an element, triggering a biological response that flags the item as high-priority.

Note that although the Focal Point principle may initially seem similar to the Von Restorff Effect, they describe two different psychological processes.

Focal Point is a Gestalt visual principle that explains how one element becomes the centre of attention within a composition because it carries the strongest visual contrast – through size, colour, brightness, position, or motion. Its purpose is to guide the viewer’s eye toward the most important element in the layout.

The Von Restorff Effect comes from cognitive psychology, not Gestalt theory. It states that an item that is noticeably different from a group of similar items is not only more attention‑grabbing but also more memorable.

So Focal Point is about where the eye goes first, while the Von Restorff Effect is about what the brain remembers later.

Design takeaways from Von Restorff

Use Isolation to Make CTAs Impossible to Miss:
On a page filled with neutral text and standard links, a single high‑contrast button (like a bold “Primary Blue” or “Emergency Red”) instantly becomes the standout element. This leverages the Von Restorff Effect to pull the user’s eye toward the most important action.

Create a Visual “Hitch” in the Scan Path:
A distinct CTA interrupts the user’s natural left‑to‑right, top‑to‑bottom scanning rhythm. This makes actions like “Buy Now” or “Sign Up” the first thing they notice and the last thing they forget.

Make Critical Actions Visually Distinct:
Because users naturally notice the one element that breaks a pattern, your most important actions should use deliberate contrast — color, size, shape, weight, or motion. Isolate key information instead of letting it blend into surrounding UI noise.

Avoid Over‑Differentiation — or Nothing Stands Out: If every button is loud, animated, or uniquely styled, the interface becomes chaotic. The Von Restorff Effect only works when there is a clear, stable pattern — and you break it once, intentionally.

To Sum Up: An item that stands out is more noticable and more likely to be remembered than other items.

5.0 Jakob’s Law

Jakob’s Law states that users spend most of their time on other sites, so they expect your interface to behave like the ones they already know.

Familiar patterns — hamburger menus, top navigation, search icons, and clickable top‑left logos — reduce cognitive load because users don’t have to interpret anything new.

But while Jakob’s Law is foundational to UX, I think it can also unintentionally suppress innovation.

When developers over‑prioritise familiarity, they fall into a standardisation trap: endlessly optimising conventional patterns instead of exploring fundamentally better ones.

The Pie Menu is a perfect illustration of this. According to Fitts’s Law, the time required to reach a target depends on its distance and size. Linear menus place the last item much farther from the cursor than the first, creating uneven interaction costs.

Radial menus position every option at an equal distance from the centre, and their wedge‑shaped targets effectively grow larger as the pointer moves outward.

Mathematically, pie/radial menu are faster to interact with and more efficient — yet they remain rare in mainstream web design because they violate users’ expectations. In other words, Jakob’s Law keeps us locked into a familiar but suboptimal pattern simply because “that’s how it’s always been done.”

But the challenge is not choosing between familiarity and innovation, but balancing them.

This is where the Aesthetic–Usability Effect becomes powerful. Research shows that users perceive attractive interfaces as easier to use, and they are more forgiving of minor usability friction when the design is visually pleasing.

A beautifully crafted Pie Menu, for example, can encourage users to invest the small amount of learning required to use it. By applying aesthetic delight strategically, developers can introduce innovative patterns without overwhelming users.

The principle that emerges is simple: Be conventional where it matters, and innovative where it delights.

Design Takeaway from Jakob's Law

Keep Trust‑Critical Elements Predictable:
Navigation, search, authentication, and other high‑stakes interactions must follow established conventions. Users rely on these patterns for speed, confidence, and safety — this is where Jakob’s Law should be respected without exception.

Experiment Only in Low‑Risk, High‑Creativity Areas:
In creative or productivity‑focused zones — like editing tools in a photo app — you can safely introduce new interaction models such as radial menus, gesture wheels, or context‑aware tool selectors. These areas invite exploration and benefit from efficiency‑driven innovation.

To Sum Up: Be conventional where it matters, and innovative where it delights.

6.0 Miller’s Law

Miller’s Law originates from George A. Miller’s classic paper “The Magical Number Seven, Plus or Minus Two.” It states that the average person can hold only about 7 (±2) chunks of information in working memory at any given moment (Miller, 1956).

Crucially, Miller emphasised that the brain doesn’t store isolated items — it groups them into meaningful units called chunks. Because working memory is so limited, developers must structure information in ways that respect this cognitive boundary.

This principle has direct implications for interface design. Long, unbroken strings of information overwhelm users, whereas chunked formats are far easier to process.

For example, instead of displaying a phone number as 1234567890, formatting it as 123‑456‑7890 transforms ten digits into three manageable chunks. The same logic applies to navigation: aim for five to nine primary menu items, and if you need more, group them into categories. Users remember the category as a single chunk rather than each individual link.

Miller’s Law also explains why long forms are so intimidating. When a user sees 30 fields on one page, their brain interprets it as a single, massive task — far beyond the 7±2 limit.

A progressive stepper solves this by breaking the form into smaller stages of 5–7 fields each. This reduces cognitive load, creates a sense of progress, and significantly lowers abandonment rates.

The same principle applies to product listings or search results. Expecting users to compare 50 items at once is unrealistic. Instead, provide strong filtering tools so users can narrow the set to a manageable size — ideally within the range their working memory can meaningfully evaluate.

In essence, Miller’s Law reminds developers that humans don’t process information in bulk. They process it in structured, meaningful chunks.

Fig 6.0: Illustrating progressive stepper

In the example above, the interface uses both a progress bar and a stepper to guide the user through multiple stages of a task. After completing the first page and selecting “Continue,” the user moves to the next step, and the progress bar updates accordingly. This creates a clear sense of forward movement and accomplishment.

By breaking the process into smaller segments, the interface prevents cognitive overload. If all the information were presented on a single page, users might feel overwhelmed, unsure where to begin, or discouraged by the sheer volume of work.

A step‑by‑step flow transforms a large task into a sequence of manageable actions, increasing the likelihood of completion.

Design Takeaway from Miller's Law

Respect the 7±2 Working‑Memory Limit:
Users can only hold about seven chunks of information at once. Long, unbroken content overwhelms them, while chunked information is instantly easier to process.

Chunk Information Into Meaningful Units:
The brain doesn’t store isolated items — it groups them. Format data (like phone numbers), menus, and settings into clear, memorable chunks instead of long, flat lists.

Keep Navigation Within 5–9 Primary Items:
If you need more than nine options, group them into categories. Users remember the category as a single chunk, not each individual link.

Break Long Forms Into Smaller Steps:
A 30‑field form feels like one giant task. A progressive stepper with 5–7 fields per step keeps users below the cognitive overload threshold and dramatically reduces abandonment.

Reduce Comparison Load With Strong Filters:
Expecting users to compare 50 products at once is unrealistic. Provide filtering tools that shrink the decision set to something the working memory can actually handle.

Design for Chunked Thinking, Not Bulk Processing:
Humans don’t process information in bulk — they process structured, meaningful groups. Interfaces that respect this limitation feel lighter, faster, and more intuitive.

To Sum Up: A step‑by‑step flow transforms a large task into a sequence of manageable actions, increasing the likelihood of completion.

7.0 The Goal-Gradient Hypothesis

This is the perfect moment to introduce the Goal‑Gradient Hypothesis, originally proposed by behaviorist Clark Hull in 1932 (Yablonski, 2022). The hypothesis states that people become more motivated as they get closer to achieving a goal. In other words, users naturally accelerate their engagement when they sense they are nearing completion.

This principle is incredibly powerful in UX design, especially for progress tracking, gamification, and reward systems.

The takeaway is straightforward: Because users are more motivated near the finish line, progress indicators should be prominent and meaningful.

Percentages, progress bars, and step counters reinforce momentum. Micro‑achievements — such as badges, checkmarks, or subtle confetti — amplify motivation by celebrating small wins.

Tasks should be broken into measurable milestones so users can see themselves advancing.

This is why e‑learning platforms display messages like “You’ve completed 8 of 10 lessons — almost there!” and why fitness apps highlight progress with prompts such as “3 km done, 2 km to go.” These cues leverage the goal‑gradient effect to keep users engaged, energized, and eager to finish.

By combining progressive steppers with clear progress feedback, developers create interfaces that feel lighter, more encouraging, and far more motivating — ultimately improving completion rates and overall user satisfaction.

But what happens when a goal isn't completed? Why do we sometimes feel uncomfortable leaving things unfinished? That discomfort is explained by another psychological principle called the Zeigarnik Effect — the tendency for people to remember and feel tension about incomplete tasks. We will look at this next.

Design Takeaway from Goal-Gradient Hypothesis

Make Progress Visible to Boost Motivation:
According to the Goal‑Gradient Hypothesis, users naturally speed up as they sense they’re nearing completion. Prominent progress bars, percentages, and step counters tap into this instinct and keep momentum high.

Celebrate Micro‑Achievements to Reinforce Engagement:
Badges, checkmarks, subtle confetti, and “step completed” cues reward small wins. These micro‑rewards amplify motivation and make long tasks feel lighter and more achievable.

Break Tasks Into Measurable Milestones:
Users stay motivated when they can see themselves advancing. Divide complex flows into clear steps so progress feels tangible rather than overwhelming.

Use Progress Feedback to Drive Completion:
Messages like “8 of 10 lessons completed — almost there” or “3 km done, 2 km to go” leverage the goal‑gradient effect to energise users and pull them toward the finish line.

Combine Steppers With Clear Feedback for Maximum Impact:
Progressive steppers paired with strong visual feedback create interfaces that feel encouraging, structured, and motivating — dramatically improving completion rates.

Video 8.0 : Video illustrating goal gradient

To Sum Up: People become more motivated as they get closer to achieving a goal.

8.0 Zeigarnik Effect

The Zeigarnik Effect is a psychological principle stating that people remember unfinished or interrupted tasks better than completed ones (Cherry, 2024).

Memory begins with sensory input, which is processed into short-term memory. Unfinished tasks persist in our thoughts, leading to active recall. This ongoing engagement can turn them into long-term memories, enhancing recall until resolved. This increases engagement, encourages task completion, improves retention, and drives conversions.

So because people remember unfinished tasks better than completed ones (Zeigarnik Effect), developers use progress indicators to make users aware that something is incomplete and motivate them to finish it.

In your designs, you can break long forms into multi-step processes to encourage completion and display profile completion percentages (for example, 70% complete) to push users toward 100%.

This is the main reason why e-commerce platforms send abandoned cart reminders to bring users back to complete their purchases. It's also why apps use streak systems to encourage daily engagement and habit formation and learning platforms show course completion bars to motivate users to finish modules.

Design Takeaway from Zeigarnik Effect

Unfinished Tasks Stay Active in Memory — Use That to Drive Completion:
Because incomplete tasks linger in working memory (Zeigarnik Effect), users naturally keep thinking about what they haven’t finished. This tension boosts recall, engagement, and the likelihood of returning to complete the task.

Make Incompleteness Visible With Progress Indicators:
Progress bars, percentages, and step counters remind users that something is still unfinished. This gentle psychological pressure motivates them to continue until the task is complete.

Break Long Flows Into Multi‑Step Processes:
A massive form feels overwhelming, but a stepper with smaller chunks keeps users moving. Showing “70% complete” nudges them toward finishing the last stretch.

Use Reminders to Re‑activate Unfinished Intent:
Abandoned cart emails, streak reminders, and “continue your lesson” prompts work because the unfinished task is already active in the user’s mind. The reminder simply pulls them back into the loop.

Celebrate Completion to Close the Cognitive Loop:
Checkmarks, confirmations, and completion badges give users closure. This resolves the mental tension created by the unfinished task and reinforces positive behaviour.

To Sum Up: Unfinished tasks persist in our thoughts, leading to active recall.

9.0 Tesler’s Law:

This law was proposed by Lawrence Tesler. He was a computer scientist known for his work on human-computer interaction, and he contributed significantly to making software more user-friendly, including work on cut, copy, and paste functionality.

This law is otherwise known as the Law of Conservation of Complexity. The core Idea here is every process has a certain amount of “inherent complexity" that can't be removed. You can only decide who handles it: the user or the system.

Some examples of these inherent complexities might be:

translating user actions into correct operations behind the scenes,
handling unreliable or slow network connections,
connecting with third-party APIs, services, or legacy systems,
sorting large datasets quickly,
performing complex search operations
managing version changes and compatibility issues,
managing state, interactions, and animations without confusing the user.

All of these can be inherently complex, but it's the job of the developer to deal with the complexity.

As a developer, you should always try as much as possible to push complexity to the system. For example, instead of making a user type their full address manually, use an Auto-complete API (Google’s Places and Map is best for this). The complexity of finding and validating the address still exists, but the software handles the work for them.

Here's a practical example: let’s say you're designing a student platform that requires users to enter their university name. A practical approach would be to store an array of all universities in the UK in your codebase (This is the hard part Tesla hinted at).

As the user types, they don't need to enter the full name, and their full university name is shown (relating to what they have typed). For instance, if they intend to type “University of Sheffield,” simply typing “Sheff” should prompt the system to display the full university name, which they can then select.

In Dart, you can use a package like fuzzysearch to implement this kind of intelligent matching.

The advantage of this approach is greater than it first appears. It improves data consistency because users often enter the same information in different ways. For example, some users might type “Uni of Sheff,” others “Sheffield University,” and others “Uni of Sheffield,” while all are referring to “University of Sheffield.”

This is how messy data is created, and it creates more work for data analysts. Little wonder that data analysts spend up to 70% of their time cleaning data.

If developers invested more time in structuring how data is collected to ensure consistency, there would be far less work downstream for analysts. This same logic should be applied in how we collect date, time, and other information.

So apart from people's names and email addresses, you should try to standardize the data your app collects as much as possible. Use date and time pickers, stepper controls, input masks, checkboxes, dropdown menu and radio buttons, toggle switches. and so on.

The essence of removing complexity from the user is not only about improving usability, but also about ensuring that the data collected is standardised, structured, and consistent.

Design Takeaway from Tesler's Law

Push Complexity to the System, Not the User:
Every process contains unavoidable complexity. Your job is to handle it behind the scenes so the user experiences the simplest possible interaction.

Automate Tasks Users Shouldn’t Have to Think About:
Use tools like autocomplete, fuzzy search, intelligent defaults, and validation APIs to remove manual effort. The complexity still exists — but the system absorbs it instead of the user.

Standardise Inputs to Prevent Messy Data:
Users enter the same information in wildly different ways. Use pickers, dropdowns, input masks, radio buttons, and toggles to enforce consistent, structured data collection.

Handle Inherent Technical Complexity Internally:
Network issues, API quirks, large dataset sorting, search optimisation, state management, and animation logic are all developer responsibilities. Users should never feel this complexity.

To Sum Up: Every process contains unavoidable complexity. Your job as a developer is to handle it behind the scenes so the user experiences the simplest possible interaction.

10.0 Peak End Rule:

In 1993, Daniel Kahneman, Barbara Fredrickson, Charles Schreiber, and Donald Redelmeier invited volunteers into a lab for what sounded like a simple experiment. The task was straightforward: place a hand into a container of painfully cold water (Kahneman et al., 1993)

In the first round, participants kept their hand in 14°C water for 60 seconds. It was uncomfortable, sharp, and unpleasant but after one minute, it was over.

In the second round, they again endured 60 seconds in 14°C water. But this time, they were asked to keep their hand in for an extra 30 seconds. The temperature was raised slightly to 15°C. Still cold. Still unpleasant. Just slightly less intense.

Objectively, the second experience was worse. It lasted 90 seconds instead of 60. More total pain. More suffering.

Later, the researchers asked a simple question:

If you had to repeat one of the trials, which would you choose?” Surprisingly, most participants chose the longer one.

Why would anyone choose more pain?

The researchers realised something profound: people don’t remember experiences by calculating total discomfort. Instead, the mind summarizes the experience using just two key moments — the most intense point (the peak) and the final moment (the end).

In both trials, the peak pain was the same: 14°C. But the longer trial ended slightly better, at 15°C. That small improvement at the end reshaped how the entire episode was remembered. The participants’ “experiencing self” suffered more during the longer trial. But their “remembering self” preferred it because it ended on a less painful note.

From this, the researchers introduced what became known as the Peak–End Rule: we judge experiences largely by their most intense moment and how they finish, not by how long they last.

Since people largely judge an experience by how it ends, developers should focus on designing satisfying confirmation screens and smooth exit interactions. You should concentrate less on making every single moment perfect and instead prioritise optimising the peak and final moments.

A negative ending can overshadow an otherwise good experience, so carefully avoid frustrating final steps such as unexpected fees or confusing confirmations.

Emotional intensity strongly shapes memory, which is why many apps incorporate celebration animations, rewards, or success messages at key moments to leave a lasting positive impression.

Design takeaway from Peak End Rule

People Judge Experiences by the Peak and the Ending — Not the Total Duration:
Users don’t remember every moment. They remember the most intense point and how the experience ends. A slightly better ending can completely reshape how the entire interaction is remembered.

Prioritise Strong, Positive Endings in Your UX Flows:
A smooth final step, a clear confirmation, or a satisfying success screen leaves a disproportionately strong impression. A bad ending can overshadow an otherwise great experience.

Design for Emotional Peaks at Key Moments:
Celebration animations, rewards, checkmarks, and success messages create memorable emotional spikes. These peaks anchor the experience in the user’s memory.

Don’t Try to Perfect Every Moment — Perfect the Right Moments:
Optimise the peak and the end of the journey. These two moments define how users recall the entire interaction.

Avoid Negative Surprises at the Finish Line:
Unexpected fees, confusing confirmations, or friction at the last step can ruin the memory of the whole process. Protect the ending carefully.

To Sum Up: We judge experiences largely by their most intense moment and how they finish, not by how long they last.

11.0 Postel’s Law:

Jon Postel’s famous principle – “Be conservative in what you send, be liberal in what you accept” – is a philosophy of kindness in software design. At its core, the principle argues that systems should be generous with what they accept from users, yet disciplined and predictable in what they output.

When developers follow this approach, users feel supported and understood. When they don’t, users feel punished for being human.

A user’s input is rarely perfect. People type quickly, make mistakes, follow their own habits, or rely on formats familiar to them. A robust system embraces this reality. It accepts messy, human input and quietly transforms it into clean, standardized data.

Real people don't think in strict formats. They write dates the way they learned in school, type phone numbers the way they say them aloud, and enter names and addresses in whatever structure feels natural to them.

A rigid system will reject anything that doesn’t match its narrow expectations, but a robust system, by contrast, adapts to the user.

Consider dates. A brittle interface might demand MM/DD/YYYY and reject everything else. A more humane system accepts a wide range of formats — “1 May 2024,” “2024‑05‑01,” “05/01/24,” or “May 1st, 2024” — and quietly converts them into a standard internal representation. This is where the complex handling described by Tesla's Law comes into play (Shifting complexity to the system, rather than the user).

Phone numbers follow the same pattern. People might enter (555) 123 4567, 555‑123‑4567, 5551234567, or +1 555 123 4567. A fragile system throws errors. A robust one parses all of them using libraries like libphonenumber and moves on.

Addresses are equally varied. “221B Baker St,” “221‑B Baker Street,” and “221 Baker St., Apt B” all refer to the same place. A forgiving system normalizes these instead of rejecting them.

Even names can be surprisingly complex. Hyphens, apostrophes, multiple words, and titles are all part of real human identity. Rejecting “O’Connor,” “Jean‑Luc,” or “Dr. Sarah Lee” is not just technically incorrect — it's disrespectful to the user.

Search bars offer another clear example. A strict search bar demands perfect spelling and exact phrasing. A robust one handles typos (“restuarant”), partial words (“resta”), synonyms (“food places”), and natural language (“where can I eat nearby”). It meets the user where they are instead of forcing them to think like a machine.

Currency should be normalized to a clear format such as GBP 5.00, no matter whether the user typed “£5,” “5 pounds,” or “5 GBP.”

Even file uploads benefit from standardization: whether the user uploads .jpeg, .jpg, .JPG, or .JPEG, the system should store everything as .jpg.

Error messages follow the same principle. Vague feedback like “Invalid password” leaves users confused and frustrated.

A clear, conservative message — “Incorrect password. Please try again.” — respects the user’s time. And instead of hiding password requirements, the system should state them upfront: minimum eight characters, at least one uppercase letter, at least one number.

Predictability reduces friction.

Because users inevitably make mistakes or enter data in unexpected ways, developers should design input fields that are tolerant rather than brittle. This means accepting flexible formats, offering autocorrect or intelligent parsing, and using forgiving validation rules that interpret the user’s intent instead of rejecting their effort.

Clear instructions, tooltips, and visible requirements should appear before submission so users understand what the system expects without trial and error.

When errors do occur, the interface should handle them gently—never crashing, and never forcing the user to start over.

Even simple variations, such as phone numbers typed with spaces, dashes, or parentheses, should be accepted and normalized behind the scenes.

By embracing flexibility on the input side and clarity on the output side, developers create systems that feel humane, resilient, and respectful of the way real people actually behave.

Design Takeaway from Postel's Law

Accept Messy Human Input, Output Clean Structured Data:
Users type dates, names, phone numbers, and addresses in unpredictable ways. A humane system accepts this variability and quietly normalises it into a consistent internal format.

Rigid interfaces punish users for being human. Robust interfaces interpret intent — handling typos, partial matches, synonyms, and natural language without complaint.

Also accept variations in spacing, punctuation, casing, and structure. Let users type naturally — the system should handle the complexity, not them.

Be Flexible With Input, Be Strict With Output:
This is the heart of Postel’s Law. Let users express information naturally, but ensure your system stores and displays it in a predictable, standardised way.

Use Intelligent Parsing and Autocorrection to Reduce Errors:
Libraries like libphonenumber, fuzzy search, and natural‑language parsers allow systems to accept a wide range of formats while still producing clean, reliable data.

Normalise Everything Behind the Scenes:
Dates, phone numbers, currency, file extensions, and addresses should all be standardised internally. This prevents messy data and reduces downstream cleanup work.

Provide Clear, Predictable Feedback:
Error messages should be specific and helpful. Requirements should be visible upfront. Users should never be surprised, confused, or forced to start over.

Combine Postel’s Law With Tesler’s Law:
Shift complexity to the system. Intelligent handling of messy input reduces cognitive load, improves usability, and ensures consistent, high‑quality data.

To Sum Up: A rigid system will reject anything that doesn’t match its narrow expectations, but a robust system, by contrast, adapts to the user.

12.0 Doherty Threshold:

The Doherty Threshold is a principle in human–computer interaction which proposes that systems should respond quickly enough to keep users actively engaged (Mod 2024).

When response times stay below a certain limit, users remain focused and productive. But once performance already meets this optimal responsiveness level, making the system even faster or adding extra capability doesn't significantly enhance satisfaction or efficiency.

The idea was introduced by Walter J. Doherty in 1976 in his paper “A Comparison of Programming Systems and Doherty Threshold.” His research showed that maintaining rapid system feedback fast enough to sustain continuous interaction has a stronger impact on productivity than simply increasing system power or features beyond that point.

Doherty proposes that this shouldn't be greater than 400ms Rule: If the system responds within this window, the user feels in total control. If the response takes longer, the user's attention begins to wander, and their "train of thought" is broken.

The challenge, of course, is that not every operation can realistically complete within 400ms. Some tasks require heavy computation, large network calls, or complex rendering. This is where the concept of perceived performance becomes essential.

Even when the system can't finish the work quickly, it can feel fast by responding instantly at the UI level. Developers can achieve this illusion of speed through a combination of thoughtful design patterns and disciplined engineering practices.

On the technical side, performance begins with reducing unnecessary work. Keeping the number of HTML elements low helps the browser render faster. Rendering only the visible portion of long lists prevents the Document Oject Model (DOM) from becoming bloated. Splitting scripts and deferring non‑critical code ensures that essential interactions load first.

Using CSS transforms and opacity changes avoids expensive layout recalculations. Lazy‑loading images, videos, and scripts ensures that the interface becomes interactive long before all assets are downloaded.

These optimizations don’t just improve raw speed — they create the foundation for interfaces that feel responsive.

Design Takeaways from Doherty Threshold

Instant Feedback: When a user clicks a button, provide a visual change (like a button press animation or a spinner) immediately, even if the background task takes longer.

Skeleton Screens: Use placeholder blocks that mimic the layout of the page while data loads. This makes the app feel like it is responding instantly.

Progressive Loading: Load text and basic structures first, then "pop in" high-resolution images later.

Optimistic UI: When a user hits "Save," don't wait for the server. Update the UI instantly (Doherty) and handle the "messy" data formatting on the backend (Postel).

Live Inline Validation: Show a green checkmark or a helpful error message as the user types. This keeps them below the 400ms "thought-break" limit.

Debouncing: In search bars, start showing results after a few keystrokes so the user feels the app is "predicting" their needs.

To Sum Up: When response times stay below a certain limit, users remain focused and productive. But once performance already meets this optimal responsiveness level, making the system even faster or adding extra capability doesn't significantly enhance satisfaction or efficiency.

13.0 Serial Position Effect (Primacy and Recency):

Murdock’s study investigated how the position of a word in a list affects recall, known as the serial position effect. He presented 103 psychology students with lists of 10 to 40 words, one at a time, at either 1 or 2 seconds per word (McLeod, 2025).

Participants were divided into six groups, each experiencing a different combination of list length and presentation rate, and were asked to recall as many words as possible in any order.

The results showed that participants were most likely to remember words at the beginning of the list (primacy effect) and at the end of the list (recency effect), while words in the middle were recalled less often. The recency effect persisted even in longer lists, and the middle section of the recall curve formed a flat asymptote.

Murdock explained this using the multi-store model of memory: early words were rehearsed and transferred to long-term memory, last words remained in short-term memory, and middle words were neither sufficiently rehearsed nor retained, leading to poorer recall.

The experiment demonstrated that memory performance varies systematically with the position of information in a sequence.

This is the reason why the most important information or actions should never be buried in the middle.

As a developer, you should put your most critical navigation links (like "Home" or "Dashboard") at the far left or the top of a list. In a pricing table, put the most popular or recommended plan on the Place "Final Actions" (like "Log Out," "Cart," or "Support") at the end of a menu or the far right of a navigation bar.

In a long onboarding flow, put the most exciting benefit of the app on the very last slide so the user enters the app feeling motivated.

Avoid placing highly important buttons in the middle of a row. If you have a row of 7 buttons, the user is statistically likely to overlook the 4th one.

Design Takeaways Serial Position Effect

Place Critical Items at the Beginning or End — Never the Middle:
Users reliably remember the first and last items in any sequence (primacy and recency). Anything placed in the middle is statistically more likely to be forgotten or ignored. Also, actions such as “Log Out,” “Cart,” “Support,” or “Checkout” should sit at the far right or bottom — the natural recency position.

Put Essential Navigation Links at the Far Left or Top:
Links like “Home,” “Dashboard,” or “Overview” should appear at the start of a menu, where recall and recognition are strongest.

To Sum Up: The results showed that participants were most likely to remember words at the beginning of the list (primacy effect) and at the end of the list (recency effect), while words in the middle were recalled less often.

14.0 Occam’s Razor:

Although first articulated in the 14th century by the Franciscan friar William of Ockham, Occam’s Razor remains one of the most indispensable principles in a developer’s toolkit. In fact, skipping this law while discussing other theories and principles would be like skipping the glue that holds the entire framework together.

At its core, Occam’s Razor states that “among competing explanations, the simplest one is usually the best.”

For example, if two user interfaces achieve the same goal, the one with fewer visual elements is typically superior because it requires less processing power.

The fundamental takeaway for modern developers regarding Occam’s Razor is that complexity is a tax on the user’s cognitive resources.

In an era of information density, the developer's primary role is no longer to provide "more" features – rather, it's to curate the most direct path to a solution.

In practice, Occam’s Razor becomes a reminder to keep things as simple as possible. This “less is more” mindset shapes everything from navigation to forms.

A good rule for navigation is the Rule of Five: aim for three to five main menu items instead of a long, overwhelming list. This keeps choices clear and prevents users from freezing up when they see too many options.

The same idea applies to data entry. When you ask only for the information that truly matters, you respect the user’s time and reduce the chance of “form fatigue,” which is one of the biggest reasons people abandon sign‑ups or checkout flows.

Simplicity isn’t just elegant — it’s practical, humane, and far more effective.

Design Takeaway from Occam's Razor

Choose the Simplest Effective Solution:
When two designs achieve the same goal, the one with fewer elements is almost always better. Simplicity reduces cognitive load and speeds up user decision‑making.

Simplicity Is Not Just Aesthetic — It’s Humane:
Clear, minimal interfaces respect the user’s time, reduce friction, and make the product feel effortless. Simplicity is both a design strategy and an act of empathy.

To Sum Up: Simplicity isn’t just elegant — it’s practical, humane, and far more effective.

15.0 Parkinson's Law

Occam’s Razor teaches us to prefer the simplest solution that works. But why do we so often end up with complex systems in the first place? That tendency is explained by another principle: Parkinson’s Law.

Parkinson’s Law states that "work expands to fill the time available for its completion". In design, this means projects often become overly complex or take longer than necessary if given too much time, resulting in inefficient, over-designed, or cluttered interfaces.

In design, this manifests as Feature Creep. If you give yourself three months to build an app, you will spend three months adding "nice-to-have" animations, extra settings toggles, and niche edge cases that nobody asked for and in reality, what you have added isn’t that important.

You just succeeded in adding layers of complexity that might ends up violating some of the laws we spoke about. Occam’s Razor reminds us that the simplest solution is often the most effective.

By being aware of Parkinson’s Law and the tendency for work to expand, developers can manage their time intentionally and focus only on what truly matters.

Design Takeaway for Parkinson's law

Set Clear Constraints to Keep Designs Focused:
Intentional time limits and scope boundaries prevent over‑designing. Constraints force clarity, prioritisation, and simplicity.

Build Only What Truly Matters for the User:
Parkinson’s Law reminds you to resist the urge to fill time with unnecessary features. Focus on the core experience, not the edge cases nobody asked for.

Use Occam’s Razor to Counterbalance Parkinson’s Law:
As work expands, complexity grows. Occam’s Razor pulls you back to the simplest effective solution. Together, the two principles prevent bloated, over‑engineered products.

To Sum Up: Work expands to fill the time available for its completion

Conclusion

Human-centered design is deeply influenced by a set of psychological principles that explain how users perceive, process, and interact with digital systems.

Among these, Fitts’s Law establishes that the time required to acquire a target depends on its size and distance. In practice, this means that larger and closer elements are easier and faster to interact with.

To apply this in practice, developers should make primary call-to-action elements prominent, large, and easily reachable – especially in mobile interfaces where thumb accessibility is critical.

Closely related to decision-making is Hick’s Law, which states that the more choices a user is presented with, the longer it takes to make a decision. Excessive options can overwhelm users and lead to decision fatigue.

To address this, developers should simplify interfaces, minimise unnecessary options, and guide users through processes step-by-step rather than presenting everything at once.

Another important cognitive principle discussed is Miller’s Law, which suggests that the average person can hold approximately seven (plus or minus two) items in working memory at a time. This limitation highlights the need to present information in manageable chunks.

By breaking content into smaller groups and avoiding information overload, developers can improve comprehension and usability.

User expectations are strongly shaped by Jakob’s Law, which says that people spend most of their time on other websites and therefore expect similar patterns across digital products.

Instead of reinventing basic interactions, developers should follow familiar conventions such as placing the logo in the top‑left, the shopping cart in the top‑right, and keeping scrolling behaviour predictable.

But innovation is still possible where it truly adds value. As we discussed with the Aesthetic‑Usability Effect, users are far more tolerant of new or unusual design patterns when the interface is visually appealing and thoughtfully crafted.

The Gestalt Principles provided additional insight into how users visually organise information. The principle of proximity suggests that objects placed close together are perceived as related, so grouping related elements improves clarity. Similarity indicates that elements with consistent colours, shapes, or styles are seen as belonging together, reinforcing visual hierarchy and function. Closure explains that users can perceive incomplete shapes as complete, allowing for minimalistic designs where the brain fills in missing details. Continuity highlights that users naturally follow smooth visual paths, meaning layouts should guide the eye logically through alignment and structure.

We also looked at The Von Restorff Effect which emphasizes that elements which stand out are more likely to be remembered. By using contrast in colour, size, or design, important features such as buttons or alerts can capture user attention.

Managing complexity was addressed by Tesler’s Law, which asserts that every system has inherent complexity that cannot be eliminated but only managed.

Developers must therefore shift complexity away from the user by simplifying interfaces while handling intricate processes behind the scenes.

The Zeigarnik Effect reveals that people remember unfinished tasks better than completed ones, creating a sense of mental tension. This can be leveraged by incorporating progress indicators, checklists, and reminders that encourage users to complete tasks.

Similarly, the Peak-End Rule suggests that users judge an experience based on its most intense moment and its conclusion. Developers should create memorable highlights and ensure a smooth, satisfying ending to user journeys.

We also discussed the Goal-Gradient Effect, which explains that users become more motivated as they approach the completion of a task. By showing progress –such as indicating that a process is “80% complete” – and breaking tasks into stages, developers can encourage users to finish what they have started.

In terms of system interaction, Postel’s Law advises developers to be flexible in accepting user input while maintaining strict standards for output. This means allowing different input formats while ensuring consistent and reliable system responses.

Performance is equally important, as highlighted by the Doherty Threshold, which shows that productivity increases when system response times stay under 400 milliseconds. Fast systems keep users engaged and create a sense of ease.

This means that developers should focus on building interfaces that feel instant, even when real processing takes longer, by combining smart engineering practices with thoughtful design patterns that maintain the illusion of speed.

Memory and attention are further explained by the Serial Position Effect, where users tend to remember the first and last items in a sequence more than those in the middle. Developers should position key information or actions at the beginning or end of lists.

Simplicity is reinforced by Occam’s Razor, which argues that the simplest solution is often the most effective. Eliminating unnecessary features reduces friction and enhances usability, and we further discussed about Parkinson’s Law, which suggests that tasks expand to fill the time available, indicating the importance of setting constraints such as deadlines or timers to encourage timely action.

These principles collectively highlight the importance of simplicity, clarity, performance, and user psychology in design. By applying them thoughtfully, developers can create intuitive, efficient, and engaging user experiences that align with both human behaviour and user expectations.

References

Budiu, R. (2022). Fitts’s Law and Its Applications in UX. [online] Nielsen Norman Group. Available at: https://www.nngroup.com/articles/fitts-law/.

Bustamante, N. (2023). Gestalt Psychology? Definition, Principles, & Examples - Simply Psychology. [online] www.simplypsychology.org. Available at: https://www.simplypsychology.org/what-is-gestalt-psychology.html.

Cherry, K. (2024). The Zeigarnik Effect Is Why You Keep Thinking of Unfinished Work. [online] Verywell Mind. Available at: https://www.verywellmind.com/zeigarnik-effect-memory-overview-4175150.

DO, A.M., RUPERT, A.V. and WOLFORD, G. (2008). Evaluations of pleasurable experiences: The peak-end rule. Psychonomic Bulletin & Review, 15(1), pp.96–98. doi:https://doi.org/10.3758/pbr.15.1.96.

GUPTA, S., GUPTA, S., MAHENDRA, A. and GUPTA, S. (2006). Inverse Halo Nevus. Dermatologic Surgery, 32(6), pp.871–872. doi:https://doi.org/10.1097/00042728-200606000-00025.

‌Hall, D. (2023). Pilot Error, Chapanis and The Shape of Things to Come. [online] UX Magazine. Available at: https://uxmag.com/articles/pilot-error-chapanis-and-the-shape-of-things-to-come.

Hunt, R.R. (1995). The subtlety of distinctiveness: What von Restorff really did. Psychonomic Bulletin & Review, 2(1), pp.105–112. doi:https://doi.org/10.3758/bf03214414.

Kahneman, D., Fredrickson, B.L., Schreiber, C.A. and Redelmeier, D.A. (1993). When More Pain Is Preferred to Less: Adding a Better End. Psychological Science, 4(6), pp.401–405. doi:https://doi.org/10.1111/j.1467-9280.1993.tb00589.x.

Mod, D. (2024). Doherty Threshold: UX Law of Swift Interactions. [online] Articles on everything UX: Research, Testing & Design. Available at: https://blog.uxtweak.com/doherty-threshold/.

Miller, G.A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, [online] 101(2), pp.343–352. doi:https://doi.org/10.1037/0033-295x.101.2.343.

Proctor, R.W. and Schneider, D.W. (2018). Hick’s law for choice reaction time: A review. Quarterly Journal of Experimental Psychology, [online] 71(6), pp.1281–1299. doi:https://doi.org/10.1080/17470218.2017.1322622.

Yablonski, J. (2022). Hick’s Law. [online] Laws of UX. Available at: https://lawsofux.com/hicks-law/.

Yablonski, J. (2022). Goal-Gradient Effect. [online] Laws of UX. Available at: https://lawsofux.com/goal-gradient-effect/.

‌

Data Science Insights: Why the Mean Lies When Handling Messy Retail Data

Rakshath Naik — Tue, 05 May 2026 16:59:17 +0000

In our daily life, we use the word "average" all the time: average salary, average marks, average age, and so on.

Let's take the case of a retail shop. If we're looking at the average order value to understand customer spending, we'd load the data, run the code, and get a result of $20 per order.

Done.

Except something looks odd.

When we take a closer look, we see that most customers are buying items worth $8 - $15. So where's $20 coming from?

In that case, the problem isn’t data – it’s the average. This is a clean textbook trap where everything works perfectly in the textbook, but real-world data doesn’t behave nicely.

Some customers buy in bulk (very large orders), some return orders (negative quantities), and a few anomalies distort the entire picture.

In this article, we'll use the Online Retail Dataset to answer a simple but tricky question: What does “average” really mean in the real world?

Prerequisites
The Dataset
Mean: The Sensitive Giant
Median: The Robust Middle
Beyond Averages: Understanding Spread with Quartiles
Applying IQR to Our Dataset
Final Comparison and Insights
Conclusion
Connect with me

Prerequisites

To follow along here, you'll need:

Basic Python knowledge: Understanding of variables and functions.

The Pandas library: Familiarity with loading data and basic DataFrame operations.

A development environment: Access to a tool like Jupyter Notebook, VS Code, or Google Colab.

A Dataset: For this analysis, I used the Online Retail Dataset, which is available for download here.

The Dataset

We'll work with the Online Retail Dataset, a real-world transactional dataset containing purchase records from a UK-based online retail store.

Source: UCI Machine Learning Repository
Collected by: UK-based online retail company (2010–2011)
Size: 541,909 transactions
Features: 8 attributes (InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country)
Ownership: Public dataset hosted by UCI
License: Open for research and educational use

Mean: The Sensitive Giant

In statistics and data analysis, the terms "average" and "arithmetic mean" are often used interchangeably. We aim to find the mean total price in our dataset. Mean in the context of the Online Retail Dataset is given as:

$$\text{Average Order Value} = \frac{\text{Sum of all TotalPrice values}}{\text{Number of transactions}}$$

In our dataset, the mean is calculated by summing all transaction values (including bulk purchases and returns) and dividing by the total number of transactions. This means every value, irrespective of unusually high or any negative values, directly influences the final average.

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
df = pd.read_excel(url, engine='openpyxl')

# Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Calculate the Mean (Average Order Value)
mean_value = df['TotalPrice'].mean()
print(f"Average Order Value (Mean): {mean_value:.2f}")

The results are as follows:

Average Order Value (Mean): 20.40

At first glance, the results may look promising: every transaction contributes equally. But that’s where the problem lies. Sometimes a few transactions, which are extremely high or low, affect the mean for all customers who lie in the closer range.

Take a look at the graph for the mean below.

The graph shows the mean Total Price for the Online Retail Dataset. We get a mean of 20.42. (Image by Author)

The graph shows a right-skewed distribution where the calculated mean of 20.40 is actually a textbook trap. The tallest bar clearly shows that the majority of transactions lie in the range of $8 - $15 range, but the red line is being dragged to the right by the long tail of high-value bulk orders by some customers.

In this scenario, the average price is well above what a typical customer actually spends because it's highly sensitive to outliers – and in reality, the bulk of the data lives in the lower price range.

In simple words, the mean is being pulled by some extreme values to the right, especially by some lying in the range of 200–300, which is noticeable in the graph.

Median: The Robust Middle

When the mean is distorted by extreme values, we need a metric that remains unaffected by such outliers. This is where the median comes into play.

Median is defined as the middle value after sorting the data.

In our dataset, we sort all the transactions and pick the middle one.

The formula for calculating the median is:

$$\text{Median} = \begin{cases} X_{\left[ \frac{n+1}{2} \right]} & \text{if } n \text{ is odd} \ \frac{X_{\left[ \frac{n}{2} \right]} + X_{\left[ \frac{n}{2} + 1 \right]}}{2} & \text{if } n \text{ is even} \end{cases}$$

Unlike the mean, the median doesn't depend on extreme values, and it cares only about the position of the data, not the magnitude.

# Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Calculate only the Median
median_value = df['TotalPrice'].median()
print(f"Typical Order Value (Median): {median_value:.2f}")

The results are as follows:

Typical Order Value (Median): 11.10

Now you'll notice that the result lies in the $8 — $15 range, where most of the transactions lie.

The figure demonstrates the graph for the median, where we get an accurate value of the transactions by the customers. (Image by Author)

In the previous graph, the mean was pulled to the right by large orders, but the median just asks what the middle customer spends. So even if someone spends $300 or some transactions are negative, the median stays stable.

In the above figure the median graph accurately highlights the range where most of the customers lie.

Beyond Averages: Understanding Spread with Quartiles

So far, we've studied the median, but knowing the center is not enough.

To truly understand how customer spending is, we need to understand how the data is spread, and this is where quartiles come into play.

Quartiles divide the dataset into the following parts:

Q1(25th percentile): 25% of transactions are below this.
Q2 (50th percentile): Median
Q3 (75th percentile): 75% of transactions are below this.

This is formally expressed as the Interquartile Range (IQR):

$$IQR = Q_3 - Q_1$$

The IQR: Detecting Outliers

The IQR measures the spread of the middle 50%.

If the IQR is small, then the data is concentrated. If it's large, the data is spread out. The IQR also helps us identify outliers mathematically.

Outlier Rule:

Lower Bound = Q1 — 1.5 * IQR
Upper Bound = Q3 + 1.5 * IQR

A Simple Example to Understand IQR

Consider the following transaction values:

$$\left[ 5, 8, 10, 12, 15, 18, 20 \right]$$

Step 1: Find the Median (Q2):

The middle value is:

$$Q_2 = 12$$

Step 2: Find Q1 (Lower Quartile):

The lower half is [5, 8, 10]. The median of the lower half is:

$$Q_1 = 8$$

Step 3: Find Q3 (Upper Quartile):

The upper half is [15, 18, 20]. The median of the upper half is:

$$Q_3 = 18$$

Step 4: Calculate IQR:

$$IQR = Q_3 - Q_1 = 18 - 8 = 10$$

Step 5: Find Outlier Bounds:

$$\begin{aligned} \text{Lower Bound} &= Q_1 - 1.5 \times IQR = 8 - 15 = -7 \ \text{Upper Bound} &= Q_3 + 1.5 \times IQR = 18 + 15 = 33 \end{aligned}$$

Any value below -7 or above 33 is an outlier (but in this demo problem, no outliers exist).

Applying IQR to Our Dataset

In our retail dataset, instead of neat values, we have bulk values and even negative returns.

# 1. Calculate IQR and Bounds
Q1 = df['TotalPrice'].quantile(0.25)
Q3 = df['TotalPrice'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

When we calculate IQR for our dataset, we get:

Lower Bound: -18.75
Upper Bound: 42.45
Number of Outliers: 33180

The graph demonstrates outliers, which are any values falling outside the range of -18.75 to 42.45. (Image by Author)

As the graph shows, the values outside the range -18.75 to 42.45 are considered outliers. These values will be removed.

Revisiting the Mean After Removing Outliers

Using the IQR method, we've removed extreme transactions that fell outside the typical spending range.

# Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Original Mean
mean_value = df['TotalPrice'].mean()
print(f"Original Mean: {mean_value:.2f}")

# IQR Calculation
Q1 = df['TotalPrice'].quantile(0.25)
Q3 = df['TotalPrice'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Lower Bound: {lower_bound:.2f}")
print(f"Upper Bound: {upper_bound:.2f}")

# Remove Outliers
df_no_outliers = df[(df['TotalPrice'] >= lower_bound) & (df['TotalPrice'] <= upper_bound)]

# New Mean after removing outliers
new_mean = df_no_outliers['TotalPrice'].mean()
print(f"Mean after removing outliers: {new_mean:.2f}")

After recomputing, we get:

Original Mean: 20.40
Lower Bound: -18.75
Upper Bound: 42.45
Mean after removing outliers: 11.63

Removing outliers significantly shifts the mean toward the region where most transactions occur. We now have a much better mean of 11.63 as opposed to the right-stretched mean of 20.40 we got with outliers.

Final Comparison and Insights

Looking at the results from all the graphs, we get a complete understanding of the dataset. The original mean was 20.40, which appeared to be significantly higher than the most transactions that actually occurred. In that case, the mean was pulled upward by some of the high-valued transactions and was distorted by the outliers.

The median, on the other hand, was 11.10, which lies within the range where most transactions are concentrated. This shows that the median is a much better representation of what a typical customer spends, as it's not affected by extreme values.

After removing the outliers using the IQR, the mean dropped to 11.63, bringing it very close to the median. This confirms that the earlier mean was not inherently wrong, but was simply influenced by extreme values in the data. Once those values were handled, the mean became a much more reliable measure of central tendency.

Conclusion

The results show that the mean can be misleading when data contains outliers. In our dataset, the original mean of 20.40 overstated customer spending, while the median (11.10) gave a more realistic picture. After removing outliers, the mean shifted to 11.63, aligning closely with the median.

This highlights a key lesson: The mean isn't wrong, but it must be used with an understanding of the data.

Choosing the right measure of average depends on the dataset, and in messy real-world scenarios, the median or a cleaned mean often tells the true story.

Connect with me

If you want to dive deeper, you can visit: Mean vs Median vs Mode: Understanding Central Tendency in Data Analysis.

How to Deploy a Serverless Spam Classifier Using Scikit-Learn, AWS Lambda, & API Gateway

Rakshath Naik — Thu, 30 Apr 2026 05:06:15 +0000

In today's digital world, spam is no longer just an annoyance - it's a growing security threat. To combat this, developers often turn to machine learning to build intelligent filters that can distinguish legitimate emails from malicious ones.

While building a machine learning model in a notebook is relatively straightforward, the real challenge lies in the last mile: deploying that model into a scalable, production-ready system that users can actually interact with.

In this project, I built an end-to-end serverless spam classifier, combining Scikit-learn for model development with AWS Lambda, Amazon S3, and Amazon API Gateway for deployment. The result is a lightweight, scalable API that can classify messages in real time.

The system is designed to be modular and cost-efficient, allowing the model to be retrained and updated independently without affecting the live API. From detecting "free iPhone" scams to identifying phishing attempts, this project demonstrates how to bridge the gap between machine learning experimentation and real-world deployment.

Prerequisites
Building the Brain: The Model
Deploying the Model to AWS
How to Run The Project Locally
Our Project Architecture
Conclusion: The Power of Serverless AI
Acknowledgment / References

1. Prerequisites

Fundamental skills: Basic proficiency in Python and understanding of Machine Learning concepts like classification.
AWS account: Access to an AWS account with permissions for Lambda, S3, and API Gateway.
Environment: Python 3.11 installed, along with libraries like scikit-learn, pandas, and joblib.
AWS CLI: Configured on your local machine for file uploads.
HuggingFace account: You can directly download the model from my account.

2. Building the Brain: The Model

Photo by Steve A Johnson on Unsplash

At the heart of this project lies a supervised learning approach. Instead of simply specifying which words are considered spam, we'll provide the computer with a dataset and an algorithm, enabling it to learn and identify spam patterns on its own.

1. Vectorization: Turning Text into Math

Machine Learning models can't read text. They require numerical input. To solve this, we used the TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer.

feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)
X_train_features = feature_extraction.fit_transform(X_train

Here's the mathematical formula:

$$w_{i,j} = tf_{i,j} \times \log \left( \frac{N}{df_i} \right)$$

TF-IDF term definitions:

wᵢ,ⱼ (Weight): The final importance score of a specific word in a document.
tfᵢ,ⱼ (Term Frequency): How often a word appears in a single email.
N (Total Documents): The total count of all emails in your dataset.
dfᵢ (Document Frequency): The number of different emails that contain this specific word.
log(N/dfᵢ) (IDF): A penalty that lowers the score of common words like the or is that appear everywhere.

It cleans the data by removing common words, converts all text to lowercase for consistency, and assigns more importance to rare and meaningful words while giving less importance to frequently used words.

2. Training: The Logistic Regression Engine

We'll use Logistic Regression here, a classification algorithm that predicts the probability of an outcome.

In this stage, we feed our vectorized training data into the Logistic Regression algorithm. The goal is to establish a mathematical relationship between specific word weights and the Spam or Ham label.

During training, the model iteratively adjusts its internal parameters to minimize error, eventually learning that words like winner or free correlate highly with spam, while conversational language correlates with legitimate messages.

model = LogisticRegression()
model.fit(X_train_features, Y_train)

In our case, it calculates the probability that an email belongs to spam or HAM.

The algorithm uses the Sigmoid function to map any real-valued number into a value between 0 and 1.

$$P(y=1|x) = \frac{1}{1 + e^{-(z)}}$$

where z = β₀ + β₁x₁ + … + βₙxₙ.

3. Evaluation: Testing the Intelligence

After training, we need to verify if the brain actually works on data it hasn't seen before.

prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

By comparing the model’s predictions against the actual labels in our test set, we calculate an Accuracy Score. This gives us the confidence that the model is ready for the real world (achieving ~94% accuracy in our tests).

4. Exporting the Logic (Serialization)

To move this brain from our local Python environment to the AWS Cloud, we'll use Joblib to save our work into binary files (.pkl).

joblib.dump(model, 'spam_model.pkl')
joblib.dump(feature_extraction, 'vectorizer.pkl')

We use the Pickle format because it allows us to freeze complex Python objects (mathematical weights and word mappings) into a portable binary format that can be instantly re-animated in the cloud.

We need the Vectorizer to translate new user text into the exact numerical coordinates the Model was trained to understand. Using one without the other is like having a key but no lock.

The trained Logistic Regression model and TF-IDF vectorizer are openly available for the community on Hugging Face here: Get the model on HuggingFace.

3. Deploying the Model to AWS

Training a model is science, while deploying it is engineering. To make this classifier accessible to the world, we'll use a serverless stack that scales automatically and incurs nearly no maintenance costs.

1. Model Storage: Amazon S3

First, we'll uploade our .pkl files to an S3 bucket. By decoupling the model from the code, we can update the AI's intelligence (simply by overwriting the file in S3) without redeploying the backend code. It makes the system highly maintainable.

2. The Production Backend: AWS Lambda

To make the AI accessible, we'll move from a local script to a Serverless Cloud Architecture. This ensures the model is always available without the cost of a 24/7 server.

The deployment environment is AWS Lambda (Python 3.11). Since Lambda is a lightweight environment, it doesn't include Scikit-Learn or Joblib. To provide these, we'll download and store them in our S3 bucket and import them through the layers.

Commands in AWS CLI:


# 1. Create a workspace
mkdir ml_layer && cd ml_layer

# 2. Install scikit-learn and its dependencies into a folder
pip install \
    --platform manylinux2014_x86_64 \
    --target=python/lib/python3.11/site-packages \
    --implementation cp \
    --python-version 3.11 \
    --only-binary=:all: \
    scikit-learn joblib

# 3. Zip the folder
zip -r sklearn_lib.zip python

# 4. Upload to S3 (Using AWS CLI)
aws s3 cp sklearn_lib.zip s3://YOUR-BUCKET-NAME/

We store the Scikit-Learn library as a ZIP in S3 to bypass the AWS Lambda deployment package size limit. This allows the function to dynamically load heavy dependencies only when needed without bloating the core code.

The Lambda Function:


import json
import boto3
import os
import sys
from io import BytesIO

# Ensures the custom Lambda layer(containing sklearn/joblib)
sys.path.append('/opt/python')

try:
    import joblib
except ImportError:
    # Fallback for specific Scikit-Learn distributions
    from sklearn.utils import _joblib as joblib

# Initialize S3 client
s3 = boto3.client('s3')

# Use placeholders for the article so readers can insert their own values
BUCKET_NAME = 'YOUR_S3_BUCKET_NAME' 
MODEL_KEY = 'spam_model.pkl'
VECTORIZER_KEY = 'vectorizer.pkl'

# Global variables for 'Warm Start' caching (improves performance by keeping model in RAM)
model = None
vectorizer = None

def load_model():
    """Downloads model files from S3 only if they aren't already in RAM"""
    global model, vectorizer
    if model is None or vectorizer is None:
        try:
            # 1. Load the Logistic Regression Model from S3
            m_obj = s3.get_object(Bucket=BUCKET_NAME, Key=MODEL_KEY)
            model = joblib.load(BytesIO(m_obj['Body'].read()))
            
            # 2. Load the TF-IDF Vectorizer directly from S3
            v_obj = s3.get_object(Bucket=BUCKET_NAME, Key=VECTORIZER_KEY)
            vectorizer = joblib.load(BytesIO(v_obj['Body'].read()))
        except Exception as e:
            raise Exception(f"Failed to load .pkl files from S3: {str(e)}")

def lambda_handler(event, context):
    try:
        # Ensure model and vectorizer are ready before processing
        load_model()
        
        # Handles both direct Lambda tests and API Gateway POST requests
        body = event.get('body', event)
        if isinstance(body, str):
            body = json.loads(body)
            
        text = body.get('text', '')
            
        if not text:
            return {
                'statusCode': 400,
                'body': json.dumps({'error': 'No text provided.'})
              }

        # 1. Transform input text to numeric features using the trained Vectorizer
        data_vec = vectorizer.transform([text])
        
        # 2. Predict using the Logistic Regression Model 
        prediction = int(model.predict(data_vec)[0])
        
      # 3. Map numeric result to human-readable label
        result_label = "HAM" if prediction == 1 else "SPAM"
        
        # RESPONSE WITH CORS
        return {
            'statusCode': 200,
            'headers': {
                'Content-Type': 'application/json',
                'Access-Control-Allow-Origin': '*' # needed for cross-domain web integration
            },
            'body': json.dumps({
                'status': 'success',
                'classification': result_label,
                'input_text': text
            })
        }
        
    except Exception as e:
        return {
            'statusCode': 500,
            'body': json.dumps({'error_message': f"Inference Error: {str(e)}"})
        }

Key features of the Lambda function:

Warm start caching: By defining the model and vectorizer variables outside the lambda_handler, we store them in the container's memory. This significantly reduces cold start latency for subsequent requests.
Dynamic dependency loading: The sys.path.append('/opt/python') line allows us to import heavy libraries from S3/Layers without exceeding the upload limit.
Bimodal input handling: The function is designed to handle both direct JSON testing from the AWS console and stringified payloads sent via API Gateway.

3. The API Gateway - The Bridge to the Web

Photo by Growtika on Unsplash

Creating the REST API

Next we'll create a REST API with a single POST method. Why POST, you might be wondering? Well, we need to securely send a JSON payload containing the user’s text message to our model.

First navigate to the Amazon API Gateway console and select Create API -> REST API.
Give your API a name, such as EmailSpamPredictor-API, and set the Endpoint Type to Regional.
Then in the left sidebar, click Resources and enter a resource name (e.g: / predict as entered by me)
Next click the create method and select POST and then select Lambda Function for integration type
Ensure Lambda Proxy integration is enabled (this allows the full request to pass through to your code).

The CORS Configuration (The Troubleshooting Hub)
This is where many developers encounter the dreaded Connection Error. Since our API is hosted on AWS, and if your front-end is on a separate website, the browser’s Same-Origin Policy will block the request by default.

To fix this, we'll enable CORS:

Access-Control-Allow-Origin: Set to * (or specifically to your domain) to tell the browser that the API is allowed to talk to your front-end.
The OPTIONS method: API Gateway creates an OPTIONS method automatically. This handles the Preflight request where the browser asks, “Are you allowed to receive data from me?” before sending the actual text.
Access-Control-Allow-Headers: In the screenshot, you'll notice headers like Content-Type and Authorization are allowed. This ensures that when our JavaScript fetch() call sets the content type to application/json, the API Gateway doesn't reject it.

Image illustrates the CORS configuration for our project. (Image by author)

Deployment Stages

Once the API is deployed to a production stage, AWS generates a permanent Invoke URL. This acts as the public gateway to our model and typically follows this structure: https://[api-id].execute-api.[region].amazonaws.com/prod/classify.

Connecting the Frontend (The JavaScript Layer)

With the API live, we can now write a simple JavaScript function to talk to our model. This script runs whenever a user clicks the Analyze button on your site.


async function checkSpam() {
    const message = document.getElementById("userInput").value;
    const apiUrl = "YOUR_API_GATEWAY_INVOKE_URL";

    try {
        const response = await fetch(apiUrl, {
            method: "POST",
            headers: {
                "Content-Type": "application/json"
            },
            body: JSON.stringify({ "text": message })
        });

        const data = await response.json();
        
        // Display result on the webpage
        const resultElement = document.getElementById("result");
        resultElement.innerText = `Prediction: ${data.classification}`;
        resultElement.style.color = data.classification === "SPAM" ? "red" : "green";

    } catch (error) {
        console.error("Error:", error);
        alert("Could not connect to the Spam Detector API.");
    }
}

4. How to Run The Project Locally

You can store the front-end as an HTML file. Once it's ready, you shouldn’t just double-click the .html file. Opening it as a file in your browser can cause security restrictions. Instead, you should host it using a simple local server.

Step 1: Open the terminal or Command Prompt.

Step 2: Navigate to your project folder

cd [PATH_TO_YOUR_FOLDER]

Step 3: Start a local Python web server.

python -m http.server 8000

Step 4: Access the application.

Open your browser and navigate to:
http://localhost:8000/your-file-name.html

Watch the Demo:

5. Our Project Architecture

The image illustrates the architecture of our project (Building a Serverless Spam Classifier). It shows the process that takes place from the client input to the final model output. (Image by Author)

Client Front-End Interaction: The process starts on the far left. A user interacts with the web interface (for example, a website or a desktop app). They input text like WIN free iPhone now and trigger a request.
The Entry Point: API Gateway: The request hits the Amazon API Gateway, which acts as the security guard and translator.
(a) CORS OPTIONS handles the pre-flight handshake to ensure the browser has permission to talk to the AWS cloud.
(b) Classification Request (POST) routes the actual message data to your backend logic.
The Engine: AWS Lambda (Python 3.11): The central “lightbulb” represents your Lambda function. This is where the code you wrote lives. It doesn’t run 24/7 – it only wakes up when a request arrives.
Storage & Retrieval: S3 Bucket: Since Lambda is lightweight, it doesn’t store your heavy Machine Learning files internally.
Dependency and Model Download: The function reaches out to the S3 Bucket to pull in the sklearn_lib.zip (the engine) and the .pkl files (the intelligence).
Required Dependency and Model: These assets are loaded into the Lambda’s temporary memory to prepare for the prediction.
The Inference Pipeline: Inside the Lambda, a three-step mathematical cycle occurs:
(a) Text Vectorizer: Translates the words into numbers.
(b) Logistic Regression: Calculates the probability of spam based on those numbers.
(c) Label: Assigns a final result (Spam or Ham).
The Result Delivery: The result is sent back through the API Gateway, including the necessary CORS Headers to ensure the browser accepts it. The front-end then updates to show the “Result: SPAM” with a visual indicator.

6. Conclusion: The Power of Serverless AI

By merging the mathematical simplicity of Logistic Regression with the industrial strength of AWS Serverless Architecture, we have transformed a static Python script into a globally accessible, scalable API.

This project demonstrates that you don’t need a massive budget or a 24/7 dedicated server to deploy high-quality Machine Learning.

Using the S3-to-Lambda workaround allowed us to bypass common storage hurdles, ensuring that our Brain (the model) and its Muscle (Scikit-Learn) could function seamlessly within the cloud’s ephemeral environment. It bridges the gap between experimentation and real-world applications, making AI systems practical, efficient, and accessible.

7. Acknowledgment / References

Pre-trained spam classification model: View on Hugging Face (rakshath1/mail-spam-detector · Hugging Face)
Scikit-learn Documentation
AWS Lambda Documentation
Amazon S3 Documentation
Amazon API Gateway Documentation

Connect With Me

You may also like

How to Build a Fashion App That Helps You Organize Your Wardrobe

Mokshita V P — Tue, 14 Apr 2026 16:26:39 +0000

I used to spend too long deciding what to wear, even when my closet was full.

That frustration made the problem feel very clear to me: it was not about having fewer clothes. It was about having better organization, better visibility, and better guidance when making outfit decisions.

So I built a fashion web app that helps users organize their wardrobe, get outfit suggestions, evaluate shopping decisions, and improve recommendations over time using feedback.

In this article, I’ll walk through what the app does, how I built it, the decisions I made along the way, and the challenges that shaped the final result.

Table of Contents
What the App Does
Why I Built It
Tech Stack
Product Walkthrough (What Users See)
How I Built It
Challenges I Faced
What I Learned
What I Want to Improve Next
Future Improvements
Conclusion

What the App Does

At a high level, the app combines six core capabilities:

Wardrobe management
Outfit recommendations
Shopping suggestions
Discard recommendations
Feedback and usage tracking
Secure multi-user accounts

Users can upload clothing items, explore suggested outfits, and mark recommendations as helpful or not helpful. They can also rate outfits and track whether items are worn, kept, or discarded.

That feedback becomes structured data for improving future recommendation quality.

Why I Built It

I wanted to create something that felt personal and actually useful. A lot of fashion apps look polished, but they do not always help with everyday decisions. My goal was to build something that could make wardrobe management easier and outfit selection less overwhelming. The app needed to do three things well:

store each user’s wardrobe data
personalize recommendations
learn from user feedback over time .

That feedback loop mattered to me because it makes the app feel more alive instead of static.

Tech Stack

Here are the tools I used to built the app:

Frontend: React + Vite
Backend: FastAPI
Database: SQLite (local development)
Background jobs: Celery + Redis
Authentication: JWT (access + refresh token flow)
Deployment support: Docker and GitHub Codespaces

This ended up giving me a pretty modular setup, which helped a lot as features started increasing: fast frontend iteration, clean API boundaries, and room to evolve recommendations separately from UI.

Product Walkthrough (What Users See)

1. Onboarding and Account Setup

To start using the app, a user needs to register, verify their email, and complete some profile basics.

Each account is isolated, so wardrobe history and recommendations stay user-specific.

In this onboarding screen above, you can see account creation, email verification, and profile fields for body shape, height, weight, and style preferences.

2. Wardrobe Upload

Users can upload clothing images .

Image analysis labels each item and makes it searchable for recommendations. The wardrobe upload form shows image analysis results with category, dominant color, secondary color, and pattern details listed.

3. Outfit Recommendations

Users can request recommendations, then rate outputs.

Above you can see the outfit recommendation dashboard that shows ranked outfit cards with feedback and rating actions. Recommendations are ranked by a weighted scoring model.

4. Shopping and Discard Assistants

The app evaluates new items against existing wardrobe data and flags low-value wardrobe items that may be worth removing.

You can see the recommendation scores, written reasons (not just a binary decision), and styling guidance for each item above. It also features a "how to style it" incase the user still wants to keep the item.

How I Built It

1. Frontend Setup (React + Vite)

I used React + Vite because I wanted fast iteration and a clean component structure.

The frontend is split into feature areas like onboarding, wardrobe management, outfits, shopping, and discarded-item suggestions. I also keep API calls in a service layer so the UI components stay focused on rendering and interaction.

The snippet below is a simplified example of the API service pattern used in the app. It is not meant to be copy-pasted as-is, but it shows the same structure the frontend uses when talking to the backend.

Example API client pattern:

export async function getOutfitRecommendations(userId, params = {}) {
  const query = new URLSearchParams(params).toString();
  const url = `/users/\({userId}/outfits/recommend\){query ? `?${query}` : ""}`;

  const response = await fetch(url, {
    headers: {
      Authorization: `Bearer ${localStorage.getItem("access_token")}`,
    },
  });

  if (!response.ok) {
    throw new Error("Failed to fetch outfit recommendations");
  }

  return response.json();
}

Here's what's happening in that snippet:

URLSearchParams builds optional query strings like occasion, season, or limit.
The request path is user-scoped, which keeps each user’s recommendations isolated.
The Authorization header sends the access token so the backend can verify the session.
The response is checked before parsing so the UI can surface a useful error if the request fails.

This pattern kept the frontend simple and reusable as the number of API calls grew.

2. Backend Architecture with FastAPI

The backend is organized around clear route groups:

auth routes for register, login, refresh, logout, and sessions
user analysis routes
wardrobe CRUD routes
recommendation routes for outfits, shopping, and discard analysis
feedback routes for ratings and helpfulness signals

One of the most important design choices was enforcing ownership checks on user-scoped resources. That prevented one user from accessing another user’s wardrobe or feedback data.

The backend snippet below is another simplified example from the app’s route layer. It shows the request validation and orchestration logic, while the actual scoring work stays in the recommendation service.

@app.get("/users/{user_id}/outfits/recommend")
def recommend_outfits(user_id: int, occasion: str | None = None, season: str | None = None, limit: int = 10):
    user = get_user_or_404(user_id)
    wardrobe_items = get_user_wardrobe(user_id)

    if len(wardrobe_items) < 2:
        raise HTTPException(status_code=400, detail="Not enough wardrobe items")

    recommendations = outfit_generator.generate_outfit_recommendations(
        wardrobe_items=wardrobe_items,
        body_shape=user.body_shape,
        undertone=user.undertone,
        occasion=occasion,
        season=season,
        top_k=limit,
    )

    return {"user_id": user_id, "recommendations": recommendations}

Here's how to read that code:

get_user_or_404 loads the profile data needed for personalization.
get_user_wardrobe fetches only the current user’s items.
The minimum wardrobe check prevents the recommendation logic from running on incomplete data.
generate_outfit_recommendations handles the scoring logic separately, which keeps the route handler small and easier to test.
The response returns the results in a shape the frontend can consume directly.

That separation helped keep the API layer readable while the recommendation logic stayed isolated in its own service.

3. Recommendation Logic

I intentionally started with deterministic rules before introducing heavy ML. That made behavior easier to debug and explain.

The outfit recommender scores combinations using weighted signals:

$$\text{outfit score} = 0.4 \cdot \text{color harmony} + 0.4 \cdot \text{body-shape fit} + 0.2 \cdot \text{undertone fit}$$

The snippet below is a simplified example from the recommendation engine. It shows how the app combines multiple signals into a single score:

def score_outfit(combo, user_context):
    color_score = color_harmony.score(combo)
    shape_score = body_shape_rules.score(combo, user_context.body_shape)
    undertone_score = undertone_rules.score(combo, user_context.undertone)

    total = 0.4 * color_score + 0.4 * shape_score + 0.2 * undertone_score
    return round(total, 3)

The logic behind this approach is straightforward:

color harmony helps the outfit feel visually coherent
body-shape scoring helps the outfit feel flattering
undertone scoring helps the colors work better with the user’s profile

I used a similar structure for discard recommendations and shopping suggestions, but with different factors and thresholds.

4. Authentication and Secure Multi-user Design

Security was one of the most important parts of this build.

I implemented:

short-lived access tokens
refresh tokens with JTI tracking
token rotation on refresh
session revocation (single session and all sessions)
email verification and password reset flows

The snippet below is a simplified example of the refresh-token lifecycle used in the app. It shows the important control points rather than every helper function:

def refresh_access_token(refresh_token: str):
    payload = decode_jwt(refresh_token)
    jti = payload["jti"]

    token_record = db.get_refresh_token(jti)
    if not token_record or token_record.revoked:
        raise AuthError("Invalid refresh token")

    new_refresh, new_jti = issue_refresh_token(payload["sub"])
    token_record.revoked = True
    token_record.replaced_by_jti = new_jti

    new_access = issue_access_token(payload["sub"])
    return {"access_token": new_access, "refresh_token": new_refresh}

What this code is doing:

It decodes the refresh token and looks up its JTI in the database.
It rejects reused or revoked sessions, which helps prevent replay attacks.
It rotates the refresh token instead of reusing it.
It issues a fresh access token so the session stays valid without forcing the user to log in again.

This design made multi-device sessions safer and gave me server-side control over logout behavior.

5. Background Jobs for Long-running Operations

Image analysis can be expensive, especially when the app needs to classify clothing, analyze colors, and estimate body-shape-related signals. To keep the request path responsive, I added Celery + Redis support for background tasks.

That gave the app two modes:

synchronous processing for simpler local development
queued processing for heavier or slower jobs

That tradeoff mattered because it let me keep the developer experience simple without blocking the app during more expensive work.

6. Data Model and Feedback Capture

A recommendation system only improves if it captures the right signals.

So I added dedicated feedback tables for:

outfit ratings (1-5 + optional comments)
recommendation helpful/unhelpful feedback
item usage actions (worn/kept/discarded)

Here is the shape of one of those models:

class RecommendationFeedback(Base):
    __tablename__ = "recommendation_feedback"

    id = Column(Integer, primary_key=True)
    user_id = Column(Integer, ForeignKey("users.id"), nullable=False)
    recommendation_type = Column(String(50), nullable=False)
    recommendation_id = Column(Integer, nullable=False)
    helpful = Column(Boolean, nullable=False)
    created_at = Column(DateTime, default=datetime.utcnow)

How to read this model:

user_id ties feedback to the person who gave it.
recommendation_type tells me whether the feedback belongs to outfits, shopping, or discard suggestions.
recommendation_id identifies the exact recommendation.
helpful stores the user’s direct response.
created_at makes it possible to analyze feedback trends over time.

This part of the system gives the app a real learning foundation, even though the feedback-to-model-update loop is still a future improvement.

Challenges I Faced

This was the section that taught me the most.

1. Image-heavy endpoints were slower than I wanted

The analyze and wardrobe upload flows were doing a lot of work at once: image validation, classification, color extraction, storage, and database writes.

At first, that made the request flow feel heavier than it should have.

What I changed:

I bounded concurrent image jobs so the app wouldn't try to do too much at once.
I separated slower jobs into background processing where possible.
I used load-test results to confirm which endpoints were actually expensive.

The practical effect was that heavy image requests stopped competing with each other so aggressively. Instead of letting many expensive tasks pile up inside the same request cycle, I limited the active work and pushed slower operations into the queue when needed.

Why this fixed it:

Bounding concurrency prevented the system from overloading CPU-bound tasks.
Moving expensive work into async jobs kept the main request/response cycle more responsive.
Load testing gave me evidence instead of guesswork, so I could tune the system based on real performance behavior.

In other words, I didn't just “optimize” the endpoint in theory. I changed the execution model so expensive analysis could not block every other request behind it.

2. JWT sessions needed real server-side control

A basic JWT setup is easy to get working, but it becomes less useful if you cannot revoke sessions or manage multiple devices cleanly.

What I changed:

I stored refresh tokens in the database.
I tracked token JTI values.
I rotated refresh tokens when users refreshed their session.
I added endpoints for logging out a single session or all sessions.

The important shift here was moving from “token exists, therefore session is valid” to “token exists, matches the database record, and has not been revoked or replaced.” That gave the server the authority to invalidate old sessions immediately.

Why this fixed it:

Server-side token tracking made revocation possible.
Rotation reduced the chance of token reuse.
Session management became visible to the user, which made the app feel more trustworthy.

This is what made logout-all and multi-device management work in a real way instead of just being cosmetic UI actions.

3. User data isolation had to be explicit

Because this is a multi-user app, I had to be careful that one account could never accidentally see another account’s wardrobe data.

What I changed:

I added ownership checks to user-scoped routes.
I kept all wardrobe and feedback queries filtered by user_id.
I used encrypted image storage instead of exposing raw paths.

In practice, this meant every route had to ask the same question: “Does this user own the resource they are trying to access?” If the answer was no, the request stopped immediately.

Why this fixed it:

Ownership checks made data access rules explicit.
User-filtered queries prevented accidental cross-account reads.
Encrypted storage improved privacy and reduced the risk of exposing image data directly.

That combination is what kept wardrobe data, feedback history, and images separated correctly across accounts.

The app includes the frontend, backend, Redis, Celery worker, and Celery Beat, so the first challenge was making the setup feel reproducible instead of fragile.

What I changed:

I defined the stack in Docker Compose.
I documented the required environment variables.
I kept the dev stack aligned with how the app runs in practice.

This removed a lot of setup ambiguity. Instead of asking someone to manually figure out how the frontend, backend, Redis, and workers fit together, I made the stack describe itself.

Why this fixed it:

Docker let contributors start the project with fewer manual steps.
Clear environment configuration reduced setup mistakes.
Matching the stack to the architecture made the app easier to understand and test.

That was important because the app depends on several moving parts, and the simplest way to make the project approachable was to make startup behavior predictable.

What I Learned

This project taught me a few important lessons:

Small features become much more valuable when they work together.
Feedback data is one of the strongest signals for improving recommendations.
Clean data modeling matters a lot when multiple users are involved.
Docker and clear setup instructions make a project much easier for other people to try.

I also learned that a project does not need to be huge to be useful. A focused app that solves one problem well can still feel meaningful.

What I Want to Improve Next

My roadmap from here:

Integrate feedback directly into ranking updates
Add visual analytics for recommendation quality trends
Improve mobile UX parity
Deploy with persistent cloud storage and production database defaults
Provide a public demo mode for easier evaluation

Future Improvements

There are still a few things I would like to add later:

a more advanced recommendation engine
visual analytics for user feedback
better mobile support
live deployment with persistent cloud storage
a public demo mode for easier testing

Conclusion

This project began as a personal frustration and turned into a full web application with authentication, wardrobe storage, recommendation logic, and feedback infrastructure.

The most rewarding part was seeing how practical software decisions, not just flashy UI, can help people make everyday choices faster.

If you want to explore or run the project, check out the repo. You can try the flows and share feedback. I would especially love input on recommendation quality, UX clarity, and what features would make this genuinely useful in daily life.

The Math Behind Artificial Intelligence: A Guide to AI Foundations [Full Book]

Tiago Capelo Monteiro — Tue, 06 Jan 2026 23:14:23 +0000

"To understand is to perceive patterns." - Isaiah Berlin

This is not a math book filled with complex formulas, theorems, and concepts that are hard to grasp.

Instead, it’s a detailed guide where we’ll break complex ideas down into simpler terms.

Even if you only have a general understanding of algebra, you should be able to easily follow along.

Here’s what we’ll cover:

Chapter 1: Background on this Book
Chapter 2: The Architecture of Mathematics
Chapter 3: The Field of Artificial Intelligence
Chapter 4: Linear Algebra - The Geometry of Data
Chapter 5: Multivariable Calculus - Change in Many Directions
Chapter 6: Probability & Statistics - Learning from Uncertainty
Chapter 7: Optimization Theory - Teaching Machines to Improve
Conclusion: Where Mathematics and AI Meet
About the Author

Chapter 1: Background on this Book

The Objective Here

My objective in this book is simple: Explain the key mathematical ideas you need to grasp in order to deeply understand AI and train machine learning models.

So you might be wondering: Why is it important to have a good math foundation before creating these models?

Well, there are many reasons, but some are:

It gives you the capacity to understand new AI research on your own.
You can use this same foundation to study other STEM concepts like signal theory and advanced statistical methods.
It helps you understand that AI models are just a mixture of different math ideas working together and gives you insight into how new innovations make LLMs more efficient.
It gives you a foundation so you know how to calibrate AI models and even create derivative models.

These skills are also important for startup founders, especially in Silicon Valley. Many startups begin with APIs or API wrappers but eventually need their own AI solutions.

Outsourcing all AI isn't ideal. This book will help you understand AI foundations so you can design better growth strategies and communicate effectively with investors – especially those who were successful technical co-founders.

Why is This Book About AI Different?

In this book, we’ll look at AI from an engineering perspective. This differs from the typical computer science approach to AI that most introductory courses take.

In doing so, I won’t spend a lot of time explaining formulas and theorems. Instead, I’ll explain their importance, how and why they are applied the way they are.

In this way, I hope to offer a unique viewpoint that emphasizes the engineering principles and good practices that underlie all modern AI technologies.

I will also explain how many of these strange math ideas make billion dollar industries possible.

We’ll start with the fundamentals: the structure of the areas of mathematics and AI. After that, we’ll look at the four subareas of math that make AI possible:

Linear Algebra
Calculus
Probability Theory and Statistics
Optimization Theory

After going through all the math, we’ll connect it with the foundation of ChatGPT and all of these large language models.

This way, you’ll get a basic foundation in key math concepts that, when mixed together like the ingredients of a cake, make all AI models possible.

By knowing where the ideas come from, you’ll develop a system-level understanding of AI and a first-principles approach.

So just keep in mind that, even though concepts like integral calculus and eigenvalues/eigenvectors might not be widely used in AI, they’ll help you develop these system-level and first-principle approaches.

Also, this book will be a work in progress. After its first release, I’ll seek feedback on things I need to perfect, chapters to add, and so on.

Here is my email for any feedback you might have: monteiro.t@northeastern.edu

And here is the book’s GitHub repository with all code: https://github.com/tiagomonteiro0715/The-Math-Behind-Artificial-Intelligence-A-Guide-to-AI-Foundations

Let Me Introduce Myself

My name is Tiago Monteiro, an electrical and computer engineer and AI master's degree student at Northeastern University's Silicon Valley campus. I have authored 20+ articles with 240K+ views here on freeCodeCamp on math, AI, and tech.

If you’d like to know more about my background, I’ll share that at the end of the book.

Prerequisites

In terms of minimum requirements, you only need to know the basics of mathematics and programming:

Basic algebra and what functions and the coordinate system are.
You should be able to read Python code and understand things like variables, functions, and loops.

Chapter 2: The Architecture of Mathematics

Math is more than numbers. It’s the science of locating complex patterns that shape our world. To truly understand math, we must look beyond numbers and formulas to grasp its structures.

This chapter aims to show math as a growing tree of ideas, a living system of logic, not just formulas to memorize. With analogies, history, and code examples, I want to help you understand math deeply and how to apply it to programming.

I’ve included code examples to connect theory and practice, showing how math ideas apply to real problems. Whether you're new to advanced math or are more experienced, these examples will help you apply math in programming.

This way, before we start going over the different math pillars that sustain AI, you will understand the structure of the field.

The Tree of Mathematics: How Everything Connects

Photo by Lerkrat Tangsri

Imagine math as a vast, ever-growing tree.

The roots are the foundations: logic and set theory. From these roots, the main fields emerge: arithmetic, algebra, geometry, and analysis.

As the tree branches out, new subfields like topology and abstract algebra appear. Sometimes branches connect with each other.

This tree keeps growing in many directions. History shows that sometimes it grows rapidly due to scientific discoveries, while at other times, growth is slow.

And you might wonder: How many more branches and connections between them will keep appearing?

A Quick History of Mathematics: From Counting to Infinity

The first mathematical ideas emerged independently in ancient civilizations, such as:

India's invention of zero
Islamic algebraic advances
Greek geometric rigor

Great mathematicians developed and shared these ideas through writing and lectures. Over time, new generations built on these ideas, creating new branches of mathematics. This endless growth is why Isaac Newton wrote to Robert Hooke in 1675:

“If I have seen further, it is by standing on the shoulders of giants.”

He meant that by working from previous knowledge, he was able to create and (re)discover new ideas.

Yet, the real power of math lies in practicing it over and over and studying it more and more deeply.

As one of my professors once pointed out:

“More important than knowing the theorems is knowing the ideas behind them and the history of how they were created.”

To solve problems, it's often necessary to think from first principles, and math teaches this. Math is not just an academic topic. It’s a global language for scientists and engineers.

By preserving and sharing it, new math can grow from old ideas, allowing the tree to keep expanding.

Foundations of Relativity: How Einstein Used Math to Understand Space and Time

Photo by Pixabay

Albert Einstein developed the general and special theories of relativity, which impact:

GPS and global communication
Satellite telecommunications
Space exploration and satellite launches

And more.

But this was only possible by combining geometry with calculus, known as differential geometry. This field evolved over centuries, thanks to many great mathematicians. Here are a few of them, though the list is not exhaustive:

Euclid (circa 300 BCE): Contributed to geometry, laying the groundwork for later mathematical systems
Archimedes (circa 287–212 BCE): Pioneered the understanding of volume, surface area, and the principles of mechanics
René Descartes (1596–1650): Developed Cartesian coordinates and analytical geometry
Isaac Newton (1642–1727) & Gottfried Wilhelm Leibniz (1646–1716): Newton’s laws of motion and gravitation, alongside Leibniz’s development of calculus, formed the basis of classical mechanics that Einstein sought to extend and modify in his theory of relativity.
Leonhard Euler (1707–1783): Contributed to the development of differential equations, which are essential in the mathematical foundations of physics.
Gaspard Monge (1746–1818): The father of differential geometry and pioneer in descriptive geometry
Carl Friedrich Gauss (1777–1855): Made groundbreaking advances in geometry, including the concept of curved surfaces.
Bernhard Riemann (1826–1866): Introduced Riemannian geometry, a branch of differential geometry.

Going back to Albert Einstein, he saw what no one else in his time saw, thanks to these great math giants and countless others.

Gödel’s Biggest Paradox: Can Math Explain Itself?

The biggest paradox in math, discovered by Kurt Gödel, is his incompleteness theorems. They show that in any consistent formal system capable of simple arithmetic, there are true statements that cannot be proven within the system.

This means there are limits to what can be proven as true or false. For mathematicians, this implies that some truths are beyond formal proofs, yet we assume they are true. It demonstrates that no matter how much effort or AI is used, some things remain unprovable, known only through approximations and non-exact methods.

What About Applied Math and Engineering?

Applied math and engineering involve adapting the pure math ideas in real-world scenarios.

Actually, in many cases, it’s the combination of many math ideas.

Let’s consider some examples:

In harmonic analysis, Laplace, Fourier, and Z-transforms are a way to see the same thing in a new domain to get new insights. In this case, integrals are used to make this mapping possible.
Principal component analysis (PCA) is a widely used tool in data science. Yet, it is a mixture of linear algebra (in PCA, eigenvalues) with optimization (order eigenvalues that represent more data with less data) in order to make datasets shorter.
In machine learning, logistic regression is a mixture of calculus with statistics and probability.
In deep learning, neural networks are just many matrices multiplying and updating themselves that adapt to model a dataset representing a system. This optimization of matrix values happens with activation functions, a gradient descent-based optimization method (tells how much values need to change), and backpropagation (applies those alterations to all matrix values).

But the best example of this fusion of math in engineering is in control theory. Control theory is the study of the architecture of systems. From trains to cars to airplanes, everything is based on control theory. It’s everywhere, in nearly all modern electronic devices. In electric circuits, control theory is also used heavily to guarantee circuit stability in the face of electric disturbances.

So as you can probably start to see, many of the tools we now have are just a mixture of many pure math ideas – like different recipes. In essence, applied math is the application of pure math as “ingredients“ in "recipes" to solve problems.

So, we’ve explored the structure and evolution of mathematics. But it’s important to see how we can apply these ideas in real life. Pure math makes the framework, and applied math applies that framework to solve problems. To understand this, we’ll examine two code examples that show how you can use math ideas as programming tools.

Code Examples: Analytical and Numerical Approaches

These code examples demonstrate a couple ways you can use Python to solve math equations.

In the first code example, we’ll solve the problem in the same way that kids in school solve math exercises: essentially, by hand with a pencil. In the second example, we’ll solve the problem using numerical analysis.

Example 1: Solve a Problem Analytically

In this problem, we need to find the values of the variables x and y. So we’ll be moving variables from left to right to find their values.

When we solve math problems analytically, like we did in school, we are manipulating symbols to get exact values. Often these symbols are x, y, and z.

The code below solves a system of two equations with two unknowns variables, x and y.

We will use the SymPy Python library to do this. It’s mainly used for symbolic mathematics.

from sympy import symbols, Eq, solve

x, y = symbols('x y')
eq1 = Eq(2*x + 3*y, 6)
eq2 = Eq(-x + y, 1)

solution = solve((eq1, eq2), (x, y))
print(solution)

Once again with this code we are finding the values of the variables x and y.

Essentially, we’re finding x and y based on this equation:

$$\begin{align} 2x + 3y &= 6 \ -x + y &= 1 \end{align}$$

Which gives us the following result:

{x: 3/5, y: 8/5}

Or:

x= 0.6
y = 1.6

When we say that we’re solving this analytically, it means that we’re finding an exact mathematical solution using formulas or equations.

But many times, problems are harder and can be solved by adding symbols to the right or left of the equation. Sometimes, there can be so many symbols and transformed versions of them, with things like derivatives and integrals, that it can become very hard to manage and takes a lot of time.

For example, let’s look at this partial differential equation:

$$\begin{cases} \frac{\partial u}{\partial t} = \alpha \frac{\partial^2 u}{\partial x^2}, & 0 < x < L, , t > 0 \ u(0,t) = 0, & t > 0 \ u(L,t) = 0, & t > 0 \ u(x,0) = f(x), & 0 < x < L \end{cases}$$

It can be solved with an analytical method call separation of variables.

But it requires many steps, and it’s easy to make mistakes. Even engineers who learned this often struggle to remember the process later.

When I first encountered this type of math exercise in my electrical and computer engineering degree back in Portugal, it took me 20 to 30 minutes to solve it.

For this reason, there's a branch of mathematics called numerical analysis that focuses on finding approximations of existing formulas. It helps solve problems faster. This is the method we'll explore next.

Example 2: Solve Numerically (Approximation)

Now let’s solve a different problem: we’re going to find the values of each of the 5 variables:

$$\begin{bmatrix} 3 & 2 & -1 & 4 & 5 \ 1 & 1 & 3 & 2 & -2 \ 4 & -1 & 2 & 1 & 0 \ 5 & 3 & -2 & 1 & 1 \ 2 & -3 & 1 & 3 & 4 \end{bmatrix} \times \begin{bmatrix} x_1 \ x_2 \ x_3 \ x_4 \ x_5 \end{bmatrix} = \begin{bmatrix} 12 \ 5 \ 7 \ 9 \ 10 \end{bmatrix}$$

Solving this by hand will take some time…but with Python code, it’s very fast.

We’ll also use the SciPy Python library for this example.

Let’s solve the system numerically:

import numpy as np
from scipy.linalg import solve

A = np.array([[3, 2, -1, 4, 5],
              [1, 1, 3, 2, -2],
              [4, -1, 2, 1, 0],
              [5, 3, -2, 1, 1],
              [2, -3, 1, 3, 4]])

b = np.array([12, 5, 7, 9, 10])

solution = solve(A, b)

print(solution)

Which corresponds to this operation:

Again, it takes time to solve this and it’s very easy to make a simple mistake.

But in this code example, this line of code:

solution = solve(A, b)

Uses the solve method from SciPy:

from scipy.linalg import solve

It’s a method that helps you find the values of x in an equation A⋅x=b, where A is a square grid of numbers and b is a list of numbers. That gives us the following:

[ 1.35022026 -0.79955947 -1.17180617  3.14317181 -0.83920705]

Which corresponds to:

$$\begin{bmatrix} x_1 \ x_2 \ x_3 \ x_4 \ x_5 \end{bmatrix} = \begin{bmatrix} 1.35022026 \ -0.79955947 \ -1.17180617 \ 3.14317181 \ -0.83920705 \end{bmatrix}$$

And is the same thing as:

$$\begin{align} x_1 &= 1.35022026 \ x_2 &= -0.79955947 \ x_3 &= -1.17180617 \ x_4 &= 3.14317181 \ x_5 &= -0.83920705 \end{align}$$

Why These Two Approaches Matter

We have solved two mathematical problems in two different ways:

Analytical: Exact solutions through algebraic manipulation
Numerical: Approximate solutions using algorithms

In engineering and in AI, we are constantly choosing between these approaches.

When training AI models with millions of parameters, analytical solutions are impossible. This is why, in these cases, we need numerical approaches.

When creating math theorems, we need analytical precision to make sure it is the best possible solution.

This is one of the many things an engineering degree teaches you: often, in the real world, it’s better to just write some code to solve a problem than to actually solve it by hand with math. Other times, the best solution is to just think in first principles and from there create new theorems to solve a problem.

Now let's step out of the code examples and see how different branches of mathematics connect.

The Impact of a Grand Unified Theory of Mathematics

Is it possible to unify all math?

In theory, yes. This is known as the Grand Unified Theory of Mathematics. It's the idea that all different areas of math can be linked together to discover deeper patterns in mathematics.

The Langlands program is trying to make this unification possible. It’s an attempt to interconnect the largest parts of the big tree of math to uncover new patterns in math.

With a Grand Unified Theory of Mathematics, we would be able to understand how every branch of the tree connects with the others and all the relationships between them.

What’s the Value of this Big Unification for Society?

By studying history, we can find patterns. The unification of various fields has created many massive impacts on society, such as:

In the 19th century, James Clerk Maxwell united the fields of electricity and magnetism with his famous Maxwell equations. This allowed the creation of radios and electric grids around the globe. In turn, it served as a foundation for all technological progress in the 20th and 21st century.
In the 20th century, the unification of algebra with logic led to the rise of digital systems. In turn, digital systems gave rise to processors and the evolution of computers and the modern laptop.
Also in the 20th century, the unification of probability and communication led to information theory. This became the foundation for the internet. This unification was carried out by a great mathematician named Claude Shannon.

In the end, a grand unified theory of mathematics could be one of the biggest achievements in modern society.

In AI, it could help unify all machine learning models in a common architecture. This would help accelerate the development of new AI models and could also open the door to new material science advances.

It could help reveal – with math – the deep patterns we still haven’t found in these fields. Just as uniting electricity and magnetism led to modern technology, a unified math framework would lead to a wave of innovation.

A Final Lesson From History

From Greek geometry to AI, math has grown like a tree over centuries. By understanding its structure, it’s possible to see its role in finding the patterns of our universe.

I hope I was able to make you see math in this way. I hope you can also see that the unification of scientific fields helps lay the foundations for the creation of new innovations to help society go forward.

Many major societal transformations only came to be thanks to abstract math ideas. When these are shared and refined, they become the hidden architecture of progress in society. Innovation begins when disconnected ideas are united, well-linked, and widely shared.

Chapter 3: The Field of Artificial Intelligence

What is Artificial Intelligence?

Photo by Pavel Danilyuk

The term Artificial Intelligence was born from the work of John McCarthy, who is often called the "father of AI."

He used it when he, along with Marvin Minsky, Nathaniel Rochester, and Claude Shannon, proposed the famous Dartmouth Summer Research Project on Artificial Intelligence in 1956.

Artificial intelligence was defined, in the Dartmouth Conference, as:

“Every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.”

Since then, the field has evolved in waves of innovation, from early rules-based systems to modern neural networks.

But over time, rather than creating general intelligence, most AI systems have been designed to excel at narrow tasks.

For example:

Chess-playing programs like Deep Blue that defeated world champion Garry Kasparov
Image recognition systems that can identify objects in photographs with impressive accuracy
Natural language processing models that can translate between languages
Game-playing AI like AlphaGo that mastered the ancient game of Go

Artificial General Intelligence isn’t yet here

Only very narrow AI models have demonstrated human-level or superhuman performance in their narrow domains.

In my view, and as we will see in this book, AGI will be the combination and interaction of different large language models interacting with each other and with the tools available to them.

Symbolic vs. Non-symbolic AI: What’s the Difference?

What is Symbolic AI?

Symbolic AI refers to the creation of a program based on many rules and symbols to simulate how humans think.

It uses symbols to represent concepts (like farms and distributors) and logical rules to reason about them.

The specific data about your domain is called facts. Facts are the pieces of information the rules operate on. For example, a fact might be "green_acres has high water usage and good pH levels."

Also, imagine someone wants to optimize farm distribution logistics. The symbols would represent farms, distributors, and transport methods. Then the rules would be:

If the farm has high water usage and good pH levels, then classify it as high-yield producer
If a high-yield producer and distributor has low demand, then prioritize direct connection
If a direct connection is needed, then select transport with lowest environmental impact

The facts would be the actual data like "farm X has high water usage" or "distributor Y has low demand."

This way, the system combines these rules and facts through logical reasoning to make decisions. A very popular programming language we use in this field is called Prolog that was designed to create rule-based systems.

Symbolic AI program: Manage agricultural networks with a Prolog program.

Let’s look at an example project to understand this more clearly. The project we’ll examine is called SymbolicAIHarvest. It was part of a course at NOVA University during my undergraduate studies in Electrical and Computer Engineering. The course was titled "Modelation of Data in Engineering."

SymbolicAIHarvest is an AI system developed with Prolog to manage agricultural networks. Here’s the project on GitHub so you can check it out.

The project optimizes farm operations using rule-based reasoning. It monitors sensors for real-time data and improves route planning for machinery. It also coordinates produce movement to reduce delays and waste, enhancing productivity and sustainability.

Understanding the code below is not a priority for this book. I just want to show you an example of all the facts of the project:

% FARMERS(owner)
farmer(ana).
farmer(asdrubal).
farmer(miguel).
farmer(joao).
farmer(teresinha).
farmer(victor).
farmer(carlos).
farmer(anabela).

% FARMS(name, owner, region, type)
farm(q1, ana, alentejo, vinha).
farm(q2, ana, alentejo, olival).
farm(q3, asdrubal, lisboa, cenoureira).
farm(q4, asdrubal, lisboa, milharal).
farm(q5, asdrubal, lisboa, vinha).
farm(q6, miguel, evora, trigal).
farm(q7, miguel, evora, cenoureia).
farm(q8, miguel, evora, vinha).
farm(q9, miguel, evora, morangueira).
farm(q10, joao, porto, vinha).
farm(q11, joao, porto, trigal).
farm(q12, joao, porto, cenoureira).
farm(q13, teresinha, algarve, olival).
farm(q14, teresinha, algarve, vinha).
farm(q15, victor, setubal, olival).
farm(q16, victor, setubal, vinha).
farm(q17, victor, setubal, trigal).
farm(q18, carlos, sintra, milharal).
farm(q19, carlos, sintra, vinha).
farm(q20, anabela, coina, milharal).
farm(q21, anabela, coina, olival).
farm(q22, anabela, coina, trigal).

% SENSOR READINGS(name, type, value)
sensor_reading(q1,humidity,28).
sensor_reading(q2,humidity,35).
sensor_reading(q3,humidity,42).
sensor_reading(q4,humidity,38).
sensor_reading(q5,humidity,33).
sensor_reading(q6,humidity,45).
sensor_reading(q7,humidity,30).
sensor_reading(q8,humidity,36).
sensor_reading(q9,humidity,50).
sensor_reading(q10,humidity,41).
sensor_reading(q11,humidity,40).
sensor_reading(q12,humidity,44).
sensor_reading(q13,humidity,32).
sensor_reading(q14,humidity,29).
sensor_reading(q15,humidity,47).
sensor_reading(q16,humidity,39).
sensor_reading(q17,humidity,53).
sensor_reading(q18,humidity,27).
sensor_reading(q19,humidity,24).
sensor_reading(q20,humidity,31).
sensor_reading(q21,humidity,37).
sensor_reading(q22,humidity,46).
sensor_reading(q1, temperature, 25).
sensor_reading(q2, temperature, 25).
sensor_reading(q3, temperature, 25).
sensor_reading(q4, temperature, 25).
sensor_reading(q5, temperature, 25).
sensor_reading(q6, temperature, 25).
sensor_reading(q7, temperature, 25).
sensor_reading(q8, temperature, 25).
sensor_reading(q9, temperature, 25).
sensor_reading(q10, temperature, 25).
sensor_reading(q11, temperature, 25).
sensor_reading(q12, temperature, 25).
sensor_reading(q13, temperature, 25).
sensor_reading(q14, temperature, 25).
sensor_reading(q15, temperature, 25).
sensor_reading(q16, temperature, 25).
sensor_reading(q17, temperature, 25).
sensor_reading(q18, temperature, 25).
sensor_reading(q19, temperature, 25).
sensor_reading(q20, temperature, 25).
sensor_reading(q21, temperature, 25).
sensor_reading(q22, temperature, 25).
sensor_reading(q1, water, 47000).
sensor_reading(q2, water, 52500).
sensor_reading(q3, water, 39000).
sensor_reading(q5, water, 61000).
sensor_reading(q8, water, 58000).
sensor_reading(q10, water, 43000).
sensor_reading(q13, water, 72000).
sensor_reading(q16, water, 49000).
sensor_reading(q18, water, 35000).
sensor_reading(q21, water, 66500).
sensor_reading(q1, ph, 6.5).
sensor_reading(q2, ph, 4.7).
sensor_reading(q3, ph, 8.2).
sensor_reading(q4, ph, 7.0).
sensor_reading(q5, ph, 5.1).
sensor_reading(q6, ph, 8.0).
sensor_reading(q7, ph, 4.5).

% DISTRIBUTORS (name, region, capacity, demand level)
distributor(d1, alentejo, 1000, 2).
distributor(d2, lisboa, 800, 1).
distributor(d3, evora, 1200, 3).
distributor(d4, porto, 900, 2).
distributor(d5, algarve, 700, 2).
distributor(d6, setubal, 1100, 1).
distributor(d7, sintra, 950, 2).
distributor(d8, coina, 1000, 1).

% TRANSPORTS (name, capacity, type, autonomy, region, impact)
transport(t1, 1000, fossil, 100, alentejo, 3).
transport(t2, 500, electric, 10, alentejo, 1).
transport(t3, 800, fossil, 400, algarve, 5).
transport(t4, 700, hybrid, 300, setubal, 2).
transport(t5, 150, electric, 340, coina, 1).
transport(t6, 700, fossil, 220, porto, 3).
transport(t7, 900, hybrid, 350, evora, 2).
transport(t8, 1000, electric, 170, sintra, 1).

% Connections based on graph image

% Top of the network
link(q2, d1, 5).
link(q1, d1, 7).
link(q3, d1, 6).

% Network center
link(q3, q4, 8).
link(q4, d2, 6).
link(q4, d3, 7).
link(q4, q5, 5).
link(q4, d4, 6).

% Additional connections
link(q2, d2, 8).
link(q3, d3, 7).

This Prolog code models an agricultural supply chain system that has:

Farmers
Farms
Sensors Readings
Distributors
Transports

In addition, in this part of the code on the facts of the system:

% Top of the network
link(q2, d1, 5).
link(q1, d1, 7).
link(q3, d1, 6).

% Network center
link(q3, q4, 8).
link(q4, d2, 6).
link(q4, d3, 7).
link(q4, q5, 5).
link(q4, d4, 6).

% Additional connections
link(q2, d2, 8).
link(q3, d3, 7).

We connect farms with distributors. This way, we can see that between the farm q1 and distributor d1 is a distance of 7k. This makes it possible to find/create algorithms to find the shortest path between them.

In the end, symbolic AI just creates programs based on a context and rules applied to that context.

What is Non-Symbolic AI?

Non symbolic AI doesn’t use symbols or rules to think. Instead, it’s data driven. In other words, it learns patterns from large datasets. This is the approach used in machine learning and deep learning.

When we create an AI model, we can associate it with an API (Application Programming Interface) so that we can use the AI model in websites, applications, and other systems. Basically, the trained AI model is set up behind an API endpoint. An API endpoint is like a web service that lets other applications send requests to the model and get responses back.

For example, when you use ChatGPT in a web browser, your messages are sent through OpenAI's API to their language model, which processes your input and sends back a response.

An AI agent is a software program that can autonomously perform tasks by making decisions and taking actions to achieve specific goals.

Unlike basic chatbots that only reply to questions, AI agents can plan steps, use tools, and work towards achieving complex goals. They do this by combining language models with extra features like accessing outside data or working with other AI agents.

Here’s an example of a non-symbolic AI agent project I worked on. I developed it using the crewAI Python library and the OpenAI API, one of the most popular libraries for creating AI agents.

In this system, five AI agents collaborate to create optimized content:

Research and Fact Checker: Conducts research to find trends and data.
Audience Specialist: Analyzes audience needs for better engagement.
Lead Content Writer: Writes engaging content based on research.
Senior Editorial Director: Ensures content quality and consistency.
SEO Specialist: Optimizes content for search engines.

Using the OpenAI API, it employs chatGPT with crewAI to have these agents work for me.

Before AI: Control Theory as the “First AI”

Before symbolic and non symbolic AI, electrical engineering had data-driven methods. One key area that I’ve already mentioned above was control theory (which studies control systems for machines like cars and rockets). This field allows us to design systems that ensure stability despite disturbances and achieve goals beyond human capabilities.

Nowadays, after creating a control theory algorithm, we check if AI can improve the control system. In my experience, only some advanced deep learning methods are effective. Most machine learning methods don't outperform control theory in efficiency and security.

Control theory also offers better interpretability, allowing us to understand decisions, unlike advanced machine learning and deep learning.

Due to the historical importance of control theory, I will continue to mention its role and mathematical applications. This will help you learn AI's math foundations and understand its significance in electronic systems and AI applications in engineering beyond dataset predictions.

Chapter 4: Linear Algebra - The Geometry of Data

Photo by Nothing Ahead.

Linear algebra is like having organized containers for data.

Instead of playing with individual numbers, we can pack them into structured boxes that are easier to handle. These structured boxes are called matrices.

When you have a lot of variables like customer data, sensor readings, or images, these structured boxes are very helpful. Also, what we can do when we play around with these boxes is very valuable.

In AI, linear algebra is everywhere. Take matrices, for example – a key concept in Linear Algebra. LLMs perform many matrix multiplications as their core operation. The data that they take in is also organized into matrices. In image recognition, matrices are used to represent pixels of images.

So as you can see, this core Linear Algebra concept is important to understand. Let's start!

What Are Matrices and Why Do They Simplify Equations?

Very often, systems in the real world can be simplified and modeled with a system of equations.

Those equations are often differential equations of many orders. But to simplify, let’s choose a very simple system like the one below:

$$\begin{align} 2x + 3y - z &= 7 \ x - 2y + 4z &= -1 \ 3x + y + 2z &= 10 \end{align}$$

When dealing with many variables and equations, writing each equation separately quickly becomes frustrating. Matrices provide a compact way to represent these systems.

For example, here’s the system above as a single matrix equation:

$$\begin{bmatrix} 2 & 3 & -1 \ 1 & -2 & 4 \ 3 & 1 & 2 \end{bmatrix} \begin{bmatrix} x \ y \ z \end{bmatrix} = \begin{bmatrix} 7 \ -1 \ 10 \end{bmatrix}$$

By seeing systems of equations as matrices, we can use linear algebra techniques to understand how the system behaves.

Some of these techniques are:

Linear Independence, Dependence, and Rank
Determinants
Eigenvalues and Eigenvectors

So to summarize:

A real world system can be represented as a system of equations
A system of equations can be compressed in a structured manipulable form called a matrix.
With matrices and linear algebra techniques, we can understand how the system works.

This way, we can study the basic behavior of a system with Linear Algebra.

For complex systems like a rocket, Linear Algebra is still the foundation. More advanced tools from control theory are used, but understanding simpler systems is essential for modeling and creating complex ones.

Vectors and Transformations: Moving in Multiple Directions

Vectors are matrices with a single row or a single column. You can also think of them as the building blocks of AI. They represent things like data points, model parameters, and much more.

For example, every data input (like an image or sentence) becomes a vector that the model can processes.

Here are two examples of vectors:

$$\mathbf{A} = \begin{bmatrix} 4 & -2 & 7 & 1 & 5 \end{bmatrix}$$

And:

$$\mathbf{B} = \begin{bmatrix} 3 \ -1 \ 8 \ 0 \ -4 \end{bmatrix}$$

All operations that you can perform on matrices can also be performed on vectors.

In Python, we can represent this by:

import numpy as np

# Define vectors A and B
A = np.array([4, -2, 7, 1, 5])
B = np.array([3, -1, 8, 0, -4])

We’re using the NumPy library because it makes math with arrays easy and fast.

As a simplification of a system of equations, a vector with a single row represents:

$$\mathbf{A} = \begin{bmatrix} 4 & -2 & 7 & 1 & 5 \end{bmatrix}$$

And this represents this system of equations:

$$4x_1 - 2x_2 + 7x_3 + x_4 + 5x_5 = k$$

A vector with a single column represents:

$$\mathbf{B} = \begin{bmatrix} 3 \ -1 \ 8 \ 0 \ -4 \end{bmatrix}$$

Which represents this system of equations:

$$\begin{align} x_1 &= 3 \ x_2 &= -1 \ x_3 &= 8 \ x_4 &= 0 \ x_5 &= -4 \end{align}$$

Now let’s see some matrix operations.

For example:

$$\mathbf{A} + \mathbf{B}^T = \begin{bmatrix} 4 & -2 & 7 & 1 & 5 \end{bmatrix} + \begin{bmatrix} 3 & -1 & 8 & 0 & -4 \end{bmatrix} = \begin{bmatrix} 7 & -3 & 15 & 1 & 1 \end{bmatrix}$$

vector_addition = A + B
print("A + B =", vector_addition)

Which gives the result of the equation above.

Often, vector addition is used to combine features. For example, adding many user preference vectors creates a profile of a user.

Here’s a scalar multiplication:

$$3\mathbf{A} = 3\begin{bmatrix} 4 & -2 & 7 & 1 & 5 \end{bmatrix} = \begin{bmatrix} 12 & -6 & 21 & 3 & 15 \end{bmatrix}$$

scalar_mult = 3 * A
print("3 * A =", scalar_mult)

Which gives the result of the equation above.

In AI, scaling vectors is usually done to adjust relevancy. For example, if we do a scalar product multiplication of a vector by 100, it means we are increasing its value. If it is by 0.3, it means we are reducing its importance.

Here's an outer product multiplication:

$$\mathbf{A} \otimes \mathbf{B} = \begin{bmatrix} 4 \ -2 \ 7 \ 1 \ 5 \end{bmatrix} \times \begin{bmatrix} 3 & -1 & 8 & 0 & -4 \end{bmatrix} = \begin{bmatrix} 12 & -4 & 32 & 0 & -20 \ -6 & 2 & -16 & 0 & 8 \ 21 & -7 & 56 & 0 & -28 \ 3 & -1 & 8 & 0 & -4 \ 15 & -5 & 40 & 0 & -20 \end{bmatrix}$$

And here’s a dot product multiplication (also called a dot product):

$$\mathbf{A} \cdot \mathbf{B}^T = \begin{bmatrix} 4 & -2 & 7 & 1 & 5 \end{bmatrix} \cdot \begin{bmatrix} 3 & -1 & 8 & 0 & -4 \end{bmatrix}$$

$$= 4 \cdot 3 + (-2) \cdot (-1) + 7 \cdot 8 + 1 \cdot 0 + 5 \cdot (-4) = 50$$

We mainly use dot products when we want to measure similarity, or alignment between two vectors.

In machine learning, in one simple phrase, it gives us a measure of similarity.

import numpy as np

dot_product = np.dot(A, B)
print("A · B =", dot_product)

Which gives the result of the equation above.

Linear Independence, Dependence, and Rank: Why It Matters

A lot of times, matrices can be made smaller and simpler. So it’s a good practice to reduce a matrix to its simplest form before we start to analyze its properties.

When each row of a matrix can be made with other rows, then that matrix is linearly dependent. This means the matrix can be further modified.

This way, a matrix has the property of linear independence when its rows cannot be created by combining each other.

For example, when we have a complex matrix like this one:

$$C = \begin{bmatrix} 1 & 2 & 3 & 4 \ 2 & 4 & 6 & 8 \ 1 & 3 & 5 & 7 \ 0 & 1 & 2 & 3 \end{bmatrix}$$

We can, with calculations, convert to this:

$$C_{\text{reduced}} = \begin{bmatrix} 1 & 0 & -1 & -2 \ 0 & 1 & 2 & 3 \ 0 & 0 & 0 & 0 \ 0 & 0 & 0 & 0 \end{bmatrix}$$

if you are not familiar with row reduction, I recommend this YouTube video.

The above simplified matrix is the same thing as this:

$$C_{\text{reduced}} = \begin{bmatrix} 1 & 0 & -1 & -2 \ 0 & 1 & 2 & 3 \end{bmatrix}$$

This way, we conclude that the C matrix has a rank of 2.

In other words, since the simplest form of the matrix has only 2 rows with numbers, it has a rank of 2.

From this, we can conclude that the reduced version of the matrix is linearly independent. This is because no row or column can be made from the existing rows or column. It’s the simplest possible matrix.

The original matrix C is linearly dependent because some rows are just multiples or combinations of other rows. For example, row 2 of the original matrix C is exactly row 1 multiplied by 2.

Another way of seeing this is that we have 4 rows in the original matrix and the rank of matrix C is 2. Since they are not equal, C is linearly dependent.

Why are these concepts important?

Linear independence and rank are important in engineering because they show whether equations, represented as matrices, give unique information. In electrical circuits and control systems, knowing that equations, represented as matrices, are independent ensures that you have unique solutions and avoids confusion.

The matrix rank shows the maximum number of independent equations that can exist. This help engineers model the simplest possible form of the systems.

In LLMs like ChatGPT, Gemini, Grok, and Claude, linear independence, dependence, and rank are used in a very important technique called LoRA (Low-Rank Adaptation).

LoRA (Low-Rank Adaptation) is widely used to calibrate these models to make sure they adapt efficiently to new tasks or domains without retraining the full model. Also, there are variants of this technique, like Quantized LoRA. This way, in many data centers, LoRA saves energy, water for cooling, and so many other things.

Determinants: Measuring Space and Scaling

Why are determinants important?

Determinants tell us if a system of equations has infinite solutions, no solutions, or if it has a unique solution without having to solve the whole system.

This way, instead of immediately trying to solve a complex system, we can first use the determinant to find out if it is even worth solving in the first place.

Many engineers don’t really understand the importance of the determinant. The only thing they know is the formula and how to apply it.

So now let’s learn, with some examples, what exactly the determinant is and why it matters.

A determinant is just a number. It’s always calculated from a square matrix. By calculating the determinant, we can find certain properties about the system it represents.

The determinant of a given matrix A:

$$A = \begin{bmatrix} a & b \ c & d \end{bmatrix}.$$

can be represented by two notations:

$$\det(A) = ad - bc$$

$$|A| = ad - bc$$

Both are the same thing.

Let's see how to calculate a determinant:

$$|A| = \begin{vmatrix} 2 & 3 \ 1 & 4 \end{vmatrix} = (2)(4) - (3)(1) = 8 - 3 = 5.$$

Let’s see how to do this in Python:

import numpy as np

# Define the matrix
A = np.array([
    [2, 3],
    [1, 4]
])

# Calculate the determinant
det_A = np.linalg.det(A)

print("Determinant of A:", det_A)

The same calculation works for other matrices!

Here's the determinant formula for a 3×3 matrix:

For a 3 by 3 matrix:

$$|B|= \begin{vmatrix} a & b & c \ d & e & f \ g & h & i \end{vmatrix} = aei + bfg + cdh - ceg - bdi - afh.$$

Now let’s apply the formula to an example:

$$|B| = \begin{vmatrix} 1 & 2 & 3 \ 0 & 4 & 5 \ 1 & 0 & 6 \end{vmatrix} = (1)(4)(6) + (2)(5)(1) + (3)(0)(0) - (3)(4)(1) - (2)(0)(6) - (1)(5)(0)$$

Assessing each term:

$$= (1)(4)(6) + (2)(5)(1) - (3)(4)(1) = 4 \cdot 6 + 2 \cdot 5 - ( 3 \cdot 4) = 24+10-12 = 22$$

In Python code:

import numpy as np

# Define the matrix
B = np.array([
    [1, 2, 3],
    [0, 4, 5],
    [1, 0, 6]
])

# Calculate the determinant
det_B = np.linalg.det(B)

print("Determinant of B:", det_B)

Now, let’s visualize matrix A by plotting its column vectors. Each column will become a vector: (3,1) and (-2,4). This shows us geometrically what the matrix is actually doing.

In a geogebra graph, it gives us this:

As we can see, the vectors define how each variable influences the system. By visualizing what the matrices are doing, we can find patterns that are harder to find just by looking at formulas.

What does this mean visually?

It means that in the space, this is what our matrix looks like. It’s also how our system of equations is represented.

C1 represents the “force“ or the impact the variable x1 has. And C2 does the same thing for the variable x2.

Now we’ll focus on a 3D matrix example. This matrix D represents a system of three equations with three variables:

$$D = \begin{bmatrix} 2 & -1 & 3 \ 4 & 0 & -2 \ -1 & 5 & 1 \end{bmatrix}$$

$$\begin{align} 2x_1 - x_2 + 3x_3 &= p \ 4x_1 + 0x_2 - 2x_3 &= q \ -x_1 + 5x_2 + x_3 &= r \end{align}$$

Each column can be described as a separate vector:

$$\begin{equation} D = \left[ D_1 \mid D_2 \mid D_3 \right] = \left[ \begin{bmatrix} 2 \ 4 \ -1 \end{bmatrix} \mid \begin{bmatrix} -1 \ 0 \ 5 \end{bmatrix} \mid \begin{bmatrix} 3 \ -2 \ 1 \end{bmatrix} \right] \end{equation}$$

As we can see, D was decomposed in 3 new column vectors:

$$\begin{equation} D_1 = \begin{bmatrix} 2 \ 4 \ -1 \end{bmatrix} \end{equation}$$

and:

$$\begin{equation} D_2 = \begin{bmatrix} -1 \ 0 \ 5 \end{bmatrix} \end{equation}$$

and:

$$\begin{equation} D_3 = \begin{bmatrix} 3 \ -2 \ 1 \end{bmatrix} \end{equation}$$

In a geogebra graph, it gives us this:

In 3D, each vector points in its own direction. Together, they organize three planes. Where all three planes touch is the solution to the system.

This is a key advantage of matrices and linear algebra. They help us visualize both simple and complex systems, enhancing systems thinking and first principles thinking.

The determinant is directly connected to these visualizations. For example, in 2D it measures the area that the vectors stretch over. Now we’ll see how that’s possible.

Let's use matrix A and see what its determinant looks like in geometric terms:

$$A = \begin{bmatrix} 2 & 3 \ 1 & 4 \end{bmatrix}$$

Which can be decomposed into 2 vectors u and v:

It gives us this determinant:

$$|A| = \begin{vmatrix} 2 & 3 \ 1 & 4 \end{vmatrix} = (2)(4) - (3)(1) = 8 - 3 = 5.$$

Now let’s see the determinant visually.

From (2,1) and (3,4), we can draw vectors parallel to u and and v. These are called u' and v' and have the same magnitude. They meet at (5,5), and we have a parallelogram that’s completed with these points: (0,0),(2,1),(3,4),(5,5)

The area of the parallelogram is the determinant:

Let’s see another example.

Let’s use a matrix F and see what it truly is:

$$F = \begin{bmatrix} 1 & 2 \ 2 & 4 \end{bmatrix}$$

It gives us this determinant:

$$|F| = \begin{vmatrix} 1 & 2 \ 2 & 4 \end{vmatrix} = (1)(4) - (2)(2) = 4 - 4 = 0$$

In geogebra, we can see that:

Now let’s try to see the determinant visually:

We can conclude that the area is 0.

Now let’s use a matrix G and see what it truly is:

$$G = \begin{bmatrix} 1 & 5 \ 2 & 3 \end{bmatrix}$$

It gives us this determinant:

$$|G| = \begin{vmatrix} 1 & 5 \ 2 & 3 \end{vmatrix} = (1)(3) - (5)(2) = 3 - 10 = -7$$

In geogebra, we can see that:

Now let’s try to see the determinant visually.

From (1,2) and (5,3), we can draw vectors parallel to u and and v. These are called u' and v' and have the same magnitude. They meet at (6,5). A parallelogram is completed with these points: (0,0),(1,2),(5,3),(6,5)

Again, the area of the parallelogram is the determinant:

We just saw that the determinant is the area of a parallelogram formed by the vectors. When the determinant is 0, there is no area. In other cases, there is an area. But what does this mean, and why do we care about these different values?

When the det = 0:

The vectors are linearly dependent (one can be written as a combination of the others)
They lie on the same line or one is a scaled version of the other
The parallelogram collapses to a line, hence zero area
This tells us the matrix has no inverse
Systems of equations either have no solution or infinitely many solutions

When the det ≠ 0 (det > 0 or det < 0):

The vectors form a proper parallelogram with an area
- If det > 0, the area is positive and transformation preserves orientation
- If det < 0, the area is negative and the orientation is flipped
The vectors are linearly independent
Systems of equations have exactly one solution

In electrical engineering, determinants help verify if a control system is controllable and observable.

Control systems use matrices a lot. For this reason, checking if their determinants are zero or non-zero tells engineers:

If it is controllable, it means the system is reachable, which helps in stabilization and performance optimization.
If it is observable, it means the system is measurable, which helps in fault detection and system monitoring.

In finite element analysis, a very popular math tool to solve partial differential equations, determinants helps figure out quickly if the calculations will give reliable results.

This way, with finite element analysis, we can design safer buildings, optimize aircraft wings, and simulate medical implants – all of which have a large impact on human lives and safety.

In machine learning, determinants are crucial to understanding data transformations. In these methods, if a determinant with a value of zero shows up, it means you are losing information and can't recover original data.

Also in deep learning, it’s used to decide the first parameters of neural networks (weight initialization) to prevent problems like the vanishing/exploding gradients.

In a 3×3 matrix, the determinant represents the volume of a parallelepiped (a 3D "box") formed by three vectors in 3D space.

If det = 0: The three vectors lie in the same plane, so they don't span any 3D volume
If det ≠ 0: The vectors form a proper 3D shape with actual volume

The absolute value |det| gives you the exact volume of that parallelepiped.

For example, if you have vectors a, b, and c, the determinant tells you how much 3D space they "fill up" when you use them as the edges of a box.

This is where it gets fascinating:

4×4 matrix: The determinant represents the "hypervolume" of a 4D parallelepiped formed by four vectors in 4-dimensional space.
1000×1000 matrix: The determinant represents the hypervolume in 1000-dimensional space!

So, to summarize, the determinant tells us easily if there are no solutions, infinite solutions, or exactly one solution in a system of equations, represented by a compact matrix.

What Are Mathematical Spaces and How Do They Simplify Calculations?

We now have a great foundation to understand the rest of this chapter on linear algebra.

Now, we will see see how a linearly independent matrix create something called a basis. Also, we will see that a basis is just a a set of building blocks for mathematical spaces!

The row vectors of a linearly independent matrix form a basis.

For example in matrix A, which is linearly independent:

$$A = \begin{bmatrix} 1 & 0 & 0 & 0 \ 0 & 1 & 0 & 0 \ 0 & 0 & 1 & 0 \ 0 & 0 & 0 & 1 \end{bmatrix}$$

forms this set:

$$((1,0,0,0), (0,1,0,0), (0,0,1,0), (0,0,0,1))$$

In this case, since matrix A is linearly independent, the set of matrix rows is called a basis. From this basis, you can create endless linear combinations of any other vector. The collection of all these possible combinations is called a mathematical space.

A mathematical space is an infinite set where all linear combinations of a basis exist. Its called a basis because these vectors form the base to express any vector in the space as a linear combination.

This matrix B is linearly independent:

$$B = \begin{bmatrix} 1 & 0 \ 0 & 1 \ \end{bmatrix}$$

And forms this set:

$$((1, 0), (0, 1))$$

And from this come all possible points in this cartesian coordinate system:

For example, mathematically, we can get the point (2,3) by:

$$(x=2, y=3) = 2(1, 0) + 3(0, 1) = (2, 0) + (0, 3) = (2, 3)$$

Note: There are other bases for the cartesian coordinate plane. I chose this one because it’s the easiest to understand.

Eigenvalues and Eigenvectors: Unlocking Hidden Patterns

Eigenvalues and eigenvectors, in my opinion, are far simpler than what mathematics professors make them out to be at university:

Eigenvalues tell you how much a matrix stretches or shrinks things.
Eigenvectors tell you which directions stay unchanged when the matrix transforms them.

This way, a matrix may have one or many eigenvalues which in turn result in many eigenvectors.

Let’s see an example:

For a square matrix A, eigenvalue λ, and eigenvector v:

$$Av=λv$$

The easiest way to find the eigenvalue is to calculate this:

$$det(A−λI)=0$$

or:

$$|A−λI|=0$$

Again, we have different notations for the determinant, but they’re the same thing.

Anyway, let’s define a very simple matrix A:

$$A = \begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix}$$

Now let’s make some calculations.

This formula:

$$det(A−λI)=0$$

Can be decomposed into:

$$det(\begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix} - λ \times \begin{bmatrix} 1 & 0 \ 0 & 1 \end{bmatrix}) = 0$$

Which is the same has:

$$det(\begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix} - \begin{bmatrix} λ & 0 \ 0 & λ \end{bmatrix}) = 0$$

Which gives us:

$$det(\begin{bmatrix} 2-λ & 0 \ 0 & 3-λ \end{bmatrix}) = 0$$

By the calculations we made above on the determinant, we can conclude that:

$$(2-λ) \times (3-λ) = 0$$

Which is the same has:

$$2-\lambda = 0 \text{ or } 3-\lambda = 0$$

Which gives us these eigenvalues:

$$\lambda_1 = 2, \quad \lambda_2 = 3$$

And these eigenvectors:

$$\mathbf{v_1} = \begin{bmatrix} 1 \ 0 \end{bmatrix}, \quad \mathbf{v_2} = \begin{bmatrix} 0 \ 1 \end{bmatrix}$$

This means that in the Cartesian coordinate system:

By applying the eigenvectors, we can see that:

The eigenvalue 2 is associated with the eigenvector v1:

$$A\mathbf{v_1} = \begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix}\begin{bmatrix} 1 \ 0 \end{bmatrix} = \begin{bmatrix} 2 \ 0 \end{bmatrix} = 2\begin{bmatrix} 1 \ 0 \end{bmatrix}$$

The eigenvalue 3 is associated with the eigenvector v2:

$$A\mathbf{v_2} = \begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix}\begin{bmatrix} 0 \ 1 \end{bmatrix} = \begin{bmatrix} 0 \ 3 \end{bmatrix} = 3\begin{bmatrix} 0 \ 1 \end{bmatrix}$$

Here is the Python code to calculate this:

import numpy as np

# Define matrix A
A = np.array([[2, 0],
              [0, 3]])

# Calculate eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)

print("Eigenvalues:")
print(eigenvalues)

print("Eigenvectors (columns):")
print(eigenvectors)

Eigenvalues and eigenvectors are key tools in engineering and machine learning because they reveal a matrix's fundamental behavior. Although a matrix transformation might seem complex, in reality:

Eigenvalues show how much stretching or compression occur.
Eigenvectors identify the special directions where this stretching happens most naturally.

In machine learning, we can use Principal Component Analysis (PCA) to make datasets smaller.

So, for example, let's say you’re building a machine learning application to predict heart disease. You have 100 data categories and 1 target variable telling whether a person has it or not.

With PCA, you can convert the 100 categories into, say, 40 categories. This way, you can make a smaller machine learning model and save computational resources.

PCA uses eigenvectors of covariance matrices to find important directions in data with many variables. It reduces data size without losing much detail, helping machine learning algorithms focus on key features and ignore unnecessary information.

Applications of Linear Algebra in AI and Control Theory

‌Linear algebra serves as the mathematical foundation for all engineering fields.

In addition, the principles of matrices and linear transformations provide the computational foundation that makes modern AI possible while enabling the control of complex systems.

All LLMs, from ChatGPT and Claude to Gemini and Grok, rely on linear operations.

All these systems carry out huge matrix multiplications to handle and create human language. So, when you type something into ChatGPT, probably millions of matrix multiplications are happening as you wait for a response!

In control theory, especially in an area called state-space control theory, matrices make it possible to create complex controllers. Linear algebra helps engineers design controllers for things like aircraft autopilots and robotic systems, among other applications

For example, when a rocket adjusts its trajectory or a drone maintains stable flight, many matrix multiplications are happening to determine the best way to guarantee the system’s stability.

Thanks to GPUs, linear algebra matrices are very efficient to compute. Also, any new matrix multiplication algorithms or special hardware for faster linear operations can greatly enhance AI and control systems.

In the end, linear algebra is the hidden mathematical engine powering the current AI revolution.

Chapter 5: Multivariable Calculus - Change in Many Directions

Photo by ThisIsEngineering

Limits and Continuity: Understanding Smooth Change

Calculus is one of the most valuable areas of mathematics and it focus on the study of continuous change.

Before we start learning a topic that makes many people give up on engineering degrees, I want to once again assure you that this chapter is very easily explained with a lot of images and code examples.

Also, just like linear algebra, many concepts in calculus are components of tools that have helped create billion-dollar industries.

What is continuity?

Before going and explaining topics like derivatives and integrals, we need to understand continuity.

In simple terms, continuity means that a function has no breaks, jumps, or holes.

Essentially, you can draw it without lifting your pencil from the paper.

For example, this function is continuous:

You can draw this graph without taking the pencil off the paper.

The above graph is represented by this function:

$$y = x^2 - 4x + 3$$

But the below function is not continuous:

This one, you can’t draw without taking the pencil off the paper.

It’s represented by this piecewise function:

$$y = \begin{cases} 1.5 + \frac{1}{x+1} & \text{if } -1 < x < 2 \ 2 + \frac{2}{(x-1)^2} & \text{if } x > 2 \end{cases}$$

This piecewise function is essentially two individual functions for two different intervals of numbers. Since calculus is the study of continuous change, we can only realistically use it in continuous functions.

How do limits guarantee continuity?

We can only use tools like derivatives and integrals if a function is continuous.

How can we describe mathematically that a function is continuous – like drawing it without lifting our pencil from the paper?

Limits solve that problem.

When we take the limit of a function at a given point, we're asking: what value does a function approach as we get close to that point?

Let's look at some examples of this function at these points and also understand the notation used in limits:

What is the limit of the point x=0?

It is 3. It actually crosses the y axis.

In mathematical notation,

$$\begin{align} \lim_{x \to 0} (x^2 - 4x + 3) &= (0)^2 - 4(0) + 3 \ &= 0 - 0 + 3 \ &= 3 \end{align}$$

In this notation, we're asking what the value of the y function is as x gets very close to 0. Think of x as being at 0.00000000000001 or -0.00000000000001. It gets so close that we can consider it near enough.

What is the limit of the point x=1?

Le’s see another example:

In this case, it’s 0.

$$\begin{align} \lim_{x \to 1} (x^2 - 4x + 3) &= (1)^2 - 4(1) + 3 \ &= 1 - 4 + 3 \ &= 0 \end{align}$$

In this notation, we're asking what the value of the y function is as x gets very close to 1. Think of x as being at 0.99999999999999 or 1.00000000000001. It gets so close that we can consider it near enough.

What is the limit of the point x=2?

Le’s see another example

Here, it’s -1.

$$\begin{align} \lim_{x \to 2} (x^2 - 4x + 3) &= (2)^2 - 4(2) + 3 \ &= 4 - 8 + 3 \ &= -1 \end{align}$$

Some more quick examples:

What is the limit of the point x=3?

In this notation, we're asking what the value of the y function is as x gets very close to 1. Think of x as being at 1.99999999999999 or 2.00000000000001. It gets so close that we can consider it near enough.

What is the limit of the point x=4?

It is 0.

What is the limit of the point x=5?

It is 3.

Now let’s see another example:

In the point x=2, it’s not well defined

If we draw with a pencil from the left to x=2, we end up with 1.83333
If we draw with a pencil from the right to x=2, we end up with 4

Why are limits important to understand derivatives and integrals?

As we have seen, when we talk about limits, we are talking about a value that symbolizes the value that a function approaches as it comes toward a particular point.

It’s critical to note that we're not looking at the value of that point itself. We’re looking at what happens as we get so near to it that we can pin down what value the function is approaching.

I will now show a very simple example to demonstrate this concept using mathematical notation.

I know that limits can be a difficult concept to understand at first. But if you understand limits very well, then you'll be well-prepared to understand derivatives and integrals.

And, as you’ll see, derivatives are responsible for modern AI and integrals are important parts of tolls widely used in billion-dollar industries.

I want you to understand the intuition behind this.

The function z(x) is continuous:

$$z(x) = \frac{3x + 7}{x + 2}$$

So to what value does this expression converge as x approaches infinity?

If you have a background in math, you might see why. But here for those who aren’t sure:

It converges to 3.

This time, the limit will be approaching infinity instead of a constant:

$$\begin{align} \lim_{x \to \infty} \frac{3x + 7}{x + 2} \end{align}$$

Let’s solve this in a very simple way:

For x = 1:

$$f(1) = \frac{3(1) + 7}{1 + 2} = \frac{10}{3} \approx 3.333...$$

For x = 5:

$$f(5) = \frac{3(5) + 7}{5 + 2} = \frac{22}{7} \approx 3.143...$$

For x = 10:

$$f(10) = \frac{3(10) + 7}{10 + 2} = \frac{37}{12} \approx 3.083...$$

For x = 50:

$$f(50) = \frac{3(50) + 7}{50 + 2} = \frac{157}{52} \approx 3.019...$$

For x = 100:

$$f(100) = \frac{3(100) + 7}{100 + 2} = \frac{307}{102} \approx 3.010...$$

For x = 1000:

$$f(1000) = \frac{3(1000) + 7}{1000 + 2} = \frac{3007}{1002} \approx 3.001...$$

For x = 10000:

$$f(10000) = \frac{3(10000) + 7}{10000 + 2} = \frac{30007}{10002} \approx 3.0001...$$

As x gets bigger and bigger, we get closer and closer to 3.

This is the main idea of limits: Describe the value a function approaches as the input approaches some point.

This same idea applies to derivatives: they’re just limits that measure rates of change (slopes of tangent lines).

And as well, Integrals are just limits that measure accumulated quantities (areas under curves)..

Let’s now see how derivatives work in depth.

Derivatives: How Things Change and How Fast

As I said before, derivatives are just limits that measure rates of change (slopes of tangent lines).

But what does this actually mean?

Let’s see an example:

What is the rate of change in the point A?

Hard question right? Let’s think how to answer this with limits.

We can find the limit of the rate of change in point A(0.72, 0.66), also called the instantaneous rate of change.

Let’s do that:

To find the slope, we take the coordinates of the points B(0.2, 0.2) and C(1.6, 1):

$$\text{slope} = \frac{1 - 0.2}{1.6 - 0.2} = \frac{0.8}{1.4} = \frac{4}{7} \approx 0.571$$

This gives us a rate of change:

$$y=0.571x + 0.084$$

Let's approximate more:

Let’s also zoom in:

To find the slope, we use the coordinates of the points B(0.58, 0.55) and C(0.85, 0.75):

$$\text{slope} = \frac{0.85- 0.58}{0.75 - 0.55} = \frac{0.27}{0.2} = \frac{2.7}{2} \approx 1.35$$

It gives us a rate of change:

$$y=1.35x + 0.11$$

Now let's approximate a lot:

To find the slope, we use the coordinates of the points B(0.7242549, 0.6625776) and C(0.7242884, 0.66260026):

$$\text{slope} = \frac{0.66260026- 0.6625776}{0.7242884- 0.7242549} = \frac{0.0000226}{0.0000335} = \frac{0.226}{0.335} \approx 0.674$$

Now let’s zoom out:

As we can see, we are so close that we can consider the limit of the rate of change to be 0.65.

It gives us the rate of change:

$$y=0.674x + 0.12$$

This way, the limit of a rate of change is called a derivative.

To recap, here is an animation:

Here’s a Python code example that lets you find the derivative in point A:

import sympy as sp

x = sp.symbols('x')
f = sp.sin(x)

# Derivative of sin(x)
derivative_of_sin = sp.diff(f, x)

# Evaluate at x = 0.72 and x = 0.66
val = f_prime.subs(x, 0.72).evalf()

print("Derivative of sin(x) at x=0.72:", val)

The function that had the point A is called a sine wave.

We convert it to its derivative function. From there we have our rate of change at point 0.72.

When we do math by hand, we usually have many rules to convert a function to its derivative, and from these find the rate of change for a given point.

Before seeing it, let’s look at a very simple example to understand the definition of a derivative:

$$\frac{d}{dx}f(x) \approx \frac{f(\textcolor{green}{x + h}) - f(\textcolor{red}{x - h})}{\textcolor{green}{x + h} - \textcolor{red}{x - h}} = \frac{f({x + h}) - f({x - h})}{2h}$$

h represents a small difference.

The derivative is the slope of the function’s small change near a point. In other words, it’s the limit of the rate of change of a given point.

A simple derivative transformation might look like this one:

$$\frac{d}{dx}x^n = nx^{n-1}$$

Two examples are:

$$\frac{d}{dx}x^3 = 3x^2$$

And:

$$\frac{d}{dx}x^5 = 5x^4$$

There are many more. But we won’t go into deep detail on this topic.

Where and why are derivatives so important?

Derivatives are one of the most important math tools out there. They serve as the foundation for understanding change across nearly all fields of STEM.

In physics (classical mechanics), derivatives are very important to find new information that draws on information that’s already made available.

For example, knowing how a body's position changes over time allows us to use derivatives to find its velocity and acceleration. This is crucial for self-driving cars, trains, rockets, and more.

Also, derivatives are the foundation of understanding how electricity works in depth. Without derivatives, there would’ve been no electromagnetic theory. Without electromagnetic theory, modern technology would not exist.

In machine learning, derivatives are so important that they served to create the algorithm that is one of the most important components of ChatGPT and others AI models. (backpropagation).

Backpropagation is in fact so important that its creators, John Hopfield and Geoffrey Hinton, won the 2024 Nobel Prize in Physics for it.

Also, autonomous vehicles like Tesla and Waymo use AI models called neural networks that depend on backpropagation to work.

It’s awesome that a math concept created in the 17th century is now one of the foundations of the current AI revolution.

What About Integral Calculus?

Before explaining derivatives further, I will ask you a question:

How can we find the area of the below shape?

In other words how can we find the integral of the function in the given interval?

Let’s see how to do it step by step.

First, we’ll try using 2 rectangles to approximate the area behind the curve:

Now the area of the rectangles is 6.282573.

But there is still a lot of error…

As we can see, the left rectangle does not cover completely the curve and the right rectangle covers too much.

So we’ll add more smaller rectangles so that we can better approximate the curve.

Now let’s try using 4 rectangles:

Now the area is 6.497481. But there’s still some error.

As we can see, the error is getting smaller. In other words, the 4 rectangles cover the area of the curve better than just the 2 rectangles. But there’s still a lot of room to make it better.

Let’s try using 8 rectangles:

Now the area is 6.604935.

How about using 16 rectangles?

Now the area is 6.658662.

Let’s try using 32 rectangles:

Now the area is 6.685525.

Now how about using 64 rectangles:

Now the area is 6.698957.

And using 128 rectangles:

Now the area is 6.705673.

What about using 256 rectangles:

Now the area is 6.709031. And the error has reached 0.0000!

Now let’s see an animation of this:

As you can see, we can approximate the area by having a limit to infinity to the number of rectangles to approximate the area.

This way, we can conclude that:

$$F(x) = \int_0^{3.14} f(x) , dx = \int_0^{3.14} (\sin(x) + 1.5) , dx = 6.71$$

This means that the area between 0 and 3.14, limited by the math equation, is 6.71!

Or, mathematically, the integral of f(x) in the interval 0 and 3.14 is 6.71.

Where and how is this applied?

In electrical engineering, integrals calculate total energy use in circuits by integrating power over time. For example, when designing a power supply for a device, engineers integrate the power to determine total energy costs and heat absorption requirements.

In other words, they see the area over time and how much power is used.

Let's see an example:

Imagine that in the image above:

The X axis can be the time in months.
The Y axis is the power used in Watts (Joules per second).

We can conclude that in 3.14 months(3 months and 4 days) the total amount of energy is 6.71 watt-months.

Here is the code to find that out:

# Import libraries
import numpy as np
import matplotlib.pyplot as plt

# Create Function
x = np.linspace(0, 3.14, 100)
y = np.sin(x) + 1.5

# Find the area under the function
area = np.trapezoid(y, x)

# Show the final image
plt.fill_between(x, y)
plt.title(f'Area = {area:.2f}')
plt.show()

In this code, we import the libraries, create the function, and find the area and plot it.

We used numpy.trapezoid to find the area, because it’s a numerical approximation to quickly find the integral of a function between two x values.

numpy.trapezoid uses a numerical approximation method called the composite trapezoidal rule.

The basic idea of the composite trapezoidal rule is to divide the area under the curve into many trapezoids and sum all of them.

If you want to learn more about this, I recommend reading the NumPy documentation on this method.

From this value, we can convert to other units:

52,400,000 joules
14.6 kWh

By converting to other units, we can more easily compare this device with other devices and see if it obeys any technical standards and laws.

This is a real-life application of integrals in engineering.

In my degree, I used this a lot in classes related to power engineering. In simple words, power engineering is a subfield of electrical engineering focused on working with electricity with very high voltage values and electric motors.

In audio compression, the Fourier transform (built on integrals) decomposes sound waves into frequency components. MP3 encoders use this to identify and remove frequencies humans can't hear. This reduces file sizes while preserving quality.

Medical imaging relies on the Radon transform, which uses integrals to reconstruct 3D images from 2D X-ray projections. When you get a CT scan, the machine takes hundreds of X-ray "slices" at different angles. During this process, integrals combine "slices" into a detailed cross-sectional image of your body.

Applications in AI and Control Theory: Calculus in Action

Modern AI depends on derivatives that use the backpropagation algorithm.

When training a neural network, the system calculates partial derivatives of the error with respect to millions of parameters. This way, find out how to adjust each weight to improve performance. Without this, large language models like ChatGPT couldn't learn from data.

PID controllers, which stabilize the temperature in your oven or maintain altitude in aircraft autopilot systems, combine calculus ideas:

The proportional term responds to the current error.
The integral term accumulates past errors to eliminate steady-state drift.
The derivative term predicts future trends to prevent overshooting.

And these are just some of the applications of calculus!

Chapter 6: Probability & Statistics - Learning from Uncertainty

Photo by Armando Are

It’s thanks to probabilities and statistics that many industries have grown so much. With statistics, we can make informed decisions and optimize many different processes. With probabilities, we can understand and model uncertainty in systems and, in this way, solve or even avoid problems.

While you may be familiar with some of the key concepts like median and mean, we’ll start with some basics to build up your intuition on more advanced stuff like the central limit theorem, Bayes’ theorem, and Markov chains.

Mean, Median, Mode: Measuring Central Tendency

Let's imagine you are a data scientist working in research. You’re going to work with data to optimize the output of farms in the Central Valley in California.

The idea is to take in a bunch of data, and by studying it, you can help farmers make better decisions.

Here’s the data from one year of activity:

Farm	Yield (tons/ha)	Fertilizer Used (kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

We have 6 farms in our dataset. For each farm, we know:

How much yield was obtained in tons per hectare
How much fertilizer was used in kilograms per hectare
How much rainfall happened during a year of activity

Now, let’s answer some questions we might have about the data to understand the mean, mode and median:

1. What is the average yield during one year of activity?

To find the average, we just need to sum all the yield values and divide by the number of farms. Like this:

$$\text{Mean} = \frac{4.2 + 5.8 + 3.9 + 6.1 + 4.7 + 5.3}{6} = \frac{30}{6} = 5$$

This is what is called the mean. The mean is just the sum of all values divided by how many values there are.

In Python, we can do the following to calculate the mean:

def calculate_mean(values):
    return sum(values) / len(values)

# Example usage
data = [4.2, 5.8, 3.9, 6.1, 4.7, 5.3]
result = calculate_mean(data)
print(f"Mean: {result}")

2. What is the mode of fertilizer used?

The mode is just the most popular value in a given dataset. In our case, it’s 200 since that’s the most common value that appears in our farm dataset.

In Python, we can do this to calculate the mode:

import statistics

def calculate_mode(values):
    return statistics.mode(values)

# Example usage
data = [150, 220, 120, 250, 200, 200]
result = calculate_mode(data)
print(f"Mode: {result}")

3. What is the median of the yield?

The median is just the value in the middle of a set of numbers. If the number of elements in the list is even, we take the mean of the two middle numbers. Here are our current yield values:

$$4.2, 5.8, 3.9, 6.1, 4.7, 5.3$$

First, we sort the values:

$$3.9, 4.2, 4.7, 5.3, 5.8, 6.1$$

Since we have 6 values (even number), the median is the average of the two middle values:

$$\text{Median} = \frac{4.7 + 5.3}{2} = \frac{10}{2} = 5$$

In Python we can do this to calculate the median:

import statistics

def calculate_median(values):
    return statistics.median(values)

# Example usage
data = [4.2, 5.8, 3.9, 6.1, 4.7, 5.3]
result = calculate_median(data)
print(f"Median: {result}")

Variance and Standard Deviation: Measuring Spread

Knowing the mean, mode, and median of data is helpful. But it’s also important to know how far away data points are from each other.

That’s where measures of dispersion come in. Variance tells us, on average, how far numbers are from the mean.

Let’s see an example of how to calculate this.

Given yield data from the table:

$$4.2, 5.8, 3.9, 6.1, 4.7, 5.3$$

The first step is the calculate the mean:

$$\bar{x} = \frac{4.2 + 5.8 + 3.9 + 6.1 + 4.7 + 5.3}{6} = \frac{30}{6} = 5$$

The second step is to calculate the variance with the sample variance formula:

$$s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}$$

Let's apply the formula little by little to understand how it works.

We will first we will calculate the variance of each yield data point:

$$\begin{align*} (4.2 - 5.0)^2 &= (-0.8)^2 = 0.64 \ (5.8 - 5.0)^2 &= (0.8)^2 = 0.64 \ (3.9 - 5.0)^2 &= (-1.1)^2 = 1.21 \ (6.1 - 5.0)^2 &= (1.1)^2 = 1.21 \ (4.7 - 5.0)^2 &= (-0.3)^2 = 0.09 \ (5.3 - 5.0)^2 &= (0.3)^2 = 0.09 \end{align*}$$

Then we will sum all the squared differences:

$$\sum(x_i - \bar{x})^2 = 0.64 + 0.64 + 1.21 + 1.21 + 0.09 + 0.09 = 3.88$$

Now, we will finally find the variance:

$$s^2 = \frac{3.88}{6-1} = \frac{3.88}{5} = 0.776$$

The standard deviation is just the square root of the variance.

$$s = \sqrt{s^2} = \sqrt{0.776} \approx 0.881 tons/ha$$

Why is this useful?

It puts the spread back into the same units as the data, making it easier to interpret.

A small standard deviation means the data huddles close to the mean, while a large one means it’s widely scattered.

And here is a code example of how to calculate both:

import statistics

def calculate_variance_and_std(values):
    variance = statistics.variance(values)
    std_dev = statistics.stdev(values)
    return variance, std_dev

# Example usage
data = [4.2, 5.8, 3.9, 6.1, 4.7, 5.3]
variance, std_dev = calculate_variance_and_std(data)
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_dev}")

What Is the Normal Distribution? The Bell Curve of Life

The normal distribution tells us how data naturally converges around the average value. Most values are focused on the center, and extreme values are more to the edges. This creates a bell curve.

By understanding this distribution, we can understand other distributions and also the central limit theorem.

To understand what normal distribution is, let’s look at it:

The normal distribution looks like like a mountain.

As you can see, most values are around the mean. Also, in and around the mean is the peak. Toward the extremes, the curve gets lower and lower. This means that in the extremes there are fewer and fewer values.

Normal distribution also has a formula associated with it:

$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right)$$

I won’t go in depth into how the formula works here. I just want you to understand the main idea behind the concept.

There are many other distributions besides the normal distribution. Some of the most common are:

Chi-squared distribution
Student’s t distribution
Bernoulli distribution
Binomial distribution
Poisson distribution

Each distribution can model different events and phenomenons. For example the Chi-squared distribution is widely used to find the correlation between two phenomenons (sunburns and skin cancer, for example).

The Poisson distribution is also used in modeling counts of events, like the number of clients that enter a store per hour or the number of data packets that are transmitted in a Ethernet cable.

But it’s also possible to approximate a lot of distributions to the normal distribution using one of the most important theorems in all of mathematics: the central limit theorem. This is what we will explore next.

How the Central Limit Theorem Helps Approximate the World

Photo by Porapak Apichodilok

The main idea of the central limit theorem is very simple:

Most distributions can be approximated to become the normal distribution.

This is just like pouring sand into a funnel. Grains may fall randomly, but over time the pile of sand will always begin to form the shape of a mountain.

This way, we can take many data points and average them. Over time, it will converge to become a normal distribution.

In other words, when independent random variables are all summed together, their sum tends toward a normal distribution.

Here is the formula:

$$\bar{X} \approx N\left(\mu, \frac{\sigma^2}{n}\right) \quad \text{or equivalently} \quad Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \approx N(0, 1)$$

You don’t need to understand in depth what it means. Just understand that it’s a theorem that approximates other distributions to the normal distribution.

And why is this important?

Because this theorem makes many billion-dollar industries possible.

Instead of testing every single possible scenario, we can test for a smaller amount of scenarios and assume that if it works for the smaller one, it will work for the bigger one.

For example, in telecommunications, instead of testing every possible phone call or data transmission, we can just test a few connections. If it works for those few connections, we can assume it will work for millions of phone and data transmissions.

For clinical trials, instead of testing a drug on millions of people, we can just test a smaller number of patients. If it works for a (relative) few patients, we can assume it will work on most people with the same condition.

Without this idea, clinical trials would not be possible. The same with telecommunications and so many other areas of engineering.

Bayes Theorem: Learning from Evidence

Now we’ll start looking at probability more in depth based on the data table we have been using.

Here’s the table again so that you can reference it more easily:

Farm	Yield (tons/ha)	Fertilizer Used (Kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

Now there are a lot of ideas and formulas related to probabilities. But here, I want to explain to you the core ones that are applied in AI and give you a high-level definition of things.

We’ll start with conditional probability, which is foundational to understanding Bayes’ theorem. Then we’ll get to the extended Bayes’ theorem formula.

So, let's get started!

What is Conditional Probability?

Photo by KOUSHIK BALA

Conditional probability is the probability that an event will happen given that another event has already taken place.

Confused? Don't worry! Let's see an example:

Let’s say that:

A = Farm has rainfall above or equal 400 mm
B = Farm has a yield above or equal to 5.0 tons/ha

Here is the formula for Conditional Probability:

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

Now let’s see this formula more in detail:

$$P(A)$$

This represents the probability that a farm has rainfall above or equal to 400 mm.

We have 6 farms, and 2 of them (farm B and D) have a rainfall above or equal to 400 mm.

So, the probability that a farm has rainfall above or equal to 400 mm is:

$$P(A) = \frac {2}{6} = \frac {1}{3} ≈ 0.33$$

Now let’s see for event B:

$$P(B)$$

This represents the probability that a farm has a yield above or equal to 5.0 tons/ha.

We have 6 farms and 3 of them (farm B, D and F) have a yield above or equal to 5.0 tons/ha.

So, the probability that a farm has a yield above or equal to 5.0 tons/ha is:

$$P(B) = \frac {3}{6} = \frac {1}{2} = 0.5$$

What about if we want to see both conditions’ probabilities at the same time?

$$P(A \cap B)$$

This refers to the probability of A and B being both true.

In our example, in means the probability that a farm both has a rainfall above or equal to 400 mm and a yield above or equal to 5.0 tons/ha.

We have:

6 farms and 2 of them (farm B and D) have a rainfall above or equal 400 mm
6 farms and 3 of them (farm B, D and F) have a yield above or equal to 5.0 tons/ha

For A and B to be true, only 2 farms (farm B and D) have both conditions.

This way:

$$P(A \cap B) = \frac {2}{6} = \frac {1}{3} ≈ 0.33$$

Now we’re ready to find out the conditional probability:

$$P(A|B)$$

This means the probability of A, knowing that B is true.

In our example, we can conclude that:

$$P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{0.33}{0.5} = 0.66$$

So, the probability that a farm has rainfall above or equal 400 mm – knowing that it has a yield above or equal to 5.0 tons/ha – is 0.66

Bayes’ Theorem

This is one of the most important theorems in mathematics.

Bayes’ theorem is a formula that tells us how to change the probability of a prediction when new verified data becomes available.

In other words, it’s like a rule that tells us how to update our beliefs when new evidence appears.

Now, based on what we already know, let’s see how Bayes’ Theorem works.

Here is its formula:

$$P(B|A) = \frac{P(A|B) \cdot P(A)}{P(B)}$$

Now, based on the previous values, we can very easily find the probability of B, given that A is true.

In other words, the probability that a farm has a yield above or equal to 5.0 tons/ha given that is has a rainfall above or equal to 400 mm.

Let’s find the answer:

$$P(B|A) = \frac{P(A|B) \cdot P(A)}{P(B)}= \frac{0.66 \cdot 0.33}{0.5}=0.44$$

So, the probability that a farm has a a yield above or equal to to 5.0 tons/ha, knowing it rained equal to or more than 400 mm, is 44%.

Now that we’ve gone through this formula step by step, hopefully it doesn’t feel as complex.

Where is this applied in real life?

As with many math ideas in this book, Bayes' Theorem has applications in many business sectors.

For example, what is the best way to make a control system for a self-driving car, robot, or really any other device?

One effective approach is to use a Kalman filter. Kalman filters rely heavily on Bayes' Theorem to handle control systems with incomplete data.

Kalman filters have a lot of applications in engineering. For example, thanks to Kalman filters, commercial jets can fly safely on autopilot.

So as you can see, Bayes’ Theorem is the foundation of many control systems used in risky industries.

What Are Markov Models? Predicting the Next Step, One Step at a Time

Photo by lil artsy

How do you predict the future with math? Markov chains allow you to do this to a certain degree.

For this reason, Markov chains are widely used in science, engineering, economics, and many other areas.

In addition to this, Markov decision processes are a very important foundation for reinforcement learning. Reinforcement learning is a branch of AI where agents learn to make decisions by interacting with an environment to maximize rewards.

In this section, I’ll introduce you to Markov chains and decision processes with an analogy, a plain English explanation, and a code example.

If you want to dive in further, I recommend my freeCodeCamp article on the subject.

Markov Chain Analogy

Imagine that you want to predict the weather tomorrow, and it only depends on the weather today. The weather can be either sunny or rainy.

Here are the probabilities:

If it's sunny today, there's an 80% chance that it will be sunny again tomorrow, and a 20% chance that it will be rainy.
If it's rainy today, there's a 50% chance that it will be sunny tomorrow, and a 50% chance that it will be rainy.

In this scenario, we can predict future states of the weather based on current states using probabilities.

This idea of predicting the future based solely on probabilities of the present is called a Markov chain.

Here, the states are either sunny or rainy and the probabilities describe the chances of the weather changing based on the current state.

Markov Chain Explained in Plain English

A Markov chain describes random processes where systems move between states, and a new state only depends on the current state, not on how it got there.

Mathematically, Markov chains are called stochastic models because they model (simulate) real life events that are random by nature (stochastic).

Markov chains are popular because they are easy to implement and efficient at modeling complex systems.

Another key advantage is their "memoryless" property. This makes it faster to run on computers, and powerful to study random processes and make predictions based on current conditions.

Applications of Markov Chains

Photo by Google DeepMind

At some level, almost all real-life events are stochastic. In other words, they involve randomness and uncertainty.

This is exactly why they are so widely used.

They can predict the behavior of systems based on current conditions:

In finance, they are used to detect changes in credit ratings for forecasting market regimes.
In genetics, they help understand how proteins change over time (which is important when studying genetic variations).

These real life examples show how effective Markov chains can be used to solve real problems in different fields.

In AI, Markov chains are used to model an environment like a factory or home. Modeling an environment with Markov chains is called a Markov decision process.

Using a Markov decision process, it’s possible to use reinforcement learning to create and optimize agents to act in the environment.

Of course, new and better variants of the Markov decision process have appeared over the years. But the key idea here is that it is thanks to Markov decision processes that the basis for reinforcement learning exists.

Reinforcement learning is widely used in advertising systems, logistics, robotics, video games, and many more applications.

Types of Markov Chains

There are many types of Markov chains. In this section, we'll only discuss the most important variants.

Discrete-Time Markov Chains (DTMCs)

In DTMCs, the system changes state at specific time steps. They are called discrete because the state transitions occur at distinct, separate time intervals.

They are used in queuing theory (study of the behavior of waiting lines), genetics, and economics because they are simple to analyze.

Continuous-Time Markov Chains (CTMCs)

CTMCs differ from DTMCs in that state transitions can occur at any continuous time point, not at fixed intervals.

This makes them stochastic models where state changes happen continuously. This is important in chemical reactions and reliability engineering.

Reversible Markov Chains

Reversible Markov chains are special. The process of state change is the same whether the direction is forwards or backwards, like rewinding a video and playing it again.

This property makes it easier to know when a system is stable and study how a system behaves over time. They are widely used in statistical physics and economics

Doubly Stochastic Markov Chains

Doubly stochastic Markov chains are defined by a transition probability matrix. In the matrix, the sum of the probabilities in each row and each column equals 1.

This means each row and each column represent a valid probability distribution. In other words, each row and column represent a list of chances for different outcomes.

This property is crucial in quantum computing and statistical mechanics.

Thanks to Doubly stochastic Markov chains, systems change in a way that preserves probabilities and symmetry, making the modeling and analysis of quantum computing systems far more accurate.

Hidden Markov Chains Code Example

Photo by Kevin Ku

Before we jump into code examples, let’s first understand what Hidden Markov Chains are.

The main idea behind hidden Markov chains is to model systems that have hidden states (states for which we don’t know their values) which can only be discovered through observable events.

In other words, hidden Markov chains allow us to predict the behavior of a system by:

Considering the likelihood of moving from one state to another.
Knowing the probability of observing a certain event from each state

We can understand this by observing how the states change from an indirect point of view.

We may not know the states’ original values. But by knowing the way they change, we can predict what their values will be in the future.

This way, hidden Markov chains are flexible in modeling sequences, capturing both the transitions between hidden states and the observable outcomes.

Because of this, hidden Markov models are used in fields such as engineering, financial modeling, speech recognition, bioinformatics, and many more.

Code Example:

In this code example, we’ll see a simple example with synthetic data.

Here is the full code:

import numpy as np
from hmmlearn import hmm

# Set random seed for reproducibility
np.random.seed(42)

# Define the HMM parameters
n_components = 2  # Number of states
n_features = 1    # Number of observation features

# Create a Gaussian HMM
model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag")

# Define transition matrix (rows must sum to 1)
model.startprob_ = np.array([0.6, 0.4])
model.transmat_ = np.array([[0.7, 0.3],
                            [0.4, 0.6]])

# Define means and covariances for each state
model.means_ = np.array([[0.0], [3.0]])
model.covars_ = np.array([[0.5], [0.5]])

# Generate synthetic observation data
X, Z = model.sample(100)  # 100 samples

# Create a new HMM instance
new_model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag", n_iter=100)

# Fit the model to the data
new_model.fit(X)

# Print the learned parameters
print("Transition matrix:")
print(new_model.transmat_)
print("Means:")
print(new_model.means_)
print("Covariances:")
print(new_model.covars_)

# Predict the hidden states for the observed data
hidden_states = new_model.predict(X)

print("Hidden states:")
print(hidden_states)

Now let’s break the code down block by block:

Import libraries and set random seed:

import numpy as np
from hmmlearn import hmm

np.random.seed(42)

In this block of code, we imported two Python libraries:

NumPy: For numerical operations.
hmmlearn: For hidden Markov model implementation.

Next we defined a random seed with the NumPy library. A random seed is a value used to start a pseudorandom number generator.

With a fixed random seed, we can ensure that the sequence of pseudorandom numbers generated is always the same. This allows us to duplicate experiments and verify results.

The specific value of the seed doesn’t matter as long as it remains consistent.

Define the HMM parameters and create a Gaussian HMM:

n_components = 2  # Number of states
n_features = 1    # Number of observation features

model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag")

In this code block, we created an HMM with two hidden states and a single observed variable.

covariance_type "diag" means the matrices that represent covariance (how two variables change together) are diagonal. In other words, each row and column is assumed to be independent of the others.

This implies that the probability distributions of each row and column are independent of each other.

But there is still something strange when we defined the hidden Markov chain:

What does “Gaussian“ mean?

This is a very big topic in statistics, but in a few words, Markov chains can only be created when we specify the transition probabilities (chances of moving from one state to another in a Markov chain) and an initial probability distribution.

A Gaussian HMM assumes events are initially modeled by a Gaussian distribution, also called a normal distribution!

And recall, we have already seen before what a normal distribution is.

Here is it again:

From a normal distribution and other components, we can create a hidden Markov chain. And hidden Markov chains serve as a foundation for systems that affect millions of lives.

Define transition matrix, means, and covariances for each state:

model.startprob_ = np.array([0.6, 0.4])
model.transmat_ = np.array([[0.7, 0.3],
                            [0.4, 0.6]])

model.means_ = np.array([[0.0], [3.0]])
model.covars_ = np.array([[0.5], [0.5]])

model.startprob_ = np.array([0.6, 0.4])

This line sets the initial state probabilities for a Hidden Markov Model (HMM). It points out that there is a 60% probability of starting in state 0 and a 40% probability of starting in state 1.

model.transmat_ = np.array([[0.7, 0.3], [0.4, 0.6]])

This line of code sets the state transition probability matrix for the HMM.

The matrix specifies the probabilities of moving from one state to another:

From state 0, there is a 70% chance of staying in state 0 and a 30% chance of transitioning to state 1.
From state 1, there is a 40% chance of transitioning to state 0 and a 60% chance of staying in state 1.

model.means_ = np.array([[0.0], [3.0]])

This line sets the mean values for the observation distributions in each state.

It indicates that the observations are normally distributed with a mean of 0.0 in state 0 and a mean of 3.0 in state 1.

model.covars_ = np.array([[0.5], [0.5]])

This line sets the covariance values for the observation distributions in each state.

It specifies that the variance (covariance in this 1-dimensional case) of the observations is 0.5 for both state 0 and state 1.

Create data, new HMM instance, and fit the model with the data:

X, Z = model.sample(100)  # 100 samples

new_model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag", n_iter=100)

new_model.fit(X)

print("Transition matrix:")
print(new_model.transmat_)
print("Means:")
print(new_model.means_)
print("Covariances:")
print(new_model.covars_)

In this code, we created a model with 100 samples, iterated it 100 times, and printed the new state transition matrix, means, and covariances.

In other words, we:

Generated 100 samples from the original model
Fitted a new HMM to these samples.
Printed the learned parameters of this new model.

What do X and Z mean here?

X means the observed data samples generated by the original model, while Z means the hidden state sequences corresponding to the observed data samples generated by the original model.

The transition matrix prints out:

[[0.8100804  0.1899196 ]
 [0.49398918 0.50601082]]

Which means that the model tends to stay in state 0 and has nearly equal chances of switching or staying when in state 1.

The means print out:

[[0.01577373]
 [3.06245496]]

Which means that the average observed value is approximately 0.016 in state 0 and 3.062 in state 1.

The covariances print out:

[[[0.41987084]]
 [[0.53146802]]]

Which means that the observed values vary by about 0.420 in state 0 and 0.531 in state 1.

This way, we may never know the exact values of the states, but we know their average observed value and how they vary and tend to change with each other.

Predict the hidden states for the observed data:

hidden_states = new_model.predict(X)

print("Hidden states:")
print(hidden_states)

In this code, based on the X observed data samples, we predicted the new states of the Markov model.

The hidden states print out:

[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1
 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0]

Which means that the hidden states switch between state 0 and state 1, showing how the system changes states over time.

Applications in AI and Control Theory: Making Decisions Under Uncertainty

Photo by capt.sopon

I have been giving you a high-level overview of the field of probabilities and statistics. As I explained before, I wanted to make the explanations simple to understand.

As someone with a bachelor's degree in electrical and computer engineering, I can assure you that while this chapter seems simple, in probabilities and statistics, things can get very complicated very quickly.

Many more concepts like:

p-values
Advanced Monte Carlo methods
Bayesian networks
Statistical hypotheses

Are not as straightforward as the ideas I’ve just told you about.

But as it is, probability and statistics are the starting points for making decisions where uncertainty exists in AI and control theory.

For example, the Bayes’ theorem, besides being the foundation of the Kalman filter, is also the foundation of many probabilistic models in the field of AI. Probabilistic models are usually used in quant firms and banks to model risk.

In control theory, probabilities and statistics are widely used to design robust control systems (as is the case with Kalman filters).

So as you can see, the application of probabilities and statistics, as with calculus and linear algebra, is the foundation for many tools that impact millions of lives and move billions of dollars in the global economy.

Chapter 7: Optimization Theory - Teaching Machines to Improve

Photo by Pixabay

This is the most advanced math chapter of the book. To truly understand it, it’s very important that you’ve first read the other chapters first.

We’re going to examine a few machine learning methods, and I’ll show you some recipes of how machine learning is just the use of linear algebra, calculus, probabilities and statistics, and optimization theory.

Just like making a cake!

What is Optimization Theory?

In AI, optimization theory is responsible for the algorithms that optimize data-driven AI models.

Often, big companies invest millions in research to create or refine algorithms that make training AI models faster.

This way, companies save far more money than the upfront research costs when scaling to train multiple large AI models.

It is thanks to optimization theory that deep learning was able to scale efficiently, eventually leading to the creation of ChatGPT and many other large language models.

But why is that?

In all data-driven machine learning models, there is a learning phase that has to happen. That is, there’s a period where the algorithms make predictions that are not correct and then need to change some parameters to make sure the next predictions are correct – or at least closer to being correct.

Without optimization, machine learning algorithms don't get anywhere on their learning path to the right solution. Without optimization, they spend too much time on a learning path that won’t increase their ability to predict things the right way.

So, let’s start learning!

Why Optimization Drives Learning in AI

Photo by Alex Knight

Optimization theory is the mathematical foundation that allows algorithms to improve their performance over many iterations.

When we combine an algorithm with a path to change its parameters to meet a certain objective (done with an optimization method), it’s called a machine learning algorithm.

This learning process always involves minimizing or maximizing a certain objective. For example, for many machine learning algorithms, the main objective is to minimize errors. To do this, over many iterations, the optimization methods "tells" the internal components of an algorithm what to change after receiving feedback on how well it’s performing.

It’s like someone first learning how to drive a car. The first few times, it may be complicated. But after a while and some practice, the driver learns how to drive properly and not make the same mistakes they once did in the past with the help of the instructor.

The same applies to optimization methods when optimizing algorithms.

Types of Optimization Theory Methods in ML and Deep Learning

The field of optimization theory is huge! Just as with many fields of mathematics, it is constantly growing every year.

But for the purposes of this book, there are three main categories of optimization methods:

First-Order Methods

These are the most used in deep learning and in all LLM models like Gemini, Grok, and others.

They are called first-order methods because they all use the first derivative of functions. The first derivative of a function measures how much a function's output changes when its input changes very little. The most widely used in deep learning are advanced variants of gradient descent.

While there are many variants, here are some popular examples:

Standard batch gradient descent
Stochastic gradient descent
Mini-batch gradient descent
RMSprop
Adam

In this chapter, we will look in depth at one of these methods called Adam (below).

Second-Order Methods

They are called second-order methods because they use information from second derivatives for better updates. There are many methods, like:

BFGS
L-BFGS
Newton's method

But these are not often used in machine and deep learning. While they optimize with fewer iterations, for the type of optimization problems algorithms in AI create (high-dimensional problems), they’re very computationally expensive.

So they’re not widely used like first-order optimization methods.

Zeroth-Order and Other Methods

These methods do not require derivatives to optimize algorithms. Some examples of algorithms where derivatives are not used are:

Genetic algorithms
Dynamic programming algorithms
Particle swarm optimization methods

The problem with these algorithms is that they are often very slow for many variables.

But in certain AI contexts, they can help optimize the architecture of deep learning models to improve AI models from an architectural point of view (instead of a parameter point of view).

How does optimization theory connect with linear algebra, calculus, and probability and statistics?

Essentially:

Calculus teaches you derivatives, which help you understand optimization theory.
Linear algebra teaches you matrices, which help you understand how different states relate and transform.
Probability and statistics teach you concepts like covariance and correlation, which help you understand how variables are connected with each other.

This way, with linear algebra and probability and statistics, you gain the knowledge necessary to understand the algorithms. With calculus you gain the basis to understand optimization theory and how it changes certain parameters of the fundamental algorithms to minimize/maximize a certain objective.

Simple Optimization Techniques: How Machines Learn Step by Step

Photo by LJ Checo

Now, we’re going to see examples of machine learning algorithms used for optimization and deconstruct them so that you can understand how these areas of mathematics apply to them.

In each example, I will explain their main idea with an analogy as well as how each math area is used in each algorithm.

Linear Regression

Imagine that you are solving a puzzle. To complete the puzzle, you need to arrange the pieces in the right design/order.

The same idea applies to linear regression.

We have matrices (linear algebra) that represent the parameters of the linear regression model and the data that flow into it.

And we can see over time how well the line is fitting the numbers, as well as its error (probabilities and statistics).

To find the best line for the linear regression, we need to know how much the parameters of the model need to change (calculus) and actually apply that change to the parameters (optimization theory).

This way, calculus tells us which direction to change the parameters, and optimization theory tells us how much to actually change them.

Let’s see how to code the linear regression above:

import numpy as np

np.random.seed(42)
X = np.linspace(0, 10, 50)
y_true = 3 * X + 2
noise = np.random.normal(0, 2, 50)
y = y_true + noise

w = 0.1 
b = 0.5
learning_rate = 0.01
iterations = [0, 1, 2, 3, 4, 5]
saved_states = []

for epoch in range(max(iterations) + 1):
    y_pred = w * X + b
    error = np.mean((y - y_pred) ** 2)
    
    if epoch in iterations:
        saved_states.append({
            'epoch': epoch,
            'w': w,
            'b': b,
            'y_pred': y_pred.copy(),
            'error': error
        })
    
    dw = -2 * np.mean(X * (y - y_pred))
    db = -2 * np.mean(y - y_pred)
    
    w = w - learning_rate * dw
    b = b - learning_rate * db

Let’s see the code block by block:

Import library:

import numpy as np

For this problem, we’ll import one of the most used Python libraries: NumPy (which we’ve worked with earlier in the book).

Create data points:

np.random.seed(42)
X = np.linspace(0, 10, 50)
y_true = 3 * X + 2
noise = np.random.normal(0, 2, 50)
y = y_true + noise

In this code, we define a base line that will help in generating the data points:

X = np.linspace(0, 10, 50)
y_true = 3 * X + 2

After this green line has been created, we will add noise to it to create the data points:

noise = np.random.normal(0, 2, 50)
y = y_true + noise

This is how we defined the data points for the line dataset.

Initializing linear regression parameters and others:

w = 0.1 
b = 0.5
learning_rate = 0.01
iterations = [0, 1, 2, 3, 4, 5]
saved_states = []

In this block of code, we initialize:

Linear regression parameters: Weight to be 0.1 and bias to be 0.5
One hyperparameter: Learning rate
How many iterations we are going to use to improve the linear regression
An array called saved_states to store values to later create graphs

This way, we start with this red line:

Making the linear regression learn with the data:

for epoch in range(max(iterations) + 1):
    y_pred = w * X + b
    error = np.mean((y - y_pred) ** 2)
    
    if epoch in iterations:
        saved_states.append({
            'epoch': epoch,
            'w': w,
            'b': b,
            'y_pred': y_pred.copy(),
            'error': error
        })
    
    dw = -2 * np.mean(X * (y - y_pred))
    db = -2 * np.mean(y - y_pred)
    
    w = w - learning_rate * dw
    b = b - learning_rate * db

It may appear complicated, but let’s see in smaller blocks:

For loop

for epoch in range(max(iterations) + 1):

Making an prediction and seeing its error

y_pred = w * X + b
error = np.mean((y - y_pred) ** 2)

In this block of the code, we find the values predicted for the current parameters and see its error from the real values.

Saving current iteration values for future statistics

if epoch in iterations:
     saved_states.append({
         'epoch': epoch,
         'w': w,
         'b': b,
         'y_pred': y_pred.copy(),
         'error': error
     })

Here we are juts storing in the saved_states array the values of the current iteration to later compute images.

Finding the gradients

dw = -2 * np.mean(X * (y - y_pred))
db = -2 * np.mean(y - y_pred)

In this block of code, we find the gradients values for the current prediction.

In other words, for the weight and bias, we find out how much they need to change in order to approximate better the values of the parameters to the data points.

Updating the parameters values

w = w - learning_rate * dw
b = b - learning_rate * db

Finally, we update the weight and the bias with the new values so that the line better approximates the data points:

Neural Networks

The same puzzle idea applies to neural networks. Neural networks are algorithmic models inspired by the brain that learn patterns from data. They are part of a machine learning field called deep learning, which uses neural networks to learn complex patterns.

Neural networks are important because they power modern AI applications like:

Image recognition
Language translation
Chatbots

For example, ChatGPT means Chat Generative Pre-trained Transformer. A transformer is an architecture of neural networks.

If you understand neural networks, you’ll understand the foundations that make ChatGPT work.

We have matrices (linear algebra) that represent the parameters of the neural network model and the data that flow into it.
And we can know over time how well the neural network model is converging to the dataset, fitting the numbers, and see its error (probabilities and statistics).
Calculus will tell us in which direction the parameters of the neural network need to change.
Optimization theory will tell us how much they need to change.

For example, this is a neural network:

This model has in total 13 parameters:

It has 10 lines(connections between circles). These are called weights.
It has 2 circles in the hidden layer and 1 in the output layer. Each circle has one bias.

Big question:

Imagine you work in a bank. You are in charge of deciding who gets credit cards or not. For that, you create the neural network above that takes 4 inputs:

Income
Credit score
Debt ratio
Bankruptcy history

With this neural network well optimized, you can figure it out!

Very simply, without going into things like activation functions, the network processes the 4 inputs through its weights and biases.

Each connection multiplies the input by its weight. After that, each node adds its bias.

The final output is a number between 0 and 1:

Numbers close to 0 mean "Not approved"
Numbers close to 1 mean "Approved"

For example, a high income figure, a good credit score, and no bankruptcy history data flow through the neural networks and produce 0.92. This means that it should be approved.

But a low income figure with a history of bankruptcy may produce 0.15, which results in a not approved.

In reality, bank systems and others have neural networks that take far more well-chosen parameters and decide this automatically.

This is precisely how AI can be used for credit approval.

But a question remains: What is the best way to know how much the parameters need to change?

In the next part, we are going to see the most famous optimization theory algorithm that will help us decide that.

What is Adam? The Most Popular Way AI Models Finds the Best Learning Path

Photo by Lum3n

To optimize neural network based AI models, one of the most popular methods is called Adam, which means Adaptive Moment Estimation.

The paper that introduced the method is one of the most influential in the 21st century in machine learning, with thousands of citations. As with all ideas in non-symbolic AI, Adam is a mixture of different math concepts.

It's composed of the ideas of two other optimization methods:

Momentum Gradient Descent: Accumulates velocity from previous gradients to move faster in consistent directions
Root Mean Square Propagation (RMSProp): Adapts learning rates based on recent gradient magnitudes

Let's understand them with an analogy.

Imagine that you are riding a bicycle down a mountain little by little. You already know the direction thanks to calculus.

But how do you descend safely without losing control or going too slowly?

First, you need to build up speed gradually using past momentum. This is one of the main ideas of momentum gradient descent.

It's also important that you adjust your speed based on the terrain's elevation. This is the main idea of RMSProp.

This way, you can safely accelerate and brake appropriately.

When optimizing a model with Adam, this is the same concept. With Adam, we want to optimize a model in a fast and stable way.

The momentum gradient descent ensures the fast part, and the RMSProp ensures the secure part.

Nowadays, for LLMs, which once again are just very big neural network models, a variant of Adam called AdamW is more often used.

Now, let's build a code example of using Adam.

Code example:

Using Adam, we are going to optimize this neural network based on fake data.

It will take 4 features:

Income
Credit score
Debt ratio
Bankruptcy history

And it will tell us if we should or should not approve credit for a given person.

Also, since this book is an introduction to the math of AI, I will not, in this code example, discuss hyperparameter optimization, regularization techniques, and other more advanced topics and good practices.

I want to show why this neural network fails with this data and explain the importance of using great data.

Here is the whole code (and we’ll see each part more in-depth below):

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader, random_split
import pytorch_lightning as pl
import matplotlib.pyplot as plt

torch.manual_seed(42)
x = torch.randn(10000, 4)
y = torch.randint(0, 2, (10000, 1)).float()
dataset = TensorDataset(x, y)

train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)

class CreditApprovalNet(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(4, 2)
        self.relu = nn.ReLU()
        self.output = nn.Linear(2, 1)
        self.sigmoid = nn.Sigmoid()
        self.loss_fn = nn.BCELoss()
        self.train_losses = []
    
    def forward(self, x):
        x = self.relu(self.hidden(x))
        return self.sigmoid(self.output(x))
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_pred = self(x)
        loss = self.loss_fn(y_pred, y)
        self.log('train_loss', loss)
        self.train_losses.append(loss.item())
        return loss
    
    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=0.0001)

model = CreditApprovalNet()
trainer = pl.Trainer(max_epochs=100, logger=False, enable_checkpointing=False)
trainer.fit(model, train_loader, val_loader)

# 
plt.plot(model.train_losses)
plt.xlabel('Training Step')
plt.ylabel('Loss')
plt.title('Credit Approval Training')
plt.grid(True, alpha=0.3)
plt.show()

Now let’s break it down:

Importing libraries:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader, random_split
import pytorch_lightning as pl
import matplotlib.pyplot as plt

In this block of code, we are importing code from 3 Python libraries:

PyTorch: One of the most popular python libraries to create new AI models in AI research
PyTorch Lightning: A PyTorch wrapper that organizes training code and handles repetitive tasks automatically
Matplotlib: One of the most popular python libraries to make graphs from data

Creating data:

torch.manual_seed(42)
x = torch.randn(10000, 4)
y = torch.randint(0, 2, (10000, 1)).float()
dataset = TensorDataset(x, y)

In this part, we define a seed to make the random numbers reproducible. In other words, when we run the code many times, the same random numbers will be generated.

Next, we will create 10,000 applications for credit with 4 features in X and their approval decisions in y. After that, we unify everything in the dataset variable.

We’ll use TensorDataset because it allows us to have the 4 features and the target paired together. This way, the data does not get mixed up during training.

Dividing data:

train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

In this block of code, we divide the data into a training dataset and a validation dataset.

This way, we have one dataset that’s being used to train and find the parameters while comparing results with the validation dataset.

As we can see, 80% of the data will be training data, and 20% of the data will be validation data.

Loading data:

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)

Here, we load the data into data loaders for the AI model to use.

This way, we have the data automatically split into small batches and shuffled. So instead of processing all 10,000 data points, the model will be trained on one batch, improved, then another batch, then improved again, and so forth. That makes training go faster.

Creating AI model and training process:

class CreditApprovalNet(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(4, 2)
        self.relu = nn.ReLU()
        self.output = nn.Linear(2, 1)
        self.sigmoid = nn.Sigmoid()
        self.loss_fn = nn.BCELoss()
        self.train_losses = []
    
    def forward(self, x):
        x = self.relu(self.hidden(x))
        return self.sigmoid(self.output(x))
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_pred = self(x)
        loss = self.loss_fn(y_pred, y)
        self.log('train_loss', loss)
        self.train_losses.append(loss.item())
        return loss
    
    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=0.0001)

This code block appears to be complicated, but let’s see each method block by block:

Creating the class with inheritance:

class CreditApprovalNet(pl.LightningModule):

This way, in one line, we can import everything we need to define both the model and how it will be trained.

init: Builds the model's layers and components:

    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(4, 2)
        self.relu = nn.ReLU()
        self.output = nn.Linear(2, 1)
        self.sigmoid = nn.Sigmoid()
        self.loss_fn = nn.BCELoss()
        self.train_losses = []

In this section of the code, we are defining the architecture of the AI model.

forward: Processes input data through the network to make predictions:

    def forward(self, x):
        x = self.relu(self.hidden(x))
        return self.sigmoid(self.output(x))

In this part of the code, we are defining how data will flow in the AI model based on the architecture defined.

training_step: Calculates loss for each batch during training:

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_pred = self(x)
        loss = self.loss_fn(y_pred, y)
        self.log('train_loss', loss)
        self.train_losses.append(loss.item())
        return loss

Here, we are defining how the model will be trained. In other words, how we will find the best parameters for the model to predict well.

configure_optimizers: Sets the Adam optimizer with learning rate:

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=0.0001)

Finally, here we are defining what optimizer we are going to use to, step by step, improve the AI model parameters.

Training AI model:

model = CreditApprovalNet()
trainer = pl.Trainer(max_epochs=100, logger=False, enable_checkpointing=False)
trainer.fit(model, train_loader, val_loader)

In this block of code:

We create the neural network model in the first line
In the 2nd and 3rd line, we prepare the training settings and train the model for 100 epochs

This way, in the command line, this appears:

The PyTorch code is essentially telling us the number of parameters in the AI model!

Seeing results and understanding why they are not good:


plt.plot(model.train_losses)
plt.xlabel('Training Step')
plt.ylabel('Loss')
plt.title('Credit Approval Training')
plt.grid(True, alpha=0.3)
plt.show()

Using the Matplotlib library, we plot the results:

The AI model is not converging.

We can see that because the loss is nearly 0.7 (70%) over time.

The main reason the model is not converging well is that there is little to no relationship between the 4 features and the target variable.

In other words, we do not have good data.

The code works perfectly, but this shows the most important rule in machine learning: when we create an AI model, the MOST IMPORTANT thing is data.

It does not matter if you use a simple linear regression or a neural network based on transformers or whatever. If you do not have high quality data, the model is not going to perform well.

Even if we use a good optimizer, like Adam, it will not solve the data problem.

Next steps: Common beginner mistakes

I also wrote this exact code example to show you something very important: neural networks are not always the best models to use.

This is a very common beginner mistake. You may start with neural networks for everything, when often machine learning methods with little data preprocessing do the job well.

For this type of problem, the solution is to first try machine learning methods instead of going to neural networks.

There are many reasons for this, but the main ones are:

Machine learning methods are simpler and often quicker to train than neural networks
Machine learning methods are simpler to understand how they make decisions. In other words, we can understand how the machine learning model thought to make a prediction.
With computational learning, we can guess with certain machine learning models how well they will predict in the future and provide theoretical guarantees about their performance.

Another common mistake is not dividing the data.

To simplify, I created only a training and validation division of the data

In a serious project, you should always divide it into 3 parts: training, validation, and testing.

With training, you create the model. With validation, you test the model based on the data it was trained on. With the test dataset part, you compare if the loss of the model is similar to the validation or different. If they are very different, it means that the AI model converged to the validation dataset but not the test dataset.

I challenge you to think further about how you could improve this code and to try to make the synthetic data more correlated in order to improve its quality.

Applications in AI and Control Theory of Optimization Theory

Photo by Tara Winstead

Optimization theory serves as the engine behind AI and control systems that shape our lives.

From unlocking your phone with facial recognition to autopilot systems guiding planes, optimization algorithms are constantly at work.

When you ask ChatGPT a question, optimization theory determines the values of billions of parameters during training.

The same is true for all other LLMs like Gemini, Claude, Grok, DeepSeek, and others. All of them contain millions and millions of parameters. The only way to find the best combination of the parameters to achieve a certain objective is with optimization theory.

In control theory, many systems like Model Predictive Control (MPC) and adaptive control systems only work thanks to optimization methods that balance how internal components of the control system should work together

Beyond training neural networks and controlling physical systems, optimization powers recommendation systems, resource allocation, and so many other systems.

Some examples are:

Netflix movie recommendation system
Spotify's song suggestion system
Google systems to reduce data center cooling costs
Quantitative trading firms high-frequency trading systems

To end this final chapter, I’ll share this:

It is optimization theory that makes math models into AI models that impact the lives of millions worldwide.

Conclusion: Where Mathematics and AI Meet

Photo by AXP Photography

When ancient civilizations first carved numbers into clay tablets, they likely didn’t imagine that these symbols would one day allow humanity to create the scientific, technological, and medical marvels we have today.

Yet here we are.

We’re in an era where mathematical ideas developed over many centuries – even millennia – have converged to create artificial intelligence.

Throughout this book, we've traced a path from the most basic math concepts to the cutting edge of AI. We have seen how:

Matrices compress complex systems into simple forms
Derivatives measure change
Probability helps us navigate uncertainty
Optimization guides algorithms toward better decisions to learn faster.

We’ve also learned how each math field has helped create tools that are responsible for many of the things we take for granted today.

Mathematics is the Foundation of AI

Photo by Jeswin Thomas

Always remember this: AI is not pure magic or a "being" we don't understand. It’s just the combination of many math ideas working very well together.

When you ask a question of ChatGPT or any other LLM, it generates a response. And in the process of generating that response, there are millions of matrix multiplications happening in seconds.

Or, for example, when a self-driving car decides to stop moving because it’s coming up to a crosswalk, there are a lot of math computations (related to calculus and probability and statistics) working very fast to ensure safety.

The great thing about mathematics is that it’s a common, standard language of logic. No matter the backgrounds of people or where they were born, a derivative will always be a derivative, and the same thing goes for key AI concepts.

This way, scientists and engineers worldwide can improve each other's work because everyone understands the same language.

The Future: On Device AI and the Democratization of AI

Photo by Steve Johnson

One shift happening now is the move toward edge AI. That is, AI that runs locally on your phone, computer, and really in all your devices (rather than in distant data centers).

This way, privacy is guaranteed because it runs locally. Waiting times for AI models decrease because no data needs to be sent. AI can be used offline, and costs decrease.

And what about the massive data centers being built all over the world? Those will be used for more products that will help improve the lives of millions of people.

As AI becomes more local and more processing power is freed up from big data centers, new AI innovations will appear, and more benefits will come.

The same way that in the past century every computer got its own networking chip, every device will have (and in some cases, already has) AI accelerators.

And much of it will be thanks to the math you learned in this book.

Final Reflections

Isaac Newton wrote, "If I have seen further, it is by standing on the shoulders of giants."

Every algorithm you use, every model you train, and every new theorem you learn stands on centuries of mathematical progress. You now stand on those same shoulders of these giants!

Thank you for reading, and happy learning.

Here’s the full book GitHub repository with all the code.

Acknowledgements

First and foremost, I would like to thank Guilherme Mendes, currently a Master’s student in Electrical and Computer Engineering at NOVA University, specializing in Control Theory, for reviewing the mathematical and technical details of the 1st version of this book.

I am also grateful to the organizations that gave me opportunities to grow:

A special thank you goes to the freeCodeCamp editorial team**,** especially Abigail Rennemeyer, for their patience and for reviewing every chapter of this book.

I would also like to thank all the professors at NOVA FCT who have taught and guided me throughout my academic journey, especially those from the Department of Electrical and Computer Engineering.

About the Author

LinkedIn: https://www.linkedin.com/in/tiago-monteiro-
GitHub: https://github.com/tiagomonteiro0715
Email: monteiro.t@northeastern.edu

My name is Tiago Monteiro, and I’m now pursuing a master's degree in Artificial Intelligence at Northeastern University in the Silicon Valley Campus (San Jose) on a merit-based scholarship.

I’m not from the United States. I am a Portuguese national, born and raised in the district of Lisbon.

In Portugal, I completed a bachelor's degree in electrical and computer engineering at NOVA University, one of Portugal's best universities.

I have authored over 20 articles for freeCodeCamp, which have accumulated more than 240,000 views over the years, and completed the Deep Learning Specialization from DeepLearningAI, taught by Andrew Ng.

Also, I had the privilege of participating in the winter 2025 batch of the renowned Silicon Valley Fellowship program.

Why did I choose electrical and computer engineering?

After finishing the Portuguese national math exam in 12th grade, I chose Electrical and Computer Engineering (ECE) to challenge myself and learn new math on my own.

The ECE degree combined:

Advanced Mathematics
Programming (from Assembly to Python)
Physics (classical mechanics, electromagnetism)

What did I gain exactly?

I mastered the skills needed to quickly understand AI research, particularly after completing Andrew Ng's Deep Learning Specialization.

In Portugal, I also studied advanced STEM areas including, for example:

Partial Differential Equations for modeling real-world phenomena
Harmonic analysis (Fourier/Laplace transforms) for signal processing and alternative problem perspectives
Complex analysis involving derivatives and integrals in the complex domain
Numerical methods for approximating mathematical solutions computationally
Signal/control theory for ensuring system stability in dynamic environments
Physics classes in classical mechanics and electromagnetism fundamentals

While not directly applied to AI, these studies enhanced my systems thinking and ability to independently learn complex STEM concepts.

The State of Bluetooth in 2025: What’s New, What’s Possible, and How to Use It

Nikheel Vishwas Savant — Fri, 07 Nov 2025 17:10:25 +0000

Introduction: Why Bluetooth Still Matters

You probably don’t even think about Bluetooth anymore. It’s just there, quietly doing its job every single day. It’s what keeps your earbuds connected, your smartwatch synced, your car infotainment system talking to your phone, and your warehouse sensors awake and reporting.

The funny thing is, while most of us stopped paying attention, Bluetooth never stopped evolving. It just kept getting smarter.

Now it’s 2025, and Bluetooth has grown into something much bigger than a way to stream music. It has become a core ecosystem that connects nearly everything around us. From audio gear and IoT sensors to industrial automation and secure building access, Bluetooth is everywhere.

The newest versions, Bluetooth 5.4 and 6.0, completely redefine how devices talk to each other. We’re talking about encrypted broadcasts, smarter advertising, centimeter-level distance tracking, and a level of scalability that feels closer to magic than engineering.

In this article, we’ll take a tour through the newest Bluetooth technologies and see what’s happening under the hood. You’ll get a feel for what’s new, how these features work in real projects, and how developers can actually take advantage of them.

Grab your favorite dev board, and let’s dive in.

The Evolution: From Classic to Low Energy to 6.0
Deep Dive: Technical Enhancements
Real-World Applications in 2025
Developer Guide: Getting Started
Challenges and Trade-Offs
The Road Ahead: Bluetooth 6.1 and Beyond
Conclusion

The Evolution — From Classic to Low Energy to 6.0

If you’ve been around Bluetooth for a while, you probably remember the early days when pairing a headset felt like solving a riddle. Back then, Bluetooth Classic ruled the scene, focused mainly on short-range audio and simple data links. Over the years, though, the story changed completely.

Today, Bluetooth has transformed from a simple cable-replacement protocol into a flexible framework for everything from earbuds to industrial robots. Each new version added fresh layers of intelligence, speed, and energy efficiency. The table below gives a quick timeline of how that evolution unfolded.

Version	Year	Key Features
2.0 + EDR	2004	Faster data rate (3 Mbps)
4.0	2010	BLE introduced for low power
5.0	2016	2× speed, 4× range, 8× advertising capacity
5.1	2019	Direction Finding (AoA/AoD)
5.2	2020	LE Audio / Isochronous Channels
5.3 – 5.4	2021-2023	Encrypted Advertising, PAwR
6.0	2024	Channel Sounding, Decision-Based Filtering
6.1	2025	Minor updates on efficiency & range

The journey tells a bigger story. What started as a way to connect two devices for audio has turned into a foundation for massive IoT networks. Each revision introduced smarter physical layers, better energy profiles, and new roles for devices that once had very limited capability.

Source: MDPI Sensors (2025), Bluetooth Core Specification Summary.

Above figure provides a visual snapshot of how Bluetooth has evolved across its major versions. It shows a clear chronological progression of features—from the launch of Bluetooth Low Energy (BLE) in version 4.0, to the introduction of secure connections, long-range PHYs, and direction-finding capabilities, all the way up to the latest breakthroughs like Channel Sounding and decision-based filtering in Bluetooth 6.0. The color-coded timeline highlights how each version refined both the physical and logical layers of communication, gradually expanding Bluetooth’s reach from simple peripherals to high-precision industrial and spatial applications. In essence, it maps Bluetooth’s transformation from a short-range wireless cable into a sophisticated, context-aware connectivity fabric that underpins modern audio, IoT, and automation ecosystems.

If you zoom out a bit, you’ll notice a clear pattern: Bluetooth keeps finding new neighborhoods to move into. From cars and headphones to factories and hospitals, the technology now feels less like a cable replacement and more like an invisible nervous system for the modern world.

What’s New in Bluetooth 5.4 and 6.0

When you hear that Bluetooth has a “new version,” it’s easy to shrug it off. After all, your headphones already work, right? But the jump from 5.3 to 5.4 and then 6.0 isn’t just a tiny step. It’s more like Bluetooth quietly taking on Wi-Fi’s job in certain places and pulling it off surprisingly well.

Let’s break it down by version so it’s easier to see what’s going on.

Bluetooth 5.4: Building the IoT Backbone

This release might not have made flashy headlines, but engineers loved it. It focuses on letting thousands of low-power devices talk to a single gateway without choking the airwaves.

Let’s look at some of the key features and why they matter:

Periodic Advertising with Responses (PAwR)

Think of it as Bluetooth’s group chat for sensors. Devices can broadcast messages and still get short replies, all without the full connection setup that usually drains batteries. It’s perfect for large sensor networks like smart warehouses or retail stores with electronic shelf labels.

Source: Nordic Semiconductor Developer Zone (2024)

Above diagram illustrates the timing structure of Bluetooth 5.4’s Periodic Advertising with Responses (PAwR) mechanism. Along the horizontal axis, it shows a repeating sequence of PAwR events separated by the overall periodic advertising interval. Within each PAwR event are several subevents—labeled #0, #1, #2, #3, and so on—each representing a defined window of time during which specific sensors or devices are allowed to communicate. The figure highlights that every subevent occurs at a fixed periodic advertising subevent interval, meaning devices can wake up only during their assigned slot, transmit or receive data, and then return to sleep. This predictable scheduling dramatically reduces radio collisions and power consumption, allowing a single gateway to coordinate thousands of low-power nodes such as electronic shelf labels or environmental sensors within a shared advertising cycle.

Encrypted Advertising Data

Broadcasts used to be open for anyone to sniff. Now they can be private and secure, which is essential for medical monitors and retail beacons carrying sensitive info.

Source: Raytac Technology (2024)

Above diagram breaks down the structure of the Encrypted Data Advertising Data (AD) type introduced in Bluetooth 5.4. It visually shows how encrypted advertising payloads are organized within a broadcast packet. At the top, the full advertising payload is represented, which includes the length (Len), Encrypted Data (ED Tag), and flags. Inside the encrypted section, the fields are expanded to show the Randomizer, Payload, and Message Integrity Check (MIC). The payload itself may contain various elements such as the Electronic Shelf Label (ESL) Tag, ESL Payload, Local Name (LN Tag), or other advertising segments. The color-coding differentiates which parts are encrypted (blue) versus unencrypted (gray or yellow), highlighting how Bluetooth 5.4 secures sensitive data while retaining key advertising identifiers for discovery. This layout helps engineers understand where encryption is applied within the advertising packet and how privacy and integrity are preserved during broadcast communication.

Electronic Shelf Labels (ESL) Support

Bluetooth 5.4 was practically written with supermarkets in mind. Imagine thousands of digital price tags blinking updates at once, all running for months on coin-cell batteries.

Source: Dani Data Systems (2023)

Above image illustrates the working architecture of a Bluetooth-based Electronic Shelf Label (ESL) system. On the left, a computer running ESL management software is shown, which allows retail staff to configure product data, prices, and display templates. The software communicates over a TCP/IP network connection with a Base Station positioned in the center of the diagram. This base station acts as a Bluetooth gateway, wirelessly transmitting the updated price and product information to numerous shelf labels throughout the store. On the right, a digital ESL display is shown featuring a price tag for a product labeled “Kaju Katali,” complete with product details, QR codes for mobile payments, and expiry dates. The blue wireless icon between the base station and ESL tag symbolizes Bluetooth communication. Together, the components demonstrate how Bluetooth 5.4 enables synchronized, low-power, and remotely managed price updates across thousands of retail shelf labels.

In short, 5.4 was the version that said, “Sure, we can handle massive IoT networks.”

Bluetooth 6.0: The Game Changer

Bluetooth 6.0 feels like the point where the technology matured from “just wireless” into “smart wireless.” This version brings features that start blurring the line between Bluetooth and more advanced location systems.

Channel Sounding

This is a big one. Instead of using signal strength (which can be messy), Bluetooth 6.0 measures phase differences in radio waves to calculate distance. That means centimeter-level accuracy (enough for digital keys), precise tracking, and even AR interactions.

Source: Bluetooth SIG (2025)

Above image explains the concept of Bluetooth Channel Sounding, a new feature introduced in Bluetooth 6.0 that enables precise distance measurement between devices. The top half of the diagram compares three levels of spatial awareness—presence detection through advertising, coarse distance estimation using RSSI (Received Signal Strength Indicator), and fine-grained ranging achieved with Channel Sounding. It also shows how Direction Finding complements these methods by determining angular orientation. On the left, a smartphone (the initiator) communicates with a smart lock (the reflector), demonstrating how Bluetooth can estimate distance and direction simultaneously. The bottom portion visualizes two measurement techniques. The Phase-Based Ranging chart shows how two signals of different frequencies experience measurable phase shifts that correspond to distance. The Round Trip Time (RTT) diagram on the right depicts packets traveling between the initiator and reflector, with the elapsed time between transmission and reception used to calculate distance. Together, these visuals illustrate how Bluetooth 6.0 achieves centimeter-level accuracy for applications like digital keys, indoor navigation, and spatially aware IoT systems.

Decision-Based Advertising Filtering

Bluetooth devices now decide which advertisements to process and which to ignore, saving both power and bandwidth. It’s like teaching scanners to pay attention only when it’s worth it.

Source: Bluetooth SIG (2024)

Above diagram illustrates the architecture of Decision-Based Advertising Filtering, a new Bluetooth 6.0 feature that allows observers to process only relevant broadcast packets, reducing power consumption and unnecessary data handling. The figure depicts two parallel host–controller stacks: the Observer on the left and the Advertiser on the right. Each side includes an Application layer, Host Controller Interface (HCI), and Controller. On the advertiser side, the application generates Decision Data that passes through the HCI to the controller’s advertising engine, where it’s embedded into extended advertising packets known as Decision PDUs. On the observer side, incoming advertising data passes through a Filter Policy module in the controller, which selects or rejects packets according to preconfigured decision criteria before forwarding only the relevant Advertising Reports to the host application. Blue arrows show configuration and report flows, while the yellow HCI bands highlight the host–controller boundary. Together, the components show how Bluetooth 6.0 empowers devices to make intelligent, context-aware filtering decisions at the controller level, improving efficiency in dense radio environments.

Advertiser Monitoring

Gateways can now keep tabs on the state of nearby advertisers, which is critical when hundreds of devices are broadcasting at once.

Source: Bluetooth SIG (2024)

Above image depicts the fundamental interaction between two Bluetooth Low Energy (BLE) device roles — advertising and scanning. On the left, a smartphone icon represents the scanning device, which actively listens for nearby Bluetooth broadcasts. On the right, a small sensor or tag icon represents the advertising device, periodically transmitting packets that announce its presence, capabilities, or data updates. Blue concentric rings radiate outward from both devices, symbolizing the propagation of radio signals and the overlapping wireless coverage area where scanning and advertising events intersect. The minimalist design highlights the asymmetric nature of BLE communication: the advertiser periodically transmits small bursts of information, while the scanner remains receptive to detect, filter, or connect with those broadcasts — forming the foundation of all Bluetooth discovery, pairing, and data exchange processes.

Negotiable Inter-Frame Spacing

This lets devices adjust timing between packets to improve throughput and avoid interference in noisy environments.

Source: Bluetooth SIG (2024)

Above image illustrates the concept of Negotiable Inter-Frame Spacing (IFS) in Bluetooth 6.0, which optimizes the timing between consecutive data packets to improve throughput and reduce interference. The diagram shows two sequences of communication between a Central (C) and a Peripheral (P) device, represented as alternating blue (C→P) and green (P→C) data blocks. In the first sequence, packets are transmitted with a short, fixed inter-frame spacing labeled T_IFS, showing a rapid exchange of packets within a connection event. The second sequence demonstrates the enhanced Bluetooth 6.0 model, where devices can dynamically negotiate a longer spacing interval — indicated by the notation “≥ T_IFS” — to accommodate environmental conditions, controller processing delays, or congestion. The red horizontal arrows mark the overall connection event duration, while the vertical lines represent packet boundaries. By allowing flexible timing adjustments between frames, Bluetooth 6.0 reduces airtime collisions and improves coexistence with other 2.4 GHz systems, particularly in dense or interference-prone environments.

ISOAL Enhancements

Audio data, especially LE Audio streams, now move more smoothly thanks to improved support for large frames.

Source: Bluetooth SIG (2024)

Above diagram illustrates the internal data flow and timing structure of the Isochronous Adaptation Layer (ISOAL) in Bluetooth 5.2 and later, which supports synchronized audio and data transmission over LE Isochronous Channels. The figure is divided into three main sections: the Upper Layer, the ISOAL, and the Link Layer. At the top, the Upper Layer handles isochronous data in the form of Service Data Units (SDUs). Within the ISOAL layer, SDUs undergo several key processes — Fragmentation and Segmentation break data into smaller protocol units, while Recombination and Reassembly merge received fragments back into complete SDUs. Two important timing-related steps occur in parallel: the Inclusion of Timing Offsets, which ensures proper packet scheduling, and Timing Reconstruction, which synchronizes the playback or reassembly timing for received streams. These operations produce either Framed or Unframed Protocol Data Units (PDUs), which are then passed to the Link Layer at the bottom for transmission over the Isochronous Stream. The diagram highlights how ISOAL bridges the upper and lower layers, managing timing alignment and packet structure to deliver low-latency, synchronized LE Audio or data streams across multiple devices.

When you put all that together, Bluetooth 6.0 starts looking a lot like Ultra-Wideband in terms of precision, but without needing new hardware. It’s faster, smarter, and somehow more polite on the airwaves.

Deep Dive — Technical Enhancements

This is where Bluetooth starts to feel less like “a thing your phone just does” and more like a finely tuned machine. The new specs add layers of intelligence that make devices more aware of distance, timing, and context. It’s the kind of stuff that gets engineers grinning because it solves problems we’ve all quietly complained about for years.

Let’s walk through a few of the most important ones.

Channel Sounding and Distance Awareness

If you’ve ever used RSSI values to guess how far a device is, you know how unpredictable it can be. RSSI measures how strong the signal sounds, not where it actually came from. A wall, a metal shelf, even a human body can distort it. Channel Sounding solves this by looking at phase instead of strength.

Here’s the idea: two devices exchange carefully crafted packets at multiple frequencies. Each frequency behaves like a different musical note. When those notes reach the receiver, their phases – how the peaks and troughs line up – shift slightly depending on distance. The receiver compares the original and received phases, then crunches the math:

$$[ \text{Distance} = \frac{c \times \Delta \phi}{2\pi f} ]$$

where:

( c ) is the speed of light,
( \Delta \phi ) is the phase shift,
( f ) is the carrier frequency.

This approach allows for precise distance measurement, achieving accuracy down to a few centimeters by analyzing the phase differences of signals received at multiple frequencies.

That level of precision changes the game. Cars can unlock automatically only when you’re physically beside the door. Smart-building systems can tell which room you’re standing in. Mixed-reality headsets can map your movements without extra sensors.

From a development point of view, you’ll need hardware that supports the new Channel Sounding PHY. Nordic’s nRF54 and Silicon Labs’ BG24 families already expose low-level APIs for it. Expect to work closer to the metal than usual: calibration, antenna diversity, and clock stability all affect measurement accuracy. It’s worth the effort, though. Few wireless technologies can deliver this precision without expensive dedicated hardware.

Periodic Advertising with Responses (PAwR)

For years, BLE advertising worked like shouting into a room and hoping someone heard you. The moment you wanted a reply, you had to form a full connection. That model doesn’t scale when you have ten-thousand tiny sensors that each wake up once a minute.

PAwR flips the model. Think of it as a scheduled town-hall meeting. A coordinator (the gateway) broadcasts a timeline. Each sensor has a reserved time slot to respond within that cycle. Because everyone speaks only during their assigned moment, collisions disappear and energy use plummets.

In practice, this lets one gateway handle tens of thousands of devices without ever maintaining individual connections. Supermarkets use it for electronic shelf labels that update prices in seconds. Factories deploy it for environmental sensors that report temperature and vibration periodically.

Developers integrating PAwR will notice that it doesn’t replace connections, it complements them. You can still open a full GATT session for configuration, but routine data flows through lightweight PAwR exchanges. Most modern SDKs, including Zephyr and ESP-IDF, now include PAwR APIs under their extended-advertising modules.

Isochronous Audio Channels & LE Audio

Bluetooth’s original audio stack wasn’t built for what we expect today. It was designed for single-stream mono headsets, not for multi-earbud synchronized audio or broadcast systems. Isochronous Channels fix that by ensuring that every packet in a group shares the same clock reference.

Two modes exist:

Connected ISO Streams (CIS) handle one-to-one cases like stereo earbuds
Broadcast ISO Streams (BIS) allow a transmitter to serve an unlimited audience, such as a gym or theater.

Both rely on the LC3 codec, which delivers near-lossless sound at roughly half the bandwidth of SBC.

In real life, this means earbuds that stay perfectly in sync even if you walk between interference zones, hearing aids that seamlessly share the same stream, and venues that broadcast announcements directly to phones without dedicated receivers. Android 14 and iOS 17 have already exposed system-level LE Audio support, so app developers can finally build end-user experiences without vendor-specific hacks.

For embedded engineers, implementing LE Audio requires controller firmware that supports ISOAL (Isochronous Adaptation Layer) and host-side stack integration. Nordic, Qualcomm, and Dialog all provide reference implementations, but testing is key – timing drift between links can break audio quality faster than you might expect.

Power & Efficiency Improvements

Battery life has always been Bluetooth’s quiet superpower, and version 6.0 tightens the screws even more. Rather than one big change, it’s a collection of small ones that add up.

Negotiable inter-frame spacing lets devices adjust the delay between packets, smoothing out contention when the air is busy. Controllers now enter deeper sleep states automatically, waking only when the radio truly needs them. Smarter advertising filters prevent devices from wasting time processing duplicates, and new firmware offloads push repetitive tasks (like connection parameter updates) away from the CPU.

When engineers combine all these tricks, the numbers look impressive: about a ten to twenty percent battery gain in dense environments. That might not sound huge, but for a coin-cell tag meant to last three years, it’s the difference between hitting the spec or not.

Security & Privacy Upgrades

With great connectivity comes great responsibility. Bluetooth now sits at the heart of cars, locks, and health monitors, which makes security non-negotiable. The new stack finally treats it as a first-class citizen.

LE Secure Connections with numeric comparison are now standard, encrypted advertising data hides sensitive broadcasts, and Channel Sounding even enables distance-based access control. In plain language, a device can now verify that you’re physically nearby before sharing keys or unlocking features.

Still, protocol features alone aren’t enough. Developers should rotate identity-resolving keys regularly, invalidate old bonds on firmware updates, and avoid static passkeys. Security in Bluetooth is like security anywhere else: the spec provides the locks, but you’re responsible for turning the key.

Together, these improvements make Bluetooth feel more alive, more aware, and more efficient. The stack now senses distance, saves power, and defends privacy without breaking backward compatibility. It’s a quiet revolution hidden inside chips that most people never think about, yet it’s shaping how billions of devices will talk to each other over the next decade.

Real-World Applications in 2025

It’s one thing to read about Channel Sounding or PAwR in a spec sheet. It’s another to see these features come alive in everyday products.

Bluetooth has quietly spread into nearly every corner of our lives, from the shelves of supermarkets to the dashboards of cars. By 2025, it’s no exaggeration to call it the most widely deployed wireless ecosystem on Earth.

Let’s look at where these new capabilities are already making an impact.

Retail: Electronic Shelf Labels and Smart Inventory

Walk into a modern supermarket in 2025 and look closely at the price tags. They aren’t paper anymore. Those little digital labels, changing prices in real time, are powered by Bluetooth 5.4’s Periodic Advertising with Responses (PAwR) and Encrypted Advertising Data.

Each label is a low-power sensor node, quietly listening for broadcast schedules from a gateway mounted above the aisle. When it’s their turn, the tags wake up, confirm their slot, and update the display – all in milliseconds and without forming a traditional Bluetooth connection. The result is a network of tens of thousands of nodes that consumes almost no energy.

Security matters here too. Encrypted advertising ensures that a competing store or curious shopper can’t sniff price data or inject bogus updates. Everything runs on coin-cell batteries that last several years, which saves retailers both time and maintenance costs.

Smart Home: Context-Aware Unlocking and Personal Audio

If you’ve ever fumbled with your phone to unlock a smart door, Bluetooth 6.0 might finally fix that. Channel Sounding makes proximity detection precise enough to trust. The system can tell whether you’re standing by the door or ten meters away in the driveway. Only when you’re truly within range does it trigger the unlock sequence.

The same precision is reshaping personal audio. Imagine walking from your living room to the kitchen and having your smart speaker hand off the song to your earbuds automatically. That’s LE Audio working behind the scenes with isochronous channels, keeping streams perfectly aligned across multiple endpoints. It feels invisible, which is exactly how good technology should feel.

Healthcare: Reliable, Secure Patient Monitoring

Hospitals have long relied on wireless monitors, but interference and power limits made them tricky. With PAwR, a single access point can now coordinate thousands of small sensors that track vitals like heart rate, oxygen, or temperature. These devices communicate in brief, deterministic bursts, avoiding packet collisions that used to plague dense wards.

Privacy is critical, and that’s where encrypted advertising comes in. Patient identifiers and medical readings remain hidden even in broadcast form. Channel Sounding adds another layer by confirming proximity: only readers within a safe range can retrieve sensitive data.

Combined, these features help reduce misreads and protect patient confidentiality without adding extra setup steps for clinicians.

Industry 4.0: Asset Tracking and Condition Monitoring

Factories and warehouses are some of Bluetooth’s biggest playgrounds. Equipment now comes with embedded Bluetooth 6.0 modules that use Channel Sounding for ultra-precise location tracking. Pallets, forklifts, and tools broadcast their position continuously, helping logistics teams know what’s where, all the time.

Add PAwR, and you get scalable telemetry for thousands of machines. Vibration, temperature, or pressure data can flow reliably to a single gateway. Some systems even combine Bluetooth data with AI analytics to predict failures before they happen. The ability to measure distance accurately also helps robots navigate crowded spaces safely.

Wearables: Hearables, AR Glasses, and Health Bands

Wearable devices benefit more than any other category. Modern earbuds use LE Audio to keep both sides synchronized, whether you’re streaming a movie or on a call. Hearing aids receive direct broadcast audio in public venues without special adapters.

AR glasses are an even bigger frontier. They use Channel Sounding to sense spatial relationships between the wearer, nearby devices, and the environment. That allows context-aware overlays – navigation cues, health metrics, or notifications – that appear exactly where they make sense. Bluetooth’s low-power model keeps these systems lightweight enough to run all day.

Automotive: Digital Keys and Vehicle Telemetry

Cars are fast becoming Bluetooth hubs on wheels. Digital Key Systems already use Bluetooth 6.0’s distance measurement to ensure you’re physically close before unlocking or starting the engine. It’s safer than older RSSI-based solutions that could be fooled by signal relays.

Onboard sensors rely on secure connections and encrypted advertising to stream data about tire pressure, cabin air quality, or driver posture. Maintenance centers can access diagnostic data automatically when a car pulls in, without plugging in a cable. In short, Bluetooth has quietly replaced several proprietary systems once needed for short-range communication inside vehicles.

The Big Picture

What’s striking is how flexible Bluetooth has become. The same fundamental protocol now powers medical wearables, industrial sensors, and entertainment systems. Each use case leans on a different mix of features – PAwR for scale, Channel Sounding for precision, LE Audio for experience, and encrypted advertising for privacy – but the foundation is consistent.

It’s this adaptability that explains why Bluetooth continues to thrive despite predictions of its demise. Rather than being replaced by Wi-Fi or UWB, it’s learning from them, borrowing their strengths, and finding new roles.

Developer Guide — Getting Started

Bluetooth 6.0 may sound futuristic, but the good news is that you don’t have to wait years to use it. Most of the new features are already landing in chipsets, SDKs, and development kits. If you’re an engineer or hobbyist itching to get your hands dirty, this section walks you through what to look for, how to get started, and a few pitfalls to watch out for along the way.

Picking the Right Chipset

The chipset you choose sets the tone for your entire project. If you’re building something simple, like a smart tag or sensor, you’ll want a microcontroller with integrated Bluetooth Low Energy and minimal power draw. But if you plan to experiment with Channel Sounding, LE Audio, or PAwR, you’ll need silicon that explicitly supports Bluetooth 5.4 or 6.0 features.

Current front-runners include the Nordic nRF54 series, Dialog DA1470x, and Silicon Labs BG24 family. These are developer-friendly chips with mature SDKs and good documentation. They also have flexible radio subsystems, which matter a lot when you’re testing features like Channel Sounding that depend on timing and signal stability.

A small tip from experience: always check the vendor’s firmware release notes. Some Bluetooth 6.0-capable chips still require you to enable experimental PHY layers or SDK flags to unlock certain features.

SDK and Stack Support

Once you’ve got your hardware, the next step is setting up your software stack. Most Bluetooth development happens through vendor SDKs or open platforms like Zephyr RTOS, ESP-IDF, or BlueZ on Linux.

If you’re targeting embedded systems, Zephyr is a great place to start. It’s modular, stable, and already includes PAwR and LE Audio APIs under its bt_le_ext_adv and iso modules. Silicon Labs’ Simplicity Studio also has strong tooling around Bluetooth mesh and PAwR.

On desktop or gateway platforms, Linux’s BlueZ stack supports extended advertising and secure connections out of the box, and work is underway to integrate Channel Sounding support via new HCI commands.

Always verify that your controller firmware is up to date before testing new features. Many “missing API” errors trace back to outdated controller images that don’t yet recognize the relevant HCI opcodes.

Advertising Strategy

Advertising is still the heartbeat of Bluetooth, and now it’s smarter than ever. Here’s a simple example of setting up extended advertising in C-style pseudocode:

ble_adv_params params = {
    .type = ADV_EXTENDED,
    .interval = 160,   // 100ms interval
    .tx_power = 0      // default transmit power
};

ble_set_adv_data(payload, sizeof(payload));
ble_start_advertising(¶ms);

Above pseudocode demonstrates how a Bluetooth Low Energy (BLE) device initializes and starts broadcasting advertisements so that nearby devices can discover it. The first block defines a structure named ble_adv_params, which contains the configuration settings for advertising. The .type = ADV_EXTENDED field specifies that the device will use Extended Advertising, a feature introduced in Bluetooth 5.0 that allows for larger payloads, better range, and the use of secondary channels beyond the traditional 31-byte limit of legacy advertising. The .interval = 160 value sets the advertising interval, expressed in Bluetooth time units of 0.625 milliseconds, meaning the device transmits an advertising packet every 100 milliseconds—frequent enough for responsive discovery without excessive power consumption. The .tx_power = 0 field sets the transmit power level to 0 dBm, which is the default radio output power and provides a balanced tradeoff between energy efficiency and signal range. After configuring the parameters, the function ble_set_adv_data(payload, sizeof(payload)) loads the advertising data—typically a collection of identifiers such as the device name, UUIDs for available services, manufacturer-specific data, or other Bluetooth advertising fields. This is the information that other devices see when scanning nearby. Finally, ble_start_advertising(¶ms) begins the actual transmission, instructing the BLE controller to start broadcasting the configured data on the standard advertising channels (37, 38, and 39). Once active, the device periodically transmits these packets until advertising is stopped manually or a central device establishes a connection. In essence, this short snippet encapsulates the three fundamental steps of BLE advertising: configuring the radio parameters, defining the broadcast data, and enabling the periodic advertisements that make the device visible to others.

This kind of setup works well for extended advertising and PAwR broadcast scheduling. When designing your advertising payloads, remember that the new encrypted format (introduced in 5.4) limits available space slightly, so plan for tighter data packing if you’re including custom fields.

If you’re building something that needs connection-less updates (like a sensor network), use PAwR or periodic advertising. For interactive applications, where you expect users to connect via a phone or hub, extended connectable advertising remains the right choice.

Connection Optimization

Tuning connection parameters is half art, half science. You’ll often find yourself trading latency for battery life. For streaming or LE Audio applications, intervals around 24–40 ms usually strike the right balance. For sensors or telemetry, you can stretch that interval out to save energy.

Sniff subrating is another underrated feature. It lets a peripheral sleep longer while maintaining an active connection, reducing energy use without affecting responsiveness too much.

If you’re testing with multiple devices, simulate busy airspace using tools like Ellisys Bluetooth Analyzer or the nRF Sniffer. This helps uncover timing issues or packet loss that might only show up in dense radio environments.

Power Testing

It’s easy to claim low power on paper – but proving it is another story. Use your dev kit’s current profiling tools to measure sleep and active currents under different intervals and PHY settings.

Run your firmware through long-duration tests in “noisy” airspace – meaning multiple other Bluetooth or Wi-Fi devices nearby. The goal is to see how your firmware reacts when packet retries or interference increase. Sometimes small timing tweaks can make big differences in battery life.

As a general rule, always start testing on the 1M PHY (the default) and only switch to 2M for high-throughput use cases like audio. Long-range modes can be valuable for IoT, but remember that higher receive sensitivity often costs extra current.

Security Checklist

Bluetooth 6.0 brings much stronger built-in security, but you’ll still need to wire it up correctly. Make sure to:

Use LE Secure Connections instead of legacy pairing.
Rotate Identity Resolving Keys (IRK) periodically.
Encrypt advertising payloads whenever transmitting private or medical data.
Handle key storage securely on your device, preferably with hardware-backed encryption or secure flash.

Also, watch for privacy gaps in the connection flow. Even encrypted devices can leak identity information if they reuse resolvable addresses or fail to clear bonds properly on reset.

Backward Compatibility

Real-world devices won’t all jump to Bluetooth 6.0 overnight. Your code should always detect peer capabilities and fall back gracefully. The HCI layer provides read commands that reveal which features the remote device supports.

For example, if Channel Sounding isn’t available, default to RSSI-based proximity or skip distance-based logic entirely. Similarly, if LE Audio isn’t supported, fall back to classic A2DP. Designing your firmware with this flexibility keeps your products compatible with millions of existing devices.

Testing and Certification

Once your prototype works, you’ll need to qualify it through the Bluetooth SIG Qualification Program. This process ensures your product complies with the spec and interoperates correctly with others. It might sound intimidating, but many vendors offer pre-qualified modules or test reports you can reuse to simplify the paperwork.

For debugging and validation, tools like the Ellisys Bluetooth Analyzer, Frontline BPA 600, or Nordic’s nRF Sniffer can capture over-the-air traffic and help verify packet sequences, timing, and encryption states.

Bluetooth development can be frustrating at first, as there’s lots of acronyms, layers, and hidden dependencies. But once you start seeing the system as a living conversation between devices, it clicks. The more you experiment with advertising intervals, connection timing, and PHY modes, the more you’ll appreciate how elegant and flexible the stack really is.

If you’ve ever wanted to build something that talks wirelessly and runs for months on a battery, this is your moment. The ecosystem has matured, the tools are ready, and the possibilities keep expanding.

Challenges & Trade-Offs

It’s tempting to think of Bluetooth 6.0 as flawless – after all, it’s faster, more efficient, and infinitely scalable. But like every engineering advancement, it comes with trade-offs. Real deployments reveal quirks that the spec sheets don’t mention, and knowing these early can save hours of debugging (and a few late-night rants).

Adoption Lag

Every new Bluetooth spec sounds exciting on paper until you realize the hardware for it isn’t widely available yet. Controller vendors take time to integrate the latest features, and phone or OS support can lag by a year or two. You might find yourself reading about Channel Sounding or PAwR in the core spec, only to discover that your development kit still marks them as “experimental.”

This is normal. The Bluetooth SIG’s release cadence moves faster than the hardware ecosystem can follow. The best strategy is to design firmware that detects capabilities dynamically. Build your code to gracefully fall back to 5.0 or 5.2 modes if 6.0 features are missing. That way your product ships today, but it’s ready for the future.

Environmental Interference

Bluetooth still lives in the 2.4 GHz band, the same noisy neighborhood as Wi-Fi, microwaves, and countless IoT gadgets. In factories or dense apartments, you’ll see interference spikes that cause packet loss or delay. Even with adaptive frequency hopping, performance can dip if too many radios are talking at once.

Developers need to test in real environments, not just in quiet labs. Use spectrum analyzers or sniffers to visualize congestion. Adjust transmit power, advertisement intervals, or even antenna orientation to mitigate problems. Remember, radio design is part science, part art. Sometimes moving a board trace by a centimeter makes more difference than rewriting code.

Power Versus Performance

Every Bluetooth generation tries to squeeze more precision and range out of roughly the same battery. Channel Sounding and high-speed PHY modes improve accuracy and throughput, but they also increase radio-on time and CPU load. You gain features but spend more energy to get them.

There’s no universal setting that fits all products. A hearing aid might value low latency over battery life, while a temperature sensor prioritizes sleeping as much as possible. Developers must tune intervals, transmission power, and frame spacing through measurement, not guesswork. The good news is that once you find the sweet spot, Bluetooth tends to be remarkably stable over long periods.

Security Configuration

Modern Bluetooth has excellent built-in security, but only if you use it correctly. Misconfigured advertising, static passkeys, or unrotated identity keys can still leak information. Even encrypted advertising won’t help if your firmware accidentally reuses session data.

The takeaway: don’t assume “secure by default.” Review every pairing and bonding flow, handle key rotation on firmware updates, and wipe old bonds when a user resets the device. The protocol gives you powerful locks, but it’s up to you to actually turn the key.

Software Complexity

The Bluetooth stack is getting heavier. Features like PAwR, Channel Sounding, and Isochronous Audio require new roles, new timing models, and new APIs. Developers who are used to simple GATT servers now have to think about scheduling, synchronization, and PHY coordination. Testing these features on multi-role devices can be especially tricky, since a single controller might handle multiple concurrent roles (central, peripheral, broadcaster, and observer).

If you’re working on an embedded platform, modular firmware design becomes essential. Split radio control, connection management, and application logic into distinct layers. It’s easier to debug timing bugs when your architecture mirrors the Bluetooth stack’s separation of concerns.

Fragmentation

Perhaps the most persistent challenge is fragmentation. Not every OEM implements the same subset of features, and some phones or chipsets may partially support a spec while skipping optional sections. Developers quickly learn that “Bluetooth 6.0” can mean slightly different things depending on the vendor.

The practical fix is to build flexibility into your software. Use feature discovery at runtime, keep your update mechanism ready for OTA patches, and enable configuration flags for new features so you can toggle them per device. Testing across diverse hardware early in the process pays off more than any elegant design decision later.

Mitigation and Mindset

Despite these challenges, none of them are deal-breakers. They’re simply part of building systems that live in the real world. Think modular, plan for gradual rollouts, and make firmware updates painless. Bluetooth’s backward compatibility means your device won’t become obsolete overnight, and your users benefit from improvements as the ecosystem matures.

In short, the trick isn’t avoiding the trade-offs but managing them. When you design with flexibility, Bluetooth 6.0 becomes less of a moving target and more of a living platform that grows alongside your product.

The Road Ahead — Bluetooth 6.1 and Beyond

If Bluetooth 6.0 was about awareness – knowing distance, filtering intelligently, and optimizing communication – then Bluetooth 6.1 is about refinement. It takes what already works and polishes it into something smoother, faster, and a little more elegant. It’s not a revolution, but it’s an important step in Bluetooth’s quiet transformation from a “wireless cable” into a context-aware network fabric for everyday devices.

Small Tweaks, Big Payoffs

Bluetooth 6.1 focuses on tightening the nuts and bolts rather than changing the whole machine. The update improves Channel Sounding accuracy, enhances advertising efficiency, and introduces a few quality-of-life adjustments to make device coordination easier.

That might sound minor, but it matters. Channel Sounding, for example, becomes more reliable when multiple reflections or obstacles exist. In indoor positioning systems like airports, hospitals, or museums, even a five percent improvement in accuracy can reduce false detections by a wide margin. Advertising refinements also make large IoT deployments more predictable, allowing gateways to manage high-density environments with less radio congestion.

In simpler terms: Bluetooth 6.1 is like a firmware tune-up for an already fast car. You may not notice it day to day, but under heavy load, it performs better and wastes less energy.

The Emerging Themes

Beyond the incremental fixes, the Bluetooth community is thinking much bigger. The next few years will likely focus on four major themes: energy harvesting, AI-assisted radio optimization, hybrid positioning, and context-aware security.

1. Energy-Harvesting Bluetooth Devices

We’re starting to see early prototypes of Bluetooth tags and sensors that run entirely on harvested energy – light, heat, or vibration – with no traditional battery. This ties into the push for maintenance-free IoT devices, especially in logistics and environmental sensing. Future specifications will refine ultra-low-duty-cycle communication patterns to support these “powerless” nodes.

2. AI-Driven Radio Management

Imagine a Bluetooth controller that dynamically learns the noise profile of its environment and adjusts its PHY, transmit power, or advertising timing in real time. Instead of a static table of parameters, AI models embedded in the firmware could predict interference and choose the best channel map automatically. It sounds futuristic, but chipmakers are already experimenting with machine learning cores in connectivity modules.

3. Cross-Technology Fusion (Bluetooth + Wi-Fi + UWB)

The border between short-range radios is blurring. Some systems already use Wi-Fi for throughput, Bluetooth for discovery, and UWB for pinpoint accuracy – all orchestrated by a single chipset. The goal isn’t to replace one with another but to fuse them, creating hybrid location frameworks that are more reliable than any single technology. Bluetooth’s Channel Sounding makes it a perfect partner in this mix.

4. Context-Aware Security

Future Bluetooth devices might decide access rights based not just on identity, but on context. For example, your smartwatch could unlock your laptop only if it detects that you’re sitting still and within one meter. That combination of motion, distance, and authentication could drastically reduce spoofing or relay attacks.

The Quiet Backbone of Connectivity

What’s fascinating about Bluetooth’s evolution is how quietly it happens. While other technologies make noise about high throughput or low latency, Bluetooth’s progress feels invisible but omnipresent. It doesn’t chase raw speed anymore – it chases relevance. The protocol is learning to sense, adapt, and coordinate, all qualities that make it essential for the next generation of ambient computing.

So while you might not notice Bluetooth 6.1 when it arrives, you’ll definitely feel its effects. Devices will sync faster, connections will drop less, audio will sound cleaner, and proximity-based features will just “know” what you want them to do. That’s the beauty of mature engineering: when it works so seamlessly that people stop thinking about it altogether.

Conclusion

Bluetooth has come a long way from its early days as a clunky pairing protocol for headsets. It’s now one of the quietest yet most influential technologies shaping how devices around us communicate. The newer generations – 5.4, 6.0, and soon 6.1 – show that Bluetooth’s evolution isn’t about flashy upgrades. It’s about refinement, about making wireless communication more precise, more private, and more power-aware.

At its core, Bluetooth’s story is about context. It’s learning to understand where you are, how far you are from something, and what kind of connection makes sense in that moment. Channel Sounding adds spatial awareness, PAwR makes massive IoT networks practical, LE Audio brings synchronized sound to earbuds, hearing aids, and broadcast systems, and encrypted advertising protects the information flowing through all of it.

For developers, this era of Bluetooth is exciting because it’s full of creative possibilities. You can build smarter sensors, more responsive wearables, or secure access systems that simply know when you’re nearby. The ecosystem is mature enough that you don’t need to be a radio engineer to experiment, but it’s still evolving fast enough to keep pushing boundaries.

The challenge now is not whether Bluetooth can handle the future. It’s how we, as developers and designers, decide to use it. Whether it’s powering ambient computing, healthcare networks, or next-gen audio, the technology is already ready.

So maybe the next time you put on your earbuds or unlock your car, take a moment to appreciate the quiet genius working behind the scenes. Bluetooth is thriving, adapting, and quietly building the connective tissue of our digital lives.

And for those of us who like tinkering with the unseen layers of technology, that’s a future well worth exploring.

How to Write a PHP Script to Calculate the Area of a Triangle

AYUSH MISHRA — Thu, 19 Jun 2025 15:33:06 +0000

In programming, being able to find the area of a triangle is useful for many reasons. It can help you understand logic-building and syntax, and it’s a common programming problem used in school assignments. There are also many real-world applications, such as computer graphics, geometry-based simulations, or construction-related calculations.

In this article, we’ll look at a common problem: we are given the dimensions of a triangle, and our task is to calculate its area. You can calculate the area of a triangle using different formulas, depending on the information you have about the triangle. Here, you’re going to learn how to do it using PHP.

After reading this tutorial:

You will understand the basic logic behind calculating the area of a triangle.
You will know how to write PHP code that calculates the triangle’s area using pre-defined and user-entered values.
You will know how to apply this logic in small projects and assignments.

Prerequisites
Find the Area of a Triangle Using Direct Formulas
Find the Area of a Triangle Using the Base and Height Approach
Find the Area of a Triangle Using Heron's Formula
Find the Area of a Triangle Using Two Sides and Included Angle (Trigonometric Formula)
Conclusion

Prerequisites

You’ll understand this guide more easily if you have some knowledge about a few things:

Basic PHP

You’ll need to know basic PHP syntax to fully understand the problem. If you know how to write a simple echo statement or create a variable in PHP, then you should be good to go.

Local PHP Environment

To run the PHP code successfully, you should have local PHP development, such as XAMPP or WAMP, on your machine. You can also use online PHP editors like PHP Fiddle or OnlineGDB to run a PHP script without any installation.

In this tutorial we are going to explore three approaches to determine the area of the triangle in PHP based on the amount of information available about the triangle.

Base and Height Formula Approach: This approach is applicable when you have the perpendicular height from the base and length of the base in the problem.
Heron’s Formula: This approach is used to calculate the area of triangle when you have the lengths of all three sides of the triangle.
Trigonometric Formula Approach: This approach is applied on the problem when you have the length of two sides and the included angle between them.

First, let’s go back to math class and use some direct formulas to find the area.

Find the Area of a Triangle Using Direct Formulas

Example 1:

In this first example, you’re given the input base and height of a triangle. You have to return the area of the triangle. For this example, you’ll use a direct formula to calculate the area of the triangle.

Input:

Base = 5,

Height = 10

You can calculate the area of the triangle using the formula:

$$Area = (Base * Height) / 2$$

So, if you plug in the values you have, you get: (5* 10) / 2 = 25.

Output:

Area = 25

Example 2:

In this second example, you’re given the length of two sides of a triangle and one angle between them. You have to return the area of the triangle. In this example, you’ll use another direct formula to calculate the area of the triangle.

Input:

Side A = 7, Side B = 9, Angle between them = 60°

In this case, you’ll use the formula:

$$Area = (1/2) A B * sin(Angle).$$

Then just substitute in the values you’ve been given to find the area.

Output:

Area = 27.33 (approximately)

Now let’s look at some different approaches to finding the area of a triangle using PHP.

Find the Area of a Triangle Using the Base and Height Approach

This is the simplest and most direct approach for calculating the area of a triangle when you know the base and height. In this approach, you’ll directly put values in the formula and find the area of the triangle – but you’ll do it with PHP code.

First, define the base and height of the triangle. Then apply the formula for the area of the triangle. As we saw above, the formula for the area of a triangle is:

$$Area = (Base * Height) / 2$$

After calculating the area of the triangle, output the answer.

Alright, so here’s how we can implement that in PHP:


// Define the base and height
$base = 5;
$height = 10;

// Calculate the area
$area = ($base * $height) / 2;

// Output the result
echo "The area of the triangle is: " . $area . " square units.";
?>

Output:

The area of the triangle is 25 square units.

In the above code, first we initialize the base and height of triangle in two variables. Then we plug those values into the area formula. PHP calculates the area of the triangle and displays the answer.

Time Complexity: In the above approach, we are using the direct formula to calculate and return the area of the triangle, so the time complexity will be constant at O(1). The constant time complexity is efficient as it will remain constant, regardless of the size or values of the base and height.

Space Complexity: The Space Complexity will be O(1). The space used by the above program is constant, which ensures minimal use of memory. This space complexity is ideal in environments where memory efficiency is a priority.

We use the above approach when we have the length of the base and height of the triangle (whether directly given or easily measurable in a right angle triangle). This method works best for right-angled triangles.

Find the Area of a Triangle Using Heron's Formula

Heron’s formula is named after a Greek mathematician named Heron of Alexandria. Heron’s formula is useful when you know the lengths of all three sides of the triangle and you want to calculate the area without needing the height. This formula works for any type of triangle, including scalene triangles (triangles with sides of all different lengths).

Here’s Heron’s formula to calculate the area of a triangle:

$$√s(s−a)(s−b)(s−c) $$

Where:

s = semi-perimeter = (a+b+c)/2 is the semi-perimeter of the triangle.
a, b, and c are the lengths of the sides.

First, we define the three sides of the triangle. Then, we check all three conditions of the Triangle Inequality Theorem which states that if the sum of two sides is greater than the third side, then it is a valid triangle, and the given sides can form a triangle.

We can calculate the semi-perimeter of the triangle using the formula s = a+b+c/2. Then we can apply Heron's formula to calculate the area. After calculating the area, then output the answer.

Here’s how you can implement this in PHP:


// Define the sides of the triangle
$a = 7;
$b = 9;
$c = 10;

// Check if the sides form a valid triangle using the Triangle Inequality Theorem
if (($a + $b > $c) && ($a + $c > $b) && ($b + $c > $a)) {

    // Calculate the semi-perimeter
    $s = ($a + $b + $c) / 2;

    // Calculate the area using Heron's formula
    $area = sqrt($s * ($s - $a) * ($s - $b) * ($s - $c));

    // Output the result
    echo "The area of the triangle is: " . $area . " square units.";

} else {
    // If the sides can't form a valid triangle
    echo "The given sides do not form a valid triangle.";
}
?>

Output:

The area of the triangle is: 27.321 square units.

In the above code, we first create three variables to store the lengths of the triangle’s sides, and check if the given sides form a valid triangle or not using the Triangle Inequality Theorem. Then we calculate the semi-perimeter using the formula: s = a + b + c / 2. We put the value of the semi-perimeter and lengths of all sides in Heron’s formula to calculate the area. The area of triangle is returned after calculating using the formula.

Time Complexity: There is a total fixed number of operations such as addition, subtraction, multiplication, and square root. These operations don’t depend on input size as they are performed only a fixed number of times. This means that the time complexity is constant O(1).

Space Complexity: We have used a fixed number of variables to calculate the area of the triangle. We have not used any additional data structures such as arrays or objects. The memory usage in the program is constant, which is better for low-memory environments. The space complexity is constant O(1).

This approach works best when the lengths of all sides are given. This approach is used mainly for scalene or isosceles triangles where height is directly not given. This approach can work for any type of triangle, however – scalene, isosceles, or equilateral.

Find the Area of a Triangle Using Two Sides and Included Angle (Trigonometric Formula)

In this approach, we will see a different variation of the problem. When you know two sides of a triangle and the included angle between them, you can calculate the area using this formula:

$$Area = 1/2 × a × b × sin(θ)$$

Where:

a and b are the lengths of the two sides.
θ is the included angle between the two sides, measured in degrees or radians.

Using the above formula, you can calculate the area of a triangle without needing its height. First, you define the two sides of the triangle and the angle between them. Then you convert the angle from degrees to radians if needed (in PHP, you can use deg2rad() to convert degrees to radians). Then you apply the formula.

After calculating the area of the triangle, output the result.

Here’s how to implement this in PHP:


// Define the two sides and the included angle
$a = 7;
$b = 9;
$angle = 60; // Angle in degrees

// Convert the angle from degrees to radians
$angle_in_radians = deg2rad($angle);

// Calculate the area using the formula
$area = 0.5 * $a * $b * sin($angle_in_radians);

// Output the result
echo "The area of the triangle is: " . $area . " square units.";
?>

Output:

The area of the triangle is: 27.321 square units.

Explanation:

In the above case, we’re using the formula:

Area of Triangle = 1/2 × a × b × sin(θ)

And we’re substituting the following values into the formula:

Area= 1/2 × 7 × 9 × sin(60 ∘) ≈ 27.321

In the code, we declared two variables to store the length of the two sides of the triangle, and the variable $angle hold the included angle in degrees. We used deg2rad(), a PHP built-in function which converts an angle from degrees to radians. Then, we applied the actual formula: Area = 1/2 × 7 × 9 × sin(60 ∘). PHP stores the final answer in the $area variable.

Time Complexity: We are using the direct formula to calculate the area of a triangle when the length of two sides and the angle between them are given. The constant time complexity is O(1).

Space Complexity: Similarly, it does not take any extra space or use any data structures. It uses a single variable to store the result, which is why the space complexity is constant O(1).

This approach is perfect for the problem in which two sides and the included angle (angle between those sides) are known. You can use it when you cannot easily calculate the height of the triangle. This problem has real-life applications in geometry problems, CAD applications, or physics simulations. This method is very accurate and doesn’t require the length of all sides.

Conclusion

In this article, you’ve learned how you can calculate the area of a triangle, both manually and using PHP. You have seen different approaches and learned about which one is best given the information you have. First, we discussed the base and height approach, then looked at Heron’s formula, and finally examined how to handle things when two sides and the included angle are given.

Understanding the logic behind each of these approaches helps you choose the right one based on the given data.

And if you'd like to support me and my work directly so I can keep creating these tutorials, you can do so here. Thank you!

The Data Communication and Networking Handbook

valentine Gatwiri — Wed, 18 Jun 2025 18:29:46 +0000

When I was beginning to learn about networks, I didn't know how many things in my daily life depended on them – from texting on WhatsApp to watching YouTube.

I still vividly remember when I learned that computers communicate with one another. It was magic – telepathy, nearly. But there is a systematic, logical process behind the magic: computer networking. And I’m excited to help you discover how computers communicate and why it’s possible.

Essentially, data communication is all about exchanging information between two or more machines. But it's not just a question of sending – it's a matter of sending the right data, to the right machine, in the right format. And that's the brilliance of networking basics.

This handbook will teach you the fundamentals of the language of computers. You'll discover how data is passed from machine to machine, how operations are carried out on information, and how networks – from tiny home arrangements to massive worldwide networks – are constructed and managed.

We’ll start with the absolute basics: what a network is, what the hardware is, and how devices know each other and talk to each other. Next, we’ll examine crucial networking models like OSI and TCP/IP stacks that segment communication into layers in order to make it easier to understand and troubleshoot. You'll learn about IP addresses, DNS, routing, switching, and firewalls and security's involvement in keeping networks safe.

Whether you are a complete beginner starting from the ground up or a seasoned dev looking to solidify your foundation, this handbook will walk you through linking the dots. When you're finished, you won't only understand how your favorite sites and apps really function behind the scenes – you'll be able to speak networks in your sleep.

Chapter 1: Data and Communication Fundamentals
Chapter 2: Signals — The Language of Communication
Chapter 3: Bandwidth — Understanding How Much We Can Transmit
Chapter 4: Transmission Media — The Highways of Communication
Chapter 5: Network Topologies — How We Structure Our Connections
Chapter 6: The OSI Model — Understanding Layers of Communication
Chapter 7: Protocols and Ports — How Rules and Doors Guide Communication
Chapter 8: IP Addressing and Subnetting — Naming and Organizing the Network
Chapter 9: Routing and Switching — Directing Data on the Network
Chapter 10: Network Infrastructure — Devices, Security, and the Modern Internet

Chapter 1: Data and Communication Fundamentals

This introductory section lays the groundwork for the rest of the handbook. You’ll learn what data communication is, how it's different from "sending a message," and what's required for two computers (or phones, or servers) to exchange information efficiently.

You'll start to feel at home with fundamental ideas, technical terminology, and the machinery behind the scenes that works quietly in the background to make daily technology appear effortless.

By the end, you will be able to:

Explain what data communication is and how it works in real life
Identify the components involved in data communication systems
Differentiate between types of data and how they're represented
Understand different types of data flow (simplex, half duplex, full duplex)
Describe what a computer network is and its main categories (LAN, MAN, WAN)
Understand the importance of protocols and how they enable communication
Recognize the role of standards and standard organizations in making networking universal

Data vs Information

We throw around the word "data" a lot these days – "big data," "data science," "data plans" – but what does it mean?

Data is raw. It's unprocessed, meaningless on its own. Think of numbers on a spreadsheet with no labels.
Information is processed data – it's meaningful and helps us make decisions.

A personal example: I once received a CSV file from my school with hundreds of rows of marks. It looked like chaos – just student IDs and scores. But the moment I matched those IDs to names and applied the grading criteria, it became useful information about who passed, who failed, and who topped the class.

So, data is the ingredient. Information is the cooked dish.

So, What Exactly is Data Communication?

Imagine you're texting your friend. Your phone sends data to their phone using signals through cables, Wi-Fi, or even satellites. This entire process is called data communication, moving data from one place (you!) to another (your friend).

But it’s not as random as it sounds. It follows a set of agreed rules called protocols. Think of them as social etiquette for devices – how to talk, when to talk, and what to say.

This process involves:

Devices (sender and receiver)
A transmission medium (like cables or wireless)
A set of rules (protocols)

Let’s break it down further.

Characteristics of Data Communication

To be considered effective, data communication must exhibit the following characteristics:

Delivery: Data must reach the correct destination. If I send a message to John, it shouldn't land in Sarah's inbox.
Accuracy: No one wants a corrupted file. Data must be accurate, free from errors.
Timeliness: Some data, like live video, must arrive on time. Lag ruins the experience.
Jitter: Inconsistent arrival times of data packets (especially in audio/video) create disruption. A good system keeps jitter low.

I once experienced a video call where the sound lagged by 5 seconds. It turned into a game of "Guess what I said." That's jitter in action.

Meet the Cast: The Components of Data Communication

In every data conversation, five key players show up:

Sender – The device that starts the chat (like your phone).
Receiver – The one getting the message (your friend’s phone).
Message – The actual info, whether it’s "hi" or a TikTok.
Transmission Medium – The path your message travels (Wi-Fi, cables, and so on).
Protocol – The language they agree to speak (like TCP/IP).

Pretty cool, right?

Data Representation

Computers are not humans. They don’t understand language, pictures, or music – unless these are converted into a format they can process: bits (0s and 1s).

Let’s walk through the different types of data representation:

1. Text

Text is stored as a sequence of characters using encoding schemes like ASCII and Unicode. For example, the letter "A" in ASCII is 65, which in binary is 01000001.

2. Numbers

Similarly, numeric data is stored as bit patterns. Computers can perform calculations using binary logic.

3. Images

An image is a matrix of pixels. Each pixel is represented by bits. A black-and-white image might only need 1 bit per pixel, while a full-color photo could use 24 bits per pixel or more.

Example: A 10x10 black and white image = 100 pixels = 100 bits.

4. Audio

Audio is analog, but we digitize it for storage and transmission. For instance, voice notes are sampled at certain intervals and stored as bits.

5. Video

Video is a sequence of images (frames) along with synchronized audio. It’s high in data volume and needs compression techniques like MP4 to be practical.

How Does the Data Flow?

You might think data just zips across in one go – but it has modes, just like moods:

Simplex: One-way only (like a radio broadcast).
Half Duplex: You take turns – like walkie-talkies.
Full Duplex: Both sides talk at once – think phone calls.

Each has its own vibe depending on the situation.

What is a Computer Network?

A computer network is a system that allows devices to share data. These connected devices (nodes) use communication links to interact.

The main goals of a network are:

Reliability: Data should get there.
Security: Unwanted access should be blocked.
Performance: High speed, low delay.

When you connect your laptop at a café, for example, you’re part of a network. But networks come in all shapes:

PAN (A personal area network): connects electronic devices within a user's immediate area.

LAN (Local Area Network): Small – like your home Wi-Fi.

MAN (Metropolitan Area Network): Covers a city – like college campuses.

WAN (Wide Area Network): Huge – think the entire internet!

The internet isn’t one big net – it’s a net of many, many nets.

What is a Protocol?

A protocol is a set of rules that devices follow to communicate. Without a protocol, it’s chaos.

Analogy: Think of a group project. If everyone agrees to use Google Docs and write in English (or any one language), it works. But if one person uses Word in French, and another emails a PDF in Mandarin, you have a mess.

Protocols define:

What data to send
How to send it
When to send it

Elements of a Protocol

Syntax: Format and structure (like grammar).
Semantics: Meaning of each section.
Timing: When to send and at what speed.

Standards in Networking

Standards are agreements to ensure that different systems can work together. Without standards, each manufacturer would create isolated networks that couldn’t talk to others.

There are two types of standards:

De facto: By convention (used commonly but not formally approved)
De jure: By law (formally approved)

Standards Organizations

There are a few key organizations that help define these standards:

ISO – International Organization for Standardization
ITU-T – International Telecommunication Union
IEEE – Institute of Electrical and Electronics Engineers
ANSI – American National Standards Institute
EIA – Electronic Industries Association

Chapter 2: Signals — The Language of Communication

In this chapter, I’ll teach you about the invisible messengers – signals – that make it all possible. You will:

Understand what signals are and how they carry data
Distinguish between analog and digital signals, and when each is used
Learn about key signal characteristics like amplitude, frequency, phase, and wavelength
Visualize and compare time domain vs frequency domain representations
Appreciate how real-world signals are composed of multiple waves (composite signals)
Understand digital signal features like bit rate, baud rate, and bit interval
Learn about baseband vs broadband transmission methods
Identify challenges like attenuation, distortion, and noise
Grasp how bandwidth affects data quality and speed

When I was a teenager, I often wondered how my voice traveled through a phone and reached someone else in another town. I imagined tiny versions of myself running through wires with a message in hand. Turns out, while not exactly accurate, the idea of something carrying your message is spot on. That something is called a signal.

A signal is the form data takes to move through physical space. Whether it’s your mom calling you, your professor sending an email, or your friend uploading a reel – all of that happens through signals.

Data and Signals

What is a Signal?

I learned that data is like the message I wanted to send, and a signal is the delivery truck. Without the truck, the message goes nowhere.

Here’s where things get a bit science-y, but stay with me. When data travels, it becomes signals, kind of like waves. These waves can be classified in to two common ways, by the nature of the signal, and by their patterns over time. We’ll talk about the nature of the signal first.

The Nature of the Signal: Analog vs Digital

Analog – A signal that varies smoothly over time and can take any value in a range. Like ocean waves, always changing smoothly. Continuous (like voices).
Digital – A signal that has discrete values, usually 0s and 1s. Like a staircase – clear, sharp steps, either up or down, in bits (1s and 0s, like computers).

Analog Signals

The first time I visualized an analog signal, it looked like the ripples I saw after tossing a stone in water. Gentle curves moving outwards.

Key features of analog signals:

Amplitude: This reminded me of volume. Louder signals have taller waves.
Frequency: It’s the beat or rhythm. High frequency = rapid waves = higher pitch.
Period: Time for one full wave cycle. Shorter periods mean higher frequency.
Phase: Two waves can start at different points – just like dancers starting a move a second apart.
Wavelength: How far one wave travels in space. It depends on how fast it moves and its frequency.

Time vs. Frequency Domain

Time Domain: Shows how signals change over time. Like watching a song’s audio waveform.
Frequency Domain: Shows the ingredients – how much bass, how much treble. It’s like the EQ settings on a music player.

Composite Signals and Fourier

Real-world signals are messy, made of multiple waves mixed. Fourier’s big idea was: Any messy signal can be broken down into simple sine waves. That insight changed how engineers understand and clean up signals.

Digital Signals

Digital signals felt familiar to me. My laptop, my phone, even my microwave speaks digital.

Key features of digital signals:

Bit Interval: One bit’s duration. Like how long I hold down a piano key.
Bit Rate: How many notes (bits) I can play per second.
Baud Rate: How often the signal actually changes. Not always the same as bit rate.
Levels: 2-level = 1s and 0s. More levels = more complex encoding.

Square Waves

If analog signals are elegant curves, digital signals are sharp edges. A square wave is a bold, binary shout: ON-OFF-ON-OFF.

Digital Advantages and Struggles

Why I love them:

They’re clean and easy to work with.
Errors are easier to spot and fix.

But they’re not perfect:

They need more bandwidth.
They don’t travel well over long distances without help.

Pattern Over Time: Periodic vs Non-periodic Signals

Periodic Signals: Repeat at regular intervals over time (for example, sine waves, clock pulses).
Non-periodic Signals: Do not repeat – more random or unique (for example, a burst of data or speech waveform).

Periodic Signals

These feel like the rhythm of my favorite song. They’re predictable. Repeating. Reliable.

Key Features

Repetition: The same pattern, again and again. Like waves hitting the shore at steady intervals.
Cycle: One complete shape of the signal. Think of it as one heartbeat in a steady pulse.
Frequency: How many cycles per second? Measured in Hertz (Hz).

Why I like them

Easy to analyze – like having a beat to follow.
Great for systems that need synchronization, like clock signals in my devices.

But still...

They can’t carry surprise or variety. No space for one-time messages.

Non-periodic Signals

These are the jazz solos of the signal world. Wild. Unique. Unpredictable.

Key Features

No repetition: Each part is different – like my playlist on shuffle.
Spikes and silence: Sudden changes, long pauses. Perfect for one-off data transmissions.
Used in real-life data: Emails, videos, and downloads all love this format.

Why they’re cool

Great for representing actual information – each burst means something new.
More flexible for transmitting complex messages.

What’s tricky

Harder to analyze and predict.
Tougher to filter or compress efficiently.

Understanding signals helps us know how fast and cleanly information travels.

Channels: The Roads Signals Travel On

In the context of signals and communication, channels refer to the medium or path through which a signal travels from a sender (transmitter) to a receiver. Channels are like roads. You can’t just send a truck (signal) without knowing if the road (channel) allows it.

We can describe channels in different ways:

Physically: What the signal travels through (like a wire or air).
Functionally: How the signal is allowed to move through (based on frequency).
Logically: How we organize multiple data streams within the same physical path.

Physical Channels = The Road Itself

These are the real, tangible paths for signals:

Example	Medium
Ethernet cable	Copper wire
Fiber-optic link	Glass strand
Wi-Fi or Radio	Air (wireless)
Satellite transmission	Space (electromagnetic waves)

Frequency Behavior of Physical Channels

Just like roads are built for certain speeds, physical channels are better at carrying certain frequencies.

Here’s where low-pass, high-pass, band-pass, and band-stop come in – they describe how a physical channel behaves.

Channel Type	Behavior	Analogy	Common Use
Low-pass	Lets low frequencies pass	Quiet country road (slow cars only)	Telephone lines (voice)
Band-pass	Allows a specific frequency band	Toll road with speed range	FM radio, Wi-Fi
High-pass	Blocks low, passes high frequencies	Speedway (fast cars only)	Audio filtering
Band-stop	Blocks a range but passes others	Road under construction	Noise removal (for example, hum filter)

So when we say "low-pass channel," we're talking about how a physical channel filters signals.

Logical Channels = Lanes on the Road

A logical channel is a virtual path created within a physical one. It organizes or splits the signal flow so multiple people or devices can use the same channel without crashing into each other.

Feature	Description	Analogy
Frequency Division	Each user gets their own frequency	FM radio stations
Time Division	Each user gets a time slot	Taking turns at a speaking table
Virtual Circuits	Custom paths inside networks	Reserved bus seats

So yes – you can have many logical channels on one physical cable.

How They Work Together

Let’s combine it all:

Imagine a fiber optic cable (physical channel) that’s designed to carry a specific frequency range (band-pass).
Within that frequency range, you can create many logical channels using time or frequency division.

Example: FM Radio

Physical Channel: Air (radio waves)
Type: Band-pass (88–108 MHz)
Logical Channels: Each station (for example, 98.4 FM) is a logical channel inside that band

Example: Internet over DSL

Physical Channel: Telephone line (copper wire)
Type: Low-pass for voice, high-pass for internet
Logical Channels: Browsing, streaming, and downloads running together via time/frequency division

Baseband vs Broadband Transmission: How We Use the Channel

There are two main types of ways we use the channel: baseband and broadband transmission.

Baseband Transmission is like talking directly to someone across a quiet room. Simple and unaltered. Common in local systems like Ethernet.

Broadband Transmission is a bit different. Here, we dress up the digital message in analog clothing using modulation. That’s how we send data over radio or fiber. It’s more complex, but necessary when you’re dealing with wider, noisier roads.

Signal Villains: What Goes Wrong on the Way

As your signal travels down the channel, it may face three big problems.

Attenuation: It’s like my voice getting quieter the farther I am from someone. Amplifiers help boost it.
Distortion: Imagine you and I agree to send square waves, but by the time it reaches you, it looks like mush. That’s distortion, especially bad over long cables.
Noise: Noise is anything extra that wasn’t supposed to be in the signal. From lightning strikes to microwaves, interference is real.

Types I learned about:

Thermal (heat-related)
Induced (nearby equipment)
Crosstalk (adjacent wires “talking”)
Impulse (sudden bursts)

We can reduce noise using better cables, filters, and digital corrections.

Bandwidth

The word ‘bandwidth’ gets thrown around so much. For me, it used to just mean internet speed. But it’s deeper:

Analog Bandwidth: Range of frequencies a signal uses.
Digital Bandwidth: How much data we can push through per second.

More bandwidth = more room = faster, clearer communication.

We’ll talk more about bandwidth in the next chapter.

Learning about signals was like being handed the key to a secret code. Every beep, flash, and wave in our world is part of a language. Once you see it, you can’t unsee it. Signals are not just theory – they are the reason I can write this on a laptop, send it to the cloud, and have you read it anywhere in the world.

Chapter 3: Bandwidth — Understanding How Much We Can Transmit

When I first heard the term "bandwidth," I assumed it just meant how fast my internet was. And while that’s not entirely wrong, I came to learn there’s much more to it.

In this chapter, we’ll delve into the concept of bandwidth as the capacity of a communication path, examine its impact on signal quality and speed, and investigate how it's measured in both analog and digital systems.

By the end of this chapter, you will be able to explain:

What bandwidth means in different contexts
How analog and digital bandwidths are measured
The concept of throughput and how it differs from bandwidth
Factors that affect data transmission performance

What Bandwidth is All About

Bandwidth is the maximum amount of data that can be transmitted over a communication channel in a given amount of time.

Have you ever streamed a movie and it kept buffering? That frustrating lag led me to one of the most important concepts in networking: bandwidth. Bandwidth is like a highway. The wider the road, the more cars (or data) can pass at once.

I also like to think of it this way: If I’m trying to pour water (data) through a pipe (the communication channel), a narrow pipe limits how much water can flow through at a time. That’s low bandwidth. A wide pipe? Now we’re talking high bandwidth – fast and smooth.

Bandwidth Utilization

Efficiency

This is how well we use the available bandwidth. High efficiency means most of the bandwidth is being used for actual data (not overhead).

Overhead

Overhead includes headers, acknowledgments, and error-checking codes. It’s necessary, but it eats into our available bandwidth.

Idle Time

Sometimes the channel sits unused, due to waiting for acknowledgment, processing time, and so on. Minimizing idle time improves efficiency.

Bandwidth in Analog and Digital Terms

Analog Bandwidth

Analog bandwidth refers to the range of frequencies over which an analog signal can be accurately acquired, processed, or transmitted by a system. Beyond this range, the signal begins to degrade – either being attenuated or distorted, making it unreliable for precise use.

Key Concepts

Frequency Range: Analog bandwidth defines the spectrum of frequencies that a system can handle without significant degradation. It’s the system’s “comfort zone” for signal fidelity.
3 dB Bandwidth: One common method of defining analog bandwidth is the -3 dB point. At this point, the signal’s amplitude drops to about 70.7% of its original value, meaning almost half its power is lost. Frequencies beyond this threshold experience much more signal loss or distortion.
Importance in Signal Fidelity: Analog bandwidth directly affects how well a system can reproduce or process real-world signals – especially in audio, video, instrumentation, and telecommunications. A narrow bandwidth results in muffled or distorted outputs, while a wider bandwidth ensures better detail and accuracy.

Bandwidth and Rise Time

In instruments like oscilloscopes, analog bandwidth is closely related to rise time – the time it takes for a signal to transition from low to high. A wider bandwidth enables faster transitions to be captured accurately, which is essential for analyzing high-speed or fast-changing signals.

Real-Life Example

Consider old telephone systems: they typically had an analog bandwidth ranging from 300 Hz to 3300 Hz, resulting in a 3000 Hz bandwidth. This range was enough for clear voice transmission, but not wide enough for high-fidelity music or modern audio standards.

Applications of Analog Bandwidth

Application Area	Role of Analog Bandwidth
Oscilloscopes	Determines how accurately signals (especially fast ones) are captured.
Amplifiers	Specifies which frequency ranges can be amplified without distortion.
Communication Systems	Defines signal capacity and transmission quality.
Data Acquisition	Affects how well fast-changing signals are measured and analyzed.

Digital Bandwidth

Digital bandwidth refers to the maximum capacity of a digital channel to transmit data over a specific period, usually measured in bits per second (bps). It’s a measure of how much data can “flow” through a communication path, much like how the width of a pipe controls how much water can pass through.

The wider the digital bandwidth, the more data can be transmitted simultaneously, resulting in faster downloads, smoother video streams, and better overall network performance.

Bandwidth vs. Data Rate

Although they’re often used interchangeably, they aren’t quite the same:

Bandwidth is the capacity of the channel – the maximum potential.
Data rate is the actual speed at which data is transmitted, which can vary based on factors like:
- Network congestion
- Hardware limitations
- Signal interference

Think of bandwidth as the size of a highway, and data rate as how fast cars are moving on it.

How Digital Bandwidth is Measured

Digital bandwidth is expressed in units such as:

bps – bits per second
Kbps – thousands of bits per second
Mbps – millions of bits per second
Gbps – billions of bits per second

Example: A 100 Mbps internet connection can, in theory, transfer 100 million bits of data every second.

Why It Matters

Bandwidth plays a central role in modern digital life. Without enough bandwidth:

Streaming videos buffer
Video calls drop in quality or disconnect
Online games lag or stutter
Large files download painfully slowly

This becomes even more critical when multiple devices share the same network. Each device draws from the available bandwidth, which can quickly get overwhelmed if the demand is too high.

Digital vs. Analog Bandwidth

Aspect	Digital Bandwidth	Analog Bandwidth
Measured in	Bits per second (bps, Mbps, Gbps)	Hertz (Hz)
Focus	Data transmission rate	Frequency range
Example	Internet connection	FM radio signal (for example, 88–108 MHz)

Bandwidth in Shared Networks

In shared environments – like home Wi-Fi or public hotspots – everyone taps into the same bandwidth. If bandwidth is limited and several devices are streaming, gaming, or downloading, the network slows down for everyone.

Throughput – What Gets Delivered

While bandwidth is the potential capacity of a channel (the width of the road), throughput is the actual rate at which data travels end‑to‑end under real‑world conditions. It’s the number of cars that make it through the city per minute, after red lights, speed limits, and detours.

Key factors that influence throughput:

Interference & Noise (analog) or packet collisions (digital)
Hardware Constraints (CPU, NICs, switches)
Network Congestion (too many users/devices)
Error Retransmissions (when packets get lost or corrupted)

Example: A “100 Mbps” link (bandwidth) might only sustain 80 Mbps of throughput because of TCP overhead, competing traffic, and occasional packet losses.

Latency and Delay – The Time Dimension

Latency is the time it takes for a single bit (or packet) to travel from sender to receiver. Think of it as a travel time, whereas bandwidth and throughput are about volume.

Propagation Delay: Time for the signal to move through the medium (for example, light in fiber: ~200,000 km/s).
Transmission Delay: Time to push all the bits of a packet onto the wire:
Packet Size (bits)÷Link Bandwidth (bps)\text{Packet Size (bits)} ÷ \text{Link Bandwidth (bps)}Packet Size (bits)÷Link Bandwidth (bps)
Processing Delay: Time routers or switches spend examining headers, making forwarding decisions.
Queuing Delay: Time packets wait in buffers when traffic spikes.

Real‑world story: During a long‑distance video call, even 100 ms of round‑trip latency can feel like talking through molasses – voices overlap, and the conversation feels stilted.

Jitter – Variability in Arrival

Jitter is the inconsistency in packet arrival times. Even if the average latency is low, high jitter disrupts:

Audio/Video Streams: Choppy playback when packets clump or arrive too late.
VoIP Calls: Glitches, echoes, or dropped words.

You can mitigate this through Buffers and Quality of Service (QoS) agreements, which real‑time traffic to smooth out the delivery.

How to Improve Performance

If I could go back in time and give myself one tip: Performance isn’t just about speed – it’s about reliability and consistency, too.

Here’s what affects performance:

Bandwidth: Think of this as the largest diameter of your internet pipe – how much data can actually move through it per second, usually in Mbps or Gbps.

Why it matters: More bandwidth means your connection can handle more data – like downloading big files fast or streaming in 4K. BUT: Just because your connection can go fast doesn't necessarily mean that it always does. That's where throughput comes in.
Throughput: Your actual speed – how much data is really passing through the pipe right now.

Why it matters: Your actual internet experience (web page loading, Netflix streaming, gaming) is throughput-dependent, not bandwidth-dependent. If your throughput is bad, your videos buffer, downloads crawl, and games lag – even when you're signed up for a "fast" plan.
Latency & Jitter: Latency is the lag – how long it takes information to travel from your machine back to the server and vice versa (in milliseconds). Jitter is the variation in that lag – how inconsistent the timing gets.

Why they're significant: High latency = frustrating delay in video calls, sluggish online gaming, or keyboard lag in remote desktops. High jitter = choppy audio, frozen faces, or desync'd video in live meetings or streams.
Packet Loss: Sometimes, data just doesn't get to where it’s supposed to go. Packets are tiny chunks of data, and if a few get lost along the way, your device has to ask for them again.

Why it matters: Small levels of packet loss can cause buffering, call drops, or rubberbanding during gaming. Greater loss = subpar performance, stuttery audio, or crashed streams.
Utilization & Overhead: Utilization refers to what ratio of your total bandwidth is being used at any one time. Overhead is the extra information that needs to be dealt with to manage your connection – like labels on a package.

Why they're important: High utilization is when your connection gets crowded – for example, rush hour. Everything slows down. High overhead absorbs your free bandwidth – less room for what you actually love (video, games, files).

Engineers use techniques like compression, efficient routing, better cabling, and load balancing to improve performance.

I now see bandwidth everywhere – not just in networks, but in life. Our mental bandwidth, emotional bandwidth – it's all about capacity. Knowing how bandwidth works helped me troubleshoot slow Wi-Fi, plan file transfers, and appreciate what’s going on behind a simple Google search.

Just as in life with mental or emotional bandwidth, we need both capacity and consistency to function at our best. Understanding these metrics empowers you to diagnose slow Wi‑Fi, optimize file transfers, and build networks that meet real user demands.

Chapter 4: Transmission Media — The Highways of Communication

How does data move across distances? What path does it take?

This chapter dives into the physical and wireless pathways data takes from one device to another – the transmission media. By the end of this chapter, you will understand:

What transmission media is and why it matters
The difference between guided (wired) and unguided (wireless) media
Various types of cables (twisted pair, coaxial, fiber optics)
Wireless media like radio waves, microwaves, and infrared
The strengths and limitations of each medium

What are Transmission Media?

Imagine needing to deliver a letter. Do you send it through a postal truck? Drop it by drone? Deliver it by hand? The method you choose is your transmission medium.

In the digital world, transmission media refers to the path data takes from the sender to the receiver. These paths can be physical (guided), like cables, or wireless (unguided), like airwaves.

When I finally understood that even invisible data needs a “road,” I realized how crucial this topic was to building fast, reliable networks.

Different Types of Transmission Media

Transmission media are classified into two broad categories:

Guided Media (Wired): The data follows a specific path (like a road or railway). Common types include a Twisted Pair cable, a Coaxial cable, and a Fiber Optic cable.
Unguided Media (Wireless): Data floats freely through the atmosphere, like radio signals or Wi-Fi. Types include Radio Waves, Microwaves, and Infrared Waves.

Let’s dive into each of these types of transmission media in a bit more detail.

Guided Transmission Media

1. Twisted Pair Cable

This was the first cable I ever handled – it looked like two wires twisted together. Signals are transmitted as tiny voltage differences between the two copper conductors. By twisting the pair, electromagnetic interference picked up on one wire tends to be canceled out on the other, since each twist reverses their positions relative to the noise source.

Features & Use‑Cases:

Structure: Two insulated copper wires twisted to reduce interference.
Types:
- Unshielded Twisted Pair (UTP): Common in LANs, cheaper but more prone to noise.
- Shielded Twisted Pair (STP): Has shielding for better noise protection.
Usage: Telephones, Ethernet.
Bandwidth: Low to medium.
Distance: Up to 100 meters (for UTP).

2. Coaxial Cable

I remember unscrewing one from the back of our old TV. A single copper core carries the signal; an insulating layer and an outer metal shield form a concentric geometry. The signal propagates as an electromagnetic wave confined between the inner conductor and shield, which also blocks external noise.

Features & Use‑Cases:

Structure: A central copper core, surrounded by insulation, a metal shield, and an outer plastic cover.
Advantages: Better shielding, higher bandwidth than UTP.
Usage: Cable TV, broadband internet.
Distance: Up to several kilometers with amplifiers.

3. Fiber Optic Cable

This one blew my mind – light carrying data! Data is encoded into light pulses (laser or LED) sent down a glass or plastic core. Total internal reflection at the core–cladding interface traps light, allowing it to travel long distances with almost no loss.

Features & Use‑Cases:

Structure: Glass or plastic core surrounded by cladding and a protective sheath.
Types:
- Single-Mode Fiber: For long distances, uses a laser.
- Multi-Mode Fiber: For shorter distances, uses LED.
Advantages:
- Immune to electromagnetic interference
- Higher bandwidth and longer distances
- More secure and reliable
Usage: Backbone of the internet, submarine cables, hospitals.

Unguided Transmission Media

When you connect to Wi-Fi or use Bluetooth, you are relying on unguided media. These don’t need a cable – just air.

There are several different kinds of unguided transmission media. Let’s talk about some of the most common.

1. Radio Waves

How It Works:
Antennas convert electrical signals into electromagnetic waves (and vice versa). Radio frequencies (3 kHz–1 GHz) propagate omnidirectionally (or in broad beams) through the air and can diffract around obstacles.

Pros: Penetrates walls; easy broadcast to many receivers.
Cons: Susceptible to interference and eavesdropping.
Applications: FM/AM radio, Wi‑Fi (2.4 GHz band), Bluetooth, cordless phones.

2. Microwaves

How It Works:
Highly directional beams (1 GHz–300 GHz) generated by parabolic dishes or waveguide antennas. Because they travel in straight lines (line‑of‑sight), they must be carefully aligned between towers or rooftop dishes.

Pros: High data rates, cellular backhaul, satellite links.
Cons: Rain fade, clear path required, more expensive antennas.
Applications: Mobile networks, satellite TV, point‑to‑point enterprise links.

3. Infrared

How It Works:
LED or laser diodes emit infrared light pulses, which are detected by photodiodes on the receiver. Because IR light cannot pass through walls, it works only in a confined, line‑of‑sight – or within a reflective “cone.”

Pros: Highly secure (confined to room), no RF interference.
Cons: Very short range; blocked by obstacles; strict alignment.
Applications: TV remotes, short‑range device pairing, some industrial sensors.

Comparison Table

Medium	Speed	Distance	Interference	Cost	Usage
Twisted Pair	Low-Medium	~100m	High	Low	LAN, telephony
Coaxial	Medium	~2km (amplified)	Medium	Medium	Cable TV, broadband
Fiber Optic	Very High	>60km (with repeaters)	Very Low	High	Backbone, high-speed
Radio	Low-Medium	Long (via towers)	High	Low	Wi-Fi, radio, Bluetooth
Microwave	High	Long (LOS)	Medium	High	Mobile, satellites
Infrared	Low	Short	Very Low	Low	Remotes, IR sensors

How to Choose the Right Transmission Medium

When I set up my first home network, I had to think about speed, distance, and cost. That’s what engineers do when designing large networks, too.

Questions to ask yourself or your team:

How far does the data need to travel?
How fast do I need the connection?
Can I afford high-end cables or equipment?
Is the environment prone to interference?

Scenario	Best Medium	Why & How to Decide
Home LAN & Office Ethernet	Cat6 UTP	Affordable, easy to install, handles Gigabit speeds up to 100 m.
No‑Cable Wireless Access	Wi‑Fi (2.4/5 GHz)	Easy coverage of rooms; choose 5 GHz for less interference, higher speed.
Long‑Distance Fiber Backbone	Single‑Mode Fiber	Minimal signal loss over tens of kilometers; vital for ISP backbones.
Campus/Building Interconnect	Multi‑Mode Fiber	Supports 10–100 Gbps across campus; lower cost than single‑mode for short runs.
Point‑to‑Point Enterprise Link	Microwave Link	Rapid deployment between buildings; ensure clear LOS and proper dish alignment.
Industrial/Noisy Environments	Shielded Twisted‑Pair or Fiber	STP resists EMI ; fiber is immune but costlier.
Room‑Confined, Secure Control Signals	Infrared	Perfect for IR‑controlled lighting or remote‑only devices in one room.
Broad Wireless Broadcast	Radio Waves	For wide‑area IoT sensors or broadcast audio; simple omnidirectional antennas.

Define Distance & Speed:
- Short run (<100 m) + moderate speed → UTP.
- Long haul → fiber or microwave.
Assess Environment:
- High EMI (factories) → fiber or STP.
- Indoor home/office → UTP or Wi‑Fi.
Consider Mobility:
- Devices moving around → wireless (Wi‑Fi, cellular).
Weigh Cost vs. Performance:
- Budget LAN → UTP
- Critical backbone → fiber
Security Needs:
- Room‑confined control → infrared
- Open campus → directional microwave or encrypted Wi‑Fi

By matching distance, throughput requirements, environmental constraints, and budget, you can select the transmission medium that delivers optimal real‑world performance, just as engineers do when designing networks that power everything from our smartphones to submarine data cables.

Learning about transmission media made me realize how much effort goes into a simple text message. Whether it’s a copper wire under the road or a beam of light under the ocean, there’s always a path connecting us.

I now see cables and antennas not just as hardware, but as lifelines of human connection. They are the highways of our digital lives.

Chapter 5: Network Topologies — How We Structure Our Connections

The word “topology”, in the context of networking, refers to how devices are arranged and connected. This chapter helps you see that the structure of a network is just as important as the technology it uses.

By the end of this chapter, you will:

Understand what a network topology is and why it matters
Explore different types of physical and logical topologies
Learn the pros and cons of each layout (bus, ring, star, mesh, hybrid)
Recognize how topology affects performance, scalability, and fault tolerance

What is Topology?

If you’ve ever arranged chairs in a room for a meeting, you’ve thought about topology. Should everyone face forward? Sit in a circle? Group up in clusters?

Networking topology is the same idea – it’s about the layout of devices and how they connect. Whether you're designing a small home LAN or a vast corporate network, choosing the right topology affects everything: speed, cost, troubleshooting, and scalability.

Physical vs Logical Topology

Physical Topology

This is what you can see – the actual layout of wires and devices.

Example: You see computers in a classroom connected by cables to a central switch. That’s the physical topology.

Logical Topology

This is how data flows, regardless of how devices are physically connected.

Example: Even if computers are wired to a switch (star), the data may travel like a bus – this makes it a logical bus topology (more on this below).

It’s like a subway map vs. the actual underground tunnels – one shows the concept, the other shows the reality.

Types of Network Topologies

Let’s go through the main types of network topologies. Each has strengths, weaknesses, and ideal use cases.

Bus Topology

Imagine one long cable – all devices “tap into” it.

In a bus topology, a single backbone cable connects all devices.

Pros:
- Simple and cheap
- Uses less cable
Cons:
- If the backbone fails, the whole network goes down
- Difficult to troubleshoot
- Performance degrades with more devices
Use case: Small temporary networks

Ring Topology

Here, each device connects to exactly two others, forming a circle.

In this case, data travels in one direction, passing through each node.

Pros:
- Easy to install
- Better than bus for managing traffic
Cons:
- Failure in one node can break the ring
- Adding/removing nodes is disruptive
Use case: Token Ring networks (rare today)

Star Topology

This is what I used when setting up a LAN in my home. All devices connect to a central hub or switch.

Pros:
- Easy to install and manage
- Failure of one device doesn’t affect the rest
Cons:
- If the central device fails, everything goes down
- Requires more cable
Use case: Modern Ethernet networks

Mesh Topology

This one fascinated me because of its complexity.

In a mesh topology, every device is connected to every other device.

Pros:
- Redundant paths ensure reliability
- Excellent fault tolerance
Cons:
- Expensive and complex to install
- Requires lots of cabling
Use case: Military, critical systems, backbone networks

Hybrid Topology

Like a recipe with ingredients from different cuisines.

A hybrid topology works by combining two or more topologies.

Pros:
- Flexible and scalable
- Can be tailored to specific needs
Cons:
- Complex design and management
Use case: Large organizations with diverse requirements

Comparison Table

Topology	Cost	Reliability	Scalability	Complexity	Use Case
Bus	Low	Low	Low	Low	Small LANs
Ring	Medium	Medium	Low	Medium	Outdated systems
Star	Medium	Medium-High	High	Low	Homes, offices
Mesh	High	Very High	Medium	Very High	Data centers, military
Hybrid	High	High	Very High	High	Enterprises

How to Choose the Right Topology

When I built my first network for a class project, I went with a star topology. Why? Because it was easy to set up and troubleshoot, and it matched our desk layout, with all PCs around a central switch. That hands-on experience taught me that the right topology isn’t just about wiring – it’s about reliability, cost, and how people use the network.

Think of it like planning a city:

Where are the busiest hubs?
Do you need alternate routes in case one fails?
Can you maintain all the connections?

Common Network Topologies and When to Use Them

Topology	How It Works	When to Use It	Pros	Cons
Bus	All devices share a single backbone cable	Very small networks, temporary setups, or budget constraints	Cheap, minimal cabling	Hard to troubleshoot, poor scalability, one break = network down
Star	Devices connect to a central hub or switch	Home networks, classrooms, offices	Easy to manage, isolate issues, scalable	Hub is single point of failure
Ring	Each device connects to two others forming a closed loop	Legacy systems or specialized industrial networks	Predictable data flow, fair traffic management	Break in loop can halt the network unless dual ring used
Mesh	Every device connects to multiple others	Critical systems (e.g. military, finance), where uptime is vital	Highly fault-tolerant, redundant paths	Expensive, complex, heavy cabling
Hybrid	Mix of two or more topologies	Large enterprises or campuses	Flexible, optimized for different departments	Can be complex and costly to manage

How to Actually Choose a Topology (Real-Life Scenarios)

Let’s move beyond theory. Here’s how you'd pick a topology depending on your network goals and constraints:

1. Need a simple setup with a tight budget?

Choose: Bus or Star
Why: Bus requires minimal cabling (but be warned—it’s fragile); Star uses affordable switches and is easy to expand.
Example: Setting up a temporary lab or a network for a rural clinic.

2. Setting up a home or small office?

Choose: Star
Why: It mirrors how devices are physically placed. One faulty PC won’t crash the whole network.
Example: Wi-Fi router (the central node) with laptops, smart TVs, and printers.

3. Running a business with multiple departments?

Choose: Hybrid (Star + Mesh or Star + Ring)
Why: Combine flexibility with reliability. Use star for offices, mesh for server interconnects.
Example: A university with classrooms (star) and data centers (mesh).

4. Downtime is a dealbreaker?

Choose: Mesh
Why: Redundant paths keep communication alive even if several links fail.
Example: Military control center or emergency dispatch system.

5. Working with legacy systems?

Choose: Ring
Why: Some older systems (like token ring networks or SONET) require ring layouts.
Example: Legacy manufacturing networks that still run on ring-based designs.

6. Expecting rapid growth?

Choose: Star or Hybrid
Why: You can easily add more nodes to the central hub or integrate new segments.
Example: A startup anticipating more staff and devices within 6–12 months.

Tips from Experience

Think long-term: Design for tomorrow’s load, not just today’s.
Plan for failures: Even if you don’t need full mesh, maybe add backup links for your star’s hub.
Sketch the layout: Visualizing devices and data flow helps you pick the best design.
Consider wireless topologies too: For mobile or flexible environments, wireless mesh or infrastructure-based topologies might be better than wired ones.

Just like roads and power lines shape how a city grows, your network topology shapes how your digital systems evolve. The best layout isn’t the one with the fanciest name – it’s the one that fits your users, your budget, and your goals.

Choose thoughtfully, and your network becomes more than wires – it becomes infrastructure for productivity, connection, and growth.

Network topology is the blueprint for that digital city. When done right, everything flows. When it’s messy, things get congested, slow, or fail. And that’s why I now look at every network not just as wires and switches, but as architecture, with a purpose and design.

Chapter 6: The OSI Model — Understanding Layers of Communication

The OSI model is like a translator – it helps all types of systems speak the same language. And it’s everywhere.

In this chapter, you will:

Understand what the OSI model is and why it was created
Learn what each of the 7 layers does
Discover how the layers work together during communication
Apply real-life analogies to remember each layer’s role

What is the OSI Model?

Picture this: you want to send a letter. You write it 📝 → put it in an envelope ✉️ → mail it 📮 → it goes to your friend’s house 🏠 → they open it 👐 → and read it 👀.

That’s basically how the OSI Model works. The OSI (Open Systems Interconnection) model is a conceptual framework that describes how data moves from one device to another in a network. Instead of all systems operating differently, the OSI model helps break down communication into 7 distinct layers.

Each layer has a specific task, and together they make communication structured, understandable, and interoperable.

Developed by the International Organization for Standardization (ISO), the OSI model was created to provide a universal standard for different systems to communicate.

Think of it like this: You’re building a house. You wouldn’t put the roof before the walls. Similarly, data follows an order, moving through each of these layers – from sender to receiver.

The 7 layers of the OSI model are:

Application (your browser or app)
Presentation (formatting, encrypting)
Session (starting/ending chats)
Transport (reliable delivery)
Network (finding the route)
Data Link (organizing the data)
Physical (the actual wires or Wi-Fi)

It’s teamwork that makes the stream work!

An easy mnemonic I used to memorize them (from top to bottom): “All People Seem To Need Data Processing.”

Let’s explore each layer from the bottom (Layer 1) to the top (Layer 7):

Layer 1 – Physical Layer

This is the hardware level.

Handles: cables, switches, voltages, pins
Responsible for: physically transmitting raw bits (0s and 1s)
Example: Ethernet cables, fiber optics

Analogy: The roads on which data travels.

Layer 2 – Data Link Layer

Ensures reliable transfer across the physical link.

Handles: MAC addresses, framing, error detection
Divided into:
- Logical Link Control (LLC)
- Media Access Control (MAC)
Example: Switches, MAC addressing

Analogy: Street signs and traffic signals managing who goes when.

Layer 3 – Network Layer

This is about routing – finding the best path to the destination.

Handles: IP addresses, packet forwarding
Devices: Routers
Protocols: IP, ICMP

Analogy: Google Maps calculating the best route.

Layer 4 – Transport Layer

Responsible for end-to-end communication and reliability.

Handles: segmentation, flow control, error correction
Protocols: TCP (reliable), UDP (fast but no guarantee)

Analogy: Your personal driver, making sure you arrive safely.

Layer 5 – Session Layer

This layer manages dialogues (sessions) between systems.

Handles: session setup, management, and termination

Analogy: A host managing who gets to speak in a Zoom meeting.

Layer 6 – Presentation Layer

Responsible for data formatting and translation.

Handles: encryption, compression, data conversion
Example: JPEG, MP3, SSL, ASCII, EBCDIC

Analogy: A translator ensuring the data is understood.

Layer 7 – Application Layer

The layer closest to the user.

Handles: user interfaces, network services
Protocols: HTTP, FTP, SMTP, DNS

Analogy: The app you open – browser, email client, and so on.

Communication Flow

When I send a message:

It starts at Layer 7 and goes down to Layer 1 at my device
Then travels across the medium
And climbs back up from Layer 1 to Layer 7 on the receiving device

Each layer talks to its “peer” on the other device using a protocol.

Why the OSI Model Matters

The OSI model is more than theory. It’s a map of the journey your data takes that helped give structure to the chaos. It’s also helped me think systematically about problems, identify where things break down, and appreciate the complexity behind “just sending a message.” When debugging a network problem, I ask:

Is the cable plugged in? (Layer 1)
Is the MAC address correct? (Layer 2)
Can I ping the destination? (Layer 3)
Is the application service running? (Layer 7)

It gave me a checklist to go through, along with some clarity.

Whether you’re a student or a network pro, these 7 layers are your best friends.

TCP/IP: The Real MVP of the Internet

While the OSI model is an ideal learning tool, the TCP/IP model is what the internet actually uses. It has only four layers, combining some of the OSI layers for simplicity and practicality:

TCP/IP Layer	Corresponds to OSI Layers	Examples
Application	Layers 5–7 (Application to Session)	HTTP, FTP, DNS, SMTP
Transport	Layer 4 (Transport)	TCP, UDP
Internet	Layer 3 (Network)	IP, ICMP
Network Access / Link	Layers 1–2 (Physical + Data Link)	Ethernet, Wi-Fi, MAC addresses

Why TCP/IP Matters:

Scalable: It powers everything from home routers to global telecom infrastructure.
Interoperable: Works across all hardware, operating systems, and devices.
Fault-tolerant: TCP handles dropped packets, reordering, and error checking.
Backbone of the Internet: Every website, email, or Zoom call runs over TCP/IP.

How TCP/IP Works (Simplified Walkthrough)

Let’s say you open your browser and type in www.example.com.

Application Layer (HTTP): Your browser sends a request for a web page.
Transport Layer (TCP): The request is broken into segments, with each piece numbered and prepared for reliable delivery.
Internet Layer (IP): Each segment gets an IP address and is routed across networks.
Network Access Layer: The data is turned into frames and signals, then physically transmitted over the internet (via cables or wireless).

At the other end, the process reverses, and you see the web page appear on your screen.

OSI vs. TCP/IP: Why Learn Both?

OSI	TCP/IP
Conceptual, educational model	Practical, real-world protocol suite
7 distinct layers	4 simplified layers
Rarely used directly in implementation	Foundation of the internet

Think of the OSI model as a textbook diagram – helpful for troubleshooting and interviews. TCP/IP is the actual engine – streamlined and optimized for real-world communication.

Chapter 7: Protocols and Ports — How Rules and Doors Guide Communication

Protocols and ports are the rules and gates that make it all happen smoothly. This chapter helps you appreciate how structured communication actually is.

By the end of this chapter, you will:

Understand what protocols are and why they’re essential
Learn about standard protocols used in networking
Explore the concept of ports and their numbers
Discover how protocols and ports work together to manage communication

The Importance of Protocols and Ports

When I tried setting up a local web server for the first time, nothing loaded. It took me a while to realize I hadn’t opened the right port or used the correct protocol.

Protocols are the rules that devices follow when talking to each other. Ports are like doors that allow specific types of data to come in and go out.

Without protocols and ports, communication would be total chaos.

What is a Protocol?

A protocol is an agreed-upon set of rules for sending and receiving data.

Think of it like:

A language: both sides must understand it
A traffic system: everyone follows the same rules to avoid crashes

Characteristics of Good Protocols

For a protocol to be effective in communication, it must clearly define how data is structured, understood, and managed in time. Let’s break that down:

1. Syntax – The Format and Structure of the Data

Think of syntax like grammar in language. It defines:

Data format (for example, header, payload, footer)
Order of fields in a message
Encoding rules (for example, binary, ASCII, JSON, XML)

Example: In an email protocol like SMTP, the syntax might require that the sender and recipient addresses come in a specific format like MAIL FROM: and RCPT TO:.

A good protocol syntax is:

Consistent and unambiguous
Easy to parse by machines
Designed to minimize errors in interpretation

2. Semantics – The Meaning of Each Field

Semantics defines what each piece of data means – what should be done with it.

What does a "200 OK" response mean in HTTP? (It means the request was successful.)
What does a SYN flag mean in TCP? (It initiates a new connection.)

Good protocol semantics:

Ensure that both sender and receiver interpret the data in the same way
Clearly define error codes, commands, and responses
Support meaningful actions tied to each instruction

3. Timing – When and How Fast to Communicate

Timing refers to:

When messages are sent (synchronization)
How fast messages should arrive (data rate)
How long to wait before assuming failure (timeouts)

A good protocol timing design:

Prevents collisions (two devices sending at the same time)
Supports flow control to avoid overwhelming slower devices
Includes retransmission logic in case of delay or loss

Common Networking Protocols

Before diving into details, here’s some context: A networking protocol is like a shared language for computers. It ensures that devices can communicate, share data, and coordinate actions reliably and securely.

TCP – Transmission Control Protocol

TCP is the backbone of reliable internet communication.

It is:

Connection-oriented: A session is established before data is sent.
Reliable: It ensures all data arrives correctly and in order using acknowledgments and retransmission.
Error-checked: Includes checksums to detect and correct corruption.

You use TCP in Web browsing (HTTP/HTTPS), email (SMTP), and file transfers (FTP). It’s like mailing a package with tracking and a required signature on delivery.

UDP – User Datagram Protocol

UDP is lightweight, fast, and doesn’t worry about delivery guarantees.

It is:

Connectionless: No handshake or setup, just send and forget.
Low overhead: No acknowledgments or retransmission.
Faster than TCP, but riskier for data loss.

You use it in online gaming, voice calls (VoIP), and live video streaming. It’s like shouting a message across a noisy room – quick, but no guarantee it’ll be heard.

HTTP / HTTPS – HyperText Transfer Protocol

HTTP is the protocol of the web – it enables your browser to request and display web pages.

It is:

Stateless: Each request is independent.
Based on the request-response model: Client sends a request; server responds.

HTTPS adds encryption via SSL/TLS, making it secure for sensitive data (for example, online banking, logins).

It’s used for activities like browsing websites and in REST APIs.

FTP – File Transfer Protocol

FTP is a classic protocol for transferring files between devices on a network.

It:

Works in client-server mode
Requires authentication (username/password)
Is not secure on its own – can be enhanced with FTPS or replaced by SFTP (uses SSH)

You can use it for website hosting and file backup systems.

SMTP, POP3, IMAP – Email Protocols

These are the three common email protocols, and each has its own features:

SMTP (Simple Mail Transfer Protocol): Used to send email from clients to servers or between servers.
POP3 (Post Office Protocol v3): Downloads emails to the device and usually deletes them from the server.
IMAP (Internet Message Access Protocol): Keeps email on the server and synchronizes across devices.

These are used in email clients like Outlook, Thunderbird, and Apple Mail.

DNS – Domain Name System

DNS is the internet’s phonebook – it converts human-readable names (like google.com) into IP addresses.

Hierarchical and distributed system
Uses caching to speed up lookups
Works behind the scenes of every website visit

It’s used in every internet-connected application that uses domain names.

What is a Port?

A port is a virtual door on a device that allows certain kinds of data through.

Each application or service uses a specific port number, which ranges from 0 to 65535.

Port Ranges

Well-known ports: 0–1023 (assigned to common services)
Registered ports: 1024–49151 (used by user processes)
Dynamic/Private ports: 49152–65535 (temporary or private use)

Common Port Numbers

Service	Protocol	Port
HTTP	TCP	80
HTTPS	TCP	443
FTP	TCP	21
SSH	TCP	22
DNS	UDP/TCP	53
SMTP	TCP	25
POP3	TCP	110
IMAP	TCP	143

How Protocols and Ports Work Together

Imagine you’re throwing a party:

Protocol: The invitation format – RSVP, dress code, rules.
Port: The door your friends enter through.

A web browser knows to use HTTP (protocol) on port 80. A secure connection will use HTTPS on port 443.

Your computer and servers use these pairings to know what type of data to expect.

Once I understood protocols and ports, troubleshooting network issues got easier. Suddenly, firewall rules, web server configs, and error messages started to make sense.

Protocols ensure everyone speaks the same language. Ports ensure everyone enters through the correct door.

They are the silent heroes of every network conversation.

Chapter 8: IP Addressing and Subnetting — Naming and Organizing the Network

When I first saw an IP address like 192.168.0.1, I didn’t think much of it. But now I see it for what it is, the digital address that tells data where to go. In this chapter, you will learn:

What an IP address is and why it's necessary
The difference between IPv4 and IPv6
How subnetting works and why it's useful
How to calculate and interpret IP ranges, subnet masks, and CIDR notation

Imagine trying to mail a letter without an address – it would be lost forever. The same applies to data on a network. Every device needs a unique identifier called an IP address to send and receive information correctly.

IP addressing ensures that when I request a webpage, my data comes back to me, not someone else on the network.

What is an IP Address?

An IP address (Internet Protocol address) is a unique number assigned to every device on a network.

Every device on a network needs an IP address to identify it – like a phone number for computers. There are two main versions of IP addresses: IPv4 and IPv6.

IPv4 vs. IPv6

IPv4 (Internet Protocol version 4) is the older, more widely used system. It uses a 32-bit address format, written as four numbers (each 0–255) separated by dots—for example: 192.168.1.1. This format allows for about 4.3 billion unique addresses.

But with the explosion of internet-connected devices, we quickly ran out of IPv4 addresses. That’s why IPv6 (Internet Protocol version 6) was introduced.IPv6 uses a 128-bit address format, written in hexadecimal and separated by colons: 2001:0db8:85a3:0000:0000:8a2e:0370:7334. This allows for a virtually unlimited number of addresses – over 340 undecillion (that’s 340 followed by 36 zeros)!

Let’s see a quick breakdown of the key details of each protocol:

IPv4 Address Format

Composed of four numbers separated by dots
Each number ranges from 0 to 255 (i.e., 8 bits per number)
Total: 32 bits (4 x 8)
Example: 192.168.1.1

IPv6 Address Format

Created to solve the address shortage in IPv4
Composed of eight blocks of hexadecimal values
Total: 128 bits
Example: 2001:0db8:85a3:0000:0000:8a2e:0370:7334

The Old IPv4 Class System

Originally, IPv4 addresses were grouped into classes to simplify allocation:

Class	Range	Default Subnet Mask	Use
A	1.0.0.0 – 126.0.0.0	255.0.0.0	Large networks
B	128.0.0.0 – 191.255.0.0	255.255.0.0	Medium networks
C	192.0.0.0 – 223.255.255.0	255.255.255.0	Small networks
D	224.0.0.0 – 239.255.255.255	N/A	Multicasting
E	240.0.0.0 – 255.255.255.255	N/A	Reserved for future use

But this system was too rigid. It wasted address space by assigning fixed block sizes, even when a network didn’t need that much.

Enter CIDR: Classless Inter-Domain Routing

CIDR (pronounced "cider") replaced the old class system in the 1990s. CIDR allows for more flexible and efficient allocation of IP addresses. Instead of using predefined classes, CIDR uses a prefix length to specify how many bits represent the network portion.

Example: 192.168.1.0/24: This means the first 24 bits are the network, and the last 8 bits are available for hosts.

CIDR made it easier to split (subnet) networks and slow the exhaustion of IPv4 addresses. We’ll discuss this more below.

Does IPv6 Use Classes?

No, IPv6 does not use classes. It was designed from the start to avoid the inefficiencies of the class system. Instead, it uses a hierarchical structure and prefix notation similar to CIDR. IPv6 addresses are divided into:

Global unicast (like public IPv4 addresses)
Link-local (used within a local network)
Multicast (send to many devices at once)

IPv6’s design naturally supports efficient routing and address assignment without needing "classes" as a workaround.

After learning about IP addresses – especially the difference between IPv4 and IPv6 – it’s important to understand how networks manage and organize these addresses. That’s where subnetting comes in.

What Is Subnetting?

Think of a large network like a school compound. Subnetting is like dividing the school into classrooms or departments. It’s the process of dividing a larger network into smaller, more manageable subnetworks (subnets).

Subnetting helps with:

Efficient use of IP addresses: You don’t need to assign a huge range of addresses when only a few devices are needed.
Network organization: Departments or teams can be separated into their own subnets.
Better performance and security: Traffic stays local within each subnet, and issues in one subnet don’t affect the whole network.

How Subnet Masks Work

To understand subnetting, we need to talk about subnet masks.

Every IPv4 address is divided into two parts:

The network portion tells you which network it belongs to.
The host portion tells you which specific device (computer, phone, printer, and so on) on that network.

A subnet mask tells us how to separate those two parts.

Example:

IP Address: 192.168.1.10
Subnet Mask: 255.255.255.0

This means:

The first three numbers of the IP address (192.168.1) represent the network.
The last number (10) identifies the specific host on that network.

The subnet mask acts like a filter that shows which part of the IP is fixed (network) and which part can vary (host).

CIDR Notation: A Modern Alternative

You might also see IP addresses written like this: 192.168.1.0/24. This is called CIDR notation (Classless Inter-Domain Routing), which we discussed briefly above.

CIDR is a more flexible and compact way to express IP addresses and subnet masks. The /24 tells us that the first 24 bits of the address are used for the network. The rest are for hosts.

CIDR Notation	Subnet Mask	Number of Hosts
/24	255.255.255.0	256 IPs (254 usable)
/26	255.255.255.192	64 IPs (62 usable)
/30	255.255.255.252	4 IPs (2 usable)

CIDR allows networks to be split or combined more precisely than the old Class A/B/C system, which had fixed sizes.

How to Calculate a Subnet

Let’s walk through a basic example.

You’re given the network: 192.168.1.0/26

The /26 means 26 bits are used for the network and 6 bits remain for hosts (since IPv4 has 32 bits total).
Using the formula 2^number_of_host_bits, you get 2^6 = 64 total addresses.
But 2 addresses are reserved: one for the network itself, and one for the broadcast address.
So, you’re left with 62 usable addresses in that subnet.

This is helpful when dividing a network among departments, buildings, or device types.

Public vs Private IP Addresses

Not all IP addresses are meant for use on the open internet. Some are private, used within internal networks.

Private IP Addresses:

Not routed over the internet.
Used in homes, schools, and offices.
Can be reused in different networks without conflict.

Range	Purpose
10.0.0.0 – 10.255.255.255	Private use
172.16.0.0 – 172.31.255.255	Private use
192.168.0.0 – 192.168.255.255	Private use

Devices with private IPs connect to the internet through a router that uses NAT (Network Address Translation).

Public IP Addresses:

Assigned by your ISP (Internet Service Provider).
Must be globally unique.
Used by websites, servers, and other devices reachable over the internet.

Static vs Dynamic IP Addresses

IP addresses can also be either static or dynamic.

Static IP Address:
- Manually assigned to a device.
- Doesn’t change over time.
- Commonly used for servers, printers, or devices that need consistent access.
Dynamic IP Address:
- Assigned automatically using DHCP (Dynamic Host Configuration Protocol).
- Changes occasionally.
- Most home networks use dynamic IPs for convenience and flexibility.

Why This All Matters

Understanding subnetting, masks, and IP types helps you:

Design networks that scale and perform well.
Assign addresses efficiently.
Improve security through network isolation.
Troubleshoot and configure routers and firewalls effectively.

Subnetting felt confusing at first, but once I saw how it's like breaking down a neighborhood into streets and houses, it clicked. It's a powerful skill for anyone working in networking or IT. And with the rise of IPv6 and cloud-based systems, it's more relevant than ever.

Chapter 9: Routing and Switching — Directing Data on the Network

In this chapter, you will:

Understand the roles of routers and switches
Learn how data is directed within and across networks
Explore routing tables, packet forwarding, and switching techniques
Compare static vs. dynamic routing
Understand how LAN and WAN switching works

Every time we send an email or watch a video, data is being routed and switched through a maze of devices. It’s like navigating a city using both small alleyways (switching) and highways (routing).

These processes ensure that data goes from point A to point B efficiently, securely, and correctly, even if they’re continents apart.

What is Switching?

Switching happens within local networks (LANs). It’s all about moving data between devices on the same network.

What is a Switch?

A switch is a device used in LANs to connect computers, printers, and other networked devices. It operates at Layer 2 (Data Link Layer) of the OSI model and plays a crucial role in directing traffic inside a local network.

But how does a switch know where to send the data?

It uses something called a MAC address.

What Are MAC Addresses?

A MAC (Media Access Control) address is a unique identifier assigned to a device’s network interface card (NIC). It’s like a digital fingerprint for your laptop, printer, or phone.

Each MAC address is a 48-bit address usually displayed in hexadecimal format like this:
00:1A:2B:3C:4D:5E

When data is sent over a LAN, it’s broken into frames, which include both a source MAC address and a destination MAC address.

The switch reads the destination MAC address and forwards the frame only to the port where that specific device is connected. This makes switching faster and more secure than old-style hubs that sent data to all devices.

LAN Switching Techniques

Switches use different techniques to decide when and how to forward frames. These include:

Store-and-Forward Switching: The switch receives the entire frame, checks it for errors using a CRC (Cyclic Redundancy Check), and then forwards it. It’s reliable but slightly slower.
Cut-Through Switching: The switch reads just the destination MAC address – often within the first 6 bytes – and immediately begins forwarding the frame. It’s faster but doesn’t check for errors.
Fragment-Free Switching: A hybrid approach. It reads the first 64 bytes before forwarding, enough to avoid most collision-related errors.

What is Routing?

While switching moves data within a single network, routing is what moves data between networks. This is how information travels from your home network to the wider internet.

What is a Router?

A router is a device that connects different networks and determines the best path for data to travel. It operates at Layer 3 (Network Layer) of the OSI model and forwards data based on IP addresses rather than MAC addresses.

You can think of a router like a GPS navigator for internet traffic. It chooses the best available route based on traffic, cost, and destination.

What is a Routing Table?

Each router has a routing table, which is like a map that tells the router:

Which destination networks does it know about
The next hop (which router to send the packet to next)
Which interface (port) to send it out on
The metric, which is a number representing the cost or preference of that path

When a router receives a data packet, it checks the routing table to decide where to send it next.

Static vs. Dynamic Routing

Routers can learn routes in two main ways: static or dynamic.

Static Routing

With static routing, a network administrator manually enters routes into the router's configuration. This method is:

Simple and efficient for small, stable networks
Very secure since routes never change unless manually updated
Limited because it doesn’t adapt if a network link goes down

Example: If you tell a router, “To reach network X, always go through Router A,” that route will stay in place until someone changes it.

Dynamic Routing

Dynamic routing uses protocols that allow routers to automatically share and update routing information with each other. This approach is:

Ideal for large or complex networks
Adaptive routes are recalculated if something changes or fails
Slightly more resource-intensive due to constant updates

Common dynamic routing protocols include:

RIP (Routing Information Protocol) – Simple, but outdated
OSPF (Open Shortest Path First) – Fast and widely used in large networks
EIGRP (Enhanced Interior Gateway Routing Protocol) – Cisco’s proprietary protocol, combining the best of both distance vector and link-state methods
BGP (Border Gateway Protocol) – The protocol that powers routing across the entire internet

Routing in Action

Let’s say I’m watching a YouTube video:

My device sends a request
The switch sends it to the router
The router consults its table and forwards it to another router
This process continues until the request reaches YouTube’s server
The server sends data back, following the same or a different route

Routers and switches never sleep. They’re working behind the scenes, 24/7, making sure our digital lives function smoothly.

Routing and switching may sound technical, but they are the backbone of modern networking. Knowing how they work has helped me troubleshoot issues and understand why certain delays or outages happen.

Switching keeps local communication efficient. Routing connects us to the world.Together, they are the traffic controllers of the internet.

Chapter 10: Network Infrastructure — Devices, Security, and the Modern Internet

As I continued my journey through networking and data communication, I could see that it's not theory alone – it's hardware, security, and innovation that are essential to the backbone of our everyday life on the internet.

This final chapter brings together the essential knowledge of networks: devices, security protocols, and the technologies behind new connectivity.

In this chapter, you will:

Understand common networking devices and their functions
Explore firewalls, intrusion detection, and best practices for security
Learn how the internet works (DNS, cloud computing, IoT)
Appreciate the role of protocols, encryption, and data integrity in today's connected world

Network Devices — The Building Blocks of Connectivity

Every time we send an email, stream a video, or browse the web, a collection of physical devices quietly work behind the scenes to make it all possible. These network devices form the infrastructure of both small local networks and the vast global internet. Let’s take a closer look at some of the key players.

Hub

The hub is one of the earliest and simplest network devices. It operates at the Physical Layer (Layer 1) of the OSI model and has a very basic job: when it receives data from one of its ports, it broadcasts that data to all other connected devices.

This method is inefficient, as it creates unnecessary traffic and poses security risks. Because of this, hubs are rarely used in modern networks, having been largely replaced by more intelligent devices like switches.

Switch

A switch is a more advanced and efficient version of a hub. It operates at Layer 2 (Data Link Layer) and uses MAC addresses to forward data only to the intended recipient. Instead of flooding the entire network with every transmission, a switch makes sure the data goes only where it's needed. This makes it the go-to device in most Local Area Networks (LANs) today.

Router

While switches handle local traffic, routers are responsible for sending data between different networks. Operating at Layer 3 (Network Layer), a router uses IP addresses to determine the best path for forwarding packets across the internet. In home and business environments, routers are essential for enabling access to the wider world beyond the local network.

Access Point (AP)

An Access Point bridges the gap between wired and wireless networking. It connects to a wired network and provides Wi-Fi so that wireless devices like laptops and smartphones can connect. Access points are especially important in large areas such as offices, schools, or public places where seamless wireless connectivity is needed.

Modem

A modem (short for modulator-demodulator) is the device that connects your local network to your Internet Service Provider (ISP). It converts digital data from your computer into signals that can travel over telephone lines or cable systems, and vice versa. In many homes, the modem is combined with a router in a single device.

Network Interface Card (NIC)

A NIC is the hardware component inside a device—like a laptop or desktop—that allows it to connect to a network. It can be built-in or external and can support either wired Ethernet or wireless Wi-Fi connections. Without a NIC, a device simply can’t participate in network communication.

Network Security — Protecting Our Digital Lives

I never thought much about network security – until I once received a very convincing spam email that nearly tricked me into sharing personal info. It was a wake-up call that our digital spaces aren’t always as safe as they seem.

In today’s connected world, network security is not just an IT concern – it’s a crucial part of everyday life. As we connect more devices and store more personal data online, the risks of cyberattacks and data breaches grow. Here’s a look at the major threats and how we protect against them.

Common Threats

There are many ways attackers can exploit vulnerabilities in a network. Some of the most common threats include:

Malware: This includes viruses, worms, and ransomware – malicious software that can damage files, steal information, or lock systems until a ransom is paid.
Phishing: Attackers send fake emails or create deceptive websites to trick users into revealing sensitive information like passwords or credit card numbers.
DDoS Attacks: A Distributed Denial of Service attack overwhelms a system with traffic from multiple sources, causing it to slow down or crash entirely.

Security Devices and Techniques

To defend against these threats, networks are equipped with various tools and strategies:

Firewalls: These act as gatekeepers between networks, blocking unauthorized access while allowing legitimate communication.
Intrusion Detection Systems (IDS): These monitor network traffic for suspicious behavior or known attack patterns.
Antivirus and Endpoint Security: These tools protect individual devices by scanning for and removing malicious software.
VPNs (Virtual Private Networks): VPNs encrypt data transmitted over the internet, shielding users from eavesdropping—especially on public Wi-Fi networks.

Best Practices

Technology alone isn’t enough – human behavior plays a big role in security. Some key habits include:

Using strong, unique passwords and changing them regularly
Keeping software and operating systems up to date, since patches often fix security holes
Enabling multi-factor authentication (MFA) to add an extra layer of protection
Educating users to recognize suspicious emails and links

Together, these tools and habits form a multi-layered defense that helps safeguard personal and organizational data.

The Modern Internet — DNS, Cloud, and IoT

Today’s internet is about far more than just connecting computers. It’s a complex, evolving ecosystem of services and smart devices, all working together to deliver seamless digital experiences. Let’s explore three key pillars of the modern internet: DNS, Cloud Computing, and the Internet of Things (IoT).

Domain Name System (DNS)

Imagine trying to access websites using IP addresses like 142.250.190.206 instead of just typing google.com. It would be nearly impossible to remember. That’s where the Domain Name System (DNS) comes in.

DNS works like the internet’s phonebook: it translates easy-to-remember domain names (like google.com) into the numerical IP addresses that computers use to communicate. Without DNS, web browsing as we know it wouldn’t exist.

Cloud Computing

The cloud has transformed how we store, process, and access information. Rather than relying on local hardware, cloud computing delivers services—like file storage, applications, or processing power—via the internet. Platforms like Google Drive, Amazon Web Services (AWS), and Microsoft Azure make it easy to scale up resources as needed, work from anywhere, and reduce infrastructure costs.

The benefits are clear: scalability, flexibility, and cost efficiency. But it also brings new challenges in terms of data privacy, security, and compliance.

Internet of Things (IoT)

The Internet of Things refers to everyday objects – like light bulbs, refrigerators, security cameras – that are connected to the internet and can communicate with each other. These devices offer convenience and automation, like turning off lights remotely or monitoring your home while away.

But the explosion of connected devices introduces challenges:

Security: Many IoT devices are poorly secured, making them easy targets for hackers.
Interoperability: With so many manufacturers and standards, getting devices to work together can be difficult.
Privacy: IoT devices often collect sensitive personal data, raising concerns about how that information is used.

Encryption and Secure Protocols

As data travels through this vast digital landscape, it must be protected from prying eyes. That’s where encryption and secure protocols come into play. These tools ensure that even if data is intercepted, it remains unreadable without the correct key.

Some of the most widely used secure protocols include:

HTTPS (Hypertext Transfer Protocol Secure): Ensures encrypted communication between your browser and websites.
SSL/TLS (Secure Sockets Layer / Transport Layer Security): Used behind HTTPS to secure web data.
IPSec: Encrypts IP packets and is commonly used in VPNs to secure network-level communication.
SSH (Secure Shell): Provides secure remote access to systems and devices.

These technologies form the backbone of secure internet communication, protecting users from data leaks, identity theft, and other forms of digital attack.

Wrapping Up

Looking back, it's amazing how far we've come – from learning what a bit is, to understanding how huge global networks function securely and efficiently.

Networking is more than routers and wires – it's a finely crafted system of trust, logic, and global cooperation. It's the very reason that we're able to learn, work, connect, and create anywhere.

And having established this foundation, I feel ready to go further.

Thank you for joining me on this journey.

Learn to Build a Multilayer Perceptron with Real-Life Examples and Python Code

Kuriko — Fri, 30 May 2025 18:21:29 +0000

The perceptron is a fundamental concept in deep learning, with many algorithms stemming from its original design.

In this tutorial, I’ll show you how to build both single layer and multi-layer perceptrons (MLPs) across three frameworks:

Custom classifier
Scikit-learn’s MLPClassifier
Keras Sequential classifier using SGD and Adam optimizers.

This will help you learn about their various use cases and how they work.

What is a Perceptron?
How to Build a Single-Layered Classifier
What is a Multi-Layer Perceptron?
How to Build Multi-Layered Perceptrons
Understanding Optimizers
How to Build an MLP Classifier with SGD Optimizer
How to Build an MLP Classifier with Adam Optimizer
Final Results: Generalization
Conclusion

Prerequisites

Mathematics (Calculus, Linear Algebra, Statistics)
Coding in Python
Basic understanding of Machine Learning concepts

What is a Perceptron?

A perceptron is one of the simplest types of artificial neurons used in Machine Learning. It’s a building block of artificial neural networks that learns from labeled data to perform classification and pattern recognition tasks, typically on linearly separable data.

A single-layer perceptron consists of a single layer of artificial neurons, called perceptrons.

But when you connect many perceptrons together in layers, you have a multi-layer perceptron (MLP). This lets the network learn more complex patterns by combining simple decisions from each perceptron. And this makes MLPs powerful tools for tasks like image recognition and natural language processing.

The perceptron consists of four main parts:

Input layer: Takes the initial numerical values into the system for further processing.
Weights: Combines input values with weights (and bias terms).
Activation function: Determines whether the neuron should fire based on the threshold value.
Output layer: Produces classification result.

It performs a weighted sum of inputs, adds a bias, and passes the result through an activation function – just like logistic regression. It’s sort of like a little decision-maker that says “yes” or “no” based on the information it gets.

So for instance, when we use a sigmoid activation, its output is a probability between 0 and 1, mimicking the behavior of logistic regression.

Applications of Perceptrons

Perceptrons are applied to tasks such as:

Image classification: Perceptrons classify images containing specific objects. They achieve this by performing binary classification tasks.
Linear regression: Perceptrons can predict continuous outputs based on input features. This makes them useful for solving linear regression problems.

How the Activation Function Works

For a single perceptron used for binary classification, the most common activation function is the step function (also known as the threshold function):

$$\phi(z) = \begin{cases} 1 &\text{if } z \geq \theta \\ \\ 0 &\text{if } z < \theta \end{cases}$$

where:

ϕ(z): the output of the activation function.
z: the weighted sum of the inputs plus the bias:

$$z = \sum_{i=1}^m w_i x_i + b$$

(xi: input values, w: weight associated with each input, b: bias terms)

θ is the threshold. Often, the threshold θ is set to zero, and the bias (b) effectively controls the activation threshold.

In that case, the formula becomes:

$$\phi(z) = \begin{cases} 1 &\text{if } z \geq 0 \\ \\ 0 &\text{if } z < 0 \end{cases}$$

When the step function ϕ(z) outputs one, it signifies that the input belongs to the class labeled one.

This occurs when the weighted sum is greater than zero, leading the perceptron to predict the input is in this binary class.

While the step function is conceptually the original activation for a perceptron, its discontinuity at zero causes computational challenges.

In modern implementations, we can use other activation functions like the sigmoid function:

$$\sigma (z) = \frac {1} {1 + e^{-z}}$$

The sigmoid function also outputs zero or one depending on the weighted sum (z).

How the Loss Function Works

The loss function is a crucial concept in machine learning that quantifies the error or discrepancy between the model's predictions and the actual target values.

Its purpose is to penalize the model for making incorrect or inaccurate predictions, which guides the learning algorithm (for example, gradient descent) to adjust the model's parameters in a way that minimizes this error and improves performance.

In a binary classification task, the model may adopt the hinge loss function to penalize misclassifications by incurring an additional cost for incorrect predictions:

$$L(y, h(x)) = max(0, 1- y*h(x))$$

(h(x): prediction label, y: true label)

How to Build a Single-Layered Classifier

Now, let’s build a simple single-layer perceptron for binary classification.

1. Custom Classifier

Initialize the classifier

We’ll first initialize the classifier with weights, bias, number of epochs (n_iterations), and learning_rates.

def __init__(self, learning_rate=0.01, n_iterations=1000):
    self.learning_rate = learning_rate
    self.n_iterations = n_iterations
    self.weights = None
    self.bias = None

Define the activation function

Use a step function that returns zero if input (x) ≤ 0, else 1. By default, the threshold is set to zero.

def _step_function(self, x, threshold: int = 0):
     return np.where(x > threshold, 1, 0)

Train the model

Now it’s time to start training. The learning process involves iteratively updating the perceptron’s internal parameters: weights and bias.

This process is controlled by a specified number of training epochs defined by n_iterations.

In each epoch, the model processes the entire input dataset (X) and adjusts its weights and bias based on the difference between its predictions and the true labels (y), guided by a predefined learning_rate.

def fit(self, X, y):
    n_samples, n_features = X.shape

    self.weights = np.zeros(n_features)
    self.bias = 0

    for _ in range(self.n_iterations):
        for i in range(n_samples):
            # compute weighted sum (z)
            z = np.dot(X[i], self.weights) + self.bias

            # apply the activation function
            y_pred = self._step_function(z)

            # update weights and bias
            self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
            self.bias += self.learning_rate * (y[i] - y_pred)

How the weights work in the iteration loop

The weights in a perceptron define the orientation (slope) of the decision boundary that separates the classes.

Its iterative update in the for loop aims to reduce classification errors such that:

$$\begin {align*} w_j &:= w_j + \Delta w_j \\ & := w_j + \eta (y_i - \hat y_i)x_{ij} \\ &= \begin{cases} w_j &\text{(a) } y_i - \hat y_i = 0\\ w_j + \eta x_ij &\text{(b) } y_i - \hat y_i = 1 \\ w_j - \eta x_ij &\text{(c) } y_i - \hat y_i = -1 \\ \end{cases} \end{align*}$$

(w_j: j-th weight, η: learning rate, (yi−y^i): error)

This means that:

When the prediction is correct, the error is zero, so the weight is unchanged.
When the prediction is too low (yi=1 and y^i=0), the weight is adjusted to the same direction to increase the weighted sum.
When the prediction is too high (yi=0 and y^i=1), the weight is adjusted to the opposite direction to pull the weighted sum lower.

How the bias terms work in the iteration loop

The bias determines the decision boundary’s intercept (position from the origin).

Similar to weights, we adjust the bias terms in each epoch to position the decision boundary:

$$\begin {align*} b &:= b + \Delta b \\ & := b + \eta (y_i - \hat y_i) \\ &= \begin{cases} b &\text{(a) } y_i - \hat y_i = 0\\ b + \eta &\text{(b) } y_i - \hat y_i = 1 \\ b - \eta &\text{(c) } y_i - \hat y_i = -1 \\ \end{cases} \end{align*}$$

This repeated adjustment aims to optimize the model’s ability to correctly classify the training data.

Make a prediction

Lastly, we add a function to generate an outcome value (zero or one) for a new, unseen data (X):

def predict(self, X):
      linear_output = np.dot(X, self.weights) + self.bias
      predictions = self._step_function(linear_output)
      return predictions

The entire classifier looks like this:

import numpy as np

class Perceptron:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None

    def _step_function(self, x, threshold: int = 0):
        return np.where(x > threshold, 1, 0)

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(self.n_iterations):
            for i in range(n_samples):
                linear_output = np.dot(X[i], self.weights) + self.bias
                y_pred = self._step_function(linear_output)
                self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
                self.bias += self.learning_rate * (y[i] - y_pred)
        return self

    def predict(self, X):
        linear_output = np.dot(X, self.weights) + self.bias
        y_pred = self._step_function(linear_output)
        return y_pred

Simulate with synthetic datasets

First, we generated a synthetic linearly separable dataset using make_blob and computed a decision boundary, then train the classifier we created.

from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
import numpy as np

# create a mock dataset
X, y = make_blobs(n_features=2, centers=2, n_samples=1000, random_state=12)

# split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train the model
perceptron = Perceptron(learning_rate=0.1, n_iterations=1000).fit(X_train, y_train)

# make a prediction
y_pred_train = perceptron.predict(X_train)
y_pred_test = perceptron.predict(X_test)

# evaluate the results
acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(f"Accuracy (Train): {acc_train:.3} \nAccuracy (Test): {acc_test:.3}")

Results

The classifier generated a clear, highly accurate linear decision boundary.

Accuracy (Train): 0.981
Accuracy (Test): 0.975

2. Leverage SckitLearn’s MCP Classifier

For our convenience, we’ll use sckit-learn’s build-in classifier ( MCPClassifier) to build a similar, yet more robust classifier:

model = MLPClassifier(
    hidden_layer_sizes=(), # intentionally set empty to create a single layer perceptron
    activation='logistic', # choosing a sigmoid function as an activation function
    solver='sgd', # choosing SGD optimizer
    max_iter=1000,
    random_state=42, 
    learning_rate='constant', 
    learning_rate_init=0.1
).fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(f"MCPClassifier\nAccuracy (Train): {acc_train:.3} \nAccuracy (Test): {acc_test:.3}")

Results

The MCP Classifier generated a clear linear decision boundary with slightly better accuracy scores.

Accuracy (Train): 0.985
Accuracy (Test): 0.995

Limitations of Single-Layer Perceptrons

Now, let’s talk about the key differences between the MCP Classifier and our custom single-layer perceptron.

Unlike more general neural networks, single-layer perceptrons use a step function as their activation.

Due to its discontinuity at x=0, the step function is not differentiable over its entire domain (−∞ to ∞).

This fundamental property precludes the use of gradient-based optimization algorithms such as SGD or Adam, as these methods depend on the computation of gradients, partial derivatives for the cost function.

In contrast, most neural networks employ differentiable activation functions (for example, sigmoid, ReLU) and loss functions (for example, MSE, Cross-Entropy) for effective optimization.

Other challenges of a single-layer perceptron include:

Limited to linear separability: Because they can only learn linear decision boundaries, they are unable to handle complex, non-linearly separable data.
Lack of depth: Being single-layered, they cannot learn complex hierarchical representations.
Limited optimizer options: As mentioned, their non-differentiable activation function precludes the use of major gradient-based optimizers.

So, in the next section, you’ll learn about multi-layered perceptrons to overcome the disadvantages.

What is a Multi-Layer Perceptron?

An MLP is a class of feedforward artificial neural network that consists of at least three layers of nodes:

an input layer,
one or more hidden layers, and
an output layer.

Except for the input nodes, each node is a neuron that uses a nonlinear activation function.

MLPs are widely used for classification problems as well as regression:

Classification tasks: MLPs are widely used for classification problems, such as handwriting recognition and speech recognition.
Regression analysis: They are also applied in regression problems where the relationship between input and output is complex.

How to Build Multi-Layered Perceptrons

Let’s handle a binary classification task using a standard MLP architecture.

Outline of the Project

Objective

Detect fraudulent transactions

Evaluation Metrics

Considering the cost of misclassification, we’ll prioritize improving Recall and Precision scores
Then check the accuracy of classification with Accuracy Score (TP + TN / (TP + TN + FP + FN ))

Cost of Misclassification (from high to low):

False Negative (FN): The model incorrectly identifies a fraudulent transaction as legitimate (Missing actual fraud)
False Positive (FP): The model incorrectly identifies a legitimate transaction as fraudulent (Blocking legitimate customers.)
True Positive (TP): The model correctly identifies a fraudulent transaction as fraud.
True Negative (TN): The model correctly identifies a non-fraudulent transaction as non-fraud.

Planning an MLP Architecture

In the network, 19 input features feed into the first hidden layer’s 30 neurons, which use a ReLU activation function.

Then, their outputs are passed to the second layer, culminating in sigmoid values as the final output.

During the optimization process, we’ll let the optimizer (SGD and Adam) perform forward and backward passes to adjust parameters.

Image: Standard MLP Architecture for Binary Classification Tasks (Created by Kuriko Iwai using image source)

Especially in deeper network, ReLU is advantageous in preventing vanishing gradient problems where gradients become extremely small as they are backpropagated from the output layers.

Learn More: A Comprehensive Guide on Neural Network in Deep Learning

Preprocessing the Datasets

First, we consolidate three datasets – transaction, customer, and credit card – into a single DataFrame, independently sanitizing numerical and categorical data:

import json
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# download the raw data to local
import kagglehub
path = kagglehub.dataset_download("computingvictor/transactions-fraud-datasets")
dir = f'{path}/gd_card_flaud_demo'

def sanitize_df(amount_str):
    """Removes '$' and converts the string to a float."""
    if isinstance(amount_str, str):
        return float(amount_str.replace('$', ''))
    return amount_str

# load transaction data
trx_df = pd.read_csv(f'{dir}/transactions_data.csv')

# sanitize the dataset (drop unnecessary columns and error transactions, convert string to int/float dtype)
trx_df = trx_df[trx_df['errors'].isna()]
trx_df = trx_df.drop(columns=['merchant_city','merchant_state', 'date', 'mcc', 'errors'], axis='columns')
trx_df['amount'] = trx_df['amount'].apply(sanitize_df)

# merge the dataframe with fraud transaction flag.
with open(f'{dir}/train_fraud_labels.json', 'r') as fp:
    fraud_labels_json = json.load(fp=fp)

fraud_labels_dict = fraud_labels_json.get('target', {})
fraud_labels_series = pd.Series(fraud_labels_dict, name='is_fraud')
fraud_labels_series.index = fraud_labels_series.index.astype(int) # convert the datatype from string to integer
merged_df = pd.merge(trx_df, fraud_labels_series, left_on='id', right_index=True, how='left')
merged_df.fillna({'is_fraud': 'No'}, inplace=True)
merged_df['is_fraud'] = merged_df['is_fraud'].map({'Yes': 1, 'No': 0})

# load card data
card_df = pd.read_csv(f'{dir}/cards_data.csv')
card_df = card_df.drop(columns=['client_id', 'acct_open_date', 'card_number', 'expires', 'cvv'], axis='columns')
card_df['credit_limit'] = card_df['credit_limit'].apply(sanitize_df)

# merge transaction and card data
merged_df = pd.merge(left=merged_df, right=card_df, left_on='card_id', right_on='id', how='inner')
merged_df = merged_df.drop(columns=['id_y', 'card_id'], axis='columns')

# converts categorical variables into a new binary column (0 or 1)
categorical_cols = merged_df.select_dtypes(include=['object']).columns
df = merged_df.copy()
df = pd.get_dummies(df, columns=categorical_cols, dummy_na=False, dtype=float) 
df = df.dropna().drop(['client_id', 'id_x'], axis=1)
print('\nDataFrame: \n', df.head(n=3))

DataFrame:

Our DataFrame shows an extremely skewed data distribution with:

Fraud samples: 1,191
Non-fraud samples: 11,477,397

For classification tasks, it's crucial to be aware of sample size imbalances and employ appropriate strategies to mitigate their negative impact on classification model performance, especially regarding the minority class.

For our data, we’ll:

split the 1,191 fraud samples into training, validation, and test sets,
add an equal number of randomly chosen non-fraud samples from the DataFrame, and
adjust split balances later if generalization challenges arise.

# define the desired size of the fraud samples for the validation and test sets
val_size_per_class = 200
test_size_per_class = 200

# create test sets
X_test_fraud = df_fraud.sample(n=test_size_per_class, random_state=42)
X_test_non_fraud = df_non_fraud.sample(n=test_size_per_class, random_state=42)

# combine to form the balanced test set
X_test = pd.concat([X_test_fraud, X_test_non_fraud]).sample(frac=1, random_state=42).reset_index(drop=True)
y_test = X_test['is_fraud']
X_test = X_test.drop('is_fraud', axis=1)

# remove sampled rows from the original dataframes to avoid data leakage
df_fraud_remaining = df_fraud.drop(X_test_fraud.index)
df_non_fraud_remaining = df_non_fraud.drop(X_test_non_fraud.index)


# create validation sets
X_val_fraud = df_fraud_remaining.sample(n=val_size_per_class, random_state=42)
X_val_non_fraud = df_non_fraud_remaining.sample(n=val_size_per_class, random_state=42)

# combine to form the balanced validation set
X_val = pd.concat([X_val_fraud, X_val_non_fraud]).sample(frac=1, random_state=42).reset_index(drop=True)
y_val = X_val['is_fraud']
X_val = X_val.drop('is_fraud', axis=1)

# remove sampled rows from the remaining dataframes
df_fraud_train = df_fraud_remaining.drop(X_val_fraud.index)
df_non_fraud_train = df_non_fraud_remaining.drop(X_val_non_fraud.index)


# create training sets
min_train_samples_per_class = min(len(df_fraud_train), len(df_non_fraud_train))

X_train_fraud = df_fraud_train.sample(n=min_train_samples_per_class, random_state=42)
X_train_non_fraud = df_non_fraud_train.sample(n=min_train_samples_per_class, random_state=42)

X_train = pd.concat([X_train_fraud, X_train_non_fraud]).sample(frac=1, random_state=42).reset_index(drop=True)
y_train = X_train['is_fraud']
X_train = X_train.drop('is_fraud', axis=1)


print("\n--- Final Dataset Shapes and Distributions ---")
print(f"X_train shape: {X_train.shape}, y_train distribution: {np.unique(y_train, return_counts=True)}")
print(f"X_val shape: {X_val.shape}, y_val distribution: {np.unique(y_val, return_counts=True)}")
print(f"X_test shape: {X_test.shape}, y_test distribution: {np.unique(y_test, return_counts=True)}")

After the operation, we secured 1,582 training, 400 validation, and 400 test samples, each dataset maintaining a 50:50 split between fraud and non-fraud transactions:

Considering the high dimensional feature space with 19 input features, we’ll apply SMOTE to resample the training data (SMOTE should not be applied to validation or test sets to avoid data leakage):

from imblearn.over_sampling import SMOTE
from collections import Counter

train_target = 2000

smote_train = SMOTE(
  sampling_strategy={0: train_target, 1: train_target},  # increase sample size to 2,000
  random_state=12
)
X_train, y_train = smote_train.fit_resample(X_train, y_train)

print(f"\nAfter SMOTE with custom sampling_strategy (target train: {train_target}):")
print(f"X_train_oversampled shape: {X_train.shape}")
print(f"y_train_oversampled distribution: {Counter(y_train)}")

We’ve secured 4,000 training samples, maintaining a 50:50 split between fraud and non-fraud transactions:

Lastly, we’ll apply column transformers to numerical and categorical features separately.

Column transformers are advantageous in handling datasets with multiple data types, as they can apply different transformations to different subsets of columns while preventing data leakage.

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),('onehot', OneHotEncoder(handle_unknown='ignore'))])

numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
numerical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)

Understanding Optimizers

In deep learning, an optimizer is a crucial element that fine-tunes a neural network’s parameters during training. Its primary role is to minimize the model’s loss function, enhancing performance.

Various optimization algorithms, known as optimizers, employ distinct strategies to converge towards optimal parameters for improved predictions efficiently.

In this article, we’ll use the SGD Optimizer and Adam Optimizer.

1. How a SGD (Stochastic Gradient Descent) Optimizer Works

SGD is a major optimization algorithm that computes the gradient (partial derivative of the cost function) using a small mini-batch of examples at each epoch:

$$\begin{align*} w_j &:= w_j - \eta \frac {\partial J} {\partial w_j} \\ \\ b &:= b - \eta \frac {\partial J} {\partial b} \end{align*}$$

(w: weight, b: bias, J: cost function, η: learning rate)

In binary classification, the cost function (J) is defined with a sigmoid function (σ(z)) where z generates weighted sum of inputs and bias terms:

$$\begin{align*} J(y, \hat y) &=−[y log(\hat y) + (1-y)log(1-\hat y)] \\ \\ \hat y &= \sigma (z) = \frac {1} {1+e^{-z}} \\ \\ z &= \sum_{i=1}^m w_i x_i + b \end {align*}$$

2. How Adam (Adaptive Moment Estimation) Optimizer Works

Adam is an optimization algorithm that computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.

Adam optimizer combines the advantages of RMSprop (using squared gradients to scale the learning rate) and Momentum (using past gradients to accelerate convergence):

$$w_{j,t+1} = w_{j,t} - \alpha \cdot \frac{\hat{m}{t,w_j}}{\sqrt{\hat{v}{t,w_j}} + \epsilon}$$

where:

α: The learning rate (default is 0.001)
ϵ: A small positive constant used to avoid division by zero
m^: First moment (mean) estimate with a bias correction, leveraging Momentum:

$$\begin{align*} \hat m_t &= \frac {m_t} {1 - \beta_1^t} \\ \\ m_t &= \beta_1 m_{t-1} + (1-\beta_1) \underbrace{ \frac {\partial L} {\partial w_t}}_{\text{gradient}} \end{align*}$$

(β1: Decay rates, typically set to β1=0.9)

v^: Second moment (variance) estimate with a bias correction, leveraging RMSprop:

$$\begin{align*} \hat v_t &= \frac {v_t} {1 - \beta_2^t} \\ \\ v_t &=\beta_2 v_{t-1} + (1- \beta_2) (\frac {\partial L} {\partial w_t})^2 \end {align*}$$

(β2: Decay rates, typically set to β2=0.999)

Since both m and v are initialized at zero, Adam computes the bias-corrected estimates to prevent them being biased toward zero.

Learn More: A Comprehensive Guide on Neural Network in Deep Learning

How to Build an MLP Classifier with SGD Optimizer

Custom Classifier

This process involves a forward pass and backpropagation, during which SGD computes optimal weights and biases using gradients:

for i in range(0, n_samples, self.batch_size):
    # SGD starts with randomly selected mini-batch for the epoch
    X_batch = X_shuffled[i : i + self.batch_size]
    y_batch = y_shuffled[i : i + self.batch_size]

    # A. forward pass
    activations, zs = self._forward_pass(X_batch)
    y_pred = activations[-1]  # final output of the network

    # B. backpropagation
    # 1) calculating gradients for the output layer)
    delta = y_pred - y_batch
    dW = np.dot(activations[-2].T, delta) / X_batch.shape[0]
    db = np.sum(delta, axis=0) / X_batch.shape[0]

    # 2) update output layer parameters
    self.weights[-1] -= self.learning_rate * dW
    self.biases[-1] -= self.learning_rate * db

    # 3) iterate backward from last hidden layer to the input layer
    for l in range(len(self.weights) - 2, -1, -1):
        delta = np.dot(delta, self.weights[l+1].T) * self._relu_derivative(zs[l]) # d_activation(z)
        dW = np.dot(activations[l].T, delta) / X_batch.shape[0]
        db = np.sum(delta, axis=0) / X_batch.shape[0]

        self.weights[l] -= self.learning_rate * dW
        self.biases[l] -= self.learning_rate * db

In the process of the forward pass, the network calculates a weighted sum of weights and bias (z), applies an activation function (ReLU) to the values in each hidden layer, and then computes the predicted output (y_pred) using a sigmoid function.

def _forward_pass(self, X):
    activations = [X]
    zs = []

    # forward through hidden layers
    for i in range(len(self.weights) - 1):
        z = np.dot(activations[-1], self.weights[i]) + self.biases[i]
        zs.append(z)
        a = self._relu(z) # using ReLU for hidden layers
        activations.append(a)

    # forward through output layer
    z_output = np.dot(activations[-1], self.weights[-1]) + self.biases[-1]
    zs.append(z_output)

    # computes the final output using sigmoid function
    y_pred = 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    activations.append(y_pred)
    return activations, zs

So the final classifier looks like this:

from sklearn.metrics import accuracy_score

class MLP_SGD:
    def __init__(self, hidden_layer_sizes=(10,), learning_rate=0.01, n_epochs=1000, batch_size=32):
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.weights = []
        self.biases = []
        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    def _sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

    def _sigmoid_derivative(self, x):
        s = self._sigmoid(x)
        return s * (1 - s)

    def _relu(self, x):
        return np.maximum(0, x)

    def _relu_derivative(self, x):
        return (x > 0).astype(float)

    def _initialize_parameters(self, n_features):
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [1]
        self.weights = []
        self.biases = []

        for i in range(len(layer_sizes) - 1):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+1]
            limit = np.sqrt(6 / (fan_in + fan_out))
            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((1, fan_out)))

    def _forward_pass(self, X):
        activations = [X]
        zs = []

        for i in range(len(self.weights) - 1):
            z = np.dot(activations[-1], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[-1], self.weights[-1]) + self.biases[-1]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        return activations, zs

    def _compute_loss(self, y_true, y_pred):
        y_pred = np.clip(y_pred, 1e-10, 1 - 1e-10)
        loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        return loss

    def fit(self, X, y):
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(-1, 1)
        X = np.asarray(X)
        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() for w in self.weights])
        self.biases_history.append([b.copy() for b in self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[-1])
        self.loss_history.append(initial_loss)

        for epoch in range(self.n_epochs):
            # shuffle datasets
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            # mini-batch loop
            for i in range(0, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[-1]

                delta = y_pred - y_batch
                dW = np.dot(activations[-2].T, delta) / X_batch.shape[0]
                db = np.sum(delta, axis=0) / X_batch.shape[0]
                self.weights[-1] -= self.learning_rate * dW
                self.biases[-1] -= self.learning_rate * db

                for l in range(len(self.weights) - 2, -1, -1):
                    delta = np.dot(delta, self.weights[l+1].T) * self._relu_derivative(zs[l]) # d_activation(z)
                    dW = np.dot(activations[l].T, delta) / X_batch.shape[0]
                    db = np.sum(delta, axis=0) / X_batch.shape[0]

                    self.weights[l] -= self.learning_rate * dW
                    self.biases[l] -= self.learning_rate * db

            self.weights_history.append([w.copy() for w in self.weights])
            self.biases_history.append([b.copy() for b in self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[-1])
            self.loss_history.append(epoch_loss)

            if (epoch + 1) % 100 == 0:
                print(f"Epoch {epoch+1}/{self.n_epochs}, Loss: {epoch_loss:.4f}")
        return self

    def predict_proba(self, X):
        activations, _ = self._forward_pass(X)
        return activations[-1]

    def predict(self, X, threshold=0.5):
        probabilities = self.predict_proba(X)
        return (probabilities >= threshold).astype(int).flatten() # for 1D output

Training / Prediction

Train the model and make a prediction using training and validation datasets:

# 1. define the model
mlp_sgd = MLP_SGD(
  hidden_layer_sizes=(30, 30, ), # 2 hidden layers with 30 neurons each
  learning_rate=0.001,           # a step size
  n_epochs=1000,                 # number of epochs
  batch_size=32                  # mini-batch size
)

# 2. train the model
mlp_sgd.fit(X_train_processed, y_train)

# 3. make a prediction with training and validation datasets
y_pred_train = mlp_sgd.predict(X_train_processed)
y_pred_val = mlp_sgd.predict(X_val_processed)

# 4. compute evaluation matrics
conf_matrix = confusion_matrix(y_true, y_pred)
acc = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, pos_label=1)
recall = recall_score(y_true, y_pred, pos_label=1)
f1 = f1_score(y_true, y_pred, pos_label=1)


print(f"\nMLP (Custom SGD) Accuracy (Train): {acc_train:.3f}")
print(f"MLP (Custom SGD) Accuracy (Validation): {acc_val:.3f}")

Results

Recall: 0.7930 — 0.6650 (from training to validation)
Precision: 0.7790 — 0.6786 (from training to validation)

The model effectively learned and generalized the patterns, achieving a Recall of 79.3% (approximately 80% accuracy in identifying fraud transactions) with a 12-point drop on the validation set.

Loss history:

We visualized the decision boundary using the first two principal components (PCA) as the x and y axes. Note that the boundary is non-linear.

Leverage SckitLearn’s MCP Classifier

We can use an MCP Classifier to define a similar model, incorporating;

Early stopping using internal validation to prevent overfitting and
L2 regularization with a small tolerance.

from sklearn.neural_network import MLPClassifier

# define a model
model_sklearn_mlp_sgd = MLPClassifier(
    hidden_layer_sizes=(30, 30),
    activation='relu',
    solver='sgd',
    learning_rate_init=0.001,
    learning_rate='constant',
    momentum=0.9,
    nesterovs_momentum=True,
    alpha=0.00001,           # l2 regulation strength
    max_iter=3000,           # max epochs (keep it high)
    batch_size=16,           # mini-batch size
    random_state=42,
    early_stopping=True,     # apply early stopping
    n_iter_no_change=50,     # stop the iteration if internal validation score doesn't improve for 50 epochs
    validation_fraction=0.1, # proportion of training data for internal validation (default is 0.1)
    tol=1e-4,                # tolerance for optimization
    verbose=False,
)

# training
model_sklearn_mlp_sgd.fit(X_train_processed, y_train)

# make a prediction
y_pred_train_sklearn = model_sklearn_mlp_sgd.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_sgd.predict(X_val_processed)

Results

Recall: 0.7830 - 0.6200 (from training to validation)
Precision: 0.8208 - 0.6703 (from training to validation)

The model showed strong performance during training, achieving a Recall of 78.30%. Its performance declined on the validation set.

This suggests that while the model learned effectively from the training data, it may be overfitting and not generalizing as well to unseen data.

Leverage Keras Sequential Classifier

For the sequential classifier, we can further enhance the classifier by:

Initializing the output layer’s bias with the log-odds of positive class occurrences in the training data (y_train) to address dataset imbalance and promote faster convergence,
Integrating 10% dropout between hidden layers to prevent overfitting by randomly deactivating neurons during training,
Including Precision and Recall in the model’s compilation metrics to optimize for classification performance,
Applying class weights to penalize misclassifications of the minority class more heavily, improving the model’s ability to learn rare patterns, and
Utilizing a separate validation dataset for monitoring performance during training to help detect overfitting and guides hyperparameter tuning.

import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Input
from keras.optimizers import SGD
from keras.callbacks import EarlyStopping
from sklearn.utils import class_weight


# calculates an initial bias for the output layer 
initial_bias = np.log([np.sum(y_train == 1) / np.sum(y_train == 0)])


# defines the model
model_keras_sgd = Sequential([
    Input(shape=(X_train_processed.shape[1],)), 
    Dense(30, activation='relu'),
    Dropout(0.1), # 10% of the neurons in that layer randomly dropped out
    Dense(30, activation='relu'),
    Dropout(0.1),
    Dense(1, activation='sigmoid', # binary classification
          bias_initializer=tf.keras.initializers.Constant(initial_bias)) # to address the imbalanced datasets
])



# compiles the model with the SGD optimizer
opt = SGD(learning_rate=0.001)
model_keras_sgd.compile(
    optimizer=opt, 
    loss='binary_crossentropy',
    metrics=[
        'accuracy', # add several metrics to return
        tf.keras.metrics.Precision(name='precision'),
        tf.keras.metrics.Recall(name='recall'),
        tf.keras.metrics.AUC(name='auc') 
    ]
)


# defines early stopping to prevent overfitting
early_stopping_callback = EarlyStopping(
    monitor='val_recall',  # monitor recall 
    mode='max',         # maximize recall
    patience=50,        # stop after 50 epochs without loss improvement
    min_delta=1e-4,     # minimum change to be considered an improvement (tol)
    verbose=0
)


# compute the class weight
class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))


# train the model
history = model_keras_sgd.fit(
    X_train_processed, y_train,
    epochs=1000,
    batch_size=32,
    validation_data=(X_val_processed, y_val), # use our external val set
    callbacks=[early_stopping_callback], # early stopping to prevent overfitting
    class_weight=class_weights_dict, # penarlize more misclassification on minority class
    verbose=0
)

# evaluate
loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_sgd.evaluate(X_train_processed, y_train, verbose=0)
print(f"\n--- Keras Model Accuracy (Train) ---")
print(f"Loss: {loss_train:.4f}")
print(f"Accuracy: {accuracy_train:.4f}")
print(f"Precision: {precision_train:.4f}")
print(f"Recall: {recall_train:.4f}")
print(f"AUC: {auc_train:.4f}")

loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_sgd.evaluate(X_val_processed, y_val, verbose=0)
print(f"\n--- Keras Model Accuracy (Validation) ---")
print(f"Loss: {loss_val:.4f}")
print(f"Accuracy: {accuracy_val:.4f}")
print(f"Precision: {precision_val:.4f}")
print(f"Recall: {recall_val:.4f}")
print(f"AUC: {auc_val:.4f}")

# display model summary
model_keras_sgd.summary()

Results

Recall: 0.7125 — 0.7250 (from training to validation)
Precision: 0.7607 — 0.7545 (from training to validation)

Given that the gaps between training and validation are relatively small, the model is generalizing reasonably well.

It suggests that the regularization techniques are likely effective in preventing significant overfitting.

How to Build an MLP Classifier with Adam Optimizer

Custom Classifier

This iterative process of updating parameters occurs within the mini-batch loop to keep updating weights and bias:

# apply Adam updates for output layer parameters
# 1) weights (w)
self.m_weights[-1] = self.beta1 * self.m_weights[-1] + (1 - self.beta1) * grad_w_output
self.v_weights[-1] = self.beta2 * self.v_weights[-1] + (1 - self.beta2) * (grad_w_output ** 2)
m_w_hat = self.m_weights[-1] / (1 - self.beta1**t)
v_w_hat = self.v_weights[-1] / (1 - self.beta2**t)
self.weights[-1] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

# 2) bias (b)
self.m_biases[-1] = self.beta1 * self.m_biases[-1] + (1 - self.beta1) * grad_b_output
self.v_biases[-1] = self.beta2 * self.v_biases[-1] + (1 - self.beta2) * (grad_b_output ** 2)
m_b_hat = self.m_biases[-1] / (1 - self.beta1**t)
v_b_hat = self.v_biases[-1] / (1 - self.beta2**t)
self.biases[-1] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)

Following the principles of forward and backward passes, we construct the final classifier by initializing it with beta1 and beta2, built upon an MLP_SGD architecture:

class MLP_Adam:
    def __init__(self, hidden_layer_sizes=(10,), learning_rate=0.001, n_epochs=1000, batch_size=32,
                 beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon

        self.weights = [] 
        self.biases = []

        # Adam optimizer internal states for each parameter (weights and biases)
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    def _sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

    def _sigmoid_derivative(self, x):
        s = self._sigmoid(x)
        return s * (1 - s)

    def _relu(self, x):
        return np.maximum(0, x)

    def _relu_derivative(self, x):
        return (x > 0).astype(float)

    def _initialize_parameters(self, n_features):
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [1]

        self.weights = []
        self.biases = []
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        for i in range(len(layer_sizes) - 1):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+1]
            limit = np.sqrt(6 / (fan_in + fan_out))

            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((1, fan_out)))

            self.m_weights.append(np.zeros((fan_in, fan_out)))
            self.v_weights.append(np.zeros((fan_in, fan_out)))
            self.m_biases.append(np.zeros((1, fan_out)))
            self.v_biases.append(np.zeros((1, fan_out)))


    def _forward_pass(self, X):
        activations = [X]
        zs = []

        for i in range(len(self.weights) - 1):
            z = np.dot(activations[-1], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[-1], self.weights[-1]) + self.biases[-1]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        return activations, zs

    def _compute_loss(self, y_true, y_pred):
        y_pred = np.clip(y_pred, 1e-10, 1 - 1e-10)
        loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        return loss

    def fit(self, X, y):
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(-1, 1)
        X = np.asarray(X)

        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() for w in self.weights])
        self.biases_history.append([b.copy() for b in self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[-1])
        self.loss_history.append(initial_loss)

        # global time step for Adam bias correction
        t = 0

        for epoch in range(self.n_epochs):
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            # Mini-batch loop
            for i in range(0, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                t += 1

                # 1. forward pass
                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[-1] # Output of the network

                # 2. backpropagation
                delta = y_pred - y_batch
                grad_w_output = np.dot(activations[-2].T, delta) / X_batch.shape[0] # Average over batch
                grad_b_output = np.sum(delta, axis=0) / X_batch.shape[0]

                # apply Adam updates to weights
                self.m_weights[-1] = self.beta1 * self.m_weights[-1] + (1 - self.beta1) * grad_w_output
                self.v_weights[-1] = self.beta2 * self.v_weights[-1] + (1 - self.beta2) * (grad_w_output ** 2)
                m_w_hat = self.m_weights[-1] / (1 - self.beta1**t)
                v_w_hat = self.v_weights[-1] / (1 - self.beta2**t)
                self.weights[-1] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                # apply Adam updates to bias
                self.m_biases[-1] = self.beta1 * self.m_biases[-1] + (1 - self.beta1) * grad_b_output
                self.v_biases[-1] = self.beta2 * self.v_biases[-1] + (1 - self.beta2) * (grad_b_output ** 2)
                m_b_hat = self.m_biases[-1] / (1 - self.beta1**t)
                v_b_hat = self.v_biases[-1] / (1 - self.beta2**t)
                self.biases[-1] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


                # Propagate gradients backward through hidden layers
                for l in range(len(self.weights) - 2, -1, -1):
                    delta = np.dot(delta, self.weights[l+1].T) * self._relu_derivative(zs[l]) # d_activation(z)
                    grad_w_hidden = np.dot(activations[l].T, delta) / X_batch.shape[0]
                    grad_b_hidden = np.sum(delta, axis=0) / X_batch.shape[0]

                    # apply Adam updates to weights
                    self.m_weights[l] = self.beta1 * self.m_weights[l] + (1 - self.beta1) * grad_w_hidden
                    self.v_weights[l] = self.beta2 * self.v_weights[l] + (1 - self.beta2) * (grad_w_hidden ** 2)
                    m_w_hat = self.m_weights[l] / (1 - self.beta1**t)
                    v_w_hat = self.v_weights[l] / (1 - self.beta2**t)
                    self.weights[l] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                    # apply Adam updates to bias
                    self.m_biases[l] = self.beta1 * self.m_biases[l] + (1 - self.beta1) * grad_b_hidden
                    self.v_biases[l] = self.beta2 * self.v_biases[l] + (1 - self.beta2) * (grad_b_hidden ** 2)
                    m_b_hat = self.m_biases[l] / (1 - self.beta1**t)
                    v_b_hat = self.v_biases[l] / (1 - self.beta2**t)
                    self.biases[l] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


            self.weights_history.append([w.copy() for w in self.weights])
            self.biases_history.append([b.copy() for b in self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[-1])
            self.loss_history.append(epoch_loss)

            if (epoch + 1) % 100 == 0:
                print(f"Epoch {epoch+1}/{self.n_epochs}, Loss: {epoch_loss:.4f}")
        return self


    def predict_proba(self, X):
        activations, _ = self._forward_pass(X)
        return activations[-1]

    def predict(self, X, threshold=0.5):
        probabilities = self.predict_proba(X)
        return (probabilities >= threshold).astype(int).flatten()

Training / Prediction

Train the model and make a prediction using training and validation datasets:

mlp_adam = MLP_Adam(hidden_layer_sizes=(30, 10), learning_rate=0.001, n_epochs=500, batch_size=32)
mlp_adam.fit(X_train_processed, y_train)

y_pred_train = mlp_adam.predict(X_train_processed)
y_pred_val = mlp_adam.predict(X_val_processed)

acc_train = accuracy_score(y_train, y_pred_train)
acc_val = accuracy_score(y_val, y_pred_val)

print(f"\nMLP (Custom Adam) Accuracy (Train): {acc_train:.3f}")
print(f"MLP (Custom Adam) Accuracy (Validation): {acc_val:.3f}")

Results

Recall: 0.9870–0.6150 (from training to validation)
Precision: 0.9811–0.6474 (from training to validation)

While the Adam optimizer outperformed SGD, the model exhibited significant overfitting, with both Recall and Precision falling by around 30 points between training and validation.

Loss History

We visualized the decision boundary using the first two principal components (PCA) as the x and y axes.

Leverage SckitLearn’s MCP Classifier

We’ve switched the optimizer from SGD to Adam, keeping all other settings constant:

model_sklearn_mlp_adam = MLPClassifier(
    hidden_layer_sizes=(30, 30),
    activation='relu',
    solver='adam',             # update the optimizer from SGD to Adam
    learning_rate_init=0.001,
    learning_rate='constant',
    alpha=0.0001,
    max_iter=3000,
    batch_size=16,
    random_state=42,
    early_stopping=True,
    n_iter_no_change=50,
    validation_fraction=0.1,
    tol=1e-4,
    verbose=False,
)

model_sklearn_mlp_adam.fit(X_train_processed, y_train)

y_pred_train_sklearn = model_sklearn_mlp_adam.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_adam.predict(X_val_processed)

Results

Recall: 0.8975–0.6400 (from training to validation)
Precision: 0.8864 — 0.6305 (from training to validation)

Despite a performance improvement compared to the SGD optimizer, the significant drop in both Recall (from 0.8975 to 0.6400) and Precision (from 0.8864 to 0.6305) from training to validation data indicates that the model is still overfitting.

Leverage Keras Sequential Classifier

Similar to MLPClassifier, we’ve switched the optimizer from SGD to Adam with all the other conditions remaining the same:

import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Input
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping
from sklearn.utils import class_weight


initial_bias = np.log([np.sum(y_train == 1) / np.sum(y_train == 0)])
model_keras_adam = Sequential([
    Input(shape=(X_train_processed.shape[1],)), 
    Dense(30, activation='relu')),
    Dropout(0.1),
    Dense(30, activation='relu'),
    Dropout(0.1),
    Dense(1, activation='sigmoid', 
          bias_initializer=tf.keras.initializers.Constant(initial_bias))
])


optimizer_keras = Adam(learning_rate=0.001)
model_keras_adam.compile(
    optimizer=optimizer_keras, 
    loss='binary_crossentropy', 
    metrics=[
        'accuracy',
        tf.keras.metrics.Precision(name='precision'),
        tf.keras.metrics.Recall(name='recall'),
        tf.keras.metrics.AUC(name='auc') 
    ]
)

early_stopping_callback = EarlyStopping(
    monitor='val_recall',
    mode='max',
    patience=50,
    min_delta=1e-4,
    verbose=0
)

class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))

model_keras_adam.fit(
    X_train_processed, y_train,
    epochs=1000,
    batch_size=32,
    validation_data=(X_val_processed, y_val),
    callbacks=[early_stopping_callback],
    class_weight=class_weights_dict,
    verbose=0
)


loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_adam.evaluate(X_train_processed, y_train, verbose=0)
print(f"\n--- Keras Model Accuracy (Train) ---")
print(f"Loss: {loss_train:.4f}")
print(f"Accuracy: {accuracy_train:.4f}")
print(f"Precision: {precision_train:.4f}")
print(f"Recall: {recall_train:.4f}")
print(f"AUC: {auc_train:.4f}")


loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_adam.evaluate(X_val_processed, y_val, verbose=0)
print(f"\n--- Keras Model Accuracy (Validation) ---")
print(f"Loss: {loss_val:.4f}")
print(f"Accuracy: {accuracy_val:.4f}")
print(f"Precision: {precision_val:.4f}")
print(f"Recall: {recall_val:.4f}")
print(f"AUC: {auc_val:.4f}")


model_keras_adam.summary()

Results

Recall: 0.7995–0.7500 (from training to validation)
Precision: 0.8409–0.8065 (from training to validation)

The model exhibits good performance, with Recall slightly decreasing from 0.7995 (training) to 0.7500 (validation), and Precision similarly dropping from 0.8409 (training) to 0.8065 (validation).

This indicates good generalization, with only minor performance degradation on unseen data.

Final Results: Generalization

Finally, we’ll evaluate the model’s ultimate performance on the test dataset, which has remained completely separate from all prior training and validation processes.

# Custom classifiers
y_pred_test_custom_sgd = mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_custom_adam = mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

# MLPClassifer
y_pred_test_sk_sgd = model_sklearn_mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_sk_adam = model_sklearn_mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

# Keras Sequential
_, accuracy_val_sgd, precision_val_sgd, recall_val_sgd, auc_val_sgd = model_keras_sgd.evaluate(X_test_processed, y_test, verbose=0)
_, accuracy_val_adam, precision_val_adam, recall_val_adam, auc_val_adam = model_keras_adam.evaluate(X_test_processed, y_test, verbose=0)

Overall, the Keras Sequential model, optimized with SGD, achieved the best performance with an AUPRC (Area Under Precision-Recall Curve) of 0.72.

Conclusion

In this exploration, we experimented with custom classifiers, Scikit-learn models, and Keras deep learning architectures.

Our findings underscore that effective machine learning hinges on three critical factors:

robust data preprocessing (tailored to objectives and data distribution),
judicious model selection, and
strategic framework or library choices.

Choosing the right framework

Generally speaking, choose MLPClassifier when:

You’re primarily working with tabular data,
You want to prioritize simplicity, quick iteration, and seamless integration,
You have simple, shallow architectures, and
You have a moderate dataset size (manageable on a CPU).

Choose Keras Sequential when:

You’re dealing with image, text, audio, or other sequential data,
You’re building deep learning models such as CNNs, RNNs, LSTMs,
You need fine-grained control over the model architecture, training process, or custom components,
You need to leverage GPU acceleration,
You’re planning for production deployment, and
You want to experiment with more advanced deep learning techniques.

Limitation of MLPs

While Multilayer Perceptrons (MLPs) proved valuable, their susceptibility to computational complexity and overfitting emerged as key challenges.

Looking ahead, we’ll delve into how Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) offer powerful solutions to these inherent MLP limitations.

You can find more info about me on my Portfolio / LinkedIn / Github.

LeetCode Meditations: A Visual Handbook of Data Structures and Algorithms Concepts

Eda Eren — Thu, 29 May 2025 19:52:13 +0000

It may seem like an oxymoron when the words "LeetCode" and "meditation" are used together – after all, one thing that almost everyone can agree is that LeetCode is challenging. It's called grinding LeetCode for a reason.

It doesn't have anything to do with the platform, of course, but rather what it represents: tackling problems for hours on end, usually to find a solution that is even harder to understand.

But what is more challenging is finding a roadmap to solve those problems with very little knowledge of data structures and algorithms. This handbook is more or less based on the Blind 75 list that's included in neetcode.io's practice problems. This is an amazing resource that offers an organized study roadmap for solving LeetCode problems.

In fact, why not take a more structured and calmer approach? We can treat learning about the topics on the list like taking a brief walk in nature – a sort of meditation, if you will.

That said, this handbook is not about specific problems. Rather it’s about understanding the concepts behind them in a casual manner. It is also language agnostic – sometimes you’ll see TypeScript, sometimes Python, and sometimes JavaScript.

This handbook also requires you to be patient, to relax, to take a step back and pay attention. The mid-quality GIFs used in the handbook (maybe ironically!) intend to encourage this. They are not videos, so you can wait for it to come to a moment that you didn't understand or missed instead of hastily rewinding it back or rushing to a certain point in the future.

Solving hundreds of LeetCode problems may be the gate to go through to get an interview at big tech companies…but learning the topics that the problems are about is not under anyone's monopoly.

With that said, let's start the first chapter.

Prerequisites
Chapter One: Arrays & Hashing
Chapter Two: Two Pointers
- Palindrome example
- Squares of a sorted array example
Chapter Three: Sliding Window
- Fixed window size
- Dynamic window size
Chapter Four: Stack
Chapter Five: Binary Search
Chapter Six: Linked Lists
Interlude: Fast & Slow Pointers
- Finding the middle node of a linked list
Chapter Seven: Trees
- Binary trees, binary search trees (BSTs)
Chapter Eight: Heap / Priority Queue
Chapter Nine: Backtracking
- Subsets
Chapter Ten: Tries
Chapter Eleven: Graphs
- Representing graphs
- Traversals
  - Breadth-First Search
  - Depth-First Search
Chapter Twelve: Dynamic Programming
Chapter Thirteen: Intervals
Chapter Fourteen: Bit Manipulation
Conclusion
Resources & Credits

Prerequisites

Before diving in, some familiarity with TypeScript/JavaScript and Python may be helpful, as these are the languages I use for the examples. Also, a basic understanding of Big O notation is useful as we go over time and space complexities.

Even though we don't go through the mathematics behind the concepts, some basic mathematical knowledge can also help. That said, it's definitely not necessary to enjoy or learn something useful from this handbook.

Chapter One: Arrays & Hashing

Let's very briefly get to know our topics for this chapter: dynamic arrays, hash tables, and prefix sums.

Dynamic Arrays

Dynamic arrays are, well, dynamic. They're flexible and can change their size during execution.

Python's list type is a dynamic array. We can create an items list, for example:

items = [3, 5]

The length of items is 2, as you can see, but its capacity is greater than or equal to its length. In fact, capacity refers to the total size, whereas length is the actual size.

Since dynamic arrays are still arrays, they need a contiguous block of memory.

We can easily add an item to items:

items.append(7)

And add some more:

items.append(9)
items.append(11)
items.append(13)

All the while, the length and capacity of items keep growing dynamically.

Time and space complexity

Accessing an element is $O(1)$ as we have random access.

Inserting a new element or deleting an element is $O(n)$ (think about having to shift all the elements before inserting or after deleting an item). But, in order to not be too pessimistic, we can look at amortized analysis – in that case, inserting/deleting at the end of the array becomes $O(1)$.

Space complexity is $O(n)$, as the need for space will grow proportionately as the input increases.

If you need more info about time and space complexity, you can refer to this guide.

Hash Tables

A hash table maps keys to values, implementing an associative array.

Python's dict is one example:

number_of_petals = {
    'Euphorbia': 2, 
    'Trillium': 3, 
    'Columbine': 5,
}

Also JavaScript's "object"s:

let numberOfMoons = {
  'Earth': 1,
  'Mars': 2,
  'Jupiter': 95,
  'Saturn': 146,
  'Uranus': 27,
  'Neptune': 14,
};

There are two important ingredients for a hash table:

an array of "buckets" to store the data
a hash function to map the data to a specific index in the array

Hashes are usually large integers, so to find an index, we can take the result of the hash modulo the array's length.

Note: The hash function that's mapping the elements to buckets is not the hash() used in the visual (it's just a Python function to calculate the hash value of an object). The hash function in this case is the modulo ( % ) operation.

Here, with the hash value of each item's key, we calculate the remainder when it's divided by the length of the array to find which "bucket" it should go to.

The ratio of the number of elements to the number of buckets is called the load factor, and the higher it gets, the more collisions (when elements have to be inserted at the same place in the array) occur.

There are some collusion resolution tactics like linear probing (probing through the array until finding an empty bucket) and chaining (chaining multiple elements as linked lists), but we'll not go into those for now.

Time and space complexity

The average case for searching, inserting, and deleting operations are $O(1)$ as we use keys to look up the values.

Space complexity is $O(n)$ as it grows linearly with the amount of elements.

Prefix Sums

A prefix sum is the sequence of numbers we get after adding the sums of running totals of another sequence. It's also called the cumulative sum.

The first element of the resulting array is the first element of the input array. That's fine. We start at the second item, and add the previous numbers each time as we go. That is:

$$result[i] = \begin{cases} nums[0] & \text{if } i \text{ is zero} \\ result[i - 1] + nums[i] & \text{if } i \geq 1 \end{cases}$$

In code, we can implement that easily:

def runningSum(nums):
    result = [nums[0]]

    for i in range(1, len(nums)):
        result.append(result[i - 1] + nums[i])

    return result

Time and space complexity

Time complexity for a prefix sum is $O(n)$ because we're iterating over each of the elements in the array.

The space complexity is also $O(n)$ because the need for external space grows as the length of the original array grows.

Chapter Two: Two Pointers

One of the techniques of iterating through an array is the two pointers technique, and it is as simple as it sounds: we just keep two pointers, one starting from the left, and the other from the right, gradually getting closer to each other.

Palindrome example

A very basic example can be the one where we check if a string is a palindrome or not. A palindrome is a string that reads the same forwards and backwards.

In an imaginary world where all the inputs always consist of lowercase English letters, we can do it like this:

// s consists of lowercase English letters
function isPalindrome(s: string) {
  let left = 0;
  let right = s.length - 1;

  while (left <= right) {
    if (s[left++] !== s[right--]) {
      return false;
    }
  }

  return true;
}

We initialize two pointers: left and right. left points to the start of the array, while the right points to the last element. As we loop while left is less than right, we check if they are equal. If not, we return false immediately. Otherwise, our left pointer is increased – that is, it's moved to the right one step, and our right pointer is decreased, meaning that it's moved to the left one step. When they eventually overlap, the loop terminates, and we return true.

Let's say our string is 'racecar', which is a palindrome. It will go like this:

Squares of a sorted array example

Another example where we can use the two pointers technique is the problem Squares of a Sorted Array.

The description says:

Given an integer array nums sorted in non-decreasing order, return an array of the squares of each number sorted in non-decreasing order.

For example, if the input is [-4, -1, 0, 3, 10], the output should be [0, 1, 9, 16, 100].

Now obviously, we can just square each one, and then sort the array with a built-in sort method, and be done with it. But a sorting operation is never better than $O(n \ log \ n)$ runtime, so we can do it using two pointers in just $O(n)$ time:

function sortedSquares(nums: number[]): number[] {
  let left = 0;
  let right = nums.length - 1;
  let result = [];

  while (left <= right) {
    if (Math.abs(nums[left]) > Math.abs(nums[right])) {
      result.push(nums[left++] ** 2);
    } else {
      result.push(nums[right--] ** 2);
    }
  }

  return result.reverse();
}

We compare the absolute value of the items that left and right are pointing to, and push the square of the greater one to our result array. And we return the reversed version of it.

Note: The reason we return the reversed result is that the array is initially already sorted, and we get the largest absolute value first. The reason that works is related to how two pointers work: as we start from both ends, we initially start with the smallest and largest values in the array.

Because we only make one pass through the array while comparing, and then later reversing, it ends up being $O(n)$, a better runtime than $O(n \ log \ n)$.

Chapter Three: Sliding Window

Now that we're familiar with the Two Pointers technique, we can add another one to our toolbox: the Sliding Window. It's usually used for operations done on the subsets of a given data. It also comes in two flavors: fixed window size and dynamic window size.

Fixed window size

If we have a size constraint in a given problem – say, we need to check a $k$-sized subarray – sliding window is an appropriate technique to use.

For example, getting the maximum subarray (of size $k$) sum of a given array can be done like this:

Note that the window size is $k$, and it doesn't change throughout the operation – hence, fixed size.

A very cool thing to notice here is that with each slide, what happens to our sum is that we add the right element, and decrease the left element.

Let's look at an example for getting the maximum sum of subarray with given size k:

function maxSubarray(numbers: number[], k: number) {
  if (numbers.length < k) {
    return 0;
  }

  let currentSum = 0;

  // Initial sum of the first window 
  for (let i = 0; i < k; i++) {
    currentSum += numbers[i];
  }

  let maxSum = currentSum;

  let left = 0;
  let right = k;

  while (right < numbers.length) {
    currentSum = currentSum - numbers[left++] + numbers[right++];
    maxSum = Math.max(maxSum, currentSum);
  }

  return maxSum;
}

Note: Updating the pointers can be done outside the brackets as well, like this:

while (right < numbers.length) {
  currentSum = currentSum - numbers[left] + numbers[right];
  maxSum = Math.max(maxSum, currentSum);
  left++;
  right++;
}

Since the postfix operator returns the value first, they can be used inside the brackets to be slightly more concise.

Here, we first get the initial sum of our window using the for loop, and set it as the maximum sum.

Then we initialize two pointers: left that points to the left end of the window, and right that points to the right end of the window. As we loop, we update our currentSum, decreasing the left value, and adding the right value. When our current sum is more than the maximum sum, maxSum variable is updated as well.

Dynamic window size

As opposed to the fixed window size version, the size of the window changes dynamically this time.

For example, let's take a brief look at the problem Best Time to Buy and Sell Stock. We need to choose a day to buy a stock, and sell it in the future. The numbers in the array are prices, and we need to buy the stock at as low a price as we can, and sell it as high as we can.

We can initialize left and right pointers again, but this time, we'll update them depending on a condition. When the left item is less than the one on the right, that means it's good – we can buy and sell at those prices, so we get their difference and update our maxDiff variable that holds the maximum difference between the two.

If, however, the left one is greater than the right one, we update our left pointer to be where the right is at. In both cases, we'll continue updating right until we reach the end of the array.

With the blue arrow indicating the left pointer, and the red the right one, the process looks like this:

The solution looks like this:

function maxProfit(prices: number[]): number {
  let left = 0;
  let right = left + 1;
  let maxDiff = 0;

  while (right < prices.length) {
    if (prices[left] < prices[right]) {
      let diff = prices[right] - prices[left];
      maxDiff = Math.max(maxDiff, diff);
    } else {
      left = right;
    }

    right++;
  }

  return maxDiff;
}

Note: This one is also called fast/catch-up version of dynamic sliding window, because the left pointer jumps to catch up with the right pointer in the else block.

Time and space complexity

Both examples have the same time and space complexity: The time complexity is $O(n)$ because in the worst case we iterate through all the elements in the array. The space complexity is $O(1)$ as we don't need additional space.

Chapter Four: Stack

A stack data type is perhaps one of the most well-known ones. A stack of books might be a good example to visualize, but insertion and deletion can only happen from the one end. A stack operates through the last-in first-out (LIFO) principle: the last item to go in is the first to go out.

Usually we'll have methods for pushing an element onto the stack, and popping an element from the stack.

For example, let's say we're looking for valid parentheses in a given string, and the operation we'll do goes like this.

As we iterate over the characters in the string, we push the character onto the stack. If we pushed a closing parenthesis (one of ), }, or ]), then, if the previous pushed element is its opening pair, we'll pop that pair from the stack.

If, at the end, the stack is empty, the string consists of valid parentheses.

A stack can be implemented as an array or a linked list. But using linked lists is more common because with arrays, we have a potential stack overflow when we predefine a maximum stack size. On the other hand, linked lists are not static when it comes to memory, so they are a good candidate to implement stacks.

Linked lists are also efficient because we are using one end of the stack for insertion and deletion, and doing these are constant time operations.

Let's look at one easy stack implementation in Python.

Now, we can use a list, but a list in Python is implemented as a dynamic array underneath, so at one point, pushing an item can be an $O(n)$ operation if the list needs to be copied into another memory location. For that reason, we'll use a deque, which is implemented as a doubly-linked list, so that we know push and pop operations will be $O(1)$.

from collections import deque

class Stack:
    def __init__(self):
        self._stack = deque()

    def push(self, item):
        self._stack.append(item)

    def pop(self):
        return self._stack.pop()

    def peek(self):
        return self._stack[-1]

    def is_empty(self):
        return not bool(len(self._stack))

    def size(self):
        return len(self._stack)

In addition to push and pop, we'll also usually have functions like peek to get the topmost item in the stack, is_empty to check if the stack is empty, and size to get the size of the stack.

We can also do it using JavaScript. Now, we can do it using an array, but we want to use a linked list instead. Since we don't have a robust built-in library like Python this time, we'll implement a very simple version of it ourselves. Even though we haven't seen linked lists so far, the basic idea is that we have nodes, each of which has a data value, and a next pointer pointing to the next node.

Let's create a simple node first:

class Node {
  constructor(data) {
    this.data = data;
    this.next = null;
  }
}

We can write our stack now:

class Stack {
  constructor() {
    this.top = null;
    this.length = 0;
  }

  push(item) {
    const node = new Node(item);
    node.next = this.top;
    this.top = node;
    this.length++;
  }

  pop() {
    if (this.isEmpty()) { return null; }

    const data = this.top.data;
    this.top = this.top.next;
    this.length--;

    return data;
  }

  peek() {
    if (this.isEmpty()) { return null; }

    return this.top.data;
  }

  isEmpty() {
    return this.size() === 0;
  }

  size() {
    return this.length;
  }
}

Now, let’s use it:

let myStack = new Stack();

myStack.push(5);
myStack.push(17);
myStack.push(55345);
myStack.push(0);
myStack.push(103)

console.log(myStack.size()) // 5
console.log(myStack.peek()) // 103

myStack.pop()

console.log(myStack.size()) // 4
console.log(myStack.peek()) // 0

Time and space complexity

Each method we defined for our stack has $O(1)$ time complexity, and it would be the same if we were to use an array as well. However, as mentioned above, arrays have limitations in that having to allocate a predefined stack size can lead to a stack overflow. And if we were to use a dynamic array, the whole array might need to be copied to go into another memory location after a certain size is reached, leading to $O(n)$ time. So, linked lists are ideal to implement a stack data type.

If the space complexity is linear – $O(n)$– the stack will grow linearly with the number of items in it.

Chapter Five: Binary Search

Binary search is one of the most well-known algorithms. It's also a divide-and-conquer algorithm, where we break the problem into smaller components.

The crux of binary search is to find a target element in a given sorted array. We have two pointers: high to point to the largest element, and low to point to the smallest element. We first initialize them for the whole array itself, with high being the last index and low being the first index.

Then, we calculate the midpoint. If the target is greater than the midpoint, then we adjust our low pointer to point to the mid + 1, otherwise if the target is less than the midpoint, we adjust high to be mid - 1. With each iteration, we eliminate half the array until the midpoint equals the target or the low pointer passes high.

If we find the index of the target, we can return it as soon as we find it. Otherwise, we can just return -1 to indicate that the target doesn't exist in the array.

For example, if we have a nums array [-1, 0, 3, 5, 9, 12] and our target is 9, the operation looks like this:

We can write it in TypeScript like this:

function search(nums: number[], target: number): number {
  let high = nums.length - 1;
  let low = 0;

  while (high >= low) {
    let mid = Math.floor((high + low) / 2);

    if (target > nums[mid]) {
      low = mid + 1;
    } else if (target < nums[mid]) {
      high = mid - 1;
    } else {
      return mid;
    }
  }

  return -1;
}

Time and space complexity

The time complexity of a binary search algorithm is $O(log \ n)$ in the worst case. (For example, if the target is not in the array, we'll be halving the array until there is one element left.) The space complexity is $O(1)$ as we don't need extra space.

Chapter Six: Linked Lists

A linked list is a linear data structure that you're likely to be familiar with. It is also a data structure that can grow and shrink dynamically – so unlike arrays, there's no need to allocate memory beforehand.

An important part of a linked list is the head pointer that points to the beginning of the list. There may or may not be a tail pointer that also points to the end of the list.

The core ingredient of a linked list is a simple node, which consists of two parts: data and the next pointer. So, it is an important idea to remember: a node only knows about its data and its neighbor.

The very last node in the linked list points to null to indicate it's the end of the list.

But there are different types of linked lists that differ from each other slightly, so let's briefly take a look at them.

Singly linked lists

The core idea with singly linked lists is that each node, along with the data it has, has a pointer that points only to the next node:

class Node {
  constructor(data) {
    this.data = data;
    this.next = null;
  }
}

And here is an example where we have three nodes, holding the values 1, 2, and 3 consecutively:

Here is a simple implementation of a singly linked list in JavaScript:

class SinglyLinkedList {
  constructor() {
    this.head = null;
    this.tail = null;
    this.length = 0;
  }

  // Add value to the end of the list
  append(value) {
    let node = new Node(value);
    // If the list is empty
    if (this.head === null) {
      this.head = node;
      this.tail = this.head;
    } else {
      this.tail.next = node;
      this.tail = node;
    }

    this.length++;
    return this;
  }

  // Add value to the beginning of the list
  prepend(value) {
    let node = new Node(value);
    // If the list is empty
    if (this.head === null) {
      this.head = node;
      this.tail = this.head;
    } else {
      node.next = this.head;
      this.head = node;
    }

    this.length++;
    return this;
  }

  remove(value) {
    // If the list is empty, return null
    if (this.head === null) { 
      return null; 
    }

    // If it is the first element
    if (this.head.data === value) {
      this.head = this.head.next;
      this.length--;
      // If it is the only element 
      // (we don't have anything after removing it)
      if (this.head === null) {
        this.tail = null;
      } 
      return;
    }

    let currentNode = this.head;

    while (currentNode.next) {
      if (currentNode.next.data === value) {
        currentNode.next = currentNode.next.next;
        // If it is the last element, update tail
        if (currentNode.next === null) {
          this.tail = currentNode;
        } 
        this.length--;
        return;
      }
      currentNode = currentNode.next;
    }
  }

  search(value) {
    let currentNode = this.head;

    while (currentNode) {
      if (currentNode.data === value) {
        return currentNode;
      }
      currentNode = currentNode.next;
    }

    // If the value does not exist, return null
    return null;
  }

  printList() {
    let values = [];
    let currentNode = this.head;
    while (currentNode) {
      values.push(currentNode.data);
      currentNode = currentNode.next;
    }

    console.log(values);
  }
}

Note: We'll keep a tail pointer in all these examples for convenience. It doesn't hurt to have a tail pointer.

We can now use it:

const mySinglyLinkedList = new SinglyLinkedList();

mySinglyLinkedList.prepend(3);
mySinglyLinkedList.prepend(143);
mySinglyLinkedList.prepend(5);

mySinglyLinkedList.printList(); // [ 5, 143, 3 ]

mySinglyLinkedList.append(21);
mySinglyLinkedList.printList(); // [ 5, 143, 3, 21 ]

console.log(mySinglyLinkedList.search(143));
// Node {
//   data: 143,
//   next: Node { data: 3, next: Node { data: 21, next: null } }
// }

mySinglyLinkedList.remove(143);
mySinglyLinkedList.printList(); // [ 5, 3, 21 ]

console.log(mySinglyLinkedList.search(143)); // null

Doubly linked lists

Doubly linked lists differ from the "singly" ones in that each node also has another pointer that points to the previous element.

So, this time, a single node will look different:

class Node {
  constructor(data) {
    this.data = data;
    this.next = null;
    this.previous = null;
  }
}

Here is the same example as above, but as a doubly linked list:

A simple implementation might look like this:

class DoublyLinkedList {
  constructor() {
    this.head = null;
    this.tail = null;
    this.length = 0;
  }

  // Add value to the end of the list
  append(value) {
    let node = new Node(value);
    // If the list is empty
    if (this.head === null) {
      this.head = node;
      this.tail = this.head;
    } else {
      node.previous = this.tail;
      this.tail.next = node;
      this.tail = node;
    }

    this.length++;
    return this;
  }

  // Add value to the beginning of the list
  prepend(value) {
    let node = new Node(value);
    // If the list is empty
    if (this.head === null) {
      this.head = node;
      this.tail = this.head;
    } else {
      this.head.previous = node;
      node.next = this.head;
      this.head = node;
    }

    this.length++;
    return this;
  }

  remove(value) {
    // If the list is empty, return null
    if (this.head === null) { 
      return null;
    }

    let currentNode = this.head;

    // If it is the first element
    if (currentNode.data === value) {
      this.head = currentNode.next;
      // If the removed element is not the only one,
      // make the previous pointer of the new head null
      if (this.head) {
        this.head.previous = null;
      // If the removed element was the only element,
      // point the tail to null as well
      } else {
        this.tail = null;
      }
      this.length--;
      return;
    }

    while (currentNode) {
      if (currentNode.data === value) {
        if (currentNode.previous) {
          currentNode.previous.next = currentNode.next;
        }
        if (currentNode.next) {
          currentNode.next.previous = currentNode.previous;
        // If it's the last element in the list, update tail
        // to point to the previous node
        } else {
          this.tail = currentNode.previous;
        }

        this.length--;
        return;
      }

      currentNode = currentNode.next;
    }
  }

  search(value) {
    let currentNode = this.head;
    while (currentNode) {
      if (currentNode.data === value) {
        return currentNode;
      }
      currentNode = currentNode.next;
    }

    // If the value does not exist, return null
    return null;
  }

  printList() {
    let values = [];
    let currentNode = this.head;

    while (currentNode) {
      values.push(currentNode.data);
      currentNode = currentNode.next;
    }

    console.log(values);
  }
}

Circular linked lists

With circular linked lists, we have the last node also pointing to the first element, creating circularity.

We'll only look at the singly circular linked list for simplicity's sake, so our node will look the same as in the first example:

class Node {
  constructor(data) {
    this.data = data;
    this.next = null;
  }
}

The same example, in a circular linked list fashion:

Here is a simple implementation:

class CircularLinkedList {
  constructor() {
    this.head = null;
    this.tail = null;
    this.length = 0;
  }

  // Add value to the "end" of the list
  append(value) {
    let node = new Node(value);
    // If the list is empty
    if (this.head === null) {
      this.head = node;
      this.tail = node;
      // As the only node in the list, it should point to itself
      node.next = node;
    } else {
      // As the "last" node, it should point to the head (this.tail.next)
      node.next = this.tail.next;
      this.tail.next = node;
      this.tail = node;
    }

    this.length++;
    return this;
  }

  // Add value to the beginning of the list
  prepend(value) {
    let node = new Node(value);
    node.next = this.head;
    // Update last node's next pointer to point to the new node
    this.tail.next = node;
    this.head = node;
    this.length++;
    return this;
  }  

  remove(value) {
    // If the list is empty, return null
    if (this.head === null) { 
      return null; 
    }

    // If it is the first element
    if (this.head.data === value) {
      // If it's the only element
      if (this.head.next === this.head) {
        this.head = null;
        this.tail = null;
        return;
      }
      this.head = this.head.next;
      this.tail.next = this.head;
      this.length--;
      return;
    }

    let currentNode = this.head;
    let prevNode = null;

    // Iterate until you find the value or
    // you don't find it after traversing the whole list
    while (currentNode.data !== value || prevNode === null) {
      if (currentNode.next === this.head) { 
        break; 
      }
      prevNode = currentNode;
      currentNode = currentNode.next;
    }

    if (currentNode.data === value) {
      // If there is a previous node before the element to be removed,
      // update the previous node's next pointer to point to
      // the one after the element to be removed
      // (unlink it)
      if (prevNode) {
        prevNode.next = currentNode.next;
        // If the element to be removed is the last one,
        // update tail to be the previous node
        if (this.tail === currentNode) {
          this.tail = prevNode;
        }
      // If the element to be removed is the first one in the list
      } else {
        // If it's the only one in the list
        if (this.head.next === this.head) {
          this.head = null;
          this.tail = null;
        } else {
          this.head = this.head.next;
          this.tail.next = this.head;
        }
      }
    }
  }

  printList() {
    let nodes = [];
    let currentNode = this.head;
    if (this.head === null) { 
      console.log(nodes); 
      return;
    }

    // Traverse the list once to add the values, 
    // don't go in circles
    do {
      nodes.push(currentNode.data);
      currentNode = currentNode.next;
    } while (currentNode !== this.head);

    console.log(nodes);
  }
}

Time and space complexity

With linked lists, the time complexity for accessing an element is in the worst case $O(n)$. Prepending and appending an element depends on whether we have a tail pointer. If we have it, then both operations are $O(1)$, as we only need to arrange pointers. But if we don't have a tail pointer, appending an element requires traversing the whole list, so it is an $O(n)$ operation. Removing an element is similar – in the worst case, it is $O(n)$.

If the space complexity is linear – $O(n)$– then the amount of data to store grows linearly with the number of nodes in the list.

Interlude: Fast & Slow Pointers

Let's take a quick look at a technique that comes in handy when it comes to working with linked lists.

We can keep two pointers while traversing a linked list: fast and slow. While the fast one increases by two steps, the slow pointer will increase by just one step.

Finding the middle node of a linked list

When the fast pointer reaches the end of the list, the slow pointer will be at the "middle" node.

Let's see how it might work:

let slow = head;
let fast = head;

while (fast !== null && fast.next !== null) {
  slow = slow.next;
  fast = fast.next.next;
}

We can think of a list like [1, 2, 3, 4, 5] (where each value is a node in the linked list).

Both fast and slow start pointing to the head, that is, 1.

Then, we update the slow pointer one step, which will be 2. And, fast will be at 3.

When we update slow again, it will be at 3. When the fast pointer increases, it will be two steps ahead, and its next pointer will point to the null value, at which point our loop will stop iterating.

slow will end up pointing to the node with the value 3, which is the middle node.

With an even number of nodes, there are two candidates for the middle node. For example, with a list like [1, 2, 3, 4], our current implementation will find the middle as 3. This technique is also useful to detect cycles in a linked list.

Chapter Seven: Trees

Let’s take a look at a non-linear data structure that is pretty familiar to many developers: trees.

Whether familiarity breeds contempt or not is arguable, so let's start with the simplest component of a tree: a node.

Trees, like linked lists, are made up of nodes. The simplest version of a tree is just the root node which doesn't have any edges (links) pointing to it; that is, it has no parent nodes. It is the starting point, in a way.

A tree can only have one root node, and when you think about it, if there are $n$ nodes in a tree, that means there are $n - 1$ edges (links) because there is no edge (link) pointing to the root node.

If you've looked at a tree long enough, you might've had a moment of epiphany: a tree has smaller trees within itself. A branch may as well be a trunk, having other branches for the little tree it constitutes.

The tree data structure is like this, it is recursive: a child node can be the root of a subtree.

Two terms that are important when it comes to a tree node are depth and height.

The depth of a node is how far away it is from the root node (how many edges (links) does it take to travel from the root node to it), and the height of a node is how far away it is from its furthest leaf node (which is a node that has no children).

Note: The height of the root node is the same as the height of the whole tree.

A balanced tree is one where the heights of the left and right subtrees of every node differ by at most 1.

Binary trees, binary search trees (BSTs)

A binary tree is a tree where each node has at most two children. That is, a node can have a left child node and a right child node, and no more.

The maximum number of nodes in a binary tree is $2^h - 1$ where $h$ is the height of the tree. This is where the binary of the binary tree makes sense: on each level, the number of nodes grows proportionately to the exponents of $2$.

For example, the number of nodes on the first level (the 0th level) is $2^0 = 1$, which is just the root node. The second level has at most 2 nodes: $2^1 = 2$ (remember that we're counting from $0$, so the second level is $1$).

A binary search tree is a binary tree where the values smaller than the node go to its left and those greater than it go to its right:

$$\text{left children } \lt \text{ node } \lt \text{ right children}$$

Here is an example:

We can define a tree node like this:

class TreeNode {
  val: number;
  left: TreeNode | null;
  right: TreeNode | null;

  constructor(val: number, left?: TreeNode | null, right?: TreeNode | null) {
    this.val = val;
    this.left = (left === undefined ? null : left);
    this.right = (right === undefined ? null : right);
  }
}

Inserting into a binary search tree

If we want to insert a new node into a binary search tree, we need to insert it into its proper place to keep the properties of a BST intact.

Recursive solution:

function insertIntoBST(root: TreeNode | null, val: number) {
  if (root === null) {
    return new TreeNode(val);
  }

  if (val < root.val) {
    root.left = insertIntoBST(root.left, val);
  } else {
    root.right = insertIntoBST(root.right, val);
  }

  return root;
}

Here, we traverse the tree until we find a space (a null position) for our value that is waiting to be a TreeNode. We start with the root node. If the value of the node-to-be-inserted is less than the value of the root node, we go left (passing root.left as the root argument to the function). Otherwise, we go right (passing root.right as the root argument).

Time and space complexity

The time complexity is $O(h)$ where $h$ is the height of the tree. On each level in the tree, we either go left or right, so we don't necessarily visit every single node. The space complexity is also $O(h)$ because we use recursion, creating a new stack frame for each function call.

Note that if the tree is unbalanced, the time and space complexity can be said to be $O(n)$.

Iterative solution:

We can also do it iteratively, using pointers only:

function insertIntoBST(root: TreeNode | null, val: number) {
  if (root === null) {
    return new TreeNode(val);
  }

  let prevNode: TreeNode | null = null;
  let currentNode: TreeNode | null = root;

  while (currentNode !== null) {
    prevNode = currentNode;
    if (val < currentNode.val) {
      currentNode = currentNode.left;
    } else {
      currentNode = currentNode.right;
    }
  }

  if (prevNode) {
    if (val < prevNode.val) {
      prevNode.left = new TreeNode(val);
    } else {
      prevNode.right = new TreeNode(val);
    }
  }

  return root;
}

Here, we do the same thing: iterating until we find the correct place, but also keeping track of the parent node. Then, we insert the node as either the left or the right child of the parent, depending on its value.

Time and space complexity

The time complexity is again $O(h)$ (or if the tree is unbalanced, $O(n)$) for the same reason as in the recursive solution. But the space complexity is constant – $O(1)$ – as we only use pointers.

Deleting from a binary search tree

The challenging thing when deleting a node from a BST is keeping the BST as a BST. All smaller values should still go to the root node's left subtree, and all those that are larger should go to the root node's right subtree.

Let's take a look at how we might do it in JavaScript:

function deleteNode(root: TreeNode | null, key: number) {
  if (root === null) {
    return root;
  }

  if (key < root.val) {
    root.left = deleteNode(root.left, key);
  } else if (key > root.val) {
    root.right = deleteNode(root.right, key);
  } else {
    // Node-to-be-deleted has no children
    if (root.left === null && root.right === null) {
      return null;
    } 

    // If either the left or the right child exists,
    // return the one that exists as the new child 
    // of the parent node (of the node-to-be-deleted)
    if (root.left === null || root.right === null) {
      return root.left ? root.left : root.right;
    }

    // If both children exist, traverse the left subtree, get its maximum value...
    let currentNode = root.left;

    while (currentNode.right !== null) {
      currentNode = currentNode.right;
    }

    // ...replace it with the node-to-be-deleted
    root.val = currentNode.val;
    // ...then apply the recursion to the left subtree to get rid of the duplicate value
    root.left = deleteNode(root.left, root.val);
  }

  return root;
}

We traverse the tree until we find the node to be deleted. Once we find it, there are several things to do.

In the case where it doesn't have any child nodes, we can return null and be done with it.

If it has one child node, we can return the one that exists using the ternary operation (return root.left ? root.left : root.right).

Note: In this case, we're essentially making the root of the subtree the child of the parent node.

For example, in the image, if the node-to-be-deleted is 10 (it has only right child node with the value 14), we make 14 the right child of 8. It doesn't break our BST, because those that are larger than 8 continue to be in the right subtree of 8:

Otherwise, if both the left and right children of the node-to-be-deleted exist, we need to do something different.

In this case, we'll replace the node-to-be-deleted with the largest value in the left subtree.

But, after replacing, we'll have two nodes of the same value in both places, so we need to apply deleteNode itself to the subtree that we've taken our replacement node from.

This is all done to keep the BST as BST. It might be a bit difficult to wrap your head around at first, but NeetCode has a detailed explanation of this problem.

Note that we can also use the smallest value in the right subtree as well. In that case, our code would look like this:

let currentNode = root.right;

while (currentNode.left !== null) {
  currentNode = currentNode.left;
}

root.val = currentNode.val;
root.right = deleteNode(root.right, root.val);

Time and space complexity

Similar to inserting into a BST, both the time and space complexity of deleting from a BST will be $O(h)$ where $h$ is the height of the tree.

Traversals

We'll take a brief look at two of the most famous ways to traverse a tree where the order in which we visit the nodes matters: depth-first search and breadth-first search.

1. Depth-First Search (DFS)

In a depth-first search, we traverse through a branch until we get to a leaf node. Then, we backtrack and do the same thing with another branch.

There are three common ways to do a depth-first search:

preorder traversal
inorder traversal
postorder traversal

Preorder traversal:

It goes like this: We first visit the node, then go on to its left subtree, then the right subtree.

node ➞ left subtree ➞ right subtree

We can do a preorder walk recursively:

function preorderWalk(node) {
  if (node === null) {
    return;
  }

  console.log(node.val);
  preorderWalk(node.left);
  preorderWalk(node.right);
}

Inorder traversal:

It goes like this: we first visit the left subtree, then the node, then the right subtree.

left subtree ➞ node ➞ right subtree

Note: The inorder traversal gives us the sorted values.

We can do an inorder walk recursively as well:

function inorderWalk(node) {
  if (node === null) {
    return;
  }

  inorderWalk(node.left);
  console.log(node.val);
  inorderWalk(node.right);
}

Postorder traversal:

It goes like this: we first visit the left subtree, then the right subtree, and finally the node.

left subtree ➞ right subtree ➞ node

We can do a postorder walk recursively:

function postorderWalk(node) {
  if (node === null) {
    return;
  }

  postorderWalk(node.left);
  postorderWalk(node.right);
  console.log(node.val);
}

2. Breadth-First Search (BFS)

In breadth-first search, we visit the nodes level by level, that is, visiting every child of a node first before moving on.

A queue is used when implementing a BFS. Since we don't have edges connecting all the children on one level together, it makes sense to keep them in a queue and visit each one when their time comes. When a node is added to the queue and has not been visited yet, it's called a discovered node.

A simple BFS operation looks like this (which is repeated until the queue is empty):

visit node
enqueue left child
enqueue right child

Note that the breadth-first search is also known as level-order traversal.

A simple example of a level-order traversal in JavaScript might look like this:

function levelOrderWalk(root) {
  if (root === null) {
    return;
  }

  let queue = [];
  queue.push(root);

  while (queue.length > 0) {
    let currentNode = queue[0];

    console.log(currentNode.val);

    if (currentNode.left !== null) {
      queue.push(currentNode.left);
    }

    if (currentNode.right !== null) {
      queue.push(currentNode.right);
    }

    // Remove the current node
    queue.shift();
  }
}

This example is based on Vaidehi Joshi's GitHub Gist.

Chapter Eight: Heap / Priority Queue

It’s now time to take a look at a data structure called a heap, which is a great way to implement an abstract data type called a priority queue. They're so interrelated that priority queues are sometimes referred to as heaps – because heaps are a very efficient way to create a priority queue.

Heap properties

The kind of heap we're interested in is also called a binary heap because it's just a binary tree that has specific properties.

One of them is that it must be a complete binary tree, meaning that all the levels must be filled, and all nodes in the last level should be as far left as possible.

For example, when it comes to shape, this is a complete binary tree:

But heaps must also be either a max heap or a min heap – all the parent nodes must be either greater than or equal to the values of their children (if it's a max heap) or less than or equal to the values of their children (if it's a min heap).

A max heap might look like this:

Note: A left child doesn't have to be less than the right child at all, as in a binary search tree. Also, we can always have duplicate values in a heap.

A min heap, on the other hand, has the values of parent nodes less than those of their children:

Note: When we have a max heap, the root node will have the maximum value. And, if we have a min heap instead, the root node will have the minimum value.

Heaps with arrays

We can create a heap using an array. Since the root node is the most interesting element with either a maximum or minimum value, it'll be the first element in our array, residing at the 0th index.

What's nice about using an array is that, given a parent node's index $i$, its left child will be at the index $2i + 1$, and its right child will be at the index $2i + 2$.

Given that, any child node's parent will be at the index $\lfloor{\frac{(n - 1)}{2}}\rfloor$.

Note: $\lfloor$ and $\rfloor$indicate the floor function.

One question we might ask at this moment is that why should we use an array at all?

The answer lies in the word queue of a priority queue. Since a queue is mainly concerned with the first element (following the FIFO principle), an array can be an ideal choice. In a priority queue, each element has a priority, and the value with the highest priority is dequeued first.

Inserting/removing elements

Let's take a look at how we can add an element to a heap.

We know that we have to add the new element to the bottom leftmost place, but once we do that, it might violate the max heap or the min heap property. Then, how can we avoid violating the heap-order property?

We'll heapify, of course!

Let's say that we want to add a node with the value 20:

So, heapify is the swapping of nodes until we know that the heap-order property is maintained.

A similar thing happens when we need to remove an element. But since we're mainly concerned with the maximum or the minimum element, we just need to remove the root node. So, how are we going to do that?

We start off by swapping the last element (the bottom leftmost one) with the root. Now we can easily remove the "root," which resides as a leaf node. But we still need to maintain the heap-order property, so we need to heapify again.

Heapsort

Even better thing is that if we have a heap, and continually heapify it, we can sort an array.

Let's build a max heap first:

function buildMaxHeap(arr: number[]) {
  /*
  Index of the last internal node 
  (i.e., the parent of the last leaf node, 
   or, the last non-leaf node).
  The last leaf node will reside at index arr.length - 1,
  so, we're getting its parent using the formula mentioned above.
  */
  let i = Math.floor((arr.length - 1) / 2);

  while (i >= 0) {
    heapify(arr, i, arr.length);
    i--;
  }

  return arr;
}

Then, the heapify function:

function heapify(arr: number[], i: number, maxLength: number) {
  while (i < maxLength) {
    let index = i;
    let leftChildIdx = 2 * i + 1;
    let rightChildIdx = leftChildIdx + 1;

    if (leftChildIdx < maxLength && arr[leftChildIdx] > arr[index]) {
      index = leftChildIdx;
    }

    if (rightChildIdx < maxLength && arr[rightChildIdx] > arr[index]) {
      index = rightChildIdx;
    }

    if (index === i) { return; }

    // Swap
    [arr[i], arr[index]] = [arr[index], arr[i]];

    i = index;
  }
}

With a given index i, we get its left and right children indices, and if the indices are within bounds, we check if they are out of order. In that case, we make the index the index of the child, and swap the two nodes. Then, we continue with that new index, assigning it to i.

Now, heapify is nice and all, but how can we actually use it for sorting?

function heapSort(arr: number[]) {
  buildMaxHeap(arr);

  let lastElementIdx = arr.length - 1;

  while (lastElementIdx > 0) {
    [arr[0], arr[lastElementIdx]] = [arr[lastElementIdx], arr[0]];

    heapify(arr, 0, lastElementIdx);
    lastElementIdx--;
  }

  return arr;
}

Note that our max heap [42, 19, 36, 17, 3, 25, 1, 2] won't change when used in the buildMaxHeap function, as it's already a max heap! But if it were to have 17 as the right child of 42, then 17 would have 25 as a child, which breaks the heap-order property. So, using buildMaxHeap with this broken version will correctly swap the 17 and 25, making it a max heap:

buildMaxHeap([42, 36, 17, 19, 3, 25, 1, 2]);

// -> [42, 36, 25, 19, 3, 17, 1, 2]

In heapSort, with our newly built max heap, we'll start with swapping the first and last nodes. Then, we'll keep heapifying until we get all the elements in their place. If we use it with our very own max heap, we can see that it returns the sorted array:

heapSort([42, 19, 36, 17, 3, 25, 1, 2]);
// -> [1, 2, 3, 17, 19, 25, 36, 42]

The examples are adapted from Vaidehi Joshi's article.

Time and space complexity

Heap sort, as a nice sorting algorithm it is, runs in $O(n \ log \ n)$ time.

In this example, building the max heap starts from the last non-leaf node and goes up to the root node, each time calling heapify. The heapify function has a time complexity of $O(log \ n)$ as we're working with a binary tree, and in the worst case, we get to do it for all the levels. Since we do it $n / 2$ times, overall, buildMaxHeap has $O(n \ log \ n)$ time complexity.

We're swapping the first and last elements, and heapifying as we go through each element, so this is also overall an $O(n \ log \ n)$ operation — which makes the time complexity of heapSort $O(n \ log \ n)$.

Note: Building the max heap can be improved to have $O(n)$ runtime.

Since there is no use of auxiliary space, the space complexity is constant, $O(1)$.

Chapter Nine: Backtracking

Let's start with admitting this one fact: backtracking is hard. Or rather, understanding it the first time is hard. Or, it's one of those concepts that you think you grasped it, only to realize later that you actually didn't.

We'll focus on one problem of finding the subsets of an array, but before that, let's imagine that we're walking along a path.

Then, we reach a fork. We pick one of the paths, and walk.

Then, we reach another fork in the path. We pick one of the paths again, and go on walking, then we reach a dead end. So, we backtrack to the last point we had a fork, then go through the other path that we didn't choose the first time.

Then we reach another dead end. So, we backtrack once more and realize that there are no other paths we can go from there. So we backtrack again, and explore the other path we didn't choose the first time we came to this point.

We reach yet another dead end, so we backtrack. We see that there are no more paths to explore, so we backtrack once more.

Now, we're at our starting point. There are no more paths left to explore, so we can stop walking.

It was a nice but tiring walk, and it went like this:

Now, let's take a look at a LeetCode problem.

Subsets

The description for Subsets says:

Given an integer array nums of unique elements, return all possible subsets (the power set).

The solution set must not contain duplicate subsets. Return the solution in any order.

For example:

Input: nums = [1, 2, 3]
Output: [[], [1], [2], [1, 2], [3], [1, 3], [2, 3], [1, 2, 3]]

Or:

Input: nums = [0]
Output: [[], [0]]

Before diving into the solution code, let's take a look at how backtracking will work in this case. Let's call the nums array items instead:

For each item in items, we have initially two choices: to include the item, or not to include it.

For each level $n$ in this decision tree, we have the option to include the next item in items. We have $2^n$ possible subsets in total.

Let's simplify the example a bit, and say that items is now ['a', 'b'] (We'll ignore the problem specifics for now).

In this case, we can use backtracking like this:

function subsets(items: string[]) {
  let result: string[][] = [];
  let currentSubset: string[] = [];

  function backtrack(idx: number) {
    if (idx >= items.length) {
      result.push([...currentSubset]);
      return;
    }

    currentSubset.push(items[idx]);
    backtrack(idx + 1);

    currentSubset.pop();
    backtrack(idx + 1);
  }

  backtrack(0);

  return result;
}

console.log(subsets(['a', 'b']));
// -> [['a', 'b'], ['a'], ['b'], []]

Well, it looks simple at first glance, but what's going on?

One thing to notice is that we pop from the currentSubset, then call backtrack. In our example of walking, that's the part we go back to our previous point, and continue our walk.

In the first animation, we indicated a dead end with a cross mark, and in this case, a dead end is the base case we reach.

It might still be tough to understand, so let's add some helpful console.logs, and see the output:

function subsets(items: string[]) {
  let result: string[][] = [];
  let currentSubset: string[] = [];

  function backtrack(idx: number) {
    console.log(`======= this is backtrack(${arguments[0]}) =======`)
    if (idx >= items.length) {
      console.log(`idx is ${idx}, currentSubset is [${currentSubset}], adding it to result...`);
      result.push([...currentSubset]);
      console.log(`backtrack(${arguments[0]}) is returning...\n`)
      return;
    }

    currentSubset.push(items[idx]);
    console.log(`added ${items[idx]} to currentSubset, inside backtrack(${arguments[0]})`);
    console.log(`calling backtrack(${idx + 1})...`)
    backtrack(idx + 1);

    let item = currentSubset.pop();
    console.log(`popped ${item} from currentSubset, inside backtrack(${arguments[0]})`);
    console.log(`calling backtrack(${idx + 1})...`)
    backtrack(idx + 1);

    console.log(`******* done with backtrack(${arguments[0]}) *******\n`);
  }

  backtrack(0);

  return result;
}

console.log(subsets(['a', 'b']));

The output looks like this:

======= this is backtrack(0) =======
added a to currentSubset, inside backtrack(0)
calling backtrack(1)...
======= this is backtrack(1) =======
added b to currentSubset, inside backtrack(1)
calling backtrack(2)...
======= this is backtrack(2) =======
idx is 2, currentSubset is [a,b], adding it to result...
backtrack(2) is returning...

popped b from currentSubset, inside backtrack(1)
calling backtrack(2)...
======= this is backtrack(2) =======
idx is 2, currentSubset is [a], adding it to result...
backtrack(2) is returning...

******* done with backtrack(1) *******

popped a from currentSubset, inside backtrack(0)
calling backtrack(1)...
======= this is backtrack(1) =======
added b to currentSubset, inside backtrack(1)
calling backtrack(2)...
======= this is backtrack(2) =======
idx is 2, currentSubset is [b], adding it to result...
backtrack(2) is returning...

popped b from currentSubset, inside backtrack(1)
calling backtrack(2)...
======= this is backtrack(2) =======
idx is 2, currentSubset is [], adding it to result...
backtrack(2) is returning...

******* done with backtrack(1) *******

******* done with backtrack(0) *******

[ [ 'a', 'b' ], [ 'a' ], [ 'b' ], [] ]

If you noticed, Add 'a'? and Go ahead? arrows on the first level are calls to backtrack(0).

Add 'b'? and Go ahead? arrows on the second level are calls to backtrack(1).

backtrack(2) calls are when we reach the "dead ends". In those cases, we add currentSubset to the result. We always reach the base case in a backtrack(2) call because it's only when the idx equals items.length.

Note: We modified the function in the above examples to work with strings, but in the actual solution we'll only deal with numbers, so in TypeScript, result and currentSubset will look like this:

let result: number[][] = [];
let currentSubset: number[] = [];

Also, the function parameter and return types are different:

function subsets(nums: number[]): number[][] { ... }

Otherwise, everything stays the same.

Time and space complexity

A subset is, in the worst case, length $n$ which is the length of our input. We'll have $2^n$ subsets and since we also use a spread operator in our example to copy currentSubset, the time complexity will be $O(n \cdot 2^n)$. The space complexity is – I think – $O(n \cdot 2^n)$ as well because of the recursive call stack (which is of depth n), and the space needed for result (which is in the worst case $2^n$).

Chapter Ten: Tries

The trie data structure gets its name from the word retrieval – and it's usually pronounced as "try," so that we don't get confused with another familiar and friendly data structure, "tree."

But a trie is still a tree (or tree-like) data structure whose nodes usually store individual letters. So, by traversing the nodes in a trie, we can retrieve strings.

Tries are useful for applications such as autocompletion and spellchecking – and the larger our trie is, the less work we have to do for inserting a new value.

Note: Using arrays is not very memory-efficient, but for now, we'll stick to the array implementation.

First, let's see what a trie looks like:

In this trie, we can retrieve the strings "sea" and "see" – but not "sew", for example.

There is a lot going on, but we can try to understand it piece by piece.

Let's look at a trie node.

We'll create a TrieNode class that has children, which is an array of length 26 (so that each index corresponds to a letter in the English alphabet), and a flag variable isEndOfWord to indicate whether that node represents the last character of a word:

class TrieNode {
  children: (TrieNode | null)[];
  isEndOfWord: boolean;

  constructor() {
    this.children = Array.from({ length: 26 }, () => null);
    this.isEndOfWord = false;
  }
}

We're initializing children with null values. As we add a character to our trie, the index that corresponds to that character will be filled.

Note: We're not storing the actual character itself in this implementation – it's implicit in the usage of indices.

In a trie, we start with an empty root node.

class Trie {
  root: TrieNode;

  constructor() {
    this.root = new TrieNode();
  }
  // ...
}

To insert a word, we're going to loop through each character, and initialize a new TrieNode to the corresponding index.

insert(word: string) {
  let currentNode = this.root;
  for (const char of word) {
    let idx = char.charCodeAt(0) - 'a'.charCodeAt(0);
    if (currentNode.children[idx] === null) {
      currentNode.children[idx] = new TrieNode();
    }
    currentNode = currentNode.children[idx];
  }

  currentNode.isEndOfWord = true;
}

Once we reach the node that indicates the last character of the word we inserted, we also mark the isEndOfWord variable as true.

Note: word is going to be lowercase in these examples – otherwise, we have to convert it, such as:

word = word.toLowerCase();

For searching a word's existence in the trie, we'll do a similar thing. We'll look at the nodes for each character, and if we reach the last one that has isEndOfWord marked as true. That means we've found the word:

search(word: string) {
  let currentNode = this.root;
  for (const char of word) {
    let idx = char.charCodeAt(0) - 'a'.charCodeAt(0);
    if (currentNode.children[idx] === null) {
      return false;
    }      
    currentNode = currentNode.children[idx];
  }

  return currentNode.isEndOfWord;
}

Note: If we find the word we're looking for, then it's called a search hit. Otherwise, we have a search miss and the word doesn't exist in our trie.

Removing a word is a bit more challenging. Let's say that we want to remove the word "see." But, there is also another word "sea," with the same prefix ('s' and 'e'). So, we should remove only the nodes that we're allowed to.

For this reason, we'll define a recursive function. Once we reach the last character of the word we want to remove, we'll back up and remove the characters we can remove:

const removeRecursively = (node: TrieNode | null, word: string, depth: number) => {
  if (node === null) {
    return null;
  }

  if (depth === word.length) {
    if (node.isEndOfWord) {
      node.isEndOfWord = false;
    }
    if (node.children.every(child => child === null)) {
      node = null;
    }

    return node;
  }

  let idx = word[depth].charCodeAt(0) - 'a'.charCodeAt(0);
  node.children[idx] = removeRecursively(node.children[idx], word, depth + 1);

  if (node.children.every(child => child === null) && !node.isEndOfWord) {
    node = null;
  }

  return node;
}

depth indicates the index of the word, or the depth of the trie we reach.

Once depth is equal to the word's length (one past the last character), we check if it's the end of the word. If that's the case, we'll mark it as false now, because that word won't exist from here on. Then, we can only mark the node as null if it doesn't have any children (in other words, if all of them are null). We'll apply this logic to each child node recursively until the word is removed as far as it can be removed.

Here is the final example implementation of a trie:

class TrieNode {
  children: (TrieNode | null)[];
  isEndOfWord: boolean;

  constructor() {
    this.children = Array.from({ length: 26 }, () => null);
    this.isEndOfWord = false;
  }
}

class Trie {
  root: TrieNode;

  constructor() {
    this.root = new TrieNode();
  }

  insert(word: string) {
    let currentNode = this.root;
    for (const char of word) {
      let idx = char.charCodeAt(0) - 'a'.charCodeAt(0);
      if (currentNode.children[idx] === null) {
        currentNode.children[idx] = new TrieNode();
      }
      currentNode = currentNode.children[idx];
    }

    currentNode.isEndOfWord = true;
  }

  search(word: string) {
    let currentNode = this.root;
    for (const char of word) {
      let idx = char.charCodeAt(0) - 'a'.charCodeAt(0);
      if (currentNode.children[idx] === null) {
        return false;
      }      
      currentNode = currentNode.children[idx];
    }

    return currentNode.isEndOfWord;
  }

  remove(word: string) {
    const removeRecursively = (node: TrieNode | null, word: string, depth: number) => {
      if (node === null) {
        return null;
      }

      if (depth === word.length) {
        if (node.isEndOfWord) {
          node.isEndOfWord = false;
        }
        if (node.children.every(child => child === null)) {
          node = null;
        }

        return node;
      }

      let idx = word[depth].charCodeAt(0) - 'a'.charCodeAt(0);
      node.children[idx] = removeRecursively(node.children[idx], word, depth + 1);

      if (node.children.every(child => child === null) && !node.isEndOfWord) {
        node = null;
      }

      return node;
    }

    removeRecursively(this.root, word, 0);
  }
}

let t = new Trie();

t.insert('sea');
t.insert('see');

console.log(t.search('sea')); // true
console.log(t.search('see')); // true

console.log(t.search('hey')); // false
console.log(t.search('sew')); // false

t.remove('see');

console.log(t.search('see')); // false 
console.log(t.search('sea')); // true

Time and space complexity

The time complexity of creating a trie is going to be $O(m * n)$ where $m$ is the longest word and $n$ is the total number of words. Inserting, searching, and deleting a word is $O(a * n)$ where $a$ is the length of the word and $n$ is the total number of words.

When it comes to space complexity, in the worst case, each node can have children for all the characters in the alphabet we're representing. But, the size of the alphabet is constant, so the growth of storage needs will be proportionate to the number of nodes we have, which is $O(n)$ where $n$ is the number of nodes.

Chapter Eleven: Graphs

A graph is probably the data structure that everyone is familiar with, regardless of their profession or interests.

Graph theory is a very broad topic, but we'll simply look at some of the main ingredients of what makes a graph and how to represent it, as well as basic graph traversals.

In a graph, there are two main components: vertices (or nodes) and edges that connect those vertices.

Note: Here, we're going to use "vertex" and "node" interchangeably. The terms "adjacent vertices" and "neighbors" are used interchangeably as well.

A graph can be directed or undirected. With a directed edge, we have an origin and a destination vertex. On the other hand, an undirected edge is bidirectional, origin and destination are not fixed.

Note: There might also be mixed graphs that have both directed and undirected edges.

A graph can also be weighted or unweighted, each edge can have different weights, usually representing the cost of going from one vertex to the other.

We can define a graph like this:

$$G = (V, \ E)$$

$V$ is a set of vertices, and $E$ is a set of edges.

For example, if we have a directed graph like this:

Then, we have the vertices:

$$V = \{A, \ B, \ C, \ D\}$$

And, the edges are:

$$E = \{(A, \ B), \ (A, \ C), \ (C, \ B), \ (C, \ D)\, \ (D, \ C)\}$$

If we have an undirected graph such as this one:

We have the same vertices:

$$V = \{A, \ B, \ C, \ D\}$$

But our edges can look like this:

$$E = \{\{B, \ A\}, \{A, \ C\}, \{C, \ B\}, \{D, \ C\}\}$$

Note: We use parentheses when it comes to directed edges, but curly braces with undirected edges as there is no direction from one vertex to the other.

When two vertices share an edge, they are adjacent to each other. The degree of a vertex is the number of adjacent vertices to it. We can also define the degree as the number of edges coming out of the vertex. For example, in the above image, the vertex A has a degree of 2.

A simple path is the one that we don't repeat any vertices while traversing the graph.

An example might look like this:

A cycle is a simple path, except that we end up at the vertex we started with:

Representing graphs

When it comes to representing graphs, there are several ways to do it, and we'll look at three of them: an edge list, an adjacency matrix, and an adjacency list.

Edge List

We can simply put all the edges in an array:

[ [A, B], [A, C], [B, C], [C, D] ]

But to find an edge in an edge list, we'll have to iterate through them, so it will have $O(E)$ time complexity, where in the worst case, we'll search the whole list to find an edge. Similarly, it needs $O(E)$ amount of space to represent all the edges.

Adjacency Matrix

The adjacency matrix for our example might look like this:

$$\left\lceil\begin{matrix}& A & B & C & D \\A & 0 & 1 & 1 & 0 \\B & 1 & 0 & 1 & 0 \\C & 1 & 1 & 0 & 1 \\D & 0 & 0 & 1 & 0\end{matrix}\right\rceil$$

Each row is for a vertex, and the matching column shows the relationship between those vertices. For example, the vertex A doesn't have an edge pointing to D, so the cell that matches A and D is 0. On the other hand, A is connected to B and C, so those cells have the value 1.

Note: If the graph is weighted, we can simply put the weight instead of 1, and when there is no edge, the value can stay 0.

An adjacency matrix will have 0s in the "main diagonal," showing that there are no self-loops.

Let's try implementing it in TypeScript.

We'll start with a minimal graph vertex:

class GraphVertex {
  value: string | number;

  constructor(value: string | number) {
    this.value = value;
  }
}

Now we can define our graph. We'll make it really simple with three properties to hold: matrix to represent the graph as an adjacency matrix, vertices to hold vertices, and isDirected to indicate whether our graph is directed:

class Graph {
  matrix: number[][];
  vertices: GraphVertex[];
  isDirected: boolean;

  constructor(vertices: GraphVertex[], isDirected = true) {
    this.vertices = vertices;
    this.isDirected = isDirected;
    // ...
  }

  // ...
}

Initializing our adjacency matrix might look like this:

this.matrix = Array.from({ length: vertices.length }, () => {
  return Array.from({ length: vertices.length }, () => 0)
});

We'll have an array with the length of vertices. Each item in the array is an array with the length of vertices as well, but filled with zeroes.

In our example with four vertices, the initial adjacency matrix looks like this:

[ [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0] ]

Then, adding an edge is just marking the corresponding value as 1, so that we can represent a connection between two vertices:

this.matrix[this.vertices.indexOf(v1)][this.vertices.indexOf(v2)] = 1;

Note: This implementation assumes that all vertices are distinct.

If we have an undirected graph, we can have it both ways:

if (!this.isDirected) {
  this.matrix[this.vertices.indexOf(v2)][this.vertices.indexOf(v1)] = 1;
}

Removing an edge, in this case, will be just resetting the value to 0:

this.matrix[this.vertices.indexOf(v1)][this.vertices.indexOf(v2)] = 0;

And, checking for the existence of an edge is simply checking whether the corresponding value is 0 or not:

this.matrix[this.vertices.indexOf(v1)][this.vertices.indexOf(v2)] !== 0;

And, here is the whole example with additional methods for adding and removing an edge, checking if there is an edge between two vertices, and checking if a specific vertex is in the graph:

class Graph {
  matrix: number[][];
  vertices: GraphVertex[];
  isDirected: boolean;

  constructor(vertices: GraphVertex[], isDirected = true) {
    this.vertices = vertices;
    this.matrix = Array.from({ length: vertices.length }, () => {
      return Array.from({ length: vertices.length }, () => 0)
    });
    this.isDirected = isDirected;
  }

  addEdge(v1: GraphVertex, v2: GraphVertex) {
    this._checkVertexIsInGraph(v1);
    this._checkVertexIsInGraph(v2);

    this.matrix[this.vertices.indexOf(v1)][this.vertices.indexOf(v2)] = 1;

    if (!this.isDirected) {
      this.matrix[this.vertices.indexOf(v2)][this.vertices.indexOf(v1)] = 1;
    }
  }

  /* 
  For a weighted graph:

  addEdge(v1: GraphVertex, v2: GraphVertex, weight: number) {
    this._checkVertexIsInGraph(v1);
    this._checkVertexIsInGraph(v2);

    this.matrix[this.vertices.indexOf(v1)][this.vertices.indexOf(v2)] = weight;
    if (!this.isDirected) {
      this.matrix[this.vertices.indexOf(v2)][this.vertices.indexOf(v1)] = weight;
    }
  }
  */

  removeEdge(v1: GraphVertex, v2: GraphVertex) {
    this._checkVertexIsInGraph(v1);
    this._checkVertexIsInGraph(v2);

    this.matrix[this.vertices.indexOf(v1)][this.vertices.indexOf(v2)] = 0;

    if (!this.isDirected) {
      this.matrix[this.vertices.indexOf(v2)][this.vertices.indexOf(v1)] = 0;
    }
  }

  hasEdge(v1: GraphVertex, v2: GraphVertex) {
    this._checkVertexIsInGraph(v1);
    this._checkVertexIsInGraph(v2);

    return this.matrix[this.vertices.indexOf(v1)][this.vertices.indexOf(v2)] !== 0;
  }

  getAdjacencyMatrix() {
    return this.matrix;
  }

  _checkVertexIsInGraph(v: GraphVertex) {
    if (!this.vertices.includes(v)) {
      throw new Error('Vertex doesn\'t exist');
    }
  }
}


let a = new GraphVertex('A');
let b = new GraphVertex('B');
let c = new GraphVertex('C');
let d = new GraphVertex('D');

let graph = new Graph([a, b, c, d], false);

graph.addEdge(a, b);
graph.addEdge(a, c);
graph.addEdge(b, c);
graph.addEdge(c, d);

console.log(graph.getAdjacencyMatrix());
// -> [ [0, 1, 1, 0], [1, 0, 1, 0], [1, 1, 0, 1], [0, 0, 1, 0] ]

Operations on an adjacency matrix have $O(1)$ time complexity. But our storage needs will be $O(V^2)$ where $V$ is the number of vertices.

Adjacency List

In an adjacency list, usually a hashmap or an array of linked lists is used. For example:

let graph = {
  'A': ['B', 'C'],
  'B': ['A', 'C'],
  'C': ['A', 'B', 'D'],
  'D': ['C']
}

Let's see how we can modify our code above to use an adjacency list instead.

Instead of having a matrix which is an array of arrays, we can have a Map that maps the vertices to an array of their neighbors.

We can initialize it as a map that has the vertices as keys, each of which has a value of an empty array for now:

this.list = new Map();
for (const v of vertices) {
  this.list.set(v, []);
}

Adding an edge will be just pushing to the array of corresponding vertex:

this.list.get(v1)!.push(v2);

If our graph is undirected, we can do it both ways here as well:

if (!this.isDirected) {
  this.list.get(v2)!.push(v1);
}

Removing an edge will be deleting that vertex from the array:

this.list.set(v1, this.list.get(v1)!.filter(v => v !== v2));

Checking if an edge exists is just checking the existence of that vertex in the array:

this.list.get(v1)!.includes(v2);

Note: We're using a non-null assertion operator as we’re using TypeScript in these examples. As we'll see below, we first check if the vertex is in the graph. And since we're adding all the vertices in the graph as keys to this.list, we're sure that getting that vertex from the list is not undefined. But TypeScript will warn us because if a key is not found in a Map object, it could potentially return undefined.

Here is our graph:

class Graph {
  list: Map;
  vertices: GraphVertex[];
  isDirected: boolean;

  constructor(vertices: GraphVertex[], isDirected = true) {
    this.vertices = vertices;
    this.list = new Map();
    for (const v of vertices) {
      this.list.set(v, []);
    }
    this.isDirected = isDirected;
  }

  addEdge(v1: GraphVertex, v2: GraphVertex) {
    this._checkVertexIsInGraph(v1);
    this._checkVertexIsInGraph(v2);

    this.list.get(v1)!.push(v2);

    if (!this.isDirected) {
      this.list.get(v2)!.push(v1);
    }
  }

  removeEdge(v1: GraphVertex, v2: GraphVertex) {
    this._checkVertexIsInGraph(v1);
    this._checkVertexIsInGraph(v2);

    this.list.set(v1, this.list.get(v1)!.filter(v => v !== v2));

    if (!this.isDirected) {
      this.list.set(v2, this.list.get(v2)!.filter(v => v !== v1));
    }
  }

  hasEdge(v1: GraphVertex, v2: GraphVertex) {
    this._checkVertexIsInGraph(v1);
    this._checkVertexIsInGraph(v2);

    return this.list.get(v1)!.includes(v2);
  }

  getAdjacencyList() {
    return this.list;
  }

  _checkVertexIsInGraph(v: GraphVertex) {
    if (!this.vertices.includes(v)) {
      throw new Error('Vertex doesn\'t exist');
    }
  }
}


let a = new GraphVertex('A');
let b = new GraphVertex('B');
let c = new GraphVertex('C');
let d = new GraphVertex('D');

let graph = new Graph([a, b, c, d], false);

graph.addEdge(a, b);
graph.addEdge(a, c);
graph.addEdge(b, c);
graph.addEdge(c, d);

console.log(graph.getAdjacencyList());

/* Output:

Map (4) {
  GraphVertex: { "value": "A" } => [
    GraphVertex: { "value": "B" }, 
    GraphVertex: { "value": "C" }
  ], 
  GraphVertex: { "value": "B" } => [
    GraphVertex: { "value": "A" }, 
    GraphVertex: { "value": "C" }
  ], 
  GraphVertex: { "value": "C" } => [
    GraphVertex: { "value": "A" }, 
    GraphVertex: { "value": "B" }, 
    GraphVertex: { "value": "D" }
  ], 
  GraphVertex: { "value": "D" } => [
    GraphVertex: { "value": "C" }
  ]
} 

*/

Getting the neighbors of a vertex is $O(1)$ because we're just looking up a key in a map. But finding a particular edge can be $O(d)$ where $d$ is the number of degrees of the vertex, because we might need to traverse all the neighbors to find it. And, it could be $V - 1$ where $V$ is the number of vertices in the graph. It's the case when that vertex has all the other vertices as its neighbors.

The space complexity can be $O(V + E)$ where $V$ is the number of vertices and $E$ is the number of edges.

Traversals

Continuing with the adjacency list representation, let's now take a look at two (very familiar) ways to traverse a graph: breadth-first search and depth-first search.

But first, we'll modify our graph a little bit. We'll add a new vertex 'E' and update some edges:

let a = new GraphVertex('A');
let b = new GraphVertex('B');
let c = new GraphVertex('C');
let d = new GraphVertex('D');
let e = new GraphVertex('E');


let graph = new Graph([a, b, c, d, e], false);

graph.addEdge(a, b);
graph.addEdge(a, c);
graph.addEdge(b, d);
graph.addEdge(c, e);

The important idea to remember is that there is no hierarchy of vertices, so we don't have a root node.

For a breadth-first or depth-first search, we can use an arbitrary node as a starting point.

Breadth-First Search

With our new graph, a breadth-first search traversal looks like this:

When it comes to breadth-first search, usually a queue is used, and the idea is simple: given a current node, we'll add the adjacent nodes first, marking them as visited as we go.

Inside the Graph class, we can implement a bfs method that does just that:

bfs(startNode: GraphVertex) {
  const visited = new Set();
  const queue = [startNode];
  visited.add(startNode);

  while (queue.length > 0) {
    const currentNode = queue.shift();
    // console.log(currentNode);
    this.list.get(currentNode as GraphVertex)!.forEach((node) => {
      if (!visited.has(node)) {
        visited.add(node);
        queue.push(node);
      }
    });
  }
}

If we call the bfs method with a as the starting vertex (graph.bfs(a)), and log currentNode to console each time we go, it's as we expected:

GraphVertex { value: 'A' }
GraphVertex { value: 'B' }
GraphVertex { value: 'C' }
GraphVertex { value: 'D' }
GraphVertex { value: 'E' }

With the adjacency list, using a BFS has $O(V + E)$ time complexity (sum of the vertices and edges) as we're traversing the whole graph.

Depth-First Search

With the same modified graph, a depth-first search looks like this:

With depth-first search there is usually recursion involved as we're traversing through a path until we have visited all the nodes in that path. Once we hit a dead end, we'll backtrack and continue exploring until we have visited all the vertices in the graph.

Again, inside the Graph class, we can add a dfs method:

dfs(startNode: GraphVertex, visited = new Set()) {
  visited.add(startNode);
  // console.log(startNode);
  this.list.get(startNode)!.forEach((node) => {
    if (!visited.has(node)) {
      this.dfs(node, visited);
    }
  });
}

Starting with a node, we check how deep we can go from there. Once we reach a dead end (when the dfs inside forEach returns), we continue checking other neighbors (with forEach) until none is left. We essentially do the same thing until all the vertices are visited.

Logging the output matches our expectation:

GraphVertex { value: 'A' }
GraphVertex { value: 'B' }
GraphVertex { value: 'D' }
GraphVertex { value: 'C' }
GraphVertex { value: 'E' }

The time complexity for a depth-first search traversal of a graph is the similar to BFS, $O(V + E)$.

Chapter Twelve: Dynamic Programming

Dynamic programming (DP) is one of those concepts that is a bit intimidating when you hear it for the first time. But the crux of it is simply breaking problems down into smaller parts and solving them. It’s also about storing those solutions so that we don't have to compute them again.

Breaking problems down into subproblems is nothing new – that's pretty much what problem-solving is all about. What dynamic programming is also specifically concerned with are overlapping subproblems that are repeating – we want to calculate solutions to those subproblems so that we won't be calculating them again each time. Put another way, we want to remember the past so that we won't be condemned to repeat it.

For example, calculating 1 + 1 + 1 + 1 + 1 is very easy if we have already calculated 1 + 1 + 1 + 1. We can just remember the previous solution, and use it:

Calculating the Fibonacci sequence is one of the well-known examples when it comes to dynamic programming. Because we have to calculate the same functions each time for a new number, it lends itself to DP very well.

For example, to calculate fib(4) we need to calculate fib(3) and fib(2). But calculating fib(3) also involves calculating fib(2), so we'll be doing the same calculation, again.

A classic, good old recursive Fibonacci function might look like this:

function fib(n: number): number {
  if (n === 0 || n === 1) {
    return n;
  }

  return fib(n - 1) + fib(n - 2);
}

Though the issue we have just mentioned remains: we'll keep calculating the same values:

But, we want to do better.

Memoization is remembering the problems we have solved before so that we don't have to solve them again and waste our time. We can reuse the solution to the subproblem we've already memoized. So, we can keep a cache to store those solutions and use them:

function fib(n: number, cache: Map<number, number>): number {
  if (cache.has(n)) {
    return cache.get(n)!;
  }

  if (n === 0 || n === 1) {
    return n;
  }

  const result = fib(n - 1, cache) + fib(n - 2, cache);
  cache.set(n, result);

  return result;
}

For example, we can initially pass an empty Map as the argument for cache, and print the first 15 Fibonacci numbers:

let m = new Map<number, number>();

for (let i = 0; i < 15; i++) {
  console.log(fib(i, m));
}

/*
  0
  1
  1
  2
  3
  5
  8
  13
  21
  34
  55 
  89
  144
  233
  377 
 */

There are two different approaches with dynamic programming: top-down and bottom-up.

Top-down is like what it sounds: starting with a large problem, breaking it down to smaller components, memoizing them. It's what we just did with the fib example.

Bottom-up is also like what it sounds: starting with the smallest subproblem, finding out a solution, and working our way up to the larger problem itself. It also has an advantage: with the bottom-up approach, we don't need to store every previous value – we can only keep the two elements at the bottom so that we can use them to build up to our target.

With the bottom-up approach, our fib function can look like this:

function fib(n: number) {
  let dp = [0, 1];
  for (let i = 2; i <= n; i++) {
    dp[i] = dp[i - 1] + dp[i - 2];
  }

  return dp[n];
}

Just note that we are keeping an array whose size will grow linearly as the input increases. So, we can do better with constant space complexity, not using an array at all:

function fib(n: number) {
  if (n === 0 || n === 1) {
    return n;
  }

  let a = 0;
  let b = 1;

  for (let i = 2; i <= n; i++) {
    let tmp = a + b;
    a = b;
    b = tmp;
  }

  return b;
}

Time and space complexity

The time complexities for both the top-down and bottom-up approaches in the Fibonacci example are $O(n)$ as we solve each subproblem, each of which is of constant time.

Note: The time complexity of the recursive Fibonacci function that doesn't use DP is exponential (in fact, $O(\phi^{n})$ – yes the golden ratio as its base).

But when it comes to space complexity, the bottom-up approach (the second version) is $O(1)$.

Note: The first version we've used for the bottom-up approach has $O(n)$ time complexity as we store the values in an array.

The top-down approach has $O(n)$ space complexity because we store a Map whose size will grow linearly as n increases.

Chapter Thirteen: Intervals

An interval simply has a start and an end. The easiest way to think about intervals is as time frames.

With intervals, the usual concern is whether they overlap or not.

For example, if we have an interval [1, 3] and another [2, 5], they are clearly overlapping, so they can be merged together to create a new interval [1, 5]:

In order for two intervals not to overlap:

the start of one should be strictly larger than the end of the other

newInterval[0] > interval[1]

Or:

the end of the one should be strictly smaller than the start of the other

newInterval[1] < interval[0]

If both of these are false, they are overlapping.

If they are overlapping, the new (merged) interval will have the minimum value from both intervals as its start, and the maximum as its end:

[
  min(newInterval[0], interval[0]),
  max(newInterval[1], interval[1])
]

Chapter Fourteen: Bit Manipulation

A bitwise operation operates on a bit string, a bit array, or a binary numeral (considered as a bit string) at the level of its individual bits.

Let's first represent a number in binary (base 2). We can use toString method on a number, and specify the radix:

const n = 17;

console.log(n.toString(2)); // 10001

We can also parse an integer giving it a base:

console.log(parseInt(10001, 2)); // 17

Note: We can also represent a binary number with the prefix 0b:

console.log(0b10001); // 17
console.log(0b101); // 5

For example, these are the same number:

0b1 === 0b00000001 // true

All bitwise operations are performed on 32-bit binary numbers in JavaScript. That is, before a bitwise operation is performed, JavaScript converts numbers to 32-bit signed integers.

So, for example, 17 won't be simply 10001 but 00000000 00000000 00000000 00010001.

After the bitwise operation is performed, the result is converted back to 64-bit JavaScript numbers.

Bitwise operators

AND (`&`)

If two bits are 1, the result is 1, otherwise 0.

The GIFs below show the numbers as 8-bit strings, but when doing bitwise operations, remember they are converted to 32-bit numbers.

const x1 = 0b10001;
const x2 = 0b101;

const result = x1 & x2; // 1 (0b1)

OR (`|`)

If either of the bits is 1, the result is 1, otherwise 0.

const x1 = 0b10001;
const x2 = 0b101;

const result = x1 | x2; // 21 (0b10101)

XOR (`^`)

If the bits are different (one is 1 and the other is 0), the result is 1, otherwise 0.

const x1 = 0b10001;
const x2 = 0b101;

const result = x1 ^ x2; // 20 (0b10100)

NOT (`~`)

Flips the bits (1 becomes 0, 0 becomes 1).

const n = 17;

const result = ~n; // -18

Note: Bitwise NOTing any 32-bit integer x yields -(x + 1).

If we use a helper function to see the binary representations, it is as we expected:

console.log(createBinaryString(n));
// -> 00000000 00000000 00000000 00010001

console.log(createBinaryString(result));
// -> 11111111 11111111 11111111 11101110

The leftmost bit indicates the signal – whether the number is negative or positive.

Remember that we said JavaScript uses 32-bit signed integers for bitwise operations. The leftmost bit is 1 for negative numbers and 0 for positive numbers. Also, the operator operates on the operands' bit representations in two's complement. The operator is applied to each bit, and the result is constructed bitwise.

Note: Two's complement allows us to get a number with an inverse signal. One way to do it is to invert the bits of the number in the positive representation and add 1 to it:

function twosComplement(n) {
  return ~n + 0b1;
}

Left shift (zero fill) (`<<`)

Shifts the given number of bits to the left, adding zero bits shifted in from the right.

const n = 17;
const result = n << 1; // 34


console.log(createBinaryString(17));
// -> 00000000 00000000 00000000 00010001

console.log(createBinaryString(34));
// -> 00000000 00000000 00000000 00100010

Note that the 32nd bit (the leftmost one) is discarded.

Right shift (sign preserving) (`>>`)

Shifts the given number of bits to the right, preserving the sign when adding bits from the left.

const n = 17;
const result = n >> 1; // 8


console.log(createBinaryString(17));
// -> 00000000 00000000 00000000 00010001

console.log(createBinaryString(8));
// -> 00000000 00000000 00000000 00001000

const n = -17;
const result = n >> 1; // -9

console.log(createBinaryString(-17));
// -> 11111111 11111111 11111111 11101111

console.log(createBinaryString(-9));
// -> 11111111 11111111 11111111 11110111

Right shift (unsigned) (`>>>`)

Shifts the given number of bits to the right, adding 0s when adding bits in from the left, no matter what the sign is.

const n = 17;
const result = n >>> 1; // 8


console.log(createBinaryString(17));
// -> 00000000 00000000 00000000 00010001

console.log(createBinaryString(8));
// -> 00000000 00000000 00000000 00001000

const n = -17;
const result = n >>> 1; // 2147483639

console.log(createBinaryString(-17));
// -> 11111111 11111111 11111111 11101111

console.log(createBinaryString(2147483639));
// -> 01111111 11111111 11111111 11110111

Getting a bit

To get a specific bit, we first need to create a bitmask. We can do this by shifting 1 to the left by the index of the bit we want to get. The result is the AND of the binary number and the bitmask.

But using JavaScript, we can also do an unsigned right shift by the index to put the bit in the first place (so that we don't get the actual value that is in that position, but whether it is a 1 or a 0):

function getBit(number, idx) {
  const bitMask = 1 << idx;
  const result = number & bitMask;

  return result >>> idx;
}

For example, let's try 13, which is 1101 in binary:

const binaryNumber = 0b1101;

console.log('Bit at position 0:', getBit(binaryNumber, 0));
console.log('Bit at position 1:', getBit(binaryNumber, 1));
console.log('Bit at position 2:', getBit(binaryNumber, 2));
console.log('Bit at position 3:', getBit(binaryNumber, 3));

/*
Output:

Bit at position 0: 1
Bit at position 1: 0
Bit at position 2: 1
Bit at position 3: 1
*/

Setting a bit

If we want to turn a bit to 1 (in other words, to "set a bit"), we can do a similar thing.

First, we can create a bitmask again by shifting 1 to the left by the index of the bit we want to set to 1. The result is the OR of the number and the bitmask:

function setBit(number, idx) {
  const bitMask = 1 << idx;
  return number | bitMask;    
}

Remember that in our example 13 was 1101 in binary, let's say we want to set the 0 at index 1:

const binaryNumber = 0b1101;
const newBinaryNumber = setBit(binaryNumber, 1);

console.log(createBinaryString(newBinaryNumber));
// -> 00000000 00000000 00000000 00001111

console.log('Bit at position 1:', getBit(newBinaryNumber, 1));
// -> Bit at position 1: 1

Conclusion

With some detours here and there, we took a look at fourteen (or fifteen, if you count our interlude) different concepts, from arrays and hashing to bit manipulation.

Although I have to say that eventually, with time, it’s easy to forget all that we learned. But, that's not a problem in itself, because as you might have realized, if there is one idea that should stick with you with this handbook, it’s that problems are best solved when they are broken into smaller parts. And, as with anything else, writing or talking to yourself (see duck debugging) works miracles.

Now, it's time to take a deep breath.

It was a delightful adventure to explore data structures and algorithms, and hopefully you found some value in it.

Have a beautiful journey ahead, and until then, happy coding.

Resources & Credits

This handbook was mainly inspired by the amazing BaseCS series by Vaidehi Joshi, which is an incredible resource for learning basic computer science concepts.

The visualization idea was inspired by Lydia Hallie's JavaScript Visualized series.

Of course, you can also check out NeetCode's courses which can be incredibly helpful for a serious study.

There are many other resources to check out if you want to go further, here are some of the ones I used in our exploration:

The Architecture of Mathematics – And How Developers Can Use it in Code

Tiago Capelo Monteiro — Fri, 23 May 2025 15:06:16 +0000

"To understand is to perceive patterns." - Isaiah Berlin

Math is not just numbers. It is the science of finding complex patterns that shape our world. This means that to truly understand it, we need to see beyond numbers, formulas, and theorems and understand its structures.

The main goal of this article is to show how math is just like a growing tree of ideas. I want to show that math is a living system of logic, not just formulas to memorize. With analogies, history, and code examples, I want to help you understand math more deeply and how you can apply it to programming.

I’ve also included some code examples here to help you connect theory and practice. I show them to demonstrate how math ideas are applied to real problems. Whether you are new to more advanced math or are more experienced, these code examples will help you understand how to apply math in programming.

This link across theory and application reflects my own studies. I am a finalist in an undergraduate degree in Electrical and Computer Engineering at NOVA FCT, one of the best engineering faculties in Portugal.

My engineering degree is one with more math and physics. This is because it’s key to get a solid grasp of math to understand electronics, telecommunications, control theory, and other areas of engineering.

Here’s a brief overview of some of the math and physics subjects I’ve learned:

Partial Differential Equations (PDEs): These equations model real-world phenomena, from heat diffusion to the economy of a country.
Harmonic Analysis (Fourier & Laplace): Integral transforms like the Fourier and Laplace transform allow us to understand problems in new domains.
Complex Analysis: Extending calculus into the complex plane gives rise to powerful tools used in physics and engineering.
Numerical Analysis: When analytical solutions are impossible or inefficient, numerical methods provide computer-based approximations. This is crucial for real-world applications.
Control and Signal Theory: These areas show us how to design stable systems like rockets, trains, and robots.
Physics: Courses in Classical Mechanics and Electromagnetism helped bridge theoretical math to physical laws

During my years of study, besides technical skills, I’ve developed a deeper understanding of how the world works and the structure of the field of mathematics. And I’ve started to find patterns in how math is a framework of interconnected logic.

In this article, we’ll explore:

Simple Analogy: The Tree of Mathematics
The Structure and History of Mathematics
An Tree example: Foundations of Relativity by Albert Einstein
The Biggest Paradox of Math, Discovered by Kurt Gödel
What About Applied Math and Engineering?
Code Examples – Analytical and Numerical Approaches
The Impact of a Grand Unified Theory of Mathematics
A Final Lesson From History

Simple Analogy: The Tree of Mathematics

Imagine math as a vast tree growing forever.

The roots of the tree are the foundations of mathematics: logic and set theory. From this foundation emerge the main basic fields of math: arithmetic, algebra, geometry, and analysis.

As the tree divides further and further into more branches, new, more complex subfields start to appear, like topology, abstract algebra, and complex analysis. Sometimes the branches are connected to each other.

And remember: this tree is always growing in many directions. From branches creating new branches to branches connecting to other branches. Little by little, it grows.

Throughout history, there have been times that, due to some big scientific discoveries, parts of the math tree started to grow very fast. Other times, decades and even centuries passed without many new branches. This is the case for imaginary numbers, for example.

And you might wonder: How many more branches and connections between them will keep appearing?

The Structure and History of Mathematics

The first mathematical ideas appeared independently across ancient civilizations. For example:

India’s invention of zero
Islamic algebraic advances
Greek geometric rigor

Over time, many different great mathematicians created and shared them by writing and giving lectures.

Eventually, these new ideas were shared widely with new generations and these new generations created new math based on old math.

This is is how new branches are continuously born from previous branches of the tree of mathematics.

And this is why Isaac Newton wrote, in a letter to Robert Hooke in 1675:

If I have seen further, it is by standing on the shoulders of giants

He meant that by working from previous knowledge, he was able to create and (re)discover new ideas.

Yet, the real power of math lies in practicing it over and over and understanding it more and more deeply. As one of my professors once explained:

More important than knowing the theorems is knowing the ideas behind them and the history of how they were created.

Very often, to solve problems, it is necessary to think in terms of first principles and build from there. Math teaches exactly that. In this way, math is not just an academic subject. It is a language spoken by scientists and engineers around the globe.

By having it well preserved and shared, it is still possible to create new math from previous ideas. And it’s possible for the big tree to continue growing based on previous branches or nodes.

An Tree example: Foundations of Relativity by Albert Einstein

Albert Einstein created the general and special theories of relativity. These have big consequences nowadays:

GPS and Global Communication
Advancements in Satellite Telecommunications
Space Exploration and Satellite Launches

But this was only possible through the unification of geometry with calculus, called differential geometry. The evolution of differential geometry happened over the centuries, thanks to many great mathematicians. Below are some of them, but this is not a complete list:

Euclid (circa 300 BCE): Contributed to geometry, laying the groundwork for later mathematical systems
Archimedes (circa 287–212 BCE): Pioneered the understanding of volume, surface area, and the principles of mechanics
René Descartes (1596–1650): Developed Cartesian coordinates and analytical geometry
Isaac Newton (1642–1727) & Gottfried Wilhelm Leibniz (1646–1716): Newton’s laws of motion and gravitation, alongside Leibniz’s development of calculus, formed the basis of classical mechanics that Einstein sought to extend and modify in his theory of relativity.
Leonhard Euler (1707–1783): Contributed to the development of differential equations, which are essential in the mathematical foundations of physics.
Gaspard Monge (1746–1818): The father of differential geometry and pioneer in descriptive geometry
Carl Friedrich Gauss (1777–1855): Made groundbreaking advances in geometry, including the concept of curved surfaces.
Bernhard Riemann (1826–1866): Introduced Riemannian geometry, a branch of differential geometry.

Once again, as Isaac Newton wrote, in a letter to Robert Hooke in 1675:

If I have seen further, it is by standing on the shoulders of giants.

Albert Einstein saw what no one else in his time saw, thanks to these great math giants and countless others.

The Biggest Paradox of Math, Discovered by Kurt Gödel

The biggest paradox in math, in my opinion, is what Kurt Gödel discovered. His early 20th century research revealed a limitation within this cycle.

This paradox – that is, his incompleteness theorems – shows that in any consistent formal system capable of expressing simple arithmetic, there will always be true mathematical statements that cannot be proven within the system itself.

This means that in ALL systems, there are limits to what you can actually prove as to what is true and false. For for mathematicians, this means that the tree will never be completed. There are truths that are beyond formal truths, and yet we still assume that they are true (albeit unproven).

This way, it proves that no matter how many mathematicians work in the field or how much AI is used to find new mathematics, there will always exist limitations. Some things are impossible to prove that are true, and we just know that they are due to approximation estimations and other non logical exact methods.

What About Applied Math and Engineering?

Applied math and engineering involves interpreting the same pure math ideas in real-world scenarios. Actually, in many cases, it is the combination of many math ideas. Let’s consider some examples:

Principal component analysis (PCA) is a widely used tool in data science. Yet, it is a mixture of linear algebra (in PCA, eigenvalues) with optimization (order eigenvalues that represent more data with less data) in order to make datasets shorter.

In machine learning, logistic regression is a mixture of calculus with statistics and probability.

In harmonic analysis, Laplace, Fourier, and Z-transforms are a way to see the same thing in a new domain to get new insights. In this case, integrals are used to make this mapping.

In deep learning, neural networks are just many matrices multiplying and updating themselves that adapt to model a dataset representing a system. This optimization of matrix values happens with activation functions, a gradient descent-based optimization method (tells how much values need to change), and backpropagation (applies those alterations to all matrix values).

I have actually written an article where I teach why activation functions are important if you want to check it out.

But the best example of this fusion of math with engineering is in control theory.

Control theory is the study of the architecture of systems. From trains to cars to airplanes, everything is based on control theory. It is everywhere in nearly all modern electronic devices. In electric circuits, control theory is also used heavily to guarantee circuit stability in the face of electric disturbances.

So as you can probably start to see, many of the tools we now have are just a mixture of many pure math ideas. Just many combinations and recipes of pure math ideas. In essence, applied math is the application of pure math as “ingredients“ in "recipes" to solve problems.

So, we’ve explored the structure and evolution of mathematics. Yet, it is important to see how these ideas can be applied in real life. Pure math makes the framework, and applied math applies that framework to solve problems. To understand this, we’ll examine two code examples that show how you can use math ideas as programming tools.

Code Examples – Analytical and Numerical Approaches

These code examples demonstrate a couple ways you can use Python to solve math equations.

In the first code example, we’ll solve the problem in the same way that kids in school solve math exercises: essentially, by hand with a pencil. Moving variables from left to right to find their values. In the second example, we’ll solve the problem using numerical analysis.

Example 1: Solve a Problem Analytically

When we solve math problems analytically, like we did in school, we are manipulating symbols to get exact values. Often there symbols are x, y and z. In Python, we can do this using the SymPy library:

from sympy import symbols, Eq, solve

x, y = symbols('x y')
eq1 = Eq(2*x + 3*y, 6)
eq2 = Eq(-x + y, 1)

solution = solve((eq1, eq2), (x, y))
print(solution)

Essentially, we are finding x and y based on this equation:

$$\begin{align*} 2x + 3y &= 6 \\ -x + y &= 1 \end{align*}$$

Which gives us the following result:

{x: 3/5, y: 8/5}

Or:

x= 0.6
y = 1.6

When we say that we’re solving this analytically, it means that we’re finding an exact mathematical solution using formulas or equations.

But many times, problems are harder and can be solved by adding symbols to the right or left of the equation.

Sometimes, there can be so many symbols and transformed versions of them, with things like derivatives and integrals, that it can become very hard to manage and takes a lot of time.

For this reason, there is an area of mathematics devoted to finding approximations of already created mathematical formulas called numerical analysis. It makes it faster to solve these problems. And this is the method we will explore next.

Example 2: Solve Numerically (Approximation)

We’ll now use SciPy to solve the same system with numerical methods:

import numpy as np
from scipy.linalg import solve

A = np.array([[3, 2, -1, 4, 5],
              [1, 1, 3, 2, -2],
              [4, -1, 2, 1, 0],
              [5, 3, -2, 1, 1],
              [2, -3, 1, 3, 4]])

b = np.array([12, 5, 7, 9, 10])

solution = solve(A, b)

print(solution)

In this code example, this line of code:

solution = solve(A, b)

Uses the solve method from the SciPy Python library:

from scipy.linalg import solve

It’s a method that helps you find the values of x in an equation A⋅x=b, where a is a square grid of numbers and b is a list of numbers. Which gives us the following:

[ 1.35022026 -0.79955947 -1.17180617  3.14317181 -0.83920705]

Now imagine, in this simple case, that a matrix like A could represent the traffic flow between cities or intersections, and b could represent the traffic entering or leaving each city.

By solving the system, it could help us determine the distribution of traffic between cities to meet desired traffic conditions.

Of course, these types of problems are far more complex in real life. But to understand and solve the big problems, you need to first understand the smaller problems.

And by the way, a system of equations is the same thing as a matrix. We just represent systems of equations as matrices to make the findings of properties and clarity easier to understand.

The thing is that by using matrices, it is easier to make calculations and to perform linear algebra math to check for characteristics of the matrix and understand it better.

In essence, a matrix represents a system of equations. Also, systems of equations can represent real life phenomena like the economy of a country or the weather.

If you want to know more, I wrote an entire article on numerical analysis that you can check out.

The Impact of a Grand Unified Theory of Mathematics

Despite the biggest paradox in mathematics, what would happen with a Grand Unified Theory of Mathematics?

Remember that such a theory tells us that there are things that are true that are impossible to formally prove, and we need to just accept it. But even with this assumption, it is still possible to unify all math.

This is what the Langland's program is trying to solve. A kind of attempt to interconnect the largest parts of the big tree of math to uncover new patterns in math.

With a Grand Unified Theory of Mathematics, we would be able to understand how every branch of the tree connects with the others and all the relationships between them.

What is the value of this big unification for society?

By studying history, we can find patterns. The unification of various fields has created many massive impacts on society, such as:

In the 19th century, James Clerk Maxwell united the fields of electricity and magnetism with his famous Maxwell equations. This allowed the creation of radios and electric grids around the globe. In turn, it served as a foundation for all technological progress in the 20th and 21th century.
In the 20th century, the unification of algebra with logic led to the rise of digital systems. In turn, digital systems gave the rise of processors and the evolution of computers to the modern laptop.
Also in the 20th century, the unification of probability and communication led to information theory. This became the foundation for the internet. This unification was carried out by a great mathematician called Clause Shannon.

In the end, a Grand Unified Theory of Mathematics could be one of the biggest achievements in modern society.

It could lead to new discoveries in physics, such as in string theory or quantum gravity, where deep mathematical structures are needed to create new physics. In AI, it could help unify all machine learning models in a common architecture. This would help accelerate the development of new AI models. It could also open the door to new cryptographic methods and material science advances, revealing, with math, the deep patterns still not found in these fields.

Just as uniting electricity and magnetism led to modern technology, a unified math framework would lead to a wave of innovation.

A Final Lesson From History

From Greek geometry to AI, math has grown like a tree over centuries. By understanding its structure, it is possible to see its role in finding the patterns of our universe. I hope I was able to make you see math in this way.

In addition, we can conclude that the unification of scientific fields makes the foundations for the creation of new innovations to help society go forward. Many profound societal transformations only came to be thanks to abstract math ideas. When these are shared and refined, they become the hidden architecture of progress in society. Innovation begins when disconnected ideas are united, well-linked, and widely shared.

Find the full code here.

How to Write Math Equations in Google Docs

Vikram Aruchamy — Fri, 16 May 2025 15:42:41 +0000

Math equations are a critical part of academic papers, research reports, and technical documentation. While LaTeX is widely used for professional typesetting, Google Docs offers a robust set of features for inserting and formatting math equations and also supports LaTeX-style input.

Whether you're a student submitting a math assignment or a professional documenting formulas, Google Docs provides multiple ways to insert and format equations efficiently.

In this article, you'll learn how to write math equations in Google Docs using different methods, including using Google Docs’ built-in equation editor and typing LaTeX-style commands directly, inserting complex equations with the help of the Auto-LaTeX add-on, and copying math equations from ChatGPT to Google Docs without losing formatting by using the ChatGPT to Google Docs or PDF Chrome extension.

How to Write Equations Using the Built-in Equation Editor
How to Write Equations Using LaTeX Commands
How to Use Auto Latex Add-on for Writing Advanced Math Equations
How to Copy Math Equations from ChatGPT to Google Docs
Watch: How to Write Equations in Google Docs
Tips for Formatting Math Equations in Google Docs
Conclusion

How to Write Equations Using the Built-in Equation Editor

Google Docs has a built-in equation editor that makes it easy to insert mathematical symbols and expressions.

To insert an equation editor box:

Open your Google Docs document.
Go to the top menu and click Insert → Equation.
An equation editor will appear, and a new toolbar will show up with common math symbols like fractions, exponents, Greek letters, and more.

Alternatively, you can use the following keyboard shortcuts to insert an equation editor box.

Windows/Linux: Alt + I, then E
Mac: Control + Option + I, then E

This shortcut quickly opens the equation editor without clicking through menus.

Toolbar Symbols:

Once the toolbar appears, you’ll find buttons for:

Greek letters
Miscellaneous operations
Relations
Math operations
Arrows

The equation editor box and a toolbar look like the following:

Now let’s learn how to write equations using the equation editor with a practical example.

Example: Typing the Quadratic Formula

Follow these steps to insert the following quadratic formula in Google Docs:

$$x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$$

Go to Insert → Equation to insert an equation editor and enable the equation toolbar.
Type x=
Click the Math Operations dropdown (the one with templates like square roots, brackets), then select the fraction template. This inserts a placeholder with two parts: a numerator and a denominator.
Click inside the numerator field. Begin by typing -b.
Now insert the ± (plus-minus) symbol. To do this:
- Click the Miscellaneous Operations dropdown
- Select the ± symbol from the list.
  Your numerator should now show: -b ± as in the following image:
After the ± symbol, insert a square root:
- Go back to the Math Operations dropdown and select the square root template.
- Inside the root, type b^2 - 4ac.
  - Use ^ to enter exponents. For example, b^2 will be rendered as b².
Move to the denominator field and type 2a.

Now your full equation should appear as:

The equation will be properly formatted using Google Docs’ equation rendering, making it easy to read and mathematically accurate. You can continue typing more text below or beside the equation as needed – it behaves like any other element in your document.

This approach is useful for inserting neatly formatted equations without relying on add-ons or external tools. It’s especially helpful for students, teachers, and professionals preparing technical documents directly in Google Docs.

How to Write Equations Using LaTeX Commands

If you're familiar with LaTeX, you can take advantage of Google Docs’ support for a subset of LaTeX-style commands inside the built-in equation editor. This can greatly speed up the process of entering complex mathematical expressions, especially if you're already comfortable with LaTeX syntax.

How to Use LaTeX Commands in Google Docs

Open your Google Docs document.
Go to Insert → Equation to activate the equation toolbar and equation input field.
Click inside the equation box. Instead of using the toolbar buttons, type LaTeX-style commands directly.
As you type, Google Docs will automatically render the commands into formatted math once you press space or enter after each command or expression.

Commonly Supported LaTeX Commands in Google Docs:

Instruction	Result
To insert a fraction	`\frac{a}{b}` → 𝑎⁄𝑏
To insert a square root	`\sqrt{x}` → √𝑥
To insert Greek letters like α, β	`\alpha, \beta` → α, β
To insert an integral with limits	`\int_a^b f(x)\,dx` → ∫ᵃᵇ 𝑓(𝑥) 𝑑𝑥
To insert x superscript 2	`x^2` → 𝑥²
To insert x subscript 1	`x_1` → 𝑥₁

Type these commands in the equation box, and when you press space or enter, they will be converted to properly formatted mathematical notation.

Example: Typing the Quadratic Formula Using LaTeX Commands

Let’s walk through how to enter the following quadratic formula using LaTeX-style commands:

$$x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$$

Steps:

Insert the equation box: Go to Insert → Equation.
In the equation input area, type the following:

x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}

\frac creates a fraction.
-b is the numerator’s first term.
\pm inserts the plus-minus symbol.
\sqrt creates a square root.
b^2 formats b squared.
- 4ac is written normally inside the square root.
2a is the denominator.

As you type, press space or enter after each LaTeX command. Google Docs will automatically convert the code into properly formatted math notation.

After rendering, the equation will appear as:

This method is ideal for users who prefer keyboard-based input over clicking toolbar icons. It also allows you to enter complex expressions faster and more accurately, especially if you're familiar with standard LaTeX syntax.

Notes:

Not all LaTeX features are supported in Google Docs. The supported commands are limited to basic math formatting, Greek letters, and common symbols.
Make sure to press space after each LaTeX command so that Docs knows to render it.

How to Use Auto Latex Add-on for Writing Advanced Math Equations

When generating mathematical content using tools like ChatGPT, you'll notice that equations are rendered visually on the webpage, but behind the scenes they’re created using LaTeX code. So when you copy content from ChatGPT into Google Docs, the equations come through as raw LaTeX code rather than rendered math expressions.

For example, a quadratic formula provided by ChatGPT might look like this when pasted into your document:

x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}

While this format is ideal for precision, Google Docs doesn’t support LaTeX rendering by default.

This is where the Auto-LaTeX Equations add-on becomes essential, especially if you're moving content from ChatGPT to Google Docs. It’s also incredibly useful when importing LaTeX-based documents into Google Docs, such as content originally written in Overleaf or other LaTeX editors.

Instead of manually reformatting equations, the add-on automatically renders LaTeX code into properly formatted math equations, preserving the typesetting and structure you’d expect from a LaTeX environment.

What is Auto-LaTeX Equations?

Auto-LaTeX Equations is a free and open-source Google Docs add-on that scans your document for LaTeX expressions and converts them into a properly formatted equations.

It recognizes LaTeX code wrapped in these delimiters:

Inline: $$ ... $$
Display: \[ ... \]

Once detected, it renders the equations seamlessly within your document, eliminating the need to retype or manually format them.

Paste your LaTeX expression into the Google Docs document. Make sure the expression is enclosed using one of the supported delimiters:

$$ ... $$ or \[ ... \]
Open the add-on sidebar by clicking Extensions → Auto-LaTeX Equations → Start.
Once the sidebar opens, you’ll see a dropdown labeled “Delimiters” and a button called “Render Equations.”
Select the delimiter you used when enclosing your LaTeX equations – for example, $$ or \[ \].
Click the “Render Equations” button.

The add-on will automatically scan your document and convert all valid LaTeX expressions into properly formatted equation images.

This step-by-step process allows you to take any LaTeX-based math copied from ChatGPT and render it cleanly within Google Docs – ready for export to Word or PDF.

Example: Converting a LaTeX coded Equation to Rendered Math Equations

Paste the following equation into Google Docs:

$$ x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} $$

To convert it:

Go to Extensions → Auto-LaTeX Equations → Start.
Select the Delimitor as $$ ..$$ and click on the Render Equations button. The equation will be rendered and look as follows:

$$x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$$

How to Install Auto-LaTeX Equations

In Google Docs, click Extensions → Add-ons → Get add-ons.
Search for Auto-LaTeX Equations.
Click Install and follow the prompts.
After installation, access it from Extensions → Auto-LaTeX Equations.

How to Copy Math Equations from ChatGPT to Google Docs

To easily transfer math equations and the surrounding content from ChatGPT into Google Docs without losing formatting, use the free ChatGPT to Google Docs or PDF Chrome extension.

This extension allows you to:

Export a single response (with equations and tables) into Google Docs while preserving formatting
Export an entire conversation, including math, code, and text, into a clean, one organized Google Docs, no need to export responses separately and merge multiple Google Docs into one later
Save ChatGPT canvas content as a Google Docs or PDF
Export ChatGPT deep research documents directly into Google Docs
Export ChatGPT content directly into PDF format when no further edits are necessary, eliminating the need to first export to Google Docs and then convert Google Docs to PDF

It’s especially useful for students, researchers, and professionals who want to keep their AI-generated math, notes, and research well-organized in Google Docs or PDF format with minimal effort.

Watch: How to Write Equations in Google Docs

If you prefer visual learning, here’s a helpful video walkthrough that demonstrates all the methods discussed above – using the built-in equation editor, LaTeX-like commands, and the Auto-LaTeX Equations add-on.

This step-by-step tutorial covers:

Opening and using the built-in equation toolbar
Typing LaTeX-style commands directly in the equation editor
Converting AI-generated LaTeX (e.g., from ChatGPT) into clean equations

Tips for Formatting Math Equations in Google Docs

Use inline equations when:

Inserting short expressions like x², a/b, or single variables
Including math within a sentence to maintain the flow of text

Use block equations when:

Writing complex or multi-line formulas (e.g., the quadratic formula)
You want the equation to be clearly separated from the surrounding text for readability

Wrapping tips for rendered equations:

Rendered equations are treated as images in Google Docs, which may disrupt the document layout if not positioned correctly
To fix this:
- Click the equation image
- Choose from:
  - In line – aligns the equation with surrounding text (best for inline use)
  - Wrap text – wraps paragraph text around the equation image
  - Break text – places the equation on its own line, isolating it
- Use the margin handles or spacing options to fine-tune the layout and prevent overlap or crowding

Conclusion

Google Docs offers several flexible ways to write and manage math equations:

Use the built-in equation editor for basic symbols, fractions, exponents, and common operations. It’s easy to access and great for straightforward math tasks without needing special syntax.
Try LaTeX-like commands inside the equation editor for faster input. You can type commands like \frac, \sqrt, or \alpha to quickly insert structured equations without navigating menus.
Install add-ons like Auto-LaTeX Equations for advanced LaTeX rendering. This is especially useful if you're copying equations from Overleaf, ChatGPT, or LaTeX documents into Google Docs, as it preserves formatting and converts code into clean equation images.
Use external tools when copying from other formats, like the ChatGPT to Google Docs or PDF Chrome extension, which helps retain equation formatting when moving content from ChatGPT or other platforms.

Whether you’re completing math homework, preparing teaching materials, or writing a research paper, Google Docs, combined with these tools, gives you everything you need to create clear, professional-looking documents with math content.

The Cryptography Handbook: Exploring RSA PKCSv1.5, OAEP, and PSS

Hamdaan Ali — Wed, 02 Apr 2025 22:04:38 +0000

The RSA algorithm was introduced in 1978 in the seminal paper, "A Method for Obtaining Digital Signatures and Public-Key Cryptosystems". Over the decades, as RSA became integral to secure communications, various vulnerabilities and attacks have emerged, underscoring the importance of understanding and implementing RSA correctly.

This handbook will help you understand the internal workings of the RSA algorithm, how they have evolved over the years, and the schemes defined under various RFCs. This knowledge will help you make informed choices about the most suitable RSA schemes depending on your business requirements.

In this handbook, we’ll begin by exploring the foundational principles of the RSA algorithm. By examining its mathematical underpinnings and historical evolution, you will gain insight into the diverse array of attacks that have emerged over the years.

The narrative unfolds as an evolutionary journey: from the original, straightforward (textbook) RSA implementation, through the discovery of vulnerabilities, to the development of effective countermeasures, and further refinements as new challenges were encountered. This progression illuminates how RSA has transformed over time and also demonstrates how modern cryptographic libraries have integrated these advancements to achieve secure implementations in today’s applications.

You can also watch the associated video here:

Prerequisites
The Alice-Bob Paradigm
The Birth of the RSA Cryptosystem
RSA Operations
Issues with Euler’s Totient Function in RSA
The Carmichael Function
- Mathematical Implication of The Carmichael function
- The Carmichael Function in Modern Implementations
Issues with Raw RSA
Exploiting Textbook RSA’s Determinism and Malleability
Low-Exponent Attacks
Håstad’s Broadcast Attack: Low Exponent Meets Multiple Recipients
Introduction to Padding Schemes in RSA
Public Key Cryptography Standards (PKCS#1 v1.5)
- The Mathematics Behind PKCS#1 v1.5
The Bleichenbacher Attack
Optimal Asymmetric Encryption Padding (OAEP)
- The Mathematics Behind OAEP
Why SHA-1 or MD5 Are Safe in RSA-OAEP
- Label Hashing
- Mask Generation Function (MGF1)
Adoption in Cryptographic Libraries (PKCS#1 v1.5 vs OAEP)
Enhancing Digital Signatures: The Transition to PSS
The Road Ahead: Assessing RSA’s Long-Term Viability
References

Prerequisites

Linear Algebra: A foundational understanding of Linear Algebra and Modular Arithmetic will help you understand certain sections of the handbook, though it is not an absolute requirement. This handbook provides comprehensive explanations of mathematical expressions and their underlying concepts as they arise.

For a concise and relevant introduction to the Chinese Remainder Theorem (CRT) in the context of the handbook, you may find this resource helpful: CRT, RSA, and Low Exponent Attacks | YouTube.

Patience (and a Sense of Adventure): RFCs can sometimes get dull to read, and research papers can feel intimidating at first glance. This handbook is designed to make standard cryptographic concepts accessible to everyone, guiding you through each step with clarity and intuition. Every concept is reinforced with clear, step-by-step examples, ensuring not only a thorough understanding but also familiarity with widely used standard notations. So take your time, take a deep breath, and embrace the journey.

For visual learners, the associated video may offer a more engaging experience.

The Alice-Bob Paradigm

Throughout this handbook, you will come across numerous sequence diagrams and mathematical proofs that use the Alice-Bob Paradigm.

The Alice-Bob paradigm is a common convention in cryptography where two generic entities, often named Alice and Bob, are used to illustrate various scenarios, protocols, or cryptographic principles.

These characters represent two parties engaged in communication, with Alice typically representing the sender or initiator, and Bob representing the receiver or responder.

We often introduce Eve as a third party, symbolizing an eavesdropper or potential attacker, adding an element of security risk, and illustrating scenarios where external entities might attempt to intercept or manipulate the communication.

The Birth of the RSA Cryptosystem

The year 1978 witnessed the birth of a new era in cryptography with the introduction of the RSA cryptosystem, named after its inventors (Rivest, Shamir, and Adleman).

This development, introduced in the paper "A Method for Obtaining Digital Signatures and Public-Key Cryptosystems", provided a method for secure digital communication and laid the foundation for modern public-key cryptography.

At the heart of RSA lies elementary number theory – specifically, the properties of prime numbers and modular arithmetic. Let’s first understand how these key concepts form its mathematical foundations.

Prime Numbers and Composite Moduli

The algorithm starts by selecting two large prime numbers, denoted as p and q. Their product ($n = p \times q$) forms the modulus for both the public and private keys.

The security of RSA depends heavily on the fact that, while multiplying these primes is computationally straightforward, factoring the resulting large composite number n is considered infeasible for sufficiently large primes.

At this point, it’s important to note that p and q must be large prime numbers to ensure RSA’s security. Fortunately, modern libraries handle this automatically by using well-established prime-generation algorithms. As a result, you can focus on higher-level aspects of your applications without having to manage the low-level details of prime selection.

For instance, let’s have a look at OpenSSL’s RSA key generation routine which performs several checks to ensure that the resulting modulus $n = p \times q $ meets the desired bit-length requirements:

The below snippet right-shifts the product of the generated primes (stored in r1) by bitse - 4 bits to isolate the top 4 bits, which are then checked to ensure that the modulus meets the desired size criteria.

if (!BN_rshift(r2, r1, bitse - 4))
    goto err;
bitst = BN_get_word(r2);

The extracted bits (bitst) are then compared against a predefined range (from 0x9 to 0xF). This range ensures that the most significant byte of the modulus isn’t too small or too large.

if (bitst < 0x9 || bitst > 0xF) {
    bitse -= bitsr[i];

If the significant bits do not fall within the desired range, the bit length is adjusted and the prime-generation process is retried. If the number of retries exceeds a set limit, the entire process is restarted.

if (!BN_GENCB_call(cb, 2, n++))
    goto err;
if (primes > 4) {
    if (bitst < 0x9)
        adj++;
    else
        adj--;
} else if (retries == 4) {
    i = -1;
    bitse = 0;
    sk_BIGNUM_pop_free(factors, BN_clear_free);
    factors = sk_BIGNUM_new_null();
    if (factors == NULL)
        goto err;
    continue;
}
retries++;
goto redo;

To ensure that the numbers are necessarily primes, these libraries use a combination of probabilistic tests, including the Rabin-Miler Primality Testing, and sieving methods to quickly eliminate non-prime candidates.

The Euler Totient Function

For a number n that is the product of two primes, the Euler totient function is given by:

$$\varphi(n) = (p-1)(q-1)$$

This function counts the number of integers less than $n$ that are co-prime to $n$. Euler’s theorem, which states that for any integer a co-prime to n, $ a^{\varphi(n)} \equiv 1 \pmod{n}$ plays a central role in proving why RSA’s operations are reversible.

But most modern RSA cryptosystems use the Carmichael function instead of the Euler’s Totient Function. We will examine the reasoning behind this shift in the next few sections.

Computing the Keys

Now we select an integer $e$ such that $1 < e < \varphi(n)$and $\gcd(e, \varphi(n)) = 1$. This $e$ becomes the public exponent you see as a parameter in the RSA function calls you make.

With that done, now let’s determine $d$ as the modular multiplicative inverse of $e \, \, modulo \, \varphi(n)$. In other words, $d$ is computed such that:

$$e \times d \equiv 1 \pmod{\varphi(n)}$$

This step is the mathematical linchpin ensuring that decryption is the inverse operation of encryption.

In the 1978 paper, the authors explicitly provided these formulas and steps. They showed that if you encrypt a message m using $c = m^e \mod n$ and then decrypt using $m = c^d \mod n $ , the original message is recovered – thanks to the properties of modular exponentiation and Euler’s theorem. This mathematical framework was novel at the time and immediately set the stage for a new era in cryptography.

RSA Operations

Now that the mathematical foundations are laid, the RSA algorithm can be seen as a set of three core operations: Encryption, Decryption, and Signing. Throughout this handbook's next sections, we will critically analyze these operations and learn about several pitfalls in each. Then we will examine how these were averted with the birth of new schemes, each to solve a new issue discovered on the way.

Encryption

With the public key $(n, e)$ available to everyone, any user can encrypt a message $m$ (where $m$ is first encoded as an integer in the range $0 \leq m < n$ ) using the formula:

$$c = m^e \mod n$$

Here, c is the ciphertext. Because the operation is based on modular exponentiation, even if m is known, recovering m from c without knowing d is computationally hard.

Decryption

The intended recipient, who possesses the private key $d$, decrypts the cipher text $c$ by computing:

$$m = c^d \bmod n$$

Using the relationship ($e \times d \equiv 1 \pmod{\varphi(n)}$) and properties from Euler’s theorem, the above operation exactly inverts the encryption step, recovering the original message $m$.

This ensures that only the holder of the private key can read the encrypted message. This is the backbone of RSA’s use in secure communication.

The sequence diagram below wraps up our discussion so far:

Digital Signatures

Digital signatures fulfill a different security goal: authenticity and integrity rather than confidentiality. While encryption and decryption use the public key for “locking” and the private key for “unlocking,” digital signatures reverse these roles.

1. Signing

The author of a message uses their private key $d$ to compute a signature $s$ on the message $m$, guided by the formula mentioned below:

$$s = m^d \bmod n$$

This can later be verified by others using the corresponding public key. The purpose here is not to recover a secret message but to create a proof of authenticity.

2. Verification:

Anyone with the public key $(n, e)$ can verify that the signature s indeed belongs to the message $m$ by computing:

$$m \equiv s^e \bmod n$$

If the equivalence holds, it confirms two key points: That the message has not been tampered with (that is, integrity), and that the signature must have been generated using the private key d (that is, authenticity).
As long as $d$ is kept secret, only the legitimate signer can produce a valid signature. Take at look at the sequence diagram below to understand the complete process.

Issues with Euler’s Totient Function in RSA

While using Euler’s Totient Function works well in theory, implementers of the scheme realized its practical downsides. Simply put, the primary issue was that Euler’s Totient Function can lead to a larger private exponent $d$ than what was necessary.

To completely appreciate this fact, let’s take a step back to understand why the size of the private exponent $d$ matters in RSA.

RSA decryption (or signing) involves computing $m^d ~~mod ~n$ which is done via modular exponentiation. The time complexity of exponentiation algorithms (like square-and-multiply) grows with the number of bits in $d$. A larger $d$ means more multiplications and squarings, that is slower decryption.

In practice, if using the Euler’s Totient Function makes $d$ roughly twice as large as what is required, then decryption can be almost twice as slow compared to using the minimal $d$. This inefficiency is especially noticeable when $e$ is small (common public exponents like 3 or 65537). A small $e$ leads to a very large $d$ under $φ(n)$.

Beyond performance, having an unnecessarily large $d$ can increase storage size slightly (a few more bytes for the key). This can also lead to interoperability quirks, which is why standards and protocols such as FIPS 186-4 [1] and RFC 8017 [2] expect $d$ to be below a certain size. We will take a detailed look at this in the next section.

To combat these issues, cryptographers utilized the Carmichael function to generate RSA keys. Before we dive into how the Carmichael function helps our case, let’s quickly understand what the Carmichael function actually is.

The Carmichael Function

The Carmichael Function, represented by $λ(n)$, also known as the reduced totient or least universal exponent, is defined as the smallest positive integer $m$ such that for every integer $a$ co-prime to $n$, $ a^m ≡ 1 (mod n)$.

To put this in easy terms, $λ(n)$ is the exponent of the multiplicative group modulo $n$ (the least common multiple of the orders of all elements). For RSA-style moduli (product of primes), the Carmichael function is guided by the formula:

$$\lambda(n) = \operatorname{lcm}(p-1,\,q-1)$$

where $n = p . q$ with $p$ and $q$ being the large primes.

You may now understand the Carmichael function better if we put it in the following way: $λ(n)$ is the least common multiple of $λ(n)$ of each prime power dividing n. So for a prime $p$, $λ(p) = φ(p) = p – 1$, and for two primes, we take the $lcm$ of $p-1 $ and $q-1.$

Mathematical Implication of The Carmichael function

The Carmichael function $λ(n)$ is a “tighter” bound. What this means is that $λ(n)$ divides $φ(n)$ (since the exponent of a finite group always divides the group order by Lagrange’s Theorem [3])

If $p$ and $q$ are both odd primes, then $p–1$ and $q–1 $ are even, so their least common multiple is roughly half of $(p–1)(q–1)$. Mathematically:

$$λ(n) = \dfrac{(p–1)(q–1)} {gcd(p–1, q–1)}$$

We can observe that this $λ(n)$ is lesser than or equal to $φ(n)$ and often considerably smaller. This means $λ(n)$ provides the minimal exponent needed for RSA’s correctness, whereas $φ(n)$might be a larger number that still works but isn’t necessary.

When you choose two large random primes $p$ and $q$, you have:

$$\varphi(n) = (p-1)(q-1) \approx n,$$

because for large primes, the subtracted ones make only a small difference compared to $p$ and $q$ themselves.

Now, since both $p-1$ and $q-1 $ are even, they each have a factor of 2. If those are their only common factors (which is often the case for random primes), then:

$$\lambda(n) = \mathrm{lcm}(p-1, q-1) \approx \frac{\varphi(n)}{2}.$$

When you compute the private exponent $d$ as the modular inverse of $e$ (a small number) modulo $ \varphi(n)$ versus modulo $\lambda(n)$, the range from which $d$ is chosen is roughly twice as large in the former case. That means the typical $d$ when computed modulo $\varphi(n)$ can be about twice as large as when computed modulo $\lambda(n)$. A larger $d$ means that during decryption (or signing) the modular exponentiation $c^d \mod n$ takes slightly more time.

Intuitively, using $λ(n)$ ensures we don’t “overshoot” the exponent required for the modular arithmetic to cycle back to 1.

A smaller $d$ makes every RSA decryption and signature operation faster. For instance, if $λ(n)$ is roughly half of $φ(n)$, then $d$ will have one less bit than it would otherwise, cutting the exponentiation work by about 50%. This is a free performance gain, as we aren’t changing the security assumptions or the key size $n$, just using the mathematically tight value for the exponent. The RSA algorithm’s security is not weakened by this and now the $d$ is different but functionally equivalent.

The Carmichael Function in Modern Implementations

The critical property for RSA ($e·d ≡ 1 ~mod ~~λ(n)$) is both necessary and sufficient for correct decryption, thanks to Carmichael’s theorem. So there’s no need for $d$ to also satisfy the stronger condition modulo $φ(n)$.

By switching to computing $d ~ modulo ~~ λ(n)$ (i.e., $d = e^{-1} ~mod ~~λ(n)$), we directly get the smallest working private exponent. Ronald Rivest himself noted this optimization in his 1999 seminal paper [4], stating that solving for $d$ using $ λ(n)$ instead of $φ(n)$ is slightly preferable because it can result in a smaller value for d.

Over time, the use of $ λ(n)$ in RSA moved from an academic suggestion to an industry standard. Today’s cryptographic standards explicitly acknowledge or require the $λ(n)$ approach.

For example, the official RSA standard (PKCS #1 v2.2, RFC 8017 [2]) defines the RSA key generation in terms of $λ(n)$. It specifies that the private exponent $d$ is chosen such that $e·d ≡ 1 (mod λ(n))$ (with $λ(n) = lcm(p–1, q–1)$). In other words, PKCS #1 expects the Carmichael function to be used for the modulus of the exponent. Likewise, NIST’s FIPS 186-4 (Digital Signature Standard) mandates that $d$ be less than $λ(n)$.

Any RSA key where $d$ is larger than $λ(n)$ is considered non-compliant in those strict contexts. This effectively forces implementations to use the smaller $λ(n)$-based exponent, since any “oversized” $d$ can be reduced $mod ~~λ(n)$ to meet the criterion.

Standards such as FIPS 186-4 [1] (the Digital Signature Standard) and RFC 8017 [2] (which specifies PKCS#1 v2.2 for RSA Cryptography) include requirements or recommendations that imply the private exponent $d$ should be as small as possible and ideally less than $ \lambda(n)$. Using $\lambda(n)$ (the least common multiple of $p-1$ and $q-1$) directly produces the smallest valid $d$, whereas using $\varphi(n)$ often results in a $d$ that is larger than necessary. This not only improves performance (by reducing the number of modular multiplications needed during decryption/signing) but also helps maintain interoperability with protocols that expect d to be below a certain size.

The Python cryptography library (PyCA cryptography) explicitly documents [5] that it uses Carmichael’s totient to generate the “smallest working value of $d$,” noting that older implementations (including the original RSA paper) used Euler’s totient and ended up with larger exponents. OpenSSL also uses the Carmichael function in their low-level RSA APIs [6].

This shift to the Carmichael function ensures that under the hood your RSA key is a bit more efficient than the ones from the late 1970s while providing the same level of security.

Issues with Raw RSA

Raw or “Textbook” RSA soon turned out to be insecure when two major weaknesses were discovered.

The operations involved in RSA are entirely deterministic, which means that for a given plaintext $m$, encryption always produces the same cipher text $C = m^e \mod n$.

An eavesdropper or an attacker, say Eve, can guess or derive plain texts by exploiting the predictability of outputs. Since RSA encryption is a public operation, an attacker can encrypt likely messages and compare results to a target cipher text – a trivial chosen plaintext attack.

Besides this, textbook RSA is also malleable. This means that its algebraic structure allows attackers to manipulate cipher texts in meaningful ways. For instance, given a cipher text $C = RSA(M)$, an attacker can multiply it by the encryption of a known value (say, r) to produce a new cipher text $C’ = C · r^e ~~mod ~n$, which decrypts to the plaintext $M·r$. When the legitimate receiver decrypts $C'$, the result is $M·r$, from which the attacker can often recover $M$.

Let’s understand these vulnerabilities with a small practical example.

Exploiting Textbook RSA’s Determinism and Malleability

Key Generation (Setup)

For our toy example, we’ll choose small prime numbers and generate an RSA key pair:

Let’s select the values of $p =3$ and $q=11$. Both of these values are prime. Now, compute the modulus and Totient Function as follows:

$$\begin{gather} \begin{split} n = p × q = 3 × 11 = 33 \\ φ(n) = (p – 1) × (q – 1) = 2 × 10 = 20 \end{split} \end{gather}$$

Now choose the public exponent. Let’s consider $e=3$ since it is coprime with $ φ(n) = 20$, and $gcd(3, 20) = 1$.

Now let’s compute the private exponent. We know that d is the modular inverse of $e ~~mod ~φ(n)$. We need to find d such that $(d × e) ≡ 1~~ (mod ~20)$. Using this knowledge we can compute $d = 7$ as $3 × 7 = 21 ≡ 1 ~~ (mod~ 20)$.

Finally, the public key is $(n = 33, ~ e = 3)$ and the private key (secret) is $d = 7$.

Encryption Process

Now, let’s encrypt a simple message using the above key. Let us select our plaintext to be $M = 4$. The cipher text in this case would be:

$$\begin{gather} \begin{split} C = 4^3 ~~mod ~33 \\ C = 64 ~~mod ~33 \\ C = 64 – 33×1 = 31 \end{split} \end{gather}$$

To consolidate the findings so far, if we encrypt message $4$ with the public key $(e=3, n=33)$, we will produce the cipher text $31$. Now, let’s try the exploits.

Determinism Exploit (Ciphertext Guessing Attack)

Textbook RSA is deterministic – the same plaintext always yields the same ciphertext (with no randomness involved). An attacker who intercepts the ciphertext $C=31$ can exploit this by encrypting likely plaintext guesses and comparing results:

The adversary, say Eve, will try encrypting candidate plaintexts with the public key and see which one produces $31$. They may pick randomized values to increase their efficiency:

$$\begin{gather} \begin{aligned} Guess~ M = 1 ⇒ 1^3~~ mod ~33 = 1 \\ Guess~ M = 2 ⇒ 2^3~~ mod ~33 = 8 \\ Guess~ M = 3 ⇒ 3^3~~ mod ~33 = 27 \\ Guess~ M = 4 ⇒ 4^3~~ mod ~33 = 31 \\ \end{aligned} \end{gather}$$

By simply comparing ciphertexts, the attacker finds that encrypting $4$ yields 31, which matches the intercepted ciphertext. Thus, the attacker learns the original plaintext $M$ was $4$. This is possible because there’s no randomization in textbook RSA – an eavesdropper can identify a message by trial encryption of guesses, breaking confidentiality if the message space is small or guessable.

Malleability Exploit (Ciphertext Manipulation Attack)

Raw RSA is also malleable. This means an attacker can take a ciphertext and modify it in a way that results in a predictable change in the decrypted plaintext. Let’s understand how this works.

RSA has a multiplicative property, that is, multiplying two ciphertexts corresponds to multiplying their plaintexts before encryption:

$$E(M_1) \cdot E(M_2) \mod n = (M_1^e \mod n)\times(M_2^e \mod n) \mod n = (M_1 \cdot M_2)^e \mod n$$

The sequence diagram below explains how the malleability exploit works in naive RSA.

Alice sends a ciphertext to Bob after the initialization phase. Note that by this point, n and e are public knowledge. Eve intercepts this ciphertext by using mechanisms such as a MiTM (Man in the Middle) attack.

Now, Eve picks a known value to manipulate the message. Let’s say the attacker chooses $X = 2$ (with the intent to double the original plaintext).

Then they compute the encryption of X using the public key:

$$E(X) = 2^3 \mod 33 = 8.$$

Now, Eve multiplies the original ciphertext by this value (mod n) to get a new ciphertext:

$$\begin{gather} \begin{split} C{\prime} = C \times E(X) \mod n = 31 \times 8 \mod 33 \\ C{\prime} = 248~~ mod~ 33 = 248 – 33×7 = 248 – 231 = 17 \end{split} \end{gather}$$

This new ciphertext $C{\prime}$ is the encryption of the product of the original plaintext and $2$. If we directly encrypted $M \times X = 4 \times 2 = 8$ with RSA, we would get $8^3 \mod 33 = 512 \mod 33 = 17$. This means that $C′$ corresponds to the plaintext $8$, which is the original message $4$ multiplied by $2$.

In a real-world chosen ciphertext attack, the attacker may have access to a decryption oracle or observe a system response that reveals information about $M{\prime}$. The decryption result $8$ is exactly $M \times 2$ (the original message multiplied by the attacker’s chosen factor). Knowing the factor $X = 2$, the attacker can deduce the original message by dividing: $8/ 2 = 4$.

Note that Eve has not broken the mathematical foundations behind RSA here. They have only used the public key to compute an encryption of $2$, and then combined it with the intercepted ciphertext. They don’t know the original plaintext yet, but they have manipulated the ciphertext in a way that they know the new plaintext is twice the original message.

Low-Exponent Attacks

Beyond determinism and malleability exploits, textbook RSA is also vulnerable to Low-Exponent Attacks. Using a small public exponent like $e = 3$ (or sometimes $17$) was popular because it used to speed up encryption and signature verification. But this soon turned out to be a security concern.

When RSA uses a small public exponent (say, $e = 3$) and the plaintext is very short (so that $M^3$ is smaller than the modulus $n$), the encryption does not “wrap around” modulo $n$. Mathematically:

$$c = M^3 \mod n = M^3 \quad \text{(if $ M^3 < n $)}$$

Let’s understand this with an easy example:

Consider our plaintext to be: $M = 5$. We compute $M^3$ as $M^3 = 5^3 = 125$.

Now assume $n$ is a $4096$‑bit number which is large compared to $125$. In this case, the ciphertext is simply $c = 125$. Eve intercepting $c = 125$ can compute the cube root of $125$ to get the plaintext: $\sqrt[3]{125} = 5$ thus recovering $M$ directly.

This shows that if $M$ is small enough, the ciphertext leaks the plaintext when $e$ is low.

Håstad’s Broadcast Attack: Low Exponent Meets Multiple Recipients

In 1985, Johan Håstad’s highlighted the broadcast attack that illustrates the danger of a low exponent, $e$, when the same message is sent to multiple parties as a broadcast.

Imagine Alice wants to send the same plaintext message M to three different recipients. Each recipient has their own RSA public key with modulus $N_1, N_2, N_3,$ but for speed all use $e = 3$ (a common practice historically). Alice encrypts $M$ with each public key, yielding ciphertexts:

$$\begin{gather} \begin{split} C_1 = M^3 \bmod N_1 \\ C_2 = M^3 \bmod N_2 \\ C_3 = M^3 \bmod N_3 \end{split} \end{gather}$$

Eve, who intercepts all three $C_1, C_2, C_3$ can recover M without breaking any single RSA key.

Since each $N_i $ is different (and we assume they are pairwise coprime, as RSA keys should be), the attacker can use the Chinese Remainder Theorem (CRT) to combine the three congruences $x \equiv C_i \pmod{N_i}$. Note that at this point Eve only has $C_1$, $C_2$ and $C_3$. They do not have the plaintext $M$ or $M^3$ and yet they can reconstruct $M^3$ with the intercepted data. To understand the Chinese Remainder Theorem and this reconstruction, you may follow this: CRT, RSA, and Low Exponent Attacks | Youtube.

There is a unique solution modulo $N_1N_2N_3$ for $x$, and that solution turns out to be an integer, $x = M^3$ (because the true integer $M^3$ is smaller than the product $N_1N_2N_3$ of each $M < N_i $ ). In essence, CRT lets Eve reconstruct $M^3$ exactly. Once they have $M^3$ as an ordinary integer, they simply take the cube root to find $M$. There’s no need to factor any modulus or invert the RSA function – the math falls out due to the low exponent.

The sequence diagram below aims to provide a high-level understanding of the attack:

Now let’s see this attack in action with a sample:

Suppose three different RSA public keys all use exponent $e=3$, with moduli $ n_b = 187$ (for Bob),
$n_c = 115 $ (for Carol), and $n_d = 87$ (for Dave).

These $n_i$ are pairwise coprime ($gcd$ of each pair is $1$). Now assume the same plaintext message $M$ is encrypted with each public key. Let’s take a concrete $M$. For example with $M=42$, we will have:

$$\begin{gather} \begin{split} c_b = M^3 \bmod n_b \\ c_c = M^3 \bmod n_c \\ c_d = M^3 \bmod n_d \\ \end{split} \end{gather}$$

On calculating these, we have:

$$\begin{gather} \begin{split} c_b = 42^3 \bmod 187 = 36 \\ c_c = 42^3 \bmod 115 = 28 \\ c_d = 42^3 \bmod 87 = 51 \\ \end{split} \end{gather}$$

So the three ciphertexts observed are $36$, $28$, and $51$, respectively. Eve who knows $n_b, n_c, n_d$ and these ciphertexts can now recover $M$ as follows:

Eve will compute the total modulus $N = n_b \cdot n_c \cdot n_d = 187 \times 115 \times 87 = 1,870,935.$ (This is the modulus for the combined system of congruences).
Now Eve will compute the partial products for each congruence:

$$\begin{gather} \begin{split} N_b = \frac{N}{n_b} = \frac{1,870,935}{187} = 10,005 \\ N_c = \frac{N}{n_c} = \frac{1,870,935}{115} = 16,269 \\ N_d = \frac{N}{n_d} = \frac{1,870,935}{87} = 21,505 \end{split} \end{gather}$$

At this point, Eve needs the inverses of each $N_i$ modulo its corresponding $n_i$:
- First Eve computes $M_b = (N_b)^{-1} \bmod n_b$, i.e. the number $M_b$ such that $N_b \cdot M_b \equiv 1 \pmod{187}$. In this case, $N_b = 10005$. Using the extended Euclidean algorithm, Eve can find $M_b = 2$ (since $10005 \times 2 = 20010 \equiv 1 \pmod{187}$).
- Then Eve computes $M_c = (N_c)^{-1} \bmod n_c$. Here $N_c = 16269$. The inverse mod $115$ turns out to be $M_c = 49$ (For verification: $16269 \times 49 \equiv 1 \pmod{115}$).
- Next up, Eve computes $M_d = (N_d)^{-1} \bmod n_d$. For $N_d = 21505$, the inverse mod $87$ is $M_d = 49$ as well (coincidentally the same value in this case, since $21505 \times 49 \equiv 1 \pmod{87}$).

Now Eve reconstructs the combined value using the Chinese Remainder Theorem for three congruencies. The construction of this formula is beyond the scope of this handbook, but to completely understand how this springs into action, you may go through this video: CRT, RSA and Low Exponent Attacks | Youtube.

$$C \;=\; c_b \cdot N_b \cdot M_b \;+\; c_c \cdot N_c \cdot M_c \;+\; c_d \cdot N_d \cdot M_d \pmod{N}$$

On substituting the numbers:

$$C = 36 \cdot 10005 \cdot 2 \;+\; 28 \cdot 16269 \cdot 49 \;+\; 51 \cdot 21505 \cdot 49 \pmod{1,870,935}$$

Let’s carefully evaluate each term:

$$\begin{gather} \begin{split} 36 \cdot 10005 \cdot 2 = 720,360 \\ 28 \cdot 16269 \cdot 49 = 22,341,348 \\ 51 \cdot 21505 \cdot 49 = 5,37,40,995 \\ \end{split} \end{gather}$$

Summing these gives a raw total of $7,20,360 + 2,23,21,068 + 5,37,40,995 = 7,67,82,423$. Now reduce this modulo $N = 1,870,935$:

$$\begin{align} \begin{split} C \equiv 7,67,82,423 \pmod{1,870,935}\\ C = 74,088 \\ \end{split} \end{align}$$

Now Eve will simply take the cube root of $C: \sqrt[3]{74088} = 42$, which is the original plaintext.
Eve has successfully recovered $M$.

The key takeaway from these attacks is that without proper defenses. RSA alone does not satisfy modern definitions of security. It is not resistant to chosen-plaintext or chosen-cipher text attacks. This gap between the theoretical one-way function (RSA’s trapdoor permutation) and a secure encryption scheme became evident as implementers found that naive RSA could be “broken” by various clever tricks.

To counter these weaknesses, standards bodies introduced padding schemes to strengthen RSA encryption. In the following sections, you will learn about each of these paddings schemes and how they’ve been exploited over the years.

Introduction to Padding Schemes in RSA

Before we dive into the padding schemes and how it helps our case, let’s quickly recap the need for padding in RSA.

Textbook RSA encryption is deterministic. The same plaintext always produces the same ciphertext under a given public key. This determinism makes raw RSA insecure. An attacker can guess possible messages, encrypt them with the public key, and compare with the target ciphertext to see which guess matches.

Beyond determinism, small-exponent attacks illustrate why padding is critical. If the message $m$ is too small relative to the modulus, raising it to a small public exponent (like $e=3$) might not wrap around $N$. Padding the plaintext with random data before encryption remedies these problems by making the ciphertext unpredictable and ensuring $m^e$ spans the modulus’ range.

Public Key Cryptography Standards (PKCS#1 v1.5)

In 1998, Kaliski and RSA Laboratories introduced PKCS#1 v1.5 to the world in a public publication [7]. In PKCS#1 v1.5, every RSA‐encrypted message is wrapped inside a special “encryption block” $EB$. This block ensures that the raw message is both the right size for RSA and padded in a way that’s hard to tamper with.

In this scheme, the plaintext is padded to the size of the modulus $N$ (in bytes) as:

$$EB = 00 ~||~ BT ~||~ PS ~||~ 00 ~||~ M$$

Here, $0x00$ (Leading Zero Byte) is always at the front. It ensures that, when the concatenated string $EB$ is converted to a big‐endian integer, the value is less than the RSA modulus (that is, we don’t end up with a number too large for RSA to handle). You will better appreciate this fact when we dive into the mathematics behind this.

The next octet is the Block Type, $BT$, which tells us the “type” of padding being used. The standard defines three possible $BT$ values: $00, 01, $ and $02$- to support different operations. For example, $BT=00$ and $BT = 01$ is used for private-key operations (such as digital signatures) and $BT = 02$ is used for public-key operations. For encryption under PKCS#1 v1.5, this is always $0x02$. It’s basically a label that says, “This is an encryption block, not something else”.

The next block is the Padding String $PS$. This is a string of nonzero random bytes. This is crucial for security because it introduces randomness into each encryption. If the same message is encrypted multiple times, these random bytes ensure that each ciphertext looks different, foiling many simple attacks that rely on seeing repeated patterns.

The next octet, $0x00$, is a Delimiter. This single zero byte marks the end of the padding. During decryption, this helps the recipient quickly identify where the padding stops and the real message begins.

Finally, we have the actual data you want to protect – $M$. Once the recipient has verified the padding, they know exactly where to find this message.

This mechanism helped solve the deterministic issue of naive RSA. In the next sections, let’s understand the mathematics involved in PKCS#1 v1.5 padding and its security implications.

The Mathematics Behind PKCS#1 v1.5

Before we begin, let’s get our symbols and abbreviations correct. We will use upper-case symbols (such as $EB$) to denote octet strings and bit strings. We will use lower-case symbols (such as $n$) to denote integers.

In PKCS#1 v1.5, we will use $k$ to represents the length of the RSA modulus $n$ in bytes. For example, if you have a $1024$-bit RSA key, then the RSA modulus $n$ is a $1024$-bit number. Since there are $8$ bits in a byte, if your RSA modulus is $L$ bits long, then:

$$k = \left\lceil \frac{L}{8} \right\rceil = \frac{1024}{8} = 128 \text{ bytes}$$

The total length of the encryption block will be equal to this RSA key length $k$ (in bytes). Now here the length of the data $M$ shall not be more than $k-11$ octets, since the 11 bytes are consumed by the blocks – $0x00 ~||~ 0x02 ~||~ PS ~||~ 0x00$. This limitation guarantees that the length of the padding string $PS$ is at least eight octets, which is a security condition in PKCS#1v1.5:

$$∣PS∣=k~−∣M∣−~3$$

For example, with a $1024$-bit RSA modulus, the value of $k$ comes out to be $128$. Here Alice could encrypt up to $128 - 11 = 117$ bytes of data. The $11$ bytes are used for the $0x00 ~||~ 0x02 ~||~ PS ~||~ 0x00$ structure. The random $PS $ ensures that each encryption of the same message produces a different ciphertext, preventing the deterministic encryption problem.

RSA doesn’t directly operate on the bytes. Once the padded string $EB$ is ready, it needs to be converted into an integer guided by the Octet String to Integer Primitive (OS2IP) formula:

$$x = \sum_{i=1}^{k} 2^{8(k - i)} \,\mathrm{EB}_i$$

where $EB_i$ are the octets of $EB$ from first to last. In other words, $EB_1$ (the first byte) is the most significant byte, and $EB_k$ (the last byte) is the least significant. Now Alice can simply encrypt this block using $C = x^c \mod n$.

To solidify our learnings so far, let’s apply this to a sample plaintext and find the padded blocks.

Let’s assume the RSA modulus is $8$ bytes long ($k=8$). Suppose we want to encrypt a message $M$ that is $2$ bytes long. Then the padding string $PS$ must fill the remaining space:

$$Total ~ bytes=k=8=1(0x00)+1(BT)+∣PS∣+1(delimiter)+∣M∣$$

Since $∣M∣=2$ and there are $∣M∣=2∣$ fixed bytes, can find the required length of the padding string:

$$∣PS∣=8−3−2=3 ~ bytes$$

Let’s pick 3 arbitrary nonzero bytes for $PS$, say - $0xA3, ~0x5F, ~0xC2$. And let’s say the message is the ASCII text “Hi”. In hexadecimal, that’s: $0x48$ for 'H' and $0x69$ for 'i'.

Thus, the complete encryption block becomes:

Now we will convert this octet string to an integer using the OS2IP formula we discussed above:

$$x = \sum_{i=1}^{k} 2^{8(k - i)} \,\mathrm{EB}_i$$

For our example, with $k=8$ the conversion is:

$$x= 0x00×256^7+0x02×256^6+0xA3×256^5+0x5F×256^4+0xC2×256^3+0x00×256^2+0x48×256^1+0x69×256^0$$

Note that the hexadecimal values can be converted to decimal as needed. For instance, $0xA3 = 163, 0x5F = 95, 0xC2 = 194, 0x48 = 72,$ and $0x69 = 105$.

There is an interesting observation in the application of this formula. Because the first two bytes are fixed ($0x00$ and $0x02$), the integer $x$ has a known lower bound. The contribution of the first two bytes is:

$$0×256^ 7 +2×256^ 6 =2×256^ 6$$

The rest of the bytes ($PS$, the delimiter, and $M$) add some value that is at least $0$ and at most just less than $256^6$ (since the second byte is fixed as $0x02$ and cannot be $0x03$). Thus, $x$ is in the range:

$$2×256 ^ 6 ≤x<3×256 ^ 6$$

This property which makes the range predictable, paved the way for the Bleichenbacher attack (also known as the “padding oracle” attack). If a system reveals whether a decrypted block is “correctly padded,” an attacker can systematically probe different ciphertexts and narrow down the plaintext – because the attacker knows it must lie in that narrow range. Let’s take a detailed look at the Bleichenbacher attack in the next sections and understand how the exploit works.

The Bleichenbacher Attack

In 1998, Daniel Bleichenbacher published a seminal paper [8] demonstrating an adaptive chosen-ciphertext attack against RSA with PKCS#1 v1.5 padding. The Bleichenbacher Attack, also dubbed as the “million messages” attack, demonstrated that if an attacker has access to an oracle that tells whether a submitted ciphertext decrypts to a properly padded plaintext (that is, whether the PKCS#1 v1.5 formatting is correct), the attacker can gradually recover the full plaintext. Let’s break down how this attack works:

First, Eve needs an Oracle. The attack assumes the attacker can query a system, such as an SSL/TLS server, and find out if a given ciphertext $C$ is PKCS#1 v1.5 conformant. In the 1998 paper, Bleichenbacher exploited the fact that a TLS server, when presented with an improperly padded RSA-encrypted premaster secret, would respond with a specific error alert if the padding was wrong. Essentially, the server acted as an oracle: it would decrypt $C$ with its private key and simply tell the attacker “padding OK” or “padding error” (the error could be timing-based or an explicit alert).

Note that the oracle does not reveal the plaintext. It only reveals a single bit of information at a time: “valid padding or not.” This might seem harmless, but Bleichenbacher showed that it’s enough to eventually recover the plaintext.

To quickly recap, the attacker’s goal is to find the unknown message integer $m$ (the PKCS#1-padded plaintext as an integer) given its ciphertext $C = m^e \bmod N$, using the oracle. We know that if $m$ is properly padded, it lies in a specific numeric range: $2B \le m < 3B$ where $B = 2^{8*(k-2)}$, as defined earlier.

If $k=128$ bytes, then $B=2^{8*126}$, and a correctly padded $m$ will start with $0x00 ~||~0x02$, so it’s between $2B$ and $3B$. The attacker, Eve, initially only knows that $m$ is in the range $[2B, 3B)$.

In the Bleichenbacher Attack, Eve will exploit RSA’s multiplicative property. They will choose a number $s$ (called the multiplier) and compute a new ciphertext $C' = (C s^e) \bmod N$. This $C'$ here corresponds to a new plaintext: $m' = m s \bmod N$ (because $C' \equiv m^e * s^e \equiv (ms)^e \pmod{N}$).

To begin the attack, Eve finds some $s_0$ such that $C_0 = C * (s_0)^e \mod N$ yields a valid padding. This is referred to as the Blinding step. This is usually easy – for example, $s_0$ can be chosen so that $m * s_0$ is just slightly above $N$, which almost certainly will wrap around and land in $[2B,3B)$. The attacker does not know $m$ to verify this directly. They rely on the padding oracle’s yes/no response to infer that the blinded plaintext $(m×s_0)\mod N$ falls in the correct range.

If the oracle returns “valid padding” for a given $ s_0$, it tells the attacker that $s_0 \mod N$lies between $2B$and $3B$. Mathematically:

$$2B≤(m×s_0)~mod N<3B$$

Now, Eve will try to try to narrow down this range in a loop, which is often referred to as the interval having step. Initially, Eve had one wide interval $[a, b] = [2B, 3B)$ that contains $m$. In each iteration, Eve tries increasing values of $s$ (starting from a certain minimum) until the oracle returns “padding OK” for $C' = C_0 * s^e$. Suppose this happens at some $s = s_i$. Given this feedback, Eve now knows:

$$2𝐵 ≤ (𝑚 × 𝑠_i) ~ mod 𝑁 < 3𝐵$$

This congruence implies there exists some integer $r$ such that:

$$2B ≤ ( m×s_i)−rN < 3B$$

Rearranging, we get a constraint on $m$:

$$\frac{2B+rN}{s_i} ≤ m < \frac{3B+rN}{s_i}$$

Eve doesn’t know $r$ outright, but they can solve for the possible range of $r$ by considering the current interval $[a,b]$ for $m$. Essentially, Eve uses the previous bounds on $m$ to guess which $r$ would make the inequality true, then updates the new bounds $[a, b]$ as the intersection of all possible solutions for $m$. This dramatically shrinks the interval.

Each oracle query yields such a constraint. Eventually, the interval $[a,b]$ collapses to a single value, $[a,a]$. Now, Eve can find the plaintext using:

$$m = (a × s_i^{-1}) ~ mod N$$

At that point, Eve has recovered the entire padded plaintext $m$, and by stripping off the padding, the original message itself.

The sequence diagram below consolidates our learning of the attack:

The Bleichenbacher attack showed that the format of the padding in PKCS#1 v1.5 leaked just enough info to enable a full private-key operation (decrypting the message) without ever factoring N. The attack leveraged the fact that it’s possible to craft ciphertexts that will decrypt to a valid-looking plaintext without knowing the plaintext. In essence, PKCS#1 v1.5 padding allowed about $1$ in $2^{16}$ chance (roughly) for a random blob to appear as “valid padding.” That was enough for an adaptive attack to succeed with feasible queries.

This is precisely what later padding designs like OAEP fixed. OAEP’s design makes such random valid ciphertexts astronomically unlikely (plaintext aware). We will learn about RSA-OAEP in the next sections.

To mitigate the Bleichenbacher attack without immediately changing the padding scheme, practitioners implemented defensive measures. For example, TLS should treat all decryption failures the same way (so an attacker can’t distinguish padding vs. other errors), and servers would generate a fake premaster secret on padding failure to continue the handshake and avoid timing leaks. Nonetheless, the safest course has been to deprecate PKCS#1 v1.5 encryption in favor of schemes like RSA-OAEP.

Optimal Asymmetric Encryption Padding (OAEP)

By the end of 1995, Bellare and Rogaway proposed Optimal Asymmetric Encryption Padding (OAEP) with the goal of achieving provable security. This padding aimed to make RSA encryption resistant not just to passive attacks but also to adaptive chosen-ciphertext attacks. In other words, even if an attacker can trick a system into decrypting chosen ciphertexts (as an “oracle”), they should learn nothing useful about the plaintext. OAEP was subsequently standardized in PKCS#1 v2.0 (published as RFC 2437 in 1998) and later versions.

Compared to PKCS#1 v1.5, OAEP has a more complex encoding that uses hash functions and a mask generation function (MGF) to thoroughly randomize the plaintext before RSA encryption, providing stronger guarantees.

OAEP’s design can be viewed as a two-layer Feistel-like network using a random seed. It takes the input message and randomizes it in a way that is reversible only with the correct seed. The scheme was proven plaintext-aware in the random oracle model which means that an adversary cannot concoct a valid ciphertext without knowing the corresponding plaintext. If an attacker tries to forge or tamper with ciphertexts, they almost surely produce an invalid padding that will be rejected. This property directly counters padding-oracle attacks.

OAEP (with a proper hash/MGF) is semantically secure against adaptive chosen ciphertext attacks, assuming RSA is hard to invert and treating the hash functions as random oracles. Unlike PKCS#1 v1.5, which lacked a formal proof, OAEP comes with a proof sketch that breaking RSA-OAEP is as hard as breaking RSA itself.

In practice, this means OAEP drastically reduces the risk of any padding oracle: an attacker can no longer easily find ciphertexts that slip through the padding check except by brute force which has a $2^{-hLen*8}$ success probability. For example, the success probability with SHA-1 would be $2^{-160}$.

The block diagram below is a visual representation of the OAEP encoding schema:

Let’s understand what these mathematical notions mean and the workings of RSA-OEAP, up next.

The Mathematics Behind OAEP

Optimal Asymmetric Encryption Padding requires a hash function for two operations we will discuss in this section. We will choose SHA-1 as a hash function in OAEP and $hLen$ denotes the length in octets of the hash function output. We will later demonstrate why even MD5 or SHA-1 is a secure choice for OAEP even if it is not collision resistant.

Before we dive into the mathematics, let’s recap a few notations and define the main pieces we’ll be using:

In RSA, $N$is the modulus, and $k$ is the size of $N$ in bytes. For a $2048$-bit modulus, $k=256$ bytes.
$M $ is the message or plaintext to be encrypted. This plaintext must be short enough to fit into the padded block (at most $k−2⋅hLen−2$ bytes). In our notation, $Hash$ refers to the cryptographic hash function (for example, SHA-1, SHA-256) of output length $hLen$. For example: If using SHA-1, $hLen=20$ bytes.

We will also use an optional string associated with the message (often empty). This is the Label $L$. If this label is empty, its hash is a fixed value. (For example: the SHA-1 of an empty string).

The hash of this label $L$ is represented by $lHash$, where $lHash=Hash(L)$. As mentioned earlier, if $L$ is empty, $lHash$ is simply $Hash('')$. This means that in any case $lHash$ will hold a value.

We will also use a Mask Generation Function, $MGF$, which is often mentioned as $MGF1$. This function takes an input (seed or masked data) and produces an output of a specified length by iterating the underlying hash function. We’ll write $MGF(input,length)$ to indicate “generate a mask of $length$ bytes from $input$”.

Now that you are familiar with all the necessary notations, we are ready to begin the encoding step.

Step 1: Constructing the Data Block (DB)

We will compute $lHash=Hash(L)$. If $L$ is empty, $lHash$ is a constant (For example, the SHA-1 of the empty string).

Form the padding string $PS$, the length of $PS$ is chosen so that the entire block $DB$ has length $(k−hLen−1)$ bytes. Numerically, $PS$ has $(k−mLen−2⋅hLen−2)$ bytes of $0x00$, where $mLen$ is the length of the message $M$.

Now we simply concatenate the blocks to generate the octet string for the Data Block ($DB$):

$$DB=lHash~∣∣~PS~∣∣~0x01~∣∣~M$$

Here the single byte $0x01$ acts as a delimiter which marks where the zero padding ends and the actual message $M$ begins. $DB$ ends up being $(k−hLen−1)$ bytes.

Step 2: Generating a Mask for the Data Block

First, we pick a random string called $seed$ of length $hLen$ bytes. For example, when using SHA-1 where $hLen=20$, then we say that the seed consists of $20$ random bytes.

Now we use the mask generation function, $MGF$, on the $seed$ to create a mask the same length as $DB$:

$$dbMask=MGF(seed,k−hLen−1)$$

The idea is to spread the randomness of the seed across the entire $DB$.

Step 3: Mask the Data Block

Now, we will Combine $DB$ and $dbMask$ with the bitwise $XOR$ operation:

$$maskedDB=DB \oplus dbMask$$

This step “scrambles” $DB$ with the random seed.

Step 4: Generate a Mask for the Seed

Next, we will produce a mask for the seed itself, based on $maskedDB$:

$$seedMask=MGF(maskedDB,hLen)$$

This step simply ensures that the seed is not left in the clear.

Step 5: Mask the Seed

Now we will combine the original seed and the new mask with an $XOR$ operation:

$$maskedSeed=seed \oplus seedMask$$

Now the seed is also “scrambled” by the data block.

Step 6: Form the Final Encoded Message (EM)

We are now ready to build our final block. Simply concatenate everything into a $k$-byte string:

$$EM=0x00~∣∣~maskedSeed~∣∣~maskedDB$$

The leading $0x00$ byte ensures that when $EM$ is interpreted as an integer, it’s less than the RSA modulus $N$. At this point, $EM$ is your OAEP-padded message of length $k$.

Step 7: Covert concatenated String to Integer

Remember from our discussion before on PKCS#1v1.5 that RSA cannot directly operate on this concatenated string of bytes. We need to convert the $EM$ block to a non-negative integer using the OS2IP formula:

$$x = \sum_{i=1}^{k} 2^{8(k - i)} \,\mathrm{EB}_i$$

Step 8: Perform RSA Encryption

Now that we have the encoded message ($EM$) as an integer $x$, we are ready to perform RSA guided by the formula:

$$C =x^e \bmod N$$

where $(e,N)$ is the public key. The thus computed $C$ is our ciphertext generated using RSA-OAEP.

When decrypting, the process is reversed: the recipient uses their private key $d$ to compute $m = c^d \bmod N$, recovers the $EM$, then splits it into the $0x00$, $maskedSeed$, and $maskedDB$, and uses the same $MGF$ and hash function to unravel the $XORs$ in reverse order. Finally, they check that the recovered $lHash'$ matches the expected hash and that the block contains the proper structure ($...||0x01||...$).

If any check fails, the padding is invalid. Only if everything checks out is the message $M$ returned. The result is that an invalid ciphertext will almost always be detected and rejected without giving an attacker any useful information.

By design, OAEP effectively foiled the padding oracle problem. The chance that a random guess produces a valid OAEP encoding is negligible: on the order of $2^{-hLen*8}$). In fact, Daniel Bleichenbacher (after breaking PKCS#1 v1.5) advocated for exactly such a “plaintext-aware” padding where forging a valid padding is infeasible.

Why SHA-1 or MD5 Are Safe in RSA-OAEP

Earlier in the section above, we mentioned that we’d be using SHA-1 for our mathematical formulation and examples. When you see SHA-1 or MD5 used in the context of RSA-OAEP, don’t let the fact that these hash functions are considered broken for collision resistance alarm you. If you notice carefully in the previous section, the hash functions serve two very specific roles that do not rely on their collision resistance. Let’s break them down one by one:

Label Hashing

The hash function is used to compute a fixed-length hash of an optional label $L$ (often empty).

Now let’s see why is this safe in the context. This hash, called $lHash$, acts as a domain separator. Its job is simply to ensure that the label is correctly associated with the ciphertext during decryption. As long as the label is chosen wisely (that is, not built from adversary-controlled parts), collision resistance isn’t critical here.

Mask Generation Function (MGF1)

The hash function is also used inside $MGF1$ to create a pseudorandom mask. This mask is applied both to the data block $DB$ and to the random seed used in the encoding process.

In this context, the hash function is treated as a random oracle. The job is to spread the randomness of the seed across a larger block of data. For this purpose, properties like length extension or collision resistance are not relevant. What matters is that the output appears random, and even SHA-1 or MD5 can deliver that when used in this controlled, fixed-input scenario.

Adoption in Cryptographic Libraries (PKCS#1 v1.5 vs OAEP)

After the Bleichenbacher attack, standards and libraries migrated to OAEP or at least added support for it, while treating PKCS#1 v1.5 as a legacy option. Modern cryptographic libraries and protocols reflect these lessons.

In 1998, the RSA standard was updated. PKCS#1 v2.0 introduced RSAES-OAEP as the new recommended encryption scheme, and by PKCS#1 v2.1 and v2.2 (RFC 3447 and RFC 8017), OAEP is required for new applications, with PKCS#1 v1.5 included only for backward compatibility.

OpenSSL discourages users from using PKCS#1 v1.5 as it leaks information that can potentially be used to mount a Bleichenbacher padding oracle attack [10]. The documentation clearly mentions that it is highly recommended to use RSA_PKCS1_OAEP_PADDING in new applications.

The Python cryptography library (PyCA cryptography) also asks developers to use OAEP for encryption instead of PKCS#1 v1.5 [11].

After Bleichenbacher’s 1998 attack, it was impractical to instantly replace PKCS#1 v1.5 everywhere. Instead, protocol designers issued countermeasures.

TLS, for example, responded by changing the error handling: the server would not reveal a padding failure distinctly. It would generate a fake premaster secret and proceed to prevent timing clues, and always return a generic handshake failure at a later stage, making it harder for the attacker to distinguish why decryption failed.

These countermeasures reduced the oracle’s fidelity but were tricky to get right across different implementations. In fact, not everyone got it right – the Bleichenbacher attack continued to resurface in various forms when implementations made mistakes in error handling.

In 2018, researchers discovered the ROBOT attack (Return Of Bleichenbacher’s Oracle Threat): several TLS implementations had subtle bugs that recreated a padding oracle, allowing the attack to succeed 19 years later. The ROBOT paper showed that even with countermeasure guidelines, the complexity of uniformly handling errors led to slip-ups in popular products.

This underscores that patching an insecure scheme is often error-prone – a design that is secure by construction (like OAEP) is preferable.

PKCS#1 v1.5 continues to exist because of these patchwork security measures and the fact that it cannot be abruptly removed from all existing systems. It is generally regarded as "legacy" or maintained "for compatibility" purposes. The collective wisdom is clear: use OAEP for RSA encryption whenever possible.

Enhancing Digital Signatures: The Transition to PSS

Now that you understand how OAEP transformed RSA encryption by mitigating vulnerabilities in deterministic padding, it’s time to turn our attention to RSA digital signatures – a critical function for ensuring message integrity and authenticity.

Early RSA signature schemes suffered from similar problems as raw encryption: their deterministic nature made them prone to forgery and replay attacks. This vulnerability paved the way for an improvement: the Probabilistic Signature Scheme (PSS).

Before we dive into PSS itself, let’s quickly understand the pain points with early RSA signatures.

Problems with Early RSA Signature Schemes

Traditional RSA signatures were generated by simply applying the RSA decryption function on a message digest (often with minimal formatting):

$$s=m^d \bmod N$$

where $m$ is the hash (or encoded hash) of the message. This approach was deterministic which meant that each time the same message was signed, the exact signature was produced. Such determinism had two major drawbacks:

Predictability and Replay

Since the signature for a given message was always identical, an attacker could replay a captured signature with impunity or forge signatures if they could deduce patterns in the signature scheme.
Forgery Risks

In a deterministic setting, if an attacker finds any structure or mathematical relationship in the signature, they might be able to forge a valid signature for a new message. In certain scenarios, weak formatting could allow an adversary to create a “signature transformation” that produces a valid signature without having access to the private key.

These issues highlighted that a signature scheme must be probabilistic to be secure against adaptive forgery attempts and to ensure non-repudiation. This means that the signer should not be able to repudiate a signature because it is bound to a random value known only at signing time.

Birth of the Probabilistic Signature Scheme (PSS)

Towards the end of 1998, Bellare and Rogaway also proposed a scheme to overcome the inherent limitations of deterministic RSA signatures [12]. The core idea was to introduce randomness into the signature generation process so that even when signing the same message twice, the resulting signatures would be different. This randomness comes from a salt value and a carefully designed encoding process. The result is a signature method with strong, provable security guarantees.

This randomness prevents attackers from exploiting patterns in the signature process. The probabilistic Signature Scheme was designed to be provably secure in the random oracle model, meaning that forging a signature would be as hard as breaking RSA itself under certain assumptions [13].

The block diagram below is a visual representation of the PSS encoding schema:

Let’s understand what these mathematical notions mean as well as the workings of RSA-PSS, up next.

The Mathematics Behind PSS

Before diving into the mechanics of RSA-PSS, it’s helpful to define the notations and terms you’ll see in the steps ahead.

In RSA, $N$is the modulus, a large integer that is the product of two primes. $k$ is the length of $N$ in bytes. For an $2048$-bit key, $k=256$ bytes.

$M$represents the message data or document you want to sign. In RSA-PSS, you’ll typically first compute a hash of $M$. $Hash$ refers to a cryptographic hash function (for example, SHA-256) that maps data to a fixed-size output. The output length is denoted $hLen$. For SHA-256, $hLen=32$ bytes.

We will use a salt, $S$, randomly generated string of fixed length (often the same as $hLen$). This randomness is essential in ensuring that each signature is unique, even for the same message.

$H$ or $mHash$ is the hash of the message $M$and $H'$ is a secondary hash that includes both $M$ and the salt $S$. This appears in the PSS encoding step.

The Mask Generation Function, $MGF$, is a function that uses the hash internally to produce a pseudorandom output of arbitrary length. In PSS, it is used to “mask” parts of the data block so that the signature is hard to forge.

A fixed byte, $0xbc$ (in hex) is appended at the end of the encoded message to mark the boundary of the PSS structure. This serves as a simple integrity check during decoding. After a successful encoding we receive an encoded message $EM$ which is an octet string of length $emLen = \left\lceil{\frac{emBits}{8}}\right\rceil$.

Now that you are familiar with all the necessary notations, we are ready to begin the encoding step.

Step 1: Message Hashing and Salt Generation

We compute the hash of the message as $H~( mHash)=Hash(M)$ where $M$ is our message. We will also create a random salt $S$ (of fixed length, say 20 bytes if you use SHA-1).

Step 2: Encoding the Hash with the Salt (PSS-Encode)

We will construct a Data Block, $DB$, by combining a padding with the hash and the salt. The padding is a sequence of $0$’s that fills space and ensures a fixed length. Mathematically:

$$M' = (0x)~00 ~00 ~00 ~00 ~00 ~00 ~00 ~00 ~||~ mHash ~||~ salt$$

Now we compute the Hash of this block as $H' = Hash(M')$. We will generate another octet string $PS$ and concatenate it with the salt and $0x01$ as a delimiter:

$$DB = PS ~||~ 0x01 ~||~ salt$$

Note that DB is an octet string of length $emLen - hLen - 1$. The mask that you see in the visual representation above must be of this length. Mathematically:

$$dbMask = MGF(H, emLen - hLen - 1)$$

We will then apply this mask on the $DB$ block using an $XOR$ operation to produce our $maskedDB$:

$$maskedDB = DB \oplus dbMask$$

Recollect that $emLen$ is the intended length of the Encoded Message $EM$ and $hLen$ is the length of the hash output. Now we append a fixed trailer field $0xbc$ and produce the encoded message in its octet string representation:

$$EM = maskedDB ~||~ H ~||~ 0xbc$$

This encoding process ensures that both the salt and the hash are mixed together in a non-reversible, pseudorandom manner. The randomness from the salt is “spread” over the data block by the $MGF$, making it extremely difficult for any adversary to manipulate the signature.

Step 3: RSA Signature Generation

Once you have the encoded message $EM$, the RSA signature is produced by using the RSA private key. First, convert the Octet String to its integer representation using the OS2IP method we’ve discussed before. Then apply the RSA Private Key Operation:

$$s=m^d \bmod N$$

where $d$ is the private exponent and $N$ is the RSA modulus.

Step 4: Signature Verification

At the receiver end, when any recipient wants to verify a signature, they reverse the process:

$$m′= s^e \bmod N$$

and convert $m'$ back to an encoded message $EM$. The verifier then extracts the components $(MaskedDB, H′, trailer)$ and recomputes $H'$ from the message and salt. The verifier confirms that the hash and salt embedded in $EM$ match what is expected. If everything checks out, the signature is valid.

The Road Ahead: Assessing RSA’s Long-Term Viability

In 1994, Peter Shor’s algorithm [14], demonstrated that a quantum computer can factor large integers in polynomial time, thereby efficiently breaking RSA’s underlying hard problem – the difficulty of factoring $N = p \times q$.

Although experimental quantum computers have made progress, they remain far from having the number of stable qubits required to break RSA keys of practical sizes (2048 or 4096 bits).

In anticipation of large-scale quantum computers, the cryptographic community is actively developing and standardizing algorithms believed to be resistant to quantum attacks. These include lattice-based schemes (such as CRYSTALS-Kyber and NTRU), code-based cryptography (such as the McEliece cryptosystem), hash-based signatures (such as XMSS), and multivariate polynomial cryptosystems.

It’s important to note that while OAEP and PSS improve the security of RSA against classical attacks, they do not protect RSA from quantum attacks. In a post-quantum world, even the most secure classical padding will not prevent a quantum computer from breaking RSA using Shor’s algorithm.

In the near term, RSA remains in widespread use and, when implemented with padding schemes such as OAEP and PSS, continues to provide strong security against classical adversaries. But looking ahead, it’s expected that organizations will gradually migrate to post-quantum algorithms as they mature and become standardized.

References

[1] FIPS 186-5: Digital Signature Standard (DSS)

[2] RFC 8017 PKCS #1: RSA Cryptography Specifications

[3] Lagrange's theorem

[4] Ronald L. Rivest, Robert D. Silverman: Are Strong Primes Needed for RSA?

[5] pyca/cryptography

[6] OpenSSL Github: rsa_chk.c

[7] RFC 2313: PKCS #1: RSA Encryption

[8 ] Daniel Bleichenbacher: Chosen Ciphertext Attacks Against Protocols Based on the RSA Encryption Standard PKCS #1

[9] RFC 8017: PKCS #1 RSA Cryptography Specifications Version 2.2

[10] RSA_public_encrypt: Warnings

[11] pyca/PKCS1v1

[12] Probabilistic signature scheme

[13] RFC 8017: RSASSA-PSS

[14] Algorithms for quantum computation: discrete logarithms and factoring

How to Connect, Read, and Process Sensor Data on Microcontrollers – A Beginner's Guide

Soham Banerjee — Fri, 14 Mar 2025 16:30:15 +0000

In today’s world, computers are ubiquitous and generally serve two primary purposes.

The first is general-purpose computing, where they handle a wide range of tasks, including running diverse applications and programs. Examples include laptops, desktops, servers, and supercomputers.

The second is embedded systems, which are specialized computers designed for specific functions. Commonly found in devices such as thermostats, refrigerators, cars, and other smart appliances, they rely on sensors to collect environmental data and execute their tasks efficiently.

The Role of Sensors

Sensors play a critical role in both types of computing. In embedded systems, sensors gather environmental data to help devices like autonomous vehicles, home appliances, and industrial machines perform tasks. In general-purpose computers, sensors primarily monitor internal conditions such as temperature and voltage, ensuring safe operation and preventing issues like overheating or electrical faults.

As Artificial Intelligence (AI) and the Internet of Things (IoT) evolve, sensors have become indispensable for gathering real-world data to support intelligent decision-making. Embedded systems leverage sensors to perceive their environment, transforming raw data into actionable insights that power automation and improve efficiency across industries.

This means that understanding sensor interfacing and designing robust sensor-driven software has become a vital skill for engineers and hobbyists alike.

Whether you're a beginner or experienced engineer, this guide will help you build a solid understanding of sensor interfacing software.

What You’ll Learn and Article Scope

In this article, you’ll learn how to connect sensors to microcontrollers (MCUs) and design sensor software pipelines that turn raw data into meaningful, usable information. You’ll also explore practical techniques for processing sensor data accurately and efficiently in embedded systems.

Here’s a breakdown of what we’ll cover:

What sensors are and how they work – An introduction to sensors, common types, and how sensor pipelines help process sensor data.
Key sensor characteristics – Important parameters like sensitivity, accuracy, precision, range, drift, and response time to help you choose the right sensor for your project.
How to interface sensors with microcontrollers – Hardware connections and communication protocols like SPI, I²C, and GPIO that allow microcontrollers to read sensor data.
Software architecture for sensor data – A high-level overview of the software pipeline that processes sensor data, including drivers, ADC support, scaling, calibration, and post-processing.
Detailed design of pipeline components – A closer look at each step in the pipeline, focusing on scaling raw data, calibrating sensors, and applying filters to clean up noisy signals.
Practical tips for power management – Best practices for handling power efficiently using low-power modes, FIFO buffers, and DMA when working with sensor data in embedded systems.

By the end of this article, you’ll know how to design and implement a complete sensor data pipeline for an embedded system, from reading raw sensor data to preparing it for real-world use in intelligent, connected devices.

Note: Advanced data processing, high-resolution ADCs, and hardware circuit design for sensors are outside the scope of this article.

Prerequisites

To get the most out of this article, you should have:

Basic knowledge of microcontrollers: Understanding of common peripherals like ADCs (Analog-to-Digital Converters), SPI (Serial Peripheral Interface), I2C (Inter-Integrated Circuit) and GPIO (General Purpose Input/Output). If you’re new to these protocols, this article provides a great overview.
Basic knowledge of electronics: Familiarity with circuits and signals, including analog and digital interfaces.
Programming in C: Familiarity in embedded software development, including driver development.
(Optional) Basic knowledge of sensors: Understanding different types of sensors (like temperature, pressure, motion) is helpful but not required.

Also, this article assumes the following:

You are working with a microcontroller equipped with the peripherals needed for sensor integration. The details of microcontroller peripherals can be found in a reference manual for example for an STM32F4 series microcontroller will have all the details :
You are familiar with compilers, debuggers, and IDEs used in embedded systems. Some common tools include:
- Compilers: GCC, Clang,
- Debuggers: GDB, LLDB
- IDEs: Visual Studio Code (VSCode) is a popular choice, especially with extensions for embedded development and debugging.
You aim to build reliable, sensor-driven embedded systems, capable of collecting and processing real-world data efficiently.

What is a Sensor and Sensor Pipeline?
Sensor Characteristics
How to Interface with a Microcontroller
Software Architecture
Detailed Design of Components
Conclusion

What is a Sensor and Sensor Pipeline?

A sensor detects changes in physical properties such as temperature, pressure, or light and converts them into electrical signals that can be measured or interpreted. For example, a thermistor is a type of resistor whose resistance changes with temperature. As the temperature varies, the resistance of the thermistor changes, altering the voltage across it. The system then interprets this voltage change to determine the temperature.

To better understand sensors, consider the natural sensors in the human body: the eyes, ears, skin, nose, and tongue. These natural sensors constantly send signals about the environment to the brain for processing. Different regions of the brain interpret these signals and use the information to drive actions and responses. Just like the brain processes signals from natural sensors, a microcontroller processes signals from electronic sensors using a sensor pipeline.

Sensors come in many types, each designed to detect specific physical properties. Some sensors have a sensing element that changes its properties in response to conditions like heat, light, or pressure. Examples include thermistors, infrared receivers, and photodiodes.

For detecting movement, such as acceleration and rotation, MEMS (Microelectromechanical Systems) sensors—like accelerometers and gyroscopes—are widely used.

To measure distance, sensors like sonars, ultrasonic sensors, and radars are common. These are just a few examples of the many types of sensors available.

Beyond the types of physical properties they detect, sensors also differ in their levels of integration. Some sensors are raw sensors, consisting only of a sensing element and a transducer with simple leads for direct connection to an external circuit.

Others, known as smart sensors, include additional components such as an ADC (analog-to-digital converter) and onboard processing capabilities, enabling them to handle more of the data processing independently.

The choice between a raw sensor and a smart sensor depends on your application requirements, including factors like cost, size, and the processing load on the interfacing microcontroller.

Returning to our human analogy, consider how vision works as a sensor pipeline. When light enters our eyes, photoreceptor cells (rods and cones) in the retina act as sensing elements, converting the light into electrical signals. These signals travel via the optic nerve to the brain’s visual cortex, where they undergo processing to form a recognizable image. The brain then interprets this information and initiates a response, like smiling when you see a beautiful scenery.

Similarly, a sensor pipeline for an embedded system can be defined as shown in the picture below:

Each of these steps may have different requirements based on the application. Creating a requirements document for the sensor is helpful when selecting the appropriate sensor and configuring the pipeline.

Sensor Characteristics

Before you dive into the blocks of the sensor pipeline, let’s review some important characteristics of a sensor.

Sensitivity

Sensitivity is the ability of a sensor to detect small changes in the physical property it’s designed to measure.

Sensitivity can vary based on factors like manufacturing processes, cost, and the design of the sensing element.

Sensors designed for a specific property often come in different sensitivity levels, allowing users to select an appropriate sensitivity based on the application requirements.

Accuracy

Accuracy is the degree to which a sensor’s measurement matches the true value of the physical property it’s measuring. Testing a sensor’s accuracy typically requires comparing its readings to those of a reference instrument.

A sensor may have gain and offset errors—issues that calibration can help correct. Calibration adjusts for these systematic errors, which are often due to manufacturing tolerances or design factors.

Once calibrated, the sensor’s output can be verified against a reference to confirm its accuracy. The required level of accuracy should be determined based on the application’s needs.

Precision

Precision refers to the consistency or repeatability of a sensor's measurements, regardless of how close those measurements are to the true value. It indicates the sensor's ability to produce the same output under identical conditions and how finely it can resolve and report values.

For example, if the true temperature of an object is 12.53°C:

A precise sensor will consistently measure values like 12.52°C, 12.53°C, or 12.54°C, even if those values are slightly offset from the true temperature.
A highly accurate sensor, on the other hand, will measure values close to 12.53°C but may lack precision if those readings vary widely (e.g., 12.50°C, 12.53°C, and 12.56°C).

For applications requiring exact measurements, a sensor with both high accuracy (closeness to the true value) and high precision (low variability) is essential. This is especially important in distinguishing small differences, such as between 12.5°C and 12.53°C.

In contrast, applications with less stringent requirements might use sensors with broader tolerances, such as ±1°C, which are sufficient for general monitoring purposes.

Range

The range of a sensor refers to the span between the maximum and minimum values of the physical property it can measure while maintaining its specified precision and accuracy. A sensor's operating range may extend beyond its measurement range, but the measurement range defines the limits within which the sensor reliably adheres to its specified sensitivity, accuracy, and response time.

Drift

Drift is when a sensor's output changes over time due to conditions like temperature or humidity. Components within the sensor, including the sensing element, may be sensitive to these conditions, leading to gradual shifts in measurements.

For example, many components are affected by temperature and humidity changes, which can alter sensor readings. Also, sensors with internal oscillators may experience time-based drift, impacting accuracy.

Regular calibration with an accurate external reference (such as a precise clock) can help correct for drift and maintain reliable measurements. For certain applications, selecting a sensor with acceptable drift characteristics is crucial.

Response Time

Response time is the duration a sensor takes to detect and reflect a change in the measured physical property. For example, if the temperature rises by 5°C, the response time indicates how long the temperature sensor takes to reflect this change in its output.

Response time depends on the sensor’s design, manufacturing quality, and internal components, such as the ADC (Analog-to-Digital Converter), averaging circuits, and filters within the sensor pipeline.

All the parameters mentioned above are thoroughly documented in the sensor’s data-sheet. In practice, it’s a good idea to create a sensor requirements document for each specific application, detailing these key parameters as a baseline for sensor selection.

Now that you’ve examined the key characteristics of sensors, let’s explore how you can connect them to a microcontroller for real-world applications.

How to Interface with a Microcontroller

Choosing a Communication Protocol

Another essential aspect of sensor requirements is specifying the communication interface between the sensor and the MCU or processor in the system. It’s important to understand how the sensor will be interfaced based on its output signal type and the available pins on the microcontroller.

For instance, certain sensors may connect directly to an analog or digital input pin on a microcontroller. A raw sensor, such as a temperature sensor, typically connects to an analog input pin, which is then read by the microcontroller’s internal ADC (Analog-to-Digital Converter).

In contrast, a digital-output sensor connects to a digital GPIO (General Purpose Input/Output) pin. For instance, speed sensors generate square waves with variable pulse widths to indicate speed. These signals are usually connected to a GPIO pin configured as an external interrupt or timer capture input, allowing the microcontroller to measure pulse width accurately.

A smart sensor, on the other hand, often supports communication protocols like SPI (Serial Peripheral Interface) or I2C (Inter-Integrated Circuit). These interfaces enable the microcontroller to configure the sensor, check its status, and retrieve data through register reads and writes.

Choosing the appropriate communication protocol for interfacing a sensor depends on the available pins in the system and the specific requirements of the application.

Tip: When working with protocols like I²C or SPI, using tools such as Saleae logic analyzers can greatly simplify debugging and validation. Logic analyzers capture and visualize communication signals, and tools like Saleae offer built-in protocol interpreters to help you decode sensor communication in real time. This can be especially helpful when troubleshooting configuration issues, timing problems, or communication errors during sensor interfacing.

Figure 2 below shows an example of a microcontroller connected to 4 sensors having different interfaces.

Determining Power Requirements

Power requirements are another key consideration when interfacing a sensor. Sensors may operate at different voltages (for example, 3.3V or 5V), so ensuring the microcontroller can accommodate these levels is essential. Level converters can bridge voltage mismatches, ensuring compatibility between the sensor and microcontroller voltage levels.

Timing and sampling requirements must also be evaluated, especially for sensors generating high-frequency data. Configuring external interrupts on GPIO pins can ensure timely data capture, while techniques like using DMA can streamline data transfer for sensors sampling at high frequencies without CPU involvement.

Now that you’ve learned about communication protocols and hardware connections, let’s focus on designing the software architecture that acquires, processes, and prepares sensor data for use. Designing effective software is crucial for obtaining clean, reliable data from the sensor.

Software Architecture

Now that we’ve chosen the sensor and communication protocol, let’s design the software architecture for the sensor pipeline. This software runs on the microcontroller connected to the sensor and processes raw data to make it clean and usable.

While application-level data processing is beyond the scope of this article, let’s focus on interfacing with the sensor and preparing the data for application use.

The sensor processing pipeline can be broken into the following components:

Sensor Driver
Analog-to-Digital Conversion (ADC) Support
Scaling
Calibration
Data Post-Processing

Let’s examine a high-level overview of these components for both smart and raw sensors.

High-Level Overview of Components

Sensor Driver
1. Smart sensors: The driver configures the sensor, manages power, and handles read and write operations to the sensor registers over a communication protocol like SPI, I2C.
2. Raw sensors: The driver may only control GPIOs for power management, as raw sensors typically lack registers.
Analog-to-Digital Conversion (ADC) Support
1. Smart sensors: Include an onboard ADC, which is configured through the sensor driver.
2. Raw sensors: Requires an external ADC, an ADC driver implemented in software to configure the ADC, initiate conversions, and retrieve data.
Scaling: Scaling is necessary for both smart and raw sensors. It converts digital counts after the analog to digital conversion into meaningful physical quantities using formulas provided in the sensor data sheet. For example, a temperature sensor will use a formula to convert digital counts to degree Celsius.
Calibration: Once the measured physical quantity is obtained, calibration adjusts the value by applying offsets, gains, or both to correct errors. This process ensures the sensor output aligns with reference values across its entire measurement range. A detailed discussion of the calibration process will follow in the next section.
Data Post-Processing: Post-processing techniques, such as filtering are applied to improve data quality and reduce noise. Common filters such as low-pass or high-pass filters can remove unwanted frequency components.

Accessing Data from the Sensor

The method of accessing data depends on the whether it’s a raw sensor or a smart sensor. Smart sensors will have onboard ADCs and FIFOs. Before delving into how data is accessed, it’s important to first understand sampling frequency.

Sampling Frequency:

The frequency of taking a measurement from the sensor must follow the Nyquist-Shannon sampling theorem. It states that the sampling rate must be twice the highest frequency component of the signal to be measured to accurately reconstruct the measured data.

The sampling frequency defines how often the sensor captures data, which affects how the data is accessed. Depending on whether the sensor is a raw sensor or a smart sensor, the approach to handling this sampled data varies.

Smart Sensors:

Data register: The sensor writes sampled data directly into a register based on the set sample frequency updated during setup. The microcontroller reads this data register based on a data conversion completion interrupt.
FIFObBuffer: Some sensors include FIFO (First-In, First-Out) buffers to store multiple data points. When enabled, the FIFO updates at the configured sampling frequency and trigger interrupts when it becomes full or reaches a predefined level.
The benefits of FIFO include:
1. Power efficiency: The MCU can process data in batches, reducing CPU overhead and allowing it to enter low-power mode during data collection.
2. Sampling and processing rate matching: FIFO buffers help reconcile differences between the sensor’s sampling rate and the MCU’s data processing rate.
3. For MCUs with Direct Memory Access (DMA), data transfer from the sensor to MCU memory can occur without CPU intervention, further reducing power consumption.

Raw Sensors:

For raw sensors, the MCU triggers ADC conversions at the sampling frequency, often using a timer interrupt. Data is read upon the ADC conversion complete interrupt, allowing the MCU to sleep during conversions and between samples to save power.

Sensor Power Management

Power management is critical for energy-sensitive applications. Strategies include:

Low-power modes: Many sensors support low-power modes configurable through sensor registers.
GPIO-controlled power cycling (Duty-Cycling): For sensors without built-in low-power modes, the microcontroller can toggle the sensor’s power line using a GPIO pin, reducing power consumption further. Figure 3 below shows the diagram of a raw temperature sensor whose power is controlled using a GPIO from the MCU. For example, a temperature sensor in sleep mode can be activated only when temperature readings are required.

The above techniques ensure efficient use of power while maintaining the required data sampling rate and sensor responsiveness.

With the high-level architecture in mind, we’ll now dive into the detailed design of each pipeline component.

Detailed Design of Components

In this section, you’ll delve into the key components of the sensor pipeline outlined in the Software Architecture section.

1. Sensor Driver

The sensor driver is responsible for managing communication, configuration, power, and data acquisition for both smart and raw sensors.

Smart Sensor Driver:

Communication driver: Generic I2C or SPI drivers on the MCU can be adapted using wrapper functions to handle sensor-specific requirements, such as 1-byte, 2-byte, or 4-byte transfers.
Configuration: Typical tasks include setting the sampling rate, configuring interrupts, managing FIFO buffers, and, if needed, clock settings.
Power management: APIs should allow higher software layers to transition sensors between power modes by writing to specific registers or controlling GPIO lines for sensors without built-in power modes.

Raw Sensor Driver:

For raw sensors, the driver primarily manages power, often through GPIO-controlled toggling.

2. ADC Support

ADC support is required only for raw sensors. In this article, we’re focusing on SAR ADCs, which are commonly embedded in microcontrollers.

How SAR ADCs Work?

A SAR ADC converts an analog signal to a digital value over multiple clock cycles, with the number of cycles equal to its bit resolution (for example, 10 cycles for a 10-bit ADC).

Reference Voltage (VRef): Represents the maximum voltage the ADC can measure. Analog signals exceeding this limit must be scaled down.
Resolution: Determines the smallest detectable voltage change. For example, a 10-bit ADC with a 3.3V VRef has a resolution of 3.22 mV

$$V_{\text{Res}} = V_{\text{Ref}} /2^{10}$$

The ADC result is stored in a data register, which can then be scaled to meaningful physical units.

3. Scaling

Scaling converts ADC counts into meaningful physical values, such as temperature (°C) or acceleration (g) depending on the sensor type. Sensor datasheets typically provide the necessary formulas or lookup tables.

For example, the method to convert a voltage measured by a raw temperature sensor to temperature value is shown below:

$$V_{\text{Measured}} = Counts_{\text{ADC}} / 2^{10} * V_{\text{Ref}} \quad \text{(Get V_Measured from ADC Counts)}$$

$$Temperature_{\text{Measured}} = V_{\text{Measured}} * T_{\text{C/mV}} \quad \text{(Get Temperature physical value)}$$

Similarly, a 3-axis accelerometer maps counts on the X, Y, and Z axes to acceleration values in g or milli-g.

4. Calibration

The figure above on the left (4a) is showing Calibration with gain and offset, while the figure above on the right (4b) is showing calibration with fixed offset.

$$x_{\text{calibrated}} = Gain * x_{\text{raw}} + Offset \quad \text{(Figure 4a - Linear Calibration)}$$

$$x_{\text{calibrated}} = x_{\text{raw}} + Offset \quad \text{(Figure 4b - Fixed offset Calibration)}$$

Calibration ensures the sensor’s output aligns with reference measurements, correcting for errors introduced by design, materials, or manufacturing.

Types of Errors:

Offset error: A constant deviation of the sensor’s output from the true reference value, regardless of input magnitude.
Gain error: A proportional error where the sensor’s output scale deviates from the expected value, causing the output to increase or decrease incorrectly relative to the input.

Calibration Methods:

2/3-Point calibration: This type of calibration may involve either applying a fixed offset to the raw value or applying both gain and offset. Figure 4a illustrates an example of a gain/offset calibration, while Figure 4b depicts offset calibration. In both figures, the y-axis represents the reference value measured by an accurate instrument, while the x-axis represents the raw value measured by the sensor after ADC.
N-Point calibration: Involves multiple points for more complex, non-linear error correction.

Implementation:

Calibration points shall cover the sensor’s entire measurement range for accuracy.
Parameters like gain and offset once estimated shall be stored in a non-volatile memory in the system for persistence to be used across power cycles.

5. Data Post-Processing

Post-processing covered in this section talks about removing noise and unwanted signal components, which improves data reliability.

Filtering

Filtering is the process of removing unwanted frequency components from a signal to improve data quality. There are several different types of filters:

Low-Pass Filters: Allows low-frequency signals to pass while attenuating high-frequency noise.
High-Pass Filters: Allows high-frequency signals to pass while attenuating low-frequency noise. (for example, gravitational acceleration in accelerometer data).
Band-Pass Filters: Retains only signals within a specific frequency range, removing both lower and higher frequencies outside the desired band.

These filters are often implemented as FIR (Finite Impulse Response) or IIR (Infinite Impulse Response) filters. IIR filters are easy to implement and computationally efficient while FIR filters are computationally intensive but have better control over the frequency response.

Here, we will explore a simple low-pass filter known as the Exponential Moving Average (EMA), a type of IIR filter. A moving average filter is a mathematical technique that smooths short-term fluctuations while highlighting longer-term trends.

Unlike other moving average filters, EMA does not require maintaining a buffer, making it more memory-efficient. It is also more responsive to data changes while still providing smoothing, making it well-suited for real-time filtering. EMA assigns greater weight to recent data samples than older ones, allowing it to adapt quickly to changes in sensor readings.

EMA can be calculated like this:

$$EMA_{\text{t}} = \alpha * x_{\text{t}} + (1 - \alpha) * EMA_{\text{t - 1}}$$

$$\alpha = 2 / (N + 1) \quad \text{(Smoothening Factor, N - filter window size)}$$

$$EMA_{\text{t}} \quad \text{(Exponential Moving Average in current iteration)}$$

$$x_{\text{t}} \quad \text{(New Data Sample in Current Iteration)}$$

$$EMA_{\text{t - 1}} \quad \text{(Exponential Moving Average in the last iteration)}$$

Now that we understand the Exponential Moving Average (EMA) filter, here are two key factors to consider when tuning it for an application:

Smoothing vs. Responsiveness: A higher smoothing factor (closer to 1, smaller filter window size) gives more weight to recent data, making the filter more responsive to changes but less effective at noise reduction. A lower smoothing factor (closer to 0, larger filter window size) provides better noise reduction but reacts more slowly to data changes.
Application-Specific Tuning: The smoothing factor should be chosen based on the sampling rate, sensor sensitivity, and application requirements. Real-time systems often require a balance between quick responsiveness and stable output.

Here’s a code sample for EMA:

#include 
#include 

// Exponential Moving Average (EMA) filter implementation
#define FILTER_WINDOW 5

// Function to calculate EMA
float calculateEMA(float ema, float new_value, float alpha) {
    return (alpha * new_value) + (1 - alpha) * ema;
}

int main() {
    float sensorReadings[] = {26.0, 27.5, 28.2, 27.0, 26.8, 26.5, 27.2};
    int numReadings = sizeof(sensorReadings) / sizeof(sensorReadings[0]);

    float alpha = 2.0f / (FILTER_WINDOW + 1); // Standard EMA formula
    float ema = sensorReadings[0];  // Initialize EMA with the first reading

    printf("EMA Filtered Sensor Data:\n");

    for (int i = 0; i < numReadings; i++) {
        ema = calculateEMA(ema, sensorReadings[i], alpha);
        printf("Reading %d: Raw = %.2f, EMA = %.2f\n", i + 1, sensorReadings[i], ema);
    }

    return 0;
}

Conclusion

In summary, sensors are the backbone of modern smart devices, bridging the gap between the physical world and digital systems. From consumer electronics to industrial automation and medical devices, they enable devices to perceive and interact with their environments.

Understanding how sensors work, the components of their data pipeline, and their integration with microcontrollers is essential for engineers and hobbyists alike. By designing effective pipelines, developers can ensure accurate, clean, and reliable data, enabling systems to meet performance and power efficiency goals.

If you have questions or want to talk more about this topic, feel free to reach out on Twitter or Lin kedIn. Always happy to connect.

SVM Kernels Explained: How to Tackle Nonlinear Data in Machine Learning

Josiah Adesola — Mon, 06 Jan 2025 22:32:35 +0000

Have you ever considered how your phone can recognize handwritten text and convert it into regular computer text? Or how your email can separate messages automatically into spam and non-spam categories?

Both of these examples work based on classification tasks, as does the facial recognition feature on your phone.

When building a classification algorithm, real-world data often has a non-linear relationship. And many machine learning classification algorithms struggle with non-linear algorithms. But in this article, we'll be looking at how Support Vector Machine (SVM) kernel functions can help to solve this problem. We’ll go in-depth into a Python implementation of non-linear classification and SVM kernel functions.

Prerequisites

Overview of the Support Vector Machine (SVM) Technique
Fundamentals of SVM
SVM Objective Function
Understanding Kernel Functions
Popular Kernel Functions
How to Choose the Right Kernel
SVM Kernel Implementation
Conclusion

Overview of the Support Vector Machine (SVM) Technique

Support Vector Machine (SVM) is a supervised learning algorithm. It uses a hyperplane that divides features inside a feature space into distinct categories. It’s effective for both classification and regression applications.

By identifying the optimal dividing line or plane that will serve as the decision boundary, SVM seeks to maximize the margin between the various target variables. It’s primarily utilized in classification tasks and is very helpful in ignoring outliers. It categorizes the data points of the features in the dataset into distinct outputs or classes.

SVM seeks to achieve the optimal maximum margin and an ideal or near-perfect separation. There are various applications for SVM, such as image classification, face detection, text classification, image classification, and bioinformatics. SVM is also efficient in linear and non-linear classification problems.

Importance of Kernel methods in SVM

Nonlinear classification is a sort of classification that involves categorizing features that have non-linear, curved, or complex decision boundaries. Decision boundaries are regions of space that separate two different classes.

In linear classification tasks, the region of space between the different classes such as if the email is spam or not can be easily separated with a straight line. But in non-linear relationships, it could have a circular, parabola, or a complex-shape decision boundary.

Non-linear classification tasks have patterns that cannot be discovered by linear models. This is because the features have a non-linear relationship with each other.

SVM as a linear classification algorithm isn’t efficient for a non-linear data. To handle this sort of data, it will require a kernel method, which is the core topic of this article.

A kernel method is a technique used in SVM to transform non-linear data into higher dimensions. For example, if the data has a complex decision boundary in a 2-Dimensional space (as I’ll explain further in the later part of this article), it can be transformed into a 3-Dimensional space. This allows efficient classification just with a linear plane.

The goal of the article is to teach you about SVM kernels and their application to non-linear classification tasks.

Fundamentals of SVM

Linear Classifiers and Margin Maximization

Linear classifiers are classification algorithms that make predictions by using a straight line of best fit as a decision boundary between two or more categories.

Marginal planes are used to determine the support vector in the classification task. Support vectors are the data points in the dataset that are used to separate the different target variable categories – they are data points very close to the decision boundary.

In the image below, the marginal planes are the yellow lines, while the hyperplane is the red line. The hyperplane serves as the line of best fit or decision boundary. The data points that are closest to the marginal plane are the support vectors – the data points encircled in green in the image below.

The marginal plane aims to achieve a maximum margin between its plane and the hyperplane – both having equal distance from hyperplane to achieve the best classification. The hyperplane in the image above shows a perfect linear relationship between feature x1 and feature x2. The support vectors also help to establish the location of the marginal plane.

We have the hard margin and the soft margin, serving as model optimization methodologies for the SVM. The hard margin shows that you cannot find a data point of feature x1 in the same area where there are feature x2 data points and vice versa. It used to describe a perfect classification by the algorithm. The image above gives a representation of a hard margin.

A soft margin shows that the classification is imperfect, because you can find some data points of feature x1 in the same area where we have data points of feature two, which could be caused by outliers. The image below gives a representation of soft margin.

SVM Objective Function

For a binary classification, such as a dog or a cat, the dog can be represented as class 1 and cat as -1. This shows that the decision boundary or hyperplane is the determining factor. Any value above the plane is given as 1, and the class below the plane is given as -1.

The mathematical function for the hyperplane is given as:

$$f(x) = \mathbf{w}^T\mathbf{x} + b$$

$$\begin{array}{l} \text{ The variables used are:} \\ \mathbf{w}: \text{Weight vector (defining the orientation of the hyperplane)} \\ b: \text{Bias term (defining the position of the hyperplane)} \\ \mathbf{x}: \text{Input feature vector} \\ \\ \text{The classification decision is based on the sign of } f(x)\text{:} \\ f(x) > 0: \text{Class 1} \\ f(x) < 0: \text{Class -1} \end{array}$$

Hard Margin SVM

The Hard Margin SVM ensures all the data points are all properly classified without error, ensuring that the data points don’t find themselves in the other part of the hyperplane, and also maximizing the margin. It’s an effective method for a “noise-free” dataset. This is achieved by minimizing an objective function given below:

$$\begin{array}{l} \text{Hard Margin SVM Objective Function:} \ \min_{\mathbf{w},b} \frac{1}{2}\|\mathbf{w}\|^2 \\ \\ \text{Subject to:} \\ y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1, \,\, \forall i \\ \\ \text{Where:} \\ y_i: \text{ Class label of the }i\text{-th sample } (+1 \text{ or } -1) \\ \mathbf{x}_i: \text{ Feature vector of the }i\text{-th sample} \end{array}$$

This constraint given above in the objective function ensures that all the data points are not misclassified and the stay outside the margin.

Soft Margin SVM

The Soft Margin SVM is lenient, as it allows some misclassifications. It’s suitable for real-world datasets, which are noisy, and it handles non-linearly separable data. It introduces a slack variable that penalizes incorrect predictions.

$$\begin{array}{l} \text{Objective Function:} \ \min_{\mathbf{w},b,\xi} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^n \xi_i \\ \\ \text{Subject to:} \\ y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i, \,\, \forall i \\ \xi_i \geq 0, \,\, \forall i \\ \\ \text{Where:} \\ \xi_i: \text{ Slack variables representing the degree of misclassification or} \\ \text{margin violation.} \\ C: \text{ Regularization parameter controlling the trade-off between} \\ \text{margin maximization and error minimization.} \end{array}$$

The hyperparameter C helps to control the penalty for a balance between margin maximization and error minimization. A large C value minimizes the classification errors, but causes a smaller margin. A small C value allows some misclassifications but causes a larger margin.

Nonlinear Classification Problems

Non-linear classification problems include datasets with non-linear patterns that are difficult for linear SVM models to capture. This is a drawback, but SVM kernels can help.

Non-linear classification contains datasets with complicated relationships and linear models like linear regression will not be able to accurately generate predictions or identify trends.

Understanding Kernel Functions

In kernel functions, we transform the dataset used in the classification task into a higher dimensional feature space. This line of action enables the hyperplane (a linear decision boundary) to split the data as linearly separable data.

For example, if a dataset contains three features in a 2D plane, the kernel function converts the data to a 3D plane, making it much simpler to partition the dataset using a basic hyperplane. This technique can be used to capture non-linear relationships in data.

To provide a clearer mental image, consider three distinct feature sets in the 2D plane (x and y). This can be taken to a 3D plane by the kernel machine, where features x1 and feature x2 may be in the x-y plane, which is readily divided by a simple hyperplane, and feature x3 may be in the y-z plane, which is already separated.

The Kernel Trick Explained

Transformation into a higher dimensional space is computationally intensive and is not the best option. But we know the importance of kernel functions in classifying non-linear data. So, what’s the way forward to still achieve the same feat while bypassing the cost of computation? It’s called the kernel trick. The kernel trick explains the “magic power” of the kernel functions.

The kernel trick is the computation of the inner or dot product between the data points in the original dimensional space instead of transforming the data into a higher-dimensional space before doing the computation.

The right side of the equation below shows the dot product of ϕ(x), representing the transformed vector into a higher dimensional space (which is not efficient). It’s the same as a kernel function at the left hand side:

$$K(x_i, x_j) = \phi(x_i) \cdot \phi(x_j)$$

The purpose of the kernel trick is to perform computation based on the data point in its original dimensional space, instead of performing calculations on complex data that might require an infinite number of dimensions.

Mathematical Implementation of the Kernel Trick

Suppose we have two classes of data that are non-linear in the 2D space representing the original feature space. No straight line can separate these points because they lie diagonally across the origin.

$$\begin{array}{l} \textbf{Mapping Without Kernel Trick: }\\ \\ \begin{align*} \textbf{The 2D data is given as: } \\ & \mathbf{x}_1 = (1,1), & y_1 = +1 \\ & \mathbf{x}_2 = (-1,-1), & y_2 = -1 \end{align*} \\ \\ \textbf{Let's use a mapping function: } \\ \\ \phi(x, y) = (x^2, \sqrt{2}xy, y^2)\ \\ \\ Mapping\ \mathbf{x}_1 \ and \ \ \mathbf{x}_2: \\ \\ \begin{array}{l} - \ \phi(\mathbf{x}_1) = (1^2, \sqrt{2}(1)(1), 1^2) = (1, \sqrt{2}, 1) \\ \\ -\ \phi(\mathbf{x}_2) = ((-1)^2, \sqrt{2}(-1)(-1), (-1)^2) = (1, \sqrt{2}, 1) \end{array} \\ \\ \\ \textbf{Dot Product in Higher-Dimensional Space:} \\ \\ \phi(\mathbf{x}_1) \cdot \phi(\mathbf{x}_2) = (1)(1) + (\sqrt{2})(\sqrt{2}) + (1)(1) = 1 + 2 + 1 = 4 \\ \\ \\ \begin{array}{l} \text{This is the dot product of }\mathbf{x}_1\text{ and }\mathbf{x}_2\text{ after explicitly} \\ \text{mapping them to the higher-dimensional space.} \end{array} \end{array}$$

$$\begin{array}{l} \textbf{Using the Kernel Trick: }\\ \\ \textbf{Polynomial Kernel Definition:} \\ \\ K(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i^\top \mathbf{x}_j + c)^d \\ \\ \textbf{For this example:} \\ \\ d = 2 \ (\text{degree of the polynomial}), \quad c = 0 \ (\text{no bias term}) \\ \\ \textbf{Given: } \\ \\ \mathbf{x}_1 = (1, -1), \quad \mathbf{x}_2 = (-1, -1) \\ \\ \textbf{Compute } K(\mathbf{x}_1, \mathbf{x}_2): \\ \\ \begin{align*} K(\mathbf{x}_1, \mathbf{x}_2) &= ((1)(-1) + (1)(-1))^2 \\ &= (-1 - 1)^2 \\ &= (-2)^2 \\ &= 4 \end{align*} \\ \\ \begin{array}{l} \text{Using the kernel trick, we directly compute the dot product in the higher} \\ \text{dimensional space without explicitly mapping the points.} \end{array} \end{array}$$

Popular Kernel Functions

Linear kernel

For a dataset that is linearly separable, the linear kernel is ideal. When used for non-linear data sets, which are the main topic of this article, it may result in underfitting and create a linear decision boundary. It’s provided as the input feature vectors' dot product.

This kernel merely constructs the hyperplane or line of best fit to divide the data points. It does not perform any particular transformation to a higher dimension.

$$Linear Kernel Function: K(x_i, x_j) = x_i \cdot x_j$$

Polynomial kernel

The polynomial kernel transforms the data into a polynomial feature space of order d. It does a dot product on the feature vector with a constant c, all within the degree of d. The higher the degree of the polynomial, the better the kernel captures the relationships in the nonlinear dataset.

$$Polynomial Kernel Function: K(x_i, x_j) = (x_i \cdot x_j + c)^d$$

Gaussian or Radial Basis Function (RBF) kernel

The Gaussian kernel, also known as the RBF kernel, is often used in SVM to map the input feature vector to an infinite-dimensional feature space using a Gaussian function. This kernel can handle more complex relationships.

$$RBF Kernel Function: K(x_i, x_j) = \exp(-\gamma \|x_i - x_j\|^2)$$

Sigmoid kernel

The sigmoid kernel acts similarly to the activation function in neural networks. It functions similarly to a two-layered perception network and can map data into a higher-dimensional feature space.

$$Sigmoid Kernel Function: K(x_i, x_j) = \tanh(\alpha(x_i \cdot x_j) + c)$$

There are other kernel functions such as Laplacian kernels, hyperbolic kernels, exponential kernels, and custom kernels that you can look into if you’re curious.

How to Choose the Right Kernel

The various kernel functions are applied based on the linear and nonlinear relationships in the feature space. The linear kernel is simple and fast, and it works well with linearly separable data but not with high-dimensional data.

The polynomial kernel is well-suited for data with non-linear or polynomial relationships, as well as low-dimensional data. The RBF kernel is ideal for dense data that you have no prior knowledge of. Finally, the sigmoid kernel works well for binary and categorical data points.

SVM Kernel Implementation

Let’s now go through an example showing how you can use this technique.

Step 1: Import the necessary libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler

Step 2: Generate the non-linear dataset

The non-linear dataset used in this article is a circle dataset from sklearn.datasets. We used 1500 samples with a random_state of 46 to keep the dataset consistent for reproducibility. We added a Gaussian noise to the data of 10%. This function generate_circle_data is implemented to generate the dataset used in the article.

def generate_circle_data(n_samples=1500, noise=0.10, random_state=46):
    """
    Generate two concentric circles dataset.

    Parameters:
    -----------
    n_samples : int
        The total number of points generated
    noise : float
        Standard deviation of Gaussian noise added to the data
    random_state : int
        Random seed for reproducibility

    Returns:
    --------
    X : array of shape [n_samples, 2]
        The generated samples
    y : array of shape [n_samples]
        The integer labels (0 or 1) for class membership of each sample
    """
    return make_circles(n_samples=n_samples, 
                       noise=noise, 
                       random_state=random_state)

Step 3: Plot the 2D Data

The data generated above comes in 2D form. Each color represents the two different data samples. The data points were plotted which allows us to see it as a circular dataset using the Matplotlib library.

def plot_2d_data(X, y, title="2D Circle Dataset"):
    """
    Plot the 2D dataset with different colors for each class.

    Parameters:
    -----------
    X : array-like of shape (n_samples, 2)
        The input samples
    y : array-like of shape (n_samples,)
        The target values (class labels)
    title : str
        The title of the plot
    """
    plt.figure(figsize=(8, 6))
    plt.scatter(X[:, 0], X[:, 1], c=y, marker='.', cmap='viridis')
    plt.title(title)
    plt.xlabel('X₁')
    plt.ylabel('X₂')
    plt.colorbar(label='Class')
    plt.grid(True, alpha=0.3)
    plt.show()

The output image of the dataset is given below:

Step 4: Transform into a Higher-Dimensional Space

The data in 2D is transformed into a 3D space using the polynomial kernel. We achieved this by creating a third feature X3 so it can be mapped into a higher dimensional space for easy separation.

def transform_to_3d(X):
    """Transform 2D data to 3D using radius-based transformation"""
    X1 = X[:, 0].reshape(-1, 1)
    X2 = X[:, 1].reshape(-1, 1)
    # Modified transformation to create better separation
    X3 = X1**2 + X2**2
    return np.hstack((X1, X2, X3))

Step 5: Plot the 3D Transformation

The next step is to plot the 3D transformed dataset. It now looks like a U-shaped bowl, and is separated with a hyperplane after fitting a LinearSVC model from the sklearn library as the kernel we’re using. This shows a practical example of the concepts you’ve learned so far:

def plot_3d_transformation_with_separator(X_transformed, y, title="3D Transformed Dataset with Linear Separator"):
    """Plot the 3D transformed dataset with a clear linear separating plane"""

    # Scale the transformed features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_transformed)

    # Fit linear SVM with adjusted parameters for better separation
    svm = LinearSVC(C=1.0, dual="auto", max_iter=5000)
    svm.fit(X_scaled, y)

    # Create the 3D plot
    fig = plt.figure(figsize=(12, 8))
    ax = fig.add_subplot(111, projection='3d')

    # Plot the two classes with different colors and markers for clarity
    class_0 = y == 0
    class_1 = y == 1

    ax.scatter(X_transformed[class_0, 0], 
              X_transformed[class_0, 1], 
              X_transformed[class_0, 2],
              c='blue', 
              marker='o',
              label='Class 0',
              alpha=0.6)

    ax.scatter(X_transformed[class_1, 0], 
              X_transformed[class_1, 1], 
              X_transformed[class_1, 2],
              c='red', 
              marker='^',
              label='Class 1',
              alpha=0.6)

    # Create a grid for the separator plane
    x_min, x_max = X_transformed[:, 0].min() - 0.2, X_transformed[:, 0].max() + 0.2
    y_min, y_max = X_transformed[:, 1].min() - 0.2, X_transformed[:, 1].max() + 0.2

    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 50),
                        np.linspace(y_min, y_max, 50))

    # Get the separating plane coefficients
    w = svm.coef_[0]
    b = svm.intercept_[0]

    # Calculate z coordinates of the plane
    grid_points = np.c_[xx.ravel(), yy.ravel(), np.zeros(xx.ravel().shape[0])]
    scaled_grid = scaler.transform(grid_points)

    # Calculate the separator plane
    z = (-w[0] * scaled_grid[:, 0] - w[1] * scaled_grid[:, 1] - b) / w[2]
    z = z.reshape(xx.shape)
    z = scaler.inverse_transform(np.c_[xx.ravel(), yy.ravel(), z.ravel()])[:, 2].reshape(xx.shape)

    # Plot the separating plane with adjusted transparency
    surface = ax.plot_surface(xx, yy, z, alpha=0.3, cmap='coolwarm')

    # Customize the plot
    ax.set_xlabel('X₁')
    ax.set_ylabel('X₂')
    ax.set_zlabel('X₁² + X₂²')
    ax.set_title(title)

    # Add legend
    ax.legend()

    # Adjust the viewing angle for better visualization
    ax.view_init(elev=20, azim=45)

    # Add text description
    ax.text2D(0.05, 0.95, 
              "Polynomial Kernel Transformation:\nΦ(x₁,x₂) → (x₁,x₂,x₁²+x₂²)\n\nClasses are linearly separable\nin transformed space", 
              transform=ax.transAxes, 
              bbox=dict(facecolor='white', alpha=0.8))

    plt.show()

def main():
    # Generate and plot the dataset
    X, y = generate_circle_data()

    # Transform and plot 3D data with clear separator
    X_transformed = transform_to_3d(X)
    plot_3d_transformation_with_separator(X_transformed, y)

if __name__ == "__main__":
    main()

The main function is a function of functions that put together all the other functions such as generate_circle_data, transform_to_3d and plot_3d_transformation_with_separator together to establish the model. The image below shows a better separation with the aid of the polynomial kernel.

Here’s the full code:

Conclusion

In this article, you learned about the efficiency of SVM kernels for non-linear classification applications. The various functions demonstrated computational efficiency by changing input data into higher dimensional data, as shown in the example, without requiring vast amounts of storage or processing.

SVM can be used in a variety of classification tasks, including image and text classification, and it has proven to be extremely efficient.

References

Park, H., & Son, J.-H. (2021). Machine learning techniques for THz imaging and time-domain spectroscopy. Sensors, 21(4), 1186. https://doi.org/10.3390/s21041186
Scikit-learn developers. (2024). Support vector machines. Scikit-learn.https://scikit-learn.org/1.5/modules/svm.html

Farm	Yield (tons/ha)	Fertilizer Used (kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

Farm	Yield (tons/ha)	Fertilizer Used (Kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

Farm	Yield (tons/ha)	Fertilizer Used (kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

Farm	Yield (tons/ha)	Fertilizer Used (Kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

MathJax - freeCodeCamp.org

How to Apply Academic Theories to Human-Centered Web Design [Full Handbook]

Table of Contents

1.0 Fitts’s Law:

1.1 Use Padding Wisely

1.2 Use Infinite Targets

Design Takeaways from Fitts Law:

2.0 Hick's Law:

Design Takeaway from Hick's Law

3.0 Gestalt Principles:

Key Gestalt Principles:

3.1 Proximity

3.2 Similarity

3.3 Continuity

3.4 Closure

3.5 Figure/Ground

3.6 Common Fate

3.7 Focal Point

Design Takeaways from the Gestalt Principles

4.0 Von Restorff Effect (The Isolation Effect):

Design takeaways from Von Restorff

5.0 Jakob’s Law

Design Takeaway from Jakob's Law

6.0 Miller’s Law

Design Takeaway from Miller's Law

7.0 The Goal-Gradient Hypothesis

Design Takeaway from Goal-Gradient Hypothesis

8.0 Zeigarnik Effect

Design Takeaway from Zeigarnik Effect

9.0 Tesler’s Law:

Design Takeaway from Tesler's Law

10.0 Peak End Rule:

Design takeaway from Peak End Rule

11.0 Postel’s Law:

Design Takeaway from Postel's Law

12.0 Doherty Threshold:

Design Takeaways from Doherty Threshold

13.0 Serial Position Effect (Primacy and Recency):

Design Takeaways Serial Position Effect

14.0 Occam’s Razor:

Design Takeaway from Occam's Razor

15.0 Parkinson's Law

Design Takeaway for Parkinson's law

Conclusion

References

Data Science Insights: Why the Mean Lies When Handling Messy Retail Data

Table Of Contents

Prerequisites

The Dataset

Mean: The Sensitive Giant

Median: The Robust Middle

Beyond Averages: Understanding Spread with Quartiles

The IQR: Detecting Outliers

A Simple Example to Understand IQR

Step 1: Find the Median (Q2):

Step 2: Find Q1 (Lower Quartile):

Step 3: Find Q3 (Upper Quartile):

Step 4: Calculate IQR:

Step 5: Find Outlier Bounds:

Applying IQR to Our Dataset

Revisiting the Mean After Removing Outliers

Final Comparison and Insights

Conclusion

Connect with me

How to Deploy a Serverless Spam Classifier Using Scikit-Learn, AWS Lambda, & API Gateway

Table of Contents

1. Prerequisites

2. Building the Brain: The Model

1. Vectorization: Turning Text into Math

2. Training: The Logistic Regression Engine

3. Evaluation: Testing the Intelligence

4. Exporting the Logic (Serialization)

3. Deploying the Model to AWS

1. Model Storage: Amazon S3

2. The Production Backend: AWS Lambda

3. The API Gateway - The Bridge to the Web

Creating the REST API

Deployment Stages

Connecting the Frontend (The JavaScript Layer)

4. How to Run The Project Locally

Farm	Yield (tons/ha)	Fertilizer Used (kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

Farm	Yield (tons/ha)	Fertilizer Used (Kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390