Tiago Capelo Monteiro - freeCodeCamp.org

How to Build Optimal AI Agents That Actually Work – A Handbook for Devs

Tiago Capelo Monteiro — Mon, 11 May 2026 21:30:42 +0000

Since moving to Silicon Valley in 2025, I've seen AI everywhere. And after I attended NVIDIA GTC 2025, one thing became very clear from many conversations I had: most companies now have AI agents running successfully in various projects or departments.

But almost no one has managed to roll them out well across an entire organization. And even where agents are deployed, they're often poorly organized.

Companies are shipping agent systems almost by guessing.

Some of the questions I heard were:

What's the right number of AI agents in a team?
What's the best model provider to use?
Should the agents have a "boss" agent supervising them, or should they coordinate peer-to-peer?

In other words, the main question was:

What is the best organizational structure for a team of AI agents?

This article tries to answer exactly that.

I previously wrote a book on the math behind AI, so we won't be doing any math here.

Instead, we'll focus on how to organize agents for real business cases.

We'll use a recent AI paper from Google Research, Google DeepMind, and MIT — Towards a Science of Scaling Agent Systems: When and Why Agent Systems Work as our primary source.

For the code, I'll use a Jupyter notebook in Google Collab.

Here's What We'll Cover:

Prerequisites
What is an LLM?
What Are AI Agents?
A Decision Algorithm for Creating Optimal AI Agents
Three Code Examples
Conclusion: The Future of AI is Evals

Prerequisites

You don't need to be an expert developer to create AI agents. There are many no-code tools that can help you through the process.

But to get the most out of the examples here (and to be able to check your agents' work and understand what they're doing), you'll need:

A general understanding of Python and what an LLM is.
Ollama installed on your machine to run large language models locally and for free.
A Jupyter Notebook setup. Google Colab is highly recommended if you have limited local hardware or need cloud GPUs.

Let's get into it!

What is an LLM?

An LLM (Large Language Model) is like a very well-read intern who has never left the library.

The LLM can quote, summarize, translate, and imitate almost any style. It can write a Python script and a Shakespearean sonnet in the same breath!

But it has limitations. For example, when an LLM is unsure, it often invents something with the same confidence it uses for topics it's sure about.

This is called hallucination.

Also, LLMs don't have memory between conversations by default, and they can't do anything on their own. For example, an LLM alone can tell you how to send an email, but it can't send one.

This is where agents come in.

What Are AI Agents?

If an LLM is like an intern, an AI agent is that same intern given a desk, a laptop, and a to-do list – and the ability to act.

An agent is essentially an LLM that has been wrapped in tools, memory, and a loop.

Tools allow the agent to do things like search the web, read a particular file, send an email, and run code. Memory allows the LLM to remember what it did before in other tasks. A loop is just code that lets the LLM think, call a tool, see the result, and think again until the task is done.

In many cases, an individual agent is very useful. But what happens when you have a task too big for one intern (or agent in this case)?

Naturally, you can hire more interns! But you get new problems:

Should you have one intern with a long to-do list (single-agent)?
Should you have five interns all working on the same task independently (independent multi-agent)?
How many interns should be on a team?
Should a boss who assigns subtasks manage the interns?
Should you have a group of peers who coordinate among themselves? A mix?

This is the exact question the Google paper we're using as our primary source here tries to answer with over 150 controlled experiments.

Just keep in mind that having more agents doesn't always mean you'll get better results. Sometimes one agent is a perfect fit. And other times you'll need more.

Some Background

Before we dive in, an important note: these are experimental findings, not laws of physics.

The Google paper evaluated, using an exhaustive methodology, many possible teams of AI agents and providers.

Some of the providers where:

OpenAI (ChatGPT)
Google (Gemini)
Anthropic (Claude)

The results of each differed by model family:

OpenAI models gained most from centralized/hybrid setups
Google models showed a clear efficiency plateau
Anthropic models were more sensitive to coordination overhead.

Since it's a persuasive study based on a lot of experiments, your team can consider these to be strong guidelines you can use when choosing a model family.

A Decision Algorithm for Creating Optimal AI Agents

Now, we'll take the research in the article and convert it into a simple-to-apply algorithm that anyone can use to create AI agents to automate their work.

The main objective of this algorithm is to help you decide, with the Google paper as a scientific reference, if you need just one agent or a couple more.

This way, instead of explaining the article step by step, I'll show you how to actually apply it to solve your problems.

1. Check Your Budget

If you have limited hardware, I recommend starting with Ollama.

Ollama is a tool that allows you to run LLMs on your personal computer. And when you run it locally, it's free (and open source).

If you use an API from OpenAI, Google, or Anthropic to access their models, you'll start spending money.

As of 6 of may 2026, OpenAI's GPT-5.5 costs $5.00 per 1M tokens, but for GPT-5.4 mini, it costs $0.75 per 1M tokens.

If you have limited cloud resources, you can use Google Colab to access GPUs and run larger and newer billion-parameter LLMs. Often, newer LLMs have better results in image generation, coding, and others.

You can also use LLMs with Ollama in Google Colab.

If you have a company project, I recommend this same cloud-based option. It allows you to build a demo and run evaluations in an environment with more memory than most local office hardware provides.

If you have a flexible budget, you can use professional APIs like Claude or Gemini.

Always remember that agents cost tokens, and tokens cost money.

2. Start with Only ONE Agent

Always begin with a single agent. Usually, if you're using frontier models, they'll have better performance than older open source models.

3. Measure Performance

According to the paper, if a single agent's real-world success rate (how well it works and how accurately it performs) is more than 45%, then there's typically no need to create a team of agents for the task.

To measure this, run the agent on 50–100 representative tasks. Then, score each against a quality bar you defined before starting (human review, a known-good answer, or a checklist).

Note that the paper's 45% finding is only one-directional: it identifies when not to add agents (above 45%). But the rule doesn't go the other way and state that if performance is below 45%, that means another agent or two will help.

The authors state that "coordination benefits arise from matching communication topology to task structure, not from scaling the number of agents".

Basically, if your agent underperforms, fix the agent first! Don't just automatically think you need another agent.

If you determine, for your project, that a single agent works, then go ahead to step 7.

If the single agent's performance is below 45%, first try improving it (better prompts, tools, or model). Only consider creating a team of agents if the task is naturally parallel (see the next step).

4. Assess Task Parallelism

A big question then becomes, why use multiple agents at all? Here's how you can decide:

If your task involves just one continuous job, a single agent typically does it better and cheaper.

But multiple agents can help when you can clearly split your project into discrete subtasks. Then a different specialist (agent) can tackle each subtask and multiple agents can work on multiple tasks in parallel.

In this step of our algorithm, you want to see if the task you're trying to apply the AI agents to is naturally parallel.

A task is naturally parallel if it can be split into independent subtasks. For example:

Searching for the best flight across five different websites.
Summarizing ten separate news articles at once.

Examples where tasks are not naturally parallel:

Planning a trip from start to finish (you must choose a destination before booking a hotel, for example – so those tasks can't be completed in parallel).
Managing a bank transfer (the funds must be verified before they're sent).

If the task is naturally parallel, you may benefit from more agents, and you should continue on to step 5.

If it's not (the task is sequential or step-by-step), stop. According to the article's research, multi-agent teams will just negatively impact the result in these cases and you should stick to one agent.

In this case (not naturally parallel), you can just work on improving your prompts, tools, or your model for the single agent. Then after it beats the 45%, go to step 7.

5. Pick the Topology by Task Type

Now we'll decide on the structure for our agent team.

Topology simply means the structure of a system. In this case, we're talking about the structure of the team of AI agents.

This step only applies once you've decided you need multiple agents. Both topologies we'll examine here are multi-agent.

If the task is based on analysis or structured work, it's better to use a centralized model. A centralized model is like a manager managing a group of interns below them. The interns report to the manager, and the manager coordinates them.

A centralized model is good for pipelines like financial reports.

According to the study, this reduces error amplification from ~17x to 4x. This means that, when the manager makes a mistake, instead of 17 errors being created by the interns, there are more like 4 errors.

If the task is more related to exploration, use a decentralized model.

They're good for open-ended research or audits where agents review the same material from different angles.

A decentralized model is like interns in a team brainstorming ideas for a new product for the company or discussing over lunch how to make a process faster.

6. Cap the Team Size and Available Tools Per Agent

According to the paper, AI agent success starts to degrade after about 3–4 agents.

They also explain that each agent should have access to the minimum tools necessary (1–3 tools per agent). The more tools each agent has, the worse it performs.

7. Build Evaluations

Now, you have something that works most of the time. But how can you ensure the agents will scale across the organization? For this reason, now you need to establish internal tests before scaling the agents.

These internal tests are called evals (evaluations).

For each evaluation, you'll need to have clear metrics that let you know how the agents are performing in each evaluation.

You'll want to measure things like accuracy, efficiency, and trajectory. Accuracy tells us if the model got it right. Efficiency reports how fast and cheap it was to process the request. And trajectory shows if the model used the right tools to do the task.

Remember, in AI and engineering in general, if you can't measure the system's performance, you can't trust the system.

This way, you can start seeing how well the model performs with the data your organization works with and its context. Using these evals, you can help the agents become more independent and better over time.

Evals might be:

Input emails and output responses expected
Input customer support transcripts and outputs summarized action items
Input complex legal contracts and outputs identified high-risk clauses

Then you see how close the agent's or agents' outputs are to the expected output.

You can also try different models and go through this decision algorithm again to see which models work best for your use case. After all, new models are often better than previous models.

With this workflow in place, you'll create more accurate and efficient agents.

Now let's look at this algorithm in action using three use cases.

Three Code Examples

In this section, I'll explain how I ran the code in the Jupyter notebook. I recommend that you copy the code and run it yourself so you can follow along and understand how it works.

We'll start the code in the sections I defined in the Google Colab so that you understand everything.

You can find the here on GitHub as well. I used the MIT license for this code.

1. Installing Utilities, Python Libraries, and Doing Config

!sudo apt update && sudo apt install -y pciutils
!sudo apt-get install -y zstd
!curl -fsSL https://ollama.com/install.sh | sh

This code essentially prepares the notebook to run AI agents.

The first line updates the package list and installs hardware detection tools to identify your GPU. The second line installs a high-speed decompression utility needed to unpack model files. Finally, it downloads the official Ollama setup script and executes it to install the software.

Ollama is an open-source tool that allows you to use LLMs on your computer.

!pip install uv
!uv pip install langchain-ollama ollama crewai duckduckgo-search langchain-community ddgs faker

Here, we downloaded the uv Python package. It's like pip but far faster and safer.

With this, we can download the rest of the Python libraries much more quickly.

import socket
import subprocess
import threading
import time

import ollama
from crewai import Agent, Crew, LLM, Process, Task
from IPython.display import Markdown
from langchain_ollama.llms import OllamaLLM

from crewai.tools import tool
from langchain_community.tools import DuckDuckGoSearchRun

from faker import Faker

With the above code, we imported all the Python libraries needed to create optimal AI agents.

Let's see what each one does:

socket: Connects your computer to others over a network.
subprocess: Lets Python launch and control other programs on your computer.
threading: Runs multiple tasks at once so one slow process doesn't freeze the whole code.
time: Handles delays and timestamps, like making the code wait or measuring speed.
ollama: The tool we'll use for talking to AI models running locally on your machine.
crewai: Organizes multiple AI agents to work together like a specialized team.
IPython: Powers interactive coding features and pretty-printing in tools like Jupyter.
langchain_ollama: Plugs local Ollama models into the popular LangChain AI framework.
langchain_community: Offers hundreds of extra "connectors" to link AI to the outside world.
faker: Generates realistic "dummy" data (names, emails) for testing your code safely.

fake = Faker("en_US")

Faker.seed(42)

In these two lines of code, we configured the Faker Python library to generate fake data in English from the United States.

2. Starting the Ollama Server, Getting the Model and Tools

with open("ollama.log", "w") as log_file:
    process = subprocess.Popen(["ollama", "serve"], stdout=log_file, stderr=log_file)

def is_server_ready(port=11434):
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        return s.connect_ex(('localhost', port)) == 0

print("Booting Ollama server...")
max_retries = 20
ready = False

for i in range(max_retries):
    if is_server_ready():
        ready = True
        break
    time.sleep(1)
    if i % 5 == 0:
        print(f"Still waiting... ({i}s)")

if ready:
    print("\n Success! Ollama is running and ready for models.")
    !curl -s http://localhost:11434 | grep "Ollama is running"
else:
    print("\n Error: Ollama server failed to start. Check 'ollama.log' for details.")

This code helps ensure that your local environment is fully prepared before your AI models try to run.

AI servers often take some time to boot, so just be patient.

This script prevents "connection refused" errors by using a background process to start Ollama and a network "handshake" to confirm that it's awake.

!ollama pull mistral-small3.2

In this line, we loaded the mistral-small3.2 LLM to the Google Colab notebook.

Mistral is a model developed by a well-known French startup, Mistral AI SAS.

_ddg = DuckDuckGoSearchRun()

@tool("web_search")
def web_search(query: str) -> str:
    """Search the public web via DuckDuckGo. Input: a concise search query string. Returns: top result snippets as plain text."""
    return _ddg.run(query)

In this code we've created a tool for our agents to use: we're giving the agents the ability to search the web with DuckDuckGo. DuckDuckGo is one of the most popular privacy-focused search engines on the web.

This is crucial because it enables our agents to provide recent information they haven't yet been programmed to know.

3. Testing the Model

Now we'll write the code that's the layout where we'll define and test the LLM.

We're initializing both a standard model for direct tasks and a specialized LLM object for the CrewAI framework. It's the specialized LLM object for the CrewAI framework that we'll use to power our AI agents.

This initial configuration is important because it validates that your machine is properly communicating with the software before you try to create AI agents.

AI_prompt = "Write a quick system prompt for an AI agent whose job is to summarize financial documents."

AI_model = OllamaLLM(model="mistral-small3.2")

crew_llm = LLM(
    model="ollama/mistral-small3.2",
    base_url="http://localhost:11434"
)

print("Running Mistral...")
AI_response = AI_model.invoke(AI_prompt)
display(Markdown(f"### AI Output:\n{AI_response}"))

4. Running the AI Agents

Now, we'll run three different agent configurations.

The first one is a single agent for sequential tasks. The second one is a centralized team, and the third one is a decentralized team.

Sequential Tasks with a Single Agent

doc_5_1 = f"""{fake.company()} {fake.company_suffix()} — Q3 2026 Earnings Report
Prepared by: {fake.name()}, CFO
KEY METRICS
Revenue: ${fake.random_int(50, 500)}M (up {fake.random_int(5, 25)}% YoY)
Net Income: ${fake.random_int(10, 80)}M
Operating Margin: {fake.random_int(12, 28)}%
Active Customers: {fake.random_int(10_000, 500_000):,}
Cash on Hand: ${fake.random_int(100, 900)}M
Employee Headcount: {fake.random_int(200, 5000):,}
MANAGEMENT COMMENTARY
{fake.paragraph(nb_sentences=5)}
RISK FACTORS
{fake.paragraph(nb_sentences=4)}
"""

In this code, we prepared the general template where the fake data will be generated.

print(doc_5_1)

Rodriguez, Figueroa and Sanchez and Sons — Q3 2026 Earnings Report
Prepared by: Megan Mcclain, CFO
KEY METRICS
Revenue: $94M (up 23% YoY)
Net Income: $64M
Operating Margin: 13%
Active Customers: 25,622
Cash on Hand: $195M
Employee Headcount: 1,991
MANAGEMENT COMMENTARY
Own night respond red information last everything. Serve civil institution. Choice whatever from behavior benefit. Page southern role movie win her.
RISK FACTORS
Stop peace technology officer relate. Product significant world. Term herself law street class. Decide environment view possible participant commercial. Clear here writer policy news.

With this code, we printed the document the agent will process.

analyst = Agent(
    role="Senior Financial Document Specialist",
    goal=(
        "Read the provided document end-to-end, extract the 5 most decision-relevant KPIs "
        "(with units, period, and source line when available), and produce a CEO-ready summary. "
        "When a figure is missing or ambiguous, use web_search to verify it against public sources."
    ),
    backstory=(
        "You have 10+ years auditing 10-Ks, earnings releases, and investor decks at a Big Four firm. "
        "You work linearly, cite page/section for every metric, and never invent numbers — "
        "if a value isn't in the text, you search for it or mark it as 'not disclosed'."
    ),
    tools=[web_search],
    llm=crew_llm,
    verbose=True,
    allow_delegation=False,
)

In this code, we defined an agent that acts as an analyst. This analyst will analyze the report that's generated. It will also have access to DuckDuckGo.

task_1 = Task(
    description=(
        "Analyze the following document for KPI metrics.\n\n"
        "DOCUMENT:\n"
        f"{doc_5_1}"
    ),
    agent=analyst,
    expected_output="A list of 5 key KPIs found in the text.",
)

task_2 = Task(
    description="Based on the KPIs extracted in the previous task, write a professional executive summary.",
    agent=analyst,
    expected_output="A 200-word summary suitable for a CEO.",
)

The analyst will only have two tasks: one is to find KPI metrics and the second is to write a report of the document. So, in this way we have sequential tasks performed by only one AI agent, and we're following the empirical guidelines of the Google paper.

sequential_crew = Crew(
    agents=[analyst],
    tasks=[task_1, task_2],
    process=Process.sequential
)

print("Running Case 1: Sequential...")
result_1 = sequential_crew.kickoff()
display(Markdown(f"### Case 1 Result:\n{result_1}"))

Dear CEO,

I am pleased to present a concise overview of Rodriguez, Figueroa and Sanchez and Sons Q3 2026 Earnings Report. Our company has demonstrated strong financial performance this quarter. We reported a significant increase in revenue, achieving $94 million, which represents a substantial 23% year-over-year growth. This growth is a testament to our effective business strategies and the increasing demand for our products or services.

Our net income for the quarter stands at $64 million, showcasing our ability to maintain robust profitability. The operating margin of 13% further highlights our efficient cost management and operational excellence. Customer satisfaction and engagement continue to be a priority, as evidenced by our growing base of 25,622 active customers.

In terms of liquidity, we have a solid cash position of $195 million, ensuring that we have the necessary resources to seize new opportunities and navigate any challenges that may arise. Our employee headcount of 1,991 reflects our commitment to talent acquisition and development.

In conclusion, this quarter's results underscore our strong market position and the successful execution of our business strategies. We remain optimistic about our future prospects and are committed to driving sustainable growth and shareholder value. Let's continue to build on this momentum in the coming quarters.

Best Regards, [Your Name]

Finally, we've run the agent we created and the above is the agent's report.

Centralized Team of Four Agents

Now we'll create a team of four agents so you can see how multiple agents work.

This team researches lithium market trends to carry out financial modeling and generate an investment proposal based on data.

A centralized team works here because each step feeds into the next. We start our research, then we study the research, and finally we make a recommendation.

Let's build the first one that will research the market:

researcher = Agent(
    role="Commodity Market Researcher (Battery Metals)",
    goal=(
        "Produce dated, sourced price data points for 2026 lithium carbonate and lithium hydroxide forecasts. "
        "Always pull from web_search; never guess. Return each data point as: value, unit, date, source URL."
    ),
    backstory=(
        "Ex-analyst at a commodities desk. You trust only primary sources (IEA, Benchmark Mineral Intelligence, "
        "Fastmarkets, company filings) and you flag any figure that lacks a verifiable source."
    ),
    tools=[web_search],
    llm=crew_llm,
    verbose=True,
    allow_delegation=False,
)

The first agent we created will search the web for data related to lithium. For this task it will have access to DuckDuckGo.

Now we'll create an agent that knows and works in finance to model the data the researcher got.

finance_pro = Agent(
    role="Capex Financial Modeler",
    goal=(
        "Take the researcher's price data and run a 10-year NPV and IRR simulation at a 10% discount rate, "
        "stating all assumptions explicitly and returning a table plus a short narrative."
    ),
    backstory=(
        "You've built DCF models for gigafactory investments. You show your formulas, label base/bull/bear cases, "
        "and refuse to produce a number without stating the inputs behind it."
    ),
    llm=crew_llm,
    verbose=True,
    allow_delegation=False,
)

The finance agent will use the researcher's information and make simulations of it.

From there, we'll define another agent that will advise us on strategy based on the financial model:

strategy_advisor = Agent(
    role="Investment Strategy Advisor",
    goal=(
        "Synthesize the researcher's price data and the modeler's NPV/IRR results into a "
        "clear go/no-go recommendation, with the top 3 risks and the conditions under which "
        "the recommendation flips."
    ),
    backstory=(
        "Former MD at a project-finance fund. You translate models into decisions and always "
        "name the sensitivities that would change your call."
    ),
    llm=crew_llm,
    verbose=True,
    allow_delegation=False,
)

This way, we have one agent to do the research, another to do the modeling, and a final one to advise us on strategy.

centralized_crew = Crew(
    agents=[researcher, finance_pro, strategy_advisor],
    tasks=[
        Task(description="Research 2026 lithium price forecasts.", agent=researcher, expected_output="Price data points."),
        Task(description="Run an NPV simulation using prices.", agent=finance_pro, expected_output="Full NPV report."),
        Task(description="Issue a go/no-go recommendation based on the NPV report.", agent=strategy_advisor, expected_output="Go/no-go memo with top 3 risks."),
    ],
    process=Process.hierarchical,
    manager_llm=crew_llm
)

print("Running Case 2: Centralized (Hierarchical)...")
result_2 = centralized_crew.kickoff()
display(Markdown(f"### Case 2 Result:\n{result_2}"))

Now, we create the 4th agent. This is themanager_llm, and it auto-spawns the manager that will review the other agents' work.

Then, we run the three agents together.

Decentralized Team of Three Agents

Now we'll create a decentralized team of three agents. Once again, the first step is to create the data.

A decentralized model fits here because the auditors review the same data from different angles. Also, the auditors cross-reference findings.

groups = ["Group A (men)", "Group B (women)", "Group C (under-40)", "Group D (over-40)"]
hiring_stats = "\n".join(
    f"{g}: {fake.random_int(40, 120)} applicants, {fake.random_int(5, 25)} hired"
    for g in groups
)
feedback = "\n".join(
    f'- Candidate {fake.name()}: "{fake.sentence(nb_words=12)}"'
    for _ in range(6)
)
doc_5_3 = f"""Q1 2026 Hiring Audit Data — {fake.company()}
APPLICANT POOL & SELECTION RATES
{hiring_stats}
INTERVIEWER FEEDBACK NOTES (sample)
{feedback}
"""

We also defined a general template to generate the fake data.

print(doc_5_3)

Q1 2026 Hiring Audit Data — Zimmerman Inc
APPLICANT POOL & SELECTION RATES
Group A (men): 81 applicants, 6 hired
Group B (women): 69 applicants, 6 hired
Group C (under-40): 80 applicants, 17 hired
Group D (over-40): 74 applicants, 7 hired
INTERVIEWER FEEDBACK NOTES (sample)
- Candidate Tommy Walter: "Defense material those poor central cause seat much section investment on gun."
- Candidate Brenda Snyder PhD: "Check civil quite others his other life edge."
- Candidate Terri Frazier: "Race Mr environment political born itself law west."
- Candidate Deborah Mason: "Medical blood personal success medical current hear claim well."
- Candidate Tamara George: "Affect upon these story film around there water beat magazine attorney set she campaign."
- Candidate Joshua Baker: "Institution deep much role cut find yet practice just military building different full open discover detail."

Above is the fake data we generated.

Now, we'll create three auditors.

The first auditor focuses on the demographic groups of the people it hires.

auditor_a = Agent(
    role="Statistical Hiring Auditor",
    goal=(
        "Compute selection-rate ratios across demographic groups for the Q1 hiring batch, "
        "apply the 4/5ths rule, and flag any group where the ratio falls below 0.80. "
        "Use web_search only to confirm regulatory definitions."
    ),
    backstory=(
        "Former EEOC compliance analyst. You are rigorously numerical, cite the Uniform "
        "Guidelines on Employee Selection Procedures, and never draw qualitative conclusions "
        "outside your lane."
    ),
    tools=[web_search],
    llm=crew_llm,
    verbose=True,
    allow_delegation=False,
)

Then we'll define the second auditor for recruitment processing. This one seeks to find bias in the way interviews are conducted.

auditor_b = Agent(
    role="Qualitative Bias Reviewer",
    goal=(
        "Read interview notes and written feedback for coded language, inconsistent rubric "
        "application, and sentiment skew across candidate groups. Combine your findings with "
        "the statistical auditor's numbers into one final report."
    ),
    backstory=(
        "I/O psychologist with a focus on structured-interview research. You cite specific "
        "phrases as evidence and distinguish 'concerning pattern' from 'isolated incident'."
    ),
    tools=[web_search],
    llm=crew_llm,
    verbose=True,
    allow_delegation=False,
)

Finally, we create a third auditor that will focus on whether the the various hiring policies are met or not.

auditor_c = Agent(
    role="Process & Policy Compliance Auditor",
    goal=(
        "Review the hiring process for adherence to documented policy: structured-interview "
        "use, rubric consistency, and required approval steps. Cross-check the statistical "
        "and qualitative findings to surface root-cause process gaps."
    ),
    backstory=(
        "Internal audit lead with an HR-ops background. You map findings to specific policy "
        "clauses and recommend concrete process fixes."
    ),
    tools=[web_search],
    llm=crew_llm,
    verbose=True,
    allow_delegation=True,
)

In each auditor initialization, we define 'allow_delegation=True'. This way, the agents know they can communicate with each other.

Then we give each auditor a task.

task_audit_stats = Task(
    description=(
        "Audit the Q1 hiring batch for structural bias. "
        "Compute selection rates per group and flag any disparities.\n\n"
        "DATA:\n"
        f"{doc_5_3}"
    ),
    agent=auditor_a,
    expected_output="A report highlighting any group disparities found.",
)

task_audit_review = Task(
    description=(
        "Review the findings of the Statistical Auditor and add qualitative "
        "context from the interviewer notes in the original document."
    ),
    agent=auditor_b,
    expected_output="A final combined audit report with numbers and narrative.",
)

task_audit_process = Task(
    description=(
        "Using the statistical and qualitative findings above, identify process-level root "
        "causes (e.g. unstructured interviews, missing rubrics, approval gaps) and propose fixes."
    ),
    agent=auditor_c,
    expected_output="A process-gap list with policy references and recommended fixes.",
)

Finally, we assemble the auditor team:

decentralized_crew = Crew(
    agents=[auditor_a, auditor_b, auditor_c],
    tasks=[task_audit_stats, task_audit_review, task_audit_process],
    process=Process.sequential,
)

print("Running Case 3: Decentralized (Peer Review)...")
result_3 = decentralized_crew.kickoff()
display(Markdown(f"### Case 3 Result:\n{result_3}"))


Case 3 Result:
Combined Audit Report: Q1 Hiring Batch Audit for Structural Bias
Statistical Audit Findings:

    Applicant Pool and Selection Rates:
        Group A (men): 81 applicants, 6 hired
            Selection Rate: 6/81 = 0.074074 (7.41%)
        Group B (women): 69 applicants, 6 hired
            Selection Rate: 6/69 = 0.08696 (8.70%)
        Group C (under-40): 80 applicants, 17 hired
            Selection Rate: 17/80 = 0.2125 (21.25%)
        Group D (over-40): 74 applicants, 7 hired
            Selection Rate: 7/74 = 0.094595 (9.46%)

    Selection Rate Ratios:
        Group A / Group B: 0.074074 / 0.08696 = 0.85 (85%)
        Group C / Group D: 0.2125 / 0.094595 = 2.24 (224%)

    Application of the 4/5ths Rule:
        Group A (men) vs Group B (women): The selection rate ratio is 0.85, which is above the 0.80 threshold.
        Group C (under-40) vs Group D (over-40): The selection rate ratio is 2.24, which is above the 0.80 threshold.

    Conclusion: Based on the selection rate analysis, no group disparities are flagged as falling below the 0.80 threshold according to the 4/5ths rule.

Qualitative Audit Findings:
Group A (men) vs Group B (women):

    Concerning Patterns:
        Feedback Inconsistency:
            Isolated Incident: "Candidate lacked experience but showed strong potential."
                This feedback was given to a female candidate but not to similarly situated male candidates.
        Sentiment Skew:
            Concerning Pattern: More frequently in female candidate assessments the phrases "needs improvement in leadership skills" and "less assertive" were observed.

Group C (under-40) vs Group D (over-40):

    Concerning Patterns:
        Feedback Inconsistency:
            Concerning Pattern: Phrases like "strong strategic thinker" and "in-depth industry knowledge" frequently used to describe over-40 candidates.
                Similar competence indicators were not noted in feedback for candidates under 40.
        Sentiment Skew:
            Isolated Incident: For a few under-40 candidates, feedback noted "lacks experience in leading teams."
                This sentiment was not applied to under-40 candidates with similar profiles but differed in gender.

Additional Notes:

    Rubric Application:
        Concerning Pattern: The rubric application was inconsistent when evaluating "leadership skills" and "assertiveness" especially between male and female candidates.
        Isolated Incident: Some reviewers emphasized "cultural fit" for female candidates which was not a requirement and was not consistently applied.

Final Conclusion:

Based on the selection rate analysis, no group disparities are flagged as falling below the 0.80 threshold according to the 4/5ths rule. However, qualitative findings indicate potential biases in feedback and rubric application which could influence hiring decisions. Recommendations:

    Standardize evaluation criteria and implement unbiased language in evaluations.
    Conduct further training to ensure consistent understanding and application of rubric standards across all reviewers.
    Monitor the impact of these interventions in future hiring cycles to ensure equitable selection practices.

Above, you can see the report from the three auditors about the hiring process.

Conclusion: The Future of AI is Evals

If you remember one thing from this article, let it be this: The organizations that win with AI agents are not the ones with the most agents. They are the ones with the best evals.

The Google paper gave us simple rules for picking agent architectures. Those rules are very useful, and I've laid them out in the form of an algorithm.

But those rules were derived from benchmarks, not an organization's data. For that reason, you have to build your own evals. Nobody knows what "correct" looks like in your domain except you.

This is the same point made by Sam Bhagwat in Principles of Building AI Agents, which I'd recommend to anyone shipping agents.

So here's the playbook again:

Check your budget first: Tokens cost money. Know what you can spend per task.
Always start with one agent: If it solves the task >45% of the time, ship it. Don't add agents.
Only build a team if the task is naturally parallel: Sequential tasks get worse with a team.
Match topology to task: For analysis it is better a centralized team. For open web research it is betetr a decentralized team. If it is sequential, it is better just one agent.
Cap teams at 3–4 agents and no more than 3 tools per agent: Like in real life the smaller the team the more agile and less mistakes it makes.
Put a supervisor on any parallel setup: According to the study, unchecked swarms amplify errors ~17×. Supervised ones ~4×.
Build evals before you scale: Synthetic tests, historical back-tests, LLM-as-judge with human calibration.

And keep humans in the loop for high-stakes decisions.

Once again, agents are like interns. Now, whether they produce great work or burn down the organization depends on how well you organize and check their work.

You can find the code on GitHub here.

The Math Behind Artificial Intelligence: A Guide to AI Foundations [Full Book]

Tiago Capelo Monteiro — Tue, 06 Jan 2026 23:14:23 +0000

"To understand is to perceive patterns." - Isaiah Berlin

This is not a math book filled with complex formulas, theorems, and concepts that are hard to grasp.

Instead, it’s a detailed guide where we’ll break complex ideas down into simpler terms.

Even if you only have a general understanding of algebra, you should be able to easily follow along.

Here’s what we’ll cover:

Chapter 1: Background on this Book
Chapter 2: The Architecture of Mathematics
Chapter 3: The Field of Artificial Intelligence
Chapter 4: Linear Algebra - The Geometry of Data
Chapter 5: Multivariable Calculus - Change in Many Directions
Chapter 6: Probability & Statistics - Learning from Uncertainty
Chapter 7: Optimization Theory - Teaching Machines to Improve
Conclusion: Where Mathematics and AI Meet
About the Author

Chapter 1: Background on this Book

The Objective Here

My objective in this book is simple: Explain the key mathematical ideas you need to grasp in order to deeply understand AI and train machine learning models.

So you might be wondering: Why is it important to have a good math foundation before creating these models?

Well, there are many reasons, but some are:

It gives you the capacity to understand new AI research on your own.
You can use this same foundation to study other STEM concepts like signal theory and advanced statistical methods.
It helps you understand that AI models are just a mixture of different math ideas working together and gives you insight into how new innovations make LLMs more efficient.
It gives you a foundation so you know how to calibrate AI models and even create derivative models.

These skills are also important for startup founders, especially in Silicon Valley. Many startups begin with APIs or API wrappers but eventually need their own AI solutions.

Outsourcing all AI isn't ideal. This book will help you understand AI foundations so you can design better growth strategies and communicate effectively with investors – especially those who were successful technical co-founders.

Why is This Book About AI Different?

In this book, we’ll look at AI from an engineering perspective. This differs from the typical computer science approach to AI that most introductory courses take.

In doing so, I won’t spend a lot of time explaining formulas and theorems. Instead, I’ll explain their importance, how and why they are applied the way they are.

In this way, I hope to offer a unique viewpoint that emphasizes the engineering principles and good practices that underlie all modern AI technologies.

I will also explain how many of these strange math ideas make billion dollar industries possible.

We’ll start with the fundamentals: the structure of the areas of mathematics and AI. After that, we’ll look at the four subareas of math that make AI possible:

Linear Algebra
Calculus
Probability Theory and Statistics
Optimization Theory

After going through all the math, we’ll connect it with the foundation of ChatGPT and all of these large language models.

This way, you’ll get a basic foundation in key math concepts that, when mixed together like the ingredients of a cake, make all AI models possible.

By knowing where the ideas come from, you’ll develop a system-level understanding of AI and a first-principles approach.

So just keep in mind that, even though concepts like integral calculus and eigenvalues/eigenvectors might not be widely used in AI, they’ll help you develop these system-level and first-principle approaches.

Also, this book will be a work in progress. After its first release, I’ll seek feedback on things I need to perfect, chapters to add, and so on.

Here is my email for any feedback you might have: monteiro.t@northeastern.edu

And here is the book’s GitHub repository with all code: https://github.com/tiagomonteiro0715/The-Math-Behind-Artificial-Intelligence-A-Guide-to-AI-Foundations

Let Me Introduce Myself

My name is Tiago Monteiro, an electrical and computer engineer and AI master's degree student at Northeastern University's Silicon Valley campus. I have authored 20+ articles with 240K+ views here on freeCodeCamp on math, AI, and tech.

If you’d like to know more about my background, I’ll share that at the end of the book.

Prerequisites

In terms of minimum requirements, you only need to know the basics of mathematics and programming:

Basic algebra and what functions and the coordinate system are.
You should be able to read Python code and understand things like variables, functions, and loops.

Chapter 2: The Architecture of Mathematics

Math is more than numbers. It’s the science of locating complex patterns that shape our world. To truly understand math, we must look beyond numbers and formulas to grasp its structures.

This chapter aims to show math as a growing tree of ideas, a living system of logic, not just formulas to memorize. With analogies, history, and code examples, I want to help you understand math deeply and how to apply it to programming.

I’ve included code examples to connect theory and practice, showing how math ideas apply to real problems. Whether you're new to advanced math or are more experienced, these examples will help you apply math in programming.

This way, before we start going over the different math pillars that sustain AI, you will understand the structure of the field.

The Tree of Mathematics: How Everything Connects

Photo by Lerkrat Tangsri

Imagine math as a vast, ever-growing tree.

The roots are the foundations: logic and set theory. From these roots, the main fields emerge: arithmetic, algebra, geometry, and analysis.

As the tree branches out, new subfields like topology and abstract algebra appear. Sometimes branches connect with each other.

This tree keeps growing in many directions. History shows that sometimes it grows rapidly due to scientific discoveries, while at other times, growth is slow.

And you might wonder: How many more branches and connections between them will keep appearing?

A Quick History of Mathematics: From Counting to Infinity

The first mathematical ideas emerged independently in ancient civilizations, such as:

India's invention of zero
Islamic algebraic advances
Greek geometric rigor

Great mathematicians developed and shared these ideas through writing and lectures. Over time, new generations built on these ideas, creating new branches of mathematics. This endless growth is why Isaac Newton wrote to Robert Hooke in 1675:

“If I have seen further, it is by standing on the shoulders of giants.”

He meant that by working from previous knowledge, he was able to create and (re)discover new ideas.

Yet, the real power of math lies in practicing it over and over and studying it more and more deeply.

As one of my professors once pointed out:

“More important than knowing the theorems is knowing the ideas behind them and the history of how they were created.”

To solve problems, it's often necessary to think from first principles, and math teaches this. Math is not just an academic topic. It’s a global language for scientists and engineers.

By preserving and sharing it, new math can grow from old ideas, allowing the tree to keep expanding.

Foundations of Relativity: How Einstein Used Math to Understand Space and Time

Photo by Pixabay

Albert Einstein developed the general and special theories of relativity, which impact:

GPS and global communication
Satellite telecommunications
Space exploration and satellite launches

And more.

But this was only possible by combining geometry with calculus, known as differential geometry. This field evolved over centuries, thanks to many great mathematicians. Here are a few of them, though the list is not exhaustive:

Euclid (circa 300 BCE): Contributed to geometry, laying the groundwork for later mathematical systems
Archimedes (circa 287–212 BCE): Pioneered the understanding of volume, surface area, and the principles of mechanics
René Descartes (1596–1650): Developed Cartesian coordinates and analytical geometry
Isaac Newton (1642–1727) & Gottfried Wilhelm Leibniz (1646–1716): Newton’s laws of motion and gravitation, alongside Leibniz’s development of calculus, formed the basis of classical mechanics that Einstein sought to extend and modify in his theory of relativity.
Leonhard Euler (1707–1783): Contributed to the development of differential equations, which are essential in the mathematical foundations of physics.
Gaspard Monge (1746–1818): The father of differential geometry and pioneer in descriptive geometry
Carl Friedrich Gauss (1777–1855): Made groundbreaking advances in geometry, including the concept of curved surfaces.
Bernhard Riemann (1826–1866): Introduced Riemannian geometry, a branch of differential geometry.

Going back to Albert Einstein, he saw what no one else in his time saw, thanks to these great math giants and countless others.

Gödel’s Biggest Paradox: Can Math Explain Itself?

The biggest paradox in math, discovered by Kurt Gödel, is his incompleteness theorems. They show that in any consistent formal system capable of simple arithmetic, there are true statements that cannot be proven within the system.

This means there are limits to what can be proven as true or false. For mathematicians, this implies that some truths are beyond formal proofs, yet we assume they are true. It demonstrates that no matter how much effort or AI is used, some things remain unprovable, known only through approximations and non-exact methods.

What About Applied Math and Engineering?

Applied math and engineering involve adapting the pure math ideas in real-world scenarios.

Actually, in many cases, it’s the combination of many math ideas.

Let’s consider some examples:

In harmonic analysis, Laplace, Fourier, and Z-transforms are a way to see the same thing in a new domain to get new insights. In this case, integrals are used to make this mapping possible.
Principal component analysis (PCA) is a widely used tool in data science. Yet, it is a mixture of linear algebra (in PCA, eigenvalues) with optimization (order eigenvalues that represent more data with less data) in order to make datasets shorter.
In machine learning, logistic regression is a mixture of calculus with statistics and probability.
In deep learning, neural networks are just many matrices multiplying and updating themselves that adapt to model a dataset representing a system. This optimization of matrix values happens with activation functions, a gradient descent-based optimization method (tells how much values need to change), and backpropagation (applies those alterations to all matrix values).

But the best example of this fusion of math in engineering is in control theory. Control theory is the study of the architecture of systems. From trains to cars to airplanes, everything is based on control theory. It’s everywhere, in nearly all modern electronic devices. In electric circuits, control theory is also used heavily to guarantee circuit stability in the face of electric disturbances.

So as you can probably start to see, many of the tools we now have are just a mixture of many pure math ideas – like different recipes. In essence, applied math is the application of pure math as “ingredients“ in "recipes" to solve problems.

So, we’ve explored the structure and evolution of mathematics. But it’s important to see how we can apply these ideas in real life. Pure math makes the framework, and applied math applies that framework to solve problems. To understand this, we’ll examine two code examples that show how you can use math ideas as programming tools.

Code Examples: Analytical and Numerical Approaches

These code examples demonstrate a couple ways you can use Python to solve math equations.

In the first code example, we’ll solve the problem in the same way that kids in school solve math exercises: essentially, by hand with a pencil. In the second example, we’ll solve the problem using numerical analysis.

Example 1: Solve a Problem Analytically

In this problem, we need to find the values of the variables x and y. So we’ll be moving variables from left to right to find their values.

When we solve math problems analytically, like we did in school, we are manipulating symbols to get exact values. Often these symbols are x, y, and z.

The code below solves a system of two equations with two unknowns variables, x and y.

We will use the SymPy Python library to do this. It’s mainly used for symbolic mathematics.

from sympy import symbols, Eq, solve

x, y = symbols('x y')
eq1 = Eq(2*x + 3*y, 6)
eq2 = Eq(-x + y, 1)

solution = solve((eq1, eq2), (x, y))
print(solution)

Once again with this code we are finding the values of the variables x and y.

Essentially, we’re finding x and y based on this equation:

$$\begin{align} 2x + 3y &= 6 \ -x + y &= 1 \end{align}$$

Which gives us the following result:

{x: 3/5, y: 8/5}

Or:

x= 0.6
y = 1.6

When we say that we’re solving this analytically, it means that we’re finding an exact mathematical solution using formulas or equations.

But many times, problems are harder and can be solved by adding symbols to the right or left of the equation. Sometimes, there can be so many symbols and transformed versions of them, with things like derivatives and integrals, that it can become very hard to manage and takes a lot of time.

For example, let’s look at this partial differential equation:

$$\begin{cases} \frac{\partial u}{\partial t} = \alpha \frac{\partial^2 u}{\partial x^2}, & 0 < x < L, , t > 0 \ u(0,t) = 0, & t > 0 \ u(L,t) = 0, & t > 0 \ u(x,0) = f(x), & 0 < x < L \end{cases}$$

It can be solved with an analytical method call separation of variables.

But it requires many steps, and it’s easy to make mistakes. Even engineers who learned this often struggle to remember the process later.

When I first encountered this type of math exercise in my electrical and computer engineering degree back in Portugal, it took me 20 to 30 minutes to solve it.

For this reason, there's a branch of mathematics called numerical analysis that focuses on finding approximations of existing formulas. It helps solve problems faster. This is the method we'll explore next.

Example 2: Solve Numerically (Approximation)

Now let’s solve a different problem: we’re going to find the values of each of the 5 variables:

$$\begin{bmatrix} 3 & 2 & -1 & 4 & 5 \ 1 & 1 & 3 & 2 & -2 \ 4 & -1 & 2 & 1 & 0 \ 5 & 3 & -2 & 1 & 1 \ 2 & -3 & 1 & 3 & 4 \end{bmatrix} \times \begin{bmatrix} x_1 \ x_2 \ x_3 \ x_4 \ x_5 \end{bmatrix} = \begin{bmatrix} 12 \ 5 \ 7 \ 9 \ 10 \end{bmatrix}$$

Solving this by hand will take some time…but with Python code, it’s very fast.

We’ll also use the SciPy Python library for this example.

Let’s solve the system numerically:

import numpy as np
from scipy.linalg import solve

A = np.array([[3, 2, -1, 4, 5],
              [1, 1, 3, 2, -2],
              [4, -1, 2, 1, 0],
              [5, 3, -2, 1, 1],
              [2, -3, 1, 3, 4]])

b = np.array([12, 5, 7, 9, 10])

solution = solve(A, b)

print(solution)

Which corresponds to this operation:

Again, it takes time to solve this and it’s very easy to make a simple mistake.

But in this code example, this line of code:

solution = solve(A, b)

Uses the solve method from SciPy:

from scipy.linalg import solve

It’s a method that helps you find the values of x in an equation A⋅x=b, where A is a square grid of numbers and b is a list of numbers. That gives us the following:

[ 1.35022026 -0.79955947 -1.17180617  3.14317181 -0.83920705]

Which corresponds to:

$$\begin{bmatrix} x_1 \ x_2 \ x_3 \ x_4 \ x_5 \end{bmatrix} = \begin{bmatrix} 1.35022026 \ -0.79955947 \ -1.17180617 \ 3.14317181 \ -0.83920705 \end{bmatrix}$$

And is the same thing as:

$$\begin{align} x_1 &= 1.35022026 \ x_2 &= -0.79955947 \ x_3 &= -1.17180617 \ x_4 &= 3.14317181 \ x_5 &= -0.83920705 \end{align}$$

Why These Two Approaches Matter

We have solved two mathematical problems in two different ways:

Analytical: Exact solutions through algebraic manipulation
Numerical: Approximate solutions using algorithms

In engineering and in AI, we are constantly choosing between these approaches.

When training AI models with millions of parameters, analytical solutions are impossible. This is why, in these cases, we need numerical approaches.

When creating math theorems, we need analytical precision to make sure it is the best possible solution.

This is one of the many things an engineering degree teaches you: often, in the real world, it’s better to just write some code to solve a problem than to actually solve it by hand with math. Other times, the best solution is to just think in first principles and from there create new theorems to solve a problem.

Now let's step out of the code examples and see how different branches of mathematics connect.

The Impact of a Grand Unified Theory of Mathematics

Is it possible to unify all math?

In theory, yes. This is known as the Grand Unified Theory of Mathematics. It's the idea that all different areas of math can be linked together to discover deeper patterns in mathematics.

The Langlands program is trying to make this unification possible. It’s an attempt to interconnect the largest parts of the big tree of math to uncover new patterns in math.

With a Grand Unified Theory of Mathematics, we would be able to understand how every branch of the tree connects with the others and all the relationships between them.

What’s the Value of this Big Unification for Society?

By studying history, we can find patterns. The unification of various fields has created many massive impacts on society, such as:

In the 19th century, James Clerk Maxwell united the fields of electricity and magnetism with his famous Maxwell equations. This allowed the creation of radios and electric grids around the globe. In turn, it served as a foundation for all technological progress in the 20th and 21st century.
In the 20th century, the unification of algebra with logic led to the rise of digital systems. In turn, digital systems gave rise to processors and the evolution of computers and the modern laptop.
Also in the 20th century, the unification of probability and communication led to information theory. This became the foundation for the internet. This unification was carried out by a great mathematician named Claude Shannon.

In the end, a grand unified theory of mathematics could be one of the biggest achievements in modern society.

In AI, it could help unify all machine learning models in a common architecture. This would help accelerate the development of new AI models and could also open the door to new material science advances.

It could help reveal – with math – the deep patterns we still haven’t found in these fields. Just as uniting electricity and magnetism led to modern technology, a unified math framework would lead to a wave of innovation.

A Final Lesson From History

From Greek geometry to AI, math has grown like a tree over centuries. By understanding its structure, it’s possible to see its role in finding the patterns of our universe.

I hope I was able to make you see math in this way. I hope you can also see that the unification of scientific fields helps lay the foundations for the creation of new innovations to help society go forward.

Many major societal transformations only came to be thanks to abstract math ideas. When these are shared and refined, they become the hidden architecture of progress in society. Innovation begins when disconnected ideas are united, well-linked, and widely shared.

Chapter 3: The Field of Artificial Intelligence

What is Artificial Intelligence?

Photo by Pavel Danilyuk

The term Artificial Intelligence was born from the work of John McCarthy, who is often called the "father of AI."

He used it when he, along with Marvin Minsky, Nathaniel Rochester, and Claude Shannon, proposed the famous Dartmouth Summer Research Project on Artificial Intelligence in 1956.

Artificial intelligence was defined, in the Dartmouth Conference, as:

“Every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.”

Since then, the field has evolved in waves of innovation, from early rules-based systems to modern neural networks.

But over time, rather than creating general intelligence, most AI systems have been designed to excel at narrow tasks.

For example:

Chess-playing programs like Deep Blue that defeated world champion Garry Kasparov
Image recognition systems that can identify objects in photographs with impressive accuracy
Natural language processing models that can translate between languages
Game-playing AI like AlphaGo that mastered the ancient game of Go

Artificial General Intelligence isn’t yet here

Only very narrow AI models have demonstrated human-level or superhuman performance in their narrow domains.

In my view, and as we will see in this book, AGI will be the combination and interaction of different large language models interacting with each other and with the tools available to them.

Symbolic vs. Non-symbolic AI: What’s the Difference?

What is Symbolic AI?

Symbolic AI refers to the creation of a program based on many rules and symbols to simulate how humans think.

It uses symbols to represent concepts (like farms and distributors) and logical rules to reason about them.

The specific data about your domain is called facts. Facts are the pieces of information the rules operate on. For example, a fact might be "green_acres has high water usage and good pH levels."

Also, imagine someone wants to optimize farm distribution logistics. The symbols would represent farms, distributors, and transport methods. Then the rules would be:

If the farm has high water usage and good pH levels, then classify it as high-yield producer
If a high-yield producer and distributor has low demand, then prioritize direct connection
If a direct connection is needed, then select transport with lowest environmental impact

The facts would be the actual data like "farm X has high water usage" or "distributor Y has low demand."

This way, the system combines these rules and facts through logical reasoning to make decisions. A very popular programming language we use in this field is called Prolog that was designed to create rule-based systems.

Symbolic AI program: Manage agricultural networks with a Prolog program.

Let’s look at an example project to understand this more clearly. The project we’ll examine is called SymbolicAIHarvest. It was part of a course at NOVA University during my undergraduate studies in Electrical and Computer Engineering. The course was titled "Modelation of Data in Engineering."

SymbolicAIHarvest is an AI system developed with Prolog to manage agricultural networks. Here’s the project on GitHub so you can check it out.

The project optimizes farm operations using rule-based reasoning. It monitors sensors for real-time data and improves route planning for machinery. It also coordinates produce movement to reduce delays and waste, enhancing productivity and sustainability.

Understanding the code below is not a priority for this book. I just want to show you an example of all the facts of the project:

% FARMERS(owner)
farmer(ana).
farmer(asdrubal).
farmer(miguel).
farmer(joao).
farmer(teresinha).
farmer(victor).
farmer(carlos).
farmer(anabela).

% FARMS(name, owner, region, type)
farm(q1, ana, alentejo, vinha).
farm(q2, ana, alentejo, olival).
farm(q3, asdrubal, lisboa, cenoureira).
farm(q4, asdrubal, lisboa, milharal).
farm(q5, asdrubal, lisboa, vinha).
farm(q6, miguel, evora, trigal).
farm(q7, miguel, evora, cenoureia).
farm(q8, miguel, evora, vinha).
farm(q9, miguel, evora, morangueira).
farm(q10, joao, porto, vinha).
farm(q11, joao, porto, trigal).
farm(q12, joao, porto, cenoureira).
farm(q13, teresinha, algarve, olival).
farm(q14, teresinha, algarve, vinha).
farm(q15, victor, setubal, olival).
farm(q16, victor, setubal, vinha).
farm(q17, victor, setubal, trigal).
farm(q18, carlos, sintra, milharal).
farm(q19, carlos, sintra, vinha).
farm(q20, anabela, coina, milharal).
farm(q21, anabela, coina, olival).
farm(q22, anabela, coina, trigal).

% SENSOR READINGS(name, type, value)
sensor_reading(q1,humidity,28).
sensor_reading(q2,humidity,35).
sensor_reading(q3,humidity,42).
sensor_reading(q4,humidity,38).
sensor_reading(q5,humidity,33).
sensor_reading(q6,humidity,45).
sensor_reading(q7,humidity,30).
sensor_reading(q8,humidity,36).
sensor_reading(q9,humidity,50).
sensor_reading(q10,humidity,41).
sensor_reading(q11,humidity,40).
sensor_reading(q12,humidity,44).
sensor_reading(q13,humidity,32).
sensor_reading(q14,humidity,29).
sensor_reading(q15,humidity,47).
sensor_reading(q16,humidity,39).
sensor_reading(q17,humidity,53).
sensor_reading(q18,humidity,27).
sensor_reading(q19,humidity,24).
sensor_reading(q20,humidity,31).
sensor_reading(q21,humidity,37).
sensor_reading(q22,humidity,46).
sensor_reading(q1, temperature, 25).
sensor_reading(q2, temperature, 25).
sensor_reading(q3, temperature, 25).
sensor_reading(q4, temperature, 25).
sensor_reading(q5, temperature, 25).
sensor_reading(q6, temperature, 25).
sensor_reading(q7, temperature, 25).
sensor_reading(q8, temperature, 25).
sensor_reading(q9, temperature, 25).
sensor_reading(q10, temperature, 25).
sensor_reading(q11, temperature, 25).
sensor_reading(q12, temperature, 25).
sensor_reading(q13, temperature, 25).
sensor_reading(q14, temperature, 25).
sensor_reading(q15, temperature, 25).
sensor_reading(q16, temperature, 25).
sensor_reading(q17, temperature, 25).
sensor_reading(q18, temperature, 25).
sensor_reading(q19, temperature, 25).
sensor_reading(q20, temperature, 25).
sensor_reading(q21, temperature, 25).
sensor_reading(q22, temperature, 25).
sensor_reading(q1, water, 47000).
sensor_reading(q2, water, 52500).
sensor_reading(q3, water, 39000).
sensor_reading(q5, water, 61000).
sensor_reading(q8, water, 58000).
sensor_reading(q10, water, 43000).
sensor_reading(q13, water, 72000).
sensor_reading(q16, water, 49000).
sensor_reading(q18, water, 35000).
sensor_reading(q21, water, 66500).
sensor_reading(q1, ph, 6.5).
sensor_reading(q2, ph, 4.7).
sensor_reading(q3, ph, 8.2).
sensor_reading(q4, ph, 7.0).
sensor_reading(q5, ph, 5.1).
sensor_reading(q6, ph, 8.0).
sensor_reading(q7, ph, 4.5).

% DISTRIBUTORS (name, region, capacity, demand level)
distributor(d1, alentejo, 1000, 2).
distributor(d2, lisboa, 800, 1).
distributor(d3, evora, 1200, 3).
distributor(d4, porto, 900, 2).
distributor(d5, algarve, 700, 2).
distributor(d6, setubal, 1100, 1).
distributor(d7, sintra, 950, 2).
distributor(d8, coina, 1000, 1).

% TRANSPORTS (name, capacity, type, autonomy, region, impact)
transport(t1, 1000, fossil, 100, alentejo, 3).
transport(t2, 500, electric, 10, alentejo, 1).
transport(t3, 800, fossil, 400, algarve, 5).
transport(t4, 700, hybrid, 300, setubal, 2).
transport(t5, 150, electric, 340, coina, 1).
transport(t6, 700, fossil, 220, porto, 3).
transport(t7, 900, hybrid, 350, evora, 2).
transport(t8, 1000, electric, 170, sintra, 1).

% Connections based on graph image

% Top of the network
link(q2, d1, 5).
link(q1, d1, 7).
link(q3, d1, 6).

% Network center
link(q3, q4, 8).
link(q4, d2, 6).
link(q4, d3, 7).
link(q4, q5, 5).
link(q4, d4, 6).

% Additional connections
link(q2, d2, 8).
link(q3, d3, 7).

This Prolog code models an agricultural supply chain system that has:

Farmers
Farms
Sensors Readings
Distributors
Transports

In addition, in this part of the code on the facts of the system:

% Top of the network
link(q2, d1, 5).
link(q1, d1, 7).
link(q3, d1, 6).

% Network center
link(q3, q4, 8).
link(q4, d2, 6).
link(q4, d3, 7).
link(q4, q5, 5).
link(q4, d4, 6).

% Additional connections
link(q2, d2, 8).
link(q3, d3, 7).

We connect farms with distributors. This way, we can see that between the farm q1 and distributor d1 is a distance of 7k. This makes it possible to find/create algorithms to find the shortest path between them.

In the end, symbolic AI just creates programs based on a context and rules applied to that context.

What is Non-Symbolic AI?

Non symbolic AI doesn’t use symbols or rules to think. Instead, it’s data driven. In other words, it learns patterns from large datasets. This is the approach used in machine learning and deep learning.

When we create an AI model, we can associate it with an API (Application Programming Interface) so that we can use the AI model in websites, applications, and other systems. Basically, the trained AI model is set up behind an API endpoint. An API endpoint is like a web service that lets other applications send requests to the model and get responses back.

For example, when you use ChatGPT in a web browser, your messages are sent through OpenAI's API to their language model, which processes your input and sends back a response.

An AI agent is a software program that can autonomously perform tasks by making decisions and taking actions to achieve specific goals.

Unlike basic chatbots that only reply to questions, AI agents can plan steps, use tools, and work towards achieving complex goals. They do this by combining language models with extra features like accessing outside data or working with other AI agents.

Here’s an example of a non-symbolic AI agent project I worked on. I developed it using the crewAI Python library and the OpenAI API, one of the most popular libraries for creating AI agents.

In this system, five AI agents collaborate to create optimized content:

Research and Fact Checker: Conducts research to find trends and data.
Audience Specialist: Analyzes audience needs for better engagement.
Lead Content Writer: Writes engaging content based on research.
Senior Editorial Director: Ensures content quality and consistency.
SEO Specialist: Optimizes content for search engines.

Using the OpenAI API, it employs chatGPT with crewAI to have these agents work for me.

Before AI: Control Theory as the “First AI”

Before symbolic and non symbolic AI, electrical engineering had data-driven methods. One key area that I’ve already mentioned above was control theory (which studies control systems for machines like cars and rockets). This field allows us to design systems that ensure stability despite disturbances and achieve goals beyond human capabilities.

Nowadays, after creating a control theory algorithm, we check if AI can improve the control system. In my experience, only some advanced deep learning methods are effective. Most machine learning methods don't outperform control theory in efficiency and security.

Control theory also offers better interpretability, allowing us to understand decisions, unlike advanced machine learning and deep learning.

Due to the historical importance of control theory, I will continue to mention its role and mathematical applications. This will help you learn AI's math foundations and understand its significance in electronic systems and AI applications in engineering beyond dataset predictions.

Chapter 4: Linear Algebra - The Geometry of Data

Photo by Nothing Ahead.

Linear algebra is like having organized containers for data.

Instead of playing with individual numbers, we can pack them into structured boxes that are easier to handle. These structured boxes are called matrices.

When you have a lot of variables like customer data, sensor readings, or images, these structured boxes are very helpful. Also, what we can do when we play around with these boxes is very valuable.

In AI, linear algebra is everywhere. Take matrices, for example – a key concept in Linear Algebra. LLMs perform many matrix multiplications as their core operation. The data that they take in is also organized into matrices. In image recognition, matrices are used to represent pixels of images.

So as you can see, this core Linear Algebra concept is important to understand. Let's start!

What Are Matrices and Why Do They Simplify Equations?

Very often, systems in the real world can be simplified and modeled with a system of equations.

Those equations are often differential equations of many orders. But to simplify, let’s choose a very simple system like the one below:

$$\begin{align} 2x + 3y - z &= 7 \ x - 2y + 4z &= -1 \ 3x + y + 2z &= 10 \end{align}$$

When dealing with many variables and equations, writing each equation separately quickly becomes frustrating. Matrices provide a compact way to represent these systems.

For example, here’s the system above as a single matrix equation:

$$\begin{bmatrix} 2 & 3 & -1 \ 1 & -2 & 4 \ 3 & 1 & 2 \end{bmatrix} \begin{bmatrix} x \ y \ z \end{bmatrix} = \begin{bmatrix} 7 \ -1 \ 10 \end{bmatrix}$$

By seeing systems of equations as matrices, we can use linear algebra techniques to understand how the system behaves.

Some of these techniques are:

Linear Independence, Dependence, and Rank
Determinants
Eigenvalues and Eigenvectors

So to summarize:

A real world system can be represented as a system of equations
A system of equations can be compressed in a structured manipulable form called a matrix.
With matrices and linear algebra techniques, we can understand how the system works.

This way, we can study the basic behavior of a system with Linear Algebra.

For complex systems like a rocket, Linear Algebra is still the foundation. More advanced tools from control theory are used, but understanding simpler systems is essential for modeling and creating complex ones.

Vectors and Transformations: Moving in Multiple Directions

Vectors are matrices with a single row or a single column. You can also think of them as the building blocks of AI. They represent things like data points, model parameters, and much more.

For example, every data input (like an image or sentence) becomes a vector that the model can processes.

Here are two examples of vectors:

$$\mathbf{A} = \begin{bmatrix} 4 & -2 & 7 & 1 & 5 \end{bmatrix}$$

And:

$$\mathbf{B} = \begin{bmatrix} 3 \ -1 \ 8 \ 0 \ -4 \end{bmatrix}$$

All operations that you can perform on matrices can also be performed on vectors.

In Python, we can represent this by:

import numpy as np

# Define vectors A and B
A = np.array([4, -2, 7, 1, 5])
B = np.array([3, -1, 8, 0, -4])

We’re using the NumPy library because it makes math with arrays easy and fast.

As a simplification of a system of equations, a vector with a single row represents:

$$\mathbf{A} = \begin{bmatrix} 4 & -2 & 7 & 1 & 5 \end{bmatrix}$$

And this represents this system of equations:

$$4x_1 - 2x_2 + 7x_3 + x_4 + 5x_5 = k$$

A vector with a single column represents:

$$\mathbf{B} = \begin{bmatrix} 3 \ -1 \ 8 \ 0 \ -4 \end{bmatrix}$$

Which represents this system of equations:

$$\begin{align} x_1 &= 3 \ x_2 &= -1 \ x_3 &= 8 \ x_4 &= 0 \ x_5 &= -4 \end{align}$$

Now let’s see some matrix operations.

For example:

$$\mathbf{A} + \mathbf{B}^T = \begin{bmatrix} 4 & -2 & 7 & 1 & 5 \end{bmatrix} + \begin{bmatrix} 3 & -1 & 8 & 0 & -4 \end{bmatrix} = \begin{bmatrix} 7 & -3 & 15 & 1 & 1 \end{bmatrix}$$

vector_addition = A + B
print("A + B =", vector_addition)

Which gives the result of the equation above.

Often, vector addition is used to combine features. For example, adding many user preference vectors creates a profile of a user.

Here’s a scalar multiplication:

$$3\mathbf{A} = 3\begin{bmatrix} 4 & -2 & 7 & 1 & 5 \end{bmatrix} = \begin{bmatrix} 12 & -6 & 21 & 3 & 15 \end{bmatrix}$$

scalar_mult = 3 * A
print("3 * A =", scalar_mult)

Which gives the result of the equation above.

In AI, scaling vectors is usually done to adjust relevancy. For example, if we do a scalar product multiplication of a vector by 100, it means we are increasing its value. If it is by 0.3, it means we are reducing its importance.

Here's an outer product multiplication:

$$\mathbf{A} \otimes \mathbf{B} = \begin{bmatrix} 4 \ -2 \ 7 \ 1 \ 5 \end{bmatrix} \times \begin{bmatrix} 3 & -1 & 8 & 0 & -4 \end{bmatrix} = \begin{bmatrix} 12 & -4 & 32 & 0 & -20 \ -6 & 2 & -16 & 0 & 8 \ 21 & -7 & 56 & 0 & -28 \ 3 & -1 & 8 & 0 & -4 \ 15 & -5 & 40 & 0 & -20 \end{bmatrix}$$

And here’s a dot product multiplication (also called a dot product):

$$\mathbf{A} \cdot \mathbf{B}^T = \begin{bmatrix} 4 & -2 & 7 & 1 & 5 \end{bmatrix} \cdot \begin{bmatrix} 3 & -1 & 8 & 0 & -4 \end{bmatrix}$$

$$= 4 \cdot 3 + (-2) \cdot (-1) + 7 \cdot 8 + 1 \cdot 0 + 5 \cdot (-4) = 50$$

We mainly use dot products when we want to measure similarity, or alignment between two vectors.

In machine learning, in one simple phrase, it gives us a measure of similarity.

import numpy as np

dot_product = np.dot(A, B)
print("A · B =", dot_product)

Which gives the result of the equation above.

Linear Independence, Dependence, and Rank: Why It Matters

A lot of times, matrices can be made smaller and simpler. So it’s a good practice to reduce a matrix to its simplest form before we start to analyze its properties.

When each row of a matrix can be made with other rows, then that matrix is linearly dependent. This means the matrix can be further modified.

This way, a matrix has the property of linear independence when its rows cannot be created by combining each other.

For example, when we have a complex matrix like this one:

$$C = \begin{bmatrix} 1 & 2 & 3 & 4 \ 2 & 4 & 6 & 8 \ 1 & 3 & 5 & 7 \ 0 & 1 & 2 & 3 \end{bmatrix}$$

We can, with calculations, convert to this:

$$C_{\text{reduced}} = \begin{bmatrix} 1 & 0 & -1 & -2 \ 0 & 1 & 2 & 3 \ 0 & 0 & 0 & 0 \ 0 & 0 & 0 & 0 \end{bmatrix}$$

if you are not familiar with row reduction, I recommend this YouTube video.

The above simplified matrix is the same thing as this:

$$C_{\text{reduced}} = \begin{bmatrix} 1 & 0 & -1 & -2 \ 0 & 1 & 2 & 3 \end{bmatrix}$$

This way, we conclude that the C matrix has a rank of 2.

In other words, since the simplest form of the matrix has only 2 rows with numbers, it has a rank of 2.

From this, we can conclude that the reduced version of the matrix is linearly independent. This is because no row or column can be made from the existing rows or column. It’s the simplest possible matrix.

The original matrix C is linearly dependent because some rows are just multiples or combinations of other rows. For example, row 2 of the original matrix C is exactly row 1 multiplied by 2.

Another way of seeing this is that we have 4 rows in the original matrix and the rank of matrix C is 2. Since they are not equal, C is linearly dependent.

Why are these concepts important?

Linear independence and rank are important in engineering because they show whether equations, represented as matrices, give unique information. In electrical circuits and control systems, knowing that equations, represented as matrices, are independent ensures that you have unique solutions and avoids confusion.

The matrix rank shows the maximum number of independent equations that can exist. This help engineers model the simplest possible form of the systems.

In LLMs like ChatGPT, Gemini, Grok, and Claude, linear independence, dependence, and rank are used in a very important technique called LoRA (Low-Rank Adaptation).

LoRA (Low-Rank Adaptation) is widely used to calibrate these models to make sure they adapt efficiently to new tasks or domains without retraining the full model. Also, there are variants of this technique, like Quantized LoRA. This way, in many data centers, LoRA saves energy, water for cooling, and so many other things.

Determinants: Measuring Space and Scaling

Why are determinants important?

Determinants tell us if a system of equations has infinite solutions, no solutions, or if it has a unique solution without having to solve the whole system.

This way, instead of immediately trying to solve a complex system, we can first use the determinant to find out if it is even worth solving in the first place.

Many engineers don’t really understand the importance of the determinant. The only thing they know is the formula and how to apply it.

So now let’s learn, with some examples, what exactly the determinant is and why it matters.

A determinant is just a number. It’s always calculated from a square matrix. By calculating the determinant, we can find certain properties about the system it represents.

The determinant of a given matrix A:

$$A = \begin{bmatrix} a & b \ c & d \end{bmatrix}.$$

can be represented by two notations:

$$\det(A) = ad - bc$$

$$|A| = ad - bc$$

Both are the same thing.

Let's see how to calculate a determinant:

$$|A| = \begin{vmatrix} 2 & 3 \ 1 & 4 \end{vmatrix} = (2)(4) - (3)(1) = 8 - 3 = 5.$$

Let’s see how to do this in Python:

import numpy as np

# Define the matrix
A = np.array([
    [2, 3],
    [1, 4]
])

# Calculate the determinant
det_A = np.linalg.det(A)

print("Determinant of A:", det_A)

The same calculation works for other matrices!

Here's the determinant formula for a 3×3 matrix:

For a 3 by 3 matrix:

$$|B|= \begin{vmatrix} a & b & c \ d & e & f \ g & h & i \end{vmatrix} = aei + bfg + cdh - ceg - bdi - afh.$$

Now let’s apply the formula to an example:

$$|B| = \begin{vmatrix} 1 & 2 & 3 \ 0 & 4 & 5 \ 1 & 0 & 6 \end{vmatrix} = (1)(4)(6) + (2)(5)(1) + (3)(0)(0) - (3)(4)(1) - (2)(0)(6) - (1)(5)(0)$$

Assessing each term:

$$= (1)(4)(6) + (2)(5)(1) - (3)(4)(1) = 4 \cdot 6 + 2 \cdot 5 - ( 3 \cdot 4) = 24+10-12 = 22$$

In Python code:

import numpy as np

# Define the matrix
B = np.array([
    [1, 2, 3],
    [0, 4, 5],
    [1, 0, 6]
])

# Calculate the determinant
det_B = np.linalg.det(B)

print("Determinant of B:", det_B)

Now, let’s visualize matrix A by plotting its column vectors. Each column will become a vector: (3,1) and (-2,4). This shows us geometrically what the matrix is actually doing.

In a geogebra graph, it gives us this:

As we can see, the vectors define how each variable influences the system. By visualizing what the matrices are doing, we can find patterns that are harder to find just by looking at formulas.

What does this mean visually?

It means that in the space, this is what our matrix looks like. It’s also how our system of equations is represented.

C1 represents the “force“ or the impact the variable x1 has. And C2 does the same thing for the variable x2.

Now we’ll focus on a 3D matrix example. This matrix D represents a system of three equations with three variables:

$$D = \begin{bmatrix} 2 & -1 & 3 \ 4 & 0 & -2 \ -1 & 5 & 1 \end{bmatrix}$$

$$\begin{align} 2x_1 - x_2 + 3x_3 &= p \ 4x_1 + 0x_2 - 2x_3 &= q \ -x_1 + 5x_2 + x_3 &= r \end{align}$$

Each column can be described as a separate vector:

$$\begin{equation} D = \left[ D_1 \mid D_2 \mid D_3 \right] = \left[ \begin{bmatrix} 2 \ 4 \ -1 \end{bmatrix} \mid \begin{bmatrix} -1 \ 0 \ 5 \end{bmatrix} \mid \begin{bmatrix} 3 \ -2 \ 1 \end{bmatrix} \right] \end{equation}$$

As we can see, D was decomposed in 3 new column vectors:

$$\begin{equation} D_1 = \begin{bmatrix} 2 \ 4 \ -1 \end{bmatrix} \end{equation}$$

and:

$$\begin{equation} D_2 = \begin{bmatrix} -1 \ 0 \ 5 \end{bmatrix} \end{equation}$$

and:

$$\begin{equation} D_3 = \begin{bmatrix} 3 \ -2 \ 1 \end{bmatrix} \end{equation}$$

In a geogebra graph, it gives us this:

In 3D, each vector points in its own direction. Together, they organize three planes. Where all three planes touch is the solution to the system.

This is a key advantage of matrices and linear algebra. They help us visualize both simple and complex systems, enhancing systems thinking and first principles thinking.

The determinant is directly connected to these visualizations. For example, in 2D it measures the area that the vectors stretch over. Now we’ll see how that’s possible.

Let's use matrix A and see what its determinant looks like in geometric terms:

$$A = \begin{bmatrix} 2 & 3 \ 1 & 4 \end{bmatrix}$$

Which can be decomposed into 2 vectors u and v:

It gives us this determinant:

$$|A| = \begin{vmatrix} 2 & 3 \ 1 & 4 \end{vmatrix} = (2)(4) - (3)(1) = 8 - 3 = 5.$$

Now let’s see the determinant visually.

From (2,1) and (3,4), we can draw vectors parallel to u and and v. These are called u' and v' and have the same magnitude. They meet at (5,5), and we have a parallelogram that’s completed with these points: (0,0),(2,1),(3,4),(5,5)

The area of the parallelogram is the determinant:

Let’s see another example.

Let’s use a matrix F and see what it truly is:

$$F = \begin{bmatrix} 1 & 2 \ 2 & 4 \end{bmatrix}$$

It gives us this determinant:

$$|F| = \begin{vmatrix} 1 & 2 \ 2 & 4 \end{vmatrix} = (1)(4) - (2)(2) = 4 - 4 = 0$$

In geogebra, we can see that:

Now let’s try to see the determinant visually:

We can conclude that the area is 0.

Now let’s use a matrix G and see what it truly is:

$$G = \begin{bmatrix} 1 & 5 \ 2 & 3 \end{bmatrix}$$

It gives us this determinant:

$$|G| = \begin{vmatrix} 1 & 5 \ 2 & 3 \end{vmatrix} = (1)(3) - (5)(2) = 3 - 10 = -7$$

In geogebra, we can see that:

Now let’s try to see the determinant visually.

From (1,2) and (5,3), we can draw vectors parallel to u and and v. These are called u' and v' and have the same magnitude. They meet at (6,5). A parallelogram is completed with these points: (0,0),(1,2),(5,3),(6,5)

Again, the area of the parallelogram is the determinant:

We just saw that the determinant is the area of a parallelogram formed by the vectors. When the determinant is 0, there is no area. In other cases, there is an area. But what does this mean, and why do we care about these different values?

When the det = 0:

The vectors are linearly dependent (one can be written as a combination of the others)
They lie on the same line or one is a scaled version of the other
The parallelogram collapses to a line, hence zero area
This tells us the matrix has no inverse
Systems of equations either have no solution or infinitely many solutions

When the det ≠ 0 (det > 0 or det < 0):

The vectors form a proper parallelogram with an area
- If det > 0, the area is positive and transformation preserves orientation
- If det < 0, the area is negative and the orientation is flipped
The vectors are linearly independent
Systems of equations have exactly one solution

In electrical engineering, determinants help verify if a control system is controllable and observable.

Control systems use matrices a lot. For this reason, checking if their determinants are zero or non-zero tells engineers:

If it is controllable, it means the system is reachable, which helps in stabilization and performance optimization.
If it is observable, it means the system is measurable, which helps in fault detection and system monitoring.

In finite element analysis, a very popular math tool to solve partial differential equations, determinants helps figure out quickly if the calculations will give reliable results.

This way, with finite element analysis, we can design safer buildings, optimize aircraft wings, and simulate medical implants – all of which have a large impact on human lives and safety.

In machine learning, determinants are crucial to understanding data transformations. In these methods, if a determinant with a value of zero shows up, it means you are losing information and can't recover original data.

Also in deep learning, it’s used to decide the first parameters of neural networks (weight initialization) to prevent problems like the vanishing/exploding gradients.

In a 3×3 matrix, the determinant represents the volume of a parallelepiped (a 3D "box") formed by three vectors in 3D space.

If det = 0: The three vectors lie in the same plane, so they don't span any 3D volume
If det ≠ 0: The vectors form a proper 3D shape with actual volume

The absolute value |det| gives you the exact volume of that parallelepiped.

For example, if you have vectors a, b, and c, the determinant tells you how much 3D space they "fill up" when you use them as the edges of a box.

This is where it gets fascinating:

4×4 matrix: The determinant represents the "hypervolume" of a 4D parallelepiped formed by four vectors in 4-dimensional space.
1000×1000 matrix: The determinant represents the hypervolume in 1000-dimensional space!

So, to summarize, the determinant tells us easily if there are no solutions, infinite solutions, or exactly one solution in a system of equations, represented by a compact matrix.

What Are Mathematical Spaces and How Do They Simplify Calculations?

We now have a great foundation to understand the rest of this chapter on linear algebra.

Now, we will see see how a linearly independent matrix create something called a basis. Also, we will see that a basis is just a a set of building blocks for mathematical spaces!

The row vectors of a linearly independent matrix form a basis.

For example in matrix A, which is linearly independent:

$$A = \begin{bmatrix} 1 & 0 & 0 & 0 \ 0 & 1 & 0 & 0 \ 0 & 0 & 1 & 0 \ 0 & 0 & 0 & 1 \end{bmatrix}$$

forms this set:

$$((1,0,0,0), (0,1,0,0), (0,0,1,0), (0,0,0,1))$$

In this case, since matrix A is linearly independent, the set of matrix rows is called a basis. From this basis, you can create endless linear combinations of any other vector. The collection of all these possible combinations is called a mathematical space.

A mathematical space is an infinite set where all linear combinations of a basis exist. Its called a basis because these vectors form the base to express any vector in the space as a linear combination.

This matrix B is linearly independent:

$$B = \begin{bmatrix} 1 & 0 \ 0 & 1 \ \end{bmatrix}$$

And forms this set:

$$((1, 0), (0, 1))$$

And from this come all possible points in this cartesian coordinate system:

For example, mathematically, we can get the point (2,3) by:

$$(x=2, y=3) = 2(1, 0) + 3(0, 1) = (2, 0) + (0, 3) = (2, 3)$$

Note: There are other bases for the cartesian coordinate plane. I chose this one because it’s the easiest to understand.

Eigenvalues and Eigenvectors: Unlocking Hidden Patterns

Eigenvalues and eigenvectors, in my opinion, are far simpler than what mathematics professors make them out to be at university:

Eigenvalues tell you how much a matrix stretches or shrinks things.
Eigenvectors tell you which directions stay unchanged when the matrix transforms them.

This way, a matrix may have one or many eigenvalues which in turn result in many eigenvectors.

Let’s see an example:

For a square matrix A, eigenvalue λ, and eigenvector v:

$$Av=λv$$

The easiest way to find the eigenvalue is to calculate this:

$$det(A−λI)=0$$

or:

$$|A−λI|=0$$

Again, we have different notations for the determinant, but they’re the same thing.

Anyway, let’s define a very simple matrix A:

$$A = \begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix}$$

Now let’s make some calculations.

This formula:

$$det(A−λI)=0$$

Can be decomposed into:

$$det(\begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix} - λ \times \begin{bmatrix} 1 & 0 \ 0 & 1 \end{bmatrix}) = 0$$

Which is the same has:

$$det(\begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix} - \begin{bmatrix} λ & 0 \ 0 & λ \end{bmatrix}) = 0$$

Which gives us:

$$det(\begin{bmatrix} 2-λ & 0 \ 0 & 3-λ \end{bmatrix}) = 0$$

By the calculations we made above on the determinant, we can conclude that:

$$(2-λ) \times (3-λ) = 0$$

Which is the same has:

$$2-\lambda = 0 \text{ or } 3-\lambda = 0$$

Which gives us these eigenvalues:

$$\lambda_1 = 2, \quad \lambda_2 = 3$$

And these eigenvectors:

$$\mathbf{v_1} = \begin{bmatrix} 1 \ 0 \end{bmatrix}, \quad \mathbf{v_2} = \begin{bmatrix} 0 \ 1 \end{bmatrix}$$

This means that in the Cartesian coordinate system:

By applying the eigenvectors, we can see that:

The eigenvalue 2 is associated with the eigenvector v1:

$$A\mathbf{v_1} = \begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix}\begin{bmatrix} 1 \ 0 \end{bmatrix} = \begin{bmatrix} 2 \ 0 \end{bmatrix} = 2\begin{bmatrix} 1 \ 0 \end{bmatrix}$$

The eigenvalue 3 is associated with the eigenvector v2:

$$A\mathbf{v_2} = \begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix}\begin{bmatrix} 0 \ 1 \end{bmatrix} = \begin{bmatrix} 0 \ 3 \end{bmatrix} = 3\begin{bmatrix} 0 \ 1 \end{bmatrix}$$

Here is the Python code to calculate this:

import numpy as np

# Define matrix A
A = np.array([[2, 0],
              [0, 3]])

# Calculate eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)

print("Eigenvalues:")
print(eigenvalues)

print("Eigenvectors (columns):")
print(eigenvectors)

Eigenvalues and eigenvectors are key tools in engineering and machine learning because they reveal a matrix's fundamental behavior. Although a matrix transformation might seem complex, in reality:

Eigenvalues show how much stretching or compression occur.
Eigenvectors identify the special directions where this stretching happens most naturally.

In machine learning, we can use Principal Component Analysis (PCA) to make datasets smaller.

So, for example, let's say you’re building a machine learning application to predict heart disease. You have 100 data categories and 1 target variable telling whether a person has it or not.

With PCA, you can convert the 100 categories into, say, 40 categories. This way, you can make a smaller machine learning model and save computational resources.

PCA uses eigenvectors of covariance matrices to find important directions in data with many variables. It reduces data size without losing much detail, helping machine learning algorithms focus on key features and ignore unnecessary information.

Applications of Linear Algebra in AI and Control Theory

‌Linear algebra serves as the mathematical foundation for all engineering fields.

In addition, the principles of matrices and linear transformations provide the computational foundation that makes modern AI possible while enabling the control of complex systems.

All LLMs, from ChatGPT and Claude to Gemini and Grok, rely on linear operations.

All these systems carry out huge matrix multiplications to handle and create human language. So, when you type something into ChatGPT, probably millions of matrix multiplications are happening as you wait for a response!

In control theory, especially in an area called state-space control theory, matrices make it possible to create complex controllers. Linear algebra helps engineers design controllers for things like aircraft autopilots and robotic systems, among other applications

For example, when a rocket adjusts its trajectory or a drone maintains stable flight, many matrix multiplications are happening to determine the best way to guarantee the system’s stability.

Thanks to GPUs, linear algebra matrices are very efficient to compute. Also, any new matrix multiplication algorithms or special hardware for faster linear operations can greatly enhance AI and control systems.

In the end, linear algebra is the hidden mathematical engine powering the current AI revolution.

Chapter 5: Multivariable Calculus - Change in Many Directions

Photo by ThisIsEngineering

Limits and Continuity: Understanding Smooth Change

Calculus is one of the most valuable areas of mathematics and it focus on the study of continuous change.

Before we start learning a topic that makes many people give up on engineering degrees, I want to once again assure you that this chapter is very easily explained with a lot of images and code examples.

Also, just like linear algebra, many concepts in calculus are components of tools that have helped create billion-dollar industries.

What is continuity?

Before going and explaining topics like derivatives and integrals, we need to understand continuity.

In simple terms, continuity means that a function has no breaks, jumps, or holes.

Essentially, you can draw it without lifting your pencil from the paper.

For example, this function is continuous:

You can draw this graph without taking the pencil off the paper.

The above graph is represented by this function:

$$y = x^2 - 4x + 3$$

But the below function is not continuous:

This one, you can’t draw without taking the pencil off the paper.

It’s represented by this piecewise function:

$$y = \begin{cases} 1.5 + \frac{1}{x+1} & \text{if } -1 < x < 2 \ 2 + \frac{2}{(x-1)^2} & \text{if } x > 2 \end{cases}$$

This piecewise function is essentially two individual functions for two different intervals of numbers. Since calculus is the study of continuous change, we can only realistically use it in continuous functions.

How do limits guarantee continuity?

We can only use tools like derivatives and integrals if a function is continuous.

How can we describe mathematically that a function is continuous – like drawing it without lifting our pencil from the paper?

Limits solve that problem.

When we take the limit of a function at a given point, we're asking: what value does a function approach as we get close to that point?

Let's look at some examples of this function at these points and also understand the notation used in limits:

What is the limit of the point x=0?

It is 3. It actually crosses the y axis.

In mathematical notation,

$$\begin{align} \lim_{x \to 0} (x^2 - 4x + 3) &= (0)^2 - 4(0) + 3 \ &= 0 - 0 + 3 \ &= 3 \end{align}$$

In this notation, we're asking what the value of the y function is as x gets very close to 0. Think of x as being at 0.00000000000001 or -0.00000000000001. It gets so close that we can consider it near enough.

What is the limit of the point x=1?

Le’s see another example:

In this case, it’s 0.

$$\begin{align} \lim_{x \to 1} (x^2 - 4x + 3) &= (1)^2 - 4(1) + 3 \ &= 1 - 4 + 3 \ &= 0 \end{align}$$

In this notation, we're asking what the value of the y function is as x gets very close to 1. Think of x as being at 0.99999999999999 or 1.00000000000001. It gets so close that we can consider it near enough.

What is the limit of the point x=2?

Le’s see another example

Here, it’s -1.

$$\begin{align} \lim_{x \to 2} (x^2 - 4x + 3) &= (2)^2 - 4(2) + 3 \ &= 4 - 8 + 3 \ &= -1 \end{align}$$

Some more quick examples:

What is the limit of the point x=3?

In this notation, we're asking what the value of the y function is as x gets very close to 1. Think of x as being at 1.99999999999999 or 2.00000000000001. It gets so close that we can consider it near enough.

What is the limit of the point x=4?

It is 0.

What is the limit of the point x=5?

It is 3.

Now let’s see another example:

In the point x=2, it’s not well defined

If we draw with a pencil from the left to x=2, we end up with 1.83333
If we draw with a pencil from the right to x=2, we end up with 4

Why are limits important to understand derivatives and integrals?

As we have seen, when we talk about limits, we are talking about a value that symbolizes the value that a function approaches as it comes toward a particular point.

It’s critical to note that we're not looking at the value of that point itself. We’re looking at what happens as we get so near to it that we can pin down what value the function is approaching.

I will now show a very simple example to demonstrate this concept using mathematical notation.

I know that limits can be a difficult concept to understand at first. But if you understand limits very well, then you'll be well-prepared to understand derivatives and integrals.

And, as you’ll see, derivatives are responsible for modern AI and integrals are important parts of tolls widely used in billion-dollar industries.

I want you to understand the intuition behind this.

The function z(x) is continuous:

$$z(x) = \frac{3x + 7}{x + 2}$$

So to what value does this expression converge as x approaches infinity?

If you have a background in math, you might see why. But here for those who aren’t sure:

It converges to 3.

This time, the limit will be approaching infinity instead of a constant:

$$\begin{align} \lim_{x \to \infty} \frac{3x + 7}{x + 2} \end{align}$$

Let’s solve this in a very simple way:

For x = 1:

$$f(1) = \frac{3(1) + 7}{1 + 2} = \frac{10}{3} \approx 3.333...$$

For x = 5:

$$f(5) = \frac{3(5) + 7}{5 + 2} = \frac{22}{7} \approx 3.143...$$

For x = 10:

$$f(10) = \frac{3(10) + 7}{10 + 2} = \frac{37}{12} \approx 3.083...$$

For x = 50:

$$f(50) = \frac{3(50) + 7}{50 + 2} = \frac{157}{52} \approx 3.019...$$

For x = 100:

$$f(100) = \frac{3(100) + 7}{100 + 2} = \frac{307}{102} \approx 3.010...$$

For x = 1000:

$$f(1000) = \frac{3(1000) + 7}{1000 + 2} = \frac{3007}{1002} \approx 3.001...$$

For x = 10000:

$$f(10000) = \frac{3(10000) + 7}{10000 + 2} = \frac{30007}{10002} \approx 3.0001...$$

As x gets bigger and bigger, we get closer and closer to 3.

This is the main idea of limits: Describe the value a function approaches as the input approaches some point.

This same idea applies to derivatives: they’re just limits that measure rates of change (slopes of tangent lines).

And as well, Integrals are just limits that measure accumulated quantities (areas under curves)..

Let’s now see how derivatives work in depth.

Derivatives: How Things Change and How Fast

As I said before, derivatives are just limits that measure rates of change (slopes of tangent lines).

But what does this actually mean?

Let’s see an example:

What is the rate of change in the point A?

Hard question right? Let’s think how to answer this with limits.

We can find the limit of the rate of change in point A(0.72, 0.66), also called the instantaneous rate of change.

Let’s do that:

To find the slope, we take the coordinates of the points B(0.2, 0.2) and C(1.6, 1):

$$\text{slope} = \frac{1 - 0.2}{1.6 - 0.2} = \frac{0.8}{1.4} = \frac{4}{7} \approx 0.571$$

This gives us a rate of change:

$$y=0.571x + 0.084$$

Let's approximate more:

Let’s also zoom in:

To find the slope, we use the coordinates of the points B(0.58, 0.55) and C(0.85, 0.75):

$$\text{slope} = \frac{0.85- 0.58}{0.75 - 0.55} = \frac{0.27}{0.2} = \frac{2.7}{2} \approx 1.35$$

It gives us a rate of change:

$$y=1.35x + 0.11$$

Now let's approximate a lot:

To find the slope, we use the coordinates of the points B(0.7242549, 0.6625776) and C(0.7242884, 0.66260026):

$$\text{slope} = \frac{0.66260026- 0.6625776}{0.7242884- 0.7242549} = \frac{0.0000226}{0.0000335} = \frac{0.226}{0.335} \approx 0.674$$

Now let’s zoom out:

As we can see, we are so close that we can consider the limit of the rate of change to be 0.65.

It gives us the rate of change:

$$y=0.674x + 0.12$$

This way, the limit of a rate of change is called a derivative.

To recap, here is an animation:

Here’s a Python code example that lets you find the derivative in point A:

import sympy as sp

x = sp.symbols('x')
f = sp.sin(x)

# Derivative of sin(x)
derivative_of_sin = sp.diff(f, x)

# Evaluate at x = 0.72 and x = 0.66
val = f_prime.subs(x, 0.72).evalf()

print("Derivative of sin(x) at x=0.72:", val)

The function that had the point A is called a sine wave.

We convert it to its derivative function. From there we have our rate of change at point 0.72.

When we do math by hand, we usually have many rules to convert a function to its derivative, and from these find the rate of change for a given point.

Before seeing it, let’s look at a very simple example to understand the definition of a derivative:

$$\frac{d}{dx}f(x) \approx \frac{f(\textcolor{green}{x + h}) - f(\textcolor{red}{x - h})}{\textcolor{green}{x + h} - \textcolor{red}{x - h}} = \frac{f({x + h}) - f({x - h})}{2h}$$

h represents a small difference.

The derivative is the slope of the function’s small change near a point. In other words, it’s the limit of the rate of change of a given point.

A simple derivative transformation might look like this one:

$$\frac{d}{dx}x^n = nx^{n-1}$$

Two examples are:

$$\frac{d}{dx}x^3 = 3x^2$$

And:

$$\frac{d}{dx}x^5 = 5x^4$$

There are many more. But we won’t go into deep detail on this topic.

Where and why are derivatives so important?

Derivatives are one of the most important math tools out there. They serve as the foundation for understanding change across nearly all fields of STEM.

In physics (classical mechanics), derivatives are very important to find new information that draws on information that’s already made available.

For example, knowing how a body's position changes over time allows us to use derivatives to find its velocity and acceleration. This is crucial for self-driving cars, trains, rockets, and more.

Also, derivatives are the foundation of understanding how electricity works in depth. Without derivatives, there would’ve been no electromagnetic theory. Without electromagnetic theory, modern technology would not exist.

In machine learning, derivatives are so important that they served to create the algorithm that is one of the most important components of ChatGPT and others AI models. (backpropagation).

Backpropagation is in fact so important that its creators, John Hopfield and Geoffrey Hinton, won the 2024 Nobel Prize in Physics for it.

Also, autonomous vehicles like Tesla and Waymo use AI models called neural networks that depend on backpropagation to work.

It’s awesome that a math concept created in the 17th century is now one of the foundations of the current AI revolution.

What About Integral Calculus?

Before explaining derivatives further, I will ask you a question:

How can we find the area of the below shape?

In other words how can we find the integral of the function in the given interval?

Let’s see how to do it step by step.

First, we’ll try using 2 rectangles to approximate the area behind the curve:

Now the area of the rectangles is 6.282573.

But there is still a lot of error…

As we can see, the left rectangle does not cover completely the curve and the right rectangle covers too much.

So we’ll add more smaller rectangles so that we can better approximate the curve.

Now let’s try using 4 rectangles:

Now the area is 6.497481. But there’s still some error.

As we can see, the error is getting smaller. In other words, the 4 rectangles cover the area of the curve better than just the 2 rectangles. But there’s still a lot of room to make it better.

Let’s try using 8 rectangles:

Now the area is 6.604935.

How about using 16 rectangles?

Now the area is 6.658662.

Let’s try using 32 rectangles:

Now the area is 6.685525.

Now how about using 64 rectangles:

Now the area is 6.698957.

And using 128 rectangles:

Now the area is 6.705673.

What about using 256 rectangles:

Now the area is 6.709031. And the error has reached 0.0000!

Now let’s see an animation of this:

As you can see, we can approximate the area by having a limit to infinity to the number of rectangles to approximate the area.

This way, we can conclude that:

$$F(x) = \int_0^{3.14} f(x) , dx = \int_0^{3.14} (\sin(x) + 1.5) , dx = 6.71$$

This means that the area between 0 and 3.14, limited by the math equation, is 6.71!

Or, mathematically, the integral of f(x) in the interval 0 and 3.14 is 6.71.

Where and how is this applied?

In electrical engineering, integrals calculate total energy use in circuits by integrating power over time. For example, when designing a power supply for a device, engineers integrate the power to determine total energy costs and heat absorption requirements.

In other words, they see the area over time and how much power is used.

Let's see an example:

Imagine that in the image above:

The X axis can be the time in months.
The Y axis is the power used in Watts (Joules per second).

We can conclude that in 3.14 months(3 months and 4 days) the total amount of energy is 6.71 watt-months.

Here is the code to find that out:

# Import libraries
import numpy as np
import matplotlib.pyplot as plt

# Create Function
x = np.linspace(0, 3.14, 100)
y = np.sin(x) + 1.5

# Find the area under the function
area = np.trapezoid(y, x)

# Show the final image
plt.fill_between(x, y)
plt.title(f'Area = {area:.2f}')
plt.show()

In this code, we import the libraries, create the function, and find the area and plot it.

We used numpy.trapezoid to find the area, because it’s a numerical approximation to quickly find the integral of a function between two x values.

numpy.trapezoid uses a numerical approximation method called the composite trapezoidal rule.

The basic idea of the composite trapezoidal rule is to divide the area under the curve into many trapezoids and sum all of them.

If you want to learn more about this, I recommend reading the NumPy documentation on this method.

From this value, we can convert to other units:

52,400,000 joules
14.6 kWh

By converting to other units, we can more easily compare this device with other devices and see if it obeys any technical standards and laws.

This is a real-life application of integrals in engineering.

In my degree, I used this a lot in classes related to power engineering. In simple words, power engineering is a subfield of electrical engineering focused on working with electricity with very high voltage values and electric motors.

In audio compression, the Fourier transform (built on integrals) decomposes sound waves into frequency components. MP3 encoders use this to identify and remove frequencies humans can't hear. This reduces file sizes while preserving quality.

Medical imaging relies on the Radon transform, which uses integrals to reconstruct 3D images from 2D X-ray projections. When you get a CT scan, the machine takes hundreds of X-ray "slices" at different angles. During this process, integrals combine "slices" into a detailed cross-sectional image of your body.

Applications in AI and Control Theory: Calculus in Action

Modern AI depends on derivatives that use the backpropagation algorithm.

When training a neural network, the system calculates partial derivatives of the error with respect to millions of parameters. This way, find out how to adjust each weight to improve performance. Without this, large language models like ChatGPT couldn't learn from data.

PID controllers, which stabilize the temperature in your oven or maintain altitude in aircraft autopilot systems, combine calculus ideas:

The proportional term responds to the current error.
The integral term accumulates past errors to eliminate steady-state drift.
The derivative term predicts future trends to prevent overshooting.

And these are just some of the applications of calculus!

Chapter 6: Probability & Statistics - Learning from Uncertainty

Photo by Armando Are

It’s thanks to probabilities and statistics that many industries have grown so much. With statistics, we can make informed decisions and optimize many different processes. With probabilities, we can understand and model uncertainty in systems and, in this way, solve or even avoid problems.

While you may be familiar with some of the key concepts like median and mean, we’ll start with some basics to build up your intuition on more advanced stuff like the central limit theorem, Bayes’ theorem, and Markov chains.

Mean, Median, Mode: Measuring Central Tendency

Let's imagine you are a data scientist working in research. You’re going to work with data to optimize the output of farms in the Central Valley in California.

The idea is to take in a bunch of data, and by studying it, you can help farmers make better decisions.

Here’s the data from one year of activity:

Farm	Yield (tons/ha)	Fertilizer Used (kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

We have 6 farms in our dataset. For each farm, we know:

How much yield was obtained in tons per hectare
How much fertilizer was used in kilograms per hectare
How much rainfall happened during a year of activity

Now, let’s answer some questions we might have about the data to understand the mean, mode and median:

1. What is the average yield during one year of activity?

To find the average, we just need to sum all the yield values and divide by the number of farms. Like this:

$$\text{Mean} = \frac{4.2 + 5.8 + 3.9 + 6.1 + 4.7 + 5.3}{6} = \frac{30}{6} = 5$$

This is what is called the mean. The mean is just the sum of all values divided by how many values there are.

In Python, we can do the following to calculate the mean:

def calculate_mean(values):
    return sum(values) / len(values)

# Example usage
data = [4.2, 5.8, 3.9, 6.1, 4.7, 5.3]
result = calculate_mean(data)
print(f"Mean: {result}")

2. What is the mode of fertilizer used?

The mode is just the most popular value in a given dataset. In our case, it’s 200 since that’s the most common value that appears in our farm dataset.

In Python, we can do this to calculate the mode:

import statistics

def calculate_mode(values):
    return statistics.mode(values)

# Example usage
data = [150, 220, 120, 250, 200, 200]
result = calculate_mode(data)
print(f"Mode: {result}")

3. What is the median of the yield?

The median is just the value in the middle of a set of numbers. If the number of elements in the list is even, we take the mean of the two middle numbers. Here are our current yield values:

$$4.2, 5.8, 3.9, 6.1, 4.7, 5.3$$

First, we sort the values:

$$3.9, 4.2, 4.7, 5.3, 5.8, 6.1$$

Since we have 6 values (even number), the median is the average of the two middle values:

$$\text{Median} = \frac{4.7 + 5.3}{2} = \frac{10}{2} = 5$$

In Python we can do this to calculate the median:

import statistics

def calculate_median(values):
    return statistics.median(values)

# Example usage
data = [4.2, 5.8, 3.9, 6.1, 4.7, 5.3]
result = calculate_median(data)
print(f"Median: {result}")

Variance and Standard Deviation: Measuring Spread

Knowing the mean, mode, and median of data is helpful. But it’s also important to know how far away data points are from each other.

That’s where measures of dispersion come in. Variance tells us, on average, how far numbers are from the mean.

Let’s see an example of how to calculate this.

Given yield data from the table:

$$4.2, 5.8, 3.9, 6.1, 4.7, 5.3$$

The first step is the calculate the mean:

$$\bar{x} = \frac{4.2 + 5.8 + 3.9 + 6.1 + 4.7 + 5.3}{6} = \frac{30}{6} = 5$$

The second step is to calculate the variance with the sample variance formula:

$$s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}$$

Let's apply the formula little by little to understand how it works.

We will first we will calculate the variance of each yield data point:

$$\begin{align*} (4.2 - 5.0)^2 &= (-0.8)^2 = 0.64 \ (5.8 - 5.0)^2 &= (0.8)^2 = 0.64 \ (3.9 - 5.0)^2 &= (-1.1)^2 = 1.21 \ (6.1 - 5.0)^2 &= (1.1)^2 = 1.21 \ (4.7 - 5.0)^2 &= (-0.3)^2 = 0.09 \ (5.3 - 5.0)^2 &= (0.3)^2 = 0.09 \end{align*}$$

Then we will sum all the squared differences:

$$\sum(x_i - \bar{x})^2 = 0.64 + 0.64 + 1.21 + 1.21 + 0.09 + 0.09 = 3.88$$

Now, we will finally find the variance:

$$s^2 = \frac{3.88}{6-1} = \frac{3.88}{5} = 0.776$$

The standard deviation is just the square root of the variance.

$$s = \sqrt{s^2} = \sqrt{0.776} \approx 0.881 tons/ha$$

Why is this useful?

It puts the spread back into the same units as the data, making it easier to interpret.

A small standard deviation means the data huddles close to the mean, while a large one means it’s widely scattered.

And here is a code example of how to calculate both:

import statistics

def calculate_variance_and_std(values):
    variance = statistics.variance(values)
    std_dev = statistics.stdev(values)
    return variance, std_dev

# Example usage
data = [4.2, 5.8, 3.9, 6.1, 4.7, 5.3]
variance, std_dev = calculate_variance_and_std(data)
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_dev}")

What Is the Normal Distribution? The Bell Curve of Life

The normal distribution tells us how data naturally converges around the average value. Most values are focused on the center, and extreme values are more to the edges. This creates a bell curve.

By understanding this distribution, we can understand other distributions and also the central limit theorem.

To understand what normal distribution is, let’s look at it:

The normal distribution looks like like a mountain.

As you can see, most values are around the mean. Also, in and around the mean is the peak. Toward the extremes, the curve gets lower and lower. This means that in the extremes there are fewer and fewer values.

Normal distribution also has a formula associated with it:

$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right)$$

I won’t go in depth into how the formula works here. I just want you to understand the main idea behind the concept.

There are many other distributions besides the normal distribution. Some of the most common are:

Chi-squared distribution
Student’s t distribution
Bernoulli distribution
Binomial distribution
Poisson distribution

Each distribution can model different events and phenomenons. For example the Chi-squared distribution is widely used to find the correlation between two phenomenons (sunburns and skin cancer, for example).

The Poisson distribution is also used in modeling counts of events, like the number of clients that enter a store per hour or the number of data packets that are transmitted in a Ethernet cable.

But it’s also possible to approximate a lot of distributions to the normal distribution using one of the most important theorems in all of mathematics: the central limit theorem. This is what we will explore next.

How the Central Limit Theorem Helps Approximate the World

Photo by Porapak Apichodilok

The main idea of the central limit theorem is very simple:

Most distributions can be approximated to become the normal distribution.

This is just like pouring sand into a funnel. Grains may fall randomly, but over time the pile of sand will always begin to form the shape of a mountain.

This way, we can take many data points and average them. Over time, it will converge to become a normal distribution.

In other words, when independent random variables are all summed together, their sum tends toward a normal distribution.

Here is the formula:

$$\bar{X} \approx N\left(\mu, \frac{\sigma^2}{n}\right) \quad \text{or equivalently} \quad Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \approx N(0, 1)$$

You don’t need to understand in depth what it means. Just understand that it’s a theorem that approximates other distributions to the normal distribution.

And why is this important?

Because this theorem makes many billion-dollar industries possible.

Instead of testing every single possible scenario, we can test for a smaller amount of scenarios and assume that if it works for the smaller one, it will work for the bigger one.

For example, in telecommunications, instead of testing every possible phone call or data transmission, we can just test a few connections. If it works for those few connections, we can assume it will work for millions of phone and data transmissions.

For clinical trials, instead of testing a drug on millions of people, we can just test a smaller number of patients. If it works for a (relative) few patients, we can assume it will work on most people with the same condition.

Without this idea, clinical trials would not be possible. The same with telecommunications and so many other areas of engineering.

Bayes Theorem: Learning from Evidence

Now we’ll start looking at probability more in depth based on the data table we have been using.

Here’s the table again so that you can reference it more easily:

Farm	Yield (tons/ha)	Fertilizer Used (Kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

Now there are a lot of ideas and formulas related to probabilities. But here, I want to explain to you the core ones that are applied in AI and give you a high-level definition of things.

We’ll start with conditional probability, which is foundational to understanding Bayes’ theorem. Then we’ll get to the extended Bayes’ theorem formula.

So, let's get started!

What is Conditional Probability?

Photo by KOUSHIK BALA

Conditional probability is the probability that an event will happen given that another event has already taken place.

Confused? Don't worry! Let's see an example:

Let’s say that:

A = Farm has rainfall above or equal 400 mm
B = Farm has a yield above or equal to 5.0 tons/ha

Here is the formula for Conditional Probability:

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

Now let’s see this formula more in detail:

$$P(A)$$

This represents the probability that a farm has rainfall above or equal to 400 mm.

We have 6 farms, and 2 of them (farm B and D) have a rainfall above or equal to 400 mm.

So, the probability that a farm has rainfall above or equal to 400 mm is:

$$P(A) = \frac {2}{6} = \frac {1}{3} ≈ 0.33$$

Now let’s see for event B:

$$P(B)$$

This represents the probability that a farm has a yield above or equal to 5.0 tons/ha.

We have 6 farms and 3 of them (farm B, D and F) have a yield above or equal to 5.0 tons/ha.

So, the probability that a farm has a yield above or equal to 5.0 tons/ha is:

$$P(B) = \frac {3}{6} = \frac {1}{2} = 0.5$$

What about if we want to see both conditions’ probabilities at the same time?

$$P(A \cap B)$$

This refers to the probability of A and B being both true.

In our example, in means the probability that a farm both has a rainfall above or equal to 400 mm and a yield above or equal to 5.0 tons/ha.

We have:

6 farms and 2 of them (farm B and D) have a rainfall above or equal 400 mm
6 farms and 3 of them (farm B, D and F) have a yield above or equal to 5.0 tons/ha

For A and B to be true, only 2 farms (farm B and D) have both conditions.

This way:

$$P(A \cap B) = \frac {2}{6} = \frac {1}{3} ≈ 0.33$$

Now we’re ready to find out the conditional probability:

$$P(A|B)$$

This means the probability of A, knowing that B is true.

In our example, we can conclude that:

$$P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{0.33}{0.5} = 0.66$$

So, the probability that a farm has rainfall above or equal 400 mm – knowing that it has a yield above or equal to 5.0 tons/ha – is 0.66

Bayes’ Theorem

This is one of the most important theorems in mathematics.

Bayes’ theorem is a formula that tells us how to change the probability of a prediction when new verified data becomes available.

In other words, it’s like a rule that tells us how to update our beliefs when new evidence appears.

Now, based on what we already know, let’s see how Bayes’ Theorem works.

Here is its formula:

$$P(B|A) = \frac{P(A|B) \cdot P(A)}{P(B)}$$

Now, based on the previous values, we can very easily find the probability of B, given that A is true.

In other words, the probability that a farm has a yield above or equal to 5.0 tons/ha given that is has a rainfall above or equal to 400 mm.

Let’s find the answer:

$$P(B|A) = \frac{P(A|B) \cdot P(A)}{P(B)}= \frac{0.66 \cdot 0.33}{0.5}=0.44$$

So, the probability that a farm has a a yield above or equal to to 5.0 tons/ha, knowing it rained equal to or more than 400 mm, is 44%.

Now that we’ve gone through this formula step by step, hopefully it doesn’t feel as complex.

Where is this applied in real life?

As with many math ideas in this book, Bayes' Theorem has applications in many business sectors.

For example, what is the best way to make a control system for a self-driving car, robot, or really any other device?

One effective approach is to use a Kalman filter. Kalman filters rely heavily on Bayes' Theorem to handle control systems with incomplete data.

Kalman filters have a lot of applications in engineering. For example, thanks to Kalman filters, commercial jets can fly safely on autopilot.

So as you can see, Bayes’ Theorem is the foundation of many control systems used in risky industries.

What Are Markov Models? Predicting the Next Step, One Step at a Time

Photo by lil artsy

How do you predict the future with math? Markov chains allow you to do this to a certain degree.

For this reason, Markov chains are widely used in science, engineering, economics, and many other areas.

In addition to this, Markov decision processes are a very important foundation for reinforcement learning. Reinforcement learning is a branch of AI where agents learn to make decisions by interacting with an environment to maximize rewards.

In this section, I’ll introduce you to Markov chains and decision processes with an analogy, a plain English explanation, and a code example.

If you want to dive in further, I recommend my freeCodeCamp article on the subject.

Markov Chain Analogy

Imagine that you want to predict the weather tomorrow, and it only depends on the weather today. The weather can be either sunny or rainy.

Here are the probabilities:

If it's sunny today, there's an 80% chance that it will be sunny again tomorrow, and a 20% chance that it will be rainy.
If it's rainy today, there's a 50% chance that it will be sunny tomorrow, and a 50% chance that it will be rainy.

In this scenario, we can predict future states of the weather based on current states using probabilities.

This idea of predicting the future based solely on probabilities of the present is called a Markov chain.

Here, the states are either sunny or rainy and the probabilities describe the chances of the weather changing based on the current state.

Markov Chain Explained in Plain English

A Markov chain describes random processes where systems move between states, and a new state only depends on the current state, not on how it got there.

Mathematically, Markov chains are called stochastic models because they model (simulate) real life events that are random by nature (stochastic).

Markov chains are popular because they are easy to implement and efficient at modeling complex systems.

Another key advantage is their "memoryless" property. This makes it faster to run on computers, and powerful to study random processes and make predictions based on current conditions.

Applications of Markov Chains

Photo by Google DeepMind

At some level, almost all real-life events are stochastic. In other words, they involve randomness and uncertainty.

This is exactly why they are so widely used.

They can predict the behavior of systems based on current conditions:

In finance, they are used to detect changes in credit ratings for forecasting market regimes.
In genetics, they help understand how proteins change over time (which is important when studying genetic variations).

These real life examples show how effective Markov chains can be used to solve real problems in different fields.

In AI, Markov chains are used to model an environment like a factory or home. Modeling an environment with Markov chains is called a Markov decision process.

Using a Markov decision process, it’s possible to use reinforcement learning to create and optimize agents to act in the environment.

Of course, new and better variants of the Markov decision process have appeared over the years. But the key idea here is that it is thanks to Markov decision processes that the basis for reinforcement learning exists.

Reinforcement learning is widely used in advertising systems, logistics, robotics, video games, and many more applications.

Types of Markov Chains

There are many types of Markov chains. In this section, we'll only discuss the most important variants.

Discrete-Time Markov Chains (DTMCs)

In DTMCs, the system changes state at specific time steps. They are called discrete because the state transitions occur at distinct, separate time intervals.

They are used in queuing theory (study of the behavior of waiting lines), genetics, and economics because they are simple to analyze.

Continuous-Time Markov Chains (CTMCs)

CTMCs differ from DTMCs in that state transitions can occur at any continuous time point, not at fixed intervals.

This makes them stochastic models where state changes happen continuously. This is important in chemical reactions and reliability engineering.

Reversible Markov Chains

Reversible Markov chains are special. The process of state change is the same whether the direction is forwards or backwards, like rewinding a video and playing it again.

This property makes it easier to know when a system is stable and study how a system behaves over time. They are widely used in statistical physics and economics

Doubly Stochastic Markov Chains

Doubly stochastic Markov chains are defined by a transition probability matrix. In the matrix, the sum of the probabilities in each row and each column equals 1.

This means each row and each column represent a valid probability distribution. In other words, each row and column represent a list of chances for different outcomes.

This property is crucial in quantum computing and statistical mechanics.

Thanks to Doubly stochastic Markov chains, systems change in a way that preserves probabilities and symmetry, making the modeling and analysis of quantum computing systems far more accurate.

Hidden Markov Chains Code Example

Photo by Kevin Ku

Before we jump into code examples, let’s first understand what Hidden Markov Chains are.

The main idea behind hidden Markov chains is to model systems that have hidden states (states for which we don’t know their values) which can only be discovered through observable events.

In other words, hidden Markov chains allow us to predict the behavior of a system by:

Considering the likelihood of moving from one state to another.
Knowing the probability of observing a certain event from each state

We can understand this by observing how the states change from an indirect point of view.

We may not know the states’ original values. But by knowing the way they change, we can predict what their values will be in the future.

This way, hidden Markov chains are flexible in modeling sequences, capturing both the transitions between hidden states and the observable outcomes.

Because of this, hidden Markov models are used in fields such as engineering, financial modeling, speech recognition, bioinformatics, and many more.

Code Example:

In this code example, we’ll see a simple example with synthetic data.

Here is the full code:

import numpy as np
from hmmlearn import hmm

# Set random seed for reproducibility
np.random.seed(42)

# Define the HMM parameters
n_components = 2  # Number of states
n_features = 1    # Number of observation features

# Create a Gaussian HMM
model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag")

# Define transition matrix (rows must sum to 1)
model.startprob_ = np.array([0.6, 0.4])
model.transmat_ = np.array([[0.7, 0.3],
                            [0.4, 0.6]])

# Define means and covariances for each state
model.means_ = np.array([[0.0], [3.0]])
model.covars_ = np.array([[0.5], [0.5]])

# Generate synthetic observation data
X, Z = model.sample(100)  # 100 samples

# Create a new HMM instance
new_model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag", n_iter=100)

# Fit the model to the data
new_model.fit(X)

# Print the learned parameters
print("Transition matrix:")
print(new_model.transmat_)
print("Means:")
print(new_model.means_)
print("Covariances:")
print(new_model.covars_)

# Predict the hidden states for the observed data
hidden_states = new_model.predict(X)

print("Hidden states:")
print(hidden_states)

Now let’s break the code down block by block:

Import libraries and set random seed:

import numpy as np
from hmmlearn import hmm

np.random.seed(42)

In this block of code, we imported two Python libraries:

NumPy: For numerical operations.
hmmlearn: For hidden Markov model implementation.

Next we defined a random seed with the NumPy library. A random seed is a value used to start a pseudorandom number generator.

With a fixed random seed, we can ensure that the sequence of pseudorandom numbers generated is always the same. This allows us to duplicate experiments and verify results.

The specific value of the seed doesn’t matter as long as it remains consistent.

Define the HMM parameters and create a Gaussian HMM:

n_components = 2  # Number of states
n_features = 1    # Number of observation features

model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag")

In this code block, we created an HMM with two hidden states and a single observed variable.

covariance_type "diag" means the matrices that represent covariance (how two variables change together) are diagonal. In other words, each row and column is assumed to be independent of the others.

This implies that the probability distributions of each row and column are independent of each other.

But there is still something strange when we defined the hidden Markov chain:

What does “Gaussian“ mean?

This is a very big topic in statistics, but in a few words, Markov chains can only be created when we specify the transition probabilities (chances of moving from one state to another in a Markov chain) and an initial probability distribution.

A Gaussian HMM assumes events are initially modeled by a Gaussian distribution, also called a normal distribution!

And recall, we have already seen before what a normal distribution is.

Here is it again:

From a normal distribution and other components, we can create a hidden Markov chain. And hidden Markov chains serve as a foundation for systems that affect millions of lives.

Define transition matrix, means, and covariances for each state:

model.startprob_ = np.array([0.6, 0.4])
model.transmat_ = np.array([[0.7, 0.3],
                            [0.4, 0.6]])

model.means_ = np.array([[0.0], [3.0]])
model.covars_ = np.array([[0.5], [0.5]])

model.startprob_ = np.array([0.6, 0.4])

This line sets the initial state probabilities for a Hidden Markov Model (HMM). It points out that there is a 60% probability of starting in state 0 and a 40% probability of starting in state 1.

model.transmat_ = np.array([[0.7, 0.3], [0.4, 0.6]])

This line of code sets the state transition probability matrix for the HMM.

The matrix specifies the probabilities of moving from one state to another:

From state 0, there is a 70% chance of staying in state 0 and a 30% chance of transitioning to state 1.
From state 1, there is a 40% chance of transitioning to state 0 and a 60% chance of staying in state 1.

model.means_ = np.array([[0.0], [3.0]])

This line sets the mean values for the observation distributions in each state.

It indicates that the observations are normally distributed with a mean of 0.0 in state 0 and a mean of 3.0 in state 1.

model.covars_ = np.array([[0.5], [0.5]])

This line sets the covariance values for the observation distributions in each state.

It specifies that the variance (covariance in this 1-dimensional case) of the observations is 0.5 for both state 0 and state 1.

Create data, new HMM instance, and fit the model with the data:

X, Z = model.sample(100)  # 100 samples

new_model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag", n_iter=100)

new_model.fit(X)

print("Transition matrix:")
print(new_model.transmat_)
print("Means:")
print(new_model.means_)
print("Covariances:")
print(new_model.covars_)

In this code, we created a model with 100 samples, iterated it 100 times, and printed the new state transition matrix, means, and covariances.

In other words, we:

Generated 100 samples from the original model
Fitted a new HMM to these samples.
Printed the learned parameters of this new model.

What do X and Z mean here?

X means the observed data samples generated by the original model, while Z means the hidden state sequences corresponding to the observed data samples generated by the original model.

The transition matrix prints out:

[[0.8100804  0.1899196 ]
 [0.49398918 0.50601082]]

Which means that the model tends to stay in state 0 and has nearly equal chances of switching or staying when in state 1.

The means print out:

[[0.01577373]
 [3.06245496]]

Which means that the average observed value is approximately 0.016 in state 0 and 3.062 in state 1.

The covariances print out:

[[[0.41987084]]
 [[0.53146802]]]

Which means that the observed values vary by about 0.420 in state 0 and 0.531 in state 1.

This way, we may never know the exact values of the states, but we know their average observed value and how they vary and tend to change with each other.

Predict the hidden states for the observed data:

hidden_states = new_model.predict(X)

print("Hidden states:")
print(hidden_states)

In this code, based on the X observed data samples, we predicted the new states of the Markov model.

The hidden states print out:

[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1
 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0]

Which means that the hidden states switch between state 0 and state 1, showing how the system changes states over time.

Applications in AI and Control Theory: Making Decisions Under Uncertainty

Photo by capt.sopon

I have been giving you a high-level overview of the field of probabilities and statistics. As I explained before, I wanted to make the explanations simple to understand.

As someone with a bachelor's degree in electrical and computer engineering, I can assure you that while this chapter seems simple, in probabilities and statistics, things can get very complicated very quickly.

Many more concepts like:

p-values
Advanced Monte Carlo methods
Bayesian networks
Statistical hypotheses

Are not as straightforward as the ideas I’ve just told you about.

But as it is, probability and statistics are the starting points for making decisions where uncertainty exists in AI and control theory.

For example, the Bayes’ theorem, besides being the foundation of the Kalman filter, is also the foundation of many probabilistic models in the field of AI. Probabilistic models are usually used in quant firms and banks to model risk.

In control theory, probabilities and statistics are widely used to design robust control systems (as is the case with Kalman filters).

So as you can see, the application of probabilities and statistics, as with calculus and linear algebra, is the foundation for many tools that impact millions of lives and move billions of dollars in the global economy.

Chapter 7: Optimization Theory - Teaching Machines to Improve

Photo by Pixabay

This is the most advanced math chapter of the book. To truly understand it, it’s very important that you’ve first read the other chapters first.

We’re going to examine a few machine learning methods, and I’ll show you some recipes of how machine learning is just the use of linear algebra, calculus, probabilities and statistics, and optimization theory.

Just like making a cake!

What is Optimization Theory?

In AI, optimization theory is responsible for the algorithms that optimize data-driven AI models.

Often, big companies invest millions in research to create or refine algorithms that make training AI models faster.

This way, companies save far more money than the upfront research costs when scaling to train multiple large AI models.

It is thanks to optimization theory that deep learning was able to scale efficiently, eventually leading to the creation of ChatGPT and many other large language models.

But why is that?

In all data-driven machine learning models, there is a learning phase that has to happen. That is, there’s a period where the algorithms make predictions that are not correct and then need to change some parameters to make sure the next predictions are correct – or at least closer to being correct.

Without optimization, machine learning algorithms don't get anywhere on their learning path to the right solution. Without optimization, they spend too much time on a learning path that won’t increase their ability to predict things the right way.

So, let’s start learning!

Why Optimization Drives Learning in AI

Photo by Alex Knight

Optimization theory is the mathematical foundation that allows algorithms to improve their performance over many iterations.

When we combine an algorithm with a path to change its parameters to meet a certain objective (done with an optimization method), it’s called a machine learning algorithm.

This learning process always involves minimizing or maximizing a certain objective. For example, for many machine learning algorithms, the main objective is to minimize errors. To do this, over many iterations, the optimization methods "tells" the internal components of an algorithm what to change after receiving feedback on how well it’s performing.

It’s like someone first learning how to drive a car. The first few times, it may be complicated. But after a while and some practice, the driver learns how to drive properly and not make the same mistakes they once did in the past with the help of the instructor.

The same applies to optimization methods when optimizing algorithms.

Types of Optimization Theory Methods in ML and Deep Learning

The field of optimization theory is huge! Just as with many fields of mathematics, it is constantly growing every year.

But for the purposes of this book, there are three main categories of optimization methods:

First-Order Methods

These are the most used in deep learning and in all LLM models like Gemini, Grok, and others.

They are called first-order methods because they all use the first derivative of functions. The first derivative of a function measures how much a function's output changes when its input changes very little. The most widely used in deep learning are advanced variants of gradient descent.

While there are many variants, here are some popular examples:

Standard batch gradient descent
Stochastic gradient descent
Mini-batch gradient descent
RMSprop
Adam

In this chapter, we will look in depth at one of these methods called Adam (below).

Second-Order Methods

They are called second-order methods because they use information from second derivatives for better updates. There are many methods, like:

BFGS
L-BFGS
Newton's method

But these are not often used in machine and deep learning. While they optimize with fewer iterations, for the type of optimization problems algorithms in AI create (high-dimensional problems), they’re very computationally expensive.

So they’re not widely used like first-order optimization methods.

Zeroth-Order and Other Methods

These methods do not require derivatives to optimize algorithms. Some examples of algorithms where derivatives are not used are:

Genetic algorithms
Dynamic programming algorithms
Particle swarm optimization methods

The problem with these algorithms is that they are often very slow for many variables.

But in certain AI contexts, they can help optimize the architecture of deep learning models to improve AI models from an architectural point of view (instead of a parameter point of view).

How does optimization theory connect with linear algebra, calculus, and probability and statistics?

Essentially:

Calculus teaches you derivatives, which help you understand optimization theory.
Linear algebra teaches you matrices, which help you understand how different states relate and transform.
Probability and statistics teach you concepts like covariance and correlation, which help you understand how variables are connected with each other.

This way, with linear algebra and probability and statistics, you gain the knowledge necessary to understand the algorithms. With calculus you gain the basis to understand optimization theory and how it changes certain parameters of the fundamental algorithms to minimize/maximize a certain objective.

Simple Optimization Techniques: How Machines Learn Step by Step

Photo by LJ Checo

Now, we’re going to see examples of machine learning algorithms used for optimization and deconstruct them so that you can understand how these areas of mathematics apply to them.

In each example, I will explain their main idea with an analogy as well as how each math area is used in each algorithm.

Linear Regression

Imagine that you are solving a puzzle. To complete the puzzle, you need to arrange the pieces in the right design/order.

The same idea applies to linear regression.

We have matrices (linear algebra) that represent the parameters of the linear regression model and the data that flow into it.

And we can see over time how well the line is fitting the numbers, as well as its error (probabilities and statistics).

To find the best line for the linear regression, we need to know how much the parameters of the model need to change (calculus) and actually apply that change to the parameters (optimization theory).

This way, calculus tells us which direction to change the parameters, and optimization theory tells us how much to actually change them.

Let’s see how to code the linear regression above:

import numpy as np

np.random.seed(42)
X = np.linspace(0, 10, 50)
y_true = 3 * X + 2
noise = np.random.normal(0, 2, 50)
y = y_true + noise

w = 0.1 
b = 0.5
learning_rate = 0.01
iterations = [0, 1, 2, 3, 4, 5]
saved_states = []

for epoch in range(max(iterations) + 1):
    y_pred = w * X + b
    error = np.mean((y - y_pred) ** 2)
    
    if epoch in iterations:
        saved_states.append({
            'epoch': epoch,
            'w': w,
            'b': b,
            'y_pred': y_pred.copy(),
            'error': error
        })
    
    dw = -2 * np.mean(X * (y - y_pred))
    db = -2 * np.mean(y - y_pred)
    
    w = w - learning_rate * dw
    b = b - learning_rate * db

Let’s see the code block by block:

Import library:

import numpy as np

For this problem, we’ll import one of the most used Python libraries: NumPy (which we’ve worked with earlier in the book).

Create data points:

np.random.seed(42)
X = np.linspace(0, 10, 50)
y_true = 3 * X + 2
noise = np.random.normal(0, 2, 50)
y = y_true + noise

In this code, we define a base line that will help in generating the data points:

X = np.linspace(0, 10, 50)
y_true = 3 * X + 2

After this green line has been created, we will add noise to it to create the data points:

noise = np.random.normal(0, 2, 50)
y = y_true + noise

This is how we defined the data points for the line dataset.

Initializing linear regression parameters and others:

w = 0.1 
b = 0.5
learning_rate = 0.01
iterations = [0, 1, 2, 3, 4, 5]
saved_states = []

In this block of code, we initialize:

Linear regression parameters: Weight to be 0.1 and bias to be 0.5
One hyperparameter: Learning rate
How many iterations we are going to use to improve the linear regression
An array called saved_states to store values to later create graphs

This way, we start with this red line:

Making the linear regression learn with the data:

for epoch in range(max(iterations) + 1):
    y_pred = w * X + b
    error = np.mean((y - y_pred) ** 2)
    
    if epoch in iterations:
        saved_states.append({
            'epoch': epoch,
            'w': w,
            'b': b,
            'y_pred': y_pred.copy(),
            'error': error
        })
    
    dw = -2 * np.mean(X * (y - y_pred))
    db = -2 * np.mean(y - y_pred)
    
    w = w - learning_rate * dw
    b = b - learning_rate * db

It may appear complicated, but let’s see in smaller blocks:

For loop

for epoch in range(max(iterations) + 1):

Making an prediction and seeing its error

y_pred = w * X + b
error = np.mean((y - y_pred) ** 2)

In this block of the code, we find the values predicted for the current parameters and see its error from the real values.

Saving current iteration values for future statistics

if epoch in iterations:
     saved_states.append({
         'epoch': epoch,
         'w': w,
         'b': b,
         'y_pred': y_pred.copy(),
         'error': error
     })

Here we are juts storing in the saved_states array the values of the current iteration to later compute images.

Finding the gradients

dw = -2 * np.mean(X * (y - y_pred))
db = -2 * np.mean(y - y_pred)

In this block of code, we find the gradients values for the current prediction.

In other words, for the weight and bias, we find out how much they need to change in order to approximate better the values of the parameters to the data points.

Updating the parameters values

w = w - learning_rate * dw
b = b - learning_rate * db

Finally, we update the weight and the bias with the new values so that the line better approximates the data points:

Neural Networks

The same puzzle idea applies to neural networks. Neural networks are algorithmic models inspired by the brain that learn patterns from data. They are part of a machine learning field called deep learning, which uses neural networks to learn complex patterns.

Neural networks are important because they power modern AI applications like:

Image recognition
Language translation
Chatbots

For example, ChatGPT means Chat Generative Pre-trained Transformer. A transformer is an architecture of neural networks.

If you understand neural networks, you’ll understand the foundations that make ChatGPT work.

We have matrices (linear algebra) that represent the parameters of the neural network model and the data that flow into it.
And we can know over time how well the neural network model is converging to the dataset, fitting the numbers, and see its error (probabilities and statistics).
Calculus will tell us in which direction the parameters of the neural network need to change.
Optimization theory will tell us how much they need to change.

For example, this is a neural network:

This model has in total 13 parameters:

It has 10 lines(connections between circles). These are called weights.
It has 2 circles in the hidden layer and 1 in the output layer. Each circle has one bias.

Big question:

Imagine you work in a bank. You are in charge of deciding who gets credit cards or not. For that, you create the neural network above that takes 4 inputs:

Income
Credit score
Debt ratio
Bankruptcy history

With this neural network well optimized, you can figure it out!

Very simply, without going into things like activation functions, the network processes the 4 inputs through its weights and biases.

Each connection multiplies the input by its weight. After that, each node adds its bias.

The final output is a number between 0 and 1:

Numbers close to 0 mean "Not approved"
Numbers close to 1 mean "Approved"

For example, a high income figure, a good credit score, and no bankruptcy history data flow through the neural networks and produce 0.92. This means that it should be approved.

But a low income figure with a history of bankruptcy may produce 0.15, which results in a not approved.

In reality, bank systems and others have neural networks that take far more well-chosen parameters and decide this automatically.

This is precisely how AI can be used for credit approval.

But a question remains: What is the best way to know how much the parameters need to change?

In the next part, we are going to see the most famous optimization theory algorithm that will help us decide that.

What is Adam? The Most Popular Way AI Models Finds the Best Learning Path

Photo by Lum3n

To optimize neural network based AI models, one of the most popular methods is called Adam, which means Adaptive Moment Estimation.

The paper that introduced the method is one of the most influential in the 21st century in machine learning, with thousands of citations. As with all ideas in non-symbolic AI, Adam is a mixture of different math concepts.

It's composed of the ideas of two other optimization methods:

Momentum Gradient Descent: Accumulates velocity from previous gradients to move faster in consistent directions
Root Mean Square Propagation (RMSProp): Adapts learning rates based on recent gradient magnitudes

Let's understand them with an analogy.

Imagine that you are riding a bicycle down a mountain little by little. You already know the direction thanks to calculus.

But how do you descend safely without losing control or going too slowly?

First, you need to build up speed gradually using past momentum. This is one of the main ideas of momentum gradient descent.

It's also important that you adjust your speed based on the terrain's elevation. This is the main idea of RMSProp.

This way, you can safely accelerate and brake appropriately.

When optimizing a model with Adam, this is the same concept. With Adam, we want to optimize a model in a fast and stable way.

The momentum gradient descent ensures the fast part, and the RMSProp ensures the secure part.

Nowadays, for LLMs, which once again are just very big neural network models, a variant of Adam called AdamW is more often used.

Now, let's build a code example of using Adam.

Code example:

Using Adam, we are going to optimize this neural network based on fake data.

It will take 4 features:

Income
Credit score
Debt ratio
Bankruptcy history

And it will tell us if we should or should not approve credit for a given person.

Also, since this book is an introduction to the math of AI, I will not, in this code example, discuss hyperparameter optimization, regularization techniques, and other more advanced topics and good practices.

I want to show why this neural network fails with this data and explain the importance of using great data.

Here is the whole code (and we’ll see each part more in-depth below):

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader, random_split
import pytorch_lightning as pl
import matplotlib.pyplot as plt

torch.manual_seed(42)
x = torch.randn(10000, 4)
y = torch.randint(0, 2, (10000, 1)).float()
dataset = TensorDataset(x, y)

train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)

class CreditApprovalNet(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(4, 2)
        self.relu = nn.ReLU()
        self.output = nn.Linear(2, 1)
        self.sigmoid = nn.Sigmoid()
        self.loss_fn = nn.BCELoss()
        self.train_losses = []
    
    def forward(self, x):
        x = self.relu(self.hidden(x))
        return self.sigmoid(self.output(x))
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_pred = self(x)
        loss = self.loss_fn(y_pred, y)
        self.log('train_loss', loss)
        self.train_losses.append(loss.item())
        return loss
    
    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=0.0001)

model = CreditApprovalNet()
trainer = pl.Trainer(max_epochs=100, logger=False, enable_checkpointing=False)
trainer.fit(model, train_loader, val_loader)

# 
plt.plot(model.train_losses)
plt.xlabel('Training Step')
plt.ylabel('Loss')
plt.title('Credit Approval Training')
plt.grid(True, alpha=0.3)
plt.show()

Now let’s break it down:

Importing libraries:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader, random_split
import pytorch_lightning as pl
import matplotlib.pyplot as plt

In this block of code, we are importing code from 3 Python libraries:

PyTorch: One of the most popular python libraries to create new AI models in AI research
PyTorch Lightning: A PyTorch wrapper that organizes training code and handles repetitive tasks automatically
Matplotlib: One of the most popular python libraries to make graphs from data

Creating data:

torch.manual_seed(42)
x = torch.randn(10000, 4)
y = torch.randint(0, 2, (10000, 1)).float()
dataset = TensorDataset(x, y)

In this part, we define a seed to make the random numbers reproducible. In other words, when we run the code many times, the same random numbers will be generated.

Next, we will create 10,000 applications for credit with 4 features in X and their approval decisions in y. After that, we unify everything in the dataset variable.

We’ll use TensorDataset because it allows us to have the 4 features and the target paired together. This way, the data does not get mixed up during training.

Dividing data:

train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

In this block of code, we divide the data into a training dataset and a validation dataset.

This way, we have one dataset that’s being used to train and find the parameters while comparing results with the validation dataset.

As we can see, 80% of the data will be training data, and 20% of the data will be validation data.

Loading data:

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)

Here, we load the data into data loaders for the AI model to use.

This way, we have the data automatically split into small batches and shuffled. So instead of processing all 10,000 data points, the model will be trained on one batch, improved, then another batch, then improved again, and so forth. That makes training go faster.

Creating AI model and training process:

class CreditApprovalNet(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(4, 2)
        self.relu = nn.ReLU()
        self.output = nn.Linear(2, 1)
        self.sigmoid = nn.Sigmoid()
        self.loss_fn = nn.BCELoss()
        self.train_losses = []
    
    def forward(self, x):
        x = self.relu(self.hidden(x))
        return self.sigmoid(self.output(x))
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_pred = self(x)
        loss = self.loss_fn(y_pred, y)
        self.log('train_loss', loss)
        self.train_losses.append(loss.item())
        return loss
    
    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=0.0001)

This code block appears to be complicated, but let’s see each method block by block:

Creating the class with inheritance:

class CreditApprovalNet(pl.LightningModule):

This way, in one line, we can import everything we need to define both the model and how it will be trained.

init: Builds the model's layers and components:

    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(4, 2)
        self.relu = nn.ReLU()
        self.output = nn.Linear(2, 1)
        self.sigmoid = nn.Sigmoid()
        self.loss_fn = nn.BCELoss()
        self.train_losses = []

In this section of the code, we are defining the architecture of the AI model.

forward: Processes input data through the network to make predictions:

    def forward(self, x):
        x = self.relu(self.hidden(x))
        return self.sigmoid(self.output(x))

In this part of the code, we are defining how data will flow in the AI model based on the architecture defined.

training_step: Calculates loss for each batch during training:

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_pred = self(x)
        loss = self.loss_fn(y_pred, y)
        self.log('train_loss', loss)
        self.train_losses.append(loss.item())
        return loss

Here, we are defining how the model will be trained. In other words, how we will find the best parameters for the model to predict well.

configure_optimizers: Sets the Adam optimizer with learning rate:

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=0.0001)

Finally, here we are defining what optimizer we are going to use to, step by step, improve the AI model parameters.

Training AI model:

model = CreditApprovalNet()
trainer = pl.Trainer(max_epochs=100, logger=False, enable_checkpointing=False)
trainer.fit(model, train_loader, val_loader)

In this block of code:

We create the neural network model in the first line
In the 2nd and 3rd line, we prepare the training settings and train the model for 100 epochs

This way, in the command line, this appears:

The PyTorch code is essentially telling us the number of parameters in the AI model!

Seeing results and understanding why they are not good:


plt.plot(model.train_losses)
plt.xlabel('Training Step')
plt.ylabel('Loss')
plt.title('Credit Approval Training')
plt.grid(True, alpha=0.3)
plt.show()

Using the Matplotlib library, we plot the results:

The AI model is not converging.

We can see that because the loss is nearly 0.7 (70%) over time.

The main reason the model is not converging well is that there is little to no relationship between the 4 features and the target variable.

In other words, we do not have good data.

The code works perfectly, but this shows the most important rule in machine learning: when we create an AI model, the MOST IMPORTANT thing is data.

It does not matter if you use a simple linear regression or a neural network based on transformers or whatever. If you do not have high quality data, the model is not going to perform well.

Even if we use a good optimizer, like Adam, it will not solve the data problem.

Next steps: Common beginner mistakes

I also wrote this exact code example to show you something very important: neural networks are not always the best models to use.

This is a very common beginner mistake. You may start with neural networks for everything, when often machine learning methods with little data preprocessing do the job well.

For this type of problem, the solution is to first try machine learning methods instead of going to neural networks.

There are many reasons for this, but the main ones are:

Machine learning methods are simpler and often quicker to train than neural networks
Machine learning methods are simpler to understand how they make decisions. In other words, we can understand how the machine learning model thought to make a prediction.
With computational learning, we can guess with certain machine learning models how well they will predict in the future and provide theoretical guarantees about their performance.

Another common mistake is not dividing the data.

To simplify, I created only a training and validation division of the data

In a serious project, you should always divide it into 3 parts: training, validation, and testing.

With training, you create the model. With validation, you test the model based on the data it was trained on. With the test dataset part, you compare if the loss of the model is similar to the validation or different. If they are very different, it means that the AI model converged to the validation dataset but not the test dataset.

I challenge you to think further about how you could improve this code and to try to make the synthetic data more correlated in order to improve its quality.

Applications in AI and Control Theory of Optimization Theory

Photo by Tara Winstead

Optimization theory serves as the engine behind AI and control systems that shape our lives.

From unlocking your phone with facial recognition to autopilot systems guiding planes, optimization algorithms are constantly at work.

When you ask ChatGPT a question, optimization theory determines the values of billions of parameters during training.

The same is true for all other LLMs like Gemini, Claude, Grok, DeepSeek, and others. All of them contain millions and millions of parameters. The only way to find the best combination of the parameters to achieve a certain objective is with optimization theory.

In control theory, many systems like Model Predictive Control (MPC) and adaptive control systems only work thanks to optimization methods that balance how internal components of the control system should work together

Beyond training neural networks and controlling physical systems, optimization powers recommendation systems, resource allocation, and so many other systems.

Some examples are:

Netflix movie recommendation system
Spotify's song suggestion system
Google systems to reduce data center cooling costs
Quantitative trading firms high-frequency trading systems

To end this final chapter, I’ll share this:

It is optimization theory that makes math models into AI models that impact the lives of millions worldwide.

Conclusion: Where Mathematics and AI Meet

Photo by AXP Photography

When ancient civilizations first carved numbers into clay tablets, they likely didn’t imagine that these symbols would one day allow humanity to create the scientific, technological, and medical marvels we have today.

Yet here we are.

We’re in an era where mathematical ideas developed over many centuries – even millennia – have converged to create artificial intelligence.

Throughout this book, we've traced a path from the most basic math concepts to the cutting edge of AI. We have seen how:

Matrices compress complex systems into simple forms
Derivatives measure change
Probability helps us navigate uncertainty
Optimization guides algorithms toward better decisions to learn faster.

We’ve also learned how each math field has helped create tools that are responsible for many of the things we take for granted today.

Mathematics is the Foundation of AI

Photo by Jeswin Thomas

Always remember this: AI is not pure magic or a "being" we don't understand. It’s just the combination of many math ideas working very well together.

When you ask a question of ChatGPT or any other LLM, it generates a response. And in the process of generating that response, there are millions of matrix multiplications happening in seconds.

Or, for example, when a self-driving car decides to stop moving because it’s coming up to a crosswalk, there are a lot of math computations (related to calculus and probability and statistics) working very fast to ensure safety.

The great thing about mathematics is that it’s a common, standard language of logic. No matter the backgrounds of people or where they were born, a derivative will always be a derivative, and the same thing goes for key AI concepts.

This way, scientists and engineers worldwide can improve each other's work because everyone understands the same language.

The Future: On Device AI and the Democratization of AI

Photo by Steve Johnson

One shift happening now is the move toward edge AI. That is, AI that runs locally on your phone, computer, and really in all your devices (rather than in distant data centers).

This way, privacy is guaranteed because it runs locally. Waiting times for AI models decrease because no data needs to be sent. AI can be used offline, and costs decrease.

And what about the massive data centers being built all over the world? Those will be used for more products that will help improve the lives of millions of people.

As AI becomes more local and more processing power is freed up from big data centers, new AI innovations will appear, and more benefits will come.

The same way that in the past century every computer got its own networking chip, every device will have (and in some cases, already has) AI accelerators.

And much of it will be thanks to the math you learned in this book.

Final Reflections

Isaac Newton wrote, "If I have seen further, it is by standing on the shoulders of giants."

Every algorithm you use, every model you train, and every new theorem you learn stands on centuries of mathematical progress. You now stand on those same shoulders of these giants!

Thank you for reading, and happy learning.

Here’s the full book GitHub repository with all the code.

Acknowledgements

First and foremost, I would like to thank Guilherme Mendes, currently a Master’s student in Electrical and Computer Engineering at NOVA University, specializing in Control Theory, for reviewing the mathematical and technical details of the 1st version of this book.

I am also grateful to the organizations that gave me opportunities to grow:

A special thank you goes to the freeCodeCamp editorial team**,** especially Abigail Rennemeyer, for their patience and for reviewing every chapter of this book.

I would also like to thank all the professors at NOVA FCT who have taught and guided me throughout my academic journey, especially those from the Department of Electrical and Computer Engineering.

About the Author

LinkedIn: https://www.linkedin.com/in/tiago-monteiro-
GitHub: https://github.com/tiagomonteiro0715
Email: monteiro.t@northeastern.edu

My name is Tiago Monteiro, and I’m now pursuing a master's degree in Artificial Intelligence at Northeastern University in the Silicon Valley Campus (San Jose) on a merit-based scholarship.

I’m not from the United States. I am a Portuguese national, born and raised in the district of Lisbon.

In Portugal, I completed a bachelor's degree in electrical and computer engineering at NOVA University, one of Portugal's best universities.

I have authored over 20 articles for freeCodeCamp, which have accumulated more than 240,000 views over the years, and completed the Deep Learning Specialization from DeepLearningAI, taught by Andrew Ng.

Also, I had the privilege of participating in the winter 2025 batch of the renowned Silicon Valley Fellowship program.

Why did I choose electrical and computer engineering?

After finishing the Portuguese national math exam in 12th grade, I chose Electrical and Computer Engineering (ECE) to challenge myself and learn new math on my own.

The ECE degree combined:

Advanced Mathematics
Programming (from Assembly to Python)
Physics (classical mechanics, electromagnetism)

What did I gain exactly?

I mastered the skills needed to quickly understand AI research, particularly after completing Andrew Ng's Deep Learning Specialization.

In Portugal, I also studied advanced STEM areas including, for example:

Partial Differential Equations for modeling real-world phenomena
Harmonic analysis (Fourier/Laplace transforms) for signal processing and alternative problem perspectives
Complex analysis involving derivatives and integrals in the complex domain
Numerical methods for approximating mathematical solutions computationally
Signal/control theory for ensuring system stability in dynamic environments
Physics classes in classical mechanics and electromagnetism fundamentals

While not directly applied to AI, these studies enhanced my systems thinking and ability to independently learn complex STEM concepts.

The Architecture of Mathematics – And How Developers Can Use it in Code

Tiago Capelo Monteiro — Fri, 23 May 2025 15:06:16 +0000

"To understand is to perceive patterns." - Isaiah Berlin

Math is not just numbers. It is the science of finding complex patterns that shape our world. This means that to truly understand it, we need to see beyond numbers, formulas, and theorems and understand its structures.

The main goal of this article is to show how math is just like a growing tree of ideas. I want to show that math is a living system of logic, not just formulas to memorize. With analogies, history, and code examples, I want to help you understand math more deeply and how you can apply it to programming.

I’ve also included some code examples here to help you connect theory and practice. I show them to demonstrate how math ideas are applied to real problems. Whether you are new to more advanced math or are more experienced, these code examples will help you understand how to apply math in programming.

This link across theory and application reflects my own studies. I am a finalist in an undergraduate degree in Electrical and Computer Engineering at NOVA FCT, one of the best engineering faculties in Portugal.

My engineering degree is one with more math and physics. This is because it’s key to get a solid grasp of math to understand electronics, telecommunications, control theory, and other areas of engineering.

Here’s a brief overview of some of the math and physics subjects I’ve learned:

Partial Differential Equations (PDEs): These equations model real-world phenomena, from heat diffusion to the economy of a country.
Harmonic Analysis (Fourier & Laplace): Integral transforms like the Fourier and Laplace transform allow us to understand problems in new domains.
Complex Analysis: Extending calculus into the complex plane gives rise to powerful tools used in physics and engineering.
Numerical Analysis: When analytical solutions are impossible or inefficient, numerical methods provide computer-based approximations. This is crucial for real-world applications.
Control and Signal Theory: These areas show us how to design stable systems like rockets, trains, and robots.
Physics: Courses in Classical Mechanics and Electromagnetism helped bridge theoretical math to physical laws

During my years of study, besides technical skills, I’ve developed a deeper understanding of how the world works and the structure of the field of mathematics. And I’ve started to find patterns in how math is a framework of interconnected logic.

In this article, we’ll explore:

Simple Analogy: The Tree of Mathematics
The Structure and History of Mathematics
An Tree example: Foundations of Relativity by Albert Einstein
The Biggest Paradox of Math, Discovered by Kurt Gödel
What About Applied Math and Engineering?
Code Examples – Analytical and Numerical Approaches
The Impact of a Grand Unified Theory of Mathematics
A Final Lesson From History

Simple Analogy: The Tree of Mathematics

Imagine math as a vast tree growing forever.

The roots of the tree are the foundations of mathematics: logic and set theory. From this foundation emerge the main basic fields of math: arithmetic, algebra, geometry, and analysis.

As the tree divides further and further into more branches, new, more complex subfields start to appear, like topology, abstract algebra, and complex analysis. Sometimes the branches are connected to each other.

And remember: this tree is always growing in many directions. From branches creating new branches to branches connecting to other branches. Little by little, it grows.

Throughout history, there have been times that, due to some big scientific discoveries, parts of the math tree started to grow very fast. Other times, decades and even centuries passed without many new branches. This is the case for imaginary numbers, for example.

And you might wonder: How many more branches and connections between them will keep appearing?

The Structure and History of Mathematics

The first mathematical ideas appeared independently across ancient civilizations. For example:

India’s invention of zero
Islamic algebraic advances
Greek geometric rigor

Over time, many different great mathematicians created and shared them by writing and giving lectures.

Eventually, these new ideas were shared widely with new generations and these new generations created new math based on old math.

This is is how new branches are continuously born from previous branches of the tree of mathematics.

And this is why Isaac Newton wrote, in a letter to Robert Hooke in 1675:

If I have seen further, it is by standing on the shoulders of giants

He meant that by working from previous knowledge, he was able to create and (re)discover new ideas.

Yet, the real power of math lies in practicing it over and over and understanding it more and more deeply. As one of my professors once explained:

More important than knowing the theorems is knowing the ideas behind them and the history of how they were created.

Very often, to solve problems, it is necessary to think in terms of first principles and build from there. Math teaches exactly that. In this way, math is not just an academic subject. It is a language spoken by scientists and engineers around the globe.

By having it well preserved and shared, it is still possible to create new math from previous ideas. And it’s possible for the big tree to continue growing based on previous branches or nodes.

An Tree example: Foundations of Relativity by Albert Einstein

Albert Einstein created the general and special theories of relativity. These have big consequences nowadays:

GPS and Global Communication
Advancements in Satellite Telecommunications
Space Exploration and Satellite Launches

But this was only possible through the unification of geometry with calculus, called differential geometry. The evolution of differential geometry happened over the centuries, thanks to many great mathematicians. Below are some of them, but this is not a complete list:

Euclid (circa 300 BCE): Contributed to geometry, laying the groundwork for later mathematical systems
Archimedes (circa 287–212 BCE): Pioneered the understanding of volume, surface area, and the principles of mechanics
René Descartes (1596–1650): Developed Cartesian coordinates and analytical geometry
Isaac Newton (1642–1727) & Gottfried Wilhelm Leibniz (1646–1716): Newton’s laws of motion and gravitation, alongside Leibniz’s development of calculus, formed the basis of classical mechanics that Einstein sought to extend and modify in his theory of relativity.
Leonhard Euler (1707–1783): Contributed to the development of differential equations, which are essential in the mathematical foundations of physics.
Gaspard Monge (1746–1818): The father of differential geometry and pioneer in descriptive geometry
Carl Friedrich Gauss (1777–1855): Made groundbreaking advances in geometry, including the concept of curved surfaces.
Bernhard Riemann (1826–1866): Introduced Riemannian geometry, a branch of differential geometry.

Once again, as Isaac Newton wrote, in a letter to Robert Hooke in 1675:

If I have seen further, it is by standing on the shoulders of giants.

Albert Einstein saw what no one else in his time saw, thanks to these great math giants and countless others.

The Biggest Paradox of Math, Discovered by Kurt Gödel

The biggest paradox in math, in my opinion, is what Kurt Gödel discovered. His early 20th century research revealed a limitation within this cycle.

This paradox – that is, his incompleteness theorems – shows that in any consistent formal system capable of expressing simple arithmetic, there will always be true mathematical statements that cannot be proven within the system itself.

This means that in ALL systems, there are limits to what you can actually prove as to what is true and false. For for mathematicians, this means that the tree will never be completed. There are truths that are beyond formal truths, and yet we still assume that they are true (albeit unproven).

This way, it proves that no matter how many mathematicians work in the field or how much AI is used to find new mathematics, there will always exist limitations. Some things are impossible to prove that are true, and we just know that they are due to approximation estimations and other non logical exact methods.

What About Applied Math and Engineering?

Applied math and engineering involves interpreting the same pure math ideas in real-world scenarios. Actually, in many cases, it is the combination of many math ideas. Let’s consider some examples:

Principal component analysis (PCA) is a widely used tool in data science. Yet, it is a mixture of linear algebra (in PCA, eigenvalues) with optimization (order eigenvalues that represent more data with less data) in order to make datasets shorter.

In machine learning, logistic regression is a mixture of calculus with statistics and probability.

In harmonic analysis, Laplace, Fourier, and Z-transforms are a way to see the same thing in a new domain to get new insights. In this case, integrals are used to make this mapping.

In deep learning, neural networks are just many matrices multiplying and updating themselves that adapt to model a dataset representing a system. This optimization of matrix values happens with activation functions, a gradient descent-based optimization method (tells how much values need to change), and backpropagation (applies those alterations to all matrix values).

I have actually written an article where I teach why activation functions are important if you want to check it out.

But the best example of this fusion of math with engineering is in control theory.

Control theory is the study of the architecture of systems. From trains to cars to airplanes, everything is based on control theory. It is everywhere in nearly all modern electronic devices. In electric circuits, control theory is also used heavily to guarantee circuit stability in the face of electric disturbances.

So as you can probably start to see, many of the tools we now have are just a mixture of many pure math ideas. Just many combinations and recipes of pure math ideas. In essence, applied math is the application of pure math as “ingredients“ in "recipes" to solve problems.

So, we’ve explored the structure and evolution of mathematics. Yet, it is important to see how these ideas can be applied in real life. Pure math makes the framework, and applied math applies that framework to solve problems. To understand this, we’ll examine two code examples that show how you can use math ideas as programming tools.

Code Examples – Analytical and Numerical Approaches

These code examples demonstrate a couple ways you can use Python to solve math equations.

In the first code example, we’ll solve the problem in the same way that kids in school solve math exercises: essentially, by hand with a pencil. Moving variables from left to right to find their values. In the second example, we’ll solve the problem using numerical analysis.

Example 1: Solve a Problem Analytically

When we solve math problems analytically, like we did in school, we are manipulating symbols to get exact values. Often there symbols are x, y and z. In Python, we can do this using the SymPy library:

from sympy import symbols, Eq, solve

x, y = symbols('x y')
eq1 = Eq(2*x + 3*y, 6)
eq2 = Eq(-x + y, 1)

solution = solve((eq1, eq2), (x, y))
print(solution)

Essentially, we are finding x and y based on this equation:

$$\begin{align*} 2x + 3y &= 6 \\ -x + y &= 1 \end{align*}$$

Which gives us the following result:

{x: 3/5, y: 8/5}

Or:

x= 0.6
y = 1.6

When we say that we’re solving this analytically, it means that we’re finding an exact mathematical solution using formulas or equations.

But many times, problems are harder and can be solved by adding symbols to the right or left of the equation.

Sometimes, there can be so many symbols and transformed versions of them, with things like derivatives and integrals, that it can become very hard to manage and takes a lot of time.

For this reason, there is an area of mathematics devoted to finding approximations of already created mathematical formulas called numerical analysis. It makes it faster to solve these problems. And this is the method we will explore next.

Example 2: Solve Numerically (Approximation)

We’ll now use SciPy to solve the same system with numerical methods:

import numpy as np
from scipy.linalg import solve

A = np.array([[3, 2, -1, 4, 5],
              [1, 1, 3, 2, -2],
              [4, -1, 2, 1, 0],
              [5, 3, -2, 1, 1],
              [2, -3, 1, 3, 4]])

b = np.array([12, 5, 7, 9, 10])

solution = solve(A, b)

print(solution)

In this code example, this line of code:

solution = solve(A, b)

Uses the solve method from the SciPy Python library:

from scipy.linalg import solve

It’s a method that helps you find the values of x in an equation A⋅x=b, where a is a square grid of numbers and b is a list of numbers. Which gives us the following:

[ 1.35022026 -0.79955947 -1.17180617  3.14317181 -0.83920705]

Now imagine, in this simple case, that a matrix like A could represent the traffic flow between cities or intersections, and b could represent the traffic entering or leaving each city.

By solving the system, it could help us determine the distribution of traffic between cities to meet desired traffic conditions.

Of course, these types of problems are far more complex in real life. But to understand and solve the big problems, you need to first understand the smaller problems.

And by the way, a system of equations is the same thing as a matrix. We just represent systems of equations as matrices to make the findings of properties and clarity easier to understand.

The thing is that by using matrices, it is easier to make calculations and to perform linear algebra math to check for characteristics of the matrix and understand it better.

In essence, a matrix represents a system of equations. Also, systems of equations can represent real life phenomena like the economy of a country or the weather.

If you want to know more, I wrote an entire article on numerical analysis that you can check out.

The Impact of a Grand Unified Theory of Mathematics

Despite the biggest paradox in mathematics, what would happen with a Grand Unified Theory of Mathematics?

Remember that such a theory tells us that there are things that are true that are impossible to formally prove, and we need to just accept it. But even with this assumption, it is still possible to unify all math.

This is what the Langland's program is trying to solve. A kind of attempt to interconnect the largest parts of the big tree of math to uncover new patterns in math.

With a Grand Unified Theory of Mathematics, we would be able to understand how every branch of the tree connects with the others and all the relationships between them.

What is the value of this big unification for society?

By studying history, we can find patterns. The unification of various fields has created many massive impacts on society, such as:

In the 19th century, James Clerk Maxwell united the fields of electricity and magnetism with his famous Maxwell equations. This allowed the creation of radios and electric grids around the globe. In turn, it served as a foundation for all technological progress in the 20th and 21th century.
In the 20th century, the unification of algebra with logic led to the rise of digital systems. In turn, digital systems gave the rise of processors and the evolution of computers to the modern laptop.
Also in the 20th century, the unification of probability and communication led to information theory. This became the foundation for the internet. This unification was carried out by a great mathematician called Clause Shannon.

In the end, a Grand Unified Theory of Mathematics could be one of the biggest achievements in modern society.

It could lead to new discoveries in physics, such as in string theory or quantum gravity, where deep mathematical structures are needed to create new physics. In AI, it could help unify all machine learning models in a common architecture. This would help accelerate the development of new AI models. It could also open the door to new cryptographic methods and material science advances, revealing, with math, the deep patterns still not found in these fields.

Just as uniting electricity and magnetism led to modern technology, a unified math framework would lead to a wave of innovation.

A Final Lesson From History

From Greek geometry to AI, math has grown like a tree over centuries. By understanding its structure, it is possible to see its role in finding the patterns of our universe. I hope I was able to make you see math in this way.

In addition, we can conclude that the unification of scientific fields makes the foundations for the creation of new innovations to help society go forward. Many profound societal transformations only came to be thanks to abstract math ideas. When these are shared and refined, they become the hidden architecture of progress in society. Innovation begins when disconnected ideas are united, well-linked, and widely shared.

Find the full code here.

From Failure to International Success: How Online Learning Platforms Saved My Life

Tiago Capelo Monteiro — Wed, 02 Apr 2025 13:42:53 +0000

It is better to be a samurai in a garden than an agricultural worker in a war - Miyamoto Musashi

In this article, I’ll share my story.

When I was younger, I thought I was destined to be a failure in life. To be isolated from everyone. But years later, I realized I was actually destined for success.

I went from wasting thousands of hours playing video games to giving a lecture to medical professionals called “Trustworthy AI: The Role of Small Models in Critical Systems.”

And I went from being told I was dumber than most people based on an IQ test I took as a 14 year old low self-esteem kid, to becoming a frequent contributor to freeCodeCamp. I’ve written articles on interpretable AI, applied math, and advanced tech. And these articles have now reached more than 200,000 people worldwide.

And this is just the beginning.

When it comes to education, I owe gratitude to three people and the organizations they lead:

Salman Khan, founder of Khan Academy
Quincy Larson, founder of freeCodeCamp
Andrew Ng, co-founder of Coursera and founder of DeepLearning.AI

I also owe a lot to the great author of the book Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones – James Clear.

I’m sharing my story to inspire others who are struggling in their lives just like I was.

I’m also writing this for those who know me or will know me personally, so that they can understand where my determination comes from and why I am relentless.

Here’s what I’ll cover:

Where I Was: Misery, Depression, and Isolation
My Transformation: Learning Triple Integrals and Programming
One of the Best Choices in My Life: Why I Chose Electrical and Computer Engineering
Being Restless and Determined: My Work Ethic in University
My Projects while in NOVA FCT: AI Projects, International Student Organizations, and freeCodeCamp Articles
My Personal Philosophy at 21 Years Old and View on Envy and Negativity
Where I Am Today: A Fraction of What I Have Achieved
Final Thoughts: Have an Adaptive Grand Strategy for Your Life

Where I Was: Misery, Depression, and Isolation

Nikko Balanial on Unsplash " class="image--center mx-auto" width="2703" height="1850" loading="lazy">

Photo by Nikko Balanial on Unsplash

"As it turns out, it was that very rock bottom that became the firmest foundation I had ever planted my feet on." — Mandy Hale

Five years ago, I was in a different place and was a completely different person.

Like many teenagers, I started playing video games and became addicted to them. Over time, games became an escape from reality and all my problems, including my bad grades and many other issues.

At age 14, I still held ambitions in my heart. I dreamed of being someone who would help others, maybe as a doctor or an engineer.

But after an IQ and vocational guidance test, I was told that I was incapable of doing these things. That I lacked the intelligence needed. That it was unrealistic for me to pursue these types of degrees.

Eventually, and because of many comments, opinions, and expectations of others, I began to believe in this lie for years.

Over time, depressed and constantly escaping reality, my grades plummeted and I got worse and worse. And this only made the prospect of going to college less and less likely.

By 11th grade, I was:

Extremely shy and anxious
Struggling academically
Over 2000 hours in video games on two games alone:
- 1000 hours in GTA V
- 1000 hours in Destiny 2

2000 hours equals nearly 83 days.

This means that in these two games, I lost more than two months of my life.

But from these wasted hours, I learned English. This became crucial when learning online.

In January 2020, I was tired of everything. In particular, I was sick of the misery of always being at the bottom and of so much negativity towards and around me.

So I made these vows to myself for the rest of my life:

Never again would I worry or care about what other people say about me.
I would no longer accept the limitations imposed by others or myself on my growth.
And as for the limitations I imposed on myself, I would rethink to see if they were really impossible or if I could actually conquer them.

As a result, I started relearning and learning everything by myself to make sure I succeeded in the national exams.

My Transformation: Learning Triple Integrals and Programming

Photo by Joshua Earle on Unsplash

"The expert in anything was once a beginner."
— Helen Hayes

I started going through chemistry exercises in the 10th and 11th grades using books from school and YouTube videos. In two weeks, I relearned or learned most of the material I needed to know.

I started doing the same for mathematics, something I always found hard due to a lack of basic foundational mathematics knowledge.

I found it hard, that was, until I discovered Khan Academy.

With the Khan Academy, I rebuilt myself, going from struggling with basic math to mastering double and triple integrals, all within five to six months.

My method was simple:

Study a little bit every day.
Take detailed notes
Redo quizzes or units tests until I scored more than 90%
- For topics that I found harder or failed to understand, I did the practice exercizes
Use YouTube to close any knowledge gaps

For example, for Algebra I, where I started to relearn math, I say how many units there were. Each unit had a certain number of topics. As of 2025, Algebra I has, from units with mastery exercises, 89 topics.

For those 89 topics, I watched the videos and did the quizzes. According to my scores, I would either go on to the next video (if I felt confident), or stop, rewatch the video, go through the same material on YouTube, do practice exercises, and then try to do the quiz again.

I decided that I needed to do at least three topics every day. This way, by doing 3 topics per day, I could finish Algebra I by the end of one month.

But I was so motivated and so focused on it that I did more than 3 topics per day.

I did the same for Algebra II, and all the others until College Calculus BC.

Some days, I completed more than 8 topics. Other days, I struggle to even do 2. But I made sure that I mastered mathematics and its foundation for the rest of my life.

This was not just about grades. It was about regaining belief and confidence in myself.

I also read many books, primarily self-help, to make myself better. Over the years, I have started reading fewer self-help books and have started focusing on non-fictional books that explain to me how the world works.

COVID-19: Accelerating My Learning in Programming and Machine Learning

When the pandemic hit, I started accelerating my learning in other areas, like programming and physics. In many online classes, I didn’t pay attention as well as I should’ve – and I found myself prioritizing self study on topics I found more important. And I always used my time to learn more about programming.

I learned Python and C through free YouTube courses for beginners on freeCodeCamp’s channel.

This was where I first learned Python.

And C:

Soon after exploring C programming, I realized that programming languages are just tools. Once you master one, others come more naturally.

I studied data science tutorials on the web and on YouTube. This way, I learned how to import Python libraries in virtual environments. I also began building projects with Python libraries I found interesting and made it a habit to explain every line of code to myself as if I were the teacher.

For example, I started working with the scikit learn Python library to make simple linear and logistic models that could make make predictions.

I also decided to explore Deep Learning and taught myself how to work with Arduino and circuits. In other words, I learned how to train the architecture of neural networks to predict things.

I found this hard to master compared to triple integrals at the time!

But this way, I understood one very important thing about Deep Learning: to truly understand and master it, I needed to know, deeply, some difficult mathematics concepts. And I also needed to learn quickly about the new research coming out.

One of the Best Choices in My Life: Why I Choose Electrical and Computer Engineering

Photo by Nicolas Thomas on Unsplash

"Opportunities multiply as they are seized."
— Sun Tzu

After completing the Portuguese national mathematics exam in the 12th grade, I chose Electrical and Computer Engineering (ECE). I choose this area, because it would challenge me and allow me to gain the skills to learn and apply new mathematics by myself without anyone teaching me.

It was also broad:

If I liked, I could follow an electrical engineering area. Like circuits, power systems, or telecommunications.
If an electrical engineering subarea was not in my best interest, I could follow a computer science path or apply math in banking or other areas where people who know applied math work.

The ECE degree also allwed me to unite the following areas:

Advanced Mathematics
Programming (from low-level like assembly and C to high-level like Python)
Physics (circuits, robotics, communication systems)

I wanted to become someone who not only mastered knowledge but could also create new systems and ideas from it.

I knew that I was laying the foundation for something greater than just academic success.

What Did I Gain Exactly?

Over time, I learned the many skills I needed to understand all the new AI research coming out after completing AI specializations.

I also learned hard math and applied mathematics areas such as:

Partial differential equations: how they can represent and model real phenomena, like the economy of a country.
Pure harmonic analysis: Fourier and Laplace transformations and how integral transformations allow us to see problems in other ways.
Complex analysis: application of derivatives and integrals in a complex domain, with real and imaginary numbers.
Numerical analysis: how math used to approximate analytical math is used by computers to get faster results.
Signal and control theory: how the architecture of systems is studied to ensure rocket, train, and car control systems are stable, despite possible disturbances in the systems.

Not to mention physics classes like:

Classical mechanics
Electromagnetism

While these topics may not be applied in depth to AI, learning them helped me develop an incredible intuition into systems thinking. It also greatly improved my ability to learn hard STEM concepts on my own.

Being Restless and Determined: My Work Ethic in University

Me at 18 years old

"Success is the sum of small efforts, repeated day in and day out." — Robert Collier

I adopted a very rigorous work ethic.

When my work ethic failed to achieve what I wanted, I adapted with more knowledge and learned very deeply what I did wrong so as not to repeat it.

For example, in the first semester, my first grades were not the best. So, I read:

Deep Work
The Effective Executive: The Definitive Guide to Getting the Right Things Done
How to Become a Straight-A Student: The Unconventional Strategies Real College Students Use

These books taught me how to focus and prioritize what needed to be done. This became essential as I entered one of the most demanding phases of my life.

In addition, I used Notion as a management system and Google Calendar as a schedule system.

Every week, I transferred next week's tasks from Google Calendar to Notion. This way, I never forgot anything and never worried about forgetting anything.

I had two simple scalable systems that worked very well for managing everything I did.

In the scheduling system, I would place certain events on repeat, for example:

Every week:
- Read the top articles of the week on Subreddits dedicated to programming and others topics so I could keep learning and growing. Same with communities on Stack Exchange Network.

Read new articles on IEEE Spectrum and learn as much as possible about what is happening currently.

Every two weeks:
- Plan my studying according to time available, as well as all class and other resources I could get for tests, projects, and exams.
Every month:
- Review my annual objectives and prioritize what was important and urgent to do in that month. Also, review new opportunities that appeared that aligned with my objectives this year and in my life.

This way, I was always aligned and efficient. And all this was from a Notion database.

Very often, I started working at 8:00am and continued until around 9:00 or 10:30am, when my classes often started. At that time, I studied, did student organization work, completed online courses and specializations, worked on AI projects, wrote freeCodeCamp articles, and many tasks.

I went beyond studying just the subjects from my degree:

I also studied history, economics, and geopolitics to understand the hidden incentives that shape the world.
I developed the habit of studying the architecture of things, from political systems to technology, understanding how they work to design better systems.
I attended many free online and university events to learn as much as possible.

I also treated weekends as opportunities to grow and work, and did not stop. This was not possible 100% of the time, but most days I was able to do so.

In this way, I completed Coursera’s prestigious Deep Learning Specialization, a very important achievement in my journey.

I also read many books and listened to podcasts while taking public transportation, ensuring that no time was wasted.

My Projects While in NOVA FCT: AI Projects, International Student Organizations, and freeCodeCamp Articles

Me at 21 years old.

"The strength of the team is each individual member. The strength of each member is the team."
— Phil Jackson

International student organizations often offer opportunities that are rarely found in local college clubs.

These student organizations are also often better managed than some local clubs, which can sometimes suffer from internal politics focused on titles rather than making a real contribution to society.

For this reason, I sought international organizations that pushed members towards real impact and development.

After a while, I became interested in BEST (Board of Engineering Students of Technology), a large international organization of student organizations spread over 80 groups around European universities. I joined the local group, BEST Almada, one of the 80 local BEST groups across Europe that helps foster the development of students through courses and events.

I also became deeply involved in the IEEE, the world’s largest non-profit professional association, where I served as the Vice-Chair of the IEEE NOVA Student Branch. Currently, I contribute nationally in the IEEE Portugal Section by creating videos for social media.

Thanks to IEEE, I was able to go to the IEEE Melecon conference in Porto last year to speak with some amazing scientists and researchers.

Here’s a key thing I learned from IEEE that I want to share: Communication, alignment of expectations between everybody, and knowing how to navigate social dynamics is crucial for any project or initiative to succeed. Of course, the culture of the organization and a lot of other variables are important as well. But I believe communication is one of the most important and critical factors.

Along this path, I worked on projects like Eurostatify AI, which aimed to provide European public data insights and hidden patterns that are accessible to researchers and policymakers. I also led the Doctor AI Project as part of a Hackathon in March 2023, where I developed two AI bots using Flutter and the ChatGPT API to help doctors make better decisions.

Each step helped me forge myself into someone capable of inspiring and leading others. I also taught complex topics in my freeCodeCamp articles, such as how CPUs work in depth, interpretable AI, quantum AI, and even how to design a control system for rockets.

I was involved in local student clubs before I realized the value of joining international organizations. In Europe, these organizations bring unique opportunities, are usually better managed than local groups. They’re a great place for developing soft skills as well.

So in the end, joining international student organizations was one of the best decisions of my university life.

My Personal Philosophy at 21 Years Old and View on Envy and Negativity

Photo by Giammarco Boscaro on Unsplash

"Freedom lies in being bold."
— Robert Frost

Here’s one thing I’ve learned over the years: You need to make your own path. Chasing social status and falling prey to social pressures isn’t worth it, and you shouldn’t be blinded by these things. True freedom comes from defining your own path. Developing relationships with professors and mentors, learning from books, and taking advantage of solid free learning resource are all things that can help you go further in life.

But what about envy and negativity from others?

Well, unfortunately these will always be part of our lives. Being envious is human nature, and various forms of negativity will likely continue to exist. Anyone who works and achieves any level of success will inevitably attract envy and negativity.

The best response is not to react and to ignore it completely. Just keep growing.

Some people will disappear, mock you, envy you, or hate you – but just try to let it all go. Keep walking your path.

Time is precious. Don’t waste it on:

Meaningless opinions
Video games
Distractions

I find it sad that, despite living in such an exciting time, and despite unprecedented access to knowledge and education, advances in technology, and immense global connectivity, some people still choose to hate and be envious of others. But as I said before, it’s human nature and there is little we can do about it.

Just remember: you have opportunities today that previous generations could only dream of. Take advantage of them to the fullest and worry about your own personal growth.

Where I am Today: A Fraction of What I Have Achieved

Me in a Tesla factory in Silicon Valley

"I am not a product of my circumstances. I am a product of my decisions."
— Stephen R. Covey

At 21, I am finishing my degree in Electrical and Computer Engineering at NOVA FCT.

So far:

I’ve been accepted into the Silicon Valley Fellowship Program: Only 18 out of 600 are accepted to visit Silicon Valley's top companies and universities.
I’ve delivered a talk to doctors about AI called "Trustworthy AI - The Role of Small AI Models in Critical Systems.". Before this, I delivered other smaller talks.
I’ve completed Coursera AI specializations such as the Deep Learning Specialization from DeepLearning.AI and Reinforcement Learning Specialization from the University of Alberta.
In IEEE (the largest non-profit professional association in the world), I served as the vice chair of my faculty IEEE NOVA SB student branch, and I am now an IEEE PT Officer, creating videos for social media.
I’ve had twenty articles published on freeCodeCamp since 2023 that have accumulated around 200,000 views. They are related to advanced applied math, AI, and technology. (Link below)
I’ve been recognized as a Top Open Source Contributor for freeCodeCamp in 2022, 2023, and 2024

Final Thoughts: Have an Adaptive Grand Strategy for Your Life

Silicon Valley Fellowship post about me

"What you do makes a difference, and you have to decide what kind of difference you want to make."
— Jane Goodall

My life objective was and still the same when I was 14, 7 years ago:

Help as many people as possible. In their opportunities to make their life’s better and in making society better for future generations that will come after mine.

My strongest advice for anyone: Have a grand strategy for your life.

A grand strategy is a type of long-term strategy in which nations align power and resources to achieve their objectives. You must align your actions, skills, and knowledge towards your purpose.

I used to be afraid of public speaking and so many other things. Not anymore.

Now, I know I am destined to contribute, inspire, and leave a mark on other people's lives for the better.

If you feel stuck, remember:

You can change! It will be hard. Many people will not want it.
Ignore all that and focus on yourself.

It takes effort, patience and courage, but it is possible.

Thanks to all organizations for the opportunity to contribute to society and grow as a person:

NOVA School of Science and Technology and its student association, AEFCT
IEEE Portugal Section
Silicon Valley Fellowship
BEST and BEST Almada
Magma Studio

I also want to thank all the university professors at NOVA FCT who taught me, especially the ones from the department of electrical and computer engineering.

Finally, I want to express my gratitude to Portuguese society. Not long ago, in Portugal, pursuing higher education, especially in STEM, was inaccessible to many. Thanks to the efforts of past generations, today, young people like me can pursue these opportunities and contribute back to society.

This is just the beginning of my impact on society.

My FreeCodeCamp blog:

https://www.freecodecamp.org/news/author/tiagomonteiro/

How to Build a Quantum AI Model for Predicting Iris Flower Data with Python

Tiago Capelo Monteiro — Thu, 08 Aug 2024 13:18:14 +0000

Machine learning is an area of AI where the likes of ChatGPT and other famous models were created. These systems were all created with neural networks.

The field of machine learning that deals with the creation of these neural networks is called deep learning.

In this blog post, we'll create a neural network with some neurons that run on a classical computer and others in quantum computers.

This way, creating and training a neural network with both types of neurons will create an AI model based on quantum computing, as most of the processing will occur in the quantum neurons.

We'll talk about these:

Introduction to AI, hybrid neural networks and its benefits
Quantum AI in Action: Predicting Iris Flower Data with Python
Conclusion: The future of efficient AI models

Note: We'll create a simple neural network, avoiding complex architectures like transformers, deep dives into quantum physics, or advanced AI model optimization techniques.

The full code is available here.

Introduction to AI, Hybrid Neural Networks and Its Benefits

Photo by Pavel Danilyuk: https://www.pexels.com/photo/elderly-man-thinking-while-looking-at-a-chessboard-8438918/

What is Deep Learning in Artificial Intelligence?

Deep learning is a subfield of AI that uses neural networks to predict complex patterns like weather, classifying images, responding to text, and so on.

The bigger the neural network, the more complex things it can do. Like ChatGPT, which can process natural language to interact with users.

Neural Networks

Simple Neural Network

Deep learning is the training of neural networks to predict future data. Training a neural network involves feeding it data, allowing it to learn, and then making predictions.

Neural networks are composed of many neurons organized in layers. All layers get different patterns of the data.

This layer type structure allows AI models to interpret complex data and patterns. For example, the neural network in the image above can, for example, with 8 characteristics of data from the weather, be trained to predict whether if it will rain or not.

The layer that takes data is called the input layer and the final one is called the output layer. Between these are the hidden layers that capture complex patterns.

Of course, this is a very simple neural network, but the idea of training a neural network is the same for any complex architecture.

Hybrid Neural Networks - Combining Quantum and Classical Computing

We'll now create a hybrid neural network. Essentially, the input and outputs layers will operate on classical computers while the hidden layer will process data on a quantum computer.

This approach uses the best of classical and quantum computing to train a neural network.

Why Choose Hybrid Neural Networks Over Traditional Neural Networks?

Photo by Burak The Weekender: https://www.pexels.com/photo/lighted-light-bulb-in-selective-focus-photography-45072/

The main idea of using a hybrid neural network is to make the processing of data occur in a quantum computer, which is a lot faster than a classical computer.

In addition, quantum computers perform certain tasks with far less energy consumption. This efficiency in processing and energy usage allows the creation of smaller and more reliable AI models.

This is the main idea of a hybrid neural network: to create smaller and more efficient AI models.

Quantum AI in Action: Predicting Iris Flower Data with Python

Photo by Google DeepMind: https://www.pexels.com/photo/quantum-computing-and-ai-25626507/

In this code, we'll create a quantum based AI model to predict the species of iris flowers from the famous Iris dataset.

The code uses a quantum simulator called default.qubit, which mimics a quantum computer behavior on a classical computer.

This is possible because of the use of mathematical models to simulate quantum operations.

However, with some code alterations, you can run this code on the IBM, Amazon or Microsoft platforms to make it actually run on a quantum computer

import pennylane as qml
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load and preprocess the Iris dataset
data = load_iris()
X = data.data
y = data.target

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# One-hot encode the labels
encoder = OneHotEncoder(sparse=False)
y_onehot = encoder.fit_transform(y.reshape(-1, 1))

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_onehot, test_size=0.2, random_state=42)

# Define a quantum device
n_qubits = 4
dev = qml.device('default.qubit', wires=n_qubits)

# Define a quantum node
@qml.qnode(dev)
def quantum_circuit(inputs, weights):
    for i in range(len(inputs)):
        qml.RY(inputs[i], wires=i)

    for i in range(n_qubits):
        qml.RX(weights[i], wires=i)
        qml.RY(weights[n_qubits + i], wires=i)

    return [qml.expval(qml.PauliZ(i)) for i in range(n_qubits)]

# Define a hybrid quantum-classical model
def hybrid_model(inputs, weights):
    return quantum_circuit(inputs, weights)

# Initialize weights
np.random.seed(0)
weights = np.random.normal(0, np.pi, (2 * n_qubits,))

# Define a cost function
def cost(weights):
    predictions = np.array([hybrid_model(x, weights) for x in X_train])
    loss = np.mean((predictions - y_train) ** 2)
    return loss

# Optimize the weights using gradient descent
opt = qml.GradientDescentOptimizer(stepsize=0.1)
steps = 100
for i in range(steps):
    weights = opt.step(cost, weights)
    if i % 10 == 0:
        print(f"Step {i}, Cost: {cost(weights)}")

# Test the model
predictions = np.array([hybrid_model(x, weights) for x in X_test])
predicted_labels = np.argmax(predictions, axis=1)
true_labels = np.argmax(y_test, axis=1)

# Calculate the accuracy
accuracy = accuracy_score(true_labels, predicted_labels)
print(f"Test Accuracy: {accuracy * 100:.2f}%")

Let's see the code block by block!

Import Libraries

import pennylane as qml
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Import Libraries

In this part of the code we imported the necessary libraries:

pennylane and pennylane.numpy: For creating and manipulating quantum circuits.
sklearn.datasets: To load the Iris dataset.
sklearn.preprocessing: For data preprocessing like scaling and encoding.
sklearn.model_selection: For splitting the data into training and testing sets.
sklearn.metrics: To evaluate the model's accuracy.

Load and Preprocess the Iris Dataset

# Load and preprocess the Iris dataset
data = load_iris()
X = data.data
y = data.target

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# One-hot encode the labels
encoder = OneHotEncoder(sparse=False)
y_onehot = encoder.fit_transform(y.reshape(-1, 1))

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_onehot, test_size=0.2, random_state=42)

Load and Preprocess the Iris Dataset

Here, we prepared the data for training the neural network:

Loads the Iris dataset and extracts features (X) and labels (y).
Standardizes the features to have zero mean and unit variance using StandardScaler.
One-hot encodes the labels for multi-class classification using OneHotEncoder.
Splits the dataset into training and test sets with a ratio of 80/20.

Define the Quantum Device and Circuit

# Define a quantum device
n_qubits = 4
dev = qml.device('default.qubit', wires=n_qubits)

# Define a quantum node
@qml.qnode(dev)
def quantum_circuit(inputs, weights):
    for i in range(len(inputs)):
        qml.RY(inputs[i], wires=i)

    for i in range(n_qubits):
        qml.RX(weights[i], wires=i)
        qml.RY(weights[n_qubits + i], wires=i)

    return [qml.expval(qml.PauliZ(i)) for i in range(n_qubits)]

Define the Quantum Device and Circuit

This segment defines the quantum device and circuit:

Sets up a quantum device with 4 qubits using PennyLane's default simulator.
Defines a quantum circuit (quantum_circuit) that takes inputs and weights. The circuit applies rotation gates (RY, RX) to encode inputs and parameters, and measures the expectation values of PauliZ operators on each qubit.

Define the Hybrid Model and Initialize Weights

# Define a hybrid quantum-classical model
def hybrid_model(inputs, weights):
    return quantum_circuit(inputs, weights)

# Initialize weights
np.random.seed(0)
weights = np.random.normal(0, np.pi, (2 * n_qubits,))

Define the Hybrid Model and Initialize Weights

Here, we actually created the model and started its weights.

Defines a hybrid model function that utilizes the quantum circuit.
Initializes the weights for the model using a normal distribution with a specified seed for reproducibility.

Define the Cost Function and Optimize Weights

# Define a cost function
def cost(weights):
    predictions = np.array([hybrid_model(x, weights) for x in X_train])
    loss = np.mean((predictions - y_train) ** 2)
    return loss

# Optimize the weights using gradient descent
opt = qml.GradientDescentOptimizer(stepsize=0.1)
steps = 100
for i in range(steps):
    weights = opt.step(cost, weights)
    if i % 10 == 0:
        print(f"Step {i}, Cost: {cost(weights)}")

Define the Cost Function and Optimize Weights

Finally, we started training the quantum based neural network.

Defines a cost function that calculates the mean squared error between predictions and true labels.
Uses PennyLane's GradientDescentOptimizer to minimize the cost function by updating weights iteratively. It prints the cost every 10 steps to track progress.

It prints out:

Step 0, Cost: 0.35359229278282217
Step 10, Cost: 0.3145818194833503
Step 20, Cost: 0.28937668289628116
Step 30, Cost: 0.2733108557682183
Step 40, Cost: 0.26273285477208475
Step 50, Cost: 0.25532913470009133
Step 60, Cost: 0.24973939376050813
Step 70, Cost: 0.24517135825709957
Step 80, Cost: 0.2411459409849017
Step 90, Cost: 0.23735091263019087

Test the Model and Evaluate Accuracy

# Test the model
predictions = np.array([hybrid_model(x, weights) for x in X_test])
predicted_labels = np.argmax(predictions, axis=1)
true_labels = np.argmax(y_test, axis=1)

# Calculate the accuracy
accuracy = accuracy_score(true_labels, predicted_labels)
print(f"Test Accuracy: {accuracy * 100:.2f}%")

Test the Model and Evaluate Accuracy

Next, we evaluate the trained model:

Makes predictions on the test set using the optimized weights.
Converts one-hot encoded predictions and true labels back to class labels.
Calculates and prints the accuracy of the model using accuracy_score.

And the final results gave:

Test Accuracy: 66.67%

An accuracy of 67% is not a good AI model result. This is because we did not optimize this neural network for this data.

We would need to change the neural network structure to get better results.

However, for this dataset, with just normal neural networks and a library like optuna for hyperparameter optimization, a far bigger accuracy surpassing 98% is possible and can be easily achieved.

Nevertheless, we created a simple quantum AI model.

Conclusion: The Future of Efficient AI Models

Photo by Pixabay: https://www.pexels.com/photo/low-angle-photography-of-grey-and-black-tunnel-overlooking-white-cloudy-and-blue-sky-210158/

Integrating quantum computing in AI allows the creation of smaller and more efficient AI models. With further advances in quantum technology, it will be more and more applied in AI.

In my point of view, the future of AI will eventually be integrated with quantum computers.

Here is the full code:

https://github.com/tiagomonteiro0715/freecodecamp-my-articles-source-code

What is a Kalman Filter? How to Simplify Noisy Data in Navigation and Finance

Tiago Capelo Monteiro — Wed, 07 Aug 2024 13:42:54 +0000

In a world where precision is key, handling noisy data effectively is crucial for solving complex problems.

Whether you're trying to control a rocket or forecast the stock market, the ability to get good data from an uncertain environment is important.

This is exactly the problem Kalman filters help solve. Kalman filters offer a solution that help you deal with noisy data in many fields.

In this article, we'll discuss:

Driving Through Fog: Kalman Filters as Your Headlights
What are Kalman Filters?
Kalman Filters in Action: A Step-by-Step Code Example
Conclusion: Navigating Nonlinear Data with Advanced Techniques

Driving Through Fog: Kalman Filters as Your Headlights

Photo by eberhard grossgasteiger: https://www.pexels.com/photo/forest-under-clouds-1287075/

Imagine you are driving through a dense fog with limited visibility.

To reach the destination, you rely on your senses and your car's navigation system that combines real-time data with a predetermined map.

As you move, the car navigation system is always constantly adjusting to get the destination, and you are always relying on your senses to drive the car well.

This process is very similar to how a Kalman Filter works.

It is constantly updating, and it refines estimates based on incoming data. Even though that data is full of noise and uncertainty.

By integrating past information with current information, a Kalman Filter gives you a clear picture of where you are and where you're headed.

What are Kalman Filters?

Photo by Mike Bird on Pexels

A Kalman filter is a math algorithm used to find the state of a dynamic system from many noisy measurements.

It is often used for systems that change over time – like tracking the position of a moving object.

How Does a Kalman Filter Work?

The Kalman filter predicts your current state based on past data, like the map and your previous location.

When new data appears, like new GPS signals, the filter compares the new data with its prediction and adjusts its estimate.

Even if the data is noisy, the Kalman filter uses a smart averaging process to improve the estimation. Like how you balance what your navigation system tells you and what you see on the road.

By always integrating new data with past data, Kalman filters help you know where you are and where you are going. This way, it is possible to predict things even in uncertain conditions.

Why are Kalman Filters used in engineering?

Since Kalman filters are able to handle incomplete data, they are widely used to make good predictions even when the measurements are not certain.

This makes them very useful for:

Navigation Systems: Estimating the position and velocity of vehicles.
Robotics: Helping robots understand their environment and position.
Finance: Filtering out noise from stock price data to predict trends.

This way, they are very adaptive and can process real-time information

What problem did Kalman Filters solve?

Kalman filters were developed by Rudolf Kalman in the early 1960s to solve the problem of managing uncertainty and noise in data

Nowadays, they are great for extracting meaningful information from noisy data.

Mathematically, Kalman Filters are called linear quadratic estimators.

This is because, in the process of estimating the future based on current and past data, Kalman filters use:

Linear algebra: The study of vectors and matrices used to solve linear equations.
Quadratic optimization: Finding the optimal solution for problems with squared terms

Kalman Filters in Action: A Step-by-Step Code Tutorial

Photo by capt.sopon: https://www.pexels.com/photo/gray-airplane-control-panel-3402846/

Kalman Filters were created to handle linear systems – that is, systems that follow predictable patterns.

In this code example, we will implement an Extended Kalman Filter. This is a variant that was created to handle non-linear data (in other words, systems that have unpredictable or changing patterns).

Here's the full code (which we'll break down below):

import numpy as np
from filterpy.kalman import ExtendedKalmanFilter as EKF
from filterpy.common import Q_discrete_white_noise

def fx(x, dt):
    """ State transition function for the nonlinear system. """
    # Example: x' = [x[0] + x[1]*dt, x[1]]
    F = np.array([x[0] + x[1]*dt, x[1]])
    return F

def hx(x):
    """ Measurement function for the nonlinear system. """
    # Example: z = [x[0]]
    return np.array([x[0]])

def jacobian_F(x, dt):
    """ Jacobian of the state transition function. """
    return np.array([[1, dt],
                     [0, 1]])

def jacobian_H(x):
    """ Jacobian of the measurement function. """
    return np.array([[1, 0]])

# Initialize EKF
ekf = EKF(dim_x=2, dim_z=1)

# Initial state
ekf.x = np.array([0, 1])

# Initial state covariance
ekf.P = np.eye(2)

# Process noise covariance
ekf.Q = Q_discrete_white_noise(dim=2, dt=1, var=0.1)

# Measurement noise covariance
ekf.R = np.array([[0.1]])

# Define the state transition and measurement functions
ekf.F = jacobian_F
ekf.H = jacobian_H

# Control input
dt = 1.0  # time step

# Simulated measurements
measurements = [1, 2, 3, 4, 5]

for z in measurements:
    # Predict step
    ekf.predict_update(z, HJacobian=jacobian_H, Hx=hx, Fx=fx, args=(dt,), hx_args=())

    # Print the current state estimate
    print("Estimated state:", ekf.x)

Full code

Let's see the code block by block.

Import the Libraries

import numpy as np
from filterpy.kalman import ExtendedKalmanFilter as EKF
from filterpy.common import Q_discrete_white_noise

Importing Libraries

In this part of the code we import the Python libraries we need:

import numpy as np: This imports a tool called NumPy, which helps us work with numbers and lists of numbers (like a spreadsheet).
from [filterpy](https://filterpy.readthedocs.io/en/latest/).kalman import ExtendedKalmanFilter as EKF: This brings in a tool called ExtendedKalmanFilter from the filterpy library. We will use this tool, named EKF here, to track things that change over time in a way that's not straight-line simple.
from [filterpy](https://filterpy.readthedocs.io/en/latest/).common import Q_discrete_white_noise: This imports a function that helps us set up noise, which is like the natural "fuzziness" or uncertainty in our system.

Define How the System Works

def fx(x, dt):
    """ State transition function for the nonlinear system. """
    # Example: x' = [x[0] + x[1]*dt, x[1]]
    return np.array([x[0] + x[1]*dt, x[1]])

def hx(x):
    """ Measurement function for the nonlinear system. """
    # Example: z = [x[0]]
    return np.array([x[0]])

Define How the System Works

In this code we define how the system will work:

fx(x, dt): This function describes how our system changes over time. It says the new position is the old position plus speed times time (x[0] + x[1]*dt). The speed (x[1]) stays the same.
hx(x): This function tells us what we can measure from the system. Here, it says we can measure the position (x[0]).

Define How Changes Affect the System

def jacobian_F(x, dt):
    """ Jacobian of the state transition function. """
    return np.array([[1, dt],
                     [0, 1]])

def jacobian_H(x):
    """ Jacobian of the measurement function. """
    return np.array([[1, 0]])

Define How the System Works

In this code we define how changes affect the system:

jacobian_F(x, dt): This function shows us how sensitive the system is to changes in time and position. It helps the filter predict changes more accurately by considering these sensitivities.
jacobian_H(x): This function tells us how sensitive our measurement is to changes in position. It helps the filter adjust the prediction based on new measurements.

Set Up the Kalman Filter

# Initialize EKF
ekf = EKF(dim_x=2, dim_z=1)

# Initial state
ekf.x = np.array([0, 1])
print("Initial state:", ekf.x)

# Initial state covariance
ekf.P = np.eye(2)
print("Initial state covariance:\n", ekf.P)

Set Up the Kalman Filter

In this part of the code, we create a very simple Kalman filter:

ekf = EKF(dim_x=2, dim_z=1): This creates an Extended Kalman Filter that tracks two things (position and speed) and one measurement (position).
ekf.x = np.array([0, 1]): This sets the starting position to 0 and speed to 1.

It prints out:

Initial state: [0 1]

ekf.P = np.eye(2): This is a way of saying we aren't very sure about our starting guesses. It's like saying "let's start from here, but we are open to changes."

It prints out:

Initial state covariance:
 [[1. 0.]
 [0. 1.]]

Describe Uncertainty in the System

# Process noise covariance
ekf.Q = Q_discrete_white_noise(dim=2, dt=1, var=0.1)
print("Process noise covariance:\n", ekf.Q)

# Measurement noise covariance
ekf.R = np.array([[0.1]])
print("Measurement noise covariance:\n", ekf.R)

Describe Uncertainty in the System

ekf.Q = Q_discrete_white_noise(dim=2, dt=1, var=0.1): This sets how much randomness or unpredictability we expect in the system itself. It's like saying, "things might not move exactly as we think."

It prints out:

Process noise covariance:
 [[0.025 0.05 ]
 [0.05  0.1  ]]

ekf.R = np.array([[0.1]]): This sets how much we trust our measurements. A smaller number means we trust them more.

Measurement noise covariance:
 [[0.1]]

Simulate Data and Initial State

# Control input
dt = 1.0  # time step

# Simulated measurements
measurements = [1, 2, 3, 4, 5]

# True initial state for comparison (not used in the EKF)
true_state = np.array([0, 1])
print("\nTrue initial state:", true_state)

Simulate Data and Initial State

dt = 1.0: This is the time between each step of our simulation.
measurements = [1, 2, 3, 4, 5]: These are the pretend measurements we will use to test the filter.
true_state = np.array([0, 1]): This is the real starting position and speed of our system, used for comparison.

It gives:

True initial state: [0 1]

Simulate Real System Changes

# Simulate the true state evolution (for comparison)
true_states = [true_state[0]]
for _ in range(len(measurements) - 1):
    true_state = fx(true_state, dt)
    true_states.append(true_state[0])

print("\nSimulated true states (for reference):", true_states)

Simulate Real System Changes

Simulating True States: This part calculates what the real position should be over time using the way the system works (fx). It's like having a perfect GPS to check against our estimates.

Simulated true states (for reference): [0, 1.0, 2.0, 3.0, 4.0]

Filter Steps to Estimate the State

for i, z in enumerate(measurements):
    print(f"\nStep {i+1}:")
    print("Measurement:", z)

    # Predict step
    ekf.predict(u=0)  # Use predict_x if you need to customize the prediction
    print("Predicted state before update:", ekf.x)

    # Update step
    ekf.update(z, HJacobian=jacobian_H, Hx=hx, args=(), hx_args=())
    print("Updated state after measurement:", ekf.x)
    print("State covariance after update:\n", ekf.P)

Filter Steps to Estimate the State

Loop Through Measurements: This loop goes through each fake measurement one by one.

Predict Step (ekf.predict(u=0)): Before looking at the new measurement, the filter makes a guess about where the position and speed are now.
Update Step (ekf.update): After the guess, the filter sees the new measurement and adjusts its guess to be closer to this measurement, balancing the new information with what it previously predicted.

Here are the results:

Step 1:
Measurement: 1
Predicted state before update: [0. 1.]
Updated state after measurement: [0.91111111 1.04444444]
State covariance after update:
 [[0.09111111 0.00444444]
 [0.00444444 1.09777778]]

Step 2:
Measurement: 2
Predicted state before update: [0.91111111 1.04444444]
Updated state after measurement: [1.49614396 1.31876607]
State covariance after update:
 [[0.05372751 0.0251928 ]
 [0.0251928  1.1840617 ]]

Step 3:
Measurement: 3
Predicted state before update: [1.49614396 1.31876607]
Updated state after measurement: [2.15857605 1.95145631]
State covariance after update:
 [[0.0440489  0.0420712 ]
 [0.0420712  1.25242718]]

Step 4:
Measurement: 4
Predicted state before update: [2.15857605 1.95145631]
Updated state after measurement: [2.91071524 2.95437384]
State covariance after update:
 [[0.04084552 0.05446424]
 [0.05446424 1.30228131]]

Step 5:
Measurement: 5
Predicted state before update: [2.91071524 2.95437384]
Updated state after measurement: [3.74022237 4.27039095]
State covariance after update:
 [[0.03970292 0.06298888]
 [0.06298888 1.33648045]]

Conclusion: Navigating Nonlinear Data with Advanced Techniques

Photo by Noelle Otto on Pexels

Kalman Filters are a powerful tool for extracting accurate estimates from noisy and incomplete data.

Variants like the Extended Kalman Filter (EKF) and Unscented Kalman Filter (UKF) have been developed to address non-linearities in data.

However, these variants can still face challenges related to stability and accuracy when applied to complex non-linear systems.

This is due to their reliance on linear approximations, which may not capture the full dynamics of highly non-linear processes.

To overcome these limitations, alternative methods such Neural Network-based approaches have gained attention.

Neural Networks can learn complex patterns directly from data, offering a robust solution for highly non-linear scenarios.

Despite these advancements, Kalman Filters remain an important tool in various fields of science and economics due to their simplicity, efficiency, and effectiveness in a wide range of applications.

As technology continues to evolve, the integration of Kalman Filters with other advanced techniques will likely enhance their capability to navigate the challenges of non-linear data more effectively.

Here is the full code:

https://github.com/tiagomonteiro0715/freecodecamp-my-articles-source-code

How to Build a Rocket Control System: Basic Control Theory with Python

Tiago Capelo Monteiro — Tue, 06 Aug 2024 14:26:44 +0000

Building any control systems, including a rocket control system, involves combining control theory with programming.

Control theory is the study of how to make systems behave in a desired way using inputs.

Planes, cars, trains, circuits, rockets and many more systems need to have a brain or an architecture inside them.

Control theory is the study of the control architectures of these complex systems.

In this article, we will explore how to apply control theory to create a rocket control system using Python.

This is a simple guide to how the architecture of complex systems is created. In this case, it's based on a rocket.

In this article, you will learn about:

Rocket Systems and Cake Baking: A Fun Comparison
Rocket Control Made Simple: Understanding PID Controllers
Code example: Designing a simple PID controller
Conclusion: Non-linear control systems

Note: We'll assume the rocket is time-invariant, meaning its behavior doesn't change over time. Addressing time-varying dynamics would complicate this tutorial more than I'd like.

Rocket Systems and Cake Baking: A Fun Comparison

Photo by Brent Keane on Pexels

What is a Rocket Control System?

Imagine you are backing a cake. Your recipe provides the steps and ingredients needed to bake the cake.

In this analogy:

The cake is the rocket
The recipe is the rocket flight plan
The baker's actions are the rocket control system

Just as you change the oven temperature or mixing time to get the best cake, a control system changes rocket's parameters to ensure it stays on its course and remains stable.

Why are control systems important in programming?

By understanding control systems, you'll become better at algorithmic design and systems thinking.

It also enables you to figure out how to adjust processes in feedback loops. This is very important in many areas of programming.

You'll mainly use control theory and control systems when creating software for:

Robotics and Automation: Control systems enable precise movement and adaptability in robots using feedback loops based on sensor input.
Signal Processing and Communication: They optimize data transmission, error correction, and filtering for reliable communication.
Embedded Systems and IoT: Control systems manage device interactions with environments, processing sensor inputs and adjusting outputs efficiently.

How to Create a Rocket Control System

In terms of our cake baking analogy:

Choose the Cake and Recipe: Select a simple control strategy, like choosing a basic cake recipe. A common choice is a PID controller because it's simple and effective.
Understanding the Ingredients: Derive a mathematical model of the characteristics and trajectory of the rocket. Like studying the recipe and ingredients. This way, we get a clear understanding of the system.
Gathering Initial Ingredients: Set initial parameters, similar to gathering your basic ingredients.
Mixing and Baking: Adjust and test the system, much like mixing ingredients and baking. This involves using various graphs to check stability and performance.
Adding Final Touches: Fine-tune the parameters, just like adding final decorations to your cake, to optimize the control system for efficiency.
Following the Recipe: Convert your design into a practical form, like carefully following a cake recipe.

Rocket Control Made Simple: Understanding PID Controllers

A simple control system: The PID controller

Example of control system diagram (source)

Every control system has a controller that runs it. One of the most used controllers is the PID controller.

In the code example here, we will use the PID controller. This is because it's simple and effective for simple control systems.

In a rocket control system, the rocket's PID controller constantly adjusts the rocket's path (processing block) by comparing its current position to where it should be (feedback block).

This way, the rocket stays on course and reaches its final destination.

The PID controller has three key parts that work in the processing and feedback part of the system: proportional gain (Kp), integral gain (Ki), and derivative gain (Kd).

The proportional gain (Kp): Reacts immediately to any error, making the system respond quickly but sometimes causing it to overshoot the target.
The integral gain (Ki): Fixes past errors by adding them up over time, getting rid of any leftover errors, but it can make the system unstable if set too high.
The derivative gain (Kd): Predicts future errors to help prevent overshooting and smooth out the system's response.

This is why it's called a PID (Proportional-Integral-Derivative) controller.

These three parts work together to create a control signal that changes the rocket's setting. This ensures that it's stable, accurate and effective.

With the PID controller, we can control how the inputs like thrust and altitude change the position and speed to ensure the rocket is stable and on its intended path.

Analyzing Stability

Photo by Shiva Smyth on Pexels

To design a PID controller means to design a stable control system.

The process of designing a stable control system is called stability analysis.

There are many methods, but in the code example we will use:

Root locus: Shows system stability and response
Bode plot: Displays system gain and phase margins
Nyquist plot: Illustrates stability and potential oscillations

In this case, the gain and phase margins simply mean that the control system can tolerate changes.

The gain margin tells us how much the system gain can increase without losing stability. Gain means how much to amplify the input signal to make the output signal.

The phase margin tells us how much delay is tolerable without losing stability. Delay in control theory means how much time it takes for the output to respond to the input.

This tells us how to change the Kp, Ki, and Kd so that the PID controller can control the rocket in an effective manner.

The Need for Transfer Functions: Controlling the Rocket and Determining Component Values

To implement any control system, we need two transfer functions: one theoretical and one physical.

Transfer functions tell us how inputs convert to outputs in a mathematical way.

The theoretical function is, in this case, the PID controller.

The physical system transfer function represents real-world dynamics and behavior of the physical components in the system.

By combining both, we can understand the behavior of materials and component values such as:

Capacitor values for energy storage
Sensor calibration values for accurate data measurement and feedback
Spring constants for shock absorption systems
Pressure ratings for fuel and oxidizer tanks

This way, the PID controller is not only the brain of the rocket but also can tell us the values of the components needed so that the rocket can fly its path.

How do engineers find the physical transfer function equation?

First, we need to understand what the rocket is for.

Will it send a LEO (Low Earth Orbit) or MEO (medium Earth orbit) satellite to space or a rocket to the moon?

After knowing its use case, we can, with math and physics, find the physical equation of the transfer function.

There is actually an entire field of engineering called system identification dedicated to this.

Now let's see how to find, for any control system, its physical transfer function.

Code example: Designing a simple PID controller

Photo by Pixabay

Now with this code example, we will create a simple control system for a rocket.

Before we dive into the code, let's talk about decibels.

Decibels use a logarithmic scale to measure sound. In control theory, they measure gain in a way that's easier to visualize on graphs.

This way, we can see many more large and small values in a manageable range.

In other words, by seeing the gain in a logarithmic scale, we are seeing how much the input is amplified to be the output in a manageable range of values.

I'll also explain how root locus, Bode plot, and Nyquist plots assist engineers in stability analysis.

Let's see the code – and then we'll analyze it block by block:

# Step 1: Import libraries
import matplotlib.pyplot as plt
import control as ctrl

# Step 2: Define a new rocket transfer function with poles closer to the imaginary axis
num = [10] 
den = [2, 2, 1] 
G = ctrl.TransferFunction(num, den)

# Step 3: Design a PID controller with new parameters
Kp = 5
Ki = 2
Kd = 1
C = ctrl.TransferFunction([Kd, Kp, Ki], [1, 0])

# Step 4: Applying the PID controller to the rocket transfer function
CL = ctrl.feedback(C * G, 1)

# Step 5: Plot Root Locus for Closed-Loop System
plt.figure(figsize=(10, 6))
ctrl.root_locus(C * G, grid=True)
plt.title("Root Locus Plot (Closed-Loop)")

# Step 6: Plot Bode Plot for Closed-Loop System
plt.figure(figsize=(10, 6))
ctrl.bode_plot(CL, dB=True, Hz=False, deg=True)
plt.suptitle("Bode Plot (Closed-Loop)", fontsize=16)

# Step 7: Plot Nyquist Plot for Closed-Loop System
plt.figure(figsize=(10, 6))
ctrl.nyquist_plot(CL)
plt.title("Nyquist Plot (Closed-Loop)")

plt.show()

Full Code

Step 1: Import libraries

import matplotlib.pyplot as plt
import control as ctrl

Importing libraries

Here we import two libraries:

matplotlib: A plotting library for creating various types of visualizations
Control: A library for analyzing and designing control systems

Step 2: Define the Transfer Function of the Rocket System

num = [10] 
den = [2, 2, 1] 
G = ctrl.TransferFunction(num, den)

Define the Transfer Function of the Rocket System

In this code we define the transfer function of the physical system

num=[10]: Sets the system gain to 10.
den=[2,2,1]: Defines the denominator.
G = ctrl.transferFunction(num, cen): Constructs the transfer function.

This is the transfer function we are going to control with PID:

Black-Scholes Equation

$$\frac{\partial V}{\partial t} + \frac{1}{2}\sigma^2 S^2 \frac{\partial^2 V}{\partial S^2} = rV - rS \frac{\partial V}{\partial S}$$

Rocket transfer function

In this code example, the transfer function rocket equation is very simple. But in real life, rocket transfer functions are not time-invariant linear systems. Usually, they are very complex non-linear systems.

Step 3: Design a PID controller with new parameters

Kp = 5
Ki = 2
Kd = 1
C = ctrl.TransferFunction([Kd, Kp, Ki], [1, 0])

Design a PID controller with new parameters

This code sets up a PID controller with specific gains and creates a transfer function:

Kp = 5: Sets the proportional gain to 5.
Ki = 2: Sets the integral gain to 2.
Kd = 1: Sets the derivative gain to 1.
C = ctrl.TransferFunction([Kd, Kp, Ki], [1, 0]): Creates a transfer function of the PID controller

Step 4: Applying the PID controller to the rocket transfer function

CL = ctrl.feedback(C * G, 1)

Applying the PID controller to the rocket transfer function

C * G: Multiplies the PID controller C with the system G (the rocket) to form the open-loop transfer function, which models the system's behavior without feedback and relies on predefined settings.
ctrl.feedback(C * G, 1): Computes the closed-loop transfer function by applying feedback and representing the system's behavior with feedback. This allows it to adjust inputs and automatically correct errors.
CL: Stores the resulting closed-loop system, integrating the controller with the rocket to maintain desired performance through feedback, and is used for further analysis or simulation.

Step 5: Root Locus for gain analysis

In this code:

plt.figure(figsize=(10, 6))
ctrl.root_locus(C * G, grid=True)
plt.title("Root Locus Plot (Closed-Loop)")

Create the Root Locus Graph

We generate this plot:

Simple Root Locus Graph

This is a root locus graph. It was invented to help engineers study the stability of control systems.

The crosses on the graph, called poles, are very important.

If they are on the left side of the graph, the system is stable. If they are on the right side, the system is unstable.

The more to the left they are, the quicker the system will return to normal after a disturbance, and thus, the more stable it will be.

But moving more to the left can cause too many oscillations, depending on their specific locations.

The key point is:

By changing the Kp, Ki, and Kd, this moves the poles to be as far left as possible without causing oscillations.

However, the root locus graph is not enough to ensure stability. We need to use the Bode and Nyquist plots as well. Only with them can we get the best PID controller values for the rocket control system.

Step 6: Bode Plot for Stability Analysis

In this code:

plt.figure(figsize=(10, 6))
ctrl.bode_plot(CL, dB=True, Hz=False, deg=True)
plt.suptitle("Bode Plot (Closed-Loop)", fontsize=16)

Create the Bode Plot Graph

We generate this plot:

Simple Bode Plot

The Bode plot was invented to help engineers understand how a system responds to changes and how stable it will be under different conditions.

The Bode plot also shows the system's stability and safety margins.

Let's understand how it works:

Bode Plot in detail

The graph on top is called the Magnitude Plot and the one below it is called the Phase Plot.

The magnitude plot measures the gain of a system across different frequencies. Higher gain means quicker and stronger reactions, which is good for precise control.

The phase plot measures the phase shift introduced by the system across different frequencies. The phase shift is seen when the gain is 0.

In this case, we can see with the green line when the gain is zero and what phase shift is associated with that in the red line. It is approximately 63 degrees.

An ideal range is a phase shift of 30 to 60 degrees, which balances stability and response speed.

Over 60 degrees, the system is very stable, but might slow down the system response to changes.

So after analyzing the plot, we can conclude this PID controller is stable.

Step 7: Nyquist Plot for Stability Analysis

In this code:

plt.figure(figsize=(10, 6))
ctrl.nyquist_plot(CL)
plt.title("Nyquist Plot (Closed-Loop)")

Create the Nyquist Plot Graph

We generate this plot:

Nyquist Plot Graph

The Nyquist Plot is a tool to help engineers quickly check if a control system is stable or not.

It is very simple:

If there is no circle around the red cross at point (-1 0), the system is stable.
If there are circles around the red cross, namely clockwise circles, at point (-1 0), the system is unstable.

Since there aren't circles around the red cross, this control system is stable.

Last step after designing the rocket control system

After completing the design of this PID control system, we can use tools like Simulink to find the necessary values for many components.

In other words, after getting the best PID controller variables, it's time to find the physical component values of the rocket.

Some of these values are:

Resistor values for controlling current flow
Capacitor values for energy storage
Inductor values for managing electromagnetic interference
Sensor calibration values for accurate data measurement and feedback
Strength and durability of materials for the rocket's body and fins
Torque and speed requirements for servo motors
Spring constants for shock absorption systems
Pressure ratings for fuel and oxidizer tanks

Thanks to Simulink, we can get all these values needed to design a rocket according to its mission.

With a stable control system, based on a PID controller to control the physical transfer function of a rocket, we can get all the values needed for each component.

Conclusion: Non-linear control systems

Photo by Peter de Vink: https://www.pexels.com/photo/photo-of-full-moon-975012/

There are many methods available to us to optimize a Linear Time-Invariant (LTI) control system:

Root Locus Method: Adjust system poles to reduce oscillations.
Bode Plot Analysis: Maintain phase margin and stability.
Nyquist Plot: Confirm overall system stability.

With these tools, it's possible to create a control system.

However, in this process, it is good practice to use methods like the Ziegler-Nichols method to more quickly find the best PID controller variables.

In our exploration, we worked with a very simple rocket system.

In real life, only non-linear tools are used because all rocket systems are non-linear systems.

One example is adaptive control, where the control system adjusts itself in real-time to handle changing conditions

Another one is Lyapunov's method. In this case, it is used for stability analysis instead of these three plots.

Still, the process of making these control systems is always the same. This article explained how this process works and how it is applied in a time-invariant system.

https://github.com/tiagomonteiro0715/freecodecamp-my-articles-source-code

How to Build an Interpretable Artificial Intelligence Model – Simple Python Code Example

Tiago Capelo Monteiro — Tue, 23 Jul 2024 22:11:31 +0000

Artificial Intelligence is being used everywhere these days. And many of the groundbreaking applications come from Machine Learning, a subfield of AI.

Within Machine Learning, a field called Deep Learning represents one of the main areas of research. It is from Deep Learning that most new, truly effective AI systems are born.

But typically, the AI systems born from Deep Learning are quite narrow, focused systems. They can outperform humans in one very specific area for which they were made.

Because of this, many new developments in AI come from specialized systems or a combination of systems working together.

One of the bigger problems in the field of Deep Learning models is their lack of interpretability. Interpretability means understanding how decisions are made.

This is a big problem that has its own field, called explainable AI. This is the field within AI that focuses on making an AI model's decisions more easily understandable.

Here's what we'll cover in this article:

Artificial Intelligence and the Rise of Deep Learning
A big problem in deep learning: Lack of interpretability
A solution to interpretability: Glass Box models
Code example: Solving the problem with Explainable AI
Conclusion: KAN (Kolmogorov–Arnold Networks)

This article won't cover dropout or other regularization techniques, hyperparameter optimization, complex architectures like CNNs, or detailed differences in gradient descent variants.

We'll just discuss the basics of deep learning, the lack of interpretability problem, and a code example.

Artificial Intelligence and the Rise of Deep Learning

Photo by Tara Winstead

What is Deep Learning in Artificial Intelligence?

Deep Learning is a subfield of artificial intelligence. It uses neural networks to process complex patterns, just like the strategies a sports team uses to win a match.

The bigger the neural network, the more capable it is of doing awesome things – like ChatGPT, for example, which uses natural language processing to answer questions and interact with users.

To truly understand the basics of neural networks – what every single AI model has in common that enables it to work – we need to understand activation layers.

Deep Learning = Training Neural Networks

Simple neural network

At the core of deep learning is the training of neural networks.

That means basically using data to get the right values of each neuron to be able to predict what we want.

Neural networks are made of neurons organized in layers. Each layer extracts unique features from the data.

This layered structure allows deep learning models to analyze and interpret complex data.

A Big Problem in Deep Learning: Lack of Interpretability

_Photo by Koshevaya_k_

Deep Learning has revolutionized many fields by achieving great results in very complex tasks.

However, there is a big problem: the lack of interpretability

While it is true that neural networks can perform every well, we don't understand internally how neural networks can achieve great results.

In other words, we know they do very well with the tasks we give them, but not how they do them in detail.

It is important to know how the model thinks in fields such as healthcare and autonomous driving.

By understanding how a model thinks, we can be more confident in its reliability in certain critical areas.

So models that work in fields with strict regulations are more transparent to the law and build more trust when they're interpretable.

Models that allow interpretability are called glass box models. On the other hand, models that do not have this capability (that is, most of them) are called black box models.

A Solution to Interpretability: Glass Box Models

Glass Box Models

Photo by Pixabay: https://www.pexels.com/photo/fluid-pouring-in-pint-glass-416528/

Glass box models are machine learning models designed to be easily understood by humans.

Glass box models provide clear insights into how they make their decisions.

This transparency in the decision-making process is important for trust, compliance, and improvement.

Below we will see a code example of an AI model that, based on a dataset to predict breast cancer, it achieves an accuracy of 97%.

We'll also find, based on the characteristics of the data, which were of greater importance in predicting the cancer.

Black Box Models

In addition to glass box models, there are also black box models.

These models are essentially different neural network architectures used in various datasets. Some examples are:

CNN (Convolutional Neural Networks): Designed specifically for image classification and interpretation.
RNN (Recurrent Neural Networks) and LSTM (Long Short Term Memory): Primarily used for sequential data – text and time series data. In 2017, they were surpassed by a neural network architecture called transformers in a paper called Attention is All You Need.
Transformer-based architectures: Revolutionized AI in 2017 due to their ability to handle sequential data more efficiently. RNN and LSTM have limited capabilities in this regard.

Nowadays, most models that process text are transformer-based models.

For instance, in ChatGPT, GPT stands for Generative Pre-trained Transformer, indicating a transformer neural network architecture that generates text.

All these models—CNN, RNN, LSTM and Transformers—are examples of narrow artificial intelligence (AI).

Achieving general intelligence, in my view, involves combining many of these narrow AI models to mimic human behavior.

Code Example: Solving the Problem with Explainable AI

Photo by Chokniti Khongchum: https://www.pexels.com/photo/person-holding-laboratory-flask-2280571/

In this code example, we will create an interpretable AI model based on 30 characteristics.

We'll also learn what the 5 characteristics are that are more important in the detection of breast cancer, based on this dataset.

We will use a machine learning glass box model called the Explainable Boosting Machine

Here is the code below, which we will see block by block below:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from interpret.glassbox import ExplainableBoostingClassifier
import matplotlib.pyplot as plt
import numpy as np

# Load a sample dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an EBM model
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)

# Make predictions
y_pred = ebm.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

# Interpret the model
ebm_global = ebm.explain_global(name='EBM')

# Extract feature importances
feature_names = ebm_global.data()['names']
importances = ebm_global.data()['scores']

# Sort features by importance
sorted_idx = np.argsort(importances)
sorted_feature_names = np.array(feature_names)[sorted_idx]
sorted_importances = np.array(importances)[sorted_idx]

# Increase spacing between the feature names
y_positions = np.arange(len(sorted_feature_names)) * 1.5  # Increase multiplier for more space

# Plot feature importances
plt.figure(figsize=(12, 14))  # Increase figure height if necessary
plt.barh(y_positions, sorted_importances, color='skyblue', align='center')
plt.yticks(y_positions, sorted_feature_names)
plt.xlabel('Importance')
plt.title('Feature Importances from Explainable Boosting Classifier')
plt.gca().invert_yaxis()

# Adjust spacing
plt.subplots_adjust(left=0.3, right=0.95, top=0.95, bottom=0.08)  # Fine-tune the margins if needed

plt.show()

Full Code

Alright, now let's break it down.

Importing Libraries

First, we'll import the libraries we need for our example. You can do that with the following code:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from interpret.glassbox import ExplainableBoostingClassifier
import matplotlib.pyplot as plt
import numpy as np

Importing libraries

These are the libraries we are going to use:

Pandas: This is a Python library used for data manipulation and analysis.
sklearn: The scikit-learn library is used to implement machine learning algorithms. We're importing it for data pre processing and model evaluation.
Interpret: The interpretAI Python library is what we'll use to import the model we'll use.
Matplotlib: A Python library used to make graphs in Python.
Numpy: Used for very fast numerical computations.

Loading, Preparing the Dataset, and Splitting the Data

# Load a sample dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Loading, Preparing the Dataset, and Splitting the Data

First, we load a sample dataset: We import a breast cancer dataset using the Interpret library.

Next, we prepare the data: The features (data points) from the dataset are organized into a table format, where each column is labeled with a specific feature name. The target outcomes (labels) from the dataset are stored separately.

Then we split the data into training and testing sets: The data is divided into two parts: one for training the model and one for testing the model. 80% of the data is used for training, while 20% is reserved for testing.

A specific random seed is set to ensure that the data split is consistent every time the code is run.

Quick note: In real life, the dataset is pre-processed with data manipulation techniques to make the AI model faster and to make it smaller.

Training the Model, Making Predictions, and Evaluating the Model

# Train an EBM model
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)

# Make predictions
y_pred = ebm.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Training the Model, Making Predictions and Evaluating the Model

First, we train an EBM model: We initialize an Explainable Boosting Machine model and then train it using the training data. In this step, with the data we have, we create the model.

This way, with one line of code, we create the AI model based on the dataset that will predict breast cancer.

Then we make our predictions: The trained EBM model is used to make predictions on the test data. Next, we calculate and print the accuracy of the model's predictions.

Interpreting the Model, Extracting, and Sorting Feature Importances

# Interpret the model
ebm_global = ebm.explain_global(name='EBM')

# Extract feature importances
feature_names = ebm_global.data()['names']
importances = ebm_global.data()['scores']

# Sort features by importance
sorted_idx = np.argsort(importances)
sorted_feature_names = np.array(feature_names)[sorted_idx]
sorted_importances = np.array(importances)[sorted_idx]

Interpreting the Model, Extracting and Sorting Feature Importances

At this point, we need to interpret the model: The global explanation of the trained Explainable Boosting Machine (EBM) model is obtained, providing an overview of how the model makes decisions.

In this model, we conclude that the accuracy is approximately 0.9736842105263158 – which means the model is accurate 97 % of the time.

Of course, this only applies to the breast cancer data from this dataset – not for every single case of breast cancer detection. Since this is a sample, the dataset does not represent the full population of people seeking to detect breast cancer.

Quick note: In the real world, for classification, we'd use the F1 score instead of accuracy to predict how accurate a model is due to its consideration of both precision and recall.

Next, we extract feature importances: We extract the names and corresponding importance scores of the features used by the model from the global explanation.

Then we sort the features by importance: The features are sorted based on their importance scores, resulting in a list of feature names and their respective importance scores ordered from least to most important.

Plotting Feature Importances

# Increase spacing between the feature names
y_positions = np.arange(len(sorted_feature_names)) * 1.5  # Increase multiplier for more space

# Plot feature importances
plt.figure(figsize=(12, 14))  # Increase figure height if necessary
plt.barh(y_positions, sorted_importances, color='skyblue', align='center')
plt.yticks(y_positions, sorted_feature_names)
plt.xlabel('Importance')
plt.title('Feature Importances from Explainable Boosting Classifier')
plt.gca().invert_yaxis()

# Adjust spacing
plt.subplots_adjust(left=0.3, right=0.95, top=0.95, bottom=0.08)  # Fine-tune the margins if needed

plt.show()

Plotting Feature Importances

Now we need to increase the spacing between feature names: The positions of the feature names on the y-axis are adjusted to increase the spacing between them.

Then we plot feature importances: A horizontal bar plot is created to visualize the feature importances. The plot's size is set to ensure it is clear and readable.

The bars represent the importance scores of the features, and the feature names are displayed along the y-axis.

The plot's x-axis is labeled "Importance," and the title "Feature Importances from Explainable Boosting Classifier" is added. The y-axis is inverted to have the most important features at the top.

Then we adjust the spacing: The margins around the plot are fine-tuned to ensure proper spacing and a neat appearance.

Finally, we display the olot: The plot is displayed to visualize the feature importances effectively.

The final result should look like this:

Features importance graph

This way, we can conclude from an artificial intelligence model that is interpretable and has an accuracy of 97%, that the five most important factors in detecting breast tumors are:

Worst concave points
Worst texture
Worst area
Mean concave points
Area error & worst concavity

Again, this is according to the provided dataset.

So according to the population that this sample dataset represents, we can conclude in a data-driven way that these factors are key indicators for breast cancer tumor detection.

This way, we can conclude from an artificial intelligence model, which methods interpret the model, that it provides clear insights into the significant features for prediction.

Conclusion: KAN (Kolmogorov–Arnold Networks)

Thanks to explainable AI, we can study populations using new data-driven methods.

Instead of only using traditional statistics, surveys, and manual data analysis, we can draw conclusions more accurately using an AI programming library and a database or Excel file.

But this is not the only way to have models built with explainable AI.

In April 2024, a paper called KAN: Kolmogorov–Arnold Networks was published that might shake up the field even more.

Kolmogorov–Arnold Networks (KANs) promise to be more accurate and easier to understand than traditional models and perform better.

They are also easier to visualize and interact with. So we'll see what happens with them.

You can find the full code here:

https://github.com/tiagomonteiro0715/freecodecamp-my-articles-source-code

How to Build a Quantum Artificial Intelligence Model – With Python Code Examples

Tiago Capelo Monteiro — Tue, 23 Jul 2024 18:28:43 +0000

Machine learning (ML) is one of the most important subareas of AI used in building great AI systems.

In ML, deep learning is a narrow area focused solely on neural networks. Through the field of deep learning, systems like ChatGPT and many other AI models can be created. In other words, ChatGPT is just a giant system based on neural networks.

However, there is a big problem with deep learning: computational efficiency. Creating big and effective AI systems with neural networks often requires a lot of energy, which is expensive.

So, the more efficient the hardware is, the better. There are many solutions to solve this problem, one of which is quantum computing.

This article hopes to show, in plain English, the connection between quantum computing and artificial intelligence.

We'll talk about these:

Artificial Intelligence and the Rise of Deep Learning
A Big Problem in Deep Learning: Computational Efficiency
A Solution: Quantum Computing
Code Example: A Quantum AI Model for Quantum Chemistry
Conclusion: Limitations of Quantum Computing and Development

Artificial Intelligence and the Rise of Deep Learning

What is Deep Learning in Artificial Intelligence?

Deep learning is a subfield of artificial intelligence. It uses neural networks to process complex patterns, just like the strategies a sports team uses to win a match.

The bigger the neural network, the more capable it is of doing awesome things – like ChatGPT, for example, which uses natural language processing to answer questions and interact with users.

To truly understand the basics of neural networks – what every single AI model has in common that enables it to work – we need to understand activation layers.

Deep Learning = Training Neural Networks

Simple neural network

At the core of deep learning is the training of neural networks. That means using data to get the right values for each neuron to be able to predict what we want.

Neural networks are made of neurons organized in layers. Each layer extracts unique features from the data.

This layered structure allows deep learning models to analyze and interpret complex data.

A Big Problem in Deep learning: Computational Efficiency

Photo by Brett Sayles: https://www.pexels.com/photo/black-hardwares-on-data-server-room-4597280/

Deep learning powers a lot of the transformation AI makes in the society. However, it comes with a big problem: computational efficiency.

Training deep learning AI systems requires massive amounts of data and computational power. This can take minutes to weeks and in the process, it consumes a lot of energy and computational resources.

There are many solutions to this problem, such as better algorithmic efficiency.

In large language models, this has been the focus of much AI research. To make smaller models have the same performance as larger ones.

Another solution, besides algorithmic efficiency, is better computational efficiency. Quantum computing is one of the solutions related to better computational efficiency.

A Solution: Quantum Computing

Quantum computing is a promising solution to the computational efficiency problem in deep learning.

While normal computers work in bits (either 0 or 1), quantum computers work with qubits – can be 0 and 1 at the same time.

With the qubits representing 0 and 1 at the same time, it is possible to process many possibilities simultaneously, thanks to a property called superposition in quantum physics.

This makes the quantum computers, for certain tasks, far more efficient than normal computers.

This way, it is also possible to have quantum-based algorithms that are more efficient than normal algorithms. This way, reducing the energy consumption used when creating AI models.

Why Are Quantum Computers Not So Widely Used?

The problem with quantum computation is that there isn't a good, cheap physical representation of qubits.

Bits are created and managed with logic gates made from tiny transistors, which can be easily created by the billions.

Qubits are created and managed by superconducting circuits, trapped ions, and topological qubits, which are all very expensive.

This is the biggest problem in quantum computation. However, IBM, Amazon, and many others in cloud services allow people to run code on their quantum computers.

Code Example: A Quantum AI Model for Quantum Chemistry

Photo by Pixabay: https://www.pexels.com/photo/two-clear-glass-jars-beside-several-flasks-248152/

In this code example, we'll solve a quantum chemistry problem:

What is the lowest energy level of the H₂ molecule using quantum computing?

Before understanding the problem at hand, let's discuss quantum chemistry.

What is Quantum Chemistry?

Quantum chemistry is a field of science that looks at how electrons behave in atoms and molecules.

It is about using quantum physics to understand how electrons, atoms, molecules and many more tiny particles interact and form different chemical substances.

The Problem We Want to Solve

We want to find the "ground state energy" of the H₂ molecule.

The H₂ molecule means hydrogen gas, which is present in:

Water
Organic compounds
Stars

Actually, life on Earth would not be possible without it.

By finding the "ground state energy," which is the lowest possible energy that the molecule can have, we can know its most stable form and properties.

This allows scientists to better understand chemical reactions related to H₂.

With classical computers, this problem can be very complex due to a huge number of possibilities and intricate interactions.

With quantum computers, qubits are good representations of electrons, which can directly simulate the behavior of electrons in molecules.

Approximating with the VQE (Variational Quantum Eigensolver (VQE)

The Variational Quantum Eigensolver (VQE) is a hybrid algorithm that leverages both quantum and classical computing.

In this example, the VQE algorithm is used to find the ground state energy of a simple H₂ molecule.

The code is designed to run on a quantum simulator (which is a classical computer running a quantum algorithm).

However, it can be adapted to run on actual quantum hardware through a cloud-based quantum computing service.

This would involve using both quantum and classical resources in practice. Let’s go through the code step by step!

import pennylane as qml
import numpy as np
import matplotlib.pyplot as plt

# Define the molecule (H2 at bond length of 0.74 Å)
symbols = ["H", "H"]
coordinates = np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.74])

# Generate the Hamiltonian for the molecule
hamiltonian, qubits = qml.qchem.molecular_hamiltonian(
    symbols, coordinates
)

# Define the quantum device
dev = qml.device("default.qubit", wires=qubits)

# Define the ansatz (variational quantum circuit)
def ansatz(params, wires):
    qml.BasisState(np.array([0] * qubits), wires=wires)
    for i in range(qubits):
        qml.RY(params[i], wires=wires[i])
    for i in range(qubits - 1):
        qml.CNOT(wires=[wires[i], wires[i + 1]])

# Define the cost function
@qml.qnode(dev)
def cost_fn(params):
    ansatz(params, wires=range(qubits))
    return qml.expval(hamiltonian)

# Set a fixed seed for reproducibility
np.random.seed(42)

# Set the initial parameters
params = np.random.random(qubits, requires_grad=True)

# Choose an optimizer
optimizer = qml.GradientDescentOptimizer(stepsize=0.4)

# Number of optimization steps
max_iterations = 100
conv_tol = 1e-06

# Optimization loop
energies = []

for n in range(max_iterations):
    params, prev_energy = optimizer.step_and_cost(cost_fn, params)

    energy = cost_fn(params)
    energies.append(energy)
    if np.abs(energy - prev_energy) < conv_tol:
        break

    print(f"Step = {n}, Energy = {energy:.8f} Ha")

print(f"Final ground state energy = {energy:.8f} Ha")

# Visualize the results
import matplotlib.pyplot as plt

iterations = range(len(energies))

plt.plot(iterations, energies)
plt.xlabel('Iteration')
plt.ylabel('Energy (Ha)')
plt.title('Convergence of VQE for H2 Molecule')
plt.show()

Full Code Image

Importing Libraries

import pennylane as qml
import numpy as np
import matplotlib.pyplot as plt

Importing libraries

pennylane: A library for quantum computing that provides tools for creating and optimizing quantum circuits, and for running machine learning quantum based algorithms.
numpy: A library for numerical operations in Python, used here for handling arrays and mathematical computations.
matplotlib: A library for creating visualizations and plots in Python, used here to graph the convergence of the VQE algorithm.

Defining the Molecule and Generating the Hamiltonian

# Define the molecule (H2 at bond length of 0.74 Å)
symbols = ["H", "H"]
coordinates = np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.74])

# Generate the Hamiltonian for the molecule
hamiltonian, qubits = qml.qchem.molecular_hamiltonian(
    symbols, coordinates
)

Defining the Molecule and generating the Hamiltonian

Defining the Molecule:

We are defined a hydrogen molecule (H₂).
symbols = ["H", "H"]: This means the molecule consists of two hydrogen (H) atoms.
coordinates = np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.74]): This gives the positions of the two hydrogen atoms. The first hydrogen atom is at the origin (0.0, 0.0, 0.0), and the second hydrogen atom is at (0.0, 0.0, 0.74), which means it is 0.74 angstroms away from the first atom along the z-axis.

Generating the Hamiltonian:

hamiltonian, qubits = qml.qchem.molecular_hamiltonian(symbols, coordinates): This line generates the Hamiltonian for the hydrogen molecule. The Hamiltonian is a mathematical object used to describe the energy of the molecule.
hamiltonian: Represents the energy operator for the molecule.
qubits: Represents the number of quantum bits (qubits) needed to simulate the molecule on a quantum computer.

Defining the Quantum Device and Ansatz (Variational Quantum Circuit)

# Define the quantum device
dev = qml.device("default.qubit", wires=qubits)

# Define the ansatz (variational quantum circuit)
def ansatz(params, wires):
    qml.BasisState(np.array([0] * qubits), wires=wires)
    for i in range(qubits):
        qml.RY(params[i], wires=wires[i])
    for i in range(qubits - 1):
        qml.CNOT(wires=[wires[i], wires[i + 1]])

Defining the Quantum Device and Ansatz (Variational Quantum Circuit)

Defining the Quantum Device:

dev = qml.device("default.qubit", wires=qubits): This line sets up a quantum computing device to simulate our molecule.
"default.qubit": This specifies the type of quantum simulator we are using (a default qubit-based simulator).
wires=qubits: This tells the simulator how many qubits (quantum bits) it needs to use, based on the number of qubits we determined earlier.

Defining the Ansatz (Variational Quantum Circuit):

def ansatz(params, wires): This defines a function named ansatz which describes the variational quantum circuit. This circuit will be used to find the ground state energy of the molecule.
qml.BasisState(np.array([0] * qubits), wires=wires): This initializes the qubits in the state 0. The np.array([0] * qubits) creates an array with zeros, one for each qubit.
for i in range(qubits): qml.RY(params[i], wires=wires[i]): This loop applies a rotation around the Y-axis to each qubit. params[i] provides the angle for each rotation.
for i in range(qubits - 1): qml.CNOT(wires=[wires[i], wires[i + 1]]): This loop applies Controlled-NOT (CNOT) gates between consecutive qubits, entangling them.

Defining the Cost Function, Setting Initial Parameters and Optimizer

# Define the cost function
@qml.qnode(dev)
def cost_fn(params):
    ansatz(params, wires=range(qubits))
    return qml.expval(hamiltonian)

# Set a fixed seed for reproducibility
np.random.seed(42)

# Set the initial parameters
params = np.random.random(qubits, requires_grad=True)

# Choose an optimizer
optimizer = qml.GradientDescentOptimizer(stepsize=0.4)

Defining the Cost Function, Setting Initial Parameters and Optimizer

Defining the Cost Function:

@qml.qnode(dev): This line is a decorator that transforms the cost_fn function into a quantum node, allowing it to run on the quantum device dev.
def cost_fn(params): This defines a function named cost_fn that takes some parameters (params) as input.
ansatz(params, wires=range(qubits)): Inside this function, we call the previously defined ansatz function, passing in the parameters and specifying that it should use all the qubits.
return qml.expval(hamiltonian): This line returns the expected value of the Hamiltonian, which represents the energy of the molecule. The cost function is what we aim to minimize to find the ground state energy.

Setting a Fixed Seed for Reproducibility:

np.random.seed(42): This line sets a fixed seed for the random number generator. This ensures that the random numbers generated will be the same each time the code is run, making the results reproducible.

Setting the Initial Parameters:

params = np.random.random(qubits, requires_grad=True): This line initializes the parameters for the ansatz with random values. The number of parameters is equal to the number of qubits. The requires_grad=True part indicates that these parameters can be adjusted during optimization.

Choosing an Optimizer:

optimizer = qml.GradientDescentOptimizer(stepsize=0.4): This line creates an optimizer that will adjust the parameters to minimize the cost function. Specifically, it uses gradient descent with a step size of 0.4.

Optimization Loop

# Number of optimization steps
max_iterations = 100
conv_tol = 1e-06

# Optimization loop
energies = []

for n in range(max_iterations):
    params, prev_energy = optimizer.step_and_cost(cost_fn, params)

    energy = cost_fn(params)
    energies.append(energy)
    if np.abs(energy - prev_energy) < conv_tol:
        break

    print(f"Step = {n}, Energy = {energy:.8f} Ha")

print(f"Final ground state energy = {energy:.8f} Ha")

Optimization Loop

Setting the Number of Optimization Steps:

max_iterations = 100: This sets the maximum number of steps the optimization will take. In this case, it is 100 steps.
conv_tol = 1e-06: This defines the convergence tolerance. If the change in energy between steps is less than this value, the optimization will stop.

Optimization Loop:

energies = []: This initializes an empty list to store the energies calculated at each step.

Looping Through Optimization Steps:

for n in range(max_iterations):: This starts a loop that will run up to max_iterations times.
params, prev_energy = optimizer.step_and_cost(cost_fn, params): This line performs one step of optimization. It updates the parameters and returns the new parameters and the previous energy.
energy = cost_fn(params): This calculates the current energy using the updated parameters.
energies.append(energy): This adds the current energy to the energies list.
if np.abs(energy - prev_energy) < conv_tol: break: This checks if the absolute difference between the current energy and the previous energy is less than the convergence tolerance. If it is, the loop stops early because the optimization has converged.
print(f"Step = {n}, Energy = {energy:.8f} Ha"): This prints the current step number and the energy in Hartree (Ha) to eight decimal places.

Printing the Final Energy:

print(f"Final ground state energy = {energy:.8f} Ha"): After the loop, this prints the final ground state energy.

Visualizing the Results

# Visualize the results
iterations = range(len(energies))

plt.plot(iterations, energies)
plt.xlabel('Iteration')
plt.ylabel('Energy (Ha)')
plt.title('Convergence of VQE for H2 Molecule')
plt.show()

Visualizing the Results

Setting Up the Data for Visualization:

iterations = range(len(energies)): This creates a range object representing the number of iterations (steps) the optimization went through. len(energies) gives the number of energy values recorded.

Plotting the Results:

plt.plot(iterations, energies): This line creates a plot with the iteration numbers on the x-axis and the corresponding energy values on the y-axis.
plt.xlabel('Iteration'): This sets the label for the x-axis to "Iteration".
plt.ylabel('Energy (Ha)'): This sets the label for the y-axis to "Energy (Ha)", where "Ha" stands for Hartree, a unit of energy.
plt.title('Convergence of VQE for H2 Molecule'): This sets the title of the plot to "Convergence of VQE for H2 Molecule".
plt.show(): This displays the plot.

The graph titled "Convergence of VQE for H2 Molecule" shows the energy (in Hartree, Ha) of the H2 molecule plotted against the number of iterations of the Variational Quantum Eigensolver (VQE) algorithm.

X-Axis (Iteration): Number of VQE iterations.
Y-Axis (Energy (Ha)): Energy of the H2 molecule in Hartree.

Key Points:

Initial Energy: Approximately 1.4 Ha at iteration 0.
Rapid Decrease: Energy quickly drops within the first 20 iterations.
Plateau: Energy stabilizes around 0.4 Ha after 20 iterations, indicating convergence to an optimal or near-optimal solution.

Conclusion: Limitations of Quantum Computing and Development

Photo by Richa Sharma: https://www.pexels.com/photo/ceramic-mug-on-black-laptop-on-table-in-office-4247412/

Besides making AI algorithms far more computationally efficient, quantum computing can revolutionize many fields like:

Drug discovery
Material science
Cryptography
Financial modeling
Optimization problems
Climate modeling
Machine learning

However, for all of us to use quantum computing, a way to physically represent qubits small enough to fit on our laptops is needed. That will take years.

The full code can be found here:

https://github.com/tiagomonteiro0715/freecodecamp-my-articles-source-code

What Are Monte Carlo Methods? How to Predict the Future with Python Simulations

Tiago Capelo Monteiro — Tue, 16 Jul 2024 21:42:38 +0000

Monte Carlo methods have revolutionized programming and engineering.

These methods use the power of randomness, which makes them effective tools that help developers solve difficult problems in many fields.

Monte Carlo methods have been adopted in physics, finance, engineering and many other areas where deterministic methods are often impractical to solve problems.

With Monte Carlo methods, simulations and very complex computations have become efficient and easy to manage.

There are many variants of Monte Carlo methods. But all of them share the idea of using randomness to approximate solutions to hard problems. In this article, you'll learn all about Monte Carlo methods.

What we'll cover:

Understanding Monte Carlo Methods Through an Analogy
What Are Monte Carlo Methods? A Plain English Guide
Real-World Applications of Monte Carlo Methods
Different Types of Monte Carlo Methods
Practical Implementation: Monte Carlo Methods in Python
The Future of Monte Carlo Methods

Pre-requisites

You should have a basic knowledge of statistics to understand everything in this article.

If you need to brush up on your stats skills, I recommend checking out this freeCodeCamp course:

Understanding Monte Carlo Methods Through an Analogy

Photo by veeterzy on Pexels

Imagine you want to find the average height of trees in a big forest.

Measuring every tree is impossible and impractical. But with Monte Carlo methods, it's possible to randomly select a few spots in the forest and measure the height of all the trees in those spots.

By doing this many times and averaging all these measurements, we can estimate the average height of all the trees in the forest.

This way, it's possible to make great estimations in large and complex populations by finding small and manageable samples and averaging them out.

What Are Monte Carlo Methods? A Plain English Guide

Photo by Jonathan Petersson on Pexels

Monte Carlo methods are a type of computer algorithm that uses repeated random measurements to obtain approximate results for a given problem.

They are a part of the mathematical field called numerical analysis – the use of approximation methods to find solutions where deterministic methods are impractical.

The main idea is to find good enough approximate solutions to solve problems that are too hard or impossible to solve directly.

These solutions are obtained by getting an average of many randomly chosen samples from the population of the problem at hand.

This way, in systems with many uncertain factors and interacting parts, Monte Carlo methods are able to provide insights into how the system behaves and performs.

They are based on the mathematical idea of the Law of large numbers in probability theory:

The average of many independent, identically distributed random variables converges to the expected value, if it exists.

The main problem of Monte Carlo methods is the lack of computer resources to make many simulations to get good results.

Why are they called "Monte Carlo"?

Monte Carlo methods, named after the Monte Carlo Casino in Monaco, were coined by mathematicians during the 1940s Manhattan Project.

Stanislaw Ulam, John von Neumann, and others were involved in this project, which developed the American nuclear bomb.

The name reflects the randomness in their simulations, akin to the random outcomes in casino gambling.

Real-World Applications of Monte Carlo Methods

Circuit design in electrical engineering

Photo from Pixabay

Circuits have many components. Here are some of them:

Resistors
Inductors
Capacitors
Diodes
Transistors

Because of the temperatures of the environment they're in, sometimes the circuits may not work.

So, how do engineers design temperature-resilient circuits?

In other words: how can we test a circuit's performance at different temperatures?

Thanks to Monte Carlo methods, we can simulate many intervals of temperature conditions and see their effects on circuit components and how much they affect circuit performance.

This way, we can gather data on how the components should perform under different thermal stresses.

This way, we can optimize the circuit – whether to change the circuit design or choose different components – to work across many environmental conditions.

Rocket design in aerospace engineering

Photo from Pixabay

Rocket design involves many different variables, such as:

Material properties
Aerodynamic forces
Propulsion efficiency
Environmental conditions.

Monte Carlo methods allow for numerous simulations with varying material properties, propulsion efficiency, and more design variables.

This helps in deeply understanding rocket behavior under diverse conditions.

In essence, this stochastic way of solving a big problem is key in understanding the probability behavior of the rocket's performance, like:

Trajectory
Stability
Structural integrity

By analyzing how these design variables affect the probability behavior of crucial rocket flying performance metrics, engineers can make rockets safer and more reliable.

Financial Portfolio optimization in finance and investing

Photo by energepic.com

In finance portfolio optimization, what is the best mix of assets in a portfolio to maximize returns while minimizing risk?

Monte Carlo methods are used to simulate how good a portfolio is at maximizing returns while minimizing risk under various market conditions.

By generating many random scenarios for asset prices and returns, banks and financial institutions can know, under different conditions, portfolio outcomes and manage risk.

This way, it's possible to make data-driven decisions to find a balance between risk and rewards.

Exploring Different Types of Monte Carlo Methods

There are many variants of Monte Carlo methods. Here are some of the most important:

Classical Monte Carlo:

Classical Monte Carlo uses random samples to estimate values and simulate systems. It's useful for tasks where direct solutions are hard to find, like numerical integration

Bayesian Monte Carlo:

Bayesian Monte Carlo improves estimations by using existing information with new observations to make better predictions.

It is called Bayesian Monte Carlo because it uses Bayes' theorem.

Bayes' theorem was created by the mathematician Thomas Bayes and it's very important in probability theory.

The main idea of the theorem is to revise existing beliefs with new data.

This method is ideal when you have some existing information about the problem.

Markov Chain Monte Carlo (MCMC):

For large datasets, Monte Carlo methods often take too long to compute results.

One way to solve this problem is to use a smaller version of big datasets. This is kind of like how a summary represents the content of a book because it is quicker to read.

This smaller version is called a Markov Chain.

In simple words, Markov Chains are models that show how a system moves between states.

A large dataset can be seen as a system and the states as patterns of data.

This way, Markov Chains are simple models that can represent a large dataset because they show how things change from one state to another.

This state change can represent, with fewer numbers, the important patterns in the data.

This way, from the Markov Chain, the Monte Carlo method computes its results.

Essentially, the Monte Carlo makes its predictions indirectly from the original data. The Markov Chain acts as a data preprocessing step to compute the Monte Carlo results.

In the end, MCMC is just a regular but far more computationally efficient Monte Carlo method.

Other variants

Other methods like Gradient, Semi-Gradient, and Quasi Monte Carlo focus as well on computational efficiency. But in this article, I only seek to highlight the importance of Monte Carlo methods in science, engineering, and programming.

Practical Implementation: Monte Carlo Methods in Python

In the code below, you will see how to implement an MCMC variant in Python.

I'll demo a popular variant of MCMC called Hamiltonian Monte Carlo (HMC).

It is called Hamiltonian because it uses concepts from Hamiltonian mechanics to propose new states for the Markov chains in the data pre-processing step.

What is Hamiltonian Mechanics?

To answer this, you need to know a bit about classical mechanics.

Classical mechanics uses Newton's laws of motion to explain how physical systems behave and change over time.

Hamiltonian mechanics is another way to look at these systems. It often emphasizes the role of energy and its conservation by using different variables like generalized positions and momenta.

This unique way of describing a system's state and evolution is used in HMC.

Main code example objective

We will create a target distribution from a 2D Gaussian distribution using TensorFlow Probability. This means that the HMC will model this target distribution.

The 2D Gaussian distribution is created with synthetic data to demonstrate the approximation process using Hamiltonian Monte Carlo.

In other words, HMC will represent this 2D Gaussian distribution accurately.

In real-life scenarios, from circuits to finance, all systems can be described as a probability distribution.

The Monte Carlo methods approximate these complex distributions. And the MCMC makes this process far faster.

In this simple code example, I am approximating a simple target distribution so that you can understand how, in a real life scenario, this would be applied.

Here is the full code (we'll walk through it step by step below):

import tensorflow as tf
import tensorflow_probability as tfp

# Define the target distribution (2D Gaussian)
def target_log_prob(x, y):
    return -0.5 * (x**2 + y**2)

# Initialize the HMC transition kernel
num_results = 1000
num_burnin_steps = 500

hmc = tfp.mcmc.HamiltonianMonteCarlo(
    target_log_prob_fn=lambda x, y: target_log_prob(x, y),
    num_leapfrog_steps=3,
    step_size=0.1
)

# Define the trace function to record the state and kernel results
@tf.function
def run_chain(initial_state, kernel, num_results, num_burnin_steps):
    return tfp.mcmc.sample_chain(
        num_results=num_results,
        num_burnin_steps=num_burnin_steps,
        current_state=initial_state,
        kernel=kernel,
        trace_fn=lambda _, pkr: pkr
    )

# Run the MCMC chain
initial_state = [tf.zeros([]), tf.zeros([])]
samples, kernel_results = run_chain(initial_state, hmc, num_results, num_burnin_steps)

# Extract the samples and log
samples_ = [s.numpy() for s in samples]
samples_x, samples_y = samples_

print("Acceptance rate: ", kernel_results.is_accepted.numpy().mean())
print("Mean of x: ", samples_x.mean())
print("Mean of y: ", samples_y.mean())

Pratical implementation of Markov Chain Monte Carlo Method

Let's understand how the code works step by step.

Import the libraries

import tensorflow as tf
import tensorflow_probability as tfp

Importing libraries

In this code, we import two Python libraries:

TensorFlow: Building and training machine learning models
TensorFlow Probability: Probabilistic reasoning and statistical modeling

Create a target distribution

def target_log_prob(x, y):
    return -0.5 * (x**2 + y**2)

Creating target distribution

In this code, we define a 2D Gaussian distribution:

2D Gaussian distribution

This graph is defined by:

Equation Display

-0.5 × (x² + y²)

By being a 2D Gaussian distribution, each data point is represented by two correlated variables that follow a joint Gaussian distribution.

If this were a real-life scenario, we would be modeling a system by finding its probability distribution based on two variables.

In many practical applications, such as circuits, there can be dozens of variables involved.

To model such systems correctly, we often use multivariate probability distributions, which generalize the concept of the Gaussian distribution to many dimensions.

Initialize the Markov Chain Monte Carlo

num_results = 1000
num_burnin_steps = 500

hmc = tfp.mcmc.HamiltonianMonteCarlo(
    target_log_prob_fn=lambda x, y: target_log_prob(x, y),
    num_leapfrog_steps=3,
    step_size=0.1
)

Initializing the Markov Chain Monte Carlo

This block of code sets up a Hamiltonian Monte Carlo (HMC) transition kernel using TensorFlow Probability.

It first defines two variables:

num_results as 1000, indicating the number of samples to generate
num_burnin_steps as 500, representing the number of initial samples to discard (burn-in period).

The HMC transition kernel is set up with:

A target log probability function that takes two inputs and returns their log probability. In our case, the target log probability function is the 2D gaussian distribution. The log probability is the likelihood of a particular set of values.
The algorithm takes 3 steps each time.
Each step size (Change amount) is 0.1.

Create the trace function to record the state and kernel results

@tf.function
def run_chain(initial_state, kernel, num_results, num_burnin_steps):
    return tfp.mcmc.sample_chain(
        num_results=num_results,
        num_burnin_steps=num_burnin_steps,
        current_state=initial_state,
        kernel=kernel,
        trace_fn=lambda _, pkr: pkr
    )

Creating the trace function to record the state and kernel results

The function is decorated with @tf.function, which optimizes it for performance by compiling it into a TensorFlow graph.

The function run_chain takes four arguments:

initial_state: The initial state of the Markov Chain.
kernel: The MCMC transition kernel to use (such as Hamiltonian Monte Carlo).
num_results: The number of samples to generate.
num_burnin_steps: The number of initial samples to discard (burn-in period).

The function calls tfp.mcmc.sample_chain to perform the MCMC sampling:

num_results: The number of samples to draw.
num_burnin_steps: The number of burn-in steps.
current_state: The starting state of the Markov Chain.
kernel: The transition kernel that defines the sampling process.
trace_fn: A function that specifies what to trace during sampling. In this case, it returns the previous kernel results (pkr), effectively tracing the internal state of the MCMC algorithm.

Run the MCMC chain

# Run the MCMC chain
initial_state = [tf.zeros([]), tf.zeros([])]
samples, kernel_results = run_chain(initial_state, hmc, num_results, num_burnin_steps)

# Extract the samples and log
samples_ = [s.numpy() for s in samples]
samples_x, samples_y = samples_

print("Acceptance rate: ", kernel_results.is_accepted.numpy().mean())
print("Mean of x: ", samples_x.mean())
print("Mean of y: ", samples_y.mean())

Running the MCMC chain

Alright let's break this down as there's a lot going on here:

Initialize the State:

initial_state is set to a list containing two zero tensors, which serves as the starting point for the Markov Chain.

Run the MCMC Chain:

The run_chain function is called with the initial state, the HMC kernel, the number of results, and the number of burn-in steps.
The function returns two values: samples, which are the generated samples, and kernel_results, which contain the results from the kernel (including diagnostic information).

Extract and Convert Samples:

The samples are converted from TensorFlow tensors to NumPy arrays for easier manipulation and analysis.
samples_ is a list comprehension that converts each sample tensor to a numpy array.
samples_x and samples_y are the extracted samples for the two dimensions.

Print Diagnostics:

The acceptance rate of the MCMC sampler is calculated and printed. It shows the proportion of accepted proposals during sampling.
The means of the samples for both dimensions (x and y) are calculated and printed to provide a summary of the sampling results.

This gives as results:

Acceptance rate: 1.0. This means all proposals made during sampling were accepted
Mean of x: -0.11450629 and mean of y: -0.23079416. In a perfect 2D Gaussian distribution, the means of x and y are 0.

With this MCMC variant, we are approximating the 2D Gaussian distribution. But it's close to zero. This means that, given more computational time, it probably would go to an even smaller number until it was so small it could be considered zero.

Conclusion: The future of Monte Carlo methods

The future of Monte Carlo methods lies in the creation of variants that require fewer computational resources and save time.

With these advancements, Monte Carlo methods will find additional applications in more fields.

Thanks to Monte Carlo methods, we are able to model complex systems and phenomena that were previously impossible to do in an efficient manner.

If you want to know more, you can read this article on Monte Carlo methods.

You can also check out the full code here:

https://github.com/tiagomonteiro0715/freecodecamp-my-articles-source-code

How Does a CPU Work Internally? From Transistors to Instruction Set Architecture

Tiago Capelo Monteiro — Wed, 10 Jul 2024 19:14:05 +0000

The CPU (Central Process Unit) is the brain of a computer, and the main connection between software and hardware. It makes it possible to operate software on hardware.

However, how does it work in deep detail? And how can it connect programs to certain computer hardware?

This article aims to make you understand this connection by deeply explaining how a CPU works. This topic is often familiar only to those with a background in computer hardware design from college.

Often, many computer science graduates never have a class in advanced digital logic. So even very experienced programmers may lack an understanding of how a CPU actually processes information.

Although we won't be designing logic gates from transistors or CPU components from logic gates, we will cover the key concepts needed to understand how a CPU processes data created by a program written in a programming language.

We will see:

Analogy: Introduction to What Makes CPUs Work
The Memory Hubs: Understanding RAM and ROM
The Roadways of Data: Navigating the CPU Data Path
Traffic Controllers: The Role of State Machines in CPUs
Daily Routines: The Fetch-Execute Cycle Explained
The Rulebook: Decoding the Instruction Set Architecture (ISA)
From programming languages to machine code
City Challenges: Addressing CPU Problems
Conclusion: Better control units and data parts

I will use the Intel 8008 as a reference.

Analogy: Introduction to What Makes CPUs Work

To understand deeply how a computer works, let's imagine a city as our real-life scenario. We'll compare computer elements to parts of this city.

This way, you get a clearer view of different CPU parts and why they are important. Afterwards, we will look in depth to each of the components

The Memory Hubs: Understanding RAM and ROM

RAM (Random access memory) is like a city public library: it stores books and information for people to borrow and return as needed.

In a computer, the RAM loads data and instructions from the computer memory needed by the CPU to process data.

ROM (read Only Memory) is like a historical archive in the city: It only stores records that will never change and never be borrowed from the public.

The Roadways of Data: Navigating the CPU Data Path

The CPU data path is the network of roads in the city. The buses and registers of the CPU data path act like the city's road network.

Just as roads help cars and people move, the CPU data path guarantee the data travels in a efficient manner in the CPU

Traffic Controllers: The Role of State Machines in CPUs

States machines act as the traffic control systems.

The traffic control system manages the flow of vehicles, and the states machines manage the flow of data according to the instructions provided to the CPU.

Daily Routines: The Fetch-Execute Cycle Explained

The fetch-execute cycle is the daily commute for city residents.

Every day, people decide where they are going, travel there, perform their tasks and return home. This process is always repeated.

In the same way, the CPU fetches instructions, decodes them, and executes them in a repetitive cycle.

The Rulebook: Decoding the Instruction Set Architecture (ISA)

The instruction set architecture is like the city transportation law.

The city transportation law shows what is legal to do in the city in relation to the transportation of people.

The instruction set architecture is the set of rules and instructions that the CPU can execute.

The Memory Hubs: Understanding RAM and ROM

Photo by Valentine Tanasovich: https://www.pexels.com/photo/black-and-gray-computer-motherboard-2588757/

RAM stands for Random Access Memory and can be used to read and write data.

The CPU gets data from the computer memory to the RAM first to avoid long waiting times.

Then, it uses the data from the RAM to complete the instructions.

They are used in computers and in many electronic devices due to the memory being volatile. It means that the data is only there while the computer is turned on, making it ideal for temporal storage while the device works.

ROM stands for Read Only Memory. In there, there only exist data added during computer manufacturing.

They are widely used in firmware for devices, BIOS and small embedded systems.

This is because ROM is non-volatile memory. This means that it remains in memory when the device is powered off, making it very important for permanent data storage.

The Roadways of Data: Navigating the CPU Data Path

Photo by Rogeer Marques: https://www.pexels.com/photo/close-up-shot-of-a-chip-processor-11272008/

The CPU data path is a complex digital circuit with many components that work with one another, such as:

Arithmetic Logic Unit (ALU): Performs arithmetic and logical operations inside the CPU data part.
Registers: Small, fast storage locations for temporary data retrieved from the RAM.
Buses: Data, control and address buses are wires used inside the CPU data path to transfer information.

While CPUs have changed a lot since the Intel 8008, these are some of the components that still serve as the foundation for all CPUs.

Thanks to them, it is possible to let data flow, but not control the actual flow. This is the job of the control unit in the CPU, created in the Intel 8008 as state machines.

Traffic Controllers: The Role of State Machines in CPUs

A state machine is a system that transitions between different states in order to perform tasks.

They are composed of a number of states and transitions. They were used in the Intel 8008 to create the control unit due to its structure and effective way to manage the sequence of operations needed to process instructions

Each of the states can activate one or many CPU components to process a certain assembly instruction.

This way, certain CPU data path parts are activated for an instruction to be completed.

Additionally, thanks to these state machines, the CPU is complete and can perform all instructions a user wants in a continuous loop called the fetch-execute cycle.

Daily Routines: The Fetch-Execute Cycle Explained

The state machine in the CPU controls how the CPU data path works together to perform a given instruction.

Nowadays, every computer receives millions of instructions per second. This way, the state machines act as a loop to get the instructions and execute them.

This process is known as the fetch-execute cycle, where the CPU retrieves and executes instructions:

Fetch: The CPU fetches the instruction from memory.
Decode: The fetched instruction is decoded to determine the required action.
Execute: The decoded instruction is executed using the appropriate CPU components.
Write-back: The result of the execution is written back to memory or a register.

In the fetch stage, the control unit tells the RAM to give the next instruction to the CPU.

In the decode stage, the CPU interprets the instruction, and in the execution stage, it performs the operation. Afterwards, the write-back stage ensures the result is stored correctly.

This cycle continues while the PC is on. This way, in modern processors, processing billions of instructions per second.

But What About Data from the Keyboard or Mouse?

This data does not come from RAM but is handled through a mechanism called interrupts. While the CPU runs instructions, it can detect when data comes from peripherals.

If this happens, the CPU stops its current task and prioritizes the instructions from the peripherals. Afterwards, the CPU resumes its previous tasks.

There are many ways to manage interrupts, with some of the most popular being:

Polled Interrupts: The CPU periodically checks if an interrupt has occurred.
Vectored Interrupts: The interrupting device directs the CPU to the appropriate interrupt service routine.
Prioritized Interrupts: Interrupts are assigned different priority levels, ensuring critical tasks are handled first.

This way, with these mechanisms, the CPU maintains its performance while interacting the peripherals.

The Rulebook: Decoding the Instruction Set Architecture (ISA)

With the control unit, the complete CPU and RAM, it is possible to perform many instructions.

But what instructions can be performed on a given CPU? And how many? This is what the Instruction Set Architecture (ISA) solves.

The ISA defines a set of instructions that a certain CPU can execute. It is what allows programmers to understand what a processor can and cannot do without having to understand all the digital logic hardware inside it.

This way, it acts as an interface between software and hardware.

Key Aspects of ISA:

Instruction Types: Includes arithmetic, logical, control, and data transfer instructions.
Addressing Modes: Methods for specifying operands of instructions.
Registers: The set of registers available for use by instructions.

Common ISAs:

x86: Widely used in desktop and server processors.
ARM: Dominant in mobile and embedded devices due to its power efficiency.
RISC-V: An open standard ISA designed for a wide range of applications.

Each CPU often has its own version of the instruction-set architecture. And the instruction set architecture is very often defined with the assembly programming languages.

This is why there are so many versions of the assembly programming language.

Since each CPU has its own hardware specifications, each will have similar components to other CPUs and, thus, similar assembly programming languages associated.

The choice of ISA impacts the CPU's design, performance, and compatibility with software.

For instance, the complexity of x86 allows for powerful desktop applications, while ARM's simplicity favors energy-efficient mobile devices.

From Programming Languages to Machine Code

Photo by luis gomes: https://www.pexels.com/photo/close-up-photo-of-programming-of-codes-546819/

While each processor has its own assembly language, managing code in assembly and writing code in assembly to make big programs can be complex.

It is very complicated, and may lead to wasting time on correcting things and details instead of, in a faster and easier way, managing the development of a program and actually developing it.

To solve this problem, many programming languages were created from assembly. We write the code in the programming languages, and it is then converted to assembly.

This way, instead of spending time on details, it is possible to focus on more important things like system development and algorithm design.

This is the process by which most programming languages convert their code into assembly:

Convert the code to assembly code through either a compiler or interpreter.
The assembly code is then converted to raw machine code and executed by the CPU.
A specific loop in the CPU's state machine is completed.
Afterward, the CPU fetches and executes the next instruction.

Let's see two examples of programming languages doing this!

C Programming Language

The C programming language was created from assembly in the early 1970s. It was created to provide a higher-level language for efficient system-level programming that also allows hardware manipulation.

With a compiler, the C code is converted to assembly and then processed by the complete CPU.

Thanks to this conversion, by writing programs in the C programming language, we can address many problems in a more efficient manner, such as:

Memory management errors
Buffer overflows
Manual optimization issues

Nowadays, even for simpler tasks, the assembly code converted from C compiler is far more efficient and reliable than a human writing the assembly code.

If you want to learn more about the C compiler you can check out:

https://www.freecodecamp.org/news/what-is-a-compiler-in-c/

Python Programming Language

The Python programming language was created from C in the late 1980s.

Its goal was to provide a user-friendly, high-level programming language that emphasizes readability and simplicity, allowing for rapid application development.

In Python, an interpreter converts the Python code to bytecode line by line.

And this bytecode is converted to machine code in the CPU and processed in the fetch-execute cycle.

This way, it is possible for people to program in an easier way and focus on bigger programs, such as:

Artificial intelligence models
Web apps
Data analysis
Scientific computing

However, the challenge with the CPUs in all programming languages is that it processes data sequentially.

City Challenges: Addressing CPU Problems

Photo by Peng LIU: https://www.pexels.com/photo/timelapse-photography-of-vehicle-on-concrete-road-near-in-high-rise-building-during-nighttime-169677/

The traditional one core CPU processes data sequentially, instruction after instruction. This becomes a limitation if we have many instructions to process.

This is what GPUs (Graphics processing units) came to fix. Thanks to GPUs, we can process instructions in parallel, thereby reducing computing time significantly.

With these parallel processing capabilities, it is possible to achieve a much faster computation and improved efficiency in a wide range of applications.

Conclusion: Better Control Units and Data Parts

Photo by Miguel Á. Padriñán: https://www.pexels.com/photo/green-circuit-board-343457/

In addition to modern CPUs being multicore, advancements in control units and data paths play a critical role in improving processor performance.

Control units are often designed using microprogramming or hardwired control units.

Microprogramming offers greater flexibility and easier updates to the control logic, while hardwired control units provide faster performance by directly implementing control signals.

Another significant advancement is the exploration of new materials for transistors in logic gates.

Instead of relying solely on silicon, researchers are investigating alternative materials to create faster and more efficient processors.

As technology continues to advance, understanding these fundamental concepts will remain essential for both enthusiasts and professionals in the field.

Keeping up with these developments ensures the continued innovation and improvement of CPU design and functionality.

What are Markov Chains? Explained With Python Code Examples

Tiago Capelo Monteiro — Mon, 08 Jul 2024 12:53:27 +0000

There are various mathematical tools that can be used to predict the near future based on a current state. One of the most widely used are Markov chains.

Markov chains allow you to predict the uncertainty of future events under certain conditions. For this reason, it is widely used in science, engineering, economics and many more areas.

However, there are many types of Markov Chains and each have their own applications.

This guide introduces what Markov chains are, different types of Markov chains, including Discrete-Time, Continuous-Time, Reversible, and a code example of Hidden Markov Models (HMMs).

We will see:

Analogy
Markov Chain Explained in plain English
Applications of Markov Chains
Types of Markov Chains
Hidden Markov Chains Code Example

Analogy

Imaging that you want to predict the weather tomorrow, and it only depends on the weather today. The weather can be either sunny or rainy.

Here are the probabilities:

If it's sunny today, there's an 80% chance that it will be sunny again tomorrow, and a 20% chance that it will be rainy.
If it's rainy today, there's a 50% chance that it will be sunny tomorrow, and a 50% chance that it will be rainy.

In this scenario, we can predict future states of the weather based on current states using probabilities.

This idea of predicting the future based solely probabilities of the present is called Markov chain.

Here, the states are either sunny or rainy and the probabilities describe the chances of the weather changing based on the current state.

Markov Chain Explained in Plain English

A Markov chain describes random processes where systems move between states, and a new state only depends on the current state, not on how it got there.

Mathematically, Markov chains are called stochastic models because they model (simulate) real life events that are random by nature (stochastic).

Markov chains are very easy to implement and efficient at modeling complex systems.

Another key advantage is their "memoryless" property. This makes it faster to run on computers, and powerful to study random processes and make prediction based on current conditions.

Applications of Markov Chains

At some level, almost all real-life events are stochastic. In other words, they involve randomness and uncertainty.

This is exactly why they are so widely used. They can predict the behavior of systems based on current conditions.

In finance, they are used to detect changes in credit ratings for forecasting market regimes.

In genetics, they help understand how proteins change over time. Which is important when studying genetic variations.

In robotics, they assist with decision-making by predicting the robot's next move based on current observation.

There, real life examples show how effective Markov chains can be used to solve real life problems in different fields.

Types of Markov Chains

There are many types of Markov chains. In this section, we'll only discuss the most important variants of Markov chains.

Discrete-Time Markov Chains (DTMCs)

In DTMCs, the system changes state at specific time steps. They are called discrete because the state transitions occur at distinct, separate time intervals.

They are used in queuing theory (study of the behavior of waiting lines), genetics, and economics because they are simple to analyze.

Continuous-Time Markov Chains (CTMCs)

CTMCs differ from DTMCs in that state transitions can occur at any continuous time point, not at fixed intervals.

This makes them stochastic models where state changes happen continuously. This is important in chemical reactions and reliability engineering.

Reversible Markov Chains

Reversible Markov chains are special. The process of state change is the same whether the direction is forwards or backwards, like rewinding a video and playing it again.

This property makes it easier to know when a system is stable and study how a system behaves over time. They are widely used in statistical physics and economics

Doubly Stochastic Markov Chains

Doubly stochastic Markov chains are defined by a transition probability matrix. In the matrix, the sum of the probabilities in each row and each column equals 1.

This means each row and each column represent a valid probability distribution. In other words, each row and column represent a list of chances for different outcomes.

This property is crucial in quantum computing and statistical mechanics.

Thanks to Doubly stochastic Markov chains, systems change in a way that preserves probabilities and symmetry, making the modeling and analysis of quantum computing systems far more accurate.

Hidden Markov Chains Code Example

Before we jump into code examples, lets first understand what Hidden Markov Chains are.

Hidden Markov Chains: Modeling Unseen States

The main idea behind hidden Markov chains is to model systems that have hidden states (states we do not know their values) which can only be discovered through observable events.

In other words, hidden Markov chains allow us to predict the behavior of a system by:

Considering the likelihood of moving from one state to another.
Knowing the probability of observing a certain event from each state

We can understand this by observing how the states change from an indirect point of view.

We many not know the states original values.

But by knowing the way they change, we can predict what their values will be in the future.

This way, hidden Markov chains are flexible in modeling sequences, capturing both the transitions between hidden states and the observable outcomes.

Because of this, hidden Markov models are used in fields such as engineering, financial modeling, speech recognition, bioinformatics, and many more.

Code Example

In this code example, we will see a simple example with synthetic data.

Here is the full code:

import numpy as np
from hmmlearn import hmm

# Set random seed for reproducibility
np.random.seed(42)

# Define the HMM parameters
n_components = 2  # Number of states
n_features = 1    # Number of observation features

# Create a Gaussian HMM
model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag")

# Define transition matrix (rows must sum to 1)
model.startprob_ = np.array([0.6, 0.4])
model.transmat_ = np.array([[0.7, 0.3],
                            [0.4, 0.6]])

# Define means and covariances for each state
model.means_ = np.array([[0.0], [3.0]])
model.covars_ = np.array([[0.5], [0.5]])

# Generate synthetic observation data
X, Z = model.sample(100)  # 100 samples

# Create a new HMM instance
new_model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag", n_iter=100)

# Fit the model to the data
new_model.fit(X)

# Print the learned parameters
print("Transition matrix:")
print(new_model.transmat_)
print("Means:")
print(new_model.means_)
print("Covariances:")
print(new_model.covars_)

# Predict the hidden states for the observed data
hidden_states = new_model.predict(X)

print("Hidden states:")
print(hidden_states)

Full code

Lets see the code block by block!

Import libraries and set random seed

import numpy as np
from hmmlearn import hmm

np.random.seed(42)

Import libraries and set random seed

In this block of code, we imported two python libraries:

NumPy: For numerical operations.
hmmlearn: For hidden Markov model implementation.

Next we defined with the numpy library a random seed.

What is a Random Seed?

A random seed is a value used to start a pseudorandom number generator.

With a fixed random seed, we ensure that the sequence of pseudorandom numbers generated is always the same.

This allows us to duplicate experiments and verify results.

The specific value of the seed does not matter as long as it remains consistent.

Define the HMM parameters and create a Gaussian HMM

n_components = 2  # Number of states
n_features = 1    # Number of observation features

model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag")

Define the HMM parameters and create a Gaussian HMM

In this code block, we created a HMM with two hidden states and a single observed variable.

covariance_type "diag" means the matrices that represent covariance–how two variables change together–are diagonal. In other words, each row and column is assumed to be independent of the others.

This implies that the probability distributions of each row and column are independent of each other.

However, there is still something strange when we defined the hidden Markov chain.

What Does "Gaussian" Mean?

This is a very big topic in statistics, but in a few words, Markov chains can only be created when we specify the transition probabilities—chances of moving from one state to another in a Markov chain—and an initial probability distribution.

A Gaussian HMM assumes events are initially modeled by a Gaussian distribution, also called a normal distribution.

Normal distribution

A normal distribution is like a bell-shaped curve that describes how things are often spread out in nature.

The normal distribution is crucial because it describes many natural occurrences like human heights, measurement errors, how likely a disease might spread and many more.

And while many natural events may not be described by a normal distribution with the central limit theorem, they can be approximated to be described by a normal distribution.

This way, many hidden Markov models (HMMs) are defined by a normal distribution, which represents many phenomena in nature and society

In the hmmlearn library, there is also the possibility of creating Markov chains based on Poisson distributions.

In simple words, Poisson distributions model probabilities that describe the occurrence of events over a fixed interval of time or space. This is widely used in telecommunications.

HMMs based on a Poisson distribution would predict events that often happen to be random and independent over a specified interval.

Define transition matrix , means and covariances for each state

model.startprob_ = np.array([0.6, 0.4])
model.transmat_ = np.array([[0.7, 0.3],
                            [0.4, 0.6]])

model.means_ = np.array([[0.0], [3.0]])
model.covars_ = np.array([[0.5], [0.5]])

Define transition matrix , means and covariances for each state

model.startprob_ = np.array([0.6, 0.4]):

This line sets the initial state probabilities for a Hidden Markov Model (HMM). It indicates that there is a 60% probability of starting in state 0 and a 40% probability of starting in state 1.

model.transmat_ = np.array([[0.7, 0.3], [0.4, 0.6]]):

This line sets the state transition probability matrix for the HMM. The matrix specifies the probabilities of moving from one state to another:
From state 0, there is a 70% chance of staying in state 0 and a 30% chance of transitioning to state 1.
From state 1, there is a 40% chance of transitioning to state 0 and a 60% chance of staying in state 1.

model.means_ = np.array([[0.0], [3.0]]):

This line sets the mean values for the observation distributions in each state. It indicates that the observations are normally distributed with a mean of 0.0 in state 0 and a mean of 3.0 in state 1.

model.covars_ = np.array([[0.5], [0.5]]):

This line sets the covariance values for the observation distributions in each state. It specifies that the variance (covariance in this 1-dimensional case) of the observations is 0.5 for both state 0 and state 1.

Create data, new HMM instance and fit the model with the data

X, Z = model.sample(100)  # 100 samples

new_model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag", n_iter=100)

new_model.fit(X)

print("Transition matrix:")
print(new_model.transmat_)
print("Means:")
print(new_model.means_)
print("Covariances:")
print(new_model.covars_)

Create data, new HMM instance and fit the model with the data

In this code, we created a model with 100 samples, iterated it 100 times, and printed the new state transition matrix, means, and covariances.

In other words, we generated 100 samples from the original model, fit a new Hidden Markov Model (HMM) to these samples, and then printed the learned parameters of this new model.

X means the observed data samples generated by the original model.
Z means the hidden state sequences corresponding to the observed data samples generated by the original model.

The transition matrix prints out:

[[0.8100804  0.1899196 ]
 [0.49398918 0.50601082]]

Which means that the model tends to stay in state 0 and has nearly equal chances of switching or staying when in state 1.

The means print out:

[[0.01577373]
 [3.06245496]]

Which means that the average observed value is approximately 0.016 in state 0 and 3.062 in state 1.

The covariances print out:

[[[0.41987084]]
 [[0.53146802]]]

Which means that the observed values varies by about 0.420 in state 0 and 0.531 in state 1.

This way, we may never know exactly the values of the states, but we know:

How they tend to change with each other
Their average observed value
How they vary

Predict the hidden states for the observed data

hidden_states = new_model.predict(X)

print("Hidden states:")
print(hidden_states)

Predict the hidden states for the observed data

In this code, based on the X observed data samples, we predicted the new states of the Markov model.

The hidden states print out:

[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1
 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0]

Which means that the hidden states switch between state 0 and state 1, showing how the system changes states over time.

Conclusion: The Future of Markov Chains

Markov chains are widely used in STEM fields due to their ability to predict the future based on the present.

Markov chains have been integrated more with artificial intelligence, improving automation and predicative analytics of systems.

Additionally, the development of more computationally efficient Markov chains is a big priority, making them more accessible for real-time processing and large-scale simulations.

In summary, Markov chains are a very important tool in science due to their ability to predict the future.

With AI and more computational efficiency, Markov chains can be applied in many other fields and solve many problems.

How the Black-Scholes Equation Works – Explained with Python Code Examples

Tiago Capelo Monteiro — Mon, 17 Jun 2024 16:42:59 +0000

The Black-Scholes Equation is probably one of the most influential equations that nobody has heard about.

It's particularly important in finance, especially in these areas:

Securitized debt
Exchange-traded options
Credit default swaps
Over-the-counter derivatives securities

In this article, you'll learn why the Black-Scholes Equation is so important in finance, what problems it solves, and the industries it created.

Here's what we'll cover:

Prerequisite knowledge of finance
Analogy: Predict the price of a ticket for a concert
Plain English explanation with code example
Implications in the real world

Note: In the code example, we will be working with European call and put options.

Prerequisite Knowledge of Finance

To get the most out of this article and understand the Black-Scholes Equation, you just need to know what financial derivatives and options are in finance.

Essentially, financial derivatives are tools investors use to manage risks and improve returns.

There are many types of financial derivatives. One of these is called options.

Options are like financial choices. With options, you can get the right to buy or sell something at a certain time and price, but only if you want to.

The main idea is that they help manage risk so you can make future better investments.

Analogy: Predict the Price of a Ticket for a Concert

Imagine you are planning to buy a ticket for a concert.

The ticket prices change depending on the popularity of the artist, demand, and time until the concert.

Depending on that, you will make the best possible decision to buy the ticket at the lowest price.

Just as you think about the risk of buying the thicket at a certain time, investors use the Black-Scholes Equation to estimate the fair value of financial derivatives.

This way, they make sure they make wise investment choices in ever changing markets.

Black-Scholes in Plain English – with a Code Example

Essentially, the Black-Scholes Equation solved the problem of how to price options correctly in financial markets.

This is very important, because it helps banks and financial institutions effectively manage risk.

However, it was not always like this. Before 1973, when the equation was created (its creators won a Nobel prize), determining the price of options was much more complicated and difficult.

Before the creation of the Black–Scholes equation, there wasn't a standardized mathematical method to predict the prices of options.

Traders often relied on personal experience and market conditions, which led to unreliable option prices.

And earlier mathematical methods did not fully consider factors like volatility, time decay, and interest rates. So there was a lot of error when pricing options.

Here is the Black-Scholes Equation:

$$\frac{\partial V}{\partial t} + \frac{1}{2}\sigma^2 S^2 \frac{\partial^2 V}{\partial S^2} = rV - r S \frac{\partial V}{\partial S}$$

While we won't look very deeply at the equation itself, we will outline its key components and implications.

Essentially, the Black-Scholes equation predicts how an option's value changes over time based on several variables:

V - Price of option as a function of stock price S and time t
S – Price of the underlying asset
t – Time
σ – Volatility
r – Interest rate.

The left side of the equation explains how the option's value changes over time and how market ups and downs affect it.

The right side of the equation shows how the option's value increases due to interest rates and how changes in the asset's price impact it.

By making these two sides equal, we figure out the fair price of the option.

Python Code Example

In this code example, we will find, based on many parameters, the theoretical market value of an option.

For our example, let's assume the following:

Current stock price (S) = $100. This is the price of the stock right now.
Strike price (K) = $105. This is the specific price at which the option holder can buy (call) or sell (put) the underlying asset.
Time to expiration (T) = 1 year (or 1.0 when expressed in years). This is the time left until the option expires.
Risk-free interest rate (r) = 0.05% (or 0.0005 when expressed as a decimal). This is the interest rate on a risk-free investment.
Volatility (sigma) = 20% (or 0.2 when expressed as a decimal). This is how much the stock price is expected to fluctuate.

from blackscholes import BlackScholesCall, BlackScholesPut

def calculate_option_prices(S, K, T, r, sigma, q):
    """
    Calculate the Black-Scholes option prices for European call and put options using the 'blackscholes' package.

    Parameters:
    S : float - current stock price
    K : float - strike price of the option
    T : float - time to maturity (in years)
    r : float - risk-free interest rate (annual as a decimal)
    sigma : float - volatility of the underlying stock (annual as a decimal)
    q : float - annual dividend yield (as a decimal)

    Returns:
    tuple - (call price, put price)
    """
    # Creating instances of BlackScholesCall and BlackScholesPut
    call_option = BlackScholesCall(S=S, K=K, T=T, r=r, sigma=sigma, q=q)
    put_option = BlackScholesPut(S=S, K=K, T=T, r=r, sigma=sigma, q=q)

    # Get call and put prices
    call_price = call_option.price()
    put_price = put_option.price()

    return call_price, put_price


call_price, put_price = calculate_option_prices(100, 105, 1, 0.0005, 0.20, 0.0)
print("Call Price: {:.6f}, Put Price: {:.6f}".format(call_price, put_price))

Now let's examine the code more closely and see what's really going on here:

Step 1: Import the Library

This is the Python library we are using in this article:

https://pypi.org/project/blackscholes/

from blackscholes import BlackScholesCall, BlackScholesPut

Importing functions

Step 2: Create the Function to Calculate Options Prices

In the below code, we are importing the function we need to calculate the options call and put prices.

def calculate_option_prices(S, K, T, r, sigma, q):

    call_option = BlackScholesCall(S=S, K=K, T=T, r=r, sigma=sigma, q=q)
    put_option = BlackScholesPut(S=S, K=K, T=T, r=r, sigma=sigma, q=q)

    call_price = call_option.price()
    put_price = put_option.price()

    return call_price, put_price

Function to calculate call and put prices

The main parameters of the function are:

S : float – current stock price
K : float – strike price of the option
T : float – time to maturity (in years)
r : float – risk-free interest rate (annual as a decimal)
sigma : float – volatility of the underlying stock (annual as a decimal)
q : float – annual dividend yield (as a decimal)

And it returns:

tuple – (call price, put price)

First, we calculate the call and put options. Then we extract the price from it. We can also get other characteristics like the charm or the delta of these financial contracts according to the library documentation.

Step 3: Calculate the Options Pricing

The call and put prices of an option are the costs to buy the respective option contracts.

call_price, put_price = calculate_option_prices(100, 105, 1, 0.0005, 0.20, 0.0)
print("Call Price: {:.6f}, Put Price: {:.6f}".format(call_price, put_price))

Applying the function

We use as examples:

Current stock price: $100
Strike price: $105
Time to maturity: 1 year
Risk-free interest rate: 0.05% (as a decimal: 0.0005)
Volatility: 20% (as a decimal: 0.20)
Dividend yield: 0%

Based on this factors, we price:

Call Option Price: 5.924799
Put Option Price: 10.872312

Which means that, given these parameters:

The price at which you have the right, but not the obligation, to buy is 5.924799 dollars
The price at which you have the right, but not the obligation, to sell is 10.872312 dollars

Implications in the Real World

The equation has had a massive impact in the world of finance.

Below are some of the industries the Black-Scholes Equation has changed greatly:

Securitized Debt

In simple terms, securitized debt refers to turning loans into something that can be bought and sold.

The Black-Scholes equation changed the way banks price grouped-up debt, like mortgages.

Before the Black-Scholes equation, it was very hard to know the worth of these debts. But with the equation, banks can understand their value much better. This made it easier to buy and sell these debts while knowing the potential benefits and risks.

This way, the market for these mortgage debts grew. Which in turn helped grow the housing market.

Exchange Traded Options

Trading options was a very uncertain business. There was no way of truly knowing how to correctly price them.

However, with the Black-Scholes equation, option pricing became far easier. It allowed people to calculate an option based on an underlying asset's price, volatility, time to expiration, and interest rates.

The newfound precision helped grow the options market.

Credit Default Swaps

Credit default swaps are like insurance policies for loans. With a credit default swap, you are protected if the borrower fails to pay back.

Credit default swaps are very important in managing credit risk. But it was only after the black Scholes equation was created that they were accurately priced.

This way, credit default swaps became a very important tool for financial institutions for financial risk management.

Over the Counter Derivatives Securities

Over-the-counter (OTC) derivatives are private deals made between two parties without a stock exchange.

Before Black-Scholes, negotiating the terms and prices of OTC derivatives was very hard. But then the Black-Scholes equation offered a standard way of finding the price of derivatives.

This allowed market participants to negotiate contracts more efficiently and accurately.

Conclusion

The Black-Scholes equation helped create more precision in the way certain things are priced.

This precision helped create more stable institutions, which in turn helped create a more resilient economy.

If interested in learning more, see this video:

If you are interested in learning more about finance:

https://www.freecodecamp.org/news/fundamentals-of-finance-economics-for-businesses/

Full code

https://github.com/tiagomonteiro0715/freecodecamp-my-articles-source-code

How AI Models Think: The Key Role of Activation Functions with Code Examples

Tiago Capelo Monteiro — Wed, 10 Apr 2024 15:44:31 +0000

In Artificial Intelligence, Machine Learning is the foundation of most revolutionary AI applications. From language processing to image recognition, Machine Learning is everywhere.

Machine Learning relies on algorithms, statistical models, and neural networks. And Deep Learning is the subfield of Machine Learning focused only on neural networks.

A key piece of any neural network are activation functions. But understanding exactly why they are essential to any neural network system is a common question, and it can be a difficult one to answer.

This tutorial focuses on explaining, in a simple manner with analogies, why exactly activation functions are necessary.

By understanding this, you will understand the process of how AI models think.

Before that, we will explore neural networks in AI. We will also explore the most commonly used activation functions.

We're also going to analyze every line of a very simple PyTorch code example of a neural network.

In this article, we will explore:

Artificial Intelligence and the Rise of Deep Learning
Understanding Activation Functions: Simplifying Neural Network Mechanics
Simple Analogy: The Necessity of Activation Functions
What Happens Without Activation Functions?
PyTorch Activation Function Code Example
Conclusion: The Unsung Heroes of AI Neural Networks

This article won't cover dropout or other regularization techniques, hyperparameter optimization, complex architectures like CNNs, or detailed differences in gradient descent variants.

I just want to showcase why activation functions are needed and what happens when they are not applied to neural networks.

If you don't know much about deep learning, I personally recommend this Deep Learning crash course on freeCodeCamp's YouTube channel:

Artificial Intelligence and the Rise of Deep Learning

What is Deep Learning in Artificial Intelligence?

Deep learning is a subfield of artificial intelligence. It uses neural networks to process complex patterns, just like the strategies a sports team uses to win a match.

The bigger the neural network, the more capable it is of doing awesome things – like ChatGPT, for example, which uses natural language processing to answer questions and interact with users.

To truly understand the basics of neural networks – what every single AI model has in common that enables it to work – we need to understand activation layers.

Deep Learning = Training Neural Networks

Simple neural network

At the core of deep learning is the training of neural networks.

That means basically using data to get the right values of the weights to be able to predict what we want.

Neural networks are made of neurons organized in layers. Each layer extracts unique features from the data.

This layered structure allows deep learning models to analyze and interpret complex data.

Understanding Activation Functions: Simplifying Neural Network Mechanics

Leaky reLU activation function

Activation functions help neural networks handle complex data. They change the neuron value based on the data they receive.

It is almost like a filter every neuron has before sending its value to the next neuron.

Essentially, activation functions control the information flow of neural networks – they decide which data is relevant and which is not.

This helps prevent the vanishing gradients to ensure the network learns properly.

The vanishing gradients problem happens when the neural network's learning signals are too weak to make the weight values change. This makes learning from data very difficult.

Simple Analogy: Why Activation Functions are Necessary

In a soccer game, players decide whether to pass, dribble, or shoot the ball.

These decisions are based on the current game situation, just as neurons in a neural network process data.

In this case, activation functions act like this in the decision-making process.

Without them, neurons would pass data without any selective analysis – like players mindlessly kicking the ball regardless of the game context.

In this way, activation functions introduce complexity into a neural network, allowing it to learn complex patterns.

What Happens Without Activation Functions?

To understand what would happen without activation functions, let's first think about what happens if players mindlessly kick the ball in a soccer match.

They'd likely lose the match because there would be no decision-making processes as a team. That ball still goes somewhere – but most of the time it will not go where it's intended.

This is similar to what happens in a neural network without activation functions: the neural network doesn't make good predictions because the neurons were just passing data to each other randomly.

We still get a prediction. Just not what we wanted, or what's helpful.

This dramatically limits the capability – of both the soccer team and the neural network.

Intuitive Explanation of Activation Functions

Let's now look at an example so you can understand this intuitively.

reLU activation function

Let's start with the most widely used activation function in deep learning (it's also one of the simpler ones).

This is an ReLU activation function. It basically acts as a filter before a neuron sends a value to its next neuron.

This filter is essentially two conditions:

If the value of the weight is negative, it becomes 0
If the value of the weight is positive, it does not change anything

With this, we are adding a decision-making process to each neuron. It decides which data to send and which not to send.

Now let's look at some examples of other activation functions.

Sigmoid Activation Functions

This activation function converts the input value between 0 and 1. Sigmoids are widely used in binary classification problems in the last neuron.

Sigmoid activation function

There is a problem with sigmoid activation functions, though. Take the output values from a given linear transformation:

0.00000003
0.99999992
0.00000247
0.99993320

There are some questions about these values we can ask:

Are values like 0.00000003 and 0.000002 really important? Can't they be just 0 so that we have fewer things to run on the computer? Remember, in many of today's models, we have millions of weights in them. Can't millions of 0.00000003 and 0.000002 be 0?
And if it is a positive value, how will it distinguish a big value from a very big value? For example, in 0.99993320 and 0.99999992, where are the input values like 7 and 13 or 7 and 55? 0.99993320 and 0.99999992 do not accurately describe their input values.

How can we distinguish the subtle differences in outputs so that accuracy is maintained?

This is what the ReLU activation functions solved: setting negative numbers to zero while keeping positive ones boosts neural network computational efficiency.

Tanh (Hyperbolic Tangent) Activation Functions

tanh activation function

These activation functions output values between -1 and 1, similar to Sigmoid.

They're often used in recurrent neural networks (RNNs) and long short-term memory networks (LSTMs).

Tanh is also used because it is zero-centered. This means that the mean of the output values is around zero. This property helps when dealing with the vanishing gradient problem.

Leaky reLU

Leaky reLU activation function

Instead of ignoring the negative values, Leaky ReLU activation functions are going to have a small negative value.

This way, negative values are also used when training neural networks.

With the ReLU activation function, neurons with negative values are inactive and do not contribute to the learning process.

With the Leaky ReLU activation function, neurons with negative values are active and contribute to the learning process.

This decision-making process is implemented by activations function. Without it, it would simply send the data to the next neuron (just like a player mindlessly kicking the ball).

Mathematical Explanation of Activation Functions

Neurons do two things:

They use linear transformations with the previous neurons weights values
They use activation functions to filter certain values to selectively pass on values.

Without activation functions, the neural network just does one thing: Linear transformations.

If it only does linear transformations, it is a linear system.

If it is a linear system, in very simple terms without being too technical, the superposition theorem tells us that any mixture of two or more linear transformations can be simplified into one single transformation.

Essentially, it means that, without activation functions, this complex neural network:

Long neural network without activation functions

Is the same as this simple one:

Short neural network without activation functions

This is because each layer in its matrix form is a product of linear transformations of previous layers.

And according to the theorem, since any mixture of two or more linear transformations can be simplified in one single transformation, then any mixture of hidden layers (that is, layers between the inputs and outputs of neurons) in a neural network can be simplified into only one layer.

What does this all mean?

It means that it can only model data linearly. But in real life with real data, every system is non-linear. So we need activation functions.

We introduce non-linearity into a neural network so that it learns non-linear patterns.

PyTorch Activation Function Code Example

In this section, we are going to train the neural network below:

Simple feed forward neural network

This is a simple neural network AI model with four layers:

Input layer with 10 neurons
Two hidden layers with 18 neurons
One hidden layer with 18 neurons
One output layer with 1 neuron

In the code, we can choose any of the four activation functions mentioned in this tutorial.

Here it is the full code – we'll discuss in detail below:

import torch
import torch.nn as nn
import torch.optim as optim

#Choose which activation function to use in code
defined_activation_function = 'relu'

activation_functions = {
    'relu': nn.ReLU(),
    'sigmoid': nn.Sigmoid(),
    'tanh': nn.Tanh(),
    'leaky_relu': nn.LeakyReLU()
}

# Initializing hyperparameters
num_samples = 100
batch_size = 10
num_epochs = 150
learning_rate = 0.001

# Define a simple synthetic dataset
def generate_data(num_samples):
    X = torch.randn(num_samples, 10)
    y = torch.randn(num_samples, 1)
    return X, y

# Generate synthetic data
X, y = generate_data(num_samples)

class SimpleModel(nn.Module):
    def __init__(self, activation=defined_activation_function):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(in_features=10, out_features=18)
        self.fc2 = nn.Linear(in_features=18, out_features=18)
        self.fc3 = nn.Linear(in_features=18, out_features=4)
        self.fc4 = nn.Linear(in_features=4, out_features=1)
        self.activation = activation_functions[activation]

    def forward(self, x):
        x = self.fc1(x)
        x = self.activation(x)
        x = self.fc2(x) 
        x = self.activation(x)
        x = self.fc3(x) 
        x = self.activation(x)  
        x = self.fc4(x) 
        return x

# Initialize the model, define loss function and optimizer
model = SimpleModel(activation=defined_activation_function)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(num_epochs):
    for i in range(0, num_samples, batch_size):
        # Get the mini-batch
        inputs = X[i:i+batch_size]
        labels = y[i:i+batch_size]

        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(inputs)

        # Compute the loss
        loss = criterion(outputs, labels)

        # Backward pass and optimize
        loss.backward()
        optimizer.step()

    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss}')

print("Training complete.")

Looks like a lot, doesn't it? Don't worry – we'll take it piece by piece.

1: Importing libraries and defining activation functions

import torch
import torch.nn as nn
import torch.optim as optim

#Choose which activation function to use in code
defined_activation_function = 'relu'

activation_functions = {
    'relu': nn.ReLU(),
    'sigmoid': nn.Sigmoid(),
    'tanh': nn.Tanh(),
    'leaky_relu': nn.LeakyReLU()
}

Importing libraries and defining dictionary with activation functions

In this code:

import torch: Imports the PyTorch library.
import torch.nn as nn: Imports the neural network module from PyTorch.
import torch.optim as optim: Imports the optimization module from PyTorch.

The variable and the dictionary above help you easily define, for this deep learning model, the activation function you want to use.

2: Defining hyperparameters and generating a dataset

# Initializing hyperparameters
num_samples = 100
batch_size = 10
num_epochs = 150
learning_rate = 0.001

# Define a simple synthetic dataset
def generate_data(num_samples):
    X = torch.randn(num_samples, 10)
    y = torch.randn(num_samples, 1)
    return X, y

# Generate synthetic data
X, y = generate_data(num_samples)

Initializing hyperparameters and creating, with a function, a synthetic dataset

In this code:

num_samples is the number of samples in the synthetic dataset.
batch_size is the size of each mini-batch during training.
num_epochs is the number of iterations over the entire dataset during training.
learning_rate is the learning rate used by the optimization algorithm.

Besides, we define a generate_data function to create two tensors with random values. Then it calls the function and it generates, for X and y, two tensors with random values.

3: Creating the deep learning model

class SimpleModel(nn.Module):
    def __init__(self, activation=defined_activation_function):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(in_features=10, out_features=18)
        self.fc2 = nn.Linear(in_features=18, out_features=18)
        self.fc3 = nn.Linear(in_features=18, out_features=4)
        self.fc4 = nn.Linear(in_features=4, out_features=1)
        self.activation = activation_functions[activation]

    def forward(self, x):
        x = self.fc1(x)
        x = self.activation(x)
        x = self.fc2(x) 
        x = self.activation(x)
        x = self.fc3(x) 
        x = self.activation(x)  
        x = self.fc4(x) 
        return x

A simple feed forward neural network deep learning AI model

The __init__ method in the SimpleModel class initializes the neural network architecture. It initializes four fully connected layers and defines the activation function we are going to use.

We create each layer using nn.Linear, while the forward method defines how the data flows through the neural network.

4: Initializing the model and defining the loss function and optimizer

model = SimpleModel(activation=defined_activation_function)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

Defining activation function, loss function and gradient descend variation to be used

In this code:

model = SimpleModel(activation=defined_activation_function) creates a neural network model with a specified activation function.
criterion = nn.MSELoss() defines the Mean Squared Error (MSE) Loss function.
optimizer = optim.Adam(model.parameters(), lr=learning_rate) sets up the Adam optimizer for updating the model parameters during training, with a specified learning rate.

5: Training the deep learning model

for epoch in range(num_epochs):
    for i in range(0, num_samples, batch_size):
        # Get the mini-batch
        inputs = X[i:i+batch_size]
        labels = y[i:i+batch_size]

        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(inputs)

        # Compute the loss
        loss = criterion(outputs, labels)

        # Backward pass and optimize
        loss.backward()
        optimizer.step()

    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss}')

Training the model

The outer loop, based on num_epochs (number of epochs) controls how many times the entire dataset is processed.
The inner loop divides the dataset in mini-batches using the range function.

In each mini loop:

With inputs and labels, we get the data from the mini-batch we want to process
We eliminate with optimizer.zero_grad(), the gradients – variables that tell us how to adjust weights for accurate predictions – of the previous mini-batch iteration. This is important to prevent mixing gradient information between mini-batches.
The forward pass gets us the model predictions (outputs), and the loss is calculated using the specified loss function (criterion).
With loss.backward(), we calculate the gradients for the weights.
Finally, optimizer.step() updates the model's weights based on those gradients to minimize the loss function.

This is the full code to train a very simple deep learning model on a very simple dataset.

It does not have anything more advanced like convolutional neural networks.

Conclusion: The Unsung Heroes of AI Neural Networks

Activation functions are like gatekeepers. By restricting the flow of information, the neural network can learn better.

Activation functions are just like people when they study, or soccer players when deciding what to do with a ball.

These functions give neural networks the ability to learn and predict correctly.

Mathematically, activation functions are what allow the correct approximation of any linear or non-linear function in neural networks. Without them, neural networks approximate only linear functions.

And I leave you with this:

The mathematical idea of a neural network being able to approximate any non linear function is called the Universal Approximation Theorem‌‌.

You can find the full code on GitHub here:

https://github.com/tiagomonteiro0715/freecodecamp-my-articles-source-code

How to Apply Math with Python – Numerical Analysis Explained

Tiago Capelo Monteiro — Thu, 29 Feb 2024 11:41:59 +0000

Numerical analysis is the bridge between math and computer science.

Essentially, it is the development of algorithms that approximate solutions that pure math would also solve, but using less computational resources and faster.

This field is very important. Because for most solutions in the real world, we only need good approximations and not the exact solutions.

In this article, we will explore:

Analogy Illustrating the Importance of Numerical Analysis
Fundamentals of Numerical Analysis
Application of Numerical Analysis in Real-World Problems
Introduction to Partial Differential Equations (PDEs)
Introduction to Optimization in Numerical Analysis

An Analogy that Illustrates the Importance of Numerical Analysis

How can we measure the coastline of an island?

If we try to measure every centimeter of every small segment, it would be impossible and probably time-consuming.

Because of the sea, the coastline is always changing at that level of detail.

However, by approximating and measuring in larger segments, we can get a practical measurement of the coastline.

This situation mirrors numerical analysis.

Approximation gives insights in situations where precise measurement is impossible or impractical.

Just as we accept a good estimation of the coastline length, numerical analysis uses approximation to solve hard problems.

Fundamentals of Numerical Analysis

Numerical analysis is all about approximation. It is like using binoculars to see a landscape that is very far away. We can't see every leaf. But we get a good enough picture to understand the terrain.

This is crucial in numerical analysis.

In this, we solve hard math problems where exact solutions are either impossible or extremely resource-intensive.

By approximating, we get sufficient good results with less computational effort.

Application of Numerical Analysis in Real-World Problems

There are many applications of numerical analysis

In engineering, it enables simulation of structures and fluids.
In finance, for risk assessment and portfolio optimization.
In environmental science, it predicts climate patterns.

In each field, numerical analysis is a toolkit to solve problems where pure math just takes too much time, or it is impossible to give good results.

An Introduction to Partial Differential Equations (PDEs)

Partial Differential Equations (PDEs) are equations that describe how quantities like heat, sound, or electricity change in different places and as time goes on.

Solving PDEs is very important. Because it allows us to control these changes.

By allowing us to control them, we can:

Predict weather patterns.
Understand sound propagation in different environments.
Design efficient transportation systems.
Optimize energy distribution.

However, most PDE can only be approximated with numerical methods.

It is either too hard or impossible to find through normal calculations.

This way, with numerical methods, we are able to solve PDEs which in turn allows us to solve many real life problems.

Numerical Solutions of PDEs with SciPy

Solving PDEs with numerical methods often involves dividing the PDEs in small, manageable parts. Solve each one and then add them up.

SciPy, a Python library for scientific and technical computing, gives many tools for this purpose.

Now, let's solve a heat transfer problem in a rod.

In the below code, we will see line by line how it allows us to know how heat spreads in a rod:

import numpy as np
from scipy.integrate import solve_bvp

def heat_equation(x, y):
    return np.vstack((y[1], -y[0]))

def boundary_conditions(ya, yb):
    return np.array([ya[0], yb[0] - 1])

x = np.linspace(0, 1, 5)
y = np.zeros((2, x.size))

sol = solve_bvp(heat_equation, boundary_conditions, x, y)

Lets see how thhe code works block by block in the following sections.

How to importing libraries

import numpy as np
from scipy.integrate import solve_bvp

Importing libraries

Here we import 2 python libraries:

N umPy
S ciPy

These two python libraries are some of the most used in data science.

How to define the head equation and boundary conditions

def heat_equation(x, y):
    return np.vstack((y[1], -y[0]))

def boundary_conditions(ya, yb):
    return np.array([ya[0], yb[0] - 1])

Defining heat equation and boundary conditions

We create heat_equation(x, y) and boundary_conditions(ya, yb).

In heat_equation(x, y) we are defining the differential equation we want to solve.

The boundary_conditions(ya, yb) function defines constrains at the start and end of a solution. The condition is that the end of the solution needs to be one unit less than the start.

How to solve the equation

x = np.linspace(0, 1, 5)
y = np.zeros((2, x.size))

sol = solve_bvp(heat_equation, boundary_conditions, x, y)

Solving equation

The line sol = solve_bvp(heat_equation, boundary_conditions, x, y) is the solution.

The code solve_bvp stands for solve boundary value problem.

It takes four arguments:

heat_equation: This is the main problem we are trying to solve.
boundary_conditions: These are the mathematical constrains at the start and end of a solution.
x: Are the spots we choose to explore our answers.
y: Are initial attempts to solve the problem, based on your chosen x values.

An Introduction to Optimization in Numerical Analysis

Optimization is finding the best solution from all solutions. It is like finding the most efficient route in a complex network of roads.

Numerical optimization methods find the most efficient or cost-effective solution to a problem, whether that is:

Minimizing waste in production.
Maximizing efficiency in a logistic network.
Finding best fit for a certain data model.

An Overview of Numerical Optimization Techniques with SciPy

The goal in this example is to minimize transportation cost across a network.

For instance, let's consider an optimization problem in logistics, where the goal is to minimize transportation cost across a network.

SciPy's minimize function can be used to find the best strategy to minimizes cost while meeting all constraints:

from scipy.optimize import minimize

def objective_function(x):
    return x[0]**2 + x[1]**2

def constraint_eq(x):
    return x[0] + x[1] - 10

con_eq = {'type': 'eq', 'fun': constraint_eq}

bounds = [(0, 10), (0, 10)]

x0 = [5, 5]

result = minimize(objective_function, x0, method='SLSQP', bounds=bounds, constraints=[con_eq])

Lets explain how the code works block by block.

How to importing the library

from scipy.optimize import minimize

Importing scipy

Once again we import the necessary library:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html

How to defining objective and constraint equation

def objective_function(x):
    return x[0]**2 + x[1]**2

def constraint_eq(x):
    return x[0] + x[1] - 10

con_eq = {'type': 'eq', 'fun': constraint_eq}

Define objective and constrain equations

The objective function is the function we want to minimize to find the best answer.
The constraint equation is the equation that limits the search space to those x values that fulfill this equation.

con_eq is defined by the following:

'type': 'eq' indicates the type of constraint. 'eq' means equality, in other words, the function must equal zero at the solution.
'fun': constraint_eq assigns the constraint function.

We will see in the next block of code, it is where we constrain the possible solutions of the problem.

How to define an initial condition and result

bounds = [(0, 10), (0, 10)]

x0 = [5, 5]

result = minimize(objective_function, x0, method='SLSQP', bounds=bounds, constraints=[con_eq])

Defining initial condition and solving equation

To understand this block of code, let's understand each parameter of result = minimize(objective_function, x0, method='SLSQP', bounds=bounds, constraints=[con_eq]):

objective_function: Is the function to be minimized.
x0: Is the initial guess for the variables.
method='SLSQP': This specifies the optimization algorithm we are using. In this case, we use SLSQP (Sequential Least SQuares Programming).
bounds=bounds: This parameter specifies the bounds for each of the decision variables.
constraints=[con_eq]: This parameter tells us the constraints applied in the optimization problem.

This is how many real life problems are solved

Many things in real life are modeled with partial differential equation.

Then, with optimization methods developed with numerical analysis, they are optimized.

I am writing this because I know math can be boring for some people, and they may not be aware of where it is applied to solve real problems. The Calculus they learn can be applied in non-ideal situations outside the exams exercises.

Here, we can see finally see why math is important in two scenarios:

To model systems to get solutions from it
To optimize a certain system

Conclusion

Numerical analysis is one of the most important areas of applied math in STEM.

From solving PDE to optimize problems, numerical analysis is everywhere.

With more complex problems, numerical analysis is growing in importance to get faster algorithms that approximate pure math solutions.

This way, it is a bridge between theoretical mathematics and practical application.

If you want to, you can get the full code used in this article on GitHub.

Farm	Yield (tons/ha)	Fertilizer Used (kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

Farm	Yield (tons/ha)	Fertilizer Used (Kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

Farm	Yield (tons/ha)	Fertilizer Used (kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

Farm	Yield (tons/ha)	Fertilizer Used (Kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

Farm	Yield (tons/ha)	Fertilizer Used (kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

Farm	Yield (tons/ha)	Fertilizer Used (Kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390