book - freeCodeCamp.org

The Math Behind Artificial Intelligence: A Guide to AI Foundations [Full Book]

Tiago Capelo Monteiro — Tue, 06 Jan 2026 23:14:23 +0000

"To understand is to perceive patterns." - Isaiah Berlin

This is not a math book filled with complex formulas, theorems, and concepts that are hard to grasp.

Instead, it’s a detailed guide where we’ll break complex ideas down into simpler terms.

Even if you only have a general understanding of algebra, you should be able to easily follow along.

Here’s what we’ll cover:

Chapter 1: Background on this Book
Chapter 2: The Architecture of Mathematics
Chapter 3: The Field of Artificial Intelligence
Chapter 4: Linear Algebra - The Geometry of Data
Chapter 5: Multivariable Calculus - Change in Many Directions
Chapter 6: Probability & Statistics - Learning from Uncertainty
Chapter 7: Optimization Theory - Teaching Machines to Improve
Conclusion: Where Mathematics and AI Meet
About the Author

Chapter 1: Background on this Book

The Objective Here

My objective in this book is simple: Explain the key mathematical ideas you need to grasp in order to deeply understand AI and train machine learning models.

So you might be wondering: Why is it important to have a good math foundation before creating these models?

Well, there are many reasons, but some are:

It gives you the capacity to understand new AI research on your own.
You can use this same foundation to study other STEM concepts like signal theory and advanced statistical methods.
It helps you understand that AI models are just a mixture of different math ideas working together and gives you insight into how new innovations make LLMs more efficient.
It gives you a foundation so you know how to calibrate AI models and even create derivative models.

These skills are also important for startup founders, especially in Silicon Valley. Many startups begin with APIs or API wrappers but eventually need their own AI solutions.

Outsourcing all AI isn't ideal. This book will help you understand AI foundations so you can design better growth strategies and communicate effectively with investors – especially those who were successful technical co-founders.

Why is This Book About AI Different?

In this book, we’ll look at AI from an engineering perspective. This differs from the typical computer science approach to AI that most introductory courses take.

In doing so, I won’t spend a lot of time explaining formulas and theorems. Instead, I’ll explain their importance, how and why they are applied the way they are.

In this way, I hope to offer a unique viewpoint that emphasizes the engineering principles and good practices that underlie all modern AI technologies.

I will also explain how many of these strange math ideas make billion dollar industries possible.

We’ll start with the fundamentals: the structure of the areas of mathematics and AI. After that, we’ll look at the four subareas of math that make AI possible:

Linear Algebra
Calculus
Probability Theory and Statistics
Optimization Theory

After going through all the math, we’ll connect it with the foundation of ChatGPT and all of these large language models.

This way, you’ll get a basic foundation in key math concepts that, when mixed together like the ingredients of a cake, make all AI models possible.

By knowing where the ideas come from, you’ll develop a system-level understanding of AI and a first-principles approach.

So just keep in mind that, even though concepts like integral calculus and eigenvalues/eigenvectors might not be widely used in AI, they’ll help you develop these system-level and first-principle approaches.

Also, this book will be a work in progress. After its first release, I’ll seek feedback on things I need to perfect, chapters to add, and so on.

Here is my email for any feedback you might have: monteiro.t@northeastern.edu

And here is the book’s GitHub repository with all code: https://github.com/tiagomonteiro0715/The-Math-Behind-Artificial-Intelligence-A-Guide-to-AI-Foundations

Let Me Introduce Myself

My name is Tiago Monteiro, an electrical and computer engineer and AI master's degree student at Northeastern University's Silicon Valley campus. I have authored 20+ articles with 240K+ views here on freeCodeCamp on math, AI, and tech.

If you’d like to know more about my background, I’ll share that at the end of the book.

Prerequisites

In terms of minimum requirements, you only need to know the basics of mathematics and programming:

Basic algebra and what functions and the coordinate system are.
You should be able to read Python code and understand things like variables, functions, and loops.

Chapter 2: The Architecture of Mathematics

Math is more than numbers. It’s the science of locating complex patterns that shape our world. To truly understand math, we must look beyond numbers and formulas to grasp its structures.

This chapter aims to show math as a growing tree of ideas, a living system of logic, not just formulas to memorize. With analogies, history, and code examples, I want to help you understand math deeply and how to apply it to programming.

I’ve included code examples to connect theory and practice, showing how math ideas apply to real problems. Whether you're new to advanced math or are more experienced, these examples will help you apply math in programming.

This way, before we start going over the different math pillars that sustain AI, you will understand the structure of the field.

The Tree of Mathematics: How Everything Connects

Photo by Lerkrat Tangsri

Imagine math as a vast, ever-growing tree.

The roots are the foundations: logic and set theory. From these roots, the main fields emerge: arithmetic, algebra, geometry, and analysis.

As the tree branches out, new subfields like topology and abstract algebra appear. Sometimes branches connect with each other.

This tree keeps growing in many directions. History shows that sometimes it grows rapidly due to scientific discoveries, while at other times, growth is slow.

And you might wonder: How many more branches and connections between them will keep appearing?

A Quick History of Mathematics: From Counting to Infinity

The first mathematical ideas emerged independently in ancient civilizations, such as:

India's invention of zero
Islamic algebraic advances
Greek geometric rigor

Great mathematicians developed and shared these ideas through writing and lectures. Over time, new generations built on these ideas, creating new branches of mathematics. This endless growth is why Isaac Newton wrote to Robert Hooke in 1675:

“If I have seen further, it is by standing on the shoulders of giants.”

He meant that by working from previous knowledge, he was able to create and (re)discover new ideas.

Yet, the real power of math lies in practicing it over and over and studying it more and more deeply.

As one of my professors once pointed out:

“More important than knowing the theorems is knowing the ideas behind them and the history of how they were created.”

To solve problems, it's often necessary to think from first principles, and math teaches this. Math is not just an academic topic. It’s a global language for scientists and engineers.

By preserving and sharing it, new math can grow from old ideas, allowing the tree to keep expanding.

Foundations of Relativity: How Einstein Used Math to Understand Space and Time

Photo by Pixabay

Albert Einstein developed the general and special theories of relativity, which impact:

GPS and global communication
Satellite telecommunications
Space exploration and satellite launches

And more.

But this was only possible by combining geometry with calculus, known as differential geometry. This field evolved over centuries, thanks to many great mathematicians. Here are a few of them, though the list is not exhaustive:

Euclid (circa 300 BCE): Contributed to geometry, laying the groundwork for later mathematical systems
Archimedes (circa 287–212 BCE): Pioneered the understanding of volume, surface area, and the principles of mechanics
René Descartes (1596–1650): Developed Cartesian coordinates and analytical geometry
Isaac Newton (1642–1727) & Gottfried Wilhelm Leibniz (1646–1716): Newton’s laws of motion and gravitation, alongside Leibniz’s development of calculus, formed the basis of classical mechanics that Einstein sought to extend and modify in his theory of relativity.
Leonhard Euler (1707–1783): Contributed to the development of differential equations, which are essential in the mathematical foundations of physics.
Gaspard Monge (1746–1818): The father of differential geometry and pioneer in descriptive geometry
Carl Friedrich Gauss (1777–1855): Made groundbreaking advances in geometry, including the concept of curved surfaces.
Bernhard Riemann (1826–1866): Introduced Riemannian geometry, a branch of differential geometry.

Going back to Albert Einstein, he saw what no one else in his time saw, thanks to these great math giants and countless others.

Gödel’s Biggest Paradox: Can Math Explain Itself?

The biggest paradox in math, discovered by Kurt Gödel, is his incompleteness theorems. They show that in any consistent formal system capable of simple arithmetic, there are true statements that cannot be proven within the system.

This means there are limits to what can be proven as true or false. For mathematicians, this implies that some truths are beyond formal proofs, yet we assume they are true. It demonstrates that no matter how much effort or AI is used, some things remain unprovable, known only through approximations and non-exact methods.

What About Applied Math and Engineering?

Applied math and engineering involve adapting the pure math ideas in real-world scenarios.

Actually, in many cases, it’s the combination of many math ideas.

Let’s consider some examples:

In harmonic analysis, Laplace, Fourier, and Z-transforms are a way to see the same thing in a new domain to get new insights. In this case, integrals are used to make this mapping possible.
Principal component analysis (PCA) is a widely used tool in data science. Yet, it is a mixture of linear algebra (in PCA, eigenvalues) with optimization (order eigenvalues that represent more data with less data) in order to make datasets shorter.
In machine learning, logistic regression is a mixture of calculus with statistics and probability.
In deep learning, neural networks are just many matrices multiplying and updating themselves that adapt to model a dataset representing a system. This optimization of matrix values happens with activation functions, a gradient descent-based optimization method (tells how much values need to change), and backpropagation (applies those alterations to all matrix values).

But the best example of this fusion of math in engineering is in control theory. Control theory is the study of the architecture of systems. From trains to cars to airplanes, everything is based on control theory. It’s everywhere, in nearly all modern electronic devices. In electric circuits, control theory is also used heavily to guarantee circuit stability in the face of electric disturbances.

So as you can probably start to see, many of the tools we now have are just a mixture of many pure math ideas – like different recipes. In essence, applied math is the application of pure math as “ingredients“ in "recipes" to solve problems.

So, we’ve explored the structure and evolution of mathematics. But it’s important to see how we can apply these ideas in real life. Pure math makes the framework, and applied math applies that framework to solve problems. To understand this, we’ll examine two code examples that show how you can use math ideas as programming tools.

Code Examples: Analytical and Numerical Approaches

These code examples demonstrate a couple ways you can use Python to solve math equations.

In the first code example, we’ll solve the problem in the same way that kids in school solve math exercises: essentially, by hand with a pencil. In the second example, we’ll solve the problem using numerical analysis.

Example 1: Solve a Problem Analytically

In this problem, we need to find the values of the variables x and y. So we’ll be moving variables from left to right to find their values.

When we solve math problems analytically, like we did in school, we are manipulating symbols to get exact values. Often these symbols are x, y, and z.

The code below solves a system of two equations with two unknowns variables, x and y.

We will use the SymPy Python library to do this. It’s mainly used for symbolic mathematics.

from sympy import symbols, Eq, solve

x, y = symbols('x y')
eq1 = Eq(2*x + 3*y, 6)
eq2 = Eq(-x + y, 1)

solution = solve((eq1, eq2), (x, y))
print(solution)

Once again with this code we are finding the values of the variables x and y.

Essentially, we’re finding x and y based on this equation:

$$\begin{align} 2x + 3y &= 6 \ -x + y &= 1 \end{align}$$

Which gives us the following result:

{x: 3/5, y: 8/5}

Or:

x= 0.6
y = 1.6

When we say that we’re solving this analytically, it means that we’re finding an exact mathematical solution using formulas or equations.

But many times, problems are harder and can be solved by adding symbols to the right or left of the equation. Sometimes, there can be so many symbols and transformed versions of them, with things like derivatives and integrals, that it can become very hard to manage and takes a lot of time.

For example, let’s look at this partial differential equation:

$$\begin{cases} \frac{\partial u}{\partial t} = \alpha \frac{\partial^2 u}{\partial x^2}, & 0 < x < L, , t > 0 \ u(0,t) = 0, & t > 0 \ u(L,t) = 0, & t > 0 \ u(x,0) = f(x), & 0 < x < L \end{cases}$$

It can be solved with an analytical method call separation of variables.

But it requires many steps, and it’s easy to make mistakes. Even engineers who learned this often struggle to remember the process later.

When I first encountered this type of math exercise in my electrical and computer engineering degree back in Portugal, it took me 20 to 30 minutes to solve it.

For this reason, there's a branch of mathematics called numerical analysis that focuses on finding approximations of existing formulas. It helps solve problems faster. This is the method we'll explore next.

Example 2: Solve Numerically (Approximation)

Now let’s solve a different problem: we’re going to find the values of each of the 5 variables:

$$\begin{bmatrix} 3 & 2 & -1 & 4 & 5 \ 1 & 1 & 3 & 2 & -2 \ 4 & -1 & 2 & 1 & 0 \ 5 & 3 & -2 & 1 & 1 \ 2 & -3 & 1 & 3 & 4 \end{bmatrix} \times \begin{bmatrix} x_1 \ x_2 \ x_3 \ x_4 \ x_5 \end{bmatrix} = \begin{bmatrix} 12 \ 5 \ 7 \ 9 \ 10 \end{bmatrix}$$

Solving this by hand will take some time…but with Python code, it’s very fast.

We’ll also use the SciPy Python library for this example.

Let’s solve the system numerically:

import numpy as np
from scipy.linalg import solve

A = np.array([[3, 2, -1, 4, 5],
              [1, 1, 3, 2, -2],
              [4, -1, 2, 1, 0],
              [5, 3, -2, 1, 1],
              [2, -3, 1, 3, 4]])

b = np.array([12, 5, 7, 9, 10])

solution = solve(A, b)

print(solution)

Which corresponds to this operation:

Again, it takes time to solve this and it’s very easy to make a simple mistake.

But in this code example, this line of code:

solution = solve(A, b)

Uses the solve method from SciPy:

from scipy.linalg import solve

It’s a method that helps you find the values of x in an equation A⋅x=b, where A is a square grid of numbers and b is a list of numbers. That gives us the following:

[ 1.35022026 -0.79955947 -1.17180617  3.14317181 -0.83920705]

Which corresponds to:

$$\begin{bmatrix} x_1 \ x_2 \ x_3 \ x_4 \ x_5 \end{bmatrix} = \begin{bmatrix} 1.35022026 \ -0.79955947 \ -1.17180617 \ 3.14317181 \ -0.83920705 \end{bmatrix}$$

And is the same thing as:

$$\begin{align} x_1 &= 1.35022026 \ x_2 &= -0.79955947 \ x_3 &= -1.17180617 \ x_4 &= 3.14317181 \ x_5 &= -0.83920705 \end{align}$$

Why These Two Approaches Matter

We have solved two mathematical problems in two different ways:

Analytical: Exact solutions through algebraic manipulation
Numerical: Approximate solutions using algorithms

In engineering and in AI, we are constantly choosing between these approaches.

When training AI models with millions of parameters, analytical solutions are impossible. This is why, in these cases, we need numerical approaches.

When creating math theorems, we need analytical precision to make sure it is the best possible solution.

This is one of the many things an engineering degree teaches you: often, in the real world, it’s better to just write some code to solve a problem than to actually solve it by hand with math. Other times, the best solution is to just think in first principles and from there create new theorems to solve a problem.

Now let's step out of the code examples and see how different branches of mathematics connect.

The Impact of a Grand Unified Theory of Mathematics

Is it possible to unify all math?

In theory, yes. This is known as the Grand Unified Theory of Mathematics. It's the idea that all different areas of math can be linked together to discover deeper patterns in mathematics.

The Langlands program is trying to make this unification possible. It’s an attempt to interconnect the largest parts of the big tree of math to uncover new patterns in math.

With a Grand Unified Theory of Mathematics, we would be able to understand how every branch of the tree connects with the others and all the relationships between them.

What’s the Value of this Big Unification for Society?

By studying history, we can find patterns. The unification of various fields has created many massive impacts on society, such as:

In the 19th century, James Clerk Maxwell united the fields of electricity and magnetism with his famous Maxwell equations. This allowed the creation of radios and electric grids around the globe. In turn, it served as a foundation for all technological progress in the 20th and 21st century.
In the 20th century, the unification of algebra with logic led to the rise of digital systems. In turn, digital systems gave rise to processors and the evolution of computers and the modern laptop.
Also in the 20th century, the unification of probability and communication led to information theory. This became the foundation for the internet. This unification was carried out by a great mathematician named Claude Shannon.

In the end, a grand unified theory of mathematics could be one of the biggest achievements in modern society.

In AI, it could help unify all machine learning models in a common architecture. This would help accelerate the development of new AI models and could also open the door to new material science advances.

It could help reveal – with math – the deep patterns we still haven’t found in these fields. Just as uniting electricity and magnetism led to modern technology, a unified math framework would lead to a wave of innovation.

A Final Lesson From History

From Greek geometry to AI, math has grown like a tree over centuries. By understanding its structure, it’s possible to see its role in finding the patterns of our universe.

I hope I was able to make you see math in this way. I hope you can also see that the unification of scientific fields helps lay the foundations for the creation of new innovations to help society go forward.

Many major societal transformations only came to be thanks to abstract math ideas. When these are shared and refined, they become the hidden architecture of progress in society. Innovation begins when disconnected ideas are united, well-linked, and widely shared.

Chapter 3: The Field of Artificial Intelligence

What is Artificial Intelligence?

Photo by Pavel Danilyuk

The term Artificial Intelligence was born from the work of John McCarthy, who is often called the "father of AI."

He used it when he, along with Marvin Minsky, Nathaniel Rochester, and Claude Shannon, proposed the famous Dartmouth Summer Research Project on Artificial Intelligence in 1956.

Artificial intelligence was defined, in the Dartmouth Conference, as:

“Every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.”

Since then, the field has evolved in waves of innovation, from early rules-based systems to modern neural networks.

But over time, rather than creating general intelligence, most AI systems have been designed to excel at narrow tasks.

For example:

Chess-playing programs like Deep Blue that defeated world champion Garry Kasparov
Image recognition systems that can identify objects in photographs with impressive accuracy
Natural language processing models that can translate between languages
Game-playing AI like AlphaGo that mastered the ancient game of Go

Artificial General Intelligence isn’t yet here

Only very narrow AI models have demonstrated human-level or superhuman performance in their narrow domains.

In my view, and as we will see in this book, AGI will be the combination and interaction of different large language models interacting with each other and with the tools available to them.

Symbolic vs. Non-symbolic AI: What’s the Difference?

What is Symbolic AI?

Symbolic AI refers to the creation of a program based on many rules and symbols to simulate how humans think.

It uses symbols to represent concepts (like farms and distributors) and logical rules to reason about them.

The specific data about your domain is called facts. Facts are the pieces of information the rules operate on. For example, a fact might be "green_acres has high water usage and good pH levels."

Also, imagine someone wants to optimize farm distribution logistics. The symbols would represent farms, distributors, and transport methods. Then the rules would be:

If the farm has high water usage and good pH levels, then classify it as high-yield producer
If a high-yield producer and distributor has low demand, then prioritize direct connection
If a direct connection is needed, then select transport with lowest environmental impact

The facts would be the actual data like "farm X has high water usage" or "distributor Y has low demand."

This way, the system combines these rules and facts through logical reasoning to make decisions. A very popular programming language we use in this field is called Prolog that was designed to create rule-based systems.

Symbolic AI program: Manage agricultural networks with a Prolog program.

Let’s look at an example project to understand this more clearly. The project we’ll examine is called SymbolicAIHarvest. It was part of a course at NOVA University during my undergraduate studies in Electrical and Computer Engineering. The course was titled "Modelation of Data in Engineering."

SymbolicAIHarvest is an AI system developed with Prolog to manage agricultural networks. Here’s the project on GitHub so you can check it out.

The project optimizes farm operations using rule-based reasoning. It monitors sensors for real-time data and improves route planning for machinery. It also coordinates produce movement to reduce delays and waste, enhancing productivity and sustainability.

Understanding the code below is not a priority for this book. I just want to show you an example of all the facts of the project:

% FARMERS(owner)
farmer(ana).
farmer(asdrubal).
farmer(miguel).
farmer(joao).
farmer(teresinha).
farmer(victor).
farmer(carlos).
farmer(anabela).

% FARMS(name, owner, region, type)
farm(q1, ana, alentejo, vinha).
farm(q2, ana, alentejo, olival).
farm(q3, asdrubal, lisboa, cenoureira).
farm(q4, asdrubal, lisboa, milharal).
farm(q5, asdrubal, lisboa, vinha).
farm(q6, miguel, evora, trigal).
farm(q7, miguel, evora, cenoureia).
farm(q8, miguel, evora, vinha).
farm(q9, miguel, evora, morangueira).
farm(q10, joao, porto, vinha).
farm(q11, joao, porto, trigal).
farm(q12, joao, porto, cenoureira).
farm(q13, teresinha, algarve, olival).
farm(q14, teresinha, algarve, vinha).
farm(q15, victor, setubal, olival).
farm(q16, victor, setubal, vinha).
farm(q17, victor, setubal, trigal).
farm(q18, carlos, sintra, milharal).
farm(q19, carlos, sintra, vinha).
farm(q20, anabela, coina, milharal).
farm(q21, anabela, coina, olival).
farm(q22, anabela, coina, trigal).

% SENSOR READINGS(name, type, value)
sensor_reading(q1,humidity,28).
sensor_reading(q2,humidity,35).
sensor_reading(q3,humidity,42).
sensor_reading(q4,humidity,38).
sensor_reading(q5,humidity,33).
sensor_reading(q6,humidity,45).
sensor_reading(q7,humidity,30).
sensor_reading(q8,humidity,36).
sensor_reading(q9,humidity,50).
sensor_reading(q10,humidity,41).
sensor_reading(q11,humidity,40).
sensor_reading(q12,humidity,44).
sensor_reading(q13,humidity,32).
sensor_reading(q14,humidity,29).
sensor_reading(q15,humidity,47).
sensor_reading(q16,humidity,39).
sensor_reading(q17,humidity,53).
sensor_reading(q18,humidity,27).
sensor_reading(q19,humidity,24).
sensor_reading(q20,humidity,31).
sensor_reading(q21,humidity,37).
sensor_reading(q22,humidity,46).
sensor_reading(q1, temperature, 25).
sensor_reading(q2, temperature, 25).
sensor_reading(q3, temperature, 25).
sensor_reading(q4, temperature, 25).
sensor_reading(q5, temperature, 25).
sensor_reading(q6, temperature, 25).
sensor_reading(q7, temperature, 25).
sensor_reading(q8, temperature, 25).
sensor_reading(q9, temperature, 25).
sensor_reading(q10, temperature, 25).
sensor_reading(q11, temperature, 25).
sensor_reading(q12, temperature, 25).
sensor_reading(q13, temperature, 25).
sensor_reading(q14, temperature, 25).
sensor_reading(q15, temperature, 25).
sensor_reading(q16, temperature, 25).
sensor_reading(q17, temperature, 25).
sensor_reading(q18, temperature, 25).
sensor_reading(q19, temperature, 25).
sensor_reading(q20, temperature, 25).
sensor_reading(q21, temperature, 25).
sensor_reading(q22, temperature, 25).
sensor_reading(q1, water, 47000).
sensor_reading(q2, water, 52500).
sensor_reading(q3, water, 39000).
sensor_reading(q5, water, 61000).
sensor_reading(q8, water, 58000).
sensor_reading(q10, water, 43000).
sensor_reading(q13, water, 72000).
sensor_reading(q16, water, 49000).
sensor_reading(q18, water, 35000).
sensor_reading(q21, water, 66500).
sensor_reading(q1, ph, 6.5).
sensor_reading(q2, ph, 4.7).
sensor_reading(q3, ph, 8.2).
sensor_reading(q4, ph, 7.0).
sensor_reading(q5, ph, 5.1).
sensor_reading(q6, ph, 8.0).
sensor_reading(q7, ph, 4.5).

% DISTRIBUTORS (name, region, capacity, demand level)
distributor(d1, alentejo, 1000, 2).
distributor(d2, lisboa, 800, 1).
distributor(d3, evora, 1200, 3).
distributor(d4, porto, 900, 2).
distributor(d5, algarve, 700, 2).
distributor(d6, setubal, 1100, 1).
distributor(d7, sintra, 950, 2).
distributor(d8, coina, 1000, 1).

% TRANSPORTS (name, capacity, type, autonomy, region, impact)
transport(t1, 1000, fossil, 100, alentejo, 3).
transport(t2, 500, electric, 10, alentejo, 1).
transport(t3, 800, fossil, 400, algarve, 5).
transport(t4, 700, hybrid, 300, setubal, 2).
transport(t5, 150, electric, 340, coina, 1).
transport(t6, 700, fossil, 220, porto, 3).
transport(t7, 900, hybrid, 350, evora, 2).
transport(t8, 1000, electric, 170, sintra, 1).

% Connections based on graph image

% Top of the network
link(q2, d1, 5).
link(q1, d1, 7).
link(q3, d1, 6).

% Network center
link(q3, q4, 8).
link(q4, d2, 6).
link(q4, d3, 7).
link(q4, q5, 5).
link(q4, d4, 6).

% Additional connections
link(q2, d2, 8).
link(q3, d3, 7).

This Prolog code models an agricultural supply chain system that has:

Farmers
Farms
Sensors Readings
Distributors
Transports

In addition, in this part of the code on the facts of the system:

% Top of the network
link(q2, d1, 5).
link(q1, d1, 7).
link(q3, d1, 6).

% Network center
link(q3, q4, 8).
link(q4, d2, 6).
link(q4, d3, 7).
link(q4, q5, 5).
link(q4, d4, 6).

% Additional connections
link(q2, d2, 8).
link(q3, d3, 7).

We connect farms with distributors. This way, we can see that between the farm q1 and distributor d1 is a distance of 7k. This makes it possible to find/create algorithms to find the shortest path between them.

In the end, symbolic AI just creates programs based on a context and rules applied to that context.

What is Non-Symbolic AI?

Non symbolic AI doesn’t use symbols or rules to think. Instead, it’s data driven. In other words, it learns patterns from large datasets. This is the approach used in machine learning and deep learning.

When we create an AI model, we can associate it with an API (Application Programming Interface) so that we can use the AI model in websites, applications, and other systems. Basically, the trained AI model is set up behind an API endpoint. An API endpoint is like a web service that lets other applications send requests to the model and get responses back.

For example, when you use ChatGPT in a web browser, your messages are sent through OpenAI's API to their language model, which processes your input and sends back a response.

An AI agent is a software program that can autonomously perform tasks by making decisions and taking actions to achieve specific goals.

Unlike basic chatbots that only reply to questions, AI agents can plan steps, use tools, and work towards achieving complex goals. They do this by combining language models with extra features like accessing outside data or working with other AI agents.

Here’s an example of a non-symbolic AI agent project I worked on. I developed it using the crewAI Python library and the OpenAI API, one of the most popular libraries for creating AI agents.

In this system, five AI agents collaborate to create optimized content:

Research and Fact Checker: Conducts research to find trends and data.
Audience Specialist: Analyzes audience needs for better engagement.
Lead Content Writer: Writes engaging content based on research.
Senior Editorial Director: Ensures content quality and consistency.
SEO Specialist: Optimizes content for search engines.

Using the OpenAI API, it employs chatGPT with crewAI to have these agents work for me.

Before AI: Control Theory as the “First AI”

Before symbolic and non symbolic AI, electrical engineering had data-driven methods. One key area that I’ve already mentioned above was control theory (which studies control systems for machines like cars and rockets). This field allows us to design systems that ensure stability despite disturbances and achieve goals beyond human capabilities.

Nowadays, after creating a control theory algorithm, we check if AI can improve the control system. In my experience, only some advanced deep learning methods are effective. Most machine learning methods don't outperform control theory in efficiency and security.

Control theory also offers better interpretability, allowing us to understand decisions, unlike advanced machine learning and deep learning.

Due to the historical importance of control theory, I will continue to mention its role and mathematical applications. This will help you learn AI's math foundations and understand its significance in electronic systems and AI applications in engineering beyond dataset predictions.

Chapter 4: Linear Algebra - The Geometry of Data

Photo by Nothing Ahead.

Linear algebra is like having organized containers for data.

Instead of playing with individual numbers, we can pack them into structured boxes that are easier to handle. These structured boxes are called matrices.

When you have a lot of variables like customer data, sensor readings, or images, these structured boxes are very helpful. Also, what we can do when we play around with these boxes is very valuable.

In AI, linear algebra is everywhere. Take matrices, for example – a key concept in Linear Algebra. LLMs perform many matrix multiplications as their core operation. The data that they take in is also organized into matrices. In image recognition, matrices are used to represent pixels of images.

So as you can see, this core Linear Algebra concept is important to understand. Let's start!

What Are Matrices and Why Do They Simplify Equations?

Very often, systems in the real world can be simplified and modeled with a system of equations.

Those equations are often differential equations of many orders. But to simplify, let’s choose a very simple system like the one below:

$$\begin{align} 2x + 3y - z &= 7 \ x - 2y + 4z &= -1 \ 3x + y + 2z &= 10 \end{align}$$

When dealing with many variables and equations, writing each equation separately quickly becomes frustrating. Matrices provide a compact way to represent these systems.

For example, here’s the system above as a single matrix equation:

$$\begin{bmatrix} 2 & 3 & -1 \ 1 & -2 & 4 \ 3 & 1 & 2 \end{bmatrix} \begin{bmatrix} x \ y \ z \end{bmatrix} = \begin{bmatrix} 7 \ -1 \ 10 \end{bmatrix}$$

By seeing systems of equations as matrices, we can use linear algebra techniques to understand how the system behaves.

Some of these techniques are:

Linear Independence, Dependence, and Rank
Determinants
Eigenvalues and Eigenvectors

So to summarize:

A real world system can be represented as a system of equations
A system of equations can be compressed in a structured manipulable form called a matrix.
With matrices and linear algebra techniques, we can understand how the system works.

This way, we can study the basic behavior of a system with Linear Algebra.

For complex systems like a rocket, Linear Algebra is still the foundation. More advanced tools from control theory are used, but understanding simpler systems is essential for modeling and creating complex ones.

Vectors and Transformations: Moving in Multiple Directions

Vectors are matrices with a single row or a single column. You can also think of them as the building blocks of AI. They represent things like data points, model parameters, and much more.

For example, every data input (like an image or sentence) becomes a vector that the model can processes.

Here are two examples of vectors:

$$\mathbf{A} = \begin{bmatrix} 4 & -2 & 7 & 1 & 5 \end{bmatrix}$$

And:

$$\mathbf{B} = \begin{bmatrix} 3 \ -1 \ 8 \ 0 \ -4 \end{bmatrix}$$

All operations that you can perform on matrices can also be performed on vectors.

In Python, we can represent this by:

import numpy as np

# Define vectors A and B
A = np.array([4, -2, 7, 1, 5])
B = np.array([3, -1, 8, 0, -4])

We’re using the NumPy library because it makes math with arrays easy and fast.

As a simplification of a system of equations, a vector with a single row represents:

$$\mathbf{A} = \begin{bmatrix} 4 & -2 & 7 & 1 & 5 \end{bmatrix}$$

And this represents this system of equations:

$$4x_1 - 2x_2 + 7x_3 + x_4 + 5x_5 = k$$

A vector with a single column represents:

$$\mathbf{B} = \begin{bmatrix} 3 \ -1 \ 8 \ 0 \ -4 \end{bmatrix}$$

Which represents this system of equations:

$$\begin{align} x_1 &= 3 \ x_2 &= -1 \ x_3 &= 8 \ x_4 &= 0 \ x_5 &= -4 \end{align}$$

Now let’s see some matrix operations.

For example:

$$\mathbf{A} + \mathbf{B}^T = \begin{bmatrix} 4 & -2 & 7 & 1 & 5 \end{bmatrix} + \begin{bmatrix} 3 & -1 & 8 & 0 & -4 \end{bmatrix} = \begin{bmatrix} 7 & -3 & 15 & 1 & 1 \end{bmatrix}$$

vector_addition = A + B
print("A + B =", vector_addition)

Which gives the result of the equation above.

Often, vector addition is used to combine features. For example, adding many user preference vectors creates a profile of a user.

Here’s a scalar multiplication:

$$3\mathbf{A} = 3\begin{bmatrix} 4 & -2 & 7 & 1 & 5 \end{bmatrix} = \begin{bmatrix} 12 & -6 & 21 & 3 & 15 \end{bmatrix}$$

scalar_mult = 3 * A
print("3 * A =", scalar_mult)

Which gives the result of the equation above.

In AI, scaling vectors is usually done to adjust relevancy. For example, if we do a scalar product multiplication of a vector by 100, it means we are increasing its value. If it is by 0.3, it means we are reducing its importance.

Here's an outer product multiplication:

$$\mathbf{A} \otimes \mathbf{B} = \begin{bmatrix} 4 \ -2 \ 7 \ 1 \ 5 \end{bmatrix} \times \begin{bmatrix} 3 & -1 & 8 & 0 & -4 \end{bmatrix} = \begin{bmatrix} 12 & -4 & 32 & 0 & -20 \ -6 & 2 & -16 & 0 & 8 \ 21 & -7 & 56 & 0 & -28 \ 3 & -1 & 8 & 0 & -4 \ 15 & -5 & 40 & 0 & -20 \end{bmatrix}$$

And here’s a dot product multiplication (also called a dot product):

$$\mathbf{A} \cdot \mathbf{B}^T = \begin{bmatrix} 4 & -2 & 7 & 1 & 5 \end{bmatrix} \cdot \begin{bmatrix} 3 & -1 & 8 & 0 & -4 \end{bmatrix}$$

$$= 4 \cdot 3 + (-2) \cdot (-1) + 7 \cdot 8 + 1 \cdot 0 + 5 \cdot (-4) = 50$$

We mainly use dot products when we want to measure similarity, or alignment between two vectors.

In machine learning, in one simple phrase, it gives us a measure of similarity.

import numpy as np

dot_product = np.dot(A, B)
print("A · B =", dot_product)

Which gives the result of the equation above.

Linear Independence, Dependence, and Rank: Why It Matters

A lot of times, matrices can be made smaller and simpler. So it’s a good practice to reduce a matrix to its simplest form before we start to analyze its properties.

When each row of a matrix can be made with other rows, then that matrix is linearly dependent. This means the matrix can be further modified.

This way, a matrix has the property of linear independence when its rows cannot be created by combining each other.

For example, when we have a complex matrix like this one:

$$C = \begin{bmatrix} 1 & 2 & 3 & 4 \ 2 & 4 & 6 & 8 \ 1 & 3 & 5 & 7 \ 0 & 1 & 2 & 3 \end{bmatrix}$$

We can, with calculations, convert to this:

$$C_{\text{reduced}} = \begin{bmatrix} 1 & 0 & -1 & -2 \ 0 & 1 & 2 & 3 \ 0 & 0 & 0 & 0 \ 0 & 0 & 0 & 0 \end{bmatrix}$$

if you are not familiar with row reduction, I recommend this YouTube video.

The above simplified matrix is the same thing as this:

$$C_{\text{reduced}} = \begin{bmatrix} 1 & 0 & -1 & -2 \ 0 & 1 & 2 & 3 \end{bmatrix}$$

This way, we conclude that the C matrix has a rank of 2.

In other words, since the simplest form of the matrix has only 2 rows with numbers, it has a rank of 2.

From this, we can conclude that the reduced version of the matrix is linearly independent. This is because no row or column can be made from the existing rows or column. It’s the simplest possible matrix.

The original matrix C is linearly dependent because some rows are just multiples or combinations of other rows. For example, row 2 of the original matrix C is exactly row 1 multiplied by 2.

Another way of seeing this is that we have 4 rows in the original matrix and the rank of matrix C is 2. Since they are not equal, C is linearly dependent.

Why are these concepts important?

Linear independence and rank are important in engineering because they show whether equations, represented as matrices, give unique information. In electrical circuits and control systems, knowing that equations, represented as matrices, are independent ensures that you have unique solutions and avoids confusion.

The matrix rank shows the maximum number of independent equations that can exist. This help engineers model the simplest possible form of the systems.

In LLMs like ChatGPT, Gemini, Grok, and Claude, linear independence, dependence, and rank are used in a very important technique called LoRA (Low-Rank Adaptation).

LoRA (Low-Rank Adaptation) is widely used to calibrate these models to make sure they adapt efficiently to new tasks or domains without retraining the full model. Also, there are variants of this technique, like Quantized LoRA. This way, in many data centers, LoRA saves energy, water for cooling, and so many other things.

Determinants: Measuring Space and Scaling

Why are determinants important?

Determinants tell us if a system of equations has infinite solutions, no solutions, or if it has a unique solution without having to solve the whole system.

This way, instead of immediately trying to solve a complex system, we can first use the determinant to find out if it is even worth solving in the first place.

Many engineers don’t really understand the importance of the determinant. The only thing they know is the formula and how to apply it.

So now let’s learn, with some examples, what exactly the determinant is and why it matters.

A determinant is just a number. It’s always calculated from a square matrix. By calculating the determinant, we can find certain properties about the system it represents.

The determinant of a given matrix A:

$$A = \begin{bmatrix} a & b \ c & d \end{bmatrix}.$$

can be represented by two notations:

$$\det(A) = ad - bc$$

$$|A| = ad - bc$$

Both are the same thing.

Let's see how to calculate a determinant:

$$|A| = \begin{vmatrix} 2 & 3 \ 1 & 4 \end{vmatrix} = (2)(4) - (3)(1) = 8 - 3 = 5.$$

Let’s see how to do this in Python:

import numpy as np

# Define the matrix
A = np.array([
    [2, 3],
    [1, 4]
])

# Calculate the determinant
det_A = np.linalg.det(A)

print("Determinant of A:", det_A)

The same calculation works for other matrices!

Here's the determinant formula for a 3×3 matrix:

For a 3 by 3 matrix:

$$|B|= \begin{vmatrix} a & b & c \ d & e & f \ g & h & i \end{vmatrix} = aei + bfg + cdh - ceg - bdi - afh.$$

Now let’s apply the formula to an example:

$$|B| = \begin{vmatrix} 1 & 2 & 3 \ 0 & 4 & 5 \ 1 & 0 & 6 \end{vmatrix} = (1)(4)(6) + (2)(5)(1) + (3)(0)(0) - (3)(4)(1) - (2)(0)(6) - (1)(5)(0)$$

Assessing each term:

$$= (1)(4)(6) + (2)(5)(1) - (3)(4)(1) = 4 \cdot 6 + 2 \cdot 5 - ( 3 \cdot 4) = 24+10-12 = 22$$

In Python code:

import numpy as np

# Define the matrix
B = np.array([
    [1, 2, 3],
    [0, 4, 5],
    [1, 0, 6]
])

# Calculate the determinant
det_B = np.linalg.det(B)

print("Determinant of B:", det_B)

Now, let’s visualize matrix A by plotting its column vectors. Each column will become a vector: (3,1) and (-2,4). This shows us geometrically what the matrix is actually doing.

In a geogebra graph, it gives us this:

As we can see, the vectors define how each variable influences the system. By visualizing what the matrices are doing, we can find patterns that are harder to find just by looking at formulas.

What does this mean visually?

It means that in the space, this is what our matrix looks like. It’s also how our system of equations is represented.

C1 represents the “force“ or the impact the variable x1 has. And C2 does the same thing for the variable x2.

Now we’ll focus on a 3D matrix example. This matrix D represents a system of three equations with three variables:

$$D = \begin{bmatrix} 2 & -1 & 3 \ 4 & 0 & -2 \ -1 & 5 & 1 \end{bmatrix}$$

$$\begin{align} 2x_1 - x_2 + 3x_3 &= p \ 4x_1 + 0x_2 - 2x_3 &= q \ -x_1 + 5x_2 + x_3 &= r \end{align}$$

Each column can be described as a separate vector:

$$\begin{equation} D = \left[ D_1 \mid D_2 \mid D_3 \right] = \left[ \begin{bmatrix} 2 \ 4 \ -1 \end{bmatrix} \mid \begin{bmatrix} -1 \ 0 \ 5 \end{bmatrix} \mid \begin{bmatrix} 3 \ -2 \ 1 \end{bmatrix} \right] \end{equation}$$

As we can see, D was decomposed in 3 new column vectors:

$$\begin{equation} D_1 = \begin{bmatrix} 2 \ 4 \ -1 \end{bmatrix} \end{equation}$$

and:

$$\begin{equation} D_2 = \begin{bmatrix} -1 \ 0 \ 5 \end{bmatrix} \end{equation}$$

and:

$$\begin{equation} D_3 = \begin{bmatrix} 3 \ -2 \ 1 \end{bmatrix} \end{equation}$$

In a geogebra graph, it gives us this:

In 3D, each vector points in its own direction. Together, they organize three planes. Where all three planes touch is the solution to the system.

This is a key advantage of matrices and linear algebra. They help us visualize both simple and complex systems, enhancing systems thinking and first principles thinking.

The determinant is directly connected to these visualizations. For example, in 2D it measures the area that the vectors stretch over. Now we’ll see how that’s possible.

Let's use matrix A and see what its determinant looks like in geometric terms:

$$A = \begin{bmatrix} 2 & 3 \ 1 & 4 \end{bmatrix}$$

Which can be decomposed into 2 vectors u and v:

It gives us this determinant:

$$|A| = \begin{vmatrix} 2 & 3 \ 1 & 4 \end{vmatrix} = (2)(4) - (3)(1) = 8 - 3 = 5.$$

Now let’s see the determinant visually.

From (2,1) and (3,4), we can draw vectors parallel to u and and v. These are called u' and v' and have the same magnitude. They meet at (5,5), and we have a parallelogram that’s completed with these points: (0,0),(2,1),(3,4),(5,5)

The area of the parallelogram is the determinant:

Let’s see another example.

Let’s use a matrix F and see what it truly is:

$$F = \begin{bmatrix} 1 & 2 \ 2 & 4 \end{bmatrix}$$

It gives us this determinant:

$$|F| = \begin{vmatrix} 1 & 2 \ 2 & 4 \end{vmatrix} = (1)(4) - (2)(2) = 4 - 4 = 0$$

In geogebra, we can see that:

Now let’s try to see the determinant visually:

We can conclude that the area is 0.

Now let’s use a matrix G and see what it truly is:

$$G = \begin{bmatrix} 1 & 5 \ 2 & 3 \end{bmatrix}$$

It gives us this determinant:

$$|G| = \begin{vmatrix} 1 & 5 \ 2 & 3 \end{vmatrix} = (1)(3) - (5)(2) = 3 - 10 = -7$$

In geogebra, we can see that:

Now let’s try to see the determinant visually.

From (1,2) and (5,3), we can draw vectors parallel to u and and v. These are called u' and v' and have the same magnitude. They meet at (6,5). A parallelogram is completed with these points: (0,0),(1,2),(5,3),(6,5)

Again, the area of the parallelogram is the determinant:

We just saw that the determinant is the area of a parallelogram formed by the vectors. When the determinant is 0, there is no area. In other cases, there is an area. But what does this mean, and why do we care about these different values?

When the det = 0:

The vectors are linearly dependent (one can be written as a combination of the others)
They lie on the same line or one is a scaled version of the other
The parallelogram collapses to a line, hence zero area
This tells us the matrix has no inverse
Systems of equations either have no solution or infinitely many solutions

When the det ≠ 0 (det > 0 or det < 0):

The vectors form a proper parallelogram with an area
- If det > 0, the area is positive and transformation preserves orientation
- If det < 0, the area is negative and the orientation is flipped
The vectors are linearly independent
Systems of equations have exactly one solution

In electrical engineering, determinants help verify if a control system is controllable and observable.

Control systems use matrices a lot. For this reason, checking if their determinants are zero or non-zero tells engineers:

If it is controllable, it means the system is reachable, which helps in stabilization and performance optimization.
If it is observable, it means the system is measurable, which helps in fault detection and system monitoring.

In finite element analysis, a very popular math tool to solve partial differential equations, determinants helps figure out quickly if the calculations will give reliable results.

This way, with finite element analysis, we can design safer buildings, optimize aircraft wings, and simulate medical implants – all of which have a large impact on human lives and safety.

In machine learning, determinants are crucial to understanding data transformations. In these methods, if a determinant with a value of zero shows up, it means you are losing information and can't recover original data.

Also in deep learning, it’s used to decide the first parameters of neural networks (weight initialization) to prevent problems like the vanishing/exploding gradients.

In a 3×3 matrix, the determinant represents the volume of a parallelepiped (a 3D "box") formed by three vectors in 3D space.

If det = 0: The three vectors lie in the same plane, so they don't span any 3D volume
If det ≠ 0: The vectors form a proper 3D shape with actual volume

The absolute value |det| gives you the exact volume of that parallelepiped.

For example, if you have vectors a, b, and c, the determinant tells you how much 3D space they "fill up" when you use them as the edges of a box.

This is where it gets fascinating:

4×4 matrix: The determinant represents the "hypervolume" of a 4D parallelepiped formed by four vectors in 4-dimensional space.
1000×1000 matrix: The determinant represents the hypervolume in 1000-dimensional space!

So, to summarize, the determinant tells us easily if there are no solutions, infinite solutions, or exactly one solution in a system of equations, represented by a compact matrix.

What Are Mathematical Spaces and How Do They Simplify Calculations?

We now have a great foundation to understand the rest of this chapter on linear algebra.

Now, we will see see how a linearly independent matrix create something called a basis. Also, we will see that a basis is just a a set of building blocks for mathematical spaces!

The row vectors of a linearly independent matrix form a basis.

For example in matrix A, which is linearly independent:

$$A = \begin{bmatrix} 1 & 0 & 0 & 0 \ 0 & 1 & 0 & 0 \ 0 & 0 & 1 & 0 \ 0 & 0 & 0 & 1 \end{bmatrix}$$

forms this set:

$$((1,0,0,0), (0,1,0,0), (0,0,1,0), (0,0,0,1))$$

In this case, since matrix A is linearly independent, the set of matrix rows is called a basis. From this basis, you can create endless linear combinations of any other vector. The collection of all these possible combinations is called a mathematical space.

A mathematical space is an infinite set where all linear combinations of a basis exist. Its called a basis because these vectors form the base to express any vector in the space as a linear combination.

This matrix B is linearly independent:

$$B = \begin{bmatrix} 1 & 0 \ 0 & 1 \ \end{bmatrix}$$

And forms this set:

$$((1, 0), (0, 1))$$

And from this come all possible points in this cartesian coordinate system:

For example, mathematically, we can get the point (2,3) by:

$$(x=2, y=3) = 2(1, 0) + 3(0, 1) = (2, 0) + (0, 3) = (2, 3)$$

Note: There are other bases for the cartesian coordinate plane. I chose this one because it’s the easiest to understand.

Eigenvalues and Eigenvectors: Unlocking Hidden Patterns

Eigenvalues and eigenvectors, in my opinion, are far simpler than what mathematics professors make them out to be at university:

Eigenvalues tell you how much a matrix stretches or shrinks things.
Eigenvectors tell you which directions stay unchanged when the matrix transforms them.

This way, a matrix may have one or many eigenvalues which in turn result in many eigenvectors.

Let’s see an example:

For a square matrix A, eigenvalue λ, and eigenvector v:

$$Av=λv$$

The easiest way to find the eigenvalue is to calculate this:

$$det(A−λI)=0$$

or:

$$|A−λI|=0$$

Again, we have different notations for the determinant, but they’re the same thing.

Anyway, let’s define a very simple matrix A:

$$A = \begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix}$$

Now let’s make some calculations.

This formula:

$$det(A−λI)=0$$

Can be decomposed into:

$$det(\begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix} - λ \times \begin{bmatrix} 1 & 0 \ 0 & 1 \end{bmatrix}) = 0$$

Which is the same has:

$$det(\begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix} - \begin{bmatrix} λ & 0 \ 0 & λ \end{bmatrix}) = 0$$

Which gives us:

$$det(\begin{bmatrix} 2-λ & 0 \ 0 & 3-λ \end{bmatrix}) = 0$$

By the calculations we made above on the determinant, we can conclude that:

$$(2-λ) \times (3-λ) = 0$$

Which is the same has:

$$2-\lambda = 0 \text{ or } 3-\lambda = 0$$

Which gives us these eigenvalues:

$$\lambda_1 = 2, \quad \lambda_2 = 3$$

And these eigenvectors:

$$\mathbf{v_1} = \begin{bmatrix} 1 \ 0 \end{bmatrix}, \quad \mathbf{v_2} = \begin{bmatrix} 0 \ 1 \end{bmatrix}$$

This means that in the Cartesian coordinate system:

By applying the eigenvectors, we can see that:

The eigenvalue 2 is associated with the eigenvector v1:

$$A\mathbf{v_1} = \begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix}\begin{bmatrix} 1 \ 0 \end{bmatrix} = \begin{bmatrix} 2 \ 0 \end{bmatrix} = 2\begin{bmatrix} 1 \ 0 \end{bmatrix}$$

The eigenvalue 3 is associated with the eigenvector v2:

$$A\mathbf{v_2} = \begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix}\begin{bmatrix} 0 \ 1 \end{bmatrix} = \begin{bmatrix} 0 \ 3 \end{bmatrix} = 3\begin{bmatrix} 0 \ 1 \end{bmatrix}$$

Here is the Python code to calculate this:

import numpy as np

# Define matrix A
A = np.array([[2, 0],
              [0, 3]])

# Calculate eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)

print("Eigenvalues:")
print(eigenvalues)

print("Eigenvectors (columns):")
print(eigenvectors)

Eigenvalues and eigenvectors are key tools in engineering and machine learning because they reveal a matrix's fundamental behavior. Although a matrix transformation might seem complex, in reality:

Eigenvalues show how much stretching or compression occur.
Eigenvectors identify the special directions where this stretching happens most naturally.

In machine learning, we can use Principal Component Analysis (PCA) to make datasets smaller.

So, for example, let's say you’re building a machine learning application to predict heart disease. You have 100 data categories and 1 target variable telling whether a person has it or not.

With PCA, you can convert the 100 categories into, say, 40 categories. This way, you can make a smaller machine learning model and save computational resources.

PCA uses eigenvectors of covariance matrices to find important directions in data with many variables. It reduces data size without losing much detail, helping machine learning algorithms focus on key features and ignore unnecessary information.

Applications of Linear Algebra in AI and Control Theory

‌Linear algebra serves as the mathematical foundation for all engineering fields.

In addition, the principles of matrices and linear transformations provide the computational foundation that makes modern AI possible while enabling the control of complex systems.

All LLMs, from ChatGPT and Claude to Gemini and Grok, rely on linear operations.

All these systems carry out huge matrix multiplications to handle and create human language. So, when you type something into ChatGPT, probably millions of matrix multiplications are happening as you wait for a response!

In control theory, especially in an area called state-space control theory, matrices make it possible to create complex controllers. Linear algebra helps engineers design controllers for things like aircraft autopilots and robotic systems, among other applications

For example, when a rocket adjusts its trajectory or a drone maintains stable flight, many matrix multiplications are happening to determine the best way to guarantee the system’s stability.

Thanks to GPUs, linear algebra matrices are very efficient to compute. Also, any new matrix multiplication algorithms or special hardware for faster linear operations can greatly enhance AI and control systems.

In the end, linear algebra is the hidden mathematical engine powering the current AI revolution.

Chapter 5: Multivariable Calculus - Change in Many Directions

Photo by ThisIsEngineering

Limits and Continuity: Understanding Smooth Change

Calculus is one of the most valuable areas of mathematics and it focus on the study of continuous change.

Before we start learning a topic that makes many people give up on engineering degrees, I want to once again assure you that this chapter is very easily explained with a lot of images and code examples.

Also, just like linear algebra, many concepts in calculus are components of tools that have helped create billion-dollar industries.

What is continuity?

Before going and explaining topics like derivatives and integrals, we need to understand continuity.

In simple terms, continuity means that a function has no breaks, jumps, or holes.

Essentially, you can draw it without lifting your pencil from the paper.

For example, this function is continuous:

You can draw this graph without taking the pencil off the paper.

The above graph is represented by this function:

$$y = x^2 - 4x + 3$$

But the below function is not continuous:

This one, you can’t draw without taking the pencil off the paper.

It’s represented by this piecewise function:

$$y = \begin{cases} 1.5 + \frac{1}{x+1} & \text{if } -1 < x < 2 \ 2 + \frac{2}{(x-1)^2} & \text{if } x > 2 \end{cases}$$

This piecewise function is essentially two individual functions for two different intervals of numbers. Since calculus is the study of continuous change, we can only realistically use it in continuous functions.

How do limits guarantee continuity?

We can only use tools like derivatives and integrals if a function is continuous.

How can we describe mathematically that a function is continuous – like drawing it without lifting our pencil from the paper?

Limits solve that problem.

When we take the limit of a function at a given point, we're asking: what value does a function approach as we get close to that point?

Let's look at some examples of this function at these points and also understand the notation used in limits:

What is the limit of the point x=0?

It is 3. It actually crosses the y axis.

In mathematical notation,

$$\begin{align} \lim_{x \to 0} (x^2 - 4x + 3) &= (0)^2 - 4(0) + 3 \ &= 0 - 0 + 3 \ &= 3 \end{align}$$

In this notation, we're asking what the value of the y function is as x gets very close to 0. Think of x as being at 0.00000000000001 or -0.00000000000001. It gets so close that we can consider it near enough.

What is the limit of the point x=1?

Le’s see another example:

In this case, it’s 0.

$$\begin{align} \lim_{x \to 1} (x^2 - 4x + 3) &= (1)^2 - 4(1) + 3 \ &= 1 - 4 + 3 \ &= 0 \end{align}$$

In this notation, we're asking what the value of the y function is as x gets very close to 1. Think of x as being at 0.99999999999999 or 1.00000000000001. It gets so close that we can consider it near enough.

What is the limit of the point x=2?

Le’s see another example

Here, it’s -1.

$$\begin{align} \lim_{x \to 2} (x^2 - 4x + 3) &= (2)^2 - 4(2) + 3 \ &= 4 - 8 + 3 \ &= -1 \end{align}$$

Some more quick examples:

What is the limit of the point x=3?

In this notation, we're asking what the value of the y function is as x gets very close to 1. Think of x as being at 1.99999999999999 or 2.00000000000001. It gets so close that we can consider it near enough.

What is the limit of the point x=4?

It is 0.

What is the limit of the point x=5?

It is 3.

Now let’s see another example:

In the point x=2, it’s not well defined

If we draw with a pencil from the left to x=2, we end up with 1.83333
If we draw with a pencil from the right to x=2, we end up with 4

Why are limits important to understand derivatives and integrals?

As we have seen, when we talk about limits, we are talking about a value that symbolizes the value that a function approaches as it comes toward a particular point.

It’s critical to note that we're not looking at the value of that point itself. We’re looking at what happens as we get so near to it that we can pin down what value the function is approaching.

I will now show a very simple example to demonstrate this concept using mathematical notation.

I know that limits can be a difficult concept to understand at first. But if you understand limits very well, then you'll be well-prepared to understand derivatives and integrals.

And, as you’ll see, derivatives are responsible for modern AI and integrals are important parts of tolls widely used in billion-dollar industries.

I want you to understand the intuition behind this.

The function z(x) is continuous:

$$z(x) = \frac{3x + 7}{x + 2}$$

So to what value does this expression converge as x approaches infinity?

If you have a background in math, you might see why. But here for those who aren’t sure:

It converges to 3.

This time, the limit will be approaching infinity instead of a constant:

$$\begin{align} \lim_{x \to \infty} \frac{3x + 7}{x + 2} \end{align}$$

Let’s solve this in a very simple way:

For x = 1:

$$f(1) = \frac{3(1) + 7}{1 + 2} = \frac{10}{3} \approx 3.333...$$

For x = 5:

$$f(5) = \frac{3(5) + 7}{5 + 2} = \frac{22}{7} \approx 3.143...$$

For x = 10:

$$f(10) = \frac{3(10) + 7}{10 + 2} = \frac{37}{12} \approx 3.083...$$

For x = 50:

$$f(50) = \frac{3(50) + 7}{50 + 2} = \frac{157}{52} \approx 3.019...$$

For x = 100:

$$f(100) = \frac{3(100) + 7}{100 + 2} = \frac{307}{102} \approx 3.010...$$

For x = 1000:

$$f(1000) = \frac{3(1000) + 7}{1000 + 2} = \frac{3007}{1002} \approx 3.001...$$

For x = 10000:

$$f(10000) = \frac{3(10000) + 7}{10000 + 2} = \frac{30007}{10002} \approx 3.0001...$$

As x gets bigger and bigger, we get closer and closer to 3.

This is the main idea of limits: Describe the value a function approaches as the input approaches some point.

This same idea applies to derivatives: they’re just limits that measure rates of change (slopes of tangent lines).

And as well, Integrals are just limits that measure accumulated quantities (areas under curves)..

Let’s now see how derivatives work in depth.

Derivatives: How Things Change and How Fast

As I said before, derivatives are just limits that measure rates of change (slopes of tangent lines).

But what does this actually mean?

Let’s see an example:

What is the rate of change in the point A?

Hard question right? Let’s think how to answer this with limits.

We can find the limit of the rate of change in point A(0.72, 0.66), also called the instantaneous rate of change.

Let’s do that:

To find the slope, we take the coordinates of the points B(0.2, 0.2) and C(1.6, 1):

$$\text{slope} = \frac{1 - 0.2}{1.6 - 0.2} = \frac{0.8}{1.4} = \frac{4}{7} \approx 0.571$$

This gives us a rate of change:

$$y=0.571x + 0.084$$

Let's approximate more:

Let’s also zoom in:

To find the slope, we use the coordinates of the points B(0.58, 0.55) and C(0.85, 0.75):

$$\text{slope} = \frac{0.85- 0.58}{0.75 - 0.55} = \frac{0.27}{0.2} = \frac{2.7}{2} \approx 1.35$$

It gives us a rate of change:

$$y=1.35x + 0.11$$

Now let's approximate a lot:

To find the slope, we use the coordinates of the points B(0.7242549, 0.6625776) and C(0.7242884, 0.66260026):

$$\text{slope} = \frac{0.66260026- 0.6625776}{0.7242884- 0.7242549} = \frac{0.0000226}{0.0000335} = \frac{0.226}{0.335} \approx 0.674$$

Now let’s zoom out:

As we can see, we are so close that we can consider the limit of the rate of change to be 0.65.

It gives us the rate of change:

$$y=0.674x + 0.12$$

This way, the limit of a rate of change is called a derivative.

To recap, here is an animation:

Here’s a Python code example that lets you find the derivative in point A:

import sympy as sp

x = sp.symbols('x')
f = sp.sin(x)

# Derivative of sin(x)
derivative_of_sin = sp.diff(f, x)

# Evaluate at x = 0.72 and x = 0.66
val = f_prime.subs(x, 0.72).evalf()

print("Derivative of sin(x) at x=0.72:", val)

The function that had the point A is called a sine wave.

We convert it to its derivative function. From there we have our rate of change at point 0.72.

When we do math by hand, we usually have many rules to convert a function to its derivative, and from these find the rate of change for a given point.

Before seeing it, let’s look at a very simple example to understand the definition of a derivative:

$$\frac{d}{dx}f(x) \approx \frac{f(\textcolor{green}{x + h}) - f(\textcolor{red}{x - h})}{\textcolor{green}{x + h} - \textcolor{red}{x - h}} = \frac{f({x + h}) - f({x - h})}{2h}$$

h represents a small difference.

The derivative is the slope of the function’s small change near a point. In other words, it’s the limit of the rate of change of a given point.

A simple derivative transformation might look like this one:

$$\frac{d}{dx}x^n = nx^{n-1}$$

Two examples are:

$$\frac{d}{dx}x^3 = 3x^2$$

And:

$$\frac{d}{dx}x^5 = 5x^4$$

There are many more. But we won’t go into deep detail on this topic.

Where and why are derivatives so important?

Derivatives are one of the most important math tools out there. They serve as the foundation for understanding change across nearly all fields of STEM.

In physics (classical mechanics), derivatives are very important to find new information that draws on information that’s already made available.

For example, knowing how a body's position changes over time allows us to use derivatives to find its velocity and acceleration. This is crucial for self-driving cars, trains, rockets, and more.

Also, derivatives are the foundation of understanding how electricity works in depth. Without derivatives, there would’ve been no electromagnetic theory. Without electromagnetic theory, modern technology would not exist.

In machine learning, derivatives are so important that they served to create the algorithm that is one of the most important components of ChatGPT and others AI models. (backpropagation).

Backpropagation is in fact so important that its creators, John Hopfield and Geoffrey Hinton, won the 2024 Nobel Prize in Physics for it.

Also, autonomous vehicles like Tesla and Waymo use AI models called neural networks that depend on backpropagation to work.

It’s awesome that a math concept created in the 17th century is now one of the foundations of the current AI revolution.

What About Integral Calculus?

Before explaining derivatives further, I will ask you a question:

How can we find the area of the below shape?

In other words how can we find the integral of the function in the given interval?

Let’s see how to do it step by step.

First, we’ll try using 2 rectangles to approximate the area behind the curve:

Now the area of the rectangles is 6.282573.

But there is still a lot of error…

As we can see, the left rectangle does not cover completely the curve and the right rectangle covers too much.

So we’ll add more smaller rectangles so that we can better approximate the curve.

Now let’s try using 4 rectangles:

Now the area is 6.497481. But there’s still some error.

As we can see, the error is getting smaller. In other words, the 4 rectangles cover the area of the curve better than just the 2 rectangles. But there’s still a lot of room to make it better.

Let’s try using 8 rectangles:

Now the area is 6.604935.

How about using 16 rectangles?

Now the area is 6.658662.

Let’s try using 32 rectangles:

Now the area is 6.685525.

Now how about using 64 rectangles:

Now the area is 6.698957.

And using 128 rectangles:

Now the area is 6.705673.

What about using 256 rectangles:

Now the area is 6.709031. And the error has reached 0.0000!

Now let’s see an animation of this:

As you can see, we can approximate the area by having a limit to infinity to the number of rectangles to approximate the area.

This way, we can conclude that:

$$F(x) = \int_0^{3.14} f(x) , dx = \int_0^{3.14} (\sin(x) + 1.5) , dx = 6.71$$

This means that the area between 0 and 3.14, limited by the math equation, is 6.71!

Or, mathematically, the integral of f(x) in the interval 0 and 3.14 is 6.71.

Where and how is this applied?

In electrical engineering, integrals calculate total energy use in circuits by integrating power over time. For example, when designing a power supply for a device, engineers integrate the power to determine total energy costs and heat absorption requirements.

In other words, they see the area over time and how much power is used.

Let's see an example:

Imagine that in the image above:

The X axis can be the time in months.
The Y axis is the power used in Watts (Joules per second).

We can conclude that in 3.14 months(3 months and 4 days) the total amount of energy is 6.71 watt-months.

Here is the code to find that out:

# Import libraries
import numpy as np
import matplotlib.pyplot as plt

# Create Function
x = np.linspace(0, 3.14, 100)
y = np.sin(x) + 1.5

# Find the area under the function
area = np.trapezoid(y, x)

# Show the final image
plt.fill_between(x, y)
plt.title(f'Area = {area:.2f}')
plt.show()

In this code, we import the libraries, create the function, and find the area and plot it.

We used numpy.trapezoid to find the area, because it’s a numerical approximation to quickly find the integral of a function between two x values.

numpy.trapezoid uses a numerical approximation method called the composite trapezoidal rule.

The basic idea of the composite trapezoidal rule is to divide the area under the curve into many trapezoids and sum all of them.

If you want to learn more about this, I recommend reading the NumPy documentation on this method.

From this value, we can convert to other units:

52,400,000 joules
14.6 kWh

By converting to other units, we can more easily compare this device with other devices and see if it obeys any technical standards and laws.

This is a real-life application of integrals in engineering.

In my degree, I used this a lot in classes related to power engineering. In simple words, power engineering is a subfield of electrical engineering focused on working with electricity with very high voltage values and electric motors.

In audio compression, the Fourier transform (built on integrals) decomposes sound waves into frequency components. MP3 encoders use this to identify and remove frequencies humans can't hear. This reduces file sizes while preserving quality.

Medical imaging relies on the Radon transform, which uses integrals to reconstruct 3D images from 2D X-ray projections. When you get a CT scan, the machine takes hundreds of X-ray "slices" at different angles. During this process, integrals combine "slices" into a detailed cross-sectional image of your body.

Applications in AI and Control Theory: Calculus in Action

Modern AI depends on derivatives that use the backpropagation algorithm.

When training a neural network, the system calculates partial derivatives of the error with respect to millions of parameters. This way, find out how to adjust each weight to improve performance. Without this, large language models like ChatGPT couldn't learn from data.

PID controllers, which stabilize the temperature in your oven or maintain altitude in aircraft autopilot systems, combine calculus ideas:

The proportional term responds to the current error.
The integral term accumulates past errors to eliminate steady-state drift.
The derivative term predicts future trends to prevent overshooting.

And these are just some of the applications of calculus!

Chapter 6: Probability & Statistics - Learning from Uncertainty

Photo by Armando Are

It’s thanks to probabilities and statistics that many industries have grown so much. With statistics, we can make informed decisions and optimize many different processes. With probabilities, we can understand and model uncertainty in systems and, in this way, solve or even avoid problems.

While you may be familiar with some of the key concepts like median and mean, we’ll start with some basics to build up your intuition on more advanced stuff like the central limit theorem, Bayes’ theorem, and Markov chains.

Mean, Median, Mode: Measuring Central Tendency

Let's imagine you are a data scientist working in research. You’re going to work with data to optimize the output of farms in the Central Valley in California.

The idea is to take in a bunch of data, and by studying it, you can help farmers make better decisions.

Here’s the data from one year of activity:

Farm	Yield (tons/ha)	Fertilizer Used (kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

We have 6 farms in our dataset. For each farm, we know:

How much yield was obtained in tons per hectare
How much fertilizer was used in kilograms per hectare
How much rainfall happened during a year of activity

Now, let’s answer some questions we might have about the data to understand the mean, mode and median:

1. What is the average yield during one year of activity?

To find the average, we just need to sum all the yield values and divide by the number of farms. Like this:

$$\text{Mean} = \frac{4.2 + 5.8 + 3.9 + 6.1 + 4.7 + 5.3}{6} = \frac{30}{6} = 5$$

This is what is called the mean. The mean is just the sum of all values divided by how many values there are.

In Python, we can do the following to calculate the mean:

def calculate_mean(values):
    return sum(values) / len(values)

# Example usage
data = [4.2, 5.8, 3.9, 6.1, 4.7, 5.3]
result = calculate_mean(data)
print(f"Mean: {result}")

2. What is the mode of fertilizer used?

The mode is just the most popular value in a given dataset. In our case, it’s 200 since that’s the most common value that appears in our farm dataset.

In Python, we can do this to calculate the mode:

import statistics

def calculate_mode(values):
    return statistics.mode(values)

# Example usage
data = [150, 220, 120, 250, 200, 200]
result = calculate_mode(data)
print(f"Mode: {result}")

3. What is the median of the yield?

The median is just the value in the middle of a set of numbers. If the number of elements in the list is even, we take the mean of the two middle numbers. Here are our current yield values:

$$4.2, 5.8, 3.9, 6.1, 4.7, 5.3$$

First, we sort the values:

$$3.9, 4.2, 4.7, 5.3, 5.8, 6.1$$

Since we have 6 values (even number), the median is the average of the two middle values:

$$\text{Median} = \frac{4.7 + 5.3}{2} = \frac{10}{2} = 5$$

In Python we can do this to calculate the median:

import statistics

def calculate_median(values):
    return statistics.median(values)

# Example usage
data = [4.2, 5.8, 3.9, 6.1, 4.7, 5.3]
result = calculate_median(data)
print(f"Median: {result}")

Variance and Standard Deviation: Measuring Spread

Knowing the mean, mode, and median of data is helpful. But it’s also important to know how far away data points are from each other.

That’s where measures of dispersion come in. Variance tells us, on average, how far numbers are from the mean.

Let’s see an example of how to calculate this.

Given yield data from the table:

$$4.2, 5.8, 3.9, 6.1, 4.7, 5.3$$

The first step is the calculate the mean:

$$\bar{x} = \frac{4.2 + 5.8 + 3.9 + 6.1 + 4.7 + 5.3}{6} = \frac{30}{6} = 5$$

The second step is to calculate the variance with the sample variance formula:

$$s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}$$

Let's apply the formula little by little to understand how it works.

We will first we will calculate the variance of each yield data point:

$$\begin{align*} (4.2 - 5.0)^2 &= (-0.8)^2 = 0.64 \ (5.8 - 5.0)^2 &= (0.8)^2 = 0.64 \ (3.9 - 5.0)^2 &= (-1.1)^2 = 1.21 \ (6.1 - 5.0)^2 &= (1.1)^2 = 1.21 \ (4.7 - 5.0)^2 &= (-0.3)^2 = 0.09 \ (5.3 - 5.0)^2 &= (0.3)^2 = 0.09 \end{align*}$$

Then we will sum all the squared differences:

$$\sum(x_i - \bar{x})^2 = 0.64 + 0.64 + 1.21 + 1.21 + 0.09 + 0.09 = 3.88$$

Now, we will finally find the variance:

$$s^2 = \frac{3.88}{6-1} = \frac{3.88}{5} = 0.776$$

The standard deviation is just the square root of the variance.

$$s = \sqrt{s^2} = \sqrt{0.776} \approx 0.881 tons/ha$$

Why is this useful?

It puts the spread back into the same units as the data, making it easier to interpret.

A small standard deviation means the data huddles close to the mean, while a large one means it’s widely scattered.

And here is a code example of how to calculate both:

import statistics

def calculate_variance_and_std(values):
    variance = statistics.variance(values)
    std_dev = statistics.stdev(values)
    return variance, std_dev

# Example usage
data = [4.2, 5.8, 3.9, 6.1, 4.7, 5.3]
variance, std_dev = calculate_variance_and_std(data)
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_dev}")

What Is the Normal Distribution? The Bell Curve of Life

The normal distribution tells us how data naturally converges around the average value. Most values are focused on the center, and extreme values are more to the edges. This creates a bell curve.

By understanding this distribution, we can understand other distributions and also the central limit theorem.

To understand what normal distribution is, let’s look at it:

The normal distribution looks like like a mountain.

As you can see, most values are around the mean. Also, in and around the mean is the peak. Toward the extremes, the curve gets lower and lower. This means that in the extremes there are fewer and fewer values.

Normal distribution also has a formula associated with it:

$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right)$$

I won’t go in depth into how the formula works here. I just want you to understand the main idea behind the concept.

There are many other distributions besides the normal distribution. Some of the most common are:

Chi-squared distribution
Student’s t distribution
Bernoulli distribution
Binomial distribution
Poisson distribution

Each distribution can model different events and phenomenons. For example the Chi-squared distribution is widely used to find the correlation between two phenomenons (sunburns and skin cancer, for example).

The Poisson distribution is also used in modeling counts of events, like the number of clients that enter a store per hour or the number of data packets that are transmitted in a Ethernet cable.

But it’s also possible to approximate a lot of distributions to the normal distribution using one of the most important theorems in all of mathematics: the central limit theorem. This is what we will explore next.

How the Central Limit Theorem Helps Approximate the World

Photo by Porapak Apichodilok

The main idea of the central limit theorem is very simple:

Most distributions can be approximated to become the normal distribution.

This is just like pouring sand into a funnel. Grains may fall randomly, but over time the pile of sand will always begin to form the shape of a mountain.

This way, we can take many data points and average them. Over time, it will converge to become a normal distribution.

In other words, when independent random variables are all summed together, their sum tends toward a normal distribution.

Here is the formula:

$$\bar{X} \approx N\left(\mu, \frac{\sigma^2}{n}\right) \quad \text{or equivalently} \quad Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \approx N(0, 1)$$

You don’t need to understand in depth what it means. Just understand that it’s a theorem that approximates other distributions to the normal distribution.

And why is this important?

Because this theorem makes many billion-dollar industries possible.

Instead of testing every single possible scenario, we can test for a smaller amount of scenarios and assume that if it works for the smaller one, it will work for the bigger one.

For example, in telecommunications, instead of testing every possible phone call or data transmission, we can just test a few connections. If it works for those few connections, we can assume it will work for millions of phone and data transmissions.

For clinical trials, instead of testing a drug on millions of people, we can just test a smaller number of patients. If it works for a (relative) few patients, we can assume it will work on most people with the same condition.

Without this idea, clinical trials would not be possible. The same with telecommunications and so many other areas of engineering.

Bayes Theorem: Learning from Evidence

Now we’ll start looking at probability more in depth based on the data table we have been using.

Here’s the table again so that you can reference it more easily:

Farm	Yield (tons/ha)	Fertilizer Used (Kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

Now there are a lot of ideas and formulas related to probabilities. But here, I want to explain to you the core ones that are applied in AI and give you a high-level definition of things.

We’ll start with conditional probability, which is foundational to understanding Bayes’ theorem. Then we’ll get to the extended Bayes’ theorem formula.

So, let's get started!

What is Conditional Probability?

Photo by KOUSHIK BALA

Conditional probability is the probability that an event will happen given that another event has already taken place.

Confused? Don't worry! Let's see an example:

Let’s say that:

A = Farm has rainfall above or equal 400 mm
B = Farm has a yield above or equal to 5.0 tons/ha

Here is the formula for Conditional Probability:

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

Now let’s see this formula more in detail:

$$P(A)$$

This represents the probability that a farm has rainfall above or equal to 400 mm.

We have 6 farms, and 2 of them (farm B and D) have a rainfall above or equal to 400 mm.

So, the probability that a farm has rainfall above or equal to 400 mm is:

$$P(A) = \frac {2}{6} = \frac {1}{3} ≈ 0.33$$

Now let’s see for event B:

$$P(B)$$

This represents the probability that a farm has a yield above or equal to 5.0 tons/ha.

We have 6 farms and 3 of them (farm B, D and F) have a yield above or equal to 5.0 tons/ha.

So, the probability that a farm has a yield above or equal to 5.0 tons/ha is:

$$P(B) = \frac {3}{6} = \frac {1}{2} = 0.5$$

What about if we want to see both conditions’ probabilities at the same time?

$$P(A \cap B)$$

This refers to the probability of A and B being both true.

In our example, in means the probability that a farm both has a rainfall above or equal to 400 mm and a yield above or equal to 5.0 tons/ha.

We have:

6 farms and 2 of them (farm B and D) have a rainfall above or equal 400 mm
6 farms and 3 of them (farm B, D and F) have a yield above or equal to 5.0 tons/ha

For A and B to be true, only 2 farms (farm B and D) have both conditions.

This way:

$$P(A \cap B) = \frac {2}{6} = \frac {1}{3} ≈ 0.33$$

Now we’re ready to find out the conditional probability:

$$P(A|B)$$

This means the probability of A, knowing that B is true.

In our example, we can conclude that:

$$P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{0.33}{0.5} = 0.66$$

So, the probability that a farm has rainfall above or equal 400 mm – knowing that it has a yield above or equal to 5.0 tons/ha – is 0.66

Bayes’ Theorem

This is one of the most important theorems in mathematics.

Bayes’ theorem is a formula that tells us how to change the probability of a prediction when new verified data becomes available.

In other words, it’s like a rule that tells us how to update our beliefs when new evidence appears.

Now, based on what we already know, let’s see how Bayes’ Theorem works.

Here is its formula:

$$P(B|A) = \frac{P(A|B) \cdot P(A)}{P(B)}$$

Now, based on the previous values, we can very easily find the probability of B, given that A is true.

In other words, the probability that a farm has a yield above or equal to 5.0 tons/ha given that is has a rainfall above or equal to 400 mm.

Let’s find the answer:

$$P(B|A) = \frac{P(A|B) \cdot P(A)}{P(B)}= \frac{0.66 \cdot 0.33}{0.5}=0.44$$

So, the probability that a farm has a a yield above or equal to to 5.0 tons/ha, knowing it rained equal to or more than 400 mm, is 44%.

Now that we’ve gone through this formula step by step, hopefully it doesn’t feel as complex.

Where is this applied in real life?

As with many math ideas in this book, Bayes' Theorem has applications in many business sectors.

For example, what is the best way to make a control system for a self-driving car, robot, or really any other device?

One effective approach is to use a Kalman filter. Kalman filters rely heavily on Bayes' Theorem to handle control systems with incomplete data.

Kalman filters have a lot of applications in engineering. For example, thanks to Kalman filters, commercial jets can fly safely on autopilot.

So as you can see, Bayes’ Theorem is the foundation of many control systems used in risky industries.

What Are Markov Models? Predicting the Next Step, One Step at a Time

Photo by lil artsy

How do you predict the future with math? Markov chains allow you to do this to a certain degree.

For this reason, Markov chains are widely used in science, engineering, economics, and many other areas.

In addition to this, Markov decision processes are a very important foundation for reinforcement learning. Reinforcement learning is a branch of AI where agents learn to make decisions by interacting with an environment to maximize rewards.

In this section, I’ll introduce you to Markov chains and decision processes with an analogy, a plain English explanation, and a code example.

If you want to dive in further, I recommend my freeCodeCamp article on the subject.

Markov Chain Analogy

Imagine that you want to predict the weather tomorrow, and it only depends on the weather today. The weather can be either sunny or rainy.

Here are the probabilities:

If it's sunny today, there's an 80% chance that it will be sunny again tomorrow, and a 20% chance that it will be rainy.
If it's rainy today, there's a 50% chance that it will be sunny tomorrow, and a 50% chance that it will be rainy.

In this scenario, we can predict future states of the weather based on current states using probabilities.

This idea of predicting the future based solely on probabilities of the present is called a Markov chain.

Here, the states are either sunny or rainy and the probabilities describe the chances of the weather changing based on the current state.

Markov Chain Explained in Plain English

A Markov chain describes random processes where systems move between states, and a new state only depends on the current state, not on how it got there.

Mathematically, Markov chains are called stochastic models because they model (simulate) real life events that are random by nature (stochastic).

Markov chains are popular because they are easy to implement and efficient at modeling complex systems.

Another key advantage is their "memoryless" property. This makes it faster to run on computers, and powerful to study random processes and make predictions based on current conditions.

Applications of Markov Chains

Photo by Google DeepMind

At some level, almost all real-life events are stochastic. In other words, they involve randomness and uncertainty.

This is exactly why they are so widely used.

They can predict the behavior of systems based on current conditions:

In finance, they are used to detect changes in credit ratings for forecasting market regimes.
In genetics, they help understand how proteins change over time (which is important when studying genetic variations).

These real life examples show how effective Markov chains can be used to solve real problems in different fields.

In AI, Markov chains are used to model an environment like a factory or home. Modeling an environment with Markov chains is called a Markov decision process.

Using a Markov decision process, it’s possible to use reinforcement learning to create and optimize agents to act in the environment.

Of course, new and better variants of the Markov decision process have appeared over the years. But the key idea here is that it is thanks to Markov decision processes that the basis for reinforcement learning exists.

Reinforcement learning is widely used in advertising systems, logistics, robotics, video games, and many more applications.

Types of Markov Chains

There are many types of Markov chains. In this section, we'll only discuss the most important variants.

Discrete-Time Markov Chains (DTMCs)

In DTMCs, the system changes state at specific time steps. They are called discrete because the state transitions occur at distinct, separate time intervals.

They are used in queuing theory (study of the behavior of waiting lines), genetics, and economics because they are simple to analyze.

Continuous-Time Markov Chains (CTMCs)

CTMCs differ from DTMCs in that state transitions can occur at any continuous time point, not at fixed intervals.

This makes them stochastic models where state changes happen continuously. This is important in chemical reactions and reliability engineering.

Reversible Markov Chains

Reversible Markov chains are special. The process of state change is the same whether the direction is forwards or backwards, like rewinding a video and playing it again.

This property makes it easier to know when a system is stable and study how a system behaves over time. They are widely used in statistical physics and economics

Doubly Stochastic Markov Chains

Doubly stochastic Markov chains are defined by a transition probability matrix. In the matrix, the sum of the probabilities in each row and each column equals 1.

This means each row and each column represent a valid probability distribution. In other words, each row and column represent a list of chances for different outcomes.

This property is crucial in quantum computing and statistical mechanics.

Thanks to Doubly stochastic Markov chains, systems change in a way that preserves probabilities and symmetry, making the modeling and analysis of quantum computing systems far more accurate.

Hidden Markov Chains Code Example

Photo by Kevin Ku

Before we jump into code examples, let’s first understand what Hidden Markov Chains are.

The main idea behind hidden Markov chains is to model systems that have hidden states (states for which we don’t know their values) which can only be discovered through observable events.

In other words, hidden Markov chains allow us to predict the behavior of a system by:

Considering the likelihood of moving from one state to another.
Knowing the probability of observing a certain event from each state

We can understand this by observing how the states change from an indirect point of view.

We may not know the states’ original values. But by knowing the way they change, we can predict what their values will be in the future.

This way, hidden Markov chains are flexible in modeling sequences, capturing both the transitions between hidden states and the observable outcomes.

Because of this, hidden Markov models are used in fields such as engineering, financial modeling, speech recognition, bioinformatics, and many more.

Code Example:

In this code example, we’ll see a simple example with synthetic data.

Here is the full code:

import numpy as np
from hmmlearn import hmm

# Set random seed for reproducibility
np.random.seed(42)

# Define the HMM parameters
n_components = 2  # Number of states
n_features = 1    # Number of observation features

# Create a Gaussian HMM
model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag")

# Define transition matrix (rows must sum to 1)
model.startprob_ = np.array([0.6, 0.4])
model.transmat_ = np.array([[0.7, 0.3],
                            [0.4, 0.6]])

# Define means and covariances for each state
model.means_ = np.array([[0.0], [3.0]])
model.covars_ = np.array([[0.5], [0.5]])

# Generate synthetic observation data
X, Z = model.sample(100)  # 100 samples

# Create a new HMM instance
new_model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag", n_iter=100)

# Fit the model to the data
new_model.fit(X)

# Print the learned parameters
print("Transition matrix:")
print(new_model.transmat_)
print("Means:")
print(new_model.means_)
print("Covariances:")
print(new_model.covars_)

# Predict the hidden states for the observed data
hidden_states = new_model.predict(X)

print("Hidden states:")
print(hidden_states)

Now let’s break the code down block by block:

Import libraries and set random seed:

import numpy as np
from hmmlearn import hmm

np.random.seed(42)

In this block of code, we imported two Python libraries:

NumPy: For numerical operations.
hmmlearn: For hidden Markov model implementation.

Next we defined a random seed with the NumPy library. A random seed is a value used to start a pseudorandom number generator.

With a fixed random seed, we can ensure that the sequence of pseudorandom numbers generated is always the same. This allows us to duplicate experiments and verify results.

The specific value of the seed doesn’t matter as long as it remains consistent.

Define the HMM parameters and create a Gaussian HMM:

n_components = 2  # Number of states
n_features = 1    # Number of observation features

model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag")

In this code block, we created an HMM with two hidden states and a single observed variable.

covariance_type "diag" means the matrices that represent covariance (how two variables change together) are diagonal. In other words, each row and column is assumed to be independent of the others.

This implies that the probability distributions of each row and column are independent of each other.

But there is still something strange when we defined the hidden Markov chain:

What does “Gaussian“ mean?

This is a very big topic in statistics, but in a few words, Markov chains can only be created when we specify the transition probabilities (chances of moving from one state to another in a Markov chain) and an initial probability distribution.

A Gaussian HMM assumes events are initially modeled by a Gaussian distribution, also called a normal distribution!

And recall, we have already seen before what a normal distribution is.

Here is it again:

From a normal distribution and other components, we can create a hidden Markov chain. And hidden Markov chains serve as a foundation for systems that affect millions of lives.

Define transition matrix, means, and covariances for each state:

model.startprob_ = np.array([0.6, 0.4])
model.transmat_ = np.array([[0.7, 0.3],
                            [0.4, 0.6]])

model.means_ = np.array([[0.0], [3.0]])
model.covars_ = np.array([[0.5], [0.5]])

model.startprob_ = np.array([0.6, 0.4])

This line sets the initial state probabilities for a Hidden Markov Model (HMM). It points out that there is a 60% probability of starting in state 0 and a 40% probability of starting in state 1.

model.transmat_ = np.array([[0.7, 0.3], [0.4, 0.6]])

This line of code sets the state transition probability matrix for the HMM.

The matrix specifies the probabilities of moving from one state to another:

From state 0, there is a 70% chance of staying in state 0 and a 30% chance of transitioning to state 1.
From state 1, there is a 40% chance of transitioning to state 0 and a 60% chance of staying in state 1.

model.means_ = np.array([[0.0], [3.0]])

This line sets the mean values for the observation distributions in each state.

It indicates that the observations are normally distributed with a mean of 0.0 in state 0 and a mean of 3.0 in state 1.

model.covars_ = np.array([[0.5], [0.5]])

This line sets the covariance values for the observation distributions in each state.

It specifies that the variance (covariance in this 1-dimensional case) of the observations is 0.5 for both state 0 and state 1.

Create data, new HMM instance, and fit the model with the data:

X, Z = model.sample(100)  # 100 samples

new_model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag", n_iter=100)

new_model.fit(X)

print("Transition matrix:")
print(new_model.transmat_)
print("Means:")
print(new_model.means_)
print("Covariances:")
print(new_model.covars_)

In this code, we created a model with 100 samples, iterated it 100 times, and printed the new state transition matrix, means, and covariances.

In other words, we:

Generated 100 samples from the original model
Fitted a new HMM to these samples.
Printed the learned parameters of this new model.

What do X and Z mean here?

X means the observed data samples generated by the original model, while Z means the hidden state sequences corresponding to the observed data samples generated by the original model.

The transition matrix prints out:

[[0.8100804  0.1899196 ]
 [0.49398918 0.50601082]]

Which means that the model tends to stay in state 0 and has nearly equal chances of switching or staying when in state 1.

The means print out:

[[0.01577373]
 [3.06245496]]

Which means that the average observed value is approximately 0.016 in state 0 and 3.062 in state 1.

The covariances print out:

[[[0.41987084]]
 [[0.53146802]]]

Which means that the observed values vary by about 0.420 in state 0 and 0.531 in state 1.

This way, we may never know the exact values of the states, but we know their average observed value and how they vary and tend to change with each other.

Predict the hidden states for the observed data:

hidden_states = new_model.predict(X)

print("Hidden states:")
print(hidden_states)

In this code, based on the X observed data samples, we predicted the new states of the Markov model.

The hidden states print out:

[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1
 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0]

Which means that the hidden states switch between state 0 and state 1, showing how the system changes states over time.

Applications in AI and Control Theory: Making Decisions Under Uncertainty

Photo by capt.sopon

I have been giving you a high-level overview of the field of probabilities and statistics. As I explained before, I wanted to make the explanations simple to understand.

As someone with a bachelor's degree in electrical and computer engineering, I can assure you that while this chapter seems simple, in probabilities and statistics, things can get very complicated very quickly.

Many more concepts like:

p-values
Advanced Monte Carlo methods
Bayesian networks
Statistical hypotheses

Are not as straightforward as the ideas I’ve just told you about.

But as it is, probability and statistics are the starting points for making decisions where uncertainty exists in AI and control theory.

For example, the Bayes’ theorem, besides being the foundation of the Kalman filter, is also the foundation of many probabilistic models in the field of AI. Probabilistic models are usually used in quant firms and banks to model risk.

In control theory, probabilities and statistics are widely used to design robust control systems (as is the case with Kalman filters).

So as you can see, the application of probabilities and statistics, as with calculus and linear algebra, is the foundation for many tools that impact millions of lives and move billions of dollars in the global economy.

Chapter 7: Optimization Theory - Teaching Machines to Improve

Photo by Pixabay

This is the most advanced math chapter of the book. To truly understand it, it’s very important that you’ve first read the other chapters first.

We’re going to examine a few machine learning methods, and I’ll show you some recipes of how machine learning is just the use of linear algebra, calculus, probabilities and statistics, and optimization theory.

Just like making a cake!

What is Optimization Theory?

In AI, optimization theory is responsible for the algorithms that optimize data-driven AI models.

Often, big companies invest millions in research to create or refine algorithms that make training AI models faster.

This way, companies save far more money than the upfront research costs when scaling to train multiple large AI models.

It is thanks to optimization theory that deep learning was able to scale efficiently, eventually leading to the creation of ChatGPT and many other large language models.

But why is that?

In all data-driven machine learning models, there is a learning phase that has to happen. That is, there’s a period where the algorithms make predictions that are not correct and then need to change some parameters to make sure the next predictions are correct – or at least closer to being correct.

Without optimization, machine learning algorithms don't get anywhere on their learning path to the right solution. Without optimization, they spend too much time on a learning path that won’t increase their ability to predict things the right way.

So, let’s start learning!

Why Optimization Drives Learning in AI

Photo by Alex Knight

Optimization theory is the mathematical foundation that allows algorithms to improve their performance over many iterations.

When we combine an algorithm with a path to change its parameters to meet a certain objective (done with an optimization method), it’s called a machine learning algorithm.

This learning process always involves minimizing or maximizing a certain objective. For example, for many machine learning algorithms, the main objective is to minimize errors. To do this, over many iterations, the optimization methods "tells" the internal components of an algorithm what to change after receiving feedback on how well it’s performing.

It’s like someone first learning how to drive a car. The first few times, it may be complicated. But after a while and some practice, the driver learns how to drive properly and not make the same mistakes they once did in the past with the help of the instructor.

The same applies to optimization methods when optimizing algorithms.

Types of Optimization Theory Methods in ML and Deep Learning

The field of optimization theory is huge! Just as with many fields of mathematics, it is constantly growing every year.

But for the purposes of this book, there are three main categories of optimization methods:

First-Order Methods

These are the most used in deep learning and in all LLM models like Gemini, Grok, and others.

They are called first-order methods because they all use the first derivative of functions. The first derivative of a function measures how much a function's output changes when its input changes very little. The most widely used in deep learning are advanced variants of gradient descent.

While there are many variants, here are some popular examples:

Standard batch gradient descent
Stochastic gradient descent
Mini-batch gradient descent
RMSprop
Adam

In this chapter, we will look in depth at one of these methods called Adam (below).

Second-Order Methods

They are called second-order methods because they use information from second derivatives for better updates. There are many methods, like:

BFGS
L-BFGS
Newton's method

But these are not often used in machine and deep learning. While they optimize with fewer iterations, for the type of optimization problems algorithms in AI create (high-dimensional problems), they’re very computationally expensive.

So they’re not widely used like first-order optimization methods.

Zeroth-Order and Other Methods

These methods do not require derivatives to optimize algorithms. Some examples of algorithms where derivatives are not used are:

Genetic algorithms
Dynamic programming algorithms
Particle swarm optimization methods

The problem with these algorithms is that they are often very slow for many variables.

But in certain AI contexts, they can help optimize the architecture of deep learning models to improve AI models from an architectural point of view (instead of a parameter point of view).

How does optimization theory connect with linear algebra, calculus, and probability and statistics?

Essentially:

Calculus teaches you derivatives, which help you understand optimization theory.
Linear algebra teaches you matrices, which help you understand how different states relate and transform.
Probability and statistics teach you concepts like covariance and correlation, which help you understand how variables are connected with each other.

This way, with linear algebra and probability and statistics, you gain the knowledge necessary to understand the algorithms. With calculus you gain the basis to understand optimization theory and how it changes certain parameters of the fundamental algorithms to minimize/maximize a certain objective.

Simple Optimization Techniques: How Machines Learn Step by Step

Photo by LJ Checo

Now, we’re going to see examples of machine learning algorithms used for optimization and deconstruct them so that you can understand how these areas of mathematics apply to them.

In each example, I will explain their main idea with an analogy as well as how each math area is used in each algorithm.

Linear Regression

Imagine that you are solving a puzzle. To complete the puzzle, you need to arrange the pieces in the right design/order.

The same idea applies to linear regression.

We have matrices (linear algebra) that represent the parameters of the linear regression model and the data that flow into it.

And we can see over time how well the line is fitting the numbers, as well as its error (probabilities and statistics).

To find the best line for the linear regression, we need to know how much the parameters of the model need to change (calculus) and actually apply that change to the parameters (optimization theory).

This way, calculus tells us which direction to change the parameters, and optimization theory tells us how much to actually change them.

Let’s see how to code the linear regression above:

import numpy as np

np.random.seed(42)
X = np.linspace(0, 10, 50)
y_true = 3 * X + 2
noise = np.random.normal(0, 2, 50)
y = y_true + noise

w = 0.1 
b = 0.5
learning_rate = 0.01
iterations = [0, 1, 2, 3, 4, 5]
saved_states = []

for epoch in range(max(iterations) + 1):
    y_pred = w * X + b
    error = np.mean((y - y_pred) ** 2)
    
    if epoch in iterations:
        saved_states.append({
            'epoch': epoch,
            'w': w,
            'b': b,
            'y_pred': y_pred.copy(),
            'error': error
        })
    
    dw = -2 * np.mean(X * (y - y_pred))
    db = -2 * np.mean(y - y_pred)
    
    w = w - learning_rate * dw
    b = b - learning_rate * db

Let’s see the code block by block:

Import library:

import numpy as np

For this problem, we’ll import one of the most used Python libraries: NumPy (which we’ve worked with earlier in the book).

Create data points:

np.random.seed(42)
X = np.linspace(0, 10, 50)
y_true = 3 * X + 2
noise = np.random.normal(0, 2, 50)
y = y_true + noise

In this code, we define a base line that will help in generating the data points:

X = np.linspace(0, 10, 50)
y_true = 3 * X + 2

After this green line has been created, we will add noise to it to create the data points:

noise = np.random.normal(0, 2, 50)
y = y_true + noise

This is how we defined the data points for the line dataset.

Initializing linear regression parameters and others:

w = 0.1 
b = 0.5
learning_rate = 0.01
iterations = [0, 1, 2, 3, 4, 5]
saved_states = []

In this block of code, we initialize:

Linear regression parameters: Weight to be 0.1 and bias to be 0.5
One hyperparameter: Learning rate
How many iterations we are going to use to improve the linear regression
An array called saved_states to store values to later create graphs

This way, we start with this red line:

Making the linear regression learn with the data:

for epoch in range(max(iterations) + 1):
    y_pred = w * X + b
    error = np.mean((y - y_pred) ** 2)
    
    if epoch in iterations:
        saved_states.append({
            'epoch': epoch,
            'w': w,
            'b': b,
            'y_pred': y_pred.copy(),
            'error': error
        })
    
    dw = -2 * np.mean(X * (y - y_pred))
    db = -2 * np.mean(y - y_pred)
    
    w = w - learning_rate * dw
    b = b - learning_rate * db

It may appear complicated, but let’s see in smaller blocks:

For loop

for epoch in range(max(iterations) + 1):

Making an prediction and seeing its error

y_pred = w * X + b
error = np.mean((y - y_pred) ** 2)

In this block of the code, we find the values predicted for the current parameters and see its error from the real values.

Saving current iteration values for future statistics

if epoch in iterations:
     saved_states.append({
         'epoch': epoch,
         'w': w,
         'b': b,
         'y_pred': y_pred.copy(),
         'error': error
     })

Here we are juts storing in the saved_states array the values of the current iteration to later compute images.

Finding the gradients

dw = -2 * np.mean(X * (y - y_pred))
db = -2 * np.mean(y - y_pred)

In this block of code, we find the gradients values for the current prediction.

In other words, for the weight and bias, we find out how much they need to change in order to approximate better the values of the parameters to the data points.

Updating the parameters values

w = w - learning_rate * dw
b = b - learning_rate * db

Finally, we update the weight and the bias with the new values so that the line better approximates the data points:

Neural Networks

The same puzzle idea applies to neural networks. Neural networks are algorithmic models inspired by the brain that learn patterns from data. They are part of a machine learning field called deep learning, which uses neural networks to learn complex patterns.

Neural networks are important because they power modern AI applications like:

Image recognition
Language translation
Chatbots

For example, ChatGPT means Chat Generative Pre-trained Transformer. A transformer is an architecture of neural networks.

If you understand neural networks, you’ll understand the foundations that make ChatGPT work.

We have matrices (linear algebra) that represent the parameters of the neural network model and the data that flow into it.
And we can know over time how well the neural network model is converging to the dataset, fitting the numbers, and see its error (probabilities and statistics).
Calculus will tell us in which direction the parameters of the neural network need to change.
Optimization theory will tell us how much they need to change.

For example, this is a neural network:

This model has in total 13 parameters:

It has 10 lines(connections between circles). These are called weights.
It has 2 circles in the hidden layer and 1 in the output layer. Each circle has one bias.

Big question:

Imagine you work in a bank. You are in charge of deciding who gets credit cards or not. For that, you create the neural network above that takes 4 inputs:

Income
Credit score
Debt ratio
Bankruptcy history

With this neural network well optimized, you can figure it out!

Very simply, without going into things like activation functions, the network processes the 4 inputs through its weights and biases.

Each connection multiplies the input by its weight. After that, each node adds its bias.

The final output is a number between 0 and 1:

Numbers close to 0 mean "Not approved"
Numbers close to 1 mean "Approved"

For example, a high income figure, a good credit score, and no bankruptcy history data flow through the neural networks and produce 0.92. This means that it should be approved.

But a low income figure with a history of bankruptcy may produce 0.15, which results in a not approved.

In reality, bank systems and others have neural networks that take far more well-chosen parameters and decide this automatically.

This is precisely how AI can be used for credit approval.

But a question remains: What is the best way to know how much the parameters need to change?

In the next part, we are going to see the most famous optimization theory algorithm that will help us decide that.

What is Adam? The Most Popular Way AI Models Finds the Best Learning Path

Photo by Lum3n

To optimize neural network based AI models, one of the most popular methods is called Adam, which means Adaptive Moment Estimation.

The paper that introduced the method is one of the most influential in the 21st century in machine learning, with thousands of citations. As with all ideas in non-symbolic AI, Adam is a mixture of different math concepts.

It's composed of the ideas of two other optimization methods:

Momentum Gradient Descent: Accumulates velocity from previous gradients to move faster in consistent directions
Root Mean Square Propagation (RMSProp): Adapts learning rates based on recent gradient magnitudes

Let's understand them with an analogy.

Imagine that you are riding a bicycle down a mountain little by little. You already know the direction thanks to calculus.

But how do you descend safely without losing control or going too slowly?

First, you need to build up speed gradually using past momentum. This is one of the main ideas of momentum gradient descent.

It's also important that you adjust your speed based on the terrain's elevation. This is the main idea of RMSProp.

This way, you can safely accelerate and brake appropriately.

When optimizing a model with Adam, this is the same concept. With Adam, we want to optimize a model in a fast and stable way.

The momentum gradient descent ensures the fast part, and the RMSProp ensures the secure part.

Nowadays, for LLMs, which once again are just very big neural network models, a variant of Adam called AdamW is more often used.

Now, let's build a code example of using Adam.

Code example:

Using Adam, we are going to optimize this neural network based on fake data.

It will take 4 features:

Income
Credit score
Debt ratio
Bankruptcy history

And it will tell us if we should or should not approve credit for a given person.

Also, since this book is an introduction to the math of AI, I will not, in this code example, discuss hyperparameter optimization, regularization techniques, and other more advanced topics and good practices.

I want to show why this neural network fails with this data and explain the importance of using great data.

Here is the whole code (and we’ll see each part more in-depth below):

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader, random_split
import pytorch_lightning as pl
import matplotlib.pyplot as plt

torch.manual_seed(42)
x = torch.randn(10000, 4)
y = torch.randint(0, 2, (10000, 1)).float()
dataset = TensorDataset(x, y)

train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)

class CreditApprovalNet(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(4, 2)
        self.relu = nn.ReLU()
        self.output = nn.Linear(2, 1)
        self.sigmoid = nn.Sigmoid()
        self.loss_fn = nn.BCELoss()
        self.train_losses = []
    
    def forward(self, x):
        x = self.relu(self.hidden(x))
        return self.sigmoid(self.output(x))
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_pred = self(x)
        loss = self.loss_fn(y_pred, y)
        self.log('train_loss', loss)
        self.train_losses.append(loss.item())
        return loss
    
    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=0.0001)

model = CreditApprovalNet()
trainer = pl.Trainer(max_epochs=100, logger=False, enable_checkpointing=False)
trainer.fit(model, train_loader, val_loader)

# 
plt.plot(model.train_losses)
plt.xlabel('Training Step')
plt.ylabel('Loss')
plt.title('Credit Approval Training')
plt.grid(True, alpha=0.3)
plt.show()

Now let’s break it down:

Importing libraries:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader, random_split
import pytorch_lightning as pl
import matplotlib.pyplot as plt

In this block of code, we are importing code from 3 Python libraries:

PyTorch: One of the most popular python libraries to create new AI models in AI research
PyTorch Lightning: A PyTorch wrapper that organizes training code and handles repetitive tasks automatically
Matplotlib: One of the most popular python libraries to make graphs from data

Creating data:

torch.manual_seed(42)
x = torch.randn(10000, 4)
y = torch.randint(0, 2, (10000, 1)).float()
dataset = TensorDataset(x, y)

In this part, we define a seed to make the random numbers reproducible. In other words, when we run the code many times, the same random numbers will be generated.

Next, we will create 10,000 applications for credit with 4 features in X and their approval decisions in y. After that, we unify everything in the dataset variable.

We’ll use TensorDataset because it allows us to have the 4 features and the target paired together. This way, the data does not get mixed up during training.

Dividing data:

train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

In this block of code, we divide the data into a training dataset and a validation dataset.

This way, we have one dataset that’s being used to train and find the parameters while comparing results with the validation dataset.

As we can see, 80% of the data will be training data, and 20% of the data will be validation data.

Loading data:

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)

Here, we load the data into data loaders for the AI model to use.

This way, we have the data automatically split into small batches and shuffled. So instead of processing all 10,000 data points, the model will be trained on one batch, improved, then another batch, then improved again, and so forth. That makes training go faster.

Creating AI model and training process:

class CreditApprovalNet(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(4, 2)
        self.relu = nn.ReLU()
        self.output = nn.Linear(2, 1)
        self.sigmoid = nn.Sigmoid()
        self.loss_fn = nn.BCELoss()
        self.train_losses = []
    
    def forward(self, x):
        x = self.relu(self.hidden(x))
        return self.sigmoid(self.output(x))
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_pred = self(x)
        loss = self.loss_fn(y_pred, y)
        self.log('train_loss', loss)
        self.train_losses.append(loss.item())
        return loss
    
    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=0.0001)

This code block appears to be complicated, but let’s see each method block by block:

Creating the class with inheritance:

class CreditApprovalNet(pl.LightningModule):

This way, in one line, we can import everything we need to define both the model and how it will be trained.

init: Builds the model's layers and components:

    def __init__(self):
        super().__init__()
        self.hidden = nn.Linear(4, 2)
        self.relu = nn.ReLU()
        self.output = nn.Linear(2, 1)
        self.sigmoid = nn.Sigmoid()
        self.loss_fn = nn.BCELoss()
        self.train_losses = []

In this section of the code, we are defining the architecture of the AI model.

forward: Processes input data through the network to make predictions:

    def forward(self, x):
        x = self.relu(self.hidden(x))
        return self.sigmoid(self.output(x))

In this part of the code, we are defining how data will flow in the AI model based on the architecture defined.

training_step: Calculates loss for each batch during training:

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_pred = self(x)
        loss = self.loss_fn(y_pred, y)
        self.log('train_loss', loss)
        self.train_losses.append(loss.item())
        return loss

Here, we are defining how the model will be trained. In other words, how we will find the best parameters for the model to predict well.

configure_optimizers: Sets the Adam optimizer with learning rate:

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=0.0001)

Finally, here we are defining what optimizer we are going to use to, step by step, improve the AI model parameters.

Training AI model:

model = CreditApprovalNet()
trainer = pl.Trainer(max_epochs=100, logger=False, enable_checkpointing=False)
trainer.fit(model, train_loader, val_loader)

In this block of code:

We create the neural network model in the first line
In the 2nd and 3rd line, we prepare the training settings and train the model for 100 epochs

This way, in the command line, this appears:

The PyTorch code is essentially telling us the number of parameters in the AI model!

Seeing results and understanding why they are not good:


plt.plot(model.train_losses)
plt.xlabel('Training Step')
plt.ylabel('Loss')
plt.title('Credit Approval Training')
plt.grid(True, alpha=0.3)
plt.show()

Using the Matplotlib library, we plot the results:

The AI model is not converging.

We can see that because the loss is nearly 0.7 (70%) over time.

The main reason the model is not converging well is that there is little to no relationship between the 4 features and the target variable.

In other words, we do not have good data.

The code works perfectly, but this shows the most important rule in machine learning: when we create an AI model, the MOST IMPORTANT thing is data.

It does not matter if you use a simple linear regression or a neural network based on transformers or whatever. If you do not have high quality data, the model is not going to perform well.

Even if we use a good optimizer, like Adam, it will not solve the data problem.

Next steps: Common beginner mistakes

I also wrote this exact code example to show you something very important: neural networks are not always the best models to use.

This is a very common beginner mistake. You may start with neural networks for everything, when often machine learning methods with little data preprocessing do the job well.

For this type of problem, the solution is to first try machine learning methods instead of going to neural networks.

There are many reasons for this, but the main ones are:

Machine learning methods are simpler and often quicker to train than neural networks
Machine learning methods are simpler to understand how they make decisions. In other words, we can understand how the machine learning model thought to make a prediction.
With computational learning, we can guess with certain machine learning models how well they will predict in the future and provide theoretical guarantees about their performance.

Another common mistake is not dividing the data.

To simplify, I created only a training and validation division of the data

In a serious project, you should always divide it into 3 parts: training, validation, and testing.

With training, you create the model. With validation, you test the model based on the data it was trained on. With the test dataset part, you compare if the loss of the model is similar to the validation or different. If they are very different, it means that the AI model converged to the validation dataset but not the test dataset.

I challenge you to think further about how you could improve this code and to try to make the synthetic data more correlated in order to improve its quality.

Applications in AI and Control Theory of Optimization Theory

Photo by Tara Winstead

Optimization theory serves as the engine behind AI and control systems that shape our lives.

From unlocking your phone with facial recognition to autopilot systems guiding planes, optimization algorithms are constantly at work.

When you ask ChatGPT a question, optimization theory determines the values of billions of parameters during training.

The same is true for all other LLMs like Gemini, Claude, Grok, DeepSeek, and others. All of them contain millions and millions of parameters. The only way to find the best combination of the parameters to achieve a certain objective is with optimization theory.

In control theory, many systems like Model Predictive Control (MPC) and adaptive control systems only work thanks to optimization methods that balance how internal components of the control system should work together

Beyond training neural networks and controlling physical systems, optimization powers recommendation systems, resource allocation, and so many other systems.

Some examples are:

Netflix movie recommendation system
Spotify's song suggestion system
Google systems to reduce data center cooling costs
Quantitative trading firms high-frequency trading systems

To end this final chapter, I’ll share this:

It is optimization theory that makes math models into AI models that impact the lives of millions worldwide.

Conclusion: Where Mathematics and AI Meet

Photo by AXP Photography

When ancient civilizations first carved numbers into clay tablets, they likely didn’t imagine that these symbols would one day allow humanity to create the scientific, technological, and medical marvels we have today.

Yet here we are.

We’re in an era where mathematical ideas developed over many centuries – even millennia – have converged to create artificial intelligence.

Throughout this book, we've traced a path from the most basic math concepts to the cutting edge of AI. We have seen how:

Matrices compress complex systems into simple forms
Derivatives measure change
Probability helps us navigate uncertainty
Optimization guides algorithms toward better decisions to learn faster.

We’ve also learned how each math field has helped create tools that are responsible for many of the things we take for granted today.

Mathematics is the Foundation of AI

Photo by Jeswin Thomas

Always remember this: AI is not pure magic or a "being" we don't understand. It’s just the combination of many math ideas working very well together.

When you ask a question of ChatGPT or any other LLM, it generates a response. And in the process of generating that response, there are millions of matrix multiplications happening in seconds.

Or, for example, when a self-driving car decides to stop moving because it’s coming up to a crosswalk, there are a lot of math computations (related to calculus and probability and statistics) working very fast to ensure safety.

The great thing about mathematics is that it’s a common, standard language of logic. No matter the backgrounds of people or where they were born, a derivative will always be a derivative, and the same thing goes for key AI concepts.

This way, scientists and engineers worldwide can improve each other's work because everyone understands the same language.

The Future: On Device AI and the Democratization of AI

Photo by Steve Johnson

One shift happening now is the move toward edge AI. That is, AI that runs locally on your phone, computer, and really in all your devices (rather than in distant data centers).

This way, privacy is guaranteed because it runs locally. Waiting times for AI models decrease because no data needs to be sent. AI can be used offline, and costs decrease.

And what about the massive data centers being built all over the world? Those will be used for more products that will help improve the lives of millions of people.

As AI becomes more local and more processing power is freed up from big data centers, new AI innovations will appear, and more benefits will come.

The same way that in the past century every computer got its own networking chip, every device will have (and in some cases, already has) AI accelerators.

And much of it will be thanks to the math you learned in this book.

Final Reflections

Isaac Newton wrote, "If I have seen further, it is by standing on the shoulders of giants."

Every algorithm you use, every model you train, and every new theorem you learn stands on centuries of mathematical progress. You now stand on those same shoulders of these giants!

Thank you for reading, and happy learning.

Here’s the full book GitHub repository with all the code.

Acknowledgements

First and foremost, I would like to thank Guilherme Mendes, currently a Master’s student in Electrical and Computer Engineering at NOVA University, specializing in Control Theory, for reviewing the mathematical and technical details of the 1st version of this book.

I am also grateful to the organizations that gave me opportunities to grow:

A special thank you goes to the freeCodeCamp editorial team**,** especially Abigail Rennemeyer, for their patience and for reviewing every chapter of this book.

I would also like to thank all the professors at NOVA FCT who have taught and guided me throughout my academic journey, especially those from the Department of Electrical and Computer Engineering.

About the Author

LinkedIn: https://www.linkedin.com/in/tiago-monteiro-
GitHub: https://github.com/tiagomonteiro0715
Email: monteiro.t@northeastern.edu

My name is Tiago Monteiro, and I’m now pursuing a master's degree in Artificial Intelligence at Northeastern University in the Silicon Valley Campus (San Jose) on a merit-based scholarship.

I’m not from the United States. I am a Portuguese national, born and raised in the district of Lisbon.

In Portugal, I completed a bachelor's degree in electrical and computer engineering at NOVA University, one of Portugal's best universities.

I have authored over 20 articles for freeCodeCamp, which have accumulated more than 240,000 views over the years, and completed the Deep Learning Specialization from DeepLearningAI, taught by Andrew Ng.

Also, I had the privilege of participating in the winter 2025 batch of the renowned Silicon Valley Fellowship program.

Why did I choose electrical and computer engineering?

After finishing the Portuguese national math exam in 12th grade, I chose Electrical and Computer Engineering (ECE) to challenge myself and learn new math on my own.

The ECE degree combined:

Advanced Mathematics
Programming (from Assembly to Python)
Physics (classical mechanics, electromagnetism)

What did I gain exactly?

I mastered the skills needed to quickly understand AI research, particularly after completing Andrew Ng's Deep Learning Specialization.

In Portugal, I also studied advanced STEM areas including, for example:

Partial Differential Equations for modeling real-world phenomena
Harmonic analysis (Fourier/Laplace transforms) for signal processing and alternative problem perspectives
Complex analysis involving derivatives and integrals in the complex domain
Numerical methods for approximating mathematical solutions computationally
Signal/control theory for ensuring system stability in dynamic environments
Physics classes in classical mechanics and electromagnetism fundamentals

While not directly applied to AI, these studies enhanced my systems thinking and ability to independently learn complex STEM concepts.

How to Design Structured Database Systems Using SQL [Full Book]

Daniel García Solla — Wed, 13 Aug 2025 18:03:10 +0000

This book will guide you, step-by-step, through designing a relational database using SQL. SQL is one of the most recognized relational languages for managing and querying data in databases.

You’ll learn the fundamental concepts related to both data and the databases where they are stored and managed – from how data is transformed into information and subsequently into knowledge, to the architecture of a database management system (DBMS). We’ll also cover the different stages of the database design process, as well as its key principles, focusing specifically on the design of relational databases.

By the end of the book, you’ll have a solid understanding of how to design and maintain efficient, secure databases that can support complex data-driven applications, all aimed at meeting a series of requirements imposed by end users or clients. You’ll also learn the SQL fundamentals you’ll need to implement this design on a DBMS, and then maintain and query data on it.

So, whether you're a beginner or looking to enhance your skills, this book will provide the knowledge and tools you need to succeed in the world of data management.

Prerequisites
The Role of Data in Today's Digital World
Chapter 1: What is Data?
- DIKW Pyramid
Chapter 2: What is a Database?
Chapter 3: Data Management Models and Technologies
Chapter 4: Database Design
- Database Design Levels
Chapter 5: Relational Model (Structured Data)
Chapter 6: Relational Schema Diagram
Chapter 7: Normalization
Chapter 8: Query Languages
- Formal vs practical query languages
Chapter 9: SQL (Structured Query Language)
- DDL
- DCL
- DML
- Views
- Database Administration
Chapter 10: Database Design Process Example
- Entity-relation to logical model
- How to create the database
Chapter 11: Example Queries
Conclusion

Prerequisites:

Before going through this book, there are a few useful prerequisites you should have:

Fundamentals:

Basic programming knowledge like variables, data types (string/number/boolean), and conditionals/loops
Familiarity with spreadsheet terms/basic functions (rows, columns, sorting/filtering) as this will help map to tables/tuples/attributes
Command-line basics like how to open a terminal, run a command, set PATH (you’ll use CLI tools occasionally here), and so on

Environment to set up

A relational DBMS like PostgreSQL (recommended, as it’s what we’ll use here)
A SQL client like psql, pgAdmin, TablePlus, or DBeaver (pick one)
An Entity Relationship Diagram tool like draw.io/diagrams.net, Lucidchart, or dbdiagram.io
A code editor like VS Code (with SQL and ERD extensions is fine)

Helpful background

Familiarity with some math/logic basics like sets/subsets, relations, functions as well as basic propositional logic (AND/OR/NOT, implication).
Basic knowledge of data modeling terms (entity, attribute, relationship, cardinality, and so on)
Version control basics

With that sorted, let’s dive in.

The Role of Data in Today's Digital World

These days, every action we take on the internet leaves behind a trail of information or data – whether it's conducting a bank transaction or shopping online.

But you may sometimes wonder whether it actually makes sense for these digital actions to be recorded. Do we need records of keystrokes when using a keyboard app, images saved in a gallery app, files in a file management program, notes saved in a note-taking app, or even vehicle routes with integrated Android Auto technology?

Some of these actions may not seem particularly useful at first, but they help developers and designers provide better, more advanced, and efficient services to users.

For instance, understanding how a user types on a keyboard app can improve the real-time typing experience by adapting the internal dictionary to the user's typing style and correcting errors more effectively. It also improves gesture typing, a feature based on artificial intelligence techniques that requires a large number of examples to be deployed successfully on a product.

Similarly, simple images saved in a gallery may not seem significant enough to be recorded on external sites, or even registered at all.

Image files, for example, can contain EXIF metadata with information about the image, such as the location where it was captured, the date of creation, its resolution, orientation, and the camera model used – among other data. While a user may not be interested in this data, it serves as the foundation for various application services, including classifying images into albums based on location, creating visual timelines, and generating "memories." These features significantly enhance the user experience.

Besides metadata, the content of images also creates a "digital trail" on third-party servers, which might initially seem intrusive and not beneficial to the user. However, it can lead to enhanced services. Since these third parties have the resources to train large machine learning models, they can recognize objects and faces in images. This improves album classification and allows users to search through images using text. Third parties can also identify which people or items are in a photo and link their data with other services.

Regarding the data generated by smart vehicles or IoT devices in general, the purpose is fundamentally the same: to provide users with better services, such as route optimization, maintenance prediction, prevention of possible failures, driving assistance, and integration with other smart devices in the environment.

These features are implemented using artificial intelligence techniques that learn from examples. Typically, the more data available, the better the underlying models "learn," leading to better results.

Ultimately, regardless of the legal and privacy issues related to these practices, recording what we do is not an end in itself. Rather, it’s a means of turning scattered information into useful knowledge, which can then be used to create services that enhance our productivity or user experience.

One clear example of this is in this very article on the Hashnode platform. It provides writers with translation, rewriting, and keyword optimization services for SEO searches, all of which are based on artificial intelligence models that have been trained using large amounts of text – that is, data.

So to make sure this is all technically feasible, we had to develop specific techniques for collecting, storing, and managing information securely, efficiently, and consistently.

Collection involves capturing information from various sources, such as IoT sensors, mobile devices, and social interactions, either manually or automatically.
This information can then be stored and accessed again when we need to transform it or apply processes that improve query efficiency or reduce storage space. This is precisely why data compression is a critical aspect of data storage.
Lastly, data management involves organizational, protective, governance, and analytical tasks.

In this book, we will focus on storage, which is the key aspect handled by databases. But databases are used for more than just storing and accessing data, as we will see later. They also provide a set of functionalities that allow us to organize, protect, and ensure the integrity of data, as well as to query it efficiently and concurrently.

This makes databases a fundamental component of the infrastructure for these services, which are often offered to a large number of users.

More precisely, we will focus on explaining all the necessary theoretical concepts you need to know to design and maintain a database. There are many ways to store data, depending on its nature or the client's needs, but we will focus on one specific structure.

To grasp the fundamentals of storing and managing data, we should begin with the most straightforward cases, which involve the simplest possible data structures. You’ll also learn the SQL language and its relevance to database maintenance through examples.

Chapter 1: What is Data?

Before we start working with databases, it's helpful to have a clear understanding of what data is. More specifically, we need to understand what data means in the context of working with databases and SQL.

The official definition of data covers the most basic level, which states that data is a symbolic representation of a quantitative or qualitative attribute or variable that describes an empirical fact, event, or entity.

It's important to note that data has no inherent meaning. In other words, data is merely a value representing something observable or measurable – it doesn't provide any interpretable meaning.

For instance, the number 27 is data to which we initially can't provide meaning, though we can store, transform, compress, and encrypt it, and so on, if possible. Later, if we discover that this value stems from a variable representing temperatures, then we have more than just data – we have semantics, or meaning.

In this example, the number 27 is considered raw data. Raw data is data that has been collected from a source, yet it lacks meaning or semantics and has not been processed or organized.

In the context of databases, the term variable is occasionally used to denote the origin of the data. But the term attribute is more common, as we will see later. So to sum up, an attribute can be viewed as a variable in programming. It represents a feature of an entity, such as a person's age. It’s characterized by a data type and a domain that define what the values can be and what its possible values are, respectively.

Data types are the internal formats and operations supported by an attribute's data. They can include:

integers (ints), which are typically encoded in computer science as 32-bit sequences
text strings encoded in UTF-8 format
decimal numbers (floats or doubles, among others), represented using the IEEE-754 floating-point standard, and
boolean values that can be true or false and are encoded with bits as 0 or 1.

As you can see, the data type defines how an attribute's values can be, while the domain is a set containing all the acceptable attribute values. A domain consists of a data type that limits the form of the data and a series of constraints that restrict the possible values that can be instantiated within that base data type.

For instance, if an attribute is labeled as an integer and represents a person's age, it's evident that the domain can’t contain negative numbers, despite the int data type allowing them. Consequently, the domain can be defined as all possible integer values with additional constraints ensuring that values less than zero aren’t considered, leaving only the positive integers needed.

Through these concepts, we can understand data in its most basic form. If we take a decimal number like 3.24, it may indicate a measurement for scientific purposes. A text string like "Juan", on the other hand, may represent a person's name. In other words, the semantics of a sequence of characters define its meaning. Alone, the sequence of characters doesn't represent anything – but together, they can represent a Spanish text with a meaning, such as someone's name.

Beyond atomic data, which are the most basic elements that can contain information, there’s also much more complex data out there. This includes document data, spatial and geographic data, network or graph data, and multidimensional data. The only difference between the "atomic" data we saw earlier and these complex forms of data is that the latter are composed of relationships or associations between simpler data.

For instance, a document consists of sequences of characters (strings) related to each other, where one string might represent the title and another a paragraph. In a computer network modeled as a graph, there could be IP addresses at the nodes, which we can think of as encoded strings, and references to other nodes, which are also IP addresses.

We won't delve into the complex nature of such data here because it’s managed by specialized databases that are more difficult to understand, and where SQL is not always present.

DIKW Pyramid

So far, we’ve seen that data itself is just 'symbols' that can be stored, with no inherent meaning unless their origin or interpretation is known.

But it’s also possible to train machine learning models to provide services that appear much more complex compared to the data they were built with. In other words, we can build complex information systems from raw data that contain higher-level knowledge than the data we have discussed.

The DIKW (Data, Information, Knowledge, Wisdom) pyramid models this transformation from data to knowledge, establishing a hierarchy through which we can acquire knowledge about some aspect of reality based on data. To understand this, let’s look at the four levels of knowledge organization.

Data: At this level, our knowledge of the world – or rather, what we know about it – is represented as raw data. As previously mentioned, raw data is devoid of semantics. The only options here are to store and analyze the data. Although they don't explicitly provide high-level knowledge, we can clean the data to avoid missing or corrupt values and calculate statistical measures.

Example: As before, a raw value, such as the integer 27, is data from which we can only calculate certain statistical metrics. We can’t interpret it because we don't know its meaning until we get more context.
Information: After advancing from the previous level, the raw data is provided with semantics, which offers meaning to the stored and analyzed values. Now, the data is better organized because it’s contextualized with respect to its semantics.

This is the primary feature of this level, though certain relationships between the data also allow for more complex statistics to be calculated and more valuable questions to be answered about the data. The knowledge at this level is more abstract and valuable than the previous one.

Example: Continuing with the previous example, the number 27 could represent a person's age. So, here we can interpret and organize it with deeper comprehension and analyze it more precisely.
Knowledge: At this point, knowledge resides in models that capture the patterns of the analyzed and organized data according to their semantics. That is, data follow hidden patterns that aren’t easily discernible, but can be revealed through advanced statistical techniques or machine learning.

So, at this level, information is compressed and summarized, or rather, an understanding of it’s generated through a model, allowing it to be synthesized.

This level is higher than the previous one because it extracts even more abstract knowledge from the information. Such knowledge describes the data itself, serves to make predictions, and achieves certain outcomes by leveraging the higher-level relationships between the data.

Example: Once we have meaningful data, we can build models to describe or summarize it in order to make predictions about unseen data or draw conclusions.

For example, we can use a statistical metric, such as the mean, as a model to determine the average age of a given dataset. Later, by comparing this mean with the ages of other people, we can determine whether they are above or below the average. But the models used to describe data at this stage are usually more complex and practical.
Wisdom: Building on the knowledge from the previous level, we reach a point where it’s no longer possible to extract higher-level relationships from the data. This means that no further abstraction is possible. The only remaining task is to combine our description of the data from the previous level with a social and ethical context, along with the professional experience of people who intend to use this knowledge to guide strategic decision-making and evaluate its consequences over time.

Example: At this final level, we can use a person's age, for which we have models describing them, and combine it with information about the context in which that information was collected to inform strategic decisions.

Note that the data may emerge from an organization, which is the context in which strategic decisions are made. The key point here is that the purpose of having such high-level knowledge is to inform strategic decisions.

By studying this hierarchy, we can see that the interpretation of raw data leads to the acquisition of knowledge, which lets us make informed decisions. Databases help in this process primarily by storing data, which is one of their main objectives. But they also assist us with the analysis process by adapting the data's storage and organization methods.

At this point, you might ask the question: How do we want the database to store and analyze this data? First, we need to store the data persistently in secondary memory so that it can be retrieved at any time, rather than in volatile memory such as RAM.

On the other hand, analyzing data involves a wide variety of operations, ranging from simple searches and filtering to complex aggregations, pattern detection, statistical calculations, executing elaborate SQL queries, and processing text or images. Each type of data and operational need requires different algorithms and data structures for efficiency.

This means that since a database must provide functionalities at the storage and analysis layers, you might wonder whether a "general" database system exists that is capable of storing and analyzing data of any kind, regardless of its complexity or user needs. As we will see below, such a general system can’t exist. But there are systems built to handle any type of data-related problem, as long as the data is in a specific “shape”.

Chapter 2: What is a Database?

Once you learn about the main functions a database needs to provide, you can understand its advantages and why it exists – especially when compared to trying to implement these functions without a database. To help illustrate this, we’ll start by analyzing a case where we try to solve a problem involving data without a database. This will show the problems that can arise and how they are resolved.

Storing Data Without a Database

In terms of data storage, the raw data could be stored directly in binary files in secondary memory. For analysis, we can implement a software "layer" which we can label "processing layer,". It contains programs that manipulate the stored data by accessing it and performing transformations based on implemented logic. And to facilitate data manipulation by users, there can be a graphical interface component that simplifies the use of these programs.

A practical example will illustrate this better. Suppose we are working with a domain that contains data about people and their financial information. Our objective is to analyze this data and make economic predictions. This data may originate from government sources, surveys, or other information systems. So we’ll need to store it in our system as binary files.

But we’re faced with a couple problems: first, we need to choose the optimal file type. Then we need to choose the best way to represent the data in the file to minimize problems in future stages when designing programs for access and analysis.

For example, storing the data in a sequential file, where the data is stored contiguously, is different from storing it in an indexed file, where the information is organized by an index. In other words, there’s an index that organizes the data by name, so all people whose names (or a similar characteristic) begin with the same letter are stored contiguously in the same block, separated from the remaining letters. This recursive principle continues for the subsequent letters of the name. It's as if the data were sorted alphabetically, though generally, a single level of recursion is enough.

Let’s look at an example: In a sequential file, people's names are stored in a "disorganized" way, which requires us to search through the entire file to retrieve a specific person's record. In contrast, an indexed file sorts people alphabetically by name. By consulting the index, we can determine where names beginning with a certain letter start, thus avoiding the need to look through the entire file. In other words, the index is similar to the table of contents in a book, which tells us on which page each chapter begins.

This type of decision affects how efficient searches and queries on data are, as well as its processing. Each file type has its own advantages and disadvantages, as you might expect.

Similarly, there is a wide variety of decisions we can consider when designing programs to access and operate on the data. These are directly influenced by the previous decisions. For example, if we change the file type, the software of these programs will most likely need to be reprogrammed. You can think of these programs as Python scripts that automate certain analysis processes.

Also, when we’re implementing these programs, we need to account for details such as concurrent access to data, which is difficult to implement from scratch, as well as other security features, such as data encryption, compression, and detecting erroneous or incomplete data. These features are essential to providing a good analysis service, but they are difficult to program and maintain.

In short, without a database, it’s possible to solve the problem of storing and analyzing data – but implementing all the software is potentially quite complicated, especially if we aim to do so from scratch. If we have the right resources, it may be possible to complete this process and end up with a sufficiently efficient system. But in most cases, using a database is more convenient.

Storing Data Using a Database

One way to simplify these processes is to use a database, which is an organized collection of data that models a domain and provides storage and analytical support for the processes we need to apply to the data. Without a database, data had to be stored in "single files" – but using a database, it’s stored according to a model that defines the type of information and its internal relationships. This is why the definition uses the term "organized."

As for the term "collection," it refers to the idea that a database is a set of data from the same domain. Here, by "domain" we mean the problem we are dealing with, for which we need to store and analyze data. In our example above, the domain would be the "universe" of people and all the tax concepts associated with them – that is, the set of concepts and information from the real environment that may be relevant to solve a problem using those data.

The advantages of a database extend beyond just storage. They also include the normalization of storage and organization, allowing for efficient queries on the stored data. These queries form the basic operations of any analysis process (querying). They’re also the fundamental support for other tasks such as the technical maintenance of the information system, data management, or even features like the system's scalability.

DMBS (Database Management Systems)

Data management involves a series of additional functionalities that are provided by a component on which the vast majority of databases are currently based: the DBMS (Database Management System). As its name suggests, this component is a software element responsible for centrally and efficiently managing the entire life cycle of stored data.

In this context, management refers primarily to the storage, extraction, modification, deletion, and search of data. These are the fundamental operations necessary for a database to be considered operational.

But management also involves additional functionalities that are useful in a database:

Centralization: Storing all the data in one system avoids having information scattered across many files, which may lead to unnecessary duplication of information, such as data references or the data itself. If the information system is not designed and implemented correctly, this can lead to inconsistencies and errors. But this is not a concern if we use a database.
Data integrity and security: The management system controls who can access the data through access controls and permissions for different database users. It also ensures data integrity, a topic we will discuss later.
Concurrent access and sharing: Information systems typically support applications used by many users simultaneously, which causes synchronization issues handled automatically by the DBMS. Fortunately, this means we don't have to implement specific logic in our database to ensure concurrent access to data by many users.

Finally, another feature of DBMSs is that they streamline the development and maintenance of information systems built using databases, especially those that rely on a DBMS. There are many different DBMS software programs, such as MySQL, MariaDB, PostgreSQL, MongoDB, and Neo4j, among others. Here, we will focus on PostgreSQL.

ACID Properties and Transactional DBMS

Beyond the basic operations we’ve discussed, it's important to highlight the significance of transactional support in modern DBMSs for applications such as banking, online invoicing, and healthcare.

In these areas, it’s usually essential that any modification or query of the data follow a transaction mechanism. In other words, the operations performed on the database must be composed of a block of low-level instructions (reads and writes), and the manager must ensure that these operations are executed as a whole or not at all. This is often called an atomic operation.

This helps prevent technical failures from causing inconsistencies in databases (or similar problems). For example, if a user sends money via internet and an error occurs, the entire transaction is canceled, as if it had never occurred. This protects the database from remaining in an inconsistent state, such as when one party has sent money, but the other has not received it. So the DBMS is responsible for ensuring this atomicity in database operations, which requires it to fulfill the ACID properties. They are:

1. Atomicity: A transaction operates under the "all or nothing" principle, meaning that either all of its low-level instructions are completed, or none of them are executed.

Example: A bank transaction must be completed fully, not left in an intermediate state where one party sends the money and the other does not receive it.

2. Consistency: Every transaction updates the database, ensuring it remains in a valid state and preserves data integrity.

Example: If a transaction changes a person's age, the final age can’t be negative.

3. Isolation: Concurrent transactions should not interfere with each other in a way that produces inconsistent results.

Example: Two people try to book the last seat on a flight at the same time. Isolation ensures that only one booking succeeds and the seat isn't double-booked.

4. Durability: Once a transaction has been completed, its effects are permanent. Even if the system fails, it must be ensured that the changes remain by writing them to persistent storage.

Example: If you transfer money between bank accounts and the system crashes right after, the transfer should still be reflected when the system comes back online.

Finally, it’s important to understand that not all database management systems (DBMSs) need to be transactional, although many of them support such functionalities.

Database Management System Architecture

After seeing what a DBMS is at a high level, we can examine how its functionalities are implemented in greater detail. We won’t look at the lowest possible level, but rather at the architectural level.

To better understand how a DBMS operates, we can focus on each of its component's roles when receiving a user request, whether it's a data modification, management operation, or data retrieval query.

Overall, each DBMS is unique, with components specific to its design and needs. Broadly speaking, though, they all share the following components:

Precompiler: This component extracts and separates individual language statements embedded in applications based on the user query, which is usually in a language like SQL, before handing them off to the parser.
Parser: Processes and validates the syntax of the user query, generating an intermediate parse tree.
Authorization Control: Verifies the user's permissions to ensure that only authorized actions are performed.
Query Processor: Converts the user query into a logical execution plan before optimizing it.
Integrity Checker: Validates that the data meets all the constraints defined on the database while the query executes its statements.
Optimizer: Analyzes and rewrites the execution plan to choose the most efficient execution strategy.
Executable Code Generation: Transforms the optimized execution plan into specific calls to the storage engine API.
Transaction Manager: Coordinates the start, commit, or rollback of transaction operations to ensure atomicity and isolation.
Log (Transaction Record): Sequentially records all modifications to ensure durability and recovery support.
Recovery Manager: Uses the log to restore the database to a consistent state after failures.
Dictionary Manager (Catalog): Maintains and queries the metadata (schemas, statistics, permissions) of the database.
Data Manager: Implements the physical storage data structures and the operations for accessing data.
I/O Processor: Manages reading and writing of data to disk, that is, persistent memory.
Result Generator: Formats and sends the result sets (queried data) to the user or the application layer.

Finally, although most databases rely on a DBMS, this is not always the case. For technical or performance reasons, implementing a custom database from scratch may work better for some teams than using a common DBMS-based solution. So a DBMS is not necessary for a database to exist, though it’s present in the vast majority of databases because of its inherent benefits.

Chapter 3: Data Management Models and Technologies

As you’ve been learning, applications that rely on databases typically involve large amounts of diverse, complex data. Because of this, there’s no single database model that effectively addresses all scenarios. Rather, there are different families, each specializing in specific tasks or sets of tasks.

So here, we’ll explore a range of options that can help you select a database to use in a project, depending on the data and the system's requirements. More specifically, we’ll examine some models or approaches on which a database may be based. But keep in mind that there are many others apart from the ones we’ll discuss here.

Types of Data According to Structure

First, the most relevant factor in determining a database's paradigm in a project is the data itself – particularly its complexity. Data complexity is defined by its structure, variability, and internal relationships. This mainly determines how the data is stored and processed.

So, before analyzing the different paradigms or approaches available, you should understand the meaning of data complexity.

Complexity is a concept that we can informally understand as the degree to which data is "complicated." For instance, a list of integers is different from a graph with integers at each node or a list of numbers encoded in binary, encrypted, or compressed.

Thus, complexity has several dimensions.

Volume: Clearly, the more data we have, the harder it will be to manage. It's likely that not all of it will fit on a single machine, resulting in longer processing or query latency times.
Heterogeneity: This alludes to the vast variety of formats, structures, and origins that data can exhibit within a given information ecosystem. Each of these characteristics constitutes a specific type of heterogeneity. This concept is more related to the world of data integration than to databases themselves, because it’s the main problem we face when integrating data into a system, regardless of whether it includes a database.
- Example: If we are going to build a database of cities and populate it with data from different sources, it’s likely that the city names will be written slightly differently in each source. This is referred to syntactic heterogeneity.
Structure: In our case, this is the key dimension, as it allows us to classify the data into different categories, each of which is associated with a specific database paradigm. Essentially, the structure of the data refers to the extent to which it adheres to a predefined schema.

For now, we can understand the schema as a formal definition that determines how data is organized, as well as the features of this organization depending on the nature of the data and the database. Later, we will focus on the concept of schema in a structured (relational) database.

So the complexity of the data depends mainly on two dimensions: the flexibility of the schema and the volume or heterogeneity of the data. The more flexible the schema and the greater the volume or heterogeneity of the data, the more complicated it will be to process it, requiring an appropriate database model.

This means that, regarding the structural dimension of complexity, we can categorize data according to how "rigid" it is.

First, we have unstructured data. These are data that do not follow a fixed schema or set of rules for automatic interpretation or labeling without prior processing. They are usually the most complex since they are unstructured and lack metadata, or additional information that describes or organizes them. This category includes images, videos, audio, and all kinds of multimedia, such as spatial data.

Next, we have semi-structured data. Unlike the unstructured data, this one uses tags as metadata to organize it. This allows the data to be clustered around these tags, which makes it easier to interpret, query, and process. But it can also be self-organized using key-value pairs or internal hierarchies.

Essentially, this data contains meta-information that enables its self-organization, though it does not adhere to the strict schema of structured data. For example, we can have data in XML or JSON format where data is presented as key-value pairs, with a key associated with one or more pieces of data. As such, the key-value pair scheme is not rigid enough to perfectly characterize the structure of the data since it does not explicitly limit the amount of data that can be associated with a tag.

Finally, we have structured data. Such data are organized by a strict schema that restricts them to tabular form. In other words, the organization is adapted to a schema and follows a series of rules. Each data point is composed of a sequence of values that it takes on a finite number of attributes, where each of these attributes is univalued.

We can think of the schema as the table header that determines the attributes for which each data point takes on values. In this way, a data point is a tuple or row of the table in which it’s stored.

There is one additional restriction: each attribute can only have one value, meaning that an attribute or cell of the table can’t contain more than one value.

Each of these categories leads to one or more database paradigms adapted to their nature. The easiest to deal with are the structured ones, as their rigidity does not allow for sufficient variation for the analytical techniques used on them to be considered "complex." In contrast, the most difficult to deal with are the unstructured ones, due to their variety and high flexibility.

Limitations of Structured Data

To keep things simple, we’ll focus on structured data and the databases supporting it. These databases are built using the relational model, which we’ll discuss later.

Since structured data is organized in tables, operating on them is simpler since tables have properties that make them easier to traverse and process. For example, knowing that each cell holds only one value allows us to programmatically traverse all the data in the table by traversing all its rows, regardless of the contents of each cell. This way, we avoid exploring an indeterminate number of values per cell, which would make it much less efficient.

This simplicity also allows tables to be implemented using record- or field-oriented data structures. These provide the necessary efficiency for structuring data within the relational model designed for this type of data. This model is a database "paradigm" that, when used with a query language such as SQL, lets us store and process most structured data, which is why it’s so important.

Keep in mind, though, that its status as a "general" paradigm that addresses almost any problem involving structured data introduces certain limitations:

1. Scalability

Most relational or structured database implementations use a monolithic architecture. This means the database runs on a single machine and can only be scaled vertically by allocating more resources to the machine.

Fortunately, distributed implementations use networks of multiple machines to run the database. This approach allows for horizontal scaling by adding more machines to the distributed system, providing greater scalability. Such scalability is critical for products like social networks, ensuring system availability.

2. Schema Flexibility

With such a rigid schema, if we need to store unstructured data (like JSON or image data), this requires transformation or an alternative to structured databases, such as NoSQL databases. We’ll discuss this more later. These databases allow greater flexibility in data schemas and support heterogeneous data.

3. Complex Data Types

In addition to having a flexible schema, the type of data we are dealing with may be complex, making querying insufficient. Operations on structured data are usually designed for simple data that will often be queried. But when storing images, graphs, or other complex entities, we may need to perform complex operations on them.

For example, we could need to perform object detection in images or calculate neighborhood and centrality metrics in graphs. This leads to the development of specific database models (which we’ll cover later) that support these operations and the storage of such data, which is usually kept in BLOBs.

4. Data Volume (Big Data)

As previously mentioned, the data volume has an impact on almost every database model, since storing a large amount of data slows down processes. But Warehousing and Data Lake models can mitigate this effect by leveraging their ability to scale horizontally and accelerate computation to process massive amounts of data faster. This is achieved through techniques like data pipelines or cluster computing (similar to distributed computing).

5. Real-Time Requirements

Finally, databases are expected to have low latency when performing operations, since the speed at which users are served is often determined by the latency of these operations. Also, as the number of users is usually large, the database must support concurrency.

But the persistent storage operations conducted during these processes – plus the mutual exclusion locks that ensure concurrency (and compliance with ACID principles) – slow down data processing. As a result, in-memory database implementations are frequently preferred to mitigate this issue. In addition to saving data in persistent storage, these implementations use RAM memory as a cache to store some of the data and respond to queries more quickly, achieving a close-to-real-time latency.

So despite being the simplest and most effective at modeling everyday problems, structured databases have certain disadvantages. These have led to the development of alternative database models and approaches. Each of these models attempts to address a specific issue with structured databases, providing support for more complex data and more technically challenging requirements.

Big Data

Before examining specific database models, we should consider a problem that affects all of them: the volume of data. When we have a problem with a sufficient amount of data, the term "Big Data" is typically applied. It’s not a model or set of models, but rather a concept referring to massive, complex data sets.

And given how much data is currently produced every day, it’s more and more common to encounter problems where massive volume becomes a limitation.

In a Big Data project, we can divide its lifecycle into several stages.

First, data is captured from multiple sources and integrated into common formats.
Then, it’s cleaned to ensure correct integration and, when necessary, manually annotated or tagged to feed machine learning models.
The data is then stored in scalable infrastructures or directly in databases, ensuring availability and fast access.

These "preprocessing" tasks can account for a significant portion of the work needed before the data is ready for use.

Once processed, we primarily use data to create knowledge models so we can understand the nature of the data. This also lets us generate predictions and informed decisions in professional environments. This process is usually referred to as business intelligence or data-driven decision-making.

These business intelligence processes can also assist with other tasks, such as statistical analysis and visualization. Some of these tasks, including visualization and statistical analysis, are considered part of the big data ecosystem and are fundamental to data processing. They go along with previous tasks like the management of databases and information systems. So it’s essential to correctly define from the start what data is needed for a project, how it will be processed, and what results are expected.

What Constitutes “Big Data”?

It’s worth noting that, for a project to be considered Big Data, there are no strict conditions for determining whether it belongs to this category. Still, there are a number of factors that contribute to this designation:

The first is volume. As we’ve already discussed, the volume of data refers to the amount of data generated and stored within a given project. The more data that’s generated and stored, the more likely the project is to be categorized as Big Data. Still, there is no specific amount that defines this distinction, as it also depends on other factors, including the availability and complexity of the data.

The next is velocity. This is the rate at which data is generated and must be processed. For example, in a project consisting of a social network or an IoT device network, data may be generated at a very high velocity – that is, a large amount of data per unit of time. This data must be processed as quickly as possible. This means that the faster the data is generated, the more likely it is to be considered part of the Big Data ecosystem.

The last main factor is variety, also called data heterogeneity. This means the more heterogeneous the data, the more difficult it is to process. This requires greater computing power, which makes the project more likely to be considered Big Data.

For instance, integrating data from sources that use the same formats is easier than integrating data from those that use different ones.

Heterogeneity is affected not only by the formats, but also by how they are encoded, transmitted, and so on. We also need to consider the level of data structuring because unstructured or unlabeled data likely requires machine learning techniques (such as clustering) to extract information from it.

These are the main factors, although more have been added over time thanks to technological advances in these processes. Among them are:

Veracity: Degree of reliability of the information received in terms of data quality and accuracy, in order to avoid decisions based on incorrect or biased information.
Viability: The degree to which the data can be effectively used in the project, as sometimes their volume or other factors make their processing technically unfeasible.
Visualization: It's the ease with which data can be transformed into understandable dashboards for users, allowing them to explore it intuitively.
Value: The expected value to be obtained from processing the data. Generally, it's economic value, although it doesn't need to be economic – it mainly depends on the application domain.
Viscosity: This is the significance that data have in decision-making. Not the value added by their processing, but the relevance they have when making a decision.

In summary, although volume is one of the key factors determining whether a problem or project is considered Big Data, it’s not the only one. The speed at which data is generated and the heterogeneity of the data require a large amount of computation to process it, which is the primary issue that led to the concept of Big Data.

NoSQL Databases

The first model or database approach we’ll examine is NoSQL. Its name indicates that these databases aren’t only structured, but also that the data can vary in structure.

The main characteristic of this database approach is its flexibility in storing data – it doesn’t force data to adhere to a fixed schema, such as a tabular one. They also focus on offering easy horizontal scalability, which allows the computational capacity of the database to be expanded by increasing the number of machines. This makes them efficient at processing complex, large-volume data and thus supporting Big Data problems.

To understand what they entail in practice, we could consider a use case involving a database for a bicycle rental system, that lets users rent bicycles through a subscription.

To implement this system, we can choose from a wide variety of databases or information systems. For example, in a relational database, the information is organized in tables, whereas NoSQL databases use different types of structures to organize the data. Each structure yields a specific type of NoSQL database.

Without delving into the specifics of the use case, we can see that using a relational database for such a project may pose challenges in the following areas:

Volume: If the system is deployed nationally or on a continental scale, a large number of users will perform transactions in our system, either by using or returning bicycles or by contracting or canceling their subscription to the service. Above all, scaling a relational database has the greatest impact on the system. To manage such a large number of users, the system requires powerful computing capacity to match to needs. This means that the database must be able to scale horizontally to reach optimal capacity. In relational databases, vertical scaling is usually applied, but it becomes costly to add more computing capabilities beyond a certain threshold.
Velocity: The system must respond quickly to user requests, such as displaying available bikes within a certain area or managing subscriptions. If the system uses a relational database, ensuring concurrency is computationally expensive, which causes high latency when many users query or modify the same information simultaneously.
Rigid schema: In a relational database, the schema does not frequently change. So if our system requires regular updates (like updates to bike models, the addition of new bike sensors, or significant modifications to the subscription service, especially the addition of functionalities or new features), these changes will require updating the database schema by adding or removing columns. This process is costly and complicated once the system is in production and its tables contain a large amount of data.
Temporal Analysis: Since structured databases are composed of tables, as we will learn later, if we need to perform a time series analysis or analyze data spanning a long period of time with a large number of records throughout that period, the database's response latency will be high. For example, consider calculating metrics on bike usage over the last 10 years, during which time there may have been a massive number of transactions between users and bicycles. These types of queries are often called analytical queries.

NoSQL databases offer different solutions to these problems, depending on how the data needs to be structured. So for each of these ways of organizing and storing data, there is a certain type of NoSQL database with a series of advantages and disadvantages depending on the nature of the project and the data involved. Let’s look at them now.

Key-Value model

The simplest option is to store all the data in a dictionary of key-value pairs, where each key is a unique identifier that acts as a tag linked to a single value. The type of content of each value depends on how we need to organize the data.

Here, we use the term "dictionary" to refer to the data structure used in languages such as Python and Java, as well as in languages where the dictionary structure is the only method of representing information, such as in JSON. In our use case, if we want to store user information, each user could be represented as a dictionary with the following key-value pairs:

{
  "id": 27,
  "name": "Juan",
  "email": "juan@juan.com",
  "birth": "1984-01-05"
}

As you can see, keys serve as names that identify the value we are storing in a given pair. In this case, the key is the user's name, although we can also save binary content or a Boolean value as the key.

Among this model's characteristics are:

its simplicity, which enables humans to easily understand it
its low latency, which benefits from data structures such as hash tables with very low access times, and
its ease of distribution on several machines, since a dictionary can be seamlessly partitioned by its keys. In practice, Redis is the most common DBMS used for this kind of database.

Document model

In this model, the information management unit is not a key-value pair, but rather a set of them, known as a "document.

The main difference from the previous key-value model is that the values are no longer "opaque." Here, a document holds its information in a nested, hierarchical structure. This means that a value might be a dictionary containing key-value pairs, some of which can also be dictionaries. Thus, a hierarchy is established within the stored information, rather than allowing the values to be of any kind as in the key-value model.

Some characteristics of the document model are its flexible schema and the hierarchical storage of heterogeneous data. For example, in our use case, we can store bike information as follows:

bike1 = {
  "id": 1,
  "model": "model1",
  "status": "available"
}

bike2 = {
  "id": 2,
  "model": "model2",
  "status": "in_use",
  "sensors": {
    "cadence": 85,
    "speed": 24.5
  }
}

bike3 = {
  "id": 3,
  "model": "model3",
  "status": "maintenance",
  "sensors": {
    "gps": {
      "latitude": 40.4168,
      "longitude": -3.7038
    },
    "camera": "front_hd"
  },
  "acquisitionDate": "2024-11-15"
}

Here, you can see that all the dictionaries represent bikes. But some contain more fields than others depending on the information that the specific bike model yields. This prevents the need for several tables to be created for each model or type of bike. You can also see that some fields have a dictionary as a value, which hierarchizes the data. Also, not all fields need to be structured equally since the model allows for some heterogeneity in this regard.

Finally, it’s important to emphasize that, in this model, the documents are self-descriptive, as the names of the keys or tags identify the stored information. MongoDB is one of the main DBMSs for implementing this model.

Column-oriented model

This model is similar to the structured model (the one used in relational databases) where information is stored in tables – but instead of each data point being kept in a row, it’s stored in a column. For example, in our use case, we could have:

Attribute	bike1	bike2	bike3
model	model1	model2	model3
status	available	in_use	maintenance
sensor_cadence	–	85	–
sensor_speed	–	24.5	–

In this type of database, the points are still rows in a table. But the items that the management system considers to compose the table aren’t the rows, but the columns.

In a relational database, a set of rows composes a table, where each row is a data point holding values taken for certain attributes, which are the columns. Similarly, in the column-oriented model, the management system treats a column as a "data point" on which operations are performed.

As illustrated above, a table from the relational model is transposed so that each column becomes a bicycle instead of an attribute. In the column-oriented model, each data point is a column, allowing analytical queries to be executed quickly since all the values of a column are considered a single "data point," which significantly speeds up aggregation operations.

Furthermore, better data compression is generally achieved since all the data in a column is of the same type. Simple horizontal scalability is also possible through techniques such as column sharding. One of the most popular DBMS for this model is Hadoop.

Graph model

Alternatively, there is the graph model, which relies on graphs as fundamental data structures for storing information and relationships between data.

In our use case, for instance, each node can represent entities ranging from people to bicycles, connected by edges representing relationships between them, such as subscriptions or rentals. Both nodes and edges can contain attributes, allowing us to further organize the information.

This model is characterized by its support for analysis and big data projects since problems that tend to be modeled with graphs often involve large volumes of information, such as social networks. Also, graphs as data structures allow for the modeling of complex information and relationships. Neo4j is a popular option here, but there’s a variety of other DBMSs oriented toward specific uses within this model.

Data Warehousing

Apart from the different options offered by the NoSQL model, you may have other needs that require different types of models. NoSQL is currently focused primarily on efficient data storage and querying. It’s especially useful in projects where data generation is the bottleneck – that is, a system that specializes in storing data is needed.

Conversely, other projects, especially those related to organizations, require a system that not only stores data efficiently but also manages the difficulty of extracting strategic information, as data lacks value on its own. The Warehousing model offers support for the centralization, organization, and subsequent transformation of data into knowledge that guides decision-making.

What is a Data Warehouse?

A Data Warehouse is essentially a specialized database for centrally storing large volumes of data from multiple sources. Besides storing all the data in "a single system" in a centralized way, its main purpose involves optimizing analytical queries on the data and generating dashboards or reports from the analysis itself. This is all aimed at supporting the efficient analysis and storage of the data.

By "analytical queries," I mean queries that require information over a certain period of time (or a different dimension) to calculate a metric on the data, such as the average magnitude over a 10-year period.

Returning to the previous example of the bicycle rental system, the Warehousing model provides advantages in terms of efficiency in storing users' bicycles and transactions, such as rentals or subscriptions. It also supports complex analytical queries on the data that contribute to strategic decision-making regarding the system. Such queries aim to predict demand and revenue, detect which parking areas are used more or less frequently, and so on.

Main Features of Data Warehouses

Now let’s look at some of the main features of a Data Warehouse so you undertstand how they work.

They’re Integrated

A data warehouse is typically a database that stores information from various sources. It integrates this information using transformations and processes that address the heterogeneity of the data, adapting it to the warehouse's common schema.

In our example, data can stem from various systems, including GPS positioning of bicycles, parking occupancy sensors, payment and subscription systems, and mobile applications. The warehouse then integrates all of this data, standardizing it into a common format to make its collective analysis easier. Note that these sources can vary greatly in nature, with some being structured and others not.

They Have a Historical Dimension

Over time, the Warehouse accumulates information from different sources to enable analytical queries. In our example, this would correspond to analyzing the data itself, such as examining user and bicycle behavior and usage, analyzing demand or revenue, among other possibilities.

They’re Optimization for Reading

Given the objectives we want to achieve with a warehouse, it’s optimized primarily for queries that only access data without modifying it, which is precisely what analytical queries require.

In our example, it would not be very efficient to implement the entire information system in a warehouse because of the need to optimize write operations. One possible solution would be to use the warehouse only to store data reserved for analysis, while providing the actual service to users with a more suitable system.

In other words, even if we use a different database to implement the bike rental service, we can also have a warehouse into which we periodically insert information that needs to be analyzed.

Different Data Warehousing Schemas

In addition to these characteristics, a data warehouse is primarily a database consisting of tables. So, if the data is highly complex or has too many dimensions, we can organize it into different data models.

1. Star Schema

Here, the data or measurements are mainly stored in a central table called the fact table, which is related to other tables representing possible dimensions for analyzing the data in the fact table. The main feature of this model is that the dimensional tables aren’t usually subdivided into more specific dimensions, as the goal here is to find a simple way to store data to speed up analytical queries as much as possible.

In our example, if you only need to build dashboards for usage, billing, or similar purposes, prioritizing query speed, you could opt for a star schema with a large rentals table containing fields like user, bike, origin/destination station, date, and cost, and surrounding tables for each of those entities that can be considered "dimensions" when analyzing that data.

2. Snowflake schema

Unlike the star data model, with a snowflake schema each surrounding table can be further subdivided into specific sub-dimensions, meaning smaller tables related to each other. This often saves space and improves data quality by reducing redundancy, as there are specific tables storing specific information and relating it to the rest of the tables, avoiding the duplication of information in too many tables. This streamlines the management of larger, more complex data sets.

ETL (Extraction, Transformation, and Load)

As you’ve now learned, a Data Warehouse is populated with data from multiple sources, all potentially different in nature. So Data Warehouses need to have a component responsible for extracting data from the sources, processing it, and inserting it into the data warehouse. This component is the ETL, which is a specific software piece for each data source that handles:

Extraction: Obtains data from the source in the provided format.
Transformation: It applies a series of transformations to clean them, eliminate heterogeneity, and adapt them to the schema defined in our Warehouse. The complexity and detail of these transformations mainly depend on the problem being addressed, even leading to the derivation or prediction of new data from existing records.
Load: It inserts them into the Warehouse.

ETL processes are typically run periodically to populate the Data Warehouse or update the data within it.

OLAP

As you’ve already seen, Data Warehousing is designed to support analytical queries, commonly known as OLAP (Online Analytical Processing). Unlike OLTP (Online Transactional Processing), which focuses on reading or modifying records individually, OLAP allows for analyzing data across various dimensions to discover trends or patterns that support strategic decision-making.

To understand this, it's very common to think about queries on the time dimension, which is the easiest to see, such as calculating an average over data from a time period or any similar metric.

More specifically, in an OLAP environment, data is organized into multidimensional cubes, where each dimension represents a perspective of analysis like time, product, region, and so on, and the data or measures are the quantitative values that are aggregated according to the dimensions we are interested in.

Some basic navigation and aggregation operations are defined on these cubes:

Drill-Down: It involves moving from a high level of aggregation to a more detailed one. For example, after reviewing the total quarterly bike rentals, we apply drill-down to see those that occurred by month, and from there by day or even by parking spot, allowing us to quickly detect usage variations in specific periods.
Roll-Up: This is the opposite operation to drill-down: it groups data into higher levels of detail. Starting from daily rentals, with a roll-up, we can obtain monthly rentals, by region, or the annual total, helping summarize large volumes of data and provide an overall view of the modeled domain.
Slice: Here, a subset of data is selected by setting a value in one dimension. For example, a "slice" in the bike rental cube by setting the dimension "region = Spain" will show all bike rentals that have occurred in Spain, while keeping other dimensions like time or other services (service subscription) fixed.
Dice: Similar to slicing, a "filter" is applied to the cube across multiple dimensions simultaneously. For example, querying bike rentals in a specific geographic region and during a certain time period. The main difference is that a range is defined in several dimensions at once, creating a sub-cube with more specific results.
Pivot: This involves rearranging the dimensions of the cube to change the analysis perspective without altering the data. For example, swapping rows and columns in a report to view regions in columns and periods in rows, making it easier to compare different dimensions and discover correlations between them.

Data Lakes

In addition to the Warehousing model, we have Data Lakes, which are like Warehouses where data is not stored following a common schema but is kept as it stems from its respective sources. That is, to populate a Warehouse with data, ETL components are needed to transform and adapt it to a schema. But with a data lake, such components do not exist because there is no schema that the data must follow – instead, it’s simply stored in its original format and structure.

The main reason for this is that a Data Lake aims to analyze the data, while a Warehouse aims to integrate the data through transformations to turn it into knowledge that supports high-level business decision analysis.

Normally, data is stored in its raw form in a data lake without any processing, although it can be organized according to the project's needs. This implies that the associated costs are generally lower than those of a Warehouse, as it saves all the computation resources related to its transformation, which can sometimes be complex and computationally expensive.

Since Data Lakes focus on storing data rather than integrating it, they are suitable for machine learning tasks and exploratory analysis. It's easy to apply algorithms to find patterns in raw data. But don't confuse non-integrated data with unlabeled data. Labeled data can be stored in a Data Lake and used to train supervised machine learning models. It all depends on the project's needs and the level of abstraction you want to work with.

Semantic Web

In addition to the previous database models, there are other types of technologies and tools that can organize data and its semantics. One of these technologies is the Semantic Web, which arises from the need to provide meaning to the terms used on the traditional web.

For example, in an HTML document, the word "user1" might appear, which by itself is just data without any meaning. So to integrate semantics, the Semantic Web is used as a "layer" of software that associates meaning to the terms that appear on the web.

While a simple HTML document serves to structure a series of data at the layout level, the Semantic Web provides meaning, usually through tags or annotations, so they can be interpreted by both humans and machines. In this way, the data "user1" can be associated with a tag like "name”, indicating that the data is a username.

This technology is based on a series of components:

RDF (Resource Description Framework): A standard where information is represented through Subject – Predicate – Object triples, where the subject is usually a resource or entity within the domain, the predicate is an attribute or relationship that the entity has with a value, which is the object of the triple. This way of representing information is easily understandable by people and easily processed by machines, being independent of the language used to manage the triples (such as XML or Turtle).
```
  <http://example.org/users/user1> domain:name "Juan"
```
Vocabularies: A set of terms used to describe data in a specific domain. We can see this as a language or dictionary of concepts with their associated meanings, all belonging to a common domain. More specifically, it can have meanings associated with classes (sets of entities), properties of those entities, or relationships between them.
- Example: https://en.wikipedia.org/wiki/Dublin_Core
Ontologies: A formal conceptualization of a domain, where the meanings of the entities within it are defined, along with their properties, relationships with other entities, hierarchies they form among themselves, and their constraints. In summary, they provide richer semantics than vocabularies due to the complexity with which they can model semantics.
- Example: http://musicontology.com/docs/getting-started.html

In relation to the web, there are multiple ways we can store our data, whether on our own infrastructure or someone else's. On one hand, we can choose to have a complete infrastructure of our own where all data is handled locally (on-premise), which offers advantages like having full control over it or faster access. But this also has drawbacks such as high costs since we have to maintain the entire infrastructure ourselves, ensure good scalability, and minimize the risk of failures that could reduce service availability.

On the other hand, you can choose to use someone else's infrastructure, usually by renting it. Here, the data is in the cloud, which provides greater scalability, reduced costs since you only pay for the infrastructure you use, broad geographic access with services like GCP or AWS, and backup services that minimize the risk of data loss, which would be potentially very expensive to achieve using local infrastructure.

Still, this approach also has drawbacks, such as the dependency on an internet connection to use the infrastructure as a service, or security and privacy issues since the data is in a place we don't know well.

Finally, keep in mind that these two types of solutions aren’t mutually exclusive. You can use them simultaneously in hybrid solutions where the most sensitive or valuable data is kept locally and the rest on external infrastructure, although this strongly depends on the project's requirements.

Chapter 4: Database Design

Now that you’ve learned about some existing database models and the technologies that support them, it's important to understand what database design means.

In short, database design refers to a database’s creation. When you have a project involving data, the first order of business is to consider is whether you actually need a database. This typically depends on factors like requirements provided by a client.

If you need a database, its design typically follows a series of stages. These stages start with the client's requirements, which determine what needs to be stored and how it needs to be stored. Then, the schema or structure that the data should follow once storage is planned. This allows you to further explore how to store and process the data computationally at a low level to optimize the most critical operations.

For example, in projects like product sales platforms, it may be more important to optimize operations related to product searches, while in others such as social networks, optimizing the writing of new posts may be more significant.

In addition to deciding the structure of the data, user requirements also help determine which data needs to be stored, as it's not always necessary to keep all available data in a database. Generally, only the data that might be retrieved or used in some operation is stored, although this strongly depends on the project's requirements and nature.

Database Design Levels

When you’re developing a data project and working on designing the database, you can divide it into a series of stages or design levels. These are related to the level of abstraction with which you can view the implementation of the database. Think of them as steps to follow to achieve a functional database that meets user requirements which are also considered part of the database design.

Apart from these design levels, there is a distinction based on the area of the development they are oriented towards, usually distinguishing three areas in which the different design levels are classified.

On one hand, there is the analysis of the client's needs and requirements, which determines what our information system must do.
Then we have the design of the database itself, which provides a description of the solution, its practical implementation, and the software/hardware components that form it.
Finally, we have the technology used for this implementation, where the tools, programs, and specific modules involved in the development are decided.

Now let’s look at the different design levels.

1. Analysis (Functional and Data Requirements)

This level is considered part of database design due to its influence on the other stages or levels. Here, information about the domain is first gathered, which can stem from clients, users, or any stakeholder with knowledge about the domain. The main goal is to obtain as much information as possible to then extract user requirements from it. These are a series of axioms that determine what the system must do to function according to the client's needs.

These requirements can be of many types, all studied in depth in the field of software engineering. A significant feature about them is that they determine what the system must do, not how it should do it, although in certain systems there are requirements for correctness or security that might restrict how the system should perform certain actions.

For example, if we design a database for a critical system like a nuclear power plant, it’s very likely that some of those requirements will require the system to respond to certain critical queries within a short time frame for safety reasons.

2. Conceptual Design (High-Level ERD/UML)

Once the requirements that the system or database must meet are clear, the conceptual design is responsible for describing how the data will be organized within the database. This is always done according to the database model you’ve selected for the project, as using NoSQL is different from using a structured database.

To correctly understand this level, let’s consider a case where the database being used is relational/structured. At this level, the data is first described, along with their possible associated constraints, such as data types, attribute domains, and so on. Then, software engineering tools like an entity-relationship diagram are used to describe the tables that comprise the database and their relationships. This helps us formalize the structure in which the data will be organized once the system is in production.

It’s important to remember that regardless of the tool used for this process (whether a diagram or any other representation method), the organization depicted in the diagram must later be translated into a software implementation, which heavily depends on the DBMS. Designing a structured database differs from designing a graph-oriented database, so you’ll need to select an appropriate tool at this level to represent the data organization.

So the main focus at this level, beyond understanding the requirements, is to organize how the information is stored according to the operations the system will support. You’ll also need to properly document the descriptions provided, whether with diagrams or other tools, so they are understandable later and can be implemented on a specific DBMS.

3. Logical Design (Relational Schema)

Assuming the database is structured, at this level, you’ll use the diagram you created in the previous level to implement the database schema on a DBMS. This means you define the tables that the database will have on the DBMS.

If you didn’t use a diagram in the previous level or the database is not structured, you’ll follow the same process – although instead of tables, you’ll use the appropriate structures, such as graphs. Ultimately, here the entity-relationship diagram is translated into a relational schema, as we will see later, which is responsible for representing the tables that exist in the database at the DBMS layer.

When dealing with tables (or the corresponding structure according to the database model you’re using), it’s easy to understand how the database is organizing the information. But this is only the high-level view, in that DBMSs show us how data is organized, since eventually everything has to be converted into low-level data structures and algorithms on files that work with information encoded in binary. In other words, although we see tables, internally the DBMS operates with other types of computational tools at a lower level, closer to the hardware, which do not necessarily have to resemble tables, graphs, key-value pairs, and so on.

This offers an advantage: when managing the database, you can do so by focusing on the tables it contains, without needing to worry about how the data is actually stored in memory (or how the data structures and algorithms used to implement the database operations are working).

In other words, the database, more specifically the DBMS, automatically translates table-level management into the lowest level management, closer to the hardware, which is called logical-physical independence. This allows us to manipulate the database by working directly with the tables, not with the content at the hardware level, which would complicate things.

Finally, at this level, you’ll often perform schema refinement. This refers to restructuring the schema with tables to make certain operations more efficient, or to improve certain aspects of the implementation according to the requirements. We do this because, when translating from the previous level to the logical one, you can modify certain design patterns to better use the tools provided by the DBMS, whether table-oriented or not.

4. Physical Design (Logical Indexes, Clustering, Partitions)

At this level, the DBMS automatically implements the schema we previously defined at the level closest to the hardware. It translates the set of tables and associations we defined into specific data structures like B-trees, indexes, and algorithms that support their operations. In essence, this level is the computational implementation of the DBMS, which manages disk memory or calls the operating system, among other details.

This implementation of our schema by the DBMS is automatic. We simply need to provide a definition based on the relational schema we created earlier, including the tables, associations between them, and the data we want to insert or delete.

With this, the DBMS translates these "relational" operations into low-level operations like assembly instructions. This helps us maintain logical-physical independence, as the DBMS implementation can be modified at any time without affecting our relational schema or its functionality. This lets us optimize the DBMS code without needing to rewrite all the "relational" programs that define the databases.

5. Storage Level (Block Formats, Disk Structures, and Access)

You can think of this level as a subset of the previous one, as it’s responsible for storing data in secondary memory according to the relational schema managed by the DBMS. It performs the necessary requests to the operating system to allocate memory and usually manages information on the disk at the byte level.

For this purpose, it employs low-level techniques that determine how available disk memory will be used, including the implementation of disk structures and the formatting of memory blocks, among others.

6. Implementation of Applications and Security (Views, Permissions, Procedures)

Finally, once the database is built, you can design new layers on top of it where you can install applications and services that facilitate the interaction with the database. That is, you can simplify its operation for the end user, for example by developing a web application in HTML, CSS, and JavaScript to obtain the data in a friendly way, instead of with SQL code.

Some of these layers are also oriented to guarantee the security of the data, establishing higher level access controls than the DBMS where the user must authenticate to access the data. You can also encrypt the data using some of the functionalities of these layers.

Chapter 5: Relational Model (Structured Data)

Now that you understand some of the processes we use to design databases, we will focus on the simplest databases, which are those that operate with structured data. These databases are usually called relational or structured. They are formally designed using the relational model, which is the formalization of the conceptual level used to design this type of database.

The reason relational databases are the simplest lies in the nature of the data they usually store and the constraints imposed on them, as we will see now. We’ll discuss both the conceptual and logical design levels simultaneously, where the fundamental elements of this type of system are mainly represented.

It’s important to differentiate between how these elements are viewed from the conceptual level and from the logical level, as they essentially refer to very similar, and sometimes equivalent, concepts – but formally they are different concepts. In a relational database, the information is structured in entities related to each other and composed of a series of attributes, which is the conceptual view of the model.

Table (Relation)

As mentioned before, structured data is that which follows a rigid schema and is organized in the form of tables. So the fundamental component for storing information in a relational database is the table, which is sometimes also called a relation. This component is part of the logical design, since we define it in the DBMS. So whenever we deal with tables, we are referring to the logical design level.

Here’s an example:

CityID	Name	Country	Population	Area
1	Madrid	Spain	3,223,000	604.3
2	Athens	Greece	664,046	38.96
3	New York	USA	8,398,748	783.8
4	Tokyo	Japan	13,929,286	2,191.1
5	Paris	France	2,140,526	105.4

Schema

The example City table above stores data about different cities. The table has a schema, which is a series of reserved data to describe the structure of the table. That is, the schema consists of the table name, which in this case is City, along with the name and type of all the attributes it has, corresponding to the columns.

For example, if we are storing cities in this table, the Name column corresponds to the Name attribute of each city, which is a property that city entities have, in addition to the associations between entities, which in certain contexts are also called properties. This attribute must have a type, such as string in this case, to determine what kind of data it will contain.

So the table name along with the names and types of the attributes form the schema of a table, which is mainly determined by user requirements. But it’s the database designers who decide how to model the domain entities, what attributes are necessary to include, and the types of each one.

Tuple

In addition to a schema, a table also has an instance, which is the set of tuples it contains at a given moment in time. Here, by tuple, we mean a row of the table, as we can mathematically view it as a tuple (value1, value2, value3…) where all the values for a certain city are present for all the table's attributes.

A peculiarity of the instance is that there can never be multiple identical tuples. This means, in this case, that there can’t be two or more cities that have the same values for all attributes at the same time. This restriction is imposed in the pure relational model, although we will see in practice that this restriction may not be followed to facilitate certain tasks.

This is the case because, in the pure relational model, the instance is considered a set of tuples, and mathematically, a set can’t have repeated elements. But in the practical implementation we will see, the instance is formally modeled with a multiset that does allow duplicates, as each tuple is internally associated with a value indicating how many times it’s repeated in the multiset.

Attribute Domain

Previously, we mentioned that each attribute has a domain, which allows the DBMS to determine how the data in that column will be stored. But we might have an attribute like Population where it doesn't make sense to store negative numbers, similar to Area.

To prevent these situations, the domain of the attribute can be restricted. For example, if we set Population to have only the INTEGER data type by default, it can take any value from the set/domain of integers. But if we only want it to take positive values, we need to add a constraint (which we’ll discuss later) so that the possible values for that attribute, meaning its domain, are only all positive integers.

Derived attribute

A special case of attributes is derived attributes. Their value is not stored, but is rather calculated from the value of other attributes.

Continuing with the example of the City table, suppose we have an attribute Density that should indicate the population density of a city. In this case, we can define it as a derived attribute, instead of calculating the values beforehand and inserting them into the database. Thus, every time Density is queried, the operation Population/Area will be performed, returning the value to the user in the corresponding tuple.

We can see a clearer example of this if we have an attribute BirthDate and we want to calculate the value of another attribute like Age. Here, we can calculate the attribute Age directly from BirthDate as if it were a "view" on that attribute. That is, we can see a birth date as if it were an age, from which we can derive the value of the attribute Age. We’ll discuss the concept of a view later in more detail at the implementation level.

Before moving on to the representation at the conceptual design level of a table, it's important to understand why a table is sometimes called a relation. A relation is a subset of the Cartesian product of the domains that the attributes have, but you can understand it more simply as a set of tuples that comply with a defined schema. For example:

Letter	Number
A	1
A	2
B	1
B	2

In this table, we can assume that the attributes Letter and Number have the domains {A, B} and {1, 2} respectively, so the entire set of possible tuples we can form with these domains are the tuples shown in the table itself.

These tuples come from the Cartesian product of both domains. So if we had larger domains, we would get a much broader cartesian product. A subset of its tuples is called a relation, and we can associate it as the instance of a table, which is why the term relation is sometimes used to refer to what is actually a table.

Conceptual Representation

Putting this aside, it's not as important to focus on formal details like the name relation, but rather to understand the structure of a table and how data is stored in it. So far, everything we've seen about tables refers to the logical design level, which is where we actually work with tables. But at the conceptual level, there is an element very similar to a table called an entity.

According to the conceptual level, a relational database is a set of entities, where each one can be likened to a table. Each entity has a series of attributes, each with a domain, where instead of attribute, it’s usually called a property at the conceptual level.

Following the example of the City table from before, at the conceptual level, there is an entity called City, shown above in a UML (Unified Modeling Language) entity-relationship diagram, which is the most common way to formally represent this type of information.

Sometimes we can use Crow's foot notation for the diagram, but here we’ll use the same notation as a class diagram in software engineering for simplicity. It’s equivalent, which is why entities are sometimes called classes.

To correctly understand what an entity is, think of it as if it were the schema of the table, or rather a class in object oriented programming that serves as a template to instantiate tuples. Just keep in mind that at the conceptual level they aren’t called tuples but rather instances or occurrences of an entity.

Intuitively, we can see it as if the attributes represented in the entity were the actual values of the first row of the equivalent table – that is, its schema. In this way, if we have a schema (that is, a template), we can create instances of that entity/schema/template simply by assigning values to those attributes. So when we assign values to the properties of an entity, we have an entity occurrence, which at the logical level we can see as a tuple.

For example, the entity City can be "instantiated" in a "tuple" like [5, Paris, France, 2140526, 105.4]. But at the conceptual level we should call it an occurrence instead of a tuple, since “instance” might cause confusion with the concept of instance we discussed earlier at the logical level.

Entity: [CityID,Name,Country,Population,Area]    --->    Tuple=Occurence: [5,Paris,France,2140526,105.4]

So every time we see a box with a name and properties in an entity-relationship diagram, it refers to an entity that is logically equivalent to a table.

Regarding the concept of an instance we saw earlier, here it’s called an entity set, and it contains all the existing occurrences of that entity at a given point in time. In the diagram, we only see the template, not the set with the occurrences of that entity (think of the tuples of a table). In other words, the diagram at the conceptual level is used to see how the database is structured, not to see its specific instances or occurrences, which are more related to the logical level.

Regarding notation, in the entity-relationship diagram, the entity is represented in a box where all its properties (attributes) are listed by name and type. Here, the type does not have to match exactly the type offered by the DBMS in the logical design, as a translation from the conceptual to the logical level is done later, as we saw before.

To the left of each attribute, a - is usually placed to indicate that it’s a private attribute. But this concept is not relevant in this database context, as it comes from the uses given in software engineering to the class diagram notation we use here.

Lastly, attribute names are usually all in lowercase, although according to the style guide you follow, this can vary – like here, where we allow uppercase to minimize changes to attribute names when translating to logical design.

Repeating Group

Once you know what an entity, or table, is, and that the database is a set of them related to each other, you’ll need to consider an important restriction about the table itself as a storage structure.

CityID	Name	Country	Temperature
5	Paris	France	7,44,20,90,1

For example, if we have a City table similar to the one above where we only want to record the temperatures of the city at different points in time, the first option we might consider is to store all the temperatures that each city has or has had in a single Temperature attribute, all together.

But this is not allowed in structured databases for efficiency reasons (as well as formally, which you’ll see later). Specifically, this situation is known as a repeating group, and it occurs when we have to store an indeterminate number of values in an attribute.

For example, if we only need to store a maximum of 5 temperatures that a city can have, we could make the data type of Temperature an array of integers with a length of 5, which would be filled as we get temperature measurements. But if we don't know how many temperatures we will measure, we can’t set an upper limit on the size of the value we are going to store, so we can’t define a specific size for the length of the data type of that attribute. This creates a repeating group.

Anyway, even if we could set a size for data structures like an array, they are usually not allowed due to the uncertainty of the size the developer might set for that array (also considered a repeating group).

At the same time, this uncertainty is the reason why repeating groups pose a problem for the physical implementation of the database. Since we don't know how much space we’ll need to represent them, we might end up wasting lots of memory trying to manage this uncertainty, as well as fragmenting it, or complicating the implementation logic in an attempt to minimize the impact of this waste and memory fragmentation.

How to avoid a repeating group

One way to solve the problem of repeating groups is to store each temperature measurement in a separate tuple. If all measurements can’t be stored in a single attribute value, then one option is to duplicate the information of the other attributes to create multiple tuples, each storing a specific temperature measurement.

CityID	Name	Country	Temperature
5	Paris	France	7
5	Paris	France	44
5	Paris	France	20
5	Paris	France	90
5	Paris	France	1

As you can see, we have duplicated information to store each temperature measurement in a tuple, which avoids accumulating them all in a single value of the same tuple. But repeating data creates (unnecessary) redundancy in the database, which is a problem.

Redundancy is not an issue in all situations, as it can sometimes be good for ensuring data availability. But in this case, we can see that it’s it’s completely unnecessary. First, because it greatly increases the space needed to store city data by repeating the city’s information. Also, because having city data repeated so many times means that every time these data need to be modified, you’ll have to make changes to all tuples recording the temperatures, causing operations to take too long. And if the schema is modified to add or remove attributes, all data in their respective columns must be deleted – so if there is a lot of repeated data in them, those operations will also have high latency.

Data Inconsistency

On the other hand, if in the previous example we insert a temperature measurement and for some reason an error occurs during the operation, we might end up in a situation like the following:

CityID	Name	Country	Temperature
5	Paris	France	7
5	Paris	China	44

Here you can see that when inserting the measurement with a temperature of 44, an error occurred, and the tuple was recorded with an incorrect Country value. This is not common, but if we choose to solve the repetitive group problem this way, we will be inserting duplicate values more often than necessary, making it more likely for these types of errors to occur.

Having the same information duplicated but with contradictory values indicates that our database has an inconsistency. This happens when the same information is duplicated in various places in the database, and the values are contradictory, such as in this example where we have multiple temperature measurements for what appears to be the same city but with the incorrect country value.

To ensure that it’s an inconsistency, we should look at the key values that uniquely identify each tuple, which we will discuss later. But intuitively, we need to focus on those attribute values that allow us to uniquely identify a tuple. If those values repeat in several tuples and there is some inconsistency in the other attributes, then we have an inconsistency.

On the other hand, if in the last example the Country value of the second tuple were "France," we wouldn't have any inconsistency, even though the temperature values don't match. So it's important to understand that inconsistency mainly depends on the schema's semantics, meaning what each attribute signifies.

Finally, to solve the problem of repetitive groups, you’ll typically need to refine the schema – that is, to transform it. In this specific case, we’ll perform a normalization operation, which we’ll see how to do later. This involves separating a table like the one we had before with duplicated information into several tables:

CityID	Name	Country
5	Paris	France

ReadingID	CityID	Temperature
1	5	7
2	5	44
3	5	20
4	5	90
5	5	1

Now there is a City table very similar to the original, but with the difference that it only stores a record of the existing cities, not the temperatures recorded in them.

Also, there is another table we can call Readings, which contains the temperature measurements for each city. In this table, each tuple contains a measurement and an identifier that determines the city where the measurement was taken, which in this case is CityID.

For example, if the measurement was taken in Paris and that city has a CityID value of 5, then the CityID in the Readings table will be 5 for the measurements of that city. This avoids duplicating all the city information as happened before.

By doing this, we avoid the potential inconsistency problems that arose before, and we also save disk space by not duplicating unnecessary information. More importantly, it prevents the appearance of the repetitive group.

For this, we have had to "complicate" or rather enrich the database schema to some extent, meaning the tables that compose it and the schemas that form it. But the complexity in structured databases doesn’t come from being structured, but from the domain being modeled and its operations. In other words, the relational model of structured data is not complex by itself, as it is simply a model. What truly causes complexity is how we use that model to reflect the domain requirements.

Entity Associations

In the context of conceptual design, a relational database is not only made up of entities (tables), as this only allows us to model the existence of "objects" in the domain. Most of the time, these objects will have associations with each other, meaning they will be related.

So, in conceptual design, we have the concept of entity association, which describes how "objects" are linked to one another. This is essential for reflecting the actual structure of the information.

For example, in a domain, we can have entities like the ones above, City and Person. These model the existence of people and cities in the domain. But besides the existence of the entities themselves, it's possible that they have relationships with each other that we can model in our diagram – such as a person living in a certain city.

In this case, we use an association to allow a person in our system to live in a city, meaning we use an association to model that relationship between both entities.

At first glance, we can see that the association is represented in the entity-relationship diagram as a relationship established between entities – but it's important to remember that entities are "templates" from which occurrences of entities are generated when implementing the system (that is, specific tuples). So when we introduce an association at the conceptual level, we have to view it in terms of the tuples that will later be generated from the related entities.

For example, here the relationship can occur between one occurrence (tuple) of a city and many occurrences of a person, since many people can live in a city. But the reverse may not be true depending on the domain requirements, which may determine that a person (occurrence of the entity Person, or tuple of the table Person) can only live in one city, as we’re assuming in this case.

Association Role

In an entity-relationship diagram, the notation of the association is usually represented with a line connecting two entities, known as a binary association. But there are higher-degree associations (which we won't cover here for simplicity) that relate an arbitrary number of entities in a single association.

A role and a direction are usually added to this line to clarify the semantics of the relationship. The role is a word or phrase written above the association line and denotes the role that an entity has in the represented relationship with respect to the direction defined alongside the role.

For example, in the diagram below we have an association between a person and a city. So in the association, the role given to the person is "lives" in the city with which they are associated, as the direction has been defined from the person to the city with the arrow next to the role. In other words, in this relationship between both entities, the function that the person performs is to "live" in the city with which they are associated.

This role doesn't need to be included in all associations, nor is it necessary to establish a direction. But in some cases, it helps us understand the diagram and the domain, which is the goal of the diagram itself.

Also, the role isn't rigid and can be modeled in many ways. For example, in this case, we can reverse the direction of the association and say that the city has the role of "having residents," which are the people it’s associated with. This would model the existence of people living in the city.

Cardinality

Continuing with the different elements of an association, we have cardinality, which describes how many occurrences (tuples) of one entity can or should be associated with how many occurrences of another entity. We represent this with numbers on both sides of the association line that denote the minimum and maximum cardinality, respectively.

To understand this using the previous example, we know that a person can only live in one city, so a person entity will be associated with at most one city. In turn, we can also assume that every person must live in some city, meaning there are no people living in the woods outside of society. So since every person must be associated with exactly one city, the multiplicity we put on the city entity side is 1…1, which is simply written as 1.

Here, the first 1 is the minimum cardinality, indicating that each person must be associated with at least one city, while the other 1 is the maximum cardinality, indicating that each person can be associated with at most one city. For simplicity in the diagram, the number 1 is usually used to denote both cardinalities at once. Also, when we talk about people and cities here, we are referring to the actual occurrences of the entities, which at a logical level are tuples.

If we look at the other side of the association, we see it has The Role of Data in Today's Digital World (sometimes called multiplicity) 1…*, where 1 is the minimum cardinality, indicating that a city must be related to at least one person. This means that in all the cities within our domain, there must be at least one inhabitant.

On the other hand, the * in the maximum cardinality is a way to denote that there is no specific value that must be given to that cardinality – it can be any amount. This means a city can be associated with an arbitrary number of people, indicating that the cities in our domain can have any number of inhabitants.

Since the asterisk denotes any, unbounded amount, we don't have to worry about it being consistent with the minimum cardinality. That is, even if we set the minimum cardinality to 1, by using an asterisk for the maximum, we are indicating that the maximum can be any number from 1 to infinity. This means that cities will have at least one inhabitant and at most an infinite number.

From the minimum cardinality, we can introduce the concepts of optionality and obligation. For example, before we had minimum cardinalities greater than 0, which indicate that a person must always be associated with a city, or a city must always be associated with at least one person. This means that when occurrences of these entities are created, they must meet the restriction imposed by the minimum cardinality of being associated with some occurrence of the other entity. So at creation, it must be directly associated with the other entity that indicates the association, to respect the minimum cardinality.

To see this at the logical design level, we first need to introduce the tools of that level with which associations are implemented – although for now, we can view it by thinking in the object-oriented paradigm, where if we instantiate a person object, it must have a reference to another city object, and vice versa.

Regarding optionality, let's consider another possible case where a city can have an arbitrary number of residents, including being uninhabited, meaning empty, since we have set the minimum cardinality to 0 and the maximum to *. This can also be represented more simply by just using the asterisk, indicating an arbitrary amount including 0.

Now, let's also assume that a person can live in one or two cities, so their corresponding cardinality is modified to 1..2, indicating that a person must be associated with at least one city and at most two cities simultaneously at any point in their life cycle.

This occurs since the entity-relationship diagram is instantaneous, not historical, which means that what we see in the diagram is an instantaneous representation of our domain, not a representation of its life cycle or evolution over time.

So when we see that an association has a multiplicity of 1..2 as in this case, we must think that at any given moment, a person must be associated with at least one city and at most two cities. We shouldn’t think that a person must have been related to at least one city and at most two cities throughout their whole lifetime.

Here we can see that a city may have no residents due to the minimum cardinality of 0 that we have set on the person side, indicating that a city may not be associated with any person. With this, we can model optionality, which refers to allowing an association not to occur. That is, when we create a city from its entity (template), we don't have to associate it with a person, since it can be associated with 0 people at minimum. This means that it's not necessary to add a reference to any person because a city may be abandoned and have no residents.

To correctly understand optionality, we can modify the example again so that a person can be associated with either no city or one city, indicating that the person may not live in any city or may live in one. Also, on the other side of the association, we also change the maximum cardinality to 500, indicating that a city can have an arbitrary number of residents between 0 and 500, meaning it can be associated with any number of people from 0 to 500, inclusive. This means having residents is optional.

With this, it should be clear that we can set cardinalities as we want according to the domain and requirements – but we always need to ensure they are correct and make sense. For example, you can’t set a maximum cardinality that is strictly less than the minimum cardinality.

In this case, something peculiar happens: on both sides, we have a minimum cardinality of 0, meaning we have optionality. So when we create new instances of the entities, they don't have to be associated with instances of entities on the other side of the association. We can see this as if the association we modeled is entirely optional.

To conclude, although we can set any number for minimum and maximum cardinalities depending on the modeled domain, the most common ones are 1..1, 1..M, or M..N, where N and M can be arbitrary numbers, including 0 in the case of N..M, as long as they aren’t both 0 at the same time (because in that case, the association could not exist).

Recursive Associations

On the other hand, an association does not necessarily have to relate multiple entities. We can use it to model a relationship between occurrences of the same entity. For example, if we want to model the friendship relationship between people in our domain, we can use a recursive association in the entity Person:

First of all, it’s very convenient to establish a role in recursive associations, as it’s the simplest way to represent their semantics so we can easily understand them when looking at the diagram.

But in this case, it’s not as useful to specify the direction of the association since the friendship relationship can be considered symmetric. Here, we have modeled the friendship relationship so that one occurrence of Person can be associated with any number of other occurrences of Person, including none, which indicates that in our domain, a person (occurrence of Person entity) can have an arbitrary number of friends, including 0.

Regarding notation, it makes no difference to use 0..* or *, as they indicate the same thing – but we should always use the shortest and simplest notation to understand.

In summary, a recursive association is simply one where both related entities are the same. In this case, the friendship association necessarily relates people to people, meaning it establishes which people are friends with each other.

Associative Entity

Now that we know what associations are, let’s learn about the concept of an associative entity. In some cases it’s also called a property just like the associations themselves. In the following example, there are cities in a domain that can host from 1 to 500 inhabitants, as long as the implicit restriction of having at least one resident is respected. Also, a person can live in an arbitrary number of cities between 0 and 3.

The above conceptual diagram would model this situation. As it stands, we can’t store any information about the person's stay in the city, meaning we can’t save information like the dates they started living in that city or moved to another. If we try to do so, we’ll have several options that lead to certain problems in the database.

On one hand, we could choose to add attributes like StartDate and EndDate to the Person entity to determine the respective dates when a person started living in a city or moved to another. But this wouldn't even work if the multiplicity 0..3 of the city were 1..1, because over the person's lifetime, even though they can live in only one house at a time in the 1..1 case, it's possible that the person moves several times throughout their life. This would require multiple pairs (StartDate, EndDate) to be recorded. So since we need to store multiple pairs of these dates, a repeating group would be generated in the respective properties (attributes), forcing us to refine the schema.

On the other hand, we could store those attributes in the City entity, but we would encounter a very similar problem here. We would have to record multiple pairs of (StartDate, EndDate) values for each person, with the added complexity that a city can have many residents. This would also create a repeating group, along with the issue of associating each (StartDate, EndDate) pair with the correct person.

To address this situation, ideally, we should be able to store these attributes within the association itself. This way, when a person starts living in a city, their association would contain these attributes, and they could record the date the person started living in the city as well as the date they stop. This value (when they stop living in the city) can be left blank or set to "NULL" until they actually leave and the association is no longer valid.

To achieve this, at a conceptual level, associative entities are used. These are entities whose main purpose is to allow our database to store information about the associations between entities.

As you can see, associative classes are "related" to associations between entities, not directly with other entities, and they don't have multiplicity or roles. This is because they exist only when the association between several entities is actually established. For example, when a person starts living in a city, they associate with a city, and this association relates to an occurrence of the associative class where the respective attributes like StartDate and EndDate are stored.

So for each person-city association we have, there will also be an occurrence of the Residence entity with the values of its corresponding properties. Also, keep in mind that this association doesn't exist all the time, as the person may stop living in that city – so the association itself may cease to be valid or, rather, cease to exist conceptually.

But depending on how we translate the relational diagram to the logical design of the database, we might want to record the StartDate and EndDate values that the occurrence of the respective associative entity had.

If we want this, we will need to specify it in the logical model of the database or in the conceptual model with a note in the diagram's margin. This is because, at a conceptual level, there are no specific tools beyond notes to specify these kinds of details, which are more related to the logical design.

Aggregation and Composition

Since a UML entity-relationship diagram is used at the conceptual level, there are modifiers we can use in the associations to give them a particular meaning. But this has no effect at the logical level – meaning the introduction of these modifiers in the conceptual diagram doesn't imply any kind of change at the logical level. They are simply used to clarify the details of the modeled domain.

On one hand, an association can be of the aggregation type, like between Person and City, where aggregation is denoted by an unfilled diamond and signifies that a city can be composed of people. This means that the entity with the diamond is composed of entities on the other side of the association.

Also, in the specific case where we create and destroy entity occurrences at the same time, the aggregation becomes a composition, denoted by a filled diamond. It then works the same way as aggregation – the only difference being the meaning it conveys.

For example, in the above diagram we have modeled that a person is composed of a single brain. Since a person's brain can’t exist independently of the person, the association is denoted as a composition. This is because aggregation would allow the brain to exist independently, which is not possible.

If we look at it inversely, the composition does not prevent the person from existing without being related to a brain, although the 1..1 cardinality we have placed on both sides models this situation, requiring all people to have exactly one brain.

The important thing to understand is that both composition and aggregation are just associations with additional meaning. This means that they don’t influence the logical design of the database itself, much less at the implementation level.

Generalization and Specialization

Another feature of the relational model is that, besides modeling associations between entities as we have seen, it can also model other types of relationships between entities. This can be useful in many situations.

For example, if we have a domain where there are people who can be customers or employees, we can use a generalization and specialization relationship like the following:

Generalization-specialization relationships work the same way as in object orientation. We have a class like Person with a set of attributes, allowing for specializations of that class like Client or Employee, where all instances are also people but with more specific attributes.

In the case of Client, it’s a specialized entity derived from the Person entity, so it inherits all the attributes of its parent entity since a client is also a person. In addition to these inherited attributes, it has others specific to being a client. So when an instance of the Client entity is created, think of it as having all the attributes of both Client and Person at the same time. The same happens with Employee but with its respective attributes.

If we look at it from a set theory perspective, first we have the entity Person, which gives rise to a set of entities that are people, meaning the occurrences of that entity. Within this entire set, it's possible that, in addition to occurrences of Person, there could be occurrences of Client, since every client is a person. So in the set of people, there will be some who are clients. This also happens with Employee, where in the set of people, there will also be employees, with all of them being people.

Also, nothing prevents a person from being both a client and an employee at the same time, so there will also be elements in the set that are both a client and an employee. But this detail is closer to the logical design of the database than to the conceptual representation of generalization and specialization presented here. In this case, these names indicate that classes like Person are more general than Client, which are their respective specializations.

Entity Association Pitfalls

When we are in the conceptual design stage and create the entity-relationship diagram, it's common to encounter association structures that initially seem correct but, when implemented in a DBMS, lead to ambiguities or unexpected problems that require us to refine the conceptual design. One of these structures is the Fan Trap:

The Fan Trap appears when we have a "central" class like City that is associated in a "fan" shape with two others, Person and Pool, where each has maximum cardinality on its side. This means a city can be associated with many people and many pools at the same time.

This situation is initially correct, but the problem arises when we want to know which people from a certain city go to which pool. This becomes complicated because if we are given a certain person, we can know their city, as we have defined that a person can only live in one city. But the city can have many pools, so we don't know which specific pool the person goes to. We can only know which pools the city has where they live. Also, the city might have no pools, given the minimum cardinality of 0 on the pool side.

On the other hand, if we are given a pool, we can determine which city it belongs to. Then with that city, we can find out the group of people living there, which we can use to solve the previous question – but in a much more complex way.

To solve this problem, there are many alternatives, although the simplest in this case is to add an explicit association between Person and Pool to model the fact that a person goes to a pool. But if we’re not going to make these types of queries frequently, it might not be worthwhile to complicate the diagram.

There is also the Chasm Trap, which is similar to a Fan Trap but with important differences. For example, in the diagram above, you can see a Chasm Trap. It occurs when we are given a city and asked to find the pools located in it. The only thing we can do is get the group of people living in that city and, from that group, identify some of the pools the city has.

In other words, each pool may or may not have an association with a person, since not all people go to the pool. So, if we try to find all the pools in a city by simply looking at the pools the city's residents go to, we might encounter situations where no resident of the city goes to the pool. Thus, all the pools will take advantage of the 0..30 cardinality on the Person side to not have any associated people, meaning no one goes to those pools.

So if there are pools that no one visits, we won't be able to find them through a group of people. This means that, given a city, we might not know all the pools it has, because if we solve the query this way, we can only be sure of knowing the pools that the city's residents visit. But if there's a pool that no one visits, then that pool won't be accessible through a person. In other words, people won't see those pools, since the 1..* relationship requires them to visit some pool – but it can still happen that no one visits a certain pool.

The solution to this problem is practically the same as for the Fan Trap, although there are many alternatives depending on the domain and requirements. There are also more situations that can lead to these problems or ambiguities which you can read more about here.

Keys

So far, we have talked about entities and associations at the conceptual level, as well as tables at the logical level. Continuing with the logical level, we have not yet introduced any mechanism to uniquely identify the tuples contained in a table. This can be very useful since tuples are data points – that is, occurrences of an entity, like people, cities, and so on.

Uniquely identifying them makes it easier to perform operations or queries on the table. It also allows us to implement associations between entities at the logical level through references between tables.

Keys are sets of attributes used to uniquely identify each tuple in a table. The combination of values in these attributes must be different for every tuple, so that no two tuples are the same.

To understand this concept, let’s start by looking at the different types of keys and their main utility.

Superkeys

Superkeys are sets of attributes that uniquely identify each tuple in a table. They are the most general type of key. As long as the combination of values for those attributes is unique for every tuple, the set of attributes qualifies as a superkey.

Here’s an example:

ID	SSN	Name	Birth	Email
30	74	Alice Johnson	1985-07-12	alice.johnson@example.com
22	59	Bob Smith	1990-03-05	bob.smith@example.org
95	10	Carol Davis	1978-11-23	carol.davis@example.net
21	32	David Brown	2001-01-30	david.brown@example.com
47	61	Emily Wilson	1995-09-14	emily.wilson@example.co.uk

In this case, we have a table called Person where each row stores a person's data. Each person has a government ID number, as well as a Social Security Number (SSN), name, and other details.

A possible superkey would be the attributes {ID, Name}, because among all the people that exist, no two people can have the same name and the same government ID number. But if we choose only {ID} as a superkey and try to uniquely identify all the rows in the table, depending on the data in the rows, we might encounter a situation where two people have exactly the same name, with identical first and last names. In this case, we couldn't uniquely identify both by their name alone.

So by including the ID in the superkey, we can differentiate between the two people/rows, as they can’t have the same government ID. We could also have chosen {ID, SSN} or even {SSN, Name} as a superkey, since the combinations of values in those attributes are very unlikely to repeat among different people. It’s impossible, for example, for multiple people to have the same name and Social Security Number.

Here’s another way to look at this: if we choose {ID, Name} as a superkey, then there can't be multiple rows in the table with the same ID and Name values. In other words, if we choose that superkey, it's because we are sure that this situation won’t occur, ensuring that all rows have a unique combination of values for the ID and Name attributes.

This mainly depends on the domain, as identifying a superkey formally is not simple. It involves knowing all the domains and associated constraints of the attributes in detail, as well as the functional dependencies between them (which we’ll discuss later).

In summary, although you can identify a superkey by formal methods, we won’t go into detail about them here. They’re usually not simple, as they combine techniques like closure or backtracking, which aren't useful to explain for correctly understanding the concept of a superkey. So for now, it's enough to focus on the semantics of each attribute and stick to those attributes that we know can't be repeated in multiple rows, like identifying codes of entities, names, or specific properties they might have, and so on.

Lastly, regarding the above table, we have seen some of the possible superkeys that can exist. But if we want to find all of them, we’ll first assume that the attributes with repeated values in several tuples are Name, Birth, and Email, since multiple people can have the same name, email, or birth date. Considering that ID and SSN do not repeat because they are government identifiers, we would have the following sets as superkeys, ordered by their size or cardinality:

Cardinality 1: {ID}, {SSN}
Cardinality 2: {ID, SSN}, {ID, Name}, {ID, Birth}, {ID, Email}, {SSN, Name}, {SSN, Birth}, {SSN, Email}
Cardinality 3: {ID, SSN, Name}, {ID, SSN, Birth}, {ID, SSN, Email}, {ID, Name, Birth}, {ID, Name, Email}, {ID, Birth, Email}, {SSN, Name, Birth}, {SSN, Name, Email}, {SSN, Birth, Email}
Cardinality 4: {ID, SSN, Name, Birth}, {ID, SSN, Name, Email}, {ID, SSN, Birth, Email}, {ID, Name, Birth, Email}, {SSN, Name, Birth, Email}
Cardinality 5: {ID, SSN, Name, Birth, Email}

Candidate Keys

Next, we have candidate keys. Their main purpose is the same as superkeys, with the only difference being that in this case, they use the minimum number of attributes possible for identification.

For example, before, as a superkey, we could choose {ID, Name}, among other options. But that superkey contains the ID attribute, which represents the government identifier for each person, and we have legal assurance that it is unique for each person.

So, since we know that each person's ID is unique, as is their Social Security Number because it’s also a number related to government procedures, we can reduce the number of attributes needed to uniquely identify each tuple and choose a candidate key like {ID} or {SSN}. We could also consider {Email} as a candidate key, although we assume that several people could have the same email, so we do not count it as a candidate key.

As you can see, conceptually the candidate keys play the same role as superkeys, but here the goal is to achieve identification with fewer attributes, specifically with the minimum number possible. In this example, by considering candidate keys with a single attribute like {ID}, we have managed to uniquely identify tuples with the smallest possible number of attributes, since you can’t form any type of key with fewer than one attribute.

Also, to verify that a key is a candidate and not a superkey, you can check that there is no subset of attributes of the key that by itself forms a key.

For example, if we have a key like {ID, Name} and want to check if it is a candidate key, we just need to check all possible subsets of attributes it has, which are {ID} and {Name} (although there can be subsets with more attributes). And remember that several people can have the same name, but if we look at the subset {ID}, we will see that no person has the same ID as another.

So since there is a subset that can uniquely identify the tuples, it fulfills the fundamental property of any key. This means that the {ID, Name} we were checking is not a superkey, as there is a subset of its attributes that is a key.

If we repeat this process exhaustively, we are guaranteed to find a candidate key, that is, a minimal set of attributes that serves as a key to identify the tuples.

So basically, a candidate key is just a minimal superkey: it uniquely identifies each tuple, and if we remove any column from it, it no longer uniquely identifies tuples.

In practice, we rarely enumerate every superkey or worry about the labels. We just look for a set of attributes that uniquely identifies each tuple, preferably with as few attributes as possible. In design, at the logical level, we could define multiple candidate keys (and, implicitly, many superkeys), but the important step is choosing one candidate key as the primary key to uniquely identify tuples.

Primary Keys

Once we have all the candidate keys that exist (since there can be several depending on the domain and tables we are dealing with), we need to select one of them as the primary key to implement in the DBMS. This way, we can have a key that uniquely identifies the tuples. In other words, a table can have many candidate keys, but these keys are subsets of attributes that we analyze theoretically.

To make them practical and actually identify the tuples in a table, we need to implement one of them in the logical model. Basically, we need to tell the DBMS which of all the candidate keys is the primary one we’ve selected for identification.

With this, we can infer that the name "candidate key" comes from the fact that there can be many minimal subsets of attributes with which we can identify the tuples. But in practice, we only use one of them, which is the one we indicate to the DBMS, that is, the primary key.

In the previous example, from all the superkeys {ID, Name}, {SSN, Name}, {ID, Email}, and so on, we can derive the candidates {ID} and {SSN}, from which we can choose {ID} as the primary. You shouldn’t always make this choice arbitrarily, even though you technically have the option to do so. Rather, you should consider the technical details of the implementation, as well as the semantics of the attributes that form the key to keep it easy to understand, among other factors.

Even though the primary key is selected for use at the logical level, it can also be represented in the entity-relationship diagram at the conceptual level. If it consists of a single attribute, it’s marked with {id} next to its data type. But if the primary key is composite (meaning it’s made up of several attributes where each one is not enough to uniquely identify the tuples, but together with the other marked attributes it is), then all of them are marked with {ID}. As for candidate or superkeys, they aren’t specially marked in the diagram because there can be many.

Alternate Keys

Of all the candidate keys we have, we only choose one as the primary, leaving all the others aside. These keys that aren’t selected as primary are called alternate keys, and their main use is the same as that of a primary key: to uniquely identify the tuples in case the primary key is not accessible or it’s not convenient to use it.

You can also use alternate keys to improve the efficiency of certain operations or queries on the table, as indexes can be defined on them. But we won’t go into detail about this type of optimization technique here.

In our example, if the candidate keys were {ID} and {SSN} and we choose {ID} as the primary, then {SSN} will be the only alternate key we have.

Composite Keys

Another type of key is a composite key, which is defined as a candidate key composed strictly of more than one attribute because each attribute alone is not enough to uniquely identify the tuples in the table.

CityName	Country	Population	Area
Madrid	Spain	3,223,000	604.3
Athens	Greece	664,046	38.96
Nantes	France	320,732	65.19
Tokyo	Japan	13,929,286	2,191.1
Paris	France	2,140,526	105.4
San José	Costa Rica	333,980	44.6
San José	USA	1,013,240	469.7

For example, here we have a City table with information about cities around the world. As you can see, the attributes CityName and Country alone can’t uniquely identify each city, since there are cities in the world that share a country, like Nantes and Paris, and there are also cities with the same name that are located in different countries.

This means that we can’t use any of these attributes separately in a candidate key, as there are multiple cities with the same value in those attributes when viewed individually.

But if we look at them together and consider the composite key {CityName, Country}, we see that no city in our list located in the same country has the same name, so it meets the requirements to be a candidate key. It’s also a superkey, since all candidate keys are superkeys.

This way, we ensure that it’s indeed a composite key, which we can then select as the primary key. This is why sometimes in the definition of a composite key, the term primary key is used instead of candidate key.

Surrogate Keys

So far, we have seen keys formed by choosing a set of attributes from a table that can uniquely identify tuples. But sometimes this may not be possible.

Name	Birth date	Email
Alice Johnson	1985-07-12	alice.johnson@example.com
Bob Smith	1990-03-05	bob.smith@example.org
Carol Davis	1978-11-23	carol.davis@example.net
David Brown	2001-01-30	david.brown@example.com
Emily Wilson	1995-09-14	emily.wilson@example.co.uk

For example, in this Person table, we have the same attributes as before except for ID and SSN, which were the only government identifiers we could use to uniquely distinguish people or tuples in the table.

Now, no matter which subset of attributes we choose, it can’t serve as a key, since we assume there could be multiple people with the same name, born on the exact same date, and using the same email address (this is an assumption here and may not be true depending on the modeled domain).

Since we can’t choose a key with the attributes we have, we need to artificially generate an attribute that can serve as a key. This attribute is known as a surrogate key, and it consists of an attribute that contains sequential numeric values for all the tuples. This means that to ensure each one has a unique value in this attribute, they are numbered from 1 to infinity with integers, guaranteeing the key property.

SurrogateKey	Name	Birth	Email
1	Alice Johnson	1985-07-12	alice.johnson@example.com
2	Bob Smith	1990-03-05	bob.smith@example.org
3	Carol Davis	1978-11-23	carol.davis@example.net
4	David Brown	2001-01-30	david.brown@example.com
5	Emily Wilson	1995-09-14	emily.wilson@example.co.uk

In addition to this auto-incremental approach, where we can see that the surrogate key is an integer value that increases as tuples are inserted into the table, there is also the possibility of the attribute assigning each tuple a UUID (Universally Unique Identifier), which is a 128-bit binary data type usually represented as a string that allow us to assign a unique value to each tuple.

SurrogateKey	Name	Birth	Email
e9e5a22b-d90c-4e5a-8d49-bbc24ff9335e	Alice Johnson	1985-07-12	alice.johnson@example.com
374d6cbe-fc29-4db0-91db-d21a1e2fef3c	Bob Smith	1990-03-05	bob.smith@example.org
57f182c5-47e2-4b71-b82c-63dc1795f9f5	Carol Davis	1978-11-23	carol.davis@example.net
a979dd61-daa4-4d88-a9f3-9a60c23d5b16	David Brown	2001-01-30	david.brown@example.com
179f4e15-0124-4a80-a25d-80e94a8e4ed9	Emily Wilson	1995-09-14	emily.wilson@example.co.uk

Lastly, it’s important to note that the surrogate key is simply a mechanism to identify tuples, so it has no semantics in our domain. In other words, the values taken by the artificial attribute we have generated do not mean anything concerning the tuples or the domain in which they are represented.

Secondary Keys

The previous types of keys generally help solve the problem of uniquely identifying tuples. But besides identifying them, it’s important to operate on them and query them efficiently.

To do this, indexes are usually defined on attributes that do not necessarily identify the tables, such as the name or birth date of the previous Person table. By defining an index on one of these attributes, we can efficiently perform certain operations on the tuples of the table, all based on the values taken by the attributes on which we have defined an index. These attributes are called secondary keys, although we won’t go into detail about what an index is here.

Foreign key

To finish with the types of keys, the ones we have seen before mainly focus on solving the problem of uniquely identifying tuples, which is the purpose of keys, as well as contributing to the optimization of operations and queries on tables.

But keys also help implement certain elements of conceptual design on the logical design of the DBMS. Specifically, with the type of key we have yet to see, the foreign key, we can implement associations between entities at the logical level, which can occur in situations like this:

Here we return to the example where we conceptually model a domain with cities and people, where a person lives in exactly one city, and a city can have any number of people living in it, from 0 to infinity. Given the 1..1 multiplicity on the City side, every person must live in some city, but the 0..* multiplicity on the other side means cities may have no inhabitants.

The below diagram represents the conceptual design of our database, capturing certain details of the domain that we later need to transfer to the logical level. On one hand, we transfer the entities themselves to the logical level by creating a table for each entity directly:

ID	Name	Birth	Email
1	Alice Johnson	1985-07-12	alice.johnson@example.com
2	Bob Smith	1990-03-05	bob.smith@example.org
3	Carol Davis	1978-11-23	carol.davis@example.net
4	David Brown	2001-01-30	david.brown@example.com
5	Emily Wilson	1995-09-14	emily.wilson@example.co.uk

CityID	Name	Country	Population	Area
1	Madrid	Spain	3,223,000	604.3
2	Athens	Greece	664,046	38.96
3	New York	USA	8,398,748	783.8
4	Tokyo	Japan	13,929,286	2,191.1
5	Paris	France	2,140,526	105.4

Given the tables for both entities, we now need to implement at the logical level the association we defined at the conceptual level. This means using a mechanism that allows us to know which city each person lives in or the people who live in a certain city.

If we think about this problem in terms of tables, we’ll see that the only way to do this is to add an additional attribute in one of the two tables so that this attribute takes as values the city where a person lives or the people who live in a certain city.

To understand this correctly, let's first assume that the primary key of the Person table is {PersonID}, which could be their government ID or an auto-incrementing surrogate key. Also, the primary key of the City table is the attribute {CityID}. This way, we can uniquely identify the tuples of City and Person using their primary keys, which take unique values for each of their tuples.

ID	CityID (FK)	Name	Birth	Email
1	5	Alice Johnson	1985-07-12	alice.johnson@example.com
2	5	Bob Smith	1990-03-05	bob.smith@example.org
3	4	Carol Davis	1978-11-23	carol.davis@example.net
4	2	David Brown	2001-01-30	david.brown@example.com
5	3	Emily Wilson	1995-09-14	emily.wilson@example.co.uk

If we want to know the city where a person lives, we could add an attribute to the Person table so that the values it takes belong to the CityID attribute as shown above. That is, if the person "Alice Johnson" lives in the city "Paris," then in that row, the value of the new attribute CityID (FK) we added is 5, which corresponds to the CityID of the city "Paris" in its respective table. Similarly, if the person "Carol Davis" lives in the city of "Tokyo," then the new attribute will take the value 4, which corresponds to the CityID of that city in its respective table.

As you can see, the new attribute we added tells us which city the person represented in each row lives in, as it takes the primary key of the City table as its value. So, by knowing the CityID value, we can identify which city it is among all those stored in that table.

This additional attribute we add to represent the association is the foreign key. It mainly serves to implement associations between entities at the conceptual level, through attributes that serve as references or pointers to other tables. This is why it’s sometimes called an association pointer.

Before continuing, it's worth considering what would happen if, instead of placing the foreign key CityID (FK) in the Person table, a foreign key PersonID (FK) was placed in the City table. If we do this intending to reference all the people who are residents of a certain city, we would encounter a significant problem. That is, if we do this, we must keep in mind that a city can have an arbitrary number of residents, so in the value of its foreign key, we would have to store all the PersonIDs of its residents one after another in the same cell. This would result in a repeating group that is prohibited in the relational model.

So to avoid the appearance of this repeating group, we could refine or normalize our diagram, leaving it where we originally placed it in the first place, which is the attribute CityID (FK) in the Person table. This would be more complicated than simply changing the table where the foreign key is located.

Now that we understand the basis of what a foreign key is, it's important to note that, for an attribute to truly serve as a foreign key, it must reference an attribute in another table that is a primary key on its own.

In this case, the foreign key is composed of a single attribute, CityID (FK), which references CityID in the City table. If it referenced the Name attribute instead, there could be multiple different cities with the same name. This would mean that if we say a person lives in a certain city and use its name to identify it, we wouldn't be able to know exactly which city they live in if there are multiple cities with the same name.

That's why the foreign key references CityID, which we can guarantee uniquely identifies cities on its own, as it’s the primary key of City.

Composite Foreign key

Still, we don't always have domains and schemas as simple as these, where primary keys are a single attribute.

For example, we might have a diagram like the following, where there are people who own pools. Each person must own exactly one pool, but it's possible for several people to agree or partner up so that together they can own a pool. This means that each person will own a small percentage of the pool, which in this domain is not relevant. So a pool can be owned by an arbitrary number of people, including none, since there will be pools that aren’t yet owned by anyone.

Given the attributes of each entity, we can easily see that the primary key of Person is their {ID}, while to uniquely identify a pool, using just PoolName or CityName is not enough, since there could be multiple pools located in the same city or with the same name.

But if we assume that there can’t be multiple pools with the same name in the same city, we can establish a composite primary key as {PoolName, CityName}, where these attributes will uniquely identify each pool. When trying to translate this to the logical level, we first create the tables corresponding to both entities.

ID	Name	Birth	Email
1	Alice Johnson	1985-07-12	alice.johnson@example.com
2	Bob Smith	1990-03-05	bob.smith@example.org
3	Carol Davis	1978-11-23	carol.davis@example.net
4	David Brown	2001-01-30	david.brown@example.com
5	Emily Wilson	1995-09-14	emily.wilson@example.co.uk

PoolName	CityName	Length	Width
Olympic Stadium Pool	Los Angeles	50.0	25.0
Community Center Pool	Chicago	25.0	12.5
Lakeside Aquatic Center	Seattle	33.3	15.0
Riverside Neighborhood Pool	Austin	30.0	10.0
Sunset Community Pool	Miami	25.0	10.0

Later, if we want to model the association between both entities with a foreign key, we first need to consider the cardinality of the association. On one hand, on the Person side, we have a cardinality of 0..*, indicating that a pool can belong to many people. On the other side of the association, we have a multiplicity of 1..1, indicating that a person can only have one pool.

With this, we can infer that if we place the foreign key in the Pool table, we would have to reference all the people who own each pool, resulting in repetitive groups in cases where there are multiple owners for the same pool (because we’d need to reference each and every owner from the same pool). That is, the pool would have an attribute whose value would be references to all its owners, and since there can be an arbitrary number of them, a repetitive group is formed.

To avoid this problem, whenever we have an association with cardinality 1 on one side and * on the other, or equivalents, we need a foreign key to model it at the logical level. Also, it should generally be placed in the table whose cardinality contains * as the maximum cardinality*,* indicating an arbitrary amount. Here, by equivalents, we refer to cardinalities like 0..1, which we can treat similarly to 1..1, or 5.., which is equivalent to 0..* because the maximum cardinality is still an arbitrary amount.

ID	PoolName (FK)	CityName (FK)	Name	Birth	Email
1	Olympic Stadium Pool	Los Angeles	Alice Johnson	1985-07-12	alice.johnson@example.com
2	Riverside Neighborhood Pool	Austin	Bob Smith	1990-03-05	bob.smith@example.org
3	Sunset Community Pool	Miami	Carol Davis	1978-11-23	carol.davis@example.net
4	Sunset Community Pool	Miami	David Brown	2001-01-30	david.brown@example.com
5	Olympic Stadium Pool	Los Angeles	Emily Wilson	1995-09-14	emily.wilson@example.co.uk

As you can see, in this case, the foreign key is placed in the Person table, which is the one with the * in its cardinality on the diagram, since each person can only own one pool. This prevents the foreign key from having to store an arbitrary number of references.

In this specific case, instead of a single attribute, we need to add PoolName (FK) and CityName (FK) because the primary key of Pool is not a single attribute but two. So the foreign key in Person will be a composite foreign key – meaning that instead of one attribute referencing another in a different table, there are two that simultaneously reference two attributes in another table.

For this to be valid, each attribute of the foreign key must reference an attribute of the primary key in the Pool table, so that together PoolName (FK) refers to PoolName, and CityName (FK) refers to the CityName attribute of Pool. So together they reference the entire primary key of Pool.

Finally, as we’ve just seen, foreign keys are a tool of logical design that we use to implement associations from the conceptual model. That's why in the conceptual model (in the entity-relationship diagram), we do not write the attributes that form the foreign keys. This is because at the conceptual level, the associations themselves indicate the relationships between entities. So even though tables have more attributes than we see in the diagram due to foreign keys, these extra attributes are never written at the conceptual level.

As for their naming, there are many style guides to follow. Here, we have added an (FK) to the attribute names to make it clear that they are foreign keys or part of one, although they can be named in any other way.

Weak Entities

Now that we’ve defined how foreign keys allow us to implement associations between entities, we’ll continue by analyzing a case where one of the associated entities can’t be identified on its own with its attributes. Instead, it needs a foreign key that references another entity to be correctly identified – this means that the entity is considered weak in identification.

Existence weakness

Before continuing, you should know that there are several types of weaknesses in this context. One is existence weakness, which means that an entity called weak can’t exist if there isn't another entity called owner with which it’s associated.

We can understand this with the previous example, where a person is composed of a brain, and a brain must always be part of a person. So, when an instance of the Brain entity is created, meaning a tuple representing a brain is created, a person must also be created to be associated with that person.

In summary, a brain can’t exist without the Person entity it’s related to. This leads to an existence weakness where we say the Brain entity is weak and the Person entity is the owner or strong. The composition allows Person to exist without a Brain, even though we prevent it here with cardinality.

Aside from this, when we have an association where all its cardinalities are 1..1, it’s very likely that we can combine those two entities into one, like Person, adding attributes like Neurons, instead of having two entities. But this doesn't always have to be done this way, as it depends on how we want to model the domain and the requirements.

Identification weakness

In addition to existence weakness, we can have an identification weakness. Here, by identification, we mean the mechanism by which each tuple in a table is uniquely distinguished from all others, as we have seen before with keys.

To understand this type of weakness, when it occurs, and how it’s managed, we can look at the following case:

Here we have some entities:

City, which models cities in the domain
Person, which does the same for people, and
Residence, which models a person's stay in a specific city.

This means people can live in a city for a certain time and then move to another. So according to this diagram, they would leave behind an occurrence or tuple of Residence with the date they started living in the city and the date they moved to another.

Regarding cardinalities, we can see that a city can be related to many residences, as it may have or have had many inhabitants, while a residence is only related to one person because the residence focuses on recording that a certain person has lived in a certain city. So the 1..1 multiplicities force a residence to link a person with a city, as introducing optionality here would imply that a residence can link a city or person with "nothing," which doesn't make sense.

Meanwhile, on the Residence side, we have 0..* multiplicities with optionality because a person may not live in any city, or conversely, a city may have or have never had any inhabitants, so it may not be related to any occurrence of Residence.

Next, when we translate this diagram to the logical level, we first try to define the primary keys for all the entities or tables. In this case, for City and Person, it's straightforward, as we assume CityID is a unique identifier for each city, and ID is a unique government identifier for each person (tuple).

But when we define the primary key for Residence, we have several options. On one hand, we could choose {StartDate} or {EndDate} as the primary key, but this isn't feasible because multiple people might start living in the same or different cities on the same start date, end date, or both. So we can't even choose {StartDate, EndDate} as the primary key, since, in the worst-case scenario, multiple people might start and stop living in a city at the same time.

This means that the Residence entity needs the other entities it’s associated with to have a primary key and be identifiable. It's important to note that at the logical level, we would have two foreign keys in Residence due to its two associations with City and Person. Specifically, it has a foreign key CityID (FK) and another ID (FK) that model these associations, respectively.

We can infer this at a glance without "seeing" the logical model because we have associations with cardinalities 1..1 and 0*…* So on the 0.. side, there must be a foreign key to implement this association as we’ve seen before.

Given these foreign keys, we might consider choosing {CityID (FK)} or {ID (FK)} as primary keys, but this wouldn’t guarantee the identification of all tuples because multiple people can be living in one or several cities at the same time. Also, a city can have multiple residents simultaneously, leading to repeated values in the foreign key attributes for tuples that should be considered distinct.

We also can’t choose {CityID (FK), ID (FK)} as a key because a person may have moved to a city multiple times during different periods, even if they lived in other cities in between. This would result in multiple tuples with the same values in both foreign keys but different values in the dates.

Given this situation, the only option left is to consider a key that includes one of the date attributes of Residence and the foreign keys {CityID (FK)} or {ID (FK)}, since nothing prevents a person from having multiple residences at the same time (where each residence indicates they are living in a city). This is normal because we haven't restricted this situation in any way in the conceptual diagram.

So, since a person can live in multiple cities at once, to identify a Residence tuple, we need to know which person is living in which city, plus at what point in time they are doing so. This we can determine with StartDate or EndDate. One of the dates is sufficient here, because a person can only live in a city once at the same moment in time, meaning a person can’t start or stop living in the same city multiple times at the same moment.

So to sum up, if we want to uniquely identify the Residence entity, we need to select {StartDate, CityID (FK), ID (FK)} as the primary key, although we could also select {EndDate, CityID (FK), ID (FK)} as long as we are sure that EndDate always exists. If the end date is not defined until the person leaves the city, we couldn't consider EndDate for identifying Residence.

So we can see here that we can’t identify the entity without using the respective foreign keys. This means the entity is considered weak in identification, as it depends on the two entities City and Person, which in this context are considered the owners of the weak entity. In other words, the owner entities can be identified by themselves, while the weak entity depends on other entities for its identification.

To denote this in the entity-relationship diagram, we can use a «weak» role on the sides of the weak entity to indicate that the foreign keys of these associations are needed to identify the weak entity.

To correctly understand what weak identification means, we can now consider the same diagram as before. But now, let’s assume that a person can only live in one city at a given moment in time, unlike before when they could live in many cities at once. This restriction can’t be modeled with UML elements, so it's enough to add a textual note in the diagram to reflect the restriction.

In this case, since a person can only live in one city at a time, we don't need to include the foreign key CityID (FK) in the primary key of Residence. If a person is living in a city at a given moment, they can't be living in another, so there won't be more tuples in the table with that person and that start and end date of residence.

Consequently, the primary key of Residence becomes {StartDate, ID (FK)}, for example. The only thing that changes besides this primary key is the conceptual diagram itself, where now the only owner entity of Residence is Person because the foreign key to City is no longer strictly necessary for its identification. So even though Residence remains weak, its only owner entity is Person. This is why the role "weak" is only written in the association that gives rise to the foreign key ID (FK), which is indeed in the primary key of Residence (unlike the previous scenario where we placed the role in both associations).

So as you can imagine, with the "weak" roles, we can not only know which entities are weak but also which entities own them. The role is always on the side of the association where the weak entity is found – that is, where the foreign key referencing the owner entity is located, which corresponds with the cardinality * seen before. Then on the other side of the association with the "weak" role, we find the owner entity.

If we want to convert Residence into an entity that is not weak, we need to add enough attributes to identify it without relying on other entities. For example, if we add a surrogate key ResidenceID that works through auto-increment or UUID, then we can automatically identify each tuple of Residence uniquely, so the primary key of Residence would become {ResidenceID}, and the entity would no longer be weak.

Finally, if we consider the domain we initially proposed and its requirements, we see that Residence is weak in identification, needing both foreign keys to be identified. So in addition to being represented with the "weak" roles in both associations, it’s worth noting the possibility of representing it using an associative entity like the following:

We can make the diagram this way in this situation because Residence has Person and City as owner entities. Since it’s linked with the association between both entities and needs both to be identified, it can be denoted as an associative entity.

But an associative entity and a weak entity are completely different concepts, as weakness in identification is a property of entities, while an associative entity is a way to represent entities in UML at a conceptual level.

For example, if Residence had only Person as an owner entity, then it would no longer make sense to represent it as an associative entity at the conceptual level. This is because it’s only a weak entity in identification with respect to one owner entity, Person, not two owner entities that can have an N:M association between them.

In addition to the representation as an associative entity, the cardinalities on both sides of the association must be 0..*, since it was previously stated that a city could have an arbitrary number of residences, where each one had only one person, necessarily. So if we represent Residence as an associative entity, the association between City and Person must have a 0..* on Person. This indicates that a city can be related to an arbitrary number of people through the Residence entity, with the same occurring in the reverse direction.

Navigability

In relation to the previous example and the concept of association or foreign key, it's sometimes important to analyze the navigability of our entity-relationship diagram before implementing the logical design of the database. This is because efficiency problems, ambiguities, or even the impossibility of performing certain operations or queries may arise.

To begin with, navigability refers to the capacity we have to “navigate” on the entity-relationship diagram through the associations between entities, or in other words, if we are located on a certain entity, it refers to the ability offered by the associations that affect that entity to navigate these associations and to retrieve information from other entities.

To understand this with an example, we can refer to the last diagram from the previous section where we introduce a surrogate key to Residence. In that diagram, we have an entity Residence with two foreign keys pointing to City and Person. So if we are given a tuple from Residence, we can use its foreign keys to determine which tuple from City or Person is associated with the occurrence of the Residence entity. This allows us to navigate those associations to the corresponding classes.

This is useful, for example, when we query the database to find the person who lived in the city corresponding to that Residence. For this, we can look at the value of the foreign key ID (FK), which corresponds to an identifier of a person recorded in the Person table. This allows us to navigate from the Residence entity to the Person entity, meaning we’ve gotten information from the Person entity starting from Residence.

We can repeat this step multiple times, navigating from entity to entity through the diagram. But the important thing is to know which associations are navigable in a certain direction.

For example, if we are given a person, that tuple doesn't have any foreign keys, so with a tuple representing a person, we can't get information about any other entity in our diagram – not even Residence. If we only look at the values of the Person tuple, we won't know which Residence tuples are associated, because we would need to query and traverse the entire Residence table to find out.

To sum up, the Residence-Person association is not navigable in both directions – we can only go from Residence to Person, but not the other way around. The same applies to City.

Navigability is important, because it's useful to know the direction in which the diagram's associations can be navigated before implementing anything. If our system needs to support a query like obtaining the city where a person currently lives, it might be more efficient to add an association directly from Person to City instead of having to go through all the Residence tuples to resolve the query, which would be more efficient.

Although this association might seem redundant, if we need to focus heavily on optimizing the query we mentioned earlier, it may be worthwhile to "complicate" the diagram in this way so that certain critical queries in our system run faster.

It’s also important to note that a person may not live in any city, which is why the minimum cardinality of the new association on the City side is 0..1. This is because the foreign key resulting from this association may "not exist," as we will see later, representing that a certain person does not live in any city.

Finally, not everything relevant about navigability is related to efficiency, such as when detecting navigation cycles. If several exist, we would need to ensure in the implementation that the DBMS optimizer chooses the shortest one in the corresponding queries.

Navigability also helps us see if certain queries can be resolved, meaning if certain data can be obtained from the system based on some input. And keep in mind that this concept of navigability that we have introduced refers to navigability over the conceptual diagram itself, not to the possibility of obtaining information about other entities at the logical level, as we’ll see later.

Constraints

Continuing with the elements of the relational model, the only thing left to discuss are constraints. These are conditions imposed on the data to correctly model the domain and meet its requirements. They are a set of rules that must always be followed so that the stored data is correct, consistent, integral, and aligns with the semantics given by the domain.

We can define constraints both at the conceptual and logical levels. On one hand, in the conceptual model, constraints are mainly modeled using the tools provided by UML when creating the entity-relationship diagram.

For example, let’s say that in our domain we have a business rule or condition stating that a city can have a maximum of 500 inhabitants. Then if we model the domain with a diagram similar to those created earlier, we will have an association between person, inhabitant, and city, where we use the cardinality of that association (specifically the maximum cardinality) to represent the constraint of the maximum number of inhabitants.

But not all constraints can be modeled at the conceptual level with UML tools. For example, consider the case where we have a social network with people who can follow other people. We can model this with an entity Person and a recursive relationship where a person can follow an unlimited number of people, including the case where they follow no one.

But nothing prevents a person from following themselves, which doesn't make much sense in a social network. So we could leave it as it is if the client doesn't specify otherwise. But if the domain itself or a requirement indicates that a person can’t follow themselves, we will need to add that restriction to the diagram.

Unfortunately, we can’t do this with the tools provided by UML, as there is no mechanism to indicate that this association can’t occur between the same occurrence (tuple) of the Person entity.

In this case, we have several options to reflect the restriction in the conceptual design. The first and simplest is to add a textual note on the margin of the diagram where we briefly explain the situation and indicate the rule that makes up the restriction. Notes in UML are standard elements consisting of a box with text where things that can’t be properly modeled with the diagram's own elements are specified.

On the other hand, instead of using a text note, which is less formal and more prone to misinterpretations or confusion, we can use a specific language to represent constraints like OCL (Object Constraint Language), where we define the restriction using the language's own code.

context Person
inv noSelfFollow:
    self.follows->forAll( p | p <> self )

Here, we won't go into detail about how constraints are modeled in OCL. The important thing is to know that there are constraints that we can’t directly represent with diagram elements, so they need to be reflected in the conceptual design using notes or specialized language code.

Data Integrity

As we’ve mentioned, constraints are validity conditions imposed on the data. They help ensure that, when stored in our database, they can be checked for correctness, consistency, and integrity, all verified automatically by the DBMS. This is because the constraints themselves are usually implemented at the logical level in the DBMS, which has specific functionalities to check constraints and ensure the correctness and integrity of the data.

So far, we have assumed that the data are stored correctly in their respective tables. We’ve also assumed that they respect the attribute domains, as well as many other details that can affect the validity of what is stored.

So to avoid issues, the database automatically checks the validity of the data, which differs from the correctness of the data. To understand the difference between these concepts, consider the following example:

CityID	Name	Country	Temperature (Kelvin)
5	Paris	France	280
1	Madrid	Spain	-3

Here we have a Temperature table that stores tuples with the temperatures in a city at different times. As you can guess, the temperature attribute is of type integer, which means it can hold any integer, including negatives. But temperatures can’t be negative if measured in Kelvin, so if we are measuring temperatures in Kelvin here, we must add a domain constraint like Temperature >= 0 to prevent the Temperature attribute from taking negative values. This is called a domain constraint.

Domain constraints, as you’ve just seen, are used to define the domains of table attributes, restricting the possible values they can take and ensuring that the stored data is of the appropriate type.

Given this restriction, we can see that the first tuple meets all the constraints, so it could be considered valid data. But with the information we have, we can’t ensure that this data is correct. That is, we have not taken a thermometer and measured the temperature in Paris, so we do not know if that 280 is the actual temperature in Paris or if it’s incorrect data. So even if data meets the constraints, we must ensure that it’s correct.

This is a very complicated task that we won’t go into detail about here. We can implement mechanisms for error detection and correction in data, or we can conduct audits to verify that the data corresponds to reality – that is, the domain. Or third parties can supervise the data, because if the person who took that measurement tells us that the 280 is not what they recorded with the thermometer, then we know that data is incorrect. Otherwise, we would have no way to guarantee its correctness.

On the other hand, in the second tuple, the temperature takes a negative value, so we can conclude that this data is not only incorrect but also invalid. It’s invalid because no Kelvin temperature can be negative, violating the domain constraint imposed earlier. It’s incorrect because if it’s invalid, then that value must necessarily be different from the true temperature of the city.

So now you know what it means for data to be erroneous or incorrect. You also understand domain constraints that can ensure data integrity in terms of data type and possible values that the attribute can take.

But data integrity goes beyond simply checking that data is in the correct format and within an attribute's domain. For example, data must be reliable and accurate, which we verify with its correctness. It must also be consistent, meaning there can’t be duplicate tuples with information that leads to contradictions as seen earlier. It must also have other high-level characteristics like availability, durability, data timeliness, security, and so on which we won’t delve into because they aren’t essential here.

Integrity Constraints

In addition to the previous characteristics, there is another one that’s essential for maintaining data integrity: completeness. In this context, completeness can have several meanings, with the simplest being that all data points are present in the database as tuples. This means all the "individuals" of the domain are represented in the database.

For instance, if we are storing a domain with 10 people and only see 9 tuples in a table like Person, we know that the data is not complete because the entire domain is not represented by the 9 tuples. On the other hand, completeness also means that each data point must necessarily have a value for each attribute of the table that defines it.

CityID	Name	Country	Population
1	Madrid	Spain	3,223,000
2	Athens	NULL	664,046
3	New York	USA	8,398,748
4	Tokyo	Japan	13,929,286
5	Paris	France	NULL

For example, if we have a domain with cities, and in our database we include a City table with these attributes, then each city (data point) we represent with a tuple must have a value for each of these attributes. This means that for the data to be complete, no cell in the table can be empty.

In the table above, you can see that the cities named "Athens" and "Paris" prevent the data from being complete, as one does not have a value in the Country attribute and the other in Population, respectively. Instead of leaving the corresponding cells empty, the special value NULL is stored in them to represent that they contain nothing.

To ensure the completeness property of the stored data, NULL values should be avoided in the tables. But we will later see that by default, DBMSs do not usually enforce the restriction that table values can’t be NULL. In other words, when we create a table by default, the values of the tuples can be NULL unless we define otherwise through a restriction.

We typically define this restriction at the attribute level, where we specify that the values in the column corresponding to that attribute can’t be NULL. So all tuples we save in the table must have a value other than NULL for that attribute.

This affects the attribute's domain, since by default, the special value NULL is included in the set of all values an attribute can take. But we can exclude this value from the set using a restriction.

In light of all this, and after introducing the concept of NULL, we can define integrity as a property that ensures that throughout its entire lifecycle, the stored data is valid, correct, consistent, complete, and reliable.

To ensure that all these characteristics are met (except for the last one, which is at a higher level), we use special types of constraints in the database, known as integrity constraints. In other words, we can categorize database constraints based on their purpose.

Some constraints dedicated to modeling business domain requirements and rules, while others are integrity constraints specifically aimed at enforcing the aforementioned integrity characteristics (but some of them may also indirectly model part of the business rules).

These last constraints are validity conditions automatically checked by the DBMS every time an operation is performed on the entity (table) or entities affected by these constraints, all with the goal of ensuring data integrity at all times.

These validity conditions, as we’ve seen, must be met for all stored tuples, ensuring that none of them can have an empty cell or a disallowed value. In other words, conditions can be defined at the attribute (column) level, although the tuples stored must adhere to these constraints. This is why they are checked for all of them. So when all the tuples stored in a table meet all the defined integrity constraints, the instance of that table is said to be legal.

Integrity constraints, depending on their logical purpose, can be classified into several types:

First, we have domain constraints. These are the ones we just discussed, and they mainly serve to define the data type of the attributes and their domain.

On one hand, implicit domain constraints include those that define the data type of the attributes, as this is something we must do when creating a table, not something we add later to limit the attribute's domain.

On the other hand, there are explicit domain constraints, which we add in addition to the data type definition to limit the values that attributes can take, such as preventing them from containing the special value NULL, or preventing an attribute that stores temperatures in Kelvin from taking negative values, as we have seen. We can also consider it implicit that the DBMS allows cells to take NULL values, which we can prevent by setting an explicit constraint.

Next, we have identification constraints. Regarding the identification of tuples, we previously saw that a primary key is chosen for each table so that its attributes can uniquely identify all the tuples stored in it. The explicit definition of a primary key is an integrity constraint that we define on the table.

But by doing this, the DBMS internally applies several sub-integrity constraints, one of which ensures that the combinations of values taken by the primary key attributes are all different (meaning unique). This is what characterizes a key. Also, none of the attributes can take NULL as a value, because if they could, there would be multiple tuples with the same value for the primary key.

ID	Name	Birth
1	Alice Johnson	1985-07-12
NULL	Bob Smith	1990-03-05
3	Carol Davis	1978-11-23
4	David Brown	2001-01-30
NULL	Emily Wilson	1995-09-14

For example, if our primary key is a single attribute {ID}, then it can’t take NULL as a value, because in that case, we could have multiple tuples with NULL in that attribute as seen above, preventing them from being uniquely identified.

Lastly, we have referential constraints. Related to the previous constraints are referential integrity constraints, which ensure that the relationships between tables are consistent at all times. These constraints are implicit, meaning the DBMS automatically ensures that they’re fulfilled. Still, we must explicitly define which attributes are foreign keys for it to do so.

Here, by consistent, we’re not referring to the same concept as data consistency. Rather, we mean that a foreign key must reference a valid tuple in the table it points to.

For example, if we have a weak entity Pool whose logical table has a foreign key attribute like CityID (FK), then whenever this attribute references a city, it must contain a valid CityID value. This means it must exist in the City table. If the value doesn't exist, then it wouldn't be referencing any city.

Also, note that the foreign key attribute itself can be NULL by default unless we specify otherwise, because the foreign key constraint doesn't behave like the primary key constraint, which implicitly prevents NULL values. Instead, the foreign key constraint is solely focused on ensuring consistency in references, not on preventing NULL values.

To understand this, we need to look at the 1..1 multiplicity on the City side, which requires all pools to belong to exactly one city, ensuring no pool is "loose" or outside a city. This means all pools must have a value in their foreign key CityID (FK), as they must belong to one and only one city.

For this restriction (which we've conceptually modeled with a minimum cardinality) to be translated to the logical level, we need to explicitly indicate a domain integrity constraint on the CityID (FK) attribute so it can’t contain NULL values. This means it must always refer to a city. This, in turn, allows the Pool entity to be identified by the pool's name and the city where it's located, as the name can be repeated in several tuples/pools. But the combination of the name and city where they are located is assumed to never repeat in our domain. In other words, in the same city, there are no multiple pools with the same name.

Assuming this, if in our database we have a series of tuples in both tables and we want to delete a city from the record, then we need to check if there is any pool referencing that city. This would prevent the city record from being deleted to maintain integrity and ensure that the respective foreign key of the pool continues to reference an existing city.

To resolve this situation, there are many policies that we will see later, although the most common is to prevent the deletion operation from being executed or to also delete the pool record that references the city we want to delete. This could cause more recursive deletions if there are foreign keys pointing to Pool.

On the other hand, if the minimum cardinality on the City side is 0, this means that at the logical level, the foreign key of Pool may not exist – meaning the pool might not be in any city. So its foreign key can take the value NULL because it's the only simple way to implement that the foreign key itself "does not exist."

If we do this, we won't have to define the explicit constraint that the foreign key attribute is non-null, and when deleting a city record, we can set the deletion policy so that the foreign key in Pool is set to NULL.

As for the weakness in identifying Pool, it disappears here because it can't use its foreign key for identification since it can take the value NULL and the pool name can be repeated. Because of this, we decide to add a surrogate key PoolID to identify the Pool entity.

Finally, nothing prevents a foreign key from modeling a recursive relationship, meaning the DBMS implicitly allows it by default. So if we want to avoid situations where a tuple references itself, we must add explicit constraints, which we can categorize as referential integrity.

Chapter 6: Relational Schema Diagram

After introducing the relational model at the conceptual level, we must remember that this is the first level of database design. Now, based on the entity-relationship diagram, we need to determine the tables that will make up the database, as well as the keys they will have to identify and reference each other. We also need to define the constraints that ensure the validity and integrity of the data.

So even though we’ve already introduced certain concepts of logical design, here we’ll formalize the logical design itself through relational schema diagrams, sometimes called relational diagrams for simplicity.

As you can see, here we have a relational diagram representing the logical design associated with the last entity-relationship diagram from the previous section.

First, instead of entities, we have tables here, each with a series of attributes. If any attribute is used in a primary key, it’s underlined like PoolID or CityID, with all other attributes being "normal" table attributes. Also, foreign keys are represented directly with arrows. In this case, CityFK is a foreign key that references the CityID attribute of the City table because it’s a primary key, which is why it's denoted with an arrow pointing from the foreign key attribute to the corresponding attribute in the other table.

Regarding the foreign key, keep in mind that an attribute can only point to one other attribute – meaning CityFK can only have one arrow pointing to one attribute, not several, as the foreign key references a single attribute in another table. If we were asked to convert this relational diagram into an entity-relationship diagram, the foreign key itself would determine the cardinalities of the association (at least the maximum cardinalities, since, for that foreign key to make sense, at the conceptual level, it would translate to a pool being in only one city at most, while a city can have an arbitrary number of pools).

These types of diagrams aren’t standard like UML. They only need to meet the characteristics mentioned earlier. That's why, in many cases, tables are represented with squares similar to UML entities instead of being shown in textual format with Datalog.

But unlike UML diagrams, there are very few implicit restrictions here. Most restrictions need to be added with notes in the margins. For example, to indicate that an attribute can’t have a NULL value, we can’t do it with diagram elements – instead, it must be represented by other means, such as a note or a piece of code in OCL.

1-1 association

Given the nature of relational diagrams, we can infer that entities are directly transferred to the logical model with tables, where each entity corresponds to a table. But in addition to the tables, we have to implement the associations between entities at the logical level.

To do this, we start with the simplest case, which is an association where the maximum cardinality on both sides is 1, as in the example we saw earlier where we had an entity Person composed of an entity Brain, whose translation to a relational diagram would be as follows.

As you can see, both entities are represented with tables, where the attributes of their primary keys are underlined. Also, even though they don't appear in the conceptual diagram, we need to reflect the existence of foreign key attributes used to implement the association itself.

So we’ve added attributes with names that best indicate that they are foreign keys. In this case, the name ends with FK, although you can use any name you like. So for a brain to be associated with a person, its corresponding foreign key refers to the primary key of the table that stores people. Since the other direction of the association is symmetrical, we do the same with the foreign key of Person (which refers to the primary key of Brain so that a person can be associated with their corresponding brain). We do this with foreign keys for simplicity and because it's the only way to determine which brain each person has and to whom each brain belongs.

Because of the 1-1 association, you typically shouldn’t leave this type of association due to the overhead caused by using multiple foreign keys referencing in both directions, and the redundancy at the conceptual level. If each person has one brain and only one, and vice versa, it's likely that both can be "merged" and modeled as a single concept, moving all the attributes that characterize Brain to the Person entity, for example. But there are other ways to refine the schema, or there are times when the domain or requirements force us to keep this type of relationship, in which case it would be perfectly valid.

1-M association

Another type of association we need to translate to the logical level is called 1-M (or 1-N), which refers to associations where the maximum cardinalities on both sides are 1 and * respectively, where M means an arbitrary amount.

For example, here we have a 1-M relationship between the entities House and Person, where a house must belong to a person, and a person can have an arbitrary number of houses, including none. At the logical level, we can represent this as:

Just like before, we implement both entities with tables, and the 1-M association between them with a foreign key in the entity on the side where the maximum cardinality is *. Specifically, to avoid repetitive groups, we place the foreign key in house, since a house can only have one person as its owner. This means it won't be necessary to store an arbitrary number of references in the attribute of its foreign key – one is enough.

And as always, the foreign key refers to the primary key of Person, so that it can reference a value of an attribute that can uniquely identify a person, and thus determine the owner of a house.

Minimum cardinality issues

Regarding the previous entity-relationship diagram, we can see that the 1..1 side indicates that at a minimum, a house must always be associated with a person who will be its owner. This means that a house must always have an owner. But this is not realistic, as when a house is built, it may be without an owner for some time, causing the cardinality on that side of the association to become 0..1.

In turn, the minimum cardinality of 0 means that a house may not have an owner – so its foreign key should not exist while the house has no owner. To model this, attributes, including foreign keys, are allowed to take NULL as a value by default (as we’ve seen before). This way, to represent that the foreign key does not point anywhere, we simply choose not to restrict the possibility of it taking this NULL value. So when a house has no owner, its foreign key attribute will be NULL until it references a person – that is, a tuple in the Person table.

This situation, where a foreign key is allowed to take the NULL value, is not explicitly indicated in the relational diagram. Instead, it’s indicated when the opposite situation occurs – where if the foreign key can’t be NULL, we need to add a note clearly indicating this (as is the case in the original entity-relationship diagram we just saw).

On the other hand, the association in the diagram has a multiplicity of 0..* on the House side, indicating that a person doesn’t have to own any house. But if we had a minimum cardinality greater than 0, then this restriction would need to be defined with a note in the relational diagram, as well as with specific SQL tools (since there are no standard elements to model this type of requirement caused by minimum cardinalities in such a situation).

N-M association

To conclude with the types of associations according to their cardinality, the only one left to translate is N-M. In this case, N and M denote arbitrary quantities, meaning associations whose maximum cardinalities are both * at the same time.

As an example, we could have a domain where a person can own many houses, and a house can be owned by many people at the same time. To model this situation at the conceptual level, the first thing we might think of is to create a diagram like this, where we only put an association with cardinality 0..* on both sides.

Conceptually this would be consistent, but logically it can’t be translated in any way. That is, if we have an association with a maximum cardinality of * on both sides and try to implement it logically using foreign keys as we’ve done so far, we’ll find that even if we put a foreign key in both entities referencing the entity on the other side of the association, the problem of the repeating group will always appear in both entities, regardless of what else we do.

To understand this, we can look at it conceptually. If a person has an indeterminate number of houses and we put a foreign key in Person referencing House, then that foreign key would need to contain references to each of the possible houses the person might have. Since it's not a fixed number, a repetitive group appears in the foreign key.

The same happens in reverse: if a house can have an arbitrary number of owners, then including a foreign key in House referencing Person would cause a repetitive group in the foreign key. So this type of association does not have a direct implementation at the logical level.

But in reality, these situations usually don't occur this way. Instead, it's common for there to be an intermediate class in the association that allows for its implementation at the logical level, as in the following example:

Here, we have a situation similar to the previous one, where a person can own an arbitrary number of houses, and a house can be owned by an arbitrary number of people. The difference here is that we assume one of the domain requirements is to record when a person buys and sells a house, as well as the price at which it was bought. We don’t need the the selling price because it will be the purchase price for another occurrence of Property.

For this, we use an intermediate Property entity that stores this data, where we must keep in mind that SellDate should not "exist" until the house is actually sold, if it’s sold at all. So to translate this to the logical level, the simplest approach is to allow SellDate to be NULL until the house is sold.

As we can see, this situation can now be translated into a relational diagram, meaning at the logical level. This is because all entities can be represented as tables. And since the associations are of the 1-M type, we already know how to implement them using foreign keys, specifically in the Property entity referencing the primary keys of the other two entities, respectively.

This doesn't mean that whenever we have an N-M relationship, we need to introduce an intermediate entity to implement it. Sometimes we need an intermediate class to store information, as in this case, and in other situations, we might need to refine the schema because the N-M relationship doesn't best represent the domain.

But if we really need to implement an N-M relationship and we’re sure that this relationship is conceptually correct, we can always add an artificial intermediate entity that has no attributes other than the foreign keys of both associations (with both being the primary key), thus making it a weak entity in identification.

For example, considering the situation where we do need to store information in an intermediate class, we previously saw that Property had its own primary key, PropertyID, probably derived from a surrogate key. But if there is no surrogate key, we must try to identify the tuples of Property through their attributes. In this case, this isn’t possible given their semantics – meaning the significance of what they store – as there could be multiple tuples with the same dates, prices, and so on.

So, knowing that two foreign keys will appear in Property referencing House and Person when translated to the logical level, we can use them to define the primary key of Property using BuyDate and the foreign key attributes themselves.

We do this because if we only make the primary key consist of the foreign keys, then Property can’t be uniquely identified if a person buys and sells the same house during multiple different time periods. So we add BuyDate to the primary key to also distinguish by purchase date, because SellDate can be NULL (which violates the fundamental integrity constraint of primary keys that none of their attributes can be NULL). With this, the Property entity becomes weak in identification, which is why we’ve added «weak» to both sides, indicating that we need both foreign keys for identification.

Similarly, in this case, since the weakness in identification affects the entities on both sides (meaning we need the foreign keys referencing the entities on both sides of Property), it can be represented conceptually with an associative entity linked to the M-N association between House and Person. This is still equivalent to the previous diagram, with the only difference being that the intermediate class is represented differently.

Also, it’s important to note that if Property had a surrogate key and did not need foreign keys for its identification, this representation using an associative entity would not be valid. Ultimately, the associative entity is only valid to use in this context when the intermediate entity depends on the two linked entities for its identification, with these being its owning entities.

Regarding the logical-level translation of this last case, we. doit in the same way – the difference being that we no longer have the PropertyID attribute in Property. Also, its primary key is now {BuyDate, HouseFK, PersonFK}, so we underline all those attributes.

As a general rule, when a foreign key is underlined in a relational diagram, it indicates that the conceptual-level entity corresponding to the table is weak in identification. This lets us know how many entities it depends on – that is, its owning entities.

IS-A Hierarchy

After seeing how entities and associations from the relational model are translated to the logical level, let’s now understand how the special relationships of generalization and specialization among the entities themselves are translated.

To do this, we’ll start with an example of an IS-A hierarchy. This basically means that one or more entities, like CityPool, are a specialization of another more general entity, Pool. This is very similar to what happens in object-oriented programming with inheritance.

The inheritance hierarchy is called IS-A because if CityPool inherits from Pool, then it’s more specific than Pool. This means that every city pool is a pool, but it has specific attributes that characterize city pools, such as their maximum user capacity or the ticket price.

Before seeing how they are translated to the logical level, it's important to know that IS-A hierarchies have a series of specialization constraints that determine the "relationship" the parent entity (Pool here, sometimes called superclass) has with the specific entities. In other words, if we consider that entities are actually sets containing all their occurrences (tuples) (which we’ll call individuals here to keep it a more "general" concept), away from the details of the conceptual and logical model, then we can establish constraints like completeness or disjunction of a hierarchy.

To understand completeness using this example, we can first have hierarchies that are complete, where all individuals of the entity Pool must necessarily belong to the sets of individuals of one of the specific entities that inherit from the superclass Pool.

In this case, the superclass Pool is an entity that contains all existing pools. So some of them might be city pools, belonging to the set of individuals formed by the inheriting entity CityPool. Others might be Olympic pools, which belong to the set of individuals of OlympicPool.

In our model, we have only specified these two types of pools, while in reality, there are many other types of pools. In this hierarchy, they’d be represented by individuals in the set generated by Pool, as they don’t have any inheriting class to belong to. So in this case, our hierarchy would not be complete, but partial, since there are pools that do not belong to any inheriting entity.

On the other hand, disjunction refers to the possibility of individuals belonging to more than one inheriting entity at the same time. For example, in our case, a pool is either a city pool or an Olympic pool, or it’s neither of those types – so we will never find a pool that is both a city and an Olympic pool at the same time.

If we consider the sets of individuals of the inheriting entities, the hierarchy is considered disjoint when those sets are disjoint, as in this case where pools are either one type or the other, but not both at the same time. Conversely, in cases where the latter occurs, the hierarchy is called overlapping.

1 table

Knowing now that the hierarchy in our example is incomplete (called partial) and disjoint, we need to implement what’s shown in the entity-relationship diagram at the logical level.

We have several options for this. One option is to implement the entire IS-A hierarchy with a single table, Pool, that gathers all the attributes of the tables in the hierarchy.

As you can see, in this option we implement a table that contains all the attributes of the three tables, where PoolID functions as the primary key of the entire Pool table, since in the conceptual design specific identifiers aren’t usually assigned to inheriting entities unless required. This is why PoolID appears underlined. As for the rest, they work the same as if they were in their respective entities.

On one hand, this option has the advantage of using only one table for the entire hierarchy, which makes it easier to understand and maintain. It also avoids the possible redundancy of storing the same information in multiple tables.

But on the other hand, it presents significant problems. First, we have no simple way to distinguish a pool from a city pool or an Olympic pool, meaning the only way to know the specific type of pool that a tuple in the Pool table represents is to have some attributes be NULL.

For example, if a tuple represents a pool from the Pool entity, then all the attributes of CityPool and OlympicPool must be NULL so that the corresponding tuple only takes values in the attributes of the Pool entity. This lets us determine that the tuple represents an "individual" of the set of occurrences of the Pool entity.

The same thing happens when we try to distinguish city pools, where all the attributes of OlympicPool must be NULL, since CityPool inherits all the attributes of the Pool entity. So all those attributes plus those specific to CityPool will have values, while those of OlympicPool will be NULL to indicate that the pool is a city pool. This also happens when we want to know if a tuple represents an Olympic pool, where the attributes of CityPool will be NULL.

So if we implement the IS-A hierarchy with a single table, we will have the problem of distinguishing the types of pools – that is, knowing if a tuple represents an occurrence of the superclass entity or one of the inheriting entities. This could lead to a potentially large number of NULL values occupying unnecessary space in the table, even though working with such a table might be easy to understand.

Also, we can also consider the ease with which the schema can be extended or modified as an advantage. This is because if a foreign key is later added in our domain in any of the 3 tables of the hierarchy referencing another entity, it would simply be necessary to add a foreign key attribute to the Pool table. Similarly, if an external foreign key points to any of the entities in the hierarchy, it would only need to reference PoolID.

2 tables

To address the previous problem of distinction, another option we have for implementing the hierarchy is to use two tables, or as many as there are inheriting entities. The basis of this is that all inheriting entities have the same attributes as the superclass they inherit from, plus a series of specific attributes that characterize them.

So to logically distinguish the inheriting entities, we can implement specific tables for each one, where they have the same attributes as the superclass plus their own.

As you can see, in this option we implement the CityPool and OlympicPool tables, which are responsible for storing tuples that represent city pools and Olympic pools, respectively. Since each contains the same attributes as the subclass, even though they aren’t explicitly copied in the inheriting entities in the conceptual diagram, both have the same primary key, PoolID.

This implementation offers various advantages: first, we eliminate unnecessary NULL values used to distinguish pool types, at least those modeled through the inheriting entities. Also, the schema remains simple, being easily understandable and maintainable due to the semantics of each table.

But there is also a distinction problem here, as our hierarchy is not complete. This means that there will be pools that are neither city nor Olympic, so they can’t be represented with tuples from CityPool or OlympicPool. In other words, this option doesn’t work for representing incomplete hierarchies, as the only way we could represent a pool that is neither of these types would be to insert an identical tuple in both CityPool and OlympicPool with all attributes not belonging to the superclass set to NULL. But this would be very counterproductive in terms of memory usage, and would also be complicated to manage.

On the other hand, even if the hierarchy were complete, a possible disadvantage to consider is the repetition of the superclass attributes in all tables, where this repetition wastes space in our database.

But even if we have extra space and can afford to repeat all those attributes, if we want to gather all the data about all the pools (or individuals) that exist, we would need to collect the data present in all the tables, which may not be entirely efficient.

Lastly, if our conceptual model has a foreign key referencing the superclass entity Pool, we need to consider that the primary key of Pool has now been transferred to the two tables. This means that foreign key would have to reference both tables at once, which isn’t possible. So instead of referencing an attribute of one table, it would have to reference PoolID from both CityPool and OlympicPool at the same time. This would complicate or even make the implementation impossible at the logical level.

Regarding foreign keys, this option would indeed allow us to easily implement a foreign key in one of the entities, CityPool or OlympicPool, that references another entity (or even foreign keys that reference these entities in a straightforward manner).

But, if we insist on using two tables to implement the hierarchy, we could refine the logical schema to solve the problem of a foreign key referencing the superclass in this way.

As you can see above, we have two tables where one is exclusively dedicated to storing tuples that contain the attributes characterizing an Olympic pool. The other entity encompasses all pools, including city pools and Olympic pools. This is because an Olympic pool also inherits the attributes of the superclass, so to represent it in this schema, we create two tuples: one in Pool that stores the values of the superclass attributes, leaving the rest as NULL, and another tuple in OlympicPool that stores the remaining attributes, with its foreign key (which is also the primary key), referencing the corresponding tuple in Pool with the superclass attribute values.

The main advantage of this option is that it solves the problem of having an external foreign key referencing Pool – as in this case, it would simply need to point to the primary key {PoolID} of Pool, instead of several attributes at once as it did before.

But this leaves us with a significantly more complex schema to understand and work with, as the way to store a city pool is entirely different from storing an Olympic pool. This complicates certain operations like inserting an Olympic pool, where we’d need to create two tuples in Pool and OlympicPool so that the primary/foreign key of OlympicPool points to the tuple created in Pool. It also complicates counting the pools that are neither Olympic nor city pools in the system, where all those tuples in Pool with NULL in the attributes characterizing city pools must be found.

Finally, although we see that the primary key of OlympicPool is also foreign in this implementation, this doesn’t imply that conceptually it’s a weak entity in identification. There are many ways to implement the hierarchy, and this is not necessarily the one that must be carried out.

3 tables

So, if we have an incomplete hierarchy and really want to make sure that the implementation lets us distinguish between the different types of pools and identify those pools that don’t belong to any inheriting entity, we can use three tables – one for each entity, respectively.

The peculiarity of this schema is that since all pools contain the attributes of the superclass Pool, whenever there is a pool in our database, it will be represented by a tuple in the Pool table that contains only the values of the superclass attributes. And, if a pool is of a specific type, it will be represented not only by the Pool tuple but also by a tuple in one of the tables reserved for the inheriting entities, where each has a foreign key pointing to the primary key {PoolID} of Pool.

For example, a city pool can be represented by a tuple in CityPool (which only has the specific attributes that characterize it as a city pool) plus a foreign key pointing to a specific tuple in Pool that holds the values of the other inherited attributes.

The advantage of this schema is that it minimizes wasted space from duplicated information or the appearance of NULL values (as the only thing being "duplicated" here is the PoolID attribute as foreign keys in the inheriting entities). It’s also easy to understand because each entity is represented by a specific table at the logical level.

Also, the schema is easy to modify in cases where we need to add foreign keys to the entities, where it would simply require adding an additional foreign key attribute or implementing external foreign keys pointing to the entities themselves, which we can do by referencing their own primary and foreign keys {PoolID}.

If we add a new type of pool to the domain later, it’s easy to add a table very similar to the ones we already have. This is unlike the previous options we saw where adding a new type of pool would be more costly because of the elements that need modification. Also, having three tables makes it easy to model the constraints related to the completeness and disjunction of the hierarchy.

But this schema also presents certain problems. On one hand, if we have a city pool and want to know its name, we’ll need to access the Pool table to find its name, plus the CityPool table. This complicates the query and affects its efficiency and latency.

Aside from this, if we have a tuple from Pool and want to know if it’s a city pool, an Olympic pool, or neither, we’ll have to go through all the tuples in CityPool and OlympicPool to see if the foreign key of any of them points to the Pool tuple we are trying to identify.

Also, the presence of three tables is more complex than having just one or two, making the logical model somewhat more complicated to operate because there are more tables and more relationships between them.

When to Model Each Entity as a Table

These alternatives aren’t the only options for implementing an IS-A hierarchy at the logical level. Depending on the domain needs and requirements, we can choose other more appropriate schemas that are similar to those we’ve already discussed.

To summarize which is the best schema we can implement to model an IS-A hierarchy at the logical level, we need to know when it’s appropriate to introduce a table for each entity.

First, we have the superclass. This is useful to model with a specific table in cases where the hierarchy is incomplete, as in the example hierarchy. We saw that without a dedicated Pool table to represent occurrences of the superclass entity, it’s difficult to distinguish when a pool is of the generic type of the superclass or is instead of a specific type (like that of the inheriting entities). It’s also helpful to implement a table for the superclass when there is a foreign key pointing to the superclass entity itself. Otherwise, it’s very likely that we’ll have trouble knowing which attribute the foreign key should reference, as we saw before.

And to finish with the superclass, we should also implement a table for it when the hierarchy is non-disjoint or overlapping. For example, if a pool could be of several types at once and we didn't have the Pool table, we would be forced to duplicate information in specific tables for the inheriting entities. This would greatly complicate database operations.

So, with our Pool table, we can have tuples in the respective tables of the inheriting entities, all with their foreign keys pointing to the same Pool tuple, which simplifies queries.

If we have a Pool table where all existing pools are stored, it’s likely that we would want to efficiently know the type of a pool from a Pool tuple. Instead of having to go through all the tuples of the inheriting entities' tables, we can add an attribute in Pool that determines its type (if it has one). Or that will be NULL if the hierarchy is incomplete and doesn't belong to any type.

This is called an explicit discriminator. If there isn't one, we typically say that there is an implicit discriminator. These are the foreign keys of the other tables that we would have to go through to find out the type.

Regarding the inheriting entities, we should create a table for each one when they have many attributes, which would result in many NULL values if we were to implement this with just one or two tables. Besides the attributes, the inheriting entities themselves may have specific domain constraints that are greatly simplified at the logical level if we implement tables for each of them. This would avoid the need to apply constraints on just one or two tables, complicating the semantics of the constraints.

In short, the more entities we combine into a single table, the more NULL values we will encounter, since to distinguish them, the table attributes that do not correspond to the concept or entity we want to represent must be NULL, as if they don’t exist.

This would also complicate database operations, as operations would need to consider which attributes should or should not be NULL – as well as the constraints – which must account for the presence of NULL values to be verified.

On the other hand, if we know the hierarchy is complete, then instead of implementing a table for the superclass, we can decide to implement tables for each inheriting entity, where each one has the attributes of the superclass. But this option loses its purpose when we have a superclass with too many attributes, which would be repeated in several tables, potentially many.

Chapter 7: Normalization

When trying to translate an IS-A hierarchy to the logical level, it's very likely that we’ll end up with a design that exhibits redundancy. This is because the same information, such as that of the superclass, can end up being stored in multiple places. This poses multiple problems in a database. So to understand it, let's consider the following example:

PersonID	PersonName	CityID	CityName
1	Alice Johnson	1	Madrid
2	Bob Smith	5	Paris
3	Carol Davis	3	New York
4	David Brown	1	Madrid
5	Emily Wilson	5	Paris

Here we have a Person table that stores data about people – specifically their ID, name, and the city where they live. But to represent the city, all the attributes that our database stores about cities are included in the Person table itself. This makes it so that in one row we can know information about the person as well as information about the city where they live.

At first glance, this may seem convenient, since if we have a Person row, we not only have all the information about the person but also all the information about the respective city. This then lets us avoid having to look up this information in other tables. But this creates a significant redundancy problem.

According to the definition of redundancy, it means that the same information is stored in multiple places, that is, repeated unnecessarily. And this doesn't mean the information has to be in different tables. For example, in this example we have redundancy because the same city information can be stored multiple times in the same Person table (as is the case with the city "Paris" or "Madrid").

This actually leads to problems when inserting new cities into the database. If we only store them in this table but don't have any person living in the city we want to insert, we won't be able to insert it unless we do so in a row of the Person table with the rest of the attributes that don't characterize a city set to NULL. And this will greatly complicate database operations.

Redundancy also poses a problem for memory consumption, as duplicating all the information of a city for each person living there uses up unnecessary space. Similarly, if we want to delete a city or update its information, we have to do it for every instance where that information is repeated. This complicates operations and making them much less efficient.

For example, if we store a Population attribute in this table to represent each city's population, every time we update the population of a certain city, we have to do it for all the Person tuples. This becomes inefficient if many people live in that city, as we have to change the population data in all the tuples representing those people.

Just as it affects efficiency, redundancy also increases the chances of data inconsistency. If we forget to change one value when updating Population data, or if there's an error and a certain value in a tuple doesn't update, then that value will contradict the rest of the Population values in the repeated tuples for that city, causing an inconsistency.

To solve these types of situations, it's best to plan ahead by creating a good design at the conceptual level. We can try to separate concepts into entities that are distinct enough to avoid storing information about semantically different ideas in the same entity (as this could cause redundancy when moving to the logical level).

But if we reach the logical level with a certain diagram that we couldn't refine further at the conceptual level and we need to refine it, one of the transformations we can apply here is called decomposition.

Decomposition

PersonID	PersonName	CityID (FK)
1	Alice Johnson	1
2	Bob Smith	5
3	Carol Davis	3
4	David Brown	1
5	Emily Wilson	5

CityID	Name
1	Madrid
3	New York
5	Paris

Before looking into what decomposition involves, it's helpful to examine the specific problems of combining information about people and cities in a single entity.

For example, if we store many attributes of a city, a lot of space will be wasted if we have many people living in that city. This is because all the city's attribute values are unnecessarily repeated in all the tuples of the people living there.

Another potential problem related to memory waste could happen if we needed to insert a person into the database and we don’t know the city they live in. This would force us to leave all the city's attributes as NULL, wasting all the space those NULLs occupy.

Similarly, if we delete all the people living in "Madrid," for example, then our database will no longer contain any information about that city, as no one lives there. This means it isn't explicitly stored in the table. Lastly, we previously saw the problems that arose when updating the information of a certain city.

As a solution to these issues, we can apply decomposition. If you consider the example above, you may be able to see that this involves breaking down the Person table into several tables. On one hand, we keep the Person table dedicated to storing information about people. On the other, for the cities, we store all their information in a specific City table.

Once the information is separated into multiple tables, we can maintain the CityID (FK) attribute in the Person table, where it’s now converted into a foreign key that references the CityID of the new City table, indicating the city where the person lives.

As you can see in the example, decomposition involves replacing one table with two or more tables, each containing a subset of the attributes from the original. By combining them, we can retrieve the original attributes.

For instance, here we have split one table into two, where one retains all the attributes related to people and the other holds attributes related to cities. Together, these attributes form the original table we had. We do this mainly to solve problems caused by redundancy. Now, in the Person table, we only store an identifier for the city where the person lives, and in the City table, we store the city's information only once, allowing it to be used by more tables in the database.

But in order to do decomposition correctly, we must ensure that certain conditions are met. One is that the decomposition is lossless. This means that if we now take the two tables generated by the decomposition and combine all their information back into a single table, we should get the information we had in the original Person table before the decomposition.

So if we now take the resulting Person table and add the information provided by the tuple from the City table identified by the foreign key defined in the decomposition to each tuple, we should get the same information as we had in the original Person table before the decomposition – without losing any tuples or creating new ones.

This join operation easily shows that it returns the data we originally had before the decomposition. And this indicates that the decomposition was done without loss. But this doesn't guarantee it will be lossless for any possible tuple. To ensure this, we need to analyze the functional dependencies present among the table's attributes, which must also be preserved after decomposition.

Lastly, when performing the decomposition, we might receive queries in the database such as, given a person, obtaining information about the city they live in. To implement this query, we usually perform operations similar to the join we described earlier, which can be computationally expensive. So if it becomes so costly that it's impractical, we might consider not doing the decomposition. Or we could even doing a partial one, where we keep the city attributes that are queried most frequently in the Person table to make certain queries more efficient, even if some redundancy exists.

Functional dependency

Continuing with these conditions, to understand them correctly, you’ll need to know what functional dependencies are.

To introduce this concept, we can look at the simplest case, which is the attributes PersonID and PersonName of the Person table. These store a person's government identification number and their name, respectively. So if we find several tuples in the Person table with the same PersonID value, we would expect their respective PersonName values to also be the same. This is because if several tuples store information about people with the same ID, then they must necessarily be the same person (as we assume the government identification number is unique for each person).

So whenever there are several tuples with the same ID, we can say that the names of the people represented by those tuples must also be the same.

But the reverse does not have to be true, because if several different people have the same name, they will have the same name but different IDs. So if several tuples have the same PersonName, their respective PersonIDs do not have to match.

This situation we just saw is a case of functional dependency between PersonID and PersonName, specifically denoted as PersonID→PersonName, since it’s the PersonID attribute that uniquely determines the person's name.

PersonID	PersonName	CityID	CityName
1	Alice Johnson	1	Madrid
2	Bob Smith	5	Paris
3	Carol Davis	3	New York
4	David Brown	1	Madrid
5	Emily Wilson	5	Paris

Formally, we can define a functional dependency as a constraint or relationship that exists between two sets of attributes, such that the values taken by one set of attributes uniquely determine the values that the other set of attributes must take.

For example, using the same example without decomposing, we can see that there is a functional dependency between the set of attributes X={PersonID} and the set Y={PersonName}, denoted as X→Y. This means that for any pair of tuples in the table, if those tuples have the same values in the set of attributes X, then they must necessarily have the same values in the set of attributes Y.

But we don't discover this by simple observation. These dependencies are mainly given by the characteristics of the attributes and the domain we are modeling, as well as the requirements. That is, to discover these dependencies, we need to focus on the semantics of the attributes.

The formal definition of this concept states that a functional dependency is a relationship between sets of attributes, so they don't have to be single-attribute sets – they can contain any number of them, depending on the complexity of the dependency.

For example, if we assume that a person always lives in the same house and never moves, then we can say there is a functional dependency {PersonID}→{CityID}, as well as {PersonID}→{CityName}. This results in functional dependencies for all possible combinations of attributes on the right-hand side, which must take the same value for several tuples if they have the same value for the attributes on the left-hand side.

Specifically, this means that given the dependencies we know exist, the following also exist:

{PersonID}→{PersonName,CityID}
{PersonID}→{PersonName,CityName}
{PersonID}→{CityID,CityName}
{PersonID}→{CityID,CityName,PersonName}

This occurs due to the union property of functional dependencies, where if we have dependencies X→Y and X→Z, then the dependency X→(Y U Z) also exists, where the uppercase letters denote sets of attributes.

Without going into more detail about these properties, it's worth highlighting that this is one of Armstrong's inference rules, whose main purpose is to infer all the functional dependencies that exist in a table. Specifically, these inference rules ensure that, starting from a series of initial functional dependencies, all the dependencies that actually exist in a table can be inferred.

With this, the important thing to know is that functional dependencies can have multiple attributes in their sets. This in turn can lead to a classification of the dependencies based on the number of attributes they have in each set.

One of the main uses of functional dependencies is to determine if a decomposition is valid, meaning if all the functional dependencies are preserved.

For example, in the original table, there are the functional dependencies {PersonID}→{PersonName} and {PersonID}→{CityID}, primarily, or {CityID}→{CityName} because the identifier of a city uniquely determines the name of the city itself. So, considering these dependencies as a base, we can infer others like {PersonID}→{CityName} by transitivity using {PersonID}→{CityID} and {CityID}→{CityName}.

The ones we have considered as base are those generated directly by the domain's semantics. This means that if a city is uniquely identified by its CityID, then it doesn't make sense to consider {PersonID}→{CityName} as a base, since we have the other dependencies that relate the person's identifier with their name and city identifier, from which we can infer it.

In summary, the base dependencies are the most fundamental ones from which all others can be inferred. There is no single algorithm to find them all. Instead, it’s a more open process that we need to follow based on our domain, requirements, and the semantics of the attributes.

Once we’ve found the base dependencies, the important thing is to ensure that they are preserved after decomposing a table. We can see this in the resulting tables, where {PersonID}→{PersonName} remains in Person, as does {PersonID}→{CityID}, with the only peculiarity being that now CityID in Person is a foreign key, and {CityID}→{CityName}, which is preserved in the new City table after decomposition.

So by preserving all the base functional dependencies, we are assured that the decomposition of Person into Person and City is correct.

Finally, functional dependencies can have many more classifications besides being base or not. For example, in some of the formal definitions of the following normal forms that we’ll see, we often check if a functional dependency is trivial. This consists of those dependencies X→Y where all the attributes of set Y are present in set X.

For example, {A, B} → {A} is trivial because {A} ⊆ {A, B}, and {A, B} → {B, A} is also trivial because {B, A} ⊆ {A, B}. But {A} → {B} is not trivial because {B} ⊄ {A}, meaning there is an attribute in set {B} that is not present in set {A}.

Normal forms

After understanding what functional dependencies are, it's important to note that there are many other types of dependencies, such as multivalued, union, or inclusion dependencies. All of these also aim to eliminate or minimize the problems associated with data redundancy that we saw earlier through normal forms.

These are a series of refinement levels of a relational schema defined by increasingly strict conditions intended to eliminate or progressively minimize the issues caused by redundancy in a schema. Among all the levels, we will only look at those that use functional dependencies between attributes as criteria for their conditions. But there are others we won’t cover here whose criteria include multivalued or union dependencies.

1NF

First, we have the first normal form (1NF), whose main condition is that each attribute is atomic. This means that the table cells do not contain an arbitrary number of values, which we can also call the non-existence of repeating groups. But it also imposes basic conditions such as the requirement for a primary key in the table so that each tuple can be uniquely identified. This prohibits duplicate tuples, as well as attributes with duplicate names, meaning there can’t be columns with duplicate names.

These conditions must be met for all tables in a database schema to be in 1NF. In this case, we can easily verify them by ensuring that each cell contains exactly one value, that there are no duplicates in rows or columns, and that there is a primary key.

These last three conditions are allowed by a DBMS, which means that when implementing a table at the logical level, we can have duplicates or even not define any key – and although the database may function, its schema won’t be in 1NF. So if we find a table that does not meet the normal form conditions, we can apply certain transformations to normalize it and bring it to 1NF.

2NF

The first normal form focuses mainly on prohibiting repeating groups, which eliminates the possibility of redundancies at the cell level – but does not eliminate redundancies caused by functional dependencies.

Despite prohibiting the existence of duplicate tuples, we saw in a previous example that city information in a table could be unnecessarily duplicated in multiple different tuples because the people living in that city were different. This meets 1NF but presents redundancy problems.

To address these redundancy cases, we use the second normal form (2NF). It includes all the conditions of 1NF plus an additional stricter condition: all attributes that aren’t the primary key of a table must depend on the entire selected primary key for the table – meaning all its attributes, not just one. This prevents partial dependency on the primary key.

BikeID	Model	Brand	BrandCountry	PurchasePrice	OwnerName	OwnerEmail
1	Roadster	SpeedX	USA	1200	John Doe	john@example.com
2	TrailBlazer	MountainCo	Canada	1500	Alice Smith	alice@example.com
3	Roadster	SpeedX	USA	1150	Bob Lee	bob@example.org
4	CityCruiser	UrbanRide	USA	800	John Doe	john@example.com
5	EcoCruiser	GreenMotion	Germany	1300	Carol Johnson	carol@example.com

For example, here we have a Bike table whose primary key is {BikeID}, and the basic functional dependencies are {BikeID}→{Model}, {BikeID}→{PurchasePrice}, {BikeID}→{OwnerName}, {BikeID}→{OwnerEmail} because if BikeID uniquely identifies each bike, then the information about the model, price, and owner will directly depend on that attribute.

We also have the dependencies {Model}→{Brand}, {Model}→{BrandCountry}, and {OwnerEmail}→{OwnerName}, since knowing the bike model can uniquely determine its brand. We can also determine the owner's name from their email, which we can’t do in reverse because multiple people can have the same name and different email addresses.

Given these dependencies, since the primary key has only one attribute, we see that all others have a dependency on the entire primary key. This means that the primary key itself uniquely determines the rest of the table's attributes. So we can formally denote that, for all attributes A that aren’t the primary key, there’s the functional dependency {Primary Key}→A.

In this case, even though some dependencies are transitive, we can see that in the end, all attributes end up depending on the primary key. For example, with {BikeID}→{Model} and {Model}→{Brand}, we infer the dependency {BikeID}→{Brand}, which is not basic.

When this condition is met, the table is in 2NF, which avoids redundancies caused by attributes that depend only on part of the primary key, not the whole key.

This might not be as clear here because the primary key in the example has only one attribute, but sometimes we have primary keys with more attributes. In such cases, the rest of the table's attributes must depend on all the attributes in the primary key in order to be in 2NF (in addition to meeting the conditions of 1NF).

If they depend only on part of the key, there could be repeated values in those attributes. This would cause redundancy issues because it’s the entire primary key (all its attributes) that can uniquely identify each tuple.

3NF

Continuing with normal forms, 3NF is defined similarly. First, for a schema to be in 3NF, it must meet all the conditions of 2NF plus a specific one that states there can’t be functional dependencies between non-prime attributes.

Prime attributes are those that belong to any candidate key of the table. So we can restate the previous condition of 3NF by saying that no attribute that does not belong to any candidate key can functionally depend on any other attribute that does not belong to any candidate key.

For example, in the Bike table we had earlier, we assume that the only candidate key that exists is {BikeID}, since no other set of attributes can uniquely identify the tuples in the table. We can verify this by looking at the semantics of the attributes. So, seeing that there are functional dependencies like {Model}→{BrandCountry} between non-prime attributes, meaning they do not belong to any candidate key, we conclude that the table is not in 3NF, and we’ll need to normalize it.

BikeID	Model (FK)	PurchasePrice	OwnerEmail (FK)
1	Roadster	1200	john@example.com
2	TrailBlazer	1500	alice@example.com
3	Roadster	1150	bob@example.org
4	CityCruiser	800	john@example.com
5	EcoCruiser	1300	carol@example.com

Model	Brand	BrandCountry
Roadster	SpeedX	USA
TrailBlazer	MountainCo	Canada
Roadster	SpeedX	USA
CityCruiser	UrbanRide	USA
EcoCruiser	GreenMotion	Germany

OwnerEmail	OwnerName
john@example.com	John Doe
alice@example.com	Alice Smith
bob@example.org	Bob Lee
john@example.com	John Doe
carol@example.com	Carol Johnson

To normalize the table, we’ll need to apply an algorithm to the tables to convert the schema to 3NF, ensuring there are no functional dependencies between non-prime attributes.

To understand this algorithm, we’ll start with the original Bike table we had before. We’ll on the functional dependencies between prime attributes that break 3NF, that aren’t derived transitively from simpler ones, and whose set of attributes on the left side does not form a superkey.

For example, if we have {A}→{B}, {B}→{C}, and {A}→{C}, we do not consider {A}→{C} since it can be derived transitively from the other two. Specifically, the problematic ones in our example, which aren’t derived transitively and whose left side is not a superkey, are {Model}→{Brand}, {Model}→{BrandCountry}, and {OwnerEmail}→{OwnerName}, which are the base functional dependencies.

Now, we need to decompose the table guided by these functional dependencies. But as you can see, we can apply the union property of functional dependencies to know that the functional dependency {Model}→{Brand, BrandCountry} also exists. We derived it from the previous problematic ones to simplify the application of the algorithm.

In short, to make the algorithm easier to apply, whenever we see multiple functional dependencies with the same determinant (set of attributes on the left side), it’s useful to apply the union property mentioned earlier to simplify them into one.

So now we have that the problematic functional dependencies are {Model}→{Brand, BrandCountry} and {OwnerEmail}→{OwnerName}. We can create a specific table for each of them where its schema is made up of all the attributes of the dependency – that is, all the attributes on both sides. We can formally denote this as the union of both sets of attributes.

As you might guess, by doing this, the primary keys in the new tables will be the attributes of the determinants of these dependencies (which in this case are {Model} and {OwnerEmail}, respectively).

We also need to remove these attributes that we have separated into additional tables from the original Bike table, leaving only the attributes of the determinants of these dependencies and converting them into foreign keys to reference the corresponding primary keys of the new tables. By convention, the attributes that make up the primary key of a table are usually placed first on the left, like Model and OwnerEmail here.

After this process, we can see that all the functional dependencies that were previously problematic are now in new tables where their determinants are now primary keys. This avoids violating the condition imposed by 3NF.

Note that after applying this algorithm, we don’t need to apply it recursively to the tables generated by the decomposition, as there is a guarantee that the resulting schema is already in 3NF after applying this process. In summary, by applying this normal form to our schema using the described algorithm, known as the relational synthesis algorithm, we manage to avoid or minimize the occurrence of redundancies caused by transitive functional dependencies.

BCNF

The three previous normal forms are the most basic ones we can apply to a schema to eliminate most problems caused by redundancies. But there is another normal form in addition to 3NF that is more restrictive and ensures a better result in this regard, which is BCNF.

As we’ve seen, the normal forms become increasingly restrictive in the conditions they apply. In this case, BCNF stands for Boyce-Codd Normal Form, and it’s characterized by allowing only those functional dependencies X→Y in the tables where it’s true that either the dependency is trivial or X is a superkey of the table.

If these conditions are met, we can formally demonstrate that all the conditions of 3NF must also be automatically met (and so also 2NF and 1NF). We won’t perform this demonstration here, as the important thing is to know how to normalize a schema to adhere to the BCNF. So if we start with a schema like the one we originally had for the unnormalized Bike table, we can apply a specific algorithm to transform it to BCNF.

BikeID	Model	Brand	BrandCountry	PurchasePrice	OwnerName	OwnerEmail
1	Roadster	SpeedX	USA	1200	John Doe	john@example.com
2	TrailBlazer	MountainCo	Canada	1500	Alice Smith	alice@example.com
3	Roadster	SpeedX	USA	1150	Bob Lee	bob@example.org
4	CityCruiser	UrbanRide	USA	800	John Doe	john@example.com
5	EcoCruiser	GreenMotion	Germany	1300	Carol Johnson	carol@example.com

The algorithm to convert to BCNF is very similar to the one we looked at for 3NF. The difference is that here, the decomposition is done in more steps.

First, we need to identify the functional dependencies that prevent compliance with BCNF, which are exactly {Model}→{Brand}, {Model}→{BrandCountry}, and {OwnerEmail}→{OwnerName}. We choose these because, as you can see, {Model} can’t be a superkey, nor can {OwnerEmail} on its own. But in other functional dependencies like {BikeID}→{PurchasePrice}, we see that {BikeID} is indeed a superkey, as it’s actually the primary key of the table. So we don’t include those when applying the algorithm.

Also, keep in mind that a functional dependency X→Y can be trivial and meet the definition of BCNF even if X is not a superkey, meaning that the set of attributes Y is a subset of the set of attributes X.

Now, to simplify the application of the algorithm, we can focus on the determinant of the dependencies that break the normal form – that is, on the set of attributes on the left side, looking for several that have the same determinant. If there are several with the same determinant, as is the case with those that have {Model} on their left side, then we can use the union property of Armstrong's inference rules to simplify them all into one like {Model}→{Brand,BrandCountry}. Here', on the right side, we have gathered all the attributes from the right sides of the dependencies we had.

In this way, we reduce the number of dependencies to consider in the algorithm which simplifies its execution. This is the case since this step is not mandatory in this algorithm (nor in the conversion to 3NF), as it’s not part of the algorithm's definition itself, but rather something additional we do to simplify it without affecting its correctness.

Afterward, we end up with the dependencies {Model}→{Brand,BrandCountry} and {OwnerEmail}→{OwnerName}, which guide the decomposition we will perform on the table, similar to the 3NF conversion algorithm. But the main difference is that now we select the dependencies one by one and perform a decomposition for each, not all at once. Each time the table is decomposed, the dependencies and keys change, so we have to do it one by one to ensure that the recombination of the decomposed tables remains lossless.

Although we won't go into detail about why this happens, the important thing to remember is that we use this method because this algorithm doesn’t guarantee the preservation of all functional dependencies due to the conditions that define this normal form. These conditions are restrictive enough that, in certain situations, some dependencies may not be preserved after decomposition.

When selecting one of the dependencies like {Model}→{Brand,BrandCountry} (we can actually choose any of them), we decompose the Bike table guided by this functional dependency. We remove all the attributes on the right side of the dependency from the original table and make the attributes of the determinant (left side) foreign keys. These foreign keys point to the corresponding attributes of a new table where we store all the attributes involved in the dependency (meaning from both sides).

BikeID	Model (FK)	PurchasePrice	OwnerName	OwnerEmail
1	Roadster	1200	John Doe	john@example.com
2	TrailBlazer	1500	Alice Smith	alice@example.com
3	Roadster	1150	Bob Lee	bob@example.org
4	CityCruiser	800	John Doe	john@example.com
5	EcoCruiser	1300	Carol Johnson	carol@example.com

Model	Brand	BrandCountry
Roadster	SpeedX	USA
TrailBlazer	MountainCo	Canada
Roadster	SpeedX	USA
CityCruiser	UrbanRide	USA
EcoCruiser	GreenMotion	Germany

Formally, if our original table is the set of attributes R, then we keep R-{Brand,BrandCountry}, convert {Model} into the foreign key {Model (FK)} referencing the set of attributes {Model} of the new table generated by the decomposition, whose attributes are given by {Model}U{Brand,BrandCountry}, and whose primary key is the set {Model} that was previously in the determinant of the dependency.

Now, we repeat this process recursively on the resulting tables, as this decomposition has solved the problem caused by the dependency {Model}→{Brand,BrandCountry}. But we still have the dependency {OwnerEmail}→{OwnerName} in the Bike table. So we apply another decomposition step guided by the only remaining dependency that violates the BCNF conditions.

By doing this, we remove the set of attributes {OwnerName} from the Bike table and convert {OwnerEmail} into a foreign key that references the same set {OwnerEmail} but from the new table generated by the decomposition. In this case it’s formed by the attributes {OwnerEmail}U{OwnerName}={OwnerEmail,OwnerName}.

BikeID	Model (FK)	PurchasePrice	OwnerEmail (FK)
1	Roadster	1200	john@example.com
2	TrailBlazer	1500	alice@example.com
3	Roadster	1150	bob@example.org
4	CityCruiser	800	john@example.com
5	EcoCruiser	1300	carol@example.com

Model	Brand	BrandCountry
Roadster	SpeedX	USA
TrailBlazer	MountainCo	Canada
Roadster	SpeedX	USA
CityCruiser	UrbanRide	USA
EcoCruiser	GreenMotion	Germany

OwnerEmail	OwnerName
john@example.com	John Doe
alice@example.com	Alice Smith
bob@example.org	Bob Lee
john@example.com	John Doe
carol@example.com	Carol Johnson

As you can see, after these steps, the schema doesn’t have any functional dependency X→Y where X is not a superkey or the dependency itself is trivial. This is because when decomposing into tables, we define their primary keys as the determinants of the dependencies that originally did not comply with the normal form.

So after performing these steps for all dependencies that prevent the schema from adhering to BCNF, we end up with a normalized schema that does comply with BCNF. During the process, it’s possible that some of the generated tables still have functional dependencies that violate BCNF, which is why these steps are applied recursively. This means that decomposition is not only done from the original table, but it may also be necessary to decompose a table generated by previous steps, especially in more complex schemas.

In the example we have, the final schema that meets the BCNF conditions is exactly the same as the one we got when transforming it to BCNF. But this is a coincidence – in most practical cases, schemas tend to be more complex, and after converting them to 3NF, they may not comply with BCNF, or it may even be impossible to convert them to BCNF. That is, converting a schema to 3NF is always guaranteed to be possible, while there is no such guarantee for BCNF.

In short, BCNF is more restrictive than 3NF, which prevents redundancies caused by functional dependencies where a set of attributes that do not uniquely identify the tuples of a table determine the values of another set of attributes. This makes the information of the determining attributes redundant, similar to what happens in 3NF with transitive dependencies.

Also, being more restrictive, it may not be achievable if a table has multiple overlapping superkeys, as applying the BCNF decomposition algorithm would break the functional dependencies between attributes of different superkeys. So by relaxing the conditions of BCNF, we get 3NF, which correctly handles situations where overlapping superkeys exist, meaning they share some attribute.

Other normal forms

Besides the normal forms based on functional dependencies, which we have just seen, there are others that eliminate redundancies caused by different types of relationships between attributes or characteristics.

For example, 4NF deals with multivalued dependencies, 5NF with join dependencies, 6NF represents the highest level of normalization of a relational schema, and DKNF (Domain–Key Normal Form) also imposes the condition that all schema constraints must result solely from domain and key definitions, meaning it only allows domain and key constraints.

When to check compliance with each normal form?

Lastly, we’ve seen that each normal form establishes a series of characteristics that a database schema has to follow and the problems it aims to solve.

Practically speaking, the most important normal forms we need to ensure for almost any schema are 1NF and 2NF. In the case of 1NF, most DBMSs guarantee it automatically – but we have to design the conceptual model so that it avoids the appearance of repeating groups that don’t meet the conditions of 1NF. On the other hand, 2NF is essential for identifying tuples in tables, so we should make sure it’s met in a real project database.

Beyond these, if we’re working with a system that performs analytical queries like in OLTP, the database schema should also meet the conditions of 3NF, especially when the schema needs to handle queries or undergo updates frequently. This helps resolve these queries and updates as efficiently as possible.

Beyond 3NF, we’ll want to meet BCNF when business rules are very complex. That is, when data has to meet complex constraints, we can help minimize the impact of redundancy issues through BCNF conditions, as they are more restrictive than those of 3NF. Then, if our schema allows multivalued attributes or associations of degree higher than 2, it may be useful to check other types of normal forms like 4NF, 5NF, and so on.

Chapter 8: Query Languages

At this point, you’ve learned about all the elements with which we can organize or structure stored data in relational databases using the relational model. But in practice, we don’t only want to store data, as we could do that with simple files. We also need tools to manipulate and query these data. This means we need to use a query language.

In simple terms, query languages are designed to manipulate and query (or access) the data stored in a database through a set of operations. Querying is the most fundamental operation of all, because if we think about how some of the other operations work (like updating or deleting data, for example), we need to be able to select or query the data in order to perform any operations on them. So basically, almost any modification starts by first identifying which records will be affected by the operation.

The query languages we’ll learn about here are relational, meaning they are created to manipulate and query data in relational databases. Fundamentally, most of them base the logic of operations on table manipulations that result in another table. Then we can continue applying operations to that resulting table. So when we operate on a relational database, we are transforming tables into other tables until we reach a table with data that interests us.

Formal vs practical query languages

There are some query languages known as formal languages, which consist of theoretical definitions where operators or transformations that can be applied to tables are formally defined. This also helps optimize operations on them significantly, as these formal tools allow us to verify equivalences between operations or queries, enabling us to choose the one with the least computational cost among several equivalents.

On the other hand, to apply this to a database, there are practical query languages like SQL, which are implementations of formal query languages adapted to be used on real systems.

Although we call them languages, it's important not to confuse them with general-purpose languages. Query languages, as their name suggests, are dedicated to manipulating and querying data, not performing any type of computation. Examples of formal query languages include:

Relational algebra

This is a formal imperative language, which means that when we program in it, we must think about how to obtain the result we want. In other words, we define a sequence of operations using the language's operators that progressively transform the tables until we reach one or more resulting tables with the data we need.

This idea of a sequence of operations is very similar to how we’d actuall plan and execute a query in a practical query language like SQL. This, along with the similarity of formal operators to the statements offered by these practical languages, helps the end user optimize the query, verify its correctness formally, or demonstrate its equivalence with another query that requires fewer computational resources, among other uses.

Example: If we want to get all the ages from a Person table that are greater than 50, we can apply the relational algebra operators π Age ( σ (Age > 50) (Person) ) that we will see later. First, we filter all tuples that meet the condition of having an age >50 using the corresponding operator, and then we apply another operator to the resulting table with those tuples to keep only the ages of those tuples.

Relational calculus

Unlike the relational algebra, relational calculus is a declarative language. This means we program by thinking about the properties the result must have, not about which operators to apply to certain tables to achieve it. In other words, we don’t define something similar to an execution plan or sequence of operators to get the result. Instead, we simply declare the properties it must have to meet our needs, and the system itself finds an execution plan that produces exactly what we are looking for.

There are several ways to pose a query or modification on the data. One is based on Tuple Relational Calculus (TRC), where we declare conditions that the attributes of the tuples must meet to be included in our result. The other is Domain Relational Calculus (DRC), which involves using variables over the domains of the attributes to set conditions on them using a methodology similar to first-order logic.

Example: Following the same example as before, in TRC we would have something like { t.Age | Person(t) ∧ t.Age > 50 }, where we declare that the tuples t we want to obtain must belong to the Person table and have a value greater than 50 in the Age attribute. Meanwhile, in DRC we would have { ⟨a⟩ | ∃id ( Person(id, a) ∧ a > 50) }, where we are assuming that the table only stores an ID attribute and an Age attribute, because if more were stored, we would have to use more domain variables. In summary, here the conditions are imposed on the domain variables, which represent the values that the tuples take in their respective attributes.

Lastly, regardless of the formal language used, both have the same expressive capacity, which can be formally demonstrated, as both are constructed using first-order logic.

Chapter 9: SQL (Structured Query Language)

In addition to formal languages, there are implementations like Structured Query Language (SQL) that are based on the operations of these formal languages. They allow us to manipulate and query data through relational database management systems (DBMS).

Specifically, SQL is a commercially used language with various standards, to which various functionalities have been added over time. Most systems have versions installed that are newer than SQL-92. But that version already includes all the necessary functions to perform the vast majority of operations needed on a database, so it’s the standard we’ll explore here. And while we aim for portable SQL, several examples use features introduced after SQL-92 or PostgreSQL-specific extensions (like BOOLEAN, XML/JSON, UUID, and psql meta-commands).

SQL is a declarative language, where we define what data we want to get, not the exact sequence of operations to get it. The DBMS does the latter internally by translating the statements we write into relational algebra operations, which transform the tables through an execution plan until reaching the final resulting table.

Before proceeding with the elements that make up the SQL language itself, we should distinguish these elements or statements based on their purpose or application area.

On one hand, we have the statements that form the Data Definition Language (DDL), which is a set of statements dedicated to managing the tables in the database (such as their creation, deletion, modification, and so on).

Then we have the Data Control Language (DCL), which is another set of language statements dedicated to controlling user permissions in the database, managing who can read or modify the tables.

On the other hand, we have the Data Manipulation Language (DML). Its statements are oriented towards managing the data contained in the tables, such as insertion, deletion, transformation, or querying.

Apart from these sets of statements or instructions, we can also consider the Transaction Control Language (TCL), which are a series of statements that allow us to manage transactions that occur in the database. Here, we will focus only on the first three sets, which contain the most fundamental instructions.

DDL

To start with SQL, the most basic thing we can do is create, modify, and delete tables in the database. This means that we use instructions that allow us to define our logical design in the DBMS.

Here, we’ll use PostgreSQL as the DBMS, although these examples can be applied to any DBMS that supports the SQL-92 standard, which is what we will focus on.

For these examples, let's assume a domain where people rent bicycles. We have a weak entity called Rental that models when a person has rented a certain bike, and the attribute Duration represents the number of rental days.

As we have seen in previous examples, the primary key of Rental is composed of the rental date and the foreign keys that identify the person who rented a certain bike. This makes Rental weak in identification because both foreign keys are needed to uniquely identify the tuples in that table.

Also, in our domain, we prohibit a person from renting a bike when it’s already being rented by someone else. This means that although everyone can rent as many bikes as they want, they can’t rent one that is already being used by another person or by themselves.

When translating it to the logical level, we simply underline the attributes that are keys and add the foreign key attributes in Rental, as they aren’t represented at the conceptual level. These are underlined because the Rental entity is weak in identification, as we just discussed.

So if we want to implement this logical design in a DBMS like PostgreSQL, we first need to install and configure it on a machine. Then, once we’ve opened it in a terminal, we can navigate it with different commands (keep in mind that, with the exception of SQL CODE;, these are psql meta-commands (client shortcuts), not SQL.):

\? Shows help about the DBMS commands.
\! [command] Executes an operating system terminal command.
\h [command] Shows help about the SQL syntax, that is, its statements like \h CREATE TABLE.
\q Closes the DBMS, which can also be done with exit.
\l or \list Lists all available databases.
\c or \connect Connects to the database with the given name.
\conninfo Shows information about the current database connection (host, port, user, database).
\dn Lists all schemas, which are groupings of elements like tables, views, types, and so on.
\dt Shows the tables of the database we are connected to.
\dv Similar to the previous command, this one shows the views.
\di Lists the indexes.
\df Lists the functions.
\d[+] object Describes the object whose name we provide as an input argument (table, view, function, and so on). With + it includes additional details.
SQL CODE; In the terminal, we can execute SQL code, typically ending with a semicolon.
\timing Used to turn on/off the measurement of query execution time.
\copy table TO 'file.csv' CSV HEADER; Exports table to a CSV file.
\copy table FROM 'file.csv' CSV HEADER; Imports data from a CSV file into a table without emptying it first.
\i path/file.sql Executes an SQL script saved in a file with a .sql extension to avoid having to copy and paste lengthy SQL code into the terminal.

When you’re using a DBMS, you should check its documentation to see if it’s case sensitive or not. In this case, PostgreSQL folds unquoted identifiers to lower-case. Quoted identifiers preserve case and must be matched exactly. SQL keywords aren’t case-sensitive.

CREATE

Once we have entered the DBMS, the first thing we can do is create elements using the CREATE statement. There are many elements we can create, but the most important one for now is DATABASE, which allows us to create a new database.

CREATE DATABASE sampledb;

If we enter this command directly into the terminal, we will create an empty database that we can connect to using the previous PostgreSQL commands. Once we are in the database, we can create the tables of our logical design with the following:

CREATE TABLE Person (
    PersonID INT,
    Name VARCHAR(255),
    Birth DATE,
    Email VARCHAR(255)
);
CREATE TABLE Bike (
    BikeID INT,
    Model VARCHAR(255),
    Weight DOUBLE PRECISION
);
CREATE TABLE Rental (
    PersonFK INT,
    BikeFK INT,
    RentalDate DATE,
    Duration INT,
    Price DOUBLE PRECISION
); --Important, don't forget the ; after each statement--

As you can see, when creating tables, you need to specify the schema for each one. Don’t confuse this with what PostgreSQL calls a schema at the DBMS level. In PostgreSQL, a schema is a namespace within the database that groups and isolates elements like tables, views, functions, and so on. This makes it easier to organize, manage, and control permissions, and avoid name conflicts.

Here, the table schema refers to the attributes that define it, which is why we declare their names along with their data types, including:

Data Type	Category	Description
`BIT`	Bit String	Fixed-size bit string (for example, `BIT(1)` stores a single 0 or 1).
`SMALLINT`	Exact Numeric	Integer typically from –32,768 to 32,767 (2 bytes).
`INTEGER` / `INT`	Exact Numeric	Integer typically from –2,147,483,648 to 2,147,483,647 (4 bytes).
`BIGINT`	Exact Numeric	Integer typically from –9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 (8 bytes).
`DECIMAL(p, s)`	Exact Numeric	Fixed-point number with precision `p` and scale `s` (for example, money).
`NUMERIC(p, s)`	Exact Numeric	Synonym for `DECIMAL`, same fixed-point behavior.
`FLOAT(p)`	Approximate Numeric	Floating-point with precision of at least `p` bits.
`REAL`	Approximate Numeric	Single-precision (typically 24-bit) floating-point.
`DOUBLE PRECISION`	Approximate Numeric	Double-precision (typically 53-bit) floating-point.
`CHAR(n)`	Character String	Fixed-length text of exactly `n` characters (padded if shorter).
`VARCHAR(n)`	Character String	Variable-length text up to `n` characters (no padding).
`CLOB`	Character String	Character Large Object for very long text (for example, articles).
`BINARY(n)`	Binary String	Fixed-length binary data of exactly `n` bytes.
`VARBINARY(n)`	Binary String	Variable-length binary data up to `n` bytes.
`BLOB`	Binary String	Binary Large Object for large binary data (for example, images).
`DATE`	Date/Time	Calendar date in `YYYY-MM-DD` format.
`TIME(p)`	Date/Time	Time of day `HH:MM:SS[.fraction]` with `p` fractional seconds.
`TIMESTAMP(p)`	Date/Time	Combined date and time with fractional seconds precision `p`.
`INTERVAL`	Date/Time	Period of time (for example, `INTERVAL '1-2' YEAR TO MONTH`).
`BOOLEAN`	Boolean	Logical value `TRUE`, `FALSE`, or `UNKNOWN` (NULL).
`XML`	Other Standard	Stores XML document or fragment.
`JSON`	Other Standard	Stores JSON-formatted text for semi-structured data.
`UUID`	Other Standard	128-bit universally unique identifier (for example, `550e8400-e29b-41d4-a716-446655440000`).

In this list, we can see some like BLOB that at first glance allow storing an arbitrary amount of data in a single cell, as the BLOB can be as large as we want. This might seem like it poses a repetitive group problem. But when a column stores BLOB data, it doesn't store multiple BLOBs in the same cell, but only one. This makes the DBMS responsible for managing the disk storage of this type of data efficiently.

In other words, we can see this as if the BLOB itself is not stored in a table cell, but rather a memory pointer is stored pointing to another memory area where the entire BLOB is stored (although the exact technique used heavily depends on the DBMS).

Also, if we look at other data types like VARCHAR for storing text, in PostgreSQL you can use VARCHAR with or without a length (or TEXT). In standard SQL, VARCHAR(n) requires a length.

Besides creating databases and tables, we might want to create a custom data type like ageDataType or colorDataType, which we can do using CREATE DOMAIN.

CREATE DOMAIN ageDataType AS INTEGER CHECK (VALUE >= 0 AND VALUE <= 150);
CREATE DOMAIN colorDataType AS VARCHAR(8) CHECK (VALUE IN ('red', 'green', 'BlUe'));

Here we just created new data types called ageDataType and colorDataType, where the first one is used to represent ages and the other colors. We could do this by imposing constraints on the values that columns can take, rather than defining a new data type, or rather a domain. But if there are many attributes with the same constraints on their domain, meaning they have the same domain like color or age, then it makes sense to define a custom one.

We mainly do this using the CHECK statement, which as we'll see is used to define constraints (in this case on the values of the data type we define as a base when creating a new domain. Above we used INTEGER and VARCHAR(8) respectively.).

ALTER

In addition to creating elements like tables or databases, we can also modify them using the ALTER statement. For example, if we forgot to add the AuxEmail column to the Person table, we can use the following statement to add it after the table has been created.

ALTER TABLE Person
  ADD COLUMN AuxEmail VARCHAR(255);

As you can see, we first specify the table where we want to add the new column, and then we specify the name and type of that attribute. But it's important to consider the value assigned to its cells when this table extension occurs.

By default, SQL allows NULL values in the table, so it will fill those values with NULL if there is content in the table. But if we want to assign a custom default value to the cells of the new column instead of NULL when there is data already inserted in the table, we can add the default value property to the column we are adding:

ALTER TABLE Person
  ADD COLUMN AuxEmail VARCHAR(255) DEFAULT 'noEmail@gmail.com';

This way, when we insert a tuple and leave the AuxEmail value undefined, the DBMS will automatically fill the cell for that attribute with its default value. This also applies when adding the column itself when there is already data in the table. We can also remove this default value property using:

ALTER TABLE Person
  ALTER COLUMN Email DROP DEFAULT;

Similarly, ALTER also allows us to remove an attribute:

ALTER TABLE Person
  DROP COLUMN Email;

Change the data type of an attribute in Postgres:

ALTER TABLE Person
  ALTER COLUMN Name TYPE CHAR(25);

And rename elements, among many other actions:

ALTER TABLE Person
  RENAME COLUMN Birth TO BirthDate; --Renames the column Birth of the table Person--
ALTER TABLE Person
  RENAME TO People; --Renames the table Person--
ALTER DATABASE sampledb
  RENAME TO otherName; --Renames the database sampledb--

In short, ALTER allows us to modify elements that have already been created in the database without deleting and recreating them with the changes. Otherwise, we would have to export the data stored in those elements and reinsert it into the new schemas, which would be inefficient.

DROP

We can remove elements with the DROP statement. Its operation is very simple, as we just need to specify the name of the element to remove, such as the database we just created:

DROP DATABASE sampledb;

When executing this statement, SQL tries to delete the database, although we might get an error if we are connected to it. Besides simply deleting it, we can check if it exists before trying to delete it with:

DROP DATABASE IF EXISTS sampledb;

Similarly, we can have a schema like this example where there are foreign keys in Rental that reference or point to other tables like Bike and Person.

If we delete Rental, nothing would happen since no foreign key points to Rental. But if we want to delete one of the other two tables, a referential integrity problem will arise. For example, deleting Bike would leave the foreign key reference in Rental that points to Bike orphaned. So to delete Bike and all the constraints or SQL elements that depend on Bike, meaning those that reference it, we can use CASCADE:

DROP TABLE Bike CASCADE;

By doing this, not only would Bike be deleted, but also the foreign key constraint in Rental that we haven't introduced yet, as well as all others that point to Bike.

It's important to note that the CASCADE in a DROP statement is not related to the CASCADE we can define in a CREATE statement to set deletion or insertion policies. If, instead of deleting an entire table, we only delete certain tuples, we might end up with a situation where a tuple has a foreign key value that doesn't correspond to any tuple in the referenced table because we deleted it. We can establish deletion policies where the tuples pointing to the deleted one are also removed, or similar actions.

INSERT

To insert tuples into tables, we use the INSERT statement, where we specify the name of the table where we want to insert, as well as the attributes of its schema and the values to insert into the new tuple.

INSERT INTO Person (PersonID, Name, Birth, Email)
VALUES (5, 'Carol Johnson', '1985-07-15', 'carol@example.com');

INSERT INTO Bike (BikeID, Model, Weight)
VALUES (5, 'EcoCruiser', 14.2);

INSERT INTO Rental (PersonFK, BikeFK, RentalDate, Duration, Price)
VALUES (5, 5, '2025-07-10', 3, 25.50);

But, if we don't have some of the values for the tuple, we can omit them by inserting values only for the attributes we do have. We can even insert a tuple with DEFAULT values for certain attributes. But this only works if a default value was defined when creating the table or added with an ALTER statement.

INSERT INTO Bike (BikeID, Model)
VALUES (6, 'Speedster');
INSERT INTO Bike (BikeID, Model, Weight)
VALUES (7, 'Commuter', DEFAULT);

DELETE

To delete tuples, you can use DELETE, which at a logical level is very similar to the SELECT clause that we will see later (that’s used to retrieve data in response to database queries).

To use DELETE, we impose a set of conditions that the tuples in the table must meet to be selected. Those tuples that meet the conditions are then deleted by DELETE.

DELETE FROM Rental
WHERE PersonFK = 5
  AND BikeFK   = 5
  AND RentalDate = '2025-07-10';

For example, here all tuples with a value of 5 in PersonFK and BikeFK and a rental date of 2025-07-10 will be deleted.

UPDATE

Similarly, we can update the values of tuples using UPDATE. We first select the tuples that will be affected by the change we want to make by imposing conditions on them, and then we use SET to change one of their attribute values or apply a transformation.

UPDATE Bike
SET Weight = 13.8
WHERE BikeID = 5;

UPDATE Person
SET Email = 'carol.johnson@example.com'
WHERE PersonID = 5;

Constraints implementation

Given these DDL statements, we can create different elements where data is stored. But as we’ve seen, in most domains we model, we need to implement a series of constraints to ensure that the data adheres to the requirements of our problem. (This is in addition to the integrity constraints inherent in the relational model, such as the existence of keys.)

Although this distinction is not as strong in SQL, most constraints we impose help ensure data integrity, whether they refer to the relational model's own rules or the business rules of our problem.

To implement constraints in SQL, we can start with the simplest ones: constraints that affect a single table. These are usually implemented using the CHECK statement within another statement like CREATE TABLE, where a condition is specified that all tuples in a table must meet whenever we modify it by inserting, modifying, or deleting its tuples.

CREATE TABLE Person (
    PersonID INT,
    Name VARCHAR(255),
    Birth DATE,
    Email VARCHAR(255),
    CHECK (Birth <= CURRENT_DATE)
);
CREATE TABLE Person (
    PersonID INT,
    Name VARCHAR(255),
    Birth DATE,
    Email VARCHAR(255),
    CONSTRAINT BirthConstraint CHECK (Birth <= CURRENT_DATE)
);

For example, we can assume that a person's birth date is always validated before being saved in the database. If a user enters an invalid date in the application layer, the application itself will generate an error and prevent saving an invalid date in the database. But it's still a good idea to add this type of constraint to ensure data integrity.

In this case, a person can’t be born on a date later than the current date, which we can get in SQL with CURRENT_DATE. So, we define a constraint where the Birth attribute must be less than or equal to the current date for all rows in the Person table.

These constraints are usually defined below the attribute declaration, and we can also give them a specific name using CONSTRAINT. This declares the constraint and assigns it a name we can use to identify it. We can add this name not only to a CHECK constraint but also to any similar declaration, such as PRIMARY KEY, FOREIGN KEY, or UNIQUE, among others.

Continuing with constraints on a specific table, if we need to ensure that an attribute can’t take NULL values, we can use either a CHECK or a NOT NULL along with declaring the corresponding attribute (to which we can also give a specific name using CONSTRAINT).

CREATE TABLE Person (
    PersonID INT,
    Name VARCHAR(255),
    Birth DATE NOT NULL,
    Email VARCHAR(255)
);
CREATE TABLE Person (
    PersonID INT,
    Name VARCHAR(255),
    Birth DATE CONSTRAINT BirthNotNull NOT NULL,
    Email VARCHAR(255)
);
CREATE TABLE Person (
    PersonID INT,
    Name VARCHAR(255),
    Birth DATE,
    Email VARCHAR(255),
    CONSTRAINT BirthNotNull CHECK (Birth IS NOT NULL)
);

These three ways are equivalent if we want to require people to save their birth date in the database, preventing NULL values in the respective column.

The main difference between using CHECK and putting NOT NULL next to the attribute declaration is that if we use CHECK, we have to write a condition in parentheses similar to how we do it in a SQL query that describes the condition we want to impose, as long as this query only affects the attributes of the table we are working on.

In contrast, NOT NULL next to an attribute is an implicit way to indicate this restriction. Note that CHECK constraints are per-row boolean expressions – they can’t contain subqueries, aggregates, or window functions in standard SQL and most DBMS. For cross-table conditions, use triggers (portable) rather than CHECK.

After understanding what CHECK involves, we can see how almost any domain restriction on attributes can be specified in one of these statements. However, SQL offers us more functionalities, such as setting a default value for attributes with DEFAULT.

CREATE TABLE Person (
    PersonID INT,
    Name VARCHAR(255) DEFAULT 'No name',
    Birth DATE,
    Email VARCHAR(255)
);
CREATE TABLE Person (
    PersonID INT,
    Name VARCHAR(255) CONSTRAINT NameDefaultValue DEFAULT 'No name', --We can name the default value too--
    Birth DATE,
    Email VARCHAR(255)
);

As we’ve seen before, we use DEFAULT so that when a tuple is inserted with a missing value for a certain attribute, if that attribute has a default value defined, the tuple will be inserted with that default value in the corresponding attribute instead of NULL.

This is important because if we include the NOT NULL restriction and don’t define a default value for an attribute, the DBMS may generate an error here. This also applies when a new attribute is added to the table using ALTER, where we can define a default value at the same time.

CREATE TABLE Bike (
    BikeID INT,
    Model VARCHAR(255),
    Weight DOUBLE PRECISION,
    CONSTRAINT ModelValues CHECK (Model IN ('Model1', 'Model2', 'Model3'))
);

As a curiosity, if we want to explicitly define the possible values an attribute can take, we can use a CHECK like the one above. This is the same expression we use when creating a new domain with CREATE DOMAIN. We can then assign it as the data type to the Model attribute. So we have the option to create a custom domain for an attribute or define a constraint with CHECK to model its domain (although in most cases, it's better to use CREATE DOMAIN for better maintainability).

Continuing with constraints that affect a single table, we also have those more related to data integrity concerning the relational model. For example, to uniquely identify the tuples of a table, we have candidate keys in the relational model, which we can declare in SQL using UNIQUE in combination with NOT NULL.

CREATE TABLE Bike (
    BikeID INT,
    Model VARCHAR(255),
    Weight DOUBLE PRECISION,
    UNIQUE (Model)
);

For example, if we assume that in our problem there aren't multiple different bikes with the same model name, then we can use Model as a candidate key to uniquely identify all the tuples in the table.

So to explicitly declare that Model can serve for tuple identification, we use UNIQUE. This indicates that all the values that this attribute takes in (all the tuples of the table) must be different.

We can also apply this to more than one attribute, where UNIQUE would determine that the combination of values of all those attributes included in the constraint must be different in all the tuples of the table.

The main usefulness of UNIQUE is that it ensures certain attributes meet the definition of a candidate key. So, if we insert multiple tuples with the same repeated values in attributes that form a candidate key defined with UNIQUE, the DBMS will generate an error. But beyond this, we don’t have to define all candidate keys that exist unless the domain or problem requirements force us to do so.

Usually, we’d just define the primary key of a table with PRIMARY KEY, without needing it to be a selected candidate key.

CREATE TABLE Person (
    PersonID INT,
    Name VARCHAR(255) DEFAULT 'No name',
    Birth DATE,
    Email VARCHAR(255),
    CONSTRAINT PersonPK PRIMARY KEY (PersonID) --The constraint is named PersonPK--
);

When we introduce the primary key constraint on a set of attributes, we are implicitly declaring that these attributes can’t contain NULL values, and the combinations of values they take must all be unique in the table's tuples (just like with UNIQUE).

It’s as if we’re implicitly defining UNIQUE and NOT NULL on the attributes that form the primary key, making sure that they meet all the necessary conditions to truly form a primary key (which can also be referenced by a foreign key).

To declare the existence of foreign keys, we use FOREIGN KEY on the attributes that constitute it.

CREATE TABLE Rental (
    PersonFK INT,
    BikeFK INT,
    RentalDate DATE,
    Duration INT,
    Price DOUBLE PRECISION,
    CONSTRAINT RentalPK PRIMARY KEY (PersonFK, BikeFK, RentalDate),
    CONSTRAINT FK_Rental_Person FOREIGN KEY (PersonFK) REFERENCES Person(PersonID) ON DELETE CASCADE ON UPDATE CASCADE,
    CONSTRAINT FK_Rental_Bike FOREIGN KEY (BikeFK) REFERENCES Bike(BikeID) ON DELETE
    SET NULL ON UPDATE CASCADE
);

As you can see, declaring foreign keys is very similar to primary keys, except that we. usethe FOREIGN KEY statement. But for the DBMS to ensure referential integrity in the database, we need to define what happens when inserting, updating, or deleting tuples from tables that are referenced by foreign keys.

To understand this, the simplest case is when a tuple is inserted into a table like Rental, where values must be provided for its foreign keys. By default (NO ACTION), SQL allows a foreign key to take NULL values, meaning NULL satisfies the foreign key constraint. But in this case, we should add a NOT NULL constraint on these attributes because, in the conceptual model, a Rental entity was related to at least one Bike entity and one Person entity, as indicated by the minimum cardinality.

So if we insert a tuple with a NULL value in the foreign key attribute and we had the NOT NULL constraint, we’d receive an error. On the other hand, if we insert a value that is not NULL but doesn’t exist in the attribute of the table we are referencing, then the DBMS won’t allow that insertion either – as that foreign key won’t be referencing an existing tuple in the table it points to.

To indicate where it points, we use REFERENCES in the FOREIGN KEY constraint itself, where the table and the attribute the foreign key should point to are specified. A foreign key must reference a candidate key in the parent table—either the primary key or another column (or column set) declared UNIQUE and NOT NULL. The referencing and referenced columns must match in number, order, and compatible data types.

Afterward, if we try to delete a tuple from the Bike or Person table that is referenced by a tuple in the Rental table, we can set several deletion policies.

First, by deleting the tuple from Bike or Person, we would have a tuple in Rental that does not reference any valid tuple from another table, creating a referential integrity problem due to an orphaned reference.

One option to solve this is to also delete the tuple in the Rental table and recursively delete the tuples that point to the tuples being removed by this process. We declare this with ON DELETE CASCADE. But if we want to keep the tuple in Rental, instead of deleting it, we can assign a particular value to the foreign key that no longer points to any valid tuple (such as NULL or the default value DEFAULT). We declare this with ON DELETE SET [value], where [value] can be SET NULL or SET DEFAULT.

But we need to be careful with NULL, because if the foreign key attribute is also part of the primary key, as in this example, it will conflict with the implicit PRIMARY KEY constraint that prevents it from being NULL.

We aren’t required to declare ON DELETE in these constraints, so if we don't, the default action (called NO ACTION) will be executed. This means rejecting the deletion of the tuple in Bike or Person, and showing an error to the user.

Similarly, this issue can also occur when updating a tuple, so the same ON DELETE mechanism applies to tuple modifications, which we can define with ON UPDATE.

Finally, a foreign key can reference the same table it’s in, and using the CASCADE policy is completely valid. This is because it recursively deletes tuples that cause referential integrity issues, not entire tables. Even if there are tuples that reference themselves, this poses no problem, as the DBMS can handle these edge cases.

These are the basic constraints that we can apply to a single table, although there are more advanced tools that help ensure data integrity or even optimize its manipulation and querying.

But there are some constraints that don’t only affect one table in the schema but can involve conditions on multiple tables. To implement them, we have several options, such as assertions, which are conditions very similar to CHECK that are verified every time any of the tables involved in the condition are modified.

CREATE ASSERTION RentalEmailConstraint CHECK (
    NOT EXISTS (
        SELECT 1
        FROM Rental r
            JOIN Person p ON r.PersonFK = p.PersonID
        WHERE p.Email IS NULL
    )
);

For example, here we create an assertion that checks we haven’t rented a bike to any person who doesn't have an Email defined. For this type of constraint, we usually use complete SQL queries within the CHECK, as they are more complex to model than the CHECK constraints we place on a single table.

We could also do this in the table CHECK constraints instead of using assertions, although it would often be more complex to model.

Lastly, besides assertions, we can implement constraints on multiple tables with triggers, which are statements composed of an event, a condition, and an action. When the defined event occurs, the condition that constitutes the constraint is checked, and depending on whether it’s true or false, a certain action is executed or not on the database.

Now that we know how to set constraints on a relational schema, we can refine the logical implementation of our example by adding the necessary constraints, resulting in the following code:

DROP TABLE IF EXISTS Rental;
DROP TABLE IF EXISTS Bike;
DROP TABLE IF EXISTS Person;
CREATE TABLE Person (
    PersonID INT NOT NULL,
    Name VARCHAR(50) NOT NULL DEFAULT 'No name',
    Birth DATE NOT NULL,
    Email VARCHAR(50) NOT NULL UNIQUE,
    CONSTRAINT PersonPK PRIMARY KEY (PersonID),
    CONSTRAINT ConstraintPersonBirth CHECK (Birth <= CURRENT_DATE)
);
CREATE TABLE Bike (
    BikeID INT NOT NULL,
    Model VARCHAR(50) NOT NULL,
    Weight DOUBLE PRECISION NOT NULL,
    --This constraint is redundant due to the definition of PRIMARY KEY constraint--
    UNIQUE (BikeID),
    CONSTRAINT BikePK PRIMARY KEY (BikeID),
    CONSTRAINT ConstraintBikeWeight CHECK (Weight > 0) --Weight must be positive--
);
CREATE TABLE Rental (
    PersonFK INT NOT NULL,
    BikeFK INT NOT NULL,
    RentalDate DATE NOT NULL,
    Duration INT NOT NULL,
    Price DOUBLE PRECISION NOT NULL,
    CONSTRAINT RentalPK PRIMARY KEY (PersonFK, BikeFK, RentalDate),
    CONSTRAINT FKRentalPerson FOREIGN KEY (PersonFK) REFERENCES Person(PersonID) ON DELETE CASCADE ON UPDATE CASCADE,
    CONSTRAINT FKRentalBike FOREIGN KEY (BikeFK) REFERENCES Bike(BikeID) ON DELETE CASCADE ON UPDATE CASCADE,
    CONSTRAINT ConstraintRentalDuration CHECK (Duration > 0),
    CONSTRAINT ConstraintRentalPrice CHECK (Price >= 0),
    CONSTRAINT ConstraintRentalDate CHECK (RentalDate <= CURRENT_DATE)
);

As you can see, in the creation script, we have added some DROP statements to remove the tables before creating the final ones with all the correct constraints. We usually do this when there is no data in the tables, as a DROP would delete everything stored in them. Also, when we delete several tables that are related through foreign keys, we want to avoid the DBMS generating referential integrity errors. Because of this, it’s common to first delete the tables that do not have any foreign keys pointing to them, and then continue with the rest.

DCL

Now that you’ve seen how to define the basic elements of the relational model in a DBMS with SQL, we should consider the security with which these operations are performed (as well as those we’ll see in DML). After all, not all database users may have good intentions when operating on the DBMS.

So in DCL, we can define a series of statements for managing users, roles, and permissions, which establish who can do what on the database.

User roles

The first thing we can do is create roles, which, as the name suggests, is a role assigned to a database user that determines what they can or can’t do with the database. Basically, the role functions as a set of permissions.

By default, a PostgreSQL role can’t log in unless it’s created WITH LOGIN (or via CREATE USER). So to simplify this section, we can assume that when a user wants access to the database, it’s enough to give them a role with login permission (although these mechanisms may depend on the DBMS we are using).

CREATE ROLE user1 WITH LOGIN PASSWORD 'userPassword';
DROP ROLE user1; --If we want to remove the role--

So it can authenticate to the DBMS using the password we define here. In PostgreSQL, roles can typically connect by default because CONNECT is granted to PUBLIC. To restrict access you first REVOKE CONNECT ON DATABASE ... FROM PUBLIC and then GRANT CONNECT selectively. So, updating permissions with GRANT:

GRANT CONNECT ON DATABASE sampledb TO user1;

By default, the user won't be able to do anything else other than connect. So by using GRANT in the following way, we can give the necessary permissions to execute any necessary statements on certain elements of the database.

GRANT SELECT, UPDATE
ON TABLE Rental
TO user1;

For example, here we are giving permission to execute the SELECT and UPDATE statements on the Rental table.

Or if we want to give all possible permissions to do anything on an element, we can use ALL, like this:

GRANT ALL PRIVILEGES
ON TABLE Bike
TO user1;

Or, if we want to be more precise, we can even control which columns of a table certain statements can be executed on:

GRANT SELECT (PersonID, Name)
ON TABLE Person
TO user1;

Similarly, if instead of using GRANT we use REVOKE, we remove certain permissions that the role has:

REVOKE ALL PRIVILEGES
ON TABLE Bike
FROM user1;

This is just a part of what can be controlled for a role in a database using DCL statements, as security is a critical aspect.

DML

After setting up user permissions to control what a user can do in the database, we have enough elements to start manipulating and querying the data. So now it’s time to introduce the set of statements that make up the DML of SQL, which mainly handles the management of stored data.

CRUD

To understand data management, you should think about how you’ll operate on them. This is guided by the needs of the user or end client. From this arises the CRUD pattern (Create, Retrieve, Update, and Delete), which defines the fundamental operations performed on the data of a real project and that the database must support.

As you can see from its acronym, at the most fundamental level in our database, new data can be inserted (Create), queried once stored (Retrieve), and can also be modified (Update) or deleted (Delete) when they are no longer useful for the domain.

Of all these operations, the most important one is querying the data. If we think about it, any service provided to the end user can be reduced to a query on stored data.

For example, simply viewing saved information means it has to be retrieved through a query. Really any metric that needs to be calculated on the data also involves querying and then computing on it. So even though DML involves a wide variety of statements with diverse objectives, we will focus here on those that form the fundamental blocks for performing queries – CRUD.

When working with relational databases, there’s a certain the mechanism that queries follow to obtain the data we request from the DBMS.

First, we have a series of tables where information is stored in tuples. These we will call base tables, meaning the ones we initially create with CREATE TABLE. We don’t modify these base tables directly – instead, we apply a series of operations to them, many from relational algebra, resulting in intermediate tables. These intermediate tables pass through the sequence of operators until we reach a final table with the results we asked for.

In other words, a query consists of obtaining a resulting table with data from a set of base tables.

From a formal perspective, this is sometimes interpreted in relational algebra as if the query were a relational tree where the leaf nodes are the base tables. As operators, which can be either unary or binary, are applied, new intermediate tables are generated, representing the intermediate nodes of the tree until reaching the root node, which is the final table, or the query result.

With this, we can see each operator as if it were a mathematical function that takes one or more tables as input, performs a certain operation on them, and returns another table as output.

In contrast, when we program in SQL, we don't directly use these relational operators, as they are formal tools that support data querying. Instead, we use a series of DML statements, some of which resemble relational operators but are actually meant to be combined with other statements to form a query.

SQL is not a formal language like relational algebra – it’s an implementation based on this formal language, as well as on relational calculus, which allows us to abstract certain formal details. So when we’re executing a SQL query, the DBMS will transform it from a sequence of SQL statements into an execution plan more similar to a sequence of relational algebra operators. Then it’s internally resolved with advanced techniques that work on the formal operators themselves.

It's also important to note that most of the optimization is done by the DBMS when analyzing the structure of the query. Despite this, we should always try to "help" the DBMS optimizer by writing SQL queries that aim to minimize its workload. For this, there are certain techniques you should follow (but that we won’t cover in detail here).

Before introducing DML statements, it's a good idea to have the schema loaded with the Person, Bike, and Rental tables, as well as some sample data. In addition to creating the tables, to ensure that the queries return some data and we can verify they actually work, you’ll need to insert data into them using INSERT.

SELECT and FROM

The first statements we'll look at for building a query are the most basic ones: SELECT and FROM. You often need a FROM to construct a SQL query, as it’s used to determine from which table the data will be gotten. (Depending on the DBMS, you can run queries without a FROM (for example, SELECT 1;), though some systems use alternatives like VALUES or FROM DUAL.) Here’s how it works:

SELECT *
FROM Bike;

For example, if we run this query, it will return all the tuples stored in the Bike table. This is because we have provided one table to FROM, in this case, Bike, from which the data will be obtained. (FROM can reference one or more tables (including joins, subqueries, or CTEs)). Then, after getting the data from that table, SELECT * is used to select the data from all its columns, which is what we will return to the user.

Although we can only use one table in the FROM, we can actually perform a series of operations on several base tables and use that result as the table in the FROM. In other words, we can make the result of a SQL query, which is itself a table, the table used in the FROM, as shown here:

SELECT *
FROM (SELECT * FROM Bike);

This isn’t common to do with such a simple example, but it’s useful to show that we can provide anything we can build with a SQL query as the table to FROM (since the result of all the queries we can construct is actually a table).

When trying to transfer the functionality of these statements to relational algebra operators, we’ll see that there is no specific operator for FROM that does something similar.

But for SELECT there is an operator that does almost the same thing. Specifically, in relational algebra, there is the projection operator π(Table, ListAttributes). It takes as input a table with data and a list of some of its attributes, and returns another table constructed from the input where only the attributes in the list are kept – with all the data from their columns – discarding the rest of the attributes not appearing in the list.

This is exactly what SELECT does: we have an input table given by the FROM clause, and then we define a series of attributes we want the resulting table to have, discarding the rest.

SELECT Name, Birth
FROM Person;

For example, when FROM gets the data from the Person table, it provides it as input to SELECT. This then returns a table where only the attributes Name and Birth that we listed are present, with all the data from their columns. If we need to get all the attributes, we can use SELECT *, and we’ll get the input table with all its attributes and data as it was received.

Aliases

Another operator that we have in SQL in an almost equivalent form is the renaming operator. As its name suggests, we use it to provide alternative names to the tables or attributes we use, to avoid ambiguity problems or to shorten long names.

SELECT P.Name, Birth AS B
FROM Person P;

In relational algebra, the operator is denoted as ρ(Object, Alias), and its function is to assign an alias to an object, which can be either a table or an attribute.

In SQL, there are several ways to use it. On one hand, in the FROM clause, we can use AS [alias] or directly place the alias name after the table or tables involved in the query. This lets us refer to them by their alias instead of their full name, especially if we use the same one multiple times.

Also, in the SELECT clause, we should use AS to avoid ambiguities when assigning aliases to the attributes we’re going to return. The main utility here is to rename the returned attributes to have more descriptive or context-appropriate names.

For example, instead of returning the attribute Birth, its data is returned with the name B, which is shorter, while the Name attribute from table P is returned with the same name it has at the time of performing the SELECT.

DISTINCT

Another important statement is DISTINCT, which we use to remove duplicate tuples from the query result. To understand this, it's important to note that SQL doesn’t use sets to represent the tuples of a table. Instead, the tuples are represented in a multiset, allowing for identical tuples, especially in intermediate tables where primary key constraints and others don’t apply. So if we want the result to have no duplicate tuples, we need to add DISTINCT at the beginning of the attribute list in the SELECT statement.

SELECT DISTINCT P.Name 
FROM Person P;

When executing this query, we should see fewer names because some people have the same name. Also, DISTINCT is not only used at the beginning of the attribute list. We can also use it to count or perform aggregation operations that affect only non-repeated values, as we’ll see later.

This statement doesn’t have a direct equivalent with any relational algebra operator, as relational algebra formally works with sets where duplicate tuples do not exist, eliminating the need for a specific operator to remove duplicates.

WHERE

With what we've seen so far, we can retrieve data from tables, even removing duplicates or unnecessary attributes for the result – but we haven't introduced a way to keep only those tuples that meet certain conditions.

This is precisely what the WHERE clause in SQL does, which has a very similar relational algebra operator called the selection operator (don’t confuse with SELECT) and denoted as σ(Table,Condition). This operator takes a table with data and a condition applied to the tuples stored in the table, so that only those tuples that meet the condition are considered in the output table provided by the operator.

In other words, all operators output a resulting table, which in this case has exactly the same schema as the input table, with the difference that the output table only contains those tuples that meet the condition we have given to the operator. This lets us perform more complex filtering on the stored data, such as retrieving rentals that have a price higher than a certain amount.

For example, by executing the following query, we’ll get all the tuples from Rental that have a price greater than 10. Specifically, we will get all their attributes, since we used * in the SELECT statement.

SELECT * 
FROM Rental AS R 
WHERE R.Price > 10;

There are many possible conditions we can use in the WHERE clause. First, we can compare numeric attributes and strings with operators like >, <, <=, or <>. These check when two things are different.

SELECT * 
FROM Rental 
WHERE Price > 50 AND Duration <> 7;
--The <> operator means values of the Duration attribute that differ from 7--

SELECT Name 
FROM Person 
WHERE Name > 'M';

SELECT * 
FROM Person 
WHERE Name = 'Carol King';

As you can see, the operators work the same with numbers as with text. But when using them with text, like in the comparison Name > 'M', we get all the tuples with a Name value that is lexicographically after 'M'.

There are many options we can set for conditions regarding text values. For example, there are functions like LOWER() and UPPER() that convert text to lowercase and uppercase, respectively. We can also use LIKE to compare text with a pattern similar to a regular expression, where we have wildcard characters % and _ (% denotes an arbitrary number of characters and _ a single character).

We can also use the BETWEEN operator to check if a text is lexicographically between two others, but we can use it to compare other data types as well.

SELECT * 
FROM Person 
WHERE Email LIKE '%@example.com';

SELECT * 
FROM Person 
WHERE LOWER(Name) = 'carol king';

SELECT * 
FROM Person 
WHERE Name BETWEEN 'A' AND 'M';

SELECT * 
FROM Rental 
WHERE RentalDate BETWEEN '2025-06-01' AND '2025-06-30';

Continuing with text operations, we also have the SIMILAR operator from the SQL-99 standard, which allows comparing text with regular expressions, using the same wildcard characters as in LIKE. But these regular expressions aren’t the ones we find in POSIX or Perl – they are simply expressions formed by the LIKE wildcard characters with a series of logical operators similar to those of conventional regular expressions.

SELECT * 
FROM Person 
WHERE Name SIMILAR TO '(John|Jane)%'; --Match names starting with John or Jane--

SELECT * 
FROM Bike 
WHERE Model SIMILAR TO '%[0-9]'; --Bike models ending in a number between 0 and 9--

In addition to these operators, there are also the logical operators AND, OR, and NOT, which let us describe more complex conditions.

SELECT * 
FROM Rental 
WHERE (RentalDate BETWEEN '2025-07-01' AND '2025-07-31') AND (Price > 50);

SELECT * 
FROM Bike 
WHERE Weight < 9.0 OR Model LIKE '%Trek%';
--Parentheses are not mandatory, but highly recommended--

SELECT 1 AS ColumnOfOnes
FROM Bike 
WHERE NOT (Weight > 10.0);

Here we can see how in the SELECT clause of the last query, instead of returning an attribute, we return a literal, which is a numeric value of 1. If we look at the result, we’ll get a table with a single attribute, ColumnOfOnes, which is what we want to get by putting it in the SELECT list.

As for the tuples, it returns as many as there are in Bike that meet the WHERE condition, although we won't see their values. Instead, each tuple will only have the value 1 for the attribute ColumnOfOnes, which is what we've named these 1 values.

SELECT *, (Price / Duration) AS Ratio 
FROM Rental 
WHERE (Price / Duration) > 5;

SELECT *, (Price*1.0 / Duration) AS Ratio 
FROM Rental 
WHERE (Price*1.0 / Duration) > 5;

When we’re using arithmetic operators, it's important to consider the data types being used. We have all the usual arithmetic operators +, -, *, and /. But when using division, if we don't perform any explicit casting, the division might be done as an integer division, providing a rounded result that may be far from what we need.

To get an exact division with all decimals, we can multiply either of the operands by 1.0 to force the DBMS to treat it as a decimal value. But we always have the option to multiply the operation by a certain amount like 100 so that the final result is an integer instead of a decimal, especially when calculating ratios.

Of course, in addition to arithmetic operations, SQL offers a series of functions that allow us to perform more advanced mathematical operations like the following:

SELECT
  ABS(-3.5)      AS abs,
  CEIL(2.1)      AS ceil,
  FLOOR(2.9)     AS floor,
  ROUND(2.345,2) AS round,
  TRUNC(2.345,1) AS trunc,
  SQRT(16)       AS sqrt,
  POWER(3,4)     AS power,
  MOD(17,5)      AS mod;

SELECT 
  EXP(1)       AS e_to_1, --The number e raised to the 1 power--
  LN(10)       AS ln10,
  LOG(10,100)  AS logBase10Of100; --Logarithm base 10 of the number 100--

SELECT
  SIN(PI()/2)   AS sin90deg,
  COS(0)        AS cos0deg,
  TAN(PI()/4)   AS tan45deg;

On the other hand, SQL allows performing bit-level logical operations, such as a bitwise AND of the binary representation of two numbers, or a shift of their bits, among others.

SELECT
  9  & 5   AS bitwiseAnd,
  9  | 5   AS bitwiseOr,
  9  # 5   AS bitwiseXor,
  1 << 3   AS shiftLeft,
  16 >> 2  AS shiftRight;

Finally, if we want to check whether an attribute contains the value NULL or not, we can’t use the = operator. Instead, we have to use a specific operator called IS for this comparison:

SELECT * 
FROM Person 
WHERE Email IS NOT NULL; --NULL can't be compared with = operator, but with IS --

UNION, INTERSECT, and EXCEPT

There are other relational algebra operators that are useful and have equivalent SQL statements, like those that operate on sets of tuples. So far, we have treated tables as if they were multisets because SQL allows duplicate tuples by default. But there are situations where it’s clearer to use operations on tables by treating them as if they were sets of tuples.

SELECT BikeFK AS BikeID 
FROM Rental 
WHERE Duration > 3 
UNION 
SELECT BikeFK 
FROM Rental 
WHERE Price <= 15;

For example, when we make a query, it returns a table with tuples, which we can see as a set of tuples. So, if we have several queries that return tables with the same number of columns and all of them have compatible data types (meaning they’re either the same or convertible by the DBMS), then we can perform a set operation between them, like a union of both sets of tuples. This in turn results in another set of tuples containing all those from both initial sets.

We do this using the UNION operator, which by default removes duplicate tuples since it treats the tables as sets of tuples. In this specific example, we’re performing a union between a set of tuples with the schema (BikeID) and another (BikeFK). Since both schemas have the same number of attributes with the same data types, regardless of their names, we can perform their union, resulting in a final table that contains all the tuples from both, removing duplicates.

SELECT PersonFK, RentalDate AS DateName 
FROM Rental 
WHERE RentalDate < '2025-01-01' 
INTERSECT 
SELECT PersonFK, RentalDate AS DateName2 /*This name is not preserved, the above one does*/ 
FROM Rental 
WHERE RentalDate > '2024-01-01';

Besides performing a union, we can also carry out other common set operations like intersection or difference. For example, with INTERSECT, we only keep the tuples that are in both sets of tuples, removing duplicates, as long as we’ve made sure that both sets are valid for performing a set operation between them.

This means that to apply INTERSECT, we have to ensure that the schema of both sets is compatible, both in the number of columns, in this case, 2, and in their respective data types. As for the names, we see here that it doesn't matter what the attributes are called, since the result will always retain the schema name from the first set in the operation.

SELECT PersonFK, RentalDate 
FROM Rental 
WHERE RentalDate < '2025-01-01' 
EXCEPT ALL
SELECT PersonFK, RentalDate 
FROM Rental 
WHERE RentalDate > '2024-01-01';

Lastly, we can also calculate the difference between several sets with EXCEPT, which in some DBMS is called MINUS. This is the only operator where the order of the sets matters, meaning the one above discards the tuples that exist in the set below, so we are left with all the tuples that are in the first set but not in the second. Like the previous ones, this operator also removes duplicate tuples, so if we need to keep them, we have to add ALL after the set operator.

Nested query

We talked about nested queries back at the beginning as a way to use the result of one query within another query. Essentially, that's what it is, but SQL provides a series of specific operators that are useful when working with nested queries in a WHERE clause for example, since they can’t only be placed in the FROM clause.

SELECT *
FROM (
    SELECT PersonFK,
      RentalDate
    FROM Rental
    WHERE RentalDate > '2024-01-01'
  ) AS T
WHERE T.RentalDate <= '2024-06-06';

To start, nested queries take advantage of the fact that a query always returns a table, allowing us to use that result as an intermediate table in another query's computation.

For example, here we first get the tuples from Rental with a date later than 2024 in the subquery of the FROM clause. Then in the “outer” query, we assign the alias T to the result of this subquery, from which we get all its tuples with a date earlier than '2024-06-06'.

SELECT *
FROM Rental R
WHERE R.RentalDate > '2024-01-01'AND R.RentalDate <= '2024-06-06';

As you might guess, when doing this, SQL internally first resolves the subquery in the FROM clause. This means it retrieves all the tuples that the subquery needs to return, and then applies the filter defined in the WHERE clause to all of them. So a condition is first evaluated on all the tuples from Rental, and then another condition is applied to all the resulting tuples from the query. This creates extra work (computation) to first obtain and potentially store in memory the tuples from the subquery and then filter them again.

Just note that conceptually, a derived table is evaluated first, but optimizers may rewrite/flatten the query – so don’t rely on a specific evaluation order.

On the other hand, this query could have been resolved more simply, as shown above. Here, the Rental table is used directly in the FROM clause, and filtering is applied with the two conditions on RentalDate "together" in a single WHERE clause. This means that only the tuples from Rental need to be traversed, instead of traversing them and then having to filter the tuples from a subquery again. This saves unnecessary computation as well as possible memory that the DBMS might use to store the resulting tuples from the subquery in memory.

With this example, we’ve seen that the same query can be resolved in a more or less computationally efficient way depending on how we plan to implement it. Although, generally, all modern DBMS have the Optimizer component in their architecture, which automatically applies certain optimization techniques to the query without us having to worry about it. We won’t go into detail about these techniques here.

In turn, nesting these queries allows us to solve more complex problems with the help of operators like EXISTS. Specifically, we mainly use EXISTS in a WHERE statement before a nested query to check if the nested query contains any tuples or not. In other words, if we consider it as a multiset of tuples, EXISTS tells us whether that multiset is empty or not.

SELECT B.*
FROM Bike AS B
WHERE EXISTS (
    SELECT *
    FROM Rental AS R
    WHERE R.BikeFK = B.BikeID
  );

For example, to find out which bikes from Bike have been rented at least once, we select all those tuples from Bike that have a tuple in Rental associated with the bike we are checking.

To understand this, you need to keep in mind that a SQL query is usually executed by scanning the tuples of the tables from top to bottom. So the WHERE clause of the outer query is actually executed for each bike in Bike, which is the table we traverse in the FROM clause.

So for each bike, we execute a nested query that returns all rentals of that bike, as it keeps the tuples from Rental whose foreign key BikeFK points to the BikeID attribute of the table with alias B. This is called correlated nesting because we’re using the table from the outer query in the nested query. This means we may be forcing SQL to recalculate it each time the WHERE condition is checked on a tuple from Bike (but engines commonly rewrite it as a semi-join, avoiding per-row re-execution).

With this, if the nested query contains any tuple, it implies that the bike has been rented at least once. And we can detect this with EXISTS, which checks if the resulting table from the nested query returns any tuple.

Since we’re simply interested in knowing if it contains any tuple, we don’t need to return any specific attribute in the nested query, although it’s generally considered good practice to return *, or a constant like 1.

Another way to solve the previous query with a different operator is by using IN. This operator checks if a certain value or tuple is contained in a column or table.

SELECT B.*
FROM Bike AS B
WHERE B.BikeID IN (
    SELECT Rental.BikeFK
    FROM Rental
  );

For example, in this case, we build a nested query in the WHERE clause that contains only the foreign key BikeFK from the Rental table, where all the BikeID values referenced by the rental tuples are found. In the outer query, all the tuples from Bike are traversed. It checks a condition where the BikeID from the Bike table must belong to the resulting table from the nested query to be considered a bike that’s been rented at least once.

So to solve this query, we need to know, for each bike, if its primary key BikeID is referenced by the corresponding foreign key of any tuple in Rental.

For this, we can use EXISTS as before to check if there is any tuple in Rental that references the specific primary key value of Bike, or we can use IN to directly check if the primary key value BikeID of the bike we are traversing in the outer query is present in the foreign key column of Rental that we get with the nested query.

Continuing with the equivalent ways to solve the previous query, we can also replace the IN operator with \=ANY. Intuitively, we can understand this as checking if the value B.BikeID is equal to any of the values in the column that we got with the nested query (which is equivalent to what the IN operator does).

SELECT B.*
FROM Bike AS B
WHERE B.BikeID = ANY (
    SELECT Rental.BikeFK
    FROM Rental
  );

In other words, conceptually, checking if something belongs to a set is equivalent to checking if it’s equal to any of the elements contained in the set. Ultimately, the ANY operator allows us to check if a certain value meets a condition with respect to any of the values stored in a nested query – that is, in a multiset, since we can do it with tuples as well as values.

SELECT B.*
FROM Bike AS B
WHERE (1, B.BikeID) = ANY (
    SELECT R.PersonFK, R.BikeFK
    FROM Rental R
  );

For example, instead of checking if a specific value of a single attribute is in the column from the nested query, we can perform the check with a complete tuple.

Here, the nested query returns the foreign key values of the tuples from Rental, so in the outer query, we can check which bikes have been rented at least once by the person with the primary key PersonID=1. Or put another way, for each tuple in Bike, we check if there is any tuple in the nested query table in the form (1, B.BikeID). This would indicate that the person with the primary key PersonID=1 has rented the bike at least once.

Lastly, the IN operator is also equivalent to the NOT <> ALL operation, which is more complicated to understand. Essentially, we want to check if the tuple (1, B.BikeID) is contained in the result of the nested query.

SELECT B.*
FROM Bike AS B
WHERE NOT (1, B.BikeID) <> ALL (
    SELECT R.PersonFK, R.BikeFK
    FROM Rental R
  );

With <> ALL, we check if the tuple is different from each and every tuple stored in the nested query. Then, by negating that result with NOT, we can determine if that condition is not met (that is, the tuple is not different from each and every tuple in the nested query). This would mean it’s equal to at least one of them, or in other words, it’s contained in the multiset returned by the nested query.

To understand the ALL operator, we can try to get the bike with the lowest weight in the entire Bike table. To do this, with a nested query, we can get all the weights from the Bike table. Then in the outer query, we can go through all the tuples in Bike and check if each one’s weight B.Weight is less than or equal to each weight gotten with the nested query using <= ALL.

SELECT *
FROM   Bike B
WHERE  B.Weight <= ALL (
    SELECT Weight
    FROM   Bike
);

If this is true, then that weight will match the lowest in the entire table, so the WHERE condition will be TRUE, and the corresponding tuple from Bike will be returned in our outer query.

In SQL, conditions usually return TRUE or FALSE values depending on whether they are met. But when comparing with NULL values, UNKNOWN is returned, since there are times when a nested query unexpectedly returns NULL values. This causes conditions that compare with those values to not result in logical truth values, but in the special value UNKNOWN.

JOIN

The JOIN operators also have an equivalent in relational algebra. Their main purpose is to gather information spread across multiple tables so that all the data can be operated on in a single intermediate table.

For example, when we look at the information in the Rental table, we see that it has foreign keys referencing Bike and Person, but the Rental table itself doesn’t contain all the information we might need about the bikes or the people. So, if we want to query the rentals and the names of the people involved in those rentals, we’ll need to apply a JOIN operation on both tables.

SELECT *
FROM Rental, Person;

There are several types of JOIN, all of which have an almost direct equivalence in relational algebra operators. The simplest one is the implicit JOIN shown above, which is denoted by using multiple tables in a FROM statement separated by commas. We can use as many tables as we want here, as long as there are no ambiguities in their names.

Note that if we perform an implicit JOIN of a table with itself, we’ll need to assign different aliases to the different uses we make of it.

Before seeing what the query does, it's useful to understand the Cartesian product operation in detail, as it’s the foundation of all SQL JOIN operators.

The Cartesian product is a mathematical operation that takes two sets as input, which in SQL are tables or multisets with tuples, such as table A with tuples {{a},{b},{c}}, and table B with tuples {{1},{2},{3}}. As output, the operation generates a new multiset of tuples where each row of A is combined with each row of B, resulting in the table or multiset A×B={{a,1},{a,2},{a,3},{b,1},{b,2},{b,3},{c,1},{c,2},{c,3}}.

As you can see, if table A has n tuples and table B has m tuples, the Cartesian product will generate n*m tuples, where each one takes values from all the attributes of table A and table B (since the result of the operation includes all possible “pairings“ we can make between tuples from both tables).

So going back to our query, as you can see in the result, the implicit JOIN performs the Cartesian product of the two tables. It doesn't matter if their names repeat, as each repetition can be accessed through a different alias.

Regarding the tuples it contains, we see that the Cartesian product returns tuples where each possible tuple of Rental is combined with each possible tuple of Person. This forms tuples with values in all the attributes of the resulting JOIN table.

The implicit join has no filtering criteria or additional functionality – it simply returns the complete Cartesian product of the tables involved in the operation.

Its name, implicit, comes from the fact that the JOIN operator and the type of JOIN we want to perform aren’t explicitly written. Instead, it's enough to list several tables separated by a comma in the FROM clause.

In addition to the implicit JOIN, we also have the explicit JOIN. It can be of various types depending on the filtering or conditions applied to the Cartesian product.

For example, instead of performing a Cartesian product between both tables with an implicit join, we can also do it explicitly with a CROSS JOIN. This does exactly the same thing but with explicit syntax: we specify the JOIN operation to perform and its type, CROSS. This indicates the execution of a Cartesian product like the previous one.

SELECT *
FROM Rental CROSS JOIN Person;

Besides the CROSS type, there are other types that provide additional functionalities to the JOIN, allowing us to filter the tuples we get from a Cartesian product.

For example, so far with the Cartesian product, we have obtained all combinations of tuples from Rental and Person. If there are N tuples in Rental and M tuples in Person, then the Cartesian product will return N*M tuples – meaning all possible combinations of tuples from both tables we are working with.

If we look at the resulting table from this operation, we will see that some values of different attributes like PersonPK and PersonID match in the same tuple. This means a tuple from Rental has been combined with a tuple from Person so that this is the person referenced by the foreign key in Rental. In other words, we have a tuple that not only contains the information from Rental but also has the information from the Person tuple representing the person who made that rental – and it’s been"concatenated" or combined with it.

So if we want to keep only those tuples from the Cartesian product where PersonFK matches PersonID from the Person table, we could apply a condition in a WHERE clause to filter those tuples. But by doing this, conceptually this is a Cartesian product followed by a filter, but the optimizer typically rewrites it into an equivalent inner join without materializing the full product.

There are specific types of JOINs that can help us perform this filtering more efficiently:

SELECT *
FROM Rental AS R CROSS JOIN Person AS P
WHERE R.PersonFK=P.PersonID;

SELECT *
FROM Rental R INNER JOIN Person AS P ON R.PersonFK=P.PersonID;

To implement this query, we can use a condition in a WHERE clause, or we can use an INNER JOIN, which allows us to set a condition in the ON clause.

If we use a WHERE clause, we’ll be filtering all the tuples obtained from the complete Cartesian product resulting from the CROSS JOIN using a condition. But to avoid creating the entire Cartesian product (which isn’t efficient), we can use an explicit INNER JOIN. Here, we can provide a condition in the ON clause so that only the tuples from the Cartesian product that meet that condition are actually constructed.

In the ON clause of an INNER JOIN, we can put any type of condition on the tuples we want to get. But there are times when these conditions are simple and only involve equality between attributes, which may even have the same name.

SELECT *
FROM Person P1 CROSS JOIN Person P2
WHERE P1.PersonID=P2.PersonID;

SELECT *
FROM Person P1 INNER JOIN Person P2 ON P1.PersonID=P2.PersonID;

For example, if we perform the Cartesian product between the Person table and itself, and we want to keep only those tuples where the PersonID attributes of both tables match, we can use an INNER JOIN with the condition that the PersonIDs of both tables being combined are equal. This way, only the tuples that meet this condition will be constructed (unlike the previous query where using a CROSS JOIN implies constructing all tuples of the Cartesian product, which requires more computation).

In these types of situations, instead of using an INNER JOIN, we can take advantage of another type of JOIN like the NATURAL JOIN. This returns only those tuples where the values of all attributes with the same name match.

SELECT *
FROM Person P1 NATURAL JOIN Person P2;

SELECT *
FROM Person P1
  NATURAL JOIN (
    SELECT PersonID,
      Name AS Name2,
      Birth AS Birth2,
      Email AS Email2
    FROM Person
) AS P2;

To understand this, we can perform a NATURAL JOIN between the Person table and itself. First, if we don't rename any attribute, then all will have the same name in both tables – so the NATURAL JOIN will impose an equality condition for each attribute. This means that it’ll return only those tuples that satisfy P1.PersonID=P2.PersonID, P1.Name=P2.Name, and so on for the rest of the attributes, since they have the same name despite being in tables with different aliases. This will result in the same Person table, as the NATURAL JOIN, in addition to imposing these conditions, "merges" attributes that meet these conditions. So if they have the same name, it leaves only one occurrence of them, not both (as happens in other types of JOINs).

But if we rename the attributes of one of the tables except for PersonID, we’ll see that NATURAL JOIN only imposes the equality condition P1.PersonID=Person.PersonID, since PersonID is the only attribute that’s exactly the same in both tables.

In the resulting table, we’ll get the same as before but with the renamed attributes included, as they aren’t discarded or subjected to any condition that makes them unnecessary. Even if we rename PersonID as well, we’ll get the Cartesian product of Person with itself – because if none of the attributes have the same name in both tables, then NATURAL JOIN doesn’t impose any equality condition.

Another option we have to impose equality conditions on attributes with the same name in both tables is to use an INNER JOIN. Instead of declaring conditions in an ON clause, we use a USING clause where we define the attributes on which equality conditions are imposed. These must have exactly the same name in both tables.

SELECT *
FROM Person P1 INNER JOIN Person P2 USING (PersonID);

For example, in the query above, we are getting the tuples from the Cartesian product of Person with itself that satisfy P1.PersonID=P2.PersonID.

The main difference with NATURAL JOIN is that NATURAL JOIN tries to impose this equality condition on all possible attributes with the same name. But with an INNER JOIN and USING, we decide which equality conditions are imposed on which attributes (as long as they have the same name in both tables). Otherwise, the DBMS might generate an error.

Also, when we use USING in combination with an INNER JOIN, only one occurrence of the attributes with the same name appears in the resulting table, just like with NATURAL JOIN.

Lastly, it’s important to note that when using ON to declare a condition, no attribute is removed from the resulting table of the JOIN operation, since the condition can be very diverse in nature. This means it doesn't necessarily have to be an equality between several attributes.

But when you’re using USING in combination with an INNER JOIN (and imposing an equality condition on the attributes declared in the USING clause), all repetitions of those attributes will be removed from the resulting table. So, if we impose an equality condition on several attributes with the same name, all but one of their occurrences will be deleted.

For example, in a table with two attributes called PersonID but coming from different tables or elements with different aliases (same Person table but different alias), USING would remove one of their occurrences. This would leave only one PersonID attribute in the resulting JOIN table, while ON would not remove any of the occurrences. And. this would result in the final table containing both original PersonID attributes.

SELECT *
FROM Person P LEFT JOIN Rental R ON R.PersonFK = P.PersonID;

Continuing with the types of JOIN, there might be a case where a person has never rented a bike, so there won't be any tuple in the Rental table referencing that person. This is possible due to the minimum multiplicities on the Rental side in the entity-relationship diagram (that don’t require any person to have rented a bike).

So if we want to build a table that shows information about all people along with information about all the rentals they’ve made, the first thing we may think of is performing an INNER JOIN between them. And we’d add a certain equality condition on the foreign key attribute of Rental that references the primary key of the Person table.

But there may be people who have never rented. abike, so if we do an INNER JOIN, the information about these people won’t appear in the table. To make sure that they appear, we need to use an OUTER JOIN instead of an INNER JOIN. We also need to specify which table we want to force to have its data appear by putting LEFT or RIGHT before the type of OUTER JOIN (or we can simply use LEFT JOIN, for example).

This way, if we use LEFT JOIN, we’re forcing the data from the table on the left of the JOIN to appear in the resulting table. If they have no match in the table on the right (meaning if they have no rental), then the other attributes will be filled with NULL values, as we saw in the result of the previous query.

SELECT *
FROM Rental R RIGHT OUTER JOIN Person P ON R.PersonFK = P.PersonID;

In the same way, if we use RIGHT JOIN and reverse the order of the tables, we’ll do the same but force the data from the table on the right to appear in the resulting table, filling the attributes of the left table with NULL in case there’s no match.

With Rental RIGHT JOIN Person, all persons appear – for persons without rentals, the Rental side will be NULL.

Finally, if we want to use both RIGHT and LEFT in a join and force the data from both tables to appear (which would fill in NULL on the side that corresponds to each tuple), we can use a FULL JOIN.

SELECT *
FROM Person P JOIN Rental R ON R.PersonFK = P.PersonID;

In this last type of JOIN, we've seen that specifying OUTER is optional when using RIGHT, LEFT, or FULL. But by default, if nothing is specified, the JOIN operator is treated as an INNER type, requiring a condition with ON or USING afterward.

Aggregation

With joins, we can now combine several tables and gather their information into one. But there are still certain operations we can't do easily, like counting the rows in a table, summing the values of a column, calculating their average, and so on.

All operations of this nature that involve values from a multiset (table) of tuples are called aggregation operations. Their goal is to perform a calculation on a series of tuples and are the basis of analytical queries.

SELECT COUNT(*) AS rentalCount,
  SUM(Price) AS income,
  AVG(Price) AS averageRentalPrice,
  MAX(Price) AS maxRentalPrice,
  MIN(Price) AS minRentalPrice
FROM Rental;

SQL offers a number of them (which don’t have a direct equivalent with relational algebra operators): COUNT(), SUM(), AVG(), MIN(), and MAX().

COUNT()

We can use COUNT() to count how many rows are in a table, including tuples where all values are NULL. So by declaring COUNT(*) in the SELECT clause, we’ll get the number of tuples in the table specified in the FROM clause.

SELECT COUNT(*), COUNT(Price), COUNT(DISTINCT Price)
FROM Rental;

But the function can also perform aggregation on a specific column. So instead of counting tuples, it counts how many values exist in a certain attribute, including duplicate values and ignoring NULLs.

So if we want to count only how many distinct values there are in Price, we can use DISTINCT as shown above.

As for the column names we get from these operations, it's not mandatory to assign them an alias and rename them, but it's very convenient for identifying which calculation is stored in each column of the resulting table.

SELECT COUNT(*) 
FROM (SELECT DISTINCT PersonFK, BikeFK FROM Rental) AS t;

In addition to a single attribute, COUNT() can count how many combinations of values from a certain set of attributes are in the table. Specifically, in this example, we are counting how many (PersonFK, BikeFK) values are in the table. This may not match the total number of tuples since NULLs are ignored here, unlike in the COUNT(*) operation where they are also considered. We can also use DISTINCT here, as long as the attributes whose value combinations we want to count are in parentheses.

SELECT SUM(2*Price), AVG(Price)
FROM Rental;

SUM()

SUM() calculates the sum of a certain numeric attribute of a table, or an attribute that can be converted to numeric. It takes as input the attribute from which we want to get the sum of all values present in the table. Note that, besides the attribute, SUM() accepts expressions that result in a single attribute. That is, if instead of Price we provide 2*Price, or Price+Price, then those operations will be summing a series of attributes whose result will be stored in a single attribute. This is given as input to SUM().

If all the values of the attribute are NULL, SUM() returns 0. Unlike COUNT(), in this case, we can’t sum several attributes at once, meaning SUM() only takes one attribute as input, regardless of whether we get it through an arithmetic expression.

AVG()

Similarly, AVG() calculates the average of the values taken by a single attribute, ignoring NULLs. Unlike SUM(), this function returns NULL when all the values of the input attribute are NULL, since internally it can be calculated as SUM()/COUNT().

So if SUM() returns 0 when counting an attribute full of NULLs and COUNT() ignores those NULL values, the average will be 0/0, which is undefined – causing AVG() to return NULL. It’s also important to note that if we use DISTINCT, both the sum and the average will be different.

SELECT MIN(Price), MAX(Price)
FROM Rental;

MIN() and MAX()

Finally, the MIN() and MAX() operations take an attribute as input and return the minimum or maximum value found in the tuples stored in the table, respectively. If all the values of that attribute are NULL, they also return NULL, as a coherent minimum or maximum value can’t be established since NULLs are ignored.

GROUP BY

If we try to use aggregate functions in the SELECT clause along with other attributes, the DBMS will give us an error because these types of functions are usually used together with the GROUP BY statement (this also doesn't have a direct equivalent in relational algebra).

To understand how GROUP BY works, we can calculate the sum of all the rental prices that a certain person has made in the system.

SELECT SUM(Price)
FROM Rental R
WHERE R.PersonFK=5;

To do this, we access the Rental table and use a WHERE clause to filter all rental tuples for a certain person using their foreign key that references the person making the rental. Then, with SUM, we get the sum of the Price attribute from the final table, which contains the prices of all rentals made by that person.

If we wanted to do it by name instead of PersonID, we would need to do a JOIN with the Person table and filter by the Name attribute of Person (although this isn’t important for understanding GROUP BY).

SELECT SUM(Price) AS PriceSum
FROM Rental R INNER JOIN Person P ON R.PersonFK=P.PersonID
WHERE P.Name='Carol King';

Now, if we want to calculate this value for the rest of the people in the database who have ever rented a bike at least once, we would have to run this query multiple times for each person in the system, which isn’t practical. Instead, we can take advantage of the fact that the Rental table itself has the foreign key PersonFK for people who have rented bikes – and we can use this to calculate this sum for all of them more simply using GROUP BY.

SELECT R.PersonFK, SUM(Price) AS PriceSum 
FROM Rental R 
GROUP BY R.PersonFK;

As you can see, this query returns all the people who have ever rented a bike – meaning those referenced from the Rental table. For each one, it calculates the sum of the prices of their rentals. This is possible thanks to GROUP BY, which groups all the tuples in the Rental table by the PersonFK attribute.

Since each person can have multiple rentals in the Rental table, we need to get all the tuples that reference each person and group them so that we can perform an aggregation operation like SUM() on one of the attributes.

In this case, we do the grouping with the PersonFK attribute, which identifies the person who made the rental. So since all the tuples in Rental with the same value in that attribute belong to the same person, they are grouped by that attribute to form groups of tuples, one for each person.

With this, we can then return the attribute that was grouped (which must be included in the SELECT when using GROUP BY) along with the results of the aggregation operations calculated on those groups.

SELECT DISTINCT Price
FROM Rental;

SELECT Price
FROM Rental
GROUP BY Price;

When we use GROUP BY and partition the tuples of the table into groups, each group is "identified" or represented by one value of the attribute we are grouping by. This means that when we return a result to the user, for each group, they receive a single tuple where the attribute used for grouping takes the value of the "representative" of that group, instead of receiving multiple tuples per group.

For example, to get all the distinct prices from the Rental table, we can use DISTINCT directly, or we can also group by that attribute, which results in forming different groups of tuples, one for each distinct price. Finally, when returning Price after grouping, the distinct values of Price that form the different groups of tuples are returned, meaning only the distinct Price values are obtained.

It’s also worth noting that we can group by several attributes at once, not just one. In this case, we would generate groups of tuples based on the unique combinations of values those attributes take in the table.

Finally, when we use the GROUP BY statement in a query, we might want to filter and keep only the tuples whose aggregation operation results meet a certain condition. For example, to get only the people whose total rental price sum is greater than 100, we might think of using a WHERE clause with the following condition:

SELECT R.PersonFK, SUM(Price) AS PriceSum
FROM Rental R
WHERE PriceSum > 100
GROUP BY R.PersonFK;

SELECT R.PersonFK, SUM(Price) AS PriceSum
FROM Rental R
GROUP BY R.PersonFK
HAVING SUM(Price) > 100;

But if we use that condition in the WHERE clause, the DBMS will give us an error because we can’t impose conditions on the aggregation calculations in the groups in a WHERE clause. We also can’t refer to them with the alias we give them, since the alias is applied at the end of the query when the result is provided to the user.

So instead of using WHERE, when we want to implement this type of condition, we use HAVING. Instead of the alias, we use the expression SUM(Price) itself to refer to the sum of Price in each group. Using WHERE isn’t prohibited, because before doing the grouping, we can filter the data that appears in the FROM table, thus grouping fewer tuples.

ORDER BY

Finally, if we want to sort the tuples of a table, we can use the ORDER BY clause. It lets us we specify one or more attributes on which the sorting is performed as well as a direction (which can be ASC or DESC for ascending or descending order, respectively).

SELECT *
FROM Person
ORDER BY Name ASC;

SELECT *
FROM Person
ORDER BY (PersonID, Name) ASC;

In sorting, certain attributes have higher priority. Those we place more to the left are sorted first, as in this last query that sorts the tuples of Person by their PersonID values and then by name.

So using all these clauses, we can start making SQL queries to get almost any type of result we need. As we have seen, queries are composed of a series of statements or clauses where each one performs a certain action on the tuples of a table.

These statements usually follow an order of appearance in the query that is important to follow to avoid DBMS errors. The order is as follows:

SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY

But at a low level, the execution of these statements or equivalent relational algebra operators follows a different order than the one we use when writing the query. It is as follows:

FROM
JOIN … ON
WHERE
GROUP BY
HAVING
SELECT
ORDER BY

First, data is fetched from a table with the FROM clause, which may need to perform certain JOIN operations between multiple tables to have the data ready. Then, the data is filtered using the conditions we set in the WHERE clause, if we use it. After that, the tuples are grouped and filtered again if we use GROUP BY. Finally, the SELECT clause is applied to extract the attributes we are interested in from the final table, which we rename and order if necessary.

So as you can see, when we write a SQL query, we must use the clauses in a specific order. But we should keep in mind that the DBMS, at the physical and storage level, doesn’t execute these statements in the same order we write them. In fact, we don't have to worry too much about this internal order because it’s transparent (that is, handled automatically and hidden) to the user. This means we don't have direct control or "see" how the execution of the clauses is carried out internally by the DBMS, inspect the plan with EXPLAIN/EXPLAIN ANALYZE.

Regarding the internal execution order, the DBMS usually reorders, combines, or transforms the clauses into others, all while constructing a physical execution plan for the query. This involves generating a plan for the operations and internal resources needed to execute it optimally (hence the reordering).

This is important to know when constructing a query, as the way you program it can affect the efficiency of the query, even though the DBMS can help by automating much of the optimization process. You don’t have to use all these statements in a query, of course. But those you do use should respect the order in which they should be written, otherwise, the DBMS will likely end up throwing an error.

Views

To finish with DML, let's look at a possible application of queries when defining DDL elements in SQL. Originally, we saw that DDL statements allowed us to create databases, tables, and similar elements. One of them worth highlighting is views, which are virtual tables that let us abstract information from the tables in a database.

Our database is made up of a schema or set of tables where the information is stored, but we might need to "view" that information differently than how it's defined in the schema itself. For this, we define a view that lets us query that information from the database using a different structure than the one used to store it.

CREATE VIEW RentalOverview AS
SELECT P.PersonID AS PersonID,
  P.Name AS ClientName,
  CURRENT_DATE - P.Birth AS ClientAge,
  B.BikeID AS BikeID,
  B.Model AS BikeModel,
  R.RentalDate AS RentalDate,
  R.Duration AS RentalDurationDays,
  R.Price AS RentalTotalPrice
FROM Rental R
  JOIN Person P ON R.PersonFK = P.PersonID
  JOIN Bike B ON R.BikeFK = B.BikeID;
SELECT *
FROM RentalOverview;

For example, in our database, we have the tables Rental, Bike, and Person, but for convenience or requirements, we might need to see all that information from the tables integrated into a single table with attributes (PersonID, ClientName, ClientAge, BikeID, BikeModel, RentalDate, RentalDurationDays, RentalTotalPrice).

By default, every time we want to see this integrated information, we would have to manually run a query (or several, depending on the circumstances) to get and integrate that information into a table.

But to simplify this process, there are views that allow us to define a "virtual" table containing the integrated information. So, whenever we need that integrated information, we can refer to the virtual table (and this is built using the query we would have had to run manually to construct it). This query is the definition with which we declare a view, and the view itself saves us from having to run it manually to get the integrated information.

That's why we create a new view in the database that acts as a virtual table (meaning it doesn't actually store any information). This is because a view is a table that receives user queries, but to resolve them, it has to fetch information from different tables in the database.

So, as you can see in the view above, the virtual table RentalOverview is defined with a SQL query on the tables that do store information. So when we query RentalOverview, the DBMS is actually transforming our query using the view's definition to obtain the attribute ClientName, for example, which is defined as the name of the person who rented a bike.

In this specific case, our view is gathering all the information from the three tables into one, so when we query it, we have the complete information about the person, bike, and rental that occurred. We don't have to perform the JOINs ourselves, as they are part of the view's definition.

SELECT *
FROM RentalOverview;

When querying the virtual table, we’ll get information derived from the base tables, which is shown to us according to the schema we defined in the view. For example, in the database, the birth date of people is stored in the Birth attribute. But the view shows that data differently, displaying age instead of the birth date. Both refer to the same information but are viewed in different ways.

Database Administration

At the logical level where we implement the database with SQL, we need to perform ongoing database maintenance (in addition to data modeling, modification, and querying). This ensures that our data and services are available, optimizes query performance, and provides certain guarantees of security and integrity. This process is part of what is considered database administration, which is a task carried out by experts.

Database users

Before introducing the concept of administration, let’s talk about the different types of users that might use a database. Each of them has a certain objective, responsibilities, and competencies.

To start, we have the client user, who uses the services provided by the database. We can see this type of user as an average user of mobile or web applications, or on any platform, using a series of services that involve a database.

Then, we have the developer user, who is dedicated to technically implementing the infrastructure, both software and hardware, that supports the applications and services. Developer users are also responsible for defining the business logic of the database, its structure, requirements, and so on. In short, they follow the different design stages we saw at the beginning, especially the conceptual and logical design, although they don’t interact with the DBMS. They simply propose the schema that the data should follow for a specialist to implement on a DBMS.

This specialist is the database administrator user, who is responsible for implementing the logical design of the database on a DBMS. To do this, they perform tasks such as choosing the appropriate DBMS for the project in question, installing it, and keeping it updated. They create the database, tables, and other logical elements, manage the security of the DBMS by defining roles, permissions, and security policies, and monitor the database's performance to ensure its availability. They also provide technical support to other types of users and define data backup protocols.

So basically, the administrator is in charge of the implementation during the logical design stage, as well as subsequent stages of possible physical design and storage. They’re also responsible for maintaining the DBMS. Among all these tasks, one of the most critical is optimizing the queries users might make to the system and refining the schemas if necessary to improve performance.

Database metadata

So far, we have only considered that the database is responsible for storing information (data). These are ultimately generated by the project or application that the database supports, such as the tuples of the tables.

But in addition to these data, the database contains a series of metadata used to manage the data. Essentially, metadata primarily serves to describe another piece of data or provide additional information that helps organize it within the database. Here’s an example:

Name	Birth	Email
Alice Johnson	1985-07-12	alice.johnson@example.com
Bob Smith	1990-03-05	bob.smith@example.org
Carol Davis	1978-11-23	carol.davis@example.net
David Brown	2001-01-30	david.brown@example.com
Emily Wilson	1995-09-14	emily.wilson@example.co.uk

To understand the idea of metadata, we can introduce the concept of a schema as metadata. In a table, we have a table name, which is metadata that describes the table. This allows us to know which table we are referring to when using that name in a query or other situations.

Besides the name, all tables have a header composed of the names of the attributes located in the first row, which make up the table's schema. These names are used to refer to the attributes or columns, just as the table name is used to refer to the table itself as an object. So the schema is part of the metadata, as it provides meaning to the data stored in the columns, allowing them to be organized.

In other words, if we didn't have the first row with the attribute names, we would have no information about the stored data, as we would lack their semantics. This is precisely what the schema provides as metadata, which lets us manage them.

Apart from table and attribute names, tables usually have associated technical metadata from the DBMS. This metadata indicates the users who own the table or have certain permissions to perform actions on it. It also contains the creation and last modification dates of the table to ensure data security, existing connections, or information about events or locks for managing concurrency.

The table as an object does not store its name and all metadata within itself, but rather in specific places within the DBMS. These specific places are reserved tables for the DBMS called dictionaries, or sometimes catalogs. They utilize the structured nature of the DBMS to store this metadata in a simple way, similar to the storage of the actual data.

Since these places are tables, they also have a name, schema, and metadata, stored in the DBMS in physical data structures, not in other tables. As for their schemas, they are specially referred to as metaschemas.

The metadata in a DBMS varies significantly depending on the specific DBMS we’re using. But in all of them, we’ll always find fundamental information about the database we have implemented, like its name, table names, schemas, constraints, and so on.

Specifically, in PostgreSQL, we can find them in the "schemas" pg_catalog and information_schema. Here, PostgreSQL refers to a "schema" as a logical container that holds certain tables, views, and similar elements of a database, where many of them are responsible for storing metadata. So a logical container is nothing more than a folder used to group elements to make them more hierarchical and organized.

On one hand, pg_catalog is the internal catalog of PostgreSQL, which means it contains all the information necessary to manage the DBMS's operation. But this catalog is very technical and dense, as it’s aimed at managing the entire operation of the system, involving a lot of details that aren’t always necessary for an administrator.

Becuase of this, there’s a standard abstraction of this logical container called information_schema, introduced with the SQL-92 standard, which primarily serves to abstract the specific details related to the DBMS's operation and provide the database administrator with a series of views to better visualize and manage the metadata.

To know what pg_catalog contains, you can use commands like \dt pg_catalog.* to see the tables, views, or generally the elements it contains. Among all of them, the most important are:

pg_catalog.pg_class: Stores metadata of database objects, such as tables or views, among others.
pg_catalog.pg_namespace: Stores the names of the schemas (logical containers) of the DBMS
pg_catalog.pg_attribute: Stores the names of the attributes of tables or views, meaning their schemas, as well as their data types or user-defined domains.
pg_catalog.pg_type: Stores the default data types and user-defined types.
pg_catalog.pg_attrdef: Stores the default values defined for the attributes.
pg_catalog.pg_constraint: Stores the definitions of constraints on tables, such as PRIMARY KEY, UNIQUE, FOREIGN KEY, CHECK, and EXCLUSION, including information about the table they apply to (conrelid), the columns involved (conkey), the update and delete actions on foreign keys (confupdtype, confdeltype), and the name of the constraint (conname), among others.
pg_catalog.pg_stat_activity: Provides real-time information about active sessions on the PostgreSQL server.

As you can see, if we explore the content of pg_catalog, we’ll find that it’s very dense and detailed. That's why we have the standard alternative information_schema, which simplifies metadata management. It works similarly to pg_catalog, serving as a logical container that provides views of the DBMS tables we've seen before to abstract their functionality.

The most significant ones are:

information_schema.tables: Stores a list of all the tables and views in the database.
information_schema.columns: Stores metadata for all the columns of all tables and views.
information_schema.table_constraints: Stores a list of all table-level constraints (primary key, unique, foreign, check...).
information_schema.key_column_usage: Stores a list of columns involved in key constraints (primary, unique, or foreign).
information_schema.referential_constraints: Stores metadata about FOREIGN KEY constraints, such as actions triggered after a deletion or update, among others.

To query the information contained in all these tables or views, you can simply use queries as if you were retrieving data from any other user table. But keep in mind that many of them also contain metadata about the DBMS dictionary or catalog tables themselves, which can complicate understanding the results.

SELECT *
FROM information_schema.tables
WHERE table_name='rental';

SELECT *
FROM pg_catalog.pg_class
WHERE relname = 'bike';

SELECT *
FROM pg_catalog.pg_stat_activity;

/*Get metadata of the PRIMARY KEY constraints we named with "PK"*/
SELECT*
FROM pg_catalog.pg_constraint
WHERE conname LIKE '%pk%';

SELECT*
FROM pg_catalog.pg_constraint
WHERE conname LIKE '%pk%';

SELECT *
FROM information_schema.table_constraints
WHERE constraint_name LIKE '%pk%';

Chapter 10: Database Design Process Example

So far, you’ve learned about the entire relational model and some basic SQL. Now you can create a relational database on the PostgreSQL DBMS, manage it, and perform queries on it. So let’s apply all this knowledge to a real-world use case.

Database Levels

To do this, we need to remember some of the different levels of the database design process. First, we have the analysis phase where we gather the project requirements from the end user or client. Then we create a conceptual design, which we subsequently transform into a logical design that we can implement on a DBMS.

These are the main levels we need to worry about here. But in addition to these, we have the physical level, which focuses on the internal representation of the logical model implementation of the database in the DBMS using DBMS objects like indexes. We also have the storage level, which is the closest to the hardware, and is mainly dedicated to organizing the disk files that implement the database functionality on the DBMS. Lastly, we also have the application design level that aims to provide the database as a service to the user.

We won’t cover these additional levels in this example due to their complexity and because they aren’t as closely related to the actual database design.

The Database Design Process

When faced with a real problem that requires designing a database, the first thing we need to do is gather as much information as possible from the user or client. We do this to formalize the requirements of the system we’re going to build.

We can interview the client, survey potential users of the service, or use other similar methods. In this case, we won’t directly perform any of these tasks. Instead, we’ll assume that we have certain requirements, and from them we’ve been able to construct an entity-relationship diagram that captures them and correctly models the domain of our system. Let’s say it looks like this:

As you can see from the diagram above (you can enlarge it by opening it in a new tab), the project we’ll work on in this example is an extension of the bike rental domain we’ve used so far.

In addition to a bike rental service, we’ll include other elements that may be present in a real world database model, such as vehicles, places, cities, and so on. We’ll also include actions that can be performed between these elements, like owning a car, residing in a city, booking a cruise trip, or getting a bus pass, among others.

When we’re building this diagram, our most significant decisions involve which concepts are modeled with entities, which are represented through relationships, and which aren’t worth including in our system.

From the entire domain, it's common to encounter a lot of information provided by the client or users that doesn't directly help us model the system, as they don’t expect it to be stored in the database. So all concepts related to information that is not intended to be stored persistently are usually not included in the design.

As for the other issues, they are very subjective, and there is no set of rules to follow to know unequivocally which concepts to model with entities or relationships – or even to determine the degree of these relationships (which in this context we will assume is always 2 to avoid complicating the design with relationships involving more than two classes).

Entity-Relationship to Logical Model

But to understand how we can and should make these design decisions, it's useful to understand the purpose of each entity in this entity-relationship diagram, as well as the meaning of the elements it comprises or relates to. We also need to understand how it’s been translated to the logical design level.

Before explaining each of the entities, below is the relational diagram we have after the entire logical design phase:

This diagram is what we will gradually build as we transform entities into tables. Make sure you have both this diagram and the entity-relationship diagram open in separate windows so you can refer to them during the following chapters. This will make everything we discuss easier to understand.

As you can see in the diagram above, it includes some modifications like foreign keys pointing to "loose" attributes such as Sanction.SanctionID, instead of the same attribute in the table of the diagram. This aims to prevent the foreign key arrows from crossing excessively. Although this isn’t a standard way to represent the relational logical model, as long as its meaning is specified it’s completely valid.

Some constraints aren’t modeled in the system due to their complexity, which we’ll see as we explain all the entities. That's why there are no notes included in the relational diagram, and we don’t indicate the attributes that can or can’t be NULL. These are helpful to show in the diagram, but it’s not required.

Lastly, during the explanations, we’ll show the SQL code used to create each table. You’ll find the SQL script for creating the entire database at the very end, after I’ve explained all the entities. This is because we’re not going to discuss them in the order they need to be created, in order to respect referential integrity constraints that would cause errors in the DBMS if tables were created in a different order.

Person entity

First, we have the entity Person, whose main goal is to model the existence of people in our system. It's important to note that in our domain, there are physical people, where each one is a physical entity that we can abstract through the concept of a person, which has a set of associated characteristics. In other words, even though there are many different people, they all share a set of characteristics that define them as people.

These characteristics are what we’ll model as the attributes of the Person entity. These can then be "instantiated," as we saw earlier, resulting in a set of entity occurrences – or in other words, specific people defined by the values of their characteristics or attributes.

To better understand this, we can translate this entity to the logical design level, where, being a single entity, we model it with a single table named Person with the corresponding attributes and data types that match the characteristics of people. In this way, the table schema will be the structure that defines "all people," like a template, while the specific people whose information we want to store in the system correspond to the tuples of the table, which will be inserted as we register people in the system.

For the attributes of the entity, we’ll include those that need to be stored persistently, such as name, date of birth, email, and so on. Among all of them, we choose PersonID as the primary key, which we assume holds the person's government ID. But to illustrate the concept of surrogate key in SQL, in the implementation on the DBMS, we’ll implement the PersonID attribute as a surrogate key instead of the person's actual ID (since both can uniquely identify each person). So each tuple in Person will have a unique and distinct value in that attribute, serving as a superkey, candidate key, and ultimately being selected as the primary key.

In addition to the attributes represented in the entity-relationship diagram, the table we use to model the Person entity has other attributes that help implement associations with other entities, specifically foreign keys.

If we look only at the entity-relationship diagram, we will see a series of associations that "leave" or "enter" the Person entity. In other words, all the relationships this entity has are 1-*, which means the maximum cardinalities on both sides are 1 and *, respectively. These maximum cardinalities tell us how many occurrences of the entities can be related to each other. So with this information, we can determine where to place the foreign keys and which attributes they should reference from specific entities.

In the case of Person, we have 12 associations with such multiplicities, of which only one is a relationship where the "many" side (the side with the maximum cardinality *) is in the Person entity itself. This means that to implement the association between Person and CruiseLine, for example, at the logical level, there should be a foreign key on the many side pointing to the entity on the 1 side. Otherwise, if we place the foreign key in CruiseLine and have it reference Person, its attribute could contain an arbitrary number of references to people, leading to the appearance of a repeating group.

On the other hand, the other 11 associations have the "1 side" in Person, indicating that there are 11 entities that must have a foreign key pointing to Person.

Thus, we know that Person has a foreign key pointing to CruiseLine, even though the attributes that make it up do not appear explicitly in the conceptual diagram. And, since the foreign key has to reference tuples from CruiseLine, it will consist of as many attributes as the primary key of CruiseLine, with the same data types, respectively.

This happens because the foreign key must uniquely reference tuples. So the values of the foreign key attributes should allow us to go to the CruiseLine table, look at the columns of its primary key attributes, and easily find the referenced tuple. So the foreign key in Person will have 2 attributes, not just one.

CREATE TABLE Person (
    PersonID SERIAL PRIMARY KEY,
    Name VARCHAR(32) NOT NULL,
    Birth DATE NOT NULL CHECK (Birth < CURRENT_DATE),
    Email VARCHAR(32) NOT NULL,
    Phone BIGINT NOT NULL CHECK (Phone > 0),
    Nationality VARCHAR(32) NOT NULL,
    NameFK VARCHAR(32),
    FoundationDateFK DATE,
    FOREIGN KEY (NameFK, FoundationDateFK) REFERENCES CruiseLine(Name, FoundationDate)
);

Furthermore, as you can see in its DDL, the attributes (NameFK, FoundationDateFK) that make up the foreign key don’t have the NOT NULL constraint. This is because the foreign key in Person may not reference any tuple from CruiseLine due to the minimum multiplicity of the association on the CruiseLine side (which, being 0, implies that a person might not be a customer of any cruise line).

Semantically, this association, implemented with the foreign key, represents the possibility that a person can be a customer of a certain cruise line, where if they aren’t a customer of any, their foreign key will be NULL.

At the same time, a cruise line does not necessarily have to have any customers, as it can be related to zero people at a minimum, according to the minimum multiplicity on the other side. So with both minimum multiplicities at 0, the association as a whole becomes optional, meaning it may not exist at all, as there is nothing that requires it to exist.

If we look at the relational diagram, to represent this entity or table, it's enough to write it in Datalog notation, with its name and attributes. The only thing to keep in mind is that the attributes that make up the primary key are underlined, and those that represent foreign keys each have an arrow coming from them pointing to the corresponding attribute of the primary key of the entity or table they reference.

In cases like this where the foreign key is composite, each of its attributes has an arrow pointing to the corresponding attribute of the referenced entity. But the order in which the attributes are written in this diagram is not entirely relevant – meaning we can write them in any order as long as we correctly represent which are primary or foreign keys.

Regarding the DDL, since we will consider PersonID as a surrogate key, we declare it as SERIAL so the column stores auto-incrementing values. This way, to uniquely identify each tuple, the attribute will use an integer value that increases by one as tuples are inserted. This allows us to differentiate all of them by that number.

We’ll specify the primary key with PRIMARY KEY, which we can place directly in the attribute declaration if it’s not composite. We’ll specify the foreign key with FOREIGN KEY, indicating which attributes reference the primary key of CruiseLine.

The only thing to be careful about is the order of the attributes. Although you can arrange the foreign key in any order in FOREIGN KEY, in REFERENCES, we must ensure that the attributes of the primary key of CruiseLine are in the same order as those of the foreign key in order to be referenced correctly.

For example, if NameFK should reference Name, then those attributes will occupy the same position in the tuples where we declare the foreign key and the primary key it points to, without needing to appear in a specific position, as long as the correspondence is maintained.

Now let’s look at what a Person can do.

Rental entity

In our domain, people can rent bikes, and for each bike rental, we want to store certain information like the time the rental occurred, the duration in hours, the price per hour, and so on. So if we modeled this as an M-N association between Bike and Person, we couldn't store all this information unless we used an associative entity (which is only valid when the entity itself is weak in identification). But here we prefer to use a surrogate key to uniquely identify the rentals, which avoids making the entity representing them weak in identification.

This is necessary because each rental requires storing associated information, in addition to the person and the bike involved. So an we’ll introduce an entity that relates to both Bike and Person through 1-* associations (each Rental associates a bike with a person), storing information about that "event." Then, as it has two associations with the many side in Rental, this entity will have two foreign keys – one to implement each association. One will reference the primary key of the Bike entity and the other the primary key of the Person entity.

Here, we need to distinguish between both foreign keys, as each is composed of one attribute, unlike the previous case where Person had only one foreign key composed of several attributes. That is, regardless of the attributes that comprise each foreign key, it’s important to distinguish that one aims to uniquely identify a bike while the other uniquely identifies a person.

CREATE TABLE Rental (
    RentalID SERIAL PRIMARY KEY,
    StartTimestamp TIMESTAMP NOT NULL,
    Duration INT NOT NULL CHECK (Duration >= 0), /*Duration of the rental period in hours*/
    HourPrice DOUBLE PRECISION NOT NULL CHECK (HourPrice >= 0),
    BikeFK INT NOT NULL,
    PersonFK INT NOT NULL,
    FOREIGN KEY (BikeFK) REFERENCES Bike(BikeID),
    FOREIGN KEY (PersonFK) REFERENCES Person(PersonID)
);

When writing your DDL, the attributes are declared the same as before – the main difference here being that each foreign key has its own FOREIGN KEY constraint, which references the primary key attribute of the corresponding table. This is the case because here, both Bike and Person have primary keys with a single attribute.

Another important detail to consider is the minimum multiplicity on the Person and Bike sides in the associations of the conceptual diagram, where the 1 side of the associations has a minimum multiplicity of 1. This means that a Rental must always be associated with a person and a bike, so their foreign keys can never be NULL. This is why the NOT NULL constraint is used in the attributes.

As before, at the conceptual level, we don’t show the attributes that form the foreign keys, as the associations themselves and their cardinalities implicitly indicate the existence of foreign keys. But in the relational diagram, we do show these attributes, where arrows indicate the primary key attributes of other entities to which they point. And, since the entity is not weak in identification, none of the foreign key attributes should be underlined.

Regarding the other constraints, we don’t allow any attribute to be NULL, as it doesn't make sense for a timestamp to be null, for example, when it’s precisely the valuable information about a rental that we want to store in the database. The other attributes also have constraints like non-negativity, since the duration or the hourly rate can’t be negative amounts.

This way, if someone tries to insert negative values for these attributes, the DBMS will automatically know that the inserted data isn’t valid or correct, since the actual numbers for duration and price can never be negative. This implies that the values for those attributes must be positive to be correct.

CarOwnership entity

Another entity related to Person in the diagram – that is, representing something else a Person can do – is CarOwnership. This aims to model that people can have cars, whether bought, rented, or leased. For this, we use the same conceptual structure as with Rental, where a person can have multiple cars and a car can belong to many people.

As before, this implicit N-M association between Car and Person must store information about the ownership, such as its type, start date, price, and so on. So we’ll use an intermediate entity with 1-* associations towards both entities, with the 1 side on them.

CREATE TYPE CarOwnershipType AS ENUM('buy', 'rental', 'lease');
CREATE TABLE CarOwnership (
    InsuranceID SERIAL PRIMARY KEY,
    BuyDate TIMESTAMP NOT NULL, /*Date when ownership starts*/
    BuyPrice DOUBLE PRECISION NOT NULL CHECK (BuyPrice >= 0), /*Ownership price, if rental or lease, this price represents a monthly amount*/
    WarrantyEndDate DATE NOT NULL CHECK (WarrantyEndDate >= DATE(BuyDate)),
    OwnershipType CarOwnershipType NOT NULL,
    PlateFK VARCHAR(32) NOT NULL,
    PersonFK INT NOT NULL,
    FOREIGN KEY (PlateFK) REFERENCES Car(Plate),
    FOREIGN KEY (PersonFK) REFERENCES Person(PersonID)
);

The table implemented at the logical level is very similar to Rental, as we have a surrogate key that uniquely identifies the tuples, thus preventing the entity from being weak in identification. You can see this directly in the conceptual diagram. There, we have an attribute marked with {id} that we provide with semantics equivalent to that of a surrogate key. This means we don't need its identification to depend on any other entity.

In other words, at the conceptual level, InsuranceID is a unique identifier provided by an insurance company. To to generate it, they likely used a technique similar to SQL's SERIAL auto-increment, although it doesn't necessarily have to be that, as there are many others with very specific applications.

The value of InsuranceID might be provided to us when inserting tuples into our system, where this value would have to meet the primary key constraint and not repeat for any pair of possible tuples. But still, we decided to implement it with a SERIAL to make the generation of synthetic data for this database simpler.

Just keep in mind that, in a real situation, if we are provided with this value, we should avoid using SERIAL and save the identifier that each tuple has. Since InsuranceID is the primary key, no pair of tuples can have the same value in this attribute, but they can have the same start date, price, and so on.

In this table, to restrict the values that the attribute OwnershipType can take, instead of using a CHECK, we’ll create a new data type. We could have done this perfectly using a CREATE DOMAIN. But instead, we’ll use a TYPE ENUM to show another way of defining the set of values an attribute can take. It defines the possible values for the attribute, representing an ownership where a person buys, rents, or leases a car. Finally, that TYPE ENUM is assigned as the data type of the attribute.

We’ve implemented the most basic domain constraints and problem requirements here, which only involve the CarOwnership table itself. For example, we have those requiring the price to be positive or the warranty end date to be after the ownership start date.

On the other hand, we can see that the attribute BuyDate has been assigned a TIMESTAMP data type, which doesn't exactly match the attribute's name. In this example, such details aren’t as important, since the TIMESTAMP was declared this way to provide a time in addition to the date of purchase. But in a real project, you should be stricter with naming attributes according to their characteristics. This will help improve schema clarity and make database management easier.

Residence entity

A person can also reside in a city, so our database must be able to store information about a person's stay in a certain city. We’ll do this using the entity Residence, which functions similarly to the previous entities Rental and CarOwnership, but with some differences.

First, the attributes it stores are:

the start date of a person's stay in a city (which can’t be null because if the stay exists, it must have started on a date),
the end date of the stay (which can be NULL because the person may reside in the city for an indefinite amount of time), and
the address where they reside within the city.

When the EndDate attribute is NULL, it means the person is still residing in the city, as the end date of the stay is not defined. Also, if this date exists and is later than the current date, we can also know that the person still lives in the city until the specified date.

This has implications for identifying the Residence entity, as there is no set of attributes within the entity itself that uniquely identifies the tuples of Residence. Instead, it’s the start date, along with the references to the person and city, that together uniquely identify it. So since the identification of the entity depends on other entities, Residence is a weak entity in terms of identification.

These references work similarly to what we saw earlier in Rental, for example, where we had several 1-* associations with the many side in the Residence entity. This implies that for each association, the foreign key is located in the Residence entity, pointing to the entity on the other side of the association.

Since there are two such associations in total, there are two foreign keys, each formed by one attribute, because the primary keys of the entities they point to are also formed by a single attribute.

CREATE TABLE Residence (
    StartDate DATE NOT NULL,
    EndDate DATE CHECK (
        EndDate IS NULL
        OR EndDate >= StartDate
    ),
    Address VARCHAR(32) NOT NULL,
    PersonFK INT NOT NULL,
    CityFK INT NOT NULL,
    PRIMARY KEY (StartDate, PersonFK, CityFK),
    FOREIGN KEY (PersonFK) REFERENCES Person(PersonID),
    FOREIGN KEY (CityFK) REFERENCES City(CityID)
);

If we look at the relational diagram, we’ll see that the table implementing this entity has its foreign keys underlined because they’re part of the primary key. This helps us identify that the corresponding conceptual entity is weak in identification, with some of its primary key attributes being foreign keys.

Also, if we wanted to reconstruct the conceptual entity from the relational diagram, it would be enough to look at the table's foreign keys, which other entities they reference, whether their attributes are underlined or not, and any possible constraints indicated in the relational diagram.

With this, if any of the foreign keys are underlined, the entity is necessarily weak in identification, and the «weak» role would be specifically placed on the association modeled by that foreign key. The many side of that association would be placed on the side of the entity from which the foreign key originates. And we wouldn’t include its foreign key attributes in the conceptual diagram entity.

In its DDL, we can see that the primary key is composed of StartDate along with the foreign key attributes, where each one represents a different foreign key pointing to a certain entity like Person or City – hence the addition of two FOREIGN KEY constraints. We’ve also added the NOT NULL constraint to both foreign keys due to the minimum multiplicity of the 1 side of the associations, which requires a Residence tuple to relate a person with a city. If we had 0..1 instead of 1..1 on those sides of the associations, then each foreign key of Residence might not reference any person or city, meaning it could be NULL.

Regarding the remaining constraints, no attribute can be null except EndDate. If it’s not NULL, then the date it stores must be after the date the residence began, as it wouldn't make sense for it to be earlier than the start date.

ShipAssignment entity

Another entity that is practically the same as the previous one is ShipAssignment, responsible for modeling the assignment of certain cruise ships to cruise lines. That is, a cruise can belong to or be assigned to a cruise line that operates it under its brand for a certain period, just like a person can reside in a city for a certain period.

Being a weak entity in identification, as we can see in its conceptual diagram, we could have represented it with an associative entity and an N-M association between CruiseShip and CruiseLine. But to be consistent with the notation we used in Residence, we won’t use an associative entity. Instead, we’ll have the entity interpose in the N-M association, resulting in two 1-* associations with the many side in ShipAssignment.

This implicitly indicates that there are two foreign keys pointing to CruiseShip and CruiseLine, respectively.

Also, note that just focusing on the many side (which is an easy rule to apply to determine where to place foreign keys just by looking at the conceptual diagram) isn’t by itself a good practice without further consideration. When you have a conceptual diagram, you should look at all the elements of the entity to make informed and reasoned decisions about its logical design.

CREATE TABLE ShipAssignment (
    StartDate DATE NOT NULL,
    EndDate DATE NOT NULL CHECK (EndDate >= StartDate),
    NameFK VARCHAR(32) NOT NULL,
    FoundationDateFK DATE NOT NULL,
    ShipFK INT NOT NULL,
    PRIMARY KEY (StartDate, NameFK, FoundationDateFK, ShipFK),
    FOREIGN KEY (NameFK, FoundationDateFK) REFERENCES CruiseLine(Name, FoundationDate),
    FOREIGN KEY (ShipFK) REFERENCES CruiseShip(ShipID)
);

Here, we assume the end date of the assignment is always defined, meaning cruises are assigned to cruise lines through "contracts" that always start and end on specific dates, and assignments don’t last indefinitely. This implies that EndDate can never be NULL. So in the DDL, we include the NOT NULL constraint and a CHECK to ensure that EndDate is after the start date, guaranteeing that only valid tuples are inserted into the database.

The foreign keys are formed solely by the attribute ShipFK, which refers to the CruiseShip entity. We use it to reference the cruise assigned to a certain cruise line. But the other foreign key, which is used to implement the other 1-* association, is composed of the attributes (NameFK, FoundationDateFK), which refer to the primary key of CruiseLine – and this, in turn, is composite and contains two attributes (Name, FoundationDate).

If we only look at the relational diagram, we’ll see that three attributes are part of foreign keys because there are arrows coming from them. Specifically, the arrow from one of them (ShipFK) will point to an attribute in a certain table. So we know that this attribute forms a foreign key by itself, while the other two have arrows pointing to attributes of another different entity (but both referencing the same one).

So together, they form another foreign key because the entity or table they reference is different from the one referenced by the other attribute ShipFK.

These attributes, in turn, serve to uniquely identify each tuple in ShipAssignment – because, with just the start and end dates, we can’t distinguish between any possible pair of tuples.

For example, if several ships are assigned to the same cruise line during the same time period, the start and end dates will match in both tuples, but they’ll represent different assignments even though the dates are the same. So the primary key of the table includes the attributes of the foreign keys, so that their values can distinguish any pair of tuples we might have in the table. Specifically, we include the foreign keys because a cruise can be or have been assigned to several cruise lines, just as a cruise line can have had multiple cruises assigned to it.

The values of both foreign keys are necessary to uniquely identify this "event" between a cruise and a cruise line, because according to the domain, the same cruise can’t be assigned multiple times to the same cruise line on the same date.

CREATE ASSERTION EveryCruiseLineHasAssignment CHECK (
    NOT EXISTS (
        SELECT *
        FROM CruiseLine cl
        WHERE NOT EXISTS (
                SELECT *
                FROM ShipAssignment sa
                WHERE sa.NameFK = cl.Name
                    AND sa.FoundationDateFK = cl.FoundationDate
            )
    ) 
);

Lastly, given the minimum multiplicity of 1..* on the ShipAssignment side, we need to implement a constraint to ensure that all cruise lines have at least one cruise assignment, which is always associated with a cruise.

To do this, we can use either an ASSERTION or a TRIGGER, as this is a constraint involving multiple tables. But for simplicity, we’ll assume that the data inserted always meets this constraint. This means we don’t need to include assertions and triggers in the DDL.

Now let’s discuss some other important entities in our system.

City entity

This entity is similar to Person and is used to store information about cities in the system. Specifically, for each city, it stores its name, the country where it’s located, population, area, and coordinates in latitude and longitude. Each physical city in our domain within the system will be represented by a tuple in the City table, which is how we implement this entity at the logical level.

Of all the associations that this entity has, none are of the 1-* type with the many side in City. Instead, they all have their 1 side in City. This means there will be exactly 4 foreign keys from other entities pointing to City, but the City table itself won’t have any foreign keys pointing to another entity.

This might not be straightforward to see at the conceptual level, as we need to look at the maximum cardinalities of the associations to know which ones result in foreign keys in the entity we’re implementing.

In contrast, in the relational diagram, this is more direct. References implemented with foreign keys are arrows, so we can directly know how many arrows refer to the primary key of a certain table or how many point to other tables, also clearly indicating from which attributes and tables they originate.

CREATE TABLE City (
    CityID SERIAL PRIMARY KEY,
    Name VARCHAR(32) NOT NULL,
    Country VARCHAR(32) NOT NULL,
    Population INT NOT NULL CHECK (Population >= 0),
    Area DOUBLE PRECISION NOT NULL CHECK (Area >= 0),
    Latitude DOUBLE PRECISION NOT NULL CHECK (
        Latitude BETWEEN -90 AND 90
    ),
    Longitude DOUBLE PRECISION NOT NULL CHECK (
        Longitude BETWEEN -180 AND 180
    )
);

Regarding the DDL, we implement the identifier CityID with a SERIAL surrogate key, as at the conceptual level we have defined that the attribute CityID is the primary key of City.

It's important to note that when modeling a domain or solving a problem for a client, we might be required to use identifiers specific to the domain we are modeling, which would mean CityID would be of the same type as the identifier to be stored. But for simplicity, let’s assume that we construct the city identifiers ourselves using a surrogate key.

In addition to the NOT NULL constraints that prevent all attributes from being NULL, since it doesn't make sense for a city to have no name or a defined population number, we impose a restriction on the range of values that Latitude and Longitude can take. This is to ensure the values are valid, even though we can't verify if they are correct, as this mainly depends on the data source.

To do this, we can use the BETWEEN operator, which performs the same check as Latitude >= -90 AND Latitude <= 90 but in a more readable way.

Port entity

In addition to cities, our domain also includes ports, which are represented by the entity Port. Like before, each port will be a tuple in the table, with its primary key composed of the port's name stored in the Name attribute and a foreign key that references City, modeling the city where the port is located.

We can infer the existence of this foreign key by looking at the entity's associations, where all of them are of the 1-* type, and only one has the many side in Port. This precisely models this relationship between Port and City. The others have their 1 side in Port, indicating that they point to Port, meaning they reference some tuple in the Port table.

At the same time, the foreign key of Port is also part of its primary key because a port can’t be identified by its name alone – we also need to know the city where it’s located.

For example, in this domain, we assume that there can be several ports with the same name, but not located in the same city. So if two ports are in the same city, according to the domain, we have the guarantee that their names can’t be the same. This allows us to define the primary key as the combination (Name, CityFK).

We’re making these assumptions here as an example, but in a real project they would need to be confirmed with domain experts and the client's requirements to ensure they are met. This would allow us to make design decisions such as establishing the keys of an entity. So once we know that Port has a foreign key that is part of its primary key, we know that the entity is weak in identification. In the relational diagram, we will have to underline not only Name but also the CityFK attribute.

CREATE TABLE Port (
    Name VARCHAR(32),
    TerminalCount INT NOT NULL CHECK (TerminalCount >= 0),
    MaxShipLength INT NOT NULL CHECK (MaxShipLength >= 0),
    Area DOUBLE PRECISION NOT NULL CHECK (Area >= 0),
    CityFK INT NOT NULL,
    PRIMARY KEY(Name, CityFK),
    FOREIGN KEY (CityFK) REFERENCES City(CityID)
);

The DDL is similar to the previous ones: we have the declaration of attributes and constraints like PRIMARY KEY, where the set of attributes (Name, CityFK) is defined as those that uniquely identify the tuples of Port. We also have the corresponding FOREIGN KEY that references the CityID attribute, the primary key of the City table.

A peculiarity of this CREATE TABLE statement is that we don’t add a NOT NULL constraint to the Name attribute because we don’t need to declare it explicitly in this case. That is, since Name is part of the primary key, and a primary key never allows NULL values in its attributes, we can skip declaring NOT NULL, as PRIMARY KEY does so implicitly to ensure the primary key integrity constraint.

This also applies to the foreign key attribute, which can’t be NULL due to the minimum cardinality (minimum cardinality 1 in 1..1) on the City entity side, which requires all ports to be associated with a city. But to more clearly reflect this minimum cardinality, we add NOT NULL explicitly to the CityFK attribute, even though it’s not strictly necessary.

Lastly, if we want to ensure that the logical design represented in the relational diagram is correct with respect to the conceptual diagram, we can try reconstructing the conceptual entity from the table in the relational diagram.

To do this, after creating the entity with its name and attributes (except for those that are foreign keys), we have to infer the associations implemented through these foreign keys. So for each of them, we introduce an association that relates Port to the entity the foreign key points to, where the many side is on Port and the 1 is on the other entity's side.

In addition to the maximum cardinalities 1 and *, we also have to define the minimums, which we can determine through the constraints indicated in the relational diagram.

For example, if one of the foreign keys can be NULL, then its minimum cardinality on the 1 side of the association will be 0, resulting in that side having a cardinality of 0..1.

On the other hand, if it can’t be NULL, the minimum cardinality is 1. On the other side of the association, we’ll default to a minimum cardinality of 0 unless there are constraints requiring cities to have at least one port, for example. This means the minimum cardinality would be 1, resulting in the Port side of the association having a cardinality of 1..*.

Finally, we can repeat this process with the foreign keys that point to the primary key of Port, leading to more associations with other entities.

For example, if we are reconstructing the conceptual entity City from the relational diagram, we will see that there is a foreign key from Port pointing to CityID of City. So City will have a 1-* association with Port, where the many side is on the Port side because the foreign key originates from Port.

In this way, when we have fully reconstructed the conceptual entity, we’ll determine if it’s weak in identification by checking if any of its foreign keys is underlined. This means it’s also part of the primary key. In that case, we’ll add the role «weak» to the associations that have arisen from these foreign keys, always on the side from which the foreign key originates.

CruiseLine entity

This entity is responsible for representing cruise lines in our system, which can have customers and cruises assigned. Conceptually, this entity is very similar to those we have already seen, as it has a primary key made up of two attributes of the entity itself, and no foreign keys pointing to other entities. But there are foreign keys in other entities that point to CruiseLine, which we can see from the 1-* type associations.

CREATE TABLE CruiseLine (
    Name VARCHAR(32) NOT NULL,
    FoundationDate DATE NOT NULL,
    ContactPhone BIGINT NOT NULL CHECK (ContactPhone > 0),
    Rating DOUBLE PRECISION NOT NULL CHECK (Rating >= 0),
    PRIMARY KEY (Name, FoundationDate)
);

Specifically, the primary key of this entity is made up of the company name and the foundation date. This combination of values might seem unique across the tuples we can store in the table, as it’s very unlikely that multiple cruise lines with the same name would be founded on the same date. But we shouldn’t make these assumptions ourselves – instead, we have to ensure that these conditions are met with the client, target user, or domain experts of our system.

Here, for simplicity, we directly assume that no cruise line has the same name as another founded on the same date, but you should always verify if this holds true in the domain.

So we set (Name, FoundationDate) as the primary key, which in turn imposes the implicit NOT NULL constraint on both attributes (meaning we don’t need to declare it explicitly). In the DDL, we can also see that the ContactPhone attribute is not of type INTEGER, but BIGINT. This is because phone numbers are usually long numbers representing large numeric quantities that would exceed the range representable by a more basic type like INTEGER. For text-type attributes, a fixed maximum length of 32 characters is used for all strings, which is enough to accommodate any cruise line name or similar information.

We could also represent the phone number with a string, allowing the storage of the country code in text format, but this can complicate processing since the number would need to be parsed from text.

Vehicle entity

In our domain, there can be certain types of vehicles, such as cars, cruise ships, bikes, or city buses. They all share a series of common characteristics like model, weight, color, or odometer reading to know how far they have traveled since they were manufactured.

These attributes are common to all vehicles in our domain, as they will always have a model name or weight, among other things, regardless of the type of vehicle they are. Becuase of this, we’ve decided to abstract these common characteristics in the conceptual design into a superclass entity called Vehicle. And from this, all entities representing specific types of vehicles must inherit.

In other words, at a conceptual level, we have an IS-A hierarchy where the parent entity is Vehicle, which contains all the characteristics that define all vehicles. From it, a series of entities inherit that represent specific types of vehicles (where each has more specific characteristics of the corresponding vehicle type).

In summary, we use an IS-A hierarchy because we need to model a situation where a series of "individuals" in our domain share a set of common features. Formally, an IS-A hierarchy can be defined as a specialization/generalization relationship between a superclass entity and some inheriting entities. The inheriting entities are composed of all the characteristics or attributes of the superclass plus some of their own attributes.

But, practically, what matters to us is that a hierarchy allows us to have a superclass (the Vehicle entity in this case) where we have attributes corresponding to these common characteristics, and then a series of entities that inherit from it and represent specific types of individuals (each having specific characteristics depending on their type).

With this, we gain clarity and maintainability in the diagram, as adding a new common characteristic to all vehicles only requires adding it to Vehicle – not to each and every inheriting entity. Similarly, if a new type of vehicle needs to be added to the system, we won’t need to include all the common attributes of vehicles in that entity.

How is this IS-A hierarchy implemented with tables?

At this point, we need to decide how to implement the hierarchy using tables in the logical model. Specifically, we have to determine the number of tables to use and the keys each will have concerning the implementation of the hierarchy itself.

To start, it's important to see that Vehicle has VehicleID as its primary key, which we assume is a surrogate key. With this, we know that if we had to implement any table for the inheriting entities, they should have a foreign key pointing to VehicleID, as it’s the primary key that can uniquely identify tuples of Vehicle.

We see that the hierarchy here is complete and disjoint. It’s complete because all the "individuals" in the hierarchy must always be represented by the inheriting entities. In other words, we will never find a vehicle that only has the attributes of Vehicle – instead, all vehicles in our domain are necessarily of one of the types defined in the inheriting entities (or so we assume). It’s disjoint because a vehicle can’t be of multiple types at once, meaning it can’t be both a car and a cruise ship, which makes sense.

All this means that each of them will be implemented with a specific table. Our system stores many types of vehicles and will likely need to expand with even more types of vehicles. To simplify this process of adding new types of vehicles and to avoid the appearance of too many NULL values in tables, we’ll implement a table for each inheriting entity of the hierarchy.

For the superclass, we’ll also implement a specific table, as each vehicle that exists in our system will be represented in one of the tables of the inheriting entities – but it’ll need to take values in the characteristics (attributes) of the superclass.

Here, we have several options. One option is not to implement a table for the superclass, duplicating all its attributes in each of the tables of the inheriting entities. This is easy to understand and initially seems practical, but it has significant drawbacks.

Another option is to implement a table for the superclass and include a foreign key in all the inheriting entities that point to the primary key of Vehicle.

We can easily dismiss the first option because duplicating attributes in all the tables for different types of vehicles leads to a lot of redundancy at the metadata level or schema, meaning duplicated attributes in multiple tables without a clear need for duplication.

Beyond the redundancy issue, duplicating the same attributes in multiple tables makes certain schema modifications more complicated. For example, adding an additional common feature in Vehicle would require adding an attribute in each table. Or we could change how common attributes like color are represented, such as switching color names from uppercase to lowercase (or any change in their representation). We’d need to make these changes across all the vehicle type tables.

With the other option, we implement a specific table for the superclass, avoiding these problems by centralizing the storage of common features in a single table. This makes it easier to perform the operations mentioned before, or even additional ones like counting how many vehicles are in our system.

We can easily do this by counting the tuples in the Vehicle table, instead of adding up the tuple counts from each of the tables for different types of vehicles. We can resolve this query this way because all vehicles will have a tuple in Vehicle that stores the common features, as well as one in their specific vehicle type table that stores the rest of the features defining it as a car, cruise, bike, and so on.

In this tuple, there’s a foreign key that references the tuple in the superclass table, thus associating the information from both tuples so it can query it and know all the information about a vehicle – both its common features to all vehicles and the specific ones of its type.

CREATE TYPE ColorType AS ENUM ( 'red', 'green', 'blue', 'yellow', 'black', 'white' ); 
CREATE TABLE Vehicle (
    VehicleID SERIAL PRIMARY KEY,
    Model VARCHAR(32) NOT NULL,
    Weight DOUBLE PRECISION NOT NULL CHECK (Weight >= 0),
    Color ColorType NOT NULL,
    Odometer DOUBLE PRECISION NOT NULL CHECK (Odometer >= 0)
);

Finally, we decide to implement a table for all entities in the hierarchy, using foreign keys in the tables of the inheriting entities to reference tuples in Vehicle that store the common features of the vehicles.

In its DDL, we can see that the primary key is implemented with an attribute of type SERIAL because it’s a surrogate key. For the Color attribute, we create a TYPE ENUM with the possible colors in our system. This is a good practice because if we need a color attribute in more areas of the system, we’ll have its domain (or data type) defined in ColorType. And this allows us to reuse it and ensure that all color attributes can take values from exactly the same data set.

But if we try to reconstruct the IS-A hierarchy from the entity-relationship diagram just by looking at the implementation represented in the relational diagram, we’ll realize that it’s somewhat more complex than what we saw before.

This is because there’s not a single way to translate an IS-A hierarchy into a relational diagram. Depending on the semantics of the features and entities, plus the system requirements, it may be better to use more or fewer tables to implement it. But in cases like this where we have a table for each entity in the hierarchy, we can clearly see that there’s a Vehicle table with a primary key VehicleID (which is referenced by multiple tables, each having exactly the same foreign key referencing Vehicle).

If we only look at this, we might think that Vehicle is an entity that has 1-* associations with other entities – and this is entirely possible when looking only at the relational diagram.

But to derive the conceptual design from the logical one and infer the existence of an IS-A hierarchy, we have to focus on the semantics of the tables and attributes. That's where we'll see that Vehicle contains attributes common to all types of vehicles that have foreign keys pointing to Vehicle. This gives us clues that Vehicle could be the superclass of a hierarchy, and the rest of the tables with foreign keys pointing to Vehicle could be inheriting entities.

But inferring the existence of an IS-A hierarchy in the conceptual design simply by observing the logical implementation is not always unequivocal. This is because, for example, here we could perfectly consider that Vehicle is an entity associated with the other inheriting entities through 1-* type associations. This would also be correct from a conceptual and logical point of view.

Still, even though conceptually we can transform the hierarchy into a series of 1-* associations between Vehicle and the other entities, this is only true to the implementation if we implement one table per entity. Otherwise, we wouldn’t be correctly reflecting in the conceptual design what is actually implemented in the logical one.

In summary, when we see an IS-A hierarchy, it doesn't necessarily mean there are foreign keys between the inheriting entities and the superclass, as not always as many tables as entities are used to implement the hierarchy. So to reconstruct a hierarchy at the conceptual level from the logical one, the most reliable thing to focus on is the constraints, notes, or indications left in the relational diagram explaining why certain tables were implemented – that is, where they come from.

Implementing a hierarchy at the logical level usually involves a series of design decisions that must be properly justified, which we can then use to infer the existence of the hierarchy at the conceptual level.

This exercise of trying to reconstruct the conceptual level is important to approach clearly, as understanding this reverse process is key to comprehending the elements of the different design levels and how they translate from one to another.

CruiseShip entity

To illustrate how we’ve implemented the IS-A hierarchy of Vehicle, let's look at the different inheriting entities that make it up.

First, we have CruiseShip, which models the existence of cruise ship-type vehicles in our system, where each cruise ship is a tuple in the table. Regarding the specific features of the cruise ship that make it a cruise ship-type vehicle, we have its length, passenger capacity, or even the speed at which it travels. It also has features that all vehicles must have in the Vehicle table, such as model, color, and so on, specifically in tuples that store the values of each cruise's features.

To relate this information from both tables, there is a foreign key in CruiseShip that points to Vehicle, meaning it references the tuple in Vehicle where these feature values are stored, for each cruise ship (tuple of CruiseShip).

This way, we ensure that the attributes repeated in all vehicle types are centralized in one table where they can be easily modified or consulted, much better than having them all duplicated in the different tables of vehicle types.

CREATE TYPE ClassType AS ENUM('first', 'second', 'third', 'economy');
CREATE TABLE CruiseShip (
    ShipID SERIAL PRIMARY KEY,
    Speed DOUBLE PRECISION NOT NULL CHECK (Speed >= 0),
    Length DOUBLE PRECISION NOT NULL CHECK (Length >= 0),
    PassengerCapacity INT NOT NULL CHECK (PassengerCapacity >= 0),
    Class ClassType NOT NULL,
    VehicleID INT UNIQUE NOT NULL,
    FOREIGN KEY (VehicleID) REFERENCES Vehicle(VehicleID)
);

In the DDL, we see that the attributes and their types are declared, where ShipID is defined as a surrogate key using the SERIAL data type. This allows us to uniquely identify each cruise ship. But since every cruise ship is also a vehicle, we could also identify it by making its primary key {VehicleID}, because this attribute, even though it’s a foreign key, will never be NULL since a cruise ship needs to have the features that classify it as a vehicle.

So the foreign key must reference a valid tuple in Vehicle where the values for the features common to all vehicles are stored. Consequently, VehicleID is an alternative key declared with the UNIQUE constraint, although we aren’t required to add this constraint since the surrogate key ShipID is sufficient to identify it.

The important thing about this attribute is to correctly define the NOT NULL and FOREIGN KEY constraints, ensuring it correctly references the primary key VehicleID of the Vehicle table.

In the conceptual design, we see that this entity has multiple 1-* associations, which indicate that there are three foreign keys from other entities pointing to CruiseShip. But if we only have the conceptual design, we can’t say anything about the possible foreign key generated by the IS-A hierarchy. That is, if we only have the conceptual diagram, we can’t "guess" how many tables have been used to implement the hierarchy – we only know that after creating the logical design. At most, we could consider all possible options for implementing the hierarchy and, for each one, analyze whether there is a foreign key coming from CruiseShip.

But if in addition to the entity-relationship diagram we know that there is a foreign key originating from CruiseShip and pointing to another entity, then the entity it points to must necessarily be Vehicle. This is because 1-* type associations are elements that we know will generate foreign keys. But certain types of associations like 1-1 or 0..1-0..1 can lead to ambiguities, as we have seen before when trying to infer the existence of a hierarchy at the conceptual level.

So by discarding entities related through 1-* associations, the only option left would be Vehicle. With all this, we can also know that the implementation of the hierarchy at the logical level has been done by creating a table for the superclass and for the CruiseShip entity – but we couldn’t be sure whether the other entities have also been implemented with a table or not, as that heavily depends on the semantics.

Bike entity

Continuing with the different types of vehicles, we also have bicycles represented in the entity Bike, which inherits from Vehicle. Here, it’s clearer that the attributes of the inheriting entities are more specific than those of the superclass, as only bikes have features like FrameHeight or Foldable.

If we only look at the conceptual diagram, we can’t be certain if Bike has a foreign key pointing to Vehicle. This is precisely because, as mentioned before, without knowing the specific implementation at the logical level, we can’t guarantee that there is a foreign key in Bike. But considering the semantics of the hierarchy, we could propose the different options available for implementing it and determine in each case whether such a foreign key exists.

CREATE TABLE Bike (
    BikeID SERIAL PRIMARY KEY,
    Electric BOOLEAN NOT NULL,
    Foldable BOOLEAN NOT NULL,
    HasLights BOOLEAN NOT NULL,
    FrameHeight DOUBLE PRECISION NOT NULL CHECK (FrameHeight >= 0),
    VehicleID INT UNIQUE NOT NULL,
    FOREIGN KEY (VehicleID) REFERENCES Vehicle(VehicleID)
);

Since we decided to use a table to implement each table in the hierarchy, in this case, there is indeed such a foreign key, just as in CruiseShip. And we can see it declared in the same way as the FOREIGN KEY constraint.

Also, the primary key of Bike is not the foreign key that uniquely identifies the vehicles. Instead, it’s the BikeID identifier. Here we’re assuming that our domain requires each type of vehicle to have its own identifier. That is, in addition to the VehicleID identifier that serves for any type of vehicle, each of these types must have its own type-specific identifier. This means that cruise ships, bikes, cars, and buses will each have a way to identify themselves (even though all of them can be distinguished from each other by the VehicleID identifier they possess indirectly through their foreign key referencing Vehicle. This is why the foreign key attribute is declared as UNIQUE.).

And since this foreign key is not part of the primary key, the entity is not weak in identification. Even if it were, it wouldn’t be marked in any way at the conceptual level. This is because at that level, we have a hierarchy of entities that can be implemented at the logical level in many ways, and not all of them will have entities weak in identification.

To understand this, we can consider a simpler example of a hierarchy with only two inheriting entities (as you can see in the diagram above). If we only have the conceptual design, we still won't know which tables we'll use to implement the hierarchy – although we know we have several options, such as:

implementing or not implementing a table for the superclass
implementing a table to represent all inheriting entities, or just one table for each entity
or even more complex implementations like using a single table for the superclass and some of the inheriting entities.

Each of these options has its peculiarities. If we don't implement a table for the superclass, then there will necessarily be no foreign keys pointing to it. Or if we decide to create a table to represent the superclass and some inheriting entity together, then that inheriting entity won’t have any foreign key pointing to the superclass.

Regarding weakness in identification, depending on whether we are required to have each type of vehicle with its own identifier, we could have a global identifier in the superclass, or as in our diagram, multiple identifiers, one for each type of vehicle in addition to the Vehicle superclass identifier that identifies any vehicle. So we see that weakness in identification does not always exist, as it mainly depends on the domain and the requirements of the problem.

For example, if we see identifiers in each of the inheriting entities, and we know that those identifiers can serve as primary keys, then this suggests that the inheriting entities may have been implemented with a table each, where their foreign key initially does not include other foreign key attributes. But the conceptual diagram can’t guarantee this, as the existence or absence of a foreign key that may or may not be part of the primary key when implementing a hierarchy is a design decision specific to the logical level.

Another aspect we can infer from the conceptual diagram is the 1-* associations where the 1 is in one of the entities of the hierarchy. Necessarily, if any foreign key points to the superclass (the 1 side of the association is in the superclass), then a table must be implemented for it.

On the other hand, if it points to one of the inheriting entities, it’s not a sufficient condition to infer that there is a table for that inheriting entity. This is because there may be a table representing the superclass along with that inheriting entity, with the foreign key perfectly pointing to the identifier of that table.

So in the IS-A hierarchies of the conceptual model, the "weak" role is never used to indicate possible identification weakness that the tables implementing the entities might have. There are many ways to implement the hierarchy with tables, and the chosen method is not 100% determined by the conceptual diagram.

But it’s very important to be clear that the entities in the hierarchy can have associations with other entities that make them weak in identification. In that case, even though they are part of an IS-A hierarchy, the "weak" role would be used to indicate that the entity is weak in identification (but not due to the hierarchy – rather because of an association with another entity).

Car entity

Similar to the previous entities, we have Car, which represents the existence of cars in the system. Its primary key is Plate, which we assume is unique for each car. As we can see in the DDL, the data type of Plate is VARCHAR. This makes it perfectly possible for the attribute to be part of a primary key, as they don’t necessarily have to be integers or numeric to be part of a key. Any data type whose values are unique across the tuples stored in the table can serve.

CREATE TYPE CarFuelType AS ENUM ( 'gas', 'diesel', 'electric', 'hybrid', 'hydrogen' ); 
CREATE TABLE Car (
    Plate VARCHAR(32) PRIMARY KEY,
    FuelType CarFuelType NOT NULL,
    DoorCount INT NOT NULL CHECK (DoorCount >= 0),
    TrunkCapacity INT NOT NULL CHECK (TrunkCapacity >= 0),
    HorsePower INT NOT NULL CHECK (HorsePower >= 0),
    Doors INT NOT NULL CHECK (Doors > 0),
    AirConditioning BOOLEAN NOT NULL,
    VehicleID INT UNIQUE NOT NULL,
    FOREIGN KEY (VehicleID) REFERENCES Vehicle(VehicleID)
);

Regarding the foreign keys related to this entity, we have the same one as before, which serves to reference a tuple of Vehicle that stores information about the car model, color, and so on. But looking at the conceptual diagram, we can see that there are two other foreign keys in other entities pointing to Car, since the 1 side of the corresponding 1-* associations is in Car.

Even though we can’t see this in the DDL of the Car table, these foreign keys are in other entities referencing Car. But we can see them in the relational diagram as arrows pointing to the attributes of the primary key of Car.

Lastly, we also create an ENUM TYPE to restrict the domain of the FuelType attribute. We could implement this perfectly with a constraint, but to reuse this data type in other entities that might need it, we should define a DOMAIN or an ENUM TYPE (as in this case, that can be assigned as a data type to the attribute).

Also, defining a set of values this way is especially useful when the attribute holds text, as in other numeric attributes it may be easier to restrict their possible values with conditions like (HorsePower >= 0).

CityBus entity

To finish with the vehicle hierarchy, we have CityBus, which represents city buses in our domain. In this entity, we also have Plate as the primary key, which stores the bus's license plate and serves to uniquely identify it (meaning it differentiates it from any other city bus).

But the license plate does not directly differentiate it from other types of vehicles like cars or bikes, as the semantics of each attribute are different for each type of vehicle, as mentioned before.

For example, although cars and buses both have a license plate, if we try to differentiate cars from buses using their Plate attributes, we will see that cars may have a different license plate structure than buses, as determined by the domain and project requirements.

So to distinguish and uniquely identify them, we need to use the VehicleID identifier, since Plate is specific to the vehicle types Car and CityBus.

In addition to representing the existence of buses, this entity has 1-* type associations with Person and City that model the person driving each bus and the city where it operates. So in the conceptual diagram, we can see that the association with Person has the role “drives”. This indicates that a person can drive an arbitrary number of buses, while a bus is driven by one and only one person.

This association results in CityBus having a foreign key pointing to Person, allowing us to know, given a bus, the person who drives it by accessing the Person tuple referenced by the foreign key attribute.

Similarly, CityBus also has an association with City that represents the city to which each bus belongs and operates. Conceptually, we can see this as each bus having to operate in only one city, and each city having an arbitrary number of buses operating in it, including none (since the minimum cardinality on the CityBus side is 0).

Logically, this is implemented with a foreign key in CityBus pointing to City. So if we need to know the city where a certain bus operates, we simply check the value of its foreign key, which will uniquely identify a tuple in City, indicating the city we are looking for.

CREATE TABLE CityBus (
    Plate VARCHAR(32) PRIMARY KEY,
    RouteNumber INT NOT NULL CHECK (RouteNumber >= 0),
    Seats INT NOT NULL CHECK (Seats > 0),
    FreeWifi BOOLEAN NOT NULL,
    VehicleID INT UNIQUE NOT NULL,
    DriverFK INT NOT NULL,
    CityFK INT NOT NULL,
    FOREIGN KEY (VehicleID) REFERENCES Vehicle(VehicleID),
    FOREIGN KEY (DriverFK) REFERENCES Person(PersonID),
    FOREIGN KEY (CityFK) REFERENCES City(CityID)
);

In addition to the previous foreign keys, there is another one in the BusTrip entity that references CityBus, which we can see with the 1-* type association it has with that entity. This last one isn’t directly reflected in the DDL of CityBus, but it’s in the relational diagram where we have an arrow pointing to the primary key Plate. And, as usual, we do not add the NOT NULL constraint to the Plate attribute, since we are imposing the PRIMARY KEY constraint, which implicitly includes NOT NULL in all the attributes that comprise it.

The foreign keys for Person and City also can’t be NULL due to the minimum cardinalities of the associations, where having 1..1 implies that a city bus must have a driver and a city to operate in, hence NOT NULL is explicitly added in the DDL.

CarRegistration entity

In our domain, cars can belong to a person through a record in the Carownership table. They can also be registered as fit to drive through CarRegistration, which associates cars with driver's licenses to model their legal registration. That is, a car can exist at any time, but to be able to drive, it must be registered and associated with a driver's license. This is why CarRegistration is dedicated to associating cars with driver's licenses.

The entity is very similar to some we have seen before, like Residence (while the entities it relates to here are different, as well as the reason for its existence). Implicitly, a car can be registered and associated with many driver's licenses, while the same driver's license can have an arbitrary number of cars associated with it. We can determine this by observing the cardinalities and navigability of the associations in the conceptual diagram.

For example, if we have a car, then by conducting an exhaustive search in the tuples of CarRegistration, we can find out how many records it’s in or has participated in. Also, for each of those records, we automatically know the driver's license it has been associated with – so from one car, we can learn about many driver's licenses.

Conversely, the same applies: if we have a certain license, we can indirectly find out by looking in the CarRegistration table how many records associate cars with that license. And for each of those records, we would obtain the associated car.

We’ve now analyzed navigability at the logical level. Previously, we saw the concept of navigability at the conceptual level, where associations could only be traversed in a certain direction depending on their cardinality. But in the logical model, we have access to all the tuples of all the tables in the database schema.

So, even though the Car-CarRegistration association is not navigable towards CarRegistration at the conceptual level, it is at the logical level. That is, if we have a car, we can find out which tuples in CarRegistration refer to that car, using the foreign keys that implement the association. With that information, we can then navigate to DrivingLicense once we know which tuples in CarRegistration pointed to the car.

This type of navigation is considered more typical of the logical level. With it, we can obtain information from other entities in a broader way than with the concept of navigation we saw at the conceptual level.

Here, on the entity-relationship diagram, we can see that there is an implicit N-M association between Car and DrivingLicense, which we just navigated through.

To do this, we had to go through the 1-* type associations, which are divided so that there can be an "intermediate" entity that stores information related to the N-M association, and to enable its implementation at the logical level. But we need to keep in mind the cardinalities of the 1-* associations that make up the implicit N-M association, where on the CarRegistration side we have optionality because the minimum cardinality is 0. This means that a car may not be registered, so there would be no tuple in CarRegistration referring to a certain car, thus preventing navigation to DrivingLicense.

This is completely valid because if a car is not registered, it won’t be associated with any driving license, and so we won’t be able to access any information from DrivingLicense.

Despite this, since there is a possibility that it’s registered, we consider these associations at the logical level to be navigable (all of this is equivalent to analyzing it in the opposite direction, from DrivingLicense to Car through CarRegistration).

For this process to be carried out, CarRegistration needs to have two foreign keys: one pointing to the primary key of Car to uniquely identify the car being registered, and another pointing to DrivingLicense to identify the driver's license associated with the car.

CREATE TABLE CarRegistration (
    RegistrationID SERIAL PRIMARY KEY,
    RegistrationDate DATE NOT NULL,
    ExpirationDate DATE NOT NULL CHECK (ExpirationDate > RegistrationDate),
    PlateFK VARCHAR(32) NOT NULL,
    LicenseFK INT NOT NULL,
    FOREIGN KEY (PlateFK) REFERENCES Car(Plate),
    FOREIGN KEY (LicenseFK) REFERENCES DrivingLicense(LicenseID)
);

When implementing the CarRegistration table, we see in its DDL that the primary key is declared as SERIAL because it’s a surrogate key. Also, the foreign keys all have the NOT NULL constraint due to the minimum multiplicity of 1 for the corresponding associations, which requires every tuple in CarRegistration to reference exactly one car and one driving license.

Regarding the information stored in the car registration, we mainly have the registration date or expiration date. Neither can be NULL, since we assume that in our domain, cars are always registered for a certain period that can later be extended through other registrations.

Here, we could have defined the CarRegistration entity as weak in identification, including both foreign keys in a primary key like {RegistrationDate, PlateFK, LicenseFK}. But for simplicity, a surrogate key is preferred, which simplifies database operations. In fact, the only advantage of not using the surrogate key would be saving the space occupied by the values of that additional column (and we could remove it if need be). But doing so would complicate the identification of CarRegistration tuples, as well as make certain queries less efficient and less readable.

And if we delve into the physical level, we would realize that having a primary key composed of more attributes would cause the DBMS to use more space to manage it. This would counteract the savings from removing the surrogate key – so the surrogate key remains the preferred option.

In summary, at the conceptual level, we’ve learned that navigation from Car to DrivingLicense is not entirely possible, as there is no foreign key in Car pointing to CarRegistration. But at the logical level, we can get information from CarRegistration because we can examine all the tuples of CarRegistration, allowing us to know which of them has their corresponding foreign key referencing the car we started from.

That is, conceptually, 1-* type associations are only navigable from the many side to the one side, but at the logical level, they are considered bidirectional. Also, the 1-* associations surrounding CarRegistration are both of the 1-* type and implicitly give rise to an N-M type, which is another reason why we can actually navigate from Car to DrivingLicense through CarRegistration.

DrivingLicenseRequest entity

In our domain, people can request a driver's license from a public entity, which in this case doesn't matter to us – we only care that it’s responsible for accepting or rejecting these requests. If a request is accepted, it should become the driver's license of the person who requested it, while if it is rejected, it will remain in the database as a failed request.

To model this in our database, we have many options:

creating a DrivingLicenseRequest entity with a boolean attribute Accepted to represent whether the request status is accepted or not, or
creating an IS-A hierarchy as seen in the conceptual diagram, where we have a superclass DrivingLicenseRequest dedicated to recording all requests that exist or have existed. In turn, we have inheriting entities that are created once the request has been resolved, with one entity representing accepted requests and the other modeling those that are rejected.

On one hand, using a single entity with attributes that determine its status is not the best option, because besides knowing if it has been rejected or not, the request can be in process. This would mean that it’s neither accepted nor rejected yet.

This causes multiple problems that complicate implementation, such as needing to make the Accepted attribute NULL until the request has been resolved, or even using this NULL value to represent the request's status. This "mixes" the semantics of the Accepted attribute with the representation of the request's status. This is not necessarily a serious problem, just a lack of clarity in representing the status and outcome of a request.

This option would also generate NULL values in the specific attributes of rejected or accepted requests, since each of them requires specific attributes that the other type does not have (such as the number of points for an accepted driver's license). So with this option, besides distinguishing the type of request, you would need to manage the NULL values in all attributes that don’t correspond with the type represented in the Accepted attribute. And this greatly complicates the semantics and operations when managing the data, as well as wasting unnecessary space storing these NULLs.

On the other hand, using an IS-A hierarchy to conceptually model driver's license requests can bring other disadvantages, such as greater complexity in the schema from the high number of tables that can be generated. You can also have more complexity when adding constraints to ensure that a request is not accepted and rejected at the same time. Or you can even have data fragmentation across multiple tables, where part of the information is stored in a superclass and the rest in an entity representing the specific type of request, whether accepted or rejected.

Still, using an IS-A hierarchy solves all the problems we saw with the previous option, providing a simpler and more consistent semantics with which we can operate on the database more easily. It also keeps us from having to worry about managing NULL values in certain attributes or the consistency between attributes that determine the request's status, as there are none of those here.

Thus, in the conceptual model, we have an IS-A hierarchy representing these requests, where a superclass represents requests that have just been created and are in process. In inheriting entities, these same requests are represented once they have been resolved. If they are rejected, they become a specific type RejectedDrivingLicense, and if accepted, another type called simply DrivingLicense.

In other words, at the conceptual level, we can view each request as an "individual" that can be found in the set of entity occurrences of the superclass, indicating that the request is in process. When it’s rejected or accepted, that individual then belongs to the set of the corresponding inheriting entity.

How can we model the driving license requests hierarchy at the logical level?

As we have seen before, an IS-A hierarchy doesn’t have a direct and unique translation at the logical level, as its semantics and domain requirements determine which possibilities are better or worse in aspects like query efficiency, ease of management, and so on.

So to translate this entity hierarchy formed by the superclass entity DrivingLicenseRequest and the inheriting entities RejectedDrivingLicense and DrivingLicense into tables in a DBMS, we need to analyze its characteristics to determine what implementation best suits the domain we’re modeling. We also need to analyze the other entities and the associations that connect with them, such as the association between Person and DrivingLicense, which models the relationship between a person and their driving license.

The first thing we need to check is whether the hierarchy is complete or not. In this case, there can be requests in process that haven’t been accepted or rejected, and so aren’t represented in any of the inheriting entities.

Since there are requests that don’t necessarily need to be represented by any inheriting entity, we see that the hierarchy is not complete (partial). Given that it’s not complete, the only way for our database to correctly store those requests that are in process is to create a specific table for the superclass DrivingLicenseRequest. Without it, it would be more complicated to know when a request has been resolved or not.

Later, knowing that all requests are stored in DrivingLicenseRequest, our system must be able to store information that determines whether it has been resolved or not, as well as the result of its resolution. For this, when a request is resolved and accepted, an occurrence of the entity DrivingLicense is created. But if it’s rejected, an occurrence of the other inheriting entity is created.

So in no case will the request be represented by occurrences of multiple inheriting entities at the same time, so the hierarchy is disjoint. This ensures that the previous decision to implement a table for the superclass is the correct option.

To translate the inheriting entities to the logical level, we need to decide whether to implement a table for each one, a single table for both, or even a table with all the entities in the hierarchy.

The most important thing to consider in this decision is the number of attributes each entity has and the tuples expected to exist in the database if we implement a table for each entity. In other words, we need to consider how many occurrences of each entity are expected to exist in the domain.

Initially, we can assume that there are always more rejected licenses than accepted ones, as it’s very likely to be rejected at least once before being accepted. With this, we could decide to implement a table for the DrivingLicenseRequest and RejectedDrivingLicense entities together (since there are more rejected than accepted) and another for DrivingLicense that has a foreign key pointing to that table. But this would generate NULL values in the attributes from RejectedDrivingLicense when representing accepted driving licenses.

Since implementing the entire hierarchy with a single table also leads to too many NULL values in the attributes when representing accepted or rejected licenses, the best solution in this case is again to implement a table for each entity in the hierarchy.

The main reason for choosing this option is the number of NULL values generated when representing accepted or rejected licenses. In general, if the inheriting entities had only one attribute, then it would be clear that it could be implemented more simply with a single table for the entire hierarchy, or two.

But when there is more than one attribute, the queries become especially complicated because we need to check if several attributes are NULL at the same time (and manage the database and the possible extension of the hierarchy to more types of requests).

CREATE TABLE DrivingLicenseRequest (
    LicenseID SERIAL PRIMARY KEY,
    RequestDate DATE NOT NULL,
    Fee DOUBLE PRECISION NOT NULL CHECK (Fee >= 0),
    PersonFK INT NOT NULL,
    FOREIGN KEY (PersonFK) REFERENCES Person(PersonID)
);

Once we decide to use a table for each entity in the hierarchy, we need to reflect this decision in both the relational diagram and the SQL DDL. This is mainly because it’s not complete, disjoint, and has multiple attributes in the inheriting entities that would lead to too many NULL values.

So in the relational diagram, we create the corresponding tables, where DrivingLicense and RejectedDrivingLicense add foreign keys pointing to DrivingLicenseRequest to identify the request that has been rejected or accepted.

In other words, all requests are stored in the superclass table. Then when they are accepted or rejected, a tuple is added to the corresponding table so that its foreign key references the DrivingLicenseRequest tuple representing the request itself. This way, the superclass table is dedicated to storing requests, while the other tables focus on representing which requests have been rejected or accepted.

Regarding the foreign keys pointing to or present in any of the tables in the hierarchy, we can see that to know which person a certain request belongs to, there is a foreign key in DrivingLicenseRequest pointing to Person. So for every request, that foreign key indicates the person associated with that request.

On the other hand, given the associations we see in the conceptual diagram, there are two other foreign keys from other entities pointing to DrivingLicense. We need to consider all of this because it can affect the decision we made earlier about how to implement the hierarchy. If there are foreign keys pointing to the superclass, for example, we would necessarily have to implement a table for it.

Finally, we identify the requests using a surrogate key in DrivingLicenseRequest, which uniquely identifies all requests, regardless of their status. We can also see this in the inheriting entities, which don’t have any type of identification on their own, but are assumed to be identified by the primary key of DrivingLicenseRequest.

In other words, even though there is no clear identifier in the inheriting entities, it’s important to remember that the attributes of the superclass are inherited. So when we’re implementing the hierarchy at the logical level, no matter how we do it, we will always do it in such a way that each accepted or rejected request can be identified by the primary key of DrivingLicenseRequest**.** This is the table that stores the resolved request.

RejectedDrivingLicense entity

Continuing with the previous hierarchy, given its implementation, we have the table RejectedDrivingLicense, which represents its corresponding entity. Its tuples will store information regarding the rejection of requests that have been denied, such as the date or reason for rejection.

Also, to know which request each tuple's information corresponds to, there is a foreign key pointing to DrivingLicenseRequest, specifically referencing the tuple in that table that stores the rest of the request information (including the primary key that identifies it).

To avoid having to include a surrogate key in this table or define a primary key from the entity's attributes, we’ll choose the primary key to be the foreign key itself. This in turn references the primary key of the superclass table that uniquely identifies all requests, regardless of their status.

This means that the RejectedDrivingLicense table is weak in identification, as it requires the primary key of the owning entity DrivingLicenseRequest to identify it. But as we’ve seen before, this shouldn’t be reflected in the conceptual diagram because this way of implementing the hierarchy is not always unique. So depending on how we do it, the table may cease to be weak in identification.

In summary, the variability that exists when implementing an IS-A hierarchy with tables means that concepts like identification weakness aren’t indicated in the conceptual diagram, as they are only generated by certain implementations.

Generally, if we only look at the diagram, all we can do is consider all the options available for implementing the hierarchy, analyze each one, and even decide which one to implement in the end. But this is a decision we make and doesn’t imply what we have represented in the hierarchy of the diagram.

In other words, from the diagram, it’s impossible to infer which specific logical implementation has been used, although in very specific cases, it can be easier and more straightforward to "guess."

For example, say we have a hierarchy with only one superclass Request that is pointed to by a foreign key and an inheriting entity AcceptedRequest that’s very similar to the one in our domain. We can see at the conceptual level that the hierarchy is incomplete, as there may be requests that have not yet been accepted. It’s also disjoint, which in this case makes analyzing this aspect of the hierarchy irrelevant given the number of inheriting entities it has.

So, since it’s incomplete, we’ll need a table for the superclass. Also, to avoid the appearance of NULL values if the hierarchy is implemented with a single table, we’ll use a table for the inheriting entity.

But if the hierarchy were complete, we would only have one clear way to implement the hierarchy: with a single table. This is because there would never be NULLs in the attributes of AcceptedRequest, despite the option of using a table for each entity and complicating the database logic by inserting a tuple in each table for each request.

With this example, we see that in very specific cases, it’s possible to infer the clearest way to implement a hierarchy at the logical level, even though there will always be some variability that prevents us from "guessing" the exact implementation chosen for the DBMS. Finally, it’s also important to consider the identification of the tuples with which we model the hierarchy, where multiple options also arise.

On one hand, if the domain or requirements dictate that certain entities need to have their own identifiers, then we’ll have to define them as primary keys of the corresponding entities. In our domain, all requests must be uniquely identified by an attribute in DrivingLicenseRequest, so we add a surrogate key to that entity.

If the requirements tell us that some of the attributes of an entity serve as an identifier, then we’ll use them as the primary key – but here for simplicity, we assume there is no domain-specific identifier, and we are the ones adding the surrogate key as an identifier to store the data in our system.

On the other hand, if we don't have information on how the entities should be identified, then we have the freedom to do so however we want, mainly depending on the implementation chosen in the end.

But regardless of the source of this identification, in general, it all comes down to whether or not each inheriting entity can be identified by its own attributes. This determines if the table it converts to at the logical level is weak in identification or not – because if we ultimately decide to define a primary key for an inheriting entity, then we will necessarily implement it with a concrete table.

As you can see, the identification of each entity can give us clues about how the hierarchy will be implemented, but it’s not something unequivocal that always guarantees a single way to implement it.

Sometimes, it’s our task to define how we identify them, and that will depend on how many tables we choose for the implementation and how we associate them with each other.

CREATE TABLE RejectedDrivingLicense (
    LicenseID SERIAL PRIMARY KEY,
    RejectionDate DATE NOT NULL,
    ReapplicationDate DATE NOT NULL CHECK (ReapplicationDate >= RejectionDate),
    Reason VARCHAR(32) NOT NULL,
    FOREIGN KEY (LicenseID) REFERENCES DrivingLicenseRequest(LicenseID)
);

In the DDL corresponding to this entity, we see that a specific table is created for it with the respective attributes shown in the conceptual diagram. Also, we include one that serves as a foreign key to reference the tuple of DrivingLicenseRequest that represents the rejected application (and also serves as the primary key of this table).

We could include a surrogate key here as well, but since we already have the LicenseID value from the superclass table, we don’t need to do so (and the domain doesn’t require us to identify rejected applications in a special way).

So we add the PRIMARY KEY and FOREIGN KEY constraints to that attribute at the same time, so it can’t contain NULL values because of the implicit NOT NULL restriction added by PRIMARY KEY. It must also reference the primary key LicenseID attribute of DrivingLicenseRequest.

To reflect this in the relational diagram, we can just underline the foreign key attribute, indicating that the table is weak in identification and that attribute can’t take NULL values. But to clearly indicate that other attributes that aren’t part of the key (like RejectionDate) can’t be NULL, we need to use other elements like margin notes or any other technique that clearly reflects this condition.

CREATE ASSERTION RejectionDateConstraint CHECK (
    NOT EXISTS (
        SELECT *
        FROM RejectedDrivingLicense R
            JOIN DrivingLicenseRequest D USING (LicenseID)
        WHERE R.RejectionDate < D.RequestDate
    )
);
CREATE ASSERTION ApprovalDateConstraint CHECK (
    NOT EXISTS (
        SELECT *
        FROM DrivingLicense D
            JOIN DrivingLicenseRequest R USING (LicenseID)
        WHERE D.ApprovalDate < R.RequestDate
    )
);

Finally, although we define constraints on the table – such as that the reapplication date can’t be earlier than the rejection date – there are also other constraints (like the rejection date must be after the application date).

These types of constraints involving information from multiple tables need to be implemented with assertions or triggers. The simplest option is to use assertions as shown above, although we haven’t yet implemented the ASSERTION statements in PostgreSQL, so attempting to define them will result in an error from the DBMS. It’ll simply ignore these definitions.

So we’ll choose not to implement these types of constraints at this point, assuming that the inserted data already meets them for simplicity.

CREATE ASSERTION NoSimultaneousApprovalRejection CHECK (
    NOT EXISTS (
        SELECT *
        FROM DrivingLicense d
            JOIN RejectedDrivingLicense r USING (LicenseID)
    )
);

Also, accepted and rejected applications can’t exist at the same time, so with this assertion, we could prevent this inconsistency. Basically, we define here that there can’t be any tuple in either DrivingLicense or RejectedDrivingLicense with the same LicenseID. This means that no application (LicenseID) can appear simultaneously in both tables, as the domain requires people to submit a new application when the one they have submitted is rejected.

DrivingLicense entity

To conclude with this hierarchy, when a driver's license application is accepted, a tuple gets created in the DrivingLicense table with which we have implemented its respective entity. Thus, the main goal of this entity is to model a person's driver's license, because once it’s accepted, it can be used to register cars, and is indirectly associated with the person who holds the license.

To do this, first, the DrivingLicense has a foreign key pointing to DrivingLicenseRequest, just like the previous inherited entity in the hierarchy. In turn, the request, regardless of its status, always refers to a person through its foreign key, to model whose license it is. And for cars to be registered in association with a person's driver's license, the CarRegistration entity has a foreign key pointing to DrivingLicense, so each registration necessarily refers to a specific license.

We can see all of this in the entity-relationship diagram through the 1-* type associations, as well as in their minimum cardinalities. But we can’t directly infer the existence of the foreign key in DrivingLicense, specific to the hierarchy implementation, because we can implement the hierarchy in many ways.

To understand this last point, imagine being told that there is a foreign key in DrivingLicense pointing to another entity. With this information, we can directly know that this foreign key points to the superclass of the hierarchy and exists because of the implementation where there is at least a specific table for DrivingLicense.

This is because the rest of the associations of the DrivingLicense entity are of the 1-* type, with the 1 on the DrivingLicense side – so these associations result in foreign keys pointing to DrivingLicense, not the other way around. In summary, with just the conceptual diagram, you can’t know exactly how a hierarchy has been implemented, but with some additional information, you can.

CREATE TABLE DrivingLicense (
    LicenseID SERIAL PRIMARY KEY,
    ApprovalDate DATE NOT NULL CHECK (ApprovalDate <= CURRENT_DATE),
    Points INT NOT NULL CHECK (
        Points BETWEEN 0 AND 15
    ),
    FOREIGN KEY (LicenseID) REFERENCES DrivingLicenseRequest(LicenseID)
);

The DDL of this entity is very similar to the previous one, where we have a primary key composed of the LicenseID attribute, which is also a foreign key that identifies the request that has been accepted. This refers to the tuple in the superclass table where the request is stored and uniquely identified, along with the entity's own attributes.

The table constraints in this case declare that the approval date must be earlier than the current date, as it’s impossible for a request to have been accepted on a date that has not yet occurred.

For this, we use CURRENT_DATE to get the current date in SQL and compare it with another date like the one stored in the attribute. There’s also the Points attribute, which determines the remaining points on the person's driver's license. According to the domain, this value is an integer between 0 and 15, so we restrict the possible values it can take with a CHECK, as well as with the INTEGER data type itself, preventing it from taking decimal values.

Given the simplicity of the attribute's domain, using a CHECK is the easiest option, although we could have defined a DOMAIN or TYPE ENUM and assigned it as the data type to the attribute. This would be useful if we had more attributes with the same domain in the rest of the schema.

BusTrip entity

People in our domain can use CityBus buses as a means of transportation. To this end, we have an entity called BusTrip that models specific routes buses take across the city. Each time a bus travels from one point to another, it’s considered a trip recorded in this entity through a tuple. This tuple stores information such as the starting and ending addresses of the trip, the date it takes place, and the time it took.

To uniquely identify the tuples in this table, the primary key uses the attributes TripDate, the starting and ending addresses, and the foreign key attribute that identifies the specific bus that made the trip. We have to include the foreign key in the primary key because there could be several BusTrips with the same date and starting and ending addresses, all conducted by different buses.

So to uniquely distinguish all of them, we need to include the information of the bus making the trip, which means the value of the foreign key pointing to CityBus.

Regarding the semantics of this entity, we can see that no bus can make the same trip multiple times on the same date, as this would result in duplicate tuples, violating the primary key constraint. We assume that this is the case because of the characteristics of the domain.

In the design process, sometimes we have to model situations that may not be entirely intuitive, such as a bus not making the same trip more than once a day.

Since the TripDate attribute is of type DATE, it can only store dates with a resolution up to days. This means that we can’t represent the exact moment the trip occurs in our database (in the same way we could using the TIMESTAMP data type, which allows representing moments in time with date and time).

So, given the granularity of the DATE data type, we comply with the restriction that a bus can’t make the same trip multiple times a day (beause in that case, several tuples with exactly the same date would be stored, since DATE can only represent up to days).

This is an example of a restriction that is implicitly modeled by the data type of the attribute itself. If it were TIMESTAMP, we could have multiple trips by the same bus on the same day but at different times.

CREATE TABLE BusTrip (
    TripDate DATE NOT NULL,
    StartAddress VARCHAR(32) NOT NULL,
    EndAddress VARCHAR(32) NOT NULL,
    Duration INT NOT NULL CHECK (Duration >= 0),
    PlateFK VARCHAR(32) NOT NULL,
    PRIMARY KEY (TripDate, StartAddress, EndAddress, PlateFK),
    FOREIGN KEY (PlateFK) REFERENCES CityBus(Plate)
);

When constructing the relational diagram, we must also underline the foreign key attribute that points to CityBus, since it’s part of the primary key (it’s the weak entity in identification). More specifically, we can infer this from the entity-relationship diagram by looking at where the «weak» role is located, which indicates the owner entity of BusTrip, meaning the one it depends on for identification.

In the DDL, this is reflected in the attributes that make up the primary key, where we find the three from the table itself and PlateFK, which is the foreign key responsible for referencing the bus that makes the trip. We won’t impose any additional restrictions on the StartAddress and EndAddress attributes, even though just any text can’t be stored in them – only texts that represent valid addresses in a city (specifically where the bus operates).

For simplicity, we’ll assume that if an address is not valid, it’s the responsibility of another part of the system to check this, such as software in the application layer that validates addresses before inserting tuples into the database.

On the other hand, we will add the non-negativity restriction on the duration, as it doesn't make sense for it to be negative. We could name these restrictions to make database administration easier, but since we aren’t going to work on them here, we won’t do so.

BusTicket entity

For a person to travel on a bus route, they must have a ticket that allows them to board a bus represented in the CityBus table. So in our domain, we’ll model the existence of tickets with the BusTicket entity. Its only attribute is used to store the timestamp when it was issued.

It’s important to use the TIMESTAMP data type here and not DATE because a person can buy multiple tickets on the same day for different routes, which is why we need to clarify which ticket was generated first.

When we see how this is represented in the conceptual diagram, you might notice the XOR restriction that appears between the associations connecting BusTicket with Person and BusPass. This restriction represents that all existing tickets are either directly associated with a person who owns the ticket or are associated with a BusPass that’s owned by a person and allows multiple trips with a pass.

This is how we’d semantically explain the restriction we want to model – but conceptually, when we have a restriction represented by a dashed line and a logical condition like XOR, it means that either the BusTicket-Person association exists, or the BusTicket-BusPass association exists. It’s not possible for neither to exist or for both to exist at the same time.

Because these associations exist, the foreign key in BusTicket pointing to the respective entity is not NULL. That is, both associations are of type 1-*, so they are clearly implemented with foreign keys in BusTicket. But the minimum cardinalities on both sides are 0, indicating that the associations as a whole may not exist. In other words, it means that the values of the foreign key attributes can be NULL.

At this point, if we didn't have the XOR restriction, the foreign keys could both be NULL at the same time, indicating that a ticket is not associated with any person or pass, making it impossible to identify the passenger taking the trip.

On the other hand, if both foreign keys had values in their attributes, we would be modeling that the person has used the pass to travel by bus through the non-nullity of the BusTicket->BusPass foreign key, while at the same time modeling that the same person has not used the pass but has obtained a ticket directly through the non-nullity of BusTicket→Person.

That is, if the foreign key BusTicket->BusPass is not NULL, then we are modeling the situation where a person uses their pass to travel by bus, while the other foreign key, when not NULL, represents that the person is not using a pass to travel but is doing so directly with a ticket.

So both situations can’t occur at the same time thanks to the domain restrictions that dictate that a person either travels with a ticket or with a pass – but not both at once, and not neither. This is because a ticket is necessary to travel. This is why we use the XOR condition to represent that either one association exists or the other, but not both at the same time. It also prohibits neither from existing.

PersonFK	PassFK	Valid	Meaning
No NULL	NULL	✔️	Ticket purchased directly by the person.
NULL	No NULL	✔️	Ticket charged to the person's BusPass.
NULL	NULL	❌	There is no information on which person is traveling (orphan ticket).
No NULL	No NULL	❌	Indicates both direct purchase and use of a pass at the same time (inconsistent).

It’s also important to emphasize that for the foreign keys to be NULL, the minimum cardinality on the side of Person and BusPass in the associations must be 0.

To model that a person travels using a pass, we might consider associating the BusPass entity directly with BusTrip instead of with BusTicket. But doing this would result in an N-M relationship between BusPass and BusTrip, since a pass can lead to an indefinite number of trips, while multiple people can travel using their pass on a single trip.

To avoid having to add another intermediate entity to implement the N-M association, let’s associate BusTicket with BusPass, so that we can see each trip made using a pass by checking the foreign key values of the ticket.

CREATE TABLE BusTicket (
    IssueTime TIMESTAMP,
    TripDateFK DATE,
    StartAddressFK VARCHAR(32),
    EndAddressFK VARCHAR(32),
    PlateFK VARCHAR(32),
    PersonFK INT,
    PassFK INT,
    PRIMARY KEY(
        IssueTime,
        TripDateFK,
        StartAddressFK,
        EndAddressFK,
        PlateFK
    ),
    FOREIGN KEY (
        TripDateFK,
        StartAddressFK,
        EndAddressFK,
        PlateFK
    ) REFERENCES BusTrip(TripDate, StartAddress, EndAddress, PlateFK),
    FOREIGN KEY (PersonFK) REFERENCES Person(PersonID),
    FOREIGN KEY (PassFK) REFERENCES BusPass(PassID),
    CONSTRAINT XORConstraint CHECK (
        (
            PersonFK IS NULL
            AND PassFK IS NOT NULL
        )
        OR (
            PersonFK IS NOT NULL
            AND PassFK IS NULL
        )
    )
);

To uniquely identify each ticket, we have to use both the IssueTime attribute of the table and the foreign key pointing to BusTrip, which determines which trip will be made with that ticket. So we have a weak entity in identification again – and it’s peculiar in that the foreign key in this case is composed of several attributes, since the primary key of BusTrip (which is the owning entity on which it depends for identification) is itself composed of multiple attributes. Specifically, this primary key has 4 attributes – so in BusTicket, as the foreign key must reference the primary key of BusTrip, it will be composed of exactly 4 attributes (meaning as many as the primary key it points to).

To declare this foreign key, we use the same FOREIGN KEY constraint as always, the only difference being that here we use several attributes instead of just one.

The most important thing about this constraint when we have multiple attributes is to declare them in order. For example, if we want TripDateFK to point to the TripDate attribute of the BusTrip primary key, then we must put those two attributes in the same order in the constraint tuple. Here, for example, they are in the first position, but we could place them both in the second position after StartAddressFK and StartAddress, or in the third (and so on), as long as they correspond.

Since this is the only foreign key that can’t be NULL in the table, we need to ensure that all its attributes have the NOT NULL constraint. But since they’re part of the primary key, we don’t need to explicitly declare the constraint.

On the other hand, for the other foreign keys that model associations with Person and BusPass, we shouldn’t add this constraint because these foreign keys will need to take a NULL value in certain situations. So none of the attributes require us to declare constraints.

Finally, the TIMESTAMP data type isn’t the only one that can store date and time in the IssueTime attribute – we also have alternatives like DATETIME or TIMESTAMP WITH TIME ZONE. These have specific uses, such as storing the time zone in addition to the time itself. For simplicity, in this example, we’ll use TIMESTAMP for all attributes that need to store date and time.

BusPass entity

We can already infer the semantics of this entity from what we’ve just seen. Specifically, if a person plans to take multiple bus trips and doesn’t want to buy individual tickets for each of those trips, they can purchase a pass (represented by the BusPass entity) which allows them to take multiple trips without worrying about tickets.

If we look at the conceptual diagram, we’ll see that it has several associations of type 1-*, where one of them is affected by the XOR constraint. So given the minimum cardinalities of this association, it may not exist, as we explained earlier. So we know that there won't always be a foreign key pointing to BusPass, since the association is optional due to its minimum cardinalities.

But on the other hand, we have the other association with Person that results in a foreign key in BusPass that always has to exist, because all passes must have an associated person (meaning every pass must have an owner).

CREATE TYPE ModalityType AS ENUM( 'single', 'round_trip', 'daily', 'weekly', 'monthly', 'annual' ); 
CREATE TABLE BusPass (
    PassID SERIAL PRIMARY KEY,
    IssueDate DATE NOT NULL,
    ExpirationDate DATE NOT NULL CHECK (ExpirationDate > IssueDate),
    Modality ModalityType NOT NULL,
    RemainingTrips INT NOT NULL CHECK (RemainingTrips >= 0),
    PersonFK INT NOT NULL,
    FOREIGN KEY (PersonFK) REFERENCES Person(PersonID)
);

For the identification of this entity, we include a surrogate key to simplify the process. We could have selected another set of attributes as the primary key, although this would result in the entity being weak in identification, as well as a primary key with more attributes than it has using a surrogate key. So the simpler solution is generally preferred.

We should include a CHECK to ensure that the expiration date of the pass is after the issue date to prevent inserting tuples with inconsistent dates. Also, none of the attributes can be NULL. The foreign key also can’t be NULL because of the minimum cardinality that prevents it from being NULL. And finally, other attributes like modality can’t be NULL either. For these, we implement a custom ENUM TYPE where we define the different pass modalities that determine how a person can use that pass.

Lastly, we can indicate the constraint that we modeled at the conceptual level with an XOR in the same way in the relational diagram using a dashed line between the foreign keys involved. We can also indicate it with a textual note. But in the DDL, the simplest way to code it is with a CHECK in BusTicket, which is where the foreign keys involved in the integrity condition originate.

Voyage entity

Continuing with the ways people in our domain travel by cruise, we have the entity Voyage. This models the trips taken by the cruises. Specifically, the entity stores information about the trip, such as the departure and arrival dates, as well as the ports where the trip begins and ends.

We can also see that it has an attribute called Distance, which might initially seem irrelevant – but Distance records the total distance traveled by the cruise during the trip. And this doesn’t necessarily have to match the shortest distance between the departure and arrival ports.

The decision to use this meaning for this attribute came from our domain and its constraints. That is, if we are required to record the total distance the cruise travels, in addition to the distance between both ports, the simplest option would be to add an attribute in this entity that records that magnitude.

In other words, if we didn't need to know the distance traveled by the cruise itself, we could be satisfied with knowing the distance between the departure and arrival ports (which we can determine from the port information). But we’ll use the attribute Distance here, which records the actual distance traveled by the cruise during the trip (since we need this information).

If we look at the conceptual model, we’ll see that this entity has two identical associations of the same type 1-* with the entity Port, all with the aim of conceptually modeling that a voyage is associated with two ports, one for departure and one for arrival, where both can be the same. Regarding this last point, if they could not be the same, we would need to indicate that restriction with a note, as there are no standard elements in an entity-relationship diagram or in the relational model to represent such a situation.

On the other hand, we could also consider modeling the trip so that is has departure and arrival ports through a single Voyage-Port association with a cardinality of 2 on the Port side. But if we did this, conceptually, we wouldn't distinguish which port was for departure and which was for arrival. Rather, we would be modeling that the cruise passes through two ports on that trip – but we wouldn't know for sure which was the arrival or departure port (at least at the conceptual level) since at the logical level there would necessarily have to be two foreign keys pointing to Port.

So to easily distinguish between the arrival and departure ports for a trip and to clarify the semantics of the association between Voyage and Port, we’ll use multiple associations, each with a role that explicitly indicates the relationship the port has with the trip.

In addition to these associations, the Voyage entity needs to reference CruiseShip to know which cruise has made that trip. That's why, in the conceptual diagram, there is a 1-* association with CruiseShip, where one cruise ship can make many trips, but a trip is only made by one cruise ship.

To identify this entity, we’ll take advantage of the fact that both the start and end dates of the trip are always defined to include them in the primary key. This means both dates can’t be NULL as they define the trip's duration.

But, to truly uniquely identify the Voyage tuples, we have to distinguish them using the departure and arrival ports of the trip, as well as the cruise ship that performs it. That's why we include all foreign keys in the primary key. If we didn’t do this, there could be several tuples of different trips made by different cruise ships or passing through different ports that could still have the same value in the departure and arrival dates. So we need to include information about the cruise ship making the trip, as well as the ports involved.

By defining the primary key this way, we are making the entity weak in identification, where its owning entities are CruiseShip and Port, even though part of its primary key is composed of attributes from the entity itself.

CREATE TABLE Voyage (
    DepartureDate DATE,
    ArrivalDate DATE CHECK (ArrivalDate >= DepartureDate),
    Distance DOUBLE PRECISION NOT NULL CHECK (Distance >= 0),
    DepartureNameFK VARCHAR(32) NOT NULL,
    DepartureCityFK INT NOT NULL,
    ArrivalNameFK VARCHAR(32) NOT NULL,
    ArrivalCityFK INT NOT NULL,
    ShipFK INT NOT NULL,
    PRIMARY KEY (
        DepartureDate,
        ArrivalDate,
        DepartureNameFK,
        DepartureCityFK,
        ArrivalNameFK,
        ArrivalCityFK,
        ShipFK
    ),
    FOREIGN KEY (ShipFK) REFERENCES CruiseShip(ShipID),
    FOREIGN KEY (DepartureNameFK, DepartureCityFK) REFERENCES Port(Name, CityFK),
    FOREIGN KEY (ArrivalNameFK, ArrivalCityFK) REFERENCES Port(Name, CityFK)
);

To implement this entity at the logical level, in its DDL, you can see that we first define the attributes of the entity itself, as well as those of the foreign key for the departure port, which are (DepartureNameFK, DepartureCityFK).

Note that the foreign keys pointing to Port must have two attributes since the primary key of Port has two attributes. So we’ll need a total of 4 attributes to model the foreign keys that reference the departure and arrival ports of the trip, both referencing the Name and CityFK attributes of the Port table (which make up its primary key as we saw earlier). Also, we need another attribute, ShipFK, to reference CruiseShip and thus determine which cruise ship made the trip.

With all this, the primary key of Voyage is defined as the set of attributes (DepartureDate, ArrivalDate, DepartureNameFK, DepartureCityFK, ArrivalNameFK, ArrivalCityFK, ShipFK).

If we had to infer the attributes that make up the primary key using only the conceptual diagram, we would need to look at the entities that the foreign keys reference. These are represented with the 1-* associations.

For example, in CruiseShip, we would see that its primary key has only one attribute, so necessarily in Voyage, the corresponding primary key that references it must have one attribute, ShipFK. Meanwhile, the other two foreign keys that reference Port need to have two attributes each, since we can see that Port is identified by its name and the city where it’s located. So its primary key has two attributes (Name, CityFK) that we will need to reference from Voyage.

In the relational diagram, this is easier to interpret. We’ll see that one attribute references an attribute of the CruiseShip table, so we know it’s a foreign key that leads to a 1-* association in the conceptual model.

Also, there are two other attributes that together reference two attributes of Port – and together, they also form a foreign key that creates a 1-* association in the conceptual diagram, where the many side is in the entity from which the foreign key originates ( that is, in Voyage).

With this last foreign key, we can represent the departure port of the trip. There’s another pair of attributes (ArrivalNameFK, ArrivalCityFK) that follow the same pattern to represent the arrival port of the trip. From them, we can also infer that at the conceptual level, there’s another association with the same characteristics.

And, since all these foreign keys are underlined, this implies they are part of the primary key of Voyage. From that we can infer that Voyage is weak in identification.

Lastly, if we look at the data types of the foreign keys, we’ll see that they match exactly with the types of the attributes they reference. This is especially important because a foreign key, by definition, is an attribute that holds the value of another attribute it references, so both must be of the same type for this to be possible.

Since foreign keys are made up of multiple attributes, in this case, we also need to consider the relative order of the attributes that form the foreign key with the order of the attributes they reference. This is unlike what happens in the PRIMARY KEY constraint, where the order in which the primary key attributes are declared doesn’t matter. In that case, in PRIMARY KEY, we are declaring a set of attributes, where what matters is that they appear in the constraint (not that they follow a specific order).

CruiseBooking entity

For a person to travel on a cruise, they must make a reservation for a specific voyage. So in our domain, we have the entity CruiseBooking, which is responsible for storing the reservations people make to travel on a cruise.

The data stored for each reservation includes the booking date, cabin number, price, and payment method. To know which person has booked which voyage, the entity has 1-* associations with Person and Voyage, which logically translate into two foreign keys pointing to the respective entities.

To uniquely identify each booking, we could choose the easy option of including a surrogate key attribute to serve as the primary key. But to illustrate the complexity of not doing this, we’ll use only attributes from the table itself to identify its tuples. So the primary key of this entity is composed of the attributes BookingDate, CabinNumber, the foreign key to Person, and the other foreign key to Voyage.

We do this because we assume that multiple people can book the same cabin for the same voyage, all on the same date. For example, this can happen if several people from the same family decide to book a certain voyage. Each of those people will have a record or tuple in the CruiseBooking table with the same attributes in BookingDate, CabinNumber, and the foreign key of Voyage, but a different value in the foreign key of Person. This allows the tuples to be uniquely distinguished.

The foreign key to Person has a single attribute since the primary key of Person has only one attribute. But the other foreign key that refers to the voyage being booked has exactly 7 attributes (as the Voyage entity requires 7 attributes to be uniquely identified).

With this, we realize that the primary key of CruiseBooking will have a total of 10 attributes, making it a much more complex solution than simply using a surrogate key. So you can see why it’s very convenient to use surrogate keys whenever possible for this type of entity – especially when the foreign keys that will be part of the primary key have too many attributes, as in this case.

CREATE TYPE PaymentMethodType AS ENUM ('card', 'paypal', 'bank', 'cash', 'mobile');
CREATE TABLE CruiseBooking (
    BookingDate DATE NOT NULL,
    CabinNumber INT NOT NULL CHECK (CabinNumber > 0),
    Price DOUBLE PRECISION NOT NULL CHECK (Price >= 0),
    PaymentMethod PaymentMethodType NOT NULL,
    PersonFK INT NOT NULL,
    DepartureDateFK DATE NOT NULL,
    ArrivalDateFK DATE NOT NULL,
    DepartureNameFK VARCHAR(32) NOT NULL,
    DepartureCityFK INT NOT NULL,
    ArrivalNameFK VARCHAR(32) NOT NULL,
    ArrivalCityFK INT NOT NULL,
    ShipFK INT NOT NULL,
    PRIMARY KEY (
        BookingDate,
        CabinNumber,
        PersonFK,
        DepartureDateFK,
        ArrivalDateFK,
        DepartureNameFK,
        DepartureCityFK,
        ArrivalNameFK,
        ArrivalCityFK,
        ShipFK
    ),
    FOREIGN KEY (PersonFK) REFERENCES Person(PersonID),
    FOREIGN KEY (
        DepartureDateFK,
        ArrivalDateFK,
        DepartureNameFK,
        DepartureCityFK,
        ArrivalNameFK,
        ArrivalCityFK,
        ShipFK
    ) REFERENCES Voyage(
        DepartureDate,
        ArrivalDate,
        DepartureNameFK,
        DepartureCityFK,
        ArrivalNameFK,
        ArrivalCityFK,
        ShipFK
    )
);

If we look at the DDL, it seems much more complex than the previous ones. But actually, the elements we used are the same. We define the primary key with PRIMARY KEY, and the attributes of the foreign keys with the same data types as the attributes they reference. We also use the NOT NULL constraint to correctly implement what's indicated in the minimum cardinalities of the associations.

We declare each foreign key with FOREIGN KEY, which is longer in this case due to the number of attributes that make up each one. The only important thing to keep in mind here is that one of the FOREIGN KEYs is exclusively dedicated to declaring the foreign key to Person (meaning the association between CruiseBooking and Person) while the other models the association with Voyage.

We do this without mixing attributes of both foreign keys in the same FOREIGN KEY – as this would be an error since we wouldn't be modeling the conceptual diagram correctly. Each foreign key is independent of the others, so each FOREIGN KEY includes only the attributes that make up each corresponding foreign key.

To simplify the domain of the PaymentMethod attribute, we can define a TYPE ENUM, since the payment method is an attribute that will likely be used in other parts of the domain. Even if it's not needed now, it's possible that in a future expansion of the domain, we might need to include it in the schema. This is why it's important to declare it to make database management easier in a potential expansion.

Pool entity

In our domain, there are also pools, which are represented in the IS-A hierarchy with the entity Pool as the superclass. This allows people to interact with pools in different ways, as we will see below. So we consider that our domain includes different types of pools, such as cruise pools found on cruise ships modeled with CruiseShip, city pools found in cities, or Olympic pools also found in cities.

Since they all share common attributes, we do the same as in the Vehicle hierarchy, using a superclass that includes these common attributes like the pool's name, its address, minimum and maximum depths, or the current state of the pool.

We also include a 1-* association between Pool and City to represent that all pools are located in a city – except for those of type CruiseShip, which are on a cruise ship and not in a city. In that specific case, the semantics of the association are different, as we’ll see later. From this, we can define different types of pools with distinct characteristics, where all of them inherit all the attributes of their superclass, including the association with City.

As we can see, CityPool and OlympicPool have no issue with this, but CruisePool models pools on cruise ships, so its association with City does not have the same semantics as the others. In other words, the pool is not located in a city but on a cruise ship – so we assume that the associated city is its place of manufacture.

As you can guess, this is not the only way to model this domain, nor is it the best, since the "locatedAt" semantics indicated in the conceptual diagram's association between City and Pool does not capture the meaning of that relationship when the pool is of type CruisePool. But once we clarify this, the model is correct in the sense that all essential elements are represented correctly, even if not in the best possible way.

How can we translate the pool hierarchy entities at the conceptual level to tables?

Once we’ve clarified the semantics of the hierarchy, we can follow the same process as before to implement it at the logical level.

First, we note that the hierarchy is not complete, as we assume that in our domain there are many types of pools, of which we only model 3 with specific entities, while the rest are pools modeled with occurrences of the Pool entity.

In other words, if a pool is one of the types of the inheriting entities, it will be represented as an occurrence of that entity, while if it’s of a different type, it will be represented by an occurrence of the superclass. So, in the hierarchy, pools aren’t required to belong to the inheriting entities, making it incomplete.

On the other hand, the types of pools are all disjoint, meaning a pool can’t be both Olympic and cruise at the same time, or city and Olympic at the same time. So the hierarchy is disjoint because there won’t be any pool that is of multiple types at once.

Just like in the DrivingLicenseRequest hierarchy, pools here are also uniquely identified with a surrogate key in the PoolID attribute, while the rest of the entities in the hierarchy initially do not have any type of identification.

This might lead us to think that the best way to implement the hierarchy is, once again, with a table for each entity. But this doesn't necessarily have to be the case because a single table can be used to implement multiple entities at once, using the table's identifier to distinguish between the entities. This is because we assume that the domain does not impose any restrictions, unlike in the Vehicle hierarchy where each type of vehicle had to have its own identifier.

Regarding the decision to implement a table for the superclass, whenever we have an incomplete hierarchy, we’ll need a specific table for the superclass – specifically to store information about pools that don’t belong to any type present in the inheriting entities. This means we need to include a Pool table.

Later, to decide whether to use that table to implement all entities in the hierarchy, only some of them, or to include a table for each inheriting entity, we need to look at the number of attributes the inheriting entities have. In this case, we see they have too many attributes, especially CruisePool and CityPool, so the simplest option is to implement a table for each entity in the hierarchy.

Another option we would have is to use the Pool table to also represent OlympicPool (which has the fewest attributes) and model the rest of the entities with specific tables. But this has disadvantages, such as the division in how we represent each type of pool.

For example, while we represent OlympicPool with some attributes in Pool that may or may not be NULL depending on whether the pool is of that type, the other types of pools would be represented differently. This can be confusing when querying the database.

We also need to consider that some foreign keys point to OlympicPool, so those foreign keys would only be valid for tuples in Pool whose corresponding attributes SpectatorMaxCapacity and CompetitionLanes aren’t NULL, greatly complicating database management, creating more constraints, and possibly complicating certain queries.

Although, no matter how complicated this option is, it would be possible to implement it, and it would be just as valid as implementing a table for each entity. That is, the complexity of an implementation can make it unfeasible but not incorrect – as long as the corresponding constraints are defined to maintain data integrity.

So even though in this case the simplest option is to use a table per entity, that doesn't mean there aren't other correct ways to implement the hierarchy. This means that from the entity-relationship diagram, we can’t infer the exact way it’s finally implemented, although it can be useful for making that decision.

CREATE TYPE PoolStatusType AS ENUM ('open', 'closed', 'maintenance', 'renovation');
CREATE TABLE Pool (
    PoolID SERIAL PRIMARY KEY,
    Name VARCHAR(32) NOT NULL,
    Address VARCHAR(32) NOT NULL,
    MinDepth INT NOT NULL CHECK (MinDepth >= 0),
    MaxDepth INT NOT NULL CHECK (MaxDepth >= MinDepth),
    Status PoolStatusType NOT NULL,
    CityFK INT NOT NULL,
    FOREIGN KEY (CityFK) REFERENCES City(CityID)
);

After deciding how to translate the hierarchy to the logical level, we add the table to the relational diagram and code it in the SQL DDL. As you can see, it’s very similar to the Vehicle table, with a surrogate key as an identifier, the entity attributes that characterize all pools, and a foreign key that references the City table. This determines the city where the pool is located (or manufactured in the case of a pool of the type CruisePool).

Regarding the status of the pool, we can see that it’s modeled here with a Status attribute. We define an ENUM TYPE for it to limit its domain. This design decision is justified because in this hierarchy we’re representing the types of pools in the inheriting entities, not their statuses. So to represent the statuses of the pools, we’ll need to use a different mechanism than generalization/specialization, such as a simple Status attribute.

There are other ways to model this, but they’d be more complex. This doesn’t make them wrong, but we won’t discuss or show them here.

To represent this attribute's data type in the entity-relationship diagram, we have chosen to define a «Enum» entity in UML with the possible values the attribute can take. Entities with the «Enum» type serve the same purpose as using a TYPE ENUM in SQL. This defines a set of values that can then be used as a data type for an attribute, thus restricting its domain.

But in general, this doesn’t have to be fully specified at the conceptual level. We could’ve simply used string as the data type and omitted this «Enum» entity, limiting its domain later at the logical level. Or rather, when the logical model is implemented in the DBMS.

Still, if we want our design to be as clear and self-descriptive as possible at all levels, we should indicate the possible values that attributes can take at all levels, as restricting the domain implicitly imposes an integrity constraint. We can do this through «Enum» entities, side notes, or by using other applicable standard UML elements.

CruisePool entity

Just as we did in the Vehicle hierarchy, here each type of pool is represented with a dedicated table. This way, when registering a new CruisePool type pool in our system, a tuple will be created in this table where the data characterizing cruise pools is stored. But the data that characterizes it as a pool won’t be stored there, as those can only be stored in the Pool table.

So to logically model the inheritance of all Pool attributes to the specific type of pool, we’ll use a foreign key to point to the Pool tuple that contains the rest of the pool information. Specifically, we’ll choose PoolID as the foreign key, as it’s the identifier of the pools in our system. We declare it it as the same SERIAL type to reference that same attribute in the Pool table, where it’s the primary key.

As you can guess, the Pool table not only stores information about pools that aren’t specifically modeled in our system, but it also contains information about pools of each of these types. So, if we want to get information about all the pools in our system, regardless of their type, we just need to query the Pool table.

This is possible because we have a table for the superclass, whereas in other hierarchies we might not implement it, which would require us to query multiple tables to get information about all the pools in the system. This is not necessarily a problem, but it’s worth considering when implementing the hierarchy or even modeling certain aspects of our domain with hierarchies.

For example, if we have a Pool and want to know its type, we must check the rest of the tables in the hierarchy to see if there is any tuple referencing that pool. This results in a very computationally expensive operation because it has to go through all the stored data. If our system needs to prioritize efficiency in such a query, it’d be helpful to modify the hierarchy implementation to make sure that this query runs as quickly as possible.

For instance, adding a redundant attribute in Pool to indicate the type, even though it introduces redundancy and unnecessary additional space, can greatly optimize the latency of certain queries. Just make sure you make these decisions according to project requirements, such as the latency that queries must have, the space the database should occupy, and so on.

CREATE TABLE CruisePool (
    PoolID SERIAL PRIMARY KEY,
    DeckNumber INT NOT NULL CHECK (DeckNumber >= 0),
    MaxCapacity INT NOT NULL CHECK (MaxCapacity >= 0),
    WaterTemperature DOUBLE PRECISION NOT NULL,
    SlideCount INT NOT NULL CHECK (SlideCount >= 0),
    ShipFK INT NOT NULL,
    FOREIGN KEY (PoolID) REFERENCES Pool(PoolID),
    FOREIGN KEY (ShipFK) REFERENCES CruiseShip(ShipID)
);

In addition to the foreign key pointing to Pool (which also serves as the primary key, making this table weak in identification with Pool as its owner entity), we have another foreign key referencing CruiseShip to determine the cruise on which the pool is located. And, since all cruise pools must be on a cruise ship to be of that type, the foreign key pointing to CruiseShip can’t be NULL. It must always reference a valid cruise. This is why we include the NOT NULL constraint, which we don’t do for PoolID because we are declaring it as the primary key.

CityPool entity

Another type of pool we can find is a municipal pool, represented by the entity CityPool and implemented with its specific table. Its DDL is very similar to the previous one, with the unique feature that in this case, we have a foreign key pointing to CityPool, which can be directly inferred from the 1-* type association connecting CityPool with Entry in the conceptual diagram.

CREATE TABLE CityPool (
    PoolID SERIAL PRIMARY KEY,
    MaxCapacity INT NOT NULL CHECK (MaxCapacity >= 0),
    AnnualBudget DOUBLE PRECISION NOT NULL CHECK (AnnualBudget >= 0),
    AccessibilityFeatures VARCHAR(32) NOT NULL,
    FreeWifi BOOLEAN NOT NULL,
    FOREIGN KEY (PoolID) REFERENCES Pool(PoolID)
);

In the relational diagram, it's important that the foreign key PoolID is underlined, indicating that this attribute, despite being a foreign key, is used to uniquely identify the tuples in CityPool. This means that when referencing the primary key of PoolID, the foreign key that refers to it contains exactly the value that identifies the pool in Pool.

So if a query simply needs the identifier of a pool of a specific type, we don’t need to access the Pool table, as the foreign key attribute of the CityPool, CruisePool, or OlympicPool table, for example, is enough to know it.

There are even times when we can access data from other tables that are indirectly associated through more levels of association, as in CruiseBooking, where we can access the identifier of a CruiseShip through the value of its foreign key, which doesn't point directly to CruiseShip, but to Voyage.

OlympicPool entity

Regarding the last type of pool in our schema, there's OlympicPool, which represents Olympic pools. The implementation of this entity as a table is the same as the previous ones, with the difference that in the entity-relationship diagram, we can see that there are two foreign keys pointing to OlympicPool. Otherwise, the only differences are in the attributes that characterize the type of pool.

CREATE TABLE OlympicPool (
    PoolID SERIAL PRIMARY KEY,
    SpectatorMaxCapacity INT NOT NULL CHECK (SpectatorMaxCapacity >= 0),
    CompetitionLanes INT NOT NULL CHECK (CompetitionLanes > 0),
    FOREIGN KEY (PoolID) REFERENCES Pool(PoolID)
);

Entry entity

Continuing with what a person can do in a pool in our system, we have the entity Entry. This is responsible for storing tickets that a person can use to enter a municipal pool, meaning one that is represented by the CityPool entity only.

To ensure that a person can only access a municipal pool with these tickets, the entity has a 1-* association with CityPool, and not directly with Pool, as that would give access to any pool regardless of type. Also, to know which person the ticket belongs to, it also has a 1-* association with Person, where a person can have an arbitrary number of tickets, but a ticket can only belong to one person.

On the other hand, we also have a 1-* association where the 1 side is in Entry, modeling that the tickets can have associated penalties, which we’ll see later. So, with all this, we can know that at a logical level, Entry will have 2 foreign keys pointing to other entities, as well as a foreign key from another entity pointing to Entry.

To uniquely identify the tickets, the most important attribute is EntryTimestamp. This records the exact time the ticket was purchased. But several people can buy tickets at the same time to enter the same pool, leading to multiple tuples with the same EntryTimestamp value, so the primary key must have more attributes to uniquely identify all the tickets.

Specifically, the primary key needs the foreign key attributes PersonFK and PoolFK to differentiate entries by the person who bought them and the pool they enter, as well as the exact time of purchase. So if we consider the possible situations and combinations of values that can occur for the primary key of Entry, we’ll see that a person can’t buy multiple entries at the exact same moment to enter the same pool.

This makes sense when the domain states that each ticket is associated with a single person and that a person can’t buy a ticket for someone else. In other words, if a person buys a ticket, they must use it themselves. They can’t buy multiple tickets for several people to enter. This doesn't have to be the case in all domains – we're just assuming here that people can't buy tickets for others.

In other domains, this might need to be modeled differently depending on the requirements. So, we’ll need to make sure that our model meets these types of requirements imposed by the domain, especially when defining primary keys or UNIQUE constraints.

So by requiring the attributes PersonFK and PoolFK to be present in the primary key, the Entry entity becomes weak in identification with two owning entities, Person and CityPool, respectively. In the DDL, we have explicitly added the NOT NULL constraint to all attributes for clarity, although it wouldn't be necessary for those present in the primary key.

CREATE TABLE Entry (
    EntryTimestamp TIMESTAMP NOT NULL,
    Price DOUBLE PRECISION NOT NULL CHECK (Price >= 0),
    PaymentMethod PaymentMethodType NOT NULL,
    AppliedDiscount DOUBLE PRECISION NOT NULL CHECK (AppliedDiscount >= 0),
    Duration INT NOT NULL CHECK (Duration >= 0),
    PersonFK INT NOT NULL,
    PoolFK INT NOT NULL,
    PRIMARY KEY (EntryTimestamp, PersonFK, PoolFK),
    FOREIGN KEY (PersonFK) REFERENCES Person(PersonID),
    FOREIGN KEY (PoolFK) REFERENCES CityPool(PoolID)
);

On the other hand, the EntryTimestamp attribute of type TIMESTAMP is named differently from the IssueTime attribute of the BusTicket entity, for example.

This isn't very important, but in a real design process, we might be required to use style guides that determine how we should name each attribute depending on its semantics, type, or constraints, as well as when and how we should declare certain constraints. In this specific case, we didn't follow any style guide – we simply named the attributes as descriptively as possible according to the circumstances. Still, following a style guide offers advantages in system maintainability and ease of administration, among others.

Team entity

To hold competitions in Olympic pools, our system needs to be able to model sports teams made up of people who participate in these competitions. So we have the entity Team, which represents sports teams that have a reference Olympic pool, are made up of people, and participate in competitions in Olympic pools.

This is modeled by using the attributes of Team to store the characteristics of the sports teams, such as the name, creation date, uniform color, and so on. We can also use associations with other entities to determine which Olympic pool is the team's official pool, which people belong to the team, who coaches the team, and which competitions they participate in.

First, to model the Olympic pool considered the team's official pool, we’ll use a 1-* association with OlympicPool, which becomes a foreign key due to its cardinality. As you can see in the conceptual diagram, the role of the association specifies the semantics, since without it, you can’t directly infer what is being modeled with that association.

The same applies to the 1-* association with Person, which we use to determine who coaches the team, so we need to specify its semantics to avoid confusion about what that association actually models. Although, given their cardinalities, it’s clear that a team can only have one person as a coach, not an arbitrary number of people, so we can rule out that the association models the people who belong to the team.

On the other hand, besides the foreign keys that Team has, there are others that point to Team and are responsible for modeling the people who make up the team, such as the one from Membership, or the team's participation in competitions, like the one we see from Participation.

CREATE TYPE SportType AS ENUM ('waterpolo', 'swimming', 'diving');
CREATE TABLE Team (
    Name VARCHAR(32),
    CreationDate DATE NOT NULL,
    ClothColor ColorType NOT NULL,
    Sport SportType NOT NULL,
    Budget INT NOT NULL CHECK (Budget >= 0),
    ContactEmail VARCHAR(32) NOT NULL,
    CoachFK INT NOT NULL,
    HomePoolFK INT NOT NULL,
    PRIMARY KEY (Name, CoachFK),
    FOREIGN KEY (CoachFK) REFERENCES Person(PersonID),
    FOREIGN KEY (HomePoolFK) REFERENCES OlympicPool(PoolID)
);

To uniquely identify each team, the attribute that can serve us best from the table itself is Name. But in this domain, we assume that multiple teams can have the same name, so the primary key can’t be formed solely by that attribute.

So from all the other attributes we have, we finally include the foreign key CoachFK in the primary key, meaning we also use the information of the person who coaches the team to uniquely identify it. This works because we assume that there can’t be multiple teams with the same name coached by the same person.

At first glance, this might seem entirely possible, but consider that some domain requirements might impose this condition, which we can leverage to define (Name, CoachFK) as the primary key. In any case, before making such a decision, make sure that the set of attributes meets the primary key restriction, either due to domain requirements or the semantics of the attributes themselves.

We can declare foreign keys with FOREIGN KEY referencing the primary key of Person and OlympicPool. We impose the NOT NULL restriction on them since all teams must have a coach and an official Olympic pool. Here, we have also assumed the necessity of these elements, but in other cases it might not be mandatory to have an official pool or a coach – it all depends on the domain.

If having a coach were not mandatory, we couldn’t include the foreign key attribute CoachFK in the primary key, as it could be NULL and would violate the primary key restriction. So for an entity to be weak in identification and another to be its owner, the association between them must be mandatory, meaning its minimum cardinality on the owner's side can’t be 0.

Finally, we define a TYPE ENUM here for the type of sport the team plays, which is stored in the Sport attribute. But we don’t need to redefine it for the Color attribute, as we had the ENUM ColorType defined earlier, which is the best example of how a data type is reused across attributes with the same domain in different entities.

Membership entity

We’ll continue with the semantics of the previous entity. To model the possibility of people being part of a team, the simplest approach would be to include an N-M association between Person and Team. This would be in addition to the 1-* association that already exists to model the person who coaches the team. This way, a person can belong to an arbitrary number of teams, while a team can be composed of an arbitrary number of people.

But since this association requires an intermediate entity to be implemented at the logical level, and we also need to store information about a person's membership in a team, we’ll introduce the Membership entity. This entity divides the N-M association into several 1-* associations, indirectly connecting Person with Team. In this way, each person belonging to a team will have a tuple in this table representing their membership. It’ll store information such as the start or end date of membership or the fee they must contribute to the team to be part of it.

At the conceptual level, we can see that this entity has many similarities with others like Residence. For example, we define the primary key of this entity with the attribute JoinDate and the foreign keys that determine the Person who belongs to a certain team. This is because the attributes that appear exclusively in the entity can’t uniquely identify each membership. That is, there can be multiple people who started belonging to different teams on the same date, causing multiple tuples in Membership with the same value in their primary key.

So even though the foreign key attributes don’t explicitly appear at the conceptual level, it’s clear that a Membership tuple must be identified not only by the start date but also by the person and team it relates to. This will avoid situations where multiple tuples with the same person and date are considered equal, or with the same team and start date. So we know it’s a weak entity in identification that’s dependent on Person and Team.

Since it depends on both entities it’s related to for identification, we could have represented it as an associative entity connected with the possible N-M association between Person and Team. But to make the diagram as clear and close to the logical level as possible, we should instead use an "intermediate" entity like the one represented here with 1-* associations.

CREATE TYPE PaymentFrequencyType AS ENUM('monthly', 'anual', 'weekly', 'quarterly');
CREATE TABLE Membership (
    JoinDate DATE NOT NULL,
    LeaveDate DATE CHECK (
        LeaveDate IS NULL
        OR LeaveDate >= JoinDate
    ),
    FeeAmount INT NOT NULL CHECK (FeeAmount >= 0),
    PaymentFrequency PaymentFrequencyType NOT NULL,
    AutoRenewal BOOLEAN NOT NULL,
    PersonFK INT NOT NULL,
    TeamNameFK VARCHAR(32) NOT NULL,
    CoachFK INT NOT NULL,
    PRIMARY KEY (JoinDate, PersonFK, TeamNameFK, CoachFK),
    FOREIGN KEY (PersonFK) REFERENCES Person(PersonID),
    FOREIGN KEY (TeamNameFK, CoachFK) REFERENCES Team(Name, CoachFK)
);

To implement the foreign keys, we’ll create the corresponding attributes: PersonFK, which is the foreign key pointing to Person, and (TeamNameFK, CoachFK), where both constitute the other foreign key referencing the team to which the person belongs. Both keys are not null because a Membership tuple must associate a person with a team.

Once we’ve declared the attributes and FOREIGN KEY constraints, we can define the primary key as the set of attributes consisting of JoinDate, the foreign key attribute PersonFK, and the other two attributes (TeamNameFK, CoachFK) of the foreign key referencing the team. We can declare them in any order in the PRIMARY KEY constraint, as long as they all appear.

Finally, according to the domain, we assume that people don’t know exactly when they will stop being members of a team, so LeaveDate doesn’t always have to be defined. This means it can be NULL until the person leaves the team or plans to leave on a specific date. So we have to define a CHECK constraint on that attribute to make sure that it’s either NULL or the date is after JoinDate, as a person can’t leave a team before the start date of membership.

Participation entity

Similarly, a sports team can also participate in sports competitions registered in SwimmingCompetition. So we have an entity called Participation that indirectly links Team with SwimmingCompetition through 1-* associations. This is just like we saw earlier with Membership, but with a different meaning. Specifically, what mainly changes is the information stored about the team's participation in a competition, such as the date they register to participate, their ranking position after the competition, or the time it took to complete the competition.

To uniquely identify the tuples of Participation, the simplest way is to use a custom database identifier as a surrogate key, just as we have done before with certain entities. But if the domain requirements don’t allow us to include surrogate keys or any additional database-specific identifier, we’ll need to choose a set of attributes that enable identification.

So if we assume that no team participates more than once in the same competition (as this wouldn't make sense), we can declare a primary key formed by the foreign keys referencing Team and SwimmingCompetition. This way, we ensure that different tuples of Participation don’t associate the same team with the same competition, as that situation can’t occur.

As you can see, identifying this entity completely depends on other entities like Team and SwimmingCompetition, meaning there is no attribute at the conceptual level of the entity that forms part of the primary key.

This isn’t necessarily a bad thing, but rather a consequence of the domain requirements preventing us from using a surrogate key. In fact, this dependency in identification can have certain advantages, such as avoiding additional columns, which might be a domain-imposed requirement (to use the fewest columns possible).

CREATE TABLE Participation (
    RegistrationDate DATE NOT NULL,
    Rank INT NOT NULL CHECK (Rank > 0),
    RecordedTime DOUBLE PRECISION NOT NULL CHECK (RecordedTime >= 0),
    NameFK VARCHAR(32),
    StartDateFK DATE,
    EndDateFK DATE,
    TeamNameFK VARCHAR(32),
    CoachFK INT,
    PRIMARY KEY (
        NameFK,
        StartDateFK,
        EndDateFK,
        TeamNameFK,
        CoachFK
    ),
    FOREIGN KEY (TeamNameFK, CoachFK) REFERENCES Team(Name, CoachFK),
    FOREIGN KEY (NameFK, StartDateFK, EndDateFK) REFERENCES SwimmingCompetition(Name, StartDate, EndDate)
);

At the logical level, the foreign key referencing Team has two attributes, which make up the primary key of Team. The foreign key pointing to SwimmingCompetition has three for the same reason. So we’ll use two FOREIGN KEY constraints: one to declare the foreign key pointing to Team and the other for the one pointing to SwimmingCompetition, respectively.

Note that the FOREIGN KEY constraint only allows one REFERENCES clause. So if we have multiple foreign keys pointing to various entities, we have to use a separate FOREIGN KEY constraint for each foreign key. If we try to declare them all with a single constraint, we would have to indicate the multiple entities/tables being referenced, which means we would need to use multiple REFERENCES statements.

After declaring the foreign keys and adding their respective NOT NULL constraint, since it’s mandatory for a participation to relate a team with a competition, we declare the foreign key as the set of attributes that form both foreign keys together. So in our system, there can be Participation tuples with different Rank or RegistrationDate values without any problem – but there can’t be multiple tuples with the same value in their primary key (meaning they can’t relate the same team with the same competition multiple times).

Finally, if we try to reconstruct the conceptual entity from the relational diagram, the first thing we should notice is that all the foreign keys are underlined, and therefore form the primary key. As they are foreign keys, these attributes won’t appear in the conceptual entity of Participation.

To determine how many foreign keys we actually have, and know how many 1-* associations to introduce and with which entities to connect them, we can see that a subset of attributes like (TeamNameFK, CoachFK) refers to the same entity – so there will be a 1-* relationship with that entity, with the many side in Participation. Doing the same with the attributes (NameFK, StartDateFK, EndDateFK), we see that they all refer to attributes of the same entity. So they form a foreign key that results in a 1-* association like the previous one, but connecting with another entity.

To infer the minimum cardinalities, we should look at the constraints indicated in the relational diagram: which foreign keys can or can’t be NULL, or how many participations each competition must have (as well as the participations each team must have).

In this case, we haven’t indicated any constraints in the relational diagram for simplicity. But, for example, if we were told that a foreign key can’t be NULL in the relational model, this conceptually translates to its respective 1-* association having a minimum cardinality of 1 on the 1 side. Similarly, if there are special constraints that require each team to have 2 participations, for example, then we know that in its corresponding association, the minimum cardinality on the Participation side would be 2.

This reverse process is what we initially followed to implement the entity at the logical level, where the attributes that make up each foreign key are inferred, and those selected to declare the primary key.

SwimmingCompetition entity

To model the sports competitions that can take place in an Olympic pool, in the conceptual diagram we have the entity called SwimmingCompetition that’s responsible for storing information about the competitions held in all the Olympic pools registered in the system. In these, any number of sports teams can participate.

The information that SwimmingCompetition stores mainly depends on the domain and requirements. In this case, we assume that we only need to store the name of the competition, start and end dates that will always be determined, a RecordTime attribute to store any record times achieved during the course of that competition, and the monetary amount of the prize for that competition.

With these attributes, the simplest way to uniquely identify each tuple in the SwimmingCompetition table is to define the set of attributes (Name, StartDate, EndDate) as the primary key.

For example, there can be competitions in the database with exactly the same name, but they can never have the same start and end dates simultaneously (because that would mean they were the same competition). Ultimately, by declaring this primary key, we’re assuming that there are no different competitions with the same name and start and end dates – so if this condition aligns with the domain requirements, it would be correct.

Consequently, there can be different competitions in the database with different combinations of values for the primary key attributes – but they might have the same record time, or the same prize in PrizeAmount, since there are no restrictions preventing it.

CREATE TABLE SwimmingCompetition (
    Name VARCHAR(32),
    StartDate DATE NOT NULL,
    EndDate DATE NOT NULL CHECK (EndDate >= StartDate),
    RecordTime DOUBLE PRECISION CHECK (RecordTime >= 0),
    PrizeAmount INT NOT NULL CHECK (PrizeAmount >= 0),
    PRIMARY KEY (Name, StartDate, EndDate)
);

In the conceptual diagram, we see that the entity has several associations that lead to the existence of a foreign key pointing to OlympicPool, since the competition must necessarily take place in an Olympic pool. So this foreign key references the specific pool where the competition is held, making its existence mandatory. In other words, the foreign key can’t be NULL because, in the conceptual model, we set the minimum cardinality to 1 to ensure that every competition is associated with a pool where it takes place.

Another peculiarity of this entity is that the attribute RecordTime may not always be defined. For example, when we register a competition in the database and need to provide a value for this attribute, such value might not exist because it’s the first time the competition is being held. So, the simplest way to model it would be to set that attribute to the maximum or minimum possible, depending on how we consider which times are better than others.

Additionally, since in our domain we also model the possibility of athletes participating in a competition being sanctioned, there is a chance that in a certain competition held for the first time, all participants could be sanctioned. This means that none of them would contribute to initializing the value of the RecordTime attribute. This is why it needs to be allowed to be NULL.

But this is a decision we must make primarily considering the domain and its requirements, as we may want to initialize the attribute with a default or special value if all athletes are penalized and the counter can’t be initialized, for example.

Sanction entity

Given everything a person can do in our domain in relation to other entities, they might break a rule that results in a sanction. So in our schema, we can introduce an IS-A hierarchy where the superclass is the entity Sanction, and its inherited entities are the different types of sanctions we define, all depending on their scope of application.

Specifically, deciding to use a hierarchy to model sanctions is driven by the specific information that needs to be stored for each type of sanction. For this reason, if we tried to use a single Sanction entity to represent all these types, its semantics would be very complicated (as some attributes would only be useful if the sanction were of a certain type – and the same goes for many others). We would also need to use a specific attribute to represent the sanction type, since otherwise knowing the exact type might depend on which attributes were NULL, and this would complicate queries.

So with this hierarchy, we can have a set of common attributes for all sanctions in Sanction, such as the monetary amount of the fine, the description, the date of the sanction, or the status, while in the inherited entities, we have specific attributes that characterize each type of sanction.

How is the IS-A hierarchy implemented with tables?

Just as we have done with other hierarchies, we need to analyze it to know how to implement it at the logical level. We need to keep in mind that the conceptual design doesn’t unequivocally determine the implementation that we’ll ultimately carry out, especially when working with IS-A hierarchy. Rather, it’s a decision we should make based not only on the conceptual design itself, but also on the domain and data requirements.

To do this, we first check whether the hierarchy is complete or not. In this case, all existing sanctions will be of a specific type represented by the inherited entities. This means that all individuals in the hierarchy will belong to one of the sets generated by these entities, which implies that the hierarchy is complete.

On the other hand, the types of sanctions are all disjoint, meaning a sanction can only be of one type, not several at once. This means that the hierarchy is disjoint because no individual will be represented by multiple inherited entities at the same time.

To identify each sanction, we’ll use a SanctionID attribute in the superclass Sanction, which we’ll implement using a surrogate key. This avoids the inherited entities needing to use their own identifiers, as we assume that the domain requirements don’t require us to identify each type of sanction differently.

So given the number of attributes each inherited entity has, it’s clear that we’ll need a table to implement each inherited entity. Otherwise, too many NULL values would be generated in the corresponding attributes, complicating both database management and queries, and potentially leading to unnecessary constraints aimed at ensuring schema integrity.

On the other hand, we have several options for implementing the superclass at the logical level. One option is to duplicate all attributes in each of the tables of the specific sanction types.

This has advantages, such as identifying each table using the SanctionID attribute inherited from the superclass, but it leads to schema management problems. If we later want to delete, modify, or add an attribute of Sanction, we would have to perform that operation on all the tables of the different sanction types, increasing the likelihood of errors in the process. Also, if we want to query all the sanctions in our database, with this option, we would have to go through all the tuples of all the tables of each sanction type, which could be inefficient because of accessing multiple tables.

To minimize errors in the database management process, we need to simplify the operations involved as much as possible. To do this, we can implement a specific table for the superclass of the hierarchy in the same way as we did with the Vehicle hierarchy (and for similar reasons).

Each of the tables for the inherited entities will have a foreign key referencing the superclass table, where we’ll store the information of attributes common to all sanctions. This makes it easier to modify these attributes. It also simplifies other operations, such as adding a new type of sanction. For this, only a new table for that type needs to be created, and we just need to make sure that its foreign key references Sanction.

If we look at the inherited entities, we’ll see that each one has a 1-* association with other entities where the many side is always in the entities of the hierarchy. This means that using a single table to implement the entire hierarchy isn’t a good idea, as it would combine all those foreign keys into one table. This would lead to much more complicated integrity constraints.

In other words, if a sanction is of a specific type, the attributes and foreign keys of the remaining types must be NULL, so ensuring this for all types involves overly elaborate and complex constraints.

CREATE TYPE SanctionStatusType AS ENUM ('created', 'active', 'expired');
CREATE TABLE Sanction (
    SanctionID SERIAL PRIMARY KEY,
    Amount DOUBLE PRECISION NOT NULL CHECK (Amount >= 0),
    Description VARCHAR(32),
    IssueDate DATE NOT NULL,
    ExpirationDate DATE CHECK (
        ExpirationDate IS NULL
        OR ExpirationDate >= IssueDate
    ),
    Status SanctionStatusType NOT NULL
);

In the DDL of the Sanction table, we can see that its primary key {SanctionID} consists of a single attribute of type SERIAL. This is a surrogate key that will be used to identify all sanctions in the database, regardless of their type.

This table also stores the status of the sanction in an attribute, since the type of sanction is represented with inherited entities from the superclass at the conceptual level. So the status must be modeled as an attribute to avoid mixing the semantics of what we represent with each tool of the entity-relationship diagram.

In other words, we could include new inherited entities that model the states of the sanctions, but we have to consider that each type of sanction could be in any of those states – leading to an unnecessarily complicated multi-level hierarchy.

Because of this, we should separate the semantics of what we represent with inherited entities from the semantics of the sanction's status, modeling it with an attribute in the superclass, as any type of sanction can be in any state.

In this Status attribute, we define a TYPE ENUM to restrict the possible states a sanction can have. But for the Description, if we want to save a description of the sanction written in natural language, we shouldn’t add restrictions unless it’s required by the project specifications.

A description in natural language can be very diverse, so the simplest approach is not to limit the possible values the attribute can take, not even with a NOT NULL constraint. This can indicate that a sanction has no description, although this isn’t necessarily correct.

In general, decisions to allow NULL values also depend on the domain and requirements. For example, sanctions may or may not have an expiration date, which is why the CHECK constraint defined on ExpirationDate specifies that this attribute can either be NULL or must hold a date later than the issuance date of the sanction.

DrivingSanction entity

Let’s now talk about the different types of sanctions in our system. First, we have DrivingSanction, which are sanctions associated with driver's licenses. So in the conceptual diagram, it has a 1-* association with the DrivingLicense entity, resulting in a foreign key in DrivingSanction that references the driver's license with the sanction. This refers to the license of the person who committed a traffic violation, leading to the existence of the fine.

The specific attributes of this type of sanction are store information about why the fine was issued, such as the speed the vehicle was going, as well as the effect the sanction has on the license (like deducting a certain number of points or suspending it for a certain period).

In its DDL, we can see that all attributes have been declared as NOT NULL, which at first might seem unnecessary in the case of RecordedSpeed, since not all sanctions are caused by speed. But this illustrates that even if an attribute isn’t necessary, it shouldn’t be NULL to be considered unnecessary.

For example, if a sanction is not related to speed, instead of using a NULL value in the RecordedSpeed attribute, we can use a special value like 0, as long as it respects the integrity constraints and system domain requirements. This allows us to distinguish whether the sanction is related to a possible speeding violation. So we make the decision to allow NULL or not is initially when modeling the entity at the logical level. This works as long as we aren’t forced to use a specific semantics like setting the attribute to 0 when it’s not necessary.

If we consider whether other attributes can be NULL or not, we can see that PermanentSuspension always has the option to take the value false (as the suspension might not be permanent). Similarly, if the suspension is permanent, the SuspensionDays attribute can always be set to 0, or to a different special value. We could also simply ignore its value and check first if the suspension is permanent before accessing the SuspensionDays attribute, among other options.

CREATE TABLE DrivingSanction (
    SanctionID SERIAL PRIMARY KEY,
    RecordedSpeed DOUBLE PRECISION NOT NULL CHECK (RecordedSpeed >= 0),
    PointsDeducted INT NOT NULL CHECK (PointsDeducted >= 0),
    SuspensionDays INT NOT NULL CHECK (SuspensionDays >= 0),
    PermanentSuspension BOOLEAN NOT NULL,
    LicenseFK INT NOT NULL,
    FOREIGN KEY (SanctionID) REFERENCES Sanction(SanctionID),
    FOREIGN KEY (LicenseFK) REFERENCES DrivingLicense(LicenseID)
);

On the other hand, the NOT NULL is indeed necessary for both foreign keys in the table, as SanctionID is the foreign key that references the tuple in Sanction that holds the rest of the sanction information. Its primary key attribute SanctionID serves as the primary key of the DrivingSanction table itself, and it’s the only way to uniquely identify the sanctions. Also, the foreign key that references the driving license that received the sanction can’t be NULL either, because if the sanction is of type DrivingSanction, it must necessarily be associated with a license.

SportSanction entity

Another type of sanction is represented in the entity SportSanction. This models those sanctions that occur in sports competitions, specifically those caused by a sports team while participating in a competition. Like the previous entity, it has attributes that characterize this type of sanction, such as the number of competitions the team is suspended or the name of the referee who issued the sanction.

In addition to this information, each sanction of this type needs to know which specific team received the sanction, as well as the competition they were participating in when they were sanctioned. So to model this, we have multiple options. We could use two 1-* associations to connect SportSanction with Team and with SwimmingCompetition, so that from the sanction you can identify the corresponding team and competition. But this is redundant and unnecessary, as it would lead to two foreign keys that can actually be reduced to one.

Remember that in our schema, we have an entity called Participation that relates teams to the competitions they participate in. So instead of two 1-* associations in SportSanction, we can use just one that connects it with Participation, since from Participation we can determine the team and competition.

CREATE TABLE SportSanction (
    SanctionID SERIAL PRIMARY KEY,
    SuspendedCompetitions INT NOT NULL CHECK (SuspendedCompetitions >= 0),
    RefereeName VARCHAR(32) NOT NULL,
    NameFK VARCHAR(32) NOT NULL,
    StartDateFK DATE NOT NULL,
    EndDateFK DATE NOT NULL,
    TeamNameFK VARCHAR(32) NOT NULL,
    CoachFK INT,
    FOREIGN KEY (SanctionID) REFERENCES Sanction(SanctionID),
    FOREIGN KEY (
        NameFK,
        StartDateFK,
        EndDateFK,
        TeamNameFK,
        CoachFK
    ) REFERENCES Participation(
        NameFK,
        StartDateFK,
        EndDateFK,
        TeamNameFK,
        CoachFK
    )
);

To implement this in SQL, we can create a table very similar to DrivingSanction, where its primary key is the attribute SanctionID, also declared as a foreign key referencing the Sanction table of the superclass. We can declare attributes in the same way as we have been doing so far, both for the attributes of the entity itself and for the foreign keys. The foreign keys must have the same data types as the attributes they reference.

In this case, to declare the foreign key that points to Participation, we need as many attributes as its respective primary key has, which is a total of 5. To simplify this process, the ideal approach is to look directly at the PRIMARY KEY constraint of the table we want to reference. Then for each of those attributes, we can declare it in our table with a characteristic name and the corresponding data type. We finally add it to the FOREIGN KEY constraint so that it references the attribute that originated it, as we have already seen.

For example, if the primary key of Participation is (NameFK, StartDateFK, EndDateFK, TeamNameFK, CoachFK), then we declare an attribute NameFK for the foreign key of SportSanction that points to the NameFK attribute of that primary key, another StartDateFK that points to the StartDateFK attribute of the primary key of Participation, and so on.

PoolSanction entity

To conclude the hierarchy of sanctions, we have PoolSanction. These, as you can guess, are sanctions imposed on people who have entered a CityPool and violated the pool rules. In this case, we store start and end dates as attributes, indicating the period during which the person can’t enter the pool. We can also include an amount as compensation if necessary, or a number of community service hours that the person must complete.

To determine from the sanction which person and pool are affected by the sanction, we can use a 1-* association with Entry. This results in a foreign key in PoolSanction that points to Entry because the many side is placed in PoolSanction. This way, we can identify the entry the person used when they received the sanction.

Besides the person, the entry also provides information about the pool they will no longer be able to enter freely. The sanction determines when they can re-enter or the action they must take due to being sanctioned.

CREATE TABLE PoolSanction (
    SanctionID SERIAL PRIMARY KEY,
    BanStartDate DATE,
    BanEndDate DATE,
    CompensationRequired INT NOT NULL CHECK (CompensationRequired >= 0),
    CommunityServiceHours INT NOT NULL CHECK (CommunityServiceHours >= 0),
    EntryFK TIMESTAMP NOT NULL,
    PersonFK INT NOT NULL,
    PoolFK INT NOT NULL,
    FOREIGN KEY (SanctionID) REFERENCES Sanction(SanctionID),
    FOREIGN KEY (EntryFK, PersonFK, PoolFK) REFERENCES Entry(EntryTimestamp, PersonFK, PoolFK),
    CHECK (
        (
            BanEndDate IS NULL
            AND BanStartDate IS NULL
        )
        OR BanEndDate >= BanStartDate
    )
);

In its DDL, we can see that the table identification is the same as the previous ones, with a primary key composed of the foreign key pointing to Sanction, as well as another foreign key pointing to Entry (which consists of three attributes). In this case, the sanction can impose several conditions on the sanctioned user: it can either prohibit them from entering for a period of time, require them to pay compensation, or serve a certain number of community service hours.

So if we assume that the domain and requirements don’t force us to store NULL values in any attribute and that we can make any decision about how data is stored in the system, we’ll decide to allow BanStartDate and BanEndDate to be NULL for sanctions that don’t prohibit the sanctioned user from entering the pool. Thus, in the CHECK constraint defined at the end, we see that as an integrity condition for all tuples in the table, both attributes must be either null or the end date must be after the start date of the prohibition. This ensures that only valid data is stored in the table.

Lastly, we can see that some attributes of the foreign key pointing to Entry are named exactly the same as the attributes they reference, like personFK or PoolFK. This is neither a problem nor an error, although in a larger project where each table has more attributes, we should follow a proper style guide when naming attributes, especially those reserved for foreign keys. This way, we can more clearly understand their purpose without having to spend time analyzing the schema in detail.

How to Create the Database

Now that you understand the domain semantics and have completed the conceptual and logical design phases, we can implement the logical model on the DBMS.

The easiest way to do this is by creating a script with a .sql extension that contains all the necessary DDL code to populate the database – that is, the statements we just reviewed where we create tables, data types, and constraints.

But since we aren’t working with a real project database here, we don't need to worry about the data that might be in tables that already exist in the database, especially those with the same name as any of the tables we’ll going to create. So for simplicity, before creating them, we’ll execute some DROP statements to remove tables with names matching any of the tables we are going to create. This will make sure that they contain no tuples.

Following this process, we’ll arrive at a DDL script like this (it’s quite long, so I’ve left it in the gist).

When we run the script, keep in mind that the statements will execute one by one from top to bottom. So we first use the DROP statements to remove any tables in the database that have the same name as any of those we’ll create.

This process is equivalent to deleting our entire database – that is, our logical model that was once created – so we first need to remove the tables that aren’t referenced by any foreign keys to maintain integrity while deleting the remaining tables.

Then, under the same condition, all corresponding tables that aren’t referenced by any foreign keys are successively deleted until no tables remain to be deleted.

There should now be no tables in our database whose names conflict with the tables in our logical model, so we write the CREATE TABLE statements we saw earlier for each table in the logical model.

We also need to do this in a specific order, specifically the reverse of the deletion process. Here, we first need to create tables that don’t have any foreign keys pointing to another entity. If we create a table at the beginning that needs to reference another table that hasn't been created yet, the DBMS will generate an integrity error. So as you can see in the script, we place the statements in an order such that whenever a table with foreign keys pointing to other tables is created, those tables have already been created beforehand.

To figure out how we need to order both the DROP and CREATE TABLE statements, there are algorithms like topological sorting that we can apply to the relational diagram. This way, we treat the database schema as a directed graph made up of nodes (tables) and directed edges (foreign keys). With this algorithm, for example, we can progressively remove minimal or maximal nodes from the graph, creating or deleting the table they represent. But, this is not the only method available.

Regarding data types and constraints defined in assertions or triggers, the order of creation is easier to infer. This is because the ENUM or DOMAIN types must always be created before being used in a table's attribute declaration. So the simplest approach is to create them at the very beginning, or just before we use them for the first time (what we’ve done here).

On the other hand, it's best to define assertions or triggers at the end. We also want to give them names descriptive enough of the constraints they model, as their definitions may involve multiple tables that we need to create before defining the constraint. Also, since these elements don’t contain data (tuples), we don’t need to delete them at the start of the script unless we are going to modify the schema itself. In that case, some constraints might become obsolete, meaning they access attributes or tables that no longer exist.

In summary, with this SQL script, we create the tables, data types, and constraints that make up our database schema, ensuring that none of them contain tuples immediately after being created.

But to run the script, we need to create a database in the DBMS. Let’s use the CREATE DATABASE statement to create a new database with a specific name:

 CREATE DATABASE ExampleDataBase OWNER postgres;

If we run this command on the DBMS terminal, we will create a completely empty database named "exampledatabase". Note that PostgreSQL is not case-sensitive for element names or SQL statements. So even if we write an element's name in uppercase, when we later check the name value stored by the DBMS for the database, we’ll see it in lowercase.

We can also assign an owner user, who will have all the privileges over that element. By default, we can make the owner user postgres, but we can change it later with a statement like the following:

ALTER DATABASE exampledatabase
OWNER TO user3; /*user3 is a sample user*/

Once we’ve created the database, we can connect to it using the DBMS command \c exampledatabase. Finally, we can execute the .sql script with the command \i /path_to_script/script.sql. The DBMS should then notify us that the DROP statements have had no effect since there is no table with the corresponding name to delete (the database is empty). But, after creating the tables, if we run the script again, the DROP statements will delete them because they are created, preventing the DBMS from giving us these notifications.

Similarly, if any statements encounter errors that prevent their execution, or in special situations like the one we just mentioned, the DBMS will notify us – but it won’t stop the execution of the script. It will simply move on to execute the next declared statements (at the syntactic level, it executes the next statement we have separated with the corresponding ;).

Chapter 11: Example Queries

Once we have done all this, we’ll have the database created and populated with tables. But these tables are empty, meaning they don't contain any tuples. So if we want to run queries on them that return any results, we need to execute INSERT statements to add tuples to all the tables.

In this case, since the database is an example, we don't have real data to use for populating the tables, and there's no simple and automatic way to fill them with synthetic data. The best option is to use the Python library faker and create a script to generate this synthetic data (I’ve explained this in this Jupyter Notebook).

There is also always the option to look for real data sources to populate our database. But when doing this, those data sources might provide information in table schemas that don't exactly match those of our database tables, requiring us to integrate and then insert the information through a process like an ETL. These ETL processes of integration and insertion are often applied in Data Warehouses, which can also be a database like ours.

Running Basic Queries

So, assuming we already have the database populated with tables and tuples within them, we can run different queries on them. After all – the main operation that other services from other software layers use from the database is querying. This lets them obtain data that they can then transform, use to calculate certain metrics, or simply display to the end user.

For example, right after inserting the data, the first query we can run to ensure that the insertion process worked is the following:

SELECT 'person' AS tableName, COUNT(*) AS numberOfTuples
FROM Person;

As you can see, we use the FROM clause to get all the information stored in the Person table (which we could have written entirely in lowercase). Then, we use the aggregation function COUNT(*) to count the total number of tuples in the table, naming the column where this number is stored numberOfTuples.

But, if we also want to display the table name in the same tuple as the previous count, we can add another column in the SELECT statement where all its values are 'person'. This way, when the query is executed, it will return a table with two columns, one tableName and another numberOfTuples. Since the aggregation function only returns one value, the resulting table will have only one tuple, where the tableName column will have the value 'person' and the other column will show the number of tuples in the Person table.

If we want to count the tuples of all the tables in the database, we have the option to create a larger query that gathers all the results of the sub-queries that count the tuples of each table. For this, we can use UNION ALL, which combines the tuples from all resulting tables into a single table. This works as long as all resulting tables have exactly the same schema, with the same column names and data types, as in this case.

SELECT 'person' AS tableName, COUNT(*) AS numberOfTuples
FROM Person
UNION ALL
SELECT 'city' AS tableName, COUNT(*) AS numberOfTuples
FROM city;

Lastly, when we say "obtain information" about an element of the domain or database schema in this context, we mean getting its data stored in the attributes of the table that represents it.

For example, information about a person could be the Name or Email attribute of the Person table, among others. We won’t detail that info here, as in most cases, it’s easy to modify which attributes are selected to return as a query result. But in a real environment, it’s convenient and important to pay attention to the attributes the query should return, the names/aliases they should have, and the order in which they should be returned. The functionality of other software layers often depends on this step being performed correctly.

Tuple Filtering

The query we just looked at is useful for managing the database. Knowing how much information is stored in each table helps us make sure that certain normalization or schema transformation operations have run correctly (and even that the information itself is correct).

Let’s now look at some other queries that allow us to execute services provided to the end user. We use these to operate on the domain according to its semantics, so they can be very diverse. Here, we’ll distinguish between different types of queries based on their approach and the SQL tools used in their construction.

First, we have a series of queries for tuple filtering. These queries apply a filter on a table to keep only certain tuples that meet specific conditions. Note that the table containing the tuples we want to filter can be generated in any way, whether through a JOIN, a set operation, or whatever we want to do. But if you need to perform a grouping with GROUP BY, the resulting table must be filtered using a HAVING clause, which differs from the usual WHERE clause used to filter tuples.

SELECT *
FROM person P
WHERE P.name LIKE 'Carol%';

The above is a simple example that retrieves the tuples from the Person table for people whose names starts with “Carol“. As you can see, the only statement we need to filter tuples is WHERE. In it, all the conditions required for filtering are defined, regardless of their number or nature, as some will be performed using subquery results.

In this specific case, the query has the condition that a person's name must start with exactly the string that appears in the LIKE operator. Since it’s case-sensitive, the string has to match exactly what we want to search or filter. Then all tuples that meet this condition will be returned in the resulting table. We’ll get all of its attributes because of the SELECT * notation we used.

To illustrate that it doesn't matter whether you use uppercase or lowercase when naming schema elements in SQL statements, we can see in the below query that both the table City and its attributes are in lowercase (except for one that’s written exactly as it was declared, with the first letter capitalized). If we run this query, it will work the same as if we use C. to reference the attributes, since using only one table means there’s no ambiguity when referring to the table's columns.

SELECT *
FROM city C
WHERE (population > 20000 AND C.Latitude >=0) OR C.longitude <= 0;

Ultimately, with these conditions, we get all the tuples from City that have a population greater than 20,000 and a positive latitude, or those that simply have a negative longitude.

Let’s look at a similar example: here, we get all cruise bookings with a price below 500, an even cabin number, and a payment method of cash. In this case, we can see how we can apply different types of operators to build the condition.

SELECT *
FROM cruiseBooking CB 
WHERE CB.price < 500 AND MOD(CB.cabinnumber, 2)=0 AND CB.paymentmethod='cash';

On one hand, if we want to declare that all the conditions we impose must be met, we’ll use the logical operator AND. This performs a logical conjunction of those conditions so that the selected tuple is added to the resulting table only when all of them are met at the same time.

In other words, we can see the WHERE clause as a logical function that runs once for each tuple present in the table we want to filter. So if the result of that logical function is TRUE, then the tuple meets the conditions. Otherwise, it’s discarded and not included in the result table of the query.

So now we know that all the conditions we can define in a WHERE clause must be composed of a sequence of simpler logical conditions like “CB.price < 500“ joined by logical operators. Also, in each of these simpler conditions, we can find more logical operators, as they’re conditions that we can see as logical functions, which can themselves be composed of a sequence of even simpler conditions joined by logical operators. This allows for recursion, enabling us to use parentheses like in (C1 AND (C2 OR C3)) to adjust the priority and precedence of these operators at different levels of recursion in our condition (just like in other programming languages).

On the other hand, we can also encounter conditions where arithmetic or comparison operators are used, such as in this case when checking if the string containing the payment method is exactly the value ‘cash‘.

While in other languages we might write CB.paymentmethod='cash', in SQL we write the comparison operator with a single character =. If we want to negate it, we can do this either by using the logical operator NOT (affecting the entire equality condition) or by using CB.paymentmethod<>'cash' which represents the condition where it checks that the payment method is not ‘cash‘, meaning it’s different from that value.

In addition to these operators, we also have a series of mathematical functions available. For example, to check if a number is even or odd, in most general-purpose programming languages we have the modulo operator % which calculates the remainder of dividing the number by 2 – so if it’s 0, the number is even.

But in SQL, these operations aren’t implemented by default with arithmetic operators, but rather with functions. Specifically, to calculate the modulo, we use MOD(Dividend, Divisor), although there are many other similar functions.

We can use some of the operators mentioned earlier to perform calculations using entire columns. This results in other columns containing the results of those operations.

SELECT *, (CURRENT_DATE - CB.bookingDate) AS DateDifference1, ABS(CB.ArrivalDateFK-CB.DepartureDateFK)
FROM cruiseBooking CB;

For example, in this query, we want to calculate several date differences, one being the number of days between the booking date and the current date, and another being the number of days between the departure and arrival dates of the cruise trip.

To do this for each tuple in the CruiseBooking table, the simplest way is to add several columns that take the results of these calculations as their values. Specifically, we create these columns in the SELECT statement. This selects the corresponding attributes from the resulting table of the query and displays them to the user. Only those attributes are visible to the user, even though we got them from a table with more attributes.

But, besides selecting attributes, we can also define new columns that didn't exist in the table we’re selecting from. For example, in this query, using the notation *, we select all the attributes present in the table from the FROM statement, which in this case is CruiseBooking.

In addition to those, we concatenate more attributes with a comma, like DateDifference1 or the difference between the departure and arrival dates of the corresponding trip. If we look at the result of the query after adding these additional attributes, we’ll see a new column in the resulting table called DateDifference1, which will take as values the difference between the current date gotten with CURRENT_DATE and the booking date, which is CB.bookingDate.

So we see that in the SELECT statement, we can perform operations with the values of the tuples to generate new columns with intermediate calculations, or simply calculations required by the query, as in this case.

Specifically, the operation performed on each tuple to generate the value of the new column is defined in the SELECT statement itself. In this case, with CURRENT_DATE - CB.bookingDate, we define that the value of each tuple equals the current date minus the booking date. By default in SQL this returns the difference in days between the two dates.

Then to get the difference between the departure and arrival dates of the cruise trip, we use the values of the DepartureDateFK and ArrivalDateFK attributes from the foreign key pointing to Voyage. This avoids having to query data from other tables that contain them.

If we simply subtract them, depending on the order, we could get negative results, since one date is earlier than the other. So if we just want the absolute difference, we can wrap the operation with the ABS() function. And if we don't assign a specific name to that additional column, SQL by default assigns it the name “abs“. But we’ll want to change it sooner or later to avoid ambiguity problems if we use the ABS() function again to create another new column.

In the previous query, we saw that all the information we needed was present in the CruiseBooking table from the FROM clause – but this is not always the case.

For example, in the below query, we want to do a few things: first, we want to get all the bookings made by people whose names start with a letter that is later or equal to ‘L’. They should also meet a series of conditions like the ones we saw before. Finally, we want to calculate the difference in days between the current date and the booking date as we saw before.

SELECT *, (CURRENT_DATE - CB.bookingDate) AS DateDifferenceColumn 
FROM cruiseBooking CB INNER JOIN Person P ON CB.PersonFK = P.PersonID
WHERE CB.price < 2000
    AND MOD(CB.cabinnumber, 2) = 0
    AND CB.paymentmethod = 'cash'
    AND CB.bookingDate BETWEEN '2025-01-01' AND CURRENT_DATE
    AND P.Name > 'L';

For this, if we only use the CruiseBooking table in the FROM clause, we won't be able to access the name of the person who made the booking, as that’s an attribute of the Person table. We can get information from that table using the foreign key PersonFK from CruiseBooking. So to use the Name attribute from the Person in our query, we need to somehow "concatenate" or join the columns of the Person table with the information from the CruiseBooking table that we had before.

In SQL, the JOIN operation allows us to do this. We just need to choose a type and conditions that let us obtain only the tuples with the information we want.

Among all the types of JOINs, the least likely to be used in production or in complex queries is the implicit join. When we use an implicit join, we are performing a Cartesian product between all the tuples involved in that JOIN. So if we want to keep only certain tuples from that Cartesian product, we have to use a WHERE clause to impose certain conditions on the attributes.

Implicit joins are harder to read and maintain. In large or complex queries, we need to separate the join itself from the conditions on the Cartesian product. That means the logic is split between the FROM list and the WHERE clause, so you have more places to check when you modify or refactor the query.

Also, in implicit JOINs, we can’t perform operations equivalent to an OUTER JOIN because there’s no way to fill certain attributes with NULL if they’re not referenced in the other table of the JOIN (among other disadvantages). So the type of JOIN we choose will depend on the condition we need to impose on the tuples of the Cartesian product.

Just keep in mind that there are certain cases where it might be convenient to use implicit joins, such as in queries involving very few tables (at most 2 to keep the code as simple as possible) with simple restrictions, or when maintaining legacy code, meaning old or inherited code that uses implicit joins.

In this case, when performing the Cartesian product, we’ll get a series of tuples that combine all those from CruiseBooking and Person. This will result in tuples with information about these two tables where the person's information does not correspond with the person referenced by the foreign key of the CruiseBooking tuple.

For that reason, we don't need those tuples from the Cartesian product – or in other words, we want to get all those where the foreign key PersonFK of CruiseBooking points to the person whose information is indeed in that same tuple of the Cartesian product.

Formally, we express this condition as CB.PersonFK = P.PersonID. In this case, we need to assign alias names to the tables to differentiate their attributes and resolve possible ambiguity issues. So the most suitable type of JOIN for this query is an INNER JOIN, as it allows us to declare this equality condition exactly as we have written it here in an ON clause, as seen above.

In this way, by using a specific type of JOIN that’s not implicit, we can isolate all the filtering conditions of the tuples in the WHERE clause (dedicating the FROM to obtaining the data). Through the JOIN, we can concatenate the attributes of other tables to the resulting table of the query, and apply a specific filter to the tuples of the Cartesian product of that operation with an equality condition.

Regarding the WHERE conditions in this query, we’ve added one that makes sure the booking date is between '2025-01-01' and the current date we get with CURRENT_DATE. We could’ve used the arithmetic operators <= and <= for this, but SQL offers us a more convenient alternative using BETWEEN, where we define that the date of the bookingDate attribute must be between '2025-01-01' and CURRENT_DATE, both included.

The BETWEEN operator would also work to check if a string is between a pair of strings, all compared alphabetically. In this query, the only condition we impose on the lexicographical order of a string is P.Name > 'L'. This ensures that the name of the person who made the booking starts with a letter greater than or equal to L. (If their name is composed of text that starts with L followed by more letters, that text will automatically be considered strictly greater than the text 'L'.)

If we wanted to keep only the people whose names start strictly with a letter greater than L, we would have to use the condition P.Name > 'M'.

What if we need to get a list of all the people in the database, their information, and also the details of all the cruise bookings they have made? We’d need a list where all the registered people in the database appear.

SELECT *
FROM cruiseBooking CB RIGHT JOIN Person P ON CB.PersonFK = P.PersonID
ORDER BY P.PersonID;

For example, if someone has made 2 bookings, there will be 2 rows with their information plus the details of the two bookings they made. Meanwhile, people who have never made a booking will appear in the list with a row containing their information and a series of NULL values in the columns where the booking information would be.

This query isn’t common in real cases, but this structure can be useful for solving other types of queries. So to build this list, the first operator we might think of is an OUTER JOIN. In this type of join, we specify the side of the table whose rows should always appear in the final list, filling in with nulls in the other table when necessary.

To understand this, in this example, we see that a person doesn’t have to have any associated booking – so for each person, there doesn't necessarily have to be a booking in their name. So there may be some people who don’t have any bookings associated with them. So when we’re trying to do an INNER JOIN with the CruiseBooking table, they won't appear in the resulting table from the query.

That's why, instead of an INNER JOIN where we impose a strict condition that all tuples from the operation must meet, we use an OUTER JOIN. So, if we want all people to appear in the list even if they haven't made any bookings, we need to specify the side of the OUTER JOIN where we placed the Person table in the JOIN operation.

In this case, the Person table is on the right side, meaning its attributes are concatenated to the right of those in the CruiseBooking table. So in the OUTER JOIN, we must specify the RIGHT side so that all tuples from the table on the right side appear in the list, and for those people who don't have any associated bookings, their corresponding tuple will be filled with NULL values in the respective attributes that hold booking information.

If we had placed the Person table on the left side, then to achieve the same result as the previous query but with the columns of both tables reordered, we just need to change RIGHT to LEFT in the JOIN operation. This way, all tuples from the table on the left (meaning Person) must appear in the resulting table. The right side gets filled in with NULL values in this case, since that's where the attributes of the CruiseBooking table are.

SELECT *
FROM Person P LEFT JOIN cruiseBooking CB ON CB.PersonFK = P.PersonID
ORDER BY P.PersonID;

On the other hand, in both queries, you can see that we used ON to define the equality condition on the tuples of the Cartesian product produced by JOIN. We have to do this because if we use USING instead of ON, both attributes on which we want to impose the equality condition must be named exactly the same – so we can’t use USING here.

Aside from the JOIN operation from which the data is extracted, we often need to return the result sorted by an attribute. It may also simply be useful to have the result sorted so we can make checks more quickly, as in this case.

SELECT P.Birth
FROM Person P LEFT JOIN cruiseBooking CB ON CB.PersonFK = P.PersonID
ORDER BY P.PersonID;

To do this, at the end of the query, we can add an ORDER BY statement, which sorts the tuples of the resulting table according to the PersonID attribute of the person. This attribute doesn’t need to appear in the SELECT, as we might need other attributes that aren’t the ones defining the order, as shown above.

To finish with this type of JOIN, besides defining one side as RIGHT or LEFT, in an OUTER JOIN we might also need all the tuples from both sides' tables to appear. In the query below, for example, we need to get a list of all driving license applications, so that all of them appear, one in each tuple, with all the information regarding their rejection or acceptance.

SELECT *
FROM DrivingLicense D FULL OUTER JOIN RejectedDrivingLicense R USING (LicenseID);

To resolve this query, first, keep in mind that the schema constraints prevent us from having a driving license application both accepted and rejected at the same time. So for each application registered in the database, there will be either a tuple in RejectedDrivingLicense or in DrivingLicense, depending on whether it has been rejected or not. So when obtaining the query list, if the resulting table contains all the attributes from both tables, there will always be NULLs in some of them (either in RejectedDrivingLicense or in DrivingLicense).

To make sure that all applications appear, we can perform a FULL OUTER JOIN, where the OUTER specification is optional as we have seen on other occasions. This forces the tuples from both tables to appear in the final result, filling with NULL on the corresponding side for each tuple.

For example, if a license is accepted and we try to find it in the RejectedDrivingLicense table, it clearly won't be there. So, if we did an INNER JOIN, we wouldn't get a tuple for that application, which happens similarly with rejected applications and the DrivingLicense table. So with a FULL OUTER JOIN, we ensure that all applications appear, filling with NULL in RejectedDrivingLicense when the application is accepted and in the other table when it’s rejected. In this case, its also possible to use USING in the JOIN, since the equality condition is based on attributes in different tables that have exactly the same name.

Another JOIN we might encounter in real queries is the NATURAL JOIN, which is very similar to the INNER JOIN but with simpler syntax.

SELECT PersonFK, RequestDate, Fee, ApprovalDate, Points
FROM DrivingLicenseRequest NATURAL JOIN DrivingLicense;

For example, you can see in the example above a query that can help us verify that the schema's integrity constraints are met. In it, we get a list of all driving license requests that have been approved.

To do this, we perform a NATURAL JOIN between the DrivingLicense table and its superclass DrivingLicenseRequest. Since the only attributes with equivalent names are LicenseID, SQL automatically imposes the condition that the tuple with information from both tables has the same values in the LicenseID attributes of both tables, removing both attributes from the resulting table of the query.

This automatically imposed condition, as well as removing the attributes, is what characterizes the NATURAL JOIN. It’s often be preferable to an INNER JOIN because of these characteristics. By eliminating identical attributes, we eep the information that actually represents the people in those tuples. We can then use it to calculate various metrics or even as the result of a subquery in a more general query.

In this specific case, since all accepted requests have to be recorded in the DrivingLicenseRequest table, this query should return all tuples from DrivingLicense. But if any aren’t recorded in DrivingLicenseRequest, the foreign key won’t reference any valid tuple in DrivingLicenseRequest, revealing a database integrity issue.

Fortunately, we never have to manually check this situation with these queries, as the DBMS automatically verifies that all integrity constraints are met with each database modification, especially those related to keys.

In real queries, multiple JOIN operations are usually used in the same FROM statement because we need to gather data from multiple tables (or even from within the same table).

SELECT *
FROM Person P
    INNER JOIN Residence R1 ON (P.PersonID = R1.PersonFK)
    INNER JOIN Residence R2 ON (
        P.PersonID = R2.PersonFK
        AND R1.CityFK <> R2.CityFK
    )
ORDER BY P.personID;

For example, say we want to find people who have lived in several different cities at some point in their lives, regardless of when they did so. Since our schema lets people live in multiple cities at once, we’ll have to use several JOIN operations to gather data from Person and Residence and join them.

But given the condition we impose on the people, to know if someone has lived in more than one city, we need to check the Residence table and see if there are multiple Residence tuples for the same person with different cities.

Specifically, the query we want to make should get all those people who have lived in at least two different cities. If we only impose the condition that a person appears in at least two tuples of the Residence table, we’d get people who have had at least two residences – not those who have lived in different cities in those residences.

Therefore, the final condition ends up being that the person appears in at least two tuples of Residence where the associated city they have lived in is different in both tuples. Also, by checking this condition, we aren’t ensuring that the person only has those two tuples – we just need to know if they appear in at least two tuples with the previous characteristics (as a person may have had many residences).

To implement this query, we might first think of using set operations and subqueries – but there is a way to solve it using only JOIN operations.

When we do a JOIN between two tables, we are really doing the Cartesian product, from which we only keep some tuples that meet certain conditions. For example, when doing a JOIN between Person and Residence, the foreign key PersonFK in Residence must refer to the person from that same tuple in the Cartesian product. This means it must match the PersonID attribute from the Person table. With this, we can see that we obtain all the residences each person has or has had.

Then, from all of them, if we want to check that there are at least two with different foreign key CityFK values (meaning that there are two residences in different cities), we can do another JOIN of the intermediate table resulting from the previous JOIN with the Residence table.

This way, in addition to saying that its foreign key PersonFK has to refer to the corresponding person from each tuple resulting from the JOIN, we’re also declaring that the city it refers to must be different from the city referenced by the previous Residence table used in the previous JOIN.

To understand this in a more programmatic way, when doing a JOIN between Residence and itself, we’re getting tuples that represent pairs of residences. So we’re obtaining a series of tuples that together represent the Cartesian product between the tuples of the Residence table with themselves.

In other words, we end up with a series of tuples where, in each one, we can find information from exactly 2 tuples of the Residence table, for each possible pair of Residence tuples (including cases where both tuples are the same). If we add the restriction that these pairs must refer to a certain person, then they will be all the possible pairs of residences that a person has had.

Then if we also add the condition that for each pair of residences the cities they refer to must be different, we‘ll end up with tuples where the person who has had those residences have lived in at least two different cities. This doesn’t ensure that it’s exactly two, as they may have lived in many more (which we can see in the resulting tuples from these JOIN operations).

When implementing this in SQL, we see that in both ON clauses, we declare the condition that the Residence tuples must refer to the same person of the tuple we want to construct – with that person and a pair of their residences. Also, in the second JOIN, we declare the condition that the cities of the pair of residences must be different using the operator <>. Finally, we order the result according to the values of the PersonID attribute.

SELECT DISTINCT P.Name
FROM Person P
    INNER JOIN Residence R1 ON (P.PersonID = R1.PersonFK)
    INNER JOIN Residence R2 ON (
        P.PersonID = R2.PersonFK
        AND R1.CityFK <> R2.CityFK
    )
ORDER BY P.personID; /*Error*/

As you can see from the query result, there are people who have had many residences, resulting in many pairs of residences that meet the imposed conditions. This creates multiple tuples in the resulting table where the same person's information appears.

So, if we only want to get the person's name, we can replace * with P.Name in the SELECT statement to select only that attribute. To avoid duplicate values, we can use DISTINCT. Without DISTINCT, the same person's name may appear multiple times, depending on the number of residence pairs they have had in different cities. This also happens because SQL by default models tables with multisets, allowing such duplicates.

If we care about removing duplicates, we should use DISTINCT – but this decision can affect other statements like ORDER BY. In this example, we’re ordering by the values of the PersonID attribute, which we don't need in the resulting table where only the Name attribute appears.

Since PersonID doesn’t appear in the SELECT after using DISTINCT, the DBMS will give us an error. We have several options to fix it.

On one hand, we can remove DISTINCT, which will result in duplicate person data but that’s ordered by their PersonID (even though it won't be shown in the result).

On the other hand, we can keep DISTINCT and remove ORDER BY, because if the attribute we are ordering by does not appear in the SELECT after using DISTINCT, we will get an error that will prevent us from executing the query.

Another alternative we have is to show all the information about the person, not just the name. This way, we can order the result by the PersonID attribute and remove duplicate people. Instead of writing the entire list of attributes from the Person table in the SELECT, we can use the notation P.* to refer to all the attributes of the table with alias P.

SELECT DISTINCT P.*
FROM Person P
    INNER JOIN Residence R1 ON (P.PersonID = R1.PersonFK)
    INNER JOIN Residence R2 ON (
        P.PersonID = R2.PersonFK
        AND R1.CityFK <> R2.CityFK
    )
ORDER BY P.personID;

Finally, in SQL, it's common to encounter queries where we need to work with dates. For example, in our schema, we might have a query to get all the people who were born in May.

SELECT *, EXTRACT(MONTH FROM Birth) AS BirthMonth
FROM Person
WHERE EXTRACT(MONTH FROM Birth) = 5;

We can solve this by imposing a single condition on the birth date, with the peculiarity that we can't treat the data type exactly as if it were entirely numeric or text. Instead, we need to extract characteristics from the date to operate with.

In this case, the clearest characteristic to obtain is the month. By using the EXTRACT() function and the MONTH characteristic, we extract the month number from the Birth attribute's date to check if it’s May or not.

Note that the function generally returns numbers for day, month, year, and so on, not strings. So we treat the month as if it were a number from 1 to 12.

We can convert between number and string using other SQL tools, all in the appropriate format according to the time zone and geographic area. Then, if we want that date characteristic to appear as an additional attribute in the resulting table, we simply treat the EXTRACT() function as if it were any SQL function that returns a value when given certain values from a tuple.

But even if we assign it an alias, we can’t use that alias in the WHERE clause to declare the condition that it equals to be 5. Instead, we must write the entire calculation in the WHERE clause. Although this may seem inefficient in terms of readability, without using additional Common Table Eexpression (CTE) techniques like those we will see later, we have no choice but to duplicate the attribute calculation in the WHERE clause if we want to impose a condition on it.

SELECT *, EXTRACT(WEEK FROM Birth)
FROM Person;

In addition to the day, month, and year, the EXTRACT() function allows us to obtain all kinds of characteristics from a date, like the week number with WEEK as shown above, or the current quarter number with QUARTER.

Subqueries

There are some SQL queries that require subqueries. A subquery is simply a query inside another query. It helps you solve a smaller problem so the main query can solve a bigger one.

Let’s dive in a little deeper. When you run a query in SQL, you get a result table (a multiset, since rows can repeat). A subquery lets the outer query use that result – for example, to check membership or existence.

SELECT *
FROM Person P
WHERE P.PersonID IN (SELECT PersonFK FROM Residence);

This returns every person whose identifier appears in Residence.PersonFK – that is, everyone who has (or had) a recorded residence. The subquery produces the set of referenced person IDs, while the outer query keeps rows where p.PersonID is in that set.

Note that this is a non-correlated subquery (it doesn’t reference the outer query), which many databases may materialize once or rewrite as a semi-join before applying the IN filter. In practice, this is usually comparable to an equivalent EXISTS or JOIN-based formulation. We’ll just choose the form that’s clearest and add appropriate indexes (for example, Residence(PersonFK), Person(PersonID)) for speed.

If the subquery can return NULL, IN uses three-valued logic. With a foreign key on Residence.PersonFK, NULL values are typically disallowed, so this isn’t an issue.

On the other hand, we can solve the query using JOIN operations as shown below:

SELECT DISTINCT P.*
FROM Person P INNER JOIN Residence R ON R.PersonFK = P.PersonID;

Here, we combine the data from Person and Residence using the equality condition that requires the foreign key of Residence to reference the person in the same tuple of the Cartesian product. This way, we only get those tuples that have the information of a residence and the person associated with it.

Then, to keep only the data of the people, we use P.* as before – but here we need to use DISTINCT, since a person may have multiple residences. Specifying DISTINCT prevents this from duplicating the data of the same person.

The JOIN operation is often considered inefficient because it’s a Cartesian product that must construct all tuples of that product and then filter them using the conditions we declare. But we can make it faster with the right hardware, like GPUs.

Still here, we need to remove duplicates with DISTINCT, which involves additional processing of the query result. We also need another filter or process that eliminates duplicate tuples, so it seems less efficient at first glance.

But depending on how the DBMS implements these operations at a physical level, it can be more or less efficient than using subqueries (as the hardware also makes a difference).

Here’s another construction based on subqueries that we can use to solve the previous query. As you can see, we build a correlated subquery where we use the PersonID attribute from the "higher-level" query to get all the residences (tuples) from the Residence table that belong to the person indicated by the PersonID identifier. In other words, since the WHERE clause is executed for each tuple of Person, we can construct a subquery where, given a certain person with that identifier, we can get all the residences registered in their name. That would be those whose foreign key PersonFK refers to the PersonID identifier of the person.

SELECT *
FROM Person P
WHERE EXISTS (
        SELECT *
        FROM Residence R
        WHERE R.PersonFK = P.PersonID
    );

With this correlated subquery, SQL must build its result for each person in the Person table, as the result depends on the specific person being processed. So to only keep those people who have a residence, we use the EXISTS operator to verify that the resulting multiset of the subquery contains at least one tuple (indicating that the person has a residence).

SQL has to go through the Residence table for each person in the Person table, although it only goes through Residence until it finds the first tuple whose foreign key points to the corresponding person. This avoids unnecessary checks of the rest of the tuples in Residence because EXISTS only requires at least one tuple in the subquery.

Still, in the worst-case scenario, it would have to go through the entire table for each person if no person has or has had residences.

Another way we can use membership or existence operators is on a list of values. This is declared very similarly to a tuple and a subquery but is not necessarily a tuple or a subquery.

SELECT *
FROM Pool P
WHERE Status IN ('closed', 'renovation')
    AND mindepth IN (
        SELECT mindepth
        FROM Pool
        WHERE mindepth > 4
    );

For example, above we have a query that returns all pools whose status is ‘closed’ or ‘renovation’ and whose minimum depth is greater than 4.

To check the first condition, we could easily use the logical OR operator and declare two simpler conditions to check whether the Status value is either ‘closed’ or ‘renovation’. But we can do this more simply using the IN operator. So by using the notation ('closed', 'renovation'), we declare a list with those two values, checking with IN if the value contained in the Status attribute is in the list or not. This has the same effect as using the OR operator, but with clearer syntax and similar efficiency.

This check we do with IN is like a membership check on the result of a subquery, as the syntax is very similar. But don’t confuse the list declaration with a subquery, since ('closed', 'renovation') doesn’t represent a multiset with tuples, but rather a list of values. We can also view it as if it were a column on which we perform a check.

On the other hand, the simplest way to check if the pool's minimum depth is greater than 4 is with the condition mindepth > 4 directly. But to show an equivalent way of checking with subqueries, you can see above that the subquery for the condition retrieves all mindepth values from the Pool table that are strictly greater than 4. Then it uses IN to check if the mindepth value from the outer query's Pool table is in the subquery's result.

So instead of writing mindepth > 4 directly, the subquery first selects all mindepth values greater than 4, and the outer query uses IN to keep a pool row only if its mindepth is in that set. In practice, although this can also be a solution to the query, we should keep the code as simple as possible. We generally avoid these techniques.

Also, we don’t need alias P. to refer to the mindepth of the outer query – as it’s the only one called that way in this query. But if we had to use it in the subquery, we’d need to use the alias P. to distinguish it from the mindepth attribute of the Pool table in the subquery. (This also doesn’t need an alias because it’s a simple subquery without another subquery inside it. This is possible to do, and sometimes even necessary.)

Here’s another equivalent way to solve the query using subqueries:

SELECT *
FROM Pool P
WHERE Status IN ('closed', 'renovation')
    AND P.mindepth > ALL (
        SELECT mindepth
        FROM Pool
        WHERE mindepth <= 4
    );

The main difference is that here, the subquery gets all the mindepth values that are <=4, which is the opposite condition of what we want the tuples to meet. So in the outer query, we have the result of this subquery, which includes all the mindepth values we’re not interested in.

To check if a tuple meets the condition of having a minimum depth >4 using these values, we use the > ALL operator to verify if the mindepth of the tuple we are checking is strictly greater than all the values present in the subquery.

This equivalent way of solving the query is more elaborate than the simplest and most efficient solution, which is to use the mindepth>4 condition directly. This is simply an example to demonstrate that there's often more than one way to get the same result for any state of the database. This is the definition of equivalent queries.

Also, in many situations, it’s useful to use operators like ANY, IN, ALL, EXISTS, and so on in combination with other arithmetic operators on a subquery to define conditions that certain tuples must meet, as shown in these examples.

So far, we’ve seen queries that use subqueries in their implementation, but those subqueries essentially behave as if they were queries themselves. This means we can execute them directly on the DBMS as if they were regular queries. So nothing prevents a subquery from being made up of subqueries at a "lower" level, meaning subqueries that are at a nesting level below the other subquery, which in turn is at a lower nesting level than the query it’s in.

Basically, SQL allows us to chain as many subqueries as we want within a query or subquery. This helps us solve problems like the query below, which retrieves a list with information on all the people who don’t have a valid driver's license:

SELECT *
FROM Person P
WHERE NOT EXISTS (
        SELECT *
        FROM DrivingLicense D
        WHERE D.LicenseID IN (
                SELECT LicenseID
                FROM DrivingLicenseRequest R
                WHERE R.PersonFK = P.PersonID
            )
    );

We could approach this query so that it'd require JOIN operations to solve it. But in this case, it’s structured in a "nested" manner at the subquery level so that it requires the use of subqueries.

So to get this list, we first go through all the people in the Person table. For each one, we check that there is no driver's license whose associated request was created by that person. We can implement this condition by applying the NOT EXISTS operator to a subquery that returns all valid driver's licenses associated with a person. We get these by filtering DrivingLicense to licenses whose matching DrivingLicenseRequest row has PersonFK = P.PersonID – that is, licenses requested by the current person.

Regarding this last point, as you can see in the code, the simplest way to implement it with subqueries is to check that the LicenseID of the valid driver's license exists in the set of LicenseID values from the requests in the DrivingLicenseRequest table whose foreign key points to the person being iterated over in Person. That makes this subquery correlated with the outer query we are making, as it includes the attribute P.PersonID.

In short, we’ve implemented this query by nesting subqueries, where SQL allows us to reach an arbitrary level of nesting according to the needs of the query. But we could’ve done it in other ways like using JOIN operations, which in certain situations are easier to understand than the approach we just followed.

Just remember that nesting queries is not always the best way to solve a problem, especially when multiple levels of nesting are created (whether correlated or not with each other). We’re just showing what’s possible here. It’s only worthwhile when it improves the efficiency or clarity of the query sufficiently compared to other alternatives.

Let’s talk about where or in which statement subqueries can be nested. In the below code, you can see how the subquery is nested in the FROM clause.

SELECT P.Name
FROM Person P
    INNER JOIN (
        SELECT DISTINCT PersonFK
        FROM Rental
    ) R ON R.PersonFK = P.PersonID;

Since it returns a table with tuples, we’ll often use that query result in a FROM clause to get the information from the tuples and return it to the user through a SELECT. Or we could even combine it with another table using a JOIN operation, as in this case. Specifically, this query will get information about all the people who have rented a bike at some point at least one.

So the approach we follow to resolve the query is to perform a JOIN between the Person table (that contains all the people in the system) and a table that has the identifiers of the people pointed to by the foreign key {PersonFK} of any tuple in Rental. This means anyone whose identifier is referenced by any tuple in Rental, implying that they’ve rented a bike at least once.

We can construct this list of person identifiers using a subquery that extracts all the PersonFK values from the Rental table while removing duplicates. A person may have made an arbitrary number of rentals throughout their history, but we’re interested in whether they have made at least one. So, we simply need to know if they appear in the list of PersonFK values.

Then, using an INNER JOIN, we combine the information of PersonFK returned by the subquery with the tuples from the Person table. This gives us all the information of the people identified by PersonFK, which in turn points to PersonID. But since we want, for example, the names of the people and not just their identifiers, both the JOIN and the subquery are essential, because if we only needed the identifier, it would be enough to return what the subquery provides.

In addition to nesting subqueries in the FROM clause, we can also do it in the SELECT clause, where the main goal is to calculate a metric or get more information for each tuple in the query. That is, if in the SELECT we get attributes P.PersonID and P.Name from each of the tuples returned to the user, we might want to get more information beyond these two attributes that needs to be calculated with a query. In this case, this query will be nested as a subquery in the SELECT, and it’s result will be the value added to the additional attribute representing the subquery in the SELECT.

SELECT P.PersonID,
    P.Name,
    (
        SELECT COUNT(*)
        FROM Residence R
        WHERE R.PersonFK = P.PersonID
    ) AS NumResidences
FROM Person P;

In these cases where the subquery is nested in the SELECT statement, the subquery must meet a basic requirement: it has to return at most one tuple and one column. This is because the result of the subquery will be added in a new additional column (and only one) in our SELECT. Then we’ll calculate its result and add it in each tuple of the outer query – so the subquery can’t return more than one tuple.

For example, in this query, we want to list all the people in the database along with a column that contains the number of residences they have had. To solve this, the simplest approach is to go through all the tuples of Person and, for each one, count how many tuples of Residence have their foreign key PersonFK referencing that person.

Going through the tuples of Person is simple: we just use a combination of SELECT and FROM. But in order to count how many tuples of Residence meet this condition for each person, we need a correlated subquery – specifically with the person being processed. We can uniquely identify this with P.PersonID.

We need to do this because to count tuples in Residence, we have to compare the values of their foreign key PersonFK with the identifier P.PersonID. To get the value of this count, we can use a subquery: the aggregation function COUNT(*) lets us count all the tuples present in Residence. It does this after filtering them with the condition that their foreign key PersonFK references the person being processed in the Person table.

It’s important to note that the subquery will only return one value generated by COUNT(), and only one column generated by this function. This meets the requirement that every subquery used in the SELECT statement must fulfill.

Finally, it’s worth mentioning that this value generated in the subquery populates an additional column which we’ve added by including the subquery itself in the SELECT for each tuple of our query. In other words, each tuple will need a value for this new column, which they’ll get by executing the correlated subquery on that specific tuple.

SELECT and FROM aren’t the only statements where subqueries are allowed. We can also use them in a WHERE, HAVING, or even ORDER BY clause. More importantly, a query can have an arbitrary number of subqueries (nested or not) depending on its needs.

SELECT P.PoolID,
    P.Name AS PoolName,
    C.Name AS CityName,
    P.Status,
    (
        SELECT PoolID
        FROM CityPool C
        WHERE C.PoolID = P.PoolID
    ) AS CityPoolID
FROM (
        SELECT *
        FROM Pool
        WHERE Status = 'maintenance'
    ) AS P
    JOIN City C ON P.CityFK = C.CityID;

For example, in a query like the one above, we can see that there is not only a subquery in the SELECT but also one in the FROM.

In this specific case, the query gets information about all the pools currently under maintenance, including details about the city the pools are located in (such as its name). There’s also an additional column indicating the pool's identifier if it’s of the CityPool type, leaving it blank if it’s not.

So to resolve this query, we first need to get information about the pools under maintenance. This simply involves going through the tuples in the Pool table and selecting those whose Status value is ‘maintenance’.

Then, to gather information about the city where each pool is located (along with the Pool tuples we just obtained), we can use a JOIN that operates on the previous tuples and the City table. This is why we’re extracting all Pool tuples using a subquery.

So although the type of JOIN is not explicitly specified, by using the ON clause SQL automatically interprets it as an INNER JOIN (it would also be interpreted as INNER type if we had used the USING clause). But this practice is not recommended, as in most situations where the JOIN type is omitted, the readability of the code is compromised, especially when there are many JOINs in the same query.

Here, in the ON clause, the JOIN condition states that in the same tuple of the Cartesian product, the foreign key CityFK – which represents the city where the pool is located – must have the same value as the CityID identifier of the city in the tuple.

Then, to attach the extra column with the pool identifier from CityPool for those tuples that represent pools of that type, respectively, we’ll use a subquery. This subquery searches the CityPool table for a tuple whose PoolID matches the PoolID from Pool. This checks if the pool from Pool is actually of the CityPool type or not.

In this way, the subquery will return the identifier value if it’s of the CityPool type – otherwise, it will return nothing, meaning it will return a table without tuples (or in other words, an empty set or multiset, rather).

This is allowed in SQL, but it can sometimes cause errors, so it's generally not a good practice to use subqueries in the SELECT that aren’t guaranteed to return at least some tuple.

So for those pools that aren’t of the CityPool type, the subquery will return nothing. This means that the value of the extra column in the SELECT will be NULL as we can see when executing the query.

Since it doesn’t return any tuple with any value, we’ll insert an unknown value. The way to represent this in SQL is with the special value NULL. Also, this extra column by default has no name, so we can assign it a recognizable alias using the AS clause as shown in the query.

On the other hand, if we want to avoid having NULL values in the additional column, we can have this column contain boolean values where TRUE indicates that the pool is of the CityPool type and FALSE that it’s not.

Starting from the same query as before, the only change we need to make to achieve this is to add an IS NOT NULL check. For each tuple, it checks whether the value inserted in the additional CityPoolType column is NULL or not. Thus, if its type is indeed CityPool, the value in the additional column provided by the original subquery won’t be NULL. This meets the IS NOT NULL condition and returns TRUE. Conversely, if it’s not of that type, IS NOT NULL won’t be met, and the additional column in this case will be filled with FALSE.

SELECT P.PoolID,
    P.Name AS PoolName,
    C.Name AS CityName,
    P.Status,
    (
        SELECT PoolID
        FROM CityPool C
        WHERE C.PoolID = P.PoolID
    ) IS NOT NULL AS CityPoolType
FROM (
        SELECT *
        FROM Pool
        WHERE Status = 'maintenance'
    ) AS P
    INNER JOIN City C ON P.CityFK = C.CityID;

Here, we need to be careful about where we place the IS NOT NULL condition. On one hand, we might think of comparing the PoolID attribute of the CityPool table itself in the SELECT clause of the subquery. If we do this, we’ll be comparing a value that may or may not exist with NULL, so the final result of the subquery will be FALSE if the pool is of the CityPool type.

But if it’s of another type, there won't even be a value for that PoolID attribute in the CityPool table, so the comparison with NULL won’t be executed. This will result in the final query output having the additional column contain NULL values for pools that aren’t of the CityPool type and FALSE for those that are of the corresponding type.

This happens because we shouldn’t compare PoolID with NULL, as its value may or may not exist. And if it doesn't exist, the check won't be executed for all the tuples in our query.

Instead, we should perform this check on the result of the entire subquery. It can be NULL when the pool is not of type CityPool – and so we see values in the additional column filled with NULL in the final result. Or it can contain a valid identifier different from NULL, which violates the IS NOT NULL condition.

In short, the check to ensure that the additional column is of boolean type should compare the result of the entire subquery (which is either NULL or a specific value) with the NULL value itself. This checks to see if each tuple in our resulting table matches or not.

In summary, although it's not good practice to use subqueries in the SELECT clause that may result in an empty set, we can do so long as it doesn't make the readability or efficiency of the query worse. We also need to have certain guarantees that it does what it’s expected to do.

So far, we've performed membership checks with IN, as well as checks with other operators. We’ve used individual attributes to verify if the value of a certain attribute was in a set formed by the values of an attribute, among other conditions. And sometimes we need these conditions to involve more than one attribute for verification.

SELECT E.EntryTimestamp, E.PersonFK, E.PoolFK
FROM Entry E
WHERE (E.EntryTimestamp, E.PersonFK, E.PoolFK) IN (
        SELECT PS.EntryFK, PS.PersonFK, PS.PoolFK
        FROM PoolSanction PS
    );

For example, above we have a query that retrieves all the tuples from Entry that have been sanctioned with some pool sanction from the PoolSanction table. To do this, we simply need to go through the tuples in Entry and, for each one, check if it has a sanction. In other words, we verify if there is a tuple in PoolSanction whose foreign key to Entry references the tuple we’re examining.

When doing this, the first thing we notice is that the primary key of Entry doesn’t consist of a single attribute, but rather 3. This is just like the foreign key in PoolSanction – it determines that the entry that has been sanctioned doesn’t have one attribute, but three.

So under normal conditions, we could use a subquery to get all the foreign key values from PoolSanction, then check if the identifier (primary key) of each entry belongs to that set of values using the IN operator. But here we can’t do it the same way because we need to work with three attributes instead of one.

That's why, in the subquery, instead of returning a single attribute, we return all those that make up the foreign key to Entry (these are (EntryFK, PersonFK, PoolFK)). With this, we have a set of tuples where each one refers to a tuple in Entry that has been sanctioned.

Specifically, each of these tuples in the set refers to the three attributes that make up the primary key of Entry, which are (EntryTimestamp, PersonFK, PoolFK). So to check if an entry belongs to this set, we simply go through it, looking to see if any of the tuples match exactly with the tuple of the entry's primary key (with all three attributes having equal values).

We do this using the IN operator, where instead of specifying a single attribute, we can specify an arbitrary number of them in parentheses. Thus, the IN operator will perform the same operation as in previous cases, taking the primary key (EntryTimestamp, PersonFK, PoolFK) of each entry and comparing it with each of the tuples from the subquery, attribute by attribute. If any of them match, then it belongs to the set, fulfilling the condition.

Here, it's very important to note that the tuples compared by IN must be the same size. This means that they need to have the same number of attributes, the same data type (or at least be comparable), and their semantics must be the same. That is, if for each tuple in Entry we use (EntryTimestamp, PersonFK, PoolFK) to check if that three-attribute tuple is in the subquery set, then that subquery must contain tuples of three attributes where:

the first one is EntryFK, which refers to the EntryTimestamp attribute of the primary key of Entry,
The second one is PersonFK, referring to PersonFK of the primary key,

and so on. This ensures that the comparison is semantically correct, even though in the DDL the primary key might have been defined in a completely different order.

Another variation of this query is to list all the sanctioned entries along with the information of the person who has the entry.

SELECT P.*, E.EntryTimestamp, E.PersonFK, E.PoolFK
FROM Person P INNER JOIN Entry E ON P.PersonID=E.PersonFK
WHERE (E.EntryTimestamp, E.PersonFK, E.PoolFK) IN (
        SELECT PS.EntryFK, PS.PersonFK, PS.PoolFK
        FROM PoolSanction PS
    );

For this, based on the previous solution where we got the list of sanctioned entries, the only additional step we need to take is to perform a JOIN between Entry and Person. In doing this, we only keep those tuples from the Cartesian product where the foreign key PersonFK from Entry refers to the primary key {PersonID} from the information coming from Person.

We can also see that the condition checking whether the entry is sanctioned or not is the same. With this example, we can more clearly see the purpose of the JOIN operation, which is to gather information from multiple tables. So for each sanctioned entry we had before, if we now need to concatenate the information of the person to whom the foreign key PersonFK points, we can simply perform the Cartesian product between both tables and impose a condition to ensure that the reference of PersonFK is indeed the person present in the tuple.

Continuing with the uses of this last technique, where we use operators like IN to check if a certain combination of attribute values belongs to a set of tuples, in the following example we have a query that lists all the trips from the Voyage table for which there is a return trip. That is, we need to find all trips going from city A to city B for which there is at least one other different trip going from B to A.

SELECT *
FROM Voyage V
WHERE (V.DepartureCityFK, V.ArrivalCityFK) IN (
        SELECT V2.ArrivalCityFK,
            V2.DepartureCityFK
        FROM Voyage V2
    )
ORDER BY (V.DepartureCityFK, V.ArrivalCityFK);

To do this, the first thing we need to realize is that the primary key of Voyage includes the attributes DepartureCityFK and ArrivalCityFK, which refer to the start and end cities of the trip, respectively. So if we have multiple trips with different values in these attributes, we’ll definitely know that both trips are different. This is because even if the rest of the primary key attributes were the same, as long as at least one of them is different, the trips must necessarily be different.

So we can formulate the query similarly to the previous ones, going through all the tuples in Voyage and for each one, checking if there is a trip whose start and end cities are the same the end and start cities of the trip being checked. So for each trip in Voyage, we construct a correlated subquery where we again go through all the tuples in Voyage and only get the values of the DepartureCityFK and ArrivalCityFK attributes. Then, we check if the values of these attributes from the trip in the "higher level" query are in the set of tuples we just built.

But in this case, if we look at the code, the order of the attributes is swapped compared to the order of those same attributes in the subquery. What we really want to check is that the value of the DepartureCityFK attribute of the tuple we are checking in the query matches the value of the ArrivalCityFK attribute of some tuple in the subquery. Also, we need to check that the value of the ArrivalCityFK attribute of the query's tuple matches the value of the DepartureCityFK attribute of the same tuple that matched the previous pair of attributes.

We can more easily understand this by viewing the pair (V.DepartureCityFK, V.ArrivalCityFK) as if they were the start and end cities, A and B, of a trip. What we want to check is if there is any tuple in the subquery that has B and A as the start and end cities, respectively.

The simplest way to make this check is either to reverse the order of the attributes in the tuple (V.DepartureCityFK, V.ArrivalCityFK) or in the attributes of the SELECT in the subquery. This is what we’ve decided to do here, which is why ArrivalCityFK is returned before DepartureCityFK.

Finally, to more easily check for the existence of these round trips, we can add the ORDER BY clause, which orders by multiple attributes instead of just one. That is, we use the attributes (V.DepartureCityFK, V.ArrivalCityFK) as the sorting criteria. SQL orders by the pairs of values for each tuple, as if each possible pair of values were considered a single value that could be compared with others.

By doing this, we can easily focus on the departure city of a trip and then look for another trip whose arrival city has the same value. Then we can find one whose departure city matches the arrival city of the original trip, thus finding a pair of trips that form a round trip to a city.

Finally, let’s look at another query where we need to compare values of multiple attributes at once. Here, all trips are listed whose associated cruise (the one making the trip) has been assigned to its cruise line on the same start date of the trip.

SELECT V.*
FROM Voyage V
WHERE (V.ShipFK, V.DepartureDate)
  IN (
    SELECT SA.ShipFK, SA.StartDate
    FROM ShipAssignment SA
  );

To implement this, we need to consider all the attributes that Voyage has, including the information contained in the foreign keys, to avoid having to perform unnecessary operations like a JOIN with the CruiseShip or CruiseLine table.

In this case, we structure the query similarly, first going through all the tuples of Voyage and checking if the cruise has been assigned to its cruise line on the same date as the start of the trip.

To make this check easier, we construct a subquery that returns all the values of the ShipFK and StartDate attributes from the ShipAssignment table. This way, later in our query we can check if the cruise making the trip (which is referenced with the foreign key ShipFK of Voyage) was assigned on the DepartureDate of the Voyage tuple (start date of the trip) to any cruise line.

As you can see, we can simplify the query if we think of it as getting all trips for a cruise ship that has been assigned a start date with any cruise line. In other words, it doesn't have to be a specific line, but any line to which it was assigned on the date indicated by DepartureDate of Voyage. So in the WHERE clause, it checks if the pair of values taken by the attributes (V.ShipFK, V.DepartureDate) are found in the subquery. And this time it maintains the correct order of the attributes, since ShipFK of Voyage must match ShipFK of ShipAssignment, and DepartureDate of Voyage must match StartDate of ShipAssignment, respectively.

On one hand, the match of ShipFK ensures that the cruise ship making the trip is the same as the one assigned in the ShipAssignment tuple. Likewise, the match of the date attributes ensures that this assignment was made on the start date of the trip.

We have also solved this query using a correlated subquery and the IN operator, although it's not the only way. As you can guess, there's always the option to use JOIN operations and conditions to filter the tuples, which can be more or less efficient in certain cases. This is why it's important to understand what SQL does under the hood, like whether it actually builds and stores all the tuples of a subquery or Cartesian product in memory, and when it does so.

Common Table Expressions

We have seen that subqueries allow us to use the result of one query within another query. We can construct this once during the execution of the entire query if it’s not correlated, or once for each tuple of the table with which it’s correlated. In other words, we can see a subquery as a set of tuples that we operate with in a query.

But we don't always need queries to be correlated. We’ve seen that some queries can be resolved by non-correlated queries, meaning sets of tuples that are constructed only once and are sufficient to resolve the entire query in which they are contained.

In these situations, to simplify notation, we can use a tool called CTE (Common Table Expression) in SQL. These typically use the WITH clause. With this, we can define and store the result of a subquery in a temporary table that needs an alias. So instead of using a subquery in the construction of a query, we define a temporary intermediate table (Common Table Expression) that only exists during the execution of the query and contains all the tuples generated by a certain subquery. Again, we need to use an alias to refer to it, just as we have to provide tables with a name and a schema when we create them in the DDL.

To understand the WITH clause with an example, we can consider the query that gets information about all currently active cruises. Here, active means assigned to a cruise line at the current date when the query is executed.

Before writing code, it's helpful to think about how the query will be structured, meaning where we’ll get the data to respond, how we should combine the different tables with that data, what conditions or operations need to be applied to them or the tuples resulting from the operations performed, and so on.

WITH ActiveShips AS (
    SELECT ShipFK
    FROM ShipAssignment
    WHERE StartDate <= CURRENT_DATE
        AND EndDate >= CURRENT_DATE
)
SELECT CS.ShipID, CS.Speed, CS.PassengerCapacity
FROM ActiveShips A INNER JOIN CruiseShip CS ON CS.ShipID = A.ShipFK;

In this case, the information for all cruise assignments, whether current or not, is in the ShipAssignment table. So, to know which cruises are currently assigned to a cruise line, we can take advantage of the fact that this table has a foreign key ShipFK that identifies the assigned cruise in each tuple of ShipAssignment.

So, if we set the condition that the StartDate of the assignment should be before the current date gotten from CURRENT_DATE, and that the EndDate is after the current date, we’ll get all those assignments that are valid on the current date. By extracting the values taken by the foreign key ShipFK for those assignments, we can identify the cruises that are currently assigned.

But the query not only asks us to identify them – but also to get information about them stored in CruiseShip. So, we save the identifiers of the cruises we got earlier in a temporary table to use in the query. In other words, we could make the conditions on StartDate and EndDate apply to ShipAssignment in a subquery. But to simplify the notation and demonstrate how to use CTEs, we’ll use the WITH clause where we define all the subquery code and assign an alias to that temporary table (see above code).

Specifically, by doing this, we’ll be saving the identifiers of the currently active cruises in the temporary table named ActiveShips. This is the alias we assigned using the AS operator – but it works in reverse in the WITH clause: first, you write the alias name and then you writethe code that gets the data from the intermediate table (the element to which the alias name is assigned).

So, when we use the WITH statement, we see that we have constructed an ActiveShips table with the result of what could be a non-correlated subquery – but for simplicity, we’ve refactored it so that its result is stored in an intermediate table with a certain alias.

Now, we can treat ActiveShips as if it were another table in the database, performing a JOIN between it and CruiseShip to get all the information about the active cruises. We impose an equality condition on the ShipFK and ShipID attributes of the ActiveShips and CruiseShip tables, respectively. This means we only keep those tuples from the Cartesian product where the foreign key ShipFK refers to the ShipID identifier of that same tuple. This allows us to find the complete information about a specific cruise.

In the previous query, we could have easily skipped using WITH and made ActiveShips a subquery to which we could’ve also assigned an alias. But when using a subquery, even if we assign it an alias, we can’t use it in just any part of the query. That is, if we have a subquery in a FROM or a SELECT, we can’t use it in other parts of the query in the same way as we can use an intermediate table defined in a WITH. This (WITH) we can reference at any point in the query, regardless of whether it’s formed by more subqueries.

WITH VoyageDistance AS (
  SELECT * 
  FROM Voyage
  WHERE Distance > 1000
)
SELECT DepartureDate, ArrivalDate, Distance
FROM VoyageDistance 
WHERE DepartureDate BETWEEN '2025-01-01' AND '2025-06-30';

We have another similar example in the query above. Here, we consider a query that gets information about all voyages that started in the first half of 2025 (approximately) and have a distance greater than 1000 kilometers. The approach in this case is simpler since all the information we need is found in the Voyage table. So the condition that the distance is greater than 1000 kilometers is easily modeled with a WHERE clause and the expression Distance > 1000.

Just like before, in this query we could also skip using WITH and include both the distance and the condition on the start date of the voyage in a single WHERE. But often we might need to modify or expand a query – for example, in the future we might be asked for a query based on this one, but with more or fewer conditions. So if we conducted an analysis of our domain, user requirements, and the query code, we might conclude that tuples with voyages over 1000 kilometers could be needed in multiple parts of the same query.

In this example, this phenomenon might not occur, but it illustrates that in a real situation, we may need to consider various factors that affect query design.

So, say we assume that the Voyage tuples with Distance>1000 could potentially be used multiple times in a single query across multiple statements (in future modifications of this query). Then the most maintainable option is to use a WITH clause where we temporarily store these tuples and then use them in the query through the alias of this intermediate table (as if it were a regular database table). Then, we can add another WHERE clause at the very end of the query, declaring the condition that the start date of the voyage is in the first half of 2025. We can model this with the BETWEEN operator, the EXTRACT() function, or many other ways.

Finally, it’s worth noting that using the WITH clause without a clear reason isn’t considered a good practice. (Examples of such a clear reason might include a design decision based on user requirements or a thorough analysis of the query that concludes that it might be useful to have an intermediate table like VoyageDistance in the future).

This is mainly because, in situations like this, a WHERE clause is being used both in the construction of the intermediate table and in the resulting table from the query. This means multiple filters might be applied internally, which can be inefficient.

But the DBMS often automatically applies certain techniques like inlining to optimize query execution through refactorizations of the execution plan. In other words, even if our code is not the most optimal, the DBMS can automatically find an equivalent and more optimal way to resolve the query.

To illustrate that the intermediate tables we define in the WITH clause can be constructed with subqueries as "complex" as we want, consider this query:

WITH Pending AS (
    SELECT D.*
    FROM DrivingLicenseRequest D
        LEFT JOIN DrivingLicense A USING (LicenseID)
        LEFT JOIN RejectedDrivingLicense R USING (LicenseID)
    WHERE A.LicenseID IS NULL
        AND R.LicenseID IS NULL
)
SELECT P.Name, Pending.*
FROM Pending INNER JOIN Person P ON P.PersonID = Pending.PersonFK;

We’re getting information on all driving license requests currently being processed, meaning those that have not yet been accepted or rejected. We’re also including information about the person who made each request.

In this case, the key point is to realize that the requests in process are represented by tuples in DrivingLicenseRequest that aren’t referenced by any tuple in either DrivingLicense or RejectedDrivingLicense (since they aren’t yet accepted or rejected).

In this case, we can use LEFT JOIN so that by combining all these tables with LEFT JOIN operations, we can gather complete information about the requests. This means constructing a table formed by all the attributes of the three tables in the hierarchy, where some of them will be NULL or not in each tuple, depending on whether they represent accepted, rejected, or pending requests.

Specifically, since the foreign keys of the inheriting entities in the hierarchy are both called LicenseID (matching the identifier {LicenseID} of the superclass), the LEFT OUTER JOINs are performed by applying an equality condition on this attribute. This ensures that the tuples we get contain information about the same request, rather than multiple requests in the same tuple of the Cartesian product.

We use LEFT JOIN because the first table we combine is DrivingLicenseRequest. We know all its tuples are non-null because it represents the superclass of the hierarchy and contains information on all requests in the database, regardless of their status. So by placing this table on the left of the JOIN operation, we ensure that the information of all the tuples it contains appears – and it fills in NULL for the attributes from the other table, DrivingLicense.

Then, we do another LEFT JOIN with RejectedDrivingLicense following the same process. This results in a table where, despite using USING in the JOIN operations, we can impose conditions on the LicenseID attributes of all the tables. So for a tuple of the resulting Cartesian product to represent a pending request, the LicenseID attributes of the DrivingLicense and RejectedDrivingLicense tables must be NULL. This indicates that there are no tuples in the respective tables because the LEFT JOIN has been filled in with NULL if they didn't exist. We declare this condition using a WHERE clause and the IS operator, as you can’t compare an attribute with NULL directly using the = operator.

At this point, to simplify the query syntax and avoid chaining too many JOINs, we can create an intermediate table with the result we got by performing these LEFT JOINs and applying the previous condition. This way, we can later perform an INNER JOIN in the query to get the information of the person who made the request. We do this all through the PersonID attribute of the Person table and the foreign key PersonFK of the intermediate table Pending, which comes from the DrivingLicenseRequest table and refers to the person associated with the request.

In this query, we could also consider combining all the joins in a single FROM clause and skipping the WITH. This would be correct, but it would complicate the code by having all the JOINs chained. And, although this strategy can be more efficient under certain circumstances, we should seek a balance between code readability and efficiency.

To illustrate that the intermediate tables in the WITH clause can be defined by queries that contain subqueries, we’ll consider the same query as before and try to solve it using a different approach.

WITH Pending AS (
    SELECT D.*
    FROM DrivingLicenseRequest D
    WHERE NOT EXISTS (
            SELECT *
            FROM DrivingLicense A
            WHERE A.LicenseID = D.LicenseID
        )
        AND NOT EXISTS (
            SELECT *
            FROM RejectedDrivingLicense R
            WHERE R.LicenseID = D.LicenseID
        )
)
SELECT P.Name, Pending.*
FROM Pending INNER JOIN Person P ON P.PersonID = Pending.PersonFK;

To get pending requests, keep the tuples in DrivingLicenseRequest whose primary key {LicenseID} is not referenced (via the foreign key LicenseID) by any tuple in DrivingLicense or RejectedDrivingLicense.

The simplest option to implement this is to go through all the tuples in DrivingLicenseRequest using the FROM clause, and for each of them, construct two very similar correlated queries.

We can have one that gets all the tuples from DrivingLicense whose foreign key LicenseID refers to the primary key LicenseID of the tuple in DrivingLicenseRequest that we are going through, and
We can have another subquery that does the same but gets tuples from the RejectedDrivingLicense table.

In this way, we can later check if any of the tables returned by the subqueries contain tuples or not using the EXISTS operator.

If any of the subqueries return tuples, then the request is either accepted or rejected. But if both subqueries return an empty set, it means that for a certain request in DrivingLicenseRequest**,** there is no tuple in the respective DrivingLicense or RejectedDrivingLicense tables that references it. This then indicates that the request is being processed.

With this process, we get the pending requests, which we store in an intermediate table using the WITH clause. To combine the information of the person who made each request, we use the intermediate table in the query, specifically in an INNER JOIN operation with the Person table, just as we did before.

So with this example, we’ve seen that there are multiple SQL constructions that lead to the same result – meaning a query doesn't necessarily have to be solved in just one way.

Also, by using the WITH clause, we can define each intermediate table with SQL code that’s as "complex" as we need it to be. We can include subqueries, conditions, and generally any SQL statement, except for a WITH, which by default can’t appear inside another WITH.

If we need to use an intermediate table to solve a query defined as a CTE, we need to define it at the same level as the other intermediate tables in our query, meaning in a single WITH statement (as we’ll see below).

So far, we have seen that we can use the WITH statement to define an intermediate table that we use to solve the query more comfortably and easily in certain situations. But, we might need several intermediate tables to solve a query, not just one.

For example, in the below code we have a query that gets information about people who have lived in at least two cities. We solved this query in the “Tuple filtering” section using JOIN operations – but we can also follow a similar approach where we first create several different intermediate tables and finally solve the query based on the results of these intermediate tables.

WITH R1 AS (
    SELECT PersonFK, CityFK AS CityA
    FROM Residence
),
R2 AS (
    SELECT PersonFK, CityFK AS CityB
    FROM Residence
),
CityPairs AS (
    SELECT DISTINCT R1.PersonFK
    FROM R1 INNER JOIN R2 ON (R1.PersonFK = R2.PersonFK
        AND R1.CityA <> R2.CityB)
)
SELECT P.*
FROM CityPairs MC INNER JOIN Person P ON MC.PersonFK = P.PersonID
ORDER BY P.PersonID;

First, we can create several intermediate tables that contain all the tuples from Residence – specifically the information about the person and city that make up each residence. We can do this by obtaining the attributes PersonFK and CityFK, which are foreign keys that refer to the person who has lived in a certain city during that residence. By constructing several intermediate tables with this information, we can rename CityFK with an alias like CityA in one of them and CityB in the other intermediate table, so that later the JOIN between them has a clearer syntax.

To construct several intermediate tables in a single WITH statement, we can chain them with commas. Instead of using the WITH keyword multiple times, we have to use it only once and chain all the intermediate tables we want with commas, as shown above.

Subsequently, with the intermediate tables R1 and R2 containing this information, we can create another intermediate table where we get the identifiers of all the people who have had a residence in several different cities (or in at least two cities).

To do this, we can perform an INNER JOIN between R1 and R2 (a Cartesian product of their tuples) and keep the tuples from the Cartesian product where the foreign key values PersonFK match and the CityFK values do not match. This way, we keep those tuples from the Cartesian product that represent information about several residences of the same person in different cities.

These identifiers are for the people whose information we need to get from the Person table. So now we can finally perform an INNER JOIN between the intermediate table CityPairs and Person, so that the final result of the query is the information of the people who have had at least two residences in different cities. (They would not have appeared in a tuple of the Cartesian product between R1 and R2 otherwise.)

The important point about this query is to note that we have used multiple intermediate tables in the same WITH clause to solve it – and this is entirely possible but not always recommended. We can resolve this query in various ways, each with its own advantages or disadvantages depending on the characteristics we need the code to have, such as clarity, efficiency, maintainability, and so on.

To conclude this CTE section, let's consider another query where we need to get information about bus trips that have taken place after 2025 and where the bus has WiFi. The simplest way to create this query would be to gather information from the CityBus and BusTrip tables using a JOIN, and then apply conditions on the tuples of the corresponding Cartesian product. But to illustrate using multiple intermediate tables (CTEs) in a single WITH clause, in this case, we’ll divide the query resolution into several parts.

WITH WifiBuses AS (
    SELECT Plate, RouteNumber
    FROM CityBus
    WHERE FreeWifi = TRUE
),
AvailableTrips AS (
    SELECT TripDate, StartAddress, EndAddress, PlateFK
    FROM BusTrip
    WHERE EXTRACT(YEAR FROM TripDate) >= 2025 
)
SELECT T.TripDate, T.StartAddress, T.EndAddress, B.RouteNumber
FROM AvailableTrips T INNER JOIN WifiBuses B ON B.Plate = T.PlateFK;

First, we’ll get information about buses with WiFi in an intermediate table. To construct this table, we simply apply the condition FreeWifi=TRUE on the tuples of the CityBus table. In this case, when we do a SELECT * FROM CityBus; we can see that in the FreeWifi attribute, the boolean values are represented with the letters ‘t’ or ‘f’ – so we might think that in the query we should compare the attribute with ‘t’.

But boolean values in SQL are TRUE and FALSE, even though the DBMS represents them with another type of notation. So the correct way to check if the attribute contains the logical value true is to compare it with TRUE. Even though the representation of the boolean value might change, in SQL we should always operate with boolean values using the literals TRUE and FALSE.

Second, we construct another intermediate table with information about bus trips that have occurred in 2025 or later. We do this by getting all the tuples from BusTrip and filtering them using the EXTRACT() function and the YEAR feature of the date.

Finally, in the query, we perform a JOIN between both intermediate tables to gather all the information about trips and buses. This way, we get tuples with trips that occurred on dates equal to or after the year 2025, along with the information about the bus with WiFi that made that trip.

But in this case, we only return the route number of the bus to the user, which is also part of the information in the CityBus table. If this isn’t enough to identify the bus, we could also return its license plate in the SELECT, for example. This decision depends on what the end user needs.

Also, with this query, we can more clearly see the effect of coding a query using multiple intermediate tables on how efficiently it executes. For example, if we coded the query without WITH (and instead with JOIN operations between the respective CityBus and BusTrip tables and imposed conditions on the resulting tuples), we have to consider that the entire Cartesian product would be performed first and then filtered by the conditions.

But by using intermediate tables where each one imposes a certain condition on the tuples of each table, we can reduce the number of tuples in each intermediate table, since WifiBuses won’t contain all existing buses, but only those with WiFi (which will be fewer).

By applying this technique (known as early filtering), we ensure that when performing the final JOIN between the intermediate tables, the Cartesian product results in fewer tuples – meaning it works with smaller tables and is therefore more efficient.

Just keep in mind that in modern DBMS, this optimization can be carried out automatically even without using intermediate tables, depending on the nature of the query. So if it doesn’t significantly worsen the clarity of the query, we should filter the information from the tables to be combined via a JOIN as early as possible, which is why we have used multiple intermediate tables in a WITH clause.

Set Operations

We have seen that in most queries, we need to impose conditions on table tuples to filter them and keep only those that interest us. Sometimes, we even need to use logical operators to chain multiple conditions together.

But using logical operators AND, OR, and NOT is not the only way to chain multiple conditions. We can also take a different approach where, instead of applying a filter on all tuples, we divide the conditions that must be met and apply multiple filters, one for each condition based on logical operators. Finally, we get the resulting tuples from those filters and combine them using set theory operators, which perform functions equivalent to logical operators.

In other words, in SQL, we can chain multiple conditions in a WHERE clause, for example, using logical operators. Or we can use set theory operators to combine the resulting tuples from multiple filters, each applying one of those conditions, all without using logical operators.

As we will see below, the decision to use logical or set operators largely depends on the clarity of the resulting code and the efficiency we want to achieve in the query.

To start, let's consider a query where we need to get information on all pools that are currently in a maintenance or closed state. If we wanted to do this with what we already know, the simplest way would be to use a filter with a WHERE clause, combining the conditions that the Status attribute is 'closed' or 'maintenance' using the logical OR operator.

SELECT *
FROM Pool
WHERE Status='maintenance'
UNION
SELECT *
FROM Pool
WHERE Status='closed';

Besides using this operator, we can rethink the query to solve it using set theory operators. In this case, using only one logical OR operator, we divide the WHERE condition into several conditions by removing that OR operator. This results in the conditions Status='maintenance' and Status='closed', respectively.

Doing this, we can resolve two queries: one applying the first condition and another applying the second. This gives us two resulting tables, one with information about pools under maintenance and another table with all the closed pools.

But we wanted all of them in a single output table, not in several. So to combine all the tuples from both tables into one (so that they all appear in the resulting table), we use the UNION operator between both queries. It treats both queries as if they were multisets of tuples, resulting in another set of final tuples where the tuples from both tables are present. That is, all the tuples from both tables. This is just like in set theory where the union of sets AUB results in another set containing all the individuals from A, all from B, and all those in both A and B.

To apply a set theory operator, the schema of the tables returned by the queries we operate with has to be exactly the same. This means that they must have the same number of attributes with the same names and data types, and in the same order. Otherwise, we can’t compare tuples from both tables, and it wouldn’t be possible to determine if a tuple belongs to one of the sets involved in the operation.

When we run a query that includes a UNION, we can see that if there are duplicate tuples in the resulting tables from the queries we’re working with, those duplicate tuples disappear in the resulting table of the query. This happens because, by default, all set theory operators in SQL take multisets with tuples as input and produce a set of tuples, which means it won’t contain duplicate tuples. So if we want to force the appearance of duplicate tuples because the query requires it, we must add the ALL modifier after the corresponding UNION, INTERSECT, or EXCEPT operator.

For example, in this case, we want to get information about all the people who have rented at least one bike or owned at least one car in our system:

SELECT PersonFK
FROM Rental
UNION ALL 
SELECT PersonFK
FROM CarOwnership;

To do this, we first create a query that gets all the PersonID identifiers of people referenced by the tuples in the Rental table through their foreign key PersonFK. In other words, each tuple in Rental has a value in its foreign key PersonFK that corresponds to a certain person's identifier, which matches the value of their unique identifier PersonID. So selecting the PersonFK attribute is enough to get the identifiers of all the people who have at least one record in this table.

We do the same with another query on the CarOwnership table, which also has a foreign key PersonFK with the same characteristics.

Finally, when reviewing these partial results from the queries, we’ll see that some people may have rented several bikes or simply made several rentals, and they may also have multiple ownership records in CarOwnership, leading to duplicate tuples. So when building the final resulting table, we need to get the people who are present in one table or the other. This means we need to combine both tables with the UNION operator to get a final set with all the tuples from both tables.

In this specific case, we shouldn’t add the ALL modifier, as we simply want to know which people meet the condition of having at least one rental and at least one car ownership (so a person who has made multiple rentals doesn’t need to appear multiple times in the final table).

But if we wanted to keep the duplicate tuples, this example clearly shows that by adding ALL after the UNION operator, all duplicate tuples from both tables are preserved, resulting in the final table showing as many tuples with the same person's information as the number of rentals and properties they have had. In other words, the ALL modifier forces UNION to return a multiset, not a set that removes duplicate tuples.

We can see the effect of the ALL modifier at a glance by examining the resulting table from the query. But if we are working with a very large query or a database with a lot of information, we may want to wrap our query in another outer query that uses the aggregation function COUNT() to count how many tuples it returns.

For example, in the below code you can see that we have used the query we just looked at as a subquery in the FROM clause, so we get all the tuples it contains. Then in the SELECT, we use the COUNT(*) function to count how many tuples there are.

SELECT COUNT(*)
FROM (
        SELECT PersonFK
        FROM Rental
        UNION
        SELECT PersonFK
        FROM CarOwnership
    ) AS PersonTable;

It’s also important here to provide an alias to the subquery, because when subqueries appear in the FROM clause, the DBMS needs them to have aliases to distinguish them and avoid ambiguities regarding the origin of the attributes that are later selected or used in other clauses.

Continuing with the different set theory operators that SQL offers, we have INTERSECT, which combines the results of several tables to return a set with all the tuples that appear in all the tables simultaneously. To understand this, we have the query below that retrieves information about all the people who have a driver's license and have also rented at least one bike.

SELECT DR.PersonFK AS PersonID
FROM DrivingLicenseRequest DR INNER JOIN DrivingLicense DL 
  ON DR.LicenseID = DL.LicenseID
INTERSECT
SELECT R.PersonFK
FROM Rental R;

We could write this query using only JOIN operations. But this time it might be more natural and straightforward to think of it as a set operation.

First, as shown above, we can construct a query that returns the identifier of all the people who currently have an active driver's license. To do this, we perform a JOIN between DrivingLicenseRequest and DrivingLicense, so we can gather the driver's license information with the foreign key PersonFK from DrivingLicenseRequest, which identifies the person who applied for the license. Then, we can construct another query that returns all the people who have rented at least one bike using the foreign key PersonFK from the Rental table.

And finally, to know which people have a driver's license and have also rented at least one bike, we need to keep the tuples that are in both tables. In other words, if the tables are multisets containing tuples, we need to keep those that appear in both multisets at the same time.

Instead of using the UNION operation, which represents the union of sets, we can use INTERSECT. It performs the intersection between sets, resulting in the final table of the query where we only have those people who meet all the conditions.

In the previous query, we only got the identifier of each person, since knowing the value of their primary key {PersonID} is enough to uniquely identify a person. With the foreign keys PersonFK pointing to the primary key {PersonID} of the Person table, we don't need the query to return more information about the person, as we can identify them with their primary key.

But there are situations where we may want to get more information about the person in the same query, such as their name in addition to the identification. So, if we needed to modify the query to return this, the simplest way would be to use a WITH clause that stores the identifiers of the people and then perform an INNER JOIN with the Person table in the body of the query, thus obtaining all the information present in the Person table.

SELECT DR.PersonFK AS PersonID, (SELECT Name FROM Person WHERE PersonID=DR.PersonFK)
FROM DrivingLicenseRequest DR INNER JOIN DrivingLicense DL 
  ON DR.LicenseID = DL.LicenseID
INTERSECT
SELECT R.PersonFK, (SELECT Name FROM Person WHERE PersonID=R.PersonFK)
FROM Rental R
ORDER BY PersonID;

But to illustrate other ways to get more information about people in the system from their identification in the same query without using WITH, we have the option to use subqueries in the SELECT clause, as you can see in the example above. Specifically, if we only need to return the name in addition to the PersonID, we can create a correlated subquery that, for each tuple to be returned, gets the name of the person with a specific identifier. In other words, it searches the Person table for the tuple with a certain PersonID, retrieves the name, and adds it as an additional column.

In general, using correlated subqueries in the SELECT is not a good practice because, as you might guess, the result of the subquery must be computed for each tuple to be returned. This also complicates maintenance, code clarity, and makes optimization by the DBMS more difficult in most cases.

Basically, with a simple WITH and an INNER JOIN, you can avoid having to traverse the entire Person table for each person to obtain a specific characteristic, instead gathering data from both the intermediate table of the WITH and the Person table.

Just like with other set theory operators, INTERSECT takes multisets as input and by default outputs a set, so there can't be duplicate tuples in the output table, as we saw earlier with UNION. So if we need to force the operator to always work with multisets and also return a multiset as the output of the intersection operation, we can add the ALL modifier, as shown in the example query below:


SELECT BikeFK AS BikeID
FROM Rental
WHERE StartTimestamp >= '2024-01-01'
INTERSECT ALL
SELECT BikeFK
FROM Rental
WHERE Duration <= 3
ORDER BY BikeID;

In this query, we get information on bikes that have been rented at least once during or after 2024 and whose rental duration was at most 3 hours. To do this, we construct a query that gets all bikes that have had at least one rental on a date >= '2024-01-01', and then another that gets those that have had at least one rental with a maximum duration of 3 hours.

Then, to keep the bikes that meet both conditions at once, we perform the intersection between the multisets returned by the queries, keeping the tuples that are in both multisets at once. And, since the same bike may have had several rentals with these characteristics, there may be duplicates in the final query table. If we want to preserve them, we’ll have to add the ALL modifier.

By default, we don’t usually use ALL in this type of query, since we simply want to know which bikes meet the conditions. But it’s possible that, due to some user requirement, the query needs to return duplicates, in which case we’d use ALL.

In most situations, the performance is similar enough to be negligible. But the more data there is in the database, the more noticeable the difference in performance will be between using ALL or not. This also depends on the algorithms the DBMS uses internally to implement this operation.

The last set operator we'll look at here is EXCEPT, which is called MINUS in some DBMS. Basically, it implements the difference operation between sets, meaning if we have several sets A and B with tuples, the difference A-B returns a set with all the tuples that are in A and are not in B.

SELECT p.EntryFK, p.PersonFK, p.PoolFK
FROM PoolSanction p
WHERE p.BanEndDate < CURRENT_DATE
EXCEPT
SELECT p.EntryFK, p.PersonFK, p.PoolFK
FROM PoolSanction p INNER JOIN Sanction s ON p.SanctionID = s.SanctionID
WHERE s.Status = 'active';

For example, in the query above, we get information about all pool sanctions where the ban end date is before the current date when the query is run and that aren’t active.

As always, to do this, we might think the simplest way is through JOIN operations and a WHERE clause where the conditions are implemented. But to illustrate the use of EXCEPT, we can frame the query as finding all sanctions that meet the first condition of having a ban end date before CURRENT_DATE, and then from all of them, selecting only those that aren’t active. That is, from all those that meet the date condition, we keep only those that aren’t active.

Viewed another way, by getting all those that meet the date condition, we get a set A with tuples, where each one represents a sanction. Among all of them, there may be some that are active and some that are not. So to keep the ones we’re interested in, we need to remove from set A all those that are active. In other words, if we consider all the active ones to be in another set called B, then the sanctions we are interested in will be in the difference A-B (this means all the sanctions that meet the date condition (are in A) and are not active (are not in B)).

So in the query, you can see that we use the EXCEPT operator to work with the queries that get sets A and B, respectively. That is, the first query constructs set A by imposing the condition p.BanEndDate < CURRENT_DATE on the tuples of PoolSanction, while the query following the EXCEPT operator constructs set B by imposing the condition s.Status = 'active', gathering data from PoolSanction and Sanction to filter by the Status attribute, which is in Sanction instead of PoolSanction.

To implement the difference A-B, we use EXCEPT, where the query above the EXCEPT is set A and the one below is set B. This is important to keep in mind because EXCEPT is the only operator where the order of the operands can change the result of the query.

For example, with the other operators UNION and INTERSECT, we can clearly see that it doesn't matter if we unite or intersect several sets A and B or B and A – in any order, the result will be the same. This is not the case with the difference A-B, which doesn't necessarily have to be equivalent to B-A. This property is called commutativity, and EXCEPT is the only set operator that is not commutative.

Ultimately, in this query, we can see that the table aliases are all in lowercase. This is allowed in SQL, and we can even declare an alias in uppercase and then use it in lowercase, or vice versa. But if we enclose the aliases in quotes, like in Person "P", we can only refer to the table with the alias exactly as it’s written in the quotes.

On one hand, not quoting it provides flexibility when writing the code, as we don’t need to remember exactly how it was written. In most SQL code, quotes aren’t commonly used. But, this can cause ambiguity issues if the alias is named exactly the same as a table or another element in the database. Quoting it avoids these potential collisions with names of other elements.

In the end, the decision of which alias to assign to each table or element in the query mainly depends on its complexity and the style guide followed, among other factors.

Just like with other set operators, EXCEPT also takes multisets as input by default and returns sets, removing any duplicate tuples. So if we need to keep those duplicate tuples, we simply need to add the ALL modifier.

SELECT BikeFK AS BikeID
FROM Rental
EXCEPT ALL
SELECT BikeFK
FROM Rental
WHERE Duration < 2;

For example, in this query, we get information on all the bikes that have been rented at least once and have never been rented for less than 2 hours.

To implement this, we first construct a query that retrieves all the bikes that have been rented at least once using the foreign key attribute BikeFK from the Rental table. Then, with another query, we get all the bikes that have been rented at least once for less than 2 hours. Finally, to keep only those we’re interested in, we get the difference between the first set and the second.

As you can see, a bike may have been rented an arbitrary number of times, so the first query might return many duplicate tuples. If we don't use the ALL modifier, all those duplicates will be lost, resulting in a table where each bike is guaranteed to appear at most once.

But if we want to keep the duplicates, simply using ALL will show that the query returns a result table with many more tuples, as many of them correspond to duplicate bikes.

Here, we also need to consider that, despite the possibility of duplicate bikes existing in both set A and set B, when we perform the difference A-B using EXCEPT, we won't get any bike that's in B, regardless of how many times it's duplicated in both sets, one, or neither.

But when performing the operation using EXCEPT ALL, if we have x repetitions of a tuple in set A and y repetitions of that same tuple in B, then in the resulting table, we will get max{x-y,0} repetitions of that tuple. That is, when there are more repetitions in A than in B, we will get x-y repetitions of the tuple in the final table. If there are more repetitions in B than in A, then x-y is negative, so we will simply get 0 repetitions of the tuple. This means that tuple won’t appear in the resulting table of the difference operation implemented with EXCEPT ALL.

To correctly understand the difference operator, let's consider a query where we need to get information on all cruise trips for which there is no return trip. In other words, we want to find all trips going from a city x to another city y with no return to city x.

SELECT V.DepartureCityFK, V.ArrivalCityFK
FROM Voyage V
EXCEPT
SELECT V2.ArrivalCityFK, V2.DepartureCityFK
FROM Voyage V2
ORDER BY DepartureCityFK, ArrivalCityFK;

The approach we'll take for this query is based on set theory, as it’s particularly easy to solve in this case. First, we construct a query that returns all existing trips. From these, we can get all the information present in the Voyage table – but for simplicity, we'll only focus on the attributes that determine the departure and arrival cities of the trip (which are their foreign keys DepartureCityFK and ArrivalCityFK).

Then, from all those trips returned by the first query, we need to remove the return trips. That is, for each existing trip that departs from city x and arrives at city y, we look for a return trip in the Voyage table that departs from city y and arrives at city x. If it exists, we remove the original trip from the result table of the first query.

We could implement this using the IN operator and a subquery. But it's simpler and more efficient to build a second query that gets all the trips from Voyage but swaps the departure and arrival cities for each trip as shown above. This second query is responsible for getting all possible return trips that might exist by swapping the values of the foreign keys DepartureCityFK and ArrivalCityFK, meaning swapping the departure cities with the arrival cities.

Finally, with these two queries, we apply the EXCEPT operator, where we remove from the result table of the first query (which contains all the trips from Voyage) all those contained in the result table of the second query. In other words, from all existing trips, we are removing those considered return trips because they were generated by the second query by swapping the departure and arrival cities.

Even if some of those return trips don't exist in Voyage (which can happen), the EXCEPT operator will simply ignore that tuple to remove since it doesn't exist in the set from which it needs to be removed.

To wrap up this type of query, so far we have seen some that apply a single set operator. But SQL allows us to use any number of them in the same query or subquery as needed. This is applied in the query below, which retrieves information about all the people who have or have had a driver's license application registered in the database, or an approved license, and who have never had a rejected application.

WITH Persons AS (
    SELECT DR.PersonFK
    FROM DrivingLicenseRequest DR
        INNER JOIN DrivingLicense DL ON DR.LicenseID = DL.LicenseID
    UNION
    SELECT DR2.PersonFK
    FROM DrivingLicenseRequest DR2
    EXCEPT
    SELECT DR3.PersonFK
    FROM DrivingLicenseRequest DR3
        INNER JOIN RejectedDrivingLicense RD ON DR3.LicenseID = RD.LicenseID
)
SELECT PersonFK
FROM Persons
ORDER BY PersonFK;

Regarding the implementation, we can see that in the first query used to build the intermediate table Persons, we get all the people who have or have had an accepted driver's license by using an INNER JOIN between the DrivingLicenseRequest and DrivingLicense tables. This way, with the data from DrivingLicenseRequest, we can access the foreign key PersonFK that identifies the person who made the application.

Then, we make a union with a second query that retrieves people who have made an application, whether pending, accepted, or rejected. This time we get the data from the DrivingLicenseRequest table, which encompasses all existing applications in the database.

By performing the union, we get all the people who have or have had pending, accepted, or rejected applications, since the first query returns only those with an approved license – but the second returns people who have also had rejected applications.

To exclude those with rejected applications, the EXTENT operator is used along with another query that retrieves these people with rejected applications. So they are all excluded from the final query result – or rather from the intermediate table Persons. From this table, we finally get all its tuples and order them by the attribute PersonFK – that is, by the identifier of the people we obtain.

As you can see, the order in which set operations are performed is from top to bottom. That is, these operators act on tables containing query results, so SQL performs these operations in a top-down order (although we can use parentheses to change the precedence of the operators according to the needs of the query). Also, in this case, we can see that the UNION operation is redundant since everything contained in the first query is also contained in the second.

In other words, the second query retrieves people who have or have had applications of any type, whether pending, accepted, or rejected, so all the people who have had accepted applications are included in this set of people generated by the second query. In a real-world environment, we should optimize it by dispensing with the first query, which also implies the elimination of the UNION operation. But here we leave it as is to illustrate its equivalence with the optimized query we have described.

Aggregation Queries

Next, we’ll look at a type of query that’s often used in temporal data analysis, calculating metrics, building dashboards aimed at strategic decision-making, and so on. These are aggregation queries, and they’re based on treating the tuples of a table as if they were groups on which we can perform certain operations. For example, we can use them to sum all the values of an attribute in a certain group, find their average, and calculate the maximum or minimum value, among others.

As you might guess, the basic statements to implement this type of query are GROUP BY, HAVING, and the different aggregation functions offered by SQL.

With GROUP BY, we can choose a series of attributes whose values will determine how we form groups of tuples in a particular table. This means that each of these groups will be formed by a combination of values taken by the selected attributes. Then with the aggregation functions, we will calculate a certain metric for each group. w

With HAVING, we’ll impose conditions related to the characteristics of each group, mainly the value taken by the metrics we calculate on them.

To understand all this with examples, let's first consider a query where we want to get a list of all the nationalities present in the Person table and the number of people with each nationality.

SELECT Nationality, COUNT(*) AS NumPersons
FROM Person
GROUP BY Nationality
ORDER BY NumPersons DESC;

At first glance, we realize that we need to count a certain number of people for each value that the Nationality attribute takes. So, the approaches we've seen so far aren’t straightforward for performing this operation (which in some cases may not be possible without grouping).

For example, here we could list all the different nationalities that appear in Nationality and, for each one, use a subquery in the SELECT statement to count how many people in the Person table have that nationality.

But this approach would be very inefficient since, for each different nationality, we would have to go through the entire Person table looking for people with that nationality. Even though these searches can be optimized, they are generally not as efficient as the approach we’re going to follow using grouping.

Instead, we use grouping with the GROUP BY statement. Specifically, we indicate the table attributes that guide generating the tuple groups in the table. In this case, since we want to calculate a metric for each value of the Nationality attribute, we use that attribute to group the tuples. This way, for each value of that attribute, a group of tuples is generated, represented by that value, which will represent all the people with that same nationality.

Then, if we want to count how many people have that nationality, we simply count how many tuples each group has. So, in the SELECT statement, we add an extra attribute where we use COUNT(*). This time it won’t count all the tuples in the table, but those in each group. Since using GROUP BY makes it mandatory to return in the SELECT the attributes we are grouping by, the final table will only show the distinct values of Nationality, meaning the "representative" values of each group of tuples.

For each of those values, we’ll attach the value of COUNT(*) in the same tuple of the output table, which will correspond to the number of tuples in the corresponding group. This conceptually represents the number of people with that nationality.

Finally, we can apply sorting with the ORDER BY statement – but we should keep in mind that we can only sort in this case with respect to the attributes we return in SELECT. This is because in the query, we’re creating groups represented by Nationality values, which means we can’t "return" the rest of the attributes in the SELECT as we did before.

We can only calculate metrics with them and return those – but not the attributes themselves with all their values. This is because when grouping, the resulting table necessarily contains only the "representative" values of each group and metrics of other attributes calculated from those groups (or metrics of the group itself, such as the number of tuples it has in this case).

There are various metrics we can calculate with the basic aggregation functions that SQL provides by default. Below we see a query where we get, for each possible pool status, the smallest minimum depth and the largest maximum depth of the pools with that status, as well as the average depth and the number of pools in that status.

SELECT Status,
    MIN(MinDepth) AS Shallowest,
    AVG((MinDepth + MaxDepth) / 2.0) AS AvgDepth,
    MAX(MaxDepth) AS Deepest,
    COUNT(*) AS NumPools
FROM Pool
GROUP BY Status;

The implementation of this query is very similar to the previous one, as we need to group the Pool tuples by the values of their Status attribute, which determines the status of the pools.

So in the GROUP BY clause, we only specify the Status attribute. This way, we group the tuples into as many groups as there are values present in the Status attribute in the table, and in each of these groups, we have all the tuples representing pools in that status.

So along with the information for each status, we can calculate metrics for its associated group of tuples – that is, for the pools in that status. For example, with MIN(MinDepth), we get the smallest value of the MinDepth attribute present in the group for which this metric is being calculated. In this case, it represents the smallest minimum depth of all pools in a certain status.

Similarly, with the aggregation operation MAX(MaxDepth), we get the largest maximum depth, or in other words, the largest value of the MaxDepth attribute in the corresponding group of pools. With COUNT(*), we get the number of pools in each group.

On the other hand, the average depth associated with the pools in each group is calculated with AVG((MinDepth + MaxDepth) / 2.0). First, it’s worth noting that both in the SELECT clause and in the input argument of an aggregation function like AVG(), we can perform arithmetic operations on the attributes.

For example, in this case, with (MinDepth + MaxDepth) / 2.0, we calculate the average value between the minimum and maximum depth of each pool – not of each group, but of each tuple in the group – all using decimal values like 2.0 so that the result isn’t automatically rounded to an integer. Then, with this value calculated for each tuple, we use the aggregation function AVG() to calculate the average of this value for each group.

That is, with (MinDepth + MaxDepth) / 2.0, we get a certain value for each tuple, and then with AVG(), we take all those calculated values for the tuples of a certain group and calculate their average. Thus, for each possible state of a pool, we obtain the average depth of all the pools in that state, first calculating the average depth of each pool and then calculating the average of these depths across all pools in a certain state.

But, in addition to calculating metrics for each group of tuples, we might need to keep only those groups whose metrics meet certain conditions, depending on the query to be resolved. For example, here we consider a query where we get, for each person, the number of bike rentals they have made since records began, as long as that person has made at least three rentals.

SELECT P.PersonID,
    P.Name,
    COUNT(*) AS RentalCount
FROM Rental R
    INNER JOIN Person P ON R.PersonFK = P.PersonID
GROUP BY P.PersonID, P.Name
HAVING COUNT(*) > 2
ORDER BY RentalCount;

To implement this query, we might have considered using a correlated subquery in the SELECT statement that counts how many tuples in Rental have their foreign key PersonFK pointing to each person. But this would be inefficient, since groupings are usually much faster for this type of task.

It’s also not possible to impose a condition WHERE COUNT(*) > 2, either in the subquery of the SELECT clause or in the main query (in general, conditions on aggregation functions can’t be imposed in a WHERE clause). So in this case, we would have to use another subquery in the WHERE clause that counts the number of rentals each person has and that then checks that this number is > 2.

To avoid using subqueries and make our implementation as fast as possible, we first perform an INNER JOIN between the Rental and Person tables. We combine all their information into tuples of their Cartesian product where we have rentals and data about the person who made them. We can do this by imposing the condition in the JOIN that the foreign key PersonFK of Rental points to the PersonID identifier of its same tuple in the Cartesian product.

After performing the JOIN, we use the GROUP BY clause to group the resulting tuples by the PersonID and Name attributes of the person table. We do this because we want to calculate a metric for each person, so we have to include their identifier (primary key) in the grouping of the GROUP BY statement (meaning all the attributes that form their primary key).

Also, since we want to return each person's name along with their identifier, we can include the Name attribute in the GROUP BY. But it's important to note that the attributes we group by must uniquely identify each group of tuples that is formed.

In other words, by grouping by PersonID, we are forming groups of tuples that contain all the rentals made by a certain person, identified by a value of their primary key PersonID. This serves as the "representative" of the group of tuples.

But since this PersonID attribute is enough to identify the group, it's fine if we include more information about the person in this "representative value" of the group. So instead of containing only their primary key, it includes more information about the person, like their name.

As you can guess, if instead of grouping by {PersonID} we group by a candidate key (or rather a superkey as in this case {PersonID, Name}), we’ll get the same groups as grouping by {PersonID}. This means that the same number of groups will still be generated as there are people in the table (since with a superkey we can uniquely identify each person, and therefore each group).

Adding the Name attribute to the grouping is not an arbitrary decision – we have to use the Name attribute in the SELECT statement. When using GROUP BY, we can only return in the SELECT statement those attributes that we have used in the GROUP BY clause (so, those we have used for grouping). So to get the person's name and not just their identifier, one option is to include the attribute in the GROUP BY so we can return it in the SELECT – or in other words, use the Name attribute for grouping.

But, this won’t always work because there are times when we group by an attribute A and want to return information about another attribute B. But for each value of attribute A, we have multiple tuples with multiple different values in attribute B. This prevents us from using B for grouping, although we can still calculate metrics on B.

On the other hand, to count how many rentals each person has made, we just need to use the aggregation function COUNT(*) after grouping by {PersonID, Name}. This forms groups of tuples from the Cartesian product where we have the same information for the same person, but each represents a different rental. By counting how many tuples each group has, we get the number of rentals made.

To get only those groups (people) who have made more than 2 rentals, we use the HAVING clause to impose that condition, since aggregation functions can’t be used in the WHERE clause. Also, we can’t use the alias given to the attribute constructed with COUNT(*) that’s returned in the SELECT in HAVING. Instead, we need to rewrite the definition of the attribute in HAVING.

That is, just like with WHERE, we can’t impose conditions on the attributes or columns of the resulting table we return by simply referring to their aliases – we have to use their definitions, as in this case with COUNT(*).

It's worth noting that including the Name attribute in the GROUP BY to "return" it in the SELECT isn’t the only option we have to do this (or even to get more information about the person). We always have the option to save the query result in an intermediate table with a WITH clause and then join it with the Person table or the appropriate one.

But we have another option, as shown below, which involves grouping only by the PersonID attribute and then using correlated subqueries in the SELECT to get the rest of the information for each person.

SELECT P.PersonID,
    (SELECT Name FROM Person WHERE PersonID=P.PersonID),
    COUNT(*) AS RentalCount
FROM Rental R
    INNER JOIN Person P ON R.PersonFK = P.PersonID
GROUP BY P.PersonID
HAVING COUNT(*) > 2
ORDER BY RentalCount;

But this option isn’t the most optimal: as with a correlated subquery in the SELECT, we can only add one attribute of information per subquery. This forces us to use one subquery per attribute we want to add, which is very inefficient as you can see. Also, correlating the subqueries reduces maintainability and possibly also the clarity of the code, which are qualities worth considering.

With these queries, we have seen how we can use GROUP BY to group tuples by one attribute, or even several if we need to add more information to the resulting table from the query. Also, we’ve seen the correct way to impose conditions on expressions with aggregation functions, which is by using the HAVING clause.

But, we’re not always trying to return more information to the user every time we use multiple attributes in the GROUP BY statement. Sometimes, we need to group tuples by more than one attribute.

For example, in the query below, we get all the person-pool pairs (only those present in CityPool) that exist in the system. Then for each of them, we calculate the average duration the user has spent in that pool across all their entries.

SELECT E.PersonFK AS PersonID,
    E.PoolFK AS PoolID,
    COUNT(*) AS VisitCount,
    AVG(E.Duration) AS AvgDuration
FROM Entry E
GROUP BY E.PersonFK, E.PoolFK
ORDER BY E.PersonFK, E.PoolFK;

As you can see, we get the data from the Entry table, where we have to perform a grouping with the attributes PersonFK and PoolFK, since we need to calculate metrics for each person-pool pair. With this grouping, each pair of person-pool values is a group formed by all the tuples in Entry that represent times the person entered that pool.

In this way, with AVG(E.Duration), we calculate the average of the Duration attribute for each group (so how long, on average, a person stayed at the pool on each visit) while COUNT(*) counts the number of those entries.

Finally, it's important to note that in this query, we’re only getting the person-pool pairs that appear in the Entry table – we’re not constructing all possible pairs. So we won't find any tuple in the resulting table of the query where a person has never entered a certain pool.

If we wanted to include this information, we would need to structure the query differently, constructing all combinations of person-pool in an intermediate table and then calculating how many entries each person has in each pool in another way (either using subqueries, OUTER JOIN operations, or even more advanced functions that aren’t covered here).

In the GROUP BY clause, we can use an arbitrary number of attributes for grouping. The query below shows how we can get the number of times the cruise has traveled that route and the sum of the distances covered on those trips for each cruise and route between two ports. We can also display the information of the cities where those ports are located.

SELECT V.ShipFK AS ShipID,
    V.DepartureNameFK AS DeparturePort,
    V.DepartureCityFK AS DepartureCity,
    V.ArrivalNameFK AS ArrivalPort,
    V.ArrivalCityFK AS ArrivalCity,
    COUNT(*) AS TripCount,
    SUM(V.Distance) AS TotalDistance
FROM Voyage V
GROUP BY V.ShipFK,
    V.DepartureNameFK,
    V.DepartureCityFK,
    V.ArrivalNameFK,
    V.ArrivalCityFK
ORDER BY V.ShipFK, TotalDistance DESC;

As you can see, in this case, we get all this information from the Voyage table, as it has multiple foreign keys to CruiseShip and Port, as well as City. These can help us implement this query easily.

Of all the attributes it has, we group by ShipFK, DepartureNameFK, DepartureCityFK, ArrivalNameFK, and ArrivalCityFK. This allows us to group the tuples of the Voyage table based on the combinations of values representing a cruise-route pair (where a route is considered as a pair of ports along with the city values where they are located).

These are redundant for the grouping itself, as clearly all ports belong to one and only one city (according to the domain). But if we want to know the city where the port is located, the simplest option is to include the DepartureCityFK and ArrivalCityFK attributes in the grouping so we can return them in the SELECT.

So for each cruise-route pair, we can count how many trips the cruise has made on that route using COUNT(*), and with SUM(V.Distance) we can get the sum of all distances covered on those trips (as the tuples of each group in this case are the trips the cruise makes or has made on the corresponding route).

On the other hand, in this type of query, it’s also common to use the DISTINCT modifier to count values in a group or perform a specific aggregation operation on them. For example, in the query below, we get all the people who have ever lived in a city. Then for each of them, we count how many different cities they have lived in.

SELECT R.PersonFK AS PersonID,
    COUNT(DISTINCT R.CityFK) AS NumCities
FROM Residence R
GROUP BY R.PersonFK
ORDER BY NumCities DESC;

To do this, we use the data stored in the Residence table, which has a foreign key PersonFK that can determine, in each tuple, the person associated with that residence record. Since we want to calculate a certain metric for each person who has had at least one residence, we group by the PersonFK attribute, as selecting data from the Residence table ensures that all these people have or have had at least one residence record.

Then, for each group of tuples formed, we could use COUNT(*) to count how many residences the "representative" person of each group has had. But in this case, we want to count the number of different cities they have lived in.

To do this, we will give COUNT() the CityFK attribute as the input argument, which is a foreign key that determines the city associated with a residence record. This would count all the values the CityFK attribute takes for each group, but not the distinct values. So, we’ll need to add the DISTINCT modifier before the attribute and inside the COUNT() aggregation function so that it only counts the distinct values that the CityFK attribute takes in each group. This corresponds to the number of different cities a certain person has been associated with through Residence, meaning where they have lived.

When using the DISTINCT modifier in aggregation functions, we may also need to apply conditions on these aggregation functions. So we’ll need to use the same DISTINCT modifier in other clauses like HAVING, in addition to SELECT where most aggregations are calculated and returned.

WITH PersonsTable AS (
    SELECT CB.PersonFK AS PersonID,
        COUNT(DISTINCT CB.PaymentMethod) AS NumPaymentMethods
    FROM CruiseBooking CB
    GROUP BY CB.PersonFK
    HAVING COUNT(DISTINCT CB.PaymentMethod) > 1
    ORDER BY NumPaymentMethods DESC
)
SELECT *
FROM PersonsTable PT
    INNER JOIN Person P USING (PersonID);

For example, above we have a query that retrieves information about all the people who have made at least one cruise booking. It also gets the number of different payment methods they have used to pay for all those bookings, as long as that number is at least two different payment methods.

First, to implement this query, we use the CruiseBooking table and group by PersonFK – as when needing to calculate a number for each person, we should group the tuples of the table by the PersonFK attribute. This way, each group corresponds to the bookings made by a certain person.

So we can easily use COUNT(DISTINCT CB.PaymentMethod) to count how many distinct values the PaymentMethod attribute takes in each group of tuples. This corresponds to the number of different payment methods the representative person of that group of tuples has used to pay for their bookings.

Also, to require that they’ve used at least two payment methods, we use a HAVING clause where we declare that the value returned by the aggregation function COUNT(DISTINCT CB.PaymentMethod) must be >1.

We can’t use its alias name NumPaymentMethods to declare this condition (some DBMS allow it, but for portability reasons, it’s better to code it without using the alias in the condition), nor can we use a WHERE because it’s an aggregation function. The correct way is using a HAVING.

Although it may seem that the value of NumPaymentMethods is being calculated multiple times unnecessarily, internally the DBMS can optimize the query automatically to avoid this type of unnecessary calculation. But we still have to write the aggregation function multiple times: once to define the attribute NumPaymentMethods in the SELECT and another in the HAVING to impose a filtering condition on the tuples of the resulting table from a grouping.

Finally, we save the result of this grouping in an intermediate table called PersonsTable, where only the identifier of each person and their corresponding number of payment methods are stored. Later, we can use this intermediate table in a JOIN operation with the Person table. This will gather all the information of each person along with the attribute containing the number of payment methods in one table. Then this is ultimately returned as the query output to the user.

So as expected, if we use the DISTINCT modifier on an attribute in an aggregation function in the SELECT clause and want to impose a condition on it, we have to write it exactly as it appears in a HAVING clause – regardless of whether it uses the modifier or not, since we write it exactly as it appears in the SELECT.

So far, we have seen that we can give an aggregation function input attributes with which it will perform the corresponding aggregation operation. Also, if we need only the distinct values of a certain attribute or result of an arithmetic operation between attributes, we can add the DISTINCT modifier.

But DISTINCT is not only used for a single attribute – we can also use it to. getunique combinations of values from a series of attributes, or even unique results obtained from an arithmetic operation involving multiple attributes. For example, in the query below, we want to get all the cruises along with the number of pairs of departure and arrival cities they have visited throughout their journeys.

SELECT V.ShipFK,
    COUNT(DISTINCT (V.DepartureCityFK, V.ArrivalCityFK)) AS NumRoutes
FROM Voyage V
GROUP BY V.ShipFK
ORDER BY NumRoutes DESC;

To do this, we simply group the tuples of the Voyage table by the ShipFK attribute, since we want to calculate a number for each cruise (and the ShipFK foreign key is what determines the cruise that made the journey). Thus, each group of tuples will be “represented” by a certain value of ShipFK that uniquely identifies a cruise. Those tuples, in turn, will represent all the journeys that cruise has made.

So to count how many distinct pairs of departure and arrival cities each cruise has traveled to, we can use the aggregation function COUNT(DISTINCT (V.DepartureCityFK, V.ArrivalCityFK)).

As you might guess, a cruise can make the same trip several times, which means that within the same group of tuples, we might find the same combination of values for the attributes (V.DepartureCityFK, V.ArrivalCityFK) multiple times. These attributes uniquely identify the departure and arrival cities of the trip, so if the trip is made several times, there must be several "duplicate" tuples – or at least tuples with the same values in those attributes, since there can be multiple different ports in the same city.

If we look at all the combinations of values that the attributes (V.DepartureCityFK, V.ArrivalCityFK) take in each group, we’ll see that they represent the departure and arrival cities of the cruise in each trip it has made. By applying the DISTINCT modifier, we treat each pair of values as if it were unique, and we keep all those that are unique. This refers to pairs of different departure and arrival cities in the group on which this aggregation operation is calculated, ignoring all duplicate pairs (which would inflate the count artificially). This would represent the total trips the ship has made.

Finally, when solving a query using grouping, we might need to calculate metrics using aggregation functions with the DISTINCT modifier. Because of this, we have to consider that if we need to impose a condition on that metric, we’ll need to use the aggregation function in the HAVING clause exactly as declared in the SELECT (with the same DISTINCT modifier and attributes present in the input argument of the aggregation function).

SELECT CB.PersonFK AS PersonID,
    COUNT(DISTINCT (CB.CabinNumber, CB.PaymentMethod))
FROM CruiseBooking CB
GROUP BY CB.PersonFK
HAVING COUNT(DISTINCT (CB.CabinNumber, CB.PaymentMethod)) > 1
ORDER BY count DESC;

For example, in the query above, we get the identifiers of certain people, and for each one, we calculate the number of distinct pairs composed of cabin number-payment method that the person has generated through their various CruiseBooking reservations in their name. Here, we’re only interested in those people whose number of pairs is at least two.

To implement this, we first group by the PersonFK attribute of the CruiseBooking table, as it contains all the information we need to calculate the previous metric for each person. This is why we only need to group by the foreign key PersonFK attribute. So for each group of CruiseBooking tuples, we’ll have all the reservations associated with a single person.

Then, with COUNT(DISTINCT (CB.CabinNumber, CB.PaymentMethod)), we can calculate the number of distinct combinations of values that the attributes (CB.CabinNumber, CB.PaymentMethod) take in the group of tuples. As you can see, when we want to count combinations of values from several attributes, we declare both attributes in a "tuple" (CB.CabinNumber, CB.PaymentMethod), which we provide as input to the COUNT() aggregation function. We use the DISTINCT modifier to ensure it only counts distinct combinations of these two attributes.

Later, if we want to say that the result of the aggregation function has to be greater than 1 to consider the corresponding group of tuples in constructing the query output, we use the HAVING clause. In it, we can declare the condition by rewriting the aggregation function again in the HAVING clause in the same way we declared it in the SELECT.

Note that we haven't assigned an alias to the attribute we built with the aggregation function, so SQL automatically assigns the name “count” to this additional attribute. This corresponds to the name of the aggregation function used in its construction.

But if we create more attributes using the same aggregation function COUNT(), then all those attributes will be called “count” by default at the same time, creating an ambiguity problem. This is why it's essential to use aliases for attributes that are expected to be named the same way by SQL (the DBMS is responsible for assigning default aliases).

Finally, it's important to note that queries don't necessarily have to contain only one grouping. The GROUP BY clause can be used an indefinite number of times in a query, especially if it includes a subquery, which can use the GROUP BY clause again.

But it's important to remember that you can't have multiple groupings “at the same time” in the same query. In other words, if we use GROUP BY multiple times, it must be because our query consists of subqueries, with each subquery performing a grouping, avoiding doing them all at the same level. In other words, for each GROUP BY clause, there must be one and only one SELECT clause.

Division Queries

At this point, with everything we've seen, we have enough tools to solve practically any query with SQL. (But there are some that we won't focus on here because they require operating on data recursively or hierarchically, which is more complex.)

Now, we’ll look at a series of queries where we use these previous tools to implement a relational algebra operator that doesn't have a direct implementation in SQL. For example, we have seen that the SELECT statement represents the implementation of the projection operator in relational algebra. And other statements such as certain types of JOIN or UNION, INTERSECT, and EXCEPT have direct equivalent operators in relational algebra as well.

But the division operator doesn’t have an equivalent clause or function in SQL due to its nature. In short, if we have several tables, this operator is responsible for getting all the tuples from one of the tables that are “associated” with each and every tuple of the other table.

To understand how this works, let's consider the following example:

SELECT R.PersonFK
FROM Rental R
GROUP BY R.PersonFK
HAVING COUNT(DISTINCT R.BikeFK) = (
        SELECT COUNT(*)
        FROM Bike
    );

Say we want to find the people who have rented each and every bike registered in our database. To implement this, we’ll have to use the division operator, since in the query setup we will have two tables: one with tuples containing information about a person and a bike, indicating that the person has rented the bike, and another table with all the bikes registered in the system. If we apply a division operation from relational algebra on these tables, we can find all the people who appear in enough tuples in the first table to have rented all the bikes in the second table.

This can be implemented in many ways, and here we will focus on two. The first method involves counting how many different/distinct bikes each person has rented and checking if that number matches the total number of bikes in the system (we’ll see how to do this below). If it matches, then we know that person has rented all the bikes.

To do this, we can group the tuples in the Rental table by the foreign key PersonFK attribute, since we need to calculate how many bikes each person has rented. So we form groups of tuples representing the rentals of each person who has rented at least once (people who have never rented don’t appear in PersonFK of the Rental table).

Next, using the HAVING clause, we count how many different bikes each person has rented with COUNT(DISTINCT R.BikeFK). This means that for each group of tuples, we count how many different values the BikeFK attribute takes in that group. This represents the number of different bikes rented, since BikeFK is the foreign key pointing to Bike and uniquely identifies the bike that has been rented.

Finally, we compare this number with the total number of bikes in our database, which we can get through a subquery using the aggregation function COUNT(*). Remember that we can use COUNT(*) without the GROUP BY clause being present in the subquery.

We can also approach the division of tables from a perspective closer to set theory. For example, below is the same query solved using NOT EXISTS and subqueries only:

SELECT P.PersonID
FROM Person P
WHERE NOT EXISTS (
        SELECT *
        FROM Bike B
        WHERE NOT EXISTS (
                SELECT *
                FROM Rental R
                WHERE R.PersonFK = P.PersonID
                    AND R.BikeFK = B.BikeID
            )
    );

Here, if each person has rented every bike in the database, then there's not a bike that exists that they haven't rented. Let’s try to translate this concept into a SQL query literally: first, with a SELECT and FROM, we can “traverse” all the tuples of Person. Then for each one, we check that there is no bike the person hasn't rented. To do this, we construct a correlated subquery that returns all bikes that person hasn’t rented.

This subquery is simply constructed by "traversing" all the tuples of Bike and checking that there is no rental record between the person and the bike being traversed. We do this another correlated subquery that traverses the tuples of Rental and returns only those in which the person has rented the bike. If it doesn't return any tuples, then there were none with those characteristics, which tells us that the person traversed has never rented the bike traversed. In this case, we’re not interested in that person since they have not rented all the bikes in the database.

If someone really had rented them all, then the correlated subquery that traverses the tuples of Rental would always return at least one tuple. Then the the correlated subquery that traverses the tuples of Bike would never return tuples. And this would satisfy the NOT EXISTS condition that we imposed on the people.

If we read the SQL code we’ve implemented "literally," we’re traversing the people, and for each one we’re checking that they’ve rented every bike. So we finally get the same people as we do with the query that uses a grouping and a count of bikes rented by each person.

If we execute either of these queries, they probably won't return any results. After all, the probability of a person having rented each and every bike registered in the database is small.

But if we want to check whether the queries work or not, we can always manually insert tuples into Person, Bike, and Rental, especially in Rental. Then there would a person who has a tuple in Rental for each bike in the Bike table, and thus can be present in the result of the division operation.

Another query we can solve using a division operation from relational algebra is the one below. In it, we find all the people who have entered all the pools in a certain city (specifically the one with the CityID value of 55).

SELECT E.PersonFK
FROM Entry E
    INNER JOIN CityPool CP ON E.PoolFK = CP.PoolID
    INNER JOIN Pool P ON CP.PoolID = P.PoolID
WHERE P.CityFK = 55
GROUP BY E.PersonFK
HAVING COUNT(DISTINCT E.PoolFK) = (
        SELECT COUNT(*)
        FROM CityPool CP2
            INNER JOIN Pool P2 ON CP2.PoolID = P2.PoolID
        WHERE P2.CityFK = 55
    );

In this case, we first gather all the information from the Entry, CityPool, and Pool tables. This lets us get the information of the people who have entered the pools and the information of the city where the pool is located. So after gathering this information with the INNER JOIN operations, we impose the condition that the foreign key CityFK must reference a certain city, specifically the one identified with the value 55 in its primary key {CityID}. We do this to filter the resulting tuples from the JOINs, so that we only have those where people have entered pools in the specific city we are considering in the query.

But this condition doesn’t necessarily have to be in that WHERE clause, as there are other equally valid alternatives (using subqueries, CTE, and so on.).

Then we group the tuples by PersonFK, so that each group of tuples represents all the pool entries of a certain person (specifically entries in pools of the city identified by the value CityID=55). To find only the people who have entered all the pools in that city, we use COUNT(DISTINCT E.PoolFK) to count the number of different pools they have entered. This equals the number of distinct values taken by the foreign key PoolFK in the Entry table. We then compare this number with the total number of city pools located in the city with CityID=55, all obtained through a simple uncorrelated subquery.

In this subquery, we perform another INNER JOIN between CityPool and Pool to gather data on all city pools with the foreign key CityFK that determines the city they are located in. This lets us declare the condition P2.CityFK = 55 to count all the pools in that city using COUNT(*). Also, the advantage of the subquery being uncorrelated is that it only needs to be computed once, since the number of pools in that city doesn't change while the query is running.

If we try to solve the previous query using the approach closest to set theory, as we did earlier, we will end up with an implementation that mainly uses the NOT EXISTS operator and correlated subqueries.

Conceptually, we can solve the query by going through all the people in the database and checking that there’s no city pool located in the city with CityID=55 for which there is no entry associating the person with the pool. In other words, for each person, there must be an entry for every city pool in the city with CityID=55.

SELECT * 
FROM Person P
WHERE NOT EXISTS (
        SELECT *
        FROM CityPool CP
            INNER JOIN Pool PL ON CP.PoolID = PL.PoolID
        WHERE PL.CityFK = 55
            AND NOT EXISTS (
                SELECT *
                FROM Entry E
                WHERE E.PersonFK = P.PersonID
                    AND E.PoolFK = CP.PoolID
            )
    );

Now, when coding all this, we arrive at a query very similar to the previous one where we looked for people who had rented all the bikes. As you can see, we first go through all the tuples of Person, and for each one, we construct a correlated subquery that gets all the city pools the corresponding person has not entered.

To do this, we perform a JOIN between CityPool and Pool and impose the condition that ensures all the city pools we consider are located in the city with CityID=55. Also, we verify with another correlated subquery that there is no entry of the person in the pool we are examining.

Each of these approaches to the same query have significant performance differences. In this last implementation, we nest several subqueries, leading to traversing many tuples, which is often unnecessary. On the other hand, using grouping tends to be faster since the traversal of tuples mainly depends on how the GROUP BY operation is implemented (which usually provides adequate performance for most queries).

Besides performance, in this last query, we can easily return all the information for each person because we directly use the Person table in the implementation. In contrast, in the previous approach using GROUP BY, we only returned each person's identifier, forcing us to use CTEs to perform a JOIN with the Person table if we want to return more information besides the identifier.

So when performing a division in SQL, we should consider not only the efficiency of the implementation but also the ease of modifying the query, as well as its clarity and maintainability.

The previous two queries were formulated considering the city with CityID=55, although this is an arbitrary decision. If we want to choose an appropriate value for CityID so that the two previous queries return data, since there may be cities where no person has entered all their pools, we can use the query below. For each person and city, it gets the number of pools in that city the person has entered, along with the total number of pools in that city.

SELECT E.PersonFK,
    P.CityFK,
    COUNT(DISTINCT E.PoolFK) AS EnteredPools,
    (
        SELECT COUNT(*)
        FROM CityPool AS CP2
            INNER JOIN Pool AS P2 ON CP2.PoolID = P2.PoolID
        WHERE P2.CityFK = P.CityFK
    ) AS TotalPools
FROM Entry AS E
    INNER JOIN CityPool AS CP ON E.PoolFK = CP.PoolID
    INNER JOIN Pool AS P ON CP.PoolID = P.PoolID
GROUP BY E.PersonFK, P.CityFK
ORDER BY EnteredPools, TotalPools;

As you can see, several JOIN operations are first performed to gather all the information from Entry, CityPool, and Pool, so we can later group the resulting tuples by PersonFK and CityFK. This means grouping the tuples into groups where each represents the entries a certain person has made in the pools of a certain city. Then, with COUNT(DISTINCT E.PoolFK), we count the pools they have entered, since PoolFK is the foreign key in the Entry table that determines the pool the person has entered. Finally, with a correlated subquery in the SELECT, we get the total number of pools in the city identified by CityFK.

Finally, it's important to note that with this query, we will never get a value of 0 in the EnteredPools attribute. If a person has never entered any pool in a certain city, there will be no resulting tuples from these JOIN operations with (E.PersonFK, P.CityFK) attributes that reference both the person and the city.

This happens because no entry (Entry tuple) will have its foreign key PersonFK as the person and its foreign key PoolFK as a pool in the corresponding city that the person has visited (since they haven't visited any pool in that city).

So if we also want to include in our query's resulting table tuples with (E.PersonFK, P.CityFK) pairs of people who have never visited any pool in the city CityFK, we would need to use set operations to find those (E.PersonFK, P.CityFK) pairs and add the tuples constructed from these pairs to the final resulting table.

We can see another fundamental case we might encounter when implementing a division operation in SQL below. Here, we have a query where we get all the people who have or have had pool sanctions in all possible states.

SELECT PS.PersonFK
FROM PoolSanction PS
    INNER JOIN Sanction S ON PS.SanctionID = S.SanctionID
GROUP BY PS.PersonFK
HAVING COUNT(DISTINCT S.Status) = 3;

To implement this, we first perform a JOIN between PoolSanction and Sanction, ensuring with the first table that the sanction occurred in a pool and using the second table to get the sanction's state (which is stored in the Status attribute).

Then, as we need to count how many states each person has sanctions in, we group by the PersonFK attribute, creating groups of tuples that represent the sanctions each person has or has had. This way, we can use HAVING to require that the number of states in which a person has sanctions is equal to the total number of possible states a sanction can have.

On one hand, with COUNT(DISTINCT S.Status), we can count how many different values the Status attribute takes in each group – or in other words, the number of states of the sanctions associated with a person. And, since there are three possible states ('created', 'active', 'expired'), we simply compare the resulting count from the aggregation function with 3.

But if we use the constant 3 in the comparison and later modify the database to include more or fewer states in the sanctions, we will be forced to change that number. This makes the query not as maintainable as it could be.

So another another option we have for declaring the condition in the HAVING clause is to compare the result of the aggregation function COUNT(*) with the result of a subquery that calculates how many possible states a sanction can have.

SELECT PS.PersonFK
FROM PoolSanction PS
    INNER JOIN Sanction S ON PS.SanctionID = S.SanctionID
GROUP BY PS.PersonFK
HAVING COUNT(DISTINCT S.Status) = (SELECT COUNT(DISTINCT Status) FROM Sanction);

As up can see above, this subquery is non-correlated, as it simply counts how many distinct values the Status attribute takes in the Sanction table. But implementing the query this way has a problem: we’re assuming that in the Sanction table, specifically in the Status attribute, we can find all the possible values that Status can take. But this might not be the case, as if the table is empty, no distinct values can be counted in the Status attribute.

This means that this last implementation of the query only works when the Sanction table contains tuples representing sanctions where there is at least one sanction in all possible states. If we can guarantee that the database meets this condition, then the above implementation is more convenient for us because it requires no maintenance.

But this condition is not usually met, so it’s not a good practice to assume that we’ll find all possible values an attribute can take in that attribute. For example, if we think of an integer attribute, it’s clear that there don’t have to be tuples that take a different value for every possible value the attribute can have.

Another option we have is to skip grouping and use a set theory-based approach. As you can see, in the implementation above, we go through all the tuples of Person, and for each one, we check that there exists a pool sanction with the status ‘created’. Also, using the logical operator AND, we require that at the same time there exists another pool sanction with the status ‘active’. Finally, we use another logical operator AND to also require that for that person there exists another pool sanction in the status ‘expired’.

SELECT p.PersonID
FROM Person p
WHERE EXISTS (
        SELECT *
        FROM PoolSanction ps
            INNER JOIN Sanction s ON ps.SanctionID = s.SanctionID
        WHERE ps.PersonFK = p.PersonID
            AND s.Status = 'created'
    )
    AND EXISTS (
        SELECT *
        FROM PoolSanction ps
            INNER JOIN Sanction s ON ps.SanctionID = s.SanctionID
        WHERE ps.PersonFK = p.PersonID
            AND s.Status = 'active'
    )
    AND EXISTS (
        SELECT *
        FROM PoolSanction ps
            INNER JOIN Sanction s ON ps.SanctionID = s.SanctionID
        WHERE ps.PersonFK = p.PersonID
            AND s.Status = 'expired'
    );

As you can see above, this implementation is equivalent to the previous one where we used groupings and counts to implement the division operation. But this approach is clearly less maintainable, though possibly easier to understand in some aspects.

For example, in this implementation, the names of the different Status values appear explicitly in the conditions we impose in each correlated subquery, which is generally not a good practice. If you want to modify the database domain, you will also have to modify these values.

Also, we’re duplicating the same code multiple times, making the query code as a whole less maintainable. This is because if the query itself needs to be modified, it’s very likely that we’ll need to make changes in all three subqueries, slowing down the management process.

So although the set theory-based approach may be impractical in certain situations, it can work for a small database like the one we're dealing with here. But, whenever possible, it's best to choose solutions that are more maintainable and require fewer changes in the future.

In this specific case, the best option would be to use the grouping approach where the number of sanction statuses for a person is compared with the total number of statuses, as changing that number in the query is easier than modifying the code of several subqueries. This also avoids having to make assumptions about the Sanction tuples.

Let’s look at another query that’s similar to the previous ones, where we can see a different way the division operator can appear. It retrieves all the people who, for every city they have lived in, have visited at least one pool located in that city.

SELECT P.PersonID
FROM Person P
WHERE NOT EXISTS (
        SELECT *
        FROM Residence R
        WHERE R.PersonFK = P.PersonID
            AND NOT EXISTS (
                SELECT *
                FROM Entry E
                    INNER JOIN Pool PO ON E.PoolFK = PO.PoolID
                WHERE E.PersonFK = P.PersonID
                    AND PO.CityFK = R.CityFK
            )
    );

For each person, keep them only if there is no residence of theirs that lacks a matching pool-visit in the same city. Equivalently: for every city a person has lived in (from Residence), there must be at least one record showing they visited a pool in that city.

The implementation of this approach is very similar to how we express it in natural language. On one hand, we go through the tuples of Person with a SELECT and a FROM, and we set the condition that the result of a subquery is empty using NOT EXISTS.

In this correlated subquery, we go through the tuples of Residence for the person we are currently checking the condition for, so to keep only the residences we are interested in, we impose the condition R.PersonFK = P.PersonID in the subquery. This ensures that the selected Residence tuples have their foreign key PersonFK pointing to the person we are going through, whose identification is given by P.PersonID.

On the other hand, within this subquery, we also check that another correlated and nested subquery doesn’t return any tuples either. This last subquery is dedicated to getting all the entries where the person identified by P.PersonID has entered a pool located in the city identified by R.CityFK – that is, the city of the residence we are going through at the time of executing this subquery.

In summary, in this query, we have seen that divisions don’t always refer to situations where the tuples we want to obtain are "associated" with all the tuples of another table. Instead, as in this case, they can also refer to the output tuples of our query needing to meet a certain condition in relation to all the tuples of another table.

Similar to the previous query, we can consider another one where we need to find people who have or have had at least one travel booking in all existing cruise classes.

SELECT CB.PersonFK
FROM CruiseBooking CB
    INNER JOIN CruiseShip CS ON CB.ShipFK = CS.ShipID
GROUP BY CB.PersonFK
HAVING COUNT(DISTINCT CS.Class) = (
        SELECT COUNT(DISTINCT Class)
        FROM CruiseShip
    );

In this case, we start by setting up the division operation through grouping and counting. First, we perform an INNER JOIN between the CruiseBooking and CruiseShip tables. This allows us to gather information about the person who made each travel booking using the foreign key PersonFK from CruiseBooking and the information about the cruise class for the trip. This same table has a foreign key ShipFK that uniquely identifies the cruise ship for the trip, from which we can determine its class.

So after this operation, we group by the PersonFK attribute, as we’ll need to count how many different cruise classes each person has booked to perform the division.

Regarding this quantity, we calculate it using the aggregation function COUNT(DISTINCT CS.Class), which is executed once for each group of tuples. Then we compare it with the total number of cruise classes in our database.

In this case, we could have directly written the number instead of using an uncorrelated subquery to get the total number of classes by looking at the distinct values of the Class attribute from the CruiseShip table. So as it stands, with the subquery, we’re implicitly assuming that the CruiseShip table contains cruises in all existing classes (but this may not be the case).

Imagine if the table is empty, for example – the subquery would result in a total of 0 cruise classes, when in reality, there may be more (the domain of the Class attribute may contain more values than those actually appearing in the table).

But it’s important to clarify here that by “all cruise classes” we mean all possible values that the Status attribute can take – that is, the values we define as the domain of the attribute. On the other hand, in some circumstances, we can assume that all cruise classes correspond to the distinct values that the Status attribute takes in the CruiseShip table, all depending on the domain we are working with.

For simplicity, from now on in this domain, we’ll assume that the distinct values of an attribute like Class found in its corresponding table are equivalent to all the values it can take. If we think about it, this makes sense because if there are only 2 distinct values in the Class attribute of the CruiseShip table, then all the bookings made throughout the time the database has existed will have to reference some cruise in the CruiseShip table (whose Class will have one of those two values). There might be no bookings referencing cruises of a certain class, but if there are cruises of two classes, then it makes sense to assume that those two classes make up “all the classes the Class attribute can hold.”

So, by assuming that the Class attribute of the CruiseShip table contains "all the classes" of cruises, we can solve the query using a set theory approach as shown in the query below.

Here, we first go through all the tuples in CruiseBooking that represent bookings. In each one, we check that there is no cruise (of any class, rather) for which no booking has been made by the person referenced by the foreign key PersonFK of the CruiseBooking tuple we are examining.

SELECT DISTINCT CB.PersonFK
FROM CruiseBooking CB
WHERE NOT EXISTS (
        SELECT *
        FROM CruiseShip C
        WHERE NOT EXISTS (
                SELECT *
                FROM CruiseBooking CB2
                    INNER JOIN CruiseShip CS2 ON CB2.ShipFK = CS2.ShipID
                WHERE CB2.PersonFK = CB.PersonFK
                    AND CS2.Class = C.Class
            )
    );

That is, since we are only interested in getting the people who have ever booked, we start the query by going through CruiseBooking, not Person, because there may be people in Person who have never booked.

So to check that there is no cruise with these characteristics, we use the NOT EXISTS operator and a correlated subquery in which we go through all the cruises registered in the CruiseShip table. For each one, we check that there is no travel booking where the cruise is the one whose class is the same as the cruise and person we are currently examining in the query.

By doing this, we ensure that for all the people returned by our query, there is no cruise of any class that hasn’t been booked at least once by that person. But we can do this correctly only if we are sure that the Class attribute in CruiseShip includes what we consider as “all possible cruise classes“. If we didn't have this assurance, then this set theory approach would not be correct, because the correlated subquery that goes through the tuples of CruiseShip might not be covering all possible cruise classes.

For example, imagine the CruiseShip table is empty. In that case, this approach would return more people than it should, since that subquery would never return tuples.

On the contrary, in the other approach based on groupings, if the CruiseShip table is empty, then the uncorrelated subquery that counts the total number of classes would return 0. Also, the HAVING condition would never be met, preventing the return of people who do not meet the condition defined in the query statement.

So as you can see, it’s not always better to use just one approach based on either groupings or set theory – it varies depending on the situation.

In this specific case, it’s more practical to use groupings – mainly for efficiency (since internally a grouping is usually faster than executing a correlated subquery multiple times) but also for simplicity of maintenance and code clarity.

To wrap up our discussion on the division operator of relational algebra, it’s important to note that there are times when we have to do division using intermediate tables (CTE) instead of tables from the database itself.

For example, in the query below, we obtain the ships that have made at least one trip departing from and arriving at each pair of cities with an area of at most 11 km². In other words, all ships that have made at least one trip between each pair of cities with this characteristic.

WITH AllPairs AS (
    SELECT C1.CityID AS Dep, C2.CityID AS Arr
    FROM City C1 CROSS JOIN City C2
    WHERE C1.CityID <> C2.CityID AND C1.Area<11 AND C2.Area<11
),
ShipVisits AS (
    SELECT V.ShipFK,
        V.DepartureCityFK AS Dep,
        V.ArrivalCityFK AS Arr
    FROM Voyage V
        INNER JOIN City C1 ON V.DepartureCityFK = C1.CityID
        INNER JOIN City C2 ON V.ArrivalCityFK = C2.CityID
    WHERE C1.Area<11 AND C2.Area<11
)
SELECT SV.ShipFK
FROM ShipVisits SV
GROUP BY SV.ShipFK
HAVING COUNT(DISTINCT (SV.Dep, SV.Arr)) = (
        SELECT COUNT(*)
        FROM AllPairs
    );

As you can see in the implementation, to make the code simpler, we can first build an intermediate table with all possible pairs of cities with an area value <11. We. cando this by executing a CROSS JOIN between the City table and itself, as it contains all the cities registered in our database. Then we require that both cities in the pair have an area <11.

It’s important to note that the Area attribute of the City table contains values representing square kilometers, so it’s straightforward to declare the <11 condition in our query. But if this attribute had values in other units, we’d need to adapt to them or convert them to other units that we could easily work with. It’s crucial to consider the units in which the values we compare are measured to correctly code the query.

Finally, this table includes all possible pairs of cities that meet the area characteristic, meaning it doesn't matter if the same pair of cities (A,B) also appears in the table as (B,A).

Then, we build another intermediate table where we store the different pairs of cities each cruise has visited throughout all its trips, considering only those cities that meet the query conditions (area <11).

To do this, we simply extract the foreign key attributes ShipFK, DepartureCityFK, and ArrivalCityFK. These determine the cruise that made the trip and the departure and arrival cities of the trip from the resulting table of the INNER JOIN operations between the Voyage table and the City table itself.

We perform these operations to access the area information of each city, allowing us to impose the same area conditions as in the first intermediate table AllPairs. If we didn't do this, we might consider cruise trips between cities that don't meet the conditions we are looking for. This would increase the number of “valid” city pairs the cruise has traveled between. Since we’re going to structure the query using a grouping and a count, it’s essential not to count irrelevant elements for our query.

Once both CTEs are built, we perform a grouping on the ShipVisits tables based on the ShipFK attribute. We do this to calculate, for each cruise, the number of distinct pairs of cities it has traveled between. We easily calculate this using the aggregation function COUNT(DISTINCT (SV.Dep, SV.Arr)). Then we can compare the returned value with the total number of city pairs that exist and that we have stored in the first CTE called AllPairs, all within the HAVING clause.

To keep only those cruises that have traveled through every pair of cities calculated in AllPairs, we compare the output of the COUNT() aggregation function with the result of an uncorrelated subquery that simply counts how many tuples the intermediate table AllPairs has.

In the total count of pairs, we don’t have to use the DISTINCT modifier, since the CROSS JOIN never generates repeated city pairs given the very definition of the cross product operation. And there there are no identical tuples in the City table, meaning there are no identical cities in our database (much less with the same value of their primary key CityID). But if we wanted to use the DISTINCT modifier to count how many distinct tuples are in AllPairs, we could use the syntax COUNT(DISTINCT AllPairs.*).

Regarding this last subquery, we could have avoided explicitly constructing all the city pairs in AllPairs if we had directly performed the same computation as in AllPairs – but returning only COUNT(*). This would directly count all the city pairs with the characteristics we are looking for. But we can only do this if we code the query using grouping and counting, as we’ll see that it can also be implemented based on set operations, for which we’ll necessarily need to construct and store the pairs in AllPairs.

So just as we have shown with other queries, we can also approach this one using set theory operators. As you can see below, the intermediate tables are constructed in the same way except for ShipVisits, where we don't need the cities involved in the trips to meet the condition of having an area <11.

This is because those ShipVisits tuples will later be compared with the city pairs in AllPairs, which do meet the condition. This way, we end up with cruises that have made a trip in all the pairs of AllPairs, regardless of additional tuples in ShipVisits with trips between cities that don't meet the condition we're looking for.

Although this isn't crucial for resolving the query, it's important to note that ShipVisits contains more tuples than necessary, which might slow down the query since ShipVisits will later be used in a correlated subquery, resulting in multiple scans of its tuples.

WITH AllPairs AS (
    SELECT C1.CityID AS Dep, C2.CityID AS Arr
    FROM City C1 CROSS JOIN City C2
    WHERE C1.CityID <> C2.CityID AND C1.Area<11 AND C2.Area<11
),
ShipVisits AS (
    SELECT V.ShipFK,
        V.DepartureCityFK AS Dep,
        V.ArrivalCityFK AS Arr
    FROM Voyage V
)
SELECT SV.ShipFK
FROM ShipVisits SV
WHERE NOT EXISTS (
        SELECT *
        FROM AllPairs AP
        WHERE NOT EXISTS (
                SELECT *
                FROM ShipVisits SV2
                WHERE SV2.ShipFK = SV.ShipFK
                    AND SV.Dep = AP.Dep
                    AND SV.Arr = AP.Arr
            )
    );

After constructing the CTEs, we solve the query in a way similar to the other divisions we've seen. First, we go through the tuples of ShipVisits (although we could also choose to go through those of CruiseShip, since what we want is to go through all the cruises in the database, or at least those that have made a trip). So instead of using CruiseShip, which might contain cruises that have never made a trip, we choose to go through the tuples of ShipVisits, where we can find cruises referenced by the foreign key ShipFK from the Voyage table, which we know have made at least one trip.

In each of these tuples, we check that there is no pair of cities from AllPairs for which there is no trip made by the cruise we are currently going through between the cities of that pair.

To do this, we use the NOT EXISTS operator and two nested correlated subqueries. In the first, we go through the tuples of AllPairs – that is, the pairs of cities that do meet the condition of having an area <11. Then for each pair, we use NOT EXISTS again on another correlated subquery that gets all the trips made by the cruise currently being processed in the query execution over the cities of the corresponding pair from AllPairs.

In a more intuitive way, we’re getting all the cruises for which there is no pair of cities from AllPairs where the cruise hasn't traveled at least once. As you can guess, since the cities in AllPairs do meet the condition of having an area less than 11 km², it doesn't matter that ShipVisits has trips with cities that don't meet this condition – because in the query we check that for a certain pair of cities from AllPairs there is no trip of a cruise in those cities. So it’s really indifferent which cities are present in the trips of ShipVisits, as those that meet the condition will definitely be there since we don't impose any condition when constructing that intermediate table.

In summary, with this approach, we can solve the query just as we did before using groupings and counts. But the difference here is that we can save the conditions (area <11) that we imposed when constructing the tuples of ShipVisits.

At first glance, this might seem like an improvement in code clarity, as it’s shorter. This makes it more maintainable in this case because fewer operations and statements are needed to construct the CTE. But the resulting CTE contains more tuples, specifically those that represent all the trips each cruise has made, not just those made between cities that meet the condition of having an area <11.

This additional number of tuples impacts the computation of the query. But to analyze this impact, we must also consider that in constructing ShipVisits, we are saving two JOIN operations, which are highly costly, in addition to the expected amount of data with which the query will be executed.

For example, if the amount of data in the involved tables is small, the performance difference won’t be significantly noticeable. But if it’s large, it’s more beneficial to have the smallest possible number of tuples in ShipVisits, even if it requires performing an additional JOIN.

This is because the correlated subquery that goes through the tuples of ShipVisits is executed once for each tuple of AllPairs, and all of this is executed once for each tuple of ShipVisits (we could have replaced this last one with CruiseShip to improve performance, as the number of cruises is fixed and tends to be smaller than the number of trips).

So the computation involved in going through all the tuples of ShipVisits is much greater than the computation of a simple JOIN used to construct the CTE itself – which, despite being computationally costly, only needs to be executed once (not multiple times depending on the number of tuples in other tables).

To finish with the division operation, we've seen that we can implement it in SQL using the EXISTS operator (either as is or negated with the logical NOT operator) and a correlated subquery. In it, the SELECT statement uses the * notation to return all the attributes of the corresponding table. This means that to check if the subquery returns any tuple or not, we construct its result so that each tuple possibly has multiple attributes – meaning all those that result from using the SELECT * notation. But sometimes instead of returning several attributes, it simply returns a column with a fixed value like the integer 1.

In general, using the SELECT * notation in a correlated subquery to which the EXISTS operator is applied is considered good practice, so it’s coded this way by default. But there are also other possibilities like SELECT 1, which at first glance might seem more efficient because it doesn't return unnecessary attributes since it only checks if the subquery results in any tuple or not.

In summary, the decision on which attributes to return in a correlated subquery using the EXISTS operator is mainly determined by the characteristics of the DBMS, as each implementation of the DBMS handles these operations differently at the physical level.

Ranking Queries

To conclude with the different "types" of queries we might encounter, there are queries where we need to calculate a ranking – that is, ordering elements based on the value they have for a certain metric. For example, ordering people by the number of bike rentals they have made, allowing us to find out who has made the most or fewest rentals, among many other similar tasks.

In this case, these approaches don’t have any equivalent operator in relational algebra. This is because the calculation of rankings is based on the combination of multiple techniques and tools like groupings, aggregations, or uncorrelated subqueries that aren’t present in relational algebra as specific operators.

This is mainly because in relational algebra, there is no concept of order, and since tables are treated as sets of tuples, there is no unique way to number the tuples positionally to establish an internal order of the set. In other words, within a set, its elements don’t necessarily have an order among them unless we explicitly define it.

We can start by finding the maximum value of whatever we’re ranking for (and, optionally, where in the table this occurs). For example, in the query below, we get the maximum passenger capacity among all the cruise ships in the CruiseShip table.

SELECT MAX(PassengerCapacity) AS MaxCapacity
FROM CruiseShip;

In terms of approach, solving this query involves establishing a ranking of the cruise ships based on their passenger capacity (this is their metric). The one with the highest capacity occupies the first place in the ranking, followed by the other cruise ships. So if we take the first in the ranking and access its metric, we will have the maximum passenger capacity, which is what we want to obtain.

In SQL, implementing this query is very simple if we only want to get the metric value and its values are already calculated in an attribute. As you can see, we simply use the MAX() aggregation function, which we give the attribute where the metric values are calculated as an input argument. Finally, when we execute the query, we will see that only one tuple is returned with that maximum value in the attribute we have named with the alias MaxCapacity.

But the implementation is not always that simple. For example, if we want to get not only the maximum value of the metric but also the specific element associated with that metric – in this case, the cruise ship with the highest passenger capacity – we first need to go through the tuples in CruiseShip and check each one to see if it corresponds to the cruise ship with the highest passenger capacity.

Specifically, what we check in each tuple is whether the passenger count is equal to the maximum or not, so that we only keep those tuples where the PassengerCapacity value is exactly equal to the maximum value of that attribute.

SELECT ShipID, PassengerCapacity
FROM CruiseShip
WHERE PassengerCapacity = (
    SELECT MAX(PassengerCapacity)
    FROM CruiseShip
);

This is reflected in the query code above. In the WHERE clause, we check that the PassengerCapacity attribute value is equal to the result of the uncorrelated subquery that returns its maximum value. We use the MAX() aggregation function for this just like before.

If the values match, we will have found the tuple of the cruise ship with the highest passenger capacity. But there may be several cruise ships with that same capacity, so our query will return them as well.

If we want to get only one cruise ship, we have the option to add an additional clause LIMIT 1 at the very end of the query, which basically returns only the first tuple of the resulting table from the query.

This LIMIT clause, it’s not part of the SQL-92 standard, but it can still be used in any query we need as long as the DBMS supports it (all modern DBMSs support it). Its use is simple: we just give it a number that indicates the number of tuples from the resulting table of the query that we want to get from the first tuple located at the top of the table, ignoring the rest.

Another option we have is to do without the MAX() aggregation function. As you can see below, most of the query code is the same, except for the subquery. Instead of returning the maximum value of the PassengerCapacity attribute, it returns the attribute itself – meaning all the values in its corresponding column in the CruiseShip table.

SELECT ShipID, PassengerCapacity
FROM CruiseShip
WHERE PassengerCapacity >= ALL (
    SELECT PassengerCapacity
    FROM CruiseShip
);

In this way, the condition of the WHERE clause uses the operator >= along with the ALL modifier, which establishes that, for a certain tuple of CruiseShip, its PassengerCapacity value must be greater than all the values returned by the subquery. Or put another way, our query retrieves information on all cruise ships whose passenger capacity is greater than or equal to each and every capacity stored in the PassengerCapacity attribute of the CruiseShip table.

Specifically, here we have to use the operator >=, not >, because if we are going through the tuple of a cruise ship that does have the highest passenger capacity, its capacity will at most be equal to the maximum capacity of the CruiseShip table (but never greater). That is, if the maximum is a value X, then there will be no cruise ship with a capacity >X, but there will be one or more with a capacity \=X, which are the ones we want to find.

At the same time, these have a capacity X that is greater than the rest of the capacities of the other cruise ships, which is why we use the operator >=.

The ALL modifier is necessary to ensure that the value of the PassengerCapacity attribute meets the condition imposed by the >= operator with respect to each and every tuple returned by the subquery. In this case, it only returns tuples with an attribute or column with the values of all the passenger capacities that it must be compared with.

As you can guess, even though this way of implementing the query is equivalent to the previous one, here the subquery returns a series of values that are compared for each tuple of CruiseShip. That is, for each cruise ship, all the tuples of the subquery are traversed to perform the comparison declared in the WHERE clause. This requires much more computation than simply comparing with a number like the maximum capacity obtained from the MAX() function.

So since the subquery is non-correlated and is computed only once, it’s more optimal to use the previous approach where we used MAX() than this one, as this approach uses more space to store the subquery tuples and more time unnecessarily traversing them to make the comparisons.

Continuing with queries where we need to calculate a maximum, here we have another one where we get the person (or people) who have had the most residences in cities, along with that maximum number of residences.

SELECT R.PersonFK AS PersonID, COUNT(*) AS NumResidences
FROM Residence AS R
GROUP BY R.PersonFK
HAVING COUNT(*) >= ALL (
        SELECT COUNT(*)
        FROM Residence
        GROUP BY PersonFK
    );

We can find the information to solve this query in the Residence table – specifically in the tuples themselves, where each one represents a residence. The person is referenced by the foreign key PersonFK and the city where the person has lived is referenced by the foreign key CityFK.

So, in this table, we don't have a number in an attribute that tells us the number of residences a person has had. Instead, the tuples themselves represent the residences of the people, and we need to count them to know which person has or has had the most residences.

To do this, we can group the tuples in Residence by the attribute PersonFK, since we need to count residences for each person. In this way, we form groups of tuples that represent all the residences a person has had.

Once the groups are made, we can use COUNT(*) to count how many residences the "representative" person of that group of tuples has or has had. Then, to ensure that this number is the maximum, we use the operator >= along with the ALL modifier and a subquery.

In this case, the subquery calculates, for each person, the total number of residences they have or have had in the same way as in the main query, using a grouping by the PersonFK attribute of Residence and the aggregation function COUNT(*).

With this, we can verify, in the HAVING clause, that the number of residences of a certain person is greater than or equal to all the numbers of residences that all the people present in the Residence table have or have had.

On the other hand, we could try to implement the query without using the \>= operator and the ALL modifier, and instead use only a non-correlated subquery and the aggregation function MAX().

SELECT
  R.PersonFK   AS PersonID,
  COUNT(*)     AS NumResidences
FROM Residence AS R
GROUP BY R.PersonFK
HAVING COUNT(*) = (
  SELECT MAX(COUNT(*))
  FROM Residence
  GROUP BY PersonFK
);

As you can see above, the query construction is very similar, except that in the HAVING clause, we directly compare COUNT(*), which returns the number of residences a person has or has had with the result of the subquery, which seems to obtain the maximum number of residences any person has had.

But if we look at the SELECT clause of the subquery, several nested aggregation functions like MAX(COUNT(*)) appear, intending to calculate the maximum value of the numbers of residences people have had. But this is not allowed in SQL. In fact, if we run the query, the DBMS will give us an error because an aggregation function can’t be used as an input argument to another aggregation function.

If we really want to use the aggregation function MAX() to solve the query, we have no choice but to first build a CTE where we store all the people who have ever had a residence and their respective number of residences.

You can see this in the code below, and it’s very similar to the approach we followed before to solve the query. This involves grouping the residence tuples by their foreign key attribute PersonFK and using COUNT(*) to count how many tuples each group has, that is, how many residences each person has.

WITH ResCount AS (
    SELECT PersonFK AS PersonID, COUNT(*) AS NumResidences
    FROM Residence
    GROUP BY PersonFK
)
SELECT RC.PersonID,
    RC.NumResidences
FROM ResCount RC
WHERE RC.NumResidences = (
        SELECT MAX(NumResidences)
        FROM ResCount
    );

Then, once this intermediate table ResCount is built, we are in the same situation as in the queries at the beginning of this section, where the numbers of residences are now values stored in an attribute.

So we can follow the usual approach to get the tuple or tuples from ResCount with the maximum value in their attribute NumResidences. This involves going through all its tuples and checking if their NumResidences value matches the maximum. We can easily calculate this with a non-correlated subquery and the aggregation function MAX().

After these queries, we can consider solving them by obtaining the element with the lowest value in its metric in the ranking.

For example, in this last case, it would correspond to finding the person or people who have had the fewest residences (which doesn't make much sense in this query, but it does in others).

So, to calculate minimums instead of maximums in SQL, you use exactly the same constructions we just saw, with the difference that the operators and aggregation functions used change, such as the operator >= to <= and the aggregation function MIN() is used instead of MAX().

In addition to calculating maximums and minimums, in SQL it's sometimes useful to calculate the ranking positions of elements based on the value of their metrics.

SELECT
  P1.PoolID,
  P1.Name,
  P1.MaxDepth,
  (
    SELECT COUNT(*) + 1
    FROM Pool AS P2
    WHERE P2.MaxDepth > P1.MaxDepth
  ) AS DepthRank
FROM Pool AS P1
ORDER BY DepthRank;

For example, in the query above, we get a list of all the pools in the database, where for each one, we calculate its position in the pool ranking ordered by the value of its MaxDepth attribute, that is, by its maximum depth.

Also, since there can be multiple pools with the same MaxDepth value, in that case, both pools will have the same position in the ranking. So the next position with a lower MaxDepth value won’t be the immediate next position – instead, you must add the number of pools from the previous position that had the same MaxDepth value to that ranking position.

PoolID	Name	MaxDepth	DepthRank
1	Sample Pool Name 1	5	1
2	Sample Pool Name 2	5	1
3	Sample Pool Name 3	3	3
4	Sample Pool Name 4	2	4
5	Sample Pool Name 5	2	4

To understand this, here we have a table where you can see that the first two pools have the same position (DepthRank) in the pool ranking because they have the same MaxDepth value. Then, the next pool with PoolID=3 has position 3 in the ranking, as there are two pools before it in the ranking. Finally, the next two pools with PoolID=4 and PoolID=5 again have the same position in the ranking for the same reason as before.

As we can see, this way of defining and building the ranking is not what we might expect, where each pool has a unique position. Instead, we slightly modify the ranking definition to allow pools with the same MaxDepth value to share the same position in the ranking, so SQL implementation doesn't require more advanced functions.

Regarding the implementation, if we look at the attributes of the example table, specifically MaxDepth and its relationship with DepthRank, we can conclude that the position we should assign to each pool in the ranking matches the number of pools with a MaxDepth strictly greater than its own plus 1.

For example, for the pool with PoolID=2, we see that there is no pool with a MaxDepth greater than its own – at most, there are some with an equivalent MaxDepth, but never greater because this pool has the highest MaxDepth value (meaning the maximum). Meanwhile, the pool with PoolID=3 has two pools with a MaxDepth greater than its own.

So if we add one to the number of pools with a metric value, which in this case we can find in the MaxDepth attribute, greater than the MaxDepth value of a certain pool, then the amount we obtain is the ranking position of that pool.

The simplest way to implement this calculation in SQL is through a correlated subquery in the SELECT, where, as you can see, we get all Pool tuples with a MaxDepth greater than the pool we are iterating over in the query. And finally, with COUNT(*)+1, we add 1 to the number of tuples returned by the subquery, thus generating the position in the ranking of the pool being iterated over in the query.

Continuing with the idea of getting the ranking position of the elements, we also have the option to select only those elements with a ranking position greater or less than a certain amount we need to set.

SELECT PoolID, Name, MaxDepth
FROM Pool AS P
WHERE (
        SELECT COUNT(*)
        FROM Pool AS P2
        WHERE P2.MaxDepth > P.MaxDepth
    ) < 5
ORDER BY MaxDepth DESC;

For example, above we have a query where we get the pools that are among the top 5 distinct positions in the ranking. In other words, we don’t get the first 5 rows with pools ordered in the ranking according to their MaxDepth value, but we get all those whose ranking position is among the top 5 distinct positions.

As you can see, the implementation is simple. We go through all the Pool tuples and for each one, we execute a subquery like the one we saw in the previous query: it gets the number of pools with a MaxDepth greater than the pool we are iterating over – that is, its position in the ranking. Then, we compare that number with 5 to ensure it’s strictly less.

Also, here we need to note that we have not added 1 to COUNT(*), which means the ranking starts counting at position 0, not 1, so we can later check that the position is among the top 5 distinct ones with < 5 and not < 6. This doesn't have to be done this way necessarily, as we could have added 1 to COUNT(*) and declared the comparison using <6, or <=5.

In summary, in this query, we used a correlated subquery to get the ranking position (starting from position 0) of each pool, so we only keep those whose position is strictly less than 5. But we could have also pre-calculated the positions of each pool in a CTE and then applied this condition to an attribute instead of the value returned by a subquery.

This alternative will likely use more memory than is necessary, since the computing the execution of the subquery that calculates the ranking position will be present whether we use a CTE or not. So the most optimal approach would be to avoid wasting memory unless we really need an intermediate table with that information for other uses.

So now we’ve have seen a series of queries that follow certain patterns that are the most basic and fundamental in SQL. But there are many other queries we could perform on the schema of this example with a wide variety of purposes. These are essential to know how to formulate and code.

To learn more queries, you can visit the following resource: PostgreSQL.ipynb.

https://github.com/cardstdani/sql-storage/blob/main/PostgreSQL.ipynb

This is a Jupyter notebook that you can run from Google Collab. It contains Python code and Bash commands that allow you to install the PostgreSQL DBMS on a Linux virtual machine like those used by Google Compute Engine (the backend of Google Collab). You can also execute SQL code to create the database from the DDL and then run queries and obtain their results.

The notebook contains a series of query statements with solutions, along with everything needed to execute them. These queries aren’t ordered or classified like those we saw in the last chapter, as the goal is for you to try to solve them from the statements without looking at the solution. This way, you can later see how they were solved and gain practice in formulating queries, which is one of the most valuable skills for providing services to end users from the database.

You don’t necessarily have to do this in a Google Collab environment – you can also do it on a PostgreSQL installation on a local machine and execute the queries by copying and pasting the query code into the PostgreSQL terminal. But doing it in a remote environment like the one offered by Google Collab has certain advantages, such as not having to worry about installing anything manually, as everything is set up automatically by simply running the code cells or being able to see the text of the statements in the notebook rendered with markdown.

Still, there are some disadvantages, such as the database being stored on a Google virtual machine, which means you don't have full control over the machine and environment in which the DBMS runs. Its execution can also be interrupted depending on how you use the virtual machine and the plan you have with Google Collab.

So even though it may not be an environment where you can deploy a fully functional production database, it’s sufficiently similar to a real environment where you might have a database deployed for a project, making it worthwhile to work in Google Collab.

Conclusion

In this book, we’ve covered all the key concepts you need to know to design a database, based on certain requirements, for a software project.

But again, these concepts and commands are only the most basic and fundamental ones. So to learn more about SQL database design, check out other resources as well like reference books, articles, or the many resources available on the internet.

Your goal should be to gain a deeper understanding of what you’ve learned here. This will help you design robust DBs according to client requirements and code even more efficient queries.

Thank you for reading!

AI in Agriculture: How AI-Enhanced Farming Can Increase Crop Yields [Full Book]

Vahe Aslanyan — Tue, 14 Jan 2025 15:11:36 +0000

Artificial intelligence is revolutionizing the agriculture industry, paving the way for a future of smarter, more efficient farming practices. Imagine a world where crops are grown with precision and care, maximizing yields like never before. With AI at the forefront, this vision is becoming a reality.

By harnessing the power of AI in agriculture, crop yields are projected to soar by an impressive 70% come 2030. But how exactly does AI-enhanced farming achieve such remarkable results? Let's dig deeper into the exciting realm of AI in agriculture and explore the boundless potential it holds.

What You’ll Learn Here

In this book, we’ll delve into the fascinating ways in which AI technologies are transforming farming practices and boosting crop productivity to unprecedented levels.

Here's a glimpse of what you can expect to learn:

The role of AI in optimizing crop cultivation techniques
How AI-powered tools enhance pest and disease management in agriculture
Real-life examples showcasing the impact of AI on farm efficiency
The future prospects and potential challenges of AI in agriculture

Join me as we uncover the game-changing advancements in AI-driven farming and discover how these innovative solutions are reshaping the landscape of agriculture for the better.

What to Expect from this Book
The Role of AI in Transforming Agriculture
Chapter 1: Precision Agriculture – Techniques and Benefits
Chapter 2: How to Enhance Crop Yields and Productivity
Chapter 3: Labor Optimization Solutions Through AI in Agriculture
Chapter 4: Predictive Analytics and Machine Learning in Crop Yield Improvement
Chapter 5: How to Leverage Big Data and Computer Vision in Farming
Chapter 6: Optimizing Soil Moisture and Quality with AI Models
Chapter 7: Sustainable Land Use Strategies with Agricultural Technology
Chapter 8: Efficient Water Use and Irrigation Systems with AI Guidance

What to Expect from this Book

As the agricultural landscape evolves at a rapid pace, farmers, researchers, and industry leaders find themselves at a pivotal juncture.

Conventional methods that once guided decision-making—reliance on manual field assessments, guesswork in resource allocation, and labor-intensive processes—are quickly becoming outdated. In their place, data-driven insights, machine learning algorithms, and AI-enhanced technologies are redefining how we grow our food and manage our farms.

This book unravels the transformative potential of AI in agriculture, illustrating the tangible benefits and strategic advantages offered by this new era of farming.

By leveraging cutting-edge tools and analytics, the agricultural community can unlock untapped efficiencies, conserve vital resources, and achieve unprecedented boosts in productivity.

Above all, this integration of AI with agriculture isn’t about replacing human intelligence or experience—it’s about complementing it, magnifying the inherent wisdom farmers possess with the power of machine-driven insights.

Some of the major topics we’ll cover include:

Foundations of AI in Farming: Gain a solid understanding of the core principles of AI and how these technologies are applied to solve enduring farming challenges. Learn how sensors, drones, big data, and machine learning models come together to inform real-time decisions.
Precision Agriculture at Scale: Discover how AI refines traditional practices by honing in on micro-level conditions—soil moisture, nutrient profiles, and localized weather patterns. Understand how precision agriculture tools empower you to apply the right resources at the right time, eliminating waste and maximizing yields.
Adaptive Resource Management: Delve into predictive analytics that forecast weather events, identify pest infestations early, and recommend timely interventions. Explore how AI-driven recommendations save precious water, optimize fertilizer usage, and reduce overall costs, all while promoting long-term soil health and environmental stewardship.
Robotics and Automation for Enhanced Efficiency: Uncover how AI, when paired with robotics and automation, tackles labor shortages, repetitive tasks, and harvest timing with surgical precision. From autonomous planting and weeding to advanced sorting systems, learn how farming operations can gain speed, accuracy, and reliability.
Data-Driven Decision Making for Sustainability: Understand the data behind sustainable farming. Explore how integrating AI with ecological principles results in farming methods that are better for the planet and more profitable. See how smarter irrigation, targeted crop protection, and efficient land use not only improve the bottom line but also strengthen the resilience of farms against climate uncertainties.
Global Food Security and Climate Adaptation: Examine the broader implications of AI adoption—from scaling food production to meet the needs of a rapidly growing global population, to adapting to extreme weather patterns. AI technology acts as a buffer, helping farmers pivot swiftly in response to environmental changes and market fluctuations.
Overcoming Barriers and Realizing Potential: Identify the barriers to AI adoption, whether they be cost, technical literacy, or data sharing challenges. Learn strategies to overcome these hurdles, ensuring that farms of all sizes, from family-owned parcels to large commercial operations, can access and leverage AI insights.
Financial Incentives and Market Opportunities: Explore how AI transforms farming from a precarious venture into a more predictable, profitable enterprise. Understand the financial incentives, loan programs, and investment avenues that encourage adopting advanced technologies. Discover how a data-driven approach not only lowers risks but opens doors to premium markets, certifications, and consumer trust.

By the end of this book, you will have the confidence to integrate AI tools into your existing farm operations, knowing when and where each technology adds the most value.

You’ll also possess a refined set of strategies and best practices to make more informed, data-backed decisions that increase efficiency and reduce waste.

Your perspective on resource management, environmental stewardship, and long-term planning will also shift. You’ll learn how to achieve sustainable intensification, producing more with less and preserving the farm for future generations.

You’ll gain insights into how precision agriculture, robotics, data analytics, and predictive modeling directly contribute to better yields and higher returns on investment, building a financially resilient agricultural operation.

And finally, you will appreciate AI not as a complex, inaccessible science, but as a practical, essential toolkit for modern agriculture. This will position you at the forefront of an industry that’s poised for exponential growth and innovation, ready to increase crop yields by a remarkable 70% in the near future.

As you turn the pages ahead, prepare to envision a new era of farming—one where the synergy of human expertise and AI capabilities ensure a prosperous, sustainable, and secure food supply for all.

I’ve also recorded a podcast on this topic if you’d like to listen to that as well.

The Role of AI in Transforming Agriculture

In recent years, the integration of artificial intelligence with agriculture has dramatically transformed traditional farming techniques, heralding a new era of productivity and sustainability.

This chapter examines the profound impact of AI on agriculture, offering an all-encompassing perspective on how AI can revolutionize farming practices, optimize crop yields, and promote environmental sustainability.

Precision Agriculture through AI

Precision agriculture stands as a flagship application of AI within the agricultural domain. By allowing farmers to make highly informed decisions derived from granular data, AI elevates farming practices to unprecedented levels of efficiency and precision.

AI-driven systems analyze multifaceted data inputs, such as soil conditions, weather patterns, and crop performance metrics, creating a cohesive picture that empowers farmers to optimize every facet of crop management.

Rather than relying on broad-spectrum agricultural practices, precision agriculture tailors interventions to the unique needs of individual fields and even specific zones within those fields.

This hyper-local management not only maximizes crop yields but also curbs resource wastage, ultimately leading to a more sustainable and profitable farming operation. These data-driven decisions extend to optimal planting times, irrigation schedules, and fertilization plans, crafting an intricate roadmap to agricultural success.

In this example, we'll simulate how AI can help in precision agriculture by collecting soil data, weather data, and crop performance metrics. A model will be used to suggest optimal irrigation schedules and fertilization plans based on this data.

import numpy as np
from sklearn.ensemble import RandomForestRegressor

# Sample data for soil moisture, temperature, and crop performance
soil_moisture = np.array([30, 35, 32, 45, 40])  # percentage
temperature = np.array([18, 21, 19, 23, 22])    # Celsius
crop_yield = np.array([80, 85, 83, 90, 88])     # yield per hectare

# Labels for optimal irrigation and fertilization in percentage
irrigation = np.array([20, 25, 22, 30, 28])   # water in percentage
fertilizer = np.array([5, 6, 5, 7, 6])        # fertilizer in kg/ha

# Train a model for irrigation schedule
irrigation_model = RandomForestRegressor()
irrigation_model.fit(np.column_stack((soil_moisture, temperature, crop_yield)), irrigation)

# Train a model for fertilizer schedule
fertilizer_model = RandomForestRegressor()
fertilizer_model.fit(np.column_stack((soil_moisture, temperature, crop_yield)), fertilizer)

# Simulating new data for a prediction
new_soil_moisture = 38
new_temperature = 20
new_crop_yield = 85

predicted_irrigation = irrigation_model.predict([[new_soil_moisture, new_temperature, new_crop_yield]])
predicted_fertilizer = fertilizer_model.predict([[new_soil_moisture, new_temperature, new_crop_yield]])

print(f"Predicted irrigation schedule: {predicted_irrigation[0]:.2f}% water")
print(f"Predicted fertilizer plan: {predicted_fertilizer[0]:.2f} kg/ha")

Machine Learning: Pioneering Predictive Crop Management

In the realm of modern agriculture, machine learning algorithms have emerged as indispensable assets. These algorithms digest vast, complex datasets encompassing soil moisture levels, plant health monitoring indicators, and meteorological forecasts, to develop predictive analytics models.

These models empower farmers to anticipate crop outcomes, facilitating proactive interventions designed to mitigate potential risks and bolster productivity.

For instance, by forecasting potential pest infestations or disease outbreaks, farmers can implement timely preventive measures, safeguarding crop health and ensuring optimal yield. This predictive capability extends beyond immediate crop management, aiding in long-term planning for resource allocation and operational logistics. The integration of machine learning not only enhances current farming practices but also fortifies the agricultural sector against future challenges.

In this code snippet, a machine learning model predicts the likelihood of a pest infestation based on factors like soil moisture and weather conditions.

from sklearn.linear_model import LogisticRegression

# Sample data (soil moisture, temperature, pest infestation - 0 means no infestation, 1 means infestation)
data = np.array([[30, 22, 0], [35, 25, 0], [40, 28, 1], [25, 20, 0], [45, 30, 1]])
X = data[:, :2]  # Soil moisture, temperature
y = data[:, 2]   # Pest infestation

# Train a Logistic Regression model
pest_model = LogisticRegression()
pest_model.fit(X, y)

# Predicting on new data
new_soil_moisture = 33
new_temperature = 27

predicted_pest_risk = pest_model.predict([[new_soil_moisture, new_temperature]])
predicted_prob = pest_model.predict_proba([[new_soil_moisture, new_temperature]])[0][1]

if predicted_pest_risk[0] == 1:
    print(f"High risk of pest infestation! Probability: {predicted_prob:.2f}")
else:
    print(f"Low risk of pest infestation. Probability: {predicted_prob:.2f}")

Farm Operations Transformed by Computer Vision

Computer vision technology propels agriculture into a new frontier, where machines possess the ability to "see" and interpret visual data with astounding accuracy. Employing sophisticated cameras and sensors, computer vision systems meticulously monitor crop health, detect and identify pest infestations, and evaluate soil quality in real-time.

The precision of computer vision enables the early detection of subtle changes in crop health that might elude the human eye. By identifying stressors such as nutrient deficiencies or water stress early, farmers can initiate targeted interventions, promoting healthier crops and improved yields.

This technology not only ensures timely management but also reduces the reliance on chemical treatments, fostering a more sustainable approach to pest and disease control.

Here, we simulate a simple computer vision task to detect unhealthy crops using image data, where red areas in the crop image might indicate stress or disease.

import cv2
import numpy as np

# Simulate a crop image with random red patches (signifying stress)
image = np.zeros((100, 100, 3), dtype="uint8")
cv2.rectangle(image, (30, 30), (70, 70), (0, 0, 255), -1)  # Simulating stress area

# Convert to HSV to detect red areas
hsv_image = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
lower_red = np.array([0, 120, 70])
upper_red = np.array([10, 255, 255])
mask = cv2.inRange(hsv_image, lower_red, upper_red)

# Calculate percentage of red (stressed) area
red_area_percentage = np.sum(mask > 0) / (image.shape[0] * image.shape[1]) * 100

if red_area_percentage > 10:
    print(f"Alert! {red_area_percentage:.2f}% of the crop area shows signs of stress.")
else:
    print(f"Healthy crops. Only {red_area_percentage:.2f}% of the area shows stress.")

AI-Driven Sustainability in Agriculture

One of the most compelling promises of AI in agriculture lies in its potential to drive sustainability. Through optimized land use and resource management, AI models contribute to reducing the environmental footprint of farming activities. AI algorithms can recommend precise dosages of water, fertilizers, and pesticides, minimizing overuse and runoff that can harm surrounding ecosystems.

AI's ability to analyze and predict climate patterns also supports the development of resilient agricultural practices. By helping farmers adapt to changing weather conditions and extreme events, AI fosters a more stable and sustainable food production system. This aspect is particularly crucial in the face of global climate change and the increasing demand for food from a growing population.

In this example, AI recommends optimal resource usage (water and fertilizer) based on predicted environmental data to minimize resource waste.

# Environmental and crop data
rainfall_forecast = 50  # mm
soil_type = 'clay'  # clay, sand, silt
crop_stage = 'vegetative'  # stages: seedling, vegetative, reproductive

def recommend_water(rainfall, soil, stage):
    base_water = 20  # base liters per hectare
    if soil == 'sand':
        base_water += 5
    if stage == 'reproductive':
        base_water += 10

    if rainfall > 30:
        base_water -= 5  # reduce water if heavy rain predicted

    return max(base_water, 5)

def recommend_fertilizer(stage):
    if stage == 'seedling':
        return 3  # kg/ha
    elif stage == 'vegetative':
        return 6
    else:
        return 10

# Predictions for optimal resources
optimal_water = recommend_water(rainfall_forecast, soil_type, crop_stage)
optimal_fertilizer = recommend_fertilizer(crop_stage)

print(f"Optimal water usage: {optimal_water:.2f} liters per hectare")
print(f"Optimal fertilizer dosage: {optimal_fertilizer:.2f} kg/ha")

Addressing Future Agricultural Challenges with AI

The agricultural sector stands at a crossroads, confronted by an array of challenges including labor shortages, extreme weather events, and the imperative for enhanced decision-making tools.

AI-powered solutions present a beacon of hope, offering tools and methodologies to navigate these obstacles effectively. By automating labor-intensive tasks such as planting and harvesting, AI eases the burden on the agricultural workforce.

Beyond this, AI's analytical capabilities provide farmers with the insights needed to adapt to evolving environmental and market conditions. Enhanced resilience is key, as the ability to swiftly respond to unforeseen challenges ensures the continuity of agricultural production and security of food supplies.

The transformation is not limited to technological or productivity aspects alone. AI also cultivates a mindset of continuous improvement and learning within the agricultural community. By embracing data-centric approaches and fostering an environment of innovation, AI nurtures a new generation of farmers equipped to tackle the intricacies of modern agriculture.

This example demonstrates how AI can assist in automating tasks like identifying ripened crops for automated harvesting using basic image processing.

import cv2

# Simulate crop image with different shades (representing ripened and unripened crops)
image = np.zeros((100, 100, 3), dtype="uint8")
cv2.circle(image, (30, 30), 20, (0, 255, 0), -1)  # Green (unripe crop)
cv2.circle(image, (70, 70), 20, (0, 0, 255), -1)  # Red (ripe crop)

# Convert image to HSV to detect red (ripened crops)
hsv_image = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
lower_red = np.array([0, 120, 70])
upper_red = np.array([10, 255, 255])
mask = cv2.inRange(hsv_image, lower_red, upper_red)

# Identify ripe crops for harvesting
ripe_area_percentage = np.sum(mask > 0) / (image.shape[0] * image.shape[1]) * 100

if ripe_area_percentage > 10:
    print(f"Ripe crops detected! {ripe_area_percentage:.2f}% of the area is ready for harvest.")
else:
    print(f"Insufficient ripeness. {ripe_area_percentage:.2f}% of the area is ready for harvest.")

As you can now start to see, the integration of AI in agriculture is shaping the future of farming by moving beyond traditional methods and unlocking a plethora of possibilities for enhanced crop management, sustainability, and resilience.

By leveraging precision agriculture, machine learning, computer vision, and sustainability-focused AI models, the agricultural sector is poised to meet future challenges head-on, ensuring food security and environmental stewardship for generations to come.

The cumulative impact of these advanced technologies holds the potential to increase crop yields significantly, setting a path toward a more productive and sustainable agricultural industry by 2030 and beyond.

Chapter 1: Precision Agriculture – Techniques and Benefits

AI and and other cutting-edge technologies are revolutionizing the agriculture industry, providing innovative solutions to enhance crop yields and address the myriad challenges faced by farmers globally. With the advent of AI models, predictive analytics, and machine learning algorithms, the agricultural sector can now leverage real-time data for more informed decision-making.

This chapter explores the profound impact of these technologies, offering a comprehensive analysis of their applications and benefits.

For each subsection below, you’ll find code snippets that demonstrate how these practices can work. These examples incorporate Large Language Models (LLMs) to enhance various agricultural applications.

The code primarily uses Python and integrates OpenAI's GPT models via their API. Ensure you have the openai library installed and have set up your API key before running these examples.

pip install openai

import openai
import os

# Set your OpenAI API key
openai.api_key = os.getenv("OPENAI_API_KEY")

Now that you’re all set, let’s examine some of the different ways that AI can have an impact on agricultural practices.

Predictive Analytics in Agriculture

Predictive analytics represents a significant advancement in the agricultural domain. By meticulously analyzing weather patterns, soil conditions, and historical crop data, farmers can proactively adapt their strategies to mitigate risks and optimize yields.

For instance, predictive models can forecast the likelihood of drought or pest infestations, allowing farmers to deploy preventive measures well in advance. This data-driven approach ensures farming practices are not only more responsive but also tailored to specific soil types and crop needs.

Consider a farmer in the Midwest United States dealing with unpredictable weather patterns. By using predictive analytics, this farmer can receive timely alerts about incoming weather changes, enabling them to adjust crop schedules, irrigation, and even planting strategies accordingly. The integration of satellite imagery and IoT sensors provides a holistic view of the farm’s health, ensuring that every decision is backed by robust data.

Example of predictive analysis in agriculture:

Objective: Utilize an LLM to generate actionable insights from predictive analytics models, such as forecasting drought risks or pest infestations.

import openai
import numpy as np
from sklearn.ensemble import RandomForestClassifier

# Sample data: [soil_moisture, temperature, humidity]
X = np.array([
    [30, 25, 40],
    [35, 30, 50],
    [20, 15, 30],
    [25, 20, 35],
    [40, 35, 60]
])

# Labels: 0 - No pest infestation, 1 - Pest infestation
y = np.array([0, 1, 0, 0, 1])

# Train a predictive model
model = RandomForestClassifier()
model.fit(X, y)

# New data point
new_data = np.array([[28, 22, 45]])

# Predict pest infestation
prediction = model.predict(new_data)[0]
probability = model.predict_proba(new_data)[0][1]

# Generate a natural language report using LLM
if prediction == 1:
    risk = f"High risk of pest infestation with a probability of {probability*100:.2f}%."
else:
    risk = f"Low risk of pest infestation with a probability of {(1 - probability)*100:.2f}%."

# Use LLM to create a comprehensive report
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an agricultural data analyst."},
        {"role": "user", "content": f"Generate a report based on the following risk assessment: {risk}"}
    ]
)

report = response.choices[0].message['content']
print(report)

Sample Output:

Based on the latest data analysis, there is a high risk of pest infestation with a probability of 70.00%. It is recommended to implement preventive measures such as targeted pesticide application and increased monitoring in the affected areas to mitigate potential damage and ensure optimal crop health.

Precision Agriculture Techniques

AI-powered machine learning algorithms are central to the practice of precision agriculture, a method that optimizes the management of farming practices. Machine learning aids in monitoring various critical parameters such as soil moisture, nutrient levels, and crop health with unparalleled precision.

By utilizing computer vision technology, farmers can remotely assess the health of their crops through high-resolution images. This technology identifies areas requiring immediate attention, thereby significantly reducing waste and enhancing productivity.

For example, a farmer in the rice-producing regions of Asia can use drones equipped with multi-spectral cameras to monitor crop conditions. The data captured is processed through AI algorithms that provide actionable insights on which areas need additional water or which sections are experiencing nutrient deficiencies. This precise targeting ensures resources are utilized efficiently, promoting sustainable farming practices while increasing yields.

Example of using precision agriculture techniques

Objective: Use an LLM to interpret data from precision agriculture sensors and provide tailored recommendations.

import openai

# Sample sensor data
sensor_data = {
    "soil_moisture": 35,  # in percentage
    "temperature": 22,    # in Celsius
    "nutrient_levels": {
        "nitrogen": 50,    # ppm
        "phosphorus": 30,  # ppm
        "potassium": 40    # ppm
    },
    "crop_stage": "vegetative"
}

# Convert sensor data to a descriptive text
data_description = (
    f"Soil moisture is at {sensor_data['soil_moisture']}%, "
    f"temperature is {sensor_data['temperature']}°C, "
    f"nitrogen levels are {sensor_data['nutrient_levels']['nitrogen']} ppm, "
    f"phosphorus levels are {sensor_data['nutrient_levels']['phosphorus']} ppm, "
    f"potassium levels are {sensor_data['nutrient_levels']['potassium']} ppm, "
    f"and the crop is in the {sensor_data['crop_stage']} stage."
)

# Use LLM to generate recommendations
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an expert in precision agriculture."},
        {"role": "user", "content": f"Based on the following sensor data, provide recommendations for irrigation and fertilization: {data_description}"}
    ]
)

recommendations = response.choices[0].message['content']
print(recommendations)

Sample Output:

Based on the current sensor data, here are the recommendations:

**Irrigation:**
- Soil moisture is at 35%, which is within the optimal range for the vegetative stage. Continue with the current irrigation schedule but monitor closely for any fluctuations due to temperature changes.

**Fertilization:**
- **Nitrogen (50 ppm):** Adequate for the vegetative stage. No additional nitrogen fertilizer is needed at this time.
- **Phosphorus (30 ppm):** Levels are slightly low. Consider applying a phosphorus-based fertilizer to support root development.
- **Potassium (40 ppm):** Adequate. Maintain current potassium levels to ensure balanced nutrient availability.

Overall, maintain regular monitoring and adjust as necessary based on plant responses and environmental conditions.

Enhancing Soil Quality and Productivity

Soil quality is a critical factor in determining crop productivity. AI-enhanced farm management software equips farmers with the tools to monitor and improve soil health continuously.

By understanding the specific characteristics of their soil, such as pH levels, nutrient content, and organic matter, farmers can implement targeted interventions. This precision management approach maximizes the use of resources while promoting soil sustainability.

Consider a farmer in sub-Saharan Africa struggling with nutrient-poor soils. AI can analyze soil samples and recommend precise formulations of fertilizers tailored to the specific needs of the soil. Over time, the software can track the impact of these interventions, providing feedback and suggesting further improvements. This continuous optimization cycle not only boosts crop yields but also enhances soil health, ensuring long-term sustainability.

Example of enhancing soil quality and productivity

Objective: Leverage an LLM to analyze soil data and recommend precise fertilizer formulations tailored to specific soil needs.

import openai

# Sample soil data
soil_data = {
    "pH": 5.8,
    "organic_matter": 3.2,  # percentage
    "nutrient_content": {
        "nitrogen": 40,       # ppm
        "phosphorus": 25,     # ppm
        "potassium": 35       # ppm
    },
    "crop_type": "corn"
}

# Create a descriptive text from soil data
soil_description = (
    f"The soil pH is {soil_data['pH']}, organic matter is {soil_data['organic_matter']}%, "
    f"nitrogen level is {soil_data['nutrient_content']['nitrogen']} ppm, "
    f"phosphorus level is {soil_data['nutrient_content']['phosphorus']} ppm, "
    f"potassium level is {soil_data['nutrient_content']['potassium']} ppm, "
    f"and the crop type is {soil_data['crop_type']}."
)

# Use LLM to recommend fertilizer formulations
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a soil fertility expert."},
        {"role": "user", "content": f"Based on the following soil data, recommend precise fertilizer formulations for optimal corn growth: {soil_description}"}
    ]
)

fertilizer_recommendations = response.choices[0].message['content']
print(fertilizer_recommendations)

Sample Output:

Based on the provided soil data, here are the fertilizer recommendations for optimal corn growth:

**Soil pH: 5.8**
- Slightly acidic for corn, which prefers a pH between 6.0 and 6.8. To raise the pH, consider applying agricultural lime at a rate of 1-2 tons per acre. Conduct a soil test after a few months to determine if further adjustments are necessary.

**Organic Matter: 3.2%**
- Adequate organic matter content. Maintain or slightly increase it by incorporating compost or well-decomposed manure to enhance soil structure and nutrient retention.

**Nutrient Content:**
- **Nitrogen (40 ppm):** Adequate for early growth stages. Apply a balanced nitrogen fertilizer, such as urea (46-0-0), at a rate of 50-60 lbs per acre at planting, followed by a side-dress application of 30-40 lbs per acre when plants reach the V6 stage.

- **Phosphorus (25 ppm):** Slightly low for corn, which requires higher phosphorus for root development. Apply a phosphorus fertilizer like triple superphosphate (0-46-0) at a rate of 20-30 lbs per acre during planting.

- **Potassium (35 ppm):** Adequate for corn growth. Maintain current levels by applying potassium sulfate (0-0-50) if necessary, but based on current data, additional potassium may not be required.

**Crop Type: Corn**
- Corn has high nutrient demands, especially nitrogen and phosphorus. Regularly monitor plant growth and soil nutrient levels throughout the growing season to adjust fertilizer applications as needed.

**Additional Recommendations:**
- Implement a crop rotation plan to prevent nutrient depletion and reduce pest and disease pressure.
- Utilize cover crops during off-season periods to enhance soil fertility and organic matter.
- Ensure proper irrigation management to facilitate nutrient uptake and prevent leaching.

These tailored fertilizer formulations will support robust corn growth, improve yield, and maintain long-term soil health.

Improving Crop Management through AI-Enhanced Decision Support Systems

AI-enhanced decision support systems integrate various data sources to provide farmers with actionable insights. These systems analyze data from weather forecasts, soil sensors, and market trends to offer comprehensive advice on crop management.

For instance, a farmer in Europe growing wheat can use these systems to decide the optimal planting time, anticipate pest outbreaks, and estimate the best harvest period based on market prices. Such integrative approaches ensure that farmers can make knowledgeable decisions that balance productivity and profitability.

In the framework of smart greenhouses, AI algorithms control environmental conditions such as lighting, temperature, and humidity. An example is the use of AI in tomato greenhouses in the Netherlands, where machine learning algorithms autonomously adjust these parameters to create optimal growing conditions. This results in enhanced growth rates, improved fruit quality, and higher yields.

Example of improving crop management through AI-enhanced decision support systems

Objective: Integrate an LLM into a decision support system to provide comprehensive advice based on multiple data sources, including weather forecasts, soil sensors, and market trends.

import openai

# Sample data inputs
data = {
    "weather_forecast": {
        "temperature": "25°C",
        "precipitation": "Low",
        "humidity": "60%",
        "wind_speed": "15 km/h"
    },
    "soil_sensors": {
        "soil_moisture": "40%",
        "pH": "6.5",
        "nutrient_levels": {
            "nitrogen": "45 ppm",
            "phosphorus": "30 ppm",
            "potassium": "40 ppm"
        }
    },
    "market_trends": {
        "wheat_price": "$200 per ton",
        "demand_growth": "5% annually"
    },
    "crop_type": "wheat",
    "crop_stage": "flowering"
}

# Create a descriptive summary
summary = (
    f"Weather Forecast: Temperature is {data['weather_forecast']['temperature']}, "
    f"precipitation is {data['weather_forecast']['precipitation']}, "
    f"humidity is {data['weather_forecast']['humidity']}, and wind speed is {data['weather_forecast']['wind_speed']}. "
    f"Soil Sensors: Soil moisture is {data['soil_sensors']['soil_moisture']}, pH is {data['soil_sensors']['pH']}, "
    f"nitrogen level is {data['soil_sensors']['nutrient_levels']['nitrogen']} ppm, "
    f"phosphorus level is {data['soil_sensors']['nutrient_levels']['phosphorus']} ppm, "
    f"and potassium level is {data['soil_sensors']['nutrient_levels']['potassium']} ppm. "
    f"Market Trends: Wheat price is {data['market_trends']['wheat_price']} with a demand growth of {data['market_trends']['demand_growth']}. "
    f"Crop Type: {data['crop_type']} in the {data['crop_stage']} stage."
)

# Use LLM to generate decision support advice
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an AI-powered agricultural decision support system."},
        {"role": "user", "content": f"Provide comprehensive advice based on the following data: {summary}"}
    ]
)

advice = response.choices[0].message['content']
print(advice)

Sample Output:

**Comprehensive Crop Management Advice for Wheat in the Flowering Stage**

**Weather Considerations:**
- **Temperature (25°C):** Optimal for wheat flowering. Maintain current irrigation levels to support continued growth.
- **Precipitation (Low):** Monitor soil moisture closely. Consider implementing supplemental irrigation if forecasts indicate prolonged dry periods.
- **Humidity (60%):** Moderate humidity levels are conducive to wheat health. Ensure adequate air circulation to prevent fungal diseases.
- **Wind Speed (15 km/h):** Manage wind exposure to reduce the risk of lodging (plants falling over). Implement windbreaks if necessary.

**Soil Management:**
- **Soil Moisture (40%):** Adequate moisture levels. Continue regular irrigation to sustain optimal growth.
- **pH (6.5):** Ideal pH for wheat. No immediate adjustments needed.
- **Nutrient Levels:**
  - **Nitrogen (45 ppm):** Sufficient for the flowering stage. Avoid over-fertilization to prevent lodging.
  - **Phosphorus (30 ppm):** Adequate. Continue monitoring to ensure availability for grain development.
  - **Potassium (40 ppm):** Optimal levels. Maintains plant health and stress resistance.

**Market Trends:**
- **Wheat Price ($200 per ton):** Favorable market conditions. Maximize yield and quality to capitalize on high prices.
- **Demand Growth (5% annually):** Positive outlook. Invest in strategies that enhance yield and sustainability to meet growing demand.

**Recommendations:**
1. **Irrigation Management:**
   - Maintain current irrigation schedules.
   - Prepare for potential supplemental irrigation if dry conditions persist.

2. **Pest and Disease Control:**
   - With moderate humidity, remain vigilant for signs of fungal diseases such as powdery mildew.
   - Implement preventive measures, including appropriate fungicide applications if necessary.

3. **Nutrient Management:**
   - Continue with balanced fertilization practices.
   - Avoid excess nitrogen to prevent lodging; consider applying a controlled-release fertilizer if additional nutrients are needed.

4. **Mechanical Practices:**
   - Assess fields for signs of lodging and take corrective actions if required.
   - Ensure harvesting equipment is calibrated to minimize grain loss and maintain quality.

5. **Harvest Planning:**
   - Monitor wheat maturity closely to determine the optimal harvest window.
   - Coordinate harvesting activities to align with favorable market prices and minimize weather-related risks.

6. **Sustainability Practices:**
   - Implement crop rotation strategies to maintain soil health.
   - Utilize cover crops post-harvest to prevent soil erosion and enhance organic matter content.

By adhering to these recommendations, you can optimize wheat yield and quality, capitalize on favorable market conditions, and ensure sustainable farming practices for future growth.

Addressing Global Agricultural Challenges with AI

AI technologies are not just limited to enhancing yields but are also pivotal in addressing global challenges such as climate change, food security, and sustainable resource management.

In regions prone to climate variability, AI models can predict and simulate different climate scenarios and recommend adaptive strategies for resilient farming. In doing so, AI helps secure food production against the changing climate.

For instance, in India, where farmers are heavily dependent on monsoon rains, AI-based systems can provide early warnings about deficient rainfalls. This allows farmers to switch to more drought-resistant crop varieties or alter their cropping patterns, thus safeguarding their livelihoods.

Example of addressing global agricultural challenges with AI

Objective: Use an LLM to generate adaptive farming strategies based on climate predictions and other global challenges.

import openai

# Sample climate data
climate_data = {
    "region": "India",
    "climate_challenge": "Deficient monsoon rains",
    "current_crop": "rice",
    "alternative_crops": ["millet", "sorghum", "pulses"],
    "forecast": "El Niño event expected to reduce rainfall by 30% in the upcoming season."
}

# Create a descriptive summary
climate_summary = (
    f"Region: {climate_data['region']}. "
    f"Climate Challenge: {climate_data['climate_challenge']}. "
    f"Current Crop: {climate_data['current_crop']}. "
    f"Alternative Crops: {', '.join(climate_data['alternative_crops'])}. "
    f"Forecast: {climate_data['forecast']}."
)

# Use LLM to recommend adaptive strategies
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an expert in sustainable agriculture and climate adaptation."},
        {"role": "user", "content": f"Given the following climate data, suggest adaptive farming strategies: {climate_summary}"}
    ]
)

strategies = response.choices[0].message['content']
print(strategies)

Sample Output:

**Adaptive Farming Strategies for India Amidst Deficient Monsoon Rains**

**1. Crop Diversification:**
   - **Shift to Drought-Resistant Crops:** Transition from rice to more drought-tolerant crops such as millet, sorghum, and pulses. These crops require less water and can thrive under reduced rainfall conditions.
   - **Intercropping:** Implement intercropping practices by planting multiple crop species simultaneously. This enhances resource utilization and reduces the risk of total crop failure.

**2. Water Management:**
   - **Rainwater Harvesting:** Construct rainwater harvesting systems to capture and store residual rainfall during the monsoon for use during dry periods.
   - **Drip Irrigation:** Adopt efficient irrigation techniques like drip or sprinkler systems to minimize water wastage and ensure targeted water delivery to crops.
   - **Soil Moisture Conservation:** Use mulching and cover cropping to retain soil moisture and reduce evaporation rates.

**3. Soil Health Improvement:**
   - **Organic Amendments:** Incorporate organic matter such as compost or manure to improve soil structure, enhance water retention, and increase nutrient availability.
   - **Conservation Tillage:** Practice conservation tillage methods to reduce soil erosion, maintain soil moisture, and promote microbial activity.

**4. Climate-Resilient Practices:**
   - **Agroforestry:** Integrate trees and shrubs into agricultural landscapes to provide shade, reduce wind speed, and improve microclimates for crops.
   - **Weather Forecasting Utilization:** Leverage advanced weather forecasting tools to make informed decisions about planting, irrigation, and harvesting schedules.

**5. Financial and Policy Support:**
   - **Subsidies for Drought-Resistant Varieties:** Advocate for government subsidies and incentives for farmers adopting drought-resistant crop varieties and water-efficient technologies.
   - **Insurance Schemes:** Promote crop insurance schemes that protect farmers against losses due to climate-induced risks.

**6. Community Engagement and Education:**
   - **Training Programs:** Organize training sessions to educate farmers about climate-resilient farming techniques and the benefits of crop diversification.
   - **Collaborative Platforms:** Foster community-based platforms for knowledge sharing, enabling farmers to learn from each other's experiences and adopt best practices.

**7. Technological Integration:**
   - **IoT and Sensors:** Deploy IoT devices and soil moisture sensors to monitor environmental conditions in real-time, allowing for timely interventions.
   - **AI-Driven Decision Support:** Utilize AI-powered tools to analyze climate data and provide personalized recommendations for crop management and resource allocation.

**8. Market Adaptation:**
   - **Value Addition:** Explore value-added products and alternative markets for drought-resistant crops to enhance profitability.
   - **Supply Chain Optimization:** Improve supply chain logistics to reduce post-harvest losses and ensure timely access to markets despite climatic challenges.

Implementing these adaptive strategies will help mitigate the adverse effects of deficient monsoon rains, ensure sustained agricultural productivity, and enhance the resilience of farming communities in India.

Advancing Agricultural Research through AI

AI is also making significant inroads into agricultural research. By fostering the development of new crop varieties, AI accelerates the breeding process. Machine learning models analyze vast datasets to identify traits associated with disease resistance, drought tolerance, and higher nutritional content. These insights expedite the breeding programs, leading to the development of superior crop varieties in record time.

For instance, in the quest to develop a rust-resistant wheat variety, researchers can use AI to sift through genetic data and pinpoint the genes responsible for resistance. This targeted approach not only saves time but also increases the likelihood of successful trait incorporation.

Example of advancing agricultural research through AI

Objective: Employ an LLM to assist in analyzing genetic data for breeding programs aimed at developing disease-resistant or drought-tolerant crop varieties.

import openai

# Sample genetic data summary
genetic_data = {
    "crop": "wheat",
    "goal": "develop rust-resistant variety",
    "current_breeding_data": {
        "gene_X": "associated with leaf rust resistance",
        "gene_Y": "no significant association",
        "gene_Z": "linked to stem rust resistance"
    },
    "existing_varieties": ["Variety_A", "Variety_B"],
    "desired_traits": ["high yield", "drought tolerance"]
}

# Create a descriptive summary
genetic_summary = (
    f"Crop: {genetic_data['crop']}. "
    f"Goal: {genetic_data['goal']}. "
    f"Current Breeding Data: {', '.join([f'{gene}: {desc}' for gene, desc in genetic_data['current_breeding_data'].items()])}. "
    f"Existing Varieties: {', '.join(genetic_data['existing_varieties'])}. "
    f"Desired Traits: {', '.join(genetic_data['desired_traits'])}."
)

# Use LLM to analyze genetic data and suggest next steps
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a geneticist specializing in crop breeding."},
        {"role": "user", "content": f"Analyze the following genetic data and suggest next steps for developing a rust-resistant wheat variety with high yield and drought tolerance: {genetic_summary}"}
    ]
)

analysis = response.choices[0].message['content']
print(analysis)

Sample Output:

**Analysis and Recommendations for Developing a Rust-Resistant Wheat Variety with High Yield and Drought Tolerance**

**1. Genetic Analysis:**
   - **Gene X:** Associated with leaf rust resistance. This gene shows promise for enhancing the plant's ability to withstand foliar rust infections.
   - **Gene Y:** No significant association with rust resistance. It may be deprioritized in the breeding program.
   - **Gene Z:** Linked to stem rust resistance. Incorporating this gene can provide comprehensive rust resistance, targeting both leaf and stem infections.

**2. Breeding Strategy:**
   - **Marker-Assisted Selection (MAS):** Utilize molecular markers linked to Gene X and Gene Z to facilitate the selection of individuals carrying these resistance genes. This approach accelerates the breeding process by enabling the identification of desired traits at the seedling stage.
   - **Pyramiding Resistance Genes:** Combine Gene X and Gene Z within a single genotype to ensure broad-spectrum rust resistance. This strategy reduces the likelihood of rust pathogens overcoming resistance through mutation.
   - **Incorporate Desired Traits:**
     - **High Yield:** Select parent lines known for their high-yield potential. Ensure that these lines are compatible with the rust-resistant varieties to maintain yield performance.
     - **Drought Tolerance:** Integrate genes or quantitative trait loci (QTLs) associated with drought tolerance. This can be achieved through traditional breeding methods or by employing genomic selection techniques.

**3. Crossbreeding Plan:**
   - **Parent Selection:** Choose existing varieties (e.g., Variety_A and Variety_B) that exhibit high yield and possess either Gene X or Gene Z.
   - **Hybridization:** Perform crosses between these parent lines to combine rust resistance with high yield traits.
   - **Progeny Evaluation:** Assess the offspring for rust resistance, yield performance, and drought tolerance through phenotypic screening and molecular assays.

**4. Genomic Tools and Techniques:**
   - **Genomic Selection:** Implement genomic selection models to predict the performance of breeding lines based on their genetic makeup. This enhances the accuracy of selecting superior genotypes.
   - **CRISPR-Cas9 Gene Editing:** Consider utilizing gene editing technologies to precisely insert or enhance Gene X and Gene Z in elite wheat varieties, reducing the time required for conventional breeding.

**5. Field Trials and Validation:**
   - **Multi-Location Trials:** Conduct field trials across different environments to evaluate the stability and effectiveness of rust resistance and drought tolerance under varying conditions.
   - **Pathogen Monitoring:** Continuously monitor rust pathogen populations to ensure that the resistance conferred by Gene X and Gene Z remains effective over time.

**6. Collaboration and Data Sharing:**
   - **Research Partnerships:** Collaborate with research institutions and agricultural organizations to share genetic data, breeding lines, and best practices.
   - **Data Management:** Maintain a comprehensive database of genetic markers, phenotypic traits, and breeding outcomes to inform future breeding decisions and track progress.

**7. Sustainability and Farmer Adoption:**
   - **Seed Distribution:** Develop a strategy for the distribution of the new rust-resistant, high-yield, and drought-tolerant wheat varieties to farmers.
   - **Training and Support:** Provide training to farmers on the benefits and cultivation practices of the new varieties to ensure successful adoption and maximize impact.

**Conclusion:**
By integrating Gene X and Gene Z through marker-assisted selection and genomic tools, and by incorporating high yield and drought tolerance traits, the breeding program can successfully develop a robust wheat variety. This variety will not only resist rust pathogens but also thrive under drought conditions, ensuring food security and enhancing agricultural sustainability.

These examples demonstrate how Large Language Models (LLMs) like OpenAI's GPT-4 can be integrated into various agricultural applications to enhance decision-making, provide actionable insights, and support sustainable farming practices.

Just a quick note: make sure you handle API keys securely and comply with OpenAI's usage policies when implementing these solutions.

These strategies represent a paradigm shift towards more resilient, efficient, and sustainable farming practices. By enabling predictive analytics, precision agriculture, and enhanced soil management, AI empowers farmers to make smarter decisions, optimize resource use, and achieve higher yields. T

Chapter 2: How to Enhance Crop Yields and Productivity

Modern agriculture faces a plethora of challenges, including climate variability, resource scarcity, and the need for increased productivity. To navigate these complexities, contemporary farmers are increasingly turning to cutting-edge soil mapping techniques facilitated by advancements in computer vision and machine learning.

Soil mapping involves the systematic collection, analysis, and visualization of soil properties across agricultural fields. Incorporating technologies like AI, farmers can now produce high-resolution soil maps, revealing intricate details about soil quality, moisture levels, and nutrient content.

This knowledge is foundational for precision agriculture, a practice that emphasizes resource efficiency and sustainability by tailoring farming inputs to the specific needs of each soil type.

To integrate Large Language Models (LLMs) into the precision agriculture domain, we can leverage LLMs for generating insights, recommendations, and explanations based on soil maps, crop health data, and sustainability metrics.

As above, I’ll include code snippets for each section in this chapter where an LLM, such as GPT-4, is used to enhance efficiency, improve crop health, and promote sustainable farming practices.

Ensure that you have the openai Python package installed and have set up your API key properly before running the following code.

pip install openai

import openai
import os

# Set your OpenAI API key
openai.api_key = os.getenv("OPENAI_API_KEY")

Alright, now we can dive into learning about the advantages and challenges of precision agriculture – with our code examples to guide us.

The Advantages of Precision Agriculture

1. Enhanced Efficiency

The central tenet of precision agriculture is maximizing efficiency. By using soil maps, farmers can precisely calibrate the application of water, fertilizers, and pesticides.

Traditional farming methods often involve uniform applications across an entire field, leading to overuse in some areas and underuse in others. Soil mapping helps farmers identify zones with varying needs, ensuring each section of the field receives the optimal amount of inputs.

For instance, an area identified as nutrient-rich may require minimal fertilization, whereas nutrient-poor zones can be targeted with customized fertilizer applications. This targeted approach conserves resources while enhancing overall farm productivity.

Consider a wheat farm that used traditional uniform fertilization methods. By switching to precision agriculture guided by detailed soil maps, the farmer could reduce fertilizer use by, say, 20% while increasing yield by 15%. This not only cuts costs but also minimizes environmental impact, showcasing a win-win scenario both economically and ecologically.

Now, let’s look at a code example to put this into practice.

Objective: Use LLMs to generate optimized fertilization schedules based on soil maps, minimizing resource usage and enhancing farm productivity.

import openai

# Sample soil data for a wheat farm (soil nutrient levels in different zones)
soil_map_data = {
    "Zone_A": {"nutrients": "high", "water_requirement": "low", "fertilizer_recommendation": "minimal"},
    "Zone_B": {"nutrients": "low", "water_requirement": "medium", "fertilizer_recommendation": "high"},
    "Zone_C": {"nutrients": "medium", "water_requirement": "high", "fertilizer_recommendation": "moderate"}
}

# Convert soil data into a descriptive text
soil_description = (
    f"Zone A has high nutrients and low water requirement. Zone B has low nutrients and medium water requirement. "
    f"Zone C has medium nutrients and high water requirement."
)

# Use LLM to generate a targeted fertilization plan based on soil map data
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an agricultural expert specializing in precision farming."},
        {"role": "user", "content": f"Based on the following soil map data, create an optimized fertilization plan: {soil_description}"}
    ]
)

fertilization_plan = response.choices[0].message['content']
print(fertilization_plan)

Sample Output:

**Optimized Fertilization Plan:**

- **Zone A:** Since nutrients are high and water requirements are low, apply minimal fertilizer (around 10% of the recommended rate) and avoid excessive watering. Focus on maintaining nutrient levels and monitor soil moisture regularly.

- **Zone B:** Nutrients are low, so apply a high dose of nitrogen-based fertilizer to boost soil fertility. Watering should be done at medium levels to ensure proper nutrient absorption. Use 80-90% of the recommended fertilizer rate for nutrient-poor soils.

- **Zone C:** Apply a moderate amount of fertilizer (50-60% of the recommended rate) to ensure nutrient balance. Since water requirements are high, implement a regular irrigation schedule to maintain soil moisture at optimal levels.

By applying this plan, fertilizer usage can be reduced by 20%, while maximizing crop yield and minimizing environmental impact.

2. Improved Crop Health

Soil is the lifeblood of crops, and its condition directly affects plant health. Detailed soil mapping enables farmers to monitor and address issues proactively.

For instance, if a specific area within a field shows signs of nutrient deficiency or excess salinity, remedial measures can be taken immediately. This proactive stance prevents problems before they escalate, ensuring that crops grow in optimal conditions throughout their life cycle.

In a vineyard, soil mapping may reveal high salinity levels in a particular section, which could adversely affect grape quality. By identifying and treating these areas with appropriate soil amendments, the vineyard can improve grape quality and yield, leading to better wine production and higher profits.

Now let’s look at a code example to help show how proactive soil monitoring can actually improve crop health.

Objective: Utilize an LLM to provide recommendations for addressing soil salinity and nutrient deficiencies based on real-time soil health data.

import openai

# Sample data from soil monitoring in a vineyard
soil_health_data = {
    "Zone_A": {"salinity": "high", "nutrient_deficiency": "none"},
    "Zone_B": {"salinity": "normal", "nutrient_deficiency": "low phosphorus"},
    "Zone_C": {"salinity": "normal", "nutrient_deficiency": "low nitrogen"}
}

# Convert soil health data into a descriptive text
soil_health_description = (
    f"Zone A has high salinity but no nutrient deficiency. "
    f"Zone B has normal salinity but a low phosphorus deficiency. "
    f"Zone C has normal salinity but a low nitrogen deficiency."
)

# Use LLM to generate recommendations for improving crop health based on soil data
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an expert in soil health and crop management."},
        {"role": "user", "content": f"Based on the following soil health data, provide recommendations to improve crop health: {soil_health_description}"}
    ]
)

crop_health_recommendations = response.choices[0].message['content']
print(crop_health_recommendations)

Sample Output:

**Crop Health Recommendations:**

- **Zone A (High Salinity):** Implement soil amendments, such as gypsum, to reduce salinity levels. Ensure that irrigation water is low in salt content to prevent further salinity buildup. Consider deep leaching to flush salts from the root zone.

- **Zone B (Low Phosphorus):** Apply phosphorus-rich fertilizers, such as superphosphate or bone meal, to address the deficiency. Focus on early applications during the growing season to promote root development.

- **Zone C (Low Nitrogen):** Apply a nitrogen-rich fertilizer, such as urea or ammonium nitrate, to boost nitrogen levels. Ensure that applications are spaced out to prevent nitrogen leaching and optimize absorption by the crops.

These actions will enhance grape quality and overall crop yield, improving profitability and sustainability.

3. Sustainable Farming Practices

Precision agriculture is synonymous with sustainability. Traditional farming methods often involve excessive use of water, fertilizers, and pesticides, contributing to resource depletion and environmental degradation.

Precise soil mapping helps in reducing these inputs to only what is necessary, fostering sustainable agricultural practices. This not only conserves resources but also minimizes the ecological footprint of farming activities.

For example, a rice grower in a water-scarce region can use soil moisture maps to implement a precise irrigation schedule. This approach could reduce water use by as much as 30%, conserve groundwater resources, and enhance crop yield by ensuring consistent soil moisture levels.

Let’s go through a code example that shows how precision irrigation can be implemented using AI tools.

Objective: Leverage an LLM to generate irrigation schedules based on soil moisture maps for sustainable water use.

import openai

# Sample soil moisture data for a rice grower
soil_moisture_map = {
    "Field_A": {"moisture_level": "high", "irrigation_requirement": "low"},
    "Field_B": {"moisture_level": "moderate", "irrigation_requirement": "medium"},
    "Field_C": {"moisture_level": "low", "irrigation_requirement": "high"}
}

# Convert soil moisture data into a descriptive text
moisture_description = (
    f"Field A has high soil moisture and low irrigation requirements. "
    f"Field B has moderate soil moisture and medium irrigation requirements. "
    f"Field C has low soil moisture and high irrigation requirements."
)

# Use LLM to generate a water-saving irrigation schedule
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an expert in sustainable farming and irrigation management."},
        {"role": "user", "content": f"Based on the following soil moisture data, generate an efficient irrigation schedule: {moisture_description}"}
    ]
)

irrigation_schedule = response.choices[0].message['content']
print(irrigation_schedule)

Sample Output:

**Water-Efficient Irrigation Schedule:**

- **Field A (High Moisture):** No immediate irrigation is needed. Monitor moisture levels over the next 7-10 days and consider irrigation only if the moisture level drops below optimal thresholds. Focus on water conservation in this zone.

- **Field B (Moderate Moisture):** Irrigate this field at medium intensity (50-60% of the standard rate) to maintain consistent soil moisture. Irrigation can be scheduled every 3-4 days based on weather conditions.

- **Field C (Low Moisture):** Prioritize this field for irrigation with high-intensity watering (80-90% of the standard rate). Schedule irrigation every 2 days to ensure sufficient moisture levels, especially during the critical growth phase.

By following this schedule, water usage can be reduced by 30%, conserving resources while ensuring optimal soil moisture for crop growth.

4. Data-Driven Decision Making

The integration of AI in soil mapping transforms raw data into actionable insights. AI-powered models can analyze soil characteristics and predict how different crops will respond to specific conditions.

This predictive capability empowers farmers to make informed decisions that optimize productivity and profitability. It also allows for real-time monitoring and adjustments, ensuring that farming practices evolve dynamically based on current data.

And lastly, let’s see how combining LLMs and precision agriculture can help you make data-driven decisions.

Objective: Integrate an LLM into a decision-making system that takes into account various precision agriculture metrics (soil health, moisture, nutrients) to suggest comprehensive farming strategies.

import openai

# Comprehensive data for a wheat farm
precision_agriculture_data = {
    "soil_nutrients": {
        "Zone_A": {"nitrogen": "high", "phosphorus": "moderate", "potassium": "low"},
        "Zone_B": {"nitrogen": "low", "phosphorus": "high", "potassium": "moderate"},
        "Zone_C": {"nitrogen": "moderate", "phosphorus": "low", "potassium": "high"}
    },
    "moisture_levels": {
        "Zone_A": "low",
        "Zone_B": "moderate",
        "Zone_C": "high"
    },
    "crop_type": "wheat"
}

# Convert precision agriculture data into a descriptive text
precision_data_description = (
    f"Zone A has high nitrogen, moderate phosphorus, and low potassium with low moisture levels. "
    f"Zone B has low nitrogen, high phosphorus, and moderate potassium with moderate moisture levels. "
    f"Zone C has moderate nitrogen, low phosphorus, and high potassium with high moisture levels."
)

# Use LLM to generate a comprehensive farming strategy
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an agricultural consultant specializing in precision farming."},
        {"role": "user", "content": f"Based on the following precision agriculture data, provide a comprehensive farming strategy: {precision_data_description}"}
    ]
)

farming_strategy = response.choices[0].message['content']
print(f

arming_strategy)

Sample Output:

**Comprehensive Farming Strategy for Wheat:**

- **Zone A:** 
  - **Nutrient Management:** Since nitrogen levels are high and potassium is low, apply a potassium-rich fertilizer (e.g., potassium sulfate) to balance nutrient availability. Avoid applying additional nitrogen to prevent over-fertilization.
  - **Moisture Management:** Moisture levels are low, so prioritize irrigation in this zone. Implement drip irrigation to target water delivery effectively without wastage.

- **Zone B:** 
  - **Nutrient Management:** Low nitrogen levels suggest the need for a nitrogen-based fertilizer (e.g., urea or ammonium nitrate). Since phosphorus is already high, avoid adding phosphorus-rich fertilizers. Focus on nitrogen supplementation for optimal growth.
  - **Moisture Management:** Moderate moisture levels are sufficient. Irrigate at a moderate intensity (50-60% of the standard rate) every 3-4 days.

- **Zone C:** 
  - **Nutrient Management:** Moderate nitrogen levels are acceptable, but low phosphorus levels require attention. Apply a phosphorus-rich fertilizer (e.g., superphosphate) to boost phosphorus content. Maintain potassium levels by applying a balanced fertilizer as needed.
  - **Moisture Management:** Since moisture levels are high, irrigation can be minimized or delayed. Monitor soil moisture closely and irrigate only if levels drop below optimal thresholds.

This strategy will optimize nutrient management, reduce water usage, and ensure higher wheat yields across all zones. By implementing targeted interventions, you can increase crop productivity while minimizing resource inputs.

In these examples, you saw how LLMs can help you analyze data from precision agriculture, provide actionable recommendations, and generate optimized strategies for enhancing efficiency, improving crop health, and promoting sustainable practices.

LLMs can handle a variety of agricultural data inputs and deliver personalized insights that help farmers make informed decisions, optimizing their farming processes.

Challenges of Precision Agriculture

1. The Initial Investment

One of the primary challenges in adopting precision agriculture is the significant initial investment. Advanced soil mapping technologies, AI models, and precision farming equipment require substantial capital outlay. But the long-term benefits – heightened crop yields, reduced input costs, and sustainable farming practices – often justify this upfront expenditure.

Financial aid and subsidies from governments and agricultural bodies can also mitigate the initial costs, making these technologies more accessible to small and medium-sized farmers.

As a solution, financial planning and incremental investments can ease the transition to precision agriculture. Farmers can start with essential technologies and gradually expand their toolkit as the initial benefits begin to materialize, thereby reducing financial strain.

2. Data Accuracy and Security

The effectiveness of AI-driven soil mapping hinges on the accuracy and security of data. Inaccurate data can lead to poor decision-making, negating the benefits of precision agriculture. Also, data privacy concerns and the potential for cyber threats necessitate robust security measures.

To combat these challenges, try implementing rigorous data validation protocols. These can help ensure the accuracy of collected data. Also, employ advanced cybersecurity measures that protect against data breaches, thereby maintaining the integrity and confidentiality of valuable agricultural data.

Soil Mapping + AI For the Win

Soil mapping techniques, augmented by AI and machine learning, are revolutionizing precision agriculture. By providing detailed insights into soil conditions, these technologies enable farmers to enhance efficiency, improve crop health, adopt sustainable practices, and make informed decisions.

Despite challenges such as initial investment and data security, the long-term benefits of precision agriculture are profound, promising increased crop yields and reduced environmental impact.

As the agricultural sector continues to innovate, soil mapping will undoubtedly play a pivotal role in shaping the future of farming, fostering a more productive and sustainable agricultural landscape for generations to come.

Chapter 3: Labor Optimization Solutions Through AI in Agriculture

Agricultural enterprises worldwide are increasingly leveraging Artificial Intelligence (AI) to address one of the most pressing challenges: labor shortages. AI technologies offer transformative solutions that enhance efficiency and optimize various operations within the sector.

By examining AI's role in enhancing farm labor management, precision agriculture, and AI-driven robotics and automation, we can appreciate its profound impact on overcoming workforce scarcity.

Enhanced Farm Labor Management

Farm labor management has traditionally been resource-intensive, often hindered by inefficiencies resulting from manual planning and unpredictable variables like weather.

AI models integrated into farm management software revolutionize this space by enabling highly precise resource allocation and task assignment. Machine learning algorithms analyze extensive datasets encompassing soil conditions, weather patterns, crop growth stages, and historical farm performance to devise actionable insights.

For example, AI can identify the optimal times for planting, irrigating, and harvesting by processing current and forecasting data. This predictive capability ensures farming activities are synchronized with peak resource availability, minimizing labor bottlenecks. This means that farms can plan their workforce requirements more effectively, reducing downtime and enhancing overall productivity.

But AI's potential extends beyond mere task scheduling. It supports decision-making processes through real-time feedback mechanisms, allowing farm managers to adjust strategies dynamically. For instance, if an unexpected weather change is detected, AI can prompt adjustments to irrigation schedules or suggest protective measures, thereby safeguarding crops and ensuring labor is utilized efficiently.

Let’s look at an example of how you’d put this into practice.

Objective: Utilize an LLM to generate dynamic task scheduling for farm labor management based on weather, soil, and crop growth data. The system adapts in real-time to changing environmental conditions.

import openai
import datetime

# Sample environmental data (weather, soil moisture, crop growth)
environmental_data = {
    "weather_forecast": {
        "today": {"temp": 28, "precipitation": 20, "wind_speed": 10},
        "tomorrow": {"temp": 30, "precipitation": 50, "wind_speed": 5}
    },
    "soil_conditions": {
        "moisture_level": 60,  # percentage
        "fertility_level": "high"
    },
    "crop_stage": "vegetative"
}

# Convert environmental data into a readable description
environment_description = (
    f"Today's weather forecast: temperature {environmental_data['weather_forecast']['today']['temp']}°C, "
    f"precipitation {environmental_data['weather_forecast']['today']['precipitation']}mm, wind speed {environmental_data['weather_forecast']['today']['wind_speed']} km/h. "
    f"Soil moisture level is {environmental_data['soil_conditions']['moisture_level']}% and fertility level is {environmental_data['soil_conditions']['fertility_level']}. "
    f"The crop is currently in the {environmental_data['crop_stage']} stage."
)

# Use LLM to generate a farm labor schedule based on environmental conditions
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an expert in farm labor management using AI."},
        {"role": "user", "content": f"Given the following environmental data, provide a dynamic labor schedule for planting, irrigation, and harvesting: {environment_description}"}
    ]
)

labor_schedule = response.choices[0].message['content']
print(labor_schedule)

Sample Output:

**Dynamic Farm Labor Schedule for Today:**

- **Planting:** The weather forecast suggests light precipitation (20mm), which is suitable for planting. Labor should focus on planting in Zone A and B during the morning hours when the temperature is cooler (28°C). Adjustments may be required if precipitation increases.

- **Irrigation:** Soil moisture levels are at 60%, which is adequate for today. No immediate irrigation is needed, but continue to monitor moisture levels. If levels drop below 50%, schedule irrigation for tomorrow morning before temperatures rise.

- **Harvesting:** There are no immediate harvesting requirements as the crop is in the vegetative stage. However, labor should be allocated to check crop growth and ensure pest control measures are in place.

- **General Maintenance:** Given the weather conditions and wind speed of 10 km/h, it’s advisable to check equipment and infrastructure stability. Allocate a small team to inspect irrigation systems and prepare for tomorrow's forecasted heavier rain (50mm).

This example focused on enhancing farm labor management by dynamically generating a labor schedule for farming tasks (for example, planting, irrigation, harvesting) based on real-time environmental data such as weather, soil conditions, and crop growth stages. The LLM ensured that the labor schedule adapted to changing conditions.

Precision Agriculture for Labor Optimization

Precision agriculture exemplifies the integration of AI and predictive analytics to optimize labor usage. This approach tailors farming practices to the specific needs of different field zones by analyzing real-time data on soil moisture levels, crop health, and weather conditions. Integrating AI into precision agriculture amplifies its effectiveness.

Imagine a farmer managing a vast field with varying soil types and fertility levels. Traditionally, uniform treatment would have been applied across the entire field, leading to inefficiencies and potential wastage of resources.

But AI can create detailed field maps, segmenting the land into manageable zones, each with tailored treatment plans. This ensures that labor-intensive tasks such as fertilization and pest control are precisely directed where needed, maximizing their impact and conserving resources.

AI's real-time data processing capabilities also enable predictive maintenance of equipment. By continuously monitoring machinery and identifying signs of wear or potential failure, AI-driven systems can schedule preemptive repairs, preventing costly downtime and labor disruptions. This predictive maintenance significantly enhances operational efficiency and prolongs the lifespan of equipment, leading to long-term cost savings.

Now let’s see an example of how you could use precision agriculture with LLMs to optimize labor and resources:

Objective: Integrate an LLM to analyze real-time precision agriculture data and provide recommendations for labor allocation in specific zones based on soil moisture, crop health, and machine maintenance needs.

import openai

# Sample precision agriculture data for a large field
precision_ag_data = {
    "zones": {
        "Zone_1": {"soil_moisture": 40, "crop_health": "good", "fertilization_need": "low"},
        "Zone_2": {"soil_moisture": 30, "crop_health": "moderate", "fertilization_need": "high"},
        "Zone_3": {"soil_moisture": 25, "crop_health": "poor", "fertilization_need": "high"}
    },
    "machinery_status": {
        "tractor_1": {"status": "operational", "maintenance_due_in_days": 5},
        "tractor_2": {"status": "requires_maintenance", "maintenance_due_in_days": 0}
    }
}

# Convert precision agriculture data into a readable description
agriculture_description = (
    f"Zone 1 has soil moisture at 40%, crop health is good, and low fertilization is needed. "
    f"Zone 2 has soil moisture at 30%, crop health is moderate, and high fertilization is needed. "
    f"Zone 3 has soil moisture at 25%, crop health is poor, and high fertilization is needed. "
    f"Tractor 1 is operational and requires maintenance in 5 days. Tractor 2 requires immediate maintenance."
)

# Use LLM to generate labor allocation recommendations
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an AI expert specializing in precision agriculture labor optimization."},
        {"role": "user", "content": f"Based on the following precision agriculture data, provide labor recommendations for today: {agriculture_description}"}
    ]
)

labor_recommendations = response.choices[0].message['content']
print(labor_recommendations)

Sample Output:

**Labor Allocation Recommendations for Today:**

- **Zone 1:** Since soil moisture is at 40% and crop health is good, allocate minimal labor for fertilization here. Focus on general crop monitoring, with labor redirected to other zones.

- **Zone 2:** This zone requires more attention due to moderate crop health and lower soil moisture (30%). Allocate a team for high-level fertilization tasks and schedule irrigation in the afternoon when temperatures are lower. Ensure laborers monitor soil moisture to avoid overwatering.

- **Zone 3:** Given the poor crop health and low soil moisture (25%), prioritize labor here. Allocate labor for both high-level fertilization and immediate irrigation. Additionally, plan a follow-up visit to assess crop recovery within 48 hours. 

- **Machinery:** Tractor 2 requires immediate maintenance and should not be used today. Tractor 1 is operational but will require maintenance in the coming days. Assign a small maintenance crew to inspect Tractor 1 and prepare it for upcoming tasks.

These labor recommendations will help optimize workforce distribution while ensuring efficient resource use and timely crop interventions.

In this example, you saw how you can use precision agriculture with LLMs to analyze zone-specific data (soil moisture, crop health) and provide optimized labor allocation recommendations. It also considered machinery maintenance requirements to prevent downtime.

AI-Driven Robotics and Automation

One of the most profound applications of AI in agriculture is in robotics and automation. AI-driven robots are designed to perform tasks traditionally requiring manual labor, such as planting, harvesting, and sorting. These robots are not only faster and more accurate but also capable of operating in conditions that might be challenging for human workers.

Take autonomous tractors, for instance. These vehicles use AI to navigate fields, planting seeds with pinpoint accuracy. They can work tirelessly, undeterred by fatigue or harsh weather, resulting in more consistent and higher-quality planting.

Similarly, harvesting robots equipped with advanced sensors and machine learning algorithms can distinguish between ripe and unripe fruits, ensuring optimal harvest times and reducing wastage.

Robotic process automation extends to post-harvest activities as well. Automated systems for sorting and packaging crops enhance the speed and accuracy of these labor-intensive tasks. These robots can be trained to recognize various crop qualities, ensuring only the best produce reaches the market.

AI-driven robotics can also adapt to various environmental conditions and crop varieties. This adaptability ensures that farms employing AI technologies enjoy consistent performance regardless of changes in soil types or weather patterns, overcoming one of the significant limitations of traditional farming methods.

Sustainable Farming Practices

The integration of AI technologies in agriculture also paves the way for sustainable farming practices. By optimizing resource utilization and minimizing wastage, AI helps in reducing the environmental footprint of agricultural activities. For instance, precision irrigation systems using AI algorithms ensure water is used efficiently, addressing sustainability concerns in water-scarce regions.

Furthermore, AI can assist in monitoring and managing the health of crops with minimal chemical inputs. Machine learning algorithms can analyze data from sensors and detect signs of diseases or pest attacks early, allowing for targeted intervention with minimal pesticide use. This approach not only ensures healthier crops but also contributes to better environmental and consumer health.

Now you have a better idea about how AI can work to address the persistent issue of labor shortages in agriculture. By enhancing farm labor management, enabling precision agriculture, and driving robotics and automation, AI technologies significantly boost operational efficiency and productivity. These innovations ensure that farmers can manage their resources more effectively, maintain sustainable practices, and ultimately achieve higher crop yields.

Chapter 4: Predictive Analytics and Machine Learning in Crop Yield Improvement

The advancements of AI in agriculture herald a transformative era where crop yields may potentially rise by as much as 70% by 2030. This leap hinges on the effective use of predictive analytics and machine learning, two potent tools that are dramatically reshaping the landscape of modern farming.

Let's delve deeply into how these technologies can elevate agricultural practices and drive substantial improvements in crop yield.

Predictive Analytics: Optimizing Agricultural Processes

Predictive analytics leverages historical data, real-time information, and weather patterns to provide farmers with actionable insights. This highly nuanced approach facilitates precise decision-making, thus optimizing the entire agricultural value chain.

Imagine a farmer who has consistently struggled with unpredictable weather and its impact on planting schedules. By utilizing predictive analytics, historical weather patterns can be analyzed alongside real-time meteorological data to forecast the optimal planting period. This allows the farmer to sow crops under conditions most conducive to their growth, thus enhancing the probability of higher yields.

Predictive analytics also helps in fine-tuning irrigation strategies. Water scarcity is a persistent challenge in agriculture, particularly in arid regions. By analyzing soil moisture levels and weather forecasts, farmers can precisely schedule irrigation, ensuring plants receive the exact amount of water they need without wastage. This not only conserves water but also promotes healthier crop growth, which directly translates to improved yields.

Plant protection is another area where predictive analytics excels. By observing historical pest invasion data and current climatic conditions, farmers can predict pest outbreaks and implement timely, targeted interventions. Such foresight prevents extensive crop damage and reduces the dependency on chemical pesticides, fostering a more sustainable agricultural practice.

Machine Learning in Intelligent Decision-Making

Machine learning algorithms further elevate the capabilities of predictive analytics by enabling the creation of highly personalized AI models. These models are specifically tailored to a farm's unique characteristics—soil type, crop variety, local climate conditions—and can process vast datasets to offer precision farming recommendations.

Consider a scenario where a farm's soil is nutrient-deficient. Traditional methods might rely on broad-spectrum fertilizers, often leading to nutrient imbalance and soil degradation. But with machine learning, farmers can analyze soil samples to determine the specific nutrient deficiencies and develop custom fertilizer blends that address these gaps precisely. Over time, as the model ingests more data, its recommendations become more accurate, ensuring that crops receive optimal nutrition, which significantly boosts yields.

Machine learning can also revolutionize crop variety selection. Season after season, choosing the right crop variety to plant is a critical yet challenging decision. By analyzing data from past harvests, climate patterns, and market demands, machine learning models can predict which crop varieties are most likely to thrive and be profitable in a given region and season. This data-driven approach minimizes the guesswork and enhances the likelihood of successful harvests.

Empowering Farmers with Data-Driven Insights

The integration of predictive analytics and machine learning empowers farmers with real-time, data-driven insights, transforming agriculture into a precision-driven industry. Access to such precise information enables quick and informed decisions that maximize resources and mitigate risks.

Take, for example, the task of monitoring soil health. Traditionally, farmers relied on sporadic soil tests, which might miss critical variations in soil conditions. With continuous data collection through sensors and real-time analytics, farmers can monitor soil health consistently. If a sudden drop in soil moisture is detected, an immediate analysis can identify the cause, prompting timely corrective actions such as adjusted irrigation or the application of mulching to conserve moisture.

Weather predictions enhanced through machine learning algorithms also play a pivotal role. Real-time weather data can be continuously analyzed to detect emerging patterns or anomalies that might affect crop growth. For instance, an impending storm that could potentially cause flooding can be predicted, allowing farmers to apply preemptive measures such as improving drainage systems or temporarily covering crops to protect them.

Moreover, management practices can be adjusted dynamically based on insights from data on plant health. Advanced sensors can monitor plant conditions, identifying early signs of disease or nutrient deficiency. With immediate feedback, farmers can apply the necessary treatments long before visible symptoms appear, thus saving crops and increasing yields.

Advanced Insights for Sustainable Farming

Beyond immediate yield improvements, predictive analytics and machine learning promote sustainable farming practices by optimizing resource use and minimizing environmental impact.

Precision in fertilizer application, as discussed earlier, prevents over-fertilization and reduces the risk of groundwater contamination. Similarly, efficient water use strategies ensure that valuable freshwater resources are conserved, which is especially crucial in regions facing water scarcity.

By promoting sustainable practices, these technologies help build resilient agricultural systems capable of withstanding the adverse effects of climate change. For example, predictive models that anticipate climate variability and its impact on crop cycles enable farmers to adapt their strategies proactively. This adaptive capacity is vital for maintaining productivity as weather patterns become increasingly unpredictable.

Concrete Examples of Success

Real-world applications of these technologies offer compelling evidence of their efficacy. In the United States, the USDA has been leveraging predictive analytics to forecast corn yield with remarkable accuracy. By integrating satellite imagery, weather data, and advanced analytics, the USDA can predict yield variations and guide farmers in optimizing their practices accordingly.

In India, machine learning models have been employed to improve rice yields. By analyzing soil health, weather patterns, and pest data, these models provide tailored advice to farmers, resulting in significant yield increases. The model's success in one of the most challenging agricultural environments underscores the transformative potential of AI-driven solutions in diverse settings.

Code Examples

Here are two examples that demonstrate how LLM (Large Language Models) applications can be integrated into the predictive analytics and machine learning aspects of agriculture to enhance crop yield optimization and sustainable farming practices.

Example 1: Predictive analytics for optimizing agricultural processes

Objective: Utilize an LLM to generate insights for a farmer on the optimal planting, irrigation, and pest control schedules based on historical weather patterns, real-time meteorological data, and soil moisture levels.

import openai
from datetime import datetime

# Sample data on historical and current weather, soil moisture, and pest data
agricultural_data = {
    "historical_weather": "Over the past 10 years, this region has experienced optimal planting conditions between March 15 and April 10, with a dry spell in mid-April.",
    "current_weather": {
        "today": {"temperature": 25, "humidity": 60, "precipitation": 0, "wind_speed": 10},
        "forecast": [
            {"date": "2024-03-18", "temperature": 22, "humidity": 55, "precipitation": 5},
            {"date": "2024-03-19", "temperature": 24, "humidity": 50, "precipitation": 0}
        ]
    },
    "soil_moisture": 35,  # percentage
    "pest_risk": "Based on historical pest data and current climate conditions, there is a high risk of pest outbreaks in late April."
}

# Create a readable summary of the data for the LLM
data_summary = (
    f"Historical weather data: {agricultural_data['historical_weather']}. "
    f"Today's weather: Temperature {agricultural_data['current_weather']['today']['temperature']}°C, "
    f"Humidity {agricultural_data['current_weather']['today']['humidity']}%, "
    f"Precipitation {agricultural_data['current_weather']['today']['precipitation']}mm, "
    f"and Wind Speed {agricultural_data['current_weather']['today']['wind_speed']} km/h. "
    f"Soil moisture is currently {agricultural_data['soil_moisture']}%. "
    f"Pest risk: {agricultural_data['pest_risk']}."
)

# Use an LLM to generate actionable insights for the farmer based on this data
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an expert in agriculture with a focus on predictive analytics."},
        {"role": "user", "content": f"Based on the following data, suggest optimal planting, irrigation, and pest control strategies: {data_summary}"}
    ]
)

recommendations = response.choices[0].message['content']
print(recommendations)

Sample Output:

**Optimal Planting Strategy:**
Based on historical data, the ideal planting window is between March 15 and April 10. Given the current weather forecast and soil moisture level of 35%, it is advisable to begin planting on March 19, when temperatures will be around 24°C and precipitation is expected to be minimal.

**Irrigation Strategy:**
With soil moisture at 35%, irrigation is not urgently required today. However, monitor moisture levels closely over the next week, especially after March 19. If the soil moisture drops below 30%, consider scheduling irrigation in the early morning or late evening to reduce evaporation.

**Pest Control Strategy:**
There is a high risk of pest outbreaks in late April. It is recommended to implement preventative measures, such as applying organic pest deterrents, during the second week of April. Regular monitoring of pest activity during this period is crucial to prevent damage to crops.

Example 2: Machine Learning for intelligent decision-making in agriculture

Objective: Use an LLM to generate recommendations for custom fertilizer blends and optimal crop variety selection based on machine learning models that analyze soil type, nutrient levels, and local climate data.

import openai

# Sample soil and climate data for a farm
farm_data = {
    "soil_type": "clay",
    "soil_nutrients": {"nitrogen": 30, "phosphorus": 15, "potassium": 40},  # ppm
    "climate_conditions": {"average_temperature": 28, "rainfall": "moderate", "humidity": 65},
    "historical_crop_yield": {
        "wheat": {"yield_per_hectare": 3000},
        "corn": {"yield_per_hectare": 2800},
        "rice": {"yield_per_hectare": 4000}
    }
}

# Convert farm data to a readable description
farm_description = (
    f"The farm's soil is clay-based, with nutrient levels of nitrogen at {farm_data['soil_nutrients']['nitrogen']} ppm, "
    f"phosphorus at {farm_data['soil_nutrients']['phosphorus']} ppm, and potassium at {farm_data['soil_nutrients']['potassium']} ppm. "
    f"Climate conditions include an average temperature of {farm_data['climate_conditions']['average_temperature']}°C, "
    f"moderate rainfall, and humidity at {farm_data['climate_conditions']['humidity']}%. "
    f"Historical yields for wheat, corn, and rice have been 3000, 2800, and 4000 kilograms per hectare, respectively."
)

# Use an LLM to suggest custom fertilizer blends and optimal crop variety based on this data
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an agricultural expert with a focus on machine learning and crop yield optimization."},
        {"role": "user", "content": f"Based on the following farm data, suggest a custom fertilizer blend and optimal crop variety for the upcoming season: {farm_description}"}
    ]
)

crop_and_fertilizer_recommendations = response.choices[0].message['content']
print(crop_and_fertilizer_recommendations)

Sample Output:

**Custom Fertilizer Blend Recommendation:**
Given the nutrient levels in your clay soil (30 ppm nitrogen, 15 ppm phosphorus, 40 ppm potassium), it is recommended to apply a balanced fertilizer with the following ratio:
- Nitrogen: 40%
- Phosphorus: 25%
- Potassium: 35%

You can achieve this blend by combining urea (for nitrogen), triple superphosphate (for phosphorus), and potassium sulfate. Apply the fertilizer before the planting season and follow up with additional nitrogen during the growth phase, especially for nitrogen-hungry crops like wheat.

**Optimal Crop Variety Recommendation:**
Based on the climate conditions (28°C average temperature, moderate rainfall, and 65% humidity), the optimal crop variety for your farm would be rice. Rice has historically produced the highest yield on your farm (4000 kg/hectare) and performs well in clay soil with moderate water availability. Choose a high-yield, drought-resistant rice variety for this season to maximize output while minimizing water usage.

Wheat is also a viable option, but with lower yield potential. However, if market demand is higher for wheat, consider alternating crops or employing crop rotation to maintain soil health.

Example 1 demonstrates the use of predictive analytics with an LLM to provide actionable recommendations for optimal planting, irrigation, and pest control schedules based on historical weather patterns, real-time data, and soil conditions.

Example 2 showcases machine learning applied to agriculture, where an LLM generates custom fertilizer recommendations and suggests the optimal crop variety based on farm-specific data such as soil nutrients, climate conditions, and historical crop yield performance.

In both examples, LLMs act as a powerful interface between the data and the farmer, providing tailored insights to optimize decision-making and enhance crop yields.

As you can see, the integration of predictive analytics and machine learning in agriculture is a technological advancement that represents a paradigm shift towards a future where farming is driven by precision, sustainability, and unprecedented productivity. By harnessing historical data and real-time information, farmers can optimize every aspect of crop management, from planting to harvest, ensuring higher yields and promoting environmental stewardship.

For farmers, researchers, and policymakers alike, the challenge is to embrace these tools, continually innovate, and drive the agricultural sector towards a future of smart, sustainable, and highly productive farming practices.

Chapter 5: How to Leverage Big Data and Computer Vision in Farming

As we explore how AI can help improve agricultural practices, we need to explore the nuances of how big data and computer vision technologies play crucial roles in achieving such ambitious goals.

This chapter will give you a comprehensive overview of the transformative impact that these technologies have on modern agriculture, offering detailed insights and practical examples that highlight their significance and implementation.

The Role of Big Data in Precision Agriculture

Big data analytics is a cornerstone of precision agriculture, where the primary aim is to monitor and manage field variability more effectively.

Farmers collect vast amounts of data through sensors, drones, and satellite imagery, encompassing soil conditions, weather patterns, and crop health. This data is then analyzed to elucidate trends and patterns that inform decision-making.

For instance, understanding soil moisture levels can help optimize irrigation schedules, while tracking weather conditions enables better planning for planting and harvesting.

The predictive power of big data can also guide the application of fertilizers and pesticides, ensuring they are used only when necessary and in precisely the right amounts. This not only saves costs but also minimizes the environmental impact of agricultural practices, addressing the pressing issues of sustainability and resource conservation.

Enhancing Crop Monitoring with Computer Vision

Computer vision technologies significantly enhance crop monitoring by providing high-resolution, real-time images of fields. Drones equipped with multispectral and hyperspectral cameras can fly over large areas, capturing detailed images that reveal information invisible to the naked eye—a critical advantage for early detection of stress factors such as pests, diseases, and nutrient deficiencies.

For instance, a farmer can use drone imagery to identify sections of a field suffering from water stress. By pinpointing these areas precisely, irrigation can be targeted and regulated accordingly, avoiding over-watering or under-watering, which can detrimentally affect crop yield.

Similarly, early detection of pest infestation through computer vision allows for timely intervention, mitigating damage and potential yield loss.

AI Models for Predicting Crop Yields

AI-powered predictive analytics are revolutionizing the way farmers forecast crop yields. By integrating various data sources, including current and historical soil quality data, weather patterns, and crop health metrics, AI models generate accurate yield predictions. These models use machine learning algorithms to continuously improve their accuracy as they are exposed to more data.

For example, if historical data indicates that a particular crop yield decreases under specific weather conditions, the AI model can predict similar outcomes and recommend proactive measures. This might include adjusting planting dates, choosing drought-resistant crop varieties, or optimizing irrigation schedules.

Such insights empower farmers to make informed decisions that enhance productivity and reduce risks associated with unforeseen variables.

Empowering Farm Management with Data-Driven Insights

Farm management software integrated with big data analytics and AI provides a holistic view of farm operations. These platforms consolidate data on everything from soil moisture levels to fertilizer usage, making it easier for farmers to plan and execute their activities efficiently. By offering real-time insights and recommendations, these tools help in optimizing resource allocation, thus enhancing productivity and sustainability.

Consider a scenario where a farmer uses farm management software to track the efficiency of different watering systems. The software can analyze data from various sections of the farm, revealing which system operates most efficiently under different conditions. This allows the farmer to make data-driven decisions on where to invest in irrigation infrastructure, thereby improving water use efficiency and reducing costs.

Sustainable Farming Practices Through Data Integration

Integrating data from multiple sources not only optimizes individual farming practices but also promotes overall sustainability. By combining data on soil health, weather patterns, and crop performance, farmers can adopt practices that improve soil fertility, reduce chemical inputs, and conserve water. For instance, data-driven crop rotation schedules can enhance soil health and reduce pest and disease pressure, consequently lowering reliance on synthetic fertilizers and pesticides.

Additionally, big data and computer vision can support the adoption of precision irrigation and fertigation techniques. For example, data on soil moisture levels and plant growth stages can be used to apply water and nutrients precisely when and where they are needed, reducing waste and environmental impact. This aligns with broader goals of sustainability and resource conservation, ensuring that agricultural practices remain viable and productive in the face of climate change and a growing global population.

Code Examples

Below are three examples that demonstrate how LLM applications can be integrated into AI-enhanced farming to increase crop yields by up to 70% by 2030. These examples showcase how LLMs can be used to analyze big data, interpret computer vision inputs, and generate predictive analytics for decision-making.

Example 1: Big data in precision agriculture for irrigation and fertilization

Objective: Use an LLM to analyze data from sensors, satellite imagery, and weather forecasts. Based on the analysis, the LLM generates an optimal irrigation and fertilization schedule.

import openai

# Sample big data inputs: weather forecasts, soil sensors, and satellite imagery
big_data = {
    "weather_forecast": {
        "today": {"temp": 28, "humidity": 50, "precipitation": 10},
        "next_week": [
            {"day": "Monday", "temp": 30, "precipitation": 5},
            {"day": "Tuesday", "temp": 32, "precipitation": 0}
        ]
    },
    "soil_conditions": {
        "moisture_level": 35,  # in percentage
        "nutrient_levels": {"nitrogen": 40, "phosphorus": 20, "potassium": 30}  # ppm
    },
    "satellite_imagery": {
        "crop_health_index": 0.8,  # normalized index (0 to 1)
        "vegetation_density": "moderate"
    }
}

# Generate a description for the LLM
big_data_description = (
    f"The weather forecast indicates a temperature of {big_data['weather_forecast']['today']['temp']}°C "
    f"with 50% humidity and 10mm of precipitation today. Soil moisture is at {big_data['soil_conditions']['moisture_level']}%. "
    f"Nutrient levels are: nitrogen at {big_data['soil_conditions']['nutrient_levels']['nitrogen']} ppm, phosphorus at "
    f"{big_data['soil_conditions']['nutrient_levels']['phosphorus']} ppm, and potassium at {big_data['soil_conditions']['nutrient_levels']['potassium']} ppm. "
    f"The crop health index from satellite imagery is 0.8, indicating moderate vegetation density."
)

# Use LLM to generate optimal irrigation and fertilization recommendations
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an AI agricultural assistant specializing in big data analysis for irrigation and fertilization."},
        {"role": "user", "content": f"Based on the following big data, provide irrigation and fertilization recommendations: {big_data_description}"}
    ]
)

recommendations = response.choices[0].message['content']
print(recommendations)

Sample Output:

**Irrigation Recommendations:**
Given the current soil moisture level of 35%, and the precipitation forecast of 10mm today, additional irrigation is not required today. However, as the temperature rises to 30-32°C next week, plan for irrigation on Tuesday, especially if soil moisture drops below 30%.

**Fertilization Recommendations:**
- Nitrogen levels are at 40 ppm, which is slightly below the optimal range for active growth phases. Apply nitrogen-rich fertilizer at 25% of the recommended dose over the next two days.
- Phosphorus levels are low at 20 ppm. Apply phosphorus-rich fertilizer at 50% of the standard rate to improve root development.
- Potassium levels are adequate but can be boosted with a light application to support flowering and fruiting.

Example 2: Computer vision for detecting crop diseases and nutrient deficiencies

Objective: Integrate computer vision data from drones into an LLM to analyze crop health and generate early disease detection and nutrient deficiency recommendations.

import openai

# Sample data from drone-based computer vision system
vision_data = {
    "field_images": {
        "zones": {
            "Zone_1": {"water_stress": "none", "nutrient_deficiency": "low nitrogen", "disease_spots": "none"},
            "Zone_2": {"water_stress": "moderate", "nutrient_deficiency": "none", "disease_spots": "possible fungal infection"}
        }
    },
    "crop_health_metrics": {
        "average_growth_rate": "good",
        "vegetation_health_index": 0.85,  # 0 to 1 scale
        "detected_pests": "none"
    }
}

# Generate a description for the LLM based on vision data
vision_data_description = (
    f"Zone 1 has no water stress, but low nitrogen deficiency is detected, with no disease spots. "
    f"Zone 2 has moderate water stress, no nutrient deficiencies, but possible fungal infection spots were detected. "
    f"Average growth rate is good, with a vegetation health index of 0.85, and no pests detected."
)

# Use LLM to generate recommendations based on computer vision analysis
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an expert in agricultural disease management and nutrient analysis."},
        {"role": "user", "content": f"Based on the following computer vision data, provide recommendations for nutrient deficiency and disease management: {vision_data_description}"}
    ]
)

crop_health_recommendations = response.choices[0].message['content']
print(crop_health_recommendations)

Sample Output:

**Zone 1 Recommendations:**
- Address the low nitrogen deficiency by applying nitrogen-rich fertilizer, such as urea, at a rate of 30% of the recommended dose. Monitor crop growth over the next week for improvement.

**Zone 2 Recommendations:**
- The moderate water stress should be alleviated by implementing targeted irrigation immediately. Focus on ensuring consistent soil moisture levels to reduce plant stress.
- The possible fungal infection should be treated with an appropriate fungicide. Apply a broad-spectrum fungicide as a preventative measure, and closely monitor the affected areas for further spread.

Example 3: Predictive analytics for crop yield forecasting

Objective: Use LLMs to process historical data and predictive models to estimate crop yields based on real-time weather patterns and soil conditions.

import openai

# Sample historical and real-time data for predictive analytics
historical_data = {
    "crop_type": "corn",
    "historical_yield_per_hectare": 5000,  # kg/ha
    "historical_weather_patterns": {
        "optimal_temp_range": [25, 30],  # °C
        "optimal_precipitation": 100  # mm/month
    }
}

real_time_data = {
    "current_temp": 28,  # °C
    "current_precipitation": 90,  # mm this month
    "soil_moisture": 50  # percentage
}

# Generate a description of the data for the LLM
data_description = (
    f"The crop is corn, with a historical average yield of 5000 kg/hectare. The optimal temperature range for growth is between "
    f"{historical_data['historical_weather_patterns']['optimal_temp_range'][0]}°C and "
    f"{historical_data['historical_weather_patterns']['optimal_temp_range'][1]}°C, and optimal precipitation is 100 mm per month. "
    f"Current conditions show a temperature of {real_time_data['current_temp']}°C, precipitation of {real_time_data['current_precipitation']} mm, "
    f"and soil moisture at {real_time_data['soil_moisture']}%."
)

# Use LLM to generate a crop yield forecast based on this data
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an expert in crop yield forecasting using predictive analytics."},
        {"role": "user", "content": f"Based on the following data, provide an estimated crop yield and suggestions for improving yield potential: {data_description}"}
    ]
)

yield_forecast = response.choices[0].message['content']
print(yield_forecast)

Sample Output:

**Crop Yield Forecast:**
Given the current temperature of 28°C, which falls within the optimal range for corn growth (25-30°C), and a slightly lower-than-optimal precipitation level of 90 mm (optimal is 100 mm), the crop yield is projected to be around 4800 kg/hectare. The current soil moisture level of 50% supports healthy growth.

**Suggestions for Improving Yield:**
- To maximize yield potential, consider increasing irrigation to make up for the slightly lower precipitation levels this month. Aim to maintain soil moisture at 60-70% to support optimal growth during the reproductive phase of the corn crop.
- Regular monitoring of soil moisture and weather conditions is crucial to adjust irrigation and nutrient inputs dynamically throughout the season.

In Example 1, we used LLMs to analyze large datasets from sensors, satellite imagery, and weather forecasts to provide irrigation and fertilization schedules, ensuring that crops receive the right amount of water and nutrients.

In Example 2, you learned how LLMs can interpret data from drone-based computer vision systems to detect signs of water stress, nutrient deficiencies, and potential diseases. The model generates targeted interventions to improve crop health.

And in Example 3, we used LLMs to process historical and real-time data to forecast crop yields and recommend adjustments to optimize yield, such as increasing irrigation or adjusting nutrient levels based on environmental factors.

In all three examples, LLMs helped process complex data and provide actionable insights for farmers, supporting decisions that improve crop yields, sustainability, and resource efficiency.

The integration of big data and computer vision technologies is undeniably transforming agriculture, making it more efficient, sustainable, and resilient. By leveraging these advanced tools, farmers are better equipped to navigate the complexities of modern farming, addressing challenges such as climate variability, resource limitations, and the need for increased productivity.

Chapter 6: Optimizing Soil Moisture and Quality with AI Models

The Importance of Soil Moisture Management

Effective soil moisture management is fundamental for optimizing crop yields, a goal that resonates universally within the agricultural sector. Inadequate or excessive moisture levels can lead to various complications like root diseases, nutrient leaching, and even yield reduction.

As AI-integrated farming techniques become more sophisticated, they offer a seamless solution to these age-old problems. By employing AI models, farmers can ensure crops consistently receive just the right amount of water.

A powerful aspect of these AI models is their ability to monitor and interpret various data points in real-time, providing insights that would be impossible through manual methods. For instance, imagine a system that analyzes weather forecasts, soil types, and plant needs daily, adjusting irrigation schedules to match this dynamic environment precisely. It's like having a digital agronomist tirelessly working to keep your soil in perfect condition. This heightened level of precision translates directly to higher yields and better crop health.

Not only does this help issue-specific concerns like drought or over-irrigation, but it also integrates seamlessly into larger farm management systems. By identifying optimal times for water distribution, AI allows for more strategic planning and resource allocation. Think of it as a cycle: healthier soil leads to healthier crops, requiring even less intervention. Thus, the benefits cascade, leading to more efficient and sustainable farming practices.

Benefits of AI in Optimizing Soil Quality

One of the most compelling advantages of using artificial intelligence in soil quality optimization is its precision. Traditional farming often relies on blanket treatments—broadly applying water or fertilizer across entire fields. AI transforms this into a surgical procedure, tailored to the specific needs of different soil segments.

For example, a farmer might employ an AI model to identify that a particular section of a field is nutrient-deficient. Rather than fertilizing the entire field, resources can be directed precisely where they are needed most.

Predictive analytics represent another revolutionary facet of AI, eliminating the guesswork from farming. By analyzing a rich history of data—soil tests, weather conditions, crop performance—AI enables farmers to anticipate future conditions and prepare accordingly. This kind of foresight can be invaluable when planning crop rotations, anticipating pest invasions, or deciding on the optimal planting and harvesting times. Imagine having a crystal ball that tells you exactly when to plant each year, aligning perfectly with the best-growing conditions.

The key takeaway here is that AI can help provide sustainable solutions. As AI models become more sophisticated, their ability to adapt to changing climates and soil conditions grows, providing a robust platform for future farming endeavors. In this way, AI-enabled soil quality management systems are contributing towards global food security, a critical need underscored in discussions on agricultural advancements.

Integration with Existing Farming Practices

The integration of AI into existing farming practices should be seamless, enhancing rather than disrupting daily operations. Many farmers may be wary of adopting new technologies, fearing complexity or disruption. But today's AI systems are designed for usability. They often integrate directly with existing farm management software, providing a unified interface for all your agricultural needs. For example, systems like John Deere's Operations Center offer modules that incorporate AI-driven insights into traditional farm management tools.

Farmers can see real-time data on soil moisture levels, nutrient content, and irrigation needs, all in one place. These platforms often offer mobile applications, allowing farmers to access this critical information from anywhere, making decisions on-the-go. The ease of use and accessibility of AI models demystify the technology, making it more approachable. It's not about replacing the farmer's expertise but augmenting it—providing tools that enable smarter, more efficient farming.

Full integration into irrigation systems means the AI can automatically adjust water levels without manual intervention. This automation ensures that even the minutest changes in soil conditions are addressed immediately, maintaining optimal growing conditions at all times. Think of it as a smart home system but for your crops—a digital assistant that ensures everything runs smoothly, even when you cannot be present.

Balancing Technological Advancements and Practical Applications

While the promise of AI in optimizing soil moisture and quality is enormous, its practical application requires a balanced approach. Not all farms are the same, and the variance in soil types, climate conditions, and crop types means a one-size-fits-all solution isn’t feasible.

Tailoring AI models to fit specific needs is crucial for maximizing their effectiveness. Customizable AI platforms are gaining traction because they allow for this level of specificity.

Take, for instance, a farm situated in a semi-arid region. The soil here typically has lower organic content and higher salinity levels. An AI model geared towards this specific environment will focus on conserving water while improving soil quality through targeted fertilization techniques and organic amendments.

Contrast this with a farm in a temperate climate, where the AI might prioritize managing periodic heavy rains to prevent soil erosion and nutrient loss. The customization of AI applications ensures that solutions are relevant and effective, driving meaningful improvements in any farming context.

The interdisciplinary nature of AI-powered farming highlights the need for collaboration between technology developers, agronomists, and the farmers themselves. Each stakeholder brings invaluable expertise, and their combined efforts can overcome any initial hurdles.

Training programs and workshops can further this integration, empowering farmers to use these technologies effectively. Enhancing the farmers' understanding of how these tools work allows them to make more informed decisions, unlocking the full potential of AI in agriculture.

Addressing Challenges and Ethical Considerations

As with any technological advancement, the implementation of AI in soil moisture and quality management comes with its own set of challenges. One significant concern is data privacy. Farms collect vast amounts of data—weather conditions, soil properties, crop performance—that is valuable not just to farmers but to numerous stakeholders, including corporations and governments. Ensuring this data is used ethically and remains secure is paramount.

Another challenge is accessibility. While larger, well-funded farms can afford to implement advanced AI systems, smaller farms often operate on tighter budgets. Ensuring equitable access to this transformative technology is crucial for its widespread adoption. Public funding, subsidies, and collaborative efforts between private sectors and government bodies can create pathways for smaller farms to benefit from AI advancements.

While AI systems can alleviate many manual tasks, reliance on technology should not come at the expense of traditional farming knowledge. The wisdom and experience of seasoned farmers offer insights that cannot be wholly replicated by algorithms. Thus, a balanced approach that combines the best of both worlds—traditional agriculture knowledge and modern AI capabilities—will yield the most robust, sustainable farming practices.

Towards Sustainable and Resilient Agriculture

The future of agriculture lies in leveraging technological advancements like AI to create systems that are not only high-yielding but also sustainable and resilient. AI-powered soil moisture and quality management systems offer a glimpse into this future, where data-driven decisions replace guesswork, and precise interventions lead to optimal outcomes. The cascading benefits—from increased crop yields and reduced resource use to enhanced food security—highlight the immense potential of this approach.

The adoption of these AI models is an essential step towards realizing the goals set out in AI in Agriculture: How AI-Enhanced Farming Could Increase Crop Yields by 70% by 2030. With every farm that integrates AI technology, we get closer to a world where agricultural practices are sustainable, efficient, and resilient to the challenges posed by climate change and growing populations.

Code Examples

Below are advanced examples of how Large Language Models (LLMs) can be incorporated into AI models for optimizing soil moisture and quality management in agriculture. These examples align well with the ones from the chapter on optimizing soil moisture and quality.

Example 1: AI-driven real-time soil moisture management

Objective: Use an LLM to dynamically adjust irrigation schedules based on soil moisture sensor data, weather forecasts, and crop needs. The system optimizes water distribution in real-time, considering potential root diseases and nutrient leaching.

import openai
from datetime import datetime

# Sample input data from real-time sensors and weather forecasts
soil_data = {
    "moisture_level": 40,  # Soil moisture percentage
    "root_zone_temperature": 25,  # Temperature in Celsius
    "potential_root_disease_risk": "moderate"
}

weather_forecast = {
    "today": {"temp": 30, "humidity": 60, "precipitation": 5},  # °C, %, mm
    "tomorrow": {"temp": 32, "precipitation": 10}  # °C, mm
}

crop_needs = {
    "growth_stage": "flowering",
    "water_requirement": "high"
}

# Describe the current data to the LLM
data_description = (
    f"The current soil moisture level is {soil_data['moisture_level']}%. "
    f"Root zone temperature is {soil_data['root_zone_temperature']}°C. "
    f"There is a {soil_data['potential_root_disease_risk']} risk of root disease. "
    f"Today's weather forecast shows a temperature of {weather_forecast['today']['temp']}°C "
    f"with 5mm of precipitation and 60% humidity. The crop is in the flowering stage, "
    f"and its water requirement is high."
)

# Use an LLM to adjust the irrigation schedule based on real-time data
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an AI expert specializing in soil moisture management and irrigation."},
        {"role": "user", "content": f"Based on the following data, provide an optimized irrigation schedule: {data_description}"}
    ]
)

irrigation_schedule = response.choices[0].message['content']
print(irrigation_schedule)

Sample Output:

**Optimized Irrigation Schedule:**
- Given the current soil moisture level of 40%, irrigation should be scheduled for early tomorrow morning, especially considering the high water requirement during the flowering stage.
- With 5mm of precipitation expected today and 10mm tomorrow, delay any additional irrigation until after the forecasted rain, and reassess moisture levels.
- Monitor root zone temperature and soil moisture closely over the next 24 hours to avoid overwatering, which could exacerbate the moderate risk of root disease. Ensure that irrigation is balanced to prevent nutrient leaching.

Example 2: AI-enhanced soil quality analysis and fertilization strategy

Objective: Use an LLM to analyze soil quality based on nutrient levels and crop requirements. The system recommends precise fertilization strategies based on real-time and historical data, helping avoid over-fertilization and nutrient leaching.

import openai

# Sample input data from soil tests and crop requirements
soil_data = {
    "pH": 6.5,
    "nutrient_levels": {"nitrogen": 30, "phosphorus": 15, "potassium": 25},  # ppm
    "organic_matter": 3.0  # percentage
}

crop_data = {
    "crop_type": "wheat",
    "growth_stage": "early vegetative",
    "nutrient_requirement": {"nitrogen": "high", "phosphorus": "moderate", "potassium": "low"}
}

# Generate description for the LLM based on the input data
data_description = (
    f"The soil pH is {soil_data['pH']}, and the nutrient levels are nitrogen at {soil_data['nutrient_levels']['nitrogen']} ppm, "
    f"phosphorus at {soil_data['nutrient_levels']['phosphorus']} ppm, and potassium at {soil_data['nutrient_levels']['potassium']} ppm. "
    f"The organic matter content is {soil_data['organic_matter']}%. The crop type is wheat, which is in the early vegetative stage and has high nitrogen requirements."
)

# Use LLM to generate a precise fertilization strategy
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an AI agronomist specializing in soil quality and fertilization."},
        {"role": "user", "content": f"Based on the following soil and crop data, provide a fertilization strategy: {data_description}"}
    ]
)

fertilization_strategy = response.choices[0].message['content']
print(fertilization_strategy)

Sample Output:

**Fertilization Strategy:**
- **Nitrogen:** The current nitrogen level is 30 ppm, which is below the optimal range for wheat in the early vegetative stage. Apply a nitrogen-rich fertilizer, such as urea, at a rate of 50 kg/ha to meet the high nitrogen demands.

- **Phosphorus:** Phosphorus levels are moderately low at 15 ppm. Apply phosphorus-based fertilizer, such as triple superphosphate, at a rate of 25 kg/ha to support early root development.

- **Potassium:** Potassium levels are sufficient for this stage, so no additional potassium fertilization is needed at this time.

- Monitor the soil pH to ensure it remains within the optimal range for wheat growth (6.0-7.0). If pH begins to drop below 6.0, consider applying lime to balance the soil acidity.

Example 3: AI-powered predictive analytics for soil moisture and quality optimization

Objective: Use an LLM to combine predictive analytics and historical data to forecast future soil moisture conditions, nutrient levels, and irrigation needs. The AI provides a long-term soil management strategy based on weather predictions and crop growth stages.

import openai
from datetime import datetime

# Historical and predictive input data for AI analysis
historical_data = {
    "soil_moisture_trend": [40, 35, 30, 25],  # % moisture over past 4 weeks
    "nutrient_depletion": {"nitrogen": 2, "phosphorus": 1, "potassium": 0.5},  # ppm depletion rate per week
    "weather_trends": [
        {"week": 1, "precipitation": 20},  # mm of rain
        {"week": 2, "precipitation": 10},
        {"week": 3, "precipitation": 0},
        {"week": 4, "precipitation": 5}
    ]
}

current_conditions = {
    "soil_moisture": 30,  # current soil moisture percentage
    "weather_forecast": {"next_week_precipitation": 15},  # mm of expected rain
    "growth_stage": "mid-vegetative"
}

# Generate description for the LLM
data_description = (
    f"Over the past 4 weeks, soil moisture has decreased from 40% to 25%. Nitrogen has been depleting at a rate of 2 ppm per week. "
    f"The precipitation levels have been fluctuating, with only 5mm last week and 15mm expected next week. "
    f"The crop is currently in the mid-vegetative stage."
)

# Use LLM to provide long-term soil moisture and quality optimization strategy
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an AI agronomist specializing in predictive analytics for soil moisture and quality."},
        {"role": "user", "content": f"Based on the following historical and predictive data, provide a soil moisture and quality management strategy: {data_description}"}
    ]
)

soil_management_strategy = response.choices[0].message['content']
print(soil_management_strategy)

Sample Output:

**Soil Moisture and Quality Management Strategy**

Based on the historical data and current conditions, the following strategy is recommended to optimize soil moisture and maintain soil quality:

1. **Irrigation Management**
   - **Scheduled Irrigation:** Implement a drip irrigation system to provide consistent moisture levels, targeting a soil moisture percentage between 30% and 35%. This helps compensate for the recent decline from 40% to 25%.
   - **Rainfall Utilization:** With an expected 15mm of precipitation next week, adjust the irrigation schedule to reduce water input accordingly, preventing waterlogging and conserving water resources.

2. **Nutrient Management**
   - **Nitrogen Supplementation:** Given the depletion rate of 2 ppm per week, apply a nitrogen-rich fertilizer bi-weekly to replenish soil nitrogen levels and support plant growth during the mid-vegetative stage.
   - **Phosphorus and Potassium Maintenance:** Continue monitoring phosphorus and potassium levels, applying supplements as needed to maintain balanced nutrient availability.

3. **Soil Conservation Practices**
   - **Mulching:** Apply organic mulch around crops to reduce soil evaporation, maintain moisture levels, and improve soil structure.
   - **Cover Cropping:** Introduce cover crops during off-seasons to enhance soil organic matter, prevent erosion, and improve nutrient retention.

4. **Weather Adaptation**
   - **Drainage Management:** Ensure proper drainage systems are in place to handle the variability in precipitation, especially during weeks with low rainfall.
   - **Weather Monitoring:** Utilize weather forecasting tools to make informed decisions on irrigation and nutrient application, adapting strategies based on real-time data.

5. **Crop Management**
   - **Growth Stage Optimization:** During the mid-vegetative stage, focus on practices that support robust leaf and stem development, ensuring that soil conditions do not limit plant growth.
   - **Pest and Disease Monitoring:** Regularly inspect crops for signs of stress, pests, or diseases that may arise from fluctuating soil moisture and nutrient levels.

6. **Long-Term Soil Health**
   - **Soil Testing:** Conduct quarterly soil tests to monitor nutrient levels, pH, and organic matter content, allowing for data-driven adjustments to management practices.
   - **Sustainable Practices:** Invest in sustainable farming practices such as crop rotation and reduced tillage to enhance soil health and resilience against environmental stressors.

7. **Technology Integration**
   - **Soil Moisture Sensors:** Deploy soil moisture sensors to obtain real-time data, enabling precise irrigation control and timely interventions.
   - **Data Analytics:** Utilize data analytics platforms to track historical trends and predict future soil moisture and nutrient needs, optimizing resource allocation.

**Implementation Timeline:**
- **Immediate (Next 1-2 Weeks):**
  - Install or calibrate drip irrigation systems.
  - Apply nitrogen-based fertilizers.
  - Begin mulching around crop areas.

- **Short-Term (Next 1-3 Months):**
  - Monitor soil moisture and nutrient levels weekly.
  - Adjust irrigation schedules based on rainfall and sensor data.
  - Introduce cover crops during off-seasons.

- **Long-Term (6 Months - 1 Year):**
  - Conduct comprehensive soil health assessments.
  - Implement sustainable farming practices.
  - Invest in advanced soil monitoring technologies.

By following this strategy, you can effectively manage soil moisture levels, replenish essential nutrients, and maintain overall soil health, leading to sustained crop productivity and resilience against environmental challenges.

Chapter 7: Sustainable Land Use Strategies with Agricultural Technology

In the landscape of modern agriculture, the promise of AI-enhanced farming sets a compelling context for exploring sustainable land use strategies supported by technological advancements.

The confluence of artificial intelligence and sustainable agricultural practices not only addresses the need for increased productivity but also emphasizes the importance of environmental stewardship.

This chapter delves into how the integration of AI and other cutting-edge technologies can revolutionize land use, optimizing resource management while promoting ecological balance.

Precision Agriculture for Resource Optimization

Precision agriculture, a hallmark of modern farming, leverages AI models and predictive analytics to refine agricultural practices at an unprecedented scale. By employing advanced data analytics, farmers can monitor vital parameters such as soil conditions, weather patterns, and crop health with pinpoint accuracy.

For example, soil moisture sensors connected to AI platforms can provide real-time data, enabling farmers to optimize irrigation schedules to conserve water without compromising crop health.

This level of precision empowers farmers to tailor their use of fertilizers and pesticides, reducing waste and enhancing soil quality. AI-driven soil quality assessments can guide the application of nutrients specifically where they are needed, rather than blanket coverage, which can lead to pollution and soil degradation. By focusing on data-driven decisions, precision agriculture not only enhances yield but also aligns farming practices with sustainable land management.

AI-Powered Farm Management Software

AI-powered farm management software represents the next frontier in agricultural efficiency. These platforms offer comprehensive tools to streamline farm operations, from resource allocation to day-to-day task management. The integration of computer vision technology allows for early detection of crop anomalies, such as nutrient deficiencies or pest infestations, through the analysis of high-resolution images.

This proactive approach can significantly mitigate crop losses and minimize the need for chemical interventions, thus fostering more sustainable farming practices. Moreover, robotic process automation (RPA) addresses labor shortages by automating routine tasks such as planting, weeding, and harvesting. This not only reduces operational strain but also enables farmers to focus on strategic decision-making and long-term planning.

Sustainable Practices for Enhanced Yields

Sustainable agricultural practices supported by AI technologies embrace the dual goals of maximizing productivity and minimizing environmental impact. AI-powered precision irrigation systems, for example, use weather forecasts and soil moisture data to deliver water only when and where it is needed. This not only conserves water but also ensures that crops receive optimal hydration for maximum growth.

Also, the adoption of AI solutions for sustainable land use often comes with financial incentives. Governments and international bodies increasingly recognize the importance of sustainable farming and offer subsidies or grants to farmers who implement eco-friendly technologies. These incentives not only offset the initial cost of adopting new technologies but also promote long-term benefits such as improved soil health, reduced pollution, and enhanced biodiversity.

Embracing the Future of Agriculture with AI

The future of agriculture lies in the seamless integration of AI technologies, transforming traditional farming into a sophisticated, data-driven practice. By addressing critical challenges such as climate variability, labor shortages, and resource constraints, AI technologies ensure the resilience and sustainability of the global food system.

For example, machine learning algorithms can predict climate-related risks, allowing farmers to adapt their planting schedules and crop selections accordingly. This adaptive approach is essential in a world where climate change poses an increasing threat to food security. By leveraging AI, farmers can make informed decisions that not only enhance productivity but also safeguard the environment for future generations.

Optimizing Resource Management through Precision Agriculture

Precision agriculture stands at the forefront of resource management optimization. Through the use of AI models and big data analytics, farmers can monitor and manage resources with precision, leading to significant improvements in efficiency and sustainability.

Soil moisture sensors are a prime example of technology enabling precise irrigation management. These sensors provide real-time data on soil moisture levels, helping farmers determine the exact amount of water needed. This ensures optimal crop hydration, reduces water wastage, and prevents over-irrigation, which can lead to soil erosion and nutrient runoff.

Beyond irrigation, precision agriculture plays a vital role in managing soil quality. AI-powered tools analyze soil samples to assess nutrient levels and composition. Farmers can then tailor fertilizer application to the specific needs of different soil sections, avoiding overuse and minimizing environmental impact. This targeted approach not only enhances crop yield but also promotes soil health and reduces the risk of contamination in nearby water sources.

The integration of weather pattern analysis further enhances resource management. Predictive analytics can forecast weather conditions with high accuracy, allowing farmers to plan their activities accordingly. Whether it's adjusting planting schedules to avoid adverse weather or applying protective measures against frost or drought, precision agriculture empowers farmers to make informed decisions that optimize resource use.

Enhancing Farm Efficiency with AI Technologies

One of the most significant contributions of AI to agriculture is the development of advanced farm management software. These platforms leverage AI algorithms to streamline farm operations, resulting in increased efficiency and productivity. By tracking and managing resources such as labor, equipment, and inputs, these systems offer a holistic view of farm activities.

Computer vision technology, integrated into farm management software, provides farmers with invaluable insights into crop health. High-resolution images captured by drones or sensors undergo detailed analysis, enabling early detection of issues such as nutrient deficiencies, pest infestations, or disease outbreaks. Timely intervention can prevent these problems from spreading and causing extensive damage. And AI-powered recommendation engines suggest appropriate remedial actions, empowering farmers to address issues effectively.

Robotic process automation (RPA) is another key component in enhancing farm efficiency. Automation of repetitive and labor-intensive tasks such as planting, weeding, and harvesting not only reduces the reliance on human labor but also ensures precision and consistency. This, in turn, leads to higher productivity and reduced operational costs.

Promoting Sustainable Farming Practices with AI

Sustainable land use practices are integral to achieving long-term agricultural productivity while minimizing environmental impact. AI technologies play a pivotal role in promoting these practices by optimizing land use and conserving natural resources. Precision irrigation systems, powered by AI, exemplify the synergy between technology and sustainability. By delivering water precisely when and where it is needed, these systems reduce water wastage and ensure that crops receive optimal hydration.

AI-driven solutions for nutrient management also help contribute to sustainable farming by minimizing the use of chemical fertilizers. By analyzing soil nutrient levels, AI models recommend targeted fertilization, ensuring that nutrients are applied only where required. This not only enhances crop yield but also prevents over-fertilization, which can lead to soil and water pollution.

Fuel consumption is another significant area where AI can drive sustainability. Autonomous machinery equipped with AI algorithms optimizes fuel use by planning efficient routes and minimizing idle time. This reduces greenhouse gas emissions and lowers operational costs, contributing to both environmental and economic sustainability.

Financial Incentives for Sustainable Farming

The adoption of sustainable land use strategies is often facilitated by financial incentives provided by governments and organizations. These incentives encourage farmers to invest in AI-driven technologies that promote sustainability and long-term benefits. Subsidies, grants, and tax incentives help offset the initial costs of implementing new technologies, making them more accessible to farmers.

For instance, governments may offer subsidies for the installation of precision irrigation systems or provide grants for adopting AI-powered soil analysis tools. These financial incentives not only support the transition to sustainable farming practices but also recognize the broader societal benefits, such as improved water quality, reduced greenhouse gas emissions, and enhanced biodiversity.

Sustainable farming practices driven by AI technologies can also lead to increased profitability for farmers. By optimizing resource use, reducing input costs, and enhancing crop yield, these practices contribute to higher economic returns. Farmers who embrace AI-driven solutions are better positioned to achieve long-term financial stability while contributing to a more sustainable food system.

Building a Resilient Future with AI in Agriculture

The integration of AI technologies in agriculture represents a paradigm shift that addresses critical challenges and paves the way for a resilient and sustainable future. By harnessing the power of AI, farmers can navigate the complexities of modern farming, optimize resource use, and mitigate environmental impact.

AI-driven predictive analytics empower farmers to adapt to changing climatic conditions. By analyzing historical weather data and current trends, AI models can predict future weather patterns with high precision. This enables farmers to make proactive decisions, such as adjusting planting schedules, selecting resilient crop varieties, and implementing protective measures. Such adaptive strategies are essential in the face of climate change, ensuring the continuity of agricultural productivity.

Labor shortages, a persistent challenge in agriculture, are effectively addressed by AI-powered automation. Robots and autonomous machinery perform labor-intensive tasks with precision and reliability, reducing the dependence on human labor. This not only increases operational efficiency but also allows farmers to focus on strategic planning and innovation.

Code Examples

Here are three advanced examples of how Large Language Models (LLMs) can be integrated into AI technologies to enhance Sustainable Land Use Strategies with Agricultural Technology:

Example 1: AI-driven precision agriculture for resource optimization

Objective: Use LLM to analyze data from soil sensors, satellite imagery, and weather forecasts to optimize irrigation and fertilizer use while maintaining sustainability. This example will help farmers optimize resource use, reduce environmental impact, and promote sustainable land management.

import openai

# Sample input data from soil sensors, satellite imagery, and weather forecasts
farm_data = {
    "soil_moisture": {
        "zone_A": 45,  # percentage
        "zone_B": 30,  # percentage
    },
    "satellite_imagery": {
        "vegetation_health_index": 0.85,  # normalized between 0 to 1
    },
    "weather_forecast": {
        "today": {"temperature": 28, "humidity": 65, "precipitation": 3},  # in °C, %, mm
        "next_week_precipitation": 15,  # mm of rain expected over the next week
    }
}

# Describe data for LLM input
data_summary = (
    f"Zone A soil moisture is at {farm_data['soil_moisture']['zone_A']}%, while Zone B is at {farm_data['soil_moisture']['zone_B']}%. "
    f"The satellite imagery shows a vegetation health index of {farm_data['satellite_imagery']['vegetation_health_index']}. "
    f"Today's weather forecast indicates a temperature of {farm_data['weather_forecast']['today']['temperature']}°C, "
    f"with 65% humidity and 3mm precipitation. The forecasted rainfall for the next week is 15mm."
)

# Use LLM to generate sustainable irrigation and fertilization recommendations
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an AI agricultural assistant specializing in sustainable precision farming."},
        {"role": "user", "content": f"Based on the following data, provide sustainable irrigation and fertilization recommendations: {data_summary}"}
    ]
)

recommendations = response.choices[0].message['content']
print(recommendations)

Sample Output:

**Sustainable Irrigation and Fertilization Recommendations:**

- **Zone A Irrigation:** Since soil moisture is at 45%, no immediate irrigation is needed. Reassess after the next rainfall. Depending on the forecasted 15mm rain, irrigation may not be necessary for at least 5 days.

- **Zone B Irrigation:** Soil moisture in Zone B is at 30%, which is approaching a critical threshold. Schedule light irrigation (20mm) for Zone B tomorrow to maintain optimal soil moisture, then reassess after the next week's rain.

- **Fertilization Strategy:** The vegetation health index of 0.85 indicates good crop health. Continue applying fertilizer at 60% of the standard rate, focused only in areas of Zone B where soil nutrient data indicates low nitrogen. This approach will reduce overuse of fertilizers and protect the soil from degradation.

Example 2: AI-powered farm management software for crop monitoring and anomaly detection

Objective: Integrate an LLM with AI-powered farm management software that uses computer vision and predictive analytics to identify crop anomalies like nutrient deficiencies or pest infestations and provide sustainable intervention strategies.

import openai

# Sample data from farm management software using computer vision for anomaly detection
crop_data = {
    "drone_images": {
        "zones": {
            "zone_1": {"anomaly_detected": "nitrogen deficiency", "severity": "moderate"},
            "zone_2": {"anomaly_detected": "early-stage pest infestation", "severity": "low"}
        }
    },
    "crop_health": {
        "growth_stage": "mid-vegetative",
        "projected_yield": 4000  # kg/ha
    }
}

# Describe data for LLM input
crop_data_summary = (
    f"Drone images have detected a nitrogen deficiency in Zone 1, with moderate severity, and an early-stage pest infestation in Zone 2, with low severity. "
    f"The crops are in the mid-vegetative growth stage, and the projected yield is currently 4000 kg/ha."
)

# Use LLM to generate sustainable recommendations for addressing detected anomalies
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an AI expert specializing in sustainable crop monitoring and intervention strategies."},
        {"role": "user", "content": f"Based on the detected anomalies and current crop health, provide sustainable intervention strategies: {crop_data_summary}"}
    ]
)

sustainable_strategy = response.choices[0].message['content']
print(sustainable_strategy)

Sample Output:

**Sustainable Intervention Strategies:**

- **Zone 1 Nitrogen Deficiency:** Apply a nitrogen-rich organic fertilizer such as composted manure to address the deficiency in a sustainable manner. Spread the fertilizer evenly across the affected area, ensuring a slow-release approach to prevent nitrogen runoff and soil contamination.

- **Zone 2 Pest Infestation:** Given the early stage and low severity of the pest infestation, implement biological pest control methods such as introducing natural predators or using neem oil to minimize chemical pesticide use. Continue monitoring the affected area closely for any escalation in pest activity.

- **General Management:** Maintain regular soil testing and drone-based monitoring to ensure nutrient levels are balanced and pest control measures are effective. This proactive approach will protect yield potential while minimizing environmental impact.

Example 3: AI-enhanced predictive analytics for climate-adaptive sustainable farming

Objective: Use LLM to analyze predictive climate data and provide sustainable, climate-adaptive strategies for planting, crop selection, and soil management. The goal is to optimize land use in light of changing weather patterns and minimize environmental risks.

import openai

# Sample predictive climate data and historical trends
climate_data = {
    "historical_weather": {
        "average_temp_summer": 32,  # °C
        "average_rainfall_summer": 80  # mm/month
    },
    "predictive_weather_model": {
        "next_summer": {"projected_temp": 35, "projected_rainfall": 50},  # °C, mm
        "risk_assessment": {"drought_risk": "high", "heatwave_risk": "moderate"}
    },
    "soil_data": {
        "organic_matter": 2.5,  # percentage
        "soil_type": "loamy",
        "moisture_retention": "moderate"
    }
}

# Describe data for LLM input
climate_data_summary = (
    f"Historically, the average summer temperature has been 32°C with 80mm of rainfall per month. "
    f"However, next summer's predictive model suggests temperatures may rise to 35°C with reduced rainfall of 50mm. "
    f"There is a high risk of drought and a moderate risk of heatwaves. The soil is loamy with 2.5% organic matter and moderate moisture retention."
)

# Use LLM to generate climate-adaptive, sustainable land use strategies
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an AI expert in sustainable land use and climate-adaptive farming."},
        {"role": "user", "content": f"Based on the predictive climate and soil data, provide sustainable, climate-adaptive farming strategies: {climate_data_summary}"}
    ]
)

climate_adaptive_strategy = response.choices[0].message['content']
print(climate_adaptive_strategy)

Sample Output:

**Climate-Adaptive Sustainable Farming Strategies:**

- **Crop Selection:** Choose drought-resistant crop varieties such as sorghum, millet, or certain legumes that are well-suited to withstand higher temperatures and lower rainfall. Consider crop rotation that improves soil health and enhances water retention.

- **Soil Management:** Improve soil organic matter content by incorporating cover crops or applying organic compost. This will enhance soil moisture retention and provide a buffer against heatwaves and drought conditions. Mulching is also recommended to conserve soil moisture and reduce evaporation.

- **Irrigation Strategy:** Given the high risk of drought, implement drip irrigation systems to deliver water directly to the plant roots, maximizing water efficiency. Utilize AI-powered precision irrigation tools to monitor real-time soil moisture and minimize water waste.

- **Heatwave Mitigation:** Use shade cloth or other protective structures during the peak heat periods to shield sensitive crops from excessive heat stress. Additionally, schedule irrigation during early morning or late evening to reduce water loss due to evaporation.

In Example 1, we used LLMs to analyze data from sensors, satellite imagery, and weather forecasts to optimize irrigation and fertilizer use, focusing on sustainable land use and conservation of resources.

In Example 2, LLMs helped detect crop anomalies (such as nutrient deficiencies and pest infestations) through computer vision, providing sustainable, targeted interventions that minimize chemical use and prevent further damage.

And in Example 3, LLMs helped us analyze predictive climate models and soil data to offer sustainable land use strategies, advising farmers on adaptive practices that mitigate the risks of drought and heatwaves, promote soil health, and optimize resource use.

In these examples, LLMs enhanced decision-making in agriculture by processing complex data and providing actionable, sustainable strategies that increased productivity while reducing environmental impact. These AI-enhanced systems promote the long-term sustainability of agricultural practices.

The integration of AI technologies in sustainable land use strategies holds transformative potential for the agricultural sector. Precision agriculture, AI-powered farm management software, and sustainable farming practices driven by AI collectively optimize resource management, enhance crop yield, and minimize environmental impact.

Financial incentives further support the adoption of these technologies, making sustainable farming practices accessible to a broader range of farmers.

As we embrace the future of agriculture with AI, we move towards a more efficient, productive, and environmentally conscious approach to farming. By leveraging data-driven insights and innovative solutions, farmers can contribute to building a resilient and sustainable food system that meets the needs of a growing global population. The journey towards sustainable agriculture is not without challenges, but with AI as a powerful ally, we are well-equipped to navigate these challenges and shape a prosperous future for farming.

Chapter 8: Efficient Water Use and Irrigation Systems with AI Guidance

Efficient water management is a critical element in effective farming practices. And it’s one where AI's intervention can make a profound difference.

As climate change intensifies water scarcity, innovative solutions become more necessary. AI-guided irrigation systems stand out as revolutionary tools that promise not only to optimize water usage but also to potentially transform agricultural practices.

This chapter delves into how AI-based irrigation systems are forging a new path in sustainable agriculture, providing the depth and nuance necessary for a scholarly exploration.

Precision Irrigation Techniques: Tailoring Watering Strategies

AI-powered precision irrigation is changing how water resources are managed. Traditional irrigation methods often involve a one-size-fits-all approach, causing either excessive or insufficient watering. But AI algorithms can tailor water distribution by analyzing a wealth of data, including soil moisture levels, weather conditions, and plant health. For instance, a vineyard might use AI to monitor soil moisture across different zones, ensuring each vine receives the optimal amount of water without wastage.

These AI systems gather real-time data from sensors embedded in the soil and parse this information to determine precise watering needs, ensuring that crops receive just the right amount of moisture when they need it. This intelligent approach reduces water waste significantly and enhances crop yield.

Imagine an arid region where water scarcity is a daily challenge. AI-guided systems can stretch each drop of water to its fullest potential, safeguarding both the crops and the environment.

Automated Irrigation Scheduling: Dynamic and Responsive Systems

Predictive analytics and weather forecasting are pivotal in AI-driven automated irrigation scheduling. Traditional methods often fail to account for unpredictable weather variations, leading to inefficiencies. AI systems transform this by autonomously adjusting irrigation schedules in response to real-time environmental inputs.

For example, predictive models can anticipate a week of heavy rainfall. The AI system preemptively adjusts irrigation schedules, avoiding unnecessary watering and conserving water for drier times. This adaptability is essential for regions experiencing erratic weather patterns due to climate change.

Farmers benefit immensely, as they can ensure water resources are used efficiently without the constant need to manually adjust schedules, leading to better crop management and resource use efficiency.

Soil Moisture Monitoring: Foundation of Data-Driven Decisions

Soil moisture monitoring using AI represents the synthesis of technology and agronomy. By utilizing advanced sensors and computer vision technologies, AI systems provide high-fidelity soil moisture data, crucial for informed irrigation decisions. In practical terms, a farmer overseeing vast fields can install soil moisture sensors at various depths and locations. The AI system continuously aggregates this data, presenting actionable insights to the farmer about when and where to irrigate.

Consider the delicate balance required in cultivating crops such as tomatoes that are sensitive to both drought stress and water logging. Continuous soil moisture monitoring aids in maintaining this balance, ensuring that water is neither overused nor insufficiently applied.

These systems provide peace of mind, enabling farmers to focus on other critical agricultural tasks, knowing that their irrigation needs are being managed with precision.

Smart Water Delivery Systems: Customizing for Optimal Efficiency

AI algorithms can fine-tune the delivery of water, considering variables like soil type, crop requirements, and field topography. This approach transforms generic irrigation practices into targeted strategies tailored to specific agricultural ecosystems.

Let’s take an example of a diverse farm with sections of sandy and clay-based soils. AI systems analyze these soil conditions and create bespoke irrigation plans for each section, ensuring optimal water absorption and minimal run-off.

This precision maximizes water use efficiency, improving crop yields and conserving water resources. The benefits extend beyond just individual farms—such practices can lead to regional water conservation efforts, potentially alleviating local water scarcity issues. The ability to customize irrigation strategies means that farmers can cultivate a wider variety of crops, confident that their water needs will be met efficiently.

Enhancing Crop Yields: The Ripple Effect of Efficient Water Use

Efficient water management is not solely about conserving water—it's intrinsically linked to crop productivity. AI-guided irrigation systems, with their precision and accuracy, ensure that crops receive consistent, optimal hydration. This leads to healthier plants, better growth, and ultimately, higher yields. For instance, a study on cotton farming demonstrated that precision irrigation using AI improved yield by 25% compared to traditional practices.

Implementing such systems on a global scale can revolutionize agricultural productivity. In regions where water scarcity and food insecurity are interlinked, AI-driven irrigation can break this cycle, providing reliable water supply to crops and thereby boosting food production. This has far-reaching implications for global food security, highlighting the critical role of AI in addressing complex agricultural challenges.

Sustainable Practices: Bridging Technology and Environmental Stewardship

Oil extraction, industrial activities, and misuse have led to the diminishing reserves of freshwater globally. AI in irrigation promotes sustainability by reducing unnecessary water usage and preserving natural resources. For example, the use of AI in Israel's arid regions helps farmers optimize the scarce water supplies, demonstrating that technology can be an ally in environmental stewardship.

These AI systems contribute to sustainable agricultural practices, balancing the needs of present and future generations. Farmers are not just incentivized to conserve water but also to adopt practices that reduce soil degradation and promote biodiversity. The integration of AI technologies in farming becomes a model for other industries, showcasing how advanced technology can aid in achieving environmental goals.

Overcoming Challenges: Addressing Implementation Barriers

Despite the numerous advantages, the integration of AI-guided irrigation systems isn't devoid of challenges. High initial costs and the need for technical expertise can be significant barriers for smallholder farmers. Addressing these challenges requires a multipronged approach involving policy incentives, financing options, and educational programs.

For instance, government subsidies and low-interest loans can make AI technologies more accessible. Collaborative efforts between agritech firms and agricultural extensions can also play a vital role in educating farmers about the operational and financial benefits of these systems. Creating a support ecosystem is essential for widespread adoption, ensuring that no farmer is left behind in the transition towards smarter irrigation practices.

Future Prospects: Evolving Technologies and Expanding Horizons

As technology evolves, so do the possibilities for AI in irrigation management. Future developments may include enhanced machine learning models that can predict long-term trends and AI systems that integrate seamlessly with other smart farming technologies, such as autonomous tractors and drones. Imagine an ecosystem where various AI technologies interact, creating a self-regulating agricultural environment.

Continuous advancements will expand the scope of AI applications, making them more robust and scalable. The potential to integrate AI with renewable energy sources, like solar-powered irrigation systems, can further enhance sustainability efforts. The horizon is vast, and as AI technology matures, its impact on agriculture can only increase.

The future of agriculture is intertwined with advancements in AI technology. As we prepare for this future, understanding the current capabilities and potential of AI-guided irrigation systems is imperative. This knowledge equips stakeholders with the insights needed to leverage these technologies for maximum benefit.

The Path Forward

AI-guided irrigation systems exemplify how technology can revolutionize water management in agriculture, offering solutions that are both sustainable and efficient. By leveraging data, real-time analysis, and predictive models, these systems optimize water usage and enhance crop yields, addressing pressing issues like water scarcity and food security. Embracing these technologies requires overcoming certain barriers, but the potential benefits make the effort worthwhile.

As you move forward, consider how the integration of AI in your irrigation practices can align with broader goals of sustainability and increased productivity. Encourage a proactive approach—explore financing options, seek educational resources, and engage with technology providers. The path forward is paved with opportunities, and the fusion of AI and agriculture is a promising frontier, ready to redefine the future of farming.

Conclusion

The integration of AI in agriculture presents an exciting opportunity to revolutionize farming practices and significantly boost crop yields. The potential of AI-enhanced farming to increase productivity by 70% by 2030 is a game-changer for the agriculture industry.

By leveraging AI technologies such as machine learning and predictive analytics, farmers can make more informed decisions and optimize resource utilization to achieve higher yields. Investing in AI solutions for agriculture is not just an option but a necessity for staying competitive in the rapidly evolving field.

Embracing this technology can lead to sustainable practices, reduced waste, and increased profitability for farmers worldwide. As we look towards the future of farming, it is clear that AI will play a crucial role in ensuring food security and meeting

FAQ

What is AI in agriculture?

AI in agriculture refers to the use of artificial intelligence technology and techniques in the farming and agricultural industry. This can include AI-powered tools and systems that help farmers optimize crop growth, monitor weather patterns, and make data-driven decisions for increased efficiency and productivity.

Will AI replace human labor in agriculture?

AI in agriculture is not meant to replace human labor, but rather enhance it. AI technology can provide valuable insights and recommendations to help farmers make more informed decisions and increase crop yields. With the use of AI, farmers can save time and resources while also increasing their productivity.

What are the potential benefits of using AI in agriculture?

Some potential benefits of using AI in agriculture include increased crop yields, reduced costs, improved efficiency, and better decision-making.

With AI technology, farmers can analyze data and make informed decisions about planting, harvesting, and managing crops. It can also help with predicting weather patterns, optimizing irrigation schedules, and identifying diseases and pests early on.

How does AI help in increasing crop yields?

AI in agriculture can help increase crop yields by using advanced technologies such as machine learning and data analytics to optimize farming practices. This can include predicting optimal planting and harvesting times, identifying potential pest or disease outbreaks, and optimizing irrigation and fertilizer use. By using AI, farmers can make more informed decisions and improve efficiency, leading to higher crop yields.

How does AI help with sustainable agriculture?

AI can help with sustainable agriculture in several ways, such as: Predicting weather patterns and optimizing irrigation schedules to reduce water waste. Analyzing soil data and recommending the best crops and fertilizers to maximize yield and minimize environmental impact. Monitoring crop health and detecting pests and diseases early on, allowing for targeted treatment and reducing the need for harmful pesticides. Optimizing planting and harvesting schedules for maximum efficiency and reducing labor and fuel costs.

What are some examples of AI technology used in farming?

Some examples of AI technology used in farming include:

Automated tractors and harvesters that use computer vision and machine learning algorithms to optimize planting and harvesting processes.
Soil sensors and drones that collect data on soil moisture, nutrient levels, and crop health, allowing farmers to make data-driven decisions.
Predictive analytics software that uses AI to analyze weather patterns and predict crop yields, helping farmers plan more effectively.
Robotic weeders and pest control systems that use AI to identify and target specific plants or pests, reducing the use of harmful chemicals.

How Can You Dive Deeper?

After studying this guide, if you're keen to dive even deeper and structured learning is your style, consider joining us at LunarTech. We offer an AI Engineering Bootcamp, 77+ individual courses, and a Bootcamp in Data Science, Machine Learning, and AI.

You can check out our Ultimate Data Science Bootcamp and join a free trial to try the content first hand. This has earned the recognition of being one of the Best Data Science Bootcamps of 2023, and has been featured in esteemed publications like Forbes, Yahoo, Entrepreneur and more. This is your chance to be a part of a community that thrives on innovation and knowledge. Here is the Welcome message:

Transform Your Future with Data Science & AI

Ready to break into the booming field of Data Science and AI? Download our free eBook, Six-Figure Data Science Bootcamp, and discover the exact steps to build in-demand skills, gain real-world experience, and land your dream job.

🎯 What You’ll Learn:
✔️ Master essential skills top employers crave.
✔️ Build a portfolio, even as a beginner.
✔️ Ace interviews and negotiate a top-tier salary.
✔️ Explore industries actively hiring Data Scientists and AI specialists.

👉 Download the Free eBook

Connect with Me

If you want to learn more about a career in Data Science, Machine Learning and AI, and learn how to secure a Data Science job, you can download this free Data Science and AI Career Handbook.

The Microservices Book – Learn How to Build and Manage Services in the Cloud

Adekola Olawale — Thu, 28 Nov 2024 15:08:48 +0000

In today’s fast-paced tech landscape, microservices have emerged as one of the most efficient ways to architect and manage scalable, flexible, and resilient cloud-based systems.

Whether you're working with large-scale applications or building something new from scratch, understanding microservices architecture is crucial to developing software that meets modern business needs.

This book is designed to provide you with a comprehensive understanding of microservices, from building robust services to managing them effectively in the cloud.

What Will You Learn?

Throughout this book, we’ll walk you through the fundamental principles of microservices architecture, focusing on:

Designing and building microservices: We’ll cover how to structure services, choose the right technology stack, define clear APIs and contracts, and utilize essential design patterns.
Managing microservices in the cloud: You'll learn about cloud platforms like AWS, Azure, and Google Cloud, as well as containerization with Docker and orchestration using Kubernetes.
Testing, deployment, and scaling strategies: We’ll dive into how to test microservices effectively, set up continuous integration/continuous deployment (CI/CD) pipelines, and use automation to deploy and scale your services.
Security, monitoring, and troubleshooting: We’ll discuss security considerations and real-time monitoring solutions for microservices in-depth, so you can keep your system resilient and secure.
Case studies and real-world examples: We'll explore how companies like Netflix, Amazon, and Uber use microservices to handle millions of requests daily and how you can apply these concepts to your projects.
Common pitfalls and solutions: Finally, you’ll learn about the common challenges that arise when implementing microservices and how to address them.

By the end of this book, you’ll have a solid understanding of the best practices for building and managing microservices, with the confidence to deploy and scale these architectures in a cloud environment.

Prerequisites

To get the most out of this guide, I recommend that you have:

Basic knowledge of programming: While we’ll use JavaScript/Node.js for many examples, prior experience with any backend programming language will help you follow along.
Familiarity with REST APIs: Since microservices often communicate over HTTP, understanding how REST APIs work will be beneficial.
A basic understanding of cloud services: Experience with cloud platforms (AWS, Azure, Google Cloud) will help as we dive into cloud-native services.
Installed Tools:
- Docker: We’ll use Docker for creating and managing containers.
- Node.js: If you’re following along with the JavaScript examples, make sure you have Node.js installed on your machine.
- Postman: For testing APIs, Postman will be useful.
- Git: Version control knowledge and Git installed on your machine to work with repositories.
- A cloud provider account (for example, AWS, Azure, or Google Cloud) to deploy your microservices into the cloud.
- Kubernetes (Optional): If you’d like to experiment with orchestration locally.
- A code editor (like Visual Studio Code) to write and manage your code.
- Cloud CLI tools (for example AWS CLI, Google Cloud SDK): These will be essential for deploying and managing microservices in your cloud provider.

This book is structured to guide you from the basics to advanced concepts, with practical examples, step-by-step tutorials, and real-world scenarios that will prepare you for building modern microservices in a cloud environment.

Whether you’re a developer looking to improve your microservices skills or an architect designing complex cloud-native systems, this book will equip you with the knowledge to succeed.

Let’s begin the journey toward mastering microservices and cloud management!

What are Microservices?
- What is a Microservices Architecture?
- Key Characteristics of Microservices
- Benefits of Microservices
- Challenges of Microservices
Microservices vs Monolithic Architecture
Core Microservices Concepts and Components
- Microservices Design Principles
- Service Communication: Synchronous vs Asynchronous
- RESTful APIs
- gRPC and Protocol Buffers
- Message Brokers (like RabbitMQ and Kafka)
Data Management in Microservices
- Database per Service Pattern
- Data Consistency and Synchronization
Service Discovery and Load Balancing
How to Build and Design Microservices
How to Implement Microservices
How to Test Microservices
How to Deploy Microservices
How to Manage Microservices in the Cloud
- Cloud Platforms and Services
Containerization and Orchestration
- Introduction to Containers (Docker)
- Container Orchestration Tools (Kubernetes, Docker Swarm)
- Helm Charts and Kubernetes Operators
Continuous Integration and Continuous Deployment (CI/CD)
- CI/CD Pipelines and Best Practices
- Tools and Platforms for CI/CD
- Automated Testing and Deployment Strategies
Monitoring and Logging
Security Considerations
Case Studies and Real-World Examples
- Case Study 1: E-Commerce Platform
- Case Study 2: Streaming Media Service
- Case Study 3: Financial Services Application
Real-World Examples of Microservices
- 1. Netflix: Scaling Content and Recommendations
- 2. Amazon: Managing Orders and Products at Scale
- 3. Uber: Managing Rides, Drivers, and Payments
- Benefits of Using Microservices in These Companies
Common Pitfalls and How to Avoid Them in Microservices
Strategies to Address and Avoid Common Issues
Future Trends and Innovations
Conclusion

What are Microservices?

This section introduces microservices architecture by exploring its foundational principles and distinguishing it from traditional monolithic approaches. It covers the defining features of microservices—like scalability, independent deployment, and support for diverse technologies—that make it a preferred architecture for modern applications.

You’ll also gain insights into the advantages of microservices, such as enhanced fault isolation and flexibility, as well as the challenges, including increased complexity in managing inter-service communication, maintaining data consistency, and ensuring security.

By understanding the key trade-offs involved, you’ll develop a comprehensive view of microservices and their role in contemporary application development. This foundation should equip you, as a developer and architect, with the necessary perspective to assess whether microservices are the right fit for your projects.

Microservices, or the microservices architecture, is a modern approach to designing software systems.

Unlike traditional monolithic applications, which are built as a single, unified unit, a microservices-based application is divided into a set of smaller, independent services.

Each service in a microservices architecture is responsible for a specific function—such as user authentication, payment processing, or data storage—and is designed to be independently deployable and scalable.

These services communicate with each other over a network, typically using lightweight protocols like HTTP or messaging queues, enabling them to operate as separate entities while contributing to the functionality of the larger system.

The primary advantage of microservices lies in their independence. Each service can be built, deployed, and managed independently, allowing development teams to work on different parts of the system simultaneously.

This setup promotes flexibility, speed in development and deployment, and the ability to scale each service according to specific demands without affecting others. Microservices are particularly well-suited for cloud environments, where resources can be allocated dynamically based on real-time needs.

What is a Microservices Architecture?

Microservices architecture is an approach to designing and developing software applications where a single application is composed of multiple loosely coupled, independently deployable services.

Each service corresponds to a specific business functionality and operates as an independent unit that communicates with other services through well-defined APIs.

Key Points about Microservices

Modular Design: Microservices break down an application into small, self-contained modules, each responsible for a distinct piece of functionality.
This modular approach promotes better organization and separation of concerns.
Independence: Each microservice can be developed, deployed, and scaled independently. This independence allows for more flexible and agile development practices.
Autonomy: Microservices operate independently and are loosely coupled, meaning that changes in one service do not necessarily impact others. This autonomy enhances fault tolerance and resilience.

Key Characteristics of Microservices

Decentralized Data Management

Each microservice manages its own database or data store, ensuring data consistency and reducing dependencies between services. This decentralization helps in scaling and optimizing data access.

Service Boundaries

Microservices are designed around business capabilities, and each service is responsible for a specific business function. This clear delineation of service boundaries helps in achieving a modular and organized system.

API-Based Communication

Services communicate with each other using APIs (Application Programming Interfaces). This ensures that services remain loosely coupled and can interact without direct knowledge of each other’s implementation details.

Independent Deployment

Each microservice can be developed, tested, and deployed independently. This allows teams to deploy updates to individual services without impacting the entire system, leading to faster release cycles.

Technology Diversity

Microservices can use different technologies, frameworks, and programming languages based on their specific needs. This enables the use of the most suitable tools for each service.

Fault Tolerance and Resilience

The decentralized nature of microservices allows for better fault isolation. If one service fails, the rest of the system can continue to function, enhancing overall system resilience.

Continuous Delivery and DevOps Practices

Microservices align well with DevOps practices and continuous delivery models.
They enable automated testing, deployment, and monitoring, facilitating a more agile and iterative development process.

Benefits of Microservices

Scalability and Flexibility: One of the standout advantages of microservices is their ability to scale specific components individually. For example, a service handling user traffic spikes, like a login service, can be scaled up independently without scaling the entire application, conserving resources and lowering operational costs.
- Imagine a restaurant where each kitchen station can expand its capacity independently. If more people order pizza, the pizza station can add more ovens without affecting the salad or dessert stations.
  
  Benefit: This flexibility makes microservices ideal for applications with varying workloads and dynamic growth patterns.
Independent Deployment and Development: Microservices allow teams to work on different services independently. This means that a change or deployment to one service does not necessitate changes or redeployments to other parts of the application, enhancing development speed and reducing downtime.
- Like a construction project where different teams (plumbing, electrical, carpentry) work independently on separate sections of a building, leading to faster overall completion.
  
  Benefit: Independent deployment reduces the risk of deploying new features or updates, as changes in one service do not directly impact others.
Fault Isolation and Resilience: In a microservices architecture, if one service fails, it does not necessarily bring down the entire application. For example, if a recommendation service in a streaming application fails, the core streaming functionality can continue to operate. This isolation makes applications more resilient and fault-tolerant.
- Consider a series of interconnected power grids. If one grid fails, the others continue to function, preventing a total blackout.
  
  Benefit: This fault isolation ensures higher availability and reliability, which is critical for modern applications that require constant uptime.
Technology Diversity and Optimization: Microservices enable teams to choose the best-suited technologies for each service. One service might benefit from being written in Python for data processing, while another might leverage JavaScript for its real-time, event-driven needs. This flexibility allows teams to optimize each service for performance, reliability, and maintainability.
- Similar to a craftsman selecting the best tool for each task, developers can use different programming languages, databases, and frameworks for different services.
  
  Benefit: This technology diversity enables teams to leverage the strengths of various tools, leading to more efficient and tailored solutions.

Challenges of Microservices

While microservices provide significant benefits, they also come with their own set of challenges:

Complexity in Management and Orchestration: Microservices increase the complexity of managing multiple services, each with its own dependencies, configurations, and monitoring requirements. Tools like Kubernetes and Docker Swarm help orchestrate and manage these services, but they require additional setup and expertise.
- Like managing a fleet of ships in a convoy, where each ship must be coordinated, tracked, and directed, the complexity grows with the number of ships.
  
  Challenge: Organizations need to invest in orchestration tools like Kubernetes and service meshes to handle this complexity.
Data Consistency and Transaction Management: In monolithic systems, data consistency is easier to maintain because all components share a single database. With microservices, each service may have its own database, complicating transactions across services. Strategies like the Saga pattern or eventual consistency models are often employed to address this issue, though they can increase system complexity.
- Imagine trying to keep multiple ledgers synchronized across different offices.
  Ensuring that every ledger reflects the same transactions simultaneously can be difficult.
  
  Challenge: Developers often need to implement eventual consistency models and use patterns like Saga to manage distributed transactions.
Inter-Service Communication: Microservices rely heavily on network communication to exchange information. Issues like network latency, service timeouts, and retries can impact system performance. Choosing the right communication protocols (for example, REST, gRPC) and implementing practices like circuit breakers are essential for reliability.
- Like ensuring clear communication between different departments in a company, where messages need to be delivered quickly and accurately, and with the right level of security.
  
  Challenge: Developers must choose appropriate communication protocols (for example, REST, gRPC) and manage inter-service communication failures gracefully.
Security Considerations: Managing security in a microservices architecture is more complex, as each service needs its own access controls, authentication, and encryption measures. Technologies like OAuth2 and JWT (JSON Web Tokens) are commonly used to secure inter-service communication, but they require careful configuration and ongoing management.
- Like securing a multi-building campus where each building has its own security protocols, and ensuring that the entire campus remains secure requires careful planning.
  
  Challenge: Implementing security best practices, such as zero trust models and secure API gateways, is essential to protect microservices from threats.

The microservices architecture is an advanced, modular approach to building applications that prioritizes scalability, resilience, and flexibility.

While it offers substantial benefits over traditional monolithic architectures, especially in terms of independent service management, it also introduces new challenges in orchestration, communication, and security.

Understanding both the strengths and weaknesses of microservices is crucial for developers, architects, and business leaders aiming to make informed decisions about their application architecture.

Microservices vs Monolithic Architecture

In a monolithic architecture, all components of an application—such as the user interface, business logic, and data layer—are interconnected within a single codebase.

This approach simplifies deployment and can be easier to start with, but it also has limitations.

As applications grow, a monolithic structure can become unwieldy, making it challenging to update or scale specific parts without affecting the entire system.

For instance, updating one feature in a monolithic application may require testing and redeploying the entire application, increasing both the time and potential risks involved.

Microservices, on the other hand, embrace a decentralized architecture, where each service can evolve independently.

This is ideal for complex applications where different teams can develop, test, and deploy their components independently.

But microservices do introduce additional complexity, such as managing service-to-service communication, handling data consistency across distributed services, and maintaining overall system security.

Despite these challenges, microservices offer a more modular, scalable approach that fits well with modern development and deployment practices, especially in agile and DevOps environments.

So to summarize, here are the key differences:

Structure

Monolithic: All functionalities are tightly integrated and managed within a single codebase. The application is usually deployed as a single unit.
Microservices: The application is divided into multiple services, each with its own codebase, data storage, and deployment lifecycle.

Deployment

Monolithic: Any change requires redeploying the entire application. This can lead to longer deployment cycles and higher risk of introducing bugs.
Microservices: Services can be deployed independently, allowing for more frequent updates and easier rollback in case of issues.

Scalability

Monolithic: Scaling requires scaling the entire application, which can be resource-intensive and inefficient.
Microservices: Individual services can be scaled independently based on their specific load and requirements, leading to more efficient resource utilization.

Development and Maintenance

Monolithic: A single codebase can become large and complex, making it difficult to maintain and understand. Development can become slower as the codebase grows.
Microservices: Each service is smaller and more focused, making it easier to manage and develop. Teams can work on different services simultaneously without interfering with each other.

Fault Isolation

Monolithic: A failure in one part of the application can affect the entire system.
Microservices: Failures in one service do not necessarily impact other services, improving the overall fault tolerance of the system.

Core Microservices Concepts and Components

In this section, we’ll delve into the essential building blocks of microservices architecture, breaking down the principles and mechanisms that make it functional, scalable, and adaptable.

This section will cover key concepts such as service boundaries, API communication, and data management. Each component plays a vital role in enabling microservices to operate independently yet cohesively as part of a larger system.

You’ll explore the architectural practices that will let you deploy, scale, and manage microservices separately, while also understanding the importance of orchestration, inter-service communication, and monitoring.

These foundational elements are crucial for building reliable microservices applications and will provide a deeper look at the architecture's inner workings. This understanding will help you apply microservices principles effectively, ensuring that they add value to complex, distributed applications.

Microservices Design Principles

Here are some important principles to keep in mind when you’re designing microservices:

Single Responsibility Principle

Each microservice should focus on a single responsibility or business capability.
This principle ensures that each service is specialized and manageable.

Think of a microservice as a specialized department in a company. For example, a company has separate departments for HR, Finance, and Sales, each handling its specific tasks.

// User Service - Manages user-related functionalities
class UserService {
  createUser(user) {
    // Code to create a user
  }
  getUser(userId) {
    // Code to get a user by ID
  }
}

// Order Service - Manages order-related functionalities
class OrderService {
  createOrder(order) {
    // Code to create an order
  }
  getOrder(orderId) {
    // Code to get an order by ID
  }
}

In this code, you can see how each class—UserService and OrderService—is created to focus on a single responsibility.

The UserService class is solely responsible for user-related tasks, such as creating a new user (createUser(user)) and retrieving a user by their ID (getUser(userId)).

By keeping these responsibilities separate, changes in user-related logic can be managed within UserService without affecting other services.

Similarly, OrderService is dedicated to managing order-related tasks, providing functions to create orders (createOrder(order)) and retrieve orders by their ID (getOrder(orderId)).

This approach aligns with the Single Responsibility Principle by ensuring that each service can evolve or scale based on its specific function without cross-dependencies.

For instance, if new features for handling complex user interactions are added, only UserService will require updates, leaving OrderService unaffected.

This isolation not only simplifies maintenance and testing but also supports independent scaling, as each service can be deployed, scaled, and optimized independently based on demand.

By encapsulating distinct business capabilities in individual services, this approach enables a cleaner, more modular, and manageable architecture—a crucial benefit for systems that may grow in complexity over time.

Decentralized Data Management

Each microservice manages its own database or data storage, avoiding shared databases between services.

Imagine each department in a company has its own filing cabinet. HR, Finance, and Sales each store their documents separately, so they don’t interfere with each other.

// Simulating a decentralized database approach
const userDatabase = {}; // Simulated database for user service
const orderDatabase = {}; // Simulated database for order service

class UserService {
  createUser(user) {
    userDatabase[user.id] = user;
  }
  getUser(userId) {
    return userDatabase[userId];
  }
}

class OrderService {
  createOrder(order) {
    orderDatabase[order.id] = order;
  }
  getOrder(orderId) {
    return orderDatabase[orderId];
  }
}

In this code, you can see how each microservice independently manages its own data. Here’s how it works in detail:

Separate Data Stores: The userDatabase object simulates a standalone database dedicated to user data, while the orderDatabase object serves as a separate storage for order data. Each service accesses only its respective database, following the decentralized data management principle.
UserService Class: The UserService class provides methods to create and retrieve user data. The createUser method adds a user to the userDatabase, using user.id as the unique key, and the getUser method retrieves a user based on their userId. This class is isolated from the OrderService, meaning changes to user-related logic or data will not interfere with order data.
OrderService Class: Similarly, the OrderService class manages its own data. The createOrder method stores an order in the orderDatabase, with order.id serving as a unique identifier, and getOrder retrieves an order by its ID.

By isolating data management responsibilities to each service, this code snippet ensures that the user-related and order-related data remain distinct.

This reduces interdependencies between services, which is crucial for achieving high reliability and scalability in a microservices architecture.

In a real-world scenario, each microservice would likely use a separate database instance (for example, separate SQL or NoSQL databases) rather than simple objects, but the principle remains the same.

Each service has full ownership and control over its data, which allows for independent scaling, maintenance, and updates without affecting other services.

API-First Design

It’s a good idea to design APIs before implementing the services to ensure clear interaction contracts between services.

Before building a bridge, engineers create detailed blueprints to define how vehicles and pedestrians will use it. Similarly, designing APIs defines how services will communicate.

// Define API contract for User Service
function createUser(user) {
  // POST /users endpoint
}

function getUser(userId) {
  // GET /users/:id endpoint
}

// Define API contract for Order Service
function createOrder(order) {
  // POST /orders endpoint
}

function getOrder(orderId) {
  // GET /orders/:id endpoint
}

In the code above, you can see how each function represents a different API endpoint, specifying the action that each endpoint should perform and the HTTP methods associated with each action.

This allows for an organized approach to creating APIs for our services and ensures that each service's interface is clearly defined before implementation.

Here’s how each function works and the purpose it serves:

The functions createUser(user) and getUser(userId) are defined for the User Service, representing the expected API contract for handling user data.

The createUser function corresponds to a POST /users endpoint, indicating that this function is designed to create a new user.

The choice of the POST method is intentional, as it aligns with standard HTTP practices for creating resources. This endpoint would typically accept a user object as input in the request body and save that data in the user service's database.
The getUser(userId) function, represented by a GET /users/:id endpoint, is designed to retrieve a user's information based on their unique identifier, userId.

The GET method reflects a read operation, meaning this endpoint will fetch data rather than modify it.

Similarly, the Order Service has two endpoint definitions, createOrder(order) and getOrder(orderId), corresponding to POST /orders and GET /orders/:id endpoints, respectively.
The createOrder function is intended to handle new order creation, taking an order object and saving it within the service.
The getOrder function retrieves order details based on the orderId, providing the necessary data for the requesting client or service.

By defining these endpoints upfront, the API-First Design approach emphasizes creating a clear and well-documented blueprint for how each service should be used.

This approach is comparable to engineers designing blueprints before building a bridge—where these API “blueprints” ensure that services can reliably interact with one another.

These API contracts serve as a formalized communication agreement between services, reducing the risk of misinterpretation or errors during integration.

Autonomous Deployment and Scaling

Each microservice can be deployed and scaled independently of others.

Imagine each department in a company has its own office space.
If the HR department grows, it can expand its office without affecting the Sales department’s office.

// Simulated deployment and scaling
class UserService {
  deploy() {
    console.log("Deploying User Service...");
  }
  scale() {
    console.log("Scaling User Service...");
  }
}

class OrderService {
  deploy() {
    console.log("Deploying Order Service...");
  }
  scale() {
    console.log("Scaling Order Service...");
  }
}

const userService = new UserService();
const orderService = new OrderService();

userService.deploy();
orderService.deploy();

userService.scale();

In the code above, you can see how each service is treated independently with its own methods for deployment and scaling.

The UserService and OrderService classes both contain deploy() and scale() methods that simulate the ability to launch and adjust the resources dedicated to each service individually.
The deploy() method in each class outputs a message that reflects the action of deploying the service. This action is critical in a cloud environment where services must be managed remotely, often across distributed infrastructure.

Deployment here means making the service available to handle requests, such as by creating new instances of the service in the cloud.
The scale() method simulates increasing the resources allocated to each service, an essential feature in microservices architectures where scaling allows a service to handle an increased load.

For instance, if there is a high demand for user-related actions, only the UserService needs to scale, without impacting the resources or operations of OrderService.

This approach, much like how each department in a company might manage its office space, allows for resource allocation to be both responsive and resource-efficient.

By creating separate instances for userService and orderService and then calling the deploy() and scale() methods, the code highlights how, in practice, these services are intended to operate independently.

This independent operation is fundamental in microservices, ensuring that each service can be scaled or deployed as needed based on demand or new releases, without disrupting or overburdening other parts of the system.

Service Communication: Synchronous vs Asynchronous

We’ll discuss two types of communication here: Synchronous and. Asynchronous communication. Let’s start with the synchronous variety.

In synchronous communication, services wait for a response from another service before continuing. This is like making a phone call where you wait for the person on the other end to respond.

async function fetchUser(userId) {
  const response = await fetch(`/users/${userId}`);
  const user = await response.json();
  return user;
}

In the code above, you can see how the function uses the fetch API to send a request to a specified endpoint (/users/${userId}).

Here’s how it works in detail:

Request Setup: When fetchUser is called, it takes userId as a parameter and builds a request to an endpoint. The URL (/users/${userId}) is set up to retrieve information specifically for that user.
Awaiting the Response: Using await, the function pauses execution until the response arrives from the server. This is the core of synchronous communication: the function stops and waits rather than moving to the next line immediately.
Extracting Data: After the server responds, await response.json() extracts the user data from the response as JSON.
Returning Data: Finally, the function returns the user object containing the requested user data.

This synchronous approach is useful when a service depends on data from another service to continue processing.

For instance, if an e-commerce microservice needs user details before creating an order, it might pause at this point, waiting until fetchUser retrieves the required data. This ensures that all necessary information is available before moving forward.

In asynchronous communication, on the other hand, services send messages and continue processing without waiting for a response.

This is like sending a letter in the mail. You don’t wait for the recipient’s reply before continuing with your day.

function sendMessage(queue, message) {
  setTimeout(() => {
    console.log(`Message sent to ${queue}: ${message}`);
  }, 1000); // Simulate asynchronous operation
}

sendMessage('orderQueue', 'New order created');

In this code example, the sendMessage function takes two arguments: queue and message. Here:

queue: Represents the name of the message queue, which is the target for the message. Think of it as the destination where the message will be processed asynchronously, like "orderQueue" in this example.
message: The content or payload of the message being sent, here being "New order created".

The setTimeout function is used to simulate an asynchronous operation by delaying the console.log output for 1 second (1000 milliseconds).

This delay represents the time it might take for the message to be sent and processed, though, in reality, the actual sending happens instantly, allowing the program to continue processing other tasks without waiting.

After calling sendMessage, the program doesn’t wait for any confirmation and immediately continues with its other operations, reflecting the non-blocking nature of asynchronous communication in microservices.

And in this code, you can see how setTimeout simulates asynchronous behavior by delaying the message output to demonstrate that sendMessage doesn’t hold up any further actions while it "sends" the message.

This mirrors the real-world asynchronous messaging between microservices, where they communicate by posting messages to queues or topics without waiting for an immediate reply.

This approach helps systems stay decoupled and scalable by allowing different services to operate independently, even if they depend on one another for data.

RESTful APIs

REST (Representational State Transfer) uses standard HTTP methods (GET, POST, PUT, DELETE) for service communication.

Think of RESTful APIs like a menu in a restaurant. Each item on the menu (endpoint) corresponds to a specific request (for example, GET to retrieve, POST to create).

// Fetch user using RESTful API
async function getUser(userId) {
  const response = await fetch(`/api/users/${userId}`);
  const user = await response.json();
  return user;
}

This code demonstrates the use of a RESTful API to fetch user data based on a unique userId identifier.

RESTful APIs rely on a standardized set of HTTP methods—such as GET, POST, PUT, and DELETE—to interact with resources.

In this example, the fetch API is used to retrieve user data from a specified endpoint (/api/users/${userId}) by issuing a GET request.

This method is asynchronous, which allows the code to wait for the response without blocking other processes.

Here’s how each part of the code functions:

Function Definition: getUser is an async function, meaning it returns a Promise and can utilize the await keyword for asynchronous operations, making it ideal for handling HTTP requests that may take time to return.
Fetching Data: Within getUser, the fetch function initiates an HTTP GET request to the specified URL endpoint (/api/users/${userId}). This URL is dynamically generated based on the userId provided when the function is called. Here, fetch represents an API request to retrieve a user's information, acting similarly to ordering a specific item from a menu in a restaurant based on a user-supplied request.
Parsing JSON: After receiving the response from the server, await response.json() is used to parse the JSON data, which contains the user’s information. JSON (JavaScript Object Notation) is the most common format for data exchange in REST APIs, making it easy for different services to communicate with one another.
Return Value: Once the data is parsed, it’s returned as a JavaScript object containing the user’s information, which can then be utilized elsewhere in the application.

In this code, you can see how the asynchronous nature of fetch and await works to ensure that the function doesn’t block the program while waiting for the response.

This approach allows the function to perform RESTful communication efficiently, reflecting how microservices interact seamlessly via HTTP requests to fetch, update, or delete resources without impacting the rest of the system.

gRPC and Protocol Buffers

gRPC is a high-performance RPC framework that uses Protocol Buffers for serialization.

gRPC and Protocol Buffers are like a highly efficient postal service that uses a compact and precise form to send messages quickly.

// gRPC server setup
const grpc = require('@grpc/grpc-js');
const protoLoader = require('@grpc/proto-loader');
const packageDefinition = protoLoader.loadSync('user.proto');
const userProto = grpc.loadPackageDefinition(packageDefinition).user;

function getUser(call, callback) {
  // Implementation here
}

const server = new grpc.Server();
server.addService(userProto.UserService.service, { getUser });
server.bind('127.0.0.1:50051', grpc.ServerCredentials.createInsecure());
server.start();

This code sets up a basic gRPC server using Protocol Buffers to define the structure and communication format of messages between the client and server.

gRPC (Google Remote Procedure Call) is a high-performance framework that uses Protocol Buffers (protobuf) for efficient serialization and deserialization of data.

This setup allows for fast and secure communication between microservices, particularly useful in distributed systems.

Here’s how each part of the code works:

Library Imports: The code first imports the necessary gRPC library (grpc) and a Protocol Buffer loader (@grpc/proto-loader). These tools are essential for creating a gRPC server and handling Protocol Buffer files.
Loading Protocol Buffer Definition: The line protoLoader.loadSync('user.proto') loads a Protocol Buffer file called user.proto. This file defines the structure of the UserService and its getUser method. After loading the Protocol Buffer file, the grpc.loadPackageDefinition() function converts the package definition into a usable JavaScript object, making the userProto service available to the server.
Defining the getUser Function: The getUser function is a placeholder for handling incoming getUser requests. The function uses two parameters: call, which contains request data sent by the client, and callback, which sends back a response. In a production implementation, this function would interact with a database or perform other business logic before responding.
Setting up the Server: The code initializes a new gRPC server with const server = new grpc.Server(). This server will listen for client requests and respond according to the services and methods defined in the Protocol Buffer.
Adding the Service: The line server.addService(userProto.UserService.service, { getUser }) registers the UserService service and assigns it the getUser function as the handler for its requests.
Binding the Server to an Address: The server is then bound to the local address 127.0.0.1 and port 50051 for listening to incoming requests. Here, grpc.ServerCredentials.createInsecure() sets up an insecure connection. In a real-world application, you’d typically use SSL/TLS certificates for secure communication.
Starting the Server: Finally, server.start() begins listening for requests on the specified address and port.

In the code, you can see how the gRPC framework, along with Protocol Buffers, is used to create an efficient and structured server-client communication channel.

This setup enables microservices to communicate rapidly and precisely by using protobuf, which is more compact than JSON or XML and allows for faster message parsing.

This is similar to a well-organized postal service where both the sender and receiver understand the same structured language, ensuring quick and accurate message delivery between services.

Message Brokers (like RabbitMQ and Kafka)

Message brokers manage and route messages between services, enabling asynchronous communication.

A message broker is like a post office that handles and delivers messages between senders and receivers.

const amqp = require('amqplib');

async function sendMessage(queue, message) {
  const connection = await amqp.connect('amqp://localhost');
  const channel = await connection.createChannel();
  await channel.assertQueue(queue);
  channel.sendToQueue(queue, Buffer.from(message));
  console.log(`Message sent to ${queue}: ${message}`);
  await connection.close();
}

sendMessage('orderQueue', 'New order created');

This code demonstrates how to send a message to a RabbitMQ message queue using the amqplib library in Node.js. Message brokers like RabbitMQ act as intermediaries, managing and routing messages between services asynchronously.

They help decouple services, meaning that services don’t need to wait for responses to continue functioning. RabbitMQ is particularly useful in microservices architectures for distributing tasks, such as order processing or notifications.

Here’s how each part of this code works:

In the code above, you can see how message passing between services is accomplished using RabbitMQ. The sendMessage function encapsulates the message-sending process:

Connecting to RabbitMQ: The line const connection = await amqp.connect('amqp://localhost'); establishes a connection to the RabbitMQ server. Here, amqp://localhost refers to a locally hosted RabbitMQ instance. In a production environment, this would typically be a remote server URL.
Creating a Channel: The await connection.createChannel(); line creates a channel for sending messages. Channels are lightweight connections over which data can be sent and received. Each channel operates independently, so multiple channels can be used simultaneously without interfering with each other.
Declaring the Queue: By calling await channel.assertQueue(queue);, the code ensures that the specified queue (orderQueue in this case) exists. If it doesn’t exist, RabbitMQ will create it. This declaration helps RabbitMQ know where the message should be sent.
Sending the Message: The line channel.sendToQueue(queue, Buffer.from(message)); sends the message to the specified queue by converting it to a Buffer. Buffers handle binary data, which is how RabbitMQ expects messages to be sent. In this case, the message "New order created" is sent to orderQueue.
Closing the Connection: Finally, await connection.close(); closes the connection to RabbitMQ, ensuring that resources are freed up after the message has been sent.

This setup is similar to a post office that receives and distributes mail. Just as a post office routes letters to their recipients, RabbitMQ ensures messages reach the correct service queues, allowing services to process them when they’re ready.

This code shows how RabbitMQ’s asynchronous communication helps prevent services from blocking each other, enabling a more scalable, reliable application design.

Data Management in Microservices

Database per Service Pattern

Each microservice has its own database, ensuring data encapsulation and independence.

And each department in a company has its own filing system, ensuring that data is kept separate and managed independently.

// Simulating separate databases for User and Order services
const userDatabase = {};
const orderDatabase = {};

function addUser(user) {
  userDatabase[user.id] = user;
}

function addOrder(order) {
  orderDatabase[order.id] = order;
}

In this code, you can see how separate databases are being simulated for the User and Order services. Each microservice manages its own isolated database (userDatabase and orderDatabase), ensuring that the data for users and orders is kept separate, just like how different departments within a company manage their own filing systems to avoid interference.

User Service Database: The userDatabase object acts as the storage for all user-related data. The addUser function adds new users to this database by storing user information with a unique user.id as the key. This means that all user data is managed and stored by the User Service independently of any other service.
Order Service Database: Similarly, the orderDatabase object stores all order-related data, with the addOrder function adding orders using their unique order.id. Again, the order data is managed and stored by the Order Service independently, without any interference from the User Service.

The key concept demonstrated here is the Database per Service pattern, which is a fundamental aspect of microservices architectures.

By ensuring that each service (for example, User Service, Order Service) has its own database, you prevent issues related to tight coupling between services.

Each service can evolve and scale independently, managing its own data in a way that best suits its functionality.

In this scenario, if the User service needs to change its database schema (for example, adding more fields to the user data), it can do so without affecting the Order service.

Similarly, if the Order service needs to optimize its data management or scale independently, it can do so without relying on the User service's database.

This approach makes each service self-contained, thus supporting easier maintenance and greater scalability.

Data Consistency and Synchronization

Ensuring consistency across services and handling data synchronization challenges are key when working with microservices.

This is like synchronizing calendars across multiple devices to ensure all appointments are up-to-date.

There are various strategies you can use to handle these issues:

Event Sourcing

Event sourcing involves storing changes to data as a sequence of events rather than a single state. It’s like keeping a diary of every change rather than just recording the final status.

const events = []; // Event log

function addUserEvent(user) {
  events.push({ type: 'USER_CREATED', payload: user });
}

function replayEvents() {
  events.forEach(event => {
    if (event.type === 'USER_CREATED') {
      console.log('Replaying event:', event.payload);
    }
  });
}

In the code above, you can see how events are logged and replayed in an event-sourcing pattern:

Event Logging with addUserEvent: The addUserEvent function simulates adding a "user created" event to an event log (events array). Each event includes a type property, which identifies the type of event (in this case, 'USER_CREATED'), and a payload property that contains the actual data for the event. Every time a new user is created, the addUserEvent function captures this change as a new entry in the events array, keeping a record of the action.
Replaying Events with replayEvents: The replayEvents function demonstrates how to go through the recorded events and process them. It iterates over each event in the events array, checking the type of each event. If an event is of type 'USER_CREATED', it logs the payload of the event. This replaying process is central to event sourcing, as it enables the system to "recreate" the state based on the sequence of events. Here, the console.log statement serves as a placeholder, which could be replaced with any logic needed to actually apply or process the event data.

This example illustrates the event sourcing principle of retaining a record of each significant change as a discrete event, rather than just updating the state directly.

By capturing changes as events, we gain a historical log of all actions, which can be replayed for auditing, debugging, or reconstructing the system state at any specific point in time.

This concept is similar to maintaining a detailed diary rather than just summarizing the current state—each entry preserves context about changes that occurred over time.

CQRS (Command Query Responsibility Segregation)

This involves separating command (write) and query (read) operations.

It’s like having separate teams for handling customer service requests (commands) and handling customer inquiries (queries).

// Command: Modify data
function createUser(user) {
  // Code to create user
}

// Query: Retrieve data
function getUser(userId) {
  // Code to get user
}

In this code, you can see how commands and queries are separated in CQRS:

Command - createUser: The createUser function represents a command. In the context of CQRS, a command is an operation that modifies the state of the application. Here, createUser would include logic to add a new user to the system, modifying the database by inserting new user data. Commands in CQRS focus solely on changing the data: they don’t return the updated data or information about the system state but rather indicate an action to be performed.
Query - getUser: The getUser function represents a query. In CQRS, queries are used solely to retrieve data without altering the system state. This function could contain logic to look up and return user information based on the provided userId. Since queries only retrieve data, they don’t impact the underlying data and can be optimized for fast reads, enabling the system to scale read operations as needed.

By separating these operations into distinct functions, CQRS helps enforce the idea that reading and modifying data should not be intermixed.

This separation improves clarity, as each function has a clear purpose and responsibility.

It also allows the system to handle high volumes of read requests without impacting write operations (and vice versa), making the architecture more resilient and scalable for complex applications.

The analogy to separate teams handling different tasks is helpful here. Just as one team might handle customer service requests (for example, resolving issues or making changes) and another team handles customer inquiries (for example, answering questions or providing information), the code separates commands and queries into distinct functions for specialized purposes.

Service Discovery and Load Balancing

Service Discovery Mechanisms

Service discovery mechanisms help you automatically locate and interact with services in a distributed system.

It’s like a company directory where employees can find the contact details of their colleagues.

// Simulated service discovery using a mock service discovery
const services = {
  userService: 'http://localhost:3001',
  orderService: 'http://localhost:3002'
};

function getServiceUrl(serviceName) {
  return services[serviceName];
}

console.log('User Service URL:', getServiceUrl('userService'));

In this code, you can see how service discovery is implemented with a simple lookup structure:

Service Directory (Mock Service Discovery): The services object acts as a mock directory that maps service names (like userService and orderService) to their URLs (for example, http://localhost:3001 for the User Service). In real-world applications, this directory would be managed by a dedicated service discovery tool (such as Consul, Eureka, or etcd) rather than a static object. These tools keep track of available service instances and their locations, handling updates when services start or stop.
Dynamic URL Resolution: The getServiceUrl function accepts a service name as an argument and returns the corresponding URL by looking it up in the services directory. Here, the code getServiceUrl('userService') returns http://localhost:3001. This allows a client or another service to dynamically resolve and access the URL for userService, decoupling the services by avoiding hardcoded URLs.
Example Output: The final console.log line demonstrates fetching the User Service URL using the getServiceUrl function, allowing dynamic access. The returned URL can be used by other services to make HTTP requests to the User Service.

The analogy here is like using a company directory to look up a colleague's contact details rather than remembering each individual’s location or number.

In a microservices architecture, service discovery mechanisms like this make the system more resilient and flexible, as services can be added, removed, or scaled without directly impacting other services that depend on them.

Load Balancing Strategies

Load balancing involves distributing network traffic across multiple servers to ensure efficient use of resources.

It’s like a traffic light that directs cars to different lanes to manage traffic flow.

// Simulated load balancing
const servers = ['http://localhost:3001', 'http://localhost:3002'];

function getServer() {
  return servers[Math.floor(Math.random() * servers.length)];
}

console.log('Selected Server:', getServer());

In the code above, you can see how load balancing is simulated using an array of server URLs and a simple randomization technique:

Server Pool: The servers array contains a list of URLs representing different servers or instances of the same service (for example, two instances of a web application running on different ports, http://localhost:3001 and http://localhost:3002). In a production environment, this list would typically include the actual IP addresses or URLs of servers that can handle the load.
Random Load Balancing Strategy: The getServer function picks a server at random by selecting an index within the servers array. It generates a random number using Math.random() and multiplies it by the length of the servers array. Then, Math.floor() rounds this value down to the nearest whole number, ensuring it corresponds to a valid index in the servers array. This strategy simulates random load balancing by choosing one server for each request, which can help distribute requests fairly evenly in smaller setups.
Output: Finally, console.log('Selected Server:', getServer()); demonstrates which server was selected. Each time getServer() is called, it may pick a different server, showing how incoming requests would be balanced across the available options.

In real-world scenarios, load balancers often use more sophisticated strategies, such as round-robin (cycling through servers in sequence) or least connections (sending traffic to the server with the fewest active connections).

The analogy here is like a traffic light directing cars into different lanes: each lane is a server, and the traffic light (load balancer) distributes vehicles (requests) to prevent congestion.

This simple load-balancing code illustrates the concept of spreading requests across servers, which can improve performance and system resilience by reducing the chances of overloading any single server.

How to Build and Design Microservices

In this section, I’ll guide you through the process of designing and developing microservices, focusing on best practices and practical techniques for creating effective, resilient services.

We’ll cover essential steps like setting up a microservices environment, structuring services for modularity, and choosing the right tools and frameworks to streamline development.

You will learn about key aspects of service creation, including defining service boundaries, establishing inter-service communication, and implementing APIs for seamless integration.

We’ll also explore important considerations like data management, security, and deployment strategies specific to microservices.

By the end of this section, you'll have a comprehensive understanding of the techniques and tools that support efficient microservices development, providing a strong foundation for creating scalable, flexible, and high-performing microservices-based applications.

Define Service Boundaries

It’s important to identify the distinct business functions that each microservice will handle. This involves defining clear responsibilities and interfaces.

Think of service boundaries like different departments in a company. Each department (HR, Sales, Support) has a clear function and operates independently.

// Define service boundaries
class UserService {
  constructor() {
    this.users = []; // Manages user-related data
  }

  createUser(user) {
    this.users.push(user);
    return user;
  }

  getUser(userId) {
    return this.users.find(user => user.id === userId);
  }
}

class OrderService {
  constructor() {
    this.orders = []; // Manages order-related data
  }

  createOrder(order) {
    this.orders.push(order);
    return order;
  }

  getOrder(orderId) {
    return this.orders.find(order => order.id === orderId);
  }
}

In this code, you can see how each service has its own distinct responsibilities:

UserService: This class is dedicated to managing user-related data and functionalities. The this.users array simulates a database, storing user data exclusively within the UserService scope. The createUser method allows for adding a new user to this array, while getUser retrieves a user by their ID. By defining these methods within UserService, the code makes sure that all user-related data is encapsulated and handled only within this service, ensuring clear separation from other services.
OrderService: Similarly, OrderService is exclusively responsible for order-related data and operations. It maintains its own this.orders array to store order data and provides createOrder and getOrder methods to add and retrieve orders, respectively. Like UserService, this approach confines order-related data management within OrderService, creating a clear boundary between the two services.

In practice, these service boundaries are like separate departments in a company, such as HR and Sales, where each department operates independently with its specific set of responsibilities.

UserService and OrderService can interact with users and orders without interfering with each other, thus minimizing dependencies and enabling each service to evolve independently.

This design makes it easier to scale, modify, and maintain individual services without impacting other parts of the application.

Decide on Data Storage

You’ll need to choose the appropriate data storage solution for each microservice, considering factors such as scalability and consistency.

It’s just like choosing the right type of storage (for example, filing cabinet, cloud storage) based on what you need to store and how you need to access it.

// Simple in-memory storage for demonstration
const userDatabase = {}; // For UserService
const orderDatabase = {}; // For OrderService

class UserService {
  createUser(user) {
    userDatabase[user.id] = user;
  }

  getUser(userId) {
    return userDatabase[userId];
  }
}

class OrderService {
  createOrder(order) {
    orderDatabase[order.id] = order;
  }

  getOrder(orderId) {
    return orderDatabase[orderId];
  }
}

In this code, you can see how each service is designed to operate with its own isolated storage:

UserService: The UserService class interacts solely with the userDatabase object. When the createUser method is called, it stores the user’s data in userDatabase, using the user’s ID as the key to make retrieval efficient. The getUser method retrieves user data by accessing this in-memory "database" with the user ID. This approach confines user data management entirely within the UserService, preventing other services from directly accessing or modifying it, which aligns with the microservices goal of encapsulating data within the responsible service.
OrderService: Similarly, the OrderService class interacts only with orderDatabase, a separate in-memory object dedicated to storing order-related data. The createOrder method adds order information to this object, using each order’s unique ID as a key. The getOrder method then retrieves orders from orderDatabase as needed. As with UserService, OrderService maintains strict data separation, ensuring that order data is accessible only within the context of this service.

This structure emphasizes decoupling data management for each service, which offers several advantages in a microservices architecture. For instance, by isolating each service’s data, this model allows each service to choose the most suitable data storage solution based on its specific requirements.

Just as an organization might choose cloud storage for accessible files and secure storage for sensitive documents, each microservice could adopt a different database type (for example, SQL, NoSQL) depending on its workload.

This separation also supports scalability, as each service can independently scale its storage layer without affecting others.

Choose the Right Technology Stack

Selecting the appropriate technology stack is a crucial step in building microservices.

This decision impacts your microservices architecture's performance, scalability, maintainability, and overall success.

The flexibility of microservices allows you to choose different programming languages, frameworks, and tools for various services, optimizing each one for its specific needs.

Programming Languages

In a microservices architecture, you can use different programming languages for different services based on their requirements.

For instance, you might choose JavaScript (Node.js) for real-time services, Python for data processing, and Java for high-performance backend services.

Here’s what to consider:

Team Expertise: Choose languages your team is proficient in to reduce the learning curve and increase productivity.
Ecosystem and Libraries: Consider the availability of frameworks, libraries, and community support for the language.
Performance Needs: Some languages offer better performance for specific tasks. For example, Go is often chosen for its concurrency capabilities in high-performance applications.

// Node.js example for a simple microservice
const express = require('express');
const app = express();

app.get('/hello', (req, res) => {
    res.send('Hello, World!');
});

app.listen(3000, () => {
    console.log('Service running on port 3000');
});

In the code above, you can see how a basic Node.js-based microservice works by using the Express framework to handle a simple HTTP GET request.

This example demonstrates setting up a microservice with minimal code, illustrating how microservices can efficiently serve specific functionalities.

In this code, you can see:

Express Setup: The code starts by importing the express module, which is a lightweight, flexible Node.js framework commonly used for building microservices and web applications. express() initializes an application instance named app, allowing us to define routes and behaviors.
Defining a Route: Next, we define a route handler using app.get('/hello', (req, res) => { ... }). This line sets up an endpoint, /hello, which will respond to HTTP GET requests. When a request is made to this endpoint, the callback function sends back a response of "Hello, World!". This function demonstrates how specific endpoints can be easily created within a microservice to handle different requests and responses.
Starting the Server: The line app.listen(3000, ...) instructs the app to listen on port 3000, meaning it will respond to incoming requests on this port. When the server successfully starts, a message, "Service running on port 3000", is logged to the console. This line is crucial for making the microservice operational, as it opens up the specified port for client communication.

This setup is a typical approach for a simple microservice, where each microservice can run independently, serve specific routes, and perform unique actions.

It demonstrates the concept of service boundaries by limiting the functionality of this microservice to a specific purpose: handling requests to the /hello endpoint and responding with a message.

This design can be expanded by adding more endpoints, handling more request types, and incorporating additional logic as needed.

Frameworks

Depending on the complexity and requirements of your service, you might choose a lightweight framework (like Express.js for Node.js) or a more comprehensive one (like Spring Boot for Java).

Some frameworks are specifically designed for microservices, offering built-in support for service discovery, configuration management, and other essential features. Examples include Spring Boot (Java) and Micronaut (Java, Groovy, Kotlin).

Here’s what to consider:

Scalability: Ensure the framework supports horizontal scaling and distributed systems.
Ease of Integration: Choose frameworks that integrate well with your existing systems and technologies.
Developer Productivity: Frameworks with higher levels of abstraction can speed up development but may also limit flexibility.

// Spring Boot example for a simple microservice
@RestController
@RequestMapping("/api")
public class HelloWorldController {

    @GetMapping("/hello")
    public String hello() {
        return "Hello, World!";
    }
}

This code illustrates how a simple Spring Boot microservice works, specifically by defining a REST endpoint that responds to HTTP requests.

You have a HelloWorldController class, annotated with @RestController, which marks it as a RESTful web service controller in Spring Boot. This annotation allows the class to handle incoming HTTP requests and automatically converts responses into JSON, making it ideal for building microservices.
The @RequestMapping("/api") annotation specifies a base URI for all endpoints in this controller. In this case, all routes managed by HelloWorldController will begin with /api, organizing the API endpoints under a single base path.
Within the class, the @GetMapping("/hello") annotation is used on the hello() method, designating it as an HTTP GET endpoint. This means that whenever the /api/hello route is accessed with a GET request, the hello() method will be triggered.
The hello() method is a simple function that returns the string "Hello, World!". When a client makes a request to /api/hello, Spring Boot processes this request and sends back the "Hello, World!" response, formatted according to HTTP standards.

This setup forms the basis of a simple microservice endpoint, as it defines a clear URI path, method type, and response format, encapsulated within a RESTful API.

The example provided explains how Spring Boot's annotations streamline the development process for RESTful services. The @RestController and route-mapping annotations handle much of the boilerplate, allowing developers to focus on building individual endpoints.

This simplicity is especially beneficial in microservices architecture, where small, single-purpose services can be rapidly developed, tested, and scaled independently.

Technology Stack Alignment

While microservices allow for different stacks across services, it’s important to strike a balance between consistency (to avoid operational overhead) and flexibility (to optimize individual services). For example, you might standardize certain tools for monitoring, logging, and CI/CD, even if you use different languages.

You should also consider how your chosen technology stack works within containers (like Docker). Containerization enables consistent environments across development, testing, and production.

Defining APIs and Contracts

Defining clear and well-structured APIs is a cornerstone of successful microservices architecture.

APIs serve as the communication bridge between microservices, enabling them to work together while remaining loosely coupled.

API Design Principles: RESTful vs. gRPC

RESTful APIs: REST (Representational State Transfer) is widely used due to its simplicity, human-readability, and ease of integration with HTTP. RESTful APIs are typically designed around resources and use standard HTTP methods (GET, POST, PUT, DELETE).

GET /api/users/{id}

In this HTTP code, you can see how a RESTful API request is structured to retrieve user information by ID. This endpoint, represented by GET /api/users/{id}, is a commonly used RESTful pattern for accessing specific resources, in this case, user data.

Here’s a breakdown of what this endpoint does and how it works:

The GET method is used to request data from the server, and it’s specifically designed to retrieve information without modifying any data on the server. In this context, the GET request is directed to the /api/users/{id} endpoint, where {id} represents a variable placeholder for the specific user’s unique identifier.
When a request is made to this endpoint (for example, GET /api/users/123), the server interprets {id} as the ID of the user whose data is being requested.
The server then retrieves the relevant user information from its database and sends it back to the client, typically in JSON format.

This approach aligns with the principles of REST (Representational State Transfer), which emphasizes stateless communication and the use of standard HTTP methods (like GET, POST, PUT, DELETE) to interact with resources.

By separating the endpoint path (/api/users) and the method (GET), this design provides a clear, intuitive interface for retrieving data, making it easy for clients to understand that this request will fetch user information based on the unique user ID provided.

Using specific paths with parameters like {id} keeps the API flexible, allowing clients to dynamically request data for any user by substituting the appropriate ID in the request URL.

This is especially useful in microservice or RESTful architectures, where clear, predictable endpoints improve communication efficiency and maintain data access consistency across distributed services.

gRPC: gRPC is a high-performance, open-source RPC (Remote Procedure Call) framework developed by Google. It uses HTTP/2 and Protocol Buffers for efficient communication, making it suitable for low-latency, high-throughput systems.

service UserService {
    rpc GetUser (UserRequest) returns (UserResponse);
}

In this code, you can see how gRPC service definitions are created to specify the RPC (Remote Procedure Call) interface for the UserService.

This example uses Protocol Buffers (protobuf) syntax, a language-neutral format for defining service contracts in gRPC.

Here’s a detailed breakdown of how this code works and what it represents:

The service UserService declaration defines a service named UserService. In gRPC, a "service" is essentially a collection of remotely callable functions. It organizes these functions (or RPC methods) under a single service name, which can be easily referenced by clients wishing to interact with it.
Inside UserService, the line rpc GetUser (UserRequest) returns (UserResponse); defines a specific RPC method called GetUser. The keyword rpc indicates that this function will be accessible remotely via gRPC calls. The name GetUser indicates its purpose—to retrieve user information—and helps to standardize the naming of this action.
The GetUser method specifies two important details: the request and response types, represented here as (UserRequest) and (UserResponse). UserRequest is the type of data the client must send when calling GetUser, which could include user identifiers (like a user ID) or any necessary parameters. UserResponse defines the format of the data that will be returned by the server, such as the user’s profile or account details.

When a client makes a call to GetUser, they send a UserRequest message, and the server responds with a UserResponse message.

This structure allows for a well-defined and efficient way for clients to retrieve user information without dealing with the details of network communication.

By defining service contracts at this level, gRPC enables type safety, performance optimization, and scalability across distributed systems.

Choosing Between REST and gRPC: REST is more flexible and easier to use for external APIs, while gRPC offers better performance and is often preferred for internal microservices communication.

Versioning

APIs evolve over time, and maintaining backward compatibility is crucial. API versioning strategies include path versioning (for example, /v1/users) and query parameter versioning (for example, /users?version=1).

GET /api/v1/users/123

In the HTTP code above, you can see how a RESTful API endpoint is defined to retrieve a resource, specifically a user, using the HTTP GET method.

This is a simple and effective way to interact with web services over HTTP, which is the backbone of REST (Representational State Transfer) design.

RESTful APIs are structured around the concept of resources—objects or data that can be accessed or manipulated via standard HTTP methods like GET, POST, PUT, and DELETE.

The endpoint GET /api/users/{id} follows this design pattern. Here's how it works in detail:

GET is the HTTP method used to request data from the server. In RESTful design, the GET method is used for retrieving data from a server without making any changes. In this case, the GET request is specifically used to fetch the details of a user.
/api/users/{id} is the resource path that identifies the target resource—in this case, a user. The {id} part is a variable path parameter, which means the client must provide a specific user identifier (ID) when making the request. This allows the server to understand which user's data is being requested. For example, GET /api/users/123 would fetch the user with the ID of 123.
The resource, in this case, is a user. RESTful APIs focus on representing data in the form of resources, which are typically accessed using URLs. The GET method on the /users/{id} path tells the server to return the data associated with the user corresponding to the given ID.

In RESTful design, the simplicity and human-readability of the HTTP protocol make it easy to integrate with other systems. Each endpoint can be understood in terms of standard HTTP methods and the structure of the resource being accessed, which makes it intuitive for both developers and clients.

The resource-oriented approach is scalable, and by using HTTP status codes, developers can communicate the results of each request (such as 200 OK for success or 404 Not Found when the resource doesn’t exist).

Thus, GET /api/users/{id} is an example of how RESTful APIs allow clients to easily query specific resources with clear, readable paths and standard methods for interaction.

Error Handling

You’ll need to define a consistent approach to handling errors in your APIs. Use standardized error codes and messages to make troubleshooting easier for clients.

{
    "error": {
        "code": "USER_NOT_FOUND",
        "message": "The user with ID 123 was not found."
    }
}

In this code, you can see how error handling works within an API response by providing standardized error information.

The JSON object returned represents an error response when a client attempts to access a resource, such as a user, that cannot be found.

The structure of the error is consistent, making it easier for both the server and client to handle errors effectively.

The outer structure of the response is an object containing an error key, which signifies that this is an error response, as opposed to a successful one. This helps clients easily distinguish between regular data responses and error responses.

Inside the error object, there are two key elements:

code: The error code (USER_NOT_FOUND) is a standardized identifier that describes the type of error. It helps developers and clients understand exactly what went wrong. In this case, USER_NOT_FOUND indicates that the user could not be found in the system based on the provided identifier (ID 123).
message: The error message (The user with ID 123 was not found.) provides a human-readable explanation of the error. This message offers clarity to the user or developer about the nature of the problem, giving a more detailed description of what happened. In this case, it explicitly informs the client that the requested user is missing from the database.

By using this approach, the error response is consistent, and clients can easily handle errors in a standardized way.

This might involve logging the error, displaying the message to the user, or retrying the operation if necessary.

The standardized error codes and messages make troubleshooting and debugging easier, as developers and clients can quickly identify the nature of the issue.

Moreover, this structure can be extended with additional information, such as timestamps or stack traces, to provide even more context if needed.

This consistent method for error handling ensures that both the client and server maintain clear communication, allowing developers to create more reliable and user-friendly APIs.

When errors are returned in a consistent and structured format like this, it also promotes better integration between different services or teams that might consume the API.

API Contracts

Contracts as Agreements

An API contract defines the rules for how services interact, specifying the expected inputs, outputs, and behavior. It serves as an agreement between teams, ensuring that changes in one service do not break others.

Schema Definition

Use schema definition tools like OpenAPI (formerly Swagger) or Protocol Buffers (for gRPC) to formally define your API contracts. These tools allow for the automatic generation of client libraries, documentation, and testing tools.

openapi: 3.0.0
info:
  title: User API
  version: 1.0.0
paths:
  /users/{id}:
    get:
      summary: Get a user by ID
      parameters:
        - name: id
          in: path
          required: true
          schema:
            type: string
      responses:
        '200':
          description: Successful response
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/User'
components:
  schemas:
    User:
      type: object
      properties:
        id:
          type: string
        name:
          type: string
        email:
          type: string

In this code, you can see how OpenAPI schema definition works by specifying a formal structure for a REST API endpoint.

This YAML example uses OpenAPI 3.0 to define the structure and behavior of an endpoint that retrieves a user by their ID.

OpenAPI, formerly known as Swagger, is a popular tool for defining API contracts, which are essentially agreements about how API requests and responses should look.

This helps create consistency, enables the automatic generation of client libraries, documentation, and testing tools, and makes integration smoother for clients who interact with the API.

The openapi: 3.0.0 line specifies the OpenAPI version, ensuring compatibility with OpenAPI 3.0 tools.

Under info, details about the API itself are defined, including the title (User API) and version (1.0.0), helping clients and developers understand what API version they are working with.

The paths section details the available endpoints, with /users/{id} representing a path to retrieve a user by their unique identifier.

The get section describes the specifics of this GET request, including:

The summary field (Get a user by ID), which briefly explains the purpose of this endpoint.
The parameters list specifies that this endpoint accepts a single parameter, id, which is required, will appear in the path (in: path), and must be of type string.

The responses section specifies possible responses:

A 200 status indicates a successful retrieval of the user data.
Under content, the schema of the JSON response is defined, referencing a reusable User schema from the components section.

In the components section, a User schema is defined to outline the structure of the user data returned by this API. The User schema is defined as an object with id, name, and email properties, each with specific types (string), detailing the expected structure of the user data.

This formal schema helps API clients understand exactly how to use the endpoint and what kind of data they will receive in response.

By defining the API in OpenAPI, this schema also enables automated documentation tools to generate visual documentation for developers. It also allows client libraries to be automatically generated to interact with the API, reducing errors and improving efficiency.

This example showcases how OpenAPI enables clear, consistent, and reusable API contracts that facilitate easier integration and maintenance.

API Gateways and Security

Implementing an API gateway allows you to manage cross-cutting concerns such as authentication, rate limiting, logging, and request routing. It acts as a single entry point for clients accessing microservices.

Security is also an important concern. You can secure your APIs using authentication mechanisms like OAuth2, API keys, or JWT (JSON Web Tokens). Also, ensure that sensitive data is encrypted both in transit and at rest.

// Example of securing a route in Express.js
const jwt = require('jsonwebtoken');

app.get('/api/secure-data', authenticateToken, (req, res) => {
    res.json({ data: 'This is secured data' });
});

function authenticateToken(req, res, next) {
    const token = req.headers['authorization'];
    if (!token) return res.sendStatus(401);

    jwt.verify(token, process.env.ACCESS_TOKEN_SECRET, (err, user) => {
        if (err) return res.sendStatus(403);
        req.user = user;
        next();
    });
}

Here, the code illustrates how route security and authentication are implemented in an Express.js application using JSON Web Tokens (JWT), which are a common method of securing API endpoints.

Here, the route '/api/secure-data' is configured to be accessible only to authenticated users, managed by the middleware function authenticateToken.

In the authenticateToken function, the code extracts the token from the request headers (req.headers['authorization']).

If no token is present, it sends a 401 Unauthorized status, indicating that access is denied. This check is crucial for restricting access to sensitive endpoints, ensuring that only requests with a valid authorization token proceed.

Next, the code uses the jwt.verify() function to verify the token against a secret key (process.env.ACCESS_TOKEN_SECRET). This secret is known only to the server, which makes it possible to authenticate the validity of the token. If the token is invalid or expired, jwt.verify will throw an error, and the function will return a 403 Forbidden response, blocking access.

When verification succeeds, the decoded user information from the token is attached to the req object (req.user = user), enabling subsequent middleware or route handlers to access user-specific data.

The next() function then passes control to the actual route handler, which, in this case, sends back a JSON object with secured data (res.json({ data: 'This is secured data' })).

This approach is often part of a larger API gateway or security strategy, as it ensures that sensitive routes can only be accessed by authenticated clients.

It aligns with secure API gateway practices by enforcing token-based authentication at the gateway level, enhancing security without needing to modify each microservice individually.

How to Implement Microservices

In this chapter, we will begin applying the concepts we discussed earlier as we go through the practical steps. We’ll dive into building a sample project to demonstrate the core aspects of microservices architecture. By focusing on a simple use case, we will walk through how to develop and deploy microservices that are loosely coupled, independently deployable, and scalable.

The scenario we will cover involves developing a microservice system for an e-commerce platform, where we will focus on creating RESTful APIs. These APIs will allow different services, such as product catalog, user management, and order processing, to interact seamlessly while maintaining independence.

You will learn how to design each service with clear boundaries, handle communication between them, and ensure that the services remain decoupled yet cohesive.

We’ll cover topics like designing and implementing RESTful APIs, integrating services via HTTP or message queues, and introducing important concepts such as service discovery and API gateways. Each subsection will build on the previous one, so by the end of the chapter, you’ll have a solid understanding of how to create and deploy a functioning microservices application, ready for further expansion and integration.

Creating RESTful APIs

You’ll implement APIs that follow REST principles to allow communication between services.

Think of RESTful APIs as menus in a restaurant, where each menu item (API endpoint) corresponds to a specific dish (service functionality).

// Node..js with Express
const express = require('express');
const app = express();
app.use(express.json());

const users = [];

app.post('/users', (req, res) => {
  const user = req.body;
  users.push(user);
  res.status(201).send(user);
});

app.get('/users/:id', (req, res) => {
  const user = users.find(u => u.id === parseInt(req.params.id));
  if (user) {
    res.send(user);
  } else {
    res.status(404).send('User not found');
  }
});

app.listen(3000, () => console.log('User service running on port 3000'));

This code demonstrates how a simple RESTful API is implemented in Node.js using the Express framework. This API demonstrates basic CRUD (Create and Read) operations for a users resource, adhering to REST principles by providing endpoints that represent specific operations on the users data.

The app.use(express.json()); line enables Express to parse incoming JSON data, allowing the server to handle POST requests with JSON bodies. This is essential because microservices often communicate in JSON, making it a standard format for data exchange in RESTful APIs.

The POST /users route allows clients to create a new user by sending JSON data representing the user. In the route, the req.body object captures this incoming data. The server then stores this data in the users array.

It responds with a status code 201 (indicating resource creation) and sends back the user object to confirm the successful addition. This design aligns with REST principles by using a standard HTTP method (POST) for creating resources and returning meaningful HTTP status codes.

The GET /users/:id route allows clients to retrieve a specific user by their id. This endpoint uses req.params.id to access the id parameter provided in the request URL.

The code searches the users array for a matching user, converts the id to an integer (since it’s stored as a string in the URL), and sends back the user data if found.

If no match is found, the server responds with a 404 status code, indicating that the user was not found. This standard error handling approach makes the API client-friendly by providing clear feedback.

The final part, app.listen(3000), starts the server on port 3000 and logs a message to confirm the service is running. This allows other services or clients to access the API by making HTTP requests to this port.

This code exemplifies a RESTful approach to creating a simple, stateless API for managing users in a microservice, with endpoints that map intuitively to create and read operations on a user resource.

Handling Authentication and Authorization

You’ll want to implement mechanisms to secure access to your microservices.

This is like issuing badges to employees to ensure only authorized personnel can enter specific areas of a building.

// Using JWT for Authentication
const jwt = require('jsonwebtoken');
const express = require('express');
const app = express();
app.use(express.json());

// Generate JWT Token
app.post('/login', (req, res) => {
  const user = req.body; // Assume user validation here
  const token = jwt.sign({ userId: user.id }, 'secret_key');
  res.send({ token });
});

// Middleware to protect routes
function authenticateToken(req, res, next) {
  const token = req.headers['authorization'];
  if (!token) return res.sendStatus(401);
  jwt.verify(token, 'secret_key', (err, user) => {
    if (err) return res.sendStatus(403);
    req.user = user;
    next();
  });
}

app.get('/protected', authenticateToken, (req, res) => {
  res.send('This is a protected route');
});

app.listen(3000, () => console.log('Authentication service running on port 3000'));

In this snippet, you can see that JWT (JSON Web Tokens) are used to handle authentication and authorization in a Node.js application. The code demonstrates the entire flow, from generating a JWT token when a user logs in, to using that token to protect specific routes in the application.

First, in the POST /login route, the application generates a JWT token for a user. Here, the user’s information is expected to be provided in req.body, simulating a login process. In a real-world scenario, this step would include user validation (such as checking the username and password against a database).

Upon a successful "login," the jwt.sign() method creates a token using the user.id as the payload and a secret_key. This token is returned to the user and serves as a kind of "badge" that represents their identity and access rights. The client can store this token and send it with future requests to authenticate themselves.

The authenticateToken middleware function demonstrates how the server can validate this token on protected routes. When a request is made to a secured route, the middleware checks for a token in the Authorization header (req.headers['authorization']).

If no token is found, the server responds with a 401 Unauthorized status, indicating that the client has not authenticated. If a token is present, the jwt.verify() method checks its validity using the same secret_key that was used to create it.

If the token is invalid (for example, expired or tampered with), the server sends a 403 Forbidden status. If the token is valid, the middleware adds the user information to req.user and calls next() to allow the request to proceed to the protected route.

The protected route GET /protected demonstrates the benefit of using JWT for securing routes. Only requests containing a valid token can reach this route, providing controlled access to sensitive parts of the application.

This approach centralizes the responsibility for verifying the token, streamlining authentication across different services if used in a microservices context. It allows other services to quickly verify user access by using the token without needing to query a central user database on each request, a critical efficiency in distributed systems.

By including this kind of token-based authentication, developers create a more secure and efficient system for controlling access within their microservices architecture.

API Gateway Pattern

The API Gateway pattern is a crucial design pattern in microservices architecture.
It acts as an entry point for all client requests, routing them to the appropriate microservices. The API Gateway abstracts the underlying complexity of microservices, providing a unified interface for clients to interact with.

Think of the API Gateway as a receptionist in a large office building.
The receptionist directs visitors to the appropriate office based on their needs, ensuring they don’t have to navigate the entire building on their own.

Responsibilities of an API Gateway

Request Routing: The gateway directs incoming requests to the appropriate microservice based on the request's endpoint.
Authentication and Authorization: It handles authentication, ensuring that only authorized users can access specific services.
Rate Limiting: The gateway can limit the number of requests a client can make in a given time to prevent abuse.
Load Balancing: It can distribute incoming requests across multiple instances of a service to ensure a balanced load and high availability.
Caching: The gateway can cache responses from services to reduce load and improve response times for frequently requested data.
Protocol Translation: It can translate between different protocols (e.g., HTTP to WebSocket) to enable communication between services using different protocols.

const express = require('express');
const app = express();

app.use('/users', (req, res) => {
    // Forward the request to the user service
    const userServiceUrl = 'http://user-service:3001';
    // Example: proxy the request to the user service
    req.pipe(request({ url: userServiceUrl + req.url })).pipe(res);
});

app.listen(3000, () => {
    console.log('API Gateway running on port 3000');
});

Here, you can see how an API Gateway is set up in Node.js using Express to act as an entry point for all client requests, routing them to the appropriate microservice—in this case, a user service.

The API Gateway abstracts the complexity of microservices architecture by providing a single unified interface, ensuring that clients do not have to know about or navigate the underlying service endpoints directly.

The code begins by setting up an Express application, which represents the gateway service. The route '/users' is defined to handle requests to the user service. When a request is made to this route, the code dynamically forwards (or "proxies") the request to the designated URL of the user service, which in this example is http://user-service:3001.

The req.pipe(request({ url: userServiceUrl + req.url })).pipe(res); line forwards the client's request to the user service's endpoint, waits for the response, and then sends it back to the client.

This forwarding mechanism uses streams (req.pipe and .pipe(res)) to efficiently pass data between the client and the user service, enabling the API Gateway to seamlessly route requests and responses without needing to manually handle each request component.

In this setup, the API Gateway could also potentially handle other responsibilities like authentication, rate limiting, caching, or load balancing by adding relevant middleware before or after forwarding the request to the user service.

By centralizing these responsibilities in the gateway, developers can ensure consistency and simplify configuration across microservices. Furthermore, this design is highly flexible: the API Gateway could be extended to route requests to other services (e.g., order, payment) as the architecture grows, without exposing the direct endpoints of these services to the client.

This way, the API Gateway efficiently manages communication between clients and the underlying microservices, while also allowing for streamlined security and protocol management across the system.

Advantages of API Gateway:

Simplifies client interactions by providing a single entry point.
Centralizes cross-cutting concerns like security, logging, and monitoring.
Improves security by hiding the internal architecture of microservices from external clients.

Challenges of API Gateway

The API Gateway can become a bottleneck if not properly scaled.
It introduces additional latency due to the extra network hop.
Complexity in managing and configuring the gateway increases as the number of services grows.

Strangler Fig Pattern

The Strangler Fig pattern is a strategy for gradually replacing a legacy monolithic application with a new microservices-based architecture. The pattern is named after the strangler fig tree, which grows around and eventually replaces its host tree.

Imagine a vine slowly growing around a tree. Over time, the vine strengthens and eventually replaces the tree. Similarly, the new microservices gradually replace the old monolithic system until the legacy application is completely phased out.

Steps to Implement Strangler Fig:

Identify Components: Begin by identifying the components of the monolithic application that can be isolated and replaced by microservices.
Build and Deploy New Services: Develop microservices that replicate the functionality of the identified components.
Route Traffic: Use an API Gateway or similar routing mechanism to direct relevant traffic to the new microservices while the rest of the traffic continues to flow to the monolith.
Incremental Replacement: Gradually replace more components of the monolith with microservices, routing traffic accordingly until the entire monolithic application is replaced.
Decommission the Monolith: Once all functionality has been transferred to microservices, the legacy system can be decommissioned.

Example of Using the Strangler Fig Pattern:

Phase 1: A monolithic e-commerce application handles product listing, user authentication, and order processing. You’d start by creating a microservice for user authentication.
Phase 2: Traffic related to authentication is routed to the new microservice while the rest continues to be handled by the monolith.
Phase 3: Over time, you’d add more microservices for product listing and order processing, gradually strangling the monolith until it's completely replaced.

Advantages of the Strangler Fig Pattern:

Minimizes risk by allowing a gradual transition to microservices.
Reduces downtime and disruption since changes are made incrementally.
Allows for continuous improvement and refactoring during the transition.

Challenges of the Strangler Fig Pattern:

Requires careful planning and coordination to avoid disrupting the existing application.
The coexistence of monolithic and microservices components can complicate deployment and operations.
Managing data consistency between the monolith and microservices during the transition can be challenging.

Backend for Frontend (BFF) Pattern

The Backend for Frontend (BFF) pattern involves creating separate backend services tailored to the needs of different user interfaces or client types (for example, web, mobile, IoT).

Each BFF acts as a specialized API Gateway that aggregates data from various microservices and presents it in a format optimized for the specific client.

Imagine different versions of a product manual for various audiences—one for engineers, one for customers, and one for marketing.

Each version presents the same core information but is tailored to meet the specific needs and language of its audience.

Steps to Implement the BFF Pattern:

Client-Specific Backends: Develop a separate BFF for each client type. For example, you might have one BFF for a web application and another for a mobile app.
Aggregation of Data: Each BFF aggregates and processes data from multiple microservices to provide a cohesive response to the client. This reduces the number of requests a client needs to make and tailors the response to the client’s needs.
Custom Business Logic: Each BFF can include custom business logic that is specific to the client type, such as formatting data differently for mobile versus web or implementing client-specific optimizations.

const express = require('express');
const app = express();

// BFF for mobile clients
app.get('/mobile/products', async (req, res) => {
    const productData = await fetchProductService();
    const reviewData = await fetchReviewService();
    res.json({ products: productData, reviews: reviewData });
});

// BFF for web clients
app.get('/web/products', async (req, res) => {
    const productData = await fetchProductService();
    const reviewData = await fetchReviewService();
    const recommendationData = await fetchRecommendationService();
    res.json({ products: productData, reviews: reviewData, recommendations: recommendationData });
});

app.listen(4000, () => {
    console.log('BFF for Frontend running on port 4000');
});

async function fetchProductService() {
    // Logic to fetch product data
}

async function fetchReviewService() {
    // Logic to fetch review data
}

async function fetchRecommendationService() {
    // Logic to fetch recommendation data
}

In this implementation, you can see how the Backend for Frontend (BFF) pattern is implemented using Node.js and Express, creating tailored endpoints specifically for different types of clients (such as mobile and web).

The BFF pattern is useful when different clients—such as a mobile app and a web app—need to access similar but customized data from the backend. Here, the server defines two routes: /mobile/products for mobile clients and /web/products for web clients.

Both endpoints retrieve product and review data, but the web client’s endpoint fetches additional recommendation data to enhance the user experience with recommendations only relevant to web-based interactions.

In the first route, app.get('/mobile/products'), a request is handled by fetching product and review data through the helper functions fetchProductService and fetchReviewService, which are async functions that simulate calls to backend services or databases.

The results are then aggregated and sent as a single JSON response back to the mobile client, reducing the number of requests the client needs to make. This approach optimizes the experience for mobile users by delivering only essential information, which minimizes data usage and speeds up response times.

Similarly, in the second route, app.get('/web/products'), the server fetches the same product and review data but also includes recommendation data via fetchRecommendationService.

This endpoint is more tailored to the needs of a web interface, where users might benefit from additional recommendations displayed alongside product listings. This custom response aggregation, specific to each client, embodies the BFF pattern by structuring responses based on client requirements, optimizing the client-server interaction, and making backend processing more efficient.

The server listens on port 4000, acting as a dedicated layer for frontend communication that hides the complexity of backend services from clients.

By using distinct BFFs, each client’s needs are met directly through dedicated logic paths, improving efficiency, reducing overhead, and allowing each client to access precisely the data it needs in a single request.

This code provides a clear example of how data aggregation and client-specific tailoring can simplify and streamline API interactions in a microservices architecture.

Advantages of the BFF Pattern:

Tailors the backend services to the specific needs of each client, improving performance and user experience.
Reduces the complexity of front-end code by offloading aggregation and transformation tasks to the BFF.
Allows for independent evolution of different clients and their corresponding backends.

Challenges of the BFF Pattern:

Increases the number of services to maintain, as each client type requires its own BFF.
Potential for code duplication if similar logic is required across multiple BFFs.
Coordination between BFFs and the underlying microservices is required to ensure consistency and efficiency.

How to Test Microservices

Testing is an essential part of ensuring the reliability, scalability, and performance of microservices. Given that microservices are composed of multiple independent services that communicate over the network, rigorous testing becomes even more critical.

With each service potentially evolving independently, it’s crucial to identify and address issues early to prevent cascading failures and disruptions in the overall system. Without comprehensive testing, microservices can become prone to hidden bugs, integration issues, and performance bottlenecks.

In this section, we’ll explore the different types of testing that are important for microservices. Each type serves a specific purpose, from validating individual components to ensuring that the entire system works together as expected.

You'll learn how to apply unit testing, integration testing, contract testing, and end-to-end testing to create a robust and reliable microservice-based architecture.

By the end of this section, you'll understand how to approach testing in a microservices environment, enabling you to deliver high-quality applications.

Unit Testing

Testing individual components of a microservice is important to ensure that they work correctly in isolation.

This is like testing each part of a machine separately to ensure each part functions properly before assembling the entire machine.

// Using Mocha and Chai
const { expect } = require('chai');
const UserService = require('./userService'); // Assume UserService is in another file

describe('UserService', () => {
  let userService;

  beforeEach(() => {
    userService = new UserService();
  });

  it('should create a user', () => {
    const user = { id: 1, name: 'John Doe' };
    userService.createUser(user);
    expect(userService.getUser(1)).to.deep.equal(user);
  });
});

This code demonstrates how you can use Mocha and Chai to perform unit testing on the UserService class. The purpose of this test is to verify that the UserService class's createUser and getUser methods work as expected, ensuring that individual components of this microservice are reliable when tested in isolation.

This is essential for microservices, where each component must be robust to ensure that the system as a whole functions smoothly.

Here, the test suite begins with describe('UserService', ...), which serves as a container for grouping multiple related test cases about UserService. Inside the suite, a new instance of UserService is created before each test by using the beforeEach() function, which resets the state of the userService instance, making each test independent and repeatable.

The actual test case, it('should create a user', ...), simulates adding a user to the service. It defines a user object, { id: 1, name: 'John Doe' }, which it then passes to createUser.

The expect assertion from Chai is used to compare the result of userService.getUser(1) to the expected user object.

By using deep.equal, the test confirms that the user retrieved by getUser has the same properties as the user added by createUser, checking both the ID and name fields.

This test validates that each part of UserService works as intended, fulfilling the principle of unit testing by ensuring components function correctly in isolation.

This approach is analogous to testing individual parts of a machine separately to ensure reliability before integrating them into the larger system, helping catch issues at the component level early in the development process.

Integration Testing

Integration testing involves testing the interactions between microservices to ensure that they work together correctly.

It’s like testing different departments in a company to ensure their workflows align and function seamlessly together.

const request = require('supertest');
const app = require('./app'); // Assume app is your Express application

describe('Integration Tests', () => {
  it('should create and retrieve a user', async () => {
    const user = { id: 1, name: 'Jane Doe' };

    // Test creating a user
    await request(app)
      .post('/users')
      .send(user)
      .expect(201);

    // Test retrieving the user
    const response = await request(app)
      .get('/users/1')
      .expect(200);

    expect(response.body).to.deep.equal(user);
  });
});

In this code, you can see how integration testing is performed using the Supertest library to verify interactions within the Express application. Integration testing is crucial for microservices as it checks that different components work correctly together, just as different departments in a company need to collaborate seamlessly.

The code defines a test suite describe('Integration Tests', ...), where Supertest is used to make HTTP requests to the Express app and assert the responses. First, it tests creating a user by sending a POST request to /users with user data, { id: 1, name: 'Jane Doe' }, which is expected to return a status code 201, indicating successful creation.

The test then proceeds to check if this user can be retrieved by making a GET request to /users/1. This call is expected to return a 200 status, confirming that the user retrieval is functioning as expected.

The expect assertion is used here to ensure the response data (response.body) matches the created user data, { id: 1, name: 'Jane Doe' }. This comparison validates that the app correctly processes and returns data across different endpoints, verifying that the service’s internal workflows are cohesive.

This approach of combining Supertest and assertions provides a reliable way to validate that the app's interconnected parts work as intended, allowing for early detection of issues that could disrupt service integrations in real-world deployments.

End-to-End Testing

End-to-End testing makes sure that the entire application works from start to finish and checks that all components work together as expected.

It’s like running a full simulation of a business process to ensure everything from start to finish operates correctly.

// Using Cypress
describe('End-to-End Test', () => {
  it('should create a user and verify its details', () => {
    cy.request('POST', '/users', { id: 1, name: 'Jack Doe' })
      .then(response => {
        expect(response.status).to.eq(201);
      });

    cy.request('/users/1')
      .then(response => {
        expect(response.status).to.eq(200);
        expect(response.body).to.have.property('name', 'Jack Doe');
      });
  });
});

This code illustrates how you can use Cypress to conduct an end-to-end test of a microservice application.

The test suite, named describe('End-to-End Test', ...), is designed to create a user and verify its details. The cy.request method is used to simulate HTTP requests, interacting with the application’s endpoints as a real client would.

First, it sends a POST request to the /users endpoint, adding a user with { id: 1, name: 'Jack Doe' }. After this request, an assertion checks that the response status is 201, indicating the successful creation of the user resource.

The test then moves to the second part, where it retrieves the user with cy.request('/users/1'). The test verifies that the status code is 200, meaning the user was found successfully. Also, expect(response.body).to.have.property('name', 'Jack Doe') confirms that the user’s name property matches the expected value, 'Jack Doe'.

This test validates the entire flow of creating and retrieving a user in the system, ensuring that the application’s different components, such as database interactions and HTTP request handling, function cohesively.

Cypress is particularly effective for E2E testing because it runs these requests in a controlled environment, allowing developers to test real-world scenarios with reliable assertions. This type of testing can catch integration issues that may not appear in unit or integration tests, providing greater confidence in the system's overall stability.

How to Deploy Microservices

Deploying microservices efficiently is a key part of building scalable and resilient applications. As microservices are typically small, independent services, they must be deployed in a way that allows them to function together seamlessly within a larger ecosystem.

Unlike traditional monolithic applications, microservices require a different approach to deployment, focusing on automation, scalability, and continuous delivery. Deployment also involves dealing with challenges such as service discovery, load balancing, and ensuring fault tolerance.

In this section, I’ll guide you through the various strategies and tools for deploying microservices. From containerization with Docker to orchestrating services with Kubernetes, we’ll explore how these technologies simplify the deployment process.

We will also cover essential topics such as continuous integration/continuous deployment (CI/CD) pipelines, automated scaling, and monitoring to ensure that your microservices architecture remains robust and adaptable in production environments.

By the end of this section, you will have a clear understanding of how to deploy microservices efficiently and how to maintain them as your application grows.

Containerization with Docker

Packaging microservices into Docker containers helps you consistently deploy across different environments.

It’s like using standardized shipping containers to transport goods efficiently and predictably.

# Dockerfile for a Node.js app

# Use Node.js image
FROM node:14

# Set working directory
WORKDIR /usr/src/app

# Copy package.json and install dependencies
COPY package*.json ./
RUN npm install

# Copy application code
COPY . .

# Expose port
EXPOSE 3000

# Run the application
CMD [ "node", "app.js" ]

Here, the code illustrates how you can use Docker to create a containerized environment for a Node.js application, ensuring that it can be deployed consistently across different environments.

Containerization with Docker works by encapsulating all the necessary application components, like code, runtime, libraries, and dependencies, into a standardized container image.

This approach provides predictable, repeatable deployments, similar to how standardized shipping containers are used to transport goods reliably across various transportation systems.

Starting with FROM node:14, the Dockerfile specifies a base image, in this case, an official Node.js image with version 14. This base image provides a pre-configured environment with Node.js installed, reducing the setup time and complexity required to run the app.

By using a standardized base, this Dockerfile also ensures compatibility and eliminates potential inconsistencies that could occur with different Node.js versions.

The WORKDIR /usr/src/app command sets the working directory inside the container to /usr/src/app, which organizes the application’s code files and simplifies file path references later in the Dockerfile.

The COPY package*.json ./ line then copies the package.json files into this working directory, and RUN npm install installs the necessary Node.js dependencies. This process isolates the dependency installation to ensure that all required libraries are present, matching the exact versions defined in package.json.

Next, COPY . . copies the rest of the application files from the host system into the container’s working directory.

The EXPOSE 3000 command designates port 3000 as the application’s external communication port, allowing traffic to be directed to this port when the container is run. Finally, CMD ["node", "app.js"] defines the container’s entry point, instructing Docker to execute node app.js to start the application when the container is launched.

This Dockerfile showcases the fundamental steps in building a Docker image for a Node.js app, enabling consistent and reproducible deployments. By following these steps, developers ensure that the application can be easily transferred between development, testing, and production environments without compatibility issues.

This predictable deployment approach streamlines operations, making it ideal for scaling and managing microservices in a production ecosystem.

Continuous Integration and Continuous Deployment (CI/CD)

CI/CD helps you automate the process of building, testing, and deploying microservices.

It’s like having an automated assembly line that assembles, tests, and packages products without manual intervention.

# Using GitHub Actions for Node.js

# .github/workflows/node.js.yml
name: Node.js CI

on:
  push:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Set up Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '14'

      - name: Install dependencies
        run: npm install

      - name: Run tests
        run: npm test

The code above shows the process of how GitHub Actions is used to automate the Continuous Integration (CI) process for a Node.js application. The CI/CD pipeline ensures that code is automatically built, tested, and prepared for deployment without manual intervention, much like an automated assembly line that assembles, tests, and packages products seamlessly.

The file begins with the line name: Node.js CI, which sets the name of the workflow. The on: section specifies when the workflow should be triggered. In this case, it’s set to trigger on push events to the main branch.

This means every time a developer pushes changes to the main branch, GitHub Actions will automatically start the pipeline to check the quality and functionality of the code.

The jobs: section defines the tasks to be executed in this pipeline, and it specifies that the job will run on ubuntu-latest, a virtual machine environment provided by GitHub to run the workflow. Inside the build job, there are several steps that execute sequentially.

In the first step, Checkout code, uses the actions/checkout@v3 action to check out the repository’s code so that the subsequent steps can operate on it.

In the next step, Set up Node.js, utilizes actions/setup-node@v3 to install Node.js version 14. This step ensures that the correct version of Node.js is used for the application, avoiding discrepancies between environments.

After setting up Node.js, the step Install dependencies runs the command npm install, which installs all the dependencies defined in the project’s package.json file. This ensures that the necessary packages are available for the tests to run.

Finally, the last step, Run tests, runs the command npm test, which triggers the tests for the Node.js application. This step ensures that any changes made in the code do not break the functionality of the application, as the tests will validate that everything works as expected.

Through this GitHub Actions configuration, the CI process is fully automated. Every time changes are pushed to the main branch, the pipeline builds the project, installs dependencies, and runs the tests.

This process ensures that issues are caught early, streamlining development and improving code quality by providing automated feedback on the state of the application. It also saves time by eliminating the need for manual testing and deployment steps.

Orchestration with Kubernetes

Kubernetes helps you manage the deployment, scaling, and operation of containerized applications.

Like a conductor orchestrating a symphony, Kubernetes manages and coordinates the deployment and scaling of your containerized services.

# Kubernetes YAML for a Node.js app

# Deployment definition
apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: user-service
  template:
    metadata:
      labels:
        app: user-service
    spec:
      containers:
        - name: user-service
          image: user-service:latest
          ports:
            - containerPort: 3000

# Service definition
apiVersion: v1
kind: Service
metadata:
  name: user-service
spec:
  selector:
    app: user-service
  ports:
    - protocol: TCP
      port: 80
      targetPort: 3000
  type: LoadBalancer

This code illustrates how you can use Kubernetes to orchestrate the deployment and management of a Node.js application, specifically the user-service.

This YAML configuration file contains two main sections: the Deployment and the Service.

The Deployment section is where you define how your application should be deployed in the Kubernetes cluster. It specifies the apiVersion, which indicates which version of the Kubernetes API should be used to create the resource, and the kind, which identifies the type of resource being defined (in this case, a Deployment).

The metadata section contains basic information about the deployment, such as its name (user-service). Under spec, you define the desired state for the application.

The replicas: 3 field indicates that Kubernetes should maintain three identical instances of the user-service pod running at all times, which helps ensure high availability and load balancing.

The selector field defines a label selector that is used to identify the set of pods that this deployment should manage. The template section defines the pod’s metadata and its spec.

This includes a container definition, where the image is set to user-service:latest, pointing to the Docker image to be used for the container. The ports section specifies that the container will listen on port 3000, which is the port your Node.js app will use.

In the Service section, Kubernetes defines how to expose the deployed application so that other services or external clients can access it. The Service is also defined with apiVersion: v1 and kind: Service, indicating that it will use Kubernetes’ core service management. The metadata section defines the service name (user-service), while the spec section describes the service's behavior.

The selector here refers to the same label as the deployment (app: user-service), ensuring that the service will route traffic to the pods created by the deployment. The ports section specifies that the service will listen on port 80 (the external port) and forward traffic to port 3000 (the port inside the container where the app is running).

Finally, the type: LoadBalancer tells Kubernetes to provision an external load balancer, distributing incoming traffic across the multiple instances of the user-service pods, further ensuring high availability and fault tolerance.

Through this orchestration, Kubernetes ensures that your user-service is deployed, scaled, and exposed in a highly available manner, much like a conductor ensuring that all sections of a symphony play in time and tune.

It provides detailed guidance on choosing the right technology stack, defining APIs and contracts, and understanding key design patterns.

Selecting appropriate programming languages and frameworks is crucial for optimizing each microservice, while well-defined APIs and contracts ensure clear and reliable communication between services.

Key design patterns such as the API Gateway Pattern, Strangler Fig Pattern, and Backend for Frontend (BFF) Pattern are explained to help manage and optimize microservices architecture.

How to Manage Microservices in the Cloud

This section delves into the essential practices, tools, and strategies needed to effectively operate and scale microservices in cloud environments. As more organizations migrate to the cloud, understanding the nuances of managing microservices in these dynamic settings has become crucial.

Here, we will look at how cloud platforms like AWS, Google Cloud, and Azure support microservices and enable seamless deployment, autoscaling, and load balancing.

This section also introduces key tools for orchestrating and monitoring microservices in the cloud, from Kubernetes for container orchestration to observability solutions like Prometheus and Grafana.

With microservices requiring intricate handling of distributed components, we’ll cover practices for maintaining service health, achieving resilience, and ensuring security across cloud-based microservices.

By exploring these foundational elements, readers will gain insights into managing, scaling, and optimizing microservices effectively within cloud infrastructures, equipping them with knowledge to handle real-world complexities.

Cloud Platforms and Services

1. Amazon Web Services (AWS):

AWS offers a broad range of services tailored for microservices architecture. Some relevant services include Elastic Container Service (ECS) for container management and Elastic Kubernetes Service (EKS) for orchestrating Kubernetes clusters.

Example: Running Node.js microservices in Docker containers managed by ECS.

2. Microsoft Azure:

Azure provides Azure Kubernetes Service (AKS) for Kubernetes orchestration, Azure Service Fabric for building scalable microservices, and Azure Functions for serverless microservices.

Example: Deploying an Express.js app on Azure Functions as a microservice.

3. Google Cloud Platform (GCP):

GCP offers Google Kubernetes Engine (GKE) for orchestrating microservices using Kubernetes and Cloud Run for running containerized apps in a fully managed environment.

Example: Deploying a microservice with Google Kubernetes Engine.

Cloud-Native Services for Microservices

Cloud providers offer specialized services for microservices that simplify scaling and management:

AWS ECS: Manages Docker containers on a cluster, with integration to AWS services.
Google Kubernetes Engine (GKE): Manages Kubernetes clusters with autoscaling features for microservices.

Running a simple Node.js container in GCP Cloud Run:

gcloud run deploy --image gcr.io/my-project/my-node-service --platform managed

In this Git Bash terminal command, you can see how to deploy a containerized Node.js application using Google Cloud Run, which is a fully managed platform that automatically handles your application’s infrastructure. This allows you to focus on writing and deploying code without managing servers.

The gcloud run deploy command is used to deploy your application to Cloud Run. It tells Google Cloud to deploy an application to Cloud Run. This is the primary command for initiating the deployment process. It’s a command line tool for interacting with Google Cloud services.

The --image gcr.io/my-project/my-node-service specifies the Docker image to be deployed. This image is hosted in Google Cloud's Container Registry (GCR), indicated by gcr.io.

The my-project is the ID of your Google Cloud project, and my-node-service refers to the specific Docker image built for your Node.js application. This image contains everything that the application needs to run: the Node.js runtime, dependencies, and your application code.

The --platform managed flag tells Google Cloud Run to use the managed platform for hosting the service. Cloud Run offers both a managed and an Anthos-based platform, and by specifying managed, you're opting for the fully managed service where Google automatically handles things like scaling, networking, and availability.

This ensures that the application will automatically scale up or down based on incoming traffic, without you needing to manually configure or manage the infrastructure.

When you run this command, Cloud Run takes the specified Docker image, deploys it as a service, and makes it available for incoming HTTP requests. This deployment model abstracts away much of the complexity of managing the underlying infrastructure, allowing you to focus purely on application development.

Cloud Run automatically provisions resources, monitors the health of the service, and ensures that scaling is handled as traffic fluctuates.

In this setup, you can take advantage of Cloud Run’s ease of use, as it integrates well with Google Cloud’s serverless offerings, helping you run your containerized Node.js application with minimal setup or management.

Containerization and Orchestration

Introduction to Containers (Docker)

Containers encapsulate microservices along with their dependencies, ensuring they run consistently across different environments. Docker is the most common containerization tool.

Containers are like shipping containers for software. No matter where you send them, the contents (code and dependencies) remain the same.

Dockerfile for Node.js Microservice:

# Use the Node.js 16 image
FROM node:16

# Create app directory
WORKDIR /usr/src/app

# Install dependencies
COPY package*.json ./
RUN npm install

# Copy app source code
COPY . .

# Expose port and start app
EXPOSE 8080
CMD ["node", "app.js"]

In this snippet, you can see how to define a Dockerfile for a Node.js microservice, which is used to build and containerize the application for deployment. The Dockerfile provides a series of steps that Docker will follow to create an image that can be run anywhere that Docker is supported.

The first line, FROM node:16, specifies the base image to use for the container. In this case, it uses the official Node.js image with version 16.

By using a specific version like this, you ensure that your application runs consistently in a controlled environment with Node.js version 16, regardless of the machine or platform it is deployed to. This guarantees compatibility with the dependencies and features available in Node.js 16.

The WORKDIR /usr/src/app line sets the working directory within the container to /usr/src/app. This is where your application code will live inside the container. By setting the working directory explicitly, all subsequent commands like COPY and RUN will be relative to this location, helping to keep things organized within the container’s filesystem.

The COPY package*.json ./ command copies the package.json and package-lock.json files (or any matching files in the pattern) into the container. This is a crucial step as these files contain the metadata and dependencies required for the Node.js application.

This allows Docker to install all necessary dependencies without copying the entire application code first, which takes advantage of Docker’s caching mechanism to avoid reinstalling dependencies when they haven’t changed.

Next, the RUN npm install command installs the dependencies listed in the package.json file. This command is run during the image-building process, meaning all the dependencies will be available when the container is started. This installation is done inside the Docker container, ensuring that the app has everything it needs to run.

The COPY . . command copies the rest of the application code into the container’s working directory. This step ensures that all the source code, such as your app.js file and any other necessary files, is available inside the container so that it can be executed by Node.js.

The EXPOSE 8080 line tells Docker that the container will listen on port 8080. This is the port that external systems will use to communicate with the running service.

While the EXPOSE command does not directly open the port, it serves as a documentation feature and makes the port accessible when the container is run with the appropriate Docker run configuration.

Finally, CMD ["node", "app.js"] defines the default command to run when the container starts. In this case, it tells Docker to run the app.js file using Node.js. This is the entry point of your application, and once the container starts, Node.js will execute this file to run your application.

Overall, this Dockerfile is a simple and efficient way to package a Node.js microservice into a container. By specifying the environment, dependencies, and instructions on how to start the application, it ensures that the service can run in any environment where Docker is supported, with consistent behavior across development, staging, and production systems.

Container Orchestration Tools (Kubernetes, Docker Swarm)

Kubernetes is the most widely used container orchestration platform, providing features like automatic scaling, load balancing, and self-healing.

Kubernetes is like a traffic controller, managing how containers (microservices) are deployed, scaled, and routed.

Kubernetes (Simple Deployment YAML):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: node-microservice
spec:
  replicas: 3
  selector:
    matchLabels:
      app: node-microservice
  template:
    metadata:
      labels:
        app: node-microservice
    spec:
      containers:
      - name: node-microservice
        image: node-microservice:latest
        ports:
        - containerPort: 8080

In this code, you can see how a simple Kubernetes Deployment YAML configuration is used to define the deployment of a Node.js microservice in a Kubernetes cluster. Kubernetes, as a container orchestration tool, automates many critical tasks such as scaling, load balancing, and self-healing.

This configuration ensures that your Node.js microservice is deployed in a controlled and repeatable manner, handling the lifecycle of the application containers effectively.

The first line, apiVersion: apps/v1, specifies the version of the Kubernetes API that this configuration is using. The apps/v1 API version is commonly used for managing applications deployed within Kubernetes, such as Deployments, StatefulSets, and DaemonSets. This ensures compatibility with the Kubernetes cluster where the configuration will be applied.

The kind: Deployment field specifies that this configuration defines a Deployment resource in Kubernetes. A Deployment ensures that a specified number of identical Pods (which run the containers of your application) are running at all times.

It is used for managing the rollout and scaling of applications while also handling updates in a declarative manner. This is one of the most commonly used resources in Kubernetes to maintain application availability.

The metadata section defines basic information about the deployment, such as the name of the deployment (name: node-microservice). This name identifies the deployment resource within the Kubernetes cluster, making it easier to reference and manage.

In the spec section, the deployment's configuration is defined in detail. The replicas: 3 line specifies that Kubernetes should maintain three copies (replicas) of the Node.js microservice running at all times.

This ensures high availability, as Kubernetes will automatically replace any failed Pods with new ones. If one Pod goes down for any reason, another will be started in its place.

The selector field defines how Kubernetes identifies which Pods are managed by this Deployment. The matchLabels section specifies that the Pods with the label app: node-microservice should be included.

This allows Kubernetes to group and manage related Pods based on labels, ensuring that the correct set of Pods is scaled, updated, and rolled back as needed.

The template field defines the structure of the Pods that will be created by this Deployment. Inside the template, metadata defines labels that will be applied to the Pods, ensuring they match the selector defined earlier.

The spec field specifies the container details for the Pod, including the container name (name: node-microservice), the container image (image: node-microservice:latest), and the ports to be exposed (containerPort: 8080). The image refers to a Docker image stored in a registry, and latest indicates the most recent version of that image.

By specifying the container port as 8080, this tells Kubernetes which port the application inside the container will be listening to. This is critical for networking within the cluster, as other services can connect to the Pods using this port.

Overall, this Deployment YAML is a simple yet powerful configuration for managing a Node.js microservice in Kubernetes. Kubernetes will handle the scaling (with three replicas), the application’s high availability, and the management of the Pods that run the application, making it much easier to deploy and manage microservices in a production environment.

Helm Charts and Kubernetes Operators

Helm is a package manager for Kubernetes, simplifying deployment. Kubernetes Operators extend Kubernetes functionalities to manage complex applications.

Helm can deploy an entire microservices stack (for example, a web service, database, and so on) with a single command.

helm install my-app ./chart

This code illustrates how you can use Helm to install an application on a Kubernetes cluster. Helm acts as a package manager for Kubernetes, simplifying the process of deploying and managing applications by using Helm Charts. Helm Charts are pre-configured application templates that define the resources necessary to deploy an application in Kubernetes.

With a single command like helm install my-app ./chart, you can deploy an entire microservice stack or application on Kubernetes, including web services, databases, and other components, all with the configuration specified in the chart.

The command helm install my-app ./chart is performing several key actions. First, it tells Helm to install a new application named my-app. The ./chart path refers to the location of the Helm Chart on your local file system.

This chart contains all the Kubernetes manifest files, configurations, and templates required to deploy the application. When you run this command, Helm takes these resources, processes any templates with user-specific values, and then communicates with the Kubernetes API server to create the necessary Kubernetes resources, such as Pods, Deployments, Services, ConfigMaps, and more.

By using Helm, you abstract away the complexity of managing multiple Kubernetes resources and dependencies. Instead of manually creating and configuring each resource (which can be error-prone and time-consuming), you use the Helm Chart to define everything in one place.

This makes Helm a powerful tool for managing complex applications, particularly microservices, by encapsulating everything needed for deployment and ensuring consistency across different environments.

Kubernetes Operators also extend the functionality of Helm by providing custom resources and controllers that automate the management of complex, stateful applications.

While Helm can handle the deployment, Operators can manage the lifecycle of the application after deployment, including tasks such as backups, scaling, and updates.

This combination of Helm and Kubernetes Operators ensures that your microservices are not only deployed efficiently but also managed intelligently through their entire lifecycle.

CI/CD Pipelines and Best Practices

CI/CD pipelines automate the process of integrating code changes, testing, and deploying them into production.

This enables rapid and frequent delivery of updates while maintaining high-quality code.

Best Practices:

Use small, frequent commits to enable easier testing and rollback.
Ensure each service can be tested and deployed independently.

Tools and Platforms for CI/CD

Jenkins: Open-source automation tool for building CI/CD pipelines.
GitLab CI/CD: Integrated with GitLab, it provides built-in CI/CD tools.
CircleCI: Offers fast and efficient pipelines for continuous delivery.

Jenkins Pipeline for Microservice Deployment:

pipeline {
    agent any
    stages {
        stage('Build') {
            steps {
                sh 'npm install'
            }
        }
        stage('Test') {
            steps {
                sh 'npm test'
            }
        }
        stage('Deploy') {
            steps {
                sh 'docker build -t my-app .'
                sh 'docker push my-app:latest'
            }
        }
    }
}

In this snippet, you can see how a Jenkins Pipeline is defined to automate the process of building, testing, and deploying a Node.js microservice using Docker. This scripted pipeline structure is specified in a Jenkinsfile and leverages three stages: Build, Test, and Deploy.

Each stage in the pipeline represents a distinct step in the continuous integration (CI) and continuous deployment (CD) lifecycle for a microservice.

In the Build stage, the pipeline runs the command npm install to install all the dependencies specified in the package.json file. This step is essential for setting up the application's environment and ensuring that all required libraries are in place for subsequent stages.

The command sh is a Jenkins Pipeline step that allows the use of shell commands, such as those for Node.js package management.

In the Test stage, the pipeline executes npm test to run the test suite defined in the project. Testing at this stage ensures that the microservice’s code functions correctly before it’s packaged for deployment.

This stage is critical for catching issues early in the CI/CD process, allowing developers to detect and address bugs before they reach the deployment environment.

The Deploy stage begins with the command docker build -t my-app ., which creates a Docker image-tagged my-app from the application source code and configuration files in the current directory (.).

After building the Docker image, the command docker push my-app:latest uploads the image to a container registry (assuming my-app is configured with a registry URL in the Docker environment). This step makes the built container image available for deployment to any environment that pulls images from this registry.

By organizing these steps in a Jenkins pipeline, you create a streamlined, automated workflow that allows you to easily reproduce the process of building, testing, and deploying the application across multiple environments.

This setup reduces the risk of human error, accelerates deployment, and ensures consistent results with every commit or code change.

Automated Testing and Deployment Strategies

Blue/Green Deployment: Involves running two versions of the service simultaneously.
Traffic is gradually shifted to the new version, ensuring zero downtime.
Canary Releases: Gradually introduce a new version of a service to a subset of users, allowing for monitoring and rollback in case of issues.

Monitoring and Logging

Effective monitoring and logging are fundamental to maintaining the health and performance of a microservices-based application. As microservices often operate in distributed environments, it becomes challenging to track, diagnose, and troubleshoot issues. Without proper visibility into the system’s behavior, you risk operational inefficiencies, performance bottlenecks, and increased downtime.

In this section, we will focus on how to implement robust monitoring and logging practices that ensure you can effectively track and manage the behavior of microservices in real-time.

We'll explore the tools and frameworks available for monitoring system health, gathering performance metrics, and collecting logs from different microservices in your application.

We'll also discuss how these practices can support proactive issue resolution by allowing for timely alerts and more insightful data for debugging.

Then we’ll dive into the importance of centralized logging systems like ELK Stack (Elasticsearch, Logstash, and Kibana), and how monitoring solutions such as Prometheus and Grafana provide metrics and visualizations to observe your services' health.

Finally, we’ll cover tracing techniques that can help pinpoint the flow of requests across microservices, ensuring quick resolution of performance or failure issues.

By the end of this section, you'll understand how to implement a comprehensive monitoring and logging strategy that ensures your microservices architecture operates smoothly and reliably.

Centralized Logging Solutions (ELK Stack, Fluentd)

Microservices generate logs across many instances. Centralized logging solutions collect and store logs in a single location, simplifying analysis.

ELK Stack (Elasticsearch, Logstash, Kibana): Common for centralized logging, enabling full-text search and visualizations.

Monitoring and Observability Tools (Prometheus, Grafana, Datadog)

Monitoring tools track the performance and health of microservices. Prometheus collects metrics, and Grafana visualizes them in dashboards.

Prometheus (Monitoring Node.js Microservice):

const client = require('prom-client');

// Create a counter metric
const requestCounter = new client.Counter({
    name: 'node_requests_total',
    help: 'Total number of requests'
});

// Increment counter on each request
app.use((req, res, next) => {
    requestCounter.inc();
    next();
});

The following code shows the process of how Prometheus metrics are integrated into a Node.js application using the prom-client library to monitor API requests.

Prometheus is a popular tool for monitoring and alerting in microservices environments, often used to track and visualize system health metrics like request counts, response times, and error rates.

Here, the code is focused on implementing a simple counter metric to monitor the total number of requests the application receives.

First, the prom-client module is imported to set up Prometheus-compatible metrics in the application. The Counter class from prom-client is used to define a new counter metric, named node_requests_total, with a description (via the help property) of "Total number of requests."

Counters in Prometheus are designed for tracking cumulative values, like the count of requests or the number of errors, and are ideal for metrics that always increase, such as a request count.

The middleware function then increments this counter on every incoming request by calling requestCounter.inc(). This middleware is added to the Express app instance using app.use(), which means it will execute for every incoming request, incrementing the requestCounter metric.

Each time a new request is processed, Prometheus records this increment, allowing the total count of requests to be monitored over time.

This setup allows Prometheus to pull these metrics at regular intervals from the application’s /metrics endpoint (if configured).

By tracking the node_requests_total counter, you can gain insights into traffic patterns and detect sudden increases or decreases in request volume, which can be crucial for monitoring system performance and ensuring service reliability.

This basic example demonstrates how to set up and use Prometheus metrics to gain visibility into microservice activity

Distributed Tracing (Jaeger, Zipkin)

In microservices, tracking a request's journey across services is crucial. Distributed tracing tools like Jaeger and Zipkin provide visibility into how requests propagate across services.

Distributed tracing is like tracking a package’s journey through multiple shipping hubs, providing insights into where delays occur.

Security Considerations

Securing APIs and Inter-Service Communication (OAuth, JWT)

OAuth 2.0: A framework that allows users to grant third-party applications access to their resources without sharing credentials.
JWT (JSON Web Tokens): Used for secure, stateless authentication between services.

Securing API with JWT in Node.js:

const jwt = require('jsonwebtoken');

// Middleware to verify JWT
function verifyToken(req, res, next) {
    const token = req.headers['authorization'];
    if (!token) return res.status(403).send('No token provided.');

    jwt.verify(token, 'secretkey', (err, decoded) => {
        if (err) return res.status(500).send('Failed to authenticate token.');
        req.userId = decoded.id;
        next();
    });
}

app.use(verifyToken);

In this implementation, you’ll notice how JWT (JSON Web Token) authentication is implemented in a Node.js application using the jsonwebtoken library to secure API access. JWT is commonly used to verify the identity of a user and ensure that only authenticated users can access certain endpoints or perform sensitive actions.

Here, a middleware function verifyToken is defined to check the presence and validity of a JWT token on each request. In Node.js applications, middleware is a function that has access to the request (req) and response (res) objects and can perform operations before passing control to the next middleware or route handler.

By setting up this middleware, you enforce token verification on every request, ensuring that all subsequent routes are protected.

The verifyToken function first checks for a token in the request headers under the authorization field. If no token is provided, it immediately returns a 403 status with a message indicating "No token provided," blocking access to unauthorized users.

If a token is present, the function uses jwt.verify() to decode and validate the token against a secret key, here referred to as 'secretkey'. If the token verification fails (for example, if the token is expired or has been tampered with), an error is returned with a 500 status code and a message indicating "Failed to authenticate token."

If the token is valid, the decoded token’s id (which could represent the user's ID or other identifying information) is assigned to req.userId, making it available for any downstream functions to use, and the next() function is called to proceed to the next middleware or route handler.

Finally, app.use(verifyToken); applies this middleware globally to all routes, meaning every incoming request to the API will go through this authentication check. This setup is useful in securing sensitive routes, as it prevents unauthorized users from accessing data or functionalities they shouldn’t have access to.

With this structure, you can also customize the JWT verification process or apply this middleware selectively to specific routes depending on the security requirements of your application.

Network Security and Firewall Configurations

Securing the network layer involves setting up firewall rules, VPNs, and Virtual Private Clouds (VPCs) to control access between services.

Example: Configure AWS Security Groups to restrict access to a microservice only from specific IP addresses or other services.

Microservices handling sensitive data must comply with data protection regulations like GDPR (General Data Protection Regulation) and HIPAA (Health Insurance Portability and Accountability Act). This involves:

Data encryption (in transit and at rest).
Role-based access control (RBAC).
Regular auditing and reporting.

Managing microservices in the cloud requires leveraging cloud-native tools, container orchestration, CI/CD practices, monitoring, and security measures.

By implementing these strategies, microservices can be deployed and managed effectively in the cloud environment while ensuring reliability, scalability, and security.

Case Studies and Real-World Examples

The section explores how microservices architecture has been implemented across various industries, offering insights into the successes, challenges, and innovations from leading companies.

By examining real-world applications, you’ll see how microservices are used to solve complex scalability and flexibility issues and how different companies have approached architecture, deployment, and management.

This section includes detailed case studies from technology giants and enterprises in sectors such as e-commerce, finance, and media, showcasing how each adapted microservices to meet unique demands.

By analyzing both the strategies that drove successful implementations and the lessons learned from obstacles encountered, this part provides a practical perspective on microservices adoption and illustrates how abstract concepts are applied in real-world environments.

Through these examples, you should be able to grasp how microservices might benefit your own applications, gaining actionable insights for building, scaling, and optimizing microservices in diverse operational contexts.

Case Study 1: E-Commerce Platform

First, we’ll look at the case of an e-commerce platform with multiple microservices handling product listings, user management, order processing, and payment transactions.

Think of the platform as a large department store with separate sections for clothing, electronics, and groceries. Each section (microservice) manages its own inventory and operations.

Architecture

Microservices involved:

Product Service: Manages product catalog and search functionality.
User Service: Handles user registration, authentication, and profile management.
Order Service: Processes orders and manages order history.
Payment Service: Handles payment processing and transactions.

// Service Definitions

// Product Service
class ProductService {
  constructor() {
    this.products = [];
  }

  addProduct(product) {
    this.products.push(product);
    return product;
  }

  searchProducts(query) {
    return this.products.filter(p => p.name.includes(query));
  }
}

// User Service
class UserService {
  constructor() {
    this.users = [];
  }

  registerUser(user) {
    this.users.push(user);
    return user;
  }

  authenticateUser(username, password) {
    return this.users.find(u => u.username === username && u.password === password);
  }
}

// Order Service
class OrderService {
  constructor() {
    this.orders = [];
  }

  createOrder(order) {
    this.orders.push(order);
    return order;
  }

  getOrder(orderId) {
    return this.orders.find(o => o.id === orderId);
  }
}

// Payment Service
class PaymentService {
  processPayment(paymentInfo) {
    // Simulate payment processing
    return `Payment of ${paymentInfo.amount} processed successfully`;
  }
}

The code above illustrates how each of the four services in a microservices-oriented application is defined independently, with dedicated methods for handling distinct functionalities related to products, users, orders, and payments.

This approach exemplifies how each service in a microservice architecture is specialized and modular, with minimal dependencies on other services, which makes the codebase easier to manage, test, and scale.

The ProductService class manages a list of products, providing methods like addProduct to add a product to the list and searchProducts to filter products based on a search query. The addProduct method appends a new product to an array, simulating a lightweight in-memory data store.

The searchProducts method then allows users to search for products by name, providing a simple but effective mechanism for retrieving relevant products based on the user’s input.

The UserService class represents the logic for handling user-related operations. It includes a registerUser method to add new users to the system, and an authenticateUser method to validate credentials.

When a user attempts to log in, authenticateUser checks for a user entry that matches both the provided username and password, simulating a basic form of user authentication.

This demonstrates how user authentication can be encapsulated within a single service, ensuring the functionality is cohesive and logically separated from other service responsibilities.

The OrderService class is focused on managing orders. The createOrder method allows for creating a new order, appending it to the orders array, and returning the created order as confirmation.

The getOrder method retrieves a specific order based on its ID, offering a way to access individual order details. This separation of concerns keeps the order-handling logic contained within its own service, making it easy to scale independently as order volumes increase.

Finally, the PaymentService class provides a processPayment method to simulate payment processing. This method takes payment information, such as an amount, and returns a confirmation message to indicate successful processing.

Although the processPayment method here is simple, in a real-world scenario, it would interact with external payment processing systems. By isolating payment logic in its own service, it becomes straightforward to modify or replace the payment processing mechanism without affecting other parts of the application.

This setup demonstrates how each service can independently perform its designated tasks, enabling scalable and maintainable code. Each service manages its own state and operations without interfering with others, allowing for independent development, testing, and deployment of each service, which is a key benefit of microservice architecture.

Challenges and Solutions

Challenge: Ensuring consistent data across services, such as synchronizing user data with orders.
Solution: Implementing a shared data store or using event-driven architecture to keep data in sync.

It’s like having a central inventory system that updates stock levels across all departments in real time.

Lessons Learned:

Scalability: Separating services allowed the platform to scale individual components (for example, product search) based on demand.
Resilience: Microservices architecture improved fault tolerance. If one service failed, the rest continued to operate.

Case Study 2: Streaming Media Service

The next case we’ll look at is a streaming service providing video content with features like recommendation engines, user profiles, and content delivery.

It’s similar to a cable TV provider with different channels (services) for live TV, on-demand content, and user recommendations.

Architecture

Microservices involved:

Content Service: Manages video content and metadata.
Recommendation Service: Provides personalized content recommendations based on user behavior.
User Profile Service: Handles user profiles, preferences, and watch history.
Streaming Service: Manages video streaming and delivery.

// Service Definitions

// Content Service
class ContentService {
  constructor() {
    this.contents = [];
  }

  addContent(content) {
    this.contents.push(content);
    return content;
  }

  getContent(id) {
    return this.contents.find(c => c.id === id);
  }
}

// Recommendation Service
class RecommendationService {
  constructor() {
    this.recommendations = {};
  }

  generateRecommendations(userId) {
    // Simulate recommendation logic
    return this.recommendations[userId] || [];
  }
}

// User Profile Service
class UserProfileService {
  constructor() {
    this.profiles = [];
  }

  getUserProfile(userId) {
    return this.profiles.find(p => p.userId === userId);
  }
}

// Streaming Service
class StreamingService {
  streamContent(contentId) {
    return `Streaming content with ID: ${contentId}`;
  }
}

In the code above, you can see how each service encapsulates specific functionalities related to content management, user recommendations, user profiles, and streaming, typical in a media platform with a microservices architecture.

Each service class represents a distinct part of the application, ensuring modularity and separation of concerns, which aligns with the microservice philosophy.

The ContentService class is designed to manage content data. It contains an array, this.contents, which acts as a temporary in-memory storage for content objects. The addContent method allows new content to be added to this array and returns the added content, allowing confirmation of a successful addition.

The getContent method retrieves a specific content item by ID, simulating a database search. In this code, you can see how addContent and getContent work to handle basic content management within a defined scope, enabling simple CRUD (Create, Read, Update, Delete) operations that could later expand with a persistent data store.

The RecommendationService class focuses on providing content recommendations based on user IDs. Here, this.recommendations is an object where recommendations for each user can be stored and accessed.

The generateRecommendations method fetches recommendations for a given userId, providing a placeholder for more sophisticated recommendation logic, such as algorithms that analyze user preferences or historical data.

Also, you can see how generateRecommendations works to encapsulate user-specific recommendations, allowing for customization and personalization of content, which is crucial for engagement in media services.

The UserProfileService class manages user profile data. The getUserProfile method retrieves a specific user profile based on userId, making it possible to access user-specific information like preferences or watch history.

This service has its own in-memory array, this.profiles, which represents user profile storage. In this code, you can see how getUserProfile works independently to fetch relevant profile information without relying on other services, allowing it to operate autonomously and at scale.

Lastly, the StreamingService class is responsible for handling content streaming. It includes the streamContent method, which takes a contentId and simulates streaming functionality by returning a message confirming the stream of the specified content.

This class doesn’t maintain state but performs an action based on a request, making it lightweight and efficient for handling multiple streaming requests. You can also see how streamContent works by focusing solely on providing a streaming response, aligning with the principle of single responsibility and ensuring that streaming functionality remains isolated from other application logic.

These services illustrate how dividing an application into focused, specialized services allows each to operate independently. Each service’s methods are designed to be extensible, meaning they can grow in functionality without interfering with other parts of the application.

This architecture is highly advantageous for complex applications, as it allows for individual services to be scaled, modified, and maintained without impacting the overall system.

Challenges and Solutions:

Challenge: Handling high traffic and ensuring smooth streaming during peak times.
Solution: Implementing content delivery networks (CDNs) and optimizing streaming protocols.

It’s like distributing TV signals through multiple antennas to ensure clear reception even in high-demand areas.

Lessons Learned:

Performance: CDN integration improved content delivery speed and reduced latency.
Personalization: Personalized recommendations increased user engagement and satisfaction.

Case Study 3: Financial Services Application

For our third case study, we’ll consider a financial services application with microservices for account management, transaction processing, and fraud detection.

it’s similar to a bank with different departments for account services, transaction handling, and security checks.

Architecture

Microservices involved:

Account Service: Manages user accounts and balances.
Transaction Service: Handles transactions and transfers.
Fraud Detection Service: Monitors and detects suspicious activities.

// Service Definitions

// Account Service
class AccountService {
  constructor() {
    this.accounts = [];
  }

  createAccount(account) {
    this.accounts.push(account);
    return account;
  }

  getAccount(accountId) {
    return this.accounts.find(a => a.id === accountId);
  }
}

// Transaction Service
class TransactionService {
  constructor() {
    this.transactions = [];
  }

  processTransaction(transaction) {
    this.transactions.push(transaction);
    return transaction;
  }
}

// Fraud Detection Service
class FraudDetectionService {
  detectFraud(transaction) {
    // Simulate fraud detection
    if (transaction.amount > 10000) {
      return 'Suspicious transaction detected';
    }
    return 'Transaction is safe';
  }
}

Here, the code illustrates how each class represents a specific service within a financial application, reflecting the modular approach of a microservices architecture.

Each service focuses on a single aspect of the financial domain—account management, transaction handling, and fraud detection—ensuring the code remains organized, reusable, and scalable as each class can operate independently.

The AccountService class is responsible for managing user accounts. Within the constructor, this.accounts is initialized as an empty array to serve as temporary in-memory storage for account objects.

The createAccount method allows new accounts to be created and added to the accounts array, returning the created account for verification or further use. The getAccount method searches through this.accounts to find an account that matches a specific accountId. In this code, you can see how createAccount and getAccount work together to provide basic CRUD operations for managing account data.

The TransactionService class focuses on processing and recording transactions. The this.transactions array is set up within the constructor to store individual transaction records. The processTransaction method receives a transaction object, adds it to the transactions array, and returns it, simulating a simple method to store and track transactions.

Further in the code, you can see how processTransaction works as a core feature of this service, facilitating transaction management independently from other services like fraud detection or account management.

The FraudDetectionService class is built to monitor transactions for potential fraud. It includes a single method, detectFraud, that evaluates a given transaction object based on a simple rule: if the transaction amount exceeds $10,000, it is considered “suspicious.” If the amount is less than or equal to $10,000, it is classified as “safe.”

While this is a basic example, it demonstrates how logic specific to fraud detection can be encapsulated within its own service, allowing for future expansion or integration with advanced fraud detection algorithms. You can also see how detectFraud works to isolate and centralize fraud detection logic, making it easy to refine this logic independently as requirements evolve.

Overall, this setup illustrates how microservices can enhance modularity by separating concerns and isolating different areas of functionality. Each class has its specific responsibilities, ensuring that each service can be developed, scaled, or maintained independently without affecting the others.

This approach aligns well with a microservices architecture, as it supports scalability, code reusability, and ease of testing, allowing each service to evolve alongside the needs of the application.

Challenges and Solutions:

Challenge: Ensuring security and compliance with financial regulations.
Solution: Implementing robust encryption, secure authentication mechanisms, and regular audits.

It’s like having a secure vault and stringent checks to protect and verify financial transactions.

Lessons Learned:

Security: Advanced fraud detection algorithms improved the system's ability to identify and prevent fraudulent transactions.
Compliance: Regular updates and compliance checks ensured adherence to financial regulations.

Real-World Examples of Microservices

Microservices are widely adopted by some of the largest tech companies to scale their platforms, provide high availability, and manage complex functionalities.

Let's look at how companies like Netflix, Amazon, and Uber implement microservices. We'll look at some conceptual examples in JavaScript to help illustrate how these architectures work.

1. Netflix: Scaling Content and Recommendations

Netflix, one of the pioneers of microservices architecture, uses microservices to handle multiple facets of its service, such as managing its vast content library, personalized recommendations, and streaming capabilities.

Each microservice is responsible for a specific part of the platform, making it easier to scale and update independently.

Key Microservices at Netflix

Content Service: Manages the catalog of shows and movies.
Recommendation Service: Handles personalized recommendations based on user behavior.
Streaming Service: Ensures content is delivered seamlessly to users across the globe.

Conceptual Example: Netflix Microservice

// Content service microservice responsible for handling the content catalog
class ContentService {
  getContent(contentId) {
    return `Fetching content with ID: ${contentId}`;
  }
}

// Recommendation service microservice responsible for generating recommendations
class RecommendationService {
  generateRecommendations(userId) {
    return `Generating recommendations for user: ${userId}`;
  }
}

// Streaming service microservice responsible for streaming content
class StreamingService {
  streamContent(contentId) {
    return `Streaming content with ID: ${contentId}`;
  }
}

// NetflixService acting as an orchestrator
class NetflixService {
  constructor() {
    this.contentService = new ContentService();
    this.recommendationService = new RecommendationService();
    this.streamingService = new StreamingService();
  }

  recommend(userId) {
    return this.recommendationService.generateRecommendations(userId);
  }

  stream(contentId) {
    return this.streamingService.streamContent(contentId);
  }
}

// Example usage
const netflix = new NetflixService();
console.log(netflix.recommend(101)); // "Generating recommendations for user: 101"
console.log(netflix.stream(200)); // "Streaming content with ID: 200"

This code demonstrates how several microservices interact together within an orchestrated service architecture, each focusing on a distinct feature relevant to a content-streaming platform.

This code illustrates a modular, microservice-oriented design where individual services manage specific tasks—content retrieval, recommendation generation, and content streaming—while a central orchestrator, NetflixService, coordinates them to provide a cohesive service interface.

The ContentService class represents a microservice dedicated to managing the content catalog. It includes the getContent method, which takes a contentId as input and returns a message indicating that the content with that ID is being fetched.

This setup allows the ContentService to handle any actions related to retrieving or interacting with content independently, encapsulating content management functionality within its own service.

The RecommendationService class focuses on generating recommendations for users. It contains the generateRecommendations method, which receives a userId and returns a message showing that recommendations are being created for the specified user.

In this code, you can see how generateRecommendations works to simulate a recommendation service that could later integrate with recommendation algorithms to provide personalized suggestions based on the user’s profile, history, or preferences.

The StreamingService class is dedicated to streaming content to the user. Its streamContent method takes a contentId and returns a message that the specified content is being streamed.

This method showcases how streaming functionalities are encapsulated separately, allowing for the potential integration of streaming protocols or optimizations that enhance the user experience.

The NetflixService class acts as an orchestrator that ties together the individual services into a unified interface. In the constructor, instances of ContentService, RecommendationService, and StreamingService are created, enabling NetflixService to coordinate these services and manage user requests.

The recommend method uses recommendationService to generate recommendations for a specified user, while the stream method calls streamContent on the streamingService to initiate content streaming.

This code demonstrates how NetflixService functions as a single point of entry that abstracts the internal microservices from the client, allowing clients to interact with a cohesive, streamlined interface without needing to know the details of each underlying service.

This design demonstrates the principles of service orchestration in a microservices architecture. Each individual service can evolve or be replaced independently, without disrupting the entire application, while NetflixService provides a high-level API that clients can use for a smooth user experience.

This type of architecture makes the application more scalable and easier to maintain, as each service focuses on a specific domain while the orchestrator manages their interactions.

In Netflix's real-world architecture, each of these services is built as an independent microservice, allowing them to deploy, scale, and evolve each service independently based on demand.

2. Amazon: Managing Orders and Products at Scale

Amazon's vast e-commerce platform depends heavily on microservices for handling everything from product searches to order management, customer service, and payment processing.

By breaking these responsibilities into independent services, Amazon can handle millions of orders daily and ensure a smooth customer experience.

Key Microservices at Amazon

Product Service: Manages the product catalog, including search and filtering.
Order Service: Processes and manages orders, tracking, and order history.
Customer Service: Handles customer-related inquiries and support.

Conceptual Example: Amazon Microservice

// Product service microservice responsible for product search
class ProductService {
  searchProducts(query) {
    return `Searching for products related to: ${query}`;
  }
}

// Order service microservice responsible for creating and managing orders
class OrderService {
  createOrder(order) {
    return `Placing order for items: ${JSON.stringify(order)}`;
  }
}

// AmazonService acting as an orchestrator
class AmazonService {
  constructor() {
    this.productService = new ProductService();
    this.orderService = new OrderService();
  }

  searchProducts(query) {
    return this.productService.searchProducts(query);
  }

  placeOrder(order) {
    return this.orderService.createOrder(order);
  }
}

// Example usage
const amazon = new AmazonService();
console.log(amazon.searchProducts('laptop')); // "Searching for products related to: laptop"
console.log(amazon.placeOrder([{ product: 'laptop', qty: 1 }])); // "Placing order for items: [{ product: 'laptop', qty: 1 }]"

This code demonstrates how each microservice is built to handle certain operations, allowing them to work together in a coordinated fashion via an orchestrator service, AmazonService.

The code illustrates the concept of an orchestrated microservices architecture, where each microservice fulfills a unique purpose, such as handling product searches or managing orders, and the orchestrator coordinates these services to create a cohesive interface for the client.

The ProductService class represents a microservice responsible for handling product-related operations, specifically product search. The searchProducts method takes a query parameter, simulating a product search by returning a message that specifies the search query.

This design allows ProductService to be focused on product-related functionality, making it modular and easy to maintain or extend as product search functionality grows more complex.

The OrderService class encapsulates order-related operations. It includes the createOrder method, which accepts an order parameter and returns a message that simulates placing an order.

This method takes advantage of JSON serialization to display the order details in a structured format, showing how each order can be individually managed within OrderService.

By isolating order management functions in their own service, this design makes it possible to scale and maintain order-specific logic without impacting other parts of the application.

AmazonService is an orchestrator that coordinates the operations of the ProductService and OrderService classes. In the constructor, instances of ProductService and OrderService are created and stored as properties, allowing AmazonService to call their methods and aggregate their functionalities.

The searchProducts method in AmazonService invokes searchProducts on productService, while the placeOrder method uses createOrder on orderService. This orchestrator provides a simplified interface that abstracts the complexity of the underlying microservices.

The above example shows how AmazonService streamlines client interactions by acting as a single point of access that conceals each microservice's implementation specifics.

This setup demonstrates the modularity and scalability of an orchestrated microservices architecture. Each microservice can be developed, maintained, and scaled independently, while AmazonService coordinates them into a streamlined workflow for the client.

This architecture is especially beneficial in complex applications, such as e-commerce platforms, where each service can focus on its specific domain, ensuring a robust, flexible, and manageable system.

Amazon’s services are decoupled, enabling teams to work on different features independently.

For example, updates to the product search system don’t affect order processing, which improves agility and resilience.

3. Uber: Managing Rides, Drivers, and Payments

Uber's platform heavily relies on microservices to support its real-time operations, including ride requests, driver matching, fare calculation, and payment processing.

Microservices allow Uber to efficiently scale its system across cities and countries, supporting millions of users simultaneously.

Key Microservices at Uber

Request Service: Manages ride requests from users.
Driver Service: Matches users with drivers in real-time.
Payment Service: Handles fare calculations and payment processing.

Conceptual Example: Uber Microservice

// Request service microservice responsible for creating ride requests
class RequestService {
  createRequest(userId, location) {
    return `Creating ride request for user: ${userId} at location: ${location}`;
  }
}

// Driver service microservice responsible for matching drivers to requests
class DriverService {
  matchDriver(requestId) {
    return `Matching driver for request ID: ${requestId}`;
  }
}

// Payment service microservice responsible for processing payments
class PaymentService {
  processPayment(paymentInfo) {
    return `Processing payment: ${JSON.stringify(paymentInfo)}`;
  }
}

// UberService acting as an orchestrator
class UberService {
  constructor() {
    this.requestService = new RequestService();
    this.driverService = new DriverService();
    this.paymentService = new PaymentService();
  }

  requestRide(userId, location) {
    return this.requestService.createRequest(userId, location);
  }

  matchDriver(requestId) {
    return this.driverService.matchDriver(requestId);
  }

  processPayment(paymentInfo) {
    return this.paymentService.processPayment(paymentInfo);
  }
}

// Example usage
const uber = new UberService();
console.log(uber.requestRide(301, 'Downtown')); // "Creating ride request for user: 301 at location: Downtown"
console.log(uber.matchDriver(401)); // "Matching driver for request ID: 401"
console.log(uber.processPayment({ amount: 20, method: 'Credit Card' })); // "Processing payment: { amount: 20, method: 'Credit Card' }"

You can see how each service in this code represents a unique step in the ride-hailing process, allowing each microservice to handle a specific operation in the flow, from creating ride requests to matching drivers and processing payments. This setup follows the microservice architecture pattern, where each service encapsulates a unique piece of business logic.

By defining these services separately, the code improves maintainability and scalability, as each service can operate independently and be scaled based on specific demands, such as more driver matches or payment processing.

The RequestService class represents a microservice dedicated to handling ride requests from users. It includes the createRequest method, which takes a userId and a location as input parameters.

This method simulates the process of creating a ride request by returning a message that contains both the user’s ID and the specified location. This service isolates the ride-request logic, allowing it to be managed independently of other processes, such as driver matching or payment processing.

The DriverService class encapsulates the logic for finding available drivers for ride requests. It includes a matchDriver method that takes a requestId as input, representing a specific ride request.

The method simulates the driver-matching process by returning a message that includes the request ID. By isolating this functionality, DriverService can be scaled or enhanced as needed without impacting other services, such as the request or payment services.

The PaymentService class is responsible for handling payment transactions. Its processPayment method takes paymentInfo as an input, which includes payment details such as the amount and payment method.

This method returns a message that simulates the payment processing operation, with JSON.stringify(paymentInfo) formatting the payment information as a JSON string for clarity. This approach isolates payment logic, ensuring security and ease of maintenance, as it operates independently from the ride request and driver services.

The UberService class serves as an orchestrator, coordinating the functionality of RequestService, DriverService, and PaymentService. In its constructor, it initializes instances of each service and assigns them to properties, allowing UberService to interact with these services easily.

The requestRide method calls createRequest on requestService to initiate a ride request, while matchDriver and processPayment invoke the respective methods on driverService and paymentService. This orchestration provides a simplified interface for clients by abstracting the implementation details of each microservice.

This example demonstrates how an orchestrated microservice architecture allows for separation of concerns, where each service manages a unique part of the business logic while the orchestrator unifies them into a cohesive API.

This design supports flexibility, scalability, and ease of maintenance, as each service can evolve independently based on business requirements. For instance, the DriverService could be enhanced with more sophisticated driver-matching algorithms without affecting other services, while the PaymentService could be scaled independently to handle high transaction volumes.

Uber’s microservices architecture allows them to handle spikes in demand (such as during rush hour or bad weather) by independently scaling their ride request service, driver matching service, and payment service as needed.

Benefits of Using Microservices in These Companies

Scalability: Each microservice can be scaled individually based on demand.
For example, Netflix can scale its streaming service more aggressively than its recommendation service during peak hours.
Fault Isolation: If one microservice fails (for example, Uber’s payment service), it doesn’t affect the other services like ride requests or driver matching.
Flexibility: Microservices enable teams to work independently on different parts of the system.
Amazon can develop new features for its product search without touching the order or customer service modules.
Technology Diversity: Different microservices can be developed using the best technology for the job. For instance, Uber might use Node.js for their real-time driver matching service and Python for their data-heavy analytics services.

Common Pitfalls and How to Avoid Them in Microservices

While microservices offer significant benefits, they also come with complexities that can lead to failure if not properly managed.

Here, we will discuss and recap (based on what we’ve already covered earlier on) some common pitfalls that organizations face when adopting microservices, provide examples of failed projects, and offer strategies to avoid these issues.

1. Overcomplicating the Architecture Too Early

Pitfall: One of the most common mistakes companies make when transitioning to microservices is breaking down the system into too many services prematurely.
This results in an overly complex architecture that is hard to manage and maintain.

Example of Failure:

A large-scale retailer attempted to move its entire e-commerce platform from a monolithic architecture to microservices overnight.

The result was a sprawling number of poorly defined services, with no clear ownership, leading to miscommunication between teams and inconsistent data.

This severely hampered performance, leading to a complete rollback to their monolithic architecture.

How to Avoid It:

Start Small: Begin by breaking down only a few core components into microservices, such as user authentication or product search.
Gradual Decomposition: Use patterns like the Strangler Fig to incrementally refactor a monolith into microservices.
Define Service Boundaries: Make sure you understand the bounded context of each service. Don’t split services until you’re clear about their responsibilities.

2. Lack of Proper Service Ownership

Pitfall: Without clear ownership of individual microservices, it's easy for problems to arise, such as uncoordinated updates, duplicated efforts, and insufficient monitoring.

This can also cause confusion regarding which team is responsible for the health and performance of specific services.

Example of Failure:

A major online platform divided its application into hundreds of microservices but failed to assign proper ownership.

This resulted in deployment delays, as it was unclear who was responsible for maintaining and scaling each service, and some services became neglected.

Bugs were not addressed quickly, and performance issues worsened.

How to Avoid It:

Clear Ownership: Assign a specific team or individual responsible for each microservice. This team should handle the development, testing, deployment, and maintenance.
Team Autonomy: Ensure that the teams responsible for the services have the authority to make decisions about their service’s architecture, scaling, and deployment strategy.
Service Registries: Maintain a registry or catalog of services, including their owners, so there is clear visibility across the organization.

3. Poorly Managed Inter-Service Communication

Pitfall: Microservices rely heavily on communication over the network, making them vulnerable to issues like high latency, network failures, and over-complicated APIs.

Without proper design, inter-service communication can lead to bottlenecks and increase the risk of cascading failures.

Example of Failure:

A financial services company implemented microservices but failed to plan for efficient inter-service communication.

They used synchronous API calls (REST) extensively, and as the number of services grew, response times degraded significantly.

In addition, when one critical service went down, it caused a cascading failure across the entire system.

How to Avoid It:

Use Asynchronous Communication: Wherever possible, use asynchronous messaging (for example, using message queues like Kafka or RabbitMQ) to avoid tight coupling between services.
Implement Circuit Breakers: Use circuit breaker patterns to prevent cascading failures. If one service fails, the breaker trips, allowing other services to continue operating independently.
Retry Logic and Timeouts: Include retry mechanisms and appropriate timeouts in inter-service communication to handle transient failures.

4. Ignoring Data Consistency and Transactions

Pitfall: In a monolithic architecture, transactions are often straightforward. In microservices, maintaining consistency across distributed services can be difficult, especially when transactions span multiple services.

Ignoring this complexity can lead to data inconsistencies, such as duplicated or missing records.

Example of Failure:

A payments platform that adopted microservices faced issues where transactions between its order management and payment services would fail midway.

For instance, payments were processed, but the order was not placed due to a network failure.

This inconsistency damaged customer trust and led to costly chargebacks.

How to Avoid It:

Use Sagas: Implement the Saga pattern for long-running transactions across multiple services.
This ensures that each service commits or rolls back its part of the transaction independently.
Eventual Consistency: Accept that not all data will be consistent in real-time.
Use event-driven approaches to ensure that services eventually synchronize their data, which is suitable for many business cases.
Compensating Transactions: In the event of failure, ensure that services can roll back any changes made in a transaction through compensating transactions.

5. Lack of Monitoring, Logging, and Observability

Pitfall: With multiple services running independently, it becomes difficult to track the overall health of the system if there is no central monitoring or logging.

A lack of observability makes it nearly impossible to diagnose issues, detect bottlenecks, or trace failures in production.

Example of Failure:

An e-commerce platform switched to microservices but lacked a unified logging and monitoring strategy.

When performance issues arose during a major sales event, they couldn’t pinpoint the failing services in time, leading to downtime and lost revenue.

How to Avoid It:

Centralized Logging: Use tools like the ELK stack (Elasticsearch, Logstash, and Kibana) or Fluentd to collect and centralize logs across all services.
Distributed Tracing: Implement distributed tracing tools like Jaeger or Zipkin to trace requests across services, helping to quickly identify bottlenecks.
Monitoring Tools: Use monitoring and alerting systems such as Prometheus and Grafana to get real-time insights into service health and performance.

6. Security Vulnerabilities in Microservices

Pitfall: The decentralized nature of microservices introduces new security challenges, including securing API endpoints, managing inter-service communication, and preventing unauthorized access to sensitive data.

Example of Failure:

A ride-sharing company built a microservices architecture but failed to secure inter-service communication properly.

An attacker was able to exploit an insecure API to access customer data, resulting in a major data breach and damage to the company's reputation.

How to Avoid It:

Secure APIs: Use secure tokens (for example, OAuth 2.0 or JWT) for authenticating and authorizing API requests.
Mutual TLS (mTLS): Ensure all communication between services is encrypted by implementing mTLS.
Network Security: Use virtual private clouds (VPCs), firewalls, and secure access controls to limit who and what can access your services.
Regular Audits: Ensure compliance with data protection regulations such as GDPR or HIPAA through regular security audits and testing.

Strategies to Address and Avoid Common Issues

Adopt an Incremental Approach: Move to microservices gradually, rather than in one big shift. Start with non-critical services and build expertise.
Service Contracts and APIs: Ensure that your APIs and contracts between services are well-documented and stable. Changes should be versioned to avoid breaking dependencies.
Use Proper Orchestration Tools: Utilize container orchestration tools like Kubernetes to manage the deployment, scaling, and operation of services.
Service Meshes like Istio can handle networking complexities.
Emphasize DevOps and CI/CD: Implement CI/CD pipelines with automated testing and monitoring.
Microservices should be easy to deploy frequently and with minimal risk.
Strong Team Collaboration: Foster a culture of collaboration between development and operations teams.
Break down silos and ensure everyone understands how services interact.

Microservices architecture, as demonstrated by companies like Netflix, Amazon, and Uber, showcases the immense potential for scalability, flexibility, and innovation.

Each of these organizations effectively leveraged microservices to enhance their core operations—whether it's delivering content, managing vast product catalogs, or facilitating ride-sharing.

These examples highlight how breaking down applications into independent services empowers teams to deploy faster, scale efficiently, and innovate rapidly.

But the journey to a successful microservices architecture is not without its challenges.

Common pitfalls, such as overcomplicating the architecture, poor service ownership, and unreliable inter-service communication, can derail even the most well-intentioned projects.

To avoid these issues, it’s essential to start small, establish clear service boundaries, adopt asynchronous communication, and implement robust monitoring and security measures.

By learning from real-world successes and failures, and implementing strategies to mitigate common risks, organizations can fully unlock the potential of microservices while maintaining operational stability, security, and performance.

Proper planning, gradual adoption, and continuous monitoring are key to building a resilient and scalable microservices-based system.

Future Trends and Innovations

In this section, we will discuss some cutting-edge developments and emerging trends that are shaping the future of microservices architecture. This section will examine the impact of new technologies and methodologies, such as serverless computing, micro frontends, and the use of AI-driven automation in service orchestration and management.

We’ll also look at the evolving role of DevOps and continuous integration/continuous delivery (CI/CD) pipelines in enhancing microservices deployment and maintenance.

Then we’ll discuss advancements in service mesh technologies, the increasing importance of observability and monitoring tools, and the rise of event-driven architecture as a complement to traditional request-response communication in microservices.

By the end of this section, you’ll gain insights into how these innovations are pushing microservices architecture forward, helping organizations further streamline, scale, and optimize their applications.

This forward-looking view will equip you with knowledge on potential tools and strategies that can keep your applications competitive and adaptable in a rapidly changing technological landscape.

Serverless Architecture

Serverless architecture allows you to build and run applications without managing servers.

Functions are executed in response to events, and resources are automatically scaled based on demand.

Imagine a coffee shop where you order coffee through an app. The coffee shop only needs to prepare coffee when an order is placed, and you don’t need to worry about the kitchen staff or equipment.

AWS Lambda Function:

// Example of an AWS Lambda function
exports.handler = async (event) => {
  console.log('Event received:', event);
  // Process the event and return a response
  return {
    statusCode: 200,
    body: JSON.stringify({ message: 'Hello from Lambda!' }),
  };
};

This code depicts how an AWS Lambda function is defined to handle and process events. AWS Lambda is a serverless compute service that allows you to run code without provisioning or managing servers.

In this code example, the function is set up to run in response to an event—whether that’s an HTTP request, an update in a data source, or any other event that can trigger a Lambda function.

The function's entry point is the exports.handler, which is structured as an asynchronous function with an event parameter. This event parameter contains the data relevant to the trigger, like request details if invoked through API Gateway or object information if triggered by S3.

The console.log('Event received:', event); line logs the event data to AWS CloudWatch, which is useful for debugging and tracking the input data Lambda received. This log output helps monitor and troubleshoot the function's operation and behavior by examining the event data and ensuring it is processed as expected.

Following the logging statement, the code returns a response object. Here, it returns an object with statusCode set to 200, indicating a successful request, and a body field containing a JSON stringified message. This JSON message ({ message: 'Hello from Lambda!' }) is typical for RESTful APIs and provides a response payload that a client can interpret.

The statusCode and body fields are crucial when the Lambda function is integrated with API Gateway, as they enable Lambda to respond to HTTP requests in a format that is directly consumable by web clients or applications.

This example shows how Lambda functions can perform a wide range of tasks triggered by various events, making them suitable for microservices and scalable cloud applications where functions execute code only when invoked, minimizing costs and resource usage.

The use of asynchronous processing (async) allows the function to handle any potential network or data-fetching tasks non-blockingly, which is ideal for serverless environments where efficiency and quick execution are prioritized.

Benefits and Challenges:

Benefits: Reduced infrastructure management, automatic scaling, and pay-per-use pricing.
Challenges: Cold start latency, limited execution time, and complexity in debugging and monitoring.

It’s like ordering takeout from a restaurant—convenient and flexible, but you rely on the restaurant’s setup and might have to wait if they’re busy.

Future Directions:

Improved Cold Start Times: Techniques to reduce latency for serverless functions.
Enhanced Monitoring and Debugging: Better tools for tracking and debugging serverless applications.

Service Meshes

A service mesh is an infrastructure layer that provides features like service-to-service communication, load balancing, and security for microservices.

Think of a service mesh as a network of interconnected communication channels within a company, ensuring secure and efficient data flow between departments.

Conceptual with Istio:

# Example of an Istio VirtualService configuration
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: example-virtualservice
spec:
  hosts:
    - example-service
  http:
    - route:
        - destination:
            host: example-service
            port:
              number: 80

In this code, you can see how Istio’s VirtualService configuration is used to define the routing of HTTP traffic within a microservices architecture. Istio is a popular service mesh that helps manage microservices traffic, security, and observability in a Kubernetes environment.

A VirtualService is one of Istio’s core components and is used to control how traffic is directed to specific services within the mesh.

The configuration starts with the apiVersion and kind fields, which specify that this is an Istio VirtualService resource and the API version used to define it. The metadata section gives the virtual service a name, example-virtualservice, which can be used to reference it within the Istio mesh.

The spec section defines the main functionality of the VirtualService. The hosts field lists the services that this VirtualService applies to—in this case, it specifies a service called example-service.

This is the destination for the traffic that matches the routing rules defined within this VirtualService.

In the http section, we define how HTTP traffic should be routed. The route field specifies that requests to the example-service should be forwarded to the host example-service on port 80.

This is a basic routing rule where all incoming HTTP traffic that matches the example-service will be directed to the service on port 80. More complex routing rules could be added here, such as load balancing between multiple instances of a service, routing based on request headers, or applying retries and timeouts.

This example is a simple yet powerful demonstration of Istio’s traffic management capabilities. Istio enables fine-grained control over how microservices communicate with each other, making it possible to implement advanced traffic routing strategies such as A/B testing, blue-green deployments, and canary releases.

Benefits and Challenges:

Benefits: Simplified communication management, security, and observability.
Challenges: Additional complexity in setup and management.

It’s like using a company-wide intranet to manage internal communication, which adds layers of control but requires proper setup.

Future Directions:

Better Integration with CI/CD: Improved integration of service meshes with continuous integration and deployment pipelines.
Advanced Security Features: Enhanced mechanisms for securing service-to-service communication.

Artificial Intelligence and Machine Learning Integration

Incorporating AI and machine learning into microservices to enable predictive analytics, automation, and intelligent decision-making.

It’s like adding a personal assistant to your team that can analyze data and provide recommendations or automate repetitive tasks.

Using TensorFlow.js:

const tf = require('@tensorflow/tfjs');

// Define a simple model
const model = tf.sequential();
model.add(tf.layers.dense({ units: 1, inputShape: [1] }));

model.compile({ optimizer: 'sgd', loss: 'meanSquaredError' });

// Training data
const xs = tf.tensor1d([1, 2, 3, 4]);
const ys = tf.tensor1d([1, 3, 5, 7]);

// Train the model
model.fit(xs, ys, { epochs: 10 }).then(() => {
  model.predict(tf.tensor1d([5])).print(); // Predict new values
});

The above example demonstrates how TensorFlow.js is used to define and train a simple machine learning model in Javascript. TensorFlow.js is a popular library that allows you to train and deploy machine learning models directly in the browser or in Node.js environments.

This example demonstrates how to create a model, train it with some data, and make predictions using that model.

The first line imports the TensorFlow.js library (const tf = require('@tensorflow/tfjs');), making its functionality available for use in this script. TensorFlow.js provides a rich set of APIs for building, training, and evaluating machine learning models.

The code then proceeds to define a simple machine learning model using the tf.sequential() function, which creates a linear stack of layers. This is a simple model composed of a single layer: a dense layer (tf.layers.dense). The dense layer has 1 unit and expects an input shape of 1, meaning it will take in a single numeric input per training sample.

Once the model structure is defined, it is compiled with the model.compile() method. This step sets up the model for training by specifying the optimizer and loss function. The optimizer: 'sgd' indicates that stochastic gradient descent (SGD) will be used to update the model's weights during training.

The loss: 'meanSquaredError' specifies that the model will minimize the mean squared error (MSE) during training, which is commonly used for regression tasks (where the goal is to predict continuous values).

Next, the training data is defined. The input data (xs) is a 1-dimensional tensor with the values [1, 2, 3, 4], and the target output data (ys) is another tensor with the corresponding values [1, 3, 5, 7]. This dataset suggests a simple linear relationship: y = 2x - 1.

The model is trained using the model.fit() function. This method takes in the training data (xs, ys) and the number of epochs (iterations) to train for. In this case, the model is trained for 10 epochs. During each epoch, the model updates its internal weights to minimize the loss function (mean squared error). After training, the model is capable of making predictions.

Finally, after the model is trained, the model.predict() function is called with new input data (tf.tensor1d([5])). This predicts the output for an unseen input (in this case, x = 5). The print() method is used to display the predicted result.

Through this code, you can see how TensorFlow.js provides an easy and flexible way to create, train, and use machine learning models in JavaScript.

The model here performs a simple linear regression, but TensorFlow.js can be used to tackle much more complex tasks, including deep learning and neural networks, in both the browser and server-side environments.

Benefits and Challenges:

Benefits: Enhanced capabilities such as predictive analytics, automation, and personalized user experiences.
Challenges: Complexity in integrating AI/ML models, and the need for large datasets and computational resources.

It’s like hiring a data scientist who can provide insights and automate processes but requires careful integration and resources.

Future Directions:

Increased Use of AutoML: Simplified processes for training and deploying machine learning models.
More Advanced AI Models: Incorporation of more sophisticated models and techniques for various use cases.

Edge Computing

Edge computing involves processing data closer to the data source (for example, IoT devices) rather than relying solely on centralized cloud servers.

Like having a local technician who can handle immediate issues on-site rather than sending everything to a central repair facility.

Benefits and Challenges:

Benefits: Reduced latency, improved performance, and decreased bandwidth usage.
Challenges: Complexity in managing distributed edge devices and ensuring data consistency.

It’s like managing multiple local warehouses to reduce shipping times, but requiring coordination and consistency.

Future Directions:

More Advanced Edge Devices: Development of more powerful and intelligent edge devices.
Improved Data Management: Enhanced tools for managing and syncing data across edge and central systems.

Enhanced Security Practices

Implementation of advanced security practices such as zero-trust models, encryption, and secure APIs to protect microservices.

It’s like having a comprehensive security system with surveillance, access control, and encryption to protect your premises and data.

Using Crypto for Encryption:

const crypto = require('crypto');

// Encrypt data
function encrypt(text) {
  const cipher = crypto.createCipher('aes-256-cbc', 'password');
  let encrypted = cipher.update(text, 'utf8', 'hex');
  encrypted += cipher.final('hex');
  return encrypted;
}

// Decrypt data
function decrypt(text) {
  const decipher = crypto.createDecipher('aes-256-cbc', 'password');
  let decrypted = decipher.update(text, 'hex', 'utf8');
  decrypted += decipher.final('utf8');
  return decrypted;
}

const text = 'Hello World';
const encryptedText = encrypt(text);
const decryptedText = decrypt(encryptedText);

console.log('Encrypted:', encryptedText);
console.log('Decrypted:', decryptedText);

This code exhibits how encryption and decryption are implemented in Node.js using the crypto module, which provides a variety of cryptographic functionality, including hashing, signing, and encryption.

The encryption used here follows the AES-256-CBC algorithm, which is a widely used symmetric encryption algorithm. This means that the same key is used for both encryption and decryption.

The encrypt() function demonstrates the process of encrypting a plain text message. It first creates a cipher instance using the crypto.createCipher() method, specifying aes-256-cbc as the encryption algorithm and 'password' as the encryption key. The createCipher() method returns a cipher object that is used to process the text.

The encryption process is done in two stages. First, the cipher.update() method is used to encrypt the input text, in this case 'Hello World'. The method takes three arguments: the input text, the encoding of the input text (here it's 'utf8'), and the encoding of the output (here it's 'hex').

This means the encrypted text will be output in hexadecimal format. The second part, cipher.final('hex'), ensures the final padding and encryption are properly applied, returning the complete encrypted text. This encrypted string is returned as the result of the encrypt() function.

The decrypt() function works similarly but in reverse. It starts by creating a decipher instance using crypto.createDecipher(), again specifying 'aes-256-cbc' as the algorithm and the same key ('password').

The decipher.update() method is used to decrypt the data, converting it back from hexadecimal format to UTF-8. As with the encryption function, decipher.final('utf8') ensures the complete decryption of the data, returning the decrypted string.

In the example, the text 'Hello World' is first encrypted and then immediately decrypted. The output demonstrates how the original text is converted into an encrypted format and then restored back to its original form.

The use of 'password' as a static key in this example is not secure for real-world applications, but it serves to illustrate the basic encryption and decryption process.

This example also highlights the importance of using strong, unique keys for cryptographic operations in practice, as well as ensuring that encrypted data is safely stored and transmitted.

The crypto module, which is built into Node.js, makes it easy to implement secure encryption and decryption in any application requiring data protection.

Benefits and Challenges:

Benefits: Enhanced protection against data breaches and cyber-attacks.
Challenges: Increased complexity in implementation and management.

It’s like upgrading from a basic lock to a high-security system with multiple layers of protection.

Future Directions:

Zero Trust Architectures: Increased adoption of zero trust models where verification is required for every request.
Advanced Encryption Techniques: Continued development of more secure and efficient encryption methods.

Multi-Cloud and Hybrid Cloud Strategies

Using multiple cloud providers (multi-cloud) or combining on-premises infrastructure with cloud services (hybrid cloud) to improve flexibility and avoid vendor lock-in.

It’s like having accounts with multiple banks to take advantage of different services and avoid reliance on a single provider.

Conceptual with Multiple Cloud Providers:

// Example of interacting with multiple cloud providers
const AWS = require('aws-sdk');
const azure = require('azure-storage');

// AWS S3 interaction
const s3 = new AWS.S3();
s3.listBuckets((err, data) => {
  if (err) console.log(err, err.stack);
  else console.log('S3 Buckets:', data.Buckets);
});

// Azure Blob Storage interaction
const blobService = azure.createBlobService();
blobService.listContainers((err, result) => {
  if (err) console.log(err);
  else console.log('Azure Containers:', result.entries);
});

This code describes how you can interact with two distinct cloud providers—AWS and Azure—specifically their storage services. The code demonstrates how to use AWS S3 and Azure Blob Storage APIs to list buckets and containers, respectively.

The first part of the code shows how to interact with AWS S3. It imports the aws-sdk package, which is a Node.js SDK that allows applications to interact with AWS services.

A new instance of the S3 service is created using new AWS.S3(). The listBuckets() method is then called on the S3 instance to retrieve a list of all buckets within the configured AWS account.

This method is asynchronous, so it takes a callback function as an argument. If the operation is successful, the callback logs the list of buckets to the console. If there's an error, the error message is printed instead.

This demonstrates a basic interaction with AWS's S3 service, where you can programmatically access and manage your storage containers (called "buckets").

Next, the code switches to Azure Blob Storage. It uses the azure-storage package, which is the official SDK for interacting with Azure's storage services. The createBlobService() method is used to create a blob service client that interacts with Azure Blob Storage.

The listContainers() method is called on the blob service client to list all the containers in the account. As with AWS, this method is asynchronous, and the result is provided via a callback. If successful, the list of containers (stored in the entries property) is logged to the console.

This code shows how developers can integrate with multiple cloud platforms to manage cloud storage resources, using the APIs provided by each service. The primary takeaway is that both AWS and Azure provide SDKs for interacting with their services, making it easy to automate and manage cloud resources programmatically.

These APIs allow you to perform basic tasks such as listing storage containers, which is a common requirement when working with cloud storage solutions. By using these SDKs, applications can remain cloud-agnostic while still leveraging the full power of each platform’s storage offerings.

Benefits and Challenges:

Benefits: Greater flexibility, reduced risk of vendor lock-in, and optimization of services across providers.
Challenges: Increased complexity in managing and integrating services across different environments.

It’s like using different suppliers for various needs to get the best deals but requiring careful coordination and management.

Future Directions:

Improved Integration Tools: Development of better tools and platforms for managing multi-cloud and hybrid cloud environments.
Advanced Orchestration: Enhanced orchestration and management capabilities across diverse cloud environments.

Conclusion

The rapid evolution of technology has significantly transformed how applications are built and managed, and microservices have become a central component of this transformation.

Let’s go over the key points we’ve discussed throughout this book. I’ll reinforce the importance of microservices, and provide guidance on how to leverage these insights for future development.

Microservices Architecture

Microservices involve breaking down applications into smaller, independent services that communicate over well-defined APIs.

This contrasts with monolithic architectures, where all components are interwoven into a single, cohesive application.

Key characteristics include independent deployment, decentralized data management, and resilience through the isolation of services.

Core Concepts and Components

Service Discovery: Mechanisms for locating and interacting with microservices.
API Gateways: Centralized entry points that manage traffic, enforce security, and handle requests.
Data Management: Strategies for managing data consistency and storage across distributed services.
Security: Implementing authentication, authorization, and encryption to protect services.
Monitoring and Logging: Tools and practices for tracking performance and diagnosing issues.

Building Microservices

Design Principles: Focus on domain-driven design, scalability, and fault tolerance.
Development Practices: Best practices include using lightweight communication protocols, managing service dependencies carefully, and employing CI/CD pipelines for automation.
Testing Strategies: Testing microservices involves unit tests, integration tests, and end-to-end tests to ensure robustness and reliability.

Managing Microservices in the Cloud

Deployment: Techniques for deploying microservices, including containerization with Docker and orchestration with Kubernetes.
Service Meshes: Infrastructure layers that manage service communication, security, and observability.
Configuration Management: Tools and practices for managing and updating configurations across services.

Future Trends and Innovations

Serverless Architectures: Enabling scalable and cost-efficient computing by removing server management responsibilities.
Service Meshes: Enhancing communication and security between microservices.
AI and Machine Learning Integration: Leveraging advanced analytics and automation within microservices.
Edge Computing: Bringing processing closer to data sources to reduce latency and improve performance.
Enhanced Security Practices: Adopting advanced security models and encryption techniques.
Multi-Cloud and Hybrid Cloud Strategies: Using multiple cloud providers and combining cloud and on-premises infrastructure for flexibility and resilience.

The Importance of Microservices

Microservices offer numerous advantages that align with the demands of modern software development:

Scalability: Microservices enable horizontal scaling by allowing individual services to scale independently based on demand. This ensures optimal performance and resource utilization.

Like expanding a retail store by adding more registers during peak hours without having to rebuild the entire store.

Flexibility: Developers can choose different technologies, frameworks, and languages for different services, enhancing overall flexibility and innovation.

Like having different specialists working on various parts of a project, each using the best tools for their specific tasks.

Resilience: By isolating services, failures in one part of the system do not necessarily impact others, improving overall system reliability.

Like having a modular power grid where the failure of one line does not disrupt the entire grid.

Faster Time-to-Market: Microservices facilitate continuous integration and continuous delivery (CI/CD) practices, enabling faster development and deployment cycles.

Like producing different components of a product simultaneously rather than waiting to assemble everything at once.

Looking Ahead

As technology continues to evolve, so will the practices and tools related to microservices. Here’s how you can prepare for the future:

Stay Informed: Keep up with industry trends, new tools, and best practices through continuous learning and professional development.

Recommendation: Follow industry blogs, attend conferences, and participate in relevant workshops.

Experiment with Emerging Technologies: Integrate new trends and innovations such as serverless computing, AI, and edge computing into your microservices architecture to stay ahead of the curve.

Recommendation: Start with small projects or pilot programs to evaluate the benefits and challenges of new technologies.

Adopt Agile Practices: Embrace agile methodologies to enhance collaboration, flexibility, and iterative development, which align well with the principles of microservices.

Recommendation: Implement agile frameworks such as Scrum or Kanban to improve project management and delivery.

Focus on Security: Prioritize security in your microservices architecture to protect against evolving threats and ensure data integrity.

Recommendation: Regularly review and update security practices, and invest in tools and training for secure coding and compliance.

Optimize for Performance: Continuously monitor and optimize the performance of your microservices to ensure they meet user expectations and handle growing demands efficiently.

Recommendation: Use performance monitoring tools and conduct regular performance reviews to identify and address bottlenecks.

Final Thoughts

Microservices represent a powerful paradigm shift in software architecture, offering significant benefits in terms of scalability, flexibility, and resilience.

However, they also come with challenges that require thoughtful planning and management.

By understanding the core concepts, embracing best practices, and staying abreast of emerging trends, you can effectively leverage microservices to build robust, scalable, and innovative applications.

The journey of adopting and mastering microservices is ongoing. As technology advances, so will the methodologies and tools that support microservices.

Embrace this journey with curiosity and adaptability, and you’ll be well-positioned to harness the full potential of microservices for your projects and organizations.

Learn Linux for Beginners: From Basics to Advanced Techniques [Full Book]

Zaira Hira — Fri, 12 Jul 2024 13:18:40 +0000

Learning Linux is one of the most valuable skills in the tech industry. It can help you get things done faster and more efficiently. Many of the world's powerful servers and supercomputers run on Linux.

While empowering you in your current role, learning Linux can also help you transition into other tech careers like DevOps, Cybersecurity, and Cloud Computing.

In this handbook, you'll learn the basics of the Linux command line, and then transition to more advanced topics like shell scripting and system administration. Whether you are new to Linux or have been using it for years, this book has something for you.

Important Note: All examples in this book are demonstrated in Ubuntu 22.04.2 LTS (Jammy Jellyfish). Most command line tools are more or less the same in other distributions. However, some GUI applications and commands may differ if you are working on another Linux distribution.

Part 1: Introduction to Linux
- 1.1. Getting Started with Linux
Part 2: Introduction to Bash Shell and System Commands
Part 3: Understanding Your Linux System
- 3.1. Discovering Your OS and Specs
Part 4: Managing Files From the Command line
Part 5: The Essentials of Text Editing in Linux
- 5.1. Mastering Vim: The Complete Guide
- 5.2. Mastering Nano
Part 6: Bash Scripting
Part 7: Managing Software Packages in Linux
Part 8: Advanced Linux Topics
Conclusion

Part 1: Introduction to Linux

1.1. Getting Started with Linux

What is Linux?

Linux is an open-source operating system that is based on the Unix operating system. It was created by Linus Torvalds in 1991.

Open source means that the source code of the operating system is available to the public. This allows anyone to modify the original code, customise it, and distribute the new operating system to potential users.

Why should you learn about Linux?

In today's data center landscape, Linux and Microsoft Windows stand out as the primary contenders, with Linux having a major share.

Here are several compelling reasons to learn Linux:

Given the prevalence of Linux hosting, there is a high chance that your application will be hosted on Linux. So learning Linux as a developer becomes increasingly valuable.
With cloud computing becoming the norm, chances are high that your cloud instances will rely on Linux.
Linux serves as the foundation for many operating systems for the Internet of Things (IoT) and mobile applications.
In IT, there are many opportunities for those skilled in Linux.

What does it mean that Linux is an open-source operating system?

First, what is open source? Open source software is software whose source code is freely accessible, allowing anyone to utilize, modify, and distribute it.

Whenever source code is created, it is automatically considered copyrighted, and its distribution is governed by the copyright holder through software licenses.

In contrast to open source, proprietary or closed-source software restricts access to its source code. Only the creators can view, modify, or distribute it.

Linux is primarily open source, which means that its source code is freely available. Anyone can view, modify, and distribute it. Developers from anywhere in the world can contribute to its improvement. This lays the foundation of collaboration which is an important aspect of open source software.

This collaborative approach has led to the widespread adoption of Linux across servers, desktops, embedded systems, and mobile devices.

The most interesting aspect of Linux being open source is that anyone can tailor the operating system to their specific needs without being restricted by proprietary limitations.

Chrome OS used by Chromebooks is based on Linux. Android, that powers many smartphones globally, is also based on Linux.

What is a Linux Kernel?

The kernel is the central component of an operating system that manages the computer and its hardware operations. It handles memory operations and CPU time.

The kernel acts as a bridge between applications and the hardware-level data processing using inter-process communication and system calls.

The kernel loads into memory first when an operating system starts and remains there until the system shuts down. It is responsible for tasks like disk management, task management, and memory management.

If you are curious about what the Linux kernel looks like, here is the GitHub link.

What is a Linux distribution?

By this point, you know that you can re-use the Linux kernel code, modify it, and create a new kernel. You can further combine different utilities and software to create a completely new operating system.

A Linux distribution or distro is a version of the Linux operating system that includes the Linux kernel, system utilities, and other software. Being open source, a Linux distribution is a collaborative effort involving multiple independent open-source development communities.

What does it mean that a distribution is derived? When you say that a distribution is "derived" from another, the newer distro is built upon the base or foundation of the original distro. This derivation can include using the same package management system (more on this later), kernel version, and sometimes the same configuration tools.

Today, there are thousands of Linux distributions to choose from, offering differing goals and criteria for selecting and supporting the software provided by their distribution.

Distributions vary from one to the other, but they generally have several common characteristics:

A distribution consists of a Linux kernel.
It supports user space programs.
A distribution may be small and single-purpose or include thousands of open-source programs.
Some means of installing and updating the distribution and its components should be provided.

If you view the Linux Distributions Timeline, you'll see two major distros: Slackware and Debian. Several distributions are derived from them. For example, Ubuntu and Kali are derived from Debian.

What are the advantages of derivation? There are various advantages of derivation. Derived distributions can leverage the stability, security, and large software repositories of the parent distribution.

When building on an existing foundation, developers can drive their focus and effort entirely on the specialized features of the new distribution. Users of derived distributions can benefit from the documentation, community support, and resources already available for the parent distribution.

Some popular Linux distributions are:

Ubuntu: One of the most widely used and popular Linux distributions. It is user-friendly and recommended for beginners. Learn more about Ubuntu here.
Linux Mint: Based on Ubuntu, Linux Mint provides a user-friendly experience with a focus on multimedia support. Learn more about Linux Mint here.
Arch Linux: Popular among experienced users, Arch is a lightweight and flexible distribution aimed at users who prefer a DIY approach. Learn more about Arch Linux here.
Manjaro: Based on Arch Linux, Manjaro provides a user-friendly experience with pre-installed software and easy system management tools. Learn more about Manjaro here.
Kali Linux: Kali Linux provides a comprehensive suite of security tools and is mostly focused on cybersecurity and hacking. Learn more about Kali Linux here.

How to install and access Linux

The best way to learn is to apply the concepts as you go. In this section, we'll learn how to install Linux on your machine so you can follow along. You'll also learn how to access Linux on a Windows machine.

I recommend that you follow any one of the methods mentioned in this section to get access to Linux so you may follow along.

Install Linux as the primary OS

Installing Linux as the primary OS is the most efficient way to use Linux, as you can use the full power of your machine.

In this section, you will learn how to install Ubuntu, which is one of the most popular Linux distributions. I have left out other distributions for now, as I want to keep things simple. You can always explore other distributions once you are comfortable with Ubuntu.

Step 1 – Download the Ubuntu iso: Go to the official website and download the iso file. Make sure to select a stable release that is labeled "LTS". LTS stands for Long Term Support which means you can get free security and maintenance updates for a long time (usually 5 years).
Step 2 – Create a bootable pendrive: There are a number of softwares that can create a bootable pendrive. I recommend using Rufus, as it is quite easy to use. You can download it from here.
Step 3 – Boot from the pendrive: Once your bootable pendrive is ready, insert it and boot from the pendrive. The boot menu depends on your laptop. You can google the boot menu for your laptop model.
Step 4 – Follow the prompts. Once, the boot process starts, select try or install ubuntu.

The process will take some time. Once the GUI appears, you can select the language, and keyboard layout and continue. Enter your login and name. Remember the credentials as you will need them to log in to your system and access full privileges. Wait for the installation to complete.
Step 5 – Restart: Click on restart now and remove the pen drive.
Step 6 – Login: Login with the credentials you entered earlier.

And there you go! Now you can install apps and customize your desktop.

For advanced installation, you can explore the following topics:

Disk partitioning.
Setting swap memory for enabling hibernation.

Accessing the terminal

An important part of this handbook is learning about the terminal where you'll run all the commands and see the magic happen. You can search for the terminal by pressing the "windows" key and typing "terminal". You can pin the Terminal in the dock where other apps are located for easy access.

💡 The shortcut for opening the terminal is ctrl+alt+t

You can also open the terminal from inside a folder. Right click where you are and click on "Open in Terminal". This will open the terminal in the same path.

How to use Linux on a Windows machine

Sometimes you might need to run both Linux and Windows side by side. Luckily, there are some ways you can get the best of both worlds without getting different computers for each operating system.

In this section, you'll explore a few ways to use Linux on a Windows machine. Some of them are browser-based or cloud-based and do not need any OS installation before using them.

Option 1: "Dual-boot" Linux + Windows With dual boot, you can install Linux alongside Windows on your computer, allowing you to choose which operating system to use at startup.

This requires partitioning your hard drive and installing Linux on a separate partition. With this approach, you can only use one operating system at a time.

Option 2: Use Windows Subsystem for Linux (WSL) Windows Subsystem for Linux provides a compatibility layer that lets you run Linux binary executables natively on Windows.

Using WSL has some advantages. The setup for WSL is simple and not time-consuming. It is lightweight compared to VMs where you have to allocate resources from the host machine. You don't need to install any ISO or virtual disc image for Linux machines which tend to be heavy files. You can use Windows and Linux side by side.

How to install WSL2

First, enable the Windows Subsystem for Linux option in settings.

Go to Start. Search for "Turn Windows features on or off."
Check the option "Windows Subsystem for Linux" if it isn't already.
Next, open your command prompt and provide the installation commands.
Open Command Prompt as an administrator:
Run the command below:

wsl --install

This is the output:

Note: By default, Ubuntu will be installed.

Once installation is complete, you'll need to reboot your Windows machine. So, restart your Windows machine.

After restarting, you might see a window like this:

Once installation of Ubuntu is complete, you'll be prompted to enter your username and password.

And, that's it! You are ready to use Ubuntu.

Launch Ubuntu by searching from the start menu.

And here we have your Ubuntu instance launched.

Option 3: Use a Virtual Machine (VM)

A virtual machine (VM) is a software emulation of a physical computer system. It allows you to run multiple operating systems and applications on a single physical machine simultaneously.

You can use virtualization software such as Oracle VirtualBox or VMware to create a virtual machine running Linux within your Windows environment. This allows you to run Linux as a guest operating system alongside Windows.

VM software provides options to allocate and manage hardware resources for each VM, including CPU cores, memory, disk space, and network bandwidth. You can adjust these allocations based on the requirements of the guest operating systems and applications.

Here are some of the common options available for virtualization:

Option 4: Use a Browser-based Solution

Browser-based solutions are particularly useful for quick testing, learning, or accessing Linux environments from devices that don't have Linux installed.

You can either use online code editors or web-based terminals to access Linux. Note that you usually don't have full administration privileges in these cases.

Online code editors

Online code editors offer editors with built-in Linux terminals. While their primary purpose is coding, you can also utilize the Linux terminal to execute commands and perform tasks.

Replit is an example of an online code editor, where you can write your code and access the Linux shell at the same time.

Web-based Linux terminals:

Online Linux terminals allow you to access a Linux command-line interface directly from your browser. These terminals provide a web-based interface to a Linux shell, enabling you to execute commands and work with Linux utilities.

One such example is JSLinux. The screenshot below shows a ready-to-use Linux environment:

Option 5: Use a Cloud-based Solution

Instead of running Linux directly on your Windows machine, you can consider using cloud-based Linux environments or virtual private servers (VPS) to access and work with Linux remotely.

Services like Amazon EC2, Microsoft Azure, or DigitalOcean provide Linux instances that you can connect to from your Windows computer. Note that some of these services offer free tiers, but they are not usually free in the long run.

Part 2: Introduction to Bash Shell and System Commands

2.1. Getting Started with the Bash shell

Introduction to the bash shell

The Linux command line is provided by a program called the shell. Over the years, the shell program has evolved to cater to various options.

Different users can be configured to use different shells. But, most users prefer to stick with the current default shell. The default shell for many Linux distros is the GNU Bourne-Again Shell (bash). Bash is succeeded by the Bourne shell (sh).

To find out your current shell, open your terminal and enter the following command:

echo $SHELL

Command breakdown:

The echo command is used to print on the terminal.
The $SHELL is a special variable that holds the name of the current shell.

In my setup, the output is /bin/bash. This means that I am using the bash shell.

# output
echo $SHELL
/bin/bash

Bash is very powerful as it can simplify certain operations that are hard to accomplish efficiently with a GUI (or Graphical User Interface). Remember that most servers do not have a GUI, and it is best to learn to use the powers of a command line interface (CLI).

Terminal vs Shell

The terms "terminal" and "shell" are often used interchangeably, but they refer to different parts of the command-line interface.

The terminal is the interface you use to interact with the shell. The shell is the command interpreter that processes and executes your commands. You'll learn more about shells in Part 6 of the handbook.

What is a prompt?

When a shell is used interactively, it displays a $ when it is waiting for a command from the user. This is called the shell prompt.

[username@host ~]$

If the shell is running as root (you'll learn more about the root user later on), the prompt is changed to #.

[root@host ~]#

2.2. Command Structure

A command is a program that performs a specific operation. Once you have access to the shell, you can enter any command after the $ sign and see the output on the terminal.

Generally, Linux commands follow this syntax:

command [options] [arguments]

Here is the breakdown of the above syntax:

command: This is the name of the command you want to execute. ls (list), cp (copy), and rm (remove) are common Linux commands.
[options]: Options, or flags, often preceded by a hyphen (-) or double hyphen (--), modify the behavior of the command. They can change how the command operates. For example, ls -a uses the -a option to display hidden files in the current directory.
[arguments]: Arguments are the inputs for the commands that require one. These could be filenames, user names, or other data that the command will act upon. For example, in the command cat access.log, cat is the command and access.log is the input. As a result, the cat command displays the contents of the access.log file.

Options and arguments are not required for all commands. Some commands can be run without any options or arguments, while others might require one or both to function correctly. You can always refer to the command's manual to check the options and arguments it supports.

💡Tip: You can view a command's manual using the man command.

You can access the manual page for ls with man ls, and it'll look like this:

Manual pages are a great and quick way to access the documentation. I highly recommend going through man pages for the commands that you use the most.

2.3. Bash Commands and Keyboard Shortcuts

When you are in the terminal, you can speed up your tasks by using shortcuts.

Here are some of the most common terminal shortcuts:

Operation	Shortcut
Look for the previous command	Up Arrow
Jump to the beginning of the previous word	Ctrl+LeftArrow
Clear characters from the cursor to the end of the command line	Ctrl+K
Complete commands, file names, and options	Pressing Tab
Jumps to the beginning of the command line	Ctrl+A
Displays the list of previous commands	history

2.4. Identifying Yourself: The `whoami` Command

You can get the username you are logged in with by using the whoami command. This command is useful when you are switching between different users and want to confirm the current user.

Just after the $ sign, type whoami and press enter.

whoami

This is the output I got.

zaira@zaira-ThinkPad:~$ whoami
zaira

Part 3: Understanding Your Linux System

3.1. Discovering Your OS and Specs

Print system information using the `uname` Command

You can get detailed system information from the uname command.

When you provide the -a option, it prints all the system information.

uname -a
# output
Linux zaira 6.5.0-21-generic #21~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb  9 13:32:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

In the output above,

Linux: Indicates the operating system.
zaira: Represents the hostname of the machine.
6.5.0-21-generic #21~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb 9 13:32:52 UTC 2: Provides information about the kernel version, build date, and some additional details.
x86_64 x86_64 x86_64: Indicates the architecture of the system.
GNU/Linux: Represents the operating system type.

Find details of the CPU architecture using the `lscpu` Command

The lscpu command in Linux is used to display information about the CPU architecture. When you run lscpu in the terminal, it provides details such as:

The architecture of the CPU (for example, x86_64)
CPU op-mode(s) (for example, 32-bit, 64-bit)
Byte Order (for example, Little Endian)
CPU(s) (number of CPUs), and so on

Let's try it out:

lscpu
# output
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 5 5500U with Radeon Graphics
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            1
    CPU max MHz:         4056.0000
    CPU min MHz:         400.0000

That was a whole lot of information, but useful too! Remember you can always skim the relevant information using specific flags. See the command manual with man lscpu.

Part 4: Managing Files From the Command line

4.1. The Linux File-system Hierarchy

All files in Linux are stored in a file-system. It follows an inverted-tree-like structure because the root is at the topmost part.

The / is the root directory and the starting point of the file system. The root directory contains all other directories and files on the system. The / character also serves as a directory separator between path names. For example, /home/alice forms a complete path.

The image below shows the complete file system hierarchy. Each directory servers a specific purpose.

Note that this is not an exhaustive list and different distributions may have different configurations.

Here is a table that shows the purpose of each directory:

Location	Purpose
/bin	Essential command binaries
/boot	Static files of the boot loader, needed in order to start the boot process.
/etc	Host-specific system configuration
/home	User home directories
/root	Home directory for the administrative root user
/lib	Essential shared libraries and kernel modules
/mnt	Mount point for mounting a filesystem temporarily
/opt	Add-on application software packages
/usr	Installed software and shared libraries
/var	Variable data that is also persistent between boots
/tmp	Temporary files that are accessible to all users

💡 Tip: You can learn more about the file system using the man hier command.

You can check your file system using the tree -d -L 1 command. You can modify the -L flag to change the depth of the tree.

tree -d -L 1
# output
.
├── bin -> usr/bin
├── boot
├── cdrom
├── data
├── dev
├── etc
├── home
├── lib -> usr/lib
├── lib32 -> usr/lib32
├── lib64 -> usr/lib64
├── libx32 -> usr/libx32
├── lost+found
├── media
├── mnt
├── opt
├── proc
├── root
├── run
├── sbin -> usr/sbin
├── snap
├── srv
├── sys
├── tmp
├── usr
└── var

25 directories

This list is not exhaustive and different distributions and systems may be configured differently.

4.2. Navigating the Linux File-system

Absolute path vs relative path

The absolute path is the full path from the root directory to the file or directory. It always starts with a /. For example, /home/john/documents.

The relative path, on the other hand, is the path from the current directory to the destination file or directory. It does not start with a /. For example, documents/work/project.

Locating your current directory using the `pwd` command

It is easy to lose your way in the Linux file system, especially if you are new to the command line. You can locate your current directory using the pwd command.

Here is an example:

pwd
# output
/home/zaira/scripts/python/free-mem.py

Changing directories using the `cd` command

The command to change directories is cd and it stands for "change directory". You can use the cd command to navigate to a different directory.

You can use a relative path or an absolute path.

For example, if you want to navigate the below file structure (following the red lines):

and you are standing at "home", the command would be like this:

cd home/bob/documents/work/project

Some other commonly used cd shortcuts are:

Command	Description
`cd ..`	Go back one directory
`cd ../..`	Go back two directories
`cd` or `cd ~`	Go to the home directory
`cd -`	Go to the previous path

4.3. Managing Files and Directories

When working with files and directories, you might want to copy, move, remove, and create new files and directories. Here are some commands that can help you with that.

💡Tip: You can differentiate between a file and folder by looking at the first letter in the output of ls -l. A'-' represents a file and a 'd' represents a folder.

Creating new directories using the `mkdir` command

You can create an empty directory using the mkdir command.

# creates an empty directory named "foo" in the current folder
mkdir foo

You can also create directories recursively using the -p option.

mkdir -p tools/index/helper-scripts
# output of tree
.
└── tools
    └── index
        └── helper-scripts

3 directories, 0 files

Creating new files using the `touch` command

The touch command creates an empty file. You can use it like this:

# creates empty file "file.txt" in the current folder
touch file.txt

The file names can be chained together if you want to create multiple files in a single command.

# creates empty files "file1.txt", "file2.txt", and "file3.txt" in the current folder

touch file1.txt file2.txt file3.txt

Removing files and directories using the `rm` and `rmdir` command

You can use the rm command to remove both files and non-empty directories.

Command	Description
`rm file.txt`	Removes the file `file.txt`
`rm -r directory`	Removes the directory `directory` and its contents
`rm -f file.txt`	Removes the file `file.txt` without prompting for confirmation
`rmdir` directory	Removes an empty directory

🛑 Note that you should use the -f flag with caution as you won't be asked before deleting a file. Also, be careful when running rm commands in the root folder as it might result in deleting important system files.

Copying files using the `cp` command

To copy files in Linux, use the cp command.

Syntax to copy files:cp source_file destination_of_file

This command copies a file named file1.txt to a new file location /home/adam/logs.

cp file1.txt /home/adam/logs

The cp command also creates a copy of one file with the provided name.

This command copies a file named file1.txt to another file named file2.txt in the same folder.

cp file1.txt file2.txt

Moving and renaming files and folders using the `mv` command

The mv command is used to move files and folders from one directory to the other.

Syntax to move files:mv source_file destination_directory

Example: Move a file named file1.txt to a directory named backup:

mv file1.txt backup/

To move a directory and its contents:

mv dir1/ backup/

Renaming files and folders in Linux is also done with the mv command.

Syntax to rename files:mv old_name new_name

Example: Rename a file from file1.txt to file2.txt:

mv file1.txt file2.txt

Rename a directory from dir1 to dir2:

mv dir1 dir2

4.4. Locating Files and Folders Using the `find` Command

The find command lets you efficiently search for files, folders, and character and block devices.

Below is the basic syntax of the find command:

find /path/ -type f -name file-to-search

Where,

/path is the path where the file is expected to be found. This is the starting point for searching files. The path can also be/or . which represents the root and current directory, respectively.
-type represents the file descriptors. They can be any of the below:
f – Regular file such as text files, images, and hidden files.
d – Directory. These are the folders under consideration.
l – Symbolic link. Symbolic links point to files and are similar to shortcuts.
c – Character devices. Files that are used to access character devices are called character device files. Drivers communicate with character devices by sending and receiving single characters (bytes, octets). Examples include keyboards, sound cards, and the mouse.
b – Block devices. Files that are used to access block devices are called block device files. Drivers communicate with block devices by sending and receiving entire blocks of data. Examples include USB and CD-ROM
-name is the name of the file type that you want to search.

How to search files by name or extension

Suppose we need to find files that contain "style" in their name. We'll use this command:

find . -type f -name "style*"
#output
./style.css
./styles.css

Now let's say we want to find files with a particular extension like .html. We'll modify the command like this:

find . -type f -name "*.html"
# output
./services.html
./blob.html
./index.html

How to search hidden files

A dot at the beginning of the filename represents hidden files. They are normally hidden but can be viewed with ls -a in the current directory.

We can modify the find command as shown below to search for hidden files:

find . -type f -name ".*"

List and find hidden files

ls -la
# folder contents
total 5
drwxrwxr-x  2 zaira zaira 4096 Mar 26 14:17 .
drwxr-x--- 61 zaira zaira 4096 Mar 26 14:12 ..
-rw-rw-r--  1 zaira zaira    0 Mar 26 14:17 .bash_history
-rw-rw-r--  1 zaira zaira    0 Mar 26 14:17 .bash_logout
-rw-rw-r--  1 zaira zaira    0 Mar 26 14:17 .bashrc

find . -type f -name ".*"
# find output
./.bash_logout
./.bashrc
./.bash_history

Above you can see a list of hidden files in my home directory.

How to search log files and configuration files

Log files usually have the extension .log, and we can find them like this:

 find . -type f -name "*.log"

Similarly, we can search for configuration files like this:

 find . -type f -name "*.conf"

How to search other files by type

We can search for character block files by providing c to -type:

find / -type c

Similarly, we can find device block files by using b:

find / -type b

How to search directories

In the example below, we are finding the folders using the -type d flag.

ls -l
# list folder contents
drwxrwxr-x 2 zaira zaira 4096 Mar 26 14:22 hosts
-rw-rw-r-- 1 zaira zaira    0 Mar 26 14:23 hosts.txt
drwxrwxr-x 2 zaira zaira 4096 Mar 26 14:22 images
drwxrwxr-x 2 zaira zaira 4096 Mar 26 14:23 style
drwxrwxr-x 2 zaira zaira 4096 Mar 26 14:22 webp 

find . -type d 
# find directory output
.
./webp
./images
./style
./hosts

How to search files by size

An incredibly helpful use of the find command is to list files based on a particular size.

find / -size +250M

Here, we are listing files whose size exceeds 250MB.

Other units include:

G: GigaBytes.
M: MegaBytes.
K: KiloBytes
c : bytes.

Just replace with the relevant unit.

find  -type f -size +N

How to search files by modification time

By using the -mtime flag, you can filter files and folders based on the modification time.

find /path -name "*.txt" -mtime -10

For example,

-mtime +10 means you are looking for a file modified 10 days ago.
-mtime -10 means less than 10 days.
-mtime 10 If you skip + or – it means exactly 10 days.

4.5. Basic Commands for Viewing Files

Concatenate and display files using the `cat` command

The cat command in Linux is used to display the contents of a file. It can also be used to concatenate files and create new files.

Here is the basic syntax of the cat command:

cat [options] [file]

The simplest way to use cat is without any options or arguments. This will display the contents of the file on the terminal.

For example, if you want to view the contents of a file named file.txt, you can use the following command:

cat file.txt

This will display all the contents of the file on the terminal at once.

Viewing text files interactively using `less` and `more`

While cat displays the entire file at once, less and more allow you to view the contents of a file interactively. This is useful when you want to scroll through a large file or search for specific content.

The syntax of the less command is:

less [options] [file]

The more command is similar to less but has fewer features. It is used to display the contents of a file one screen at a time.

The syntax of the more command is:

more [options] [file]

For both commands, you can use the spacebar to scroll one page down, the Enter key to scroll one line down, and the q key to exit the viewer.

To move backward you can use the b key, and to move forward you can use the f key.

Displaying the last part of files using `tail`

Sometimes you might need to view just the last few lines of a file instead of the entire file. The tail command in Linux is used to display the last part of a file.

For example, tail file.txt will display the last 10 lines of the file file.txt by default.

If you want to display a different number of lines, you can use the -n option followed by the number of lines you want to display.

# Display the last 50 lines of the file file.txt
tail -n 50 file.txt

💡Tip: Another usage of the tail is its follow-along (-f) option. This option enables you to view the contents of a file as they are being written. This is a useful utility for viewing and monitoring log files in real-time.

Displaying the beginning of files using `head`

Just like tail displays the last part of a file, you can use the head command in Linux to display the beginning of a file.

For example, head file.txt will display the first 10 lines of the file file.txt by default.

To change the number of lines displayed, you can use the -n option followed by the number of lines you want to display.

Counting words, lines, and characters using `wc`

You can count words, lines and characters in a file using the wc command.

For example, running wc syslog.log gave me the following output:

1669 9623 64367 syslog.log

In the output above,

1669 represents the number of lines in the file syslog.log.
9623 represents the number of words in the file syslog.log.
64367 represents the number of characters in the file syslog.log.

So, the command wc syslog.log counted 1669 lines, 9623 words, and 64367 characters in the file syslog.log.

Comparing files line by line using `diff`

Comparing and finding differences between two files is a common task in Linux. You can compare two files right within the command line using the diff command.

The basic syntax of the diff command is:

diff [options] file1 file2

Here are two files, hello.py and also-hello.py, that we will compare using the diff command:

# contents of hello.py

def greet(name):
    return f"Hello, {name}!"

user = input("Enter your name: ")
print(greet(user))

# contents of also-hello.py

more also-hello.py
def greet(name):
    return fHello, {name}!

user = input(Enter your name: )
print(greet(user))
print("Nice to meet you")

Check whether the files are the same or not

diff -q hello.py also-hello.py
# Output
Files hello.py and also-hello.py differ

See how the files differ. For that, you can use the -u flag to see a unified output:

diff -u hello.py also-hello.py
--- hello.py    2024-05-24 18:31:29.891690478 +0500
+++ also-hello.py    2024-05-24 18:32:17.207921795 +0500
@@ -3,4 +3,5 @@

 user = input(Enter your name: )
 print(greet(user))
+print("Nice to meet you")

In the above output:

--- hello.py 2024-05-24 18:31:29.891690478 +0500 indicates the file being compared and its timestamp.
+++ also-hello.py 2024-05-24 18:32:17.207921795 +0500 indicates the other file being compared and its timestamp.
@@ -3,4 +3,5 @@ shows the line numbers where the changes occur. In this case, it indicates that lines 3 to 4 in the original file have changed to lines 3 to 5 in the modified file.
user = input(Enter your name: ) is a line from the original file.
print(greet(user)) is another line from the original file.
+print("Nice to meet you") is the additional line in the modified file.

To see the diff in a side-by-side format, you can use the -y flag:

diff -y hello.py also-hello.py
# Output
def greet(name):                        def greet(name):
    return fHello, {name}!                        return fHello, {name}!

user = input(Enter your name: )                    user = input(Enter your name: )
print(greet(user))                        print(greet(user))
                                        >    print("Nice to meet you")

In the output:

The lines that are the same in both files are displayed side by side.
Lines that are different are shown with a > symbol indicating the line is only present in one of the files.

Part 5: The Essentials of Text Editing in Linux

Text editing skills using the command line are one of the most crucial skills in Linux. In this section, you will learn how to use two popular text editors in Linux: Vim and Nano.

I suggest that you master any one text editor of your choice and stick to it. It will save you time and make you more productive. Vim and nano are safe choices as they are present on most Linux distributions.

5.1. Mastering Vim: The Complete Guide

Introduction to Vim

Vim is a popular text editing tool for the command line. Vim comes with its advantages: it is powerful, customizable, and fast. Here are some reasons why you should consider learning Vim:

Most servers are accessed via a CLI, so in system administration, you don't necessarily have the luxury of a GUI. But Vim has got your back – it'll always be there.
Vim uses a keyboard-centric approach, as it is designed to be used without a mouse, which can significantly speed up editing tasks once you have learned the keyboard shortcuts. This also makes it faster than GUI tools.
Some Linux utilities, for example editing cron jobs, work in the same editing format as Vim.
Vim is suitable for all – beginners and advanced users. Vim supports complex string searches, highlighting searches, and much more. Through plugins, Vim provides extended capabilities to developers and system admins that includes code completion, syntax highlighting, file management, version control, and more.

Vim has two variations: Vim (vim) and Vim tiny (vi). Vim tiny is a smaller version of Vim that lacks some features of Vim.

How to start using `vim`

Start using Vim with this command:

vim your-file.txt

your-file.txt can either be a new file or an existing file that you want to edit.

Navigating Vim: Mastering movement and command modes

In the early days of the CLI, the keyboards didn't have arrow keys. Hence, navigation was done using the set of available keys, hjkl being one of them.

Being keyboard-centric, using hjkl keys can greatly speed up text editing tasks.

Note: Although arrow keys would work totally fine, you can still experiment with hjkl keys to navigate. Some people find this this way of navigation efficient.

💡Tip: To remember the hjkl sequence, use this: hang back, jump down, kick up, leap forward.

The three Vim modes

You need to know the 3 operating modes of Vim and how to switch between them. Keystrokes behave differently in each command mode. The three modes are as follows:

Command mode.
Edit mode.
Visual mode.

Command Mode. When you start Vim, you land in the command mode by default. This mode allows you to access other modes.

⚠ To switch to other modes, you need to be present in the command mode first

Edit Mode

This mode allows you to make changes to the file. To enter edit mode, press I while in command mode. Note the '-- INSERT' switch at the end of the screen.

Visual mode

This mode allows you to work on a single character, a block of text, or lines of text. Let's break it down into simple steps. Remember, use the below combinations when in command mode.

Shift + V → Select multiple lines.
Ctrl + V → Block mode
V → Character mode

The visual mode comes in handy when you need to copy and paste or edit lines in bulk.

Extended command mode.

The extended command mode allows you to perform advanced operations like searching, setting line numbers, and highlighting text. We'll cover extended mode in the next section.

How to stay on track? If you forget your current mode, just press ESC twice and you will be back in Command Mode.

Editing Efficiently in Vim: Copy/pasting and searching

1. How to copy and paste in Vim

Copy-paste is known as 'yank' and 'put' in Linux terms. To copy-paste, follow these steps:

Select text in visual mode.
Press 'y' to copy/ yank.
Move your cursor to the required position and press 'p'.

2. How to search for text in Vim

Any series of strings can be searched with Vim using the / in command mode. To search, use /string-to-match.

In the command mode, type :set hls and press enter. Search using /string-to-match. This will highlight the searches.

Let's search a few strings:

3. How to exit Vim

First, move to command mode (by pressing escape twice) and then use these flags:

Exit without saving → :q!
Exit and save → :wq!

Shortcuts in Vim: Making Editing Faster

Note: All these shortcuts work in the command mode only.

Basic Navigation
- h: Move left
- j: Move down
- k: Move up
- l: Move right
- 0: Move to the beginning of the line
- $: Move to the end of the line
- gg: Move to the beginning of the file
- G: Move to the end of the file
- Ctrl+d: Move half-page down
- Ctrl+u: Move half-page up
Editing
- i: Enter insert mode before the cursor
- I: Enter insert mode at the beginning of the line
- a: Enter insert mode after the cursor
- A: Enter insert mode at the end of the line
- o: Open a new line below the current line and enter insert mode
- O: Open a new line above the current line and enter insert mode
- x: Delete the character under the cursor
- dd: Delete the current line
- yy: Yank (copy) the current line (use this in visual mode)
- p: Paste below the cursor
- P: Paste above the cursor
Searching and Replacing
- /: Search for a pattern which will take you to its next occurrence
- ?: Search for a pattern that will take you to its previous occurrence
- n: Repeat the last search in the same direction
- N: Repeat the last search in the opposite direction
- :%s/old/new/g: Replace all occurrences of old with new in the file
Exiting
- :w: Save the file but don't exit
- :q: Quit Vim (fails if there are unsaved changes)
- :wq or :x: Save and quit
- :q!: Quit without saving
Multiple Windows
- :split or :sp: Split the window horizontally
- :vsplit or :vsp: Split the window vertically
- Ctrl+w followed by h/j/k/l: Navigate between split windows

5.2. Mastering Nano

Getting started with Nano: The user-friendly text editor

Nano is a user-friendly text editor that is easy to use and is perfect for beginners. It is pre-installed on most Linux distributions.

To create a new file using Nano, use the following command:

nano

To start editing an existing file with Nano, use the following command:

nano filename

List of key bindings in Nano

Let's study the most important key bindings in Nano. You'll use the key bindings to perform various operations like saving, exiting, copying, pasting, and more.

Write to a file and save

Once you open Nano using the nano command, you can start writing text. To save the file, press Ctrl+O. You'll be prompted to enter the file name. Press Enter to save the file.

Exit nano

You can exit Nano by pressing Ctrl+X. If you have unsaved changes, Nano will prompt you to save the changes before exiting.

Copying and pasting

To select a region, use ALT+A. A marker will show. Use arrows to select the text. Once selected, exit the marker with with ALT+^.

To copy the selected text, press Ctrl+K. To paste the copied text, press Ctrl+U.

Cutting and pasting

Select the region with ALT+A. Once selected, cut the text with Ctrl+K. To paste the cut text, press Ctrl+U.

Navigation

Use Alt \ to move to the beginning of the file.

Use Alt / to move to the end of the file.

Viewing line numbers

When you open a file with nano -l filename, you can view line numbers on the left side of the file.

Searching

You can search for a specific line number with ALt + G. Enter the line number to the prompt and press Enter.

You can also initiate search for a string with CTRL + W and press Enter. If you want to search backwards, you can press Alt+W after initiating the search with Ctrl+W.

Summary of keybindings in Nano

General
- Ctrl+X: Exit Nano (prompting to save if changes are made)
- Ctrl+O: Save the file
- Ctrl+R: Read a file into the current file
- Ctrl+G: Display the help text
Editing
- Ctrl+K: Cut the current line and store it in the cutbuffer
- Ctrl+U: Paste the contents of the cutbuffer into the current line
- Alt+6: Copy the current line and store it in the cutbuffer
- Ctrl+J: Justify the current paragraph
Navigation
- Ctrl+A: Move to the beginning of the line
- Ctrl+E: Move to the end of the line
- Ctrl+C: Display the current line number and file information
- Ctrl+_ (Ctrl+Shift+-): Go to a specific line (and optionally, column) number
- Ctrl+Y: Scroll up one page
- Ctrl+V: Scroll down one page
Search and Replace
- Ctrl+W: Search for a string (then Enter to search again)
- Alt+W: Repeat the last search but in the opposite direction
- Ctrl+\: Search and replace
Miscellaneous
- Ctrl+T: Invoke the spell checker, if available
- Ctrl+D: Delete the character under the cursor (does not cut it)
- Ctrl+L: Refresh (redraw) the current screen
- Alt+U: Undo the last operation
- Alt+E: Redo the last undone operation

Part 6: Bash Scripting

6.1. Definition of Bash scripting

A bash script is a file containing a sequence of commands that are executed by the bash program line by line. It allows you to perform a series of actions, such as navigating to a specific directory, creating a folder, and launching a process using the command line.

By saving commands in a script, you can repeat the same sequence of steps multiple times and execute them by running the script.

6.2. Advantages of Bash Scripting

Bash scripting is a powerful and versatile tool for automating system administration tasks, managing system resources, and performing other routine tasks in Unix/Linux systems.

Some advantages of shell scripting are:

Automation: Shell scripts allow you to automate repetitive tasks and processes, saving time and reducing the risk of errors that can occur with manual execution.
Portability: Shell scripts can be run on various platforms and operating systems, including Unix, Linux, macOS, and even Windows through the use of emulators or virtual machines.
Flexibility: Shell scripts are highly customizable and can be easily modified to suit specific requirements. They can also be combined with other programming languages or utilities to create more powerful scripts.
Accessibility: Shell scripts are easy to write and don't require any special tools or software. They can be edited using any text editor, and most operating systems have a built-in shell interpreter.
Integration: Shell scripts can be integrated with other tools and applications, such as databases, web servers, and cloud services, allowing for more complex automation and system management tasks.
Debugging: Shell scripts are easy to debug, and most shells have built-in debugging and error-reporting tools that can help identify and fix issues quickly.

6.3. Overview of Bash Shell and Command Line Interface

The terms "shell" and "bash" are often used interchangeably. But there is a subtle difference between the two.

The term "shell" refers to a program that provides a command-line interface for interacting with an operating system. Bash (Bourne-Again SHell) is one of the most commonly used Unix/Linux shells and is the default shell in many Linux distributions.

Till now, the commands that you have been entering were basically being entered in a "shell".

Although Bash is a type of shell, there are other shells available as well, such as Korn shell (ksh), C shell (csh), and Z shell (zsh). Each shell has its own syntax and set of features, but they all share the common purpose of providing a command-line interface for interacting with the operating system.

You can determine your shell type using the ps command:

ps
# output:

    PID TTY          TIME CMD
  20506 pts/0    00:00:00 bash <--- the shell type
  20931 pts/0    00:00:00 ps

In summary, while "shell" is a broad term that refers to any program that provides a command-line interface, "Bash" is a specific type of shell that is widely used in Unix/Linux systems.

Note: In this section, we will be using the "bash" shell.

6.4. How to Create and Execute Bash scripts

Script naming conventions

By naming convention, bash scripts end with .sh. However, bash scripts can run perfectly fine without the sh extension.

Adding the Shebang

Bash scripts start with a shebang. Shebang is a combination of bash # and bang ! followed by the bash shell path. This is the first line of the script. Shebang tells the shell to execute it via bash shell. Shebang is simply an absolute path to the bash interpreter.

Below is an example of the shebang statement.

#!/bin/bash

You can find your bash shell path (which may vary from the above) using the command:

which bash

Creating your first bash script

Our first script prompts the user to enter a path. In return, its contents will be listed.

Create a file named run_all.sh using any editor of your choice.

vim run_all.sh

Add the following commands in your file and save it:

#!/bin/bash
echo "Today is " `date`

echo -e "\nenter the path to directory"
read the_path

echo -e "\n you path has the following files and folders: "
ls $the_path

Let's take a deeper look at the script line by line. I am displaying the same script again, but this time with line numbers.

  1 #!/bin/bash
  2 echo "Today is " `date`
  3
  4 echo -e "\nenter the path to directory"
  5 read the_path
  6
  7 echo -e "\n you path has the following files and folders: "
  8 ls $the_path

Line #1: The shebang (#!/bin/bash) points toward the bash shell path.
Line #2: The echo command displays the current date and time on the terminal. Note that the date is in backticks.
Line #4: We want the user to enter a valid path.
Line #5: The read command reads the input and stores it in the variable the_path.
line #8: The ls command takes the variable with the stored path and displays the current files and folders.

Executing the bash script

To make the script executable, assign execution rights to your user using this command:

chmod u+x run_all.sh

Here,

chmod modifies the ownership of a file for the current user :u.
+x adds the execution rights to the current user. This means that the user who is the owner can now run the script.
run_all.sh is the file we wish to run.

You can run the script using any of the mentioned methods:

sh run_all.sh
bash run_all.sh
./run_all.sh

Let's see it running in action 🚀

6.5. Bash Scripting Basics

Comments in bash scripting

Comments start with a # in bash scripting. This means that any line that begins with a # is a comment and will be ignored by the interpreter.

Comments are very helpful in documenting the code, and it is a good practice to add them to help others understand the code.

These are examples of comments:

# This is an example comment
# Both of these lines will be ignored by the interpreter

Variables and data types in Bash

Variables let you store data. You can use variables to read, access, and manipulate data throughout your script.

There are no data types in Bash. In Bash, a variable is capable of storing numeric values, individual characters, or strings of characters.

In Bash, you can use and set the variable values in the following ways:

Assign the value directly:

country=Netherlands

2. Assign the value based on the output obtained from a program or command, using command substitution. Note that $ is required to access an existing variable's value.

same_country=$country

This assigns the value of country to the new variable same_country.

To access the variable value, append $ to the variable name.

country=Netherlands
echo $country
# output
Netherlands
new_country=$country
echo $new_country
# output
Netherlands

Above, you can see an example of assigning and printing variable values.

Variable naming conventions

In Bash scripting, the following are the variable naming conventions:

Variable names should start with a letter or an underscore (_).
Variable names can contain letters, numbers, and underscores (_).
Variable names are case-sensitive.
Variable names should not contain spaces or special characters.
Use descriptive names that reflect the purpose of the variable.
Avoid using reserved keywords, such as if, then, else, fi, and so on as variable names.

Here are some examples of valid variable names in Bash:

name
count
_var
myVar
MY_VAR

And here are some examples of invalid variable names:

# invalid variable names

2ndvar (variable name starts with a number)
my var (variable name contains a space)
my-var (variable name contains a hyphen)

Following these naming conventions helps make Bash scripts more readable and easier to maintain.

Input and output in Bash scripts

Gathering input

In this section, we'll discuss some methods to provide input to our scripts.

Reading the user input and storing it in a variable

We can read the user input using the read command.

#!/bin/bash
echo "What's your name?"
read entered_name
echo -e "\nWelcome to bash tutorial" $entered_name

2. Reading from a file

This code reads each line from a file named input.txt and prints it to the terminal. We'll study while loops later in this section.

while read line
do
  echo $line
done < input.txt

3. Command line arguments

In a bash script or function, $1 denotes the initial argument passed, $2 denotes the second argument passed, and so forth.

This script takes a name as a command-line argument and prints a personalized greeting.

#!/bin/bash
echo "Hello, $1!"

We have supplied Zaira as our argument to the script.

Output:

Displaying output

Here we'll discuss some methods to receive output from the scripts.

Printing to the terminal:

echo "Hello, World!"

This prints the text "Hello, World!" to the terminal.

2. Writing to a file:

echo "This is some text." > output.txt

This writes the text "This is some text." to a file named output.txt. Note that the > operator overwrites a file if it already has some content.

3. Appending to a file:

echo "More text." >> output.txt

This appends the text "More text." to the end of the file output.txt.

4. Redirecting output:

ls > files.txt

This lists the files in the current directory and writes the output to a file named files.txt. You can redirect output of any command to a file this way.

You'll learn about output redirection in detail in section 8.5.

Conditional statements (if/else)

Expressions that produce a boolean result, either true or false, are called conditions. There are several ways to evaluate conditions, including if, if-else, if-elif-else, and nested conditionals.

Syntax:

if [[ condition ]];
then
    statement
elif [[ condition ]]; then
    statement 
else
    do this by default
fi

Syntax of bash conditional statements

We can use logical operators such as AND -a and OR -o to make comparisons that have more significance.

if [ $a -gt 60 -a $b -lt 100 ]

This statement checks if both conditions are true: a is greater than 60 AND b is less than 100.

Let's see an example of a Bash script that uses if, if-else, and if-elif-else statements to determine if a user-inputted number is positive, negative, or zero:

#!/bin/bash

# Script to determine if a number is positive, negative, or zero

echo "Please enter a number: "
read num

if [ $num -gt 0 ]; then
  echo "$num is positive"
elif [ $num -lt 0 ]; then
  echo "$num is negative"
else
  echo "$num is zero"
fi

The script first prompts the user to enter a number. Then, it uses an if statement to check if the number is greater than 0. If it is, the script outputs that the number is positive. If the number is not greater than 0, the script moves on to the next statement, which is an if-elif statement.

Here, the script checks if the number is less than 0. If it is, the script outputs that the number is negative.

Finally, if the number is neither greater than 0 nor less than 0, the script uses an else statement to output that the number is zero.

Seeing it in action 🚀

Looping and branching in Bash

While loop

While loops check for a condition and loop until the condition remains true. We need to provide a counter statement that increments the counter to control loop execution.

In the example below, (( i += 1 )) is the counter statement that increments the value of i. The loop will run exactly 10 times.

#!/bin/bash
i=1
while [[ $i -le 10 ]] ; do
   echo "$i"
  (( i += 1 ))
done

For loop

The for loop, just like the while loop, allows you to execute statements a specific number of times. Each loop differs in its syntax and usage.

In the example below, the loop will iterate 5 times.

#!/bin/bash

for i in {1..5}
do
    echo $i
done

Case statements

In Bash, case statements are used to compare a given value against a list of patterns and execute a block of code based on the first pattern that matches. The syntax for a case statement in Bash is as follows:

case expression in
    pattern1)
        # code to execute if expression matches pattern1
        ;;
    pattern2)
        # code to execute if expression matches pattern2
        ;;
    pattern3)
        # code to execute if expression matches pattern3
        ;;
    *)
        # code to execute if none of the above patterns match expression
        ;;
esac

Here, "expression" is the value that we want to compare, and "pattern1", "pattern2", "pattern3", and so on are the patterns that we want to compare it against.

The double semicolon ";;" separates each block of code to execute for each pattern. The asterisk "*" represents the default case, which executes if none of the specified patterns match the expression.

Let's see an example:

fruit="apple"

case $fruit in
    "apple")
        echo "This is a red fruit."
        ;;
    "banana")
        echo "This is a yellow fruit."
        ;;
    "orange")
        echo "This is an orange fruit."
        ;;
    *)
        echo "Unknown fruit."
        ;;
esac

In this example, since the value of fruit is apple, the first pattern matches, and the block of code that echoes This is a red fruit. is executed. If the value of fruit were instead banana, the second pattern would match and the block of code that echoes This is a yellow fruit. would execute, and so on.

If the value of fruit does not match any of the specified patterns, the default case is executed, which echoes Unknown fruit.

Part 7: Managing Software Packages in Linux

Linux comes with several built-in programs. But you might need to install new programs based on your needs. You might also need to upgrade the existing applications.

7.1. Packages and Package Management

What is a package?

A package is a collection of files that are bundled together. These files are essential for a particular program to run. These files contain the program's executable files, libraries, and other resources.

In addition to the files required for the program to run, packages also contain installation scripts, which copy the files to where they are needed. A program may contain many files and dependencies. With packages, it is easier to manage all the files and dependencies at once.

What is the difference between source and binary?

Programmers write source code in a programming language. This source code is then compiled into machine code that the computer can understand. The compiled code is called binary code.

When you download a package, you can either get the source code or the binary code. The source code is the human-readable code that can be compiled into binary code. The binary code is the compiled code that the computer can understand.

Source packages can be used with any type of machine if the source code is compiled properly. Binary, on the other hand, is compiled code that is specific to a particular type of machine or architecture.

You can find the architecture of your machine using the uname -m command.

uname -m
# output
x86_64

Package dependencies

Programs often share files. Instead of including these files in each package, a separate package can provide them for all programs.

To install a program that needs these files, you must also install the package containing them. This is called a package dependency. Specifying dependencies makes packages smaller and simpler by reducing duplicates.

When you install a program, its dependencies must also be installed. Most required dependencies are usually already installed, but a few extra ones might be needed. So, don't be surprised if several other packages are installed along with your chosen package. These are the necessary dependencies.

Package managers

Linux offers a comprehensive package management system for installing, upgrading, configuring, and removing software.

With package management, you can get access to an organized base of thousands of software packages along with having the ability to resolve dependencies and check for software updates.

Packages can be managed using either command-line utilities that can be easily automated by system administrators, or through a graphical interface.

Software channels/repositories

⚠️ Package management is different for different distros. Here, we are using Ubuntu.

Installing software is a bit different in Linux as compared to Windows and Mac.

Linux uses repositories to store software packages. A repository is a collection of software packages that are available for installation via a package manager.

A package manager also stores an index of all of the packages available from a repo. Sometimes the index is rebuilt to ensure that it is up to date and to know which packages have been upgraded or added to the channel since it last checked.

The generic process of downloading software from a repo looks something like this:

If we talk specifically about Ubuntu,

Index is fetched using apt update. (apt is explained in next section).
Required files/ dependencies requested according to index using apt install
Packages and dependencies installed locally.
Update dependencies and packages when required using apt update and apt upgrade

On Debian-based distros, you can file the list of repos (repositories) in /etc/apt/sources.list.

7.2. Installing a Package via Command Line

The apt command is a powerful command-line tool, which works with Ubuntu’s "Advanced Packaging Tool (APT)".

apt, along with the commands bundled with it, provides the means to install new software packages, upgrade existing software packages, update the package list index, and even upgrade the entire Ubuntu system.

To view the logs of the installation using apt, you can view the /var/log/dpkg.log file.

Following are the uses of the apt command:

Installing packages

For example, to install the htop package, you can use the following command:

sudo apt install htop

Updating the package list index

The package list index is a list of all the packages available in the repositories. To update the local package list index, you can use the following command:

sudo apt update

Upgrading the packages

Installed packages on your system can get updates containing bug fixes, security patches, and new features.

To upgrade the packages, you can use the following command:

sudo apt upgrade

Removing packages

To remove a package, like htop, you can use the following command:

sudo apt remove htop

7.3. Installing a Package via an Advanced Graphical Method – Synaptic

If you are not comfortable with the command line, you can use a GUI application to install packages. You can achieve the same results as the command line, but with a graphical interface.

Synaptic is a GUI package management application that helps in listing the installed packages, their status, pending updates, and so on. It offers custom filters to help you narrow down the search results.

You can also right-click on a package and view further details like the dependencies, maintainer, size, and the installed files.

7.4. Installing downloaded packages from a website

You may want to install a package you have downloaded from a website, rather than from a software repository. These packages are called .deb files.

Usingdpkgto install packages:dpkg is a command-line tool used to install packages. To install a package with dpkg, open the Terminal and type the following:

cd directory
sudo dpkg -i package_name.deb

Note: Replace "directory" with the directory where the package is stored and "package_name" with the filename of the package.

Alternatively, you can right-click, select "Open With Other Application," and choose a GUI app of your choice.

💡 Tip: In Ubuntu, you can see a list of installed packages with dpkg --list.

Part 8: Advanced Linux Topics

8.1. User Management

There can be multiple users with varying levels of access in a system. In Linux, the root user has the highest level of access and can perform any operation on the system. Regular users have limited access and can only perform operations they have been granted permission to do.

What is a user?

A user account provides separation between different people and programs that can run commands.

Humans identify users by a name, as names are easy to work with. But the system identifies users by a unique number called the user ID (UID).

When human users log in using the provided username, they have to use a password to authorize themselves.

User accounts form the foundations of system security. File ownership is also associated with user accounts and it enforces access control to the files. Every process has an associated user account that provides a layer of control for the admins.

There are three main types of user accounts:

Superuser: The superuser has complete access to the system. The name of the superuser is root. It has a UID of 0.
System user: The system user has user accounts that are used to run system services. These accounts are used to run system services and are not meant for human interaction.
Regular user: Regular users are human users who have access to the system.

The id command displays the user ID and group ID of the current user.

id
uid=1000(john) gid=1000(john) groups=1000(john),4(adm),24(cdrom),27(sudo),30(dip)... output truncated

To view the basic information of another user, pass the username as an argument to the id command.

id username

To view user-related information for processes, use the ps command with the -u flag.

ps -u
# Output
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.1  16968  3920 ?        Ss   18:45   0:00 /sbin/init splash
root         2  0.0  0.0      0     0 ?        S    18:45   0:00 [kthreadd]

By default, systems use the /etc/passwd file to store user information.

Here is a line from the /etc/passwd file:

root:x:0:0:root:/root:/bin/bash

The /etc/passwd file contains the following information about each user:

Username: root – The username of the user account.
Password: x – The password in encrypted format for the user account that is stored in the /etc/shadow file for security reasons.
User ID (UID): 0 – The unique numerical identifier for the user account.
Group ID (GID): 0 – The primary group identifier for the user account.
User Info: root – The real name for the user account.
Home directory: /root – The home directory for the user account.
Shell: /bin/bash – The default shell for the user account. A system user might use /sbin/nologin if interactive logins are not allowed for that user.

What is a group?

A group is a collection of user accounts that share access and resources. Groups have group names to identify them. The system identifies groups by a unique number called the group ID (GID).

By default, the information about groups is stored in the /etc/group file.

Here is an entry from the /etc/group file:

adm:x:4:syslog,john

Here is the breakdown of the fields in the given entry:

Group name: adm – The name of the group.
Password: x – The password for the group is stored in the /etc/gshadow file for security reasons. The password is optional and appears empty if not set.
Group ID (GID): 4 – The unique numerical identifier for the group.
Group members: syslog,john – The list of usernames that are members of the group. In this case, the group adm has two members: syslog and john.

In this specific entry, the group name is adm, the group ID is 4, and the group has two members: syslog and john. The password field is typically set to x to indicate that the group password is stored in the /etc/gshadow file.

The groups are further divided into 'primary' and 'supplementary' groups.

Primary Group: Each user is assigned one primary group by default. This group usually has the same name as the user and is created when the user account is made. Files and directories created by the user are typically owned by this primary group.
Supplementary Groups: These are extra groups a user can belong to in addition to their primary group. Users can be members of multiple supplementary groups. These groups let a user have permissions for resources shared among those groups. They help provide access to shared resources without affecting the system’s file permissions and keeping the security intact. While a user must belong to one primary group, belonging to supplementary groups is optional.

Access control: finding and understanding file permission

File ownership can be viewed using the ls -l command. The first column in the output of the ls -l command shows the permissions of the file. Other columns show the owner of the file and the group that the file belongs to.

Let's have a closer look into the mode column:

Mode defines two things:

File type: File type defines the type of the file. For regular files that contain simple data it is blank -. For other special file types the symbol is different. For a directory which is a special file, it is d. Special files are treated differently by the OS.
Permission classes: The next set of characters define the permissions for user, group, and others respectively.
– User: This is the owner of a file and owner of the file belongs to this class.
– Group: The members of the file’s group belong to this class
– Other: Any users that are not part of the user or group classes belong to this class.

💡Tip: Directory ownership can be viewed using the ls -ld command.

How to Read Symbolic Permissions or the `rwx` permissions

The rwx representation is known as the Symbolic representation of permissions. In the set of permissions,

r stands for read. It is indicated in the first character of the triad.
w stands for write. It is indicated in the second character of the triad.
x stands for execution. It is indicated in the third character of the triad.

Read:

For regular files, read permissions allow the file to be opened and read only. Users can't modify the file.

Similarly for directories, read permissions allow the listing of directory content without any modification in the directory.

Write:

When files have write permissions, the user can modify (edit, delete) the file and save it.

For folders, write permissions enable a user to modify its contents (create, delete, and rename the files inside it), and modify the contents of files that the user has write permissions to.

Examples of permissions in Linux

Now that we know how to read permissions, let's see some examples.

-rwx------: A file that is only accessible and executable by its owner.

-rw-rw-r--: A file that is open to modification by its owner and group but not by others.
drwxrwx---: A directory that can be modified by its owner and group.

Execute:

For files, execute permissions allows the user to run an executable script. For directories, the user can access them, and access details about files in the directory.

How to Change File Permissions and Ownership in Linux using `chmod` and `chown`

Now that we know the basics of ownerships and permissions, let's see how we can modify permissions using the chmod command.

Syntax ofchmod:

chmod permissions filename

Where,

permissions can be read, write, execute or a combination of them.
filename is the name of the file for which the permissions need to change. This parameter can also be a list if files to change permissions in bulk.

We can change permissions using two modes:

Symbolic mode: this method uses symbols like u, g, o to represent users, groups, and others. Permissions are represented as r, w, x for read, write, and execute, respectively. You can modify permissions using +, - and =.
Absolute mode: this method represents permissions as 3-digit octal numbers ranging from 0-7.

Now, let's see them in detail.

How to Change Permissions using Symbolic Mode

The table below summarize the user representation:

USER REPRESENTATION	DESCRIPTION
u	user/owner
g	group
o	other

We can use mathematical operators to add, remove, and assign permissions. The table below shows the summary:

OPERATOR	DESCRIPTION
+	Adds a permission to a file or directory
–	Removes the permission
\=	Sets the permission if not present before. Also overrides the permissions if set earlier.

Example:

Suppose I have a script and I want to make it executable for the owner of the file zaira.

Current file permissions are as follows:

Let's split the permissions like this:

To add execution rights (x) to owner (u) using symbolic mode, we can use the command below:

chmod u+x mymotd.sh

Output:

Now, we can see that the execution permissions have been added for owner zaira.

Additional examples for changing permissions via symbolic method:

Removing read and write permission for group and others: chmod go-rw.
Removing read permissions for others: chmod o-r.
Assigning write permission to group and overriding existing permission: chmod g=w.

How to Change Permissions using Absolute Mode

Absolute mode uses numbers to represent permissions and mathematical operators to modify them.

The below table shows how we can assign relevant permissions:

PERMISSION	PROVIDE PERMISSION
read	add 4
write	add 2
execute	add 1

Permissions can be revoked using subtraction. The below table shows how you can remove relevant permissions.

PERMISSION	REVOKE PERMISSION
read	subtract 4
write	subtract 2
execute	subtract 1

Example:

Set read (add 4) for user, read (add 4) and execute (add 1) for group, and only execute (add 1) for others.

chmod 451 file-name

This is how we performed the calculation:

Note that this is the same as r--r-x--x.

Remove execution rights from other and group.

To remove execution from other and group, subtract 1 from the execute part of last 2 octets.

Assign read, write and execute to user, read and execute to group and only read to others.

This would be the same as rwxr-xr--.

How to Change Ownership using the `chown` Command

Next, we will learn how to change the ownership of a file. You can change the ownership of a file or folder using the chown command. In some cases, changing ownership requires sudo permissions.

Syntax of chown:

chown user filename

How to change user ownership with `chown`

Let's transfer the ownership from user zaira to user news.

chown news mymotd.sh

Command to change ownership: sudo chown news mymotd.sh.

Output:

How to change user and group ownership simultaneously

We can also use chown to change user and group simultaneously.

chown user:group filename

How to change directory ownership

You can change ownership recursively for contents in a directory. The example below changes the ownership of the /opt/script folder to allow user admin.

chown -R admin /opt/script

How to change group ownership

In case we only need to change the group owner, we can use chown by preceding the group name by a colon :

chown :admins /opt/script

How to switch between users

You can switch between users using the su command.

[user01@host ~]$ su user02
Password:
[user02@host ~]$

How to gain superuser access

The superuser or the root user has the highest level of access on a Linux system. The root user can perform any operation on the system. The root user can access all files and directories, install and remove software, and modify or override system configurations.

With great power comes great responsibility. If the root user is compromised, someone can gain complete control over the system. It is advised to use the root user account only when necessary.

If you omit the username, the su command switches to the root user account by default.

[user01@host ~]$ su
Password:
[root@host ~]#

Another variation of the su command is su -. The su command switches to the root user account but does not change the environment variables. The su - command switches to the root user account and changes the environment variables to those of the target user.

Running commands with sudo

To run commands as the root user without switching to the root user account, you can use the sudo command. The sudo command allows you to run commands with elevated privileges.

Running commands with sudo is a safer option rather than running the commands as the root user. This is because, only a specific set of users can be granted permission to run commands with sudo. This is defined in the /etc/sudoers file.

Also, sudo logs all commands that are run with it, providing an audit trail of who ran which commands and when.

In Ubuntu, you can find the audit logs here:

cat /var/log/auth.log | grep sudo

For a user that does not have access to sudo, it gets flagged in logs and prompts a message like this:

user01 is not in the sudoers file.  This incident will be reported.

Managing local user accounts

Creating users from the command line

The command used to add a new user is:

sudo useradd username

This command sets up a user's home directory and creates a private group designated by the user's username. Currently, the account lacks a valid password, preventing the user from logging in until a password is created.

Modifying existing users

The usermod command is used to modify existing users. Here are some of the common options used with the usermod command:

Here are some examples of the usermod command in Linux:

Change a user's login name:

 sudo usermod -l newusername oldusername

Change a user's home directory:

 sudo usermod -d /new/home/directory -m username

Add a user to a supplementary group:
```
 sudo usermod -aG groupname username
```
Change a user's shell:
```
 sudo usermod -s /bin/bash username
```
Lock a user's account:
```
 sudo usermod -L username
```
Unlock a user's account:
```
 sudo usermod -U username
```
Set an expiration date for a user account:
```
 sudo usermod -e YYYY-MM-DD username
```
Change a user's user ID (UID):
```
 sudo usermod -u newUID username
```
Change a user's primary group:
```
 sudo usermod -g newgroup username
```
Remove a user from a supplementary group:
```
sudo gpasswd -d username groupname
```

Deleting users

The userdel command is used to delete a user account and related files from the system.

sudo userdel username: removes the user's details from /etc/passwd but keeps the user's home directory.
The sudo userdel -r username command removes the user's details from /etc/passwd and also deletes the user's home directory.

Changing user passwords

The passwd command is used to change a user's password.

sudo passwd username: sets the initial password or changes the existing password of username. It is also used to change the password of the currently logged in user.

8.2 Connecting to Remote Servers via SSH

Accessing remote servers is one of the essential tasks for system administrators. You can connect to different servers or access databases through your local machine and execute commands, all using SSH.

What is the SSH protocol?

SSH stands for Secure Shell. It is a cryptographic network protocol that allows secure communication between two systems.

The default port for SSH is 22.

The two participants while communicating via SSH are:

The server: the machine that you want access to.
The client: The system that you are accessing the server from.

Connection to a server follows these steps:

Initiate Connection: The client sends a connection request to the server.
Exchange of Keys: The server sends its public key to the client. Both agree on the encryption methods to use.
Session Key Generation: The client and server use the Diffie-Hellman key exchange to create a shared session key.
Client Authentication: The client logs in to the server using a password, private key, or another method.
Secure Communication: After authentication, the client and server communicate securely with encryption.

How to connect to a remote server using SSH?

The ssh command is a built-in utility in Linux and also the default one. It makes accessing servers quite easy and secure.

Here, we are talking about how the client would make a connection to the server.

Prior to connecting to a server, you need to have the following information:

The IP address or the domain name of the server.
The username and password of the server.
The port number that you have access to in the server.

The basic syntax of the ssh command is:

ssh username@server_ip

For example, if your username is john and the server IP is 192.168.1.10, the command would be:

ssh john@192.168.1.10

After that, you'll be prompted to enter the secret password. Your screen will look similar to this:

john@192.168.1.10's password: 
Welcome to Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-70-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  System information as of Fri Jun  5 10:17:32 UTC 2024

  System load:  0.08               Processes:           122
  Usage of /:   12.3% of 19.56GB   Users logged in:     1
  Memory usage: 53%                IP address for eth0: 192.168.1.10
  Swap usage:   0%

Last login: Fri Jun  5 09:34:56 2024 from 192.168.1.2
john@hostname:~$ # start entering commands

Now you can execute the relevant commands on the server 192.168.1.10.

⚠️ The default port for ssh is 22 but it is also vulnerable, as hackers will likely attempt here first. Your server can expose another port and share the access with you. To connect to a different port, use the -p flag.

ssh -p port_number username@server_ip

8.3. Advanced Log Parsing and Analysis

Log files, when configured, are generated by your system for a variety of useful reasons. They can be used to track system events, monitor system performance, and troubleshoot issues. They are specifically useful for system administrators where they can track application errors, network events, and user activity.

Here is an example of a log file:

# sample log file
2024-04-25 09:00:00 INFO Startup: Application starting
2024-04-25 09:01:00 INFO Config: Configuration loaded successfully
2024-04-25 09:02:00 DEBUG Database: Database connection established
2024-04-25 09:03:00 INFO User: New user registered (UserID: 1001)
2024-04-25 09:04:00 WARN Security: Attempted login with incorrect credentials (UserID: 1001)
2024-04-25 09:05:00 ERROR Network: Network timeout on request (ReqID: 456)
2024-04-25 09:06:00 INFO Email: Notification email sent (UserID: 1001)
2024-04-25 09:07:00 DEBUG API: API call with response time over threshold (Duration: 350ms)
2024-04-25 09:08:00 INFO Session: User session ended (UserID: 1001)
2024-04-25 09:09:00 INFO Shutdown: Application shutdown initiated

A log file usually contains the following columns:

Timestamp: The date and time when the event occurred.
Log Level: The severity of the event (INFO, DEBUG, WARN, ERROR).
Component: The component of the system that generated the event (Startup, Config, Database, User, Security, Network, Email, API, Session, Shutdown).
Message: A description of the event that occurred.
Additional Information: Additional information related to the event.

In real-time systems, log files tend to be thousands of lines long and are generated every second. They can be very wordy depending on the configuration. Every column in a log file is a piece of information that can be used to track down issues. This makes log files difficult to read and understand manually.

This is where log parsing comes in. Log parsing is the process of extracting useful information from log files. It involves breaking down the log files into smaller, more manageable pieces, and extracting the relevant information.

The filtered information can also be useful for creating alerts, reports, and dashboards.

In this section, you will explore some techniques for parsing log files in Linux.

Text extraction using `grep`

Grep is a built-in bash utility. It stands for "global regular expression print". Grep is used to match strings in files.

Here are some common uses of grep:

Search for a specific string in a file:
```
 grep "search_string" filename
```
This command searches for "search_string" in the file named filename.
Search recursively in directories:
```
 grep -r "search_string" /path/to/directory
```
This command searches for "search_string" in all files within the specified directory and its subdirectories.
Ignore case while searching:
```
 grep -i "search_string" filename
```
This command performs a case-insensitive search for "search_string" in the file named filename.
Display line numbers with matching lines:
```
 grep -n "search_string" filename
```
This command shows the line numbers along with the matching lines in the file named filename.
Count the number of matching lines:
```
 grep -c "search_string" filename
```
This command counts the number of lines that contain "search_string" in the file named filename.
Invert match to display lines that do not match:
```
 grep -v "search_string" filename
```
This command displays all lines that do not contain "search_string" in the file named filename.
Search for a whole word:
```
 grep -w "word" filename
```
This command searches for the whole word "word" in the file named filename.
Use extended regular expressions:
```
 grep -E "pattern" filename
```
This command allows the use of extended regular expressions for more complex pattern matching in the file named filename.

💡 Tip: If there are multiple files in a folder, you can use the below command to find the list of files containing the desired strings.

# find the list of files containing the desired strings
grep -l "String to Match" /path/to/directory

Text extraction using `sed`

sed stands for "stream editor". It processes data stream-wise, meaning it reads data one line at a time. sed allows you to search for patterns and perform actions on the lines that match those patterns.

Basic syntax ofsed:

The basic syntax of sed is as follows:

sed [options] 'command' file_name

Here, command is used to perform operations like substitution, deletion, insertion, and so on, on the text data. The filename is the name of the file you want to process.

sedusage:

1. Substitution:

The s flag is used to replace text. The old-text is replaced with new-text:

sed 's/old-text/new-text/' filename

For example, to change all instances of "error" to "warning" in the log file system.log:

sed 's/error/warning/' system.log

2. Printing lines containing a specific pattern:

Using sed to filter and display lines that match a specific pattern:

sed -n '/pattern/p' filename

For instance, to find all lines containing "ERROR":

sed -n '/ERROR/p' system.log

3. Deleting lines containing a specific pattern:

You can delete lines from the output that match a specific pattern:

sed '/pattern/d' filename

For example, to remove all lines containing "DEBUG":

sed '/DEBUG/d' system.log

4. Extracting specific fields from a log line:

You can use regular expressions to extract parts of lines. Suppose each log line starts with a date in the format "YYYY-MM-DD". You could extract just the date from each line:

sed -n 's/^\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\).*/\1/p' system.log

Text parsing with `awk`

awk has the ability to easily split each line into fields. It's well-suited for processing structured text like log files.

Basic syntax ofawk

The basic syntax of awk is:

awk 'pattern { action }' file_name

Here, pattern is a condition that must be met for the action to be performed. If the pattern is omitted, the action is performed on every line.

In the coming examples, you'll use this log file as an example:

2024-04-25 09:00:00 INFO Startup: Application starting
2024-04-25 09:01:00 INFO Config: Configuration loaded successfully
2024-04-25 09:02:00 INFO Database: Database connection established
2024-04-25 09:03:00 INFO User: New user registered (UserID: 1001)
2024-04-25 09:04:00 INFO Security: Attempted login with incorrect credentials (UserID: 1001)
2024-04-25 09:05:00 INFO Network: Network timeout on request (ReqID: 456)
2024-04-25 09:06:00 INFO Email: Notification email sent (UserID: 1001)
2024-04-25 09:07:00 INFO API: API call with response time over threshold (Duration: 350ms)
2024-04-25 09:08:00 INFO Session: User session ended (UserID: 1001)
2024-04-25 09:09:00 INFO Shutdown: Application shutdown initiated
  INFO

Accessing columns usingawk

The fields in awk (separated by spaces by default) can be accessed using $1, $2, $3, and so on.

zaira@zaira-ThinkPad:~$ awk '{ print $1 }' sample.log
# output
2024-04-25
2024-04-25
2024-04-25
2024-04-25
2024-04-25
2024-04-25
2024-04-25
2024-04-25
2024-04-25
2024-04-25

zaira@zaira-ThinkPad:~$ awk '{ print $2 }' sample.log
# output
09:00:00
09:01:00
09:02:00
09:03:00
09:04:00
09:05:00
09:06:00
09:07:00
09:08:00
09:09:00

Print lines containing a specific pattern (for example, ERROR)

awk '/ERROR/ { print $0 }' logfile.log

# output
2024-04-25 09:05:00 ERROR Network: Network timeout on request (ReqID: 456)

This prints all lines that contain "ERROR".

Extract the first field (Date and Time)

awk '{ print $1, $2 }' logfile.log
# output
2024-04-25 09:00:00
2024-04-25 09:01:00
2024-04-25 09:02:00
2024-04-25 09:03:00
2024-04-25 09:04:00
2024-04-25 09:05:00
2024-04-25 09:06:00
2024-04-25 09:07:00
2024-04-25 09:08:007
2024-04-25 09:09:00

This will extract the first two fields from each line, which in this case would be the date and time.

Summarize occurrences of each log level

awk '{ count[$3]++ } END { for (level in count) print level, count[level] }' logfile.log

# output
 1
WARN 1
ERROR 1
DEBUG 2
INFO 6

The output will be a summary of the number of occurrences of each log level.

Filter out specific fields (for example, where the 3rd field is INFO)

awk '{ $3="INFO"; print }' sample.log

# output
2024-04-25 09:00:00 INFO Startup: Application starting
2024-04-25 09:01:00 INFO Config: Configuration loaded successfully
2024-04-25 09:02:00 INFO Database: Database connection established
2024-04-25 09:03:00 INFO User: New user registered (UserID: 1001)
2024-04-25 09:04:00 INFO Security: Attempted login with incorrect credentials (UserID: 1001)
2024-04-25 09:05:00 INFO Network: Network timeout on request (ReqID: 456)
2024-04-25 09:06:00 INFO Email: Notification email sent (UserID: 1001)
2024-04-25 09:07:00 INFO API: API call with response time over threshold (Duration: 350ms)
2024-04-25 09:08:00 INFO Session: User session ended (UserID: 1001)
2024-04-25 09:09:00 INFO Shutdown: Application shutdown initiated
  INFO

This command will extract all lines where the 3rd field is "INFO".

💡 Tip: The default separator in awk is a space. If your log file uses a different separator, you can specify it using the -F option. For example, if your log file uses a colon as a separator, you can use awk -F: '{ print $1 }' logfile.log to extract the first field.

Parsing log files with `cut`

The cut command is a simple yet powerful command used to extract sections of text from each line of input. As log files are structured and each field is delimited by a specific character, such as a space, tab, or a custom delimiter, cut does a very good job of extracting those specific fields.

The basic syntax of the cut command is:

cut [options] [file]

Some commonly used options for the cut command:

-d : Specifies a delimiter used as the field separator.
-f : Selects the fields to be displayed.
-c : Specifies character positions.

For example, the command below would extract the first field (separated by a space) from each line of the log file:

cut -d ' ' -f 1 logfile.log

Examples of usingcutfor log parsing

Assume you have a log file structured as follows, where fields are space-separated:

2024-04-25 08:23:01 INFO 192.168.1.10 User logged in successfully.
2024-04-25 08:24:15 WARNING 192.168.1.10 Disk usage exceeds 90%.
2024-04-25 08:25:02 ERROR 10.0.0.5 Connection timed out.
...

cut can be used in the following ways:

Extracting the time from each log entry:

cut -d ' ' -f 2 system.log

# Output
08:23:01
08:24:15
08:25:02
...

This command uses a space as a delimiter and selects the second field, which is the time component of each log entry.

Extracting the IP addresses from the logs:

cut -d ' ' -f 4 system.log

# Output
192.168.1.10
192.168.1.10
10.0.0.5

This command extracts the fourth field, which is the IP address from each log entry.

Extracting log levels (INFO, WARNING, ERROR):

cut -d ' ' -f 3 system.log

# Output
INFO
WARNING
ERROR

This extracts the third field which contains the log level.

Combiningcutwith other commands:

The output of other commands can be piped to the cut command. Let's say you want to filter logs before cutting. You can use grep to extract lines containing "ERROR" and then use cut to get specific information from those lines:

grep "ERROR" system.log | cut -d ' ' -f 1,2 

# Output
2024-04-25 08:25:02

This command first filters lines that include "ERROR", then extracts the date and time from these lines.

Extracting multiple fields:

It is possible to extract multiple fields at once by specifying a range or a comma-separated list of fields:

cut -d ' ' -f 1,2,3 system.log` 

# Output
2024-04-25 08:23:01 INFO
2024-04-25 08:24:15 WARNING
2024-04-25 08:25:02 ERROR
...

The above command extracts the first three fields from each log entry that are date, time, and log level.

Parsing log files with `sort` and `uniq`

Sorting and removing duplicates are common operations when working with log files. The sort and uniq commands are powerful commands used to sort and remove duplicates from the input, respectively.

Basic syntax of sort

The sort command organizes lines of text alphabetically or numerically.

sort [options] [file]

Some key options for the sort command:

-n: Sorts the file assuming the contents are numerical.
-r: Reverses the order of sort.
-k: Specifies a key or column number to sort on.
-u: Sorts and removes duplicate lines.

The uniq command is used to filter or count and report repeated lines in a file.

The syntax of uniq is:

uniq [options] [input_file] [output_file]

Some key options for the uniq command are:

-c: Prefixes lines by the number of occurrences.
-d: Only prints duplicate lines.
-u: Only prints unique lines.

Examples of using `sort` and `uniq` together for log parsing

Let's assume the following example log entries for these demonstrations:

2024-04-25 INFO User logged in successfully.
2024-04-25 WARNING Disk usage exceeds 90%.
2024-04-26 ERROR Connection timed out.
2024-04-25 INFO User logged in successfully.
2024-04-26 INFO Scheduled maintenance.
2024-04-26 ERROR Connection timed out.

Sorting log entries by date:

sort system.log

# Output
2024-04-25 INFO User logged in successfully.
2024-04-25 INFO User logged in successfully.
2024-04-25 WARNING Disk usage exceeds 90%.
2024-04-26 ERROR Connection timed out.
2024-04-26 ERROR Connection timed out.
2024-04-26 INFO Scheduled maintenance.

This sorts the log entries alphabetically, which effectively sorts them by date if the date is the first field.

Sorting and removing duplicates:

sort system.log | uniq

# Output
2024-04-25 INFO User logged in successfully.
2024-04-25 WARNING Disk usage exceeds 90%.
2024-04-26 ERROR Connection timed out.
2024-04-26 INFO Scheduled maintenance.

This command sorts the log file and pipes it to uniq, removing duplicate lines.

Counting occurrences of each line:

sort system.log | uniq -c

# Output
2 2024-04-25 INFO User logged in successfully.
1 2024-04-25 WARNING Disk usage exceeds 90%.
2 2024-04-26 ERROR Connection timed out.
1 2024-04-26 INFO Scheduled maintenance.

Sorts the log entries and then counts each unique line. According to the output, the line '2024-04-25 INFO User logged in successfully.' appeared 2 times in the file.

Identifying unique log entries:

sort system.log | uniq -u

# Output

2024-04-25 WARNING Disk usage exceeds 90%.
2024-04-26 INFO Scheduled maintenance.

This command shows lines that are unique.

Sorting by log level:

sort -k2 system.log

# Output
2024-04-26 ERROR Connection timed out.
2024-04-26 ERROR Connection timed out.
2024-04-25 INFO User logged in successfully.
2024-04-25 INFO User logged in successfully.
2024-04-26 INFO Scheduled maintenance.
2024-04-25 WARNING Disk usage exceeds 90%.

Sorts the entries based on the second field, which is the log level.

8.4. Managing Linux Processes via Command Line

A process is a running instance of a program. A process consists of:

An address space of the allocated memory.
Process states.
Properties such as ownership, security attributes, and resource usage.

A process also has an environment that consists of:

Local and global variables
The current scheduling context
Allocated system resources, such as network ports or file descriptors.

When you run the ls -l command, the operating system creates a new process to execute the command. The process has an ID, a state, and runs until the command completes.

Understanding process creation and lifecycle

In Ubuntu, all processes originate from the initial system process called systemd, which is the first process started by the kernel during boot.

The systemd process has a process ID (PID) of 1 and is responsible for initializing the system, starting and managing other processes, and handling system services. All other processes on the system are descendants of systemd.

A parent process duplicates its own address space (fork) to create a new (child) process structure. Each new process is assigned a unique process ID (PID) for tracking and security purposes. The PID and the parent's process ID (PPID) are part of the new process environment. Any process can create a child process.

Through the fork routine, a child process inherits security identities, previous and current file descriptors, port and resource privileges, environment variables, and program code. A child process may then execute its own program code.

Typically, a parent process sleeps while the child process runs, setting a request (wait) to be notified when the child completes.

Upon exiting, the child process has already closed or discarded its resources and environment. The only remaining resource, known as a zombie, is an entry in the process table. The parent, signaled awake when the child exits, cleans the process table of the child's entry, thus freeing the last resource of the child process. The parent process then continues executing its own program code.

Understanding process states

Processes in Linux assume different states during their lifecycle. The state of a process indicates what the process is currently doing and how it is interacting with the system. The processes transition between states based on their execution status and the system's scheduling algorithm.

The processes in a Linux system can be in one of the following states:

State	Description
(new)	Initial state when a process is created via a fork system call.
Runnable (ready) (R)	Process is ready to run and waiting to be scheduled on a CPU.
Running (user) (R)	Process is executing in user mode, running user applications.
Running (kernel) (R)	Process is executing in kernel mode, handling system calls or hardware interrupts.
Sleeping (S)	Process is waiting for an event (for example, I/O operation) to complete and can be easily awakened.
Sleeping (uninterruptible) (D)	Process is in an uninterruptible sleep state, waiting for a specific condition (usually I/O) to complete, and cannot be interrupted by signals.
Sleeping (disk sleep) (K)	Process is waiting for disk I/O operations to complete.
Sleeping (idle) (I)	Process is idle, not doing any work, and waiting for an event to occur.
Stopped (T)	Process execution has been stopped, typically by a signal, and can be resumed later.
Zombie (Z)	Process has completed execution but still has an entry in the process table, waiting for its parent to read its exit status.

The processes transition between these states in the following ways:

Transition	Description
Fork	Creates a new process from a parent process, transitioning from (new) to Runnable (ready) (R).
Schedule	Scheduler selects a runnable process, transitioning it to Running (user) or Running (kernel) state.
Run	Process transitions from Runnable (ready) (R) to Running (kernel) (R) when scheduled for execution.
Preempt or Reschedule	Process can be preempted or rescheduled, moving it back to Runnable (ready) (R) state.
Syscall	Process makes a system call, transitioning from Running (user) (R) to Running (kernel) (R).
Return	Process completes a system call and returns to Running (user) (R).
Wait	Process waits for an event, transitioning from Running (kernel) (R) to one of the Sleeping states (S, D, K, or I).
Event or Signal	Process is awakened by an event or signal, moving it from a Sleeping state back to Runnable (ready) (R).
Suspend	Process is suspended, transitioning from Running (kernel) or Runnable (ready) to Stopped (T).
Resume	Process is resumed, moving from Stopped (T) back to Runnable (ready) (R).
Exit	Process terminates, transitioning from Running (user) or Running (kernel) to Zombie (Z).
Reap	Parent process reads the exit status of the zombie process, removing it from the process table.

How to view processes

You can use the ps command along with a combination of options to view processes on a Linux system. The ps command is used to display information about a selection of active processes. For example, ps aux displays all processes running on the system.

zaira@zaira:~$ ps aux
# Output
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0 168140 11352 ?        Ss   May21   0:18 /sbin/init splash
root           2  0.0  0.0      0     0 ?        S    May21   0:00 [kthreadd]
root           3  0.0  0.0      0     0 ?        I<   May21   0:00 [rcu_gp]
root           4  0.0  0.0      0     0 ?        I<   May21   0:00 [rcu_par_gp]
root           5  0.0  0.0      0     0 ?        I<   May21   0:00 [slub_flushwq]
root           6  0.0  0.0      0     0 ?        I<   May21   0:00 [netns]
root          11  0.0  0.0      0     0 ?        I<   May21   0:00 [mm_percpu_wq]
root          12  0.0  0.0      0     0 ?        I    May21   0:00 [rcu_tasks_kthread]
root          13  0.0  0.0      0     0 ?        I    May21   0:00 [rcu_tasks_rude_kthread]
*... output truncated ....*

The output above shows a snapshot of the currently running processes on the system. Each row represents a process with the following columns:

USER: The user who owns the process.
PID: The process ID.
%CPU: The CPU usage of the process.
%MEM: The memory usage of the process.
VSZ: The virtual memory size of the process.
RSS: The resident set size, that is the non-swapped physical memory that a task has used.
TTY: The controlling terminal of the process. A ? indicates no controlling terminal.
STAT: The process state.
- R: Running
- I or S: Interruptible sleep (waiting for an event to complete)
- D: Uninterruptible sleep (usually IO)
- T: Stopped (either by a job control signal or because it is being traced)
- Z: Zombie (terminated but not reaped by its parent)
- Ss: Session leader. This is a process that has started a session, and it is a leader of a group of processes and can control terminal signals. The first S indicates the sleeping state, and the second s indicates it is a session leader.
START: The starting time or date of the process.
TIME: The cumulative CPU time.
COMMAND: The command that started the process.

Background and foreground processes

In this section, you'll learn how you can control jobs by running them in the background or foreground.

A job is a process that is started by a shell. When you run a command in the terminal, it is considered a job. A job can run in the foreground or the background.

To demonstrate control, you'll first create 3 processes and then run them in the background. After that, you'll list the processes and alternate them between the foreground and background. You'll see how to put them to sleep or exit completely.

Create Three Processes

Open a terminal and start three long-running processes. Use the sleep command, that keeps the process running for a specified number of seconds.

# run sleep command for 300, 400, and 500 seconds
sleep 300 &
sleep 400 &
sleep 500 &

The & at the end of each command moves the process to the background.

Display Background Jobs

Use the jobs command to display the list of background jobs.

jobs

The output should look something like this:

jobs
[1]   Running                 sleep 300 &
[2]-  Running                 sleep 400 &
[3]+  Running                 sleep 500 &

Bring a Background Job to the Foreground

To bring a background job to the foreground, use the fg command followed by the job number. For example, to bring the first job (sleep 300) to the foreground:

fg %1

This will bring job 1 to the foreground.

Move the Foreground Job Back to the Background

While the job is running in the foreground, you can suspend it and move it back to the background by pressing Ctrl+Z to suspend the job.

A suspended job will look like this:

zaira@zaira:~$ fg %1
sleep 300

^Z
[1]+  Stopped                 sleep 300

zaira@zaira:~$ jobs
# suspended job 
[1]+  Stopped                 sleep 300
[2]   Running                 sleep 400 &
[3]-  Running                 sleep 500 &

Now use the bg command to resume the job with ID 1 in the background.

# Press Ctrl+Z to suspend the foreground job
# Then, resume it in the background
bg %1

Display the jobs again

jobs
[1]   Running                 sleep 300 &
[2]-  Running                 sleep 400 &
[3]+  Running                 sleep 500 &

In this exercise, you:

Started three background processes using sleep commands.
Used jobs to display the list of background jobs.
Brought a job to the foreground with fg %job_number.
Suspended the job with Ctrl+Z and moved it back to the background with bg %job_number.
Used jobs again to verify the status of the background jobs.

Now you know how to control jobs.

Killing processes

It is possible to terminate an unresponsive or unwanted process using the kill command. The kill command sends a signal to a process ID, asking it to terminate.

A number of options are available with the kill command.

# Options available with kill

kill -l
 1) SIGHUP     2) SIGINT     3) SIGQUIT     4) SIGILL     5) SIGTRAP
 6) SIGABRT     7) SIGBUS     8) SIGFPE     9) SIGKILL    10) SIGUSR1
11) SIGSEGV    12) SIGUSR2    13) SIGPIPE    14) SIGALRM    15) SIGTERM
16) SIGSTKFLT    17) SIGCHLD    18) SIGCONT    19) SIGSTOP    20) SIGTSTP
21) SIGTTIN    22) SIGTTOU    23) SIGURG    24) 
...terminated

Here are some examples of the kill command in Linux:

Kill a process by PID (Process ID):
```
 kill 1234
```
This command sends the default SIGTERM signal to the process with PID 1234, requesting it to terminate.
Kill a process by name:
```
 pkill process_name
```
This command sends the default SIGTERM signal to all processes with the specified name.
Forcefully kill a process:
```
 kill -9 1234
```
This command sends the SIGKILL signal to the process with PID 1234, forcefully terminating it.
Send a specific signal to a process:
```
 kill -s SIGSTOP 1234
```
This command sends the SIGSTOP signal to the process with PID 1234, stopping it.
Kill all processes owned by a specific user:
```
 pkill -u username
```
This command sends the default SIGTERM signal to all processes owned by the specified user.

These examples demonstrate various ways to use the kill command to manage processes in a Linux environment.

Here is the information about the kill command options and signals in a tabular form: This table summarizes the most common kill command options and signals used in Linux for managing processes.

Command / Option	Signal	Description
`kill`	`SIGTERM`	Requests the process to terminate gracefully (default signal).
`kill -9`	`SIGKILL`	Forces the process to terminate immediately without cleanup.
`kill -SIGKILL`	`SIGKILL`	Forces the process to terminate immediately without cleanup.
`kill -15`	`SIGTERM`	Explicitly sends the `SIGTERM` signal to request graceful termination.
`kill -SIGTERM`	`SIGTERM`	Explicitly sends the `SIGTERM` signal to request graceful termination.
`kill -1`	`SIGHUP`	Traditionally means "hang up"; can be used to reload configuration files.
`kill -SIGHUP`	`SIGHUP`	Traditionally means "hang up"; can be used to reload configuration files.
`kill -2`	`SIGINT`	Requests the process to terminate (same as pressing `Ctrl+C` in terminal).
`kill -SIGINT`	`SIGINT`	Requests the process to terminate (same as pressing `Ctrl+C` in terminal).
`kill -3`	`SIGQUIT`	Causes the process to terminate and produce a core dump for debugging.
`kill -SIGQUIT`	`SIGQUIT`	Causes the process to terminate and produce a core dump for debugging.
`kill -19`	`SIGSTOP`	Pauses the process.
`kill -SIGSTOP`	`SIGSTOP`	Pauses the process.
`kill -18`	`SIGCONT`	Resumes a paused process.
`kill -SIGCONT`	`SIGCONT`	Resumes a paused process.
`killall`	Varies	Sends a signal to all processes with the given name.
`killall -9`	`SIGKILL`	Force kills all processes with the given name.
`pkill`	Varies	Sends a signal to processes based on a pattern match.
`pkill -9`	`SIGKILL`	Force kills all processes matching the pattern.
`xkill`	`SIGKILL`	Graphical utility that allows clicking on a window to kill the corresponding process.

8.5. Standard Input and Output Streams in Linux

Reading an input and writing an output is an essential part of understanding the command line and shell scripting. In Linux, every process has three default streams:

Standard Input (stdin): This stream is used for input, typically from the keyboard. When a program reads from stdin, it receives data entered by the user or redirected from a file. A file descriptor is a unique identifier that the operating system assigns to an open file in order to keep track of open files.

The file descriptor for stdin is 0.
Standard Output (stdout): This is the default output stream where a process writes its output. By default, the standard output is the terminal. The output can also be redirected to a file or another program. The file descriptor for stdout is 1.
Standard Error (stderr): This is the default error stream where a process writes its error messages. By default, the standard error is the terminal, allowing error messages to be seen even if stdout is redirected. The file descriptor for stderr is 2.

Redirection and Pipelines

Redirection: You can redirect the error and output streams to files or other commands. For example:

# Redirecting stdout to a file
ls > output.txt

# Redirecting stderr to a file
ls non_existent_directory 2> error.txt

# Redirecting both stdout and stderr to a file
ls non_existent_directory > all_output.txt 2>&1

In the last command,

ls non_existent_directory: lists the contents of a directory named non_existent_directory. Since this directory does not exist, ls will generate an error message.
> all_output.txt: The > operator redirects the standard output (stdout) of the ls command to the file all_output.txt. If the file does not exist, it will be created. If it does exist, its contents will be overwritten.
2>&1:: Here, 2 represents the file descriptor for standard error (stderr). &1 represents the file descriptor for standard output (stdout). The & character is used to specify that 1 is not the file name but a file descriptor.

So, 2>&1 means "redirect stderr (2) to wherever stdout (1) is currently going," which in this case is the file all_output.txt. Therefore, both the output (if there were any) and the error message from ls will be written to all_output.txt.

Pipelines:

You can use pipes (|) to pass the output of one command as the input to another:

ls | grep image
# Output
image-10.png
image-11.png
image-12.png
image-13.png
... Output truncated ...

8.6 Automation in Linux – Automate Tasks with Cron Jobs

Cron is a powerful utility for job scheduling that is available in Unix-like operating systems. By configuring cron, you can set up automated jobs to run on a daily, weekly, monthly, or other specific time basis. The automation capabilities provided by cron play a crucial role in Linux system administration.

The crond daemon (a type of computer program that runs in the background) enables cron functionality. The cron reads the crontab (cron tables) for running predefined scripts.

By using a specific syntax, you can configure a cron job to schedule scripts or other commands to run automatically.

What are cron jobs in Linux?

Any task that you schedule through crons is called a cron job.

Now, let's see how cron jobs work.

How to control access to crons

In order to use cron jobs, an admin needs to allow cron jobs to be added for users in the /etc/cron.allow file.

If you get a prompt like this, it means you don't have permission to use cron.

To allow John to use crons, include his name in /etc/cron.allow. Create the file if it doesn't exist. This will allow John to create and edit cron jobs.

Users can also be denied access to cron job access by entering their usernames in the file /etc/cron.d/cron.deny.

How to add cron jobs in Linux

First, to use cron jobs, you'll need to check the status of the cron service. If cron is not installed, you can easily download it through the package manager. Just use this to check:

# Check cron service on Linux system
sudo systemctl status cron.service

Cron job syntax

Crontabs use the following flags for adding and listing cron jobs:

crontab -e: edits crontab entries to add, delete, or edit cron jobs.
crontab -l: list all the cron jobs for the current user.
crontab -u username -l: list another user's crons.
crontab -u username -e: edit another user's crons.

When you list crons and they exist, you'll see something like this:

# Cron job example
* * * * * sh /path/to/script.sh

In the above example,

* represents minute(s) hour(s) day(s) month(s) weekday(s), respectively. See details of these values below:

	VALUE	DESCRIPTION
Minutes	0-59	Command will be executed at the specific minute.
Hours	0-23	Command will be executed at the specific hour.
Days	1-31	Commands will be executed in these days of the months.
Months	1-12	The month in which tasks need to be executed.
Weekdays	0-6	Days of the week where commands will run. Here, 0 is Sunday.

sh represents that the script is a bash script and should be run from /bin/bash.
/path/to/script.sh specifies the path to the script.

Below is a summary of the cron job syntax:

*   *   *   *   *  sh /path/to/script/script.sh
|   |   |   |   |              |
|   |   |   |   |      Command or Script to Execute        
|   |   |   |   |
|   |   |   |   |
|   |   |   |   |
|   |   |   | Day of the Week(0-6)
|   |   |   |
|   |   | Month of the Year(1-12)
|   |   |
|   | Day of the Month(1-31)  
|   |
| Hour(0-23)  
|
Min(0-59)

Cron job examples

Below are some examples of scheduling cron jobs.

SCHEDULE	SCHEDULED VALUE
`5 0 * 8 *`	At 00:05 in August.
`5 4 * * 6`	At 04:05 on Saturday.
`0 22 * * 1-5`	At 22:00 on every day-of-week from Monday through Friday.

It's okay if you are unable to grasp this all at once. You can practice and generate cron schedules with the crontab guru website.

How to set up a cron job

In this section, we will look at an example of how to schedule a simple script with a cron job.

Create a script called date-script.sh which prints the system date and time and appends it to a file. The script is shown below:

#!/bin/bash

echo `date` >> date-out.txt

2. Make the script executable by giving it execution rights.

chmod 775 date-script.sh

3. Add the script in the crontab using crontab -e.

Here, we have scheduled it to run per minute.

*/1 * * * * /bin/sh /root/date-script.sh

4. Check the output of the file date-out.txt. According to the script, the system date should be printed to this file every minute.

cat date-out.txt
# output
Wed 26 Jun 16:59:33 PKT 2024
Wed 26 Jun 17:00:01 PKT 2024
Wed 26 Jun 17:01:01 PKT 2024
Wed 26 Jun 17:02:01 PKT 2024
Wed 26 Jun 17:03:01 PKT 2024
Wed 26 Jun 17:04:01 PKT 2024
Wed 26 Jun 17:05:01 PKT 2024
Wed 26 Jun 17:06:01 PKT 2024
Wed 26 Jun 17:07:01 PKT 2024

How to troubleshoot crons

Crons are really helpful, but they might not always work as intended. Fortunately, there are some effective methods you can use to troubleshoot them.

1. Check the schedule.

First, you can try verifying the schedule that's set for the cron. You can do that with the syntax you saw in the above sections.

2. Check cron logs.

First, you need to check if the cron has run at the intended time or not. In Ubuntu, you can verify this from the cron logs located at /var/log/syslog.

If there is an entry in these logs at the correct time, it means the cron has run according to the schedule you set.

Below are the logs of our cron job example. Note the first column which shows the timestamp. The path of the script is also mentioned at the end of the line. Line #1, 3, and 5 show that the script ran as intended.

1 Jun 26 17:02:01 zaira-ThinkPad CRON[27834]: (zaira) CMD (/bin/sh /home/zaira/date-script.sh)
2 Jun 26 17:02:02 zaira-ThinkPad systemd[2094]: Started Tracker metadata extractor.
3 Jun 26 17:03:01 zaira-ThinkPad CRON[28255]: (zaira) CMD (/bin/sh /home/zaira/date-script.sh)
4 Jun 26 17:03:02 zaira-ThinkPad systemd[2094]: Started Tracker metadata extractor.
5 Jun 26 17:04:01 zaira-ThinkPad CRON[28538]: (zaira) CMD (/bin/sh /home/zaira/date-script.sh)

3. Redirect cron output to a file.

You can redirect a cron's output to a file and check the file for any possible errors.

# Redirect cron output to a file
* * * * * sh /path/to/script.sh &> log_file.log

8.7. Linux Networking Basics

Linux offers a number of commands to view network related information. In this section we will briefly discuss some of the commands.

View network interfaces with `ifconfig`

The ifconfig command gives information about network interfaces. Here is an example output:

ifconfig

# Output
eth0: flags=4163  mtu 1500
        inet 192.168.1.100  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 fe80::a00:27ff:fe4e:66a1  prefixlen 64  scopeid 0x20
        ether 08:00:27:4e:66:a1  txqueuelen 1000  (Ethernet)
        RX packets 1024  bytes 654321 (654.3 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 512  bytes 123456 (123.4 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 256  bytes 20480 (20.4 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 256  bytes 20480 (20.4 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

The output of the ifconfig command shows the network interfaces configured on the system, along with details such as IP addresses, MAC addresses, packet statistics, and more.

These interfaces can be physical or virtual devices.

To extract IPv4 and IPv6 addresses, you can use ip -4 addr and ip -6 addr, respectively.

View network activity withnetstat

The netstat command shows network activity and stats by giving the following information:

Here are some examples of using the netstat command in the command line:

Display all listening and non-listening sockets:
```
 netstat -a
```
Show only listening ports:
```
 netstat -l
```
Display network statistics:
```
 netstat -s
```
Show routing table:
```
 netstat -r
```
Display TCP connections:
```
 netstat -t
```
Display UDP connections:
```
 netstat -u
```
Show network interfaces:
```
 netstat -i
```
Display PID and program names for connections:
```
 netstat -p
```
Show statistics for a specific protocol (for example, TCP):
```
 netstat -st
```
Display extended information:
```
netstat -e
```

Check network connectivity between two devices using `ping`

ping is used to test network connectivity between two devices. It sends ICMP packets to the target device and waits for a response.

ping google.com

ping tests if you get a response back without getting a timeout.

ping google.com
PING google.com (142.250.181.46) 56(84) bytes of data.
64 bytes from fjr04s06-in-f14.1e100.net (142.250.181.46): icmp_seq=1 ttl=60 time=78.3 ms
64 bytes from fjr04s06-in-f14.1e100.net (142.250.181.46): icmp_seq=2 ttl=60 time=141 ms
64 bytes from fjr04s06-in-f14.1e100.net (142.250.181.46): icmp_seq=3 ttl=60 time=205 ms
64 bytes from fjr04s06-in-f14.1e100.net (142.250.181.46): icmp_seq=4 ttl=60 time=100 ms
^C
--- google.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3001ms
rtt min/avg/max/mdev = 78.308/131.053/204.783/48.152 ms

You can stop the response with Ctrl + C.

Testing endpoints with the `curl` command

The curl command stands for "client URL". It is used to transfer data to or from a server. It can also be used to test API endpoints that helps in troubleshooting system and application errors.

As an example, you can use http://www.official-joke-api.appspot.com/ to experiment with the curl command.

The curl command without any options uses the GET method by default.

curl http://www.official-joke-api.appspot.com/random_joke
{"type":"general",
"setup":"What did the fish say when it hit the wall?","punchline":"Dam.","id":1}

curl -o saves the output to the mentioned file.

curl -o random_joke.json http://www.official-joke-api.appspot.com/random_joke
# saves the output to random_joke.json

curl -I fetches only the headers.

curl -I http://www.official-joke-api.appspot.com/random_joke
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Vary: Accept-Encoding
X-Powered-By: Express
Access-Control-Allow-Origin: *
ETag: W/"71-NaOSpKuq8ChoxdHD24M0lrA+JXA"
X-Cloud-Trace-Context: 2653a86b36b8b131df37716f8b2dd44f
Content-Length: 113
Date: Thu, 06 Jun 2024 10:11:50 GMT
Server: Google Frontend

8.8. Linux Troubleshooting: Tools and Techniques

System activity report with `sar`

The sar command in Linux is a powerful tool for collecting, reporting, and saving system activity information. It's part of the sysstat package and is widely used for monitoring system performance over time.

To use sar you first need to install syssstat using sudo apt install sysstat.

Once installed, start the service with sudo systemctl start sysstat.

Verify the status with sudo systemctl status sysstat.

Once the status is active, the system will start collecting various stats that you can use to access and analyze historical data. We'll see that in detail soon.

The syntax of the sar command is as follows:

sar [options] [interval] [count]

For example, sar -u 1 3 will display CPU utilization statistics every second for three times.

sar -u 1 3
# Output
Linux 6.5.0-28-generic (zaira-ThinkPad)     04/06/24     _x86_64_    (12 CPU)

19:09:26        CPU     %user     %nice   %system   %iowait    %steal     %idle
19:09:27        all      3.78      0.00      2.18      0.08      0.00     93.96
19:09:28        all      4.02      0.00      2.01      0.08      0.00     93.89
19:09:29        all      6.89      0.00      2.10      0.00      0.00     91.01
Average:        all      4.89      0.00      2.10      0.06      0.00     92.95

Here are some common use cases and examples of how to use the sar command.

sar can be used for a variety of purposes:

1. Memory usage

To check memory usage (free and used), use:

sar -r 1 3

Linux 6.5.0-28-generic (zaira-ThinkPad)     04/06/24     _x86_64_    (12 CPU)

19:10:46    kbmemfree   kbavail kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
19:10:47      4600104   8934352   5502124     36.32    375844   4158352  15532012     65.99   6830564   2481260       264
19:10:48      4644668   8978940   5450252     35.98    375852   4165648  15549184     66.06   6776388   2481284        36
19:10:49      4646548   8980860   5448328     35.97    375860   4165648  15549224     66.06   6774368   2481292       116
Average:      4630440   8964717   5466901     36.09    375852   4163216  15543473     66.04   6793773   2481279       139

This command displays memory statistics every second three times.

2. Swap space utilization

To view swap space utilization statistics, use:

sar -S 1 3

sar -S 1 3
Linux 6.5.0-28-generic (zaira-ThinkPad)     04/06/24     _x86_64_    (12 CPU)

19:11:20    kbswpfree kbswpused  %swpused  kbswpcad   %swpcad
19:11:21      8388604         0      0.00         0      0.00
19:11:22      8388604         0      0.00         0      0.00
19:11:23      8388604         0      0.00         0      0.00
Average:      8388604         0      0.00         0      0.00

This command helps monitor the swap usage, which is crucial for systems running out of physical memory.

3. I/O devices load

To report activity for block devices and block device partitions:

sar -d 1 3

This command provides detailed stats about data transfers to and from block devices, and is useful for diagnosing I/O bottlenecks.

5. Network statistics

To view network statistics, like number of packets received (transmitted) by the network interface:

sar -n DEV 1 3
# -n DEV tells sar to report network device interfaces
sar -n DEV 1 3
Linux 6.5.0-28-generic (zaira-ThinkPad)     04/06/24     _x86_64_    (12 CPU)

19:12:47        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
19:12:48           lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
19:12:48       enp2s0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
19:12:48       wlp3s0     10.00      3.00      1.83      0.37      0.00      0.00      0.00      0.00
19:12:48    br-5129d04f972f      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
.
.
.

Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
Average:           lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:       enp2s0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
...output truncated...

This displays network statistics every second for three seconds, helping in monitoring network traffic.

6. Historical data

Recall that previously we installed the sysstat package and ran the service. Follow the steps below to enable and access historical data.

Enable data collection: Edit the sysstat configuration file to enable data collection.

 sudo nano /etc/default/sysstat

Change ENABLED="false" to ENABLED="true".

 vim /etc/default/sysstat
 #
 # Default settings for /etc/init.d/sysstat, /etc/cron.d/sysstat
 # and /etc/cron.daily/sysstat files
 #

 # Should sadc collect system activity informations? Valid values
 # are "true" and "false". Please do not put other values, they
 # will be overwritten by debconf!
 ENABLED="true"

Configure data collection interval: Edit the cron job configuration to set the data collection interval.
```
 sudo nano /etc/cron.d/sysstat
```
By default, it collects data every 10 minutes. You can adjust the interval by modifying the cron job schedule. The relevant files will go to the /var/log/sysstat folder.
View historical data: Use the sar command to view historical data. For example, to view CPU usage for the current day:
```
 sar -u
```
To view data from a specific date:
```
 sar -u -f /var/log/sysstat/sa
```
Replace
with the day of the month for which you want to view the data.

In the below command, /var/log/sysstat/sa04 gives stats for the 4th day of the current month.

sar -u -f /var/log/sysstat/sa04
Linux 6.5.0-28-generic (zaira-ThinkPad)     04/06/24     _x86_64_    (12 CPU)

15:20:49     LINUX RESTART    (12 CPU)

16:13:30     LINUX RESTART    (12 CPU)

18:16:00        CPU     %user     %nice   %system   %iowait    %steal     %idle
18:16:01        all      0.25      0.00      0.67      0.08      0.00     99.00
Average:        all      0.25      0.00      0.67      0.08      0.00     99.00

7. Real-Time CPU Interruptions

To observe real-time interrupts per second served by the CPU, use this command:

sar -I SUM 1 3

# Output
Linux 6.5.0-28-generic (zaira-ThinkPad)     04/06/24     _x86_64_    (12 CPU)

19:14:22         INTR    intr/s
19:14:23          sum   5784.00
19:14:24          sum   5694.00
19:14:25          sum   5795.00
Average:          sum   5757.67

This command helps in monitoring how frequently the CPU is handling interrupts, which can be crucial for real-time performance tuning.

These examples illustrate how you can use sar to monitor various aspects of system performance. Regular use of sar can help in identifying system bottlenecks and ensuring that applications keep running efficiently.

8.9. General Troubleshooting Strategy for Servers

Why do we need to understand monitoring?

System monitoring is an important aspect of system administration. Critical applications demand a high level of proactiveness to prevent failure and reduce the outage impact.

Linux offers very powerful tools to gauge system health. In this section, you'll learn about the various methods available to check your system's health and identify the bottlenecks.

Find load average and system uptime

System reboots may occur which can sometimes mess up some configurations. To check how long the machine has been up, use the command: uptime. In addition to the uptime, the command also displays load average.

[user@host ~]$ uptime 19:15:00 up 1:04, 0 users, load average: 2.92, 4.48, 5.20

Load average is the system load over the last 1, 5, and 15 minutes. A quick glance indicates whether the system load appears to be increasing or decreasing over time.

Note: Ideal CPU queue is 0. This is only possible when there are no waiting queues for the CPU.

Per-CPU load can be calculated by dividing load average with the total number of CPUs available.

To find the number of CPUs, use the command lscpu.

lscpu
# output
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
.
.
.
output omitted

If the load average seems to increase and does not come down, the CPUs are overloaded. There is some process that is stuck or there is a memory leakage.

Calculating free memory

Sometimes, high memory utilization might be causing problems. To check the available memory and the memory in use, use the free command.

free -mh
# output
               total        used        free      shared  buff/cache   available
Mem:            14Gi       3.5Gi       7.7Gi       109Mi       3.2Gi        10Gi
Swap:          8.0Gi          0B       8.0Gi

Calculating disk space

To ensure the system is healthy, don't forget about the disk space. To list all the available mount points and their respective used percentage, use the below command. Ideally, utilized disk spaces should not exceed 80%.

The df command provides detailed disk spaces.

df -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           1.5G  2.4M  1.5G   1% /run
/dev/nvme0n1p2  103G   34G   65G  35% /
tmpfs           7.3G   42M  7.2G   1% /dev/shm
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
efivarfs        246K   93K  149K  39% /sys/firmware/efi/efivars
/dev/nvme0n1p3  130G   47G   77G  39% /home
/dev/nvme0n1p1  511M  6.1M  505M   2% /boot/efi
tmpfs           1.5G  140K  1.5G   1% /run/user/1000

Determining process states

Process states can be monitored to see any stuck process with a high memory or CPU usage.

We saw previously that the ps command gives useful information about a process. Have a look at the CPU and MEM columns.

[user@host ~]$ ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
 runner         1  0.1  0.0 1535464 15576 ?       S  19:18   0:00 /inject/init
 runner        14  0.0  0.0  21484  3836 pts/0    S   19:21   0:00 bash --norc
 runner        22  0.0  0.0  37380  3176 pts/0    R+   19:23   0:00 ps aux

Real-time system monitoring

Real time monitoring gives a window into the realtime system state.

One utility you can use to do this is the top command.

The top command displays a dynamic view of the system's processes, displaying a summary header followed by a process or thread list. Unlike its static counterpart ps, top continuously refreshes the system stats.

With top, you can see well-organised details in a compact window. There a number of flags, shortcuts, and highlighting methods that come along with top.

You can also kill processes using top. For that, press k and then enter the process id.

Interpreting logs

System and application logs carry tons of information about what the system is going through. They contain useful information and error codes that point towards errors. If you search for error codes in logs, issue identification and rectification time can be greatly reduced.

Network ports analysis

The network aspect should not be ignored as network glitches are common and may impact the system and traffic flows. Common network issues include port exhaustion, port choking, unreleased resources, and so on.

To identify such issues, we need to understand port states.

Some of the port states are explained briefly here:

State	Description
LISTEN	Represents ports that are waiting for a connection request from any remote TCP and port.
ESTABLISHED	Represents connections that are open and data received can be delivered to the destination.
TIME WAIT	Represents waiting time to ensure acknowledgment of its connection termination request.
FIN WAIT2	Represents waiting for a connection termination request from the remote TCP.

Let's explore how we can analyze port-related information in Linux.

Port ranges: Port ranges are defined in the system, and range can be increased/decreased accordingly. In the below snippet, the range is from 15000 to 65000, which makes a total of 50000 (65000 - 15000) available ports. If utilized ports are reaching or exceeding this limit, then there is an issue.

[user@host ~]$ /sbin/sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 15000    65000

The error reported in logs in such cases can be Failed to bind to port or Too many connections.

Identifying packet loss

In system monitoring, we need to ensure that the outgoing and incoming communication is intact.

One helpful command is ping. ping hits the destination system and brings the response back. Note the last few lines of statistics that show packet loss percentage and time.

# ping destination IP
[user@host ~]$ ping 10.13.6.113
 PING 10.13.6.141 (10.13.6.141) 56(84) bytes of data.
 64 bytes from 10.13.6.113: icmp_seq=1 ttl=128 time=0.652 ms
 64 bytes from 10.13.6.113: icmp_seq=2 ttl=128 time=0.593 ms
 64 bytes from 10.13.6.113: icmp_seq=3 ttl=128 time=0.478 ms
 64 bytes from 10.13.6.113: icmp_seq=4 ttl=128 time=0.384 ms
 64 bytes from 10.13.6.113: icmp_seq=5 ttl=128 time=0.432 ms
 64 bytes from 10.13.6.113: icmp_seq=6 ttl=128 time=0.747 ms
 64 bytes from 10.13.6.113: icmp_seq=7 ttl=128 time=0.379 ms
 ^C
 --- 10.13.6.113 ping statistics ---
 7 packets transmitted, 7 received,0% packet loss, time 6001ms
 rtt min/avg/max/mdev = 0.379/0.523/0.747/0.134 ms

Packets can also be captured at runtime using tcpdump. We'll look into it later.

Gathering stats for issue post mortem

It is always a good practice to gather certain stats that would be useful for identifying the root cause later. Usually, after system reboot or services restart, we loose the earlier system snapshot and logs.

Below are some of the methods to capture system snapshot.

Logs Backup

Before making any changes, copy log files to another location. This is crucial for understanding what condition the system was in during time of issue. Sometimes log files are the only window to look into past system states as other runtime stats are lost.

TCP Dump

Tcpdump is a command-line utility that allows you to capture and analyze incoming and outgoing network traffic. It is mostly used to help troubleshoot network issues. If you feel that system traffic is being impacted, take tcpdump as follows:

sudo tcpdump -i any -w

# Where,
# -i any captures traffic from all interfaces
# -w specifies the output filename

# Stop the command after a few mins as the file size may increase
# use file extension as .pcap

Once tcpdump is captured, you can use tools like Wireshark to visually analyze the traffic.

8.10 Diagnosing Hardware Problems

Troubleshooting unexpected issues is a part of the learning process. Sometimes, you may notice frequent segmentation faults (SIGSEGV), overheating, or random crashes across unrelated applications. The issue could either be software or hardware related. While software-related issues depend on the specific application itself, hardware issues can be diagnosed with some standard steps.

In this section, we will discuss how to diagnose and rule out hardware issues related to memory, CPU, system sensors, power supply, and more.

8.10.1 Analyzing Memory Performance

Determine Available RAM

If you feel your system is getting slow and taking longer to finish tasks, check your system's available memory. This will ensure there is enough available memory including the swap memory.

The command to check available memory is free -mh, where -h is for human-readable output and -m is for displaying memory in MB.

free -mh
               total        used        free      shared  buff/cache   available
Mem:            14Gi       5.1Gi       2.4Gi        77Mi       7.3Gi       9.3Gi
Swap:          4.0Gi          0B       4.0Gi

In the above output, look at the "available" column in the "Mem" row. This shows how much RAM is free for use.

Another way to check the memory in real time is to use the top command. There are 2 ways to do this:

When you are in top, press Shift + M to sort the processes by memory usage.
Alternately, press m to see the memory usage in a progress bar like format:

If you see the memory consumed near to 100%, you might want to consider identifying the process that is consuming the memory and take necessary action. You might also want to consider adding more memory to your system.

Run a stress test on your hardware

The memtester command is a utility used for diagnosing memory-related issues by stressing the memory and checking for faults. It is often used in situations where you suspect faulty RAM might be causing system instability or crashes.

Here's how to use it effectively:

First, install memtester.
```
  sudo apt install memtester
```
Determine the amount of RAM to test and the number of passes you’d like your RAM to go through. In the command below, 1G is the amount of RAM to test (1 GB), and 5 is the number of test passes:
```
  sudo memtester 1G 5
```

If all tests pass, your RAM is likely error-free. If errors are reported, your RAM might be faulty and could require replacement or further inspection. You can always run the test again with a different amount of RAM or test passes.

Note that, you shouldn't test too much memory at once, as your system also needs memory for running processes. If you have more RAM than can be tested at once, test in smaller segments sequentially.

Below is a snippet of the memtester output if all tests pass. Notice the ”ok” status for each test.

memtester version 4.5.1 (64-bit)
Copyright (C) 2001-2020 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 1024MB (1073741824 bytes)
got  1024MB (1073741824 bytes), trying mlock ...locked.
Loop 1/5:
  Stuck Address       : ok
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok
  Block Sequential    : ok
  Checkerboard        : ok
  Bit Spread          : ok
  Bit Flip            : ok
  Walking Ones        : ok
  Walking Zeroes      : ok
  8-bit Writes        : ok
  16-bit Writes       : ok
.
.
.

Below is a snippet of the output if a test fails. Notice the FAILURE status for each test.

memtester version 4.5.1 (64-bit)
Copyright (C) 2001-2020 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 1024MB (1073741824 bytes)
got  1024MB (1073741824 bytes), trying mlock ...locked.
Loop 1/5:
  Stuck Address       : testing   1FAILURE: possible bad address line at offset 0x25378a58.
Skipping to next test...
  Random Value        : FAILURE: 0x4df704aaafdf8848 != 0x4df704aaafdfc848 at offset 0x05379a48.
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : testing   6FAILURE: 0x00000000 != 0x00004000 at offset 0x05379a48.
  Block Sequential    : testing   3FAILURE: 0x303030303030303 != 0x303030303034303 at offset 0x05379a48.
  Checkerboard        : testing   0FAILURE: 0xaaaaaaaaaaaaaaaa != 0xaaaaaaaaaaaaeaaa at offset 0x05379a48.
  Bit Spread          : testing  12FAILURE: 0xffffffffffffafff != 0xffffffffffffefff at offset 0x05379a48.
  Bit Flip            : testing   0FAILURE: 0x00000001 != 0x00004001 at offset 0x05379a48.
  Walking Ones        : ok
  Walking Zeroes      : testing   0FAILURE: 0x00000001 != 0x00001001 at offset 0x053af9f8.
  8-bit Writes        : -FAILURE: 0x57c7c8ba7d6f5b3b != 0x57c7c8ba7d6f1b3b at offset 0x0537da28.
  16-bit Writes       : -FAILURE: 0xd7768894fbf79099 != 0xd7768894fbf7d099 at offset 0x05379a48.
FAILURE: 0xfffc5633ffefca5d != 0xfffc5633ffefda5d at offset 0x053a5a38.
.
.
.

If errors persist across all test loops, it strongly suggests hardware issues, not transient software glitches.

8.10.2 Identifying Overheating Issues

Overheating can cause unexpected errors and crashes. To diagnose overheating issues, you can use a command line utility lm-sensors.

lm-sensors allow syou monitor hardware health by reading data from various sensors. It provides information about system temperatures, voltages, and fan speeds.

Here's how you can identify and monitor your system temperature using lm-sensors:

First, install lm-sensors:
```
  sudo apt install lm-sensors
```
Detect the available sensors on your system:
```
  sudo sensors-detect
```
Follow the prompts and answer “YES” to detect the available sensors on your system.
Once the available sensors are detected, you can view the temperature of your system using the sensors command:
```
  sensors
```
In the output below, you can see the temperature reading at the edge of the GPU, which is 41.0 degrees Celsius. You can also see other pieces of information like voltage supplied, power consumption and voltage supplied.
```
  amdgpu-pci-0400
  Adapter: PCI adapter
  vddgfx:      731.00 mV 
  vddnb:       687.00 mV 
  edge:         +41.0°C  
  PPT:           7.00 W
```
Using lm-sensors ensures that the system is operating within safe parameters. It helps to detect potential hardware problems early and take corrective actions to prevent hardware damage.

8.10.3 Evaluating Hard Drive Health

Disk errors can also cause application crashes. To identify disk issues, you can run disk check using smartmontools:

First, install smartmontools:
```
  sudo apt install smartmontools
```
Run a quick health check using the command below and replace /dev/sdX with your disk name (check with lsblk).
```
  sudo smartctl -H /dev/sdX
```

Here is the result I got when I ran the command on my disk /dev/nvme0n1:

  sudo smartctl -H /dev/nvme0n1
  smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0-52-generic] (local build)
  Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

  === START OF SMART DATA SECTION ===
  SMART overall-health self-assessment test result: PASSED

You can also run a detailed test:
```
  sudo smartctl -a /dev/nvme0n1
```

The detailed test provides a full report, including:

Temperature
Power-on hours
Error counts
Wear leveling (for SSDs), and more.

8.10.4 Conducting a CPU Stress Test

Faulty CPUs can also lead to a number of performance issues. To test your CPU, you can use the stress-ng utility:

Install stress-ng:
```
  sudo apt install stress-ng
```
Run a CPU stress test:
```
  stress-ng --cpu 4 --timeout 60
```

In the above command, 4 is the number of CPU cores you’d like to test and 60 is the duration in seconds. The command will stress all 4 CPU cores for 60 seconds. Notice the CPU is at 100% load during the test:

If the system crashes during this test, the CPU may be faulty.

8.10.5 Examining System Logs for Errors

systemd is a Linux system manager responsible for booting the system, managing system processes, and handling system services.

journalctl is a command to query the systemd journal logs. It provides detailed logging for system processes, kernel events, user applications, and more.

You can check system logs for hardware-related errors using the command: journalctl -k | grep -iE "error|fault|panic".

In the logs, look for messages about:

Memory faults.
I/O errors.
Hardware timeouts.

Here is what errors in the log file can look like:

Feb 11 10:15:32 hostname kernel: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: b200000000070f0f
Feb 11 10:15:32 hostname kernel: [Hardware Error]: TSC 0 ADDR fef1c000 MISC 38a0000086 
Feb 11 10:15:32 hostname kernel: [Hardware Error]: PROCESSOR 0:306a9 TIME 1613045732 SOCKET 0 APIC 0 microcode 1f
Feb 11 10:16:45 hostname kernel: EXT4-fs error (device sda1): ext4_find_entry:1453: inode #2: comm ls: reading directory lblock 0
Feb 11 10:17:12 hostname kernel: [drm:drm_atomic_helper_commit_cleanup_done [drm_kms_helper]] *ERROR* [CRTC:36:pipe A] flip_done timed out
Feb 11 10:18:05 hostname kernel: Kernel panic - not syncing: Fatal exception

Conclusion

Thank you for reading the book until the end. If you found it helpful, consider sharing it with others.

This book doesn't end here, though. I will continue to improve it and add new materials in the future. If you found any issues or if you would like to suggest any improvements, feel free to open a PR/ Issue.

Stay Connected and Continue Your Learning Journey!

Your journey with Linux doesn't have to end here. Stay connected and take your skills to the next level:

Follow Me on Social Media:
- X: I share useful short form content there. My DMs are always open.
- LinkedIn: I share articles and posts on tech there. Leave a recommendation on LinkedIn and endorse me on relevant skills.
Get access to exclusive content: For one-on-one help and exclusive content go here.

My articles and books, like this one, are part of my mission to increase accessibility to quality content for everyone. This book will also be open to translation in other languages. Each piece takes a lot of time and effort to write. This book will be free, forever. If you've enjoyed my work and want to keep me motivated, consider buying me a coffee.

Thank you once again and happy learning!

How to Learn to Code and Get a Developer Job [Full Book]

Quincy Larson — Thu, 11 Jul 2024 23:52:00 +0000

If you want to learn to code and get a job as a developer, you're in the right place. This book will show you how.

And yes this is the full book – for free – right here on this page of freeCodeCamp.

Also, I've recorded a FREE full-length audiobook version of this book, which I published as episode #100 of The freeCodeCamp Podcast. You can search for it in your favorite podcast player. Be sure to subscribe. I've also embedded it below for your convenience.

A few years back, one of the Big 5 book publishers from New York City reached out to me about a book deal. I met with them, but didn't have time to write a book.

Well, I finally had time. And I decided to just publish this book for free, right here on freeCodeCamp.

Information wants to be free, right? 🙂

It will take you a few hours to read all this. But this is it. My insights into learning to code and getting a developer job.

I learned all of this while:

learning to code in my 30s
then working as a software engineer
then running freeCodeCamp.org for the past 8 years. Today, more than a million people visit this website each day to learn about math, programming, and computer science.

I was an English teacher who had never programmed before. And I was able to learn enough coding to get my first software development job in just one year.

All without spending money on books or courses.

(I did spend money to travel to nearby cities and participate in tech events. And as you'll see later in the book, this was money well spent.)

After working as a software engineer for a few years, I felt ready. I wanted to teach other people how to make this career transition, too.

I built several technology education tools that nobody was interested in using. But then one weekend, I built freeCodeCamp.org. A vibrant community quickly gathered around it.

Along the way, we all helped each other. And today, people all around the world have used freeCodeCamp to prepare for their first job in tech.

You may be thinking: I don't know if I have time to read this entire book.

No worries. You can bookmark it. You can come back to it and read it across as many sittings as you need to.

And you can share it on social media. Sharing: "check out this book I'm reading" and linking to it is a surprisingly effective way to convince yourself to finish reading a book.

I say this because I'm not trying to sell you this book. You already "bought" this book when you opened this webpage. Now my goal is to reassure you that it will be worth investing your time to finish reading this book. 😉

I promise to be respectful of your time. There's no hype or fluff here – just blunt, actionable tips.

I'm going to jam as much insight as I can into every chapter of this book.

Which reminds me: where's the table of contents?

Ah. Here it is:

Preface: Who is this book for?
500 Word Executive Summary
Chapter 1: How to Build Your Skills
Chapter 2: How to Build Your Network
Chapter 3: How to Build Your Reputation
Chapter 4: How to Get Paid to Code – Freelance Clients and the Job Search
Chapter 5: How to Succeed in Your First Developer Job
Epilogue: You Can Do This

Preface: Who is This Book For?

This book is for anyone who is considering a career in software development.

If you're looking for a career that's flexible, high-paying, and involves a lot of creative problem solving, software development may be for you.

Of course, each of us approaches our own coding journey with certain resources: time, money, and opportunity.

You may be older, and may have kids or elderly relatives you're taking care of. So you may have less time.

Or you may be younger, and may have had less time to build up any savings, or acquire skills that boost your income. So you may have less money.

And you may live far away from the major tech cities like San Francisco, Berlin, Tokyo, or Bengaluru.

Or you may live with disabilities, physical or mental. Agism, racism, and sexism are real. Immigration status can complicate the job search. So can a criminal record.

So you may have less opportunity.

Learning to code and getting a developer job is going to be harder for some people than it will be for others. Everyone approaches this challenge from their own starting point, with whatever resources they happen to have on hand.

But wherever you may be starting out from – in terms of time, money, and opportunity – I'll do my best to give you actionable advice.

In other words: don't worry – you are in the right place.

A Quick Note on Terminology

Whenever I use new terms, I'll do my best to define them.

But there are a few terms I'll be saying all the time.

I'll use the words "programming" and "coding" interchangeably.

I'll use the word "app" as it was intended – as shorthand for any sort of application, regardless of whether it runs on a phone, laptop, game console, or refrigerator. (Sorry, Steve Jobs. iPhone does not have a monopoly on the word app.)

I will also use the words "software engineer" and "software developer" interchangeably.

You may encounter people in tech who take issue with this. As though software engineering is some fancy-pants field with a multi-century legacy, like mechanical engineering or civil engineering are. Who knows – maybe that will be true for your grandkids. But we are still very much in the early days of software development as a field.

I'll just drop this quote here for you, in case you feel awkward calling yourself a software engineer:

"If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization." – Gerald Weinberg, Programmer, Author, and University Professor

Can Anyone Learn to Code?

Yes. I believe that any sufficiently motivated person can learn to code. At the end of the day, learning to code is a motivational challenge – not a question of aptitude.

On the savannas of Africa – where early humans lived for thousands of years before spreading to Europe, Asia, and the Americas – were there computers?

Programming skills were never something that was selected for over the millennia. Computers as we know them (desktops, laptops, smartphones) emerged in the 80s, 90s, and 00s.

Yes – I do believe that aptitude plays a part. But at the end of the day, anyone who wants to become a professional developer will need to put in time at the keyboard.

A vast majority of people who try to learn to code will get frustrated and give up.

I sure did. I got frustrated and gave up. Several times.

But like other people who eventually succeeded, I kept coming back after a few days, and tried again.

I say all this because I want to acknowledge: learning to code and getting a developer job is hard. And it's even harder for some people than others, due to circumstance.

I'm not going to pretend to have faced true adversity in learning to code. Yes, I was in my 30s, and I had no formal background in programming or computers science. But consider this:

I grew up middle class in the United States – a 4th-generation American from an English-speaking home. I went to university. My father went to university. And his father went to university. (His parents before him were farmers from Sweden.)

I benefitted from a sort of intergenerational privilege. A momentum that some families are able to pick up over time when they are not torn apart by war, famine, or slavery.

So that is my giant caveat to you: I am not some motivational figure to pump you up to overcome adversity.

If you need inspiration, there are a ton of people in the developer community who have overcome real adversity. You can seek them out.

I'm not trying to elevate the field of software development. I'm not going to paint pictures of science fiction utopias that can come about if everyone learns to code.

Instead, I'm just going to give you practical tips for how you can acquire these skills. And how you can go get a good job, so you can provide for your family.

There's nothing wrong with learning to code because you want a good, stable job.

There's nothing wrong with learning to code so you can start a business.

You may encounter people who say that you must be so passionate about coding that you dream about it. That you clock out of your full-time job, then spend all weekend contributing to open source projects.

I do know people who are that passionate about coding. But I also know plenty of people who, after finishing a hard week's work, just want to go spend time in nature, or play board games with friends.

People generally enjoy doing things they're good at doing. And you can develop a reasonable level of passion for coding just by getting better at coding.

So in short: who is this book for? Anyone who wants to get better at coding, and get a job as a developer. That's it.

You don't need to be a self-proclaimed "geek", an introvert, or an ideologically-driven activist. Or any of those stereotypes.

It's fine if you are. But you don't need to be.

So if that's you – if you're serious about learning to code well enough to get paid to code – this book is for you.

And you should start by reading this quick summary of the book. And then reading the rest of it.

500 Word Executive Summary

Learning to code is hard. Getting a job as a software developer is even harder. But for many people, it's worth the effort.

Coding is a high-paying, intellectually challenging, creatively rewarding field. There is a clear career progression ahead of you: senior developer, tech lead, engineering manager, CTO, and perhaps even CEO.

You can find work in just about any industry. About two thirds of developer jobs are outside of what we traditionally call "tech" – in agriculture, manufacturing, government, and service industries like banking and healthcare.

If you're worried your job might be automated before you reach retirement, consider this: coding is the act of automating things. Thus it is by definition the last career that will be completely automated.

Automation will impact coding. It already has. For decades.

Generative AI tools like GPT-4 and Copilot can help us move from Imperative Programming – where you tell computers exactly what to do – closer to Declarative Programming – where you give computers higher-level objectives. In other words: Star Trek-style programming.

You should still learn math even though we now have calculators. And you should still learn programming even though we now have AI tools that can write code.

Have I sold you on coding as a career for you?

Good. Here's how to break into the field.

Build your skills.

You need to learn:

Front End Development: HTML, CSS, JavaScript
Back End Development: SQL, Git, Linux, and Web Servers
Scientific Computing: Python and its many libraries

These are all mature, 20+ year old technologies. Whichever company you work for, you will almost certainly use most of these tools.

The best way to learn these tools is to build projects. Try to code at least some every day. If you do the freeCodeCamp curriculum from top to bottom, you'll learn all of this and build dozens of projects.

Some of the certifications in the freeCodeCamp core curriculum.

Build your network.

So much of getting a job is who you know.

It's OK to be an introvert, but you do need to push your boundaries.

Create GitHub, Twitter, LinkedIn, and Discord accounts.

Go to tech meetups and conferences. Travel if you have to. (Most of your "learn to code" budget should go toward travel and event tickets – not books and courses.)

Greet people who are standing by themselves. Let others do most of the talking, and really listen. Remember people's names.

Add people on LinkedIn, follow them on Twitter, and go to after-parties.

Build your reputation.

Share short video demos of your projects.

Keep applying to speak at bigger and bigger conferences.

Hang out at hackerspaces and help people who are even newer to coding than you.

Contribute to open source. The work is similar to professional software development.

Build your Skills, Network, and Reputation at the same time. Don't let yourself procrastinate the scariest parts.

Instead of applying for jobs through the "front door" use your network to land job interviews through the "side door." Recruiters can help, too.

Keep interviewing until you start getting job offers. You don't need to accept the first offer you get, though. Be patient.

Your first developer job will be the hardest. Try to stay there for at least 2 years, and essentially get paid to learn.

The real learning begins once you're on-the-job, working alongside a team, and with large legacy codebases.

Most importantly, sleep and exercise.

Any sufficiently-motivated person can learn to code well enough to get a job as a developer.

It's just a question of how badly you want it, and how persistent you can be in the job search.

Remember: you can do this.

This Book is Dedicated to the Global freeCodeCamp Community.

Thank you to all of you who have supported our charity and our mission over the past 9 years.

It is through your volunteerism and through your philanthropy that we've been able to help so many people learn to code and get their first developer job.

The community has grown so much from the humble open source project I first deployed in 2014. I am now just a small part of this global community.

It is a privilege to still be here, working alongside you all. Together, we face the fundamental problems of our time. Access to information. Access to education. And access to the tools that are shaping the future.

These are still early days. I have no illusion that everyone will know how to code within my lifetime. But just like the Gutenberg Bible accelerated literacy in 1455, we can continue to accelerate technology literacy through free, open learning resources.

Again, thank you all.

And special thanks to Abbey Rennemeyer for her editorial feedback, and to Estefania Cassingena Navone for designing the book cover.

And now, the book.

Chapter 1: How to Build Your Skills

"Every artist was first an amateur." ― Ralph Waldo Emerson

The road to knowing how to code is a long one.

For me, it was an ambiguous one.

But it doesn't have to be like that for you.

In this chapter, I'm going to share some strategies for learning to code as smoothly as possible.

First, allow me to walk you through how I learned to code back in 2011.

Then I'll share what I learned from this process.

I'll show you how to learn much more efficiently than I did.

Story Time: How Did a Teacher in His 30s Teach Himself to Code?

I was a teacher running an English school. We had about 100 adult-aged students who had traveled to California from all around the world. They were learning advanced English so they could get into grad school.

Most of our school's teachers loved teaching. They loved hanging out with students around town, and helping them improve their conversational English.

What these teachers didn't love was paperwork: Attendance reports. Grade reports. Immigration paperwork.

I wanted our teachers to be able to spend more time with students. And less time chained to their desks doing paperwork.

But what did I know about computers?

Programming? Didn't you have to be smart to do that? I could barely configure a WiFi router. And I sucked at math.

Well one day I just pushed all that aside and thought "You know what: I'm going to give it a try. What do I have to lose?"

I started googling questions like "how to automatically click through websites." And "how to import data from websites into Excel."

I didn't realize it at the time, but I was learning how to automate workflows.

And the learning began. First with Excel macros. Then with a tool called AutoHotKey where you can program your mouse to move to certain coordinates of a screen, click around, copy text, then move to different coordinates and paste it.

After a few weeks of grasping in the dark, I figured out how to automate a few tasks. I could open an Excel spreadsheet and a website, run my script, then come back 10 minutes later and the spreadsheet would be fully populated.

It was the work of an amateur. What developers might call a "dirty hack". But it got the job done.

I used my newfound automation skills to continue streamlining the school.

Soon teachers barely had to touch a computer. I was doing the work of several teachers, just with my rudimentary skills.

This had a visible impact on the school. So much of our time had been tied up with rote work on the computer. And now we were free.

The teachers were happier. They spent more time with students.

The students were happier. They told all their friends back in their home country "you've got to check out this school."

Soon we were one of the most successful schools in the entire school system.

This further emboldened me. I remember thinking to myself: "Maybe I can learn to code."

I knew some software engineers from my board game night. They had traditional backgrounds, with degrees from Cal Tech, Harvey Mudd, and other famous Computer Science programs.

At the time, it was far less common for people in their 30s to learn to code.

I worked up the courage to share my dreams with some of these friends.

I wanted to learn to how program properly. I wanted to be able to write code for a living like they did. And to maybe even write software that could power schools.

I would share these dreams up with my developer friends. "I want to do what you do."

But they would sort of shrug. Then they'd say something like:

"I mean, you could try. But you're going to have to drink an entire ocean of knowledge."

And: "It's a pretty competitive field. How are you going to hang with people who grew up coding from an early age?"

And: "You're already doing fine as a teacher. Why don't you just stick with what you're good at?"

And that would knock me off course for a few weeks. I would go on long, soul-searching walks at night. I would ponder my future under the stars. Were these people right? I mean – they would know, right?

But every morning I'd be back at my desk. Watching my scripts run. Watching my reports compile themselves at superhuman speeds. Watching as my computer did my bidding.

A thought did occur to me: maybe these friends were just trying to save me from heartache. Maybe they just don't know anyone who learned to code in their 30s. So they don't think it's possible.

It's like... for years doctors thought that it would be impossible for someone to run a mile in 4 minutes. They thought your heart would explode from running so fast.

But then somebody managed to do it. And his heart did not explode.

Once Roger Bannister – a 25-year old Oxford student – broke that psychological barrier – a ton of other people did it, too. To date, more than 1,000 people have run a sub-4 minute mile.

Roger Bannister running like a a champ. (Image: Britannica)

And it's not like I was doing something as bold and unprecedented as running a 4 minute mile here. Plenty of famous developers have managed to teach themselves coding over the years.

Heck, Ada Lovelace taught herself programming in the 1840s. And she didn't even have a working computer. She just had an understanding of how her friend Charles Babbage's computer would work in theory.

She wrote several of the first computer algorithms. And she's widely regarded as the world's first computer programmer. Nobody taught her. Because there was nobody to teach her. Whatever self doubt she may have had, she clearly overcame it.

Now, I was no Ada Lovelace. I was just some teacher who already had a working computer, a decent internet connection, and the ability to search through billions of webpages with Google.

I cracked my knuckles and narrowed my gaze. I was going to do this.

Stuck in Tutorial Hell

"If you work for 10 years, do you get 10 years of experience or do you get 1 year of experience 10 times? You have to reflect on your activities to get true experience. If you make learning a continuous commitment, you’ll get experience. If you don’t, you won’t, no matter how many years you have under your belt." – Steve McConnell, Software Engineer

I spent the next few weeks googling around, and doing random tutorials that I encountered online.

Oh look, a Ruby tutorial.

Uh-oh, it's starting to get hard. I'm getting error messages not mentioned in the tutorial. Hm... what's going on here...

Oh look, a Python tutorial.

Human psychology is a funny thing. The moment something starts to get hard, we ask: am I doing this right?

Maybe this tutorial is out of date. Maybe its author didn't know what they were talking about. Does anybody even still use this programming language?

When you're facing ambiguous error messages hours into a coding session, the grass on the other side starts to look a lot greener.

It was easy to pretend I'd made progress. Time to go grab lunch.

I'd see a friend at the café. "How's your coding going?" they'd ask.

"It's going great. I already coded 4 hours today."

"Awesome. I'd love to see what you're building sometime."

"Sure thing," I'd say, knowing that I'd built nothing. "Soon."

Maybe I'd go to the library and check out a new JavaScript book.

There's that old saying that buying books gives you the best feeling in the world. Because it also feels like you're buying the time to read them.

And this is precisely where I found myself a few weeks into learning to code.

I had read the first 100 pages of several programming books, but finished none.

I had written the first 100 lines of code from several programming tutorials, but finished none.

I didn't know it, but I was trapped in place that developers lovingly call "tutorial hell."

Tutorial hell is where you jump from one tutorial to the next, learning and then relearning the same basic things. But never really going beyond the fundamentals.

Because going beyond the fundamentals? Well, that requires some real work.

It Takes a Village to Raise a Coder

Learning to code was absorbing all of my free time. But I wasn't making much progress. I could now type the { and * characters without looking at the keyboard. But that was about it.

I knew I needed help. Perhaps some Yoda-like mentor, who could teach me the ways. Yes – if such a person existed, surely that would make all the difference.

I found out about a nearby place called a "hackerspace." When I first heard the name, I was a bit apprehensive. Don't hackers do illegal things? I was an English teacher who liked playing board games. I was not looking for trouble.

Well I called the number listed and talked with a guy named Steve. I nervously asked: "You all don't do anything illegal, do you?" And Steve laughed.

It turns out the word "hack" is what he called an overloaded term. Yes – "to hack" can mean to maliciously break into a software system. But "to hack" can also mean something more mundane: to write computer code.

Something can be "hacky" meaning it's not an elegant solution. And yet you can have "a clever hack" – an ingenious trick to make your code work more efficiently.

In short: don't be scared of the term "hack."

Facebook's corporate campus has the word "hack" written in giant letter on the concrete. (Image: Bloomberg)

I, for one, scarcely use the term because it's so confusing. And I think recently a lot of hackerspaces have picked up on the ambiguity. Many of them now call themselves "makerspaces" instead.

Because that's what a hackerspace is all about – making things.

Steve invited me to visit the hackerspace on Saturday afternoon. He said several developers from the area would be there.

The first time I walked through the doors of the Santa Barbara Hackerspace, I was blown away.

The place smelled like an electric fire. Its makeshift tables were lined with soldering irons, strips of LED lights, hobbyist Arduino circuit boards, and piles of Roomba vacuum robots.

The same Steve I'd spoken to on the phone was there, and he greeted me. He had glasses, slicked back hair, and a goatee beard. He was always smiling. And when you asked him a question, instead of responding quickly, he would nod and think for a few seconds first.

Steve was a passionate programmer who had studied math and philosophy at the University of California – Santa Barbara. He was still passionate about those subjects. But his real passion was Python.

Steve turned on the projector and gave an informal "lightning talk." He was demoing an app he'd written that would recognize QR codes in a video and replace them with images.

Someone in the audience pulled up a QR code on their laptop and held it in front of the camera. Steve's app then replaced the QR code with a picture of a pizza.

Somebody in the audience shouted, "Can you make the pizza spin?"

Steve opened up his code in a code editor, called Emacs, and started making changes to it in real time. He effortlessly tabbed between his code editor, his command line, and the browser the app was running in, "hot loading" updates to the code.

For me, this was sorcery. I couldn't believe Steve had just busted out that app over the course of few hours. And now he was adding new features on the fly, as the audience requested them.

I thought: "This guy is a genius."

And that evening, after the event ended, he and I stayed after and I told him so.

We ate sandwiches together. And I said to him: "I could code for my entire career and not be as good as you. I would be thrilled if after 10 years I could code even half as well as you."

But Steve pushed back. He said, "I'm nothing special. Don't limit yourself. If you stick with coding, you could easily surpass me."

Now, I didn't for a second believe the words he said to me. But just the fact that he said it gave me butterflies.

Here he was: a developer who believed in me. He saw me – some random teacher – the very definition of a "script kiddie" – and thought I could make it.

Steve and I talked late into the night. He showed me his $200 netbook computer, which even by 2011 standards was woefully underpowered.

"You don't need a powerful computer to build software," Steve told me. "Today's hardware is incredibly powerful. Computers are only slow because the bloated software they run makes them slow. Get an off-the-shelf laptop, wipe the hard drive, install Linux on it, and start coding."

I took note of the model of laptop he had and ordered the exact same one when I got home that night.

After a few days of debugging my new computer with Stack Overflow, I successfully installed Ubuntu. I started learning how to use the Emacs code editor. By the following Saturday, I knew a few commands, and was quick to show them off.

Steve nodded in approval. He said, "Awesome. But what are you building?"

I didn't understand what he meant. "I'm learning how to use Emacs. Check it out. I memorized..."

But Steve looked pensive. "That's cool and all. But you need a project. Always have a project. Then learn what you need to learn en route to finishing that project."

Other than a few scripts I'd written for to help the teachers at my school, I had never finished anything. But I started to see what he was saying.

And it started to dawn on me. All this time I had been trapped in tutorial hell, going in circles, finishing nothing.

Steve said, "I want you to build a project using HTML5. And next Saturday, I want you to present it at the hackerspace."

I was mortified at his words. But I stood up straight and said. "Sounds like a plan. I'm on it."

Nobody Can Make You a Developer But You

"I'm trying to free your mind, Neo. But I can only show you the door. You're the one that has to walk through it." – Morpheus in the 1999 film The Matrix

The next morning, I woke up extra early before work and googled something like "HTML5 tutorial." I already knew a lot of this from my previous time in tutorial hell. But instead of skipping ahead, I just slowed my roll and followed along exactly, typing every single command.

Usually once I finished a tutorial I would just go find another tutorial. But instead, I started playing with the tutorial's code. I had a simple idea for a project. I was going to make an HTML5 documentation page. And I was going to code it purely in HTML5.

Let me explain HTML5 real quick. It's just a newer version of HTML, which has existed since the first webpages back in the 1990s.

If a website was a body, HTML would be the bones. Everything else rests on top of those bones. (You can think of JavaScript as the muscles and CSS as the skin. But let's get back to the story.)

I knew that in HTML, you could link to different parts of the same webpage by using ID properties. So I thought: what if I put a table of contents along the left hand side? Then clicking the different items on the left would scroll down the page on the right to show those items.

Within half an hour, I had coded a rough prototype.

But it was time to report for work at the school. The entire day, all I could think about was my project, and how I should best go about finishing it.

I raced home, opened up my laptop, and spent the entire evening coding.

I copied the official (and creative commons-licensed) HTML documentation directly into my page, "hard coding" it into the HTML.

Then I spent about an hour on the CSS, getting everything to look right, and using absolute positioning to keep the sidebar in place.

I made a point to make use of as many of HTML5's new "semantic" tags as I could.

And boom – project finished.

A wave of accomplishment washed over me. I jogged to a nearby football field and ran laps around the field, celebrating. I did it. I finished a project.

And I decided right then and there: from here on out, everything I do is going to be a project. I'm going to be working toward some finished product.

The next evening I walked up to the podium, plugged in my laptop, and presented my HTML5 webpage. I answered questions from the developers there about HTML5.

Sometimes I'd get something wrong, and someone in the audience would say, "that doesn't sound right – let me check the documentation."

People weren't afraid to correct me. But they were polite and supportive. It didn't even feel like they were correcting me – it felt like they were correcting the public record – lest someone walk away with incorrect information.

I didn't feel any of the anxiety that I might have felt giving a talk at a teacher in-service meeting.

Instead I almost felt like I was part of the audience, learning alongside them.

After all, these tools were new and emerging. We were all trying to understand how to use them together.

After my talk, Steve came up to me and said, "Not bad."

I smiled for an awkwardly long time, not saying anything, just happy with myself.

Then Steve squinted and pursed his lips. He said: "Start your next project tonight."

Lessons from my Coding Journey

We'll check in on younger Quincy's coding journey in each of the following chapters. But now I want to break down some of the lessons here. And I want to answer some of the questions you may have.

Why is Learning to Code so Hard?

Learning any new skill is hard. Whether it's dribbling a soccer ball, changing the oil on a car, or speaking a new language.

Learning to code is hard for a few particular reasons. And some of these are unique to coding.

The first one is that most people don't understand exactly what coding is. Well, I'm going to tell you.

What is coding?

Coding is telling a computer what to do, in a way the computer can understand.

That's it. That's all coding really is.

Now, make no mistake. Communicating with computers is hard. They are "dumb" by human standards. They will do exactly what you tell them to do. But unless you're good at coding, they are probably not going to do what you want them to do.

You may be thinking: what about servers? What about databases? What about networks?

At the end of the day, these are all controlled by layers of software. Code. It's code all the way down. Eventually you reach the physical hardware, which is moving electrons around circuit boards.

For the first few decades of computing, developers wrote code that was "close to the metal" – often operating on the hardware directly, flipping bits from 0 to 1 and back.

But contemporary software development involves so many "layers of abstraction" – programs running on top of programs – that just a few lines of JavaScript code can do some really powerful things.

In the 1960s, a "bug" could be an insect crawling around inside a room-sized computer, and getting fried in one of the circuits.

The first computer bug, discovered in 1945, was a moth that got trapped in the panels of a room-sized calculator computer at Harvard. (Image: Public Domain)

Today, we're writing code so many layers of abstraction above the physical hardware.

That is coding. It's vastly easier than it has ever been in the past. And it is getting easier to do every year.

I am not exaggerating when I say that in a few decades, coding will be so easy and so common that most younger people will know how to do it.

Why is learning to code still so hard after all these years?

There are three big reasons why learning to code is so hard, even today:

The tools are still primitive.
Most people aren't good at handling ambiguity, and learning to code is ambiguous. People get lost.
Most people aren't good at handling constant negative feedback. And learning to code is one brutal error message after another. People get frustrated.

Now I'll discuss each of these difficulties in more detail. And I'll give you some practical strategies for overcoming each of them.

The Tools are Still Primitive

A Possessed Barclay from Star Trek: The Next Generation, programming on the Holodeck.

"Computer. Begin new program. Create as follows. Work station chair. Now create a standard alphanumeric console positioned to the left hand. Now an iconic display console for the right hand. Tie both consoles into the Enterprise main computer core, utilizing neuralscan interface." - Barclay from Star Trek: The Next Generation, Season 4 Episode 19: "The Nth Degree"

This is how people might program in the future. It's an example from my favorite science fiction TV show, Star Trek: The Next Generation.

Every character in Star Trek can code. Doctors, security officers, pilots. Even little Wesley Crusher (played by child actor Wil Wheaton) can get the ship's computer to do his bidding.

Sure – one of the reasons everyone can code is that they live in a post-scarcity 24th-century society, with access to free high quality education.

Another reason is that in the future, coding will be much, much easier. You just tell a computer precisely what to do, and – if you're precise enough – the computer does it.

What if programming was as easy as just saying instructions to a computer in plain English?

Well, we've already made significant progress toward this goal. Think of our grandmothers, running between room-sized mainframe computers with stacks of punchcards.

Working with a punchcard-based computer in the 1950s (Image: NASA)

It used to be that programming even a simple application would require meticulous instructions.

Here are two examples of a "Cesar Cypher", the classic computer science homework project.

This is also known as "ROT-13" because you ROTate the letters by 13 positions. For example, A becomes N (13 letters after A), and B becomes O (13 letters after B).

I'm going to show you two examples of this program.

First, here's the program in x86 Assembly:

format     ELF     executable 3
entry     start

segment    readable writeable
buf    rb    1

segment    readable executable
start:    mov    eax, 3        ; syscall "read"
    mov    ebx, 0        ; stdin
    mov    ecx, buf    ; buffer for read byte
    mov    edx, 1        ; len (read one byte)
    int    80h

    cmp    eax, 0        ; EOF?
    jz    exit

    xor     eax, eax    ; load read char to eax
    mov    al, [buf]
    cmp    eax, "A"    ; see if it is in ascii a-z or A-Z
    jl    print
    cmp    eax, "z"
    jg    print
    cmp    eax, "Z"
    jle    rotup
    cmp    eax, "a"
    jge    rotlow
    jmp    print

rotup:    sub    eax, "A"-13    ; do rot 13 for A-Z
    cdq
    mov    ebx, 26
    div    ebx
    add    edx, "A"
    jmp    rotend

rotlow:    sub    eax, "a"-13    ; do rot 13 for a-z
    cdq
    mov    ebx, 26
    div    ebx
    add    edx, "a"

rotend:    mov    [buf], dl

print:     mov    eax, 4        ; syscall write
    mov    ebx, 1        ; stdout
    mov    ecx, buf    ; *char
    mov    edx, 1        ; string length
    int    80h

    jmp    start

exit:     mov     eax,1        ; syscall exit
    xor     ebx,ebx        ; exit code
    int     80h

This x86 Assembly example comes from the Creative Commons-licensed Rosetta Code project.

And here's the same program, written in Python:

def rot13(text):
    result = []

    for char in text:
        ascii_value = ord(char)

        if 'A' <= char <= 'Z':
            result.append(chr((ascii_value - ord('A') + 13) % 26 + ord('A')))
        elif 'a' <= char <= 'z':
            result.append(chr((ascii_value - ord('a') + 13) % 26 + ord('a')))
        else:
            result.append(char)

    return ''.join(result)

if __name__ == "__main__":
    input_text = input("Enter text to be encoded/decoded with ROT-13: ")
    print("Encoded/Decoded text:", rot13(input_text))

This is quite a bit simpler and easier to read, right?

This Python example comes straight from GPT-4. I prompted it the same way Captain Picard would prompt the ship's computer in Star Trek.

Here's exactly what I said to it: "Computer. New program. Take each letter of the word I say and replace it with the letter that appears 13 positions later in the English alphabet. Then read the result back to me. The word is Banana."

GPT-4 produced this Python code, and then read the result back to me: "Onanan."

What we're doing here is called Declarative Programming. We're declaring "computer, you should do this." And the computer is smart enough to understand our instructions and execute them.

Now, the style of coding most developers use today is Imperative Programming. We're telling the computer exactly what to do, step-by-step. Because historically, computers have been pretty dumb. So we've had to help them put one foot in front of the other.

The field of software development just isn't mature yet.

But just like early human tools advanced – from stone to bronze to iron – the same is happening with software tools. And much faster.

We're probably still in the programming equivalent of the Bronze Age right now. But we may reach the Iron Age in our lifetime. Generative AI tools like GPT are quickly becoming more powerful and more reliable.

The developer community is still divided on how useful tools like GPT will be for software development.

On one side, you have the "become your own boss" entrepreneur influencers who say things like: "You don't need to learn to code anymore. ChatGPT can write all your code for you. You just need an app idea."

And on the other side of the spectrum, you have "old guard" developers with decades of programming experience – many of whom are skeptical that tools like GPT are really all that useful for producing production-grade code.

As with most things, the real answer is probably somewhere in between.

You don't have to look hard to find YouTube videos of people who start with an app idea, then prompt ChatGPT for the code they need. Some people can even take that code and wire it together into an app that works.

Large Language Models like GPT-4 are impressive, and the speed at which they're improving is even more impressive.

Still, many developers are skeptical about how useful these tools will ultimately become. They question whether we'll be able to get AIs to stop "hallucinating" false information.

This is the fundamental problem of "Interpretability." It could be decades before we truly understand what's going on inside of a black box AI like GPT-4. And until we do, we should double check everything it says, and assume there will be lots of bugs and security flaws in the code that it gives us.

There's a big difference from being able to get a computer to do something for you, and actually understanding how the computer is doing it.

Many people can operate a car. But far fewer can repair a car – let alone design a new car from the ground up.

If you want to be able to develop powerful software systems that solve new problems – and you want those systems to be fast and secure – you're still going to need to learn how to code properly.

And that means feeling your way through a lot of ambiguity.

Learning to Code is an Ambiguous Process

When you're learning to code, you constantly ask yourself: "Am I spending my time wisely? Am I learning the right tools? Do these book authors / course creators even know what they're talking about?"

Ambiguity fogs your every study session. "Did my test case fail because the tutorial is out of date, and there have been breaking changes to the framework I'm using? Or am I just doing it wrong?"

As I mentioned earlier with Tutorial Hell, you also have to cope with "grass is greener on the other side" disease.

This is compounded by the fact that some developers think it's clever to answer questions with "RTFM" which means "Read the Freaking Manual." Not super helpful. Which manual? Which section?

Another problem is: you don't know what you don't know. Often you can't even articulate the question you're trying to ask.

And if you can't even ask the right question, you're going to thrash.

This is extra hard with coding because it's possible no one has attempted to build quite the same app that you're building.

And thus some of the problems you encounter may be unprecedented. There may be no one to turn to.

15% of the queries people type into Google every day have never ever been searched before. That's bad news if you're the person typing one of those.

My theory is that most developers will figure out how to solve a problem and simply move on, without ever documenting it anywhere. So you may be one of dozens of developers who has had to invent their own solution to the same exact problem.

And then, of course, there are the old forum threads and StackOverflow pages.

Comic by XKCD

How Not to Get Lost When Learning to Code

The good news is: both competence and confidence come with practice.

Soon you'll know exactly what to google. You'll get a second sense for how documentation is usually structured, and where to look for what. And you'll know where to ask which questions.

I wish there were a simpler solution to the ambiguity problem. But you just need to accept it. Learning to code is an ambiguous process. And even experienced developers grapple with ambiguity.

After all, coding is the rare profession where you can just infinitely reuse solutions to problems you've previously encountered.

Thus as a developer, you are always doing something you've never done before.

People think software development is about typing code into a computer. But it's really about learning.

You're going to spend a huge portion of your career just thinking really hard. Or blindly inputting commands into a prompt trying to understand how a system works.

And you're going to spend a lot of time in meetings with other people: managers, customers, fellow devs. Learning about the problem that needs to be solved, so you can build a solution to it.

Get comfortable with ambiguity and you will go far.

Learning to Code is One Error Message After Another

A lot of people who are learning to code feel like they hit a wall. Progress does not come as fast as they expect.

One huge reason for this: in programming, the feedback loop is much tighter than in other fields.

In most schools, your teacher will give you assignments, then grade those assignments and give them back to you. Over the course of a semester, you may only have a dozen instances where you get feedback.

"Oh no, I really bombed that exam," you might say to yourself. "I need to study harder for the midterm."

Maybe your teacher will leave notes in red ink on your paper to help you improve your work.

Getting a bad grade on an exam or paper can really ruin your day.

And that's how we generally think about feedback as humans.

If you've spent much time coding, you know that computers are quite fast. They can execute your code within a few milliseconds.

Most of the time your code will crash.

If you're lucky, you'll get an error message.

And if you're really lucky, you'll get a "stack trace" – everything the computer was trying to do when it encountered the error – along with a line number for the piece of code that caused the program to crash.

Now this in-your-face negative feedback from a computer. Not everyone can handle it seeing this over and over all day long.

Imagine if every time you handed your teacher your term paper, they handed it back with big red "F" written on it. And imagine they did this before you could even blink. Over and over.

That's what coding can feel like sometimes. You want to grab the computer and shout at it, "why don't you just understand what I'm trying to do?"

How Not to Get Frustrated

The key, again, is practice.

Over time, you will develop a tolerance for vague error messages and screen-length stack traces.

Coding will never be harder than it is when you're just starting out.

Not only do you not know what you're doing, but you're not used to receiving such impersonal, rapid-fire, negative feedback.

So here are some tips:

Tip #1: Know that you are not uniquely bad at this.

Everyone who learns to code struggles with the frustration of trying to Vulcan Mind Meld with a computer, and get it to understand you. (That's another Star Trek reference.)

Of course, some people started programming when they were just kids. They may act like they've always been good at programming. But they most likely struggled just like we adults do, and over time have simply forgotten the hours of frustration.

Think of the computer as your friend, not your adversary. It's just asking you to clarify your instructions.

Tip #2: Breathe.

Many people's natural reaction when they get an error message is to gnash their teeth. Then go back into their code editor and start blindly changing code, hoping to somehow luck into getting past it.

This does not work. And I'll tell you why.

The universe is complex. Software is complex. You are unlikely to just Forest Gump your way into anything good.

Forest Gump doing what he does and getting improbably lucky catching shrimp.

You may have heard of the Infinite Monkey Theorem. It's a thought experiment where you imagine chimpanzees typing on typewriters.

If you had a newsroom full of chimpanzees doing this, how long would it take before one of them typed out the phrase "to be or not to be" by random chance?

Let's say each chimp types one random character per second. It would likely take 1 quintillion years for one of them to type "to be or not to be." That's 10 to the 18th power. A billion billion.

Even assuming the chimps remain in good health and the typewriters are regularly serviced – the galaxy would be a cold, dark void by the time one of them managed to type "to be or not to be."

Why do I tell you all of this? Because you don't want to be one of those chimps.

In that time, you could almost certainly figure out a way to teach those chimps how to type English words. They could probably manage to type out all of Hamlet – not just its most famous line.

Even if you somehow do get lucky, and get past the bug, what will you have learned?

So instead of thrashing, you want to take some time. Understand the code. Understand what's happening. And then fix the error.

Always take time to understand the failing code. Don't be a quintillionarian chimp. (I think that means someone who is 1 quintillion years old, though according to Google, nobody has ever typed that word before.)

Instead of blindly trying things, hoping to get past the error message, slow down.

Take a deep breath. Stretch. Get up to grab a hot beverage.

Your future self will be grateful that you took this as a teachable moment.

Tip #3: Use Rubber Duck Debugging

Get a rubber ducky and set it next to your computer. Every time you hit an error message, try to explain what you think is happening to your rubber duck.

Of course, this is silly. How could this possibly be helpful?

Except it is.

Rubber Duck Debugging is a great tool for slowing down and talking through the problem at hand.

You don't have to use a rubber duck, of course. You could explain your Python app to your pet cactus. Your SQL query to the cat that keeps jumping onto your keyboard.

The very act of explaining your thinking out loud seems to help you process the situation better.

How do Most People Learn to Code?

Now let's talk about traditional pathways to a first developer job.

Why should you care what everyone else does? Spoiler alert: you don't really need to.

You do you.

This said, you may doubt yourself and the decisions you've made about your learning. You may yearn for the path not taken.

My goal with this section is to calm any anxieties you may have.

The Importance of Computer Science Degrees

University degrees are still the gold standard for preparing for a career in software development. Especially bachelor's degrees in Computer Science.

Before you start saying "But I don't have a computer science degree" – no worries. You don't need a Computer Science degree to become a developer.

But their usefulness is undeniable. And I'll explain why.

First, you may wonder: why should developers study computer science? After all, one of the most prominent developers of all time had this to say about the field:

"Computer science education cannot make anybody an expert programmer any more than studying brushes and pigment can make somebody an expert painter." – Eric Raymond, Developer, Computer Scientist, and Author

Computer Science departments were traditionally part of the math department. Universities back in the 1960s and 1970s didn't know quite where to put this whole computer thing.

At other universities, Computer Science was considered an extension of Electrical Engineering. And until recently, even University of California – Berkeley – one of the greatest public universities in the world – only provided Computer Science degrees as sort of a double-major with Electrical Engineering.

But most universities have now come to understand the importance of Computer Science as a field of study.

As of writing this, Computer Science is the highest paying degree you can get. Higher even than fields focused on money, such as Finance and Economics.

According to Glassdoor, the average US-based Computer Science major makes more money at their first job than any other major. US $70,000. That's a lot of money for someone who just graduated from college.

More than Nursing majors ($59,000), Finance majors ($55,000) and Architecture majors ($50,000).

OK – so getting a Computer Science degree can help you land a high-paying entry-level job. That is probably news to no one. But why is that?

How Employers Think About Bachelor's Degrees

You may have heard some big employers in tech say things like, "we no longer require job candidates to have a bachelor's degree."

Google said this. Apple said this.

And I believe them. That they no longer require bachelor's degrees.

We've had lots of freeCodeCamp alumni get jobs at these companies, some of whom did not have a bachelor's degrees.

But those freeCodeCamp alumni who landed those jobs probably had to be extra strong candidates to overcome the fact that they didn't have bachelor's degrees.

You can look at these job openings as having a variety of criteria they judge candidates on:

Work experience
Education
Portfolio and projects
Do they have a recommendation from someone who already works at the company? (We'll discuss building your network in depth in Chapter 2)
Other reputation considerations (we'll discuss building your reputation in Chapter 3)

For these employers who do not require a bachelor's degree, education is just one of several considerations. If you are stronger in other areas, they may opt to interview you – regardless of whether you've ever even set foot inside a university classroom.

Just note that having a bachelor's degree will make it easier for you to get an interview, even at these "degree-optional" employers.

Why do so Many Developer Jobs Require a Computer Science Degree Specifically?

A bachelor's is a bachelor's, I often tell people. Because for most intents and purposes, it is.

Want to enter the US military as an officer, rather than an enlisted service member? You'll need a bachelor's degree, but any major will do.

Want to get a work visa to work abroad? You'll probably need a bachelor's degree, but any major will do.

And for so many job openings that say "bachelor's degree required" – any major will do.

Why is this? Doesn't the subject you study in university matter at all?

Well, here's my theory on this: what you learn in university is less important than whether you finished university.

Employers are trying to select for people who can figure out a way to get through this rite of passage.

It is certainly true that you can be at the bottom of your class, repeating courses you failed, and being on academic probation for half the time. But a degree is a degree.

You know what they call the student who finished last in their class at medical school? "Doctor."

And for most employers, the same holds true.

In many cases, HR folks are just checking a box on their job application filtering software. They're filtering out applicants who don't have a degree. In those cases, they may never even look at job applications from people without degrees.

Again, not every employer is like this. But many of them are. Here in the US, and perhaps even more so in other countries.

It sucks, but it's how the labor market works right now. It may change over the next few decades. It may not.

This is why I always encourage people who are in their teens and 20s to seriously considering getting a bachelor's degree.

Not because of any of the things universities market themselves as:

The education itself. (You can take courses from some of the best universities online for free, so this alone does not justify the high cost of tuition.)
The "college experience" of living in a dorm, making new friends, and self discovery. (Most US University students never live on campus so they don't really get this anyway.)
General education courses that help you become a "well rounded individual" (Ever hear of the Freshman 15? This is a joke of course. But a lot of university freshman do gain weight due to the stress of the experience.)

Again, the real value of getting a bachelor's degree – the real reason Americans pay $100,000 or more for 4 years of university – is because many employers require degrees.

Of course, there are other benefits of having a bachelor's degree, such as the ones I mentioned: expanded military career options, and greater ease getting work visas.

One of these is: if you want to become a doctor, dentist, lawyer, or professor, you will first need a bachelor's degree. You can then use that to get into grad school.

OK – this is a lot of background information. So allow me to answer your questions bluntly.

Do You Need a University Degree to Work as a Software Developer?

No. There are plenty of employers who will hire you without a bachelor's degree.

A bachelor's degree will make it much easier to get an interview at a lot of employers. And it may also help you command a higher salary.

What About Associate's Degrees? Are Those Valuable?

In theory, yes. There are some fields in tech where having an associates may be required. And I think it always does increase your chances of getting an interview.

This said, I would not recommend going to university with the specific goal of getting an associate's degree. I would 100% encourage you to stay in school until you get a bachelor's degree, which is vastly more useful.

According to the US Department of Education, over the course of your career, having a bachelor's degree will earn you 31% more than merely having an associate's degree.

And I'm confident that difference is much wider with a bachelor's in Computer Science.

Is it Worth Going to University to Get a Bachelor's Degree Later in Life, if You Don't Already Have One?

Let's say you're in your 30s. Maybe you attended some college or university courses. Maybe you completed the first two years and were able to get an associate's degree.

Does it make sense to go "back to school" in the formal sense?

Yes, it may make sense to do so.

But I don't think it ever makes sense to quit your job to go back to school full time.

The full-time student lifestyle is really designed with "traditional" students in mind. That is, people age 18 to 22 (or a bit older if they served in the military), who have not yet entered the workforce beyond high school / summer jobs.

Traditional universities cost a lot of money to attend, and the assumption is that students will pay through some combination of scholarships, family funds, and student loans.

As a working adult, you'll have less access to these funding sources. And just as importantly, you'll have less time on your hands than a recent high school graduate would.

But that doesn't mean you have to give up on the dream of getting a bachelor's degree.

Instead of attending a traditional university, I recommend that folks over 30 attend one of the online nonprofit universities. Two that have good reputations, and whose fees are quite reasonable, are Western Governor's University and University of the People.

You may also find a local community college or state university extension program that offers degrees. Many of these programs are online. And some of them are even self-paced, so that you can complete courses as your work schedule permits.

Do your research. If a school looks promising, I recommend finding one of its alumni on LinkedIn and reaching out to them. Ask them questions about their experience, and whether they think it was worth it.

I recommend not taking on any debt to finance your degree. It is much better to attend a cheaper school. After all, a degree is a degree. As long as it's from an accredited institution, it should be fine for most intents and purposes.

If You Already Have a Bachelor's Degree, Does it Make Sense to Go Back and Earn a Second Bachelor's in Computer Science?

No. Second bachelor's degrees are almost never worth the time and money.

If you have any bachelor's degree – even if it's in a non-STEM field – you have already gotten most of the value you will get out of university.

What About a Master's of Computer Science Degree?

These can be helpful for career advancement. But you should pursue them later, after you're already working as a developer.

Many employers will pay for their employee's continuing education.

One program a lot of my friends in tech have attended is Georgia Tech's Master's in Computer Science degree.

Georgia Tech's Computer Science department is among the best in the US. And this degree program is not only fully online – it's also quite affordable.

But I wouldn't recommend doing it now. First focus on getting a developer job. (We'll cover that in-depth later in this book).

Will Degrees Continue to Matter in the Future?

Yes, I believe that university degrees will continue to matter for decades – and possibly centuries – to come.

University degrees have existed for more than 1,000 years.

Many of the top universities in the US are older than the USA itself is. (Harvard is more than 400 years old.)

The death of the university degree is greatly exaggerated.

It has become popular in some circles to bash universities, and say that degrees don't matter anymore.

But if you look at the statistics, this is clearly not true. They do have an impact on lifetime earnings.

And just as importantly, they can open up careers that are safer, more stable, and ultimately more fulfilling.

Sure, you can make excellent money working as a deckhand offshore, servicing oil rigs.

But you can make similarly excellent money working as a developer in a climate-controlled office, servicing servers and patching codebases.

One of these jobs is dangerous, back-breaking work. The other is a job you could comfortably do for 40 years.

Many of the "thought leaders" out there who are bashing universities have themselves benefitted from a university education.

One reason why I think so many people think degrees are "useless" is: it's hard to untangle the learning from the status boost you get.

Is university just a form of class signaling – a way for the wealthy to continue to pass advantage on to their children? After all, you're 3 times as likely to find a rich kid at Harvard as you are a poor kid.

The fact is: life is fundamentally unfair. But that does not change how the labor market works.

You can choose easy mode, and finish a degree that will give you more options down the road.

Or you can go hard mode, potentially save time and money, and just be more selective about which employers you apply to.

I have plenty of friends who've used both approaches to great success.

What Alternatives are There to a University Degree?

I've worked in adult education for nearly two decades, and I have yet to see a convincing substitute for a university degree.

Sure – there are certification programs and bootcamps.

But these do not carry the same weight with employers. And they are rarely as rigorous.

Side note: when I say "certification programs" I mean a program where you attend a course, then earn a certification at the end. These are of limited value. But exam-based certifications from companies like Amazon and Microsoft are quite valuable. We'll discuss these in more depth later.

What I tell people is: to degree or not to degree – that is the question.

I meet lots of people who are auto mechanics, electricians, or who do some other sort of trade, who don't have a bachelor's. They can clearly learn a skillset, apply it, and hold down a job.

I meet lots of people who are bookkeepers, paralegals, and other "knowledge workers" who don't have a bachelor's. They can clearly learn a skillset, apply it, and hold down a job.

In many cases, these people can just learn to code on their own, using free learning resources and hanging out with likeminded people.

Some of these people have always had the personal goal of going back and finishing their bachelor's. That's a good reason to do it.

But it's not for everyone.

If you want formal education, go for the bachelor's degree. If you don't want formal education, don't do any program. Just self-teach.

The main thing bootcamps and other certification programs are going to give you is structure and a little bit of peer pressure. That's not a bad thing. But is it worth paying thousands of dollars for it?

How to Teach Yourself to Code

Most developers are self-taught. Even the developers who earned a Bachelor's of computer science still often report themselves as "self-taught" on industry surveys like Stack Overflow's annual survey.

Most working developers consider themselves to be "self-taught" (Image: Stack Overflow 2016 Survey)

This is because learning to code is a life-long process. There are constantly new tools to learn, new legacy codebases to map out, and new problems to solve.

So whether you pursue formal education or not, know this: you will need to get good at self-teaching.

What Does it Mean to be a "Self-Taught" Developer?

Not to be pedantic, but when I refer to self-teaching, I mean self-directed learning – learning outside of formal education.

Very few people are truly "self-taught" at anything. For example, Isaac Newton taught himself Calculus because there were no Calculus books. He had to figure it out and invent it as he went along.

Similarly, Ada Lovelace taught herself programming. Because before her there was no programming. She invented it.

Someone might tell you: "You're not really self taught because you learned from books or online courses. So you had teachers." And they are correct, but only in the most narrow sense.

If someone takes issue with you calling yourself self-taught, just say: "By your standards, no one who wasn't raised by wolves can claim to be self-taught at anything."

Point them to this section of this book and tell them: "Quincy anticipated your snobbery." And then move on with your life.

Because come on, life's too short, right?

You're self taught.

What is Self-Directed Learning?

As a self-learner, you are going to curate your own learning resources. You're going to choose what to learn, from where. That is the essence of "Self-Directed Learning."

But how do you know you're learning the right skills, and leveraging the right resources?

Well, that's where community comes in.

There are lots of communities of learners around the world, all helping one another expand their skills.

Community is a hard word to define. Is Tech Twitter a community? What about the freeCodeCamp forum? Or the many Discord groups and subreddits dedicated to specific coding skillsets?

I consider all of these communities. If there are people who regularly hang out there and help one another, I consider it a community.

What about in-person events? The monthly meetup of Ruby developers in Oakland? The New York City Startup community meetup? The Central Texas Linux User Group?

These communities can be online, in-person, or some mix of both.

We'll talk more about communities in the Build Your Network chapter. But the big takeaway is: the new friends you meet in these communities can help you narrow your options for what to learn, and which resources to learn from.

What Programming Language Should I Learn First?

The short answer is: it doesn't really matter. Once you've learned one programming language well, it is much easier to learn your second language.

There are different types of programming languages, but today most development is done using "high-level scripting languages" like JavaScript and Python. These languages trade away the raw efficiency you get from "low-level programming languages" like C. What they get in return: the benefit of being much easier to use.

Today's computers are billions of times faster than they were in the 1970s and 1980s, when people were writing most of their programs in languages like C. That power more than makes up for the relative inefficiency of scripting languages.

It's worth noting that both JavaScript and Python themselves are written in C, and they are both getting faster every year – thanks to their large communities of open source code contributors.

Python is a powerful language for scientific computing (Data Science and Machine Learning).

And JavaScript... well, JavaScript can do everything. It is the ultimate Swiss Army Knife programming language. JavaScript is the duct tape that holds the World Wide Web together.

"Any application that can be written in JavaScript, will eventually be written in JavaScript." – Atwood's Law (Jeff Atwood, founder of Stack Overflow and Discourse)

You could code your entire career in JavaScript and would never need to learn a second language. (This said, you'll want to learn Python later on, and maybe some other languages as well.)

So I recommend starting with JavaScript. Not only is it much easier to use than languages like Java and C++ – it's easier to learn, too. And there are far, far more job openings for people who know JavaScript.

A screenshot from job search engine Indeed. My search for "javascript" for the US yielded 68,838 job listings.

The other skills you'll want to focus on are HTML and CSS. If a webpage were a body, HTML would be the bones, and CSS would be the skin. (JavaScript would be the muscles, making it possible for the website to move around and be interactive.)

You can learn some HTML and CSS in a single afternoon. Like most of the tools I mention here, they are easy to learn, but difficult to master.

You'll also want to learn how to use Linux. Linux powers a vast majority of the world's servers, and you will spend much of your career running commands in the Linux command line.

If you have a Mac, MacOS has a terminal that accepts almost all the same commands as Linux. (MacOS and Linux have a common ancestor in Unix.)

But if you're on a Windows PC, you'll want to install WSL, which stands for Windows Subsystem for Linux. You will then be able to run Linux commands on your PC. And if you're feeling adventurous, you can even dual boot both the Windows and Linux operating systems on the same computer.

If you're going to install Linux on a computer, I recommend starting with Ubuntu. It is the most widely used (and widely documented) Linux distribution. So it should be the most forgiving.

Make no mistake – Linux is quite a bit harder to use than Windows and MacOS. But what you get in return for your efforts is an extremely fast, secure, and highly customizable operating system.

Also, you will never have to pay for an operating system license again. Unless you want to. Red Hat is a billion dollar company even though its software is open source, because companies pay for their help servicing and supporting Linux servers.

You'll also want to learn Git. This Version Control System is how teams of developers coordinate their changes to a codebase.

You may have heard of GitHub. It's a website that makes it easier for developers to collaborate on open source projects. And it further extends some of the features of Git. You'll learn more about GitHub in the How to Build Your Reputation chapter later.

You'll want to learn SQL and how relational databases work. These are the workhorses of the information economy.

You'll also hear a lot about NoSQL databases (Non-relational databases such as graph databases, document databases, and key-value stores.) You can learn more about these later. But focus on SQL first.

Finally, you'll want to learn how web servers work. You'll want to start with Node.js and Express.js.

When you hear the term "full stack development" it refers to tying together the front end (HTML, CSS, JavaScript) with the back end (Linux, SQL databases, and Node + Express).

There are lots of other tools you'll want to learn, like React, NGINX, Docker, and testing libraries. You can pick these up as you go.

But the key skills you should spend 90% of your pre-job learning time on are:

HTML
CSS
JavaScript
Linux
Git
SQL
Node.js
Express.js

If you learn these tools, you can build most major web and mobile apps. And you will be qualified for most entry-level developer jobs. (Of course, many job descriptions will include other tools, but we'll discuss these later in the book.)

So you may be thinking: great. How do I learn these?

Where do I learn how to code?

Funny you should ask. There's a full curriculum designed by experienced software engineers and teachers. It's designed with busy adults in mind. And it's completely free and self-paced.

That's right. I'm talking about the freeCodeCamp core curriculum. It will help you learn:

Front End Development
Back End Development
Engineering Mathematics
and Scientific Computing (with Python for Data Science and Machine Learning)

To date, thousands of people have gone through this core curriculum and gotten a developer job. They didn't need to quit their day job, take out loans, or really risk anything other than some of their nights and weekends.

In practice, freeCodeCamp has become the default path for most people who are learning to code on their own.

If nothing else, the freeCodeCamp core curriculum can be your "home base" for learning, and you can branch out from there. You can learn the core skills that most jobs require, and also dabble in technologies you're interested in.

There are decades worth of books and courses to learn from. Some are available at your public library, or through monthly subscription services. (And you may be able to access some of these subscription services for free through your library as well.)

Also, freeCodeCamp now has nearly 1,000 free full-length courses on everything from AWS certification prep to mobile app development to Kali Linux.

There has never been an easier time to teach yourself programming.

Building Your Skills is a Life-Long Endeavor

We've talked about why self-teaching is probably the best way to go, and how to go about it.

We've talked about the alternatives to self-teaching, such as getting a bachelor's degree in Computer Science, or getting a Master's degree.

And we've talked about which specific tools you should focus on learning first.

Now, let's shift gears and talk about how to build the second leg of your stool: your network.

Chapter 2: How to Build Your Network

"If you want to go fast, go alone. If you want to go far, go together." – African Proverb

"Networking." You may wince at the sound of that word.

Networking may bring to mind awkward job fairs in stuffy suits, desperately pushing your résumé into the hands of anyone who will accept it.

Networking may bring to mind alcohol-drenched watch parties – where you pretend to be interested in a sport you don't even follow.

Networking may bring to mind wishing "happy birthday" to people you barely know on LinkedIn, or liking their status updates hoping they'll notice you.

But networking does not have to be that way.

In this chapter, I'll tell you everything I've learned about meeting people. I'll show you how to earn their trust and be top of their mind when they're looking for help.

Because at the end of the day, that's what it's all about. Helping people solve their problems. Being of use to people.

I'll show you how to build a robust personal network that will support you for decades to come.

Story Time: How did a Teacher in his 30s Build a Network in Tech?

Last time on Story Time: Quincy learned some coding by reading books, watching free online courses, and hanging out with developers at the local Hackerspace. He had just finished building his first project and given his first tech talk...

OK – so I now had some rudimentary coding skills. I could now code my way out of the proverbial paper bag.

What was next? After all, I was a total tech outsider.

Well, even though I was new to tech, I wasn't new to working. I'd put food on the table for nearly a decade by working at schools and teaching English.

As a teacher, I got paid to sling knowledge. And as a developer, I'd get paid to sling code.

I already knew one very important truth about the nature of work: it's who you know.

I knew the power of networks. I knew that the path to opportunity goes right through the gatekeepers.

All that stood between me and a lucrative developer job was a hiring manager who could say: "Yes. This Quincy guy seems like someone worthy of joining our team."

Of course, being a tech outsider, I didn't know the culture.

Academic culture is much more formal.

You wear a suit.

You use fancy academic terminology to demonstrate you're part of the "in group."

You find ways to work into every conversation that you went to X university, or that you TA'd under Dr. Y, or that you got published in The Journal of Z.

Career progressions are different. Conferences are different. Power structures are different.

And I didn't immediately appreciate this fact.

The first few tech events I went to, I wore a suit.

I kept neatly-folded copies of my résumé in my pocket at all times.

I even carried business cards. I had ordered sheets of anodized aluminum, and used a laser cutter to etch in my name, email address, and even a quote from legendary educator John Dewey:

"Anyone who has begun to think places some portion of the world in jeopardy." – John Dewey

It's still my favorite quote to this day.

But talk about heavy-handed.

"Hi, I'm Quincy. Here's my red aluminum business card. Sorry in advance – it might set off the metal detector on your flight home."

I was trying too hard. And it was probably painfully apparent to everyone I talked to.

I went on Meetup.com and RSVP'd for every developer event I could find. Santa Barbara is a small town, but it's near Los Angeles. So I made the drive for events there, too.

I quickly wised up, and traded my suit for jeans and a hoody. And I noticed that no one else gave out business cards. So I stopped carrying them.

I took cues from the devs I met at the hackerspace: Be passionate, but understated. Keep some of your enthusiasm in reserve.

And I read lots of books to better understand developer culture.

The Coders at Work is a good book from the 1980s.

Hackers: Heroes of the Revolution is a good book from the 1990s.

For a more contemporary cultural resource, check out the TV series Mr. Robot. Its characters are a bit extreme, but they do a good job of capturing the mindset and mannerisms of many developers.

Soon, I was talking less like a teacher and more like a developer. I didn't stick out quite as awkwardly.

Several times a week I attended local tech-related events. My favorite event wasn't even a developer event. It was the Santa Barbara Startup Night. Once every few weeks, they'd have an event where developers would pitch their prototypes. Some of the devs demoing their code were even able to secure funding from angels – rich people who invest in early-stage companies.

The guy who ran the event was named Mike. He must have known every developer and entrepreneur in Santa Barbara.

When I finally got the nerve to introduce myself to Mike, I was star-struck. He was an ultra-marathoner with a resting heartbeat in the low 40s. Perfectly cropped hair and beard. To me he was the coolest guy on the planet. Always polished. Always respectful.

Mike was "non-technical". He worked as a product manager. And though he knew a lot about technology and user experience design, he didn't know how to code.

Sometimes devs would write non-technical people off. "He's just a business guy," they'd say. Or: "She's a suit." But I never heard anyone say that about Mike. He had the respect of everyone.

I made a point to watch the way Mike interacted with developers. After all, I wasn't that far removed from "non-technical" myself. I'd only been coding for a few months.

Often my old habits would creep in. During conversations I'd have the temptation to show off what I'd learned or what I'd built.

Many developers are modest about their skills or accomplishments. They might say: "I dabble in Python." And little 'ol insecure me would open his big mouth and say something like, "Oh yeah. I've coded so many algorithms in Python. I write Python in my sleep."

And then I'd go home and google that developer's name, and realize they were a core contributor to a major Python library. And I'd kick myself.

I quickly learned not to boast of my accomplishments or my skills. There's a good chance a person you're talking to can code circles around you. But most of them would never volunteer this fact.

There's nothing worse than confidently pulling out your laptop, showing off your code, and then having someone ask you a bunch of questions that you're wholly unprepared to answer.

My first few months of attending events was a humbling experience. But these events energized me to keep pushing forward with my skills.

Soon people around southern California would start to recognize me. They'd say: "I keep running to you at these events. What's your name again?"

One night a dev said, "Let's follow each other on Twitter." I had grudgingly set up a Twitter account a few days earlier, thinking it was a gimmicky website. How much could you really convey with just 140 characters? I had barely tweeted anything. But I did have a Twitter account ready, and she did follow me.

That inspired me to spend more time refining my online presence. I made my LinkedIn less formal and more friendly. I looked at how other devs in the community presented themselves online.

Within a few months, I knew people from so many fields:

experienced developers
non-technical or semi-technical people who worked at tech companies
hiring managers and recruiters
and most importantly, my peers who were also mid-career and trying to break into tech

Why were peers the most important? Surely they would be the least able to help me get a job, right?

Well, let me tell you a secret: let's say a hiring manager brings on a new dev, trains them, and they turn out to be really good at their job. That hiring manager is going to ask: where can I find more people like you?

Your peers are one of the most important pieces of your network. So many of my freelance opportunities and job interview opportunities came from people who started learning to code around the same time as I did.

We came up together. We were brothers and sisters in arms. Those bonds are the tightest.

Anyway, all this networking over the months would ultimately come to fruition one night when I walked into the bar of a fancy downtown hotel for a developer event.

But more on that in the next chapter. Now let's talk more about the art and science of building your network.

Is it Really Who You Know?

You may have heard the expression that success is "less about what you know, and more about who you know."

In practice, it's about both.

Yes – your connections may help you land your dream job. But if you're out of your depth, and lack the skills to succeed, you will not fare well in that role.

But let's assume that you are proactively building your skills. You've followed my advice from Chapter 1. When is the right time to start building your network?

The best time to start building your network is yesterday.

But you don't need a time machine to do this. Because you already have a network. It's probably much smaller than you'd like it to be, but you do know people.

They may be friends from your home town, or the colleagues of your parents. Any person you know from your past – however marginally – may be of help.

So step one is to take full inventory of the people you know. Don't worry – I am not asking you to reach out to anyone yet, or tax your personal relationships.

Think before you move. Formulate a strategy.

First, let's inventory all the people you know.

How to Build a Personal Network Board

You want to start by creating a list of people you know.

You could do this with a spreadsheet, or a Customer Relationship Management tool (CRM) like sales people use. But that's probably overkill for what we're doing here.

I recommend using a Kanban board tool like Trello, which is free.

You're going to create 5 columns: "to evaluate", "to contact", "waiting for reply", "recently in contact", and "don't contact yet".

Then you're going to want to create labels, so you can classify people by how you know them. Here are some label ideas for you: "Childhood friend", "Friend of the family", "Former colleague", "Classmate", "Friends from Tech Events".

Now you can start creating cards. Each card can just be their name, and if you have time you can add a photo to the card.

Here is the Trello board I created to give you an idea of what this Personal Network Board might look like. I used characters from my favorite childhood movie, the 1989 classic Teenage Mutant Ninja Turtles.

My Personal Network Board with my friends from my side job fighting crime.

You can go through your social media accounts – even your old school year books if you have them – and start adding people.

Many of these people are not going to be of any help. But I recommend adding them for the sake of being comprehensive. You never know when you'll remember: "oh – so and so got a job at XYZ corp. I should reach out to them."

This process may take a day or two. But know that this is an investment. You'll be able to use this board for the rest of your career.

You may think "I don't need to do this – I already have a LinkedIn account." That might work OK, but LinkedIn is a blunt instrument. You want to maximize signal and minimize noise here. That's why I'm encouraging you to create this dedicated personal network board.

As you add people to your board, you can label them. Take a moment to research each of these people. What are they up to these days? Do they have a job? Run a company?

You can add notes to each card, as you discover new facts about them. Did they recently run a fundraiser 5K run? Did their grandma recently celebrate her 90th birthday? These facts may seem extraneous. But if the person is sharing them on social media, it means these facts are important to them.

Make an effort to be interested in people. Their daily lives. Their aspirations. By understanding their motivations and goals, you will have deeper insight into how you can help them.

And as I said earlier, the best way to forge alliances is to help people. We'll talk about this at length in a little bit.

For each of the people you add to your Personal Network Board, consider whether they might be worth reaching out to. Then either put them into the "to contact" or "don't contact yet" column.

You may be wondering: why is the column called "don't contact yet"? Because you never know when it might be helpful to know someone. Never take any friendship or acquaintanceship for granted.

Once you've filled up your board, labeled everyone, and sorted them into columns, you're ready to start reaching out.

How to Prepare for Network Outreach

The main thing to keep in mind when reaching out and trying to make an impression: keep yourself simple.

People are busy, and they can only remember so many facts about you. You want to boil down who you are to the fundamentals. And the best way to do this is to write a personal bio.

You want your presence to be consistent across all of your social media accounts.

Here's how I introduce myself:

"I'm Quincy. I'm a teacher at freeCodeCamp. I live in Dallas, Texas. I can help you learn to code."

Go ahead and write yours. See if you can get it down to 100 characters or less. Try to avoid using fancy words or jargon.

It may be hard to distill your identity down to a few words. But this is an important process.

Remember: people are busy. They don't need to know your life story. As you get to know these people better, you can gradually fill in the details of who you are as a person. As they ask questions, they can get to know you better over time.

And on that note, you need a good photo of your smiling face.

If you have the money, just find a local photographer and pay them to take some professional headshots.

You may even have a friend who's into photography, who can take them for free.

I took my headshot myself, using Photobooth, which comes pre-installed on MacOS. My friend spent about 10 minutes fixing some background and shading in Photoshop. He may have made my teeth slightly whiter. Here's what it looks like:

My headshot. I use this same photo everywhere.

Be sure to smile with your eyes, so you don't look robotic. Or better yet, think of something really funny, like I did here. Then the smile will be genuine.

Take a lot of shots from different angles, and just use whichever one looks best on you.

I recommend using a headshot that looks like how you look on any given day. Not a heavily photoshopped photo that tries to maximize your attractiveness. You want people at events to recognize you from your photo. And you don't want to intimidate people with your beauty. You want to put them at ease.

Speaking of putting people at ease: do not wear sunglasses, or try too hard to look cool. You want to look friendly and approachable. A good acid test for this is: look at your photo. If you were lost, and saw this person on the street, would you be brave enough to ask them for directions?

Once you have chosen your headshot photo, use that same photo everywhere. Put it on all of your social media accounts.

Use it on your personal website. Even add the profile photo to your email account.

I recommend using that same photo for years. Every time you change it, you run the risk that some people won't immediately recognize you. Even subtle changes in lighting, angle, or background can throw off people's familiarity.

Be sure to keep a high-definition version of the photo. That way people can use it to promote your talk at their conference, or your guest appearance on their podcast. (Don't worry – in time, you will get there.)

How to Reach Out to People from your Past

Now that you've got your bio and photos sorted out, you're ready to start talking with people.

15 years ago, I would say you should call people on the phone instead of messaging them. But culture has changed a lot with the introduction of smart phones. Most people will not respond well to a phone call.

Similarly, I don't recommend asking people out to coffee or lunch until much later in the conversation. People are busy, and may view the request as awkward.

You need to get to the point, and do so quickly.

So what is that point you need to get to?

Essentially:

I know you
I like you
and I respect the work you're doing.

That's it.

People like to be known. They like to be liked. They like for the work they do and the lives they live to be noticed.

Most of us get recognition on our birthdays. People from our past might send "happy birthday" text messages, social media posts, or even call us.

But what about the other 364 days of the year? People like to be recognized on those other days, too.

Well, here's a simple way you can recognize people.

Step 1: Research the person. Google them. Read through their most recent social media posts. Read through their LinkedIn. If they post family photos, actually take time to look at them.

Step 2: Think about something you could say that might make their day a bit brighter.

Step 3: Choose a social media platform they've been recently active on. Send them a direct message.

I'm going to share a template, but never use any templates verbatim, because if the recipient plugs your message into Google, they'll discover it's a template, and all your goodwill will be squandered.

If I were messaging someone I hadn't talked to in a few months or years out of the blue, I would say something like this:

"Hey [name], I hope your [new year / spring / week] is off to a fun start. Congrats on [new job / promotion / new baby / completed project]. It's inspiring to see you out there getting things done."

Something short and to the point like that. Greeting + congratulations + compliment. That is the basic formula.

Don't just say it. Mean it.

Really want this person to feel recognized. Really want to brighten their day. Really want to encourage them to keep progressing toward their goals.

Humans are very good at detecting insincerity. Don't try to over-sell it. Don't give them any reason to think "this person wants something from me."

That's why the most important thing about this is: be brief. Be respectful of people's time. Nobody wants a long letter that they'll feel obligated to respond to at length.

Because – say it with me again – people are busy.

How to Build Even Deeper Connections

Because people are so busy, they're often tempted to see strangers more for what those strangers can do for them:

This person drives the bus that gets me to work.
This person makes my beverage just the way I like it.
This person in HR answers my questions about time off.
This person put together a bangin' acid jazz playlist for me to listen to while I code.
This person sends me helpful emails each week with free coding resources.

To some extent, you are what you do for people.

I know, I know. That might sound overly reductive. Cynical even. And that is 100% not true for the close friends and family in your life.

But for people who barely know you – who just encounter you while going about their day – this is likely how they see you.

You have to give people a reason to care about you. You have to inspire them to learn more about you.

Before you can become somebody's close friend – someone they truly care about, and think about when you're not around – you need to start off as someone who is helpful to them.

And that's what we're going to do here. We're going to build even deeper relationships by offering to help people.

This will be a long process. And you should start it well in advance of your job search. The last thing you want is for someone to think "Oh – you're just reaching out because you need something from me."

On the contrary – you're reaching out because you have something to offer them.

You are, after all, in possession of one of the most powerful skillsets a person can acquire. The ability to bend machines to your will. You are a programmer.

This is what being good at coding feels like.

Or, at least, you're on the road to becoming one.

So you already have a good pretext to reach out to people.

You may have heard the term "cold call". This is where you call someone knowing almost nothing about them, and trying to sell them something. This is not easy, and a vast majority of cold calls end with the other party hanging up.

But the more information you know about the other person, the warmer the call gets, and the more likely you are to succeed.

Now, you're not selling anything here. And as I mentioned earlier, you're not calling them either. You're sending them a direct message.

Maybe this is through Twitter, LinkedIn, Discord, Reddit – wherever. But you are reaching out to them with a single paragraph of text.

As I said, the strongest opening move – the approach that's most likely to get a response – is to casually offer help.

If I were doing this, here's a simple template I'd use. Remember not to use this template verbatim. Rewrite it in your own voice, how you would say it to a friend:

"Hey [name], congrats on the [new job / promotion / new baby]. I've been learning some programming, and am building my portfolio. You immediately came to mind as someone who gets a lot of things done. Is there any sort of tool or app that would make your life easier? I may be able to code it up for you, for practice."

This is a strong approach, because it is personalized and doesn't come across as automated. People get so many automated messages these days that they are quick to disregard anything that even resembles an automated message.

This is why I send all my messages manually, and don't rely on automation. It's better to slowly compose messages one-by-one than it is try and save time with a script or a mail-merge.

The fastest way to get blocked is to message someone with "Hi , how's it going?" where there's clearly a first name missing – evidence that the message is a template.

Sometimes I get a message using my last name instead of my first name. "Hey Larson." What, am I in military school now?

And a lot of people on LinkedIn have started putting an emoji at the beginning of their name. This makes it easy to detect automated messages, because nobody would include that emoji in their direct message.

When a message starts with: "Hi 🍜Sarah, are you looking for a new job?" Then you know it's a bulk message.

Also note that my above template does not say "we went to school together" or something like that. Unless you just met someone a few days ago, you shouldn't specify how you two know one another.

Why? Because the very act of reminding people how you know one another will prompt some people to step back and think: "Gee, I barely know this person."

How to Keep the Conversation Going

Again, your goal is to get a response from them, so you can start a back-and-forth conversation.

These messaging platforms have a casual feel to them. Keep it casual.

Don't send a single, multi-paragraph message. Keep your messages short and snappy. You don't want for it to feel like a chore to reply to you.

Once you've got them replying to you, start making notes on your Personal Network Board so you can remember these facts later.

Maybe they do have some app idea or tool idea. Great. Ask them questions about it. See if you can build it for them.

Start by sketching out a simple mockup of the user interface. Use graphing paper if you want to look extra sophisticated. Snap a photo of it and send it to them. "Something like this?"

This will establish that you're serious about helping them. And I'd be willing to bet for most people, this would be a new experience.

"You're helping me? You're creating this app for me?" It will be flattering, and they will be likely to remember it. Even if the app itself doesn't go anywhere.

From there, you can just go with the flow of conversation. Maybe it fizzles out. No worries. Let it. You can find a reason to pick the conversation back up a few weeks later.

The great thing about these social media direct messages is the entire message log is there. The next time you message them, they can just scroll up and see "oh – this is that person who offered to build that app for me." There are no more "who are you again?" head tilts that you might get during in-person conversations.

Again, keep everything casual and upbeat. If it feels like the conversation is going slow, that's no problem. Because you're going to have dozens of other conversations going. Other irons in the fire. You're going to be a busy bee building your network.

How to Meet New People and Expand Your Personal Network

We've talked about how to reach out to people you already know. Those connections are still there, even if they've atrophied a bit over the years.

But how do you make brand new connections?

This is no easy task. But I have some tips that will make this process a bit less daunting.

First of all, meeting people for the first time in person is so much more powerful than meeting them online.

When you meet someone in person, your memory has so much more information to latch onto:

How the person looks, their posture, and how they move through the space
The sound of their voice and the way they speak
The lights, sounds, aromas, temperature, and the general feel of the venue
And so many other little details that get baked into your memory

Spending 10 minutes talking with someone in person can build a deeper connection than dozens of messages back and forth, across weeks of correspondence.

This is why I strongly recommend: get out there and meet people at local events.

How to Meet People at Local Events Around Town

Which events? If you live in a densely-populated city, you may have a ton of options at your disposal. You may be able to go to tech events several nights each week, with minimal commuting.

If you live in a small town, you may have to stick with meeting people at local gatherings. Book fairs, ice cream socials, sporting events.

If you go to church, mosque, or temple, get to know people there, too.

And yes, I realize this may sound ridiculous. "That person standing in the bleachers next to me at the soccer game? They're somehow going to help me get a developer job?"

Maybe. Maybe not. But don't write people off.

That person may run a small business.

They may have gone to school with a friend who's a VP of Engineering at a Fortune 500 company.

And maybe – just maybe – they're a software engineer, too. After all, there are millions of us software engineers out there. And we don't all live in Silicon Valley. 😉

When you do meet a new person, you don't want to immediately pull out your phone and say "Can I add you to my LinkedIn professional network?"

Instead, you want to play it cool. Introduce yourself.

Remember their name. Names are integral to building a relationship. If you are bad with names, practice remembering them. You can practice by just trying to remember the name of every character – no matter how minor they are – when you're watching TV shows or movies.

If you forget someone's name, don't guess. Just say "what's your name again" and be sure to remember it the second time.

Shake their hand or fist bump. Talk with them about whatever feels natural. If the conversation peters out, no worries. Let it.

You build relationships over time. It's not about total time spent with someone – it's about the number of times you meet that person over a longer span of time.

There's a good chance you will see the person again in the future. Maybe at that same exact location a few weeks later. And that is when you make your move:

"Hi [name] how's the [thing you talked about the previous time] going?"

Pick the conversation up where it left off. If they seem like someone who would be a helpful addition to your Personal Network Board, ask them "hey what are you doing next [day of week]? Do you want to come with me to [other upcoming local event]?"

Always have your upcoming week of events in mind, so you can invite people to join you.

This is a great way to get people to hang out with you in a safe, public space. And you're providing something of value – giving them awareness of an upcoming event.

If they seem interested, you can say "Awesome. What's the best way for me to message with you, and get you the event details?"

Boom – you now have their email or social media or phone number, and your relationship can unfold from there.

This may sound like a slow burn approach. Why be so cautious?

Again, people are busy. Smart people are defensive of their time, and of their personal information.

There are too many vampires out there who want to take advantage of people – trying to sell them something, scam them, get them into their multi-level marketing scheme, or in some other way proselytize them.

The best way to help other people get past this reflexive defensiveness is to already be on their radar from previous encounters as a reasonable person.

How to Leverage Your Network

We'll talk more about how to leverage your network in Chapter 4. For now, look at your network purely as an investment of time and energy.

I like to think of my network as an orchard. I am planting relationships. Tending to them, and making sure they're healthy.

Who knows when those relationships will grow into trees and bear fruit. The goal is to keep planting trees, and at some point in the future, those trees will help sustain you.

Keep sending out positive energy. Keep offering to help people using your skills, and even your own network. (It is rarely a bad move to make a polite introduction between two people you know.)

Be a kind, thoughtful, helpful person.

Don't ever feel impatient with how slow a job search may be going.

Don't ever let yourself feel slighted or snubbed.

Don't ever let yourself feel jealous of someone else's success.

What goes around comes around. You will one day reap what you sow. And if you're sowing positive energy, you're setting yourself up for one bountiful harvest.

Chapter 3: How to Build Your Reputation

"The way to gain a good reputation is to endeavor to be what you desire to appear." – Socrates

Now that you've started building your skills and your network, you're ready to start building your reputation.

You may be starting from scratch – a total newcomer to tech. Or you may already have some credibility you can bring with you from your other job.

In this chapter, I'll share practical tips for how you can build a sterling reputation among your peers. This will be the key to getting freelance clients, a first job, and advancing in your career.

But first, here's how I built my reputation.

Story Time: How Did a Teacher in His 30s Build a Reputation as a Developer?

Last time on Story Time: Quincy started building his network of developers, entrepreneurs, and hiring managers in tech. He was frequenting hackerspaces and tech events around the city. But he had yet to climb into the arena and test his might...

I was already several months into my coding journey when I finally worked up the courage to go to my first hackathon.

One day I encountered a particularly nasty bug, and I wasn't sure how to fix it. So I did what a lot of people would do in that situation: I procrastinated by browsing the web. And that's when I saw it. Startup Weekend EDU.

Startup Weekend is a 54-hour competition that involves building an app, then pitching it to a panel of judges. These events reward your knowledge of coding, design, and entrepreneurship as well.

This particular event – held in the heart of Silicon Valley – had a panel of educators and education entrepreneurs as its judges. With my background in adult education, this seemed like an ideal first hackathon for me.

I told Steve about the event. And then I said the magic words: "I'll do the driving." Which was good, because Steve didn't have a driver's license.

With Steve onboard, we rounded out our team with a couple of devs from the Santa Barbara Hackerspace.

I spent weeks preparing for the event by researching the judges and the companies they worked for. I researched the sponsors. And of course, I practiced coding like a Shaolin monk.

Finally, after a month of preparation, it was the big weekend. We piled into my 2003 Toyota Corolla with the peeling clear coat, put on some high energy music, and started our 5-hour drive.

On the way up, we discussed what we should build. It would be education-focused, of course. Preferably catering to high school students, since those were the grade levels the judge's companies focused on.

But what should the app do? How was it going to make people's lives easier?

I thought back to my own time in high school. I didn't have much to go on, since I'd dropped out after just one year. (I did manage to study for and pass the GED – Good Enough Degree as we called it – while working at Taco Bell, before eventually going to college. But that's another story.)

But one pain point I did remember from high school, which still rang out after all these years: English papers.

Now I loved writing. But I didn't love writing in MLA format, with its rigid citation rules. I used to dread preparing a Work Cited page. My teacher would always dock me points for not formatting my citations correctly.

After listening to a lot of OK ideas from the other passengers in the car, I piped up. I said: "I have an idea. We should code an app that creates citations for you."

And someone laughed and said: "Out of sight."

And Steve said, "Hey that's a good name. We could call it Out of Cite with a 'C'."

We all laughed and felt clever. Then we started discussing the implementation details.

When we arrived at the venue, there were about 100 other devs there. It was an open-plan office space, with low-rise cubicles flanked by whiteboards.

I heard whispers about one of those developers. "Hey, it's that guy who won the event last year," I heard people say. They gestured in the direction of a cocky-looking dev surrounded by fans. "Maybe he'll let me be on his team."

The event started with pitches. Anyone could go up to the front of the room, grab the mic, and deliver a 60 second pitch for the app they wanted to build.

I was so nervous it felt like an alien was about to burst out of my chest. So naturally, I was first in line. Rip the band-aid off, right?

I was sweating and gesticulating wildly as I raced through my pitch. I said something like this: "Citations suck. I mean, they don't suck. They're necessary. And you need to add them to your papers. But preparing citations sucks. Let's build an app that will fill out your Work Cited page for you. Who's with me?"

The room was quiet. Then people realized I was finished talking, and they gave me an obligatory round of applause. The MC took the mic out of my hand and gave it to the next person, and I pranced back to my seat.

After pitches, it was time to form teams. Our Santa Barbara contingent looked at each other and said "I guess we're a team."

We figured out the wifi password and grabbed the choicest of workspaces: a corner office that had a door you could actually close.

I started scrawling UI mockups on whiteboard. I said, "We want something that's always a click away. Right in your browser's menu bar."

"Like a browser plugin," Steve said.

"Yeah. Let's build a browser plugin."

I showed them examples of the three formats that essays might require: MLA, APA, and Chicago.

"Could we generate all three of these at once, so they can just copy-paste them?" I asked.

"We can do better than that," Steve said. "We can have a button for each of them that puts the citation directly into their clipboard."

We worked fast, creating a simple MVP (Minimum Viable Product) by the end of Friday night. All it did was grab the current website's metadata and structure it as a citation. But it worked.

Since it was my first hackathon, I didn't want the stress of staying in a hostel. So I'd splurged to get a hotel room. We had two twin beds, so each night we'd rotate which of us had to sleep on the floor.

Saturday morning, our ambitions grew. I walked to the whiteboard and said to the team: "Citing websites is great and all. But a lot of the things students cite are in books or academic papers. We need to be able to generate citations for those, too."

We found an API that we could use to get citation information based on ISBN (a serial number used for books). And we hacked together a script that could search for academic papers based on their DOI (a serial number used for academic papers), then scrape data from the result page.

By Saturday night, the code for our browser plugin was really coming together. So I sat down and started preparing the presentation slides. I left a lot of the final coding to my teammates while I rehearsed my pitch over and over again for hours.

Even though it was my turn to sleep in a bed, I could barely get any shut-eye due to the jitters. Here I was, right in the heart of the tech ecosystem. Silicon Valley.

As a teacher, I would routinely give talks in front of my peers – sometimes dozens of them. But this was different.

In a few hours, I'd be presenting to a room full of ambitious developers. And judges. People with Ph.D.s, some of whom had founded their own tech companies. They were going to be evaluating our work. I was terrified I'd somehow blow it.

Unable to sleep, I opened my email. The Startup Weekend staff had sent out an email, which included a PDF of a book. It was an unofficial mash-up of the tech startup classics 4 Steps to the Epiphany and The Lean Startup.

Now, I had already read these books, because they were required reading for anyone who wanted to build software products in the early 2010s. But I had also read dozens of other startup books. And a lot of their insights sort of ran together into a slurry of advice.

It was 4 a.m., and I couldn't sleep. So I just started reading. One thing these books really hit on is building something that people will pay for. The ultimate form of customer validation.

That's when I realized: you know what would really push my presentation over the finish line? Proof of product-market fit. Proof that the app we were building solved a real problem people had. So much so that they'd open up their wallets.

This gave me an idea. I should take our app on the road and sell it to people.

But it was Sunday morning. Where was I going to find potential customers? Well, our hotel just happened to be located near the main campus of Stanford University.

I drove my team to the event venue, waved goodbye and said: "I'll come back when I have cold, hard cash from customers."

My teammates chuckled. I'm not sure if they thought I was serious. They said, "Just don't be late for the pitch."

But I was serious. I had a prototype of the app running on my laptop. I punched Stanford into my GPS and embarked on my mission.

Now, I studied at a really inexpensive state university in Oklahoma. So I felt really out of my depth when I rolled up to one of the premier universities in the world.

Stanford costs $50,000 per year to attend. And I pulled into their parking lot driving a car worth 1/10th of that.

The campus was a ghost town this time of the week. But a palatial ghost town, nonetheless. Bronze statues. Iconic arches everywhere.

I asked myself: where are the most high-achieving, hard-core students this time of day? The ones who don't have time to waste on manually creating their Work Cited pages?

I walked into the main library, right past the security desk and a sign that said "no soliciting."

I strode around the stacks, finding a small handful of people studying. This one kid was studiously taking notes as he read through a thick textbook. Bingo.

I slid into the seat next to him. "Psst. Hey. Do you like citations?"

"What?"

"Citations. You know, like, work cited pages."

"Um..."

"You know, the last page of your paper, where you have to list all the..."

"I know what a work cited page is."

"OK. Well check this out." I pulled my jacket to the side like a drug dealer, and whipped out my $200 netbook. He humored me for a moment while I delivered my awkward sales pitch.

I said: "Here. I've got this browser plugin. I go to any website, click the button, and voilà. It will create a citation for me."

The kid raised his eyebrows. "Can it do MLA?"

I bit back my excitement and said, "MLA, APA, and even Chicago. Watch." I clicked the button and three citations appeared – each with its own copy-to-clipboard button.

The kid nodded, seeming somewhat impressed. So I attempted to close the sale.

"What if I told you that I was about to launch this app with a yearly subscription. But if you sign up now, I'll get unlimited access not for a year, but for a lifetime."

The kid thought for a moment.

I had heard that silence was the salesperson's best friend. So I sat there for an uncomfortably long time in total silence, staring him down.

Finally he said: "Cool I'm in."

"Awesome. That'll be twenty bucks."

The kid recoiled. "What? That's expensive."

This was of course the era of venture capital-subsidized startups, where Uber and Lyft were losing money on every ride in a race for market share. So the kid's reaction was not totally surprising.

But I thought fast. "Well, how much cash do you have on you?"

He fumbled with his wallet, then said, "five bucks."

I looked at the crumpled bill and shrugged. "Sold."

He smiled, and I sent him an email with instructions for how to install it. Then I said, "One more thing. Let's take a picture together."

I put my phone on selfie mode. He started to smile, and I said, "Here. Hold up the five dollar bill."

I spent another hour pitching people in the library, and managed to get another paying customer as well. Then I raced back to the event venue to finalize our prototype with the team.

That afternoon, I gave what I still think is the best presentation of my life. We live-demoed the working app – which worked perfectly.

We ended the presentation with the photos I'd taken, posing with Stanford students who were now our paying customers. When I held up the cash we earned, the audience burst into applause.

Overall, it was one of the most exhilarating experiences of my life. We came in second place, and won some API credit from one of the companies who sponsored the event.

At the after party, I chipmunked some pizza, so I'd have more time to network with everyone I could. I connected on LinkedIn. I followed on Twitter. I snapped selfies together with people and used the heck out of the event's hashtag.

This was a watershed moment in my coding journey. I had proven to the people in that room that I could help design, code, and even sell an app. And more importantly, I'd proven it to myself.

Riding the Hackathon Circuit

From that moment on, I was hooked on hackathons. That year, I participated in dozens of them. I became a road warrior, railing up and down the coast, attending every competition I could.

It would be much harder from here on out. I didn't have a team anymore. I was on my own.

I'd arrive, meet as many people as I could, then go up and pitch an idea I thought might win over the judges.

Sometimes people joined my team. Sometimes I joined other people's teams.

I didn't merely want to design apps – I wanted to code them, too. And my reach often exceeded my grasp.

There were many hackathons where I would still be trying to fix bugs down to the final minutes before going on stage. Sometimes my apps would crash during live demos.

One hackathon in Las Vegas, I managed to screw up the codebase so badly that we just had to use a slideshow. I sat in the audience with my head in my hands, watching helplessly as my team member demonstrated how our app would hypothetically work – if I could have gotten it to work. We didn't fare well with the judges.

But I kept grinding. Kept arriving in new towns, checking into the hostel, and hitting the venue, and eating as much free pizza as I could.

My teams had come in second or third so many times I could barely keep count. But we'd never managed to outright win a hackathon.

Breaking Through

That was until an event in San Diego. I'll never forget the feeling of building something that won over the audience and judges to the extent that our victory felt like a foregone conclusion.

After they announced us as the winner, I remember sneaking out the back door to a parking lot and calling my grandparents. I told them that I'd finally done it. I'd helped build an app and craft a pitch that had won a hackathon.

I don't know how much my grandparents understood about software development, or about hackathons. But they said they were proud of me.

With them gone now, I often think back to this conversation. I cherish their encouragement. Their faith in a 30-something teacher grandson could try like crazy and become a developer.

I kept going to hackathons after that. I kept forming new teams and learning new tools along the way. You never forget the first time you get an API to work. Or when you finally grok how some Git command works. And you never forget the people hustling alongside you, trying to get the app to hold together through the demo.

The TechCrunch Disrupt hackathon. The DeveloperWeek hackathon. The ProgrammableWeb hackathon. The $1 Million Prize Salesforce Hackathon. So many big hackathons and so much learning. This was the crucible where my developer chops were forged.

Not only did I manage to build my skills and my network along the way – I now had a reputation as someone who could actually win a hackathon.

I could ship.

This made me a known quantity.

And that reputation was crucial to getting my first freelance clients, my first developer job, and most importantly – trusting my own instincts as a dev.

Why Your Reputation is So Important

The role of reputation in society goes way, way back to human prehistory. In most tribes and settlements, there was some system to keep track of who owed what to whom.

Before there was cash, there was credit.

This may have been a written ledger. Or it may have been an elder who simply kept all these records in their head.

Beyond raw accounting, there was also a less tangible, but equally important vibe that people carried with them.

"John sure knows how to shoe a horse."

Or "Jane is the best story teller in the land."

Or "Jay's courage in battle saved us against the invaders three winters ago."

You'll note that these examples all involve someone being good at something. Not merely being a good, likable person.

It certainly helps to be a chill, down-to-earth human being. But this isn't The Big Lebowski, and we aren't going to survive on our charm alone.

The Big Lebowski (left). He had no job, he had no skills, he had no energy. But he had chill, out the wazoo.

It's easy for a developer to say: "Oh yeah. I know JavaScript like the back of my hand. I can build you any kind of JavaScript application you need, running on any device you can think of."

Or to say: "I ship code under budget and ahead of time – all the time."

But how do you know they're not exaggerating their claims?

After all, a devious man once said:

"If you can only be good at one thing, be good at lying.

Then you're good at everything."

(The true provenance of this quote is unknown. But I like to imagine it was said by a 1920s con man wearing a top hat and a monocle.)

Anyone can lie. And some people do.

Earlier in my career, I had the unpleasant task of firing a teacher who had lied about earning a master's degree. The years went by and nobody caught it.

Every year, he would lie on his annual paperwork, so that he could get a higher pay raise than the other teachers. And every year, he would get away with it.

But one day a small discrepancy tipped me off. I reviewed his file, called up some university record departments, and discovered that he had never bothered finishing his degree.

When I caught him it was a real Scooby Doo moment. "And I would have gotten away with it, if not for you darn kids."

It sucked to know that this person was teaching at the school for years and getting paid more than many of the other teachers – just because he was willing to lie.

The spoils of lying are always there, glistening. Some people are willing to give in to that temptation.

Employers know this. They know you can't trust just any person who claims to know full-stack JavaScript development. You have to be cautious about who gets a company badge, a company email address, and the keys to your production databases.

This is why employers use behavioral interview questions – to try and catch people who are more capable of dishonesty.

Call me naive, but I believe that most people are inherently good. That most people are willing to play by the rules as long as those rules are reasonably fair.

But if even one person out of ten would be a disaster hire, it means that all of us are subjected to higher scrutiny.

The worst-case scenario is not merely someone who lies to make more money. It's someone who sells company secrets, destroys relationships with customers, or breaks laws in the name of inflating their numbers.

History is rife with employees who unleashed catastrophic damage upon their employers, all for their own personal gain.

Thus, the developer hiring process at most big companies is paranoid as heck. Maybe it should be. But unfortunately, this makes it harder for everyone to get a developer job – even the most honest of candidates.

As developers, we need proof that our skills are as strong as we say they are. We need proof that our work ethic is as steadfast as our employers need it to be.

That's where reputation comes in. It reduces ambiguity. It reduces counter-party risk. It makes it safer for employers to make a job offer, and to sign an employment contract with you.

This means that – if you have a strong enough reputation – you may be able to get into the company through a side door – rather than the front door that other applicants line up for.

Some companies even have in-house recruiters who can fast track your interview process. A strong reputation can also help you command more bargaining power during salary negotiations.

So let's talk about how you can build a strong reputation, and become sought-after by managers.

How to Build Your Reputation as a Developer

There are at least six time-tested ways you can build your reputation as a developer. These are:

Hackathons
Contributing to open source
Creating Developer-focused content
Rising in the ranks working at companies who have a "household name"
Building a portfolio of freelance clients
Starting your own open source project, company, or charity

How to Find Hackathons and Other Developer Competitions

Hackathons represent the most immediate way to build your reputation, your network, and your coding skills at the same time.

Most hackathons are free, and open to the public. You just need to have the time and the budget to travel.

If you live in a city with lots of hackathons – like San Francisco, New York, Bengaluru, or Beijing – you may be able to commute to the event, then go home and sleep in your own bed.

Even though I lived in Santa Barbara, which only had hackathons once every few months, I did have an old classmate in San Francisco who let me crash on his couch. This gave me access to many more events.

Hackathons used to be hard core events. People would knock back energy drinks and sleep on floors, all to finish their project by pitch time.

But hackathon organizers are gradually becoming more mindful about the health and sustainability of these events. After all, a lot of participants have kids, or demanding full-time jobs, and can't just all-out code for an entire weekend.

The best way to find upcoming events is to just google "hackathon [your city name]" and browse the various event calendars that come up in the search results. Many of these will be run by universities, local employers, or even education-focused charities.

If you're playing to win, I recommend doing your research ahead of time.

Who are the event sponsors? Usually it will be Business-to-Developer type companies, with APIs, database tools, or various Software-as-a-Service offerings.

These sponsors will probably have a booth at the event where you can talk with their developer advocates. These are people who get paid to teach people how to use the company's tools. Sometimes you'll even meet key employees or founders at these booths, which can be a great networking opportunity, too.

Often the hackathon will offer sponsor-specific prizes. "Best Use of [sponsor's] API." It may be easier to focus your time on incorporating specific sponsor tools into your project, rather than trying to win the grand prize. You can still put these down as wins on your LinkedIn or your résumé. A win is a win.

Sometimes the hackathon is just so high profile – or the prize is so substantial – that is just makes sense to try and win the competition outright.

During my time going to hackathons, I was able to win several months' rent worth of cash prizes, several years' worth of free co-working space, and even a private tour of the United Nations building in New York City.

On the hackathon circuit, I met people whose main source of income was cash prizes from winning hackathons. One dev I knew managed to win nine sponsor prizes at the same hackathon. He managed to integrate all of those sponsor tools into his project – and also win second place overall.

Don't be surprised if some of the people you run into frequently at hackathons go on to found venture-backed companies, or launch prominent open source projects.

The level of ambition you'll see among hackathon regulars is way, way higher than that of the average developer. These are, after all, people who finish a work week, then go straight into a work weekend. These people are not afraid to leap out of the frying pan and into the fire.

How to Contribute to Open Source

Contributing to open source is one of the most immediate ways you can build your reputation. Most employers are going to look at your GitHub profile, which will prominently display your Git commit history.

The GitHub profile of Mrugesh Mohapatra, who does a ton of platform development and DevOps for freeCodeCamp.org. Note how green his activity bar is. 2,208 contributions in the past year alone.

Many open source project maintainers, such as Linux Foundation, Mozilla (Firefox), and of course freeCodeCamp ourselves, have high standards for code quality.

You can read through open GitHub issues to find known bugs or feature requests. Then you can make the code changes and open a pull request. If the maintainers merge your pull request, this will be a major feather in your cap.

One of the best ways to get a job at a tech company is to become a prolific open source contributor to their repositories.

Open source contribution is a great way to build your reputation because everything you do is right out in public. And you get the social proof of having other developers review and accept your work.

If you're interested in building your reputation through open source, here's how to get started.

Read Hillary Nyakundi's comprehensive guide to getting started with open source.

How to Create Developer-Focused Content

Developers are people. And like other people, they want something to do with their time when they're not working, sleeping, or hanging with friends and family.

For many people – including myself – that means spending time in other people's thoughts. Books. Video essays. Interactive experiences like visual novels.

You can broadly refer to these as "content." I'm not a huge fan of the word, because it makes these works feel disposable. But that's what people call it.

Software development is an incredibly broad field, with so many different topics you could approach. There are developer lifestyle vlogs, coding interview prep tutorials, coding live streams on Twitch, and developer interview podcasts like the freeCodeCamp Podcast.

There are probably entire categories of developer content that we haven't even thought of yet, which will break over the next decade.

If you're interested in film, journalism, or creative writing, developer content may be a good way to build your reputation.

You can pick a specific topic and gradually come to be seen as the expert.

There are developers who specialize in tutorials for specific technology stacks, for example.

My friend Andrew Brown is a former CTO from Toronto who has passed all the major DevOps exams. He creates free courses to prepare you for all the AWS, Azure, and Google Cloud certifications, and also runs an exam prep service.

There are more than 30 million software developers around the world. That's a lot of people who will potentially consume your content, and who will come to know who you are.

How to Rise in the Ranks by Working at Big Companies

You may have seen a developer introduced as an "Ex-Googler" or an "Ex-Netflix engineer."

Some tech companies have such rigorous hiring processes – and such high standards – that even getting a job at the company is a big accomplishment.

There are some practical reasons why employers look at where candidates have previously worked. It reduces the risk of a bad hire.

You can build up your reputation by working your way up the prestige hierarchy. You can ladder from a local employer to a Fortune 500 company, and ultimately to one of the big tech giants.

Of course, working at a giant corporation is not for everyone. I'll talk about this more in Chapter 4. But know that it is one option you have for building up your reputation.

How to Build your Reputation by Building a Portfolio of Freelance Clients

You can build your reputation by working with companies as a freelancer.

Freelance developers usually work on smaller one-person projects. So this may be a better strategy for building your reputation locally.

For example, if you did good work for a locally-based bank, that may be enough to convince a local law firm to contract you as well.

There is something to be said for being a "hometown hero." I know many developers who can effectively compete with online competition just by being physically present in meetings, and knowing people locally.

How to Build a Developer Portfolio of Your Work

Once you've built some projects, you'll want to show them off. And the best way to do this is with short videos.

People are busy. They don't have time to pull down your code and run it on their own computer.

And if you send people to a website, they may not fully grasp what they're looking at, and why it's so special.

That's why I recommend you use a screen capture tool to record 2 minute video demos.

Two minutes should be long enough to show how the project works. And once you've done that, you can explain some of the implementation details, and design decisions you made.

But always, always start with the demo. People want to see something work. They want to see something visual.

Once you've lured people in with your compelling demo of your app running, you can explain all the details you want. Your audience will now have more context, and be more interested.

Two minutes is also a magic length, because you can upload that video to a tweet, and it will auto-play on Twitter as people scroll past it. Auto-play videos are much, much more likely to be watched on Twitter. They remove the friction of having to click a play button, or navigate to another website.

You can put these project demo videos on websites like YouTube, Twitter, your GitHub profile, and of course your own portfolio website.

For capturing this video, I recommend using QuickTime, which comes built-in with MacOS. And if you're on Windows, you can use Game Recorder, which comes free in Windows 10.

And if you want a more powerful tool, OBS is free and open source. It's harder to learn, but infinitely customizable.

As far as recording tips: keep your font size as large as possible, and use an external mic. Any mic you can find – even from cheap headphones – will be better than speaking into your laptop's built in mic.

Invest as much time as you need to in recording and re-recording takes until you nail it.

Being able to demo your projects and present your code is a valuable skill you'll use throughout your career. Time spent practicing pitching is never wasted.

How to Start Your Own Open Source Project, Company, or Charity

Being a founder is the fastest – but also riskiest – way to build a reputation as a developer.

It's riskiest because you're wagering your time, your money, and possibly even your personal relationships – all for an unknown outcome.

If you contribute to open source for long enough, you will build a reputation as a developer.

If you grind the hackathon circuit for long enough, you will build a reputation as a developer.

But you could attempt to start entrepreneurial projects for decades without getting traction. And squander your time, money, and connections along the way.

Entrepreneurship is beyond the scope of this book. But if you're interested in it, I will give you this quick advice:

Most entrepreneurs fail. Some fail due to circumstances outside their control. But a lot fail due to not understanding the nature of the risks they're taking on.

Don't rush into founding a project, company, or charity. Try to work for other organizations who are already doing work in your field of interest.

By working for someone else, you get paid to learn. You get exposure to the work, and the risks surrounding it. And you can build savings for an eventual entrepreneurial venture.

How Not to Destroy Your Reputation

"It takes a lifetime to build a good reputation, but you can lose it in a minute." – Will Rogers, Actor, Cowboy, and one of my heroes growing up in Oklahoma City

Building your reputation is a marathon, not a sprint.

It may take years to build up a reputation strong enough to open the right doors.

But just like in a competitive marathon, a stumble can cost you valuable time. A stumble that results in injury may put you out of the race completely.

Don't Say Dumb Things on the Internet

People used to say dumb things all the time. The words might hang in the air for a few minutes while everyone winced. But the words did eventually dissipate.

Now when people say dumb things, they often do so online. And in indelible ink.

Always assume that the moment you type something into a website and press enter, it's going to be saved to a database. That database is going to be backed up across several data centers around the world.

You can prove the existence of data, but there is no way to prove the absence of data.

You should assume, for all intents and purposes, that the cat is out of the bag. There's no getting the cat back in the bag. Whatever you just said: that's on your permanent record.

You can delete the remark. You can delete your account. You can even try to scrub it from Google search results. But someone has probably already backed it up on the Wayback Machine. And when one of those databases inevitably gets hacked years from now, those data will probably still be in there somewhere, ready for someone to resurface them.

It is a scary time to be a loud mouth. So don't be. Think before you speak.

My advice, which may sound cowardly: get out of the habit of arguing with people online.

Some people abide by the playground rule of "if you don't have something nice to say, don't say anything at all."

I prefer the "praise in public, criticize in private."

I will publicly recognize good work someone is doing in the developer community. If I see a project that impresses me, I will say so.

But I generally refrain from tearing people down. Even people who deserve it.

In a fight, everyone looks dirty.

You don't want to look wrathful, tearing apart someone's argument, or dog piling in on someone who just said something dumb.

Sure – caustic wit can win you internet points in the short term. But it can also make people love you a little bit less and fear you a little bit more.

I also try to refrain from complaining. Yes, I could probably get better customer service if I threatened to tweet about a cancelled flight.

But people are busy. Most of them don't want to use their scarce time, scrolling through social media, only to see me groaning about what is in the grand scheme of things a mild inconvenience.

So that is my advice on using social media. Try to keep it positive.

If it's a matter that you believe strongly about, I won't stop you from speaking your mind. Just think before you type, and think before you hit send.

Don't Over-promise and Under-deliver

One of the most common ways I see developers torpedo their own reputations is to over-promise and under-deliver. This is not necessarily a fatal error. But it is bad.

Remember when I talked about the Las Vegas hackathon where I utterly failed to finish the project in time for the pitch, and we had to use slides instead of a working app?

Yeah, that was one of the lowest points in my learn to code journey. My teammates were polite, but I'm sure they were disappointed in me. After all, I had been overconfident. I had over-promised what I'd be able to achieve in that time frame, and I had under-delivered.

It is much better to be modest in your estimations of your abilities.

Remember the parable of Icarus, who on wax wings flew too close to the sun. If only he'd taken a more measured approach. Ascended a bit slower. Then his wings wouldn't have melted, and he wouldn't have plunged into the sea, leaving a guilt-stricken father.

Landscape with the Fall of Icarus by Pieter Bruegel the Elder, circa 1560. Icarus coulda been a contender. He coulda been somebody. But instead, he's just legs disappearing into the sea. And the farmers and the shepherds can't be bothered to look up from their work to take in his insignificance.

Get Addictions Under Control Before They Damage Your Reputation

If you have an untreated drug, alcohol, or gambling addiction, seek help first. The developer job search can be a long, grueling one. You want to go into it with your full attention.

Even something as seemingly harmless as video game addiction can distract you, and soak up too much of your time. It's worth getting it under control first.

I am not a doctor. And I'm not going to give you a "drugs are bad" speech. But I will say: you may hear of Silicon Valley fads, where developers abuse drugs thinking they can somehow improve their coding or problem solving abilities.

For a while there was a "micro-dosing LSD" trend. There was a pharmaceutical amphetamines trend.

My gut reaction to that is: any edge these may give you is probably unsustainable, and a net negative over a longer time period.

Don't feel peer pressure to take psychoactive drugs. Don't feel peer pressure to drink at happy hours. (I haven't drank so much as a beer since my daughter was born 8 years ago, and I don't feel like I've missed out on anything at all.)

If you are in recovery from addiction, be mindful that learning to code and getting a developer job will be a stressful process. Pace yourself, so you don't risk a relapse.

You do not want to reach the end of the career transition process – and achieve so much – only to have old habits resurface and undo your hard work.

Try and Separate Your Professional Life From Your Personal Life

You may have heard the expression, "Don't mix business with pleasure."

As a developer, you are going to become a powerful person. You are going to command a certain degree of respect from other people in your city.

Maybe not as much as a doctor or an astronaut. But still. People are going to look up to you.

You're going to talk with people who would love to be in your shoes.

Do not flaunt your wealth.

Do not act as though you're smarter than everybody else.

Do not abuse the power dynamic to get what you want in relationships.

This will make you unlikable to the people around you. And if it's somehow captured and posted online, it may go on to haunt you for the rest of your career.

Never lose sight of how much you have. And how much you have to lose.

Use the Narrator Trick

I'll close this chapter with a little trick I use to pump myself up.

First, remember that you are the hero in your own coding journey. In the theater of your mind, you are the person everyone's watching – the one they are rooting for.

The Narrator Trick is to narrate your actions in your head as you do them.

Quincy strides across the hackerspace, his laptop tucked under his arm. He sets his mug under the hot water dispenser and drops in a fresh tea bag. He pulls back the lever. And as the steaming water fills his mug, he says aloud in his best British accent: "Tea. Earl Grey. Hot."

His energizing beverage in hand, he slides into a booth, squares his laptop on the surface, and catches the glance of a fellow developer. They lock eyes for a second. Quincy bows his head ever-so-slightly, acknowledging the dev's presence. The dev bows back, almost telepathically sharing this sentiment: "I see you friend. I see you showing up. I see you getting things done."

This may sound ridiculous. Why yes, it is ridiculous. But I do it all the time. And it works.

Narrating even the most mundane moments of your life in your head can help energize you. Crystalize the moment laid out before you, and give you clarity of purpose.

And this works even better when you think of your life in terms of eras ("the Taco Bell years"). Or inflection points ("passing the GED exam").

What does this have to do with building your reputation? Your reputation is essentially the summary of who you are. What you mean to people around you.

By taking yourself more seriously, by thinking about your life as a movie, you're gradually working through who you are. And who you want to one day become.

By narrating your actions, you shine a brighter light on them in your own mind. Why did I just do that? What was I thinking? Was there a better move there?

So many people sabotage their reputations without even realizing it, just because they've settled into bad habits.

For years I thought I had to be "funny" all the time. I would find any opportunity to inject some self-deprecating humor. A lot of people realized what I was doing and found it amusing. But a lot of them didn't understand, and just got the impression I was a jerk.

Why did I do that? I think it went back to grade school, when I was always trying to be the class clown and make people laugh.

But decades later, this reflex to fill silence with laughter was not serving me well.

"When you repeat a mistake, it's not a mistake anymore. It's a decision." – Paulo Coelho

I might have gone on much longer without noticing this bad habit. But with the Narrator Trick, the awkwardness of my behavior was laid bare.

I'm sure I've got lots of other ways of thinking and ways of doing things that are suboptimal. And with the help of the Narrator Trick, I'm hoping to identify them in the future and refine them, before they give people the wrong impression.

Your Reputation Will Become Your Legacy.

Think about who you want to be at the end of your story. How you want people to think of your time on Earth. Then work backward from there.

The person you want to be at the end of the movie. That hero you want people to admire. Why not start carrying yourself like that now?

Can you imagine what it would be like to be a successful developer? To have built software systems that people rely upon?

That future you – how would they think? How would they approach situations and solve problems? How would they talk about their accomplishments? Their setbacks?

Merely thinking about your future self can help you clarify your thinking. Your priorities.

I often think of "Old Man Quincy", with his bad back. He has to excuse himself to run to the toilet every 30 minutes.

But Old Man Quincy still tries his best to work with what he has. He moves in spite of sore joints. He ponders in spite of a foggy mind.

Old Man Quincy still wants to get things done. He's proud of what he's accomplished, but he doesn't spend much time looking back. He looks forward at what he's going to do that day, and what goals he's going to accomplish.

I often think about Old Man Quincy, and work backward to where I am today.

What decisions can I make today that will set me up for being someone worthy of admiration tomorrow? Do I have to wait decades to earn that reputation? Or can I borrow some of that respect from the future?

By thinking like my future self might think, can I make moves that earn me a positive reputation in the present?

I believe that you can leverage your future reputation – your legacy – right now. Just think in terms of your future self and what you'll accomplish. And use that as a waypoint to guide you forward.

I hope that these tools – the Narrator Trick and the visualizing your future self trick – help you not only think about the nature of reputation. I hope they also help you take concrete steps toward improving your reputation.

Because building a reputation – making a name for yourself – is the surest path to sustainable success as a developer.

Success can mean many things to many people. But most people – from most cultures – would agree: one big aspect of success is putting food on the table for yourself and your family.

And that's what we're going to talk about next.

Chapter 4: How to Get Paid to Code – Freelance Clients and the Job Search

If you've been building your skills, your network, and your reputation, then getting a developer job is not all that complicated.

Note that I said it's not complicated – it's still a lot of work. And it can be a grind.

First, let me tell you how I got my first job.

Story Time: How Did a Teacher in His 30s Get His First Developer Job?

Last time on Story Time: Quincy hit the hackathon circuit hard, even winning a few of the events. He was building his reputation as a developer who was "dangerous" with JavaScript. Not super skilled. Just dangerous...

I had just finished a long day of learning at the Santa Barbara downtown library, sipping tea and building projects.

The best thing about living in California is the weather. We'd joke that when you rented an exorbitantly-priced one-bedroom apartment in the suburbs, you were not paying for the inside – you were paying for the outside.

My goal was to spend as little time in that cramped 100-year old rat trap as necessary, and to spend the rest out walking around town.

It was a beautiful Wednesday evening. I still had two more days to prepare for that weekend's hackathon. And my brain was completely fried from the day of coding. My wife was working late, so I checked my calendar to find something to do.

On the first Monday of each month, I would map out all that month's upcoming tech events around southern California, so I'd always have a tech event I could attend if I had the energy.

Ah – tonight is the Santa Barbara Ruby on Rails meetup, and I had already RSVP'd.

I didn't know a lot about Ruby on Rails, but I had completed a few small projects with it. I was much more of a JavaScript and Python developer.

But I figured, what the heck. I need to keep up my momentum with building my network. And the venue was just a few blocks away.

I walked in and it was just a few devs sitting around a table chatting. It quickly became clear that they all worked together at a local startup, maintaining a large Ruby on Rails codebase. Most of them had been working there for several years.

Now at this point, I'd spent the past year building my skills, my network, and my reputation. So I was able to hold my own during the conversation.

But I also had a feel for the limits of my abilities. So I stayed modest. Understated. The way I'd seen so many other successful developers maneuver a conversation at tech events.

It became clear that one of the developers at the table was the Director of Engineering. He reported directly to the CTO.

And then it became clear that they were looking to hire Ruby on Rails developers.

I was candid about my background and my abilities. "My background is in adult education. Teaching English and running schools. I just started learning to code about a year ago."

But the man was surprisingly unfazed. "Well if you want to come in for an interview, we can see whether you'd be a good fit for the team."

That night I walked home feeling an electricity. It was much more dread than excitement.

I felt nowhere near ready. And I wasn't even looking for a job. I was just living off my savings, learning to code full-time, with health insurance through my wife's job.

I was a compulsive saver. People would give me a hard time about it. I would change my own oil, cut my own hair, and even cook my own rice at home when we ordered takeout – just to save a few bucks.

Over the decade that I'd worked as a teacher, I'd managed to save nearly a quarter of my after-tax earnings. And I would buy old video games on Craigslist, then flip them on eBay. That may sound silly, but it was a substantial source of income for me.

What were we saving all this for? We weren't sure. Maybe to buy a house in California at some point in the future? But it meant that I did not have to hustle to get a job. I knew I was in a privileged position, and I tried to make the most of it by learning more every day.

So in short, I didn't think I was ready for my first developer job. And I was worried that if they hired me, it would be a big mistake. They would realize how inexperienced I was, fire me, and then I'd have to explain that failure during future job interviews.

Of course, I now know I was looking at this opportunity the wrong way. But let me finish the story.

When I scheduled my job interview, they asked me for a résumé. I wasn't sure what to do, so I left all my professional experience there. All the schools I'd worked for over the years. (I left off my time running the drive-thru at Taco Bell.)

Of course, none of my work experience had anything to do with coding. But what was I supposed to do, hand them a blank sheet of paper?

Well, I did have an online portfolio of projects I'd built. And most importantly, I had a list of all the hackathons I'd won or placed at. So I included those.

I spent the final hours before the interview revisiting all the Ruby on Rails tutorials I'd used over the past year, to refresh my memory. And then I put on my hoody, jeans, and backpack, and walked over to their office.

The office manager was a nice lady who took me back to the developer bullpen and introduced me to their small team of devs. There were maybe a dozen of them, most of them dressed in jeans and hoodies, aged from early 20s to late 40s. Two of them were women.

I took turns navigating the tangle of desks and cables, shaking hands with each of them and introducing myself. This is where all my experience as a classroom teacher memorizing student names came in handy. I was able to remember all their names, so that later in the day when I left I could follow up with each of them: "Great meeting you [name]. I'd be excited to work alongside you."

First I met with the director of engineering. We went into a small office and closed the door.

A whiteboard on the wall was covered in sketches of Unified Modeling Language (UML) diagrams. A rainbow of dry-erase marker laid out the relationships between various servers and services.

I kept glancing at that whiteboard, fearing that he'd send me over to it to solve some coding problems and demonstrate my skills. Maybe the famous fizzbuzz problem? Maybe he'd want me to invert a binary tree?

But he never even mentioned the whiteboard. He just sat there looking intensely at me the whole time.

They were a company of about 50 employees, with lots of venture capital funding, and thousands of paying customers – mostly small businesses. They prided themselves on being pragmatic. At no point did they inquire about what I studied in school, or what kind of work I did in the past. All they really cared about was...

"Look. I know you can code," he said. "You've been coding this whole time, winning hackathons. I checked out some of your portfolio projects. The code quality was OK for someone who's new to coding. So for me, the real question is – can you learn how we do things here? Can you work with the other devs on the team? And most critically: can you get things done?"

I gulped, leaned forward, and mustered all the confidence I could. "Yes," I said. "I believe I can."

And he said, "Good. Good. OK. Go wait in the Pho restaurant downstairs. [The CTO] should be down there in a minute."

So I talked with the CTO over noodles. Mostly listened. I'd learned that people project intelligence onto quiet people. Listening intently not only helps you get smarter – it even makes you look smarter.

And the approach worked. The meeting lasted about an hour. The noodles were tasty. I learned a lot about the company history, and the near-term goals. The CTO said, "OK go back up and talk with [the director of engineering]."

And I did. And he offered me a job.

Now, I want to emphasize. This is not how most people get their first developer job.

You're probably thinking, "Gee, here Quincy is Forest Gumping his way into a developer job that he wasn't even looking for. If only we could all be so lucky."

And that's certainly what it felt like for me at the time. But in the next section, I'm going to explore the relationship between employers and developers. And how me landing that job was less about my skills as an interviewee, and more about the year of coding, networking, and reputation building that preceded it.

This wasn't a cushy job at a big tech company, with all the compensation, benefits, and company bowling alleys. It was a contractor role that paid about the same as I was making as a teacher.

But it was a developer job. A company was paying me to write code.

I was now a professional developer.

What Employers Want

Flash forward a decade. I have now been on both sides of the table. I've been interviewed by hiring managers as a developer. I've interviewed developers as a hiring manager.

I've spent many hours on calls with developers who are in the middle of the job search. Some of them have applied to hundreds of jobs and gotten only a few "call-backs" for job interviews.

I've also spent many hours on calls with managers and recruiters, trying to better understand how they hire and what they look for.

I think much of the frustration developers feel about the hiring process comes down to a misunderstanding.

Employers value one thing above all else: predictability.

Which of these candidates do you think an employer would prefer?

X is a "rockstar" 10x coder who often has flashes of genius. X also has bursts of incredible productivity. But X is often grumpy with colleagues, and often misses deadlines or meetings.

Y is an OK coder, and has slower but more consistent output. Y gets along fine with colleagues, and rarely misses meetings or deadlines.

Z is similar to Y in output, and able to get along well with colleagues and meet deadlines. But Z has changed jobs 3 times in the past 3 years.

OK, you can probably guess from everything I've said up to this point: Y is the preferred candidate. And that is because employers value predictability above all else.

X is a trap candidate that some first-time managers may make the mistake of hiring. If you are curious why hiring X would be such a bad idea, read We fired our top talent. Best decision we ever made.

I only added Z to this list to make a point: try not to change jobs too often.

You can increase your income pretty quickly by laddering from employer to employer. You can start applying for new jobs the moment you accept an offer letter. But this will repel many hiring managers.

After all, the rolling stone gathers no moss. You will be in and out of codebases before you have the time to understand how they work.

Consider this: it can take 6 months or longer for a manager to bring a new developer up to speed, to the point where they can be a net positive for the team.

Until that point, the new hire is essentially a drain on company resources, absorbing time and energy from their peers who have to onboard them, help them find their way around a codebase, and fix their bugs.

Most Employers are Risk Averse

Let's say a manager hires the wrong developer. Take a moment to think about how bad that can be for the team.

On average, it takes about 3 months to fill a developer position at a company. Employers have to first:

get the budget to hire a developer approved by their bosses
create the job description
post the job on job sites and communicate with recruiters
sift through résumés – many of which will be low-effort from candidates who are blindly applying to as many jobs as possible
start the interviewing process, which may involve flying the candidates out to the city and lodging them in a hotel
rounds of interviews involving lots of team members. For some employers, this is a multi-day affair
selecting a final candidate, and negotiating an offer...
which many candidates will not accept anyway
signing contracts and onboarding the employee
giving them access to sensitive internal systems
introducing them to their teammates, and making sure everyone gets along OK
and then months of informal training, when the employee needs to understand a service or a part of a legacy codebase
and finally, steeping them in the team's way of doing things

In short – a lot of work.

Now imagine that after doing all that, the new employee says "Hey, I just got a higher offer from this other company. Peace out, yo."

Or imagine that the employee is unreliable, and often shows up hours after the workday has started.

Or imagine that the employee struggles with untreated drug, alcohol, or gambling addiction, anger issues – or just turns out to be a passive aggressive person who undermines the team.

Now you have to start this entire process over again, and search for a new candidate for the position.

Hiring is hard.

So you can see why employers are risk averse. Many of them will pass over seemingly qualified candidates until they find someone whom they feel 99% sure about.

Because Employers are So Risk Averse, Job Seekers Suffer

Now if you think hiring is hard, wait until you hear about the job application process. You may already be all-too-familiar with it. But here goes...

You have to prepare your résumé or CV. Along the way, you will make decisions which you'll constantly second-guess throughout your job search.
You have to look around for job openings online, research the employers, and assess whether they're likely to be a good fit for you.
Most job openings will lead to webforms where you will have to retype your résumé over and over again, hoping the form doesn't crash due to server errors or JavaScript validation errors.
Once you submit these job applications, you have to wait while employer process them. Some employers receive so many applications that they can't manually review them all. (Google alone receives 9,000 applications per day.) Employers will use software to filter through applications. In-house recruiters spend an average of 6 seconds looking at each résumé. Often your application will never even be reviewed by a human.
You will likely never hear anything back from the company. They have little incentive to tell you why they rejected you (they don't want you to file a discrimination lawsuit). If you're lucky, you'll get one of those "We've chosen to pursue other candidates" emails.
And all the time you spend applying for these jobs – potentially hours per week – is mentally exhausting and, of course, unpaid.

Wow. So you can see what a nightmare the hiring process is for employers, and especially for job candidates.

But if you stick with it, you can eventually land offers. And when it rains, it pours.

Here's data from one freeCodeCamp contributor's job search over the span of 12 weeks:

Out of 291 applications, he ultimately received 8 offers.

And as the offers came in, the starting salary got higher and higher. Note, of course, that this is for a job in San Francisco, one of the most expensive cities in the world.

By week 12 his starting salary offers were nearly double what they were in week 2.

This developer's rate of getting interviews is quite strong. And his negotiation ability was also strong. You can read more about his process if you're curious.

But as I've said before, it is much easier to get into a company through the side door.

And that's one of the reasons I wrote this book. I don't want you to keep lining up for the front door at these employers.

If you Build Your Skills, Your Network, and Your Reputation You Can Bypass a Lot of the Job Application Process.

Throughout this book, I've been teaching you techniques to increase your likelihood of "lucking" into a job offer.

"Luck is preparation meeting opportunity. If you hadn't been prepared when the opportunity came along, you wouldn't have been 'lucky.'" – Oprah Winfrey

This is why throughout this book I've encouraged you to develop all three of these areas at once, and to start thinking about them from day one – well in advance of your job search.

My story of not even looking for a job and landing a job may seem silly. But this happens more often than you might think.

The reality is: learning to code is hard.

But knowing how to code is important.

In every industry – in virtually every company in the world – managers are trying to figure out ways to push their processes to the software layer.

That means developers.

You may hear about big layoffs in tech from time to time. Many of these layoffs affect employees who are not developers. But often a lot of developers do lose their jobs.

Why would companies lay off developers, after spending so much time and money recruiting and training them? Aside from a bankruptcy situation, I don't know the answer to that question. I'm not sure that anyone does.

There's growing evidence that layoffs destroy long-term value within a company. But in practice, many CEOs feel pressure from their investors to do layoffs. And when a several companies do layoffs at around the same time, other CEOs may follow suit.

Still, even with the layoffs, most economists expect the number of developer jobs and other software-related jobs to continue to grow. For example, the US Department of Labor Statistics expects an increase of 15% in developers over the next decade.

The job market may be tight right now, but few people expect this downturn to last.

My hope is that with strong skills, a strong network, and a strong reputation, you'll be able to land a good job despite a challenging job market.

Hopefully one day, it will be easier for employers and skilled employees to find one another – without the long, brutal job application and interviewing process.

What to Expect from the Developer Job Interview Process

Once you start landing job interviews, you'll get a taste of the dreaded developer job interview process and the notorious coding interview.

A typical interview flow might involve:

Taking an online coding assessment of your skills or a "Phone Screen."
And then if you pass that, a second phone- or video call-based technical interview
And then if you pass that, an "onsite" interview where you travel to a company office. These usually involve several interviews with HR, hiring managers, and rank-and-file developers you might work with.

Along the way, you'll face questions that test your knowledge of problem solving, algorithms & data structures, debugging, and other skills.

Your interviewers may let you solve these coding problems on a computer in a code editor. But often you'll have to solve them by hand while standing at a whiteboard.

The key thing to remember is that the person interviewing you is not just looking for a correct answer from you. They're also trying to understand how you think.

They want to know: do you understand fundamentals of programming and computer science? Or are you just regurgitating a bunch of solutions you memorized?

Now, practicing algorithms and data structures will go a long way. But you also need to be able to think out loud, and explain your thought process as you write your solutions.

The best way to practice this is to talk out loud to yourself while you code. Or – if you're feeling adventurous – live stream yourself coding.

There are lots of "live coding" streams on Twitch where people "learn in public" by building projects in front of an audience. As a bonus, if you're willing to put yourself out there like this, it will also help you build your reputation as a developer.

Another thing to remember during white board interviews: your interviewer. They're not just sitting there waiting for you to finish. They're with you the entire time, watching you and evaluating you both consciously and unconsciously.

Try to make the interview process as interactive as possible for your interviewer. Smile and make occasional eye contact. Try to judge their body language. Are they relaxed? Are they nodding along as you explain points?

Your interviewer probably knows what they're looking for in your code. So see if you can tease some hints out of them. By making observations or asking open-ended questions out loud to yourself, you may be able to get your interviewer to step in, and feel involved in the process.

You want your interviewer to like you. You want them to be rooting for you, so that they may dismiss some of the shortcomings in your programming skills, or overlook some of the errors you may make in your code.

You are selling yourself as a job candidate. Make sure your interviewer feels like they're getting a good deal.

And this goes the same for any Behavioral Interviews you may have to clear. These interviews are less about your coding ability than your "culture fit." (I wish I could tell you what this means, but every manager will define it in a slightly different way.)

In these Behavioral Interviews, you'll have to convince your interviewer that you have strong communication skills.

It definitely helps to be fluent in the language you're interviewing in, and to know the right jargon. You can pick a lot of this up from regularly listening to tech podcasts, like the freeCodeCamp Podcast.

One big thing your interviewers are trying to establish: are you a cool-headed person who will play well with others? The best way to show this is to be polite, and refrain from using profanity or drifting too far off from the subject at hand.

You do not want to get into a debate over something unrelated, like a sports rivalry. I also recommend not trying to correct your interviewers, even if they say things that you believe to be silly or false.

If you get bad vibes from the company, you don't have to accept their job offer. Employers pass on candidates all the time. And you as a candidate also have the right to pass on an employer. The interview itself is probably not the best time for conflict.

Should I Negotiate My Salary at My First Developer Job?

Trying to negotiate your salary upward generally does not hurt as long as you do so politely.

I've written at length on how to negotiate your developer job offer salary.

Essentially, negotiating a higher starting salary comes down to how much leverage you have.

Your employer has work to be done. How badly does your employer need you to work for them? What other options do they have?

And you need income to survive. What other options do you have? What is your backup plan?

If you have a job offer from another employer offering to pay you a certain amount, you can use that as leverage in your salary negation.

If your best backup plan is to go back to school and get a graduate degree... that's not particularly strong leverage, but it's better than nothing. And you could mention it during the salary negotiation process.

Think back to the lengthy hiring process I described earlier. Employers have to go through at least a dozen steps before they can reach the job offer step with candidates. They are probably already planning for you to negotiate, and won't be surprised by it.

Now, if you're in a situation like I was where a company just offers you a job out of the blue, you may feel awkward trying to negotiate.

Smithers from the Simpsons

I will admit – in my story time above, when my manager offered me the job, I did not negotiate.

In retrospect, should I have negotiated my compensation? Probably.

Did I have leverage? Probably not much. My backup plan was to just keep competing in hackathons and keep sipping tea and coding at the public library.

I may have been able to negotiate and get a few more bucks an hour. But in the moment they offered me the job, compensation was the last thing on my mind. I was just ecstatic that I was going to be a professional developer.

By the way, once you've worked as a developer at a company for a year or so, you may want to ask for a raise. I've written at length about how to ask for a raise as a developer. But it comes down to the same thing: leverage.

Should You Use a Recruiter for Your Developer Job Search?

Yes. If you can find a recruiter who will help you land your first developer job, I think you should.

I've written at length about why recruiters are an underrated tool in your toolbox.

Many employers will pay recruiters a finder's fee for sending them high quality job candidates.

Recruiters’ incentives are well-aligned with your own goals as a job seeker:

Since they get paid based on your starting salary, they are inclined to help you negotiate as high a starting salary as possible.
The more candidates they place — and the faster they place them — the more money recruiters make. So they’ll want to help you get a job as fast as possible so they can move on to other job seekers.
Since they only get paid if you succeed as an employee (and stay for at least 90 days), they'll try and make sure you’re competent, and a good fit for the company’s culture.

This said, if a recruiter asks you to pay them money for anything, that is a red flag.

And not all recruiters are created equal. Do your research before working with a recruiter. Even if they're ultimately getting paid by the employer, you are still investing your time in helping them place you. And time is valuable.

Speaking of time, one way you can start getting paid to code sooner – even while you're preparing for the job search – is to get some freelance clients.

How to Get Freelance Clients

I encourage new developers to try and get some freelance clients before they start their job search. There are three good reasons for this:

It's much easier to get a freelance client than it is to get a full time job.
Freelance work is less risky since you can do it without quitting your day job.
You can start getting paid to code sooner, and start building your portfolio of professional work sooner.

Getting freelance clients can be much easier than getting a developer job. Why is this?

Think about small local businesses. It may just be a family that runs a restaurant. Or a shop. Or a plumbing company. Or a law firm.

How many of those businesses could benefit from having an interactive website, back office management systems, and tools to automate their common workflows? Most of them.

Now how many of those companies can afford to have a full-time software developer to build and maintain those systems? Not as many.

That's where freelancers come in. They can do work in a more economical, case-by-case basis. A small business can bring on a freelancer for a single project, or for a shorter period of time.

If you are actively building your network, some of the people you meet may become your clients.

For example, you may meet a local accountant who wants to update their website. And maybe add the ability to schedule a consultation, or accept a credit card payment for a bill. These are common features that small businesses may request, and you may get pretty good at implementing them.

You may also meet the managers of small businesses who need an ERP system, or a CRM system, or an inventory system, or one of countless other tools.

In many cases, there is an open source tool that you can deploy and configure for them. Then you can just teach them how to use that system. And you can bill them a monthly service fee to have you "on call" and ready to fix problems that may arise.

Should I Use a Contract for Freelance Work?

You will want to find a standard contract template, customize it, and get a lawyer to approve it.

It may feel awkward to make the local bakery sign a contract with you just to help update their website or social media presence. But doing so will make the entire transaction feel more professional than a mere handshake agreement.

It's unlikely that a small business will take you to court over a few thousand dollars. But in the event that this happens, you'll be glad you signed a contract.

How Much Should I Charge for Freelance Work?

I would take whatever you make at your day job, figure out your hourly rate, and double it. This may sound like a lot of money, but freelance work is much harder than regular work. You have to learn a lot.

Alternatively, you could just bill for a project. "I will deploy and configure this system for you for $1,000."

Just be sure to specify a time frame that you are willing to maintain the project. You don't want people calling you 3 years later expecting you to come back and fix a system that nobody has been maintaining.

How Do I Make Sure Freelance Clients Pay Me?

A lot of other freelancers – myself included – use this simple approach: ask for half of your compensation up-front, before you start the work. And when you can demonstrate that you're half way finished, ask for the other half.

Always try to get all the money before you actually finish the project. That way, the client will not be able to dangle the money over your head and try to get extra work out of you.

If you're already paid in full, the work you do to help your client after the fact will convey: "I'm going above and beyond for you."

Which is a totally different vibe from: "Uh oh – are you even going to pay me for all this work I'm doing?"

Should I Use a Freelance Website like Upwork or Fiverr?

If you are in a rural part of the world and can't find any clients locally, you could try some of these freelance websites. But otherwise I would not focus on them. Here's why:

When you try to land contracts on a freelance website, you are competing with all the freelancers around the world. Many of them will live in cities that have a much lower cost of living than yours. Some of them will not even really care about their reputations like you do, and may be willing to deliver sub-par work.

To some extent, these websites promote a "race to the bottom" phenomenon where the person who offers to do the work the cheapest usually gets the job.

If you instead focus on finding clients through your own local network, you will not have to compete with these freelancers abroad.

And the same goes for people who are looking for help from freelance developers. If you ever want to hire a freelancer, I strongly recommend working with someone you can meet with in-person, who has ties to your community.

Someone who has lived in your city for several years, and attends a lot of the same social gatherings as you – they're going to be much less likely to try to take advantage of you. If both you and your counterparty care about their reputation, you are both invested in a partnership that works.

You can each be a success story in one another's portfolios.

Freelancing is like running a one-person company. And that means a lot of hidden work.

Don't underestimate the amount of "hidden work" involved in running your freelance development practice.

For one, you may want to create your own legal entity.

In the US, the most common approach is to create a Limited Liability Company (LLC) and conduct business as that company – even if you're the only person working there.

This can simplify your taxes. And heaven forbid you make a mistake and get sued by a client, your legal entity can help insulate you from personal liability, so that it's your LLC going into bankruptcy – not you personally.

You may also consider getting liability insurance to further protect against this.

Remember that when you are working freelance, you usually have to pay tax at the end of the year, so be sure to save for this.

To create your LLC, you can of course just find boilerplate paperwork online, and file it yourself. But if you're serious about freelancing, I recommend talking with a small business lawyer and/or accountant to make sure you set everything up correctly.

When Should I Stop Freelancing and Start Looking for a Job?

If you are able to pay your bills freelancing, you may just want to keep doing it. Over time, you may even be able to build up your own software development agency, and hire other developers to help you.

This said, if you are yearning for the stability of a developer job, you may be in luck. Freelance clients may convert into full-time jobs if you stick with them long enough. At some point, it may make economic sense for a client to just offer you a full-time job at a lower hourly rate. You get the stability of a 40-hour work week, and they get your skills full-time.

You may also be able to hang onto a few freelance clients when you get a job. This can be a nice supplement to your income. But keep in mind that, as we'll learn in the next chapter, your first developer job can be an all-consuming responsibility. At least at first.

How wild is that first year of working as a professional developer going to be? Well, let's talk about that.

Chapter 5: How to Succeed in Your First Developer Job

"A ship in port is safe. But that's not what ships are built for." – Grace Hopper, Mathematician, US Navy Rear Admiral, and Computer Science Pioneer

Once you get your first developer job, that's when the real learning begins.

You'll learn how to work productively alongside other developers.

You'll learn how to navigate large legacy codebases.

You'll learn Version Control Systems, Continuous Integration and Continuous Delivery tools (CI/CD), project management tools, and more.

You'll learn how to work under an engineering manager. How to ship ahead of a deadline. And how to work through a great deal of ambiguity on the job.

Most importantly, you'll learn how to manage yourself.

You'll learn how to break through psychological barriers that affect all of us, such as imposter syndrome. You'll learn your limits, and how to push ever so slightly beyond them.

Story Time: How did a Teacher in his 30s Succeed in his First Developer Job?

Last time on Story Time: Quincy landed his first developer job at a local tech startup. He was going to work as one of a dozen developers maintaining a large, sophisticated codebase. And he had no idea what he was doing...

I woke up at 4 a.m. and I couldn't go back to sleep. I tried. But I had this burning in my chest. This anxiety. Panic.

I had worked for a decade in education. First as a tutor. Then as a teacher. And then as a school director.

In a few hours, I would be starting over from the very bottom, working as a developer.

Would any of my past learnings – past success – even matter in this new career?

I did what I always do when I feel anxiety – I went for a run. I bounded down the hills, my headlamp bobbing in the darkness. When I reached the beach, I ran alongside the ocean as the sun crept up over the treetops.

By the time I got home, my wife was already leaving for work. She told me not to worry. She said, "I'll still love you even if you get fired for not knowing what you're doing."

When I reached my new office, nobody was there. As a teacher, I was used to getting to school at 7:30 sharp. But I quickly realized that most software developers don't start work that early.

So I sat crosslegged in the entry hallway, coding along to tutorials on my netbook.

An employee walked up to me with a nervous look on her face. She probably thought I was a squatter. But I reassured her that I did indeed now work at her company, and convinced her to let me in.

It felt surreal walking across the empty open-plan office toward the developer bullpen, with only the light of the exit sign to guide my way.

I set up my netbook on an empty standing desk and finished my coding tutorial.

A little while later, the lights flickered on around me. My boss had arrived. At first he didn't acknowledge my presence. He just sat down at his desk and started firing off bursts of keystrokes onto his mechanical keyboard.

"Larson," he finally said. "You ready for your big first day?"

I wasn't. But I wanted to signal confidence. So I said the words first uttered in Big Trouble in Little China, one of my favorite 80s movies: "I was born ready."

You've probably heard "I was born ready" a million times. But it was first uttered in 1986 by Jack Burton to his friend Wang Chi, when they were getting ready to confront a thousand year old wizard in his death warehouse. I can't believe my parents let me watch this back then, but I'm glad they did.

"Great," my boss said. "Let's get you a machine."

"Oh, I've already got one," I said, tapping my $200 netbook. "This baby is running Linux Mint, and I've already customized my .emacs file to be able to..."

"We're a Mac shop," he said walking to a storage closet. He rustled around for a moment and emerged. "Here. It's a 3 year old model, but it should do. We wiped it to factory default."

I started to say that I was already familiar with my setup, and that I could work much faster with it, but he would have none of it.

"We're all using the same tools. It makes collaborating a lot easier. Convention over configuration, you know."

That was the first time I'd heard the phrase "convention over configuration" but it would come up a lot over the next few days.

I spent the next few hours configuring my new work computer as other developers gradually filed in.

It was nearly 10 a.m. when we started our team "standup meeting." We all stood in a circle by the whiteboard. We took turns reporting what we were working on that day.

Everyone gave quick, precise status updates.

When it was my turn, I started to introduce myself. I was already anxious enough, when in walked none other than Mike, that ultramarathoner guy who ran the Santa Barbara Startup events. He was crunching on some baby carrots, having already run about 30 miles that morning.

After I finished, Mike spoke, welcoming me and saying he'd seen me at some of his events. He then gave a 15 second status update about some feature he was working on.

The entire meeting only took about 10 minutes, and everyone scattered back to their desks.

I eventually got the company's codebase to run on my new laptop. It was a Ruby on Rails app that had grown over 5 years. I ran the rake stats command and saw that it was millions of lines of code. I shuddered. How could I ever comprehend all that?

My neighbor, a gruff, bearded dev said, "Eh, most of that is just packages. The actual codebase you'll be working on is only maybe 100,000 lines. Don't worry. You'll get the hang of it."

I gulped, but thought to myself: "That's less than millions of lines. So that is good."

"Name's Nick by the way," he said, introducing himself. "If you need any help just let me know. I've been stumbling around this codebase for quite a few years now, so I should be able to help you out."

Over the next few days, I peppered Nick with questions about every internal system I encountered.

Eventually Nick started setting his chat status to "code mode" and putting on his noise cancelling headphones. He swiveled his back toward me a bit, with the body language of: "leave me alone so I can get some of my own work done, too."

This was one of my earliest lessons about team dynamics. You don't want to wear out your welcome with too many questions. You need to get better at learning things for yourself.

But this was a massive codebase, and it was largely undocumented, aside from inline comments and a pretty sparse team wiki.

Since it was a closed-source codebase that only the devs around me were working in, I couldn't use Stack Overflow to figure out where particular logic was located. I just had to feel around in the dark.

I started rotating through which neighbor I'd bug about a particular question. But it felt like I was quickly ringing out any enthusiasm they may have had left for me as a teammate.

I over-corrected. I became shy about asking even simple questions. I made a rule for myself that I would try for 2 hours to get unstuck before I would ask for help.

At some point, after thrashing for several hours, I did ask for help. When my manager discovered I'd been stuck all morning, he asked, "Why didn't you ask for help earlier?"

Another struggle was with understanding the codebase itself – the "monolith" and its many microservices.

The codebase had thousands of unit tests and integration tests. Whenever you wrote a new code contribution, you were also supposed to write tests. These tests helped ensure that your code did what it was supposed to – and didn't break any existing functionality.

I would frequently "break the build" by committing code that I thought was sufficiently tested – only to have my code break some other part of the app I hadn't thought about. This frustrated the entire team, who were unable to merge their own code until the root problem had been fixed.

The build would break several times a week. And I was not the only person who made these sorts of mistakes. But it felt like I was.

There were days where I felt like I was not cut out to be a developer. I'd say to myself: "Who am I kidding? I just wake up one day and decide I'm going to be a developer?"

I kept hearing echoes of all those things my developer friends had said to me a year earlier, when I was first starting my coding journey.

"How are you going to hang with people who grew up coding from an early age?"

"You're going to have to drink an entire ocean of knowledge."

"Why don't you just stick with what you're good at?"

I would take progressively longer breaks to get away from my computer. The office had a kitchen filled with snacks. I would find more excuses to get up to grab a snack. Anything to delay the crushing sense that I had no idea what I was doing.

The first few months were rough. During morning standup meetings, it felt like everyone was moving fast. Closing open bugs and shipping features. It felt like I had nothing to say. I was still working on the same feature as the day before.

Every day when I woke up and got ready for work, I felt dread. "This is going to be the day they fire me."

But then I'd go to work and everyone would be pretty kind, pretty patient. I would ask for help if I was really stuck. I would make some progress, and maybe fix a bug or two.

I was getting faster at navigating the codebase. I was getting faster at reading stack traces when my code errored out. I was shipping features at a faster clip than before.

Whenever my boss called me into his office, I would think to myself: "Oh no, I was right. I'm going to get fired today." But he would just assign me some more bugs to fix, or features to develop. Phew.

It was the most surreal thing – me terrified out of my mind that I'm about to get the axe, and him having no idea anything's wrong.

Of course, I had heard the term "imposter syndrome" before. But I didn't realize that was what I was experiencing. Surely I was just suffering from "sucks at coding" syndrome, right?

One day I was sitting next to Nick, and he was looking pretty frazzled. I offered to grab him a soda from the kitchen.

When I got back, he cracked the can open, took a sip, and leaned back in his chair, gazing at his monitor full of code. "This bug, man. Three weeks trying to fix this one bug. At this point I'm debugging it in my sleep."

"Three weeks trying to fix the same bug?" I asked. I had never heard of such a thing.

"Some bugs are tougher to crack than others. This is one of those really devious ones."

It felt like someone had slapped me across the face with a salmon. I had viewed my job as chunks of work. As though it should take half a day to fix a bug, and if it took longer than that, I was doing something wrong.

But here Nick was – with his computer science degree from University of California and his years of experience working on this same codebase – and he was stumped for three weeks on a single bug.

Maybe I had been too hard on myself. Maybe some of these bugs I'd been fixing were not necessarily "half-day bugs", but were "two- or three-day bugs." Yes, I was inexperienced and slow. But even so, maybe I was holding myself to unrealistic standards.

After all, when we budgeted time for features, sometimes we would have "5-day features" or even "2-week features." We didn't do this for bugs, but they probably varied similarly.

I went home and read more about Imposter Syndrome. And what I read explained away a lot of my anxiety.

Over the coming months, I kept building out features for the codebase. I kept collaborating with my team. It was still hard, brain-busting work. But it was starting to get a little bit easier.

I bonded with my teammates each day at lunch over board games. One week, we had a company-wide chess tournament.

A couple rounds in, I played against the CEO.

The CEO has an unorthodox chess play style. He used a silly opening that few serious chess players would opt for. And I was able to take any early lead in the game.

But over the next few moves, he was able to slowly grind back control over the game. He eventually gained the upper hand and beat me.

When I asked him how he found time to keep his chess skills sharp while running a company, he said, "Oh, I don't. I only play once or twice a year."

Then he paused for a moment, his hand frozen in front of him, as if preparing to launch into a lecture. He said: "My uncle was a competitive chess player. And he just gave me a single piece of advice to follow: every time your opponent moves, slow down and try to understand the game from their perspective – why did they make that move?"

He bowed then excused himself to run to a meeting.

I've thought a lot about what he said over the years. And I've realized this advice doesn't just apply to chess. You can apply it to any adversarial situation.

If You Keep Having to Do a Task, You Should Automate it

Another lesson I learned about software development: since I was the most junior person on the team, I often got assigned the "grunt work" that nobody else wanted to do. One of these tasks was to be the "build nanny."

Whenever someone broke the build, I would pull down the latest version of our main branch and use git bisect to try and identify the commit that broke it.

I'd open up that commit, run the tests, and figure out what went wrong. Then I'd send a message to the person who broke the build, telling them what they needed to fix.

I got really fast at doing this. In a day full of confusing bug reports and ambiguous feature requests, I looked forward to the build breaking. It would give me a chance to feel useful real quick.

It wasn't long before someone on the team said, "With how often the build breaks, we should just automate this."

I didn't say anything, but I felt defensive. This was a bad idea. How could a script do as good a job at finding the guilty commit as I – a flesh and blood developer – could?

It took a few days. But sure enough, one of my teammates whipped up a script. And I didn't have to be the build nanny anymore.

It felt strange to see a message that the build failed, and then a moment later see a message saying which commit broke the build and who needed to go fix it.

I felt indignant. I didn't say anything, but in my mind I was thinking: "That's supposed to be my work. That script took my job."

But of course, I now look back at my reaction and laugh. I imagine myself, now in my 40s, still dropping everything several times each week so I could be the build nanny.

Because in practice, if a task can be automated – if you can break it down into discrete steps that a computer can reliably do for you – then you should probably automate it.

There's plenty of more interesting work you can do with your time.

This chart from XKCD can help you figure out whether a task is worth the time investment to automate.

Lessons from the Village Elders

I learned a lot from other people on the team. I learned product design concepts from Mike. He took me running on the beach, and taught me how to run on my forefoot, where the balls of my feet hit the ground before my heels. This is a bit easier on your joints.

And I learned about agile software engineering concepts from Nick. He helped me pick out some good software development books from the company library. And he even invited me over for a house-warming party, and I got to meet his kids.

After about a year of working for the company, I felt it was time to try to strike out on my own, and build some projects around online learning. I sat down with the CTO to break the news to him that I was leaving.

I said, "I'm grateful that you all hired me, even though I was clearly the weakest developer at the company."

He just let out a laugh and said, "Sure, when you started, you were the worst developer on the team. I'd say you're still the worst developer on the team."

I sat there smiling awkwardly, blinking at him, not sure whether he was just angry I was leaving.

And then he said, "But that's smart. You're smart. Because you always want to be the worst musician in the band. You always want to be surrounded by people who are better than you. That's how you grow."

Two weeks later, I checked in my code changes for the day and handed off my open tickets. I reset my Mac to factory settings and handed it to my manager.

I shook hands with my teammates and headed out the door into the California evening air.

I hit the ground running, lining up freelance contracts to keep the lights on. And I scouted out an apartment in the Bay Area, just across the bridge from the beating heart of tech in South of Market San Francisco.

I was now a professional developer with a year of experience already under my belt.

I was ready to dream new dreams and make new moves.

I was off to the land of startups.

Lessons From my First Year as a Developer

I did a lot of things right during my first year as a professional developer. I give myself a B-.

But if I had the chance to do it all again, there are some things I'd do differently.

Here are some tips. May these maximize your learning and minimize your heartache.

Leave Your Ego at the Door

Many people entering the software development field will start at the very bottom. One title you might have is "Junior Developer."

It can feel a bit awkward to be middle aged and have the word "junior" in your title. But with some patience and some hard work, you can move past it.

One problem I faced every day was – I had 10 years of professional experience. I was not an entry-level employee. Yes, I was new to development, but I was quite experienced at teaching and even managing people. (I'd managed 30 employees at my most recent teaching job.)

And yet – in spite of all my past work experience – I was still an entry-level developer. I was still a novice. A neophyte. A newbie.

As much as I wanted to scream "I used to be the boss – I don't need you to babysit me" – the truth was I did need babysitters.

What if I accidentally broke production? What if I introduced a security vulnerability into the app? What if I wiped the entire database? Or encrypted something important and lost the key?

These sorts of disasters happen all the time.

The reality is as a new developer, you are like a bull in a China shop, trying to walk carefully, but smashing everything in your path.

Don't let yourself get impatient with your teammates. Resist the temptation to talk about your advanced degrees, awards your work has won, or that time the mayor gave you the key to the city. (OK, maybe that last one never happened to me.)

Not just because it will make you hard to work with. Because it will distract you from the task at hand.

For the first few months of my developer career, I used my past accomplishments as a sort of pacifier. "Yeah I suck at coding, but I'm phenomenal at teaching English grammar. Did I mention I used to run a school?"

When your fingers are on the keyboard, and your eyes are on the code editor, you have to let that past self go. You can revel in yesterday's accomplishment tonight, after today's work is done.

But for now, you need to accept all the emotions that come with being a beginner again. You need to focus on the task at hand and get the job done.

It's Probably Just the Imposter Syndrome Talking

Almost everyone I know has experienced Imposter Syndrome. That feeling that you do not belong. That feeling that at any moment your teammates are going to see how terrible your code is and expose you as not a "real developer."

To some extent, the feeling does not go away. It's always there in the back of your mind, ready to rear its head when you try to do something new.

"Could you help me get past this error message?" "Um... I'm not sure if I'm the best person to ask."

"Could you pair program with me on implementing this feature?" "Um... I guess if you can't find someone more qualified."

"Could you give a talk at our upcoming conference?" "Um... me?"

I've met senior engineers who still suffer from occasional imposter syndrome, more than a decade into their career.

When you feel inadequate or unprepared, it may just be imposter syndrome.

Sure – if you handed me a scalpel and said, "help me perform heart surgery" I would feel like an imposter. To some extent, feeling out of your depth is totally reasonable if you are indeed out of your depth.

The problem is that if you've been practicing software development, you may be able to do something but still inexplicably suffer from anxiety.

I am not a doctor. But my instinct is that – for most people – imposter syndrome will gradually diminish with time, as you get more practice and build more confidence.

But it can randomly pop up. I'm not afraid to admit that I sometimes feel pangs of imposter syndrome when doing a new task, or one I haven't done in a while.

The key is to just accept it: "It's probably just the imposter syndrome talking."

And to keep going.

Find Your Tribe. But Don't Fall for Tribalism

When you get your first developer job, you'll work alongside other developers. Yipee – you found your tribe.

You'll spend a lot of time with them, and you all may start to feel like a tight unit.

But don't ignore the non-developer people around you.

In my story above, I talked about Mike, the Product Manager who also ran startup events. He was "non-technical". His knowledge of coding was limited at best. But I'd venture to say I learned as much from him as anyone else at the company.

You may work with other people from other departments – designers, product managers, project managers, IT people, QA people, marketers, even finance and accounting folks. You can learn a lot from these people, too.

Yes, you should focus on building strong connective tissue between you and the other devs on the team. But stay curious. Hang out with other people in the lunch room or at company events. You never know who's going to be the next person to help you build your skills, your network, or your reputation.

Don't Get Too Comfortable and Specialize too Early

I often give this advice to folks who are first starting their coding journey: "learn general coding skills (JavaScript, SQL, Linux, and so on) and then specialize on the job."

The idea is, once you understand how the most common tools work, you can the go and learn those tools' less common equivalents.

For example, once you've learned PostgreSQL, you can easily learn MySQL. Once you've learned Node.js, you can easily learn Ruby on Rails or Java Spring Boot.

But some people specialize too early at work. Their boss might ask them to "own" a certain API or feature. And if they do a good job with that, their boss may keep giving them similar projects.

You are only managing yourself, but your boss is managing many people. They may be too busy to develop a nuanced understanding of your abilities and interests. They may come to see you as "the XYZ person" and just give you tasks related to that.

But you know what you're good at, and what you're interested in. You can try and volunteer for projects outside of your comfort zone. If you can get your boss to assign these to you, you'll be able to continue to expand your skills, and potentially work with new teams.

Remember: your boss may be responsible for your performance at your job, but you are responsible for your performance across your career.

Take on projects that both fulfill your obligation to your employer, and also position you well for your long-term career goals.

Epilogue: You Can Do This

If there's one message I want to leave you with here, it is this: you can do this.

You can learn these concepts.

You can learn these tools.

You can become a developer.

Then, the moment someone hands you money for you to help them code something, you will graduate to being a professional developer.

Learning to code and getting a first developer job is a daunting process. But do not be daunted.

If you stick with it, you will eventually succeed. It is just a matter of practice.

Build your projects. Show them to your friends. Build projects for your friends.

Build your network. Help the people you meet along the way. What goes around comes around. You'll get what's coming to you.

It is not too late. Life is long.

You will look back on this moment years from now and be glad you made a move.

Plan for it to take a long time. Plan for uncertainty.

But above all, keep coming back to the keyboard. Keep making it out to events. Keep sharing your wins with friends.

As Lao Tsu, the Old Master, once said:

"A journey of a thousand miles begins with a single step."

By finishing this book, you've already taken a step. Heck, you may have already taken many steps toward your goals.

Momentum is everything. So keep up that forward momentum you've already built up over these past few hours with this book.

Start coding your next project today.

And always remember:

You can do this.

Farm	Yield (tons/ha)	Fertilizer Used (kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

Farm	Yield (tons/ha)	Fertilizer Used (Kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

Farm	Yield (tons/ha)	Fertilizer Used (kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

Farm	Yield (tons/ha)	Fertilizer Used (Kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

Farm	Yield (tons/ha)	Fertilizer Used (kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390

Farm	Yield (tons/ha)	Fertilizer Used (Kg/ha)	Rainfall (mm)
A	4.2	150	280
B	5.8	220	420
C	3.9	120	230
D	6.1	250	480
E	4.7	200	340
F	5.3	200	390