automation - freeCodeCamp.org

How to Build an AI Agent That Runs its Own LLM Experiments with autoresearch

ishaan gupta — Mon, 29 Jun 2026 16:50:22 +0000

A few months ago, Andrej Karpathy released autoresearch. It's an open-source Python tool that lets an AI agent run experiments on one GPU while you sit back and wait for the results.

Lately I've still seen folks on Twitter arguing about whether AI agents can build their “million dollar idea” or something about Openclaw. But here's a repo that lets you hand an agent a real GPT training setup and ask it to do the research itself.

Basically it edits the code, trains, reads the loss, makes a decision about the result, and repeats this process. And all this happens while you sleep, or dig into something else. And surprisingly, it does actually work.

On a depth-12 nanochat baseline (more on what "depth" means later), Karpathy left it running for about two days. Over roughly 700 experiments, the agent found about 20 changes that genuinely improved the model, and those changes stacked on top of each other.

In this article, I'll walk through what autoresearch is, why the way it measures success is the whole trick, what each file in the repo actually does, what the agent tends to discover, and a step-by-step guide to running it yourself. By the end you should be able to point an agent at your own GPU and let it run.

Prerequisites
What is autoresearch?
Why This Matters
What Exactly is val_bpb?
What the Agent Actually Finds
Final Thoughts

Prerequisites

This article is a complete walkthrough of this repo. The goal is that by the end, you'll understand what autoresearch is and how you can run it on your own machine.

No prior ML research experience required, but if you have it then the deeper sections I wrote will be more meaningful to you. Just basic knowledge of GPU, VRAM and GPUs like H100/A100/4090 would suffice, but don't worry i have quoted the text below explaining every term i think a beginner needs to understand.

What is autoresearch?

Simply put, autoresearch is just one specific idea executed cleanly. You take a small but real LLM training setup, put it in a single Python file, and let an AI agent edit that file.

The agent runs the file and reads the loss. When you train a language model, "loss" is just a single number that scores how badly the model is predicting the next chunk of text. A high number means it's guessing poorly, and a number close to zero means it's predicting almost perfectly.

Training is the process of nudging the model's millions of internal weights to push that number down. So when I say the agent "reads the loss," I mean it looks at that score to judge whether the change it just made helped or hurt.

Based on that score, the agent decides whether the change helped, and then either keeps the change or reverts it. Then it tries something else.

The flow runs top to bottom like this: A human (you) writes the playbook (a Markdown file called program.md), which spells out the rules. An AI agent reads that playbook and starts an experiment loop.

In each pass of the loop, the agent edits the training code with a new idea, trains for five minutes, reads the resulting score, decides whether to keep or undo the change, and writes the outcome to a results file. Then it loops back and tries the next idea.

It does this on its own, around twelve times an hour. So a full night of sleep buys you roughly a hundred experiments and, with luck, a noticeably better model by morning.

The repo is laid out so the agent has exactly one knob to turn. It can't install new packages or change how the data is loaded or how the loss is measured. All of that is locked down on purpose. The only file the agent edits is train.py which consists of the model architecture, the optimizer, the batch size, the learning rate, and the structure of the training loop itself.

The reason this design works is the same reason a controlled experiment in any field works. If the data, the metric, and the budget are all fixed, then any change in the result must be coming from the change the agent made. The agent is doing science the way a careful researcher would, only it doesn't get tired and doesn't need lunch.

Why This Matters

It's tempting to read this as just another agent demo. But it's not, and the reason is the metric. That metric is called val_bpb, short for validation bits per byte. It's a specific way of scoring how well the model predicts text it has never seen during training (the "validation" set).

I'll break down exactly how it's calculated in the next section, but the one-line version is that it measures, on average, how many bits of information the model needs to encode each byte of text. Lower is better: a lower val_bpb means the model is surprised less often by real text, which is the whole goal.

The reason Karpathy uses bits per byte rather than the raw training loss is that bits per byte doesn't change just because you changed the vocabulary, so two very different models can still be compared fairly. The "lower is better" part and the "vocabulary-independent" part are two separate properties. The metric happens to have both.

When I say a baseline model from this repo "lands around 1.00 bpb," I mean that if you run the default untouched training script for its 5 minutes, the model it produces scores roughly 1.00 on this metric when measured on the held-out validation text. That's your starting line.

From there, an improvement of 0.005 bpb (so a score of about 0.995) is a small but real win, the kind the agent finds often. An improvement of 0.05 (a score near 0.95) would be enormous, the kind of jump you'd usually only get from a much bigger model or a much longer training run. So the numbers look tiny, but on this scale, thousandths of a bit genuinely matter.

Here's why optimizing this particular number is a big deal. The agent isn't chasing some artificial leaderboard that researchers spent years gaming. It's pushing down the same kind of validation loss curve that every major language model has been trained against since GPT-2 in 2019.

A "loss curve" is just the plot of that score dropping over the course of training, and "the wave of LLMs since GPT-2" is shorthand for the fact that essentially all of the progress, from GPT-2 to today's frontier models, came from people finding ways to make that curve drop faster or lower for the same amount of compute. The agent is working on the exact same problem, just at a small, fast cheap scale.

And that's what makes the next part surprising. When the agent finds an improvement "here," I mean on the small depth-12 model it's allowed to edit. "Depth" is the number of transformer layers stacked in the model. depth-12 is a small model, and depth-24 is a bigger one with twice as many layers.

Karpathy took the roughly 20 tweaks the agent discovered on the small depth-12 model and applied them to the bigger depth-24 model. Being stacked cleanly means two things at once: the improvements were additive (turning on all 20 together gave you the sum of their individual gains, rather than cancelling each other out), and they transferred (gains found on the small model still showed up on the big one).

That's the signal that the agent found real insights about training, not lucky quirks that only help at one specific size. Stacked together, they cut Karpathy's "Time to GPT-2" benchmark from 2.02 hours to 1.80 hours, which is about an 11% speedup on code he'd already hand-tuned for a long time.

The other thing that's significant is the budget. Each experiment runs for exactly 5 minutes of wall-clock training time, no more, no less. That gives roughly 12 experiments per hour, or about 100 in a typical 8-hour sleep cycle.

Exploring the Repo

Now if you clone the repo, you get a small handful of files. Most of them are plumbing. Three of them are the heart of the system and the difference between them is who edits what.

Only three files matter, and they differ by who edits them.

train.py is the file the agent edits. it holds the GPT model, the optimizer, and the training loop, and everything in it is fair game.
prepare.py is the fixed foundation that nobody edits during a run: it downloads the data, trains the tokenizer, and defines the metric.
program.md is the file you, the human, edit: it's the playbook of rules the agent follows.

The remaining files (README.md, pyproject.toml, uv.lock, .gitignore, .python-version, the analysis.ipynb notebook, and the progress.png image) are plumbing and documentation that neither you nor the agent needs to touch during a run.

There are a few other files in the repo which don't need attention from you or the agent during a run.

What Exactly is `val_bpb`?

Before going further, it helps to understand what val_bpb is. If you've read other LLM articles, you have probably seen terms like “perplexity” or “cross-entropy loss” thrown around.

Bits per byte is like their cousin. When a language model predicts text, it assigns probabilities to what comes next. If the model is confident and right, it gets a low loss. If it's confident and wrong, it gets a high loss, a large penalty. Add up those penalties across all the text and you get the model's total loss. Lower is better, because a lower total means the model assigned high probability to the words that actually appeared.

Cross-entropy loss is the standard scoring function for training language models. For each token, the model assigns a probability to every possible next token and the loss is the negative logarithm of the probability it gave to the token that actually came next. Predict the right token confidently and the loss is near zero. Assign low probability to the correct token and the loss is large. The model's total loss is the average of this across all tokens.

Cross-entropy loss measures this in nats. A nat is the unit you get when that logarithm is taken in base e (the natural log) instead of base 2. It measures the same quantity of "surprise" on a different scale (one nat is about 1.44 bits). Dividing the loss by the natural log of 2 is what rescales nats into bits, which is the conversion bits per byte performs.

Bits per byte takes that loss and divides it by the number of bytes the text actually contains, then converts to log base 2. The result is a number that tells you, on average, how many bits of information the model needs to encode each byte of text.

A perfect model would need close to zero, while a random model would need around 8 bits per byte (since a byte has 8 bits).

The reason Karpathy chose bpb instead of plain cross-entropy is that bpb is vocabulary-size-independent. If the agent decides to change the tokenizer or the vocabulary, the cross-entropy loss would be completely different even for the same model quality. Bits per byte normalizes that out, so a depth-8 model with vocab 8192 and a depth-12 model with vocab 16384 are directly comparable.

The function that computes this, evaluate_bpb, lives in prepare.py, which the agent is never allowed to edit. It can only touch train.py. Because the metric's definition sits in a file the agent can't modify, it can't lower its score by quietly changing how the score is calculated. The scoring rule stays identical for every experiment, which is what makes the comparison honest.

The 5 Minute Rule

There's one design choice in autoresearch that deserves its own section, because it's the choice that makes the whole thing work in practice. Every experiment runs for exactly 5 minutes of wall-clock training time regardless of what the agent is doing.

Wall-clock time means real elapsed time: what a clock on the wall measures, and not the number of training steps or tokens processed. 5 minutes of wall-clock time is 5 literal minutes regardless, of how much the model does in them.

If you trained for a fixed number of steps instead, the agent could “win” by making the model so small that it ripped through more steps than the baseline. If you trained for a fixed number of tokens, the agent could win by lowering the sequence length.

The agent isn't competing against another agent as we might think of it. Its only objective is to push val_bpb below the previous best score on this exact setup. So "winning" means producing a lower score, and the risk is that it lowers the score through a degenerate shortcut that games whichever budget you chose rather than a real efficiency gain. If you trained until convergence, the agent’s run would take wildly different amounts of time and you would never finish 100 experiments in a night.

A fixed wall clock budget cuts through all of this. The agent is forced to optimize for actual training efficiency on the actual hardware in front of it. If it makes the model slightly bigger but the per-step compute drops because of a smarter attention pattern, that's a real win. If it speeds up the per-step compute but the model now learns less per step, that shows up as a worse val_bpb. The two effects get netted out automatically in the end.

The H100 and A100 are NVIDIA datacenter GPUs and the RTX 4090 is a high-end consumer card. They differ sharply in speed and memory, and that's the whole point: in a fixed 5 minute budget, a faster card processes more data and reaches a lower val_bpb. So a score from one GPU can't be compared head-to-head with a score from another.

There's a tradeoff, though. Because the budget is wall-clock, the val_bpb you get on an H100 isn't directly comparable to the val_bpb you get on a 4090 or an A100. The system is designed to find the best model for your specific compute platform in 5 minutes, not to be a global benchmark.

If you want to compare across hardware, you would need to fix a different budget. For the autonomous research use case, this is exactly right.

Let’s get into each of the files in depth now.

1. `prepare.py`

Nobody touches this file but everything depends on it. It mainly performs three jobs.

The first job is downloading data. The training corpus is ClimbMix-400B, a high-quality web dataset hosted on HuggingFace and shuffled into 6,543 parquet shards. By default prepare.py downloads only 10 of these (about a few gigabytes), which is plenty for running thousands of 5-minute experiments.

The very last shard is always downloaded and pinned as the validation set. That pinning matters, since every experiment (no matter what changes) evaluates on the exact same held-out data.

The second job is training a tokenizer. The repo uses rustbpe, a fast Rust implementation of byte-pair encoding, to learn a vocabulary of 8,192 tokens from a sample of the training data. The result is exported as a tiktoken-compatible encoding so it integrates cleanly with PyTorch downstream. There's also a small precomputed lookup table called token_bytes.pt that maps each token id to its UTF-8 byte length. This is what makes the bpb calculation honest.

The third job is providing utilities that train.py imports at runtime. The dataloader is the interesting one. It does what's called best-fit packing: every row in the batch starts with a special BOS (beginning of sequence) token and the loader fills the row by greedily picking documents that fit in the remaining space. Only when no document fits does it crop the shortest available document to fill the gap.

The result is 100% utilization with no padding. This is meaningfully faster than the naïve approach of just truncating long documents and padding short ones. The constants at the top of prepare.py are deliberately simple. Three numbers and a sequence length define the entire experimental contract.

If you run autoresearch on different hardware and want to compare results with a friend, the only thing both of you need to share is these constants. That's the whole point of putting them here and nowhere else.

2. `train.py`

This is the file the agent lives in. It breaks naturally into four parts: the model, the optimizer (Muon for the matrix weights, AdamW for the embeddings and scalar parameters), the hyperparameters, and the training loop. We'll walk through each one with the goal of understanding why each piece exists.

The model is a fairly modern GPT written from scratch with no library dependencies beyond PyTorch and a Flash Attention 3 kernel. If you've read other GPT implementations the high-level structure will look familiar: a token embedding, a stack of transformer blocks, a normalization layer, and a linear head that projects back to vocabulary logits.

The interesting parts are in the details. I don’t think explaining the architecture or code is required for this repo, so I’ll just draw out a small architecture diagram for those of you who want to visualize it. Then I'll explain how the training loop is written.

The loop itself is short and almost pleasant to read. The skeleton is:

while True:
    # accumulate gradient over micro-batches to hit TOTAL_BATCH_SIZE
    for micro_step in range(grad_accum_steps):
        with autocast_ctx:
            loss = model(x, y)
        loss = loss / grad_accum_steps
        loss.backward()
        x, y, epoch = next(train_loader)

    # update LR / momentum / weight decay based on time elapsed
    progress = min(total_training_time / TIME_BUDGET, 1.0)
    # ... set group["lr"], group["momentum"], group["weight_decay"] ...

    optimizer.step()
    model.zero_grad(set_to_none=True)

    # log step metrics
    # ...

    if step > 10 and total_training_time >= TIME_BUDGET:
        break

There are a few things worth noticing here. First, the time budget is checked after the first 10 steps. This is so the budget doesn't include the initial PyTorch compilation (which can take 30 seconds or more). Without this, fast experiments would get penalized for spending half their budget on warmup.

Second, the loop has a fast-fail check. If the loss explodes or hits NaN it prints “FAIL” and exits. The agent then sees a crash and logs it. This is a defense against the agent doing something that diverges spectacularly.

Third, after the loop ends, there's a single final call to evaluate_bpb and then a structured summary printed to stdout.

That summary is the whole API between the training script and the agent:

---
val_bpb:          0.997900
training_seconds: 300.1
total_seconds:    325.9
peak_vram_mb:     45060.2
mfu_percent:      39.80
total_tokens_M:   499.6
num_steps:        953
num_params_M:     50.3
depth:            8

This is what the grep extracts and the agent reads. The whole experimental contract is seven lines of this plain text.

The Hyperparameters

The hyperparameters live in their own clearly-marked section near the bottom of train.py, with a comment that says "edit these directly, no CLI flags needed." They look like this:

# Model architecture
ASPECT_RATIO = 64       # model_dim = depth * ASPECT_RATIO
HEAD_DIM = 128          # target head dimension for attention
WINDOW_PATTERN = "SSSL" # sliding window pattern: L=full, S=half context

# Optimization
TOTAL_BATCH_SIZE = 2**19 # ~524K tokens per optimizer step
EMBEDDING_LR = 0.6
UNEMBEDDING_LR = 0.004
MATRIX_LR = 0.04
SCALAR_LR = 0.5
WEIGHT_DECAY = 0.2
ADAM_BETAS = (0.8, 0.95)
WARMUP_RATIO = 0.0
WARMDOWN_RATIO = 0.5
FINAL_LR_FRAC = 0.0

# Model size
DEPTH = 8
DEVICE_BATCH_SIZE = 128

Everything here is a deliberate single point of truth. The model dimension is computed from depth (depth × 64, rounded to the head dimension). The number of heads is computed from model dimension. This means that the agent can change one number DEPTH, and the model rescales itself coherently.

That kind of "one knob to scale the model" parameterization is exactly what makes a search space tractable.

3. `program.md`

program.md is the shortest of the three files and is arguably the most important. It's the file that we edit and it contains everything the agent needs to know about how to behave during a run.

The structure of program.md mirrors the lifecycle of a research session. It opens with setup, agrees on a run tag, creates a Git branch named autoresearch/, reads the in-scope files, verifies that the data exists, and initializes a results file. It then describes the experimentation rules, like what the agent can and can't modify, that VRAM is a soft constraint, and crucially a simplicity criterion that says all else being equal, simpler is better.

A 0.001 bpb improvement that adds 20 lines of hacky code isn't worth keeping. A 0.001 bpb improvement that removes 20 lines is definitely worth keeping.

Then comes the actual loop. The agent is told to run training with uv run train.py > run.log 2>&1 and never to use tee or stream the output because that would flood the agent's context window. It's also told to extract metrics with grep "^val_bpb:\|^peak_vram_mb:" run.log, which gives just the one or two lines that matter.

If the grep produces nothing, that means the run crashed and the agent is told to read the last 50 lines of the log and try to fix the issue (but it should give up after a few attempts and move on). The result of every experiment is logged to results.tsv.

The decision rule is simple: if val_bpb improved (got lower) then the agent advances the branch by keeping its commit. If it didn't improve, the agent runs git reset to undo the commit. If it crashed, the agent logs that and tries something else.

The last paragraph of program.md is the one that makes autoresearch what it is. It's titled NEVER STOP. The agent is explicitly told not to ask the human (you) if it should keep going, not to ask for any permissions, and not to pause for confirmation. If the agent runs out of ideas, it should think harder, look at the failures, combine near-misses, and try more radical changes.

The loop runs until we interrupt it. This single instruction is more interesting than any line of Python in the repo. It's the difference between an agent that does a few experiments and asks if you want to continue and an agent that genuinely does autonomous research overnight.

There is no contradiction with the 5 minute budget. 5 minutes governs a single experiment, one training run. The "Never stop" instruction governs the outer loop. The moment one run finishes and the agent logs the result, it launches the next one. It keeps starting fresh 5 minute experiments back-to-back until you interrupt it.

Nothing ever trains for more than five minutes. The agent simply never stops starting new 5 minute trainings.

Now that you understand how it works, let’s start using it.

Setup Guide

I'm assuming you have a single NVIDIA GPU with enough VRAM to run these experiments. Anything with 24GB or more should work with the default settings. Smaller GPUs need some tuning, which I'll cover later on.

Step 1: Install uv, the Python Project Manager the Repo Uses

uv is much faster than pip and handles virtual environments transparently. After you install it, then clone the repo and install dependencies:

curl -LsSf https://astral.sh/uv/install.sh | sh

git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
uv sync

This will create a .venv and install pyTorch, Flash Attention, rustbpe, tiktoken, pyarrow, and a few other packages. It pulls PyTorch from the CUDA 12.8 wheel index, so make sure your driver supports that.

Step 2: Run the Data Preparation

This downloads 10 ClimbMix shards plus the validation shard and then trains our tokenizer.

uv run prepare.py

It takes about 2 minutes on a decent connection. If you have limited disk space, you can pass --num-shards 4 for a smaller download. The data and tokenizer get cached in ~/.cache/autoresearch/.

Step 3: Run a Manual Training Experiement

Now, you'll run a single training experiment manually, just to confirm that everything works end-to-end.

uv run train.py

After about 5 minutes of training, plus an evaluation pass at the end, you'll get the summary block with val_bpb printed. That's your baseline.

Step 4: Hand the Repo to an Agent

In practice, this means opening Claude Code or your tool of choice in the repo directory, ideally with permissions disabled or scoped tightly to the repo, and prompting it with something like this:

Have a look at program.md and let's kick off a new experiment.
Let's do the setup first.

The agent will read program.md, walk through the setup steps (creating the autoresearch branch and initializing results.tsv), confirm with you, and then start running. From this point on, you can leave it alone. When you come back, check results.tsv and the Git log on the autoresearch branch.

Tuning autoresearch for Smaller GPUs

The default configuration assumes an H100. If you have a 4090, 3090, or anything with less than 80GB of VRAM, you'll need to dial things down.

Lower the sequence length first: MAX_SEQ_LEN = 2048 in prepare.py is the biggest VRAM lever since attention scales quadratically with it. Try 512 or even 256 on a small GPU and bump DEVICE_BATCH_SIZE in train.py slightly to compensate. The product of these two is the tokens-per-forward-pass.
Lower the depth: DEPTH = 8 in train.py is the master knob for model size. Drop it to 4 on a small GPU and the model dimension automatically scales down with it.
Switch the window pattern: WINDOW_PATTERN = "SSSL" uses banded attention which is fast on H100 but can be slow on consumer GPUs, depending on the kernel implementation. Just "L" (always full attention) is simpler and often faster on smaller cards.
Lower the total batch size: TOTAL_BATCH_SIZE = 2**19 is roughly 524K tokens per optimizer step. On a small GPU, drop it to 2^14 (~16K) to start.
Consider switching the dataset: climbMix is a hard broad web corpus. On a tiny model, the loss curve is noisy and bpb numbers are hard to interpret. Karpathy specifically recommends his own TinyStories-GPT4-Clean dataset for small-scale experimentation. The text is narrower in scope (children’s stories) so a small model can actually learn to generate something coherent in 5 minutes.

There are already several community forks that have done the consumer-GPU tuning for you which you can check out in the repo's readme.md file.

What the Agent Actually Finds

It's one thing to describe how the loop works, and another to see what it produces. Karpathy was open about this on Twitter in his depth-12 run: the agent found about 20 changes that improved validation loss, all of which transferred to depth-24.

Specific examples from his post-run analysis include adding a learnable scalar to the parameterless QK-norm to sharpen attention, applying regularization to the value embeddings, widening the banded attention window, correcting the AdamW betas for certain parameter groups, tuning weight decay schedules, and adjusting initialization.

None of these would headline a research paper, but all of them showed up as 0.001 to 0.005 bpb improvements that stacked.

So it's not that an AI agent invented a new architecture. It's that the slow patient hill-climbing that real researchers spend months doing can be done by an agent in a couple of days. The result is the same boring detail-tuning that has always been where most of the actual progress in ML comes from.

Final Thoughts

autoresearch doesn't introduce a new model or a new optimizer or a new dataset. It just defines a kind of contract between a human researcher and an AI agent and it shows that the contract can be enough. That contract is something like “here is the fixed part of reality, the metric that judges you, a budget, and within those rules, do whatever you want and tell me what worked.”

There are two questions I still ponder that are worth thinking about. One is overfitting to the validation set. If you run hundreds of experiments against the same fixed validation shard, eventually the agent will start finding tweaks that look like wins on this shard but don't transfer. Karpathy himself called the results “fragile” in some sessions.

There's no obvious fix here yet beyond rotating validation data which would break comparability.

The other question is what the human’s role becomes. If the agent does the experiments, the human’s contribution shifts to shaping the search space and the rules. That is what program.md is. It's a pretty good preview of what research looks like when the loop is automated.

Well, that’s it for today. See you folks in my next article!

How to Automate PDF Data Extraction Using Python

Manish Shivanandhan — Wed, 03 Jun 2026 16:25:14 +0000

PDFs are still one of the most widely used document formats in business.

Financial reports, invoices, contracts, compliance filings, and operational documents are often shared as PDFs because they preserve formatting across devices and operating systems.

The problem is that PDFs are designed for presentation, not structured data analysis. Extracting information manually from these files is slow, repetitive, and highly prone to human error.

This becomes a major issue for teams that work with large volumes of documents every day.

Finance departments process invoices and statements, analysts review reports, and operations teams manage records that contain valuable structured data trapped inside static files.

Copying rows manually into spreadsheets doesn't scale, especially when organisations handle hundreds or thousands of PDFs each month.

Python has become one of the most effective tools for automating PDF data extraction because of its mature ecosystem of libraries and data processing frameworks.

Developers can build workflows that extract text, identify tables, clean inconsistent formatting, and export structured datasets into Excel or CSV files automatically.

In smaller workflows, some teams may simply choose to convert PDF to Excel with SmallPDF for quick spreadsheet conversions, while larger organizations often build fully automated extraction pipelines using Python for deeper customisation and control.

In this article, we'll explore how to automate PDF data extraction using Python, including how to extract text and tables from PDFs, clean and transform structured data, work with scanned documents using OCR, and export information into spreadsheet formats like Excel.

We'll also look at some of the most useful Python libraries for document automation and discuss the common challenges developers face when building scalable PDF processing workflows.

Understanding PDF Structures

One of the biggest misconceptions about PDFs is that they all behave the same way. In reality, PDFs can vary significantly depending on how they were generated.

Machine-readable PDFs contain embedded text that can be extracted directly using parsing libraries. These files are usually exported from software systems such as accounting tools, reporting platforms, or office applications. Since the text already exists digitally, extraction is relatively reliable.

Scanned PDFs are different. These documents are essentially images stored inside a PDF container. Since there's no actual text layer, extraction tools can't read the content directly. OCR software must first analyze the images and attempt to reconstruct readable text.

Before writing any code, you should always test whether the text inside a PDF can be selected manually. If text highlighting works normally, the file likely contains a machine-readable layer. If not, you'll probably need OCR.

Setting Up the Python Environment

Python provides several excellent libraries for PDF extraction and document automation. Each library specializes in different aspects of the workflow.

Some tools focus on text extraction, while others are optimized for identifying tables or processing scanned documents. Commonly used libraries include pdfplumber, PyMuPDF, Camelot, tabula-py, and pytesseract.

You can configure the environment using pip:

pip install pdfplumber pandas openpyxl pymupdf camelot-py

If OCR support is required, you can also install some additional packages:

pip install pytesseract pillow

Tesseract itself must also be installed separately on the operating system because pytesseract acts only as a Python wrapper around the OCR engine.

Once the environment is ready, you can begin building extraction workflows tailored to specific document types.

Extracting Text From PDFs

The simplest PDF automation workflow involves extracting plain text from machine-readable documents.

Libraries such as pdfplumber make this process straightforward:

import pdfplumber

with pdfplumber.open(“report.pdf”) as pdf:

for page in pdf.pages:

text = page.extract_text()

print(text)

This approach works well for reports, contracts, meeting notes, and other text-heavy documents.

But raw text extraction often introduces formatting issues. Multi-column layouts may become scrambled, line breaks can appear unexpectedly, and tabular information may lose alignment completely.

While text extraction is useful for search indexing and keyword analysis, structured business workflows usually require table extraction instead.

Extracting Tables From PDFs

Most business automation projects focus on extracting tables from PDFs into structured spreadsheet formats.

Camelot is one of the most widely used Python libraries for this purpose. It identifies table structures by analyzing page layouts and separating rows and columns automatically.

Here's a simple example:

import camelot

tables = camelot.read_pdf(“financial_report.pdf”, pages=’1')

print(tables[0].df)

The extracted table is returned as a Pandas DataFrame, which makes downstream processing significantly easier.

Exporting the extracted data into Excel is straightforward:

import pandas as pd

df = tables[0].df

df.to_excel(“output.xlsx”, index=False)

This type of workflow is extremely valuable for finance and operations teams that regularly process statements, invoices, audit reports, or procurement records.

Real-world PDFs, however, are rarely perfectly-structured. Tables may span multiple pages, contain merged cells, or use inconsistent spacing. You'll often need additional transformation logic to clean and standardize the extracted data before it becomes useful for analytics or reporting.

Working With OCR for Scanned PDFs

Scanned documents require OCR because there's no machine-readable text available inside the file.

Python devs commonly use Tesseract together with pytesseract for OCR workflows.

A simple example looks like this:

from PIL import Image

import pytesseract

image = Image.open(“invoice_scan.png”)

text = pytesseract.image_to_string(image)

print(text)

OCR accuracy depends heavily on image quality. Low-resolution scans, skewed pages, handwritten content, and poor lighting can reduce recognition performance substantially.

To improve results, you can preprocess images before running OCR. Common preprocessing techniques include grayscale conversion, thresholding, sharpening, and noise reduction.

Even with preprocessing, OCR should generally be treated as a fallback solution rather than the primary extraction strategy whenever machine-readable PDFs are available.

Building End-to-End Automation Pipelines

Single extraction scripts are useful for experimentation, but enterprise workflows usually require complete automation pipelines.

A production-ready document automation system may include file ingestion, document classification, extraction, transformation, validation, export, and archival stages.

Python works particularly well in these environments because it integrates cleanly with APIs, databases, cloud storage platforms, and workflow orchestration systems.

For example, an accounts payable workflow might automatically monitor an inbox for incoming invoices, extract tabular data from attached PDFs, validate totals, and push the cleaned records into an ERP platform without human intervention.

This type of automation can save organizations hundreds of hours of repetitive administrative work each month while improving consistency and reducing operational errors.

Many advanced systems also combine traditional extraction logic with AI models that automatically classify document types before routing them into specialized extraction pipelines.

Common Challenges in PDF Automation

PDF extraction becomes more difficult as workflows scale.

One major challenge is inconsistency. Documents generated from the same source system may still vary slightly in formatting, page layout, or spacing. Small formatting differences can break rigid extraction logic unexpectedly.

Accuracy validation is another critical issue. Extracted data should never be assumed correct automatically, especially in finance, healthcare, or compliance workflows where errors can create operational or regulatory risks.

Performance can also become a bottleneck when processing large volumes of files. Sequential extraction may be sufficient for small workloads, but larger systems often require parallel processing and queue-based architectures.

Scanned PDFs introduce even more uncertainty because OCR engines are inherently probabilistic. Many organizations use human review systems for low-confidence extractions instead of relying entirely on automation.

The most reliable automation systems combine structured extraction logic, validation rules, and selective manual oversight.

Choosing the Right Python Libraries

Different libraries perform better depending on the structure and complexity of the documents being processed.

pdfplumber is excellent for lightweight text extraction and layout analysis. Camelot performs particularly well with clearly defined tables. PyMuPDF offers strong performance and lower-level PDF manipulation capabilities.

For OCR workflows, pytesseract remains one of the most popular open-source solutions because it integrates easily into Python pipelines.

There's rarely a single perfect tool for every document type. Experienced developers typically combine multiple libraries within the same workflow and dynamically choose extraction strategies based on document characteristics.

Testing against real production data is critical because sample documents rarely capture the inconsistencies found in live operational environments.

The Future of PDF Automation

Document automation is evolving rapidly as AI systems become better at understanding unstructured information.

Traditional rule-based extraction workflows still dominate most enterprise systems, but AI-assisted models are increasingly capable of interpreting layouts, identifying fields, and understanding relationships between document elements more accurately than older parsing techniques.

Python remains central to this ecosystem because of its flexibility and extensive machine learning tooling. You can combine PDF extraction libraries with AI frameworks to build systems that continuously improve as they process more documents.

As organizations continue digitizing operations, automated PDF extraction will become increasingly important across finance, legal, healthcare, logistics, and compliance industries.

Teams that invest in document automation early can reduce manual work, improve reporting accuracy, and unlock structured business data that would otherwise remain trapped inside static PDF files.

Hope you enjoyed this article. You can connect with me on LinkedIn.

How to Use Bash & Python for Real DevOps Automation – Full Handbook with 5 Production Use Cases

Osomudeya Zudonu — Wed, 27 May 2026 15:51:44 +0000

Automation scripts often validate process completion instead of system health.

A Kubernetes pod can be running while the application inside it can't authenticate to the database. A Terraform deployment can return clean while someone has manually changed infrastructure in the cloud console. A canary rollout can show zero errors while users wait five seconds for every request.

The problem isn't the tooling. The problem is that the system can look healthy when it really is not.

This handbook walks through five production-style automation scenarios using Bash and Python for:

Detecting abnormal AWS spend before the monthly invoice arrives
Correlating logs across multiple services using trace IDs
Finding infrastructure drift outside Terraform
Validating secret rotation at the application level
Automatically rolling back slow deployments before users complain

By the end of this handbook, you'll be able to build small scripts that help you notice when something is wrong in a system, even when the tools say everything is fine.

The scripts are intentionally small. The important part is the operational thinking behind them like what signal the script measures, what failure mode it can detect, and what assumptions the platform is making underneath.

Each use case includes a runnable demo environment, the complete script, a breakdown of the system behaviour involved, and an intentional failure you can trigger yourself.

If you're new to this workflow, start with use case 1 and work forward. The later sections build on the same pattern: automation is useful when it verifies reality, not just process completion.

Prerequisites

Before you start, set up the following:

Python 3.8 or higher – check with python3 --version
A Python virtual environment – create one before installing anything:

python3 -m venv venv
source venv/bin/activate  

 # on Windows: 

venv\Scripts\activate

This keeps your installed packages isolated from your system Python and prevents permission errors on shared machines.

pip – Python's package installer, included with Python
AWS CLI configured with a working profile – a free-tier AWS account is enough for use cases 1, 3, and 4. Verify it's working with:
```
aws sts get-caller-identity
```
Docker and Docker Compose – needed for use cases 2, 4, and 5
Kind (Kubernetes in Docker) – a way to run Kubernetes locally for use cases 4 and 5. Install with brew install kind on macOS, or follow the Kind quick start guide
kubectl – the command-line tool for talking to a Kubernetes cluster. After installing Kind, run kind create cluster and kubectl is configured automatically
Helm – a package manager for Kubernetes, needed for use case 5. Install with brew install helm or the Helm install guide
Terraform – needed for use case 3. Install with brew install terraform on macOS or follow the Terraform install guide. Check with terraform version.
bc – a calculator utility used by the canary watch scripts for floating-point comparison. Install with brew install bc on macOS or apt install bc on Ubuntu. Run bc --version to confirm it is available before starting use case 5.

Knowledge and Skills

You should be comfortable reading Python and Bash scripts without needing to write them from scratch.
You should have basic Linux terminal comfort – navigating directories, running scripts, reading output, and so on.
You should know what Kubernetes pods and deployments are at a basic level – you don't need deep Kubernetes expertise, as use cases 4 and 5 will introduce the Kubernetes concepts they rely on as they go.
Familiarity with AWS basics such as what EC2, IAM, and Secrets Manager will help with use cases 1, 3, and 4, while use case 2 runs entirely on your local machine and requires no AWS knowledge at all.
For use case 3, knowing what Terraform is and what a state file does will help. You don't need to write any Terraform, but understanding that Terraform tracks and what it created is the foundation of the whole use case.

AWS IAM Permissions Required

The scripts in this article make real AWS API calls. Your IAM user or role needs the following minimum permissions. (If you see an AccessDenied error, this is the first place to look.):

Use Case	Required IAM Permission
1 - Cost Anomaly Detection	`ce:GetCostAndUsage`
3 - Drift Detection	`ec2:DescribeSecurityGroups`
4 - Secrets Rotation	`secretsmanager:GetSecretValue`, `secretsmanager:PutSecretValue`

If you're using a fresh AWS free-tier account with AdministratorAccess attached, these permissions are already included and you can skip this step.

If you're on a restricted IAM user, here's how to add them. In the AWS Console, go to IAM, click Users, then click your username. Under the Permissions tab, click Add permissions, then Create inline policy.

Switch to the JSON tab and paste a policy document granting the permissions in the table above, then save it.

If your company manages AWS through an organization and you don't have permission to edit your own IAM policies, ask your administrator to add these permissions to your role.

Companion GitHub Repository

All demo projects live at: https://github.com/irvingtalks/devops-scripting-labs

Each use case has its own numbered folder with the complete script, supporting files, a setup.sh to prepare the environment, and a break_it.sh that injects the specific failure each use case is built around.

Clone the repo before starting:

git clone https://github.com/irvingtalks/devops-scripting-labs
cd devops-scripting-labs

Before running any use case, check that you have everything installed:

./preflight.sh

This checks for every tool the lab needs like Python, AWS CLI, Docker, Kind, Helm, Terraform, and bc and tells you exactly what's missing with the install command for each one.

Use Case 1 - Cost Anomaly Detection
Use Case 2 - Log Correlation Across Services
Use Case 3 - Infrastructure Drift Detection
Use Case 4 - Secrets Rotation with Zero Downtime
Use Case 5 - Automated Canary Rollback Trigger
What You Can Do Now

Use Case 1 - Cost Anomaly Detection

Environment: AWS Cost Explorer API (read-only, available in all accounts) Language: Python

The Production Problem

A junior engineer is testing a Kubernetes configuration. They spin up a managed node group in AWS (a set of EC2 virtual machines that the Kubernetes cluster uses to run workloads) and configure the cluster autoscaler, which is the Kubernetes component responsible for adding more machines when the cluster needs more capacity. The test goes well, and on Friday afternoon, they forget to tear the environment down.

Over the weekend, the autoscaler keeps provisioning new nodes because the test workloads are still running and requesting resources. By Monday morning you have a node group that has been quietly growing for two and a half days, and nobody noticed until the invoice landed three weeks later.

The script in this use case exists because your AWS bill isn't just a monthly number. It's a time series, and you can monitor it the same way you monitor application metrics. Check it daily, know your baseline, and you catch this kind of event in hours instead of weeks.

What's Actually Happening at the System Level

What this is not: This isn't a finance dashboard. It's an operational anomaly detector and the signal it monitors is cost. But the thing it's actually detecting is unexpected infrastructure behavior such as resources left running, autoscaler events, and forgotten environments.

AWS Cost Explorer is a service that stores your billing data and exposes it through an API, and when you call it, you're running a query against your account's billing records by specifying the time range, the granularity, and how you want results grouped.

One thing to know before you start investigating any flagged cost is that AWS decides which service category to put a charge under, not you. An EBS snapshot copy running across regions might appear under the EC2 line item rather than data transfer, which means a spike in EC2 spend doesn't necessarily mean something went wrong with your EC2 instances. The script flags the spike correctly, but investigating it means asking "what changed in my infrastructure on this date" rather than "what is running in EC2 right now."

The billing label is a starting point, not a diagnosis.

Set Up the Demo Environment

Navigate to 01-cost-anomaly/ in the companion repo. No cluster setup is needed for this use case because the script runs against your AWS account directly, and the only dependency is boto3:

cd 01-cost-anomaly
pip install boto3

Before running against your real account, make sure your AWS credentials are configured. The script uses whatever credentials the AWS CLI is set up with. If you haven't done this yet:

aws configure

This will ask for your AWS Access Key ID, Secret Access Key, default region (use us-east-1 if unsure), and output format (type json). You can find your access keys in the AWS Console under IAM → Users → your username → Security credentials → Create access key.

Your account needs the ce:GetCostAndUsage permission also, if you're on a fresh account with AdministratorAccess that's already included.

If you have an AWS account with a few weeks of billing history, you can run the script directly against your real data:

python detect_cost_anomaly.py

Two things to know before running against a real account. First, Cost Explorer data has a 24-hour lag. This means spend from today won't appear until tomorrow, so the script automatically excludes the most recent day to avoid incomplete results.

Second, the script uses unblended costs, which is what you actually pay on a single-account setup. Blended costs are a weighted average used in multi-account organisations sharing reserved capacity and will give different numbers.

If you have a new account or prefer not to use real billing data, the script includes a --sample flag that uses built-in data and calls no AWS APIs at all.
Run this first to see what the output looks like before reading the code:

python detect_cost_anomaly.py --sample

The Script

#!/usr/bin/env python3
# detect_cost_anomaly.py — Use Case 1: Cost Anomaly Detection
# Full explanation of every function is in the article.

import statistics
import sys
from datetime import datetime, timedelta

import boto3

def build_sample_data(days=30):
    """Synthetic Cost Explorer rows for the last `days` (ending yesterday).

    The EC2 spike is placed on yesterday (device local date) so sample output
    always matches the same window as live Cost Explorer mode.
    """
    last_day = datetime.today().date() - timedelta(days=1)
    first_day = last_day - timedelta(days=days - 1)
    anomaly_day_index = days - 1
    results = []
    for i in range(days):
        day = first_day + timedelta(days=i)
        d = i + 1
        results.append(
            {
                "TimePeriod": {
                    "Start": str(day),
                    "End": str(day + timedelta(days=1)),
                },
                "Groups": [
                    {
                        "Keys": ["Amazon EC2"],
                        "Metrics": {
                            "UnblendedCost": {
                                "Amount": str(
                                    round(
                                        18.50
                                        if i == anomaly_day_index
                                        else 1.10 + (d % 3) * 0.10,
                                        2,
                                    )
                                )
                            }
                        },
                    },
                    {
                        "Keys": ["Amazon S3"],
                        "Metrics": {
                            "UnblendedCost": {
                                "Amount": str(round(0.04 + (d % 5) * 0.01, 2))
                            }
                        },
                    },
                    {
                        "Keys": ["Amazon RDS"],
                        "Metrics": {
                            "UnblendedCost": {
                                "Amount": str(round(0.85 + (d % 4) * 0.05, 2))
                            }
                        },
                    },
                ],
            }
        )
    return results, str(last_day)


def get_daily_costs(days=30):
    ce = boto3.client("ce", region_name="us-east-1")
    end = datetime.today().date() - timedelta(days=1)
    start = end - timedelta(days=days)
    response = ce.get_cost_and_usage(
        TimePeriod={"Start": str(start), "End": str(end)},
        Granularity="DAILY",
        Metrics=["UnblendedCost"],
        GroupBy=[{"Type": "DIMENSION", "Key": "SERVICE"}],
    )
    return response["ResultsByTime"]


def build_service_timeseries(results):
    services = {}
    for day in results:
        date_str = day["TimePeriod"]["Start"]
        for group in day["Groups"]:
            service = group["Keys"][0]
            cost = float(group["Metrics"]["UnblendedCost"]["Amount"])
            if service not in services:
                services[service] = []
            services[service].append({"date": date_str, "cost": cost})
    return services


def detect_anomalies(services, baseline_days=7, multiplier=2.0, recent_days=None):
    """Flag days where cost exceeds prior `baseline_days` average + 2σ.

    Uses a rolling baseline (each day vs the previous week). If `recent_days`
    is set, only returns anomalies on or after today - recent_days.
    """
    cutoff = None
    if recent_days is not None:
        cutoff = datetime.today().date() - timedelta(days=recent_days)

    anomalies = []
    for service, daily in services.items():
        if len(daily) < baseline_days + 1:
            continue
        for i in range(baseline_days, len(daily)):
            day = daily[i]
            day_date = datetime.strptime(day["date"], "%Y-%m-%d").date()
            if cutoff is not None and day_date < cutoff:
                continue
            baseline_costs = [d["cost"] for d in daily[i - baseline_days : i]]
            avg = statistics.mean(baseline_costs)
            if avg < 0.01:
                continue
            try:
                std = statistics.stdev(baseline_costs)
            except statistics.StatisticsError:
                continue
            threshold = avg + (multiplier * std)
            if day["cost"] > threshold:
                anomalies.append(
                    {
                        "service": service,
                        "date": day["date"],
                        "actual": round(day["cost"], 4),
                        "baseline_avg": round(avg, 4),
                        "threshold": round(threshold, 4),
                        "pct_above": round(((day["cost"] - avg) / avg) * 100, 1),
                    }
                )
    return sorted(anomalies, key=lambda x: x["date"])


def parse_args(argv):
    use_sample = "--sample" in argv
    recent_days = None
    for arg in argv[1:]:
        if arg.startswith("--recent-days="):
            recent_days = int(arg.split("=", 1)[1])
    return use_sample, recent_days


def run(use_sample=False, recent_days=None):
    if use_sample:
        results, anomaly_date = build_sample_data()
        print("Running against sample data (--sample mode).")
        print(
            f"This data represents 30 days of billing ending yesterday, "
            f"with a realistic EC2 anomaly on {anomaly_date}.\n"
        )
    else:
        print("Fetching 30 days of daily AWS costs by service...")
        print("Note: today is excluded — Cost Explorer has a 24-hour billing lag.\n")
        results = get_daily_costs(days=30)

    if recent_days is not None:
        since = datetime.today().date() - timedelta(days=recent_days)
        print(
            f"Checking for spikes in the last {recent_days} days only "
            f"(on or after {since}), each vs its prior 7-day average.\n"
        )

    services = build_service_timeseries(results)
    anomalies = detect_anomalies(services, recent_days=recent_days)

    if not anomalies:
        print("No anomalies detected.")
        print("\nNote: this script flags statistical outliers against your own baseline.")
        print("A consistently elevated spend level will not trigger — only sudden increases.")
        return

    print(f"{'=' * 60}")
    print(f"ANOMALIES DETECTED: {len(anomalies)}")
    print(f"{'=' * 60}\n")

    for a in anomalies:
        print(f"Service:      {a['service']}")
        print(f"Date:         {a['date']}")
        print(f"Actual cost:  ${a['actual']}")
        print(f"Baseline avg: ${a['baseline_avg']} (prior 7-day average)")
        print(f"Threshold:    ${a['threshold']}")
        print(f"Overage:      {a['pct_above']}% above baseline")
        print()

    print("=" * 60)
    print("A note on AWS cost attribution:")
    print("The service label in Cost Explorer is assigned by AWS, not by the resource")
    print("that caused the cost. An EC2 spike may be caused by EBS snapshot copies,")
    print("cross-region data transfer, or autoscaling events that AWS categorizes under")
    print("EC2 in billing — not a running EC2 instance you can find in the console.")
    print()
    print("Before investigating the flagged service directly, ask:")
    print("What changed in my infrastructure on or before the flagged date?")
    print("Work backward from the operational change, not forward from the billing label.")


if __name__ == "__main__":
    use_sample, recent_days = parse_args(sys.argv)
    run(use_sample=use_sample, recent_days=recent_days)

How the Script Works

get_daily_costs pulls your AWS billing data for the last 30 days.

build_service_timeseries takes the raw data from AWS and reorganises it. AWS groups the data by day first, then by service. This function flips that around so each service has its own list of daily costs, which is what the detection step needs to work with.

detect_anomalies is where the actual check happens. For each service, it compares each day's spend to the 7 days right before it. If yesterday cost dramatically more than the week before, the script flags it. That's all it does.

--recent-days=7 means "only show me anomalies from the last 7 days." The script still fetches 30 days of data because it needs that history to calculate the comparison, but the results are filtered to the window you care about. This is good for a quick Monday morning check.

--sample runs without touching your AWS account at all. It uses built-in fake billing data with a spike baked into yesterday's date so the detection always fires. Use this first to see what the output looks like before connecting it to real data.

What the Output Looks Like

Running --sample (the spike date will show as yesterday's actual date, not a fixed value):

Running against sample data (--sample mode).
30 days of billing ending yesterday, with an EC2 spike on 2026-05-14.

============================================================
ANOMALIES DETECTED: 1
============================================================

Service:      Amazon EC2
Date:         2026-05-14
Actual cost:  $18.5
Baseline avg: $1.2143 (prior 7-day average)
Threshold:    $1.3939
Overage:      1423.4% above baseline

============================================================
A note on AWS cost attribution:
The service label in Cost Explorer is assigned by AWS, not by the resource
that caused the cost. An EC2 spike may be caused by EBS snapshot copies,
cross-region data transfer, or autoscaling events that AWS categorizes under
EC2 in billing - not a running EC2 instance you can find in the console.

Before investigating the flagged service directly, ask:
What changed in my infrastructure on or before the flagged date?
Work backward from the operational change, not forward from the billing label.

Your numbers will differ slightly from the above because the sample data generates dates from today dynamically. The spike always shows up on yesterday and the surrounding baseline numbers shift depending on the day you run it.

The Decision the Script Can't Make for You

The anomaly is on the EC2 line, and the instinct is to go look at running EC2 instances. But as the output warns, the attribution is AWS's choice, not yours.

Before opening the EC2 console, check your deployment history for that date. What was deployed? Was a new environment created? Did an autoscaler event run? Start from the operational change and follow the thread to the billing data, because starting from the billing label and working backward is slower and frequently misleading.

Break it On Purpose

# See the spike immediately with no AWS account needed
python detect_cost_anomaly.py --sample

# Run against your real account
python detect_cost_anomaly.py

# Only show anomalies from the last 7 days, good for a quick this-week check
python detect_cost_anomaly.py --recent-days=7

# Combine both flags - sample data filtered to the last 7 days
python detect_cost_anomaly.py --sample --recent-days=7

If your real account returns "No anomalies detected" that's not a failure. It means your spend has been consistent. A clean account returns clean output. The script is doing exactly what it should.

When a real event happens on your account such as an autoscaler left running, a forgotten environment or an unexpected data transfer, this is what catches it before the invoice does.

Use Case 2 – Log Correlation Across Services

Environment: Fully local – Docker Compose, three Python services
Language: Python

The Production Problem

A user reports that their payment failed. You open your logging tool and search. The auth service logged a successful authentication. The ledger service logged a successful transaction but the notification service which should have sent a payment confirmation email has logged nothing at all.

Two services reported success while one service stay silent. The payment still failed, and you have three logs and no clear answer about where the chain broke.

What's Actually Happening at the System Level

What this is not: This isn't a guide to installing a log aggregation tool. It's about the data structure that makes log correlation possible in the first place and what happens when that structure breaks on one service's error path.

In a system with a single service, debugging is simple: one service, one log file, one timeline. But when a user request passes through multiple services, you need a way to link all the logs together. That link is called a trace ID.

Think of it like a ticket number at a government office. When you walk in, you get a number, say, A247. Every desk that handles your case writes A247 on your file. If something goes wrong, the manager pulls every record with A247 and sees exactly what happened, in order, across every desk. That is a trace ID. One number, shared across every service that touched the request.

In the demo, when a payment comes in, the auth service creates a unique ID for it. Every log line that auth, ledger, and notification write for that payment includes the same ID. When something breaks, you run correlate.py with that ID and it finds every related log line across all three services and sorts them by time:

python correlate.py pay-abc123

Here's what those logs look like. Notice that every line has the same trace_id:

{"timestamp": "2026-05-01T14:23:01.234Z", "trace_id": "pay-abc123", "service": "auth", "event": "user_authenticated", "level": "INFO", "user_id": "u_789", "duration_ms": 12}
{"timestamp": "2026-05-01T14:23:01.891Z", "trace_id": "pay-abc123", "service": "ledger", "event": "transaction_recorded", "level": "INFO", "amount": 50.0, "currency": "USD"}
{"timestamp": "2026-05-01T14:23:02.103Z", "trace_id": "pay-abc123", "service": "notification", "event": "email_queued", "level": "INFO", "recipient": "user@example.com"}

Now here's what breaks it. The notification service hits a timeout connecting to the email provider. The developer who wrote the error handler forgot to include the trace ID, so instead of a proper log line, it writes this:

2026-05-01T14:23:02.415Z ERROR Connection timeout to email provider smtp.example.com:587

The error happened, the log line exists. But because it has no trace_id, correlate.py can't find it.

The notification still appears in the timeline, and you can see email_send_attempt – but email_queued never follows it.

Timeline — 5 events across 3 service(s):

  [2026-05-15T21:59:00.605307+00:00] [AUTH] [INFO] payment_request_received
  [2026-05-15T21:59:00.606008+00:00] [AUTH] [INFO] user_authenticated
  [2026-05-15T21:59:00.617331+00:00] [LEDGER] [INFO] transaction_recorded
  [2026-05-15T21:59:00.630313+00:00] [NOTIFICATION] [INFO] email_send_attempt
  [2026-05-15T21:59:00.685182+00:00] [AUTH] [INFO] payment_complete

The attempt is there but the failure is not. The developer just forgot one field.

Set Up the Demo Environment

Navigate to 02-log-correlation/ and start the three services:

cd 02-log-correlation
docker compose up -d

This starts the auth, ledger, and notification services. Trigger a payment request to generate some logs:

./trigger_request.sh

The script prints the trace ID it used. Copy the ID and Run the correlation script against it now, before we break anything, to see the full working path:

python correlate.py pay-5831e1bf

You should see something like this (your trace ID will be different but the structure is the same):

Loading logs from ./logs/...
Loaded 6 structured log lines.

============================================================
Trace ID: pay-5831e1bf
============================================================

Timeline - 6 events across 3 service(s):

  [2026-05-15T21:42:28.079046+00:00] [AUTH] [INFO] payment_request_received
    service: auth
    user_id: u_789
    amount: 50.0
  [2026-05-15T21:42:28.080718+00:00] [AUTH] [INFO] user_authenticated
    service: auth
    user_id: u_789
    duration_ms: 12
  [2026-05-15T21:42:28.145528+00:00] [LEDGER] [INFO] transaction_recorded
    service: ledger
    user_id: u_789
    amount: 50.0
    currency: USD
  [2026-05-15T21:42:28.210088+00:00] [NOTIFICATION] [INFO] email_send_attempt
    service: notification
    recipient: user@example.com
  [2026-05-15T21:42:28.347893+00:00] [NOTIFICATION] [INFO] email_queued
    service: notification
    recipient: user@example.com
    amount: 50.0
  [2026-05-15T21:42:28.378402+00:00] [AUTH] [INFO] payment_complete
    service: auth
    user_id: u_789
    amount: 50.0

That's the full payment journey with auth, ledger, notification in the exact order it happened. Now let's look at how the script works.

The Script

# correlate.py
import json
import os
import sys

SERVICES = ["auth", "ledger", "notification"]
LOG_DIR = "./logs"


def load_logs(log_dir):
    """
    Read each service's log file and parse every line as JSON.
    Lines that fail JSON parsing are printed as warnings.
    They are not silently dropped - a plain-text error line in a service
    that should emit structured logs is itself evidence worth seeing.
    """
    all_lines = []

    for service in SERVICES:
        log_file = os.path.join(log_dir, f"{service}.log")

        if not os.path.exists(log_file):
            print(f"  WARNING: No log file for '{service}' at {log_file}")
            continue

        with open(log_file) as f:
            for line_num, line in enumerate(f, 1):
                line = line.strip()
                if not line:
                    continue
                try:
                    parsed = json.loads(line)
                    parsed["_source"] = service
                    all_lines.append(parsed)
                except json.JSONDecodeError:
                    # This line exists in the log but cannot be correlated.
                    print(f"  WARNING: {service}.log line {line_num} is not structured JSON:")
                    print(f"           {line[:100]}")
                    print(f"           This line will NOT appear in any trace-based search.")

    return all_lines


def correlate(trace_id, all_lines):
    """
    Find every log line with this trace_id and sort by timestamp.
    The sorted result is the reconstructed timeline of the request.
    """
    matched = [line for line in all_lines if line.get("trace_id") == trace_id]
    matched.sort(key=lambda x: x.get("timestamp", ""))
    return matched


def find_missing_services(matched):
    """
    Check which services produced zero trace-tagged lines for this request.
    A missing service is not just an absence - it is a signal.
    Either the request never reached that service, or an error path swallowed
    the trace ID. Both are worth investigating.
    """
    services_seen = {line["_source"] for line in matched}
    return [s for s in SERVICES if s not in services_seen]


def print_timeline(trace_id, matched, missing):
    print(f"\n{'=' * 60}")
    print(f"Trace ID: {trace_id}")
    print(f"{'=' * 60}")

    if not matched:
        print("\nNo structured log lines found with this trace ID.")
        print("Either the trace ID is wrong, or no service emitted")
        print("a structured log line for this request.")
        return

    services_count = len({line["_source"] for line in matched})
    print(f"\nTimeline - {len(matched)} events across {services_count} service(s):\n")

    for line in matched:
        ts = line.get("timestamp", "unknown")
        service = line.get("_source", "unknown").upper()
        event = line.get("event", "unknown event")
        level = line.get("level", "INFO")
        extras = {k: v for k, v in line.items()
                  if k not in ("timestamp", "trace_id", "event", "level", "_source")}

        print(f"  [{ts}] [{service}] [{level}] {event}")
        for k, v in extras.items():
            print(f"    {k}: {v}")

    if missing:
        print(f"\n{'=' * 60}")
        print("MISSING TELEMETRY")
        print(f"{'=' * 60}")
        print(f"These services produced no trace-tagged events for trace {trace_id}:\n")
        for s in missing:
            print(f"  - {s}")
        print()
        print("This means one of three things:")
        print("  1. The request never reached this service.")
        print("  2. The service received it but an error path swallowed the trace ID,")
        print("     leaving a plain-text log line that trace correlation cannot find.")
        print("  3. This service's log file was not included in this run.")
        print()
        print("Check the raw log file for a plain-text error line around the same timestamp.")
        print("If one exists, that is your root cause - and a structured logging gap to fix.")


def run(trace_id):
    print(f"Loading logs from {LOG_DIR}/...")
    all_lines = load_logs(LOG_DIR)
    print(f"Loaded {len(all_lines)} structured log lines.\n")

    matched = correlate(trace_id, all_lines)
    missing = find_missing_services(matched)
    print_timeline(trace_id, matched, missing)


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python correlate.py ")
        print("Example: python correlate.py pay-abc123")
        sys.exit(1)
    run(sys.argv[1])

How the Script Works

load_logs reads log files from each service. Each line should be JSON. If a line isn't JSON, it prints a warning that usually means an error log is missing a trace ID and can't be tracked.

correlate finds all logs that match the given trace ID and sorts them by time. This rebuilds the full request flow across services.

find_missing_services checks which services have no logs for that trace ID. This tells you where the request stopped or where the trace ID was lost.

print_timeline displays the full request timeline in order. It also shows which services are missing if something didn't log correctly.

One thing worth knowing for when you use this in a real Kubernetes environment:
in Kubernetes, kubectl logs only shows the current running container.
If a pod restarts, you can use this:

kubectl logs  --previous

But this only works for the last restart. Older logs are gone unless you use a logging system like Loki or CloudWatch.

What the Output Looks Like After Breaking it

The point of this section is to show you what happens when a service fails silently, – when the error exists in the logs but the script can't find it because the developer forgot one field.

break_it.sh forces the notification service to fail when it tries to send an email, and because the error handler was written without a trace ID, the failure gets logged as plain text with no way to tie it back to the original request.

Run it:

./break_it.sh

Then trigger a new request:

./trigger_request.sh

Copy the trace ID it prints, then correlate it:

python correlate.py pay-xxxxxxxx

Here is what you'll see:

Loading logs from ./logs/...
  WARNING: notification.log line 10 is not structured JSON:
           2026-05-15T21:59:00.681583+00:00 ERROR Connection timeout to email
           provider http://mock-email:80/ after 0.001s - failed to send
           confirmation to user@example.com
           This line will NOT appear in any trace-based search.
Loaded 29 structured log lines.

============================================================
Trace ID: pay-6cf69a8c
============================================================

Timeline - 5 events across 3 service(s):

  [2026-05-15T21:59:00.605307+00:00] [AUTH] [INFO] payment_request_received
  [2026-05-15T21:59:00.606008+00:00] [AUTH] [INFO] user_authenticated
  [2026-05-15T21:59:00.617331+00:00] [LEDGER] [INFO] transaction_recorded
  [2026-05-15T21:59:00.630313+00:00] [NOTIFICATION] [INFO] email_send_attempt
  [2026-05-15T21:59:00.685182+00:00] [AUTH] [INFO] payment_complete

Look at this carefully. The notification is in the timeline, and it logged email_send_attempt. But email_queued is missing, which means the email never actually sent and the error that explains why isn't in the timeline at all. It's hiding in the WARNING at the very top, where the script told you it found a line it couldn't parse.

That's the problem: where the attempt is visible but the failure is invisible.

Run cat logs/notification.log and scroll to the bottom:

{"timestamp": "2026-05-15T21:59:00.630313+00:00", "trace_id": "pay-6cf69a8c",
 "service": "notification", "event": "email_send_attempt", ...}
2026-05-15T21:59:00.681583+00:00 ERROR Connection timeout to email provider
http://mock-email:80/ after 0.001s - failed to send confirmation to user@example.com

Two lines to note: the first has a trace ID, which the script found and showed in the timeline. The second doesn't – the script flagged it as a warning and skipped it. The error happened 0.075 seconds after the attempt. The log file has both lines. The timeline only has one.

That is what "invisible failure" looks like in production. The payment went through. The confirmation email never sent. The error is sitting right there in the log file, Connection timeout to email provider after 0.001s but in the correlation output above, the timeline shows email_send_attempt and then jumps straight to payment_complete with nothing in between: no error, no failure, no gap. It looks like everything worked.

The fix is in 02-log-correlation/services/notification/main.py. Here's the broken error handler:

except httpx.TimeoutException:
    emit_plain(f"Connection timeout to email provider {EMAIL_PROVIDER_URL}")
    return {"status": "ok"}

And here's the fixed version. The only change is passing req.trace_id into emit instead of calling emit_plain:

except httpx.TimeoutException:
    emit(req.trace_id, "email_timeout", level="ERROR",
         provider=EMAIL_PROVIDER_URL)
    return {"status": "ok"}

Once that change is made, the timeout error shows up in the timeline like everything else:

  [2026-05-15T21:59:00.681583+00:00] [NOTIFICATION] [ERROR] email_timeout
    provider: http://mock-email:80/

One command, one trace ID, the full picture.

The Decision the Script Can't Make For You

The correlation script identifies notification as the gap. When you check the raw notification.log, you find the plain-text timeout error, that the request reached the service, that authentication and transaction recording both succeeded, but that the email failed.

Whether a notification failure is a payment failure depends entirely on how your system was designed. If notification is a soft dependency, this error shouldn't have surfaced to the user as a payment failure, and something else in your system design is wrong. If it's a hard dependency, the transaction itself should have rolled back. The script found where things broke, but the right response depends on the design.

Break it On Purpose

Run ./break_it.sh – this switches the notification service to a mode where its error handler drops the trace ID
Run ./trigger_request.sh to generate a new payment request and get a new trace ID
Run python correlate.py – the notification will be missing from the timeline
Run cat logs/notification.log – the timeout error is right there, without a trace ID, invisible to the script

Use Case 3 - Infrastructure Drift Detection

Environment: AWS free tier (one security group) + Terraform
Language: Python

The Production Problem

Your Terraform plan shows no changes. Your deployment is behaving differently than it did yesterday, and when you ask around, someone eventually remembers: a colleague made a quick manual change to a security group in the AWS console last week to unblock a staging test. They meant to go back and apply it through Terraform but they forgot.

Your Terraform state file and your actual AWS infrastructure have been quietly disagreeing ever since. Not that anything broke loudly or an alert fired. Terraform wouldn't even know unless someone ran terraform plan to check, and in this scenario, nobody did.

This is called infrastructure drift, and it's far more common than most teams want to admit.

What's Actually Happening at the System Level

What this is not: This isn't the same as running terraform plan. A plan shows you what Terraform would change. This script shows you what has already changed in AWS without Terraform knowing.

The script itself doesn't run any Terraform commands. It reads the state file Terraform already produced. In the demo, Terraform creates that file. In a real environment, it already exists from your normal workflow.

Think of Terraform's state file as a receipt. When Terraform creates a security group, it writes down exactly what it created, the rules, the ports, the CIDRs. That receipt is the state file.

The script compares that receipt against what AWS actually has right now. If someone went into the AWS console and added a rule that isn't on the receipt, the script flags it as drift.

The blind spot is that, if someone creates a completely new security group in the console and never uses Terraform at all, there's no receipt for it. The script can't compare something it has never seen. It returns clean, and that group sits in your account undetected.

The demo shows both. First you break a known resource. Then the --invisible scenario creates a new one outside Terraform entirely, and the script returns clean even though your account now has an extra security group.

Set Up the Demo Environment

Navigate to 03-drift-detection/ in the companion repo:

cd 03-drift-detection
pip install -r requirements.txt

Run setup. This uses real Terraform, not a mock:

./setup.sh

This runs terraform init and terraform apply, which creates a real AWS security group:

It also writes a genuine terraform.tfstate file. Open it in any text editor if you want to see what Terraform actually produces. It's JSON, it's readable, and it's the real thing.

Once setup completes, run the script:

python detect_drift.py terraform.tfstate

You should see something like this, but your actual security group ID will be different:

Loading Terraform state from: terraform.tfstate

Checking: sg-0a1b2c3d4e5f6a7b8

  OK - No drift detected.

The lab is alive and both sides of the contract match. Now let's look at what the script is doing.

The Script (Code Files)

# detect_drift.py
import boto3
import json
import sys


def load_tfstate(path):
    """
    The Terraform state file is plain JSON - open it in any text editor
    and you will see a 'resources' array listing everything Terraform knows about.
    This function reads that file and returns the parsed contents.
    """
    with open(path) as f:
        return json.load(f)


def get_security_groups_from_state(tfstate):
    """
    Walk through the resources array and collect every security group entry.
    Each resource has a 'type', a 'name', and an 'instances' array holding
    the attribute values Terraform recorded when it last ran.
    We extract the resource ID and the ingress (inbound) rules.
    """
    resources = {}
    for resource in tfstate.get("resources", []):
        if resource["type"] == "aws_security_group":
            for instance in resource.get("instances", []):
                sg_id = instance["attributes"]["id"]
                resources[sg_id] = {
                    "ingress": instance["attributes"].get("ingress", [])
                }
    return resources


def get_security_group_from_aws(sg_id):
    """
    Call the AWS EC2 API to fetch the live current state of this security group.
    Under the hood, boto3 constructs an authenticated HTTPS request, signs it with
    your AWS credentials, sends it to the EC2 API endpoint in your configured region,
    and parses the response. The response contains far more data than we need -
    we extract only the inbound rules.
    """
    ec2 = boto3.client("ec2")
    response = ec2.describe_security_groups(GroupIds=[sg_id])
    sg = response["SecurityGroups"][0]
    return {"ingress": sg.get("IpPermissions", [])}


def normalize_state_rules(rules):
    """
    Terraform stores ingress rules in its own format.
    We normalize them into a set of tuples for easy comparison.
    Each tuple is: (from_port, to_port, protocol, cidr_block)
    """
    normalized = set()
    for rule in rules:
        for cidr in rule.get("cidr_blocks", []):
            normalized.add((
                rule.get("from_port", 0),
                rule.get("to_port", 0),
                rule.get("protocol", "-1"),
                cidr
            ))
    return normalized


def normalize_aws_rules(rules):
    """
    AWS returns ingress rules in a different format from Terraform's.
    We normalize them into the same tuple shape so the comparison works.
    """
    normalized = set()
    for rule in rules:
        from_port = rule.get("FromPort", 0)
        to_port = rule.get("ToPort", 0)
        protocol = rule.get("IpProtocol", "-1")
        for ip_range in rule.get("IpRanges", []):
            normalized.add((from_port, to_port, protocol, ip_range["CidrIp"]))
    return normalized


def detect_drift(tfstate_path):
    print(f"Loading Terraform state from: {tfstate_path}")
    tfstate = load_tfstate(tfstate_path)
    state_sgs = get_security_groups_from_state(tfstate)

    if not state_sgs:
        print("No security groups found in state file. Nothing to compare.")
        return

    drift_found = False

    for sg_id, state_data in state_sgs.items():
        print(f"\nChecking: {sg_id}")

        try:
            aws_data = get_security_group_from_aws(sg_id)
        except Exception as e:
            print(f"  ERROR: Could not fetch {sg_id} from AWS - {e}")
            print(f"  Check your IAM permissions: ec2:DescribeSecurityGroups is required.")
            continue

        state_rules = normalize_state_rules(state_data["ingress"])
        aws_rules = normalize_aws_rules(aws_data["ingress"])

        # Rules in AWS that Terraform does not know about (manual additions)
        added_in_aws = aws_rules - state_rules
        # Rules Terraform expects that no longer exist in AWS (manual deletions)
        removed_from_aws = state_rules - aws_rules

        if added_in_aws:
            drift_found = True
            print("  DRIFT - Rules present in AWS but missing from state file:")
            for rule in added_in_aws:
                print(f"    Port {rule[0]}-{rule[1]} | Protocol: {rule[2]} | CIDR: {rule[3]}")

        if removed_from_aws:
            drift_found = True
            print("  DRIFT - Rules in state file but removed from AWS:")
            for rule in removed_from_aws:
                print(f"    Port {rule[0]}-{rule[1]} | Protocol: {rule[2]} | CIDR: {rule[3]}")

        if not added_in_aws and not removed_from_aws:
            print("  OK - No drift detected.")

    print("\n" + "=" * 60)
    if drift_found:
        print("Drift detected. See above for details.")
    else:
        print("No drift detected in monitored resources.")

    print("\nIMPORTANT: This script only checks resources tracked in your state file.")
    print("Resources created manually in AWS without Terraform are invisible to this check.")
    print("A clean output here does not mean your AWS account is clean - it means")
    print("the resources you are watching match what Terraform last recorded.")


if __name__ == "__main__":
    tfstate_path = sys.argv[1] if len(sys.argv) > 1 else "terraform.tfstate"
    detect_drift(tfstate_path)

How the Script Works

load_tfstate opens terraform.tfstate and reads it. Run cat terraform.tfstate after setup and you'll see that it's just a text file and everything Terraform knows about your infrastructure is stored in there.

get_security_groups_from_state pulls out every security group from that file, the ID AWS assigned it, and the inbound rules Terraform last recorded. These are the expected values.

get_security_group_from_aws calls the AWS API and fetches the same security group's current inbound rules. These are the actual values. The script now has two versions of the same thing.

normalize_state_rules and normalize_aws_rules exist because Terraform and AWS store the same rule in slightly different formats. These two functions convert both into the same format so the comparison works.

The comparison is the last step. Rules in AWS but not in the state file were added manually. Rules in the state file but not in AWS were deleted manually. The script prints both.

What the Output Looks Like

A clean run with no drift:

Loading Terraform state from: terraform.tfstate

Checking: sg-0a1b2c3d4e5f6a7b8

  OK - No drift detected.

============================================================
No drift detected in monitored resources.

IMPORTANT: This script only checks resources tracked in your state file.
Resources created manually in AWS without Terraform are invisible to this check.
A clean output here does not mean your AWS account is clean - it means
the resources you are watching match what Terraform last recorded.

After injecting drift:

Loading Terraform state from: terraform.tfstate

Checking: sg-0a1b2c3d4e5f6a7b8

  DRIFT - Rules present in AWS but missing from state file:
    Port 22-22 | Protocol: tcp | CIDR: 0.0.0.0/0

============================================================
Drift detected. See above for details.

The Decision the Script Can't Make For You

The script finds drift, an inbound rule that Terraform doesn't know about. The instinct is to revert it immediately by running terraform apply, but before doing that, ask one question: was this change an emergency hotfix? Someone may have manually opened a port at 2am to restore a broken service while a proper fix was being prepared. And if you revert it automatically, you might undo something that was deliberately placed there to keep a service running.

Drift detection tells you that things are different. It doesn't tell you which version is correct, and investigating that is the work that comes after the script runs.

Break it On Purpose

Run ./break_it.sh. This adds an SSH inbound rule (port 22) directly via the AWS CLI, simulating a manual console change.
Run python detect_drift.py terraform.tfstate. The drift appears in the output.
Run ./break_it.sh --invisible to create a brand new security group that's not in the state file at all, then run the script again. It returns clean even though a new resource exists in your account, making the coverage gap visible.
Run ./teardown.sh. When finished, this runs terraform destroy to delete the security group and clean up all AWS resources. No charges will remain after this.

Use Case 4 - Secrets Rotation with Zero Downtime

Environment: AWS Secrets Manager + local Kind cluster
Language: Python

The Production Problem

The goal of this use case: Kubernetes says a pod is healthy, but your users are getting database errors. The script catches that gap before the users are affected by running one extra check that Kubernetes never runs.

You rotate your database credentials. The pod restarts. kubectl get pods shows Running. Ten minutes later, users can't log in.

The rotation worked, but the problem is that Kubernetes checked whether the HTTP server was alive, not whether it could authenticate with the database. Those are two different things.

What's Actually Happening

What this is not: This isn't about how to store secrets in Kubernetes. It's about what happens after the secret is rotated.

When a pod is already running, it holds a pool of open database connections that were authenticated before the rotation happened. Those connections stay alive after the password changes because they were authenticated before the change and the database does not kick them out. But when the pool needs to open a new connection, it uses the current environment credentials, which still have the old password. That new connection fails immediately.

Meanwhile, Kubernetes sees the pod responding to HTTP and marks it Running, so your users are hitting the failures with no indication from the cluster that anything is wrong.

What the `/healthz/db` Endpoint Does

/healthz returns 200 if the HTTP server is alive. That is all Kubernetes checks.

/healthz/db opens a fresh database connection using the current credentials and runs SELECT 1. If that fails after a rotation, the pod is Running but can't serve database requests. The rotation script calls this endpoint as its final step – the check Kubernetes never runs.

Here's what that looks like in the demo FastAPI application (code files):

# app.py (relevant section)
import os
import asyncpg
from fastapi import FastAPI, HTTPException

app = FastAPI()

DB_HOST = os.environ.get("DB_HOST", "postgres")
DB_PORT = int(os.environ.get("DB_PORT", "5432"))
DB_NAME = os.environ.get("DB_NAME", "appdb")
DB_USERNAME = os.environ.get("DB_USERNAME", "appuser")
DB_PASSWORD = os.environ.get("DB_PASSWORD", "")

@app.get("/healthz")
async def healthz():
    # Always returns 200 if the HTTP server is alive.
    # This is all the Kubernetes readiness probe checks.
    return {"status": "ok"}

@app.get("/healthz/db")
async def healthz_db():
    # Opens a fresh connection using the current environment credentials.
    # If the password was rotated and this pod has not restarted yet,
    # the environment still has the old password - this connection fails.
    # /healthz above would still return 200. Your users would see errors.
    try:
        conn = await asyncpg.connect(
            host=DB_HOST, port=DB_PORT,
            database=DB_NAME, user=DB_USERNAME, password=DB_PASSWORD,
        )
        await conn.execute("SELECT 1")
        await conn.close()
        return {"status": "ok", "db": "authenticated"}

    except asyncpg.InvalidPasswordError:
        raise HTTPException(
            status_code=503,
            detail=(
                f"Authentication failed for '{DB_USERNAME}'. "
                "Password may have been rotated. "
                "Readiness probe does not check this."
            )
        )
    except Exception as e:
        raise HTTPException(status_code=503, detail=f"Database error: {str(e)}")

The difference between these two endpoints is the entire lesson of this use case.

Set Up the Demo Environment

Navigate to 04-secrets-rotation/ and run the setup script:

cd 04-secrets-rotation
./setup.sh

This starts a Kind cluster, deploys real PostgreSQL with the appuser account already created, deploys the demo FastAPI app connected to it, and creates an initial secret in AWS Secrets Manager.

Once setup completes, install the dependencies:

pip install boto3 kubernetes

Before running the rotation, confirm everything is running:

kubectl get pods

You should see myapp and postgres pods both in the Running state. If any pod shows Pending or Error, wait 30 seconds and check again. PostgreSQL takes a moment to finish initialising.

You can also verify that the secret was created in AWS. In the console, go to AWS Secrets Manager and look for myapp/db-credentials:

If you prefer the CLI:

aws secretsmanager get-secret-value --secret-id myapp/db-credentials

Once both pods are Running and the secret exists, run the rotation to see the full path:

python rotate_secret.py

If Step 6 shows FAILED on this first clean run, it's almost always a timing issue: the app pod restarted successfully but /healthz/db ran before the new pod finished establishing its first database connection. Wait 20 seconds and run python rotate_secret.py again. If it fails repeatedly, run kubectl logs deployment/myapp to see what the app is reporting.

You should see all six steps complete cleanly, ending with:

Rotation complete. Credential verified at the application level.
  AWS Secrets Manager: updated
  PostgreSQL:          updated (ALTER USER)
  Kubernetes Secret:   updated
  Application pod:     restarted, authenticated

The lab is alive and the full rotation chain works end to end. Now let's look at what the script is doing.

The Script (Code Files)

# rotate_secret.py
import boto3
import base64
import json
import subprocess
import sys
from kubernetes import client, config


def get_current_secret(secret_name):
    """
    Fetch the current credential from AWS Secrets Manager.
    The secret is stored as a JSON string with 'username' and 'password' fields.
    """
    sm = boto3.client("secretsmanager")
    response = sm.get_secret_value(SecretId=secret_name)
    return json.loads(response["SecretString"])


def rotate_in_aws(secret_name, username, new_password):
    """
    Write the new credential to AWS Secrets Manager.
    put_secret_value creates a new version - the previous version is
    not deleted immediately, giving you a short rollback window.
    """
    sm = boto3.client("secretsmanager")
    new_value = json.dumps({"username": username, "password": new_password})
    sm.put_secret_value(SecretId=secret_name, SecretString=new_value)
    print("  [AWS] Secret updated in Secrets Manager.")


def update_kubernetes_secret(namespace, k8s_secret_name, username, new_password):
    """
    Patch the Kubernetes Secret object with the new credential values.
    Kubernetes requires secret data to be base64-encoded - this is encoding,
    not encryption. Anyone with access to the Secret object can decode the values.
    Real encryption at rest requires separate etcd encryption configuration.
    """
    config.load_kube_config()
    v1 = client.CoreV1Api()

    secret_data = {
        "username": base64.b64encode(username.encode()).decode(),
        "password": base64.b64encode(new_password.encode()).decode()
    }

    v1.patch_namespaced_secret(
        name=k8s_secret_name,
        namespace=namespace,
        body={"data": secret_data}
    )
    print(f"  [K8s] Kubernetes Secret '{k8s_secret_name}' updated.")


def rolling_restart(namespace, deployment_name):
    """
    Trigger a rolling restart of the deployment.
    Rolling restart means Kubernetes creates one new pod, waits for it to pass
    its readiness probe, then terminates one old pod - and repeats until all
    pods have been replaced. Availability is preserved throughout.
    This is very different from deleting all pods at once.
    """
    result = subprocess.run(
        ["kubectl", "rollout", "restart",
         f"deployment/{deployment_name}", "-n", namespace],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        raise RuntimeError(f"Rolling restart failed: {result.stderr}")
    print(f"  [K8s] Rolling restart triggered for '{deployment_name}'.")


def wait_for_rollout(namespace, deployment_name, timeout=120):
    """
    Block until the rolling restart finishes or times out.
    'Finished' means all new pods are Running and their readiness probes passed.
    This does NOT mean the application can authenticate with the new credential.
    That is what verify_credential checks next.
    """
    print(f"  [K8s] Waiting for rollout (timeout: {timeout}s)...")
    result = subprocess.run(
        ["kubectl", "rollout", "status",
         f"deployment/{deployment_name}",
         "-n", namespace,
         f"--timeout={timeout}s"],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        raise RuntimeError(f"Rollout did not complete: {result.stderr}")
    print("  [K8s] Rollout complete. All pods report Ready.")


def verify_credential(namespace, deployment_name):
    """
    This is the check the readiness probe does not make.
    We exec into the running pod and call /healthz/db - an endpoint that
    makes an actual authenticated query to the database.
    If this passes: the credential is working at the application level.
    If this fails after the readiness probe passed: the contract mismatch is confirmed.
    The pod is Running. The application cannot serve database requests.
    """
    print("  [Verify] Running post-rotation credential check...")

    result = subprocess.run(
        ["kubectl", "get", "pods", "-n", namespace,
         "-l", f"app={deployment_name}",
         "-o", "jsonpath={.items[0].metadata.name}"],
        capture_output=True, text=True
    )
    pod_name = result.stdout.strip()

    if not pod_name:
        print("  [Verify] ERROR: No running pod found for this deployment.")
        return False

    verify = subprocess.run(
        ["kubectl", "exec", pod_name, "-n", namespace,
         "--", "curl", "-sf", "http://localhost:8000/healthz/db"],
        capture_output=True, text=True
    )

    if verify.returncode != 0:
        print("  [Verify] FAILED - Pod is Running but database authentication failed.")
        print("           The readiness probe validated HTTP reachability.")
        print("           The application cannot authenticate with the new credential.")
        print("           These are two different contracts. Only one was checked automatically.")
        return False

    print("  [Verify] PASSED - Application confirmed it can authenticate with the new credential.")
    return True


def rotate(secret_name, new_password, namespace, k8s_secret_name, deployment_name):
    print("\n[Step 1/6] Reading current secret from AWS Secrets Manager...")
    current = get_current_secret(secret_name)
    username = current["username"]

    print("[Step 2/6] Updating AWS Secrets Manager...")
    rotate_in_aws(secret_name, username, new_password)

    print("[Step 3/6] Rotating password at the database level (ALTER USER)...")
    rotate_postgres_password(namespace, new_password)

    print("[Step 4/6] Updating Kubernetes Secret object...")
    update_kubernetes_secret(namespace, k8s_secret_name, username, new_password)

    print("[Step 5/6] Triggering rolling restart...")
    rolling_restart(namespace, deployment_name)
    wait_for_rollout(namespace, deployment_name)

    print("[Step 6/6] Verifying the new credential works at the application level...")
    success = verify_credential(namespace, deployment_name)

    print("\n" + "=" * 60)
    if success:
        print("Rotation complete. Credential verified at the application level.")
    else:
        print("Rotation incomplete. Readiness probe passed but credential verification failed.")
        print("Recommended action: force-restart all pods to flush the connection pool,")
        print("or investigate the database session timeout configuration.")
        sys.exit(1)


if __name__ == "__main__":
    import secrets as _secrets
    rotate(
        secret_name="myapp/db-credentials",
        new_password=_secrets.token_urlsafe(16),
        namespace="default",
        k8s_secret_name="db-credentials",
        deployment_name="myapp"
    )

How the Script Works

get_current_secret reads the current credential from AWS Secrets Manager so the script knows the username before it generates a new password.

rotate_in_aws writes the new credential to Secrets Manager. It creates a new version rather than overwriting the old one, so you have a short window to roll back if something goes wrong.

_pg_password_literal and rotate_postgres_password handle the step that most rotation scripts skip, which is actually changing the password inside PostgreSQL. This is done by running ALTER USER appuser PASSWORD '...' directly on the live PostgreSQL pod. Before this step, the database still accepts the old password. After this step, it does not.

update_kubernetes_secret writes the new password into the Kubernetes Secret so that any new pod that starts will get the new credential from the beginning.

rolling_restart and wait_for_rollout restart the application pods one at a time so the deployment stays available throughout. When this step completes, all pods are Running and their readiness probes have passed – but keep in mind that "Running" only means /healthz returned 200, which is exactly the problem this use case is about.

verify_credential is the extra step Kubernetes never runs. It reaches inside the new pod and calls /healthz/db, which opens a real database connection with the credentials in the pod's current environment. If this passes, the rotation is genuinely complete. If this fails after the readiness probe passed, you have confirmed the gap: the pod looks healthy but can't serve database requests.

What the Output Looks Like

Successful rotation:

[Step 1/6] Reading current secret from AWS Secrets Manager...
[Step 2/6] Updating AWS Secrets Manager...
  [AWS] Secrets Manager updated.
[Step 3/6] Rotating password at the database level (ALTER USER)...
  [DB]  Running ALTER USER on PostgreSQL...
  [DB]  Password changed at the database level.
        New connections now require the new password.
        Existing pool connections remain valid until they close.
[Step 4/6] Updating Kubernetes Secret object...
  [K8s] Kubernetes Secret 'db-credentials' updated.
[Step 5/6] Triggering rolling restart...
  [K8s] Rolling restart triggered for 'myapp'.
  [K8s] Waiting for rollout (timeout: 120s)...
  [K8s] Rollout complete. All pods report Ready.
[Step 6/6] Verifying the new credential works at the application level...
  [Verify] Running post-rotation credential check...
  [Verify] PASSED - Application confirmed it can authenticate with the new credential.

============================================================
Rotation complete. Credential verified at the application level.
  AWS Secrets Manager: updated
  PostgreSQL:          updated (ALTER USER)
  Kubernetes Secret:   updated
  Application pod:     restarted, authenticated

The lab is alive and the full rotation chain works end to end.

Before you break anything, confirm the pod is healthy:

kubectl get pods

You should see myapp in Running state. That is the baseline: everything working as expected. Now let's break it.

Break it On Purpose

Step 1: Desync the DB

./break_it.sh

This runs ALTER USER directly on PostgreSQL with a wrong password. The K8s Secret still has the old password, so the pod's environment and the database are now out of sync.

Step 2: Check what Kubernetes sees

kubectl exec deployment/myapp -- curl -s http://localhost:8000/healthz

You will see {"status":"ok"}. The pod is still showing Ready in kubectl get pods. Kubernetes has no idea anything is wrong – that's the contract gap made visible in your terminal.

Step 3: Check what your users experience

kubectl exec deployment/myapp -- curl -s http://localhost:8000/healthz/db

You'll see a 503 error. Fresh database connections are failing. Your users are already seeing this.

Step 4: See the mixed pattern (optional)

./load_test.sh

Some requests succeed because they hit old pool connections that were authenticated before the break. Some fail because they need a fresh connection. The pod looks healthy, but half your traffic is failing.

Step 5: Run the rotation script

python rotate_secret.py

This time, Step 6 catches the failure. Here's what you'll see:

[Step 5/6] Triggering rolling restart...
  [K8s] Rollout complete. All pods report Ready.
[Step 6/6] Verifying the new credential works at the application level...
  [Verify] Running post-rotation credential check...
  [Verify] FAILED - Pod is Running but database authentication failed.
           The readiness probe validated HTTP reachability.
           The application cannot authenticate with the new credential.
           These are two different contracts. Only one was checked automatically.

============================================================
Rotation incomplete. Readiness probe passed but credential verification failed.

The pod is Running and shows Ready in kubectl get pods. The rotation script says the credential is broken. That's the contract gap visible in your terminal, caught before your users hit it.

The lesson: /healthz tells you the HTTP server is alive. /healthz/db tells you the application can actually connect to the database. Kubernetes only checks the first one unless you add a database probe. The rotation script adds that check at the end of every rotation so you catch the failure before your users do.

The Decision the Script Can't Make For You

The verification failed, the pod is Running, and requests to the database are failing. You have two options:

force-restart all pods at once to flush the connection pool (which is faster but causes a brief capacity reduction), or
wait for old sessions to expire naturally (which avoids downtime but leaves requests failing intermittently until the pool cycles).

The script found the problem, but deciding what to do next belongs to an engineer who knows the system.

Teardown

./teardown.sh

Use Case 5 - Automated Canary Rollback Trigger

Environment: Fully local – Kind, Prometheus via Helm
Language: Bash

What This Use Case Does and Why it Matters

This use case runs a script that watches your new deployment and automatically rolls it back if something goes wrong, before your users flood your support queue.

This matters in production because, when you ship a new version, you don't send all traffic to it immediately. You send a small slice, say 20% to the new version while 80% still goes to the old one. If the new version is broken, only 20% of users are affected and you can roll back before the damage spreads. But the rollback only works if you're watching the right things.

The takehome: Two scripts watch the same failing canary. One reports everything is fine. The other fires the rollback. The only difference is what they measure. Your automation is only as good as what it watches.

What to watch for: canary_watch_v1.sh watches errors only and stays silent while the canary is slow. canary_watch_v2.sh watches errors AND response time and fires the rollback. The difference between them is the lesson.

What this is not: This isn't a guide to canary deployments. It's about what your monitoring misses when it only watches one signal.

How it Works

Three things run in the cluster: the stable app (three pods, handles most traffic), the canary app (one pod, handles a small slice), and Prometheus (collects response times and error counts from both every 15 seconds).

The watch script asks Prometheus every 15 seconds: "Is the canary behaving normally?" If the answer is no for three checks in a row, it rolls back the canary automatically.

The question is that what does "behaving normally" mean? That is the entire use case.

Set Up the Demo Environment

Navigate to 05-canary-rollback/ and run:

cd 05-canary-rollback
./setup.sh

Setup takes a few minutes. It installs Prometheus, deploys both versions of the demo app, and starts a load generator pod that sends continuous traffic to both so Prometheus always has data.

When setup finishes, confirm everything is running:

kubectl get pods

You should see output like this:

NAME                                                   READY   STATUS    RESTARTS   AGE
load-generator-68c59698b7-kws2l                        1/1     Running   0          4m54s
myapp-canary-6d6979c66f-g9lgw                          1/1     Running   0          32s
myapp-stable-6bcf994fc4-b4k9l                          1/1     Running   0          4m55s
myapp-stable-6bcf994fc4-ndhxc                          1/1     Running   0          4m55s
myapp-stable-6bcf994fc4-z97kx                          1/1     Running   0          4m55s
prometheus-kube-prometheus-operator-59b847d96c-mp72s   1/1     Running   0          5m58s
prometheus-prometheus-kube-prometheus-prometheus-0     2/2     Running   0          5m1s

Three stable pods, one canary pod, one load generator, Prometheus running. The lab is alive.

Wait 60 seconds before running anything else. Prometheus needs time to scrape the first metrics from the pods. If you skip this, the watch scripts return empty data with no explanation.

Three Terminal Windows

You need three separate command prompts running at the same time.

On macOS: open Terminal and press Cmd+T twice. You now have three tabs, each an independent terminal.
On Linux: press Ctrl+Shift+T in most terminal apps, or right-click and choose "Open new tab."

Label them Terminal 1 for the watch script, Terminal 2 for injecting failures, Terminal 3 for watching latency.

The Scripts

Version 1: watches errors only (code here)

#!/usr/bin/env bash
# canary_watch_v1.sh

PROMETHEUS="http://localhost:9090"
DEPLOYMENT="myapp-canary"
NAMESPACE="default"
ERROR_THRESHOLD="0.05"
CHECK_INTERVAL=15
STRIKE_LIMIT=3

strikes=0

echo "Canary monitor running (v1 - error rate only)."
echo "Rollback triggers if error rate exceeds \({ERROR_THRESHOLD} for \){STRIKE_LIMIT} checks."
echo ""

while true; do
    ts=$(date '+%Y-%m-%dT%H:%M:%S')

    error_query='sum(rate(http_requests_total{app="myapp-canary",status=~"5.."}[1m])) / sum(rate(http_requests_total{app="myapp-canary"}[1m]))'

    error_rate=\((curl -sf "\){PROMETHEUS}/api/v1/query" \
        --data-urlencode "query=${error_query}" | \
        python3 -c "
import sys, json
d = json.load(sys.stdin)
result = d['data']['result']
print(result[0]['value'][1] if result else '0')
" 2>/dev/null)

    error_rate=${error_rate:-0}
    above=\((echo "\)error_rate > $ERROR_THRESHOLD" | bc -l)

    echo "[\(ts] error_rate=\){error_rate} | threshold=\({ERROR_THRESHOLD} | breach=\)([ "$above" = "1" ] && echo YES || echo NO)"

    if [ "$above" = "1" ]; then
        strikes=$((strikes + 1))
        echo "  Strike \({strikes}/\){STRIKE_LIMIT}"
        if [ "\(strikes" -ge "\)STRIKE_LIMIT" ]; then
            echo "  ROLLBACK TRIGGERED"
            kubectl rollout undo deployment/"\({DEPLOYMENT}" -n "\){NAMESPACE}"
            exit 0
        fi
    else
        strikes=0
    fi

    sleep "${CHECK_INTERVAL}"
done

Version 2: watches error rate AND response time

#!/usr/bin/env bash
# canary_watch_v2.sh

PROMETHEUS="http://localhost:9090"
DEPLOYMENT="myapp-canary"
NAMESPACE="default"
ERROR_THRESHOLD="0.05"
LATENCY_THRESHOLD="2.0"
CHECK_INTERVAL=15
STRIKE_LIMIT=3

strikes=0

echo "Canary monitor running (v2 - error rate + P99 latency)."
echo "Error threshold: \({ERROR_THRESHOLD} | Latency P99 threshold: \){LATENCY_THRESHOLD}s"
echo ""

while true; do
    ts=$(date '+%Y-%m-%dT%H:%M:%S')

    error_query='sum(rate(http_requests_total{app="myapp-canary",status=~"5.."}[1m])) / sum(rate(http_requests_total{app="myapp-canary"}[1m]))'
    error_rate=\((curl -sf "\){PROMETHEUS}/api/v1/query" \
        --data-urlencode "query=${error_query}" | \
        python3 -c "
import sys, json
d = json.load(sys.stdin)
result = d['data']['result']
print(result[0]['value'][1] if result else '0')
" 2>/dev/null)

    latency_query='histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{app="myapp-canary"}[1m])) by (le))'
    latency=\((curl -sf "\){PROMETHEUS}/api/v1/query" \
        --data-urlencode "query=${latency_query}" | \
        python3 -c "
import sys, json
d = json.load(sys.stdin)
result = d['data']['result']
print(result[0]['value'][1] if result else '0')
" 2>/dev/null)

    error_rate=${error_rate:-0}
    latency=${latency:-0}

    error_breach=\((echo "\)error_rate > $ERROR_THRESHOLD" | bc -l)
    latency_breach=\((echo "\)latency > $LATENCY_THRESHOLD" | bc -l)

    triggered_by=""
    [ "\(error_breach" = "1" ] && triggered_by="error_rate(\){error_rate})"
    [ "\(latency_breach" = "1" ] && triggered_by="\){triggered_by:+\({triggered_by}, }latency_p99(\){latency}s)"

    echo "[\(ts] error_rate=\){error_rate} | latency_p99=\({latency}s | breach=\){triggered_by:-none}"

    if [ "\(error_breach" = "1" ] || [ "\)latency_breach" = "1" ]; then
        strikes=$((strikes + 1))
        echo "  Strike \({strikes}/\){STRIKE_LIMIT} | Triggered by: ${triggered_by}"
        if [ "\(strikes" -ge "\)STRIKE_LIMIT" ]; then
            echo ""
            echo "  ROLLBACK TRIGGERED"
            echo "  Signal: ${triggered_by}"
            kubectl rollout undo deployment/"\({DEPLOYMENT}" -n "\){NAMESPACE}"
            exit 0
        fi
    else
        strikes=0
    fi

    sleep "${CHECK_INTERVAL}"
done

How the Scripts Work

The error rate query asks Prometheus: "What fraction of requests to the canary returned an error in the last minute?" A result of 0.0 means no errors. A result of 0.06 means 6% of requests are failing, above the 5% threshold. You see this in the output as:

error_rate=0.06 | threshold=0.05 | breach=YES

The latency query asks: "How slow is the slowest 1% of requests to the canary right now?" A result of 5.234 means 1 in every 100 requests is taking over 5 seconds. You see this as:

latency_p99=5.234s | breach=latency_p99(5.234s)

V1 only runs the first query. V2 runs both. Same canary, same problem, different answers.

The three-strike rule means a single bad check doesn't trigger a rollback – three in a row does. The tradeoff is 45 seconds (three checks at 15 seconds each) of exposure before the rollback fires.

When three strikes hit, the watch script itself runs:

kubectl rollout undo deployment/myapp-canary -n default

That one line is what triggers the rollback. It lives inside canary_watch_v2.sh and runs automatically – you don't have to do anything. The script detects, decides, and acts.

Break it On Purpose

In Terminal 1, start the v1 monitor:

./canary_watch_v1.sh

You will see this repeating every 15 seconds:

Canary monitor running (v1 - error rate only).
Rollback triggers if error rate exceeds 0.05 for 3 checks.

[2026-05-17T11:53:12] error_rate=0 | threshold=0.05 | breach=NO
[2026-05-17T11:53:27] error_rate=0 | threshold=0.05 | breach=NO
[2026-05-17T11:53:42] error_rate=0 | threshold=0.05 | breach=NO

breach=NO means the canary looks healthy. Leave this running and move to Terminal 2.

In Terminal 2, inject latency into the canary:

./break_it.sh

This makes every request to the canary take 5 seconds. Requests still return 200 – no errors, just slowness. You will see:

Injecting latency into the canary deployment...
deployment "myapp-canary" successfully rolled out
Latency injection is active.

The canary pod is Running and passing its readiness probe.
Every request to the canary now takes 5 seconds.
Error rate: 0%   |   P99 latency: ~5s

Now look back at Terminal 1. The v1 monitor keeps printing breach=NO. The canary is taking 5 seconds per request and your monitoring says everything is fine. That's the failure.

In Terminal 3, see what your users are actually experiencing:

./check_latency.sh

TIMESTAMP                   STABLE (ms)   CANARY (ms)   STATUS
---------                   -----------   -----------   ------
2026-05-17T11:55:14         18ms          5008ms        CANARY DEGRADED
2026-05-17T11:55:20         7ms           5008ms        CANARY DEGRADED
2026-05-17T11:55:27         6ms           5008ms        CANARY DEGRADED

Stable is responding in 6–18 milliseconds. Canary is taking over 5 seconds. Users on the canary are waiting 5 seconds for every page load. The v1 monitor in Terminal 1 still says breach=NO.

This is the lesson: the monitoring and the user experience are completely disconnected. The script isn't broken. It's watching the wrong thing.

Now let's see the fix. Press Ctrl+C in Terminal 1 to stop v1. Start v2 in the same terminal:

./canary_watch_v2.sh

In Terminal 2, re-inject the latency:

./break_it.sh

Watch Terminal 1. V2 catches the latency and fires the rollback after three strikes:

Canary monitor running (v2 - error rate + P99 latency).
Error threshold: 0.05 | Latency P99 threshold: 2.0s

[2026-05-15T14:30:00] error_rate=0.0 | latency_p99=0.082s | breach=none
[2026-05-15T14:30:15] error_rate=0.0 | latency_p99=5.234s | breach=latency_p99(5.234s)
  Strike 1/3 | Triggered by: latency_p99(5.234s)
[2026-05-15T14:30:30] error_rate=0.0 | latency_p99=5.891s | breach=latency_p99(5.891s)
  Strike 2/3 | Triggered by: latency_p99(5.891s)
[2026-05-15T14:30:45] error_rate=0.0 | latency_p99=6.102s | breach=latency_p99(6.102s)
  Strike 3/3 | Triggered by: latency_p99(6.102s)

  ROLLBACK TRIGGERED
  Signal: latency_p99(6.102s)

deployment.apps/myapp-canary rolled back

The error rate never moved from 0. V2 rolled back anyway because latency crossed the threshold. That's the difference one extra measurement makes.

After the rollback, confirm the canary is dormant but not deleted:

kubectl rollout history deployment/myapp-canary -n default

REVISION  CHANGE-CAUSE
1         
2

Two revisions. The rollback scaled revision 2 down to zero and restored revision 1. Nothing was deleted, and you can re-deploy if you decide the rollback was a false alarm.

The Decision the Script Can't Make For You

V2 rolled back based on latency with zero errors. Before re-deploying, ask if the latency was a real regression in the new code, or a temporary spike, like a database cache warming up on first use? Both produce the same signal. Only you know which is more likely given what changed.

False positive rollbacks slow down deployments and erode confidence in automation. The right thresholds depend on your users and your system.
What the script enforces is whatever you configure.

Teardown

./teardown.sh

What You Can Do Now

Each use case in this handbook was a script solving a specific problem the standard tooling wasn't catching. Here's where you land:

You can catch AWS cost spikes before the invoice and you know that the service label is AWS's attribution, not a pointer to what actually caused the cost. Start from what changed operationally, not from the billing label.

You can reconstruct the full timeline of any failed request across multiple services from a single trace ID, and you know that a missing service in that timeline is evidence, not just an absence.

You can detect infrastructure drift by comparing what Terraform believes against what AWS actually contains, and you know that a clean result means the resources Terraform manages are in sync, not that your entire AWS account is clean.

You can validate a secret rotation at the application level, not just at the infrastructure level, and you know the difference between a readiness probe passing and the application actually being able to connect to the database.

You can build a canary rollback trigger that watches the right signals, and you know why watching only error rates can leave a slow, broken deployment running while users wait.

The pattern across all five use cases is the same: the standard tooling reported everything as fine while something was actually broken. The cost script returned clean, the pod showed Running, and the canary showed zero errors – not because the tools were wrong but because they were only checking what was easy to check. These scripts check what the standard tooling skips.

GitHub repo: https://github.com/Osomudeya/devops-scripting-labs

I write about DevOps weekly, covering real systems, interview, CV tips and tricks, and real incidents – Join the newsletter.

How to Connect Your AI Coding Agent to a Browser on macOS

אחיה כהן — Tue, 26 May 2026 12:40:33 +0000

AI coding agents like Claude Code, Cursor, and the rest have gotten remarkably good at reading and writing code. But the moment they need to look at something on the web, they hit a wall. They can't see your staging site. They can't read the error in your analytics dashboard. They can't check whether the form they just built actually submits.

The usual fix is to hand the agent a headless browser — Puppeteer or Playwright driving a fresh Chromium instance. That works, sort of. But a headless Chromium starts every session as a stranger: no logins, no cookies, no sessions. It spins up a second browser engine that pushes your CPU and spins up your fan. And a growing number of sites simply block it on sight.

There's another option, and on a Mac it's a good one: let the agent drive the Safari you already use — the one that's already logged into GitHub, your analytics, your staging environment. That's what Safari MCP does. It's an open-source MCP server that exposes Safari to any MCP-capable agent through around 80 tools, with no Chromium, no WebDriver, and no separate browser to babysit.

In this tutorial you'll connect Safari MCP to an AI agent, run your first automation, and then build something a headless browser fundamentally cannot do: an automation that works inside a page you're logged into. By the end you'll understand not just how to wire this up, but when native browser automation is the right call — and when it isn't.

Here's what you'll need:

A Mac (Safari MCP is macOS-only — more on that trade-off later)
Node.js 18 or newer
An MCP-capable AI agent — this tutorial uses Claude Code and Cursor, but any MCP client works

What is MCP, and Why Does Browser Automation Need It?
Why Safari Instead of Chrome or Playwright?
Installing Safari MCP
Your First Automation: Reading a Page
The Payoff: Automating a Logged-in Workflow
Handling the Tricky Parts
Limitations: When Not to Use This
Wrapping Up

What is MCP, and Why Does Browser Automation Need It?

Before wiring anything up, it helps to know what the "MCP" in Safari MCP stands for.

MCP is the Model Context Protocol — an open standard for connecting AI agents to external tools and data. Think of it the way you'd think of a USB port. Before USB, every device needed its own connector. MCP is the equivalent of agreeing on one connector: an agent that speaks MCP can use any tool that speaks MCP, with no custom integration code on either side.

An MCP server exposes a set of tools. An MCP client — your AI agent — discovers those tools and calls them. The server describes each tool (its name, what it does, what arguments it takes) and the agent decides when to call it. When Claude Code decides it needs to read a web page, it doesn't run browser code itself. It calls a tool that some MCP server provides.

Browser automation is a natural fit for this model. The agent's job is reasoning — "I need to see what's on the staging site, then check the console for errors." The actual mechanics — open a tab, wait for load, read the DOM, capture console output — are well-defined operations that belong behind a stable interface. That interface is exactly what an MCP server provides.

Safari MCP is one such server. It runs as a local process, exposes around 80 browser tools (navigate, click, fill, read, screenshot, extract, and more), and any MCP client can drive it. The agent never touches AppleScript or WebKit internals. It just calls safari_navigate and gets a result.

The "USB port" framing matters for a practical reason: nothing in this tutorial is Claude-specific. Wire Safari MCP into Cursor, Cline, Windsurf, or your own MCP client and the tools are identical.

Why Safari Instead of Chrome or Playwright?

If you've automated a browser before, you've almost certainly used Chrome through Puppeteer, Playwright, or Selenium. So why reach for Safari?

It comes down to three differences that matter once an AI agent, not a test script, is the thing driving the browser.

1. It's your real browser, with your real sessions. A headless Chromium launched by Playwright is a clean room. It has never logged into anything. If you want your agent to read your analytics dashboard, you first have to solve authentication — store credentials somewhere, script the login, handle two-factor prompts, refresh tokens. Safari MCP skips all of that. It drives the Safari instance you use every day, which is already logged into your dashboards, your GitHub, your email. The agent inherits those sessions for free.

2. It doesn't melt your laptop. A headless Chromium is a second, full browser engine running alongside the browser you already have open. On a laptop that's real CPU, real memory, and a fan you can hear. Safari MCP uses the WebKit engine that's already running on every Mac — there's no second engine to start. The project measures this at roughly 60% less CPU for the browsing work, and the automation runs with Safari in the background, so it doesn't steal your screen.

3. Sites don't treat it as a bot. Headless browsers leak. They expose navigator.webdriver, they ship with telltale automation fingerprints, and bot-detection services — Cloudflare's challenge pages, reCAPTCHA, the WAFs in front of a lot of B2B sites — have gotten very good at spotting them. Your real Safari, driven through the operating system, looks like exactly what it is: a person's browser. (To be clear: this is for automating your own accounts and sites — not for evading access controls you don't own.)

The cost of all this is the obvious one: Safari MCP is macOS-only. It's built on WebKit and AppleScript, so there's no Windows or Linux story. If your agent runs on a Linux CI box, this isn't your tool. If it runs on your Mac — which, for a coding agent, it very often does — the trade is a good one. We'll come back to limitations honestly at the end.

Installing Safari MCP

Installation is genuinely one command, but there are two Safari settings to flip first. Let's do it in order.

Step 1 — Enable Safari's developer features

Safari MCP reads and controls pages by running JavaScript inside Safari. Two settings have to be on:

Open Safari → Settings → Advanced and check "Show features for web developers." This reveals the Develop menu.
Open the new Develop menu and check "Allow JavaScript from Apple Events."

That second one is the important one. It's what lets an outside process — the MCP server — ask Safari to run JavaScript on a page. Without it, every tool call fails.

Step 2 — Run the server

npx safari-mcp

That's the whole install. npx fetches the package and runs it; there's nothing to build. The first time an agent calls a tool, macOS will pop up a permission prompt — something like "Terminal wants to control Safari." Click OK. That's the standard Automation permission, and you can review it later under System Settings → Privacy & Security → Automation.

If you'd rather have it installed permanently:

npm install -g safari-mcp

Step 3 — Tell your agent about it

Your AI agent needs to know the server exists. For Claude Code, one command does it:

claude mcp add safari -- npx safari-mcp

For Cursor, create .cursor/mcp.json in your project:

{
  "mcpServers": {
    "safari": {
      "command": "npx",
      "args": ["safari-mcp"]
    }
  }
}

The process is the same for every client — Claude Desktop, Cline, Windsurf, Continue, VS Code. You're telling the agent: "there's an MCP server named safari; start it by running npx safari-mcp."

Restart your agent (or reload its MCP servers) and it will connect. In Claude Code you can confirm with the /mcp command, which lists connected servers and their tools. You should see safari with around 80 tools available.

That's it. Your agent now has a browser.

Your First Automation: Reading a Page

Let's prove the wiring works with the simplest possible task: have the agent read a web page.

In your agent, just ask in plain language:

"Use the safari tools to open example.com and tell me what the page says."

Behind that request, the agent makes two tool calls. First it navigates:

{ "tool": "safari_navigate", "arguments": { "url": "https://example.com" } }

Then it reads the content:

{ "tool": "safari_read_page", "arguments": {} }

safari_read_page returns the page's title, URL, and text content with the HTML stripped out — exactly the form an LLM wants. The agent gets back something like this:

Example Domain
https://example.com/
This domain is for use in illustrative examples in documents. You may
use this domain in literature without prior coordination or asking for
permission.

And it relays that to you. You just watched your agent browse.

A quick note on how the agent should look at a page, because it changes everything downstream. safari_read_page is great for "what does this say." But when the agent needs to act — click a button, fill a field — text isn't enough. It needs to know what's actually there and how to target it. For that, the better first move is safari_snapshot:

{ "tool": "safari_snapshot", "arguments": {} }

This returns an accessibility-tree view of the page, where every interactive element has a stable ref ID:

[textbox ref=0_8] "Full Name" value=""
[combobox ref=0_10] "Subject"
[button ref=0_15] "Submit"

Those ref IDs are the agent's reliable handles. CSS selectors break when a page re-renders. A snapshot ref stays valid for the life of the page. Keep that in mind — it's the difference between an automation that works once and one that works every time.

The Payoff: Automating a Logged-in Workflow

Reading example.com is a wiring test. Here's the thing a headless browser genuinely cannot do.

Pick a site you're logged into in Safari right now — your analytics, your project board, your CI dashboard. We'll use GitHub, because every developer has an account and the notifications page is a real, mildly annoying chore. The task: have the agent open your GitHub notifications and summarize what actually needs your attention.

Ask the agent:

"Open my GitHub notifications, read them, and group them into 'needs a reply' versus 'just FYI'."

The agent navigates:

{ "tool": "safari_navigate", "arguments": { "url": "https://github.com/notifications" } }

Stop and notice what didn't happen. No login screen. No OAuth dance. No personal access token in an environment variable. Safari is already authenticated as you, so the agent lands directly on your real notifications. A headless Chromium would have hit a login wall here and stopped.

Notification lists load incrementally, so the agent should wait for content before reading. safari_wait_for polls the page until a selector or piece of text appears, or a timeout elapses:

{ "tool": "safari_wait_for", "arguments": { "text": "Inbox", "timeout": 10000 } }

Then it reads. safari_read_page scoped to the notifications region returns the list as clean text:

{ "tool": "safari_read_page", "arguments": { "selector": "main" } }

The agent reasons over that text and hands you the grouped summary. The whole loop — navigate, wait, read, summarize — is a handful of tool calls.

When you need data in a precise shape rather than prose — to feed another step, or to write to a file — the agent can reach for safari_evaluate, which runs custom JavaScript on the page and returns whatever you build:

{
  "tool": "safari_evaluate",
  "arguments": {
    "expression": "JSON.stringify([...document.querySelectorAll('li')].map(li => li.innerText.trim()))"
  }
}

The agent writes that expression itself, against the structure it just saw in the snapshot — you don't hand-author selectors.

You might be thinking: GitHub has an API, why scrape the page? Fair. For GitHub specifically, the API is excellent. But the point generalizes. Most of the dashboards you stare at every day — your billing portal, your error tracker's specific filtered view, a client's analytics, the admin panel of some tool your company pays for — either have no usable API or would cost you an afternoon of OAuth setup to reach. With Safari MCP, "the page I'm already looking at" is the API. The agent reads what you can see, because it's using the browser you're seeing it in.

That's the capability headless automation can't match. Not speed, not features — access.

Handling the Tricky Parts

A first automation always looks easy. Three things tend to bite on the second one.

Tab Safety — The Agent Must not Hijack Your Tabs

This is the scariest failure mode: you're typing in a tab, the agent navigates that tab, and your work is gone. Safari MCP guards against it by stamping each automation tab with an identity marker — it uses window.name, which survives page navigations — and resolving "the agent's tab" through that marker on every call. If it can't positively identify its own tab, it refuses to act and raises a re-anchor error rather than guessing.

The practical rule for you: let the agent open its own tab with safari_new_tab, and it will stay in its lane. Don't point it at "the current tab" and assume.

Waiting for Dynamic Content

Modern pages render after load. If the agent reads too early, it reads an empty shell. Don't have it guess with fixed sleeps — use safari_wait_for, which polls for a selector or text until it appears or the timeout elapses:

{ "tool": "safari_wait_for", "arguments": { "selector": ".results-list", "timeout": 8000 } }

This is the single most common fix for "the automation works when I step through it slowly but fails when it runs."

Framework Forms

Set a React or Vue input's .value directly and the framework never notices — its internal state stays empty, and your "filled" form submits blank. Safari MCP's safari_fill and safari_fill_form use the native value setters and dispatch the input and change events the framework listens for, so React, Vue, Angular, and Svelte state all stay in sync:

{
  "tool": "safari_fill_form",
  "arguments": {
    "fields": [
      { "selector": "#email", "value": "jane@example.com" },
      { "selector": "#message", "value": "Looks great." }
    ]
  }
}

For framework-heavy pages where CSS selectors are fragile, go back to the snapshot refs from the previous section — pass { "ref": "0_9" } instead of { "selector": "#email" }. Refs survive re-renders; selectors don't.

None of these are exotic. They're just the difference between a demo and an automation you'd actually leave running.

Limitations: When Not to Use This

A tool tutorial that only lists strengths isn't worth much. Here's where Safari MCP is the wrong choice.

It's macOS-only, and that's structural. Safari MCP is built on WebKit and AppleScript. There's no Windows or Linux port coming, because the foundation doesn't exist on those platforms. If your agent runs in Linux CI, use Playwright.

It drives one Safari, on one Mac. This is browser automation for your machine — a coding agent working alongside you. It is not a fleet. If you need 50 parallel browsers scraping in a data center, that's a headless-Chromium-in-containers job, and Safari MCP is the wrong shape for it.

Cross-browser test suites should stay on Playwright. If you're writing end-to-end tests that must pass on Chrome, Firefox, and Safari, use the tool built for that. Safari MCP drives exactly one engine: WebKit.

It shares a browser with you. Because it uses your real Safari, the agent and you are in the same browser. That's the entire point — but it means you should let the agent work in its own tabs and not fight it for the same window.

The honest summary: Safari MCP is built for one specific situation — an AI agent doing real browser work on the Mac you're sitting at, against sites you're already logged into. In that situation it's hard to beat. Outside it, reach for the headless tools. Knowing which situation you're in is the actual skill.

Wrapping Up

You've gone from an AI agent that could only see code to one that can see the web — the real web, behind your real logins.

To recap what you did: you learned what MCP is and why browser automation belongs behind that interface. You saw why a native Safari engine beats a headless Chromium for an agent working on your Mac and you installed Safari MCP with one command and two settings. You ran a first read, and then you did the thing that actually matters — an automation inside a logged-in page, with no auth code at all. Finally, you saw the edges: tab safety, waiting for dynamic content, framework forms, and the cases where you should pick a different tool.

The bigger idea is worth holding onto. An AI agent is only as capable as the tools you connect to it. Giving it a browser — a real one — turns "write me code" into "go look at the staging site, find the bug, and tell me what's wrong." That's a different kind of collaborator.

Safari MCP is open source under the MIT license, and it exposes around 80 tools beyond the handful you used here — screenshots, network inspection, storage, accessibility audits, multi-tab workflows. The repository and full tool reference are at github.com/achiya-automation/safari-mcp. Point your agent at it and see what it does when it can finally look around.

How to Build a Self-Hosted WhatsApp Bot with n8n and WAHA

אחיה כהן — Mon, 11 May 2026 13:57:06 +0000

WhatsApp is where your many of your customers likely already are. For support tickets, order updates, booking reminders, and lead qualification, a WhatsApp channel often converts several times better than email.

But the official WhatsApp Business Cloud API can be slow to onboard, template-restricted for proactive messages, and priced per conversation — which adds up fast at scale.

There's another path: you can run your own WhatsApp HTTP gateway on a small server, connect it to a workflow engine, and keep every message — inbound and outbound — inside infrastructure you control. No monthly conversation fees, no template approvals for routine replies, no third-party middleman holding your customer data.

In this tutorial, you'll build exactly that. By the end, you'll have a WhatsApp bot that:

Receives every incoming message through a webhook
Routes messages through an n8n workflow
Replies automatically based on keywords, AI, or any API call you want
Runs entirely on your own server, using two open-source tools

You'll use WAHA (WhatsApp HTTP API) as the gateway, and n8n as the workflow engine. Both run in Docker, both are free for self-hosting, and together they cover everything from a simple auto-reply to a full CRM integration.

What You'll Learn
Prerequisites
A Note on Which WhatsApp Account to Use
WAHA vs the official WhatsApp Business Cloud API
Part 1: Understanding WAHA
Part 2: Running WAHA with Docker
Part 3: Starting a WhatsApp session
Part 4: Running n8n
Part 5: Creating the Webhook Trigger in n8n
Part 6: Wiring WAHA to n8n
Part 7: Building the first auto-reply
Part 8: A Second Example — Proactive Booking Confirmations
Part 9: Going to Production
Common Pitfalls
Where to Go Next

What You'll Learn

How WAHA works under the hood and when to use it instead of the official Cloud API
How to run WAHA and n8n side by side with Docker Compose
How to scan the QR code and bind a WhatsApp account to your gateway
How to connect WAHA's webhook to an n8n workflow
How to build a keyword-based auto-reply bot
How to send proactive confirmations from a separate workflow
How to harden the setup for production (HTTPS, API keys, rate limits, Queue Mode)

Prerequisites

A Linux server (any VPS works — 2 GB of RAM is enough for a small bot)
Docker and Docker Compose installed
A public hostname with DNS pointing at the server, or an ngrok tunnel for local testing
A WhatsApp account you're willing to dedicate to the bot (more on that below)
Basic familiarity with JSON and HTTP requests

You don't need prior n8n experience. If you can drag a box and wire it to another box, you can build the flow.

A Note on Which WhatsApp Account to Use

WAHA works by running an actual WhatsApp Web session inside a headless Chromium process. It logs in as a real account — the same way you would open web.whatsapp.com in your browser. Meta doesn't officially endorse this approach for commercial use at scale, and heavy volume from a single number can lead to a ban.

For that reason, use a dedicated number for the bot. Don't use your personal WhatsApp. Get a second SIM, eSIM, or a VoIP number that supports WhatsApp activation. Keep outbound volume reasonable, and you'll be fine for most small-business use cases.

If you plan to send thousands of marketing messages per day, switch to the official WhatsApp Business Cloud API — that's what it exists for. This tutorial is aimed at the middle ground: support bots, order updates, booking confirmations, and similar conversational flows where you need real-time control without enterprise pricing.

WAHA vs the official WhatsApp Business Cloud API

Before writing any code, it helps to understand when each option is the right fit.

Dimension	WAHA (self-hosted)	WhatsApp Cloud API (Meta)
Onboarding	Scan a QR code — ready in minutes	Business verification, app review — days to weeks
Cost	Server cost only	Per-conversation pricing
Template approval	Not needed	Required for proactive messages outside the 24-hour window
Session model	One WhatsApp Web session per Core container	Native API, no web session
Risk	Account ban possible at high unsolicited volume	Rate limits but no ban for normal use
Vendor lock-in	None — pure open source	Tied to Meta's API and pricing
Best for	Support bots, small-team workflows, internal tools	High-volume marketing, regulated industries, >100k monthly messages

Neither is strictly better. If you run a support team for a small business, WAHA is often the pragmatic choice. If you're a bank sending millions of transactional messages, you want the Cloud API. Many teams run both — WAHA for conversational support, Cloud API for bulk transactional traffic.

Part 1: Understanding WAHA

WAHA is an open-source project that wraps WhatsApp Web behind a clean REST API. You POST /api/sendText with a chat ID and a message, and WAHA sends it. You configure a webhook URL, and WAHA POSTs to that URL every time a message arrives.

Under the hood, WAHA spawns a Chromium instance, opens WhatsApp Web, and uses an engine (whatsapp-web.js, NOWEB, or GOWS) to automate the session. Your code doesn't see any of that complexity — you just see an HTTP API.

The project ships in two flavors:

WAHA Core — free, MIT licensed, one active session per container, community support.
WAHA Plus — commercial license, multi-session support, priority support, and access to advanced endpoints.

For most developers building a single bot, Core is enough. You can always upgrade later.

Official docs live at waha.devlike.pro. Keep that open in another tab — we'll reference specific endpoints as we go.

Part 2: Running WAHA with Docker

Create a fresh directory for the project:

mkdir whatsapp-bot && cd whatsapp-bot

Create a docker-compose.yml file:

services:
  waha:
    image: devlikeapro/waha:latest
    container_name: waha
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - WAHA_DASHBOARD_ENABLED=true
      - WAHA_DASHBOARD_USERNAME=admin
      - WAHA_DASHBOARD_PASSWORD=change-me-now
      - WHATSAPP_API_KEY=super-secret-key-change-me
      - WHATSAPP_DEFAULT_ENGINE=WEBJS
    volumes:
      - ./waha-sessions:/app/.sessions

A few things to notice:

The dashboard username and password protect the web UI at http://your-server:3000. Always change the defaults before you expose the port publicly.
WHATSAPP_API_KEY is the key every HTTP request to WAHA must include in the X-Api-Key header. Treat it like a database password.
WHATSAPP_DEFAULT_ENGINE=WEBJS uses the mature whatsapp-web.js engine. WAHA also supports NOWEB and GOWS engines with different trade-offs — WEBJS is the safest default for a first deployment.
The volume mount persists the session across restarts. Without it, every container rebuild forces you to scan the QR code again.

Start the container:

docker compose up -d
docker compose logs -f waha

Within about 20 seconds WAHA finishes booting. Visit http://your-server:3000 and log in with the dashboard credentials.

Part 3: Starting a WhatsApp session

WAHA calls each WhatsApp account a "session." You can have one session at a time on WAHA Core.

From the dashboard, click Start New Session and name it default. WAHA displays a QR code.

On your phone:

Open WhatsApp.
Tap the three-dot menu (Android) or Settings (iOS).
Tap Linked Devices → Link a Device.
Point the camera at the QR code on your screen.

Within a few seconds the dashboard shows WORKING status. Your session is live.

You can also do this over the API. Start the session (default is the session name, encoded in the URL path):

curl -X POST http://your-server:3000/api/sessions/default/start \
  -H "X-Api-Key: super-secret-key-change-me"

The call is idempotent — if the session is already running, nothing happens.

Fetch the QR as a PNG:

curl http://your-server:3000/api/default/auth/qr \
  -H "X-Api-Key: super-secret-key-change-me" \
  -H "Accept: image/png" \
  --output qr.png

Scan and you're in.

Test that the session works by sending a message to yourself:

curl -X POST http://your-server:3000/api/sendText \
  -H "X-Api-Key: super-secret-key-change-me" \
  -H "Content-Type: application/json" \
  -d '{
    "session": "default",
    "chatId": "15555550123@c.us",
    "text": "Hello from WAHA!"
  }'

Replace 15555550123 with your own number (country code plus number, no +, no spaces, no dashes). The @c.us suffix marks it as an individual chat. Groups use @g.us.

If the message lands on your phone — congratulations. The gateway works.

Part 4: Running n8n

Add an n8n service to your docker-compose.yml alongside WAHA:

services:
  waha:
    # ... existing config

  n8n:
    image: n8nio/n8n:latest
    container_name: n8n
    restart: unless-stopped
    ports:
      - "5678:5678"
    environment:
      - N8N_HOST=n8n.example.com
      - N8N_PORT=5678
      - N8N_PROTOCOL=https
      - WEBHOOK_URL=https://n8n.example.com/
      - GENERIC_TIMEZONE=UTC
    volumes:
      - ./n8n-data:/home/node/.n8n

Replace n8n.example.com with your real domain. For purely local testing, set:

- N8N_HOST=localhost
- N8N_PROTOCOL=http
- WEBHOOK_URL=http://localhost:5678/

If you want to test webhooks from your laptop without a server, run ngrok http 5678 in another terminal and use the ngrok HTTPS URL as WEBHOOK_URL. n8n uses WEBHOOK_URL to tell external services where to POST — get this wrong and your webhooks will 404.

Start the stack:

docker compose up -d

Visit http://your-server:5678. On the first visit, n8n walks you through creating an owner account (email and password). Every subsequent visit requires that login. For extra safety in production, put n8n behind a reverse proxy with an allow-list or an additional auth layer — we'll set that up later.

Part 5: Creating the Webhook Trigger in n8n

Click Create Workflow. You'll see an empty canvas.

Add a Webhook node and configure it:

HTTP Method: POST
Path: whatsapp (this becomes part of the URL)
Response Mode: Respond Immediately
Response Data: First Entry JSON

Click Listen for Test Event. n8n shows you two URLs: a test URL and a production URL. Copy the production URL. It looks like this:

https://n8n.example.com/webhook/whatsapp

Not webhook-test — that one only fires while the editor is open. You want webhook.

Part 6: Wiring WAHA to n8n

WAHA can POST to a webhook on every WhatsApp event. Tell it where to send those events.

In the WAHA dashboard, open your session and set the webhook URL. Or do it over the API:

curl -X PUT http://your-server:3000/api/sessions/default \
  -H "X-Api-Key: super-secret-key-change-me" \
  -H "Content-Type: application/json" \
  -d '{
    "config": {
      "webhooks": [
        {
          "url": "https://n8n.example.com/webhook/whatsapp",
          "events": ["message", "session.status"]
        }
      ]
    }
  }'

The message event fires on every inbound message. session.status fires when the session connects, disconnects, or reconnects — which is useful for alerting when your bot goes down.

Test it. From another phone, send a WhatsApp message to your bot's number. Head back to the n8n editor. Within a second or two the webhook node lights up with the event data.

The payload looks roughly like this:

{
  "event": "message",
  "session": "default",
  "payload": {
    "id": "false_15555550123@c.us_3EB0...",
    "from": "15555550123@c.us",
    "body": "Hello",
    "timestamp": 1713801234,
    "fromMe": false
  }
}

Everything you need is in payload: who sent it (from), what they said (body), and when (timestamp).

Part 7: Building the first auto-reply

A bot that only listens is boring. Let's make it answer.

You'll build a tiny keyword router: if the user sends hi or hello, the bot greets them. If they send price, it sends a pricing message. Anything else gets a fallback.

After the Webhook node, add a Switch node.

Configure the Switch node:

Mode: Expression
Value: {{ $json.payload.body.toLowerCase().trim() }}
Add routing rules:
- Rule 1: equals hi — output 0
- Rule 2: equals hello — output 0
- Rule 3: equals price — output 1
- Fallback output: 2

After the Switch, add three HTTP Request nodes, one per output.

Configure each HTTP Request node identically, except for the body text:

Method: POST
URL: http://waha:3000/api/sendText (inside the Docker network you can reach WAHA by its service name. From outside use the full public URL)
Send Headers: on
- X-Api-Key: super-secret-key-change-me
- Content-Type: application/json
Send Body: on
- Body Content Type: JSON
- Specify Body: Using JSON

For the greeting node, the JSON body is:

{
  "session": "default",
  "chatId": "={{ $('Webhook').item.json.payload.from }}",
  "text": "Hi! I'm the bot. Send 'price' to see pricing, or anything else for help."
}

For the pricing node:

{
  "session": "default",
  "chatId": "={{ $('Webhook').item.json.payload.from }}",
  "text": "Our plans start at $49/month. Reply 'sales' to talk to a human."
}

For the fallback:

{
  "session": "default",
  "chatId": "={{ $('Webhook').item.json.payload.from }}",
  "text": "I didn't catch that. Try 'hi' or 'price'."
}

The ={{ ... }} syntax is an n8n expression — at runtime it pulls values from earlier nodes.

Connect the Switch outputs to their matching HTTP Request nodes. Save the workflow. Click Activate in the top-right.

Send hi to your bot from any phone. It should reply within a second.

Congratulations — you have a WhatsApp bot running entirely on your own infrastructure.

Part 8: A Second Example — Proactive Booking Confirmations

Auto-reply is useful. Proactive outbound is where the value really compounds. Here's a second workflow that sends a booking confirmation whenever a new row lands in a database.

Create a second workflow in n8n. Use one of these triggers:

Schedule Trigger — poll a database every minute for new rows
Webhook Trigger — listen for a notification from your booking system
Database Trigger (Postgres, MySQL, Supabase) — react to inserts in real time

For this example, use a Schedule Trigger set to every minute, followed by a Postgres Execute Query node that reads pending confirmations:

SELECT id, customer_phone, service_name, booking_time
FROM bookings
WHERE confirmation_sent = false
LIMIT 20;

After the Postgres node, add an HTTP Request node pointing to the same WAHA sendText endpoint you used earlier. The body:

{
  "session": "default",
  "chatId": "={{ $json.customer_phone }}@c.us",
  "text": "Hi! Your booking for {{ \(json.service_name }} on {{ \)json.booking_time }} is confirmed. Reply 'change' to reschedule."
}

Finally, add a second Postgres node that marks the booking as sent:

UPDATE bookings
SET confirmation_sent = true, confirmation_sent_at = NOW()
WHERE id = {{ $json.id }};

Activate the workflow. Every minute, n8n pulls pending bookings, sends a WhatsApp confirmation, and marks them done.

This pattern generalizes. Replace the SQL with a call to Shopify for order confirmations, Stripe for receipt messages, or Calendly for appointment reminders. The WhatsApp layer stays the same — only the source of truth changes.

Part 9: Going to Production

The setup above works, but it's not yet production-ready. Here's what to harden before you point real customers at it.

1. Put Everything Behind HTTPS

Never expose n8n or WAHA directly on plain HTTP. Put a reverse proxy in front. Caddy is the easiest choice because it handles Let's Encrypt automatically.

A minimal Caddyfile:

n8n.example.com {
    reverse_proxy n8n:5678
}

waha.example.com {
    reverse_proxy waha:3000
}

Run Caddy as another service in the same Docker Compose. TLS certificates are issued and renewed automatically.

2. Rotate the API Keys

Don't ship super-secret-key-change-me to production. Generate a real key:

openssl rand -hex 32

Put it in a .env file, reference it as ${WHATSAPP_API_KEY} in docker-compose.yml, and add .env to your .gitignore.

3. Rate-limit Outbound Messages

WhatsApp bans accounts that send too many messages too fast. A safe outbound rate for a fresh number is well under 20 messages per minute. For bursty replies, add an n8n Wait node between sends, or queue outgoing messages through a small custom function node that sleeps between requests.

4. Scale n8n with Queue Mode

By default, n8n runs everything in a single process. That's fine for low volume. For higher throughput, switch to Queue Mode:

Add a Redis container.
Run one n8n main container (the web UI and webhook receiver).
Run one or more n8n-worker containers that pull jobs from the queue.

Queue Mode is documented at docs.n8n.io/hosting/scaling/queue-mode/. Setup adds two environment variables (EXECUTIONS_MODE=queue, QUEUE_BULL_REDIS_HOST=redis) and decouples incoming webhooks from workflow execution. The webhook responds in milliseconds while workers chew through the queue in the background.

5. Monitor the Session

WhatsApp Web sessions drop. The phone loses connection, WhatsApp rotates security tokens, or your server reboots. Catch those drops early.

Subscribe to the session.status webhook event in WAHA. When status becomes FAILED or STOPPED, route it to an n8n workflow that posts to Slack, sends an email, or pages you. The faster you know, the faster you recover.

For overall uptime, point something like Uptime Kuma at GET /api/sessions/default on WAHA. If WAHA reports WORKING, you're fine. Anything else triggers an alert.

6. Back Up the Sessions Volume

The waha-sessions directory contains the logged-in state. If you lose it, you have to scan the QR code again — possibly from a phone that's no longer handy. Back it up nightly. A simple cron job with tar and rclone to S3-compatible storage is plenty.

7. Add a Live-Agent Handoff

Not every conversation should stay with the bot. When a user types human — or when your intent classifier can't answer confidently — hand off to a real agent.

Chatwoot is a solid open-source option: it has a dedicated WhatsApp channel, agent inbox, team assignment, and conversation history. The handoff is an n8n branch that stops processing bot replies and forwards the message stream to Chatwoot's API.

Common Pitfalls

A few issues catch almost everyone on their first production deploy.

Webhooks Timing Out

WAHA gives your webhook a few seconds to respond. If your n8n workflow is slow (calling an LLM, hitting a remote API), the webhook times out and WAHA retries, potentially causing duplicate replies.

Fix: make the webhook return 200 immediately and offload the slow work. In n8n, set the Webhook node's Response Mode to Using Respond to Webhook Node, add a Respond to Webhook node as the first step with a 200 and empty body, then do the heavy lifting after that.

Duplicate Messages

WAHA delivers the same message event more than once in edge cases (phone comes back online, session reconnects). Store the payload.id somewhere — Redis, a database, or n8n's static data store — and drop any ID you've already processed.

Messages Arriving Out of Order

The webhook is async, and n8n may parallelize executions. If ordering matters — for example, in a multi-step conversation — key a queue by the sender's chatId and process each sender serially.

Sessions Disconnecting After a Phone Restart

Normal WhatsApp Web behavior. WAHA auto-reconnects, but occasionally the linked-devices list needs a manual refresh. If a session refuses to come back, stop the WAHA container, delete that session's folder under waha-sessions/, start the container again, and rescan the QR.

Your Number Gets Banned

The single biggest cause is rate: a new number blasting hundreds of messages an hour gets flagged fast. Warm up a number slowly — send a normal, human-like volume for the first week. Don't send to strangers unsolicited. Prefer inbound-driven replies over outbound pushes wherever you can.

The Wrong Chat ID Format

WhatsApp individual chats use @c.us and groups use @g.us. Don't include the + or spaces in the number. If WAHA returns a 404 when sending, the chat ID is almost always the problem.

Where to Go Next

You now have the foundation. The same two-service stack supports almost any bot you can imagine — you're only limited by what you can build in an n8n workflow.

Some natural next steps:

Plug in AI replies: Add an OpenAI or Anthropic node after the Webhook, pass the user's message through it with a short system prompt, and send the response back through WAHA. Cap conversation length to prevent runaway token usage.
Integrate a CRM: Look up the caller's chatId in HubSpot, Pipedrive, or your own database before deciding how to reply. Segment responses by customer tier.
Send proactive notifications: Appointment reminders, shipping updates, payment receipts, abandoned-cart nudges. Keep the content transactional and expected — unsolicited marketing blasts are the fastest way to a ban.
Log every conversation: Add a Postgres or Supabase node after the Webhook to persist messages for analytics and customer history. Your future self (and your support team) will thank you.
Add media handling: WAHA exposes sendImage, sendFile, and sendVoice endpoints. Teach the bot to accept photos for support tickets, or send invoices as PDFs directly inside the chat.

The WhatsApp layer stays the same. Everything interesting happens upstream in the workflow.

If you want to see production examples of n8n and WAHA running at scale — or you need a similar automation built for your business — I'm the founder of Achiya Automation, where we ship WhatsApp, n8n, and Chatwoot integrations. You can find more at achiya-automation.com.

Reclaim Your Time – Master Automation with Zapier

Beau Carnes — Tue, 21 Apr 2026 14:57:45 +0000

Do you ever spend a lot of time doing small repetitive tasks like copying data from an email into a spreadsheet or manually moving files between folders.

We just posted a new course on the freeCodeCamp.org YouTube channel, led by instructor and developer Estafania, that will help you leverage the power of automation to help with all your tasks.

Zapier is a no-code platform that allows you to connect and share information between the applications you use every day. The core philosophy is simple: "If this happens, then do that".

The Trigger: This is the "If this happens" part. It's a specific event in one app (like receiving a new lead in a form).
The Action: This is the "do that" part. This is the task Zapier performs automatically in another app (like sending a Slack notification or adding a row to a Google Sheet).

This four-hour course takes you from a complete beginner to an advanced user. You will start by setting up a free account and learning the basic building blocks of a "Zap". As you progress, you will dive into modern, AI-enhanced features.

And for people looking to bridge the gap between AI and development, the course concludes with a deep dive into Model Context Protocol (MCP). You will learn how to set up an MCP server to share information from your apps with AI clients like Visual Studio Code and the Gemini CLI. This allows you to interact with your Google Calendar or GitHub repositories directly through an AI interface.

Watch the full course on the freeCodeCamp.org YouTube channel (4-hour watch).

How to Build a Local SEO Audit Agent with Browser Use and Claude API

Daniel Nwaneri — Mon, 30 Mar 2026 23:37:08 +0000

Every digital marketing agency has someone whose job involves opening a spreadsheet, visiting each client URL, checking the title tag, meta description, and H1, noting broken links, and pasting everything into a report. Then doing it again next week.

That work is deterministic. An agent can do it.

In this tutorial, you'll build a local SEO audit agent from scratch using Python, Browser Use, and the Claude API. The agent visits real pages in a visible browser window, extracts SEO signals using Claude, checks for broken links asynchronously, handles edge cases with a human-in-the-loop pause, and writes a structured report — all resumable if interrupted.

By the end, you'll have a working agent you can run against any list of URLs. It costs less than $0.01 per URL to run.

What You'll Build

A seven-module Python agent that:

Reads a URL list from a CSV file
Visits each URL in a real Chromium browser (not a headless scraper)
Extracts title, meta description, H1s, and canonical tag via Claude API
Checks for broken links asynchronously using httpx
Detects edge cases (404s, login walls, redirects) and pauses for human input
Writes results to report.json incrementally — safe to interrupt and resume
Generates a plain-English report-summary.txt on completion

The full code is on GitHub at dannwaneri/seo-agent.

Prerequisites

Python 3.11 or higher
An Anthropic API key (get one at console.anthropic.com)
Windows, macOS, or Linux
Basic familiarity with Python and the command line

Why Browser Use Instead of a Scraper
Project Structure
Setup
Module 1: State Management
Module 2: Browser Integration
Module 3: Claude Extraction Layer
Module 4: Broken Link Checker
Module 5: Human-in-the-Loop
Module 6: Report Writer
Module 7: The Main Loop
Running the Agent
Scheduling for Agency Use
What the Results Look Like

Why Browser Use Instead of a Scraper

The standard approach to SEO auditing is to fetch page HTML with requests and parse it with BeautifulSoup. That works on static pages. It breaks on JavaScript-rendered content, misses dynamically injected meta tags, and fails entirely on authenticated pages.

Browser Use (84,000+ GitHub stars, MIT license) takes a different approach. It controls a real Chromium browser, reads the DOM after JavaScript executes, and exposes the page through Playwright's accessibility tree. The agent sees what a human would see.

The practical difference: a requests-based scraper might miss a meta description injected by a React component. Browser Use won't.

The other difference worth naming: Browser Use reads pages semantically. A Playwright script breaks when a button's CSS class changes from btn-primary to button-main. Browser Use identifies it's still a "Submit" button and acts accordingly. The extraction logic lives in the Claude prompt, not in brittle CSS selectors.

Project Structure

seo-agent/
├── index.py          # Main audit loop
├── browser.py        # Browser Use / Playwright page driver
├── extractor.py      # Claude API extraction layer
├── linkchecker.py    # Async broken link checker
├── hitl.py           # Human-in-the-loop pause logic
├── reporter.py       # Report writer
├── state.py          # State persistence (resume on interrupt)
├── input.csv         # Your URL list
├── requirements.txt
├── .env.example
└── .gitignore

Setup

Create a project folder and install dependencies:

mkdir seo-agent && cd seo-agent
pip install browser-use anthropic playwright httpx
playwright install chromium

Create input.csv with your URLs:

url
https://example.com
https://example.com/about
https://example.com/contact

Create .env.example:

ANTHROPIC_API_KEY=your-key-here

Set your API key as an environment variable before running:

# macOS/Linux
export ANTHROPIC_API_KEY="sk-ant-..."

# Windows PowerShell
$env:ANTHROPIC_API_KEY = "sk-ant-..."

Create .gitignore:

state.json
report.json
report-summary.txt
.env
__pycache__/
*.pyc

Module 1: State Management

The agent needs to track which URLs it has already audited. If the run is interrupted — power cut, keyboard interrupt, network error — it should resume from where it stopped, not start over.

state.py handles this with a flat JSON file:

import json
import os

STATE_FILE = os.path.join(os.path.dirname(__file__), "state.json")

_DEFAULT_STATE = {"audited": [], "pending": [], "needs_human": []}


def load_state() -> dict:
    if not os.path.exists(STATE_FILE):
        save_state(_DEFAULT_STATE.copy())
    with open(STATE_FILE, encoding="utf-8") as f:
        return json.load(f)


def save_state(state: dict) -> None:
    with open(STATE_FILE, "w", encoding="utf-8") as f:
        json.dump(state, f, indent=2)


def is_audited(url: str) -> bool:
    return url in load_state()["audited"]


def mark_audited(url: str) -> None:
    state = load_state()
    if url not in state["audited"]:
        state["audited"].append(url)
    save_state(state)


def add_to_needs_human(url: str) -> None:
    state = load_state()
    if url not in state["needs_human"]:
        state["needs_human"].append(url)
    save_state(state)

The design is intentional: mark_audited() is called immediately after a URL is processed and written to the report. If the agent crashes mid-run, it loses at most one URL's work.

Module 2: Browser Integration

browser.py does the actual page navigation. It uses Playwright directly (which Browser Use installs as a dependency) to open a visible Chromium window, navigate to the URL, capture HTTP status and redirect information, and extract the raw SEO signals from the DOM.

The key design decisions:

Visible browser, not headless. Set headless=False so you can watch the agent work. This matters for the demo and for debugging.

Status capture via response listener. Playwright raises an exception on 4xx/5xx responses, but the on("response", ...) handler fires before the exception. We capture status there.

2-second delay between visits. Prevents triggering rate limiting or bot detection on agency client sites.

Here is the core navigation function:

import asyncio
import sys
import time
from playwright.sync_api import sync_playwright, TimeoutError as PlaywrightTimeout

TIMEOUT = 20_000  # 20 seconds


def fetch_page(url: str) -> dict:
    result = {
        "final_url": url,
        "status_code": None,
        "title": None,
        "meta_description": None,
        "h1s": [],
        "canonical": None,
        "raw_links": [],
    }

    first_status = {"code": None}

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()

        def on_response(response):
            if first_status["code"] is None:
                first_status["code"] = response.status

        page.on("response", on_response)

        try:
            page.goto(url, wait_until="domcontentloaded", timeout=TIMEOUT)
            result["status_code"] = first_status["code"] or 200
            result["final_url"] = page.url

            # Extract SEO signals from DOM
            result["title"] = page.title() or None
            result["meta_description"] = page.evaluate(
                "() => { const m = document.querySelector('meta[name=\"description\"]'); "
                "return m ? m.getAttribute('content') : null; }"
            )
            result["h1s"] = page.evaluate(
                "() => Array.from(document.querySelectorAll('h1')).map(h => h.innerText.trim())"
            )
            result["canonical"] = page.evaluate(
                "() => { const c = document.querySelector('link[rel=\"canonical\"]'); "
                "return c ? c.getAttribute('href') : null; }"
            )
            result["raw_links"] = page.evaluate(
                "() => Array.from(document.querySelectorAll('a[href]'))"
                ".map(a => a.href).filter(Boolean).slice(0, 100)"
            )

        except PlaywrightTimeout:
            result["status_code"] = first_status["code"] or 408
        except Exception as exc:
            print(f"[browser] Error: {exc}", file=sys.stderr)
            result["status_code"] = first_status["code"]
        finally:
            browser.close()

    time.sleep(2)
    return result

A few things worth noting:

The raw_links cap at 100 is deliberate. DEV.to profile pages have hundreds of links — you don't need all of them for broken link detection.

The wait_until="domcontentloaded" setting is faster than networkidle and sufficient for meta tag extraction. JavaScript-rendered content needs the DOM to be ready, not all network requests to complete.

Module 3: Claude Extraction Layer

extractor.py takes the raw page snapshot from browser.py and calls Claude to produce a structured SEO audit result.

This is where most tutorials go wrong. They either write complex parsing logic in Python (fragile) or ask Claude for a free-form response and try to parse prose (unreliable). The right approach: give Claude a strict JSON schema and tell it to return nothing else.

The prompt engineering that makes this reliable:

import json
import os
import sys
from datetime import datetime, timezone
import anthropic

MODEL = "claude-sonnet-4-20250514"
client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))


def _strip_fences(text: str) -> str:
    """Remove accidental markdown code fences from Claude's response."""
    text = text.strip()
    if text.startswith("```"):
        lines = text.splitlines()
        # Drop opening fence
        lines = lines[1:] if lines[0].startswith("```") else lines
        # Drop closing fence
        if lines and lines[-1].strip() == "```":
            lines = lines[:-1]
        text = "\n".join(lines).strip()
    return text


def extract(snapshot: dict) -> dict:
    if not os.environ.get("ANTHROPIC_API_KEY"):
        raise OSError("ANTHROPIC_API_KEY is not set.")

    prompt = f"""You are an SEO auditor. Analyze this page snapshot and return ONLY a JSON object.
No prose. No explanation. No markdown fences. Raw JSON only.

Page data:
- URL: {snapshot.get('final_url')}
- Status code: {snapshot.get('status_code')}
- Title: {snapshot.get('title')}
- Meta description: {snapshot.get('meta_description')}
- H1 tags: {snapshot.get('h1s')}
- Canonical: {snapshot.get('canonical')}

Return this exact schema:
{{
  "url": "string",
  "final_url": "string",
  "status_code": number,
  "title": {{"value": "string or null", "length": number, "status": "PASS or FAIL"}},
  "description": {{"value": "string or null", "length": number, "status": "PASS or FAIL"}},
  "h1": {{"count": number, "value": "string or null", "status": "PASS or FAIL"}},
  "canonical": {{"value": "string or null", "status": "PASS or FAIL"}},
  "flags": ["array of strings describing specific issues"],
  "human_review": false,
  "audited_at": "ISO timestamp"
}}

PASS/FAIL rules:
- title: FAIL if null or length > 60 characters
- description: FAIL if null or length > 160 characters  
- h1: FAIL if count is 0 (missing) or count > 1 (multiple)
- canonical: FAIL if null
- flags: list every failing field with a clear description
- audited_at: use current UTC time in ISO 8601 format"""

    response = client.messages.create(
        model=MODEL,
        max_tokens=1000,
        messages=[{"role": "user", "content": prompt}],
    )

    raw = response.content[0].text
    clean = _strip_fences(raw)

    try:
        return json.loads(clean)
    except json.JSONDecodeError as exc:
        print(f"[extractor] JSON parse error: {exc}", file=sys.stderr)
        return _error_result(snapshot, str(exc))


def _error_result(snapshot: dict, reason: str) -> dict:
    return {
        "url": snapshot.get("final_url", ""),
        "final_url": snapshot.get("final_url", ""),
        "status_code": snapshot.get("status_code"),
        "title": {"value": None, "length": 0, "status": "ERROR"},
        "description": {"value": None, "length": 0, "status": "ERROR"},
        "h1": {"count": 0, "value": None, "status": "ERROR"},
        "canonical": {"value": None, "status": "ERROR"},
        "flags": [f"Extraction error: {reason}"],
        "human_review": True,
        "audited_at": datetime.now(timezone.utc).isoformat(),
    }

Two things make this reliable in production:

First, _strip_fences() handles the case where Claude wraps its response in ```json fences despite being told not to. This happens occasionally with Sonnet and consistently breaks json.loads() if you don't handle it.

Second, the _error_result() fallback means the agent never crashes on a bad Claude response — it logs the error and marks the URL for human review, then continues to the next URL.

Cost: Claude Sonnet 4 is priced at $3 per million input tokens and $15 per million output tokens. A typical page snapshot is around 500 input tokens; the structured JSON response is around 300 output tokens. That works out to roughly $0.006 per URL — about $0.12 for a 20-URL audit.

Module 4: Broken Link Checker

linkchecker.py takes the raw_links list from the browser snapshot and checks same-domain links for broken status using async HEAD requests.

The design choices:

Same-domain only. Checking every external link on a page would take minutes and isn't what agency clients need. Filter to links on the same domain as the page being audited.
HEAD requests, not GET. Faster, lower bandwidth, sufficient for status code detection.
Cap at 50 links. Pages like DEV.to article listings have hundreds of internal links. Checking all of them would dominate the runtime.
Concurrent requests via asyncio. All links are checked in parallel, not sequentially.

import asyncio
import logging
from urllib.parse import urlparse
import httpx

CAP = 50
TIMEOUT = 5.0
logger = logging.getLogger(__name__)


def _same_domain(link: str, final_url: str) -> bool:
    if not link:
        return False
    lower = link.strip().lower()
    if lower.startswith(("#", "mailto:", "javascript:", "tel:", "data:")):
        return False
    try:
        page_host = urlparse(final_url).netloc.lower()
        parsed = urlparse(link)
        return parsed.scheme in ("http", "https") and parsed.netloc.lower() == page_host
    except Exception:
        return False


async def _check_link(client: httpx.AsyncClient, url: str) -> tuple[str, bool]:
    try:
        resp = await client.head(url, follow_redirects=True, timeout=TIMEOUT)
        return url, resp.status_code != 200
    except Exception:
        return url, True  # Timeout or connection error = broken


async def _run_checks(links: list[str]) -> list[str]:
    async with httpx.AsyncClient() as client:
        results = await asyncio.gather(*[_check_link(client, url) for url in links])
    return [url for url, broken in results if broken]


def check_links(raw_links: list[str], final_url: str) -> dict:
    same_domain = [l for l in raw_links if _same_domain(l, final_url)]

    capped = len(same_domain) > CAP
    if capped:
        logger.warning("Page has %d same-domain links — capping at %d.", len(same_domain), CAP)
        same_domain = same_domain[:CAP]

    broken = asyncio.run(_run_checks(same_domain))

    return {
        "broken": broken,
        "count": len(broken),
        "status": "FAIL" if broken else "PASS",
        "capped": capped,
    }

Module 5: Human-in-the-Loop

This is the part most automation tutorials skip. What happens when the agent hits a login wall? A page that returns 403? A URL that redirects to a "Subscribe to continue reading" page?

Most scripts either crash or silently skip. Neither is acceptable in an agency context.

hitl.py handles this with two functions: one that detects whether a pause is needed, and one that handles the pause itself.

from state import add_to_needs_human

LOGIN_KEYWORDS = {"login", "sign in", "sign-in", "access denied", "log in", "unauthorized"}
REDIRECT_CODES = {301, 302, 307, 308}


def should_pause(snapshot: dict) -> bool:
    code = snapshot.get("status_code")

    # Navigation failed entirely
    if code is None:
        return True

    # Non-200, non-redirect
    if code != 200 and code not in REDIRECT_CODES:
        return True

    # Login wall detection
    title = (snapshot.get("title") or "").lower()
    h1s = [h.lower() for h in (snapshot.get("h1s") or [])]

    if any(kw in title for kw in LOGIN_KEYWORDS):
        return True
    if any(kw in h1 for kw in LOGIN_KEYWORDS for h1 in h1s):
        return True

    return False


def pause_reason(snapshot: dict) -> str:
    code = snapshot.get("status_code")
    if code is None:
        return "Navigation failed (None status)"
    if code != 200 and code not in REDIRECT_CODES:
        return f"Unexpected status code: {code}"
    return "Possible login wall detected"


def pause_and_prompt(url: str, reason: str) -> str:
    print(f"\n⚠️  HUMAN REVIEW NEEDED")
    print(f"   URL:    {url}")
    print(f"   Reason: {reason}")
    print(f"   Options: [s] skip  [r] retry  [q] quit\n")

    while True:
        choice = input("Your choice: ").strip().lower()
        if choice in ("s", "r", "q"):
            return {"s": "skip", "r": "retry", "q": "quit"}[choice]
        print("   Enter s, r, or q.")

The should_pause() function catches four cases: navigation failure, unexpected HTTP status, login keywords in the title, and login keywords in H1 tags. The login keyword check is what catches "Please sign in to continue" pages that return 200 but are effectively inaccessible.

In --auto mode (for scheduled runs), the main loop skips the pause_and_prompt() call and automatically handles these cases by logging the URL to needs_human[] in state and continuing.

Module 6: Report Writer

reporter.py writes results incrementally. This is important: results are written after each URL is audited, not batched at the end. If the run is interrupted, you don't lose completed work.

import json
import os
from datetime import datetime, timezone

REPORT_JSON = os.path.join(os.path.dirname(__file__), "report.json")
REPORT_TXT = os.path.join(os.path.dirname(__file__), "report-summary.txt")


def _load_report() -> list:
    if not os.path.exists(REPORT_JSON):
        return []
    with open(REPORT_JSON, encoding="utf-8") as f:
        return json.load(f)


def write_result(result: dict) -> None:
    """Append or update a result in report.json."""
    entries = _load_report()
    url = result.get("url", "")

    # Update existing entry if URL already present (handles retries)
    for i, entry in enumerate(entries):
        if entry.get("url") == url:
            entries[i] = result
            break
    else:
        entries.append(result)

    with open(REPORT_JSON, "w", encoding="utf-8") as f:
        json.dump(entries, f, indent=2, ensure_ascii=False)


def _is_overall_pass(result: dict) -> bool:
    fields = ["title", "description", "h1", "canonical"]
    for field in fields:
        if result.get(field, {}).get("status") not in ("PASS",):
            return False
    if result.get("broken_links", {}).get("status") == "FAIL":
        return False
    return True


def write_summary() -> None:
    entries = _load_report()
    passed = sum(1 for e in entries if _is_overall_pass(e))

    lines = []
    for entry in entries:
        overall = "PASS" if _is_overall_pass(entry) else "FAIL"
        failed_fields = [
            f for f in ["title", "description", "h1", "canonical", "broken_links"]
            if entry.get(f, {}).get("status") == "FAIL"
        ]
        suffix = f" [{', '.join(failed_fields)}]" if failed_fields else ""
        lines.append(f"{entry.get('url', 'unknown'):<60} | {overall}{suffix}")

    lines.append("")
    lines.append(f"{passed}/{len(entries)} URLs passed")

    with open(REPORT_TXT, "w", encoding="utf-8") as f:
        f.write("\n".join(lines))

The deduplication in write_result() handles retries cleanly. If a URL is retried after a human reviews a login wall and authenticates, the new result replaces the old one rather than creating a duplicate entry.

Module 7: The Main Loop

index.py wires everything together. It reads the URL list, loads state, skips already-audited URLs, and runs the audit loop.

import csv
import os
import sys
import time
import argparse

from state import load_state, is_audited, mark_audited, add_to_needs_human
from browser import fetch_page
from extractor import extract
from linkchecker import check_links
from hitl import should_pause, pause_reason, pause_and_prompt
from reporter import write_result, write_summary

INPUT_CSV = os.path.join(os.path.dirname(__file__), "input.csv")


def read_urls(path: str) -> list[str]:
    with open(path, newline="", encoding="utf-8") as f:
        return [row["url"].strip() for row in csv.DictReader(f) if row.get("url", "").strip()]


def run(auto: bool = False):
    if not os.environ.get("ANTHROPIC_API_KEY"):
        print("Error: ANTHROPIC_API_KEY environment variable is not set.")
        sys.exit(1)

    urls = read_urls(INPUT_CSV)
    pending = [u for u in urls if not is_audited(u)]

    print(f"Starting audit: {len(pending)} pending, {len(urls) - len(pending)} already done.\n")

    total = len(urls)

    try:
        for i, url in enumerate(pending, start=1):
            position = urls.index(url) + 1
            print(f"[{position}/{total}] {url}", end=" -> ", flush=True)

            # Browser navigation
            snapshot = fetch_page(url)

            # Human-in-the-loop check
            if should_pause(snapshot):
                reason = pause_reason(snapshot)

                if auto:
                    print(f"AUTO-SKIPPED ({reason})")
                    add_to_needs_human(url)
                    mark_audited(url)
                    continue

                action = pause_and_prompt(url, reason)
                if action == "quit":
                    print("Exiting.")
                    break
                elif action == "skip":
                    add_to_needs_human(url)
                    mark_audited(url)
                    continue
                # "retry" falls through to re-fetch below
                snapshot = fetch_page(url)

            # Claude extraction
            result = extract(snapshot)

            # Broken link check
            links = check_links(snapshot.get("raw_links", []), snapshot.get("final_url", url))
            result["broken_links"] = links

            # Write result immediately
            write_result(result)
            mark_audited(url)

            overall = "PASS" if all(
                result.get(f, {}).get("status") == "PASS"
                for f in ["title", "description", "h1", "canonical"]
            ) and links["status"] == "PASS" else "FAIL"

            print(overall)

    except KeyboardInterrupt:
        print("\n\nInterrupted. Progress saved. Re-run to continue.")
        return

    write_summary()
    passed = sum(
        1 for e in [r for r in []]
        if all(e.get(f, {}).get("status") == "PASS" for f in ["title", "description", "h1", "canonical"])
    )
    print(f"\nAudit complete. Report saved to report.json and report-summary.txt")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--auto", action="store_true", help="Auto-skip URLs requiring human review")
    args = parser.parse_args()
    run(auto=args.auto)

The KeyboardInterrupt handler is the resume mechanism. When you press Ctrl+C, the handler prints a message and exits cleanly. Because mark_audited() is called after write_result() for each URL, the next run skips everything already processed.

Running the Agent

Interactive mode (pauses on edge cases):

python index.py

Auto mode (skips edge cases, adds to needs_human[]):

python index.py --auto

When it runs, you'll see the browser window open for each URL and the terminal print progress:

Starting audit: 7 pending, 0 already done.

[1/7] https://example.com -> PASS
[2/7] https://example.com/about -> FAIL
[3/7] https://example.com/contact -> AUTO-SKIPPED (Unexpected status code: 404)
...
Audit complete. Report saved to report.json and report-summary.txt

To resume after an interruption:

python index.py --auto
# Starting audit: 4 pending, 3 already done.

Scheduling for Agency Use

For recurring weekly audits, create a batch file and schedule it with Windows Task Scheduler.

Create run-audit.bat:

@echo off
set ANTHROPIC_API_KEY=your-key-here
cd /d C:\Users\yourname\Desktop\seo-agent
python index.py --auto

In Windows Task Scheduler:

Create a new Basic Task
Set the trigger to Weekly, Monday at 7:00 AM
Set the action to "Start a program"
Browse to your run-audit.bat file

Check report-summary.txt on Monday morning. URLs in needs_human[] in state.json need manual review — login walls, paywalls, or pages that returned unexpected status codes.

For macOS/Linux, use cron:

# Run every Monday at 7am
0 7 * * 1 cd /path/to/seo-agent && ANTHROPIC_API_KEY=your-key python index.py --auto

What the Results Look Like

I ran this agent against seven of my own published pages across Hashnode, freeCodeCamp, and DEV.to. Every single one failed.

https://hashnode.com/@dannwaneri                    | FAIL [h1]
https://freecodecamp.org/news/claude-code-skill     | FAIL [description]
https://freecodecamp.org/news/stop-letting-ai-guess | FAIL [description]
https://freecodecamp.org/news/rag-system-handbook   | FAIL [title, description]
https://freecodecamp.org/news/author/dannwaneri     | FAIL [description]
https://dev.to/dannwaneri/gatekeeping-panic         | FAIL [title]
https://dev.to/dannwaneri/production-rag-system     | FAIL [title]

0/7 URLs passed

The freeCodeCamp description issues are partly platform-level — freeCodeCamp's template sometimes truncates or omits meta descriptions for article listing pages. The DEV.to title issues are mine. Article titles that work as headlines often exceed 60 characters in the </code> tag. A note on the 60-character title rule: this is a display threshold, not a ranking penalty. Google indexes titles of any length. The 60-character guideline reflects approximately how many characters fit in a desktop SERP result before truncation. Titles over 60 characters often still rank — they just get cut off in search results, which can hurt click-through rate. The agent flags display risk, not a ranking violation. <h2 id="heading-next-steps">Next Steps</h2> The agent as built handles the core SEO audit workflow. Obvious extensions: <ul> <li>Performance metrics — add a Lighthouse or PageSpeed Insights API call per URL </li> <li>Structured data validation — check for JSON-LD schema markup and validate it </li> <li>Email delivery — send <code>report-summary.txt</code> via SMTP after the run completes </li> <li>Multi-client support — separate <code>input.csv</code> files per client, separate report directories </li> </ul> The full code including all seven modules is at <a href="https://github.com/dannwaneri/seo-agent">dannwaneri/seo-agent</a>. Clone it, add your URLs, and run it. If you found this useful, I write about practical AI agent setups for developers and agencies at <a href="https://dev.to/dannwaneri">DEV.to/@dannwaneri</a>. The DEV.to companion piece covers the design decisions behind the agent — why HITL matters, why Browser Use over scrapers, and what the audit results mean for your own published content. </article> <article> <h1> How to Find Any File on Windows Like a Linux User (using Windows Powershell) </h1> Piotr "NotBlackMagic" Opoka — Wed, 25 Mar 2026 16:00:00 +0000 Sometimes you might struggle to find a file or program when you have no idea where it could be saved or installed. And the Windows user interface may not always give you the results you want. If that's the case for you, you're in the right place. <code>Get-ChildItem</code> (also known as <code>gci</code>, <code>ls</code>, <code>dir</code> ) is a very powerful command. And one of its most iconic uses is to find/search for a file. It's more precise and more reliable than Windows Explorer. It even has better filtering options that show the results that are more relevant to you. In this tutorial, you'll learn how to use <code>gci</code> and how to combine it with other commands so that it becomes an even more powerful tool. Remember to enable copy-pasting in Windows PowerShell, so it's easier for you to follow along. You can see how to enable it <a href="https://notblackmagic.hashnode.dev/enable-copy-pasting-in-windows-powershell-cli-in-3-steps">here</a>. <h3 id="heading-what-well-cover">What we'll cover:</h3> <ol> <li><a href="#heading-1-basic-explanation-of-the-get-childitem-command">Basic explanation of the Get-ChildItem command</a> <ul> <li><a href="#heading-most-used-examples-of-searching-by-gci-command">Most used examples of searching by gci command</a></li> </ul> </li> <li><a href="#heading-2-setup-for-other-more-complex-examples">Setup for other more complex examples</a> </li> <li><a href="#heading-3-when-is-the-path-option-not-needed">When is the -Path option not needed?</a> </li> <li><a href="#heading-4-advanced-searching-combining-getchildren-with-the-whereobject-command">Advanced Searching – Combining Get-ChildItem with the Where-Object Command</a> <ul> <li><a href="#heading-41-how-to-search-through-only-a-particular-directory">4.1. How to search through only a particular directory</a> </li> <li><a href="#heading-42-how-to-search-while-excluding-a-particular-directory">4.2. How to search while excluding a particular directory</a> </li> <li><a href="#heading-43-searching-only-1-directory-from-many-with-exactly-the-same-name">4.3 Searching only 1 directory from many with exactly the same name</a> </li> <li><a href="#heading-44-filter-how-deep-how-many-folders-in-you-want-to-search-for-the-file">4.4 Filter how deep (how many folders in) you want to search for the file</a> </li> </ul> </li> <li><a href="#heading-5-how-to-search-through-hidden-files">How to Search Through Hidden Files</a> </li> <li><a href="#heading-6-how-can-you-know-all-the-properties-that-you-can-use-as-a-filter">How can you know all the properties that you can use as a filter?</a> <ul> <li><a href="#heading-how-to-retrieve-only-1-desired-property">How to retrieve only 1 desired property</a></li> </ul> </li> <li><a href="#heading-7-i-dont-know-the-files-name-but-i-know-whats-inside-it-how-do-i-find-the-file-by-its-content">I don't know the file’s name, but I know what's inside it. How do I find the file by its content?</a> </li> <li><a href="#heading-8-i-cant-see-the-full-path-how-do-i-fix-this">I can't see the full path - how do I fix this?</a> </li> <li><a href="#heading-9-hard-to-read-open-the-results-in-the-text-editor-of-your-choice">Hard to read? Open the results in the text editor of your choice</a> </li> <li><a href="#heading-10-summary-the-ultimate-commands-for-searching-and-finding-whatever-you-need">Summary - the ultimate commands for searching and finding whatever you need</a> </li> </ol> <h2 id="heading-1-basic-explanation-of-the-get-childitem-command">1. Basic Explanation of the <code>Get-ChildItem</code> Command</h2> Let's take a look at the example searching script to understand how it works: <pre><code class="language-powershell">Get-ChildItem -Recurse -Path "C:\path to\your directory\" -Filter "*whatImLookingFor*" </code></pre> <code>Get-ChildItem</code> (aliases: <code>dir</code>, <code>ls</code>, <code>gci</code>) lists the content of a folder or directory just like the Linux <code>ls</code> command does. This command works by searching every single file and directory in the path specified. It shows you everything it found that matches the filter. It doesn't mean that this command doesn't look everywhere else – because it does. So you specify the path that is the parent (folder), which means that every folder and file under it is its child. If you know some CSS and JavaScript, treat it the same way that these languages do. If you don't use <code>-Recurse</code> or <code>-Depth</code>, then the command works only in your current directory (parent Depth level 0) and searches for its children inside that directory (children Depth level 0). If you use <code>-Recurse</code>, then the <code>gci</code> will search for what you want on ALL LEVELS. But by using<code>-Depth</code>, you can specify how deep you want it to look for a file/folder. To recurse means "to repeat an operation". So, <code>-Recurse</code> means that <code>gci</code> will repeat the search for your file or folder in every child element of the "Documents" directory, and every directory inside it, all levels deep. All of these files and folders are children of your "Documents" folder. If you delete the folder, you delete everything inside it too. <code>-Filter</code> filters the output of the command to only show what matches the filter (examples of how to use filter are further in the article). <code>-Path</code> tells where the command should be looking for files (by using "C:\", for example, you're telling it to look at the very basis of your computer). If you want to search in certain directory it would look like this: <pre><code class="language-powershell">Get-ChildItem -Path "C:\path to\your directory\" </code></pre> OR <pre><code class="language-powershell">Get-ChildItem -Path "~\Documents\path to\your directory\" </code></pre> <code>~\</code> here is a shorthand for "inside current user's folder" or "C:\Users\YourUsername". Next, we can specify whether we'd like to look for a file or a folder, so we have fewer results to look at: <pre><code class="language-powershell">Get-ChildItem -Path C:\ -Recurse -Filter "*whatImLookingFor*" -File </code></pre> <pre><code class="language-powershell">Get-ChildItem -Path C:\ -Recurse -Filter "*whatImLookingFor*" -Directory </code></pre> You might be wondering how you can stop the search if it takes too long. When you're using <code>-Recurse,</code> the output that you'll get might become quite overwhelming, especially if you didn't specify your command enough (more about that in <a href="#heading-3-when-is-the-path-option-not-needed">step 3</a> and <a href="#heading-4-advanced-searching-combining-getchildren-with-the-whereobject-command">step 4</a>). Luckily, you can stop any command in PowerShell after starting it with Ctrl + C OR Ctrl + Z OR Ctrl + X. All of them should work. <h2 id="heading-most-used-examples-of-searching-by-gci-command">Most Used Examples of Searching by <code>gci</code> Command</h2> Here are some handy examples of searching scripts that you can use: Example #1: search for all executive files on your PC (remember that you can stop this command with one of shortcuts, like Ctrl + C): <pre><code class="language-powershell">Get-ChildItem -Path C:\ -Recurse -Filter "*.exe" -File </code></pre> REMEMBER: In order to paste commands into the PowerShell, you have to first enable it. <a href="https://notblackmagic.hashnode.dev/enable-copy-pasting-in-windows-powershell-cli-in-3-steps">Here's how</a>. This command will show you a very long list of executable files and their folders (as shown in the image below). These lists might be so long that it's impossible to find anything in them. That's why you'll learn how to use more advanced techniques of filtering in <a href="#heading-4-advanced-searching-combining-getchildren-with-the-whereobject-command">step 4</a> to see fewer unnecessary results that don't fit your criteria. Example #2: search for an executable file that has "notepad" in its name (or search for any program you need, basically): <pre><code class="language-powershell">Get-ChildItem -Path C:\ -Recurse -Filter "notepad*.exe" -File </code></pre> One of the results will show you the location of the file you want: In our case it's the <code>C:\Windows\System32</code> folder. You can mix it however you want! Thanks to that command, you don't have to remember much about your file and it will still work. <pre><code class="language-powershell">Get-ChildItem -Path C:\ -Recurse -Filter "n*pad*.*xe" </code></pre> So what if you see some errors while scanning the whole system. Should you worry? It's ok! Sometimes you might get lots of errors. They will most likely occur when a script scours the system folders/files. If you want to get rid of them, add <code>-ErrorAction SilentlyContinue</code>, like you see here: <pre><code class="language-powershell">Get-ChildItem -Path C:\ -Recurse -Filter "notepad*.exe" -File -ErrorAction SilentlyContinue </code></pre> You can try it now ;) <h2 id="heading-2-setup-for-other-more-complex-examples">2. Setup for Other More Complex Examples</h2> Now, let's look at even more use cases for this command. But first, we'll create a space where I can show you examples. First, create new folder inside your "Documents" folder. Let's call it "Items". Inside it, create two text documents. Name one of them "Item 1- Green Bracelet" and the other "Item 2- Blue Bracelet" (Yes, make sure you write the first letter of each word in UPPER CASE). Copy these files now. Go one folder back (you can use the Ctrl + UpArrow shortcut ) and create another folder next to "Items" called "More items": Paste the copied files inside the "More items" folder and change their names, so they have only lower case letters ("item 1- green bracelet" and "item 2- blue bracelet" ). PRO TIP: You can click once on a file with your mouse and then type the F2 key on your keyboard in order to change their names. <h3 id="heading-3-when-is-the-path-option-not-needed">3. When is the <code>-Path</code> option not needed?</h3> You don't have to specify the path every time. You can always just move to the desired directory with the <code>cd</code> (change directory) command. This command will move you to your <code>Documents</code> folder: <pre><code class="language-powershell">cd ~\Documents\ </code></pre> Now, you should be able to see PowerShell pointing to your <code>Documents</code> folder on the left of the screen: If you don't see this, then you can use double quotes <code>" "</code>, like in this command: <pre><code class="language-powershell">cd "~\Documents\" </code></pre> Make sure that PowerShell is pointing to our desired folder. Now, the searching command looks like this without the <code>-Path</code> option: <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*item*" -File </code></pre> Pretty simple, right? As you can see in the image above, we first moved to our desired directory, so later we could perform the search inside it without specifying the <code>-Path</code> option/parameter. But the <code>-Path</code> option is very useful, either when you're creating a script or you want to search for something without moving away from the current directory: <pre><code class="language-powershell">Get-ChildItem -Path ~\Documents\ -Recurse -Filter "*item*" -File </code></pre> <pre><code class="language-powershell">Get-ChildItem -Path ~\Documents\ -Recurse -Filter "*item*" -Directory </code></pre> Here's an example. I'm inside the <code>System32</code> folder and I want to know whether the thing I'm looking for is inside the <code>Documents</code> folder without moving in there: And it really is there! From now on, because you already know what the <code>-Path</code> option is being used for, I won't be using it unless it's necessary. <h2 id="heading-4-advanced-searching-combining-get-childitem-with-the-where-object-command">4. Advanced Searching – Combining <code>Get-ChildItem</code> with the <code>Where-Object</code> Command</h2> Sometimes you might have several folders named exactly the same, but they're in different places. You might want to exclude them based on their content, which folder they are in, or based on their<code>-Depth</code> level (see the graphic with the explanation about <code>-Depth</code> level in <a href="#heading-1-basic-explanation-of-the-get-childitem-command">step 1</a>). That's what we're going to cover in the next few points. For this part of the tutorial, make sure you've gone through <a href="#heading-2-setup-for-other-more-complex-examples">step 2</a> (but you can skip step 3 if you want). <h3 id="heading-41-searching-through-only-a-particular-directory">4.1. Searching through only a particular directory</h3> Let's say that we're now looking for the bracelets that we created in step 2. But, we want to see the results from only one folder. For that, we'll use case-sensitive search (<code>-clike</code>) to get only our preferred results. But <code>-clike</code> doesn't work with <code>gci</code> alone. We need to apply another filter with the <code>Where-Object { }</code> command: <pre><code class="language-powershell">Get-ChildItem -Path ~\Documents\ -Recurse -Filter "*item*" | Where-Object { $_.Name -clike "*Item*" } </code></pre> OR (clearer version, without the <code>-Path</code> option): <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*item*" | Where-Object { $_.Name -clike "*Item*" } </code></pre> Let's review what's going on here: <ul> <li><code>Get-ChildItem -Recurse -Filter "*item*"</code> searches for all files and folders with "item" in their name </li> <li><code>|</code> – the "pipe" symbol is used to get the output of the previous command (the list of all files and folders filtered by <code>gci</code>) and send it to the next command (<code>Where-Object</code> is applying another filter to what is already filtered by <code>gci</code>). </li> <li><code>Where-Object { }</code> is the command used for filtering the lists of objects. The filter is being specified inside the <code>{ }</code> curly brackets. </li> <li><code>$_</code> refers to all the separate objects. Treat it as "ForEachObjectFromList". And treat the whole sequence after the <code>|</code> as "FindObjectsFromList that have a name with 'Item' ". <code>$_</code> is very often used with <code>Where-Object</code>, but also with some other commands. </li> <li><code>.Name</code> – we choose a Name property to get from every object. </li> <li><code>-clike</code> finds a match that is 100% correct. All letters must be the exact same case as the phrase we specified. <code>c</code> stands for "case sensitive" and it checks every letter to see if it's upper case or lower case. </li> </ul> So, <code>Where-Object { $_.Name -clike "*Item*" }</code> is a filter that takes the <code>Name</code> parameter of every object from the list (created by <code>gci</code>) and checks with <code>-clike</code> if any <code>Name</code> has the word "Item" in it. As you can see in the image below, now we'll get only the files with upper case names in our result: IMPORTANT: <code>-like</code> alone means that we're looking for a certain pattern, no matter what case the letters are. The <code>c</code> in <code>-clike</code> means that we look for the thing with exactly the same capitalization of the letters (both upper and lower case, hence the "c"). If you want to see the files without the upper case first letter, you can do that by changing "*Item*" from our current command to "*item*": <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*item*" | Where-Object { $_.Name -clike "*item*" } </code></pre> Let's try it out! <h3 id="heading-42-how-to-search-while-excluding-a-particular-directory">4.2. How to search while excluding a particular directory</h3> In step 4.1 we learned how to search only for files/folders with specific case-sensitive names in them. After applying only two changes to our previous code, we can exclude certain directories from our search. Here's our starting command once again: <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*item*" | Where-Object { $_.Name -clike "*Item*" } </code></pre> <h4 id="heading-change-1">Change #1</h4> In the example above, <code>-clike</code> shows only files/folders including specific phrase in their names. If we change it to <code>-cnotlike</code>, we'll exclude from the search all files/folders with that specific phrase in their name. Now our code looks like this: <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*item*" | Where-Object { $_.Name -cnotlike "*Item*" } </code></pre> <h4 id="heading-change-2">Change #2</h4> After the first change, <code>Where-Object { $_.Name -cnotlike "*Item*" }</code> only excludes the names, not full paths. In order to avoid that, we need to exclude an actual path to these files. We can do that by changing <code>$_.Name</code> to <code>$_.FullName</code>, which checks for a certain phrase in the whole path to the file and in the file's name. Now, your command should look like this: <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*item*" | Where-Object { $_.FullName -cnotlike "*Item*" } </code></pre> We excluded the "Items" folder from our search. You should now be able to see the files only from the "More items" directory. Try it out yourself! What if you want to exclude the "More items" directory instead? Just change the phrase inside the filter to something like this: <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*green*" -File | Where-Object { $_.FullName -cnotlike "*More*" } </code></pre> We also changed the name of the file from "*item*" to "*green*" in our <code>gci</code> search (first line of code). That's why now we'll see only one bracelet in our result list: The <code>gci</code> command has two filters applied. First, it searches for files with phrase "green" in their names. The second filter is the "Where-Object" command, which excludes anything that has the word "More" in its path. In our case, the "More items" folder got excluded. We don't even need the case-sensitive filter in our case. The command will work the same when we exclude just a lowercase word "more". So let's change <code>-cnotlike "*More*"</code> to <code>-notlike "*more*"</code> and see if it's true: <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*green*" -File | Where-Object { $_.FullName -notlike "*more*" } </code></pre> As you can see, the result is the same! Despite different cases of the letters, we still got the right keyword. So, case-sensitive search isn't always needed – only when you want to be very specific. Sometimes, being too specific might be bad and make your code not work as intended. To see what I mean, let's look at the example below. Let's apply case-sensitive search once again, but to our unchanged, lowercase keyword "more" and see if it still works: <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*green*" -File | Where-Object { $_.FullName -cnotlike "*more*" } </code></pre> Case-sensitive search doesn't filter out anything now, because it's too specific. Both the "Items" and "More items" folders omit the filter now. <h4 id="heading-faq">FAQ:</h4> If the <code>Where-Object</code> command is what actually filters the output for us, shouldn't we drop (delete) the <code>-Filter</code> option from <code>gci</code>? No, we should still use the <code>-Filter</code> option, because it already separates around 99% of the possible files, so the <code>Where-Object</code> command has to work roughly only on 1% of the objects. It makes this part of the command AT LEAST 100 times faster (more often 100,000 times or even faster). You can try using this command in <code>-Path C:/</code> with and without the <code>-Filter</code> option. In my case, using the <code>-Filter</code> shortened the time needed for the whole sequence of commands to finish from 16 seconds to 8 seconds (first 7.99 seconds is used by <code>gci</code>, so that's why the time got shortened only by a half). That's what we call ✨optimization✨ :D <h3 id="heading-43-searching-only-1-directory-from-many-with-exactly-the-same-name">4.3 Searching only 1 directory from many with exactly the same name</h3> We've learned how to search for a phrase anywhere inside the path of a file. But what if we want to search inside exactly the "More items" folder? For that, we'll use the <code>-match</code> filter (which works similarly to the <code>-like</code> filter). Our phrase will also use "\", instead of "\". This is because "\" is the symbol for a folder, but alone in programming it also has some other features, which we don't want. This command will look for a match for the "More items" folder in the path of every file from the list. Then, it will show you this file if it matches. What if we want to check for two folders, one next to the other, simultaneously? Very easy! Just connect them with the sign for a folder "\". Here, the command will search inside the "More items" folder only if it's inside the "Documents" folder: As you can see, we didn't use "More items", only "More". You can shorten that filter how you want. It will still be applied to the whole path. See the example below: <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*green*" -File | Where-Object { $_.FullName -match "s\\Mo*" } </code></pre> Earlier, we used the <code>not</code> statement in <code>-like</code> filter to exclude certain files and directories. The same can be done with <code>-notmatch</code>: <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*green*" -File | Where-Object { $_.FullName -notmatch "ents\\Ite*" } </code></pre> Be aware that we're now excluding the "Items" folder from the search, not "More items". And, with <code>-cmatch</code> we can apply the same case-sensitive filter as with <code>-clike</code>: <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*green*" -File | Where-Object { $_.FullName -cmatch "green*" } </code></pre> I hope you get the gist of it now. <h3 id="heading-44-filter-how-deep-how-many-folders-in-you-want-to-search-for-the-file">4.4 Filter how deep (how many folders in) you want to search for the file</h3> Sometimes you might have a very long path to some of your files. If you don't want to waste time searching every folder on your computer recursively, you can use <code>-Depth</code> option. It specifies how many folders to search inside your folder tree. I already showed you the picture of a folder tree in the beginning of this article, but you should take a look at it here once again. So, how does the <code>-Depth</code> parameter work? <code>-Depth 0</code> means that our command will search only the current folder. It will show results of all children of Depth level 0. Those results are: 1 "child file" and 2 "child folders". <code>-Depth 1</code> searches the current folder and its child-folders. It will show the results of all children of Depth level 1. Those results are: 1 "child file", 2 "child folders", 2 "grandchild files" and 1 "grandchild folder". <code>-Depth 2</code> searches the current folder and its child and grandchild folders. It will show results of all children of Depth level 2. Those results are: 1 "child file", 2 "child folders", 2 "grandchild files", 1 "grandchild folder" and 1 "great grandchild file". Let's see the difference between these two commands: <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*item*" -Depth 0 </code></pre> <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*item*" -Depth 1 </code></pre> The first command will show you only the files and folders inside our current directory. The second command will also search for them inside every folder found inside the current folder. For the sake of practice, let's combine it with <code>Where-Object</code> to find the green bracelet: <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*item*" -Depth 1 | Where-Object { $_.name -clike"*green*" } </code></pre> I hope that this example showed you how easy it is to use multiple options ( <code>-Depth</code>, <code>-Recurse</code>) and filters (<code>-Filter</code>, <code>Where-Object</code>). <h2 id="heading-5-how-to-search-through-hidden-files">5. How to Search Through Hidden Files</h2> Some files are not that easily accessible to the user. You can see some of the hidden files and folders in Windows Explorer (<a href="https://notblackmagic.hashnode.dev/how-to-see-hidden-files-and-folders-in-windows-file-explorer">here's how</a>). But sometimes it's easier to find what you need if you see only those hidden files. That's possible with PowerShell. The options we're going to use for that are: <ul> <li><code>-Force</code>: show files otherwise not accessible by the user, such as hidden files. </li> <li><code>-Hidden</code>: show only those hidden files and directories. </li> </ul> This example will search for hidden files in our user's folder: <pre><code class="language-powershell">gci -Path ~\ -Force -Hidden </code></pre> Everything here is usually invisible to the typical user. But not for you now :D The interesting thing is that there are more files not available to the user than the available ones. If you're brave enough, you can see them yourself (Remember! Ctrl + C stops the command!): <pre><code class="language-powershell">gci -Path ~\ -Force -Hidden -Recurse </code></pre> <h2 id="heading-6-how-can-you-know-all-the-properties-that-you-can-use-as-a-filter">6. How can you know all the properties that you can use as a filter?</h2> Up until now, we'vce used some common properties, like <code>Name</code> and <code>Fullname</code>. But there are many others that you might want to access, like <code>CreationTime</code> (date of creating the file) or <code>LastWriteTime</code> (date of last edit of the file). In this section, I'll first show you how to see all the possible properties. After that, you'll learn how to retrieve only the property you want for scripting purposes. Go through step 2 above if you haven't already, because we're going to use the same files that we created before. Move to the <code>Documents</code> folder in PowerShell. I hope that this script looks familiar to you now. It searches for files with "item" in their names and checks if these names contain the word "green" (all lowercase letters): <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*item*" | Where-Object { $_.Name -clike "*green*" } </code></pre> We know that only one file should appear (if you don't trust me, just see for yourself). So, we're going to see every possible property we can use by appending (adding at the end) this fragment of code: <code>| Select-Object -Property *</code> <code>Select-Object</code> (alias: <code>select</code>) is used for selecting different types of properties. By using an option <code>-Property</code> we tell it to show both values and names of all the properties. For example: Name of property: <code>FullName</code> Value of property: <code>~\Documents\More items\item 1- green bracelet.txt</code> The asterisk <code>*</code> at the end tells this command to show these names and values for every property possible. The final version of this command looks like this: <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*item*" | Where-Object { $_.Name -clike "*green*" } | Select-Object -Property * </code></pre> Try finding the <code>FullName</code> property in there :D This command showed us all possible properties that we can use for that 1 file that it found. If there were more files fitting the filter, then every single one of them would have a similar list of properties. But for different types of files you will get different results. <h3 id="heading-how-to-retrieve-only-1-desired-property">How to retrieve only 1 desired property</h3> You've already learned how to check for all possible properties. So, how do we use any of them? Just put one of them instead an asterisk <code>*</code> at the end of the command, like we put <code>CreationTime</code> in here: <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*green*" -File | Where-Object { $_.Name -clike "*green*" } | Select-Object -Property CreationTime </code></pre> You can use any other property for the sake of this exercise, like <code>LastWriteTime</code>: <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*green*" -File | Where-Object { $_.Name -clike "*green*" } | Select-Object -Property LastWriteTime </code></pre> What if you want to retrieve only the value of the property without its name (because you already know its name and it also messes up your script)? You can retrieve just the value, by changing the <code>-Property</code> to <code>-ExpandProperty</code>: <pre><code class="language-powershell">Get-ChildItem -Recurse -Filter "*green*" -File | Where-Object { $_.Name -clike "*green*" } | Select-Object -ExpandProperty LastWriteTime </code></pre> See the result: <h2 id="heading-7-i-dont-know-the-files-name-but-i-know-whats-inside-it-how-do-i-find-the-file-by-its-content">7. I don't know the file’s name, but I know what's inside it. How do I find the file by its content?</h2> Sometimes it's easier to find a file by searching it by its content. Or perhaps you have lots of similar files and you'd like to check them quickly without opening and closing them. I'll show you some techniques that will let you achieve that in no time. This command will search every file on your system for the specified word or phrase (in our case, the phrase is "match"): <pre><code class="language-powershell">Get-ChildItem -Path C:\ -Recurse -File | Select-String -Pattern 'match' -List </code></pre> Here's what's happening: <ul> <li><code>Get-ChildItem -Path C:\ -Recurse -File</code>: as you already know, this part searches for every file on your computer. </li> <li><code>|</code> – passes the list of files to the next command. So, the next command will search for a certain phrase only in the files listed by <code>gci</code>. </li> <li><code>Select-String</code> – "String" is a common word in programming used to describe a word/phrase/some text. So, we select the phrase that we want to search for. That phrase is specified by the <code>-Pattern</code> parameter (in our case it's "match"). </li> <li><code>-List</code> tells the command to show only the first found match in every file (great if you want to just see the list of all found files). </li> </ul> Here's an example output of our command: Of course, you have quite a lot of files, and some images may also appear in your search (like .svg files that are basically text files that tell the system how to draw an icon). So, it's always best to specify what type of file you're searching for. Let's look for the phrase "red" inside .svg files: <pre><code class="language-powershell">Get-ChildItem -Filter "*.svg" -Recurse | Select-String -Pattern 'red' -List </code></pre> On the other hand, some text documents will never appear in your search (for example .doc and .docx documents are encoded in such a way that they're impossible to decode without Word). But in regular text files, you can search for phrases with an emphasis on big and small letters with the <code>-CaseSensitive</code> option. Here, we're going to search for the phrase "github" with only lowercase letters: <pre><code class="language-powershell">Get-ChildItem -Filter "*.txt" -Recurse | Select-String -Pattern 'github' -List -CaseSensitive </code></pre> Other options that you'll often use with the <code>Select-String</code> command are: <ul> <li><code>Select-String -AllMatch</code> will show you all matches found in every searched file (instead of only 1 match found per file, like with <code>-List</code>). <code>Select-String -Context 3</code> shows the three lines of text before and after the line in which the match is found. <code>Select-String -Raw</code> won't show you the paths, just the content of the files. This is great for automation and scripts. It's often combined with the <code>-Context</code> option.</li> </ul> Let's see some of these options in action: <pre><code class="language-powershell">Get-ChildItem -Filter "*.txt" -Recurse | Select-String -Pattern 'github' -AllMatch -Context 3 </code></pre> Thanks to the <code>-Context</code> parameter, you can see a total of seven lines (three lines before and three lines after the match) in this file, one after another. This makes it easier to differentiate it from all the other matches found by <code>-AllMatch</code> that might be put in a very similar context. If you ever feel like there's too much clutter on your screen, you can combine <code>Select-String</code> with <code>Select-Object</code> to get only the paths of the files with matched phrases. The command below will search every .txt file on your computer for the phrase specified: <pre><code class="language-powershell">Get-ChildItem -Filter "*.txt" -Recurse | Select-String -Pattern 'github' -List </code></pre> Let's add the <code>Select-Object -Property Path</code> filter at the end. Now, the command will only show the paths, so there's less clutter on your screen: <pre><code class="language-powershell">Get-ChildItem -Filter "*.txt" -Recurse | Select-String -Pattern 'github' -List | Select-Object -Property Path </code></pre> Some of the paths are not fully visible. We'll fix that in the next step. <h2 id="heading-8-i-cant-see-the-full-path-how-do-i-fix-this">8. I can't see the full path - how do I fix this?</h2> Let's format the results with the <code>Format-Table -Wrap -AutoSize</code> command. <code>-Autosize</code> allows the result to take the whole available space. <code>-Wrap</code> allows wrapping (continuing the text in the next line when it doesn't fit in the space available), which creates more space if it's needed. Here's an example: <pre><code class="language-powershell">Get-ChildItem -Path C:\ -Filter "*.txt" -Recurse | Select-String -Pattern 'github' -List | Select -Property Path | Format-Table -Wrap -AutoSize </code></pre> Now, you can see the whole paths (or any other results you need) even in PowerShell! <h2 id="heading-9-hard-to-read-open-the-results-in-the-text-editor-of-your-choice">9. Hard to read? Open the results in the text editor of your choice</h2> You can send the results of any script/command in two ways: <code>> ~\Documents\command_output.txt</code> AND <code>| Out-File ~\Documents\command_output.txt</code> Both of these will create a file inside your <code>Documents</code> folder, which you can later open in any program of your choice and edit. Just add whichever solution you prefer to the end of your command, like here: <pre><code class="language-powershell">Get-ChildItem -Filter "*.txt" -Recurse | Select-String -Pattern 'match' -List | Select -Property Path | Out-File ~\Documents\command_output.txt </code></pre> In the image below, first you'll see the same command, but without exporting the results to another file. The second command, at the bottom of the image, will export the results to the other file without showing them in PowerShell: You'll see the results from second command after opening the file in any text editor: But, what if you can't see the full path even in your text editor? To address this, you can add <code>| Format-Table -Wrap -AutoSize</code> right before sending the results to the file: <pre><code class="language-powershell">Get-ChildItem -Path C:\ -Filter "*.txt" -Recurse | Select-String -Pattern 'match' -List | Select -Property Path | Format-Table -Wrap -AutoSize | Out-File ~\Documents\command_output.txt </code></pre> And open the file to see the whole path! Just remember that you have to copy each line one by one. Where you see the arrows in the screenshot above is a "newline" character, which you have to delete. Only after doing that can you copy the whole path and paste it into Windows Explorer or into some script. <h2 id="heading-10-summary-the-ultimate-commands-for-searching-and-finding-whatever-you-need">10. Summary: the Ultimate Commands for Searching and Finding Whatever You Need</h2> <a href="https://github.com/NotBlackMagician/NBM-cheat-sheets/blob/main/windows_powershell/NBM_cheat_sheet_Get-ChildItem_find_any_file_like_on_linux.txt">Here</a> you can download a free cheat sheet with explanations of the commands and examples in one place. <h3 id="heading-most-used-commands">Most used commands:</h3> <ul> <li>Case-sensitive search:</li> </ul> <pre><code class="language-powershell">Get-ChildItem -Path C:\ -Recurse -Filter "*whatYouNeed*" | Where-Object { $_.Name -clike "*whatYouNeed*" } | Select-Object { $_.FullName } | Format-Table -Wrap -AutoSize </code></pre> <ul> <li>Alternatively, send the result to a file:</li> </ul> <pre><code class="language-powershell">Get-ChildItem -Path C:\ -Recurse -Filter "*whatYouNeed*" | Where-Object { $_.Name -clike "*whatYouNeed*" } | Select-Object { $_.FullName } | Format-Table -Wrap -AutoSize | Out-File ~\Documents\command_output.txt </code></pre> <ul> <li>Search by file's content:</li> </ul> <pre><code class="language-powershell">Get-ChildItem -Path C:\ -Recurse | Select-String -Pattern 'what you remember' -AllMatch -Context 2 | Format-Table -Wrap -AutoSize </code></pre> <ul> <li>Alternatively, send the result to the file:</li> </ul> <pre><code class="language-powershell">Get-ChildItem -Path C:\ -Recurse | Select-String -Pattern 'what you remember' -CaseSensitive -AllMatch -Context 2 | Format-Table -Wrap -AutoSize | Out-File ~\Documents\command_output.txt </code></pre> These commands should work for anything you want to find. I hope you understand now how they function after reading through this tutorial ;) <h2 id="heading-wrapping-up">Wrapping Up</h2> If you want to learn more about these commands, I show you how to work with them in depth in my tutorial <a href="https://notblackmagic.hashnode.dev/learn-windows-powershell-commands-like-a-linux-user">“Learn PowerShell commands like a Linux user”</a>. If what you found here helped you in any way, consider following me on my social media in order to help me reach further audience: <a href="https://social.linux.pizza/@SecretDevil">Mastodon</a>, <a href="https://www.linkedin.com/in/piotr-opoka-4320143a5/">LinkedIn</a>. You can also rate me on <a href="https://github.com/NotBlackMagician">Github</a> and support me on <a href="https://ko-fi.com/piotropoka">Ko-fi!</a> Thank you for any support you're able to give. Have a great day! </article> <article> <h1> How to Build a Production-Ready Flutter CI/CD Pipeline with GitHub Actions: Quality Gates, Environments, and Store Deployment </h1> Oluwaseyi Fatunmole — Wed, 18 Mar 2026 22:58:15 +0000 Mobile application development has evolved over the years. The processes, structure, and syntax we use has changed, as well as the quality and flexibility of the apps we build. One of the major improvements has been a properly automated CI/CD pipeline flow that gives us seamless automation, continuous integration, and continuous deployment. In this article, I'll break down how you can automate and build a production ready CI/CD pipeline for your Flutter application using GitHub Actions. Note that there are other ways to do this, like with Codemagic (built specifically for Flutter apps – which I'll cover in a subsequent tutorial), but in this article we'll focus on GitHub Actions instead. <h2 id="heading-table-of-contents">Table of Contents</h2> <ol> <li><a href="#heading-the-typical-workflow">The Typical Workflow</a> </li> <li><a href="#heading-prerequisites">Prerequisites</a> </li> <li><a href="#heading-pipeline-architecture">Pipeline Architecture</a> </li> <li><a href="#heading-writing-the-workflows">Writing the Workflows</a> <ul> <li><a href="#heading-the-helper-scripts">The Helper Scripts</a> <ul> <li><a href="#heading-script-1-generateconfigsh">generate_config.sh</a> </li> <li><a href="#heading-script-2-qualitygatesh">quality_gate.sh</a> </li> <li><a href="#heading-script-3-uploadsymbolssh-sentry">upload_symbols.sh</a> </li> </ul> </li> <li><a href="#heading-workflow-1-prchecksyml">PR Quality Gate (pr_checks.yml)</a> </li> <li><a href="#heading-workflow-2-androidyml">Android CI/CD Pipeline (android.yml)</a> </li> <li><a href="#heading-workflow-3-iosyml">iOS CI/CD Pipeline (ios.yml)</a> </li> </ul> </li> <li><a href="#heading-secrets-and-configuration-reference">Secrets and Configuration Reference</a> </li> <li><a href="#heading-end-to-end-flow">End-to-End Flow</a> </li> <li><a href="#heading-conclusion">Conclusion</a> </li> </ol> <h2 id="heading-the-typical-workflow">The Typical Workflow</h2> First, let's define the common approach to deploying production-ready Flutter apps. The development team does their work on local, pushes to the repository for merge or review, and eventually runs <code>flutter build apk</code> or <code>flutter build appbundle</code> to generate the apk file. This then gets shared with the QA team manually, or deployed to Firebase app distribution for testing. If it's a production move, the app bundle is submitted to the Google Play store for review and then deployed. This process is often fully manual with no automated checks, validation, or control over quality, speed, and seamlessness. Manually shipping a Flutter app starts out relatively simply, but can quickly and quietly turn into a liability. You run <code>flutter build</code>, switch configs, sign the build, upload it somewhere, and hope you didn’t mix up staging keys with production ones. As teams grow and release updates more and more quickly, these manual steps become real risks. A skipped quality check, a missing keystore, or an incorrect base URL deployed to production can cost hours of debugging or worse – it can affect your users. Automating this process fully involves some high level configuration and predefined scripting. It completely takes control of the deployment process from the moment the developer raised a PR into the common or base branch (for example, the <code>develop</code> branch). This automated process takes care of everything that needs to be done – provided it has been predefined, properly scripted, and aligns with the use case of the team. <h3 id="heading-what-well-do-here">What we'll do here:</h3> In this tutorial, we'll build a production-grade CI/CD pipeline for a Flutter app using GitHub Actions. The pipeline automates the entire lifecycle: pull-request quality checks, environment-specific configuration injection, Android and iOS builds, Firebase App Distribution for testers, Sentry symbol uploads, and final deployment to the Play Store and App Store. By the end, every release – from a developer opening a PR to the final build landing in users' hands – will be fully automated, with no one touching a terminal. <h2 id="heading-prerequisites">Prerequisites</h2> Before starting, you should have: <ol> <li>A Flutter app with working Android and iOS builds </li> <li>Basic familiarity with <a href="https://www.freecodecamp.org/news/automate-cicd-with-github-actions-streamline-workflow/">GitHub Actions</a> (workflows and jobs) </li> <li>A Firebase project with App Distribution enabled </li> <li>A Sentry project for error tracking </li> <li>A Google Play Console app already created </li> <li>An Apple Developer account with App Store Connect access </li> <li>Fastlane configured for your iOS project </li> <li>Basic Bash knowledge (I’ll explain the important parts) </li> </ol> <h2 id="heading-pipeline-architecture">Pipeline Architecture</h2> In this guide, we'll be building a CI/CD pipeline with very precise instructions and use cases. These use cases determine the way your pipeline is built. For this tutorial, we'll use this use case: I want to automate the workflow on my development team based on the following criteria: <ol> <li>When a developer on the team raises a PR into the common working branch <code>develop</code> in most cases), a workflow is triggered to run quality checks on the code. It only allows the merge to happen if all checks (like tests coverage, quality checks, and static analysis) pass. </li> <li>Code that's moving from the develop branch to the staging branch goes through another workflow that injects staging configurations/secret keys, does all the necessary checks, and distributes the application for testing on Firebase App Distribution for android as well as Testflight for iOS. </li> <li>Code that's moving from the staging to the production branch goes through the production level workflow which involves apk secured signing, production configuration injection, running tests to ensure nothing breaks, Sentry analysis for monitoring, and submission to App Store Connect as well as Google Play Console. </li> </ol> These are our predefined conditions which help with the construction of our workflows. <h2 id="heading-writing-the-workflows">Writing the Workflows</h2> We'll split this pipeline into three GitHub Actions workflows. We'll also be taking it a notch higher by creating three helper .sh scripts for a cleaner and more maintainable workflow. In your project root, create two folders: <ol> <li>.github/ </li> <li>scripts. </li> </ol> The .github/ folder will hold the workflows we'll be creating for each use case, while the scripts/ folder will hold the helper scripts that we can easily call in our CLI or in the workflows directly. After this, we'll create three workflow .yaml files: <ol> <li>pr_checks.yaml </li> <li>android.yaml </li> <li>ios.yaml </li> </ol> Also in the scripts folder, let's create three .sh files: <ol> <li>generate_config.sh </li> <li>quality_checks.sh </li> <li>upload_symbols.sh </li> </ol> <pre><code class="language-yaml">.github/ workflows/ pr_checks.yml android.yml ios.yml scripts/ generate_config.sh quality_checks.sh upload_symbols.sh </code></pre> This workflow architecture ensures that a push to <code>develop</code> automatically produces a tester build. Also, merging to <code>production</code> ships directly to the stores without manual commands or config changes. The scripts live outside the YAML on purpose. This lets you run the same logic locally. <h3 id="heading-the-helper-scripts">The Helper Scripts</h3> The scripts form the backbone of the pipeline. Each one has a single responsibility and is reused across workflows. Instead of cramming logic into YAML, we'll move it into reusable scripts. This keeps workflows clean and lets you run the same logic locally. Let's go through each one now. <h3 id="heading-script-1-generateconfigsh">Script #1: <code>generate_config.sh</code></h3> Injecting secrets safely is one of the hardest CI/CD problems in mobile apps. The strategy: <ul> <li>Commit a Dart template file with placeholders </li> <li>Replace placeholders at build time using secrets from GitHub Actions </li> <li>Never commit real credentials </li> </ul> <pre><code class="language-yaml">#!/usr/bin/env bash set -euo pipefail ENV_NAME=${1:-} BASE_URL=${2:-} ENCRYPTION_KEY=${3:-} TEMPLATE="lib/core/env/env_ci.dart" OUT="lib/core/env/env_ci.g.dart" if [ -z "$ENV_NAME" ] || [ -z "$BASE_URL" ] || [ -z "$ENCRYPTION_KEY" ]; then echo "Usage: $0 <env-name> <base-url> <encryption-key>" exit 2 fi sed -e "s|<<BASE_URL>>|$BASE_URL|g" \ -e "s|<<ENCRYPTION_KEY>>|$ENCRYPTION_KEY|g" \ -e "s|<<ENV_NAME>>|$ENV_NAME|g" \ "$TEMPLATE" > "$OUT" echo "Generated config for $ENV_NAME" </code></pre> This script is responsible for injecting environment-specific configuration into the Flutter app at build time, without ever committing secrets to source control. Let’s walk through it carefully. <h4 id="heading-1-shebang-choosing-the-shell">1. Shebang: Choosing the Shell</h4> <pre><code class="language-yaml">#!/usr/bin/env bash </code></pre> This line tells the system to execute the script using Bash, regardless of where Bash is installed on the machine. Using <code>/usr/bin/env bash</code> instead of <code>/bin/bash</code> makes the script more portable across local machines, GitHub Actions runners, and Docker containers. <h4 id="heading-2-fail-fast-fail-loud">2. Fail Fast, Fail Loud</h4> <pre><code class="language-yaml">set -euo pipefail </code></pre> This is one of the most important lines in the script. It enables three strict Bash modes: <ul> <li><code>-e</code>: Exit immediately if any command fails </li> <li><code>-u</code>: Exit if an undefined variable is used </li> <li><code>-o pipefail</code>: Fail if any command in a pipeline fails, not just the last one </li> </ul> This matters in CI because silent failures are dangerous, partial config generation can break production builds, and CI should stop immediately when something is wrong. This line ensures that no broken config ever makes it into a build. <h4 id="heading-3-reading-input-arguments">3. Reading Input Arguments</h4> <pre><code class="language-yaml"> ENV_NAME=${1:-} BASE_URL=${2:-} ENCRYPTION_KEY=${3:-} </code></pre> These lines read positional arguments passed to the script: <ul> <li><code>$1</code>: Environment name (<code>dev</code>, <code>staging</code>, <code>production</code>) </li> <li><code>$2</code>: API base URL </li> <li><code>$3</code>: Encryption or API key </li> </ul> The <code>${1:-}</code> syntax means: “If the argument is missing, default to an empty string instead of crashing.” This works hand-in-hand with <code>set -u</code> , we control the failure explicitly instead of letting Bash explode unexpectedly. <h4 id="heading-4-defining-input-and-output-files">4. Defining Input and Output Files</h4> <pre><code class="language-yaml">TEMPLATE="lib/core/env/env_ci.dart" OUT="lib/core/env/env_ci.g.dart" </code></pre> Here we define two files: <ul> <li>Template file (<code>env_ci.dart</code>) <ul> <li>Contains placeholder values like <code><<BASE_URL>></code> </li> <li>Safe to commit to Git </li> </ul> </li> <li>Generated file (<code>env_ci.g.dart</code>) <ul> <li>Contains real environment values </li> <li>Must be ignored by Git (<code>.gitignore</code>) </li> </ul> </li> </ul> At the heart of this approach are two Dart files with very different responsibilities. They may look similar, but they play completely different roles in the system. <h4 id="heading-envcidart"><code>env.ci.dart</code>:</h4> <pre><code class="language-java">// lib/core/env/env_ci.dart class EnvConfig { static const String baseUrl = '<<BASE_URL>>'; static const String encryptionKey = '<<ENCRYPTION_KEY>>'; static const String environment = '<<ENV_NAME>>'; } </code></pre> This file is safe, static, and version-controlled. It contains placeholders, not real values. Some of its key characteristics are: <ul> <li>Contains no real secrets </li> <li>Uses obvious placeholders (<code><<BASE_URL>></code>, etc.) </li> <li>Safe to commit to Git </li> <li>Reviewed like normal source code </li> <li>Serves as the single source of truth for required config fields </li> </ul> Think of this file as a contract: “These are the configuration values the app expects at runtime.” <h4 id="heading-envcigdart"><code>env.ci.g.dart</code>:</h4> This file is created at build time by <code>generate_config.sh</code>. After substitution, it looks like this: <pre><code class="language-java">// lib/core/env/env_ci.g.dart // GENERATED FILE — DO NOT COMMIT class EnvConfig { static const String baseUrl = 'https://staging.api.example.com'; static const String encryptionKey = 'sk_live_xxxxx'; static const String environment = 'staging'; } </code></pre> Key characteristics: <ul> <li>Contains real environment values </li> <li>Generated dynamically in CI </li> <li>Differs per environment (dev / staging / production) </li> <li>Must never be committed to source control </li> </ul> This file exists only on a developer’s machine (if generated locally), inside the CI runner during a build. Once the job finishes, it disappears. <h4 id="heading-gitignore"><code>.gitignore</code>:</h4> To guarantee the generated file never leaks, it must be ignored: <h4 id="heading-why-this-separation-is-critical">Why This Separation Is Critical</h4> This design solves several hard problems at once. Security: <ul> <li>Secrets live only in GitHub Actions secrets </li> <li>They never appear in the repository </li> <li>They never appear in PRs </li> <li>They never appear in Git history </li> </ul> Environment Isolation: Each environment gets its own generated config: <ul> <li><code>develop</code>: dev API </li> <li><code>staging</code>: staging API </li> <li><code>production</code>: production API </li> </ul> The same codebase behaves differently without branching logic in Dart. Deterministic Builds: Every build is fully reproducible, fully automated, and explicit about which environment it targets. There are no “it worked locally” scenarios. <h4 id="heading-5-validating-required-arguments">5. Validating Required Arguments</h4> <pre><code class="language-java">if [ -z "$ENV_NAME" ] || [ -z "$BASE_URL" ] || [ -z "$ENCRYPTION_KEY" ]; then echo "Usage: $0 <env-name> <base-url> <encryption-key>" exit 2 fi </code></pre> This block enforces correct usage. <ul> <li><code>-z</code> checks whether a variable is empty </li> <li>If any required argument is missing: <ul> <li>A helpful usage message is printed </li> <li>The script exits with a non-zero status code </li> </ul> </li> <li><code>0</code>: success </li> <li><code>1+</code>: failure </li> <li><code>2</code> conventionally means incorrect usage </li> </ul> In CI, this immediately fails the job and prevents an invalid build. <h4 id="heading-6-injecting-environment-values">6. Injecting Environment Values</h4> <pre><code class="language-java">sed -e "s|<<BASE_URL>>|$BASE_URL|g" \ -e "s|<<ENCRYPTION_KEY>>|$ENCRYPTION_KEY|g" \ -e "s|<<ENV_NAME>>|$ENV_NAME|g" \ "$TEMPLATE" > "$OUT" </code></pre> This is the heart of the script. What’s happening here: <ol> <li><code>sed</code> performs stream editing: it reads text, transforms it, and outputs the result </li> <li>Each <code>-e</code> flag defines a replacement rule: <ul> <li>Replace <code><<BASE_URL>></code> with the actual API URL </li> <li>Replace <code><<ENCRYPTION_KEY>></code> with the real key </li> <li>Replace <code><<ENV_NAME>></code> with the environment label </li> </ul> </li> <li>The transformed output is written to <code>env_ci.g.dart</code> </li> </ol> This entire operation happens at build time: <ul> <li>No secrets are committed </li> <li>No secrets are logged </li> <li>No secrets persist beyond the CI run </li> </ul> <h4 id="heading-7-success-feedback">7. Success Feedback</h4> <pre><code class="language-java">echo "Generated config for $ENV_NAME" </code></pre> This line provides a clear success signal in CI logs. It answers three important questions instantly: <ul> <li>Did the script run? </li> <li>Did it finish successfully? </li> <li>Which environment was generated? </li> </ul> In long CI logs, these small confirmations matter. Alright, now let's move on to the second script. <h3 id="heading-script-2-qualitygatesh">Script #2: <code>quality_gate.sh</code></h3> This script defines what “good code” means for your team. <pre><code class="language-yaml">#!/usr/bin/env bash set -euo pipefail echo "Running quality checks" dart format --output=none --set-exit-if-changed . flutter analyze flutter test --no-pub --coverage if command -v dart_code_metrics >/dev/null 2>&1; then dart_code_metrics analyze lib --reporter=console || true fi echo "Quality checks passed" </code></pre> Lets break down this script bit by bit. <h4 id="heading-1-start-amp-end-log-markers">1. Start & End Log Markers</h4> <pre><code class="language-yaml">echo "Running quality checks" ... echo "Quality checks passed" </code></pre> These two lines act as visual boundaries in CI logs. In large pipelines (especially when Android and iOS jobs run in parallel), logs can be very noisy. Clear markers: <ul> <li>Help developers quickly find the quality phase </li> <li>Make debugging faster </li> <li>Confirm that the script completed successfully </li> </ul> The final success message only prints if everything above it passed, because <code>set -e</code> would have terminated the script earlier on failure. So this line effectively means: All quality gates passed. Safe to proceed. <h4 id="heading-2-running-the-test-suite">2. Running the Test Suite</h4> <pre><code class="language-yaml">flutter test --no-pub --coverage </code></pre> This line executes your entire Flutter test suite. Let’s break it down carefully. 1. <code>flutter test</code> This runs unit tests, widget tests, and any test under the <code>test/</code> directory. If any test fails, the command exits with a non-zero status code. Because we enabled <code>set -e</code> earlier, that immediately stops the script and fails the CI job. 2. <code>--coverage</code> This flag generates a coverage report at: <pre><code class="language-yaml">coverage/lcov.info </code></pre> This file can later be uploaded to Codecov, used to enforce minimum coverage thresholds, and tracked over time for quality improvement. Even if you’re not enforcing coverage yet, generating it now future-proofs your pipeline. <h4 id="heading-3-optional-code-metrics">3. Optional Code Metrics</h4> <pre><code class="language-yaml">if command -v dart_code_metrics >/dev/null 2>&1; then dart_code_metrics analyze lib --reporter=console || true fi </code></pre> This block is intentionally designed to be optional and non-blocking. Step 1 – Check If the Tool Exists: <pre><code class="language-yaml">command -v dart_code_metrics >/dev/null 2>&1 </code></pre> This checks whether <code>dart_code_metrics</code> is installed. <ul> <li>If installed, proceed </li> <li>If not installed, skip silently </li> </ul> The redirection: <ul> <li><code>>/dev/null</code> hides normal output </li> <li><code>2>&1</code> hides errors </li> </ul> This makes the script portable: <ul> <li>Developers without the tool can still run the script </li> <li>CI can enforce it if configured </li> </ul> Step 2 – Run Metrics (Soft Enforcement): <pre><code class="language-yaml">dart_code_metrics analyze lib --reporter=console || true </code></pre> This analyzes the <code>lib/</code> directory and prints results in the console. The important part is: <pre><code class="language-yaml">|| true </code></pre> Because we enabled <code>set -e</code>, any failing command would normally stop the script. Adding <code>|| true</code> overrides that behavior: <ul> <li>If metrics report issues, </li> <li>The script continues, </li> <li>CI does not fail. </li> </ul> Why design it this way? Because metrics are often gradual improvements, technical debt indicators, or advisory rather than blocking. You can later remove <code>|| true</code> to make metrics mandatory. <h4 id="heading-4-final-success-message">4. Final Success Message</h4> <pre><code class="language-yaml">echo "✅ Quality checks passed" </code></pre> This line only executes if formatting passed, static analysis passed, and tests passed. If you see this in CI logs, it means the branch has successfully cleared the quality gate. It’s your automated approval before deployment steps begin. <h4 id="heading-what-this-script-guarantees">What This Script Guarantees</h4> With this in place, every branch must satisfy: <ul> <li>Clean formatting </li> <li>No analyzer errors </li> <li>Passing tests </li> <li>(Optional) Healthy metrics </li> </ul> That’s how you move from “We try to maintain quality” to “Quality is enforced automatically.” Alright, on to the third script. <h3 id="heading-script-3-uploadsymbolssh-sentry">Script #3: <code>upload_symbols.sh</code> (Sentry)</h3> This script is responsible for uploading obfuscation debug symbols to Sentry so production crashes remain readable. <pre><code class="language-yaml">#!/usr/bin/env bash set -euo pipefail RELEASE=${1:-} [ -z "$RELEASE" ] && exit 2 if ! command -v sentry-cli >/dev/null 2>&1; then exit 0 fi sentry-cli releases new "$RELEASE" || true sentry-cli upload-dif build/symbols || true sentry-cli releases finalize "$RELEASE" || true echo "✅ Symbols uploaded for release $RELEASE" </code></pre> Let's go through it step by step. <h4 id="heading-1-reading-the-release-identifier">1. Reading the Release Identifier</h4> <pre><code class="language-yaml">RELEASE=${1:-} </code></pre> This reads the first positional argument passed to the script. When you call the script in CI, it typically looks like: <pre><code class="language-yaml">./scripts/upload_symbols.sh $(git rev-parse --short HEAD) </code></pre> So <code>$1</code> becomes the short Git commit SHA. Using <code>${1:-}</code> ensures: <ul> <li>If no argument is passed, the variable becomes an empty string </li> <li>The script does not crash due to <code>set -u</code> </li> </ul> This release value ties the uploaded symbols, deployed build, and crash reports all to the exact same commit. This linkage is critical for production debugging. <h4 id="heading-2-validating-the-release-argument">2. Validating the Release Argument</h4> <pre><code class="language-yaml">[ -z "$RELEASE" ] && exit 2 </code></pre> This is a compact validation check. <ul> <li><code>-z</code> checks whether the string is empty </li> <li>If it is empty → exit with status code 2 </li> </ul> Conventionally: <ul> <li><code>0</code> = success </li> <li><code>1+</code> = failure </li> <li><code>2</code> = incorrect usage </li> </ul> This prevents symbol uploads from running without a release identifier, which would break traceability in Sentry. <h4 id="heading-3-checking-if-sentry-cli-exists">3. Checking If <code>sentry-cli</code> Exists</h4> <pre><code class="language-yaml">if ! command -v sentry-cli >/dev/null 2>&1; then exit 0 fi </code></pre> This block checks whether the <code>sentry-cli</code> tool is available in the environment. What’s happening: <ul> <li><code>command -v sentry-cli</code> checks if it exists </li> <li><code>>/dev/null 2>&1</code> suppresses all output </li> <li><code>!</code> negates the condition </li> </ul> So this reads as: "If <code>sentry-cli</code> is NOT installed, exit successfully." Why exit with <code>0</code> instead of failing? Because not every environment needs symbol uploads. Also, dev builds may not install Sentry, and you don’t want CI to fail just because Sentry isn’t configured. This makes symbol uploading environment-aware and optional. Production environments can install <code>sentry-cli</code>, while dev environments skip it cleanly. <h4 id="heading-4-creating-a-new-release-in-sentry">4. Creating a New Release in Sentry</h4> <pre><code class="language-yaml">sentry-cli releases new "$RELEASE" || true </code></pre> This tells Sentry: “A new release exists with this version identifier.” Even if the release already exists, the script continues because of: <pre><code class="language-yaml">|| true </code></pre> This prevents the build from failing if: <ul> <li>The release was already created </li> <li>The command returns a non-critical error </li> </ul> The goal is resilience, not strict enforcement. <h4 id="heading-5-uploading-debug-information-files-difs">5. Uploading Debug Information Files (DIFs)</h4> <pre><code class="language-yaml">sentry-cli upload-dif build/symbols || true </code></pre> This is the core step. <code>build/symbols</code> is generated when you build Flutter with: <pre><code class="language-yaml">--obfuscate --split-debug-info=build/symbols </code></pre> When you obfuscate Flutter builds: <ul> <li>Method names are renamed </li> <li>Stack traces become unreadable </li> </ul> The symbol files allow Sentry to reverse-map obfuscated stack traces and show readable crash reports. Without this step, production crashes look like: <pre><code class="language-yaml">a.b.c.d (Unknown Source) </code></pre> With this step, you get: <pre><code class="language-yaml">AuthRepository.login() </code></pre> Again, <code>|| true</code> ensures the build doesn’t fail if: <ul> <li>The directory doesn’t exist </li> <li>No symbols were generated </li> <li>Upload encounters a transient issue </li> </ul> Symbol uploads should not block deployment. <h4 id="heading-6-finalizing-the-release">6. Finalizing the Release</h4> <pre><code class="language-yaml">sentry-cli releases finalize "$RELEASE" || true </code></pre> This marks the release as complete in Sentry. Finalizing signals: <ul> <li>The release is deployed </li> <li>It can begin aggregating crash reports </li> <li>It’s ready for production monitoring </li> </ul> Like the previous steps, this is soft-failed with <code>|| true</code> to keep CI robust. <h4 id="heading-what-this-script-guarantees">What This Script Guarantees</h4> When everything is configured correctly: <ol> <li>Production build is obfuscated </li> <li>Debug symbols are generated </li> <li>Symbols are uploaded to Sentry </li> <li>Crashes map back to real source code </li> <li>Release version matches commit SHA </li> </ol> That’s production-grade crash observability. Now that we've gone through the three helper scripts we've created to optimize and enhance this process, lets now dive into the three workflow .yaml files we're going to create. <h2 id="heading-workflow-1-prchecksyml">Workflow #1: <code>PR_CHECKS.YML</code></h2> This workflow will be designed to help ensure a certain standard is met once a PR is raised into a certain common or base branch. This will ensure that all quality checks in the incoming code pass before allowing any merge into the base branch. This is basically a gate that verifies the quality of the code that's about to be merged into the base branch. If your pipeline allows unverified code into your base branch, then your CI becomes decorative, not protective. Lets break down what's actually needed during every PR Check. <h3 id="heading-1-dependency-integrity">1. Dependency Integrity</h3> For Flutter apps, where we manage dependencies with the pub get command, it's important to make sure that the integrity of all dependencies are confirmed – up to date as well as compatible. Every PR should begin with: <pre><code class="language-yaml">flutter pub get </code></pre> This ensures: <ul> <li><code>pubspec.yaml</code> is valid </li> <li>Dependency constraints are consistent </li> <li>Lockfiles are not broken </li> <li>The project is buildable in a clean environment </li> </ul> If this fails, the branch is not deployable. <h3 id="heading-2-static-analysis">2. Static Analysis</h3> This ensures code quality and architecture integrity. Static analysis helps prevent common issues like forgotten await, dead code, null safety violations, async misuse, and so on. Most production bugs aren't business logic errors – they're structural carelessness. Static analysis helps enforce consistency automatically, so code reviews focus on intent, not linting. <pre><code class="language-yaml">flutter analyze --fatal-infos --fatal-warnings </code></pre> <h3 id="heading-3-formatting">3. Formatting</h3> This command ensures that your code is properly formatted based on your organization's coding standard and policies. <pre><code class="language-yaml">dart format --output=none --set-exit-if-changed . </code></pre> <h3 id="heading-4-tests">4. Tests</h3> This runs the unit, widget and business logic tests to ensure quality and avoid regression leaks, silent behavior changes and feature drift. <pre><code class="language-yaml">flutter test --coverage </code></pre> <h3 id="heading-5-test-coverage-enforcement">5. Test Coverage Enforcement</h3> Ideally, running tests is not enough. Your workflow should also enforce a minimum threshold: <pre><code class="language-yaml">if [ $(lcov --summary coverage/lcov.info | grep lines | awk '{print $2}' | sed 's/%//') -lt 70 ]; then echo "Coverage too low" exit 1 fi </code></pre> The command above ensures that a minimum test coverage of 70% is met, with this quality becomes measurable. The five commands above must be checked (at least) for a quality gate to guarantee code quality, security, and integrity. Now here is the full pr_checks.yml file: <pre><code class="language-yaml">name: PR Quality Gate on: pull_request: branches: develop types: [opened, synchronize, reopened, ready_for_review] jobs: pr-checks: name: Run quality checks on this pull request runs-on: ubuntu-latest steps: - name: Checkout code uses: actions/checkout@v2 - name: Setup Java uses: actions/setup-java@v1 with: java-version: "12.x" - name: Setup Flutter uses: subosito/flutter-action@v1 with: channel: "stable" - name: Install dependencies run: flutter pub get - name: Run quality checks run: ./scripts/quality_checks.sh - name: Notify Team (Success) if: success() run: | echo "PR Quality Checks PASSED" echo "PR: ${{ github.event.pull_request.html_url }}" echo "Branch: ${{ github.head_ref }} → ${{ github.base_ref }}" echo "By: @${{ github.actor }}" echo "Team notification: @foluwaseyi-dev @olabodegbolu" - name: Notify Team (Failure) if: failure() run: | echo "PR Quality Checks FAILED" echo "PR: ${{ github.event.pull_request.html_url }}" echo "Branch: ${{ github.head_ref }} → ${{ github.base_ref }}" echo "By: @${{ github.actor }}" echo "Please fix the issues before requesting review 🔧" echo "Team notification: @foluwaseyi-dev @olabodegbolu" </code></pre> Every time a developer opens (or updates) a pull request targeting the <code>develop</code> branch, this workflow kicks in automatically. Think of it as a bouncer at the door: no code gets through without passing inspection first. <h3 id="heading-what-triggers-it">What Triggers it?</h3> The workflow fires on four events: when a PR is <code>opened</code>, <code>synchronized</code> (new commits pushed), <code>reopened</code>, or marked <code>ready_for_review</code>. So drafts won't trigger it – only PRs that are actually ready to be looked at. <h3 id="heading-what-does-it-actually-do">What Does it Actually Do?</h3> It spins up a fresh Ubuntu machine and runs five steps in sequence: <ol> <li>Checkout: pulls down the branch's code </li> <li>Setup Java 12: installs the JDK (likely a dependency for some tooling or build process) </li> <li>Setup Flutter (stable channel): this is a Flutter project, so it grabs the stable Flutter SDK </li> <li>Install dependencies: runs <code>flutter pub get</code> to pull all Dart/Flutter packages </li> <li>Run quality checks: executes the helper shell script (<code>./scripts/quality_checks.sh</code>) that we created which runs linting, tests, formatting checks, or all of the above </li> </ol> <h3 id="heading-the-notification-layer">The Notification Layer</h3> After the checks run, the workflow reports the verdict and it's context-aware: <ul> <li>If everything passes, it logs a success message with the PR URL, branch info, and the person who opened it </li> <li>If something fails, it logs a failure message and nudges the author to fix issues before requesting a review </li> </ul> Both outcomes tag <code>@foluwaseyi-dev</code> and <code>@olabodegbolu</code> – the two team members responsible for staying in the loop. This workflow enforces a "fix it before you merge it" culture. No one can merge broken code into <code>develop</code> without the team knowing about it. <h2 id="heading-workflow-2-androidyml">Workflow #2: Android.yml</h2> It's a better practice to split your workflows based on platform. This helps you properly manage the instructions regarding each platform. This is the reason behind keeping the Android workflow separate. Unlike <code>PR _Checks</code>, this workflow presumes that all checks for quality and standards have been done and the code that runs this workflow already meets the required standards. Based on our predefined use case, let's create a workflow to handle test deployments when merged to develop or staging, and production level activities when merged to production. <pre><code class="language-yaml">name: Android Build & Release on: push: branches: - develop - staging - production jobs: android: runs-on: ubuntu-latest env: FLUTTER_VERSION: 'stable' steps: - name: Checkout code uses: actions/checkout@v3 - name: Setup Java uses: actions/setup-java@v3 with: distribution: 'temurin' java-version: '11' - name: Setup Flutter uses: subosito/flutter-action@v2 with: flutter-version: ${{ env.FLUTTER_VERSION }} - name: Install dependencies run: flutter pub get - name: Determine environment id: env run: | echo "branch=${GITHUB_REF##*/}" >> $GITHUB_OUTPUT if [ "${GITHUB_REF##*/}" = "develop" ]; then echo "ENV=dev" >> $GITHUB_OUTPUT elif [ "${GITHUB_REF##*/}" = "staging" ]; then echo "ENV=staging" >> $GITHUB_OUTPUT else echo "ENV=production" >> $GITHUB_OUTPUT fi # Dev uses hardcoded values no secrets needed - name: Generate config (dev) if: steps.env.outputs.ENV == 'dev' run: ./scripts/generate_config.sh dev "https://dev.api.example.com" "dev_dummy_key" # Staging and production inject real secrets - name: Generate config (staging/production) if: steps.env.outputs.ENV != 'dev' run: | if [ "${{ steps.env.outputs.ENV }}" = "staging" ]; then ./scripts/generate_config.sh staging \ "${{ secrets.STAGING_BASE_URL }}" \ "${{ secrets.STAGING_API_KEY }}" else ./scripts/generate_config.sh production \ "${{ secrets.PROD_BASE_URL }}" \ "${{ secrets.PROD_API_KEY }}" fi # Keystore is only needed for signed builds (staging & production) - name: Restore Keystore if: steps.env.outputs.ENV != 'dev' run: | echo "${{ secrets.ANDROID_KEYSTORE_BASE64 }}" | base64 --decode > android/app/upload-keystore.jks # Production builds are obfuscated + split debug info for Play Store - name: Build artifact run: | if [ "${{ steps.env.outputs.ENV }}" = "production" ]; then flutter build appbundle --release \ --obfuscate \ --split-debug-info=build/symbols else flutter build appbundle --release fi # Dev and staging go to Firebase App Distribution for internal testing - name: Upload to Firebase App Distribution if: steps.env.outputs.ENV == 'dev' || steps.env.outputs.ENV == 'staging' env: FIREBASE_TOKEN: ${{ secrets.FIREBASE_TOKEN }} FIREBASE_ANDROID_APP_ID: ${{ secrets.FIREBASE_ANDROID_APP_ID }} FIREBASE_GROUPS: ${{ secrets.FIREBASE_GROUPS }} run: | firebase appdistribution:distribute \ build/app/outputs/bundle/release/app-release.aab \ --app "$FIREBASE_ANDROID_APP_ID" \ --groups "$FIREBASE_GROUPS" \ --token "$FIREBASE_TOKEN" # Only production goes to the Play Store - name: Upload to Play Store if: steps.env.outputs.ENV == 'production' uses: r0adkll/upload-google-play@v1 with: serviceAccountJsonPlainText: ${{ secrets.GOOGLE_PLAY_SERVICE_ACCOUNT_JSON }} packageName: com.your.package releaseFiles: build/app/outputs/bundle/release/app-release.aab track: production - name: Notify Team (Success) if: success() run: | echo "Android Build & Release PASSED" echo "Environment: ${{ steps.env.outputs.ENV }}" echo "Branch: ${{ steps.env.outputs.branch }}" echo "By: @${{ github.actor }}" echo "Commit: ${{ github.sha }}" - name: Notify Team (Failure) if: failure() run: | echo "Android Build & Release FAILED" echo "Environment: ${{ steps.env.outputs.ENV }}" echo "Branch: ${{ steps.env.outputs.branch }}" echo "By: @${{ github.actor }}" echo "Commit: ${{ github.sha }}" echo "Check the logs and fix the issue before retrying" </code></pre> This workflow ensures that whenever code lands on the develop, staging or production branch, this action is triggered on a fresh Ubuntu machine. This is triggered by a simple push to any of the tracked branches, no manual intervention needed. Let's walk through it piece by piece. <h3 id="heading-1-the-setup-phase">1. The Setup Phase</h3> Before any Flutter-specific work happens, the workflow lays the foundation: <ol> <li>Checkout: grabs the latest code from the branch that triggered the run (using the more modern <code>actions/checkout@v3</code>). </li> <li>Java 11 via Temurin: this is an upgrade from the first workflow we created. Instead of a generic <code>setup-java@v1</code>, this uses the <code>temurin</code> distribution which is the Eclipse's open-source JDK build. It's the current industry standard for Android toolchains. </li> <li>Flutter (stable): this pulls the stable Flutter SDK, version pinned via an environment variable (<code>FLUTTER_VERSION: 'stable'</code>) defined at the job level. </li> <li>Install dependencies: this ensures we run <code>flutter pub get</code> to pull all packages </li> </ol> <h3 id="heading-2-environment-detection">2. Environment Detection</h3> This is where it gets interesting. This workflow also checks and determines the environment which will help us define the next set of instructions to run. This command reads the branch name from GITHUB REF and maps it to its environment label which we already created in one of our helper scripts. <ul> <li>develop → ENV=dev </li> <li>staging → ENV=staging </li> <li>production → ENV=production </li> </ul> It strips the branch name from the full ref path using <code>${GITHUB_REF##*/}</code>, then writes both the branch name and the resolved <code>ENV</code> value to <code>$GITHUB_OUTPUT</code>, making them available as named outputs (<code>steps.env.outputs.ENV</code>) for every subsequent step. This means the rest of the pipeline can branch its behaviour based on which environment it's building for, different API keys, different signing configs, different targets – whatever the app needs. <h3 id="heading-3-config-injection">3. Config Injection</h3> With the environment resolved, the next step is injecting the right configuration into the app. This is where the <code>generate_config.sh</code> script we built earlier gets called directly from the workflow. For the <code>dev</code> environment, hardcoded placeholder values are used. No real secrets are needed, since this build is only meant for internal developer testing: <pre><code class="language-yaml">- name: Generate config (dev) if: steps.env.outputs.ENV == 'dev' run: ./scripts/generate_config.sh dev "https://dev.api.example.com" "dev_dummy_key" </code></pre> For staging and production, however, real secrets are pulled from GitHub Actions secrets and passed directly into the script: <pre><code class="language-yaml">- name: Generate config (staging/production) if: steps.env.outputs.ENV != 'dev' run: | if [ "${{ steps.env.outputs.ENV }}" = "staging" ]; then ./scripts/generate_config.sh staging \ "${{ secrets.STAGING_BASE_URL }}" \ "${{ secrets.STAGING_API_KEY }}" else ./scripts/generate_config.sh production \ "${{ secrets.PROD_BASE_URL }}" \ "${{ secrets.PROD_API_KEY }}" fi </code></pre> Notice that these two steps use an <code>if</code> condition to make them mutually exclusive. Only one will ever run per job. This keeps the pipeline clean: no complicated branching logic inside the script itself, just a clear decision at the workflow level. <h3 id="heading-4-keystore-restoration">4. Keystore Restoration</h3> Android requires signed builds for distribution. The signing keystore file cannot be committed to the repository for obvious security reasons, so it's stored as a Base64-encoded GitHub secret and decoded at build time. <pre><code class="language-yaml">- name: Restore Keystore if: steps.env.outputs.ENV != 'dev' run: | echo "${{ secrets.ANDROID_KEYSTORE_BASE64 }}" | base64 --decode > android/app/upload-keystore.jks </code></pre> This step is skipped entirely for the <code>dev</code> environment because dev builds are unsigned debug artifacts meant purely for internal testing on Firebase App Distribution. Only staging and production builds need to be properly signed. To encode your keystore file as a Base64 string for storing in GitHub secrets, you have to run this locally: <pre><code class="language-yaml">base64 -i upload-keystore.jks | pbcopy </code></pre> This copies the encoded string to your clipboard, which you can then paste directly into your GitHub repository secrets. <h3 id="heading-5-building-the-artifact">5. Building the Artifact</h3> With the environment configured and the keystore in place, the workflow builds the app bundle: <pre><code class="language-yaml">- name: Build artifact run: | if [ "${{ steps.env.outputs.ENV }}" = "production" ]; then flutter build appbundle --release \ --obfuscate \ --split-debug-info=build/symbols else flutter build appbundle --release fi </code></pre> There's a deliberate difference between how production and non-production builds are compiled. For production: <ul> <li><code>--obfuscate</code> renames method and class names in the compiled output, making it significantly harder to reverse engineer the app </li> <li><code>--split-debug-info=build/symbols</code> extracts the debug symbols into a separate directory at <code>build/symbols</code> </li> </ul> These symbols are what <code>upload_symbols.sh</code> later ships to Sentry, so obfuscated crash reports remain readable in your monitoring dashboard. For dev and staging, neither flag is used. This keeps build times faster and makes local debugging easier since stack traces remain human-readable. <h3 id="heading-6-distributing-to-firebase-app-distribution">6. Distributing to Firebase App Distribution</h3> Once the app bundle is built, dev and staging builds are uploaded to Firebase App Distribution so testers can install them immediately: <pre><code class="language-yaml">- name: Upload to Firebase App Distribution if: steps.env.outputs.ENV == 'dev' || steps.env.outputs.ENV == 'staging' env: FIREBASE_TOKEN: ${{ secrets.FIREBASE_TOKEN }} FIREBASE_ANDROID_APP_ID: ${{ secrets.FIREBASE_ANDROID_APP_ID }} FIREBASE_GROUPS: ${{ secrets.FIREBASE_GROUPS }} run: | firebase appdistribution:distribute \ build/app/outputs/bundle/release/app-release.aab \ --app "$FIREBASE_ANDROID_APP_ID" \ --groups "$FIREBASE_GROUPS" \ --token "$FIREBASE_TOKEN" </code></pre> Three secrets power this step: <ul> <li><code>FIREBASE_TOKEN</code>: the authentication token generated from <code>firebase login:ci</code> </li> <li><code>FIREBASE_ANDROID_APP_ID</code>: the app identifier from the Firebase console </li> <li><code>FIREBASE_GROUPS</code>: the tester group(s) that should receive the build notification </li> </ul> Once this step completes, every tester in the specified groups receives an email with a direct download link. No one needs to manually share an APK file over Slack or email. <h3 id="heading-7-deploying-to-the-play-store">7. Deploying to the Play Store</h3> Production builds skip Firebase entirely and goes straight to the Google Play Store: <pre><code class="language-yaml">- name: Upload to Play Store if: steps.env.outputs.ENV == 'production' uses: r0adkll/upload-google-play@v1 with: serviceAccountJsonPlainText: ${{ secrets.GOOGLE_PLAY_SERVICE_ACCOUNT_JSON }} packageName: com.your.package releaseFiles: build/app/outputs/bundle/release/app-release.aab track: production </code></pre> This uses the <code>r0adkll/upload-google-play</code> GitHub Action, which handles the Google Play API interaction under the hood. The only requirements are: <ul> <li>A Google Play service account with the correct permissions, stored as a JSON secret </li> <li>The correct package name matching what is registered in your Play Console </li> <li>The <code>track</code> set to <code>production</code> (you can also use <code>internal</code>, <code>alpha</code>, or <code>beta</code> depending on your release strategy) </li> </ul> Replace <code>com.your.package</code> with your actual application ID (the same one defined in your <code>build.gradle</code> file). <h3 id="heading-8-the-notification-layer">8. The Notification Layer</h3> Just like the PR checks workflow, this workflow reports its outcome clearly: <pre><code class="language-yaml">- name: Notify Team (Success) if: success() run: | echo "Android Build & Release PASSED" echo "Environment: ${{ steps.env.outputs.ENV }}" echo "Branch: ${{ steps.env.outputs.branch }}" echo "By: @${{ github.actor }}" echo "Commit: ${{ github.sha }}" - name: Notify Team (Failure) if: failure() run: | echo "Android Build & Release FAILED" echo "Environment: ${{ steps.env.outputs.ENV }}" echo "Branch: ${{ steps.env.outputs.branch }}" echo "By: @${{ github.actor }}" echo "Commit: ${{ github.sha }}" echo "Check the logs and fix the issue before retrying 🔧" </code></pre> The success notification includes the environment, branch, actor, and shares everything needed to trace exactly what was deployed and who triggered it. The failure notification includes the same context, with a clear call to action. <h2 id="heading-workflow-3-iosyml">Workflow #3: iOS.yml</h2> iOS CI/CD is more complex than Android by nature. This is because Apple's signing requirements involve certificates, provisioning profiles, and entitlements that all need to be in the right place before Xcode will produce a valid archive. Fastlane helps us handles all of that complexity, and the workflow simply calls into it. Here is the full <code>ios.yml</code>: <pre><code class="language-yaml">name: iOS Build & Release on: push: branches: - develop - staging - production jobs: ios: runs-on: macos-latest env: FLUTTER_VERSION: 'stable' steps: - name: Checkout code uses: actions/checkout@v3 - name: Setup Flutter uses: subosito/flutter-action@v2 with: flutter-version: ${{ env.FLUTTER_VERSION }} - name: Install dependencies run: flutter pub get - name: Determine environment id: env run: | echo "branch=${GITHUB_REF##*/}" >> $GITHUB_OUTPUT if [ "${GITHUB_REF##*/}" = "develop" ]; then echo "ENV=dev" >> $GITHUB_OUTPUT elif [ "${GITHUB_REF##*/}" = "staging" ]; then echo "ENV=staging" >> $GITHUB_OUTPUT else echo "ENV=production" >> $GITHUB_OUTPUT fi - name: Generate config (dev) if: steps.env.outputs.ENV == 'dev' run: ./scripts/generate_config.sh dev "https://dev.api.example.com" "dev_dummy_key" - name: Generate config (staging/production) if: steps.env.outputs.ENV != 'dev' run: | if [ "${{ steps.env.outputs.ENV }}" = "staging" ]; then ./scripts/generate_config.sh staging \ "${{ secrets.STAGING_BASE_URL }}" \ "${{ secrets.STAGING_API_KEY }}" else ./scripts/generate_config.sh production \ "${{ secrets.PROD_BASE_URL }}" \ "${{ secrets.PROD_API_KEY }}" fi - name: Install Fastlane run: | cd ios gem install bundler bundle install - name: Import signing certificate if: steps.env.outputs.ENV != 'dev' run: | echo "${{ secrets.IOS_CERTIFICATE_BASE64 }}" | base64 --decode > ios/cert.p12 security create-keychain -p "" build.keychain security import ios/cert.p12 -k build.keychain -P "${{ secrets.IOS_CERTIFICATE_PASSWORD }}" -T /usr/bin/codesign security list-keychains -s build.keychain security default-keychain -s build.keychain security unlock-keychain -p "" build.keychain security set-key-partition-list -S apple-tool:,apple: -s -k "" build.keychain - name: Install provisioning profile if: steps.env.outputs.ENV != 'dev' run: | echo "${{ secrets.IOS_PROVISIONING_PROFILE_BASE64 }}" | base64 --decode > profile.mobileprovision mkdir -p ~/Library/MobileDevice/Provisioning\ Profiles cp profile.mobileprovision ~/Library/MobileDevice/Provisioning\ Profiles/ - name: Build iOS (dev) if: steps.env.outputs.ENV == 'dev' run: flutter build ios --release --no-codesign - name: Build & distribute to TestFlight (staging) if: steps.env.outputs.ENV == 'staging' env: APP_STORE_CONNECT_API_KEY_ID: ${{ secrets.APP_STORE_CONNECT_API_KEY_ID }} APP_STORE_CONNECT_API_ISSUER_ID: ${{ secrets.APP_STORE_CONNECT_API_ISSUER_ID }} APP_STORE_CONNECT_API_KEY_CONTENT: ${{ secrets.APP_STORE_CONNECT_API_KEY_CONTENT }} run: | cd ios bundle exec fastlane beta - name: Build & release to App Store (production) if: steps.env.outputs.ENV == 'production' env: APP_STORE_CONNECT_API_KEY_ID: ${{ secrets.APP_STORE_CONNECT_API_KEY_ID }} APP_STORE_CONNECT_API_ISSUER_ID: ${{ secrets.APP_STORE_CONNECT_API_ISSUER_ID }} APP_STORE_CONNECT_API_KEY_CONTENT: ${{ secrets.APP_STORE_CONNECT_API_KEY_CONTENT }} run: | cd ios bundle exec fastlane release - name: Upload Sentry symbols (production only) if: steps.env.outputs.ENV == 'production' env: SENTRY_AUTH_TOKEN: ${{ secrets.SENTRY_AUTH_TOKEN }} SENTRY_ORG: ${{ secrets.SENTRY_ORG }} SENTRY_PROJECT: ${{ secrets.SENTRY_PROJECT }} run: ./scripts/upload_symbols.sh $(git rev-parse --short HEAD) - name: Notify Team (Success) if: success() run: | echo "iOS Build & Release PASSED" echo "Environment: ${{ steps.env.outputs.ENV }}" echo "Branch: ${{ steps.env.outputs.branch }}" echo "By: @${{ github.actor }}" echo "Commit: ${{ github.sha }}" - name: Notify Team (Failure) if: failure() run: | echo "iOS Build & Release FAILED" echo "Environment: ${{ steps.env.outputs.ENV }}" echo "Branch: ${{ steps.env.outputs.branch }}" echo "By: @${{ github.actor }}" echo "Commit: ${{ github.sha }}" echo "Check the logs and fix the issue before retrying 🔧" </code></pre> Let's walk through what is different about this workflow compared to that of android. <h3 id="heading-1-macos-runner">1. MacOS Runner</h3> <pre><code class="language-yaml">runs-on: macos-latest </code></pre> This is the major difference. iOS builds require Xcode, which only runs on macOS. GitHub Actions provides hosted macOS runners, but they are significantly more expensive in terms of compute minutes than Ubuntu runners. Just keep that in mind when thinking about build frequency. No Java setup is needed here. Flutter on iOS compiles through Xcode directly, so the toolchain requirements are different. <h3 id="heading-2-installing-fastlane">2. Installing Fastlane</h3> <pre><code class="language-yaml">- name: Install Fastlane run: | cd ios gem install bundler bundle install </code></pre> Fastlane is a Ruby-based automation tool that handles certificate management, building, and uploading to TestFlight and the App Store. This step navigates into the <code>ios/</code> directory and installs Fastlane along with all its dependencies as defined in the project's <code>Gemfile</code>. Your <code>ios/Gemfile</code> should look something like this: <pre><code class="language-ruby">source "https://rubygems.org" gem "fastlane" </code></pre> And your <code>ios/fastlane/Fastfile</code> should define at minimum two lanes: one for staging (TestFlight) and one for production (App Store): <pre><code class="language-ruby">default_platform(:ios) platform :ios do lane :beta do build_app(scheme: "Runner", export_method: "app-store") upload_to_testflight(skip_waiting_for_build_processing: true) end lane :release do build_app(scheme: "Runner", export_method: "app-store") upload_to_app_store(force: true, skip_screenshots: true, skip_metadata: true) end end </code></pre> <h3 id="heading-3-certificate-and-provisioning-profile-setup">3. Certificate and Provisioning Profile Setup</h3> This is the step that trips most teams up the first time. Apple's code signing requires two things to be present on the machine: <ol> <li>The signing certificate (a <code>.p12</code> file) </li> <li>The provisioning profile </li> </ol> Both are stored as Base64-encoded GitHub secrets and restored at build time. <pre><code class="language-yaml">- name: Import signing certificate if: steps.env.outputs.ENV != 'dev' run: | echo "${{ secrets.IOS_CERTIFICATE_BASE64 }}" | base64 --decode > ios/cert.p12 security create-keychain -p "" build.keychain security import ios/cert.p12 -k build.keychain -P "${{ secrets.IOS_CERTIFICATE_PASSWORD }}" -T /usr/bin/codesign security list-keychains -s build.keychain security default-keychain -s build.keychain security unlock-keychain -p "" build.keychain security set-key-partition-list -S apple-tool:,apple: -s -k "" build.keychain </code></pre> Breaking this down step by step: <ul> <li>Decodes the Base64 certificate and write it to <code>cert.p12</code> </li> <li>Creates a temporary keychain called <code>build.keychain</code> with an empty password </li> <li>Imports the certificate into that keychain, granting codesign access </li> <li>Sets it as the default keychain so Xcode finds it automatically </li> <li>Unlocks the keychain so it can be used non-interactively </li> <li>Sets partition list to allow access without repeated prompts </li> </ul> The provisioning profile step is simpler: <pre><code class="language-yaml">- name: Install provisioning profile if: steps.env.outputs.ENV != 'dev' run: | echo "${{ secrets.IOS_PROVISIONING_PROFILE_BASE64 }}" | base64 --decode > profile.mobileprovision mkdir -p ~/Library/MobileDevice/Provisioning\ Profiles cp profile.mobileprovision ~/Library/MobileDevice/Provisioning\ Profiles/ </code></pre> It decodes the profile and copies it into the exact directory where Xcode expects to find provisioning profiles on any macOS system. To encode your certificate and profile locally, you can run these: <pre><code class="language-bash">base64 -i Certificates.p12 | pbcopy # for the certificate base64 -i YourApp.mobileprovision | pbcopy # for the provisioning profile </code></pre> <h3 id="heading-4-building-for-each-environment">4. Building for Each Environment</h3> Dev builds skip signing entirely. They're built without code signing just to verify the project compiles correctly on a clean machine: <pre><code class="language-yaml">- name: Build iOS (dev) if: steps.env.outputs.ENV == 'dev' run: flutter build ios --release --no-codesign </code></pre> Staging builds go through Fastlane's <code>beta</code> lane, which builds and uploads to TestFlight. Production builds go through Fastlane's <code>release</code> lane, which submits directly to App Store Connect. Both staging and production steps consume the same three App Store Connect API key secrets: the key ID, the issuer ID, and the key content itself. Fastlane uses these to authenticate with Apple's API without requiring a manual Apple ID login. <h3 id="heading-5-sentry-symbol-upload">5. Sentry Symbol Upload</h3> On production iOS builds, the <code>upload_symbols.sh</code> script runs after the build completes, passing the current short commit SHA as the release identifier: <pre><code class="language-yaml">- name: Upload Sentry symbols (production only) if: steps.env.outputs.ENV == 'production' env: SENTRY_AUTH_TOKEN: ${{ secrets.SENTRY_AUTH_TOKEN }} SENTRY_ORG: ${{ secrets.SENTRY_ORG }} SENTRY_PROJECT: ${{ secrets.SENTRY_PROJECT }} run: ./scripts/upload_symbols.sh $(git rev-parse --short HEAD) </code></pre> This is the same script explained earlier in the helper scripts section. It creates a Sentry release, uploads the debug information files, and finalizes the release. Any production crash from this point forward will map back to real, readable source code in your Sentry dashboard. <h2 id="heading-secrets-and-configuration-reference">Secrets and Configuration Reference</h2> For this entire pipeline to work, you need to configure the following secrets in your GitHub repository. Go to Settings → Secrets and variables → Actions → New repository secret to add each one. Shared (used across environments): <table> <thead> <tr> <th>Secret</th> <th>Description</th> </tr> </thead> <tbody><tr> <td><code>FIREBASE_TOKEN</code></td> <td>Generated via <code>firebase login:ci</code> on your local machine</td> </tr> <tr> <td><code>FIREBASE_ANDROID_APP_ID</code></td> <td>Android app ID from your Firebase console</td> </tr> <tr> <td><code>FIREBASE_GROUPS</code></td> <td>Comma-separated tester group names in Firebase</td> </tr> <tr> <td><code>SENTRY_AUTH_TOKEN</code></td> <td>Auth token from your Sentry account settings</td> </tr> <tr> <td><code>SENTRY_ORG</code></td> <td>Your Sentry organization slug</td> </tr> <tr> <td><code>SENTRY_PROJECT</code></td> <td>Your Sentry project slug</td> </tr> </tbody></table> Staging: <table> <thead> <tr> <th>Secret</th> <th>Description</th> </tr> </thead> <tbody><tr> <td><code>STAGING_BASE_URL</code></td> <td>Your staging API base URL</td> </tr> <tr> <td><code>STAGING_API_KEY</code></td> <td>Your staging API or encryption key</td> </tr> </tbody></table> Production: <table> <thead> <tr> <th>Secret</th> <th>Description</th> </tr> </thead> <tbody><tr> <td><code>PROD_BASE_URL</code></td> <td>Your production API base URL</td> </tr> <tr> <td><code>PROD_API_KEY</code></td> <td>Your production API or encryption key</td> </tr> </tbody></table> Android: <table> <thead> <tr> <th>Secret</th> <th>Description</th> </tr> </thead> <tbody><tr> <td><code>ANDROID_KEYSTORE_BASE64</code></td> <td>Base64-encoded <code>.jks</code> keystore file</td> </tr> <tr> <td><code>GOOGLE_PLAY_SERVICE_ACCOUNT_JSON</code></td> <td>Full JSON content of your Play Console service account</td> </tr> </tbody></table> iOS: <table> <thead> <tr> <th>Secret</th> <th>Description</th> </tr> </thead> <tbody><tr> <td><code>IOS_CERTIFICATE_BASE64</code></td> <td>Base64-encoded <code>.p12</code> signing certificate</td> </tr> <tr> <td><code>IOS_CERTIFICATE_PASSWORD</code></td> <td>Password protecting the <code>.p12</code> file</td> </tr> <tr> <td><code>IOS_PROVISIONING_PROFILE_BASE64</code></td> <td>Base64-encoded <code>.mobileprovision</code> file</td> </tr> <tr> <td><code>APP_STORE_CONNECT_API_KEY_ID</code></td> <td>Key ID from App Store Connect → Users & Access → Keys</td> </tr> <tr> <td><code>APP_STORE_CONNECT_API_ISSUER_ID</code></td> <td>Issuer ID from the same App Store Connect page</td> </tr> <tr> <td><code>APP_STORE_CONNECT_API_KEY_CONTENT</code></td> <td>The full content of the downloaded <code>.p8</code> key file</td> </tr> </tbody></table> None of these values should ever appear in your codebase. If any secret is accidentally committed, rotate it immediately. <h2 id="heading-end-to-end-flow">End-to-End Flow</h2> With all three workflows in place, here is exactly what happens from the moment a developer opens a pull request to the moment a user receives an update: <h3 id="heading-1-developer-opens-a-pr-into-develop">1. Developer Opens a PR into <code>develop</code></h3> The <code>pr_checks.yml</code> workflow fires. It runs formatting checks, static analysis, and the full test suite. If anything fails, the PR cannot be merged and the team is notified immediately. The developer fixes the issues and pushes again, which triggers a fresh run. <h3 id="heading-2-pr-is-approved-and-merged-into-develop">2. PR is Approved and Merged into <code>develop</code></h3> The <code>android.yml</code> and <code>ios.yml</code> workflows both fire on the push event. They detect the environment as <code>dev</code>, inject placeholder config, build unsigned artifacts, and upload them to Firebase App Distribution. Testers receive an email and can install the build on their devices within minutes – no one shared a file manually. <h3 id="heading-3-develop-is-merged-into-staging">3. <code>develop</code> is Merged into <code>staging</code></h3> Both platform workflows fire again. This time the environment resolves to <code>staging</code>. Real secrets are injected, builds are properly signed, and the artifacts go to Firebase App Distribution (Android) and TestFlight (iOS). QA begins testing the staging build against the staging API. <h3 id="heading-4-staging-is-merged-into-production">4. <code>staging</code> is merged into <code>production</code></h3> Both workflows fire one final time. Production secrets are injected, builds are obfuscated and signed, debug symbols are uploaded to Sentry, and the final artifacts are submitted to the Google Play Store and App Store Connect. The release goes live on Apple and Google's review timelines with no further human intervention required. From that first PR to a production submission, not a single command was run manually. <h2 id="heading-conclusion">Conclusion</h2> Building this pipeline is an upfront investment that pays off from the very first release cycle. What used to be a sequence of error-prone manual steps building locally, signing, uploading, switching configs, and hoping nothing was mixed up is now a fully automated, auditable, and repeatable process that runs the moment code moves between branches. The architecture we built here does more than just automate builds. The PR quality gate enforces team standards consistently, so code review becomes a conversation about intent rather than a hunt for formatting issues. The environment-aware config injection eliminates an entire class of production incidents where staging keys made it into a live release. The Sentry symbol upload means your team can debug production crashes with full source visibility even from an obfuscated binary. Every piece of this pipeline also runs locally. The helper scripts in the <code>scripts/</code> folder are plain Bash so you can call them from your terminal the same way CI calls them. This eliminates the frustrating cycle of pushing a commit just to test a pipeline change. As your team grows, this foundation scales with you. You can extend the <code>pr_checks.yml</code> to enforce code coverage thresholds, add a performance benchmarking job, or introduce a dedicated security scanning step. You can extend the platform workflows to support multiple flavors, multiple Firebase projects, or staged rollouts on the Play Store. The architecture stays the same – you're just adding new steps to an already working system. This ensures that standards are met, code quality remains high, you have a proper team structure, clear process and automated post development activities are in place – and at the end of the day, you'll have an optimized engineering approach that will help your team in so many ways. </article> <article> <h1> How to Build an AI-Powered Research Automation System with n8n, Groq, and Academic APIs </h1> Chidozie Managwu — Mon, 16 Mar 2026 18:17:27 +0000 As a researcher and developer, I found myself spending hours manually searching academic databases, reading abstracts, and trying to synthesize findings across multiple sources. For my work on circular economy and battery recycling, I needed a way to query multiple databases at once without the manual fatigue. In this tutorial, you'll build an automated research pipeline using n8n that reduces roughly six hours of manual literature review into a five-minute automated process. This isn’t a “cool demo workflow.” It’s a production-minded pipeline with parallel collection, normalisation, deduplication, structured AI extraction, scoring, and practical error handling. <h3 id="heading-table-of-contents">Table of Contents</h3> <ol> <li><a href="#heading-prerequisites">Prerequisites</a> </li> <li><a href="#heading-the-problem-research-takes-too-long">The Problem: Research Takes Too Long</a> </li> <li><a href="#heading-the-tech-stack">The Tech Stack</a> </li> <li><a href="#heading-the-project-structure-how-to-think-about-an-n8n-workflow-like-software">The Project Structure: How to Think About an n8n Workflow Like Software</a> </li> <li><a href="#heading-stage-1-centralized-configuration">Stage 1: Centralised Configuration</a> </li> <li><a href="#heading-stage-2-parallel-api-collection=with-failure-isolation">Stage 2: Parallel API Collection (With Failure Isolation)</a> </li> <li><a href="#heading-stage-3-normalisation-and-deduplication-doifirst-title-fallback">Stage 3: Normalisation and Deduplication (DOI-first, Title fallback)</a> </li> <li><a href="#heading-stage-4-aipowered-content-extraction-strict-json">Stage 4: AI-Powered Content Extraction (Strict JSON)</a> </li> <li><a href="#heading-stage-5-scoring-and-synthesis">Stage 5: Scoring and Synthesis</a> </li> <li>[Beginner-Friendly Evals (Retrieval and Extraction QA)(#heading-beginnerfriendly-evals-retrieval-and-extraction-qa) </li> <li><a href="#heading-key-learnings-and-error-handling">Key Learnings and Error Handling</a> </li> <li><a href="#heading-conclusion">Conclusion</a> </li> </ol> <h2 id="heading-prerequisites">Prerequisites</h2> You don’t need to be a DevOps engineer to follow this, but you should have: <ul> <li>Basic comfort with APIs and JSON (request/response payloads) </li> <li>Familiarity with spreadsheets (Google Sheets basics) </li> <li>Willingness to use a small amount of JavaScript inside n8n Function/Code nodes </li> </ul> Access to: <ul> <li>An n8n instance (self-hosted or cloud) </li> <li>A Groq API key (or a compatible LLM provider) </li> <li>Optional API keys, depending on the databases you use </li> </ul> What you’ll build assumes: <ul> <li>You’re extracting from metadata + abstracts (not downloading full PDFs). </li> <li>You can accept that some sources will occasionally rate-limit or return partial results (and your workflow will be designed to survive this). </li> </ul> <h2 id="heading-the-problem-research-takes-too-long">The Problem: Research Takes Too Long</h2> Manual research is often a bottleneck for innovation. Before building this automation, my workflow involved searching multiple academic databases, scanning abstracts, and manually extracting key findings. This process was not only slow but also prone to human error and inconsistent note-taking. The goal of this automation is to provide a “full-stack research assistant” that handles the heavy lifting of collecting candidate papers, removing duplicates, extracting consistent fields, scoring relevance and quality, and delivering a curated daily or weekly report, so you can spend your time on high-level synthesis rather than repetitive collection. <h2 id="heading-the-tech-stack">The Tech Stack</h2> This workflow leverages a combination of automation tooling, high-speed LLM inference, and academic metadata providers. <table> <thead> <tr> <th>Tool</th> <th>Purpose</th> </tr> </thead> <tbody><tr> <td>n8n</td> <td>The workflow engine that orchestrates all steps</td> </tr> <tr> <td>Groq</td> <td>Runs a fast LLM (for example, Llama 3.3 70B) for structured extraction/synthesis</td> </tr> <tr> <td>Semantic Scholar / OpenAlex</td> <td>Broad academic coverage for metadata, abstracts, citations</td> </tr> <tr> <td>arXiv / PubMed</td> <td>Strong specialised coverage (preprints, life sciences)</td> </tr> <tr> <td>Google Sheets</td> <td>A lightweight “research database” for storage + history</td> </tr> </tbody></table> Notes: coverage varies by provider. Some APIs return abstracts reliably, while others may omit them. Your pipeline should treat missing abstracts as a normal case, not a failure. <h2 id="heading-the-project-structure-how-to-think-about-an-n8n-workflow-like-software">The Project Structure: How to Think About an n8n Workflow Like Software</h2> While n8n is a visual tool, it helps to design your workflow as modular stages to avoid the “spaghetti workflow” problem. <pre><code class="language-text">. ├── configuration/ # Keywords, thresholds, limits, date filters ├── collectors/ # Parallel HTTP request nodes (multiple sources) ├── processing/ # Normalization + deduplication code nodes ├── extraction/ # LLM extraction nodes (strict JSON) ├── scoring/ # Relevance + quality scoring + filtering └── delivery/ # Google Sheets + email/HTML report </code></pre> Design principle: each stage should produce a clean, predictable output shape that the next stage can rely on. <h2 id="heading-stage-1-centralised-configuration">Stage 1: Centralised Configuration</h2> Instead of hardcoding search parameters (keywords, min year, citation thresholds) across multiple nodes, use one configuration node to define workflow variables. This matters for maintainability (change a value once, not in ten nodes), reusability (repurpose the entire pipeline by swapping one config object), and debuggability (log the config at the start of each run so you can reproduce results). Use a Set node, or a Code node returning JSON like this: <pre><code class="language-json">{ "keywords": "circular economy battery recycling remanufacturing", "min_year": 2020, "max_results_per_source": 10, "min_citations": 2, "relevance_threshold": 15, "batch_size": 10 } </code></pre> Tip: keep numeric fields as numbers (not strings) to avoid scoring bugs later. <h2 id="heading-stage-2-parallel-api-collection-with-failure-isolation">Stage 2: Parallel API Collection (With Failure Isolation)</h2> Your workflow should query multiple sources simultaneously. In n8n, you can branch from your configuration node into multiple HTTP Request nodes, and then merge results later. The production mindset here is simple: APIs fail. Rate limits happen. Providers return partial data. The key is to prevent one failing collector from crashing the whole run. To implement this, on each HTTP Request node, enable Continue On Fail (or the equivalent “don’t stop workflow” behaviour). Then, in the normalisation stage, treat missing or failed outputs as empty arrays so downstream stages still run. In practice, it also helps to set explicit timeouts and add a small retry policy (one to two retries) for transient failures. “Good” looks like this: if two out of five sources fail, you still produce a useful report from the remaining three, and you log which sources failed so you can investigate later. <h2 id="heading-stage-3-normalisation-and-deduplication-doi-first-title-fallback">Stage 3: Normalisation and Deduplication (DOI-first, Title fallback)</h2> Each academic API returns different field names and shapes. One might use <code>title</code>, another <code>display_name</code>, another <code>paper_title</code>. Your next stage should normalise all inputs into one schema. <h3 id="heading-target-normalised-schema">Target normalised schema</h3> Here’s a simple baseline schema (expand later as needed): <pre><code class="language-json">{ "title": "string", "abstract": "string|null", "doi": "string|null", "year": 2024, "citations": 12, "url": "string|null", "source": "Semantic Scholar|OpenAlex|arXiv|PubMed" } </code></pre> <h3 id="heading-what-deduping-by-doi-means-and-what-a-doi-is">What deduping by DOI means (and what a DOI is)</h3> A DOI (Digital Object Identifier) is a unique, persistent identifier assigned to many scholarly publications. If a paper has a DOI, that DOI functions like a stable ID: the same paper may appear in multiple databases with slightly different metadata, but the DOI should remain consistent. So, deduping by DOI means: if two records share the same DOI, treat them as the same paper and keep only one. When a DOI is missing (which is common for some preprints and some API responses), the fallback is to dedupe using a normalised title key, lowercased, trimmed, punctuation stripped, and whitespace collapsed. It’s not as perfect as DOI-based matching, but it’s a strong pragmatic backup. <h3 id="heading-what-normalise-into-a-unified-object-means-whats-happening-in-the-code">What “normalise into a unified object” means (what’s happening in the code)</h3> “Normalise into a unified object” simply means converting every provider’s raw response into the same predictable shape (the schema above). Once everything looks the same, downstream steps, such as deduplication, scoring, AI extraction, and storage, become straightforward because they don’t need provider-specific logic. In the code below, that’s what the <code>normalized</code> object is: it maps Semantic Scholar’s fields (<code>paper.title</code>, <code>paper.externalIds.DOI</code>, <code>paper.citationCount</code>) into your standard fields (<code>title</code>, <code>doi</code>, <code>citations</code>, etc.). After that, the workflow generates a dedupe key (<code>doi:...</code> if DOI exists, otherwise <code>title:...</code>) and uses a <code>Set</code> to keep only the first occurrence. <h4 id="heading-example-n8n-code-node-normalisation-dedupe-pattern">Example n8n Code Node (Normalisation + Dedupe Pattern)</h4> <pre><code class="language-javascript">const itemsIn = $input.all(); const seen = new Set(); const results = []; function titleKey(t) { return (t || "") .toLowerCase() .replace(/[\W_]+/g, " ") .replace(/\s+/g, " ") .trim(); } for (const item of itemsIn) { // Example: Semantic Scholar response shape const papers = item.json?.data || []; for (const paper of papers) { // "Normalize into a unified object": // take the provider-specific fields and map them into our standard schema. const normalized = { title: paper.title || null, abstract: paper.abstract || null, doi: paper.externalIds?.DOI || null, year: paper.year || null, citations: paper.citationCount || 0, url: paper.url || null, source: "Semantic Scholar", }; if (!normalized.title) continue; // Dedupe key: DOI is strongest; title is fallback const key = normalized.doi ? `doi:${normalized.doi.toLowerCase()}` : `title:${titleKey(normalized.title)}`; if (seen.has(key)) continue; seen.add(key); results.push(normalized); } } return results.map(r => ({ json: r })); </code></pre> Production-minded note: keep a field like <code>source</code> so you can debug where bad metadata is coming from later. <h2 id="heading-stage-4-ai-powered-content-extraction-strict-json">Stage 4: AI-Powered Content Extraction (Strict JSON)</h2> Once you have a deduplicated list of papers, you can send each paper (or a small batch) to Groq for structured extraction. <h3 id="heading-why-structured-output-matters">Why structured output matters</h3> If your LLM returns narrative text instead of JSON, misses fields, or emits malformed JSON, your workflow breaks downstream. In a production workflow, that’s not a rare edge case; it’s something you should expect and design around. That’s why you’ll use strict schema prompting and validate responses downstream. <h3 id="heading-system-prompt-vs-user-prompt-and-how-to-compose-them">System prompt vs user prompt (and how to compose them)</h3> A helpful way to think about prompts in production is: <ul> <li>The system prompt defines the non-negotiable contract: output format, allowed keys, no commentary, and what to do in uncertain cases. This is where you say “return ONLY valid JSON” and “no extra keys.” </li> <li>The user prompt provides the variable data for this specific request: title, year, citations, abstract, and the exact schema you want filled. </li> </ul> Composing them this way keeps your workflow stable. The system prompt stays mostly constant (your formatting contract), while the user prompt changes per paper (your payload). It also makes debugging easier: if outputs start failing, you can adjust the system constraints without rewriting every payload template. <h3 id="heading-suggested-extraction-schema">Suggested extraction schema</h3> Extract only what you can support from abstract-level data: <ul> <li><code>research_question</code> </li> <li><code>methodology</code> </li> <li><code>key_findings</code> </li> <li><code>limitations</code> </li> <li><code>notes</code> (for missing abstract / ambiguity) </li> </ul> <h3 id="heading-example-prompt-system-user">Example prompt (system + user)</h3> System: You are a research extraction engine. You must return ONLY valid JSON. No markdown. No extra keys. No commentary. If the abstract is missing or too vague, set fields to null and include a reason in "notes". User: Extract structured fields from this paper. TITLE: {{title}} YEAR: {{year}} CITATIONS: {{citations}} ABSTRACT: {{abstract}} Return JSON with keys: research_question (string|null) methodology (string|null) key_findings (array of strings) limitations (array of strings) notes (string) Model settings: keep temperature low (around 0.2–0.3) and keep responses short and structured. <h3 id="heading-batch-processing-to-avoid-timeouts">Batch processing to avoid timeouts</h3> Instead of sending 50 papers at once, process them in batches (for example, 10). This reduces latency spikes, failure blast radius, and cost surprises. Smaller batches also make it easier to retry only the failing chunk rather than re-running everything. <h2 id="heading-stage-5-scoring-and-synthesis">Stage 5: Scoring and Synthesis</h2> Not every retrieved paper is worth your time. Without scoring, your pipeline becomes a firehose: you’ve automated collection, but you still have to manually decide what to read. Scoring is what turns “a big list of results” into a shortlist you can trust. I recommend computing two signals: <ul> <li>Relevance: Is this actually about your research question? </li> <li>Quality/priority: If it’s relevant, is it worth reading first? </li> </ul> For relevance, keep it simple and explainable. Count keyword hits in the title and abstract (and optionally in extracted <code>key_findings</code>). Title matches should be weighted higher because titles are deliberately compact summaries. Abstract hits are useful too, but cap them so long abstracts don’t dominate the score. For quality/priority, use lightweight metadata you already have. Recency is a strong signal in fast-moving areas, and citations can help, but they should be treated as a weak signal (and capped) so newer high-value papers aren’t unfairly penalised. A solid first scoring model is: add a title bonus, add a capped abstract bonus, add a capped citations bonus, and add a small recency bonus for papers from the last two years. Then filter using the <code>relevance_threshold</code> results from Stage 1. The advantage of this approach is that it’s easy to debug and tune: you can always explain why a paper passed or failed. Once you’ve filtered down to your “gold” set, synthesis becomes safer and more useful. Write one row per accepted paper to Google Sheets, then generate a daily/weekly HTML summary (for example, top 5 papers with 1–2 key findings each) and include links so you can verify quickly. <h2 id="heading-beginner-friendly-evals-retrieval-and-extraction-qa">Beginner-Friendly Evals: Retrieval and Extraction QA</h2> AI workflows regress silently. A prompt tweak, a model update, or an API schema change can break extraction without throwing an obvious error. Adding lightweight evals is the difference between “it worked last week” and “it’s reliable.” The goal here isn’t to build a full evaluation framework. It’s to add small, cheap checks that catch the most common failure modes: <ul> <li>Are collectors still returning results? </li> <li>Are we actually removing duplicates? </li> <li>Is the LLM returning valid JSON with the keys we require? </li> </ul> <h3 id="heading-what-it-looks-like-in-n8n-a-concrete-example">What it looks like in n8n (a concrete example)</h3> A simple implementation is to add an “Assertions” Code node immediately after your extraction step, plus (optionally) another one after normalisation/deduplication. At a high level, the workflow section looks like: <ol> <li>Collectors (parallel HTTP Request nodes) </li> <li>Merge results </li> <li>Normalise + dedupe (Code node) </li> <li>Split in Batches (optional) </li> <li>LLM extraction (Groq/OpenAI-compatible node) </li> <li>Assertions (Code node) </li> <li>If node (pass/fail) </li> <li>Delivery (Sheets + email) </li> </ol> <h3 id="heading-example-assertions-code-node-after-extraction">Example: Assertions code node after extraction</h3> This code node assumes each item is a paper with: <ul> <li><code>title</code>, <code>abstract</code> in the normalised fields, and </li> <li>an <code>extraction</code> field (or whatever you name it) containing the LLM response as an object or JSON string. </li> </ul> Adapt the field name to match your actual node output, but the pattern is the same: parse, validate required keys, compute percentages, then decide whether to fail or warn. <pre><code class="language-javascript">const items = $input.all(); let total = items.length; let withTitle = 0; let withAbstract = 0; let parseOk = 0; let schemaOk = 0; const requiredKeys = [ "research_question", "methodology", "key_findings", "limitations", "notes", ]; const failures = []; for (let i = 0; i < items.length; i++) { const p = items[i].json; if (p.title && String(p.title).trim().length > 0) withTitle++; if (p.abstract && String(p.abstract).trim().length > 0) withAbstract++; // Adjust this depending on where you store the model output: const raw = p.extraction ?? p.llm ?? p.model_output; let obj = null; try { obj = typeof raw === "string" ? JSON.parse(raw) : raw; parseOk++; } catch (e) { failures.push({ index: i, title: p.title || null, reason: "JSON parse failed" }); continue; } const hasAllKeys = requiredKeys.every(k => Object.prototype.hasOwnProperty.call(obj, k)); if (!hasAllKeys) { failures.push({ index: i, title: p.title || null, reason: "Missing required keys" }); continue; } // Optional: ensure arrays are arrays const arraysOk = Array.isArray(obj.key_findings) && Array.isArray(obj.limitations); if (!arraysOk) { failures.push({ index: i, title: p.title || null, reason: "key_findings/limitations not arrays" }); continue; } schemaOk++; } const pct = (n) => (total === 0 ? 0 : Math.round((n / total) * 100)); const report = { total_papers: total, pct_with_title: pct(withTitle), pct_with_abstract: pct(withAbstract), pct_extraction_json_parse_ok: pct(parseOk), pct_extraction_schema_ok: pct(schemaOk), failures_sample: failures.slice(0, 5), }; // Decide pass/fail thresholds const HARD_FAIL_PARSE_BELOW = 90; const HARD_FAIL_SCHEMA_BELOW = 85; const shouldFail = report.pct_extraction_json_parse_ok < HARD_FAIL_PARSE_BELOW || report.pct_extraction_schema_ok < HARD_FAIL_SCHEMA_BELOW; return [ { json: { eval_report: report, shouldFail, }, }, ]; </code></pre> Then add an If node: <ul> <li>If <code>shouldFail</code> is true, then route to an “Alert/Stop” branch (Slack/email/log) and optionally stop the workflow. </li> <li>If false, then continue to the delivery stage. </li> </ul> This is the automation equivalent of unit tests: small, cheap, and extremely effective. It also gives you a concrete paper trail when something changes upstream. <h2 id="heading-key-learnings-and-error-handling">Key Learnings and Error Handling</h2> Building this automation taught me that the best workflows are designed for failure. First, error resilience is not optional. Never let one failing API crash the workflow. Use “Continue On Fail” on your HTTP nodes, merge partial results, and log which sources failed in your final report so you can debug without losing an entire run. Second, batching is your friend. Process papers in batches (often 5–15) to reduce timeouts and cost spikes. Keep LLM payloads small and focused on what you actually need (metadata + abstract), and retry transient failures once rather than repeatedly hammering the model or API. Third, structured prompting is what makes AI reliable in automation. A strict JSON schema is the difference between a workflow that runs unattended and one that breaks randomly. Keep temperature low, enforce the schema in the system prompt, and validate everything downstream with simple parse-and-assert checks. <h2 id="heading-conclusion">Conclusion</h2> A good research pipeline doesn’t just retrieve papers – it turns scattered results into a consistent, deduplicated, scored, and review-ready shortlist you can trust. By treating your n8n workflow like software modular stages, strict contracts between steps, and lightweight eval checks, you can reduce hours of manual literature review into a fast, repeatable process that survives real-world API failures and model quirks. If you build this with good defaults (failure isolation, batching, normalisation, strict JSON extraction, and simple scoring), you end up with something you can run daily or weekly and actually rely on without the manual fatigue. <h3 id="heading-about-me">About Me</h3> I am Chidozie Managwu, an award-winning AI Product Architect and founder focused on helping global tech talent build real, production-ready skills. I contribute to global AI initiatives as a GAFAI Delegate and lead the AI Titans Network, a community for developers learning how to ship AI products. My work has been recognised with the Global Tech Hero award and featured on platforms like HackerNoon. </article> <article> <h1> Penetration Testing — Services vs Automated Platforms: What’s Better in 2026? </h1> Manish Shivanandhan — Mon, 16 Mar 2026 17:54:26 +0000 In 2026, cybersecurity teams face more threats than ever before. Attack surfaces are broad, technology stacks are complex, and adversaries are quick to exploit weak points. Against this backdrop, companies must decide how best to test their defences. Two main approaches have emerged as leaders: human-led penetration testing services and automated testing platforms. Each has strengths and limitations. Choosing the right one depends on your security goals, risk tolerance, and budget. At its core, <a href="https://www.cloudflare.com/learning/security/glossary/what-is-penetration-testing/">penetration testing</a> is about finding security holes before attackers do. But how you get there matters. Human experts bring creativity and real-world insight, while automated platforms offer scale and speed. This article explores both approaches and compares top providers to help you decide what’s better for your organization in 2026. <h3 id="heading-what-well-cover">What we'll cover:</h3> <ol> <li><a href="#heading-what-are-penetration-testing-services">What Are Penetration Testing Services?</a> </li> <li><a href="#heading-what-are-automated-penetration-testing-platforms">What Are Automated Penetration Testing Platforms?</a> </li> <li><a href="#heading-why-the-debate-matters-in-2026">Why the Debate Matters in 2026</a> <ul> <li><a href="#heading-depth-of-testing-humans-vs-machines">Depth of Testing: Humans vs Machines</a> </li> <li><a href="#heading-speed-and-frequency-of-testing">Speed and Frequency of Testing</a> </li> <li><a href="#heading-cost-considerations">Cost Considerations</a> </li> <li><a href="#heading-integration-with-security-workflows">Integration with Security Workflows</a> </li> </ul> </li> <li><a href="#heading-real-world-context-top-providers-in-2026">Real World Context: Top Providers in 2026</a> </li> <li><a href="#heading-compliance-and-reporting">Compliance and Reporting</a> </li> <li><a href="#heading-which-one-should-you-choose-in-2026">Which One Should You Choose in 2026?</a> </li> <li><a href="#heading-final-thoughts">Final Thoughts</a> </li> </ol> <h2 id="heading-what-are-penetration-testing-services">What Are Penetration Testing Services?</h2> Penetration testing services are engagements where cybersecurity professionals actively probe your systems to find vulnerabilities. These experts use a mix of tools, manual techniques, and real-world attack simulations to surface weaknesses that machines might miss. These services may include scheduled tests, one-time assessments, and ongoing engagements. Many providers tailor their approach to the environment being tested, whether that’s a corporate network, web application, cloud infrastructure, or mobile ecosystem. Human testers think like attackers, combining automated scans with logic and adaptability that machines cannot replicate on their own. These engagements are typically measured in reports, debrief sessions, and clear remediation guidance. The human element is the defining factor. A skilled tester doesn’t just find flaws. They understand context, creative exploit paths, and business impact. <h2 id="heading-what-are-automated-penetration-testing-platforms">What Are Automated Penetration Testing Platforms?</h2> Automated penetration testing platforms use software to scan, crawl, and test systems for vulnerabilities. These platforms run scheduled scans or continuous assessments with minimal human intervention. They aim to find flaws early and often, integrating with development pipelines or security operations centers. Automation brings consistency, speed, and the ability to repeat tests frequently. Many modern platforms use machine learning to prioritize findings and reduce noise. Some offer automation rules that trigger scans based on changes in the environment or codebase. In contrast to full manual services, platforms are best suited for ongoing baseline assessments and rapid feedback. They are often priced in subscription models and integrate with other tooling like bug tracking systems or <a href="https://www.ibm.com/think/topics/siem">SIEMs</a>. While they can pinpoint known vulnerability patterns efficiently, automated tools are limited in creative attack paths and logic-based exploits. <h2 id="heading-why-the-debate-matters-in-2026">Why the Debate Matters in 2026</h2> In 2026, the cybersecurity landscape is both more advanced and more hazardous. Organizations operate hybrid clouds, microservices architectures, and complex supply chains. Threat actors are using AI to scale attacks. In this environment, the question is not only about finding old vulnerabilities but anticipating novel attack methods. With limited resources, security leaders must choose wisely. Do you invest heavily in services with human experts? Do you adopt automated platforms that test continuously? Maybe a mix is best. To answer these questions, let’s explore how the two approaches compare across key criteria. <h3 id="heading-depth-of-testing-humans-vs-machines">Depth of Testing: Humans vs Machines</h3> Human-led penetration tests shine when deep context and logic are required. Expert testers can chain together multiple issues to compromise a system in ways automated tools don't anticipate. They explore paths, think creatively, and adapt in real time to the environment they encounter. Automated platforms excel at breadth and repetition. They perform wide sweeps of systems quickly and can generate alerts on common vulnerability classes. They're particularly strong in repetitive tasks like scanning hundreds of endpoints or validating compliance controls. But platforms often rely on predefined signatures and patterns. They perform poorly when an exploit requires intuition or lateral thinking. In simple terms, human services dig deep while platforms dig wide. <h3 id="heading-speed-and-frequency-of-testing">Speed and Frequency of Testing</h3> Automated platforms have a clear advantage in speed and frequency. They can run multiple scans in parallel, test after every code commit, and provide almost immediate feedback. This makes them ideal for DevOps pipelines and agile environments that change daily. Penetration testing services, by design, occur on a schedule. A quarterly or annual test may be thorough, but it cannot match the cadence that automated tools provide. Manual tests take time to plan, execute, and analyze. In fast-moving environments, this might leave gaps between testing windows. For many organizations, automation fills these gaps, while manual testing provides periodic, deep insight. <h3 id="heading-cost-considerations">Cost Considerations</h3> Cost is always a factor. Automated platforms generally come with lower upfront costs compared to human-led engagements. Subscriptions scale with usage and provide continuous assessment for a predictable price. This makes them appealing to midsize companies or teams with limited budgets. Penetration testing services, especially from reputable consultancies, command higher fees. These reflect labor costs, expertise, and the bespoke nature of the work. However, the value gained is often more than just flaw detection: it’s expert interpretation, custom exploitation paths, and strategic guidance. In cost-benefit terms, automated platforms provide the most value per dollar for baseline security, while services deliver high-value insight that can justify a higher cost. <h3 id="heading-integration-with-security-workflows">Integration with Security Workflows</h3> Automated platforms are built to integrate with broader security tooling. They often connect to continuous integration/continuous delivery (CI/CD) pipelines, vulnerability management platforms, and ticketing systems. This integration ensures that issues are communicated to the teams who need them most and tracked to resolution. Penetration testing services can integrate into workflows too, but this usually requires additional coordination. Reports must be ingested into tracking systems and aligned with internal priorities. Some providers offer APIs and extended services that help bridge this gap, but the process typically takes more effort than with automated platforms. Integration matters because security cannot operate in isolation. Automated platforms fit more naturally into modern DevSecOps workflows, while services provide episodic insights that must be planned and bridged into operations. <h2 id="heading-real-world-context-top-providers-in-2026">Real World Context: Top Providers in 2026</h2> To illustrate how these approaches manifest in practice, consider a few leading options. Each provider offers different strengths in manual services or automated tooling. One such provider is <a href="https://xbow.com/pentest">XBOW</a>. XBOW is known for deep manual testing engagements, combining expert human testers with structured methodologies across network, application, and cloud environments. Their work emphasizes real-world attack simulations and strategic risk reporting. Another well-known provider is <a href="https://www.cobalt.io/">Cobalt</a>. Cobalt blends human expertise with platform-based management. Their Pentest as a Service (PtaaS) model connects testers to client environments through a platform that organizes findings, workflows, and communication. Clients can collaborate with testers, track issues in real time, and integrate results with other systems. A different model comes from <a href="https://www.synack.com/">Synack</a>. Synack uses a crowd of vetted testers who work with a secure testing platform. This hybrid model aims to combine the creativity of human testers with the scalability and tracking of automated systems. Clients benefit from diverse testing styles and coordinated reporting within a structured platform. Each of these approaches has merit. Some lean more toward pure services, others toward platform-driven collaboration. Your choice should align with your security maturity and goals. <h2 id="heading-compliance-and-reporting">Compliance and Reporting</h2> For regulated industries, compliance matters. Automated platforms often include reporting features that map directly to standards like PCI DSS, HIPAA, or ISO 27001. These reports can be generated on a regular cadence and integrated into audit evidence. Penetration testing services provide compliance support too, but the reports are typically narrative and bespoke. The real value is in expert interpretation of compliance requirements and guidance on remediating complex findings. In essence, automation provides structured, repeatable reporting, while services deliver customized insights that may carry more weight with auditors and internal stakeholders. <h2 id="heading-which-one-should-you-choose-in-2026">Which One Should You Choose in 2026?</h2> There is no one-size-fits-all answer. Many organizations adopt both approaches. Automated platforms serve as the first line of defense by continuously scanning for known issues and tracking progress over time. Human-led services then provide a deeper second layer, uncovering complex issues and offering strategic guidance. If your environment is highly dynamic, with frequent releases and evolving infrastructure, an automated platform is essential. If you operate in a high-risk sector where attackers are likely to craft bespoke exploits, human-led penetration testing services are indispensable. Most mature security programs use both. Automation drives frequency and scale. Human services provide depth and insight. Together, they form a layered testing strategy that maximizes coverage and minimizes blind spots. <h2 id="heading-final-thoughts">Final Thoughts</h2> In 2026, cybersecurity testing is more sophisticated and essential than ever. Organizations must balance speed, depth, cost, and context when selecting between penetration testing services and automated platforms. While one is not inherently better than the other in all cases, understanding their differences and complementary strengths will help you build a robust security posture. Automated platforms catch the routine and repetitive, giving continuous visibility into known risks. Human-led services uncover the hidden and unexpected, thinking beyond patterns to simulate real adversaries. For most teams, the future of testing lies in a hybrid approach that leverages both. By aligning your security goals with the right mix of services and tools, you can stay ahead of threats now and in the years to come. Hope you enjoyed this article. Learn more about me by <a href="https://manishmshiva.me">visiting my website</a>. </article> <article> <h1> How to Build an Autonomous AI Agent with n8n and Decapod </h1> Lee Nathan — Wed, 11 Mar 2026 20:18:39 +0000 I tried out Open Claw two weeks ago. I loved the potential, but did not enjoy the tool itself. I, like many others, struggled with the installation process. And working from Linux, the Mac specific orientation added extra pitfalls. It wasn't always clear whether configuration and management should be done in the docs, the CLI, or the interface. I found the UI unintuitive and it left me wondering if it wasn't just a dev placeholder. The color choice in particular was especially harsh. All the red tricked the eye and made white text appear green. It also made everything seem like an error message. I couldn't make heads or tails of the organization and structure. Workspaces, agents, and sessions are all terms I'm familiar with and understand. But the way Open Claw implements them made no sense to me. Open Claw started as a way to connect a chat tool to an AI. I did that eight months ago with n8n. It's literally only a few nodes. It was so easy that I didn't think anything of it. In my opinion, Open Claw isn’t actually all that special. There’s no part of it that stands out as unique, except for the approach. It’s the Flappy Bird of the agentic AI world. So I set out to make my own. And within a few hours, I'd whipped up a simple working prototype vibe-coded with Python and connected to Open WebUI (OWUI). But I wanted to see what prompt OWUI was sending the agent, exactly. Now, if I was actually a Python guy, I would have done some console output. But instead, I went for my favorite tool: n8n (a powerful low-code automation system). And that's where things got interesting. <h2 id="heading-about-this-handbook">About This Handbook</h2> This handbook will introduce you to agentic AI creation using a hands-on approach and a starter project I created called Decapod. Decapod is not a self-contained SaaS offering. There is no part of it that is black boxed and unavailable to hack on. Decapod is a collection of <code>docker-compose.yml</code> containers, scripts, AI agent prompts, and n8n workflows that work together to help give you a leg up on your path to building your own agentic AI empire. Concepts and technologies you'll be introduced to and using: <ul> <li>Agentic AI with tools and skills </li> <li>Docker containers with Docker Compose </li> <li>Open WebUI </li> <li>n8n </li> <li>S3 and MinIO </li> <li>Caddy </li> <li>Postgres </li> </ul> For a list of required skills, services, and tools, please check out the "Requirements and Processes" section. <h2 id="heading-table-of-contents">Table of Contents</h2> <ul> <li><a href="#heading-decapod-the-diyers-dream-agent">Decapod - The DIYer's Dream Agent</a> </li> <li><a href="#heading-how-decapod-works">How Decapod Works</a> <ul> <li><a href="#heading-core-engine">Core Engine</a> </li> <li><a href="#heading-supakitchen-supabase-on-a-budget">Supakitchen - Supabase on a Budget</a> </li> <li><a href="#heading-open-webui-ai-chat-with-all-the-bells-and-whistles">Open WebUI - AI Chat With All the Bells and Whistles</a> </li> </ul> </li> <li><a href="#heading-requirements-and-processes-tools-i-use-and-recommend">Requirements and Processes - Tools I Use and Recommend</a> <ul> <li><a href="#heading-the-checklist">The Checklist</a></li> </ul> </li> <li><a href="#heading-assembling-the-dream-team-ikea-style">Assembling the Dream Team - Ikea Style</a> <ul> <li><a href="#heading-accessing-your-vps-with-cursor-and-ssh">Accessing Your VPS With Cursor and SSH</a> </li> <li><a href="#heading-installing-and-configuring-the-docker-containers">Installing and Configuring the Docker Containers</a> </li> </ul> </li> <li><a href="#heading-configuration-and-wiring">Configuration and Wiring</a> <ul> <li><a href="#heading-initiate-the-database">Initiate the Database</a> </li> <li><a href="#heading-a-little-minio">A Little MinIO</a> </li> <li><a href="#heading-adding-the-workflows">Adding the Workflows</a> </li> <li><a href="#heading-getting-started-with-n8n">Getting Started With n8n</a> </li> <li><a href="#heading-now-get-owui-to-talk-to-decapod">Now, Get OWUI to Talk to Decapod</a> </li> <li><a href="#heading-there-was-supposed-to-be-an-earth-shattering-kaboom">There Was Supposed to Be an Earth Shattering Kaboom</a> </li> </ul> </li> <li><a href="#heading-the-ever-present-hello-world">The Ever-Present "Hello World"</a> </li> <li><a href="#heading-into-the-future">Into the Future!</a> <ul> <li><a href="#heading-a-work-in-progress">A Work in Progress</a> </li> <li><a href="#heading-adding-your-own-skills-limitless-potential">Adding Your Own Skills - Limitless Potential</a> </li> <li><a href="#heading-future-plans">Future Plans</a> </li> </ul> </li> <li><a href="#heading-got-questions-meet-captain-finn">Got Questions? Meet Captain Finn!</a> </li> </ul> <h2 id="heading-decapod-the-diyers-dream-agent">Decapod – The DIYer's Dream Agent</h2> I'll be honest. I'd never even considered the security issues with Open Claw at first. But they're enormous! Let's open a giant hole in our server and give a fledgling alien intelligence root access and all of our API keys. What could possibly go wrong? Decapod isn't a monolithic app. It's a collection of tools and n8n workflows that give you complete control over your agent and its tools. It's a framework to give <a href="https://monday.com/appdeveloper/blog/citizen-developer/">citizen developers</a> a leg up. By switching to n8n, I accidentally solved a ton of issues and made a far superior (in my opinion) project: <ul> <li>Double (or triple if you choose to host in a VPS) sandboxed security. My agent lives inside of n8n inside of a Docker container inside of a VPS. </li> <li>The agent never sees a single API key or even ever needs to know exactly how you're connecting services. Credentials are handled by n8n. </li> <li>Universal access – I prefer OWUI. But literally anything that can connect to a standard OpenAI API endpoint can connect to Decapod. </li> <li>Over 1,000 integrations – What n8n does best is connecting any API to any other API via drag-and-drop nodes. And there are more than <a href="https://community.n8n.io/t/master-list-of-every-n8n-node/155146">1,000 of them</a>. </li> <li>No more sketchy skills – Decapod uses skills, but they have to actually be connected to n8n workflows and nodes to work. </li> </ul> More problems Decapod solves: <ul> <li>Fewer tokens burned – Decapod maintains a clean boundary between what's best handled with code/logic and what's best handled by AI. </li> <li>No endless loops and hung jobs – Decapod uses a jobs and tasks system that the AI can manage. So if it sees that a task has failed, it can change tasks or suspend the job. </li> <li>HITL (Human In The Loop) – You can add a HITL sub-workflow before any AI skill to give them permission to proceed or not. </li> <li>An MVP you can trust – The core Decapod system is just an MVP. But it's built on exclusively mature, open source, enterprise ready solutions: n8n, Open WebUI, Docker, Caddy, Postgres, and MinIO. </li> </ul> <h2 id="heading-how-decapod-works">How Decapod Works</h2> Decapod is middleware that acts like an OpenAI API. But it intercepts the API call and does agent work with the real API. The OpenAI API standard is the most widely used in the industry. Almost every tool, like Open WebUI, Zed, and Obsidian have ways to connect to the OpenAI standard. So those tools can also connect to Decapod. Decapod itself can connect to any API and pass available models through to other tools. I strongly prefer and recommend OpenRouter. OpenRouter also uses the OpenAI standard, but lets you connect to hundreds of mainstream and indie models under the same pricing system. Decapod is configured to work with OR out of the box. This is an image of the Decapod agent tool router – one of the key n8n workflows in Decapod. <h3 id="heading-core-engine">Core Engine</h3> Decapod consists of an agent with tools and skills. By tools, I mean the agentic tools that an AI can access to perform tasks as part of the API. And by skills, I'm referring to <a href="https://agentskills.io/home">Anthropic's Agent Skills standard</a>. It's the same skills standard used by Open Claw. The Decapod agent has a limited, immutable set of tools for managing Decapod's state and job queue. One tool is used to call skills. Skills are dynamic and you can add as many as you like mid-flight. Each skill consists of core instructions, followed by JSON specs. The agent builds a skill request based on the JSON and calls the use_skill tool to have it executed. Then Decapod calls a sub-workflow with a name that matches the skill and sends it the JSON. One skill = one sub-workflow. JSON specs = sub-workflow's expected input. When Decapod receives a user message, it passes it to the agent. If it's just a message, the agent responds. If it's a call to action, the agent picks a tool and gets to work. Decapod loops through each job in the queue, handling the agent's tool calls and passing it back the results. When the agent is done, it concludes the job and stops sending tool calls. The final message is passed back to the user. <h3 id="heading-supakitchen-supabase-on-a-budget">Supakitchen – Supabase on a Budget</h3> I'm a huge fan of Supabase. It's all the fun of Firebase, except with data normalization. But I'm self-hosting Decapod because paying $20 per month for each of five or more services doesn't sit right with me. As a mad scientist, I like to be able to try different tools without dealing with the freemium hoops. So I'm running Decapod on a Hetzner VPS with 8 gigs of RAM for about $18 per month. Those 8 gigs go really far in the self-hosted world, but Supabase is heavy. What I really wanted was to give my agent file access and a database. I accomplished that with MinIO and Postgres. No real-time data, but my agent is async anyway. And agent authentication is done through n8n. So it's good enough. But you do you! Decapod can work with any S3 compatible file store and any Postgres database. So if you want to use Supabase instead, go for it! <h3 id="heading-open-webui-ai-chat-with-all-the-bells-and-whistles">Open WebUI – AI Chat With All the Bells and Whistles</h3> You can use chat tools, like Discord, Telegram, Slack, and others, to chat with your AI easily enough. But if you want multiple sessions or to use different models, it can be tricky. The easiest tool to set up and work with, by far, is Telegram. You get chat, UI elements, and even embedded apps without having to host your own server, like you do with Discord. I once used it to create a HITL lead qualification tool in a few hours. BUT! While Telegram and friends do get the job done, if you want a new session you have to create a new bot for each and every one. If you want to switch models, you need to add /slash commands. If you want context management, you have to handle that server side. That's why I prefer Open WebUI. OWUI gives you everything you expect from all of the best mainstream AI offerings, but with a direct tap to the API. <ul> <li>It works great on browser and mobile as a progressive web app (PWA). </li> <li>You can mod it with Python. </li> <li>It has many ways to manage and supply context, including nested projects/folders and RAG support. </li> <li>You can collaboratively work on notes with AI. </li> </ul> Those are a few of my favorite features, but there are <a href="https://docs.openwebui.com/features/">so many more</a>. Why reinvent the wheel when the absolute best solution already exists? <h2 id="heading-requirements-and-processes-tools-i-use-and-recommend">Requirements and Processes – Tools I Use and Recommend</h2> Welcome to my lab-or-a-tory. We're out there on the fringes of agentic AI now. Doing weird experiments by stitching together pieces and parts. Let me show you how I work and tell you where you can and can't stray from my process. Decapod is a finished MVP and should work right out of the box with minimal headache. But it doesn't have more than a few skills yet. So you'll need to build your own until it takes off. Fortunately, your Decapod agent can help. <h3 id="heading-the-checklist">The Checklist</h3> Skills: <ul> <li>✅ A generalist's mindset, problem-solving skills, and a sense of adventure. <ul> <li>You don't have to be an expert at anything to install Decapod. I'm not, and I built it. </li> <li>But you do have to be comfortable with many different technologies. </li> </ul> </li> <li>✅ The command line, Docker, and probably Node. Decapod is self hosted. So you'll need to get your hands a bit dirty. </li> <li>✅ The ability to read and write a little JavaScript. This helps a lot with n8n code nodes to give it more utility. </li> <li>✅ Familiarity with JSON and APIs. Everything in n8n is about passing JSON from node to node. And n8n is nothing if not a universal API connector. </li> </ul> Services: <ul> <li>✅ A domain name with DNS access. <ul> <li>This is critical for n8n to work properly due to CORS and security issues. </li> <li>Also, the OWUI PWA doesn't work when hosted through an IP. It's just a web page at that point. </li> <li>Plus, it's just better for security overall with https support. </li> <li>If cost is an issue, you can get an <a href="https://gen.xyz/">all-digit domain name from gen.xyz</a> for $0.99. Seems legit, but I haven't tried it myself. </li> </ul> </li> <li>✅ A dedicated VPS with SSH access. (SSH access should be standard for any VPS.) <ul> <li>You can technically host this on your own PC if you know it will be running 24/7. But using a VPS will give you peace of mind and avoid complicating your PC. </li> <li>Big-name solutions like AWS and Google Cloud can wind up going off the rails and costing you big bucks if you don't know exactly what you're doing. Better to stick with less enterprise-oriented offerings. I've used the following: <ul> <li><a href="https://www.hetzner.com/">Hetzner</a> – My current personal favorite. Germany based. High quality and affordable pricing with a few American servers. Even more affordable with European servers. </li> <li><a href="https://www.digitalocean.com/">Digital Ocean</a> – US based. Can't go wrong. Decent prices. Many offerings. Almost exclusively American servers. </li> <li><a href="https://webdock.io/en">Webdock</a> – Denmark based. The most affordable of the bunch. </li> </ul> </li> </ul> </li> <li>✅ An OpenRouter account. OR provides a universal interface for hundreds of AI models. There's no freemium upsell, like with Hugging Face, but there is a percentage add on when you buy credits/tokens. I feel like it's worth the extra fee to be able to easily swap from Claude to Kimi to GPT to DeepSeek as I please without more keys, more accounts, and more wiring. But this is optional. You can plug Decapod right into Kimi or Gemini and just leave it there if you like. </li> </ul> Tools: <ul> <li>✅ Cursor, or similar. I love Cursor. It matches my hands-on style. If you're freestyling and dreaming something into creation as you build it, AI will always take the wrong path if you take your hands off the wheel. Cursor lets me be in charge and play director while the AI does the heavy lifting and saves me from hours of Googling and digging through 10-year-old questions on Stack Overflow. Especially with the command line stuff. I could not have knocked out Decapod in two weeks without it. But it couldn't have built Decapod at all without me. </li> <li>✅ Another AI bestie to help you dream, plot, and plan. Cursor is great, but very utilitarian. I always have a session open with a running commentary about my work. I'm constantly feeding it context and leaning on it to get a fresh perspective and solve more esoteric issues, like debugging n8n flow problems, for example. I use Claude for absolutely everything. It has the most natural conversational flow, it's good at taking meta instructions regarding its behavior, and it always has an eye on accuracy – very reliable. </li> </ul> <h2 id="heading-assembling-the-dream-team-ikea-style">Assembling the Dream Team – Ikea Style</h2> Here are the pieces and parts you'll find in your Dekkaplonkën Ikea flat pack (the GitHub repo). <ol> <li>Four Docker containers containing five services with docker-compose files. Just heat and serve.</li> </ol> <ul> <li>Infrastructure: Caddy for routing and SSL certificates for https security. </li> <li>Infrastructure: Postgres for all your data needs. </li> <li>MinIO: An S3 compatible file storage system. </li> <li>n8n: The ultimate automation tool. </li> <li>Open WebUI: The ultimate AI chat interface. </li> <li>SQL tables <ul> <li>A table for the decapod state. </li> <li>A table for jobs, tasks, and tool chat history. </li> </ul> </li> <li>S3 Files and Folders – Agent Templates <ul> <li>Four starter skills (two actually implemented in n8n). </li> <li>Two instructional files, including the persona and skill definitions. </li> </ul> </li> <li>n8n Workflows (6,889 lines of pure JSON) <ul> <li>API Middleware: The entry and exit point that manages the session and loops. </li> <li>AI Tool Router: Executes your agent's tool requests. </li> <li>Construct Message History: Injects instructions into your agent's chat history. </li> <li>Get Job Queue: A one-off database call that gets active jobs ordered by priority and creation date (First In First Out). </li> <li>Utility Workbench: A place for testing and managing your flows. Currently contains a Skill assembly jig. </li> <li>Worker: Loops over job queues, talking to the agent and calling the tool router with its responses. </li> <li>A write-file skill and a research-recipes skill. </li> <li>A couple more placeholders. (Decapod is an MVP) </li> </ul> </li> <li>Also <ul> <li>A Docker cheatsheet. </li> <li>A script to generate agents from the template. </li> <li>A destructive script to upload local agent files to your S3 account by overwriting existing files. Good for dev. Bad if you let your agent start modding their own instructions. </li> <li>Scripts to start and stop all Docker containers at once. </li> </ul> </li> </ul> <h3 id="heading-accessing-your-vps-with-cursor-and-ssh">Accessing Your VPS With Cursor and SSH</h3> SSH is the standard way to access any server and has been forever. But working through a terminal can be slow and plodding. Fortunately, there's a better way. Connect to the server with Cursor, VS Code, Antigravity, or whatever you use. This gives you: <ul> <li>Multiple terminals to access the remote server. </li> <li>The ability to view localhost servers as if they were on your own machine via port forwarding. </li> <li>Drag and drop folder and file management. </li> <li>No more Nano, Vim, or Emacs (unless you want to). </li> <li>And the best part! Cursor can do all the remote file system work for you, including troubleshooting servers and containers, writing scripts for automating common tasks, and helping you hash out actionable plans. </li> <li>(Cursor can also connect to your Decapod!) </li> </ul> Every VPS provider will have their own way of managing SSH access. They usually make adding them part of the sign up process. Generating and managing keys is a pretty well-paved path and I won't go over it. It's a good job for Cursor, if you need help. However! I use Bitwarden for SSH key generation and management. They still need to be stored locally for tools on your computer to access. But it's nice to have them in a single secure location. VS Code requires an extra plugin to access a remote server. Cursor comes with it preinstalled. Just click <code>Connect via SSH</code>, set up your connection, and you're good to go. 📝 Side note: I was on the paid plan when I started, I swear. I tend to switch services a lot as new models are released and I discover different tools and options. But I only ever pay for 2 or 3 at a time. I got about halfway through this article when Cursor expired. But I'm trying the new Gemini 3 models and switched to Antigravity mid-flight rather than re-up cursor. <h3 id="heading-installing-and-configuring-the-docker-containers">Installing and Configuring the Docker Containers</h3> Finally! After a novella's worth of lead-up, we, at long last, get to the actual installation. That will be shared in the next article – have a good night! Just kidding, please put down the brick. Once you've SSHed in to a VPS, a Raspberry Pi with Ubuntu, or a Virtual Machine, you're ready to get started. I'm going to assume you know how to install tools like Docker and Node on your system and not go into a lot of detail. Ask your friendly neighborhood AI for help if you get stuck. 💡 Important! If you haven't already, get your domain name and open up the DNS page. You'll want to redirect "A" records to your IP for each relevant service. Start by cloning the Decapod repo. <pre><code class="language-shell">git clone https://github.com/leetheguy/decapod.git </code></pre> <code>cd decapod</code> and create your Docker network. <pre><code class="language-shell">docker network create web </code></pre> Now we're going to go into each of the four Docker folders, configure them, and fire them up, starting with infrastructure. <code>cd infrastructure</code> <code>cp .env.example .env</code> Alternatively, you can move the files to rename them or just click on the file in the UI and <code>F2</code> to rename it. Whatever floats your goat 🐐. Now edit the new <code>.env</code> file. You can get the data folder path by clicking on the infrastructure folder and <code>Ctrl/Cmd+Alt+C</code>. The rest is up to you. I used Bitwarden to generate a password here. Next, copy the Caddyfile template into its own file. <code>cp caddy_config/Caddyfile.template caddy_config/Caddyfile</code> And start the Docker container with <code>docker compose up -d.</code> Back out of infrastructure and into <code>minio</code>. Same again with the <code>.env</code> – copy and configure. Make sure the URLs match your domain. Once more for <code>n8n</code> and then again for <code>openwebui</code>. OWUI config comes from the <code>infrastructure</code> and <code>minio</code> <code>.env</code> files: <ul> <li>S3_ACCESS_KEY_ID=minio_admin </li> <li>S3_SECRET_ACCESS_KEY=minio_password </li> <li>S3_BUCKET_NAME=decapod </li> <li>MINIO_ROOT_USER=minio_admin </li> <li>MINIO_ROOT_PASSWORD=minio_password </li> <li>POSTGRES_DB=postgres </li> <li>POSTGRES_USER=postgres </li> <li>POSTGRES_PASSWORD=postgres_password </li> </ul> 📝 Note! OWUI may take a moment or two to start. Go grab some water and it should be up by the time you get back. <h2 id="heading-configuration-and-wiring">Configuration and Wiring</h2> Roll up your sleeves! This is where we get up to our elbows in pieces and parts. If everything went to plan, you should now have all five services up and running. You can confirm the containers are live with <code>docker ps</code>. You can check that they're actually properly connected by visiting s3, OWUI, and n8n.your-domain.com. Create accounts for all three and sign in to each. ⚡️ Important! Get your n8n license key! It's free and gives you access to all community features. You'll be severely limited without it. Activate it under Usage and plan in the settings. <h3 id="heading-initiate-the-database">Initiate the Database</h3> Decapod only needs two data tables. You can add them from the command line. But I like pgAdmin. Connect to your Postgres database in the usual way. But you'll need your server's IP for the host name instead of postgres (which you use to connect services inside of the Docker network) since pgAdmin isn't in your Docker network. You'll find your SQL files in <code>components/pgsql_tables</code>. Create a decapod database and add both of the SQL files to it. A default <code>decapod_state</code> table record will be automatically generated when running the SQL. In pgAdmin: <ul> <li>Open the decapod server. </li> <li>Create a decapod database by right-clicking on databases. </li> <li>Select the new database. </li> <li>Click the query tool button at the top of the explorer. </li> <li>Copy and paste the decapod_state table into the query and run it with F5. </li> <li>Clear the query, paste in job_queue, run it. </li> </ul> Or ask Cursor or an AI bestie for help if you want to go pure command line. <h3 id="heading-a-little-minio">A Little MinIO</h3> Next up, you'll be adding your agent's instructions and persona files to your private S3 service. Start by visiting your MinIO server and adding a decapod bucket. In <code>components/S3_structure/agents/</code>, you'll find a template for your agents. (I have the intention of making Decapod a multi-agent tool in a future release.) The template is meant to be copied to a new agent of your choice. But if you choose something other than Decapod, you'll need to update the state table. You can do it manually if you wish. Copy the folder to match the new agent's name and update the <code>definitions/skills.yaml</code> file to include all the skills you want your agent to have. The name and description should exactly match what's found at the top of each skill file. Alternatively, I vibe coded a script to make it a little easier. It's in the scripts folder and you'll need to install the <code>inquirer</code> Node module to use it. Run <code>cd scripts</code> and <code>create-agent.mjs</code> to use it. You also need to make sure that the files and folder structure in your MinIO match those in <code>S3_structure</code>. Start by creating a bucket called decapod in your drive. Then upload the files from <code>S3_structure</code> into your bucket. But that's easier said than done because they're on a remote server. And if you used the visual interface, you'd have to download them to your local machine first. So I made another script – <code>upload_S3_structure.sh</code>. That script is strictly meant for dev purposes. It's absolute and destructive. Just a heavy mallet. So if you want to surgically alter your MinIO, do not use it! Remember kids: mallets and brain surgery don't mix. Once your agent files are in place, you can let your agents edit them, Open Claw style, or you can edit them yourself. But MinIO doesn't give you much of anything in the way of features for their UI. For a better experience, I'd recommend <a href="https://web.s3drive.app/">S3Drive</a>. When you go to sign up, look for the connect button towards the bottom to connect to your own MinIO endpoint. S3Drive will let you edit your files in place after you've uploaded them. This is good for quick fixes or copying and pasting sections without a complete wipe. <h3 id="heading-adding-the-workflows">Adding the Workflows</h3> You'll find most of what makes Decapod Decapod in the components folder. And the heart of that is in n8n_workflows. You can manually import those workflows one at a time and go over each one to make sure they're safe and sound. Or you can use the n8n CLI inside of the Docker container and save yourself some tedium. These commands move the workflows to the Docker container, import them with the n8n CLI, and then remove them from the tmp directory. <pre><code class="language-shell">docker cp ./components/n8n_workflows n8n:/tmp/workflows docker exec -u node n8n n8n import:workflow --input=/tmp/workflows --separate docker exec -u node n8n n8n import:workflow --input=/tmp/workflows/skills --separate docker exec -u root n8n rm -rf /tmp/workflows </code></pre> Now, you should see the 10 workflows in n8n. I'd recommend drag-and-dropping the main workflows to a dedicated decapod folder and the two skills to decapod/skills, just to keep things tidy. But they reference each other by id, so do what you want. <h3 id="heading-getting-started-with-n8n">Getting Started With n8n</h3> Now would be a good time to start exploring the workflows in your n8n UI Personal tab. If you sort them by name, the main file will be on top. Crack it open and see it's not too intense, and it's self-documented. Blue for notes, Green for sub-workflows, and Red for nodes that require your credentials. I'd recommend reading the notes and thoroughly exploring the sub-workflows to help you understand Decapod. It's your tool now! Create credentials as you go. Because we're using a Docker network, creating credentials and connecting your services to each other couldn't be easier. The standard to connect all of your services is to reference them by <code>name:port</code>. Because the Postgres credential has its own port field, you can just set it to Postgres. Port should be 5432. 📝 Note! All credential details, like your container names, ports, and passwords, can be found in your docker-compose and .env files. For MinIO: <ul> <li>Endpoint: <code>http://minio:9000</code> </li> <li>Force Path Style: Enabled! Important for MinIO. </li> </ul> API Connections to OpenRouter: <ul> <li>choose: Authentication -> Predefined Credential Type </li> <li>then: Credential Type -> OpenRouter </li> <li>Now just paste your API key from <a href="https://openrouter.ai/settings/keys">OpenRouter</a>. </li> </ul> n8n – (meta access to your workflow): <ul> <li>In a new tab, go to n8n Settings -> n8n API. </li> <li>Turn off expiration if you like. </li> <li>Copy your key. </li> <li>Paste it in the field. </li> <li>Base URL: <code>http://n8n:5678/api/v1</code> </li> </ul> Once you've created credentials, you can reuse them for every relevant node that uses the same credential. Just select it from the dropdown. 💡 Tip! It may help to remove the red sticky notes as you add credentials. And don't forget the skills! I didn't sticky note them at all. As a final step, make sure your n8n workflows are published in the following order: <ul> <li>construct message history </li> <li>get job queue </li> <li>hitl yes/no </li> <li>tool router </li> <li>worker </li> <li>middleware </li> <li>and the two skills </li> </ul> 💡 Tip! Always make sure your n8n workflows are in a published state with a green dot before calling them. Otherwise, you'll be calling an outdated version. <h3 id="heading-now-get-owui-to-talk-to-decapod">Now, Get OWUI to Talk to Decapod</h3> OWUI is built for teams, so you have admin settings and personal settings. You'll want to edit the admin settings by clicking on the profile circle in the lower-left-hand corner, then Admin Panel -> Settings -> Connections. From there: <ul> <li>Ollama API Disabled: Just keeping things tidy. </li> <li>Configure the OpenAI link by clicking on the gear and delete that too. </li> <li>Direct Connections: Enabled </li> <li>Cache Base Model List: Enabled Now add your Decapod connector with the plus button. </li> <li>URL: <a href="http://n8n:5678/webhook/v1/decapod">http://n8n:5678/webhook/v1/decapod</a> (Click the cycle icon to confirm your connection.) </li> <li>Auth: none (it's all in the same Docker network, so it's fine for now. You can add a password for production.) </li> <li>Prefix ID: decapod (If you do decide to use OpenAI, Hugging Face, or whatever else, this will help distinguish the model hosts.) </li> </ul> That's it. Save and go to the Models tab. Decapod passes OpenRouter models straight through. So if you see hundreds of models, take a victory lap! That means that Decapod is working, live, accepting requests, and you've even properly done your certifications (at least for OpenRouter). Now create a new chat session and pick a model. I like Claude Haiku 4.5. Fast, cheap, and good. Pick three. I did all of my Decapod dev with it in the saddle, so I know it works. And 3.5 million tokens towards testing iterations cost me $4, so I know it's reasonable. Alternatively, Kimi K2.5 will likely work and would be even a little bit cheaper. I burned through 4.7 million tokens installing a Docker container in Open Claw with Kimi for about $3. Time to say hello to your little friend! Haiku is fast. So if it takes more than a few seconds to respond, something could be borked in your n8n flow. It happened to me as I was writing this article. I had some issues with both Postgres and MinIO. 💡 Tip: If the agent does get hung, it's easier to resend the message than stop and try again. <h3 id="heading-there-was-supposed-to-be-an-earth-shattering-kaboom">There Was Supposed to Be an Earth Shattering Kaboom</h3> So, your agent really wants to talk to you, but all you have is a pulsating dot. It's likely that something got misconfigured in n8n. You can debug n8n by going to the middleware workflow and selecting <code>executions</code> from the top tab bar. If there's an error on the left list, look for a message in the lower right. This was when I had some database config issues and it couldn't find the state table. Some sub-workflows may fail quietly. You can trace flow from the webhook entry point to the error. All successful nodes will light up green. The bad node will be red. Drill down, check executions, and repeat for each sub-workflow. When you find the culprit – the actual bad node in the bad execution – select "copy to editor" in the upper-right-hand corner. That will freeze the workflow to that state. Open the node, fix the credential or whatever, and click <code>Execute Step</code> to see if it's fixed. Remember: after every change, always always always publish your update. Otherwise, n8n won't actually use the latest fixed version of your workflow. Once you've successfully debugged your Decapod, make sure that you clean out the loose unfinished jobs in the job_queue table with pgAdmin or whatever. Otherwise, your agent will try to complete each of them before finishing the next job. <h2 id="heading-the-ever-present-hello-world">The Ever-Present "Hello World"</h2> OK! Now for the moment of truth. You got your agent to say hello back. That was the easy part because it didn't need to do any work or use any tools. I set you up with two skills to put it to the test: write-file and research-recipes. The recipes skill connects your bot to a free recipe API (no key needed). It's not just pulling recipes out of training data. Try this prompt: Would you please look up pizza recipes and save them to a file? If all of your credentials are properly configured, you should get what you asked for. Open up MinIO or S3Drive and look in <code>/agents/decapod/documents</code> for the file. <h2 id="heading-into-the-future">Into the Future!</h2> I know that was a lot! (At least it felt like a lot from my end.) I hope it wasn't too painful. And look at the bright side: you just got a crash course on some really powerful technology. And if you made it through, that's a major accomplishment! The hard part is behind you. Now comes the fun. <h3 id="heading-a-work-in-progress">A Work in Progress</h3> I'll be honest. I just wanted to get Decapod out fast to prove how doable a personal agent is while Open Claw is still hot. Anyone can build their own Agentic AI with little or no code. And you don't have to settle for painful UI and poor security. You can have it all. But, as I've said, Decapod is still an MVP. Complete and functional, but feature light. And I was stressing about that a little bit. I wanted multiple agents and more skills for the early adopters. Then it hit me. Duh! You already have everything you need with n8n. You can add an n8n agent node, connect it to a model and an MCP server, and have a sub-agent ready to go in minutes. Then have your agent produce a skill sheet to contact the sub-agent. <h3 id="heading-adding-your-own-skills-limitless-potential">Adding Your Own Skills – Limitless Potential</h3> Let's create a dead simple n8n agent to search the web. Then we'll add that to Decapod as a new skill. In this image I used the prompt: <blockquote> Thank you so much! Next up, I want to give you web search access via a sub-agent. So your web search skill wouldn't directly search the web, but would instead call a simple agent to do the search for you. Would you please create a web-search.md skill for your future self to use? The only required field should be prompt. </blockquote> The agent's file folder is sandboxed by default, so the agent's <code>skills/web-search.md</code> is actually in the agent's private <code>documents</code> storage. I moved it to the actual skills folder and updated my agent's skills.yaml file with the new skill. Now I'll create a new n8n skill workflow in <code>decapod/skills/</code>. ⚡️ Important! Your n8n skill workflow name must match the skill name exactly. So, <a href="http://web-search.md">web-search.md</a> would be a workflow called web-search. Decapod uses the name to look for the skill so it can be hot loaded without a secondary router. The n8n screenshot above was pretty much exactly the whole thing. Try rebuilding it yourself. I used chat input to make sure it was working with n8n's chat interface. And I used the <a href="https://www.pulsemcp.com/servers/exa">Exa Web Search MCP</a> as the search tool. I used Haiku as the model, but an even simpler model would have likely been just fine. OpenRouter has a number of free models with tool abilities that would probably do the trick. Once you have the workflow operating properly, replace the chat node with a "When Executed by Another Workflow" node with a <code>parameters</code> object as input. Next, open up the utility/workbench workflow. This tool will help you turn your web-search workflow into a skill. Work through each node in order, testing the node with "Execute step" button as you go. Doing so will create output data that the next node can use as input data. <ol> <li>get workflow id from name: Set name to "web-search". </li> <li>deliver JSON arguments to skill: Set parameters object to { "prompt": "Can I please get a list of a variety of pizza recipes complete with links to their sources?" }; (or whatever matches your skill sheet) </li> <li>call skill based on workflow id: Should be ready to execute. </li> </ol> If your output looks like that, your skill should be ready to go. In this image I used the prompt: Alright! I think you're all set. Try doing a search for dessert pizza recipes. If your agent gives you the following error, make sure that it knows it MUST create a job before it can call the <code>use_skill</code> tool. It should know that from the instructions, but pobody's nerfect. (I'll need to tighten that up.) Hopefully that was also pretty painless and now your mind is exploding with possibilities like mine is. If you're unconcerned with safety or actively want to invoke Skynet, you can even give your agent a skill to create its own n8n skills with the <code>Create a workflow</code> node. But don't do that. <h2 id="heading-future-plans">Future Plans</h2> Here are a few more features I'd like to add: <ul> <li>/slash commands – You shouldn't have to go into n8n or pgAdmin to see what your agent is doing and manage its job queue. </li> <li>Streaming responses – I'd like to see what my agent is doing as it's doing it, but streaming is a bit tricky and was beyond the MVP. </li> <li>Multiple states – With multiple states, you can run multiple agents simultaneously. Or you can have different agents/models for different sessions. For example, you can have a health and fitness session with one agent with its own context window, job queue, and skill set. And you can have another one to help you keep track of your coding education. </li> <li>It's a bug, not a feature – There are many places where the state and model are hard-coded throughout the app. I also started working on features that didn't pan out and left some dangling nodes. I'd like to clean up the app and actually implement those features. </li> </ul> If you've read this far and are totally all in, I'd love to hear feedback and suggestions for more features. I'd be fascinated to hear about how Decapod is being used. And I'm also happy to answer any questions. <h2 id="heading-got-questions-meet-captain-finn">Got Questions? Meet Captain Finn!</h2> Decapod is the culmination of a year spent studying and learning all things AI and automation. It's also the result of 20 years in the world of coding and app development. I'm currently starting a community for AI Enthusiasts, Automation Inventors, and Systems Thinkers. It will be led by Captain Finn, a retro-futuristic space captain who got stranded without his crew in our time and space. He used AI, automation, and systems thinking to keep the ship working, give himself someone to talk to, and to wake up to the smell of fresh coffee every morning. And yes, Finn himself is an AI persona, operating from AI-automated systems, like Decapod, that he will be teaching people about. My goal is to create a welcoming environment for my fellow mad scientists, dreamers, and citizen developers to learn and grow with help from the community and Captain Finn Feldspar himself. I plan to release weekly articles, more tutorials like this, and other tips and tricks. Whether you want help with Decapod, learning automation, or just want to geek out about the power and future of AI — Captain Finn's Fleet has a place for you. <a href="https://discord.gg/HJtTpBAjQ5">Join here for free.</a> </article> <article> <h1> How to Build a Résumé Screening System Using Python and Multiprocessing </h1> Abdul Talha — Fri, 06 Feb 2026 16:19:01 +0000 Hiring the right candidate starts with one time-consuming task: screening résumés. If you’ve ever posted a job opening, you know the pain of hundreds of applications in your inbox, leaving you to spend hours reviewing each résumé manually. In this article, you’ll build a résumé screening system using pure Python, focusing on core programming concepts and the power of multiprocessing. You’ll create a custom system that automates the evaluation process by transforming unstructured résumé documents into a ranked leaderboard. By the end of this guide, you will: <ul> <li>Parse documents by extracting text from PDF and DOCX résumés using Python </li> <li>Extract information by identifying skills and keywords from résumé content </li> <li>Design a scoring algorithm using weighted logic to rank candidates objectively </li> <li>Build a web interface using Streamlit </li> <li>Deploy the application on Streamlit Cloud for public access </li> </ul> By following this tutorial, you’ll build a tool capable of processing hundreds of résumés in seconds. Here’s the source code: <a target="_blank" href="https://github.com/abdultalha0862/Resume_Parser_Project">GitHub Repository</a> <h2 id="heading-table-of-contents">Table of Contents</h2> <ul> <li><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a> </li> <li><a class="post-section-overview" href="#heading-project-overview">Project Overview</a> </li> <li><a class="post-section-overview" href="#heading-how-the-system-works">How the System Works</a> </li> <li><a class="post-section-overview" href="#heading-system-architecture">System Architecture</a> </li> <li><a class="post-section-overview" href="#heading-project-structure">Project Structure</a> </li> <li><a class="post-section-overview" href="#heading-step-1-set-up-the-project">Step 1: Set Up the Project</a> </li> <li><a class="post-section-overview" href="#heading-step-2-build-the-resume-parser">Step 2: Build the Résumé Parser</a> </li> <li><a class="post-section-overview" href="#heading-step-3-build-the-keyword-extractor">Step 3: Build the Keyword Extractor</a> </li> <li><a class="post-section-overview" href="#heading-step-4-implement-the-scoring-engine">Step 4: Implement the Scoring Engine</a> </li> <li><a class="post-section-overview" href="#heading-step-5-build-the-web-interface">Step 5: Build the Web Interface</a> </li> <li><a class="post-section-overview" href="#heading-step-6-test-the-system">Step 6: Test the System</a> </li> <li><a class="post-section-overview" href="#heading-step-7-deploy-the-application">Step 7: Deploy the Application</a> </li> <li><a class="post-section-overview" href="#heading-conclusion">Conclusion</a> </li> </ul> <h2 id="heading-prerequisites">Prerequisites</h2> To follow along with this tutorial, you should have: <ul> <li>Basic knowledge of Python (functions, loops, dictionaries) </li> <li>Python 3.8 or higher installed </li> <li>Familiarity with installing packages using <code>pip</code> </li> <li>A code editor such as VS Code, PyCharm, or any editor you prefer </li> </ul> <h2 id="heading-project-overview">Project Overview</h2> In this guide, you’ll develop a system that takes a folder of résumés and a Job Description (JD) as input. The system processes each résumé, extracts relevant information, and calculates a score based on how well the candidate matches the job requirements. <h2 id="heading-how-the-system-works">How the System Works</h2> The project consists of four core components: <ul> <li>Résumé Parser: Reads PDF and DOCX files and extracts text </li> <li>JD Parser: Analyses the job description to identify required skills </li> <li>Keyword Extractor: Matches résumé content against a skills taxonomy </li> <li>Scoring Engine: Ranks candidates using a weighted algorithm </li> </ul> <h3 id="heading-scoring-formula">Scoring Formula</h3> Here’s the scoring formula we’ll use: <pre><code class="lang-plaintext">Total Score = (Required Skills × 50%) + (Preferred Skills × 25%) + (Experience × 15%) + (Keywords × 10%) </code></pre> This approach ensures that essential skills carry more weight than secondary keywords. <h3 id="heading-how-this-approach-helps-reduce-bias">How This Approach Helps Reduce Bias</h3> This system evaluates résumés using predefined criteria instead of subjective judgment. Each résumé is scored based on the same set of required skills, preferred skills, experience indicators, and keywords. Because all candidates are evaluated using the same weighted formula, personal factors such as writing style, formatting, or unconscious preferences don’t influence the ranking. The scoring logic focuses only on how closely a résumé matches the job requirements. By normalising the evaluation process, the system promotes more consistent and objective screening, which helps reduce bias during the initial résumé review stage. <h2 id="heading-system-architecture">System Architecture</h2> <pre><code class="lang-plaintext">Input Processing Output ───── ────────── ────── Résumés ──► Résumé Parser ──► Keyword Extractor ──┐ (PDF/DOCX) │ ├──► Scoring Engine ──► Ranked Results Job Description ──► JD Parser ────────────────────┘ (TXT/PDF) </code></pre> The system follows a simple input–process–output flow. Résumés and the job description are provided as inputs. The Résumé Parser extracts text from each résumé, while the JD Parser identifies required and preferred skills from the job description. The extracted résumé text is then passed to the Keyword Extractor, which matches skills and keywords using a predefined taxonomy. Finally, the Scoring Engine applies a weighted formula to calculate a score for each candidate and outputs a ranked list of résumés. <h2 id="heading-project-structure">Project Structure</h2> <pre><code class="lang-plaintext">resume_screening_system/ ├── app.py # Streamlit web interface ├── main.py # Command-line interface ├── parsers/ │ ├── resume_parser.py # PDF/DOCX text extraction │ └── jd_parser.py # Job description parsing ├── extractors/ │ └── keyword_extractor.py # Skills and experience extraction ├── matcher/ │ └── scorer.py # Scoring algorithm ├── data/ │ ├── config.json # Scoring weights │ └── skills_taxonomy.json # Skills database └── requirements.txt # Dependencies </code></pre> The project is organised into clear, modular directories. Parsing logic, keyword extraction, and scoring are separated into their own folders, while configuration files and data are kept isolated. This structure keeps the codebase easy to navigate, maintain, and extend. <h2 id="heading-step-1-set-up-the-project">Step 1: Set Up the Project</h2> Create the folder structure and set up a virtual environment: <pre><code class="lang-bash">mkdir resume_screening_system cd resume_screening_system mkdir parsers extractors matcher data input output python -m venv venv </code></pre> Then go ahead and activate the virtual environment: <pre><code class="lang-bash"># Windows source venv/Scripts/activate # macOS / Linux source venv/bin/activate </code></pre> Install the required dependencies like this: <pre><code class="lang-bash">pip install PyPDF2 python-docx streamlit pandas </code></pre> <h2 id="heading-step-2-build-the-resume-parser">Step 2: Build the Résumé Parser</h2> The résumé parser handles different file formats by using a separate extraction method for each type. For PDF files, the parser opens the document page by page and extracts text from each page using a PDF reader. The extracted text is combined into a single string for further processing. For DOCX files, the parser reads each paragraph in the document and joins the paragraph text into one block. This ensures consistent text output regardless of the résumé format. By combining all résumés into plain text, the parser allows components such as keyword extraction and scoring to work efficiently. File: <code>parsers/resume_parser.py</code> <pre><code class="lang-python">def _extract_pdf(self, file_path: Path) -> str: text = "" with open(file_path, "rb") as file: pdf_reader = PyPDF2.PdfReader(file) for page in pdf_reader.pages: page_text = page.extract_text() if page_text: text += page_text + "\\n" return text.strip() def _extract_docx(self, file_path: Path) -> str: from docx import Document doc = Document(file_path) return "\\n".join( para.text for para in doc.paragraphs ).strip() </code></pre> <h2 id="heading-step-3-build-the-keyword-extractor">Step 3: Build the Keyword Extractor</h2> This project uses a résumé dataset from <a target="_blank" href="https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset">Kaggle</a> to ensure the logic works with real-world professional data. The keyword extractor identifies skills by scanning the résumé text. The résumé text is first converted to lowercase so that matching is case-insensitive. A predefined skills taxonomy stores each skill along with its possible variations. The extractor checks the résumé text against these variations to find matches. Word boundaries are used during matching to avoid partial matches, such as matching “Java” inside “JavaScript”. Matched skills are stored in a set to prevent duplicates. This approach ensures consistent and controlled skill detection across all résumés. File: <code>extractors/keyword_extractor.py</code> <pre><code class="lang-python">def extract_skills(self, text: str) -> Set[str]: text_lower = text.lower() found_skills = set() for category, skills_dict in self.skills_taxonomy.items(): for skill_name, variations in skills_dict.items(): for variation in variations: # Prevent "Java" matching "JavaScript" pattern = r"\\b" + re.escape(variation) + r"\\b" if re.search(pattern, text_lower): found_skills.add(skill_name) break return found_skills </code></pre> <h2 id="heading-step-4-implement-the-scoring-engine">Step 4: Implement the Scoring Engine</h2> To produce objective rankings, the system uses a weighted scoring formula. <div class="hn-table"> <table> <thead> <tr> <td>Component</td><td>Weight</td><td>Rationale</td></tr> </thead> <tbody> <tr> <td>Required Skills</td><td>50%</td><td>Essential technical needs</td></tr> <tr> <td>Preferred Skills</td><td>25%</td><td>Competitive differentiators</td></tr> <tr> <td>Experience</td><td>15%</td><td>Professional depth</td></tr> <tr> <td>Keywords</td><td>10%</td><td>Domain familiarity</td></tr> </tbody> </table> </div><pre><code class="lang-plaintext">Total Score = (S_req × 0.50) + (S_pref × 0.25) + (E_exp × 0.15) + (K_key × 0.10) </code></pre> The scoring engine calculates a final score for each résumé using weighted values. It counts how many required skills, preferred skills, experience indicators, and keywords appear in a résumé. Each count is multiplied by its assigned weight, with required skills contributing the most. The weighted values are summed to produce a single score. Résumés are then sorted by this score to generate a ranked list of candidates. <h2 id="heading-step-5-build-the-web-interface">Step 5: Build the Web Interface</h2> Streamlit provides a simple web interface for interacting with the résumé screening system. The text area allows users to input a job description, while the file uploader lets them upload multiple résumé files. When the button is clicked, Streamlit triggers the backend logic to parse résumés, extract data, and calculate scores. The results are then displayed in the browser, allowing users to run the screening process without using the command line. File: <code>app.py</code> <pre><code class="lang-python">import streamlit as st jd_text = st.text_area( "Paste the job description here:", height=300 ) uploaded_files = st.file_uploader( "Upload resume files:", type=["pdf", "docx", "txt"], accept_multiple_files=True ) if st.button("Screen Resumes", type="primary"): st.success("Processing resumes...") </code></pre> Run the application: <pre><code class="lang-bash">streamlit run app.py </code></pre> The app will be available at <a target="_blank" href="http://localhost:8501/"><code>http://localhost:8500</code></a>. <h2 id="heading-step-6-test-the-system">Step 6: Test the System</h2> <h3 id="heading-sample-job-description-input">Sample Job Description Input</h3> Below is an example of a job description you can use to test the system: <pre><code class="lang-plaintext">We are looking for a Senior Python Developer with strong experience in backend development. Required Skills: - Python - Django - REST APIs - SQL Preferred Skills: - PostgreSQL - Docker - AWS Experience: - 3+ years of professional Python development - Experience building web applications </code></pre> This input helps the system identify required skills, preferred skills, and experience keywords, which are then used by the scoring engine to rank résumés. <pre><code class="lang-bash">python main.py </code></pre> <h3 id="heading-sample-output">Sample Output</h3> <pre><code class="lang-plaintext">============================================================ SCREENING RESULTS ============================================================ Rank #1: Alice Johnson | Score: 85.42/100 | Matched: python, django, postgresql Rank #2: Carol Davis | Score: 72.50/100 | Matched: python, django </code></pre> <h2 id="heading-step-7-deploy-the-application">Step 7: Deploy the Application</h2> To make the system publicly accessible: <ol> <li>Push the code to GitHub </li> <li>Go to <a target="_blank" href="http://share.streamlit.io/"><code>share.streamlit.io</code></a> </li> <li>Select your <a target="_blank" href="http://app.py/"><code>app.py</code></a> file </li> <li>Deploy the application </li> </ol> Your app will be live at: <pre><code class="lang-plaintext"><https://your-app-name.streamlit.app> </code></pre> <h2 id="heading-conclusion">Conclusion</h2> In this tutorial, you’ve built a complete résumé screening system from scratch using Python. By combining text processing, structured scoring, and automation, this project demonstrates how manual résumé screening can be transformed into an efficient and objective workflow. This system helps reduce bias, save time, and evaluate candidates more consistently. Happy coding! </article> <article> <h1> Why You Should Stop Managing Kafka Manually – A Guide to Kafka UI and Cruise Control </h1> Ramesh Sinha — Wed, 14 Jan 2026 15:58:54 +0000 Over 80% of Fortune 100 companies use Apache Kafka. That's not surprising, as Kafka has revolutionized how we build real-time data pipelines and streaming applications. If you're working in software engineering today, chances are you've encountered Kafka in some capacity. But here's the thing: while Kafka itself is incredibly powerful, managing Kafka clusters is notoriously challenging. This isn't a flaw in Kafka – it's just the reality of distributed systems. The bigger your cluster grows, the more complex operations become. The most painful aspect? Manual cluster management. It's tedious, error-prone, and doesn't scale. What starts as simple topic creation with a few brokers turns into hours of carefully orchestrating partition reassignments across dozens of machines. One typo in a JSON file at 3 AM can take down production. Sound familiar? You're not alone. In this guide, you'll learn how two tools can transform Kafka operations from a manual slog into a manageable process: <ul> <li>Kafka UI – A modern web interface that replaces cryptic CLI commands with visual cluster management </li> <li>Cruise Control – LinkedIn's automation engine that handles cluster balancing and self-healing </li> </ul> We'll start by experiencing the pain of manual management firsthand, then see how these tools solve real-world operational challenges. You'll set up everything locally with <code>Docker</code> and by the end you’ll know exactly how to manage Kafka clusters without the headache. <h2 id="heading-what-well-cover">What We’ll Cover:</h2> <ul> <li><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a> </li> <li><a class="post-section-overview" href="#heading-setting-up-our-unmanaged-cluster">Setting Up Our Unmanaged Cluster</a> </li> <li><a class="post-section-overview" href="#heading-starting-the-cluster-amp-verification">Starting the Cluster & Verification</a> </li> <li><a class="post-section-overview" href="#heading-creating-topics-the-manual-way">Creating Topics: The Manual Way</a> </li> <li><a class="post-section-overview" href="#heading-kafka-ui">Kafka UI</a> <ul> <li><a class="post-section-overview" href="#heading-setting-up-kafka-ui">Setting up Kafka UI</a> </li> <li><a class="post-section-overview" href="#heading-drawbacks-of-kafka-ui">Drawbacks of Kafka UI</a> </li> </ul> </li> <li><a class="post-section-overview" href="#heading-cruise-control">Cruise Control</a> <ul> <li><a class="post-section-overview" href="#heading-how-cruise-control-works">How Cruise Control Works</a> </li> <li><a class="post-section-overview" href="#heading-setting-up-cruise-control">Setting Up Cruise Control</a> </li> <li><a class="post-section-overview" href="#heading-cruise-control-configuration-file">Cruise Control Configuration File</a> </li> <li><a class="post-section-overview" href="#heading-creating-the-imbalance">Creating the Imbalance</a> </li> <li><a class="post-section-overview" href="#heading-attempting-manual-rebalancing">Attempting Manual Rebalancing</a> </li> <li><a class="post-section-overview" href="#heading-rebalancing-using-cruise-control">Rebalancing Using Cruise Control</a> </li> </ul> </li> <li><a class="post-section-overview" href="#heading-conclusion">Conclusion</a> </li> </ul> <h2 id="heading-the-problem-manual-kafka-management">The Problem: Manual Kafka Management</h2> Let’s dive right in. First, I'm going to show you what managing a Kafka cluster looks like without any tools – just you, the command line, and dozens of manual operations. You’ll spin up a small cluster locally, create some topics, and simulate the kind of growth you'd see in a real production environment. By the end of this section, you'll understand exactly why teams spend thousands of engineering hours just keeping Kafka clusters running smoothly. Fair warning: this is going to feel tedious but it’s ok – that’s the point. <h2 id="heading-prerequisites">Prerequisites</h2> Before we dive in, make sure you have: <ol> <li>Docker Desktop installed and running <ul> <li>Mac and Windows users: <a target="_blank" href="https://www.docker.com/products/docker-desktop/">https://www.docker.com/products/docker-desktop/</a> </li> <li>Linux users can install Docker Engine via their package manager </li> </ul> </li> <li>Basic Kafka knowledge. You should understand: <ul> <li>Topics: Categories for organizing messages </li> <li>Partitions: How topics are divided for parallelism </li> <li>Brokers: The Kafka servers that store data </li> <li>Producers and Consumers: Applications that write to and read from Kafka </li> <li>KRaft: Kafka consensus based discovery? </li> </ul> </li> </ol> If these terms are new to you, <a target="_blank" href="https://www.freecodecamp.org/news/apache-kafka-handbook/">here’s a great handbook about them</a>. I’d also recommend reading <a target="_blank" href="https://kafka.apache.org/intro">Kafka's Introduction</a> first. <ol start="3"> <li>System Requirements <ul> <li>At least 8GB Ram </li> <li>10GB Free Disk space </li> </ul> </li> <li>Some basic understanding of containers is good to have: <ul> <li>Docker </li> <li>Images </li> <li>Volumes </li> <li>Networks </li> </ul> </li> </ol> <h2 id="heading-setting-up-our-unmanaged-cluster">Setting Up Our Unmanaged Cluster</h2> Let’s go ahead and build the cluster so that we can see the problems firsthand. We’ll use Docker to spin up three Kafka brokers running in <code>KRaft</code> mode (the modern, ZooKeeper-free approach). Start by creating a file called <code>docker-compose-basic.yml</code>: <pre><code class="lang-yaml">version: '3.8' services: kafka-1: image: confluentinc/cp-kafka:7.6.0 container_name: kafka-1 ports: - "9092:9092" environment: KAFKA_NODE_ID: 1 KAFKA_PROCESS_ROLES: broker,controller KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093 KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9092 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-1:29092,PLAINTEXT_HOST://localhost:9092 KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2 KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2 KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1 CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk' KAFKA_LOG_DIRS: /var/lib/kafka/data volumes: - kafka-1-data:/var/lib/kafka/data kafka-2: image: confluentinc/cp-kafka:7.6.0 container_name: kafka-2 ports: - "9093:9093" environment: KAFKA_NODE_ID: 2 KAFKA_PROCESS_ROLES: broker,controller KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093 KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9093 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-2:29092,PLAINTEXT_HOST://localhost:9093 KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2 KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2 KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1 CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk' KAFKA_LOG_DIRS: /var/lib/kafka/data volumes: - kafka-2-data:/var/lib/kafka/data kafka-3: image: confluentinc/cp-kafka:7.6.0 container_name: kafka-3 ports: - "9094:9094" environment: KAFKA_NODE_ID: 3 KAFKA_PROCESS_ROLES: broker,controller KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093 KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9094 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-3:29092,PLAINTEXT_HOST://localhost:9094 KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2 KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2 KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1 CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk' KAFKA_LOG_DIRS: /var/lib/kafka/data volumes: - kafka-3-data:/var/lib/kafka/data volumes: kafka-1-data: kafka-2-data: kafka-3-data: </code></pre> In the above configuration file, we’re creating three Kafka brokers (<code>kafka-1, kafka-2, kafka-3</code>). Each one uses the <code>confluentinc/cp-kafka:7.6.0</code> image and has its port opened (<code>9092, 9093, 9094</code>). The environment variables are: <ul> <li>KAFKA_NODE_ID – A unique identifier for each broker (1,2,3). No two brokers can have the same ID. </li> <li>KAFKA_PROCESS_ROLES: broker, controller – This tells Kafka to run in <code>KRaft</code> mode (without ZooKeeper). Each broker acts as both a data broker and a controller for cluster coordination. </li> <li>KAFKA_CONTROLLER_QUORUM_VOTERS – The membership list that tells each broker how to find the others. All three brokers must have the identical list: <code>1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093</code>. This is how they discover each other and elect a leader. </li> <li>CLUSTER_ID – A unique identifier for the entire cluster. All brokers must use the exact same value or they won't recognize each other as part of the same cluster. The actual value (<code>MkU3OEVBNTcwNTJENDM2Qk</code>) doesn't matter as long as long as it is consistent across brokers. One important thing to note is that CLUSTER_ID must be a valid <code>base64-encoded UUID</code> per Kafka’s requirement. </li> <li>KAFKA_LISTENERS - Defines which network interfaces and ports Kafka listens on. We have three listeners: <ul> <li>PLAINTEXT://0.0.0.0:29092: For inter-broker communication (brokers talking to each other) </li> <li>CONTROLLER://0.0.0.0:29093: For controller communication in <code>KRaft</code> mode </li> <li>PLAINTEXT_HOST://0.0.0.0:9092 (varies per broker): For external connections from your machine </li> </ul> </li> <li>KAFKA_ADVERTISED_LISTENERS – Tells clients (producers/consumers) how to connect to this broker. This is what gets returned when a client asks "<code>where should I connect?</code>" The PLAINTEXT_HOST://localhost:9092 part is what allows you to connect from your Mac. </li> </ul> Note: Listener configuration is critical. Incorrect settings will prevent clients from connecting even when brokers are running. These settings work for local Docker environments where Docker's internal DNS resolves broker names (<code>kafka-1, kafka-2, kafka-3</code>). For production, replace hostnames with actual IP addresses or FQDNs - (Fully Qualified Domain Name): <ul> <li>KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2 – How many copies of consumer offset data to keep. We use 2 instead of 3 because with only three brokers, this prevents issues during rolling restarts. In production with more brokers, you'd use 3 or more. </li> <li>The Volumes – <code>kafka-x-data:/var/lib/kafka/data</code> creates persistent storage for each broker’s data. Without volumes you will lose your topics and messages if you stop or restart your containers. Volumes are assigned to each broker so they don’t accidentally share data. </li> </ul> Note: For a restart from scratch you need to delete the volumes using the following command. The -v flag removes volumes. Without it, old data persists even after down. <pre><code class="lang-bash">docker compose -f docker-compose-basic.yml down -v </code></pre> If you're using the legacy <code>docker-compose</code> tool (V1), replace <code>docker compose</code> with <code>docker-compose</code> in all commands throughout this tutorial. <h3 id="heading-ports">Ports</h3> Three ports are used for any given broker. Their purposes are: <div class="hn-table"> <table> <thead> <tr> <td>Port</td><td>Purpose</td></tr> </thead> <tbody> <tr> <td>9092</td><td>external connections (producers, consumers from you Mac)</td></tr> <tr> <td>29092</td><td>Internal broker-to-broker communication</td></tr> <tr> <td>29093</td><td>Cluster coordination via KRaft</td></tr> </tbody> </table> </div><h2 id="heading-starting-the-cluster-amp-verification">Starting the Cluster & Verification</h2> Now that we have the basic docker configuration for Kafka, let’s run it and verify the results. Run the following command in the same directory where you saved <code>docker-compose-basic.yml</code>: <pre><code class="lang-bash">docker compose -f docker-compose-basic.yml up -d </code></pre> The <code>-d</code> flag runs the containers in detached mode (in the background), so you get your terminal back. You should see output like this: Using the following command, check if the containers running Kafka brokers are up: <pre><code class="lang-bash">docker ps </code></pre> You should see three Kafka containers (kafka-1, kafka-2, kafka-3) with status “<code>Up</code>” – something like this: Run the following command to verify that all three brokers are registered in the cluster: <pre><code class="lang-bash">docker exec -it kafka-1 kafka-broker-api-versions --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 </code></pre> You should see API version information for all three brokers (IDs 1, 2, 3) without any connection errors. Note that we’re using <code>kafka-1:29092,kafka-2:29092,kafka-3:29092</code> here (the internal Docker addresses) instead of localhost:9092 because this command runs inside the <code>kafka-1</code> container by virtue of <code>docker exec -it kafka-1</code>, where <code>localhost</code> only refers to that specific container. If any of the above verification returns errors or doesn’t show expected result as shown in screenshots, you can run the following command to see logs and debug: <pre><code class="lang-bash">docker logs kafka-1 </code></pre> <h2 id="heading-creating-topics-the-manual-way">Creating Topics: The Manual Way</h2> Now that we have a cluster running, let’s simulate a real production use case where different teams need Kafka topics for their applications – payments, logs, events, metrics notifications, you name it. Let’s start by creating a topic for logs. The command to do this is: <pre><code class="lang-bash">docker exec -it kafka-1 kafka-topics \ --create \ --topic freecodecamp-logs \ --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \ --partitions 12 \ --replication-factor 2 \ --config retention.ms=604800000 \ --config compression.type=snappy </code></pre> You’ll need to specify some command parameters, which are: <ol> <li>The exact broker address <code>kafka-1:29092,kafka-2:29092,kafka-3:29092</code> (or the IP address of your servers in production) </li> <li>The number of partitions – I have used <code>12</code> in the above command. Creating too few partitions creates bottlenecks, while creating too many adds overhead. </li> <li>Retention policy – I have used 7 days (that is, 604800000 milliseconds) </li> <li>Compression type </li> </ol> Manually managing these parameters and running the command a handful of times is okay – but what if you have to run this for every team in your enterprise? Each team will have different requirements. The grind of copy, paste, adjust becomes painful if you have 100+ topics and multiple clusters (dev, staging, prod). Feel the pain yet? Well, let’s just go on for a minute and we’ll address this issue shortly. For now, if you run the above command you should see the “Created topic” message: Note: We’re using <code>kafka-1:29092,kafka-2:29092,kafka-3:29092</code> to reach Kafka brokers because we’re running the command inside of broker kafka-1 by running using <code>docker exec</code>. Let's keep going. We’ll create more topics using the same command by changing the topic name and partitions. Copy, paste, update, and run the above commands a couple times. On my machine, I ran it 3 more times like below (you can choose to run couple more times with changed values – it won’t matter because concrete values are not important for this tutorial): <pre><code class="lang-bash">docker exec -it kafka-1 kafka-topics \ --create \ --topic freecodecamp-views \ --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \ --partitions 20 \ --replication-factor 2 \ --config retention.ms=604800000 \ --config compression.type=snappy docker exec -it kafka-1 kafka-topics \ --create \ --topic freecodecamp-analytics \ --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \ --partitions 3 \ --replication-factor 2 \ --config retention.ms=604800000 \ --config compression.type=snappy docker exec -it kafka-1 kafka-topics \ --create \ --topic freecodecamp-articles \ --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \ --partitions 5 \ --replication-factor 2 \ --config retention.ms=604800000 \ --config compression.type=snappy </code></pre> After creating the topics, let’s see all the ones you have now by running the following command: <pre><code class="lang-bash">docker exec -it kafka-1 kafka-topics \ --list \ --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 </code></pre> You should see a list of topics like this: Notice that you just get the list of topics but no meaningful information, like: <ul> <li>How many partitions does each have? </li> <li>Which brokers are hosting them? </li> <li>Are they evenly distributed? </li> <li>What are their configurations? </li> </ul> <h3 id="heading-partition-information">Partition Information</h3> Let’s try to get information about our partitions. For this tutorial, I have created 4 topics and a total of 40 partitions spread across three brokers. I want to see which broker has the most partitions. In a well-managed cluster, you’d want them roughly evenly distributed. But how can we check that? Maybe the describe command shown below can help. Let’s run it: <pre><code class="lang-bash">docker exec -it kafka-1 kafka-topics \ --describe \ --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 </code></pre> It will return a wall of text, something like this: So, we have partition information but: <ul> <li>No summary or aggregation </li> <li>No visual representation </li> <li>It’s difficult to scan and compare </li> <li>It gets exponentially worse with more topics </li> </ul> <h3 id="heading-counting-leaders">Counting Leaders</h3> The Leader field in the above screenshot tells you which broker is the leader for each partition. Leaders handle all read and write requests, so you want them evenly distributed or else some brokers will become overloaded. Let’s try to count how many partitions each broker leads. To do that, run the following command: <pre><code class="lang-bash">docker exec -it kafka-1 kafka-topics \ --describe \ --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 | grep "Leader: 1" | wc -l </code></pre> It will show something like this: Per my topic creation, <code>14</code> is the count of partitions where <code>broker 1 (Leader : 1)</code> is the leader. You might see a different number depending on how many topics and how many partitions you have created. You can repeat this command to see the count of partitions led by other brokers. To do so, just change <code>Leader: 1</code> to <code>Leader: 2</code> or <code>Leader: 3.</code>. I get <code>14, 12, 14</code>: That’s somewhat balanced, but you had to run the command multiple times, parse using <code>grep</code> and <code>wc</code>, and this is just 3 brokers. What if you had 100+? Also, what if you have to get the replicas’ information? I could go on and on with the data we need and the commands to get that information. But the point I’m trying to make here is that sooner or later this becomes impossible to manage. Your team is going to need an army, and to be honest there isn’t much value in doing all of this manually. So far, you’ve seen only simple operational commands, but the problems don’t stop there. In a real production environment there are more complex and challenging operations like: <ul> <li>Consumer Lag Monitoring: When consumers fall behind, you need to track which partitions are lagging, which consumer instances own them, and where the lag is growing or shrinking. With CLI tools, you get raw numbers but no trends or context. </li> <li>Broker Failures: When a broker fails, you need to identify under-replicated partitions, trigger leader elections, and create partition reassignment <code>JSON</code> files manually. One mistake in that JSON can cause data loss. </li> <li>Cluster rebalancing: You’ll see that when you add new brokers, they sit empty until you manually redistribute partitions. Similarly for removing brokers, you need to move all their partitions first. These operations require calculating optimal placement and creating complex reassignment plans. </li> </ul> If you’re still with me, you’re probably thinking that there has to be a better way. Fortunately, there is – actually, there are a couple complimentary ways and we are going to talk about those next. <h2 id="heading-kafka-ui">Kafka UI</h2> Kafka UI is a modern, open-source web interface for managing Kafka clusters. It replaces the <code>command line chaos</code> we just experienced with a clean, visual dashboard. Kafka UI provides the following features: <ul> <li><code>Visual cluster Overview</code>: see all brokers, topics, and partitions at a glance. </li> <li><code>Topic management</code>: create, configure, and delete topics with a GUI </li> <li><code>Consumer group monitoring</code>: track lags, offsets, and consumer health in real-time </li> <li><code>Message browsing</code>: view actual messages in topics without command line tools </li> </ul> Without further ado, let’s set up Kafka UI. <h3 id="heading-setting-up-kafka-ui">Setting Up Kafka UI</h3> To setup up Kafka UI, let’s modify our existing <code>docker-compose-basic.yml</code> like this: <pre><code class="lang-yaml">version: '3.8' services: kafka-1: image: confluentinc/cp-kafka:7.6.0 container_name: kafka-1 ports: - "9092:9092" environment: KAFKA_NODE_ID: 1 KAFKA_PROCESS_ROLES: broker,controller KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093 KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9092 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-1:29092,PLAINTEXT_HOST://localhost:9092 KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2 KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2 KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1 CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk' KAFKA_LOG_DIRS: /var/lib/kafka/data volumes: - kafka-1-data:/var/lib/kafka/data kafka-2: image: confluentinc/cp-kafka:7.6.0 container_name: kafka-2 ports: - "9093:9093" environment: KAFKA_NODE_ID: 2 KAFKA_PROCESS_ROLES: broker,controller KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093 KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9093 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-2:29092,PLAINTEXT_HOST://localhost:9093 KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2 KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2 KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1 CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk' KAFKA_LOG_DIRS: /var/lib/kafka/data volumes: - kafka-2-data:/var/lib/kafka/data kafka-3: image: confluentinc/cp-kafka:7.6.0 container_name: kafka-3 ports: - "9094:9094" environment: KAFKA_NODE_ID: 3 KAFKA_PROCESS_ROLES: broker,controller KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093 KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9094 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-3:29092,PLAINTEXT_HOST://localhost:9094 KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2 KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2 KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1 CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk' KAFKA_LOG_DIRS: /var/lib/kafka/data volumes: - kafka-3-data:/var/lib/kafka/data # Adding kafka-UI service start kafka-ui: image: provectuslabs/kafka-ui:latest container_name: kafka-ui ports: - "8080:8080" environment: DYNAMIC_CONFIG_ENABLED: 'true' KAFKA_CLUSTERS_0_NAME: freecodecamp-cluster KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka-1:29092,kafka-2:29092,kafka-3:29092 depends_on: - kafka-1 - kafka-2 - kafka-3 # Adding kafka-UI service end volumes: kafka-1-data: kafka-2-data: kafka-3-data: </code></pre> The yaml file is pretty much the same as before except that we have added a new service called <code>kafka-ui</code> (for better clarity, I have added the changes in between start and end comments). Key Configurations are: <ul> <li>Port 8080 – You can access the UI at <a target="_blank" href="http://localhost:8080">http://localhost:8080</a> from your machine. </li> <li>KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS – This environment variable tells Kafka UI where to connect your cluster (using internal Docker addresses). </li> <li>KAFKA_CLUSTERS_0_NAME – A friendly name for your cluster in the UI. </li> </ul> Let’s first clean up the old cluster while keeping the topic data intact. Go ahead and run the following command to do so: <pre><code class="lang-bash">docker compose -f docker-compose-basic.yml down </code></pre> Note that we’re not using <code>-v</code> here, so volumes (topic data) will remain intact. Wait for couple seconds and then run the following docker up command to bring up our cluster with Kafka UI: <pre><code class="lang-bash">docker compose -f docker-compose-basic.yml up -d </code></pre> Now open a browser and visit <a target="_blank" href="http://localhost:8080/">http://localhost:8080/</a>. You’ll see the UI like this: You can click around and see all information about the cluster we have created, like: <ul> <li>Your 3 brokers </li> <li>The topics you created earlier </li> <li>Partition counts </li> </ul> For comparison with manual commands, let's look at the Brokers tab. You can see the partition leader count for each broker at a glance – remember that we had to run multiple commands to get this information earlier. Beyond this, the UI provides many other useful metrics that would require separate command-line queries. Remember the CLI commands we had to run to create topics? If you go to the <code>Topics</code> tab, you will notice that Topic management (<code>creation, deletion, data cleanup</code> and so on) are just a few button clicks. Similarly, managing Consumers only requires a few button clicks. After exploring the Kafka UI, you'll see how much easier it is to monitor your cluster compared to running individual CLI commands. <h3 id="heading-drawbacks-of-kafka-ui">Drawbacks of Kafka UI</h3> That said, Kafka UI does have some limitations: <ul> <li>Automatic rebalancing: One or few brokers having more partitions that others, you must manually reassign them. </li> <li>Self-healing: If a broker fails, you have to manually create reassignment plans. </li> <li>Performance optimization: The UI can’t recommend intelligent partition placement. </li> <li>Alerts: The UI doesn’t warn you before problems happen. </li> </ul> For small clusters (3 - 10 brokers ), Kafka UI and some command execution might be enough. You’ll be able to see problems clearly and fix them when needed. For large clusters, manual operations are still not scalable, so we need some kind of a complementary tool…and that tool is Cruise Control. <h2 id="heading-cruise-control">Cruise Control</h2> Cruise Control is an automation engine for Kafka clusters. While Kafka UI gives you visibility and manual control, Cruise Control provides intelligent automation and self-healing. You can think of Kafka UI as a dashboard with manual controls and Cruise Control as an autopilot. In other words, they complement each other. Let’s try to create some imbalance in our cluster and fix it manually. This will help you learn how to reason through why you need Cruise Control. To keep things simple, let’s start from scratch. We will first delete all the Docker resources we have created so far by running the following command: <pre><code class="lang-bash">docker compose -f docker-compose-basic.yml down -v </code></pre> Running <code>docker-compose down -v</code> will delete all the topics and messages we created so far, but don’t worry –we’ll create them again. <h3 id="heading-how-cruise-control-works">How Cruise Control Works</h3> You can think of Cruise Control as a metric-monitoring and action-taking tool. Kafka brokers collect internal metrics (CPU, disk, network traffic, partition sizes), and a metric reporter running inside each broker sends these metrics to a Kafka topic. Cruise Control then reads from that topic and analyzes the data. Based on that analysis, it proposes partition movements. We’ll see this in action shortly. <h3 id="heading-setting-up-cruise-control">Setting Up Cruise Control</h3> As of this writing, I couldn’t find a compatible Kafka and Cruise Control image that supports <code>KRaft</code> (Kafka Consensus Algorithm), so I decided to create Kafka and Cruise Control public images that will help with the tutorial. I don’t recommend using these images in production. For production usage, you should either wait for community to provide an image or create one of your own. Change the <code>docker-compose-basic.yml</code> file to look like the below: <pre><code class="lang-yaml">version: '3.8' services: kafka-1: image: justramesh2000/kafka-apache-cc:3.8.1 container_name: kafka-1 ports: - "9092:9092" environment: KAFKA_NODE_ID: 1 KAFKA_PROCESS_ROLES: broker,controller KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093 KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9092 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-1:29092,PLAINTEXT_HOST://localhost:9092 KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2 KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2 KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1 CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk' KAFKA_LOG_DIRS: /var/lib/kafka/data # Cruise Control Metrics Reporter KAFKA_METRIC_REPORTERS: 'com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter' KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS: 'kafka-1:29092,kafka-2:29092,kafka-3:29092' KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE: 'true' KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS: '1' KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR: '2' KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE: 'false' KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS: '60000' volumes: - kafka-1-data:/var/lib/kafka/data kafka-2: image: justramesh2000/kafka-apache-cc:3.8.1 container_name: kafka-2 ports: - "9093:9093" environment: KAFKA_NODE_ID: 2 KAFKA_PROCESS_ROLES: broker,controller KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093 KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9093 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-2:29092,PLAINTEXT_HOST://localhost:9093 KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2 KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2 KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1 CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk' KAFKA_LOG_DIRS: /var/lib/kafka/data KAFKA_METRIC_REPORTERS: com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS: kafka-1:29092,kafka-2:29092,kafka-3:29092 KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE: 'false' KAFKA_CRUISE_CONTROL_METRICS_TOPIC: __CruiseControlMetrics KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE: 'true' KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS: '1' KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR: '2' KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS: '60000' volumes: - kafka-2-data:/var/lib/kafka/data kafka-3: image: justramesh2000/kafka-apache-cc:3.8.1 container_name: kafka-3 ports: - "9094:9094" environment: KAFKA_NODE_ID: 3 KAFKA_PROCESS_ROLES: broker,controller KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093 KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:29093,PLAINTEXT_HOST://0.0.0.0:9094 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka-3:29092,PLAINTEXT_HOST://localhost:9094 KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 2 KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 2 KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1 CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk' KAFKA_LOG_DIRS: /var/lib/kafka/data KAFKA_METRIC_REPORTERS: com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS: kafka-1:29092,kafka-2:29092,kafka-3:29092 KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE: 'false' KAFKA_CRUISE_CONTROL_METRICS_TOPIC: __CruiseControlMetrics KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE: 'true' KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS: '1' KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR: '2' KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS: '60000' volumes: - kafka-3-data:/var/lib/kafka/data # Adding kafka-UI service start kafka-ui: image: provectuslabs/kafka-ui:latest container_name: kafka-ui ports: - "8080:8080" environment: DYNAMIC_CONFIG_ENABLED: 'true' KAFKA_CLUSTERS_0_NAME: freecodecamp-cluster KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka-1:29092,kafka-2:29092,kafka-3:29092 depends_on: - kafka-1 - kafka-2 - kafka-3 volumes: - ./config:/opt/cruise-control/config # Adding kafka-UI service end # Adding cruise-control start cruise-control: image: justramesh2000/cruise-control-kraft:2.5.142 container_name: cruise-control ports: - "9090:9090" volumes: - ./config/cruisecontrol.properties:/opt/cruise-control/config/cruisecontrol.properties - ./config/capacityJBOD.json:/opt/cruise-control/config/capacityJBOD.json:ro - ./config/log4j.properties:/opt/cruise-control/config/log4j.properties:ro depends_on: - kafka-1 - kafka-2 - kafka-3 # Adding cruise-control end volumes: kafka-1-data: kafka-2-data: kafka-3-data: </code></pre> You should have made the following changes to the file: <ul> <li>Changed Kafka image from <code>confluentinc/cp-kafka:7.6.0</code> to <code>justramesh2000/kafka-apache-cc:3.8.1</code>. The new image contains the Cruise Control metrics exporter which will export metrics data from Kafka brokers to be used by Cruise Control. </li> <li>Added the following environment variables: <ul> <li>KAFKA_METRIC_REPORTERS – This variable tells Kafka to load a plugin called the <code>Cruise Control Metrics Reporter</code>. It runs inside each Kafka broker process, and hooks into Kafka’s internal metrics system. This helps with data collection. </li> <li>KAFKA_CRUISE_CONTROL_METRICS_REPORTER_BOOTSTRAP_SERVERS – This tells the <code>Cruise Control Metrics Reporter</code> where to send metrics to, meaning which Kafka brokers and which port. </li> <li>KAFKA_CRUISE_CONTROL_METRICS_REPORTER_KUBERNETES_MODE – This disables specific Kubernetes behaviors (Pod name, id instead of Host). We are using Docker, so we don’t need K8s behaviors. </li> <li>KAFKA_CRUISE_CONTROL_METRICS_TOPIC – Specifies the name of the topic where metrics will be published. Default is <code>__CruiseControlMetrics</code> but you can customize using this variable if you want to. </li> <li>KAFKA_CRUISE_CONTROL_METRICS_TOPIC_AUTO_CREATE – Automatically creates a <code>__CruiseControlMetrics</code> topic if it doesn’t exist. Without this metric, the reporter will fail reporting until you manually create this topic. </li> <li>KAFKA_CRUISE_CONTROL_METRICS_TOPIC_NUM_PARTITIONS – Defines the number of partitions for the topic <code>__CruiseControlMetrics</code>. </li> <li>KAFKA_CRUISE_CONTROL_METRICS_TOPIC_REPLICATION_FACTOR – Tells Kafka how many copies of metrics data to keep. In our case, we’re keeping 2 copies of the data. </li> <li>KAFKA_CRUISE_CONTROL_METRICS_REPORTER_METRICS_REPORTING_INTERVAL_MS – Tells Kafka how often to send metrics. We’re sending every minute. </li> </ul> </li> <li>Added Cruise-control service using image <code>justramesh2000/cruise-control-kraft:2.5.142</code>. For clarity, I have kept this change between the <code>start</code> and <code>end</code> comments. </li> <li>Under cruise control, we’ve mounted <code>three</code> Cruise Control configurations files. We’ll talk about those files next. </li> </ul> <h3 id="heading-cruise-control-configuration-file">Cruise Control Configuration File</h3> To run Cruise Control, we need to provide several configuration files. Among the key pieces of information are: <ul> <li>Where the Kafka cluster is located </li> <li>The capacity of each broker </li> </ul> Create a config directory and add the following files: <pre><code class="lang-bash">mkdir config </code></pre> <h4 id="heading-cruisecontrolproperties">cruisecontrol.properties</h4> This is Cruise Control’s main configuration file. Save the following content as <code>cruisecontrol.properties</code> in the config directory: <pre><code class="lang-abap"># Kafka cluster. Tells how to connect to brokers bootstrap.servers=kafka-1:29092,kafka-2:29092,kafka-3:29092 # Topic from which metrics are to be read metric.reporter.topic=__CruiseControlMetrics # Aggregated partition data partition.metric.sample.store.topic=__KafkaCruiseControlPartitionMetricSamples #Aggregated broker data broker.metric.sample.store.topic=__KafkaCruiseControlModelTrainingSamples # Enable broker failure detection for KRaft mode (no ZooKeeper) kafka.broker.failure.detection.enable=true # Capacity. Tells where the capacity file is capacity.config.file=config/capacityJBOD.json # Goals. What to optimize for during cluster balancing. These are the riles for CC to abide to during rebalancing default.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,\ com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,\ com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,\ com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,\ com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,\ com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal,\ com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,\ com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskUsageDistributionGoal,\ com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderReplicaDistributionGoal,\ com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderBytesInDistributionGoal # hard goals. hard.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,\ com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,\ com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,\ com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,\ com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,\ com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal # Webserver. For WebApi access webserver.http.port=9090 webserver.http.address=0.0.0.0 # Execution num.broker.metrics.windows=1 num.partition.metrics.windows=1 </code></pre> I’ve added in line comments to explain much of the above configuration, but I think the <code>Goals</code> need special attention. These are the rules that we as users have set for Cruise Control to abide by. By defining goals, we tell Cruise Control to do the following: <ul> <li><code>RackAwareGoal</code> – Spread replicas across racks (or in our case, brokers) </li> <li><code>ReplicaCapacityGoal</code> – Don't overload brokers with too many replicas </li> <li><code>DiskCapacityGoal</code> – Don't fill up disk </li> <li><code>NetworkInboundCapacityGoal</code> – Balance incoming network traffic </li> <li><code>NetworkOutboundCapacityGoal</code> – Balance outgoing network traffic </li> <li><code>CpuCapacityGoal</code> – Balance CPU usage </li> <li><code>ReplicaDistributionGoal</code> – Evenly distribute replicas </li> <li><code>DiskUsageDistributionGoal</code> – Ensure even disk usage across brokers </li> <li><code>LeaderReplicaDistributionGoal</code> – Evenly distribute leader replicas </li> <li><code>LeaderBytesInDistributionGoal</code> – Balance data flowing to leaders </li> </ul> Via Cruise Control configuration, you can define two types of goals: <code>Default goals</code> and <code>Hard goals</code>. Hard goals must be met. Default goals that aren’t part of the hard goals become soft goals. This means that Cruise Control will give its best effort to satisfy them but won’t reject a proposal if it can’t. Here’s a little summary: <div class="hn-table"> <table> <thead> <tr> <td>Type</td><td>Meaning</td><td>What CC Does</td></tr> </thead> <tbody> <tr> <td>Hard Goals</td><td>Must-haves (capacity limits)</td><td>Never violates – rejects proposal if can't satisfy</td></tr> <tr> <td>Soft Goals</td><td>Nice-to-haves (better balance)</td><td>Tries to satisfy – skips if conflicts with hard goals</td></tr> <tr> <td>Default Goals</td><td>Hard + Soft together</td><td>Optimizes for all – prioritizes hard over soft</td></tr> </tbody> </table> </div>Cruise control collects metrics for a defined period (default: 5 minutes) and creates a monitoring window. The following settings control how many windows Cruise Control needs before it’s ready to generate proposals (shortly, we will see what proposals are): <ul> <li><code>num.broker.metrics.windows=1</code>: Wait for 1 monitoring window before generating proposals. Each window in Cruise Control is 5 minutes by default. This means that Cruise Control will be ready after 5 minutes. I’ve set this to 1 for quick testing. The recommendation is to use a large window in production to avoid false proposals from temporary spikes. </li> <li><code>num.partition.metrics.windows=1</code>: Wait for 1 window of partition metrics. Same reasoning as above. </li> </ul> <h4 id="heading-capacity">Capacity</h4> This informs cruise control about the capacity (CPU, DISK) of each broker, which then helps it to make decisions. Using the below file, we’re telling Cruise Control the following: <ul> <li>What are the brokerIds </li> <li>What is the disk path <code>/var/lib/kafka/data</code> and disk capacity (<code>100000000</code> MB = 100 GB). This is used by <code>DiskCapacityGoal</code> that we set up in the above <code>cruisecontrol.properties</code> file. </li> <li>What is the CPU 100% (1 Core). Used by <code>CpuCapacityGoal</code>. </li> <li>What is the <code>NW_IN</code> Network Inbound Capacity (125,000 KB/s = 1 MB/s –Megabytes per second) = 1 Gbps – Giga <code>bits</code> per second). Used by <code>NetworkInboundCapacityGoal</code>. </li> <li>What is the <code>NW_OUT</code> Network Outbound Capacity (125,000 KB/s). Used by <code>NetworkOutboundCapacityGoal</code> </li> </ul> Save the following content as <code>capacityJBOD.json</code> in the config directory: <pre><code class="lang-json">{ "brokerCapacities":[ { "brokerId": "1", "capacity": { "DISK": {"/var/lib/kafka/data": "100000000"}, "CPU": "100", "NW_IN": "125000", "NW_OUT": "125000" } }, { "brokerId": "2", "capacity": { "DISK": {"/var/lib/kafka/data": "100000000"}, "CPU": "100", "NW_IN": "125000", "NW_OUT": "125000" } }, { "brokerId": "3", "capacity": { "DISK": {"/var/lib/kafka/data": "100000000"}, "CPU": "100", "NW_IN": "125000", "NW_OUT": "125000" } } ] } </code></pre> <h4 id="heading-logging">Logging</h4> This is not important for Cruise Control to work properly, but it’ll help you debug if there are issues. Save the following content as <code>log4j.properties</code> in the config directory. When you execute commands to start Cruise Control and If you see unexpected behaviors like container exiting, you can use the <code>docker logs</code> command to see what happened. <pre><code class="lang-abap"># Root logger - INFO level, output to console rootLogger.level=INFO appenders=console # Console output (for docker logs) appender.console.type=Console appender.console.name=STDOUT appender.console.layout.type=PatternLayout appender.console.layout.pattern=[%d] %p %m (%c)%n # Send root logger to console rootLogger.appenderRef.console.ref=STDOUT </code></pre> Now that we have all the configurations in place, let’s run the following command to start Kafka brokers with Kafka UI and Cruise Control: <pre><code class="lang-bash">docker compose -f docker-compose-basic.yml up -d </code></pre> Using the following command, verify that the three Kafka brokers, Kafka UI, and Cruise Control containers are running: <pre><code class="lang-bash">docker ps </code></pre> You should see something like this: Now that we have Cruise Control up and running, let’s create some Imbalance and see how much better of an experience we get by using Cruise Control versus mitigating the imbalance manually. <h3 id="heading-creating-the-imbalance">Creating the Imbalance</h3> An imbalance is a scenario where some brokers are handling more messages than others – and they may run into high disk usage or high IOPS. To create the imbalance in our cluster, we’ll have to create a few topics and then produce messages unevenly. Now that you have Kafka UI running, you can create topics using that method or you can create topics using commands. I’m going to use the commands because it’ll be easier for you to reproduce my work (but I recommend UI for production operations because it prevents typos). If you also decide to use commands, run the following command. Then using UI, verify that the topics have been created. Note: You’ll find that the commands are different from previous commands. This is because, previously in our <code>docker-compose-basic.yml</code> file, we were using the <code>confluentinc/cp-kafka:7.6.0</code> image for Kafka. But now we’re using the <code>justramesh2000/kafka-apache-cc:3.8.1</code> image which is based off of the <code>apache/kafka:3.8.1</code> image. For different images, the tools are located at different places, so the command needs to be adjusted to account for that. <pre><code class="lang-bash">docker exec -it kafka-1 bash -c ' /opt/kafka/bin/kafka-topics.sh --create \ --topic freecodecamp-logs \ --bootstrap-server kafka-1:29092 \ --partitions 12 \ --replication-factor 2 \ --config retention.ms=604800000 \ --config compression.type=snappy /opt/kafka/bin/kafka-topics.sh --create \ --topic freecodecamp-views \ --bootstrap-server kafka-1:29092 \ --partitions 20 \ --replication-factor 2 \ --config retention.ms=604800000 \ --config compression.type=snappy /opt/kafka/bin/kafka-topics.sh --create \ --topic freecodecamp-analytics \ --bootstrap-server kafka-1:29092 \ --partitions 3 \ --replication-factor 2 \ --config retention.ms=604800000 \ --config compression.type=snappy /opt/kafka/bin/kafka-topics.sh --create \ --topic freecodecamp-articles \ --bootstrap-server kafka-1:29092 \ --partitions 5 \ --replication-factor 2 \ --config retention.ms=604800000 \ --config compression.type=snappy ' </code></pre> Run the following command to produce uneven messages on different topics we created above. Heavy Load on <code>freecodecamp-logs</code>: <pre><code class="lang-bash">docker exec -it kafka-1 bash -c " for i in {1..100000}; do echo '{\"log_id\":\"'\$i'\",\"level\":\"INFO\",\"message\":\"Log entry '\$i'\"}' done | /opt/kafka/bin/kafka-console-producer.sh --topic freecodecamp-logs --bootstrap-server kafka-1:29092" </code></pre> Heavy load on <code>freecodecamp-views</code>: <pre><code class="lang-bash">docker exec -it kafka-1 bash -c " for i in {1..80000}; do echo '{\"view_id\":\"'\$i'\",\"page\":\"/article/'\$((i % 100))'\",\"user\":\"user_'\$((i % 1000))'\"}' done | /opt/kafka/bin/kafka-console-producer.sh --topic freecodecamp-views --bootstrap-server kafka-1:29092" </code></pre> Moderate load on <code>freecodecamp-analytics</code>: <pre><code class="lang-bash">docker exec -it kafka-1 bash -c " for i in {1..30000}; do echo '{\"event\":\"page_view\",\"user\":\"user_'\$i'\"}' done | /opt/kafka/bin/kafka-console-producer.sh --topic freecodecamp-analytics --bootstrap-server kafka-1:29092" </code></pre> Now, produce a message with a <code>fixed key</code> to force all data into one Partition. This is a fast way to create strong disk imbalance. Run the following command: <pre><code class="lang-bash">docker exec -it kafka-1 bash -c " for i in {1..300000}; do echo 'hotkey:{\"log_id\":'\$i',\"msg\":\"big payload\"}' done | /opt/kafka/bin/kafka-console-producer.sh \ --topic freecodecamp-logs \ --bootstrap-server kafka-1:29092 \ --property parse.key=true \ --property key.separator=:" </code></pre> After running the above commands, come back to the UI, refresh, and you will see a number of messages like this: Now, go to brokers tab and see the imbalance in Disk Usage: You should be able to see that Broker-2 has only about 47% of the data that Broker-1 has, and Broker-3 has about 11% more data than Broker-1. Broker-2 is significantly underutilized, while Broker-1 and Broker-3 hold most of the data. <h3 id="heading-attempting-manual-rebalancing">Attempting Manual Rebalancing</h3> Step 1: First, we need to find out which topic is heavy – meaning which one handles more data. My setup shows the <code>freecodecamp-logs</code> topic with 8MB of data: Step 2: Let’s see where the heavy partitions are. Click on freecodecamp-logs in Kafka UI and see the partition table. Look at the message count: partition 4 is bigger than the others. The table also gives information about replicas of partitions: partition 4 has replicas on Broker 1 and 3. Broker 2 doesn’t have partition 4 at all. This explains why Broker 2 was underutilized. Step 3: To balance the cluster, we need to move partition 4 around. We can move partition 4 to Broker 2. But before that, let’s do some math to be able to rationalize our decision. Note that the calculation doesn’t have to be precise – we just want a relative sense of data between brokers. Current state: <ul> <li>Broker 1: 4.55 MB </li> <li>Broker 2: 2.29 MB (underutilized) </li> <li>Broker 3: 5.11 MB (over-utilized) </li> </ul> Note that roughly the compressed data size for partition 4 is 2.25 MB (exact size is not critical). If we move partition 4 from [1,3] to [2,3]: <ul> <li>Broker 1: Loses partition 4, so 4.55 + 2.25 = ~2.3 MB </li> <li>Broker 2: Gains Partition 4, so 2.33 + 2.25 = ~4.58 MB </li> <li>Broker 3: Already has partition 4, so = 5.11 MB (no change) </li> </ul> The result is that Broker 1 becomes underutilized. How about if we move partition 4 from [1,3] to [1,2]? <ul> <li>Broker 1: Already has partition 4 = 4.55 MB (no change) </li> <li>Broker 2: Gains Partition 4, so 2.33 + 2.25 = ~4.58 MB </li> <li>Broker 3: Loses partition 4, so 5.11 + 2.25 = ~2.8 MB </li> </ul> Hmm, this still creates an imbalance (broker 3 becomes too light). So basically, manual rebalancing requires complex calculations. Moving a single partition impacts disk usage, leader distribution, and network traffic across multiple brokers. One poorly planned move can create a new imbalance elsewhere. But, let’s say you somehow landed on a perfect mathematical calculation and you’re ready to make the move to balance. We’ll assume that the perfect plan is to move Partition 4 from [1, 3] to [2, 3]. I know it’s not the perfect move but the point is to see the pain afterwards. Step 4: it’s time to move the partition manually. We need to tell Kafka to move partition 4's replicas from brokers [1,3] to brokers [2,3]. To do that, you need create a file called <code>reassignment.json</code> on your machine: <pre><code class="lang-json">{ "version": 1, "partitions": [ { "topic": "freecodecamp-logs", "partition": 4, "replicas": [2, 3], "log_dirs": ["any", "any"] } ] } </code></pre> What this means: <ul> <li>"partition": 4 – Target Partition </li> <li>"replicas": [2, 3] – New placement: brokers 2 and 3 </li> <li>"log_dirs": ["any", "any"] – Let Kafka choose the disk directory </li> </ul> Save this file somewhere accessible. Then run the following command to copy the JSON to the Kafka cluster: <pre><code class="lang-bash">docker cp reassignment.json kafka-1:/tmp/reassignment.json </code></pre> This copies your local file into the kafka-1 container's /tmp directory. Run following command to verify the file is there: <pre><code class="lang-bash">docker exec -it kafka-1 cat /tmp/reassignment.json </code></pre> You should see your JSON file content. Now run the actual reassignment command: <pre><code class="lang-bash">docker exec -it kafka-1 /opt/kafka/bin/kafka-reassign-partitions.sh \ --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \ --reassignment-json-file /tmp/reassignment.json \ --execute </code></pre> You will get a message from Kafka that will tell you if Kafka has accepted the reassignment and started moving the data. You can monitor the reassignment using the following command: <pre><code class="lang-bash">docker exec -it kafka-1 /opt/kafka/bin/kafka-reassign-partitions.sh \ --bootstrap-server kafka-1:29092,kafka-2:29092,kafka-3:29092 \ --reassignment-json-file /tmp/reassignment.json \ --verify </code></pre> I’m not going to run the manual reassignment because I want to keep the imbalance and show how Cruise Control can help reduce the manual steps. Next, let’s see how Cruise Control handles the same imbalance automatically. <h3 id="heading-rebalancing-using-cruise-control">Rebalancing Using Cruise Control</h3> After creating the topic and messages, I have let Cruise Control run for a couple minutes. During that time, it collected metrics and trained its linear regression model. You can run the following command to verify if Cruise Control is running fine and it has data (following is a REST API call using curl): <pre><code class="lang-bash">curl http://localhost:9090/kafkacruisecontrol/state </code></pre> You will get multiple JSON object outputs as part of the response. Each JSON object holds some information about the state of Cruise Control and the Kafka cluster. Let’s see each of these one at a time: <pre><code class="lang-json">MonitorState: { state: RUNNING(20.000% trained), NumValidWindows: (1/1) (100.000%), NumValidPartitions: 105/105 (100.000%), flawedPartitions: 0 } </code></pre> This tells about the state of monitoring based on data collected by Cruise Control: <ul> <li><code>state: RUNNING(20.000% trained)</code> – Cruise Control is actively collecting metrics from your Kafka cluster. Right now it has trained its model on 20% of the expected monitoring data. </li> <li><code>NumValidWindows: (1/1) (100%)</code> – Cruise Control has collected 1 complete monitoring window out of 1 required (100% ready). Remember, we had set <code>num.broker.metrics.windows=1</code> in the <code>cruisecontrol.properties</code> configuration file. </li> <li><code>NumValidPartitions: 105/105 (100%)</code> – Cruise Control analyzed all 105 partitions and has metrics for all. </li> <li><code>flawedPartitions: 0</code> – None of the partitions have problematic or missing metrics. </li> </ul> <pre><code class="lang-json">ExecutorState: {state: NO_TASK_IN_PROGRESS} </code></pre> The above response indicates the execution engine is idle – no partition moves or leadership changes are currently in progress. This makes sense since we haven't asked Cruise Control to do anything yet. <pre><code class="lang-json">AnalyzerState: { isProposalReady: true, readyGoals: [ NetworkInboundCapacityGoal, LeaderBytesInDistributionGoal, DiskCapacityGoal, ReplicaDistributionGoal, RackAwareGoal, NetworkOutboundCapacityGoal, CpuCapacityGoal, DiskUsageDistributionGoal, LeaderReplicaDistributionGoal, ReplicaCapacityGoal ] } </code></pre> AnalyzerState tells whether Cruise Control is ready to show a proposal or not. In this case it’s ready. <ul> <li><code>isProposalReady: true</code> – Cruise Control has calculated a potential rebalancing plan (a proposal) that satisfies the configured goals. </li> <li><code>readyGoals</code> – These are the goals that are considered ready and valid for rebalancing. Examples: <ul> <li><code>DiskCapacityGoal</code>: balance disk usage among brokers </li> <li><code>ReplicaDistributionGoal</code>: balance number of replicas per broker </li> <li><code>RackAwareGoal</code>: maintain replicas across racks for fault tolerance </li> <li><code>LeaderBytesInDistributionGoal</code>: balance network traffic from leaders </li> <li><code>DiskUsageDistributionGoal</code>: ensures partitions are spread to prevent skew </li> </ul> </li> </ul> Note that these are the goals we had set earlier in the <code>cruisecontrol.properties</code> file. <pre><code class="lang-json">AnomalyDetectorState: { selfHealingEnabled:[], selfHealingDisabled:[BROKER_FAILURE, DISK_FAILURE, GOAL_VIOLATION, METRIC_ANOMALY, TOPIC_ANOMALY, MAINTENANCE_EVENT], selfHealingEnabledRatio:{...}, recentGoalViolations:[], recentBrokerFailures:[], recentMetricAnomalies:[], recentDiskFailures:[], recentTopicAnomalies:[], recentMaintenanceEvents:[], metrics:{...}, ongoingSelfHealingAnomaly:None, balancednessScore:100.000 } </code></pre> Anomaly detection shows information about any existing anomaly and healing properties. <ul> <li><code>selfHealingEnabled: []</code> – Automatic self-healing is currently off. Cruise Control will not move partitions automatically in response to anomalies. </li> <li><code>selfHealingDisabled: [...]</code> – Lists the anomaly types that are disabled for automatic self-healing, including broker failures, disk failures, and goal violations. </li> <li><code>recentGoalViolations: []</code> – No goals have been violated recently. </li> <li><code>balancednessScore: 100.000</code> – This is how balanced the cluster is according to Cruise Control’s hard goals. 100% means the cluster is perfectly balanced according to the metrics and hard goals currently active. This metric only cares about Hard Goals (Disk Capacity, CPU capacity) being violated – that’s why it shows 100% even though we know there are some disk usage imbalances in our cluster. </li> </ul> <h4 id="heading-the-proposal">The Proposal</h4> Via AnalyzerState information, Cruise Control told us that it has a proposal for the cluster. Let’s see what it is. We can fetch the proposal using the proposal end point: <pre><code class="lang-bash">curl -s "http://localhost:9090/kafkacruisecontrol/proposals?json=true" </code></pre> The JSON response is quite large. Let's focus on the key parts that show our cluster's imbalance and how Cruise Control plans to fix it: <pre><code class="lang-json">{ "summary": { "numReplicaMovements": 13, // CC wants to move 13 partition replicas "numLeaderMovements": 6, // And reassign 6 partition leaders "onDemandBalancednessScoreBefore": 84.67, // Current: 84.67% balanced "onDemandBalancednessScoreAfter": 89.76. // After: 89.76% balanced }, "goalSummary": [ { "goal": "DiskUsageDistributionGoal", "status": "VIOLATED" }, { "goal": "LeaderBytesInDistributionGoal", "status": "VIOLATED" } ] } </code></pre> Based on the calculations, Cruise Control thinks: <ol> <li>Moving 13 partition replicas will help. Note that manually we decided to move just 1 partition, that is partition 4. </li> <li>Reassigning 6 partition leaders will help. Manually we didn’t account for any leadership reassignment. </li> <li><code>DiskUsageDistributionGoal</code> has been violated. We know that the disk usage is not distributed perfectly. </li> <li><code>LeaderBytesInDistributionGoal</code> has also been violated. We couldn’t find this out manually. Technically, you could find out but it would take a decent amount of manual calculations and would still be error-prone. </li> </ol> Note: While we're focusing on disk usage imbalance, Cruise Control optimizes for 10 different goals (disk, CPU, network, leaders, and so on). This holistic approach gives it a better chance of achieving true cluster balance versus balancing manually. <h4 id="heading-executing-the-proposal">Executing the proposal</h4> Let’s run the actual rebalancing using Cruise Control. The command is: <pre><code class="lang-bash">curl -X POST 'http://localhost:9090/kafkacruisecontrol/rebalance?dryrun=false&json=true' </code></pre> Again, you’ll get a huge JSON file similar to the proposal. You can track the status using following API call: <pre><code class="lang-bash">curl "http://localhost:9090/kafkacruisecontrol/user_tasks" </code></pre> You will get something like this: Note that the 4th item in the list is our rebalance API call and it’s complete. This was quick for our small Dev cluster, but in large clusters you may see status as <code>InExecution</code>. Let’s look at the UI to see what is the state of Imbalance now that Cruise Control has completed its execution of the proposal. The UI shows the following for me: <h4 id="heading-comparison">Comparison</h4> Before rebalancing: <ul> <li>Broker 1: 4.52 MB, 69 partitions, 35 leaders </li> <li>Broker 2: 2.22 MB, 69 partitions, 35 leaders (underutilized) </li> <li>Broker 3: 5.05 MB, 72 partitions, 35 leaders (overutilized) </li> <li>Disk range: 2.83 MB (5.05 - 2.22) </li> </ul> After rebalancing: <ul> <li>Broker 1: 4.66 MB, 69 partitions, 38 leaders </li> <li>Broker 2: 3.87 MB, 77 partitions, 31 leaders </li> <li>Broker 3: 4.87 MB, 64 partitions, 36 leaders </li> <li>Disk range: 1.00 MB (4.87 - 3.87) </li> </ul> Results: <ul> <li>Disk usage balanced – Range reduced from 2.83 MB to 1.00 MB (64% improvement!) </li> <li>Replicas redistributed – Broker 2 gained 8 replicas, Broker 3 lost 8 replicas </li> <li>Leaders balanced – Changed from 35-35-35 to 38-31-36. Cruise Control prioritized balancing actual network traffic over leader count. </li> </ul> The cluster is now more balanced across all metrics. Congrats! <h2 id="heading-conclusion">Conclusion</h2> We covered a lot in this tutorial, so let’s take a step back and look at what we did. You started by experiencing the reality of manual Kafka management – the endless CLI commands, the tedious calculations, the JSON files, and the potential for costly mistakes. If you felt frustrated during that section, that’s to be expected. That frustration is exactly what thousands of engineering teams deal with every day. Then you were presented with two complementary tools: <ol> <li>Kafka UI gave you visibility. No more grepping through command outputs or manually counting partition leaders. Everything you need, broker health, topic configurations, consumer lag is right there in a clean web interface. For small teams and development environments, this alone is a game-changer. </li> <li>Cruise Control gave you intelligence. It didn't just automate what you'd do manually – it also did a fundamentally better job. While you were focused on moving one partition (partition 4), Cruise Control analyzed all 105 partitions across 10 different optimization goals and proposed a comprehensive rebalancing plan. That's the difference between human effort and automated intelligence. </li> </ol> I want to call out that this tutorial used a simplified setup. For production, you’ll expect complex configurations like” <ul> <li>Kafka and Cruise Control running on separate machines </li> <li>Larger monitoring window for Cruise Control </li> <li>Some self healing capabilities enabled </li> </ul> If there's one thing you take away from this article, let it be this: you should stop managing your Kafka cluster manually. You've seen there's a better way. Use it. Thanks for reading! </article> <article> <h1> A Beginner’s Guide to Automation with n8n </h1> Manish Shivanandhan — Mon, 03 Nov 2025 15:27:58 +0000 Automation has become one of the most valuable skills for any technical team. It helps eliminate repetitive work, speeds up business operations, and lets you focus on creative or strategic tasks. Whether it’s moving data between apps, triggering actions when something changes, or building smart systems that run on their own, automation can save hours every week. The problem is that most automation platforms make you choose between flexibility and simplicity. Tools like Zapier are easy to use but limited when you need customisation. Writing your own scripts in Python or JavaScript gives you full control but takes more time to build and maintain. <a target="_blank" href="https://n8n.io/">n8n</a> changes that balance. It is an open-source workflow automation platform that provides both control and simplicity. n8n lets you automate anything from simple tasks to complex systems using a visual interface. You can drag and connect nodes to create workflows or write code when needed. It’s built for technical teams who want freedom without losing ease of use. In this article, we’ll learn how to build and deploy your own automation workflows using n8n. By the end, you’ll have a working automation server and the knowledge to create smart, self-running workflows for any use case. <h2 id="heading-table-of-contents">Table of Contents</h2> <ul> <li><a class="post-section-overview" href="#heading-what-n8n-does">What n8n Does</a> </li> <li><a class="post-section-overview" href="#heading-n8n-is-open-source">n8n Is Open Source</a> </li> <li><a class="post-section-overview" href="#heading-how-to-get-started-with-n8n">How to Get Started with n8n</a> </li> <li><a class="post-section-overview" href="#heading-building-a-n8n-workflow">Building a</a> <a class="post-section-overview" href="#heading-how-to-get-started-with-n8n">n8n</a> <a class="post-section-overview" href="#heading-building-a-n8n-workflow">Workflow</a> </li> <li><a class="post-section-overview" href="#heading-running-n8n-in-production-using-sevalla">Running</a> <a class="post-section-overview" href="#heading-how-to-get-started-with-n8n">n8n</a> <a class="post-section-overview" href="#heading-running-n8n-in-production-using-sevalla">in Production using Sevalla</a> </li> <li><a class="post-section-overview" href="#heading-where-n8n-becomes-powerful">Where n8n Becomes Powerful</a> </li> <li><a class="post-section-overview" href="#heading-ai-driven-automations">AI-Driven Automations</a> </li> <li><a class="post-section-overview" href="#heading-conclusion">Conclusion</a> </li> </ul> <h2 id="heading-what-n8n-does">What n8n Does</h2> n8n connects the apps and systems you already use. Each connection is called a node, and every node performs an action. You can combine multiple nodes into a workflow that runs automatically. For example, you could create a workflow where a new form submission in Typeform triggers a Slack message and stores the data in Google Sheets. You can then add logic to send an email only if certain conditions are met. This approach allows anyone to build automation visually, yet it stays developer-friendly. You can use JavaScript or Python inside the workflow for custom logic, import npm packages, or connect to any API that doesn’t have a prebuilt node yet. The platform supports over four hundred integrations out of the box, from GitHub and AWS to OpenAI and Telegram. This large library of ready-to-use nodes means you can connect most tools you use every day without needing to write any code at all. <h2 id="heading-n8n-is-open-source">n8n is Open Source</h2> The open source nature of n8n is what makes it stand out. Most automation tools like <a target="_blank" href="https://zapier.com/">Zapier</a> are closed systems that hide their inner workings. With n8n, the <a target="_blank" href="https://github.com/n8n-io/n8n">source code</a> is publicly available. You can host it on your own server, modify it, and inspect how everything works. This matters for both privacy and flexibility. When you self-host n8n, your data never leaves your environment. This is especially useful for industries like finance, healthcare, and security where sensitive data must stay private. Teams can build automations without sending information through third-party servers. Being open source also means you are never locked into one vendor. You can add your own nodes, extend the platform, or even contribute back to the community. The fair-code license ensures that while the project stays sustainable for the developers who maintain it, it remains accessible to anyone who wants to use or modify it. <h2 id="heading-how-to-get-started-with-n8n">How to Get Started with n8n</h2> Getting started with n8n takes only a few minutes. If you already have Node.js installed, you can launch it right from your terminal using the command: <pre><code class="lang-python">npx n8n </code></pre> This will start n8n locally and open the visual editor at <a target="_blank" href="http://localhost:5678/">http://localhost:5678</a>. You can also <a target="_blank" href="https://docs.n8n.io/hosting/installation/docker/">deploy n8n with Docker</a> using a few simple commands. Docker is often the easiest option if you want a persistent setup where your data and workflows are saved automatically. Once the editor is open, you’ll see an empty canvas where you can drag and drop nodes. For beginners, the best way to learn is by building small workflows. <h2 id="heading-building-a-n8n-workflow">Building a n8n Workflow</h2> Let’s build a simple n8n workflow. Step 1: After logging in, click on “Create Workflow” at the top. This will open a blank workspace. Give your workflow a name such as “RSS to Email”. You’ll be building a simple chain of steps, where one action leads to another. Step 2: Every workflow in n8n starts with a trigger, which decides when the workflow should run. In this example, we’ll use the Schedule Trigger so the workflow runs once a day. Click the plus icon to add a new node and search for “On a Schedule”. Select it and choose the option that says “Every Day”. You can set the exact time you want it to run, for example, every morning at 9am. This means that once your workflow is activated, n8n will automatically start it daily at that time. Step 3: Now that the workflow knows when to run, it needs to know what to do. The next step is to fetch the latest articles from a blog’s RSS feed. Click the plus icon again to add another node and search for “RSS Read”. In the URL field, type the link to a blog’s feed such as <a target="_blank" href="https://blog.cloudflare.com/rss/"><code>https://blog.cloudflare.com/rss/</code></a>. Click “Execute Node” to test it. You should now see a list of recent blog posts with their titles, descriptions, and links. This confirms that the feed is working correctly. Step 4: Sometimes you may not want all the items from the RSS feed. For instance, you might only want the top three posts. To do this, you can add a Function node between the RSS and email steps. In that node, enter a short JavaScript snippet like <code>return items.slice(0, 3);</code>. This will trim the list and only keep the first three results. You can also choose to skip this step if you want to send all the posts in the email. Step 5: The next step is to send the RSS feed items to your email inbox. Add another node and search for “Email”. You can use your preferred email service such as Gmail or Outlook, or configure it manually using SMTP settings. For Gmail, choose “Send an email”. For the settings, <a target="_blank" href="https://docs.n8n.io/integrations/builtin/credentials/google/oauth-single-service/#set-up-oauth">get your oauth keys</a> from Google. In the subject field, write something like “Daily Blog Updates”. In the message field, you can include the data from the RSS feed using expressions such as <code>{{ $json["title"] }} - {{ $json["link"] }}</code>. This will automatically replace those variables with the actual titles and links when the workflow runs. You can test the email by clicking “Execute Node” and checking your inbox. Step 6: Once you have added all three nodes, Schedule Trigger, RSS Feed Read, and Email, you need to connect them in that order. The arrows show the flow of data. Click “Execute Workflow” to test everything. If the setup is correct, you should receive an email with the latest blog posts. When you’re satisfied with the result, turn on the workflow by clicking the toggle switch in the top right corner. It will now run automatically every day without you having to open n8n again. As you get comfortable, you can start chaining multiple services together, add conditional logic, or include custom code nodes for specific cases. The live execution view helps you see how data moves between nodes in real time. <h2 id="heading-running-n8n-in-production-using-sevalla">Running n8n in Production using Sevalla</h2> When you are ready to move beyond testing, n8n gives you two main options. You can self-host it using your own infrastructure or use their managed cloud version at <a target="_blank" href="https://n8n.io/">n8n.io</a>. Self-hosting gives you full control and is usually preferred by technical teams who want to integrate with private APIs or keep sensitive data in-house. You can choose any cloud provider, like AWS, DigitalOcean, or others to set up N8N. But I will be using Sevalla. <a target="_blank" href="https://sevalla.com/">Sevalla</a> is a PaaS provider designed for developers and dev teams shipping features and updates constantly in the most efficient way. It offers application hosting, database, object storage, and static site hosting for your projects. I am using Sevalla for two reasons: <ul> <li>Every platform will charge you for creating a cloud resource. Sevalla comes with a $50 credit for us to use, so we won’t incur any costs for this example. </li> <li>Sevalla has a <a target="_blank" href="https://docs.sevalla.com/templates/overview">template for n8n</a>, so it simplifies the manual installation and setup for each resource you will need for installation. </li> </ul> <a target="_blank" href="https://app.sevalla.com/login">Log in</a> to Sevalla and click on Templates. You can see n8n as one of the templates. Click on the “N8N” template. You will see the resources needed to provision the application. Click on “Deploy Template”. You can see the resource being provisioned. Once the resources are provisioned, go to your n8n application and click on the current deployment. Wait for a few minutes. Once the deployment is complete, you will see a green checkmark. Click on “Visit app”. You will get a cloud url eg. <a target="_blank" href="https://n8n-9u6kc.sevalla.app/">https://n8n-9u6kc.sevalla.app/</a>. You now have a production-grade n8n server running on the cloud. You can use this to build your automations in your self hosted cloud environment. <h2 id="heading-where-n8n-becomes-powerful">Where n8n Becomes Powerful</h2> Most users begin with simple automations. But n8n’s true power shows up when you start building complex, multi-step workflows. You can create sequences that involve APIs, data transformation, and logic-based decision making. For example, a marketing team could build a system that monitors mentions on Twitter, classifies them with an AI model, adds potential leads to a CRM, and sends a Slack alert for high-priority mentions. A developer could build a workflow that triggers deployment pipelines automatically when code is merged into a branch. Because n8n supports both no-code and full-code modes, you never outgrow it. As your automations become more advanced, you can still use the same platform to handle them. <h2 id="heading-ai-driven-automations">AI-Driven Automations</h2> n8n is also built for the era of AI. It comes with native support for connecting large language models and tools like <a target="_blank" href="https://www.langchain.com/">LangChain</a>. This means you can build AI workflows that use your own data and logic. Imagine setting up a workflow that reads new support tickets, summarizes them with an AI model, and routes them to the right team. Or one that takes blog posts, generates summaries, and posts them automatically to your social channels. You can design these workflows visually while letting the AI handle the heavy lifting. Because n8n allows you to control how and where AI models are called, it gives teams flexibility without sacrificing data security. You can integrate your own OpenAI key, run local models, or use third-party APIs in the same environment. The real value of n8n lies in how it combines flexibility, transparency, and control. It doesn’t hide complexity from you but gives you tools to manage it better. You can start small with visual automation and grow into advanced logic and AI-driven workflows. Because it’s open source, you never risk losing access to your automations. You can run it anywhere, connect it with anything, and inspect everything that happens under the hood. This level of freedom is rare among modern automation platforms. For beginners, n8n is an opportunity to understand how automation works without needing to learn full-stack programming. For developers, it’s a scalable system that can power serious production workflows. <h2 id="heading-conclusion">Conclusion</h2> Automation is becoming an essential part of every technical process. The challenge is finding a tool that balances simplicity with power. n8n achieves that balance by being open source, extensible, and flexible enough for both no-code users and developers. n8n is not just another automation app. It is a complete, open, and developer-friendly platform built to make automation accessible to everyone. Hope you enjoyed this article. Find me on <a target="_blank" href="https://linkedin.com/in/manishmshiva">Linkedin</a> or <a target="_blank" href="https://manishshivanandhan.com/">visit my website</a>. </article> </main></body></html>

automation - freeCodeCamp.org

How to Build an AI Agent That Runs its Own LLM Experiments with autoresearch

Table of Contents

Prerequisites

What is autoresearch?

Why This Matters

Exploring the Repo

What Exactly is val_bpb?

The 5 Minute Rule

1. prepare.py

2. train.py

The Hyperparameters

3. program.md

Setup Guide

Step 1: Install uv, the Python Project Manager the Repo Uses

Step 2: Run the Data Preparation

Step 3: Run a Manual Training Experiement

Step 4: Hand the Repo to an Agent

Tuning autoresearch for Smaller GPUs

What the Agent Actually Finds

Final Thoughts

How to Automate PDF Data Extraction Using Python

What We'll Cover:

Understanding PDF Structures

Setting Up the Python Environment

Extracting Text From PDFs

Extracting Tables From PDFs

Working With OCR for Scanned PDFs

Building End-to-End Automation Pipelines

Common Challenges in PDF Automation

Choosing the Right Python Libraries

The Future of PDF Automation

How to Use Bash & Python for Real DevOps Automation – Full Handbook with 5 Production Use Cases

Prerequisites

Knowledge and Skills

AWS IAM Permissions Required

Companion GitHub Repository

Table of Contents

Use Case 1 - Cost Anomaly Detection

The Production Problem

What's Actually Happening at the System Level

Set Up the Demo Environment

The Script

How the Script Works

What the Output Looks Like

The Decision the Script Can't Make for You

Break it On Purpose

Use Case 2 – Log Correlation Across Services

The Production Problem

What's Actually Happening at the System Level

Set Up the Demo Environment

The Script

How the Script Works

What the Output Looks Like After Breaking it

The Decision the Script Can't Make For You

Break it On Purpose

Use Case 3 - Infrastructure Drift Detection

The Production Problem

What's Actually Happening at the System Level

Set Up the Demo Environment

The Script (Code Files)

How the Script Works

What the Output Looks Like

The Decision the Script Can't Make For You

Break it On Purpose

Use Case 4 - Secrets Rotation with Zero Downtime

The Production Problem

What's Actually Happening

What the /healthz/db Endpoint Does

Set Up the Demo Environment

The Script (Code Files)

How the Script Works

What the Output Looks Like

Break it On Purpose

Step 1: Desync the DB

Step 2: Check what Kubernetes sees

Step 3: Check what your users experience

Step 4: See the mixed pattern (optional)

Step 5: Run the rotation script

The Decision the Script Can't Make For You

What Exactly is `val_bpb`?

1. `prepare.py`

2. `train.py`

3. `program.md`

What the `/healthz/db` Endpoint Does