Chidozie Managwu - freeCodeCamp.org

How to Build an AI-Powered Research Automation System with n8n, Groq, and Academic APIs

Chidozie Managwu — Mon, 16 Mar 2026 18:17:27 +0000

As a researcher and developer, I found myself spending hours manually searching academic databases, reading abstracts, and trying to synthesize findings across multiple sources.

For my work on circular economy and battery recycling, I needed a way to query multiple databases at once without the manual fatigue.

In this tutorial, you'll build an automated research pipeline using n8n that reduces roughly six hours of manual literature review into a five-minute automated process.

This isn’t a “cool demo workflow.” It’s a production-minded pipeline with parallel collection, normalisation, deduplication, structured AI extraction, scoring, and practical error handling.

Prerequisites
The Problem: Research Takes Too Long
The Tech Stack
The Project Structure: How to Think About an n8n Workflow Like Software
Stage 1: Centralised Configuration
Stage 2: Parallel API Collection (With Failure Isolation)
Stage 3: Normalisation and Deduplication (DOI-first, Title fallback)
Stage 4: AI-Powered Content Extraction (Strict JSON)
Stage 5: Scoring and Synthesis
[Beginner-Friendly Evals (Retrieval and Extraction QA)(#heading-beginnerfriendly-evals-retrieval-and-extraction-qa)
Key Learnings and Error Handling
Conclusion

Prerequisites

You don’t need to be a DevOps engineer to follow this, but you should have:

Basic comfort with APIs and JSON (request/response payloads)
Familiarity with spreadsheets (Google Sheets basics)
Willingness to use a small amount of JavaScript inside n8n Function/Code nodes

Access to:

An n8n instance (self-hosted or cloud)
A Groq API key (or a compatible LLM provider)
Optional API keys, depending on the databases you use

What you’ll build assumes:

You’re extracting from metadata + abstracts (not downloading full PDFs).
You can accept that some sources will occasionally rate-limit or return partial results (and your workflow will be designed to survive this).

The Problem: Research Takes Too Long

Manual research is often a bottleneck for innovation. Before building this automation, my workflow involved searching multiple academic databases, scanning abstracts, and manually extracting key findings. This process was not only slow but also prone to human error and inconsistent note-taking.

The goal of this automation is to provide a “full-stack research assistant” that handles the heavy lifting of collecting candidate papers, removing duplicates, extracting consistent fields, scoring relevance and quality, and delivering a curated daily or weekly report, so you can spend your time on high-level synthesis rather than repetitive collection.

The Tech Stack

This workflow leverages a combination of automation tooling, high-speed LLM inference, and academic metadata providers.

Tool	Purpose
n8n	The workflow engine that orchestrates all steps
Groq	Runs a fast LLM (for example, Llama 3.3 70B) for structured extraction/synthesis
Semantic Scholar / OpenAlex	Broad academic coverage for metadata, abstracts, citations
arXiv / PubMed	Strong specialised coverage (preprints, life sciences)
Google Sheets	A lightweight “research database” for storage + history

Notes: coverage varies by provider. Some APIs return abstracts reliably, while others may omit them. Your pipeline should treat missing abstracts as a normal case, not a failure.

The Project Structure: How to Think About an n8n Workflow Like Software

While n8n is a visual tool, it helps to design your workflow as modular stages to avoid the “spaghetti workflow” problem.

.
├── configuration/         # Keywords, thresholds, limits, date filters
├── collectors/            # Parallel HTTP request nodes (multiple sources)
├── processing/            # Normalization + deduplication code nodes
├── extraction/            # LLM extraction nodes (strict JSON)
├── scoring/               # Relevance + quality scoring + filtering
└── delivery/              # Google Sheets + email/HTML report

Design principle: each stage should produce a clean, predictable output shape that the next stage can rely on.

Stage 1: Centralised Configuration

Instead of hardcoding search parameters (keywords, min year, citation thresholds) across multiple nodes, use one configuration node to define workflow variables.

This matters for maintainability (change a value once, not in ten nodes), reusability (repurpose the entire pipeline by swapping one config object), and debuggability (log the config at the start of each run so you can reproduce results).

Use a Set node, or a Code node returning JSON like this:

{
  "keywords": "circular economy battery recycling remanufacturing",
  "min_year": 2020,
  "max_results_per_source": 10,
  "min_citations": 2,
  "relevance_threshold": 15,
  "batch_size": 10
}

Tip: keep numeric fields as numbers (not strings) to avoid scoring bugs later.

Stage 2: Parallel API Collection (With Failure Isolation)

Your workflow should query multiple sources simultaneously. In n8n, you can branch from your configuration node into multiple HTTP Request nodes, and then merge results later.

The production mindset here is simple: APIs fail. Rate limits happen. Providers return partial data. The key is to prevent one failing collector from crashing the whole run.

To implement this, on each HTTP Request node, enable Continue On Fail (or the equivalent “don’t stop workflow” behaviour). Then, in the normalisation stage, treat missing or failed outputs as empty arrays so downstream stages still run.

In practice, it also helps to set explicit timeouts and add a small retry policy (one to two retries) for transient failures. “Good” looks like this: if two out of five sources fail, you still produce a useful report from the remaining three, and you log which sources failed so you can investigate later.

Stage 3: Normalisation and Deduplication (DOI-first, Title fallback)

Each academic API returns different field names and shapes. One might use title, another display_name, another paper_title. Your next stage should normalise all inputs into one schema.

Target normalised schema

Here’s a simple baseline schema (expand later as needed):

{
  "title": "string",
  "abstract": "string|null",
  "doi": "string|null",
  "year": 2024,
  "citations": 12,
  "url": "string|null",
  "source": "Semantic Scholar|OpenAlex|arXiv|PubMed"
}

What deduping by DOI means (and what a DOI is)

A DOI (Digital Object Identifier) is a unique, persistent identifier assigned to many scholarly publications. If a paper has a DOI, that DOI functions like a stable ID: the same paper may appear in multiple databases with slightly different metadata, but the DOI should remain consistent.

So, deduping by DOI means: if two records share the same DOI, treat them as the same paper and keep only one.

When a DOI is missing (which is common for some preprints and some API responses), the fallback is to dedupe using a normalised title key, lowercased, trimmed, punctuation stripped, and whitespace collapsed. It’s not as perfect as DOI-based matching, but it’s a strong pragmatic backup.

What “normalise into a unified object” means (what’s happening in the code)

“Normalise into a unified object” simply means converting every provider’s raw response into the same predictable shape (the schema above). Once everything looks the same, downstream steps, such as deduplication, scoring, AI extraction, and storage, become straightforward because they don’t need provider-specific logic.

In the code below, that’s what the normalized object is: it maps Semantic Scholar’s fields (paper.title, paper.externalIds.DOI, paper.citationCount) into your standard fields (title, doi, citations, etc.). After that, the workflow generates a dedupe key (doi:... if DOI exists, otherwise title:...) and uses a Set to keep only the first occurrence.

Example n8n Code Node (Normalisation + Dedupe Pattern)

const itemsIn = $input.all();

const seen = new Set();
const results = [];

function titleKey(t) {
  return (t || "")
    .toLowerCase()
    .replace(/[\W_]+/g, " ")
    .replace(/\s+/g, " ")
    .trim();
}

for (const item of itemsIn) {
  // Example: Semantic Scholar response shape
  const papers = item.json?.data || [];

  for (const paper of papers) {
    // "Normalize into a unified object":
    // take the provider-specific fields and map them into our standard schema.
    const normalized = {
      title: paper.title || null,
      abstract: paper.abstract || null,
      doi: paper.externalIds?.DOI || null,
      year: paper.year || null,
      citations: paper.citationCount || 0,
      url: paper.url || null,
      source: "Semantic Scholar",
    };

    if (!normalized.title) continue;

    // Dedupe key: DOI is strongest; title is fallback
    const key = normalized.doi
      ? `doi:${normalized.doi.toLowerCase()}`
      : `title:${titleKey(normalized.title)}`;

    if (seen.has(key)) continue;
    seen.add(key);

    results.push(normalized);
  }
}

return results.map(r => ({ json: r }));

Production-minded note: keep a field like source so you can debug where bad metadata is coming from later.

Stage 4: AI-Powered Content Extraction (Strict JSON)

Once you have a deduplicated list of papers, you can send each paper (or a small batch) to Groq for structured extraction.

Why structured output matters

If your LLM returns narrative text instead of JSON, misses fields, or emits malformed JSON, your workflow breaks downstream. In a production workflow, that’s not a rare edge case; it’s something you should expect and design around.

That’s why you’ll use strict schema prompting and validate responses downstream.

System prompt vs user prompt (and how to compose them)

A helpful way to think about prompts in production is:

The system prompt defines the non-negotiable contract: output format, allowed keys, no commentary, and what to do in uncertain cases. This is where you say “return ONLY valid JSON” and “no extra keys.”
The user prompt provides the variable data for this specific request: title, year, citations, abstract, and the exact schema you want filled.

Composing them this way keeps your workflow stable. The system prompt stays mostly constant (your formatting contract), while the user prompt changes per paper (your payload). It also makes debugging easier: if outputs start failing, you can adjust the system constraints without rewriting every payload template.

Suggested extraction schema

Extract only what you can support from abstract-level data:

research_question
methodology
key_findings
limitations
notes (for missing abstract / ambiguity)

Example prompt (system + user)

System:

You are a research extraction engine. You must return ONLY valid JSON.
No markdown. No extra keys. No commentary.
If the abstract is missing or too vague, set fields to null and include a reason in "notes".

User:

Extract structured fields from this paper.

TITLE: {{title}}
YEAR: {{year}}
CITATIONS: {{citations}}
ABSTRACT: {{abstract}}

Return JSON with keys:
research_question (string|null)
methodology (string|null)
key_findings (array of strings)
limitations (array of strings)
notes (string)

Model settings: keep temperature low (around 0.2–0.3) and keep responses short and structured.

Batch processing to avoid timeouts

Instead of sending 50 papers at once, process them in batches (for example, 10). This reduces latency spikes, failure blast radius, and cost surprises. Smaller batches also make it easier to retry only the failing chunk rather than re-running everything.

Stage 5: Scoring and Synthesis

Not every retrieved paper is worth your time. Without scoring, your pipeline becomes a firehose: you’ve automated collection, but you still have to manually decide what to read. Scoring is what turns “a big list of results” into a shortlist you can trust.

I recommend computing two signals:

Relevance: Is this actually about your research question?
Quality/priority: If it’s relevant, is it worth reading first?

For relevance, keep it simple and explainable. Count keyword hits in the title and abstract (and optionally in extracted key_findings). Title matches should be weighted higher because titles are deliberately compact summaries. Abstract hits are useful too, but cap them so long abstracts don’t dominate the score.

For quality/priority, use lightweight metadata you already have. Recency is a strong signal in fast-moving areas, and citations can help, but they should be treated as a weak signal (and capped) so newer high-value papers aren’t unfairly penalised.

A solid first scoring model is: add a title bonus, add a capped abstract bonus, add a capped citations bonus, and add a small recency bonus for papers from the last two years. Then filter using the relevance_threshold results from Stage 1. The advantage of this approach is that it’s easy to debug and tune: you can always explain why a paper passed or failed.

Once you’ve filtered down to your “gold” set, synthesis becomes safer and more useful. Write one row per accepted paper to Google Sheets, then generate a daily/weekly HTML summary (for example, top 5 papers with 1–2 key findings each) and include links so you can verify quickly.

Beginner-Friendly Evals: Retrieval and Extraction QA

AI workflows regress silently. A prompt tweak, a model update, or an API schema change can break extraction without throwing an obvious error. Adding lightweight evals is the difference between “it worked last week” and “it’s reliable.”

The goal here isn’t to build a full evaluation framework. It’s to add small, cheap checks that catch the most common failure modes:

Are collectors still returning results?
Are we actually removing duplicates?
Is the LLM returning valid JSON with the keys we require?

What it looks like in n8n (a concrete example)

A simple implementation is to add an “Assertions” Code node immediately after your extraction step, plus (optionally) another one after normalisation/deduplication.

At a high level, the workflow section looks like:

Collectors (parallel HTTP Request nodes)
Merge results
Normalise + dedupe (Code node)
Split in Batches (optional)
LLM extraction (Groq/OpenAI-compatible node)
Assertions (Code node)
If node (pass/fail)
Delivery (Sheets + email)

Example: Assertions code node after extraction

This code node assumes each item is a paper with:

title, abstract in the normalised fields, and
an extraction field (or whatever you name it) containing the LLM response as an object or JSON string.

Adapt the field name to match your actual node output, but the pattern is the same: parse, validate required keys, compute percentages, then decide whether to fail or warn.

const items = $input.all();

let total = items.length;
let withTitle = 0;
let withAbstract = 0;

let parseOk = 0;
let schemaOk = 0;

const requiredKeys = [
  "research_question",
  "methodology",
  "key_findings",
  "limitations",
  "notes",
];

const failures = [];

for (let i = 0; i < items.length; i++) {
  const p = items[i].json;

  if (p.title && String(p.title).trim().length > 0) withTitle++;
  if (p.abstract && String(p.abstract).trim().length > 0) withAbstract++;

  // Adjust this depending on where you store the model output:
  const raw = p.extraction ?? p.llm ?? p.model_output;

  let obj = null;
  try {
    obj = typeof raw === "string" ? JSON.parse(raw) : raw;
    parseOk++;
  } catch (e) {
    failures.push({ index: i, title: p.title || null, reason: "JSON parse failed" });
    continue;
  }

  const hasAllKeys = requiredKeys.every(k => Object.prototype.hasOwnProperty.call(obj, k));
  if (!hasAllKeys) {
    failures.push({ index: i, title: p.title || null, reason: "Missing required keys" });
    continue;
  }

  // Optional: ensure arrays are arrays
  const arraysOk = Array.isArray(obj.key_findings) && Array.isArray(obj.limitations);
  if (!arraysOk) {
    failures.push({ index: i, title: p.title || null, reason: "key_findings/limitations not arrays" });
    continue;
  }

  schemaOk++;
}

const pct = (n) => (total === 0 ? 0 : Math.round((n / total) * 100));

const report = {
  total_papers: total,
  pct_with_title: pct(withTitle),
  pct_with_abstract: pct(withAbstract),
  pct_extraction_json_parse_ok: pct(parseOk),
  pct_extraction_schema_ok: pct(schemaOk),
  failures_sample: failures.slice(0, 5),
};

// Decide pass/fail thresholds
const HARD_FAIL_PARSE_BELOW = 90;
const HARD_FAIL_SCHEMA_BELOW = 85;

const shouldFail =
  report.pct_extraction_json_parse_ok < HARD_FAIL_PARSE_BELOW ||
  report.pct_extraction_schema_ok < HARD_FAIL_SCHEMA_BELOW;

return [
  {
    json: {
      eval_report: report,
      shouldFail,
    },
  },
];

Then add an If node:

If shouldFail is true, then route to an “Alert/Stop” branch (Slack/email/log) and optionally stop the workflow.
If false, then continue to the delivery stage.

This is the automation equivalent of unit tests: small, cheap, and extremely effective. It also gives you a concrete paper trail when something changes upstream.

Key Learnings and Error Handling

Building this automation taught me that the best workflows are designed for failure.

First, error resilience is not optional. Never let one failing API crash the workflow. Use “Continue On Fail” on your HTTP nodes, merge partial results, and log which sources failed in your final report so you can debug without losing an entire run.

Second, batching is your friend. Process papers in batches (often 5–15) to reduce timeouts and cost spikes. Keep LLM payloads small and focused on what you actually need (metadata + abstract), and retry transient failures once rather than repeatedly hammering the model or API.

Third, structured prompting is what makes AI reliable in automation. A strict JSON schema is the difference between a workflow that runs unattended and one that breaks randomly. Keep temperature low, enforce the schema in the system prompt, and validate everything downstream with simple parse-and-assert checks.

Conclusion

A good research pipeline doesn’t just retrieve papers – it turns scattered results into a consistent, deduplicated, scored, and review-ready shortlist you can trust.

By treating your n8n workflow like software modular stages, strict contracts between steps, and lightweight eval checks, you can reduce hours of manual literature review into a fast, repeatable process that survives real-world API failures and model quirks.

If you build this with good defaults (failure isolation, batching, normalisation, strict JSON extraction, and simple scoring), you end up with something you can run daily or weekly and actually rely on without the manual fatigue.

About Me

I am Chidozie Managwu, an award-winning AI Product Architect and founder focused on helping global tech talent build real, production-ready skills. I contribute to global AI initiatives as a GAFAI Delegate and lead the AI Titans Network, a community for developers learning how to ship AI products.

My work has been recognised with the Global Tech Hero award and featured on platforms like HackerNoon.

How to Ship a Production-Ready RAG App with FAISS (Guardrails, Evals, and Fallbacks)

Chidozie Managwu — Mon, 16 Mar 2026 17:43:51 +0000

Most LLM applications look great in a high-fidelity demo. Then they hit the hands of real users and start failing in very predictable yet damaging ways.

They answer questions they should not, they break when document retrieval is weak, they time out due to network latency, and nobody can tell exactly what happened because there are no logs and no tests.

In this tutorial, you’ll build a beginner-friendly Retrieval Augmented Generation (RAG) application designed to survive production realities. This isn’t just a script that calls an API. It’s a system featuring a FastAPI backend, a persisted FAISS vector store, and essential safety guardrails (including a retrieval gate and fallbacks).

Why RAG Alone Does Not Equal Production-Ready
The Architecture You Are Building
Project Setup and Structure
How to Build the RAG Layer with FAISS
How to Add the LLM Call with Structured Output
How to Add Guardrails: Retrieval Gate and Fallbacks
FastAPI App: Creating the /answer Endpoint
How to Add Beginner-Friendly Evals
What to Improve Next: Realistic Upgrades

Why RAG Alone Does Not Equal Production-Ready

Retrieval Augmented Generation (RAG) is often hailed as the hallucination killer. By grounding the model in retrieved text, we provide it with the facts it needs to be accurate. But simply connecting a vector database to an LLM isn’t enough for a production environment.

Production issues usually arise from the silent failures in the system surrounding the model:

Weak retrieval: If the app retrieves irrelevant chunks of text, the model tries to bridge the gap by inventing an answer anyway. Without a designated “I do not know” path, the model is essentially forced to hallucinate.
Lack of visibility: Without structured outputs and basic logging, you can’t tell if bad retrieval, a confusing prompt, or a model update caused a wrong answer.
Fragility: A simple API timeout or malformed provider response becomes a user-facing outage if you don’t implement fallbacks.
No regression testing: In traditional software, we have unit tests. In AI, we need evals. Without them, a small tweak to your prompt might fix one issue but break ten others without you realising it.

We’ll solve each of these issues systematically in this guide.

Prerequisites

This tutorial is beginner-friendly, but it assumes you have a few basics in place so you can focus on building a robust RAG system instead of getting stuck on setup issues.

Knowledge

You should be comfortable with:

Python fundamentals (functions, modules, virtual environments)
Basic HTTP + JSON (requests, response payloads)
APIs with FastAPI (what an endpoint is and how to run a server)
High-level LLM concepts (prompting, temperature, structured outputs)

Tools + Accounts

You’ll need:

Python 3.10+
A working OpenAI-compatible API key (OpenAI or any provider that supports the same request/response shape)
A local environment where you can run a FastAPI app (Mac/Linux/Windows)

What This Tutorial Covers (and What It Doesn’t)

We’ll build a production-minded baseline:

A FAISS-backed retriever with a persisted index + metadata
A retrieval gate to prevent “forced hallucination”
Structured JSON outputs so your backend is stable
Fallback behavior for timeouts and provider errors
A small eval harness to prevent regressions

We won’t implement advanced upgrades such as rerankers, semantic chunking, auth, background jobs beyond a roadmap at the end.

The Architecture You Are Building

The flow of our application follows a disciplined path so every answer is grounded in evidence:

User query: The user submits a question via a FastAPI endpoint.
Retrieval: The system embeds the question and retrieves the top-k most similar document chunks.
The retrieval gate: We evaluate the similarity score. If the context is not relevant enough, we stop immediately and refuse the query.
Augmentation and generation: If the gate passes, we send a context-augmented prompt to the LLM.
Structured response: The model returns a JSON object containing the answer, sources used, and a confidence level.

Project Setup and Structure

To keep things organized and maintainable, we’ll use a modular structure. This allows you to swap out your LLM provider or your vector database without rewriting your entire core application.

Project Structure

.
├── app.py              # FastAPI entry point and API logic
├── rag.py              # FAISS index, persistence, and document retrieval
├── llm.py              # LLM API interface and JSON parsing
├── prompts.py          # Centralized prompt templates
├── data/               # Source .txt documents
├── index/              # Persisted FAISS index and metadata
└── evals/              # Evaluation dataset and runner script
    ├── eval_set.json
    └── run_evals.py

Install Dependencies

First, create a virtual environment to isolate your project:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install fastapi uvicorn faiss-cpu numpy pydantic requests python-dotenv

Configure the Environment

Create a .env file in the root directory. We are targeting OpenAI-compatible providers:

OPENAI_API_KEY=your_actual_api_key_here
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODEL=gpt-4o-mini

Important note on compatibility: The code below assumes an OpenAI-style API. If you use a provider that is not compatible, you must change the URL, headers (for example X-API-Key), and the way you extract embeddings and final message content in embed_texts() and call_llm().

How to Build the RAG Layer with FAISS

In rag.py, we handle the “Retriever” part of RAG. This involves turning raw text into mathematical vectors that the computer can compare.

What is FAISS (and What Does It Do)?

FAISS (Facebook AI Similarity Search) is a fast library for vector similarity search. In a RAG system, each chunk of text becomes an embedding vector (a list of floats). FAISS stores those vectors in an index so you can quickly ask:

“Given this question embedding, which document chunks are closest to it?”

In this tutorial, we use IndexFlatIP inner product and normalise vectors with faiss.normalize_L2(...). With normalised vectors, the inner product behaves like cosine similarity, giving us a stable score we can use for a retrieval gate.

Chunking Strategy With Overlap

We’ll use chunking with overlap. If we split a document at exactly 1,000 characters, we might cut a sentence in half, losing its meaning. By using an overlap, for example, 200 characters, we ensure that the end of one chunk and the beginning of the next share context.

Implementation of `rag.py`

import os
import faiss
import numpy as np
import requests
import json
from typing import List, Dict
from dotenv import load_dotenv

load_dotenv()

INDEX_PATH = "index/faiss.index"
META_PATH = "index/meta.json"

def chunk_text(text: str, size: int = 1000, overlap: int = 200) -> List[str]:
    chunks = []
    step = max(1, size - overlap)
    for i in range(0, len(text), step):
        chunk = text[i : i + size].strip()
        if chunk:
            chunks.append(chunk)
    return chunks

def embed_texts(texts: List[str]) -> np.ndarray:
    # Note: If your provider is not OpenAI-compatible, change this URL and headers
    url = f"{os.getenv('OPENAI_BASE_URL')}/embeddings"
    headers = {"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"}
    payload = {"input": texts, "model": "text-embedding-3-small"}

    resp = requests.post(url, headers=headers, json=payload, timeout=30)
    resp.raise_for_status()
    # If your provider uses a different response format, change the line below
    vectors = np.array([item["embedding"] for item in resp.json()["data"]], dtype="float32")
    return vectors

def build_index() -> None:
    all_chunks: List[str] = []
    metadata: List[Dict] = []

    if not os.path.exists("data"):
        os.makedirs("data")
        return

    for file in os.listdir("data"):
        if not file.endswith(".txt"):
            continue

        with open(f"data/{file}", "r", encoding="utf-8") as f:
            text = f.read()

        chunks = chunk_text(text)
        all_chunks.extend(chunks)
        for c in chunks:
            metadata.append({"source": file, "text": c})

    if not all_chunks:
        return

    embeddings = embed_texts(all_chunks)
    faiss.normalize_L2(embeddings)

    dim = embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)
    index.add(embeddings)

    os.makedirs("index", exist_ok=True)
    faiss.write_index(index, INDEX_PATH)

    with open(META_PATH, "w", encoding="utf-8") as f:
        json.dump(metadata, f, ensure_ascii=False)

def load_index():
    if not (os.path.exists(INDEX_PATH) and os.path.exists(META_PATH)):
        raise FileNotFoundError(
            "FAISS index not found. Add .txt files to data/ and run build_index()."
        )

    index = faiss.read_index(INDEX_PATH)
    with open(META_PATH, "r", encoding="utf-8") as f:
        metadata = json.load(f)
    return index, metadata

def retrieve(query: str, k: int = 5) -> List[Dict]:
    index, metadata = load_index()

    q_emb = embed_texts([query])
    faiss.normalize_L2(q_emb)

    scores, ids = index.search(q_emb, k)
    results = []
    for score, idx in zip(scores[0], ids[0]):
        if idx == -1:
            continue
        m = metadata[idx]
        results.append(
            {"score": float(score), "source": m["source"], "text": m["text"], "id": int(idx)}
        )
    return results

How to Add the LLM Call with Structured Output

A major failure point in AI apps is the “chatty” nature of LLMs. If your backend expects a list of sources but the LLM returns conversational filler, your code will crash.

We solve this with structured output: instruct the model to return a strict JSON object, then parse it safely.

Implementation of `llm.py`

import json
import requests
import os
from typing import Dict, Any

def call_llm(system_prompt: str, user_prompt: str) -> Dict[str, Any]:
    # Note: Change URL/Headers if using a non-OpenAI compatible provider
    url = f"{os.getenv('OPENAI_BASE_URL')}/chat/completions"
    headers = {
        "Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}",
        "Content-Type": "application/json",
    }

    payload = {
        "model": os.getenv("OPENAI_MODEL"),
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        "response_format": {"type": "json_object"},
        "temperature": 0,
    }

    try:
        resp = requests.post(url, headers=headers, json=payload, timeout=30)
        resp.raise_for_status()
        content = resp.json()["choices"][0]["message"]["content"]

        parsed = json.loads(content)
        parsed.setdefault("answer", "")
        parsed.setdefault("refusal", False)
        parsed.setdefault("confidence", "medium")
        parsed.setdefault("sources", [])
        return parsed

    except (requests.Timeout, requests.ConnectionError):
        return {
            "answer": "The system is temporarily unavailable (network issue). Please try again.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "error_type": "network_error",
        }
    except Exception:
        return {
            "answer": "A system error occurred while generating the answer.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "error_type": "unknown_error",
        }

How to Add Guardrails: Retrieval Gate and Fallbacks

Guardrails are interceptors. They sit between the user and the model to prevent predictable failures.

The Retrieval Gate: How It Works and How to Add It

In a standard RAG pipeline, the system always calls the LLM. If the user asks an irrelevant question, the retriever will still return the “closest” (but wrong) chunks.

The solution is the retrieval gate:

Retrieve top-k chunks and get the top similarity score
If the score is below a threshold (for example 0.30), refuse immediately
Only call the LLM when retrieval is strong enough to ground the answer

A threshold of 0.30 is a reasonable starting point when using normalised cosine similarity, but you should tune it using evals (next section).

Fallbacks and Why They Matter

Fallbacks ensure that if an API fails or times out, the user gets a helpful message instead of a crash. They also keep your API response shape consistent, which prevents frontend errors and makes logging meaningful.

In this tutorial, fallbacks are implemented inside call_llm() so your FastAPI layer stays simple.

FastAPI App: Creating the /answer Endpoint

The app.py file is the conductor. It ties retrieval, guardrails, prompting, and generation together.

Implementation of `app.py`

from fastapi import FastAPI
from pydantic import BaseModel
from rag import retrieve
from llm import call_llm
import prompts
import time
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("rag_app")

app = FastAPI(title="Production-Ready RAG")

class QueryRequest(BaseModel):
    question: str

@app.post("/answer")
async def get_answer(req: QueryRequest):
    start_time = time.time()
    question = (req.question or "").strip()

    if not question:
        return {
            "answer": "Please provide a non-empty question.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "latency_sec": round(time.time() - start_time, 2),
        }

    # 1) Retrieval
    results = retrieve(question, k=5)
    top_score = results[0]["score"] if results else 0.0

    logger.info("query=%r top_score=%.3f num_results=%d", question, top_score, len(results))

    # 2) Retrieval Gate (Guardrail)
    if top_score < 0.30:
        return {
            "answer": "I do not have documents to answer that question.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "latency_sec": round(time.time() - start_time, 2),
            "retrieval": {"top_score": top_score, "k": 5},
        }

    # 3) Augment
    context_text = "\n\n".join([f"Source {r['source']}: {r['text']}" for r in results])
    user_prompt = f"Context:\n{context_text}\n\nQuestion: {question}"

    # 4) Generation with Fallback
    response = call_llm(prompts.SYSTEM_PROMPT, user_prompt)

    # 5) Attach debug metadata
    response["latency_sec"] = round(time.time() - start_time, 2)
    response["retrieval"] = {"top_score": top_score, "k": 5}
    return response

Centralized Prompt – Template: prompts.py

A small but important habit: keep prompts centralised so they’re versionable and easy to evaluate.

Example `prompts.py`

SYSTEM_PROMPT = """You are a RAG assistant. Use ONLY the provided Context to answer.
If the context does not contain the answer, respond with refusal=true.

Return a valid JSON object with exactly these keys:
- answer: string
- refusal: boolean
- confidence: "low" | "medium" | "high"
- sources: array of strings (source filenames you used)

Do not include any extra keys. Do not include markdown. Do not include commentary."""

How to Add Beginner-Friendly Evals

In AI systems, outputs are probabilistic. This makes testing harder than traditional software. Evals (evaluations) are a set of “golden questions” and “expected behaviours” you run repeatedly to detect regressions.

Instead of “does it output exactly this string,” you test:

Should the app refuse when the retrieval is weak?
When it answers, does it include sources?
Is the behaviour stable across prompt tweaks and model changes?

Step 1: Create `evals/eval_set.json`

This should contain both positive and negative cases.

[
  {
    "id": "in_scope_01",
    "question": "What is a retrieval gate and why is it important?",
    "expect_refusal": false,
    "notes": "Should explain gating and relate it to hallucination prevention."
  },
  {
    "id": "out_of_scope_01",
    "question": "What is the capital of France?",
    "expect_refusal": true,
    "notes": "If the knowledge base only includes our docs, the app should refuse."
  },
  {
    "id": "edge_01",
    "question": "",
    "expect_refusal": true,
    "notes": "Empty input should not call the LLM."
  }
]

Step 2: Create `evals/run_evals.py`

This runner calls your API endpoint (end-to-end) and checks expected behaviours.

import json
import requests

API_URL = "http://127.0.0.1:8000/answer"

def run():
    with open("evals/eval_set.json", "r", encoding="utf-8") as f:
        cases = json.load(f)

    passed = 0
    failed = 0

    for case in cases:
        resp = requests.post(API_URL, json={"question": case["question"]}, timeout=60)
        resp.raise_for_status()
        out = resp.json()

        got_refusal = bool(out.get("refusal", False))
        expect_refusal = bool(case["expect_refusal"])

        ok = (got_refusal == expect_refusal)

        # Beginner-friendly: if it answers, sources should exist and be a list
        if not got_refusal:
            ok = ok and isinstance(out.get("sources"), list)

        if ok:
            passed += 1
            print(f"PASS {case['id']}")
        else:
            failed += 1
            print(f"FAIL {case['id']} expected_refusal={expect_refusal} got_refusal={got_refusal}")
            print("Output:", json.dumps(out, indent=2))

    print(f"\nDone. Passed={passed} Failed={failed}")
    if failed:
        raise SystemExit(1)

if __name__ == "__main__":
    run()

How to Use Evals in Practice

Run your server:

uvicorn app:app --reload

In another terminal, run evals:

python evals/run_evals.py

If an eval fails, you have a concrete signal that something changed in retrieval, gating, prompting, or provider behaviour.

What to Improve Next: Realistic Upgrades

Building a reliable RAG app is iterative. Here are realistic next steps:

Semantic chunking: Break text based on meaning instead of character count.
Reranking: Use a cross-encoder reranker to reorder the top-k chunks for higher precision.
Metadata filtering: Filter results by category, date, or department to reduce false positives.
Better citations: Store chunk IDs and show exactly which chunk(s) the answer came from.
Observability: Add request IDs, structured logs, and traces so “what happened?” is answerable.
Async + background indexing: Move index building to a background job and keep the API responsive.

Final Thoughts: Production-Ready Is a Set of Habits

Building an AI application that survives in the real world is about building a system that is predictable, measurable, and safe.

Retrieval quality is measurable: Use similarity scores to gate your LLM.
Refusal is a feature: It is better to say “I do not know” than to lie.
Fallbacks are mandatory: Design for the moment the API goes down.
Evals prevent regressions: Never deploy a change without running your tests.

About Me

I am Chidozie Managwu, an award-winning AI Product Architect and founder focused on helping global tech talent build real, production-ready skills. I contribute to global AI initiatives as a GAFAI Delegate and lead AI Titans Network, a community for developers learning how to ship AI products.

My work has been recognized with the Global Tech Hero award and featured on platforms like HackerNoon.

Chidozie Managwu - freeCodeCamp.org

How to Build an AI-Powered Research Automation System with n8n, Groq, and Academic APIs

Table of Contents

Prerequisites

The Problem: Research Takes Too Long

The Tech Stack

The Project Structure: How to Think About an n8n Workflow Like Software

Stage 1: Centralised Configuration

Stage 2: Parallel API Collection (With Failure Isolation)

Stage 3: Normalisation and Deduplication (DOI-first, Title fallback)

Target normalised schema

What deduping by DOI means (and what a DOI is)

What “normalise into a unified object” means (what’s happening in the code)

Example n8n Code Node (Normalisation + Dedupe Pattern)

Stage 4: AI-Powered Content Extraction (Strict JSON)

Why structured output matters

System prompt vs user prompt (and how to compose them)

Suggested extraction schema

Example prompt (system + user)

Batch processing to avoid timeouts

Stage 5: Scoring and Synthesis

Beginner-Friendly Evals: Retrieval and Extraction QA

What it looks like in n8n (a concrete example)

Example: Assertions code node after extraction

Key Learnings and Error Handling

Conclusion

About Me

How to Ship a Production-Ready RAG App with FAISS (Guardrails, Evals, and Fallbacks)

Table of Contents

Why RAG Alone Does Not Equal Production-Ready

Prerequisites

Knowledge

Tools + Accounts

What This Tutorial Covers (and What It Doesn’t)

The Architecture You Are Building

Project Setup and Structure

Project Structure

Install Dependencies

Configure the Environment

How to Build the RAG Layer with FAISS

What is FAISS (and What Does It Do)?

Chunking Strategy With Overlap

Implementation of rag.py

How to Add the LLM Call with Structured Output

Implementation of llm.py

How to Add Guardrails: Retrieval Gate and Fallbacks

The Retrieval Gate: How It Works and How to Add It

Fallbacks and Why They Matter

FastAPI App: Creating the /answer Endpoint

Implementation of app.py

Centralized Prompt – Template: prompts.py

Example prompts.py

How to Add Beginner-Friendly Evals

Step 1: Create evals/eval_set.json

Step 2: Create evals/run_evals.py

How to Use Evals in Practice

What to Improve Next: Realistic Upgrades

Final Thoughts: Production-Ready Is a Set of Habits

About Me

Implementation of `rag.py`

Implementation of `llm.py`

Implementation of `app.py`

Example `prompts.py`

Step 1: Create `evals/eval_set.json`

Step 2: Create `evals/run_evals.py`