<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ observability - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ observability - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sat, 30 May 2026 11:14:29 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/observability/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Trace Multi-Agent AI Swarms with Jaeger v2 ]]>
                </title>
                <description>
                    <![CDATA[ When you run a single AI agent, debugging is straightforward. You read the log, you see what happened. When you run five agents in a swarm, each spawning its own tool calls and producing its own outpu ]]>
                </description>
                <link>https://www.freecodecamp.org/news/multi-agent-ai-swarms-tracing/</link>
                <guid isPermaLink="false">69eaae45904b915438cefb47</guid>
                
                    <category>
                        <![CDATA[ jaeger ]]>
                    </category>
                
                    <category>
                        <![CDATA[ OpenTelemetry ]]>
                    </category>
                
                    <category>
                        <![CDATA[ distributed tracing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ multi-agent systems ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ai agents ]]>
                    </category>
                
                    <category>
                        <![CDATA[ observability ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Christopher Galliart ]]>
                </dc:creator>
                <pubDate>Thu, 23 Apr 2026 23:41:57 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/308710e6-cfe6-4007-887a-c49a5e2e6b9a.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When you run a single AI agent, debugging is straightforward. You read the log, you see what happened.</p>
<p>When you run five agents in a swarm, each spawning its own tool calls and producing its own output, "read the log" stops being a strategy.</p>
<p>I built <a href="https://github.com/HatmanStack/claude-forge">Claude Forge</a> as an adversarial multi-agent coding framework on top of Claude Code. A typical run spawns a planner, an implementer, a reviewer, and a fixer. They evaluate each other's work and loop back when quality checks fail.</p>
<p>But when something went wrong, I had timestamps and text dumps but no way to see which agent was responsible, how long it actually took, or where the tokens went.</p>
<p>Jaeger fixed that. This article covers setting up Jaeger v2 with Docker, wiring it into a multi-agent system through OpenTelemetry, and what I learned along the way.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-is-distributed-tracing">What Is Distributed Tracing?</a></p>
</li>
<li><p><a href="#heading-why-jaeger-v2">Why Jaeger v2?</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-installing-docker-on-debian">Installing Docker on Debian</a></p>
</li>
<li><p><a href="#heading-setting-up-jaeger-v2">Setting Up Jaeger v2</a></p>
</li>
<li><p><a href="#heading-setting-up-claude-forge-tracing">Setting Up Claude Forge Tracing</a></p>
</li>
<li><p><a href="#heading-understanding-the-span-model">Understanding the Span Model</a></p>
</li>
<li><p><a href="#heading-instrumenting-a-multi-agent-swarm">Instrumenting a Multi-Agent Swarm</a></p>
</li>
<li><p><a href="#heading-viewing-traces-in-the-jaeger-ui">Viewing Traces in the Jaeger UI</a></p>
</li>
<li><p><a href="#heading-lessons-from-the-trenches">Lessons from the Trenches</a></p>
</li>
<li><p><a href="#heading-environment-variable-reference">Environment Variable Reference</a></p>
</li>
<li><p><a href="#heading-wrapping-up">Wrapping Up</a></p>
</li>
</ul>
<h2 id="heading-what-is-distributed-tracing">What Is Distributed Tracing?</h2>
<p>Distributed tracing tracks a single operation as it moves through multiple services. A span is one unit of work with a start time, end time, and key-value attributes. Spans nest into parent-child trees. One tree per operation is one trace.</p>
<p>Microservices people already know this pattern: follow an HTTP request from the gateway through auth, the database, and the cache. Same idea works for multi-agent AI. Follow one swarm invocation from the orchestrator through each subagent and its tool calls.</p>
<p>OpenTelemetry (OTel) is the standard. It gives you SDKs for creating spans and shipping them over OTLP. Jaeger receives that data and renders it as a searchable timeline.</p>
<h2 id="heading-why-jaeger-v2">Why Jaeger v2?</h2>
<p>Jaeger started at Uber and graduated as a CNCF project in 2019. v1 hit end of life in December 2025. v2 is the current release, built on the OpenTelemetry Collector framework. Single binary: collector, query service, and UI. It speaks OTLP natively on port 4317 (gRPC) and 4318 (HTTP). There's no separate collector needed for local work.</p>
<p>One important difference from v1: configuration moved from CLI flags and environment variables to a YAML file. The old <code>-e SPAN_STORAGE_TYPE=badger</code> env vars are silently ignored in v2. The container starts fine but falls back to in-memory storage. I lost two days of traces before noticing. More on the correct setup below.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p><strong>Docker</strong> installed and running.</p>
</li>
<li><p><strong>Claude Code</strong> installed.</p>
</li>
<li><p><strong>Python 3.8+</strong> for the tracing hook.</p>
</li>
<li><p><strong>Claude Forge</strong> or another multi-agent system to instrument.</p>
</li>
</ul>
<h2 id="heading-installing-docker-on-debian">Installing Docker on Debian</h2>
<p>Skip this if you already have Docker. macOS and Windows users can use Docker Desktop. On Debian:</p>
<pre><code class="language-bash">sudo apt-get update
sudo apt-get install -y ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] \
  https://download.docker.com/linux/debian \
  \((. /etc/os-release &amp;&amp; echo "\)VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list &gt; /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER
newgrp docker
</code></pre>
<p>Ubuntu users: replace both <code>linux/debian</code> URLs with <code>linux/ubuntu</code>.</p>
<h2 id="heading-setting-up-jaeger-v2">Setting Up Jaeger v2</h2>
<h3 id="heading-basic-run">Basic Run</h3>
<p>For quick testing with no persistence:</p>
<pre><code class="language-bash">docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/jaeger:2.17.0
</code></pre>
<p>Port 16686 is the UI. Port 4317 is OTLP/gRPC ingestion. Port 4318 is OTLP/HTTP. Remove the container and your traces are gone.</p>
<h3 id="heading-persistent-storage-with-badger">Persistent Storage with Badger</h3>
<p>v2 reads configuration from a YAML file, not environment variables. Save this as <code>~/.local/share/jaeger/config.yaml</code>:</p>
<pre><code class="language-yaml">service:
  extensions: [jaeger_storage, jaeger_query, healthcheckv2]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger_storage_exporter]
extensions:
  healthcheckv2:
    use_v2: true
    http: { endpoint: 0.0.0.0:13133 }
  jaeger_query:
    storage: { traces: main_store }
  jaeger_storage:
    backends:
      main_store:
        badger:
          directories: { keys: /badger/key, values: /badger/data }
          ephemeral: false
          ttl: { spans: 720h }
receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }
processors:
  batch:
exporters:
  jaeger_storage_exporter:
    trace_storage: main_store
</code></pre>
<p>The Jaeger container runs as UID 10001. Docker named volumes default to root ownership. Without fixing permissions first, the container crash-loops with <code>mkdir /badger/key: permission denied</code>.</p>
<p>Pre-create the volume and fix ownership:</p>
<pre><code class="language-bash">docker volume create jaeger-data

docker run --rm \
  -v jaeger-data:/badger \
  alpine sh -c "mkdir -p /badger/data /badger/key &amp;&amp; chown -R 10001:10001 /badger"
</code></pre>
<p>Then run Jaeger with the config mounted in:</p>
<pre><code class="language-bash">docker run -d --name jaeger \
  --restart unless-stopped \
  -v ~/.local/share/jaeger/config.yaml:/etc/jaeger/config.yaml:ro \
  -v jaeger-data:/badger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/jaeger:2.17.0 \
  --config /etc/jaeger/config.yaml
</code></pre>
<p>Verify persistence by running <code>docker restart jaeger</code> and confirming a previously recorded trace is still there. Hit <code>http://localhost:16686</code> and you should see the UI.</p>
<h2 id="heading-setting-up-claude-forge-tracing">Setting Up Claude Forge Tracing</h2>
<h3 id="heading-installing-claude-forge">Installing Claude Forge</h3>
<p>Install it through the Claude Code plugin marketplace:</p>
<pre><code class="language-bash">/plugin marketplace add hatmanstack/claude-forge
/plugin install forge@claude-forge
/reload-plugins
</code></pre>
<p>The install opens a TUI to confirm scope and settings. After reload, commands use the <code>forge:</code> prefix (for example, <code>/forge:pipeline</code>).</p>
<p>You can also clone the repo from <a href="https://github.com/HatmanStack/claude-forge">GitHub</a>.</p>
<h3 id="heading-installing-the-tracing-hook">Installing the Tracing Hook</h3>
<p>From your target project directory, run the install script. For plugin installs:</p>
<pre><code class="language-bash">cd your-project
forge-trace                # if you set up the alias from the README
# or, without the alias:
bash "$(find ~/.claude -path '*/forge*' -name install-tracing.sh 2&gt;/dev/null | head -1)"
</code></pre>
<p>For clone installs:</p>
<pre><code class="language-bash">cd your-project
bash /path/to/claude-forge/bin/install-tracing.sh
</code></pre>
<p>The script builds a dedicated venv at <code>~/.local/share/claude-forge/venv</code> (prefers <code>uv</code>, falls back to <code>python3 -m venv</code>), installs the OpenTelemetry packages, copies the hook into place, merges hook entries into <code>.claude/settings.local.json</code>, and self-tests against the OTLP endpoint.</p>
<p>Pass <code>--no-settings</code> to skip the settings merge, or <code>--uninstall</code> to tear everything down.</p>
<h3 id="heading-opting-in">Opting In</h3>
<p>Add to your shell init and restart your terminal:</p>
<pre><code class="language-bash">export CLAUDE_FORGE_TRACING=1
</code></pre>
<p>Restart Claude Code, run <code>/pipeline</code>, then check <code>http://localhost:16686</code> for the <code>claude-forge</code> service.</p>
<h2 id="heading-understanding-the-span-model">Understanding the Span Model</h2>
<p>Here's what the hierarchy looks like for a typical swarm run:</p>
<pre><code class="language-plaintext">session: "implement login form with OAuth"        &lt;- root span
├── subagent:planner
│   ├── tool:Write  (Phase-0.md)                  &lt;- mutation spans (on by default)
│   ├── tool:Write  (Phase-1.md)
│   └── subagent_result:planner                   &lt;- duration, token counts, output
├── subagent:implementer
│   ├── tool:Edit   (src/auth.ts)
│   ├── tool:Bash   (npm test)
│   ├── tool:Write  (src/oauth.ts)
│   └── subagent_result:implementer
├── subagent:reviewer
│   └── subagent_result:reviewer
└── session_complete                              &lt;- session totals
</code></pre>
<p>The root span's name comes from the first line of your prompt. Find traces by what you asked for, not by a UUID.</p>
<p>Subagents get an anchor span on start and a result span on completion. The result carries duration, token counts, prompt, and output.</p>
<h3 id="heading-three-tiers-of-detail">Three Tiers of Detail</h3>
<p>Not all inner tool calls are equally interesting. Write, Edit, MultiEdit, and Bash are mutational: small in number, high signal. They tell you what actually changed. Read, Glob, Grep, and WebFetch are navigation: lots of them, mostly noise.</p>
<p>Tracing captures mutations by default. That middle ground turned out to be the right one. Before this change, you either saw nothing inside subagents or you saw 200+ spans per run.</p>
<table>
<thead>
<tr>
<th>Mode</th>
<th>Subagents</th>
<th>Mutations (Write/Edit/Bash)</th>
<th>Other inner tools</th>
</tr>
</thead>
<tbody><tr>
<td>Default</td>
<td>yes</td>
<td>yes</td>
<td>no</td>
</tr>
<tr>
<td><code>CLAUDE_FORGE_TRACE_INNER=1</code></td>
<td>yes</td>
<td>yes</td>
<td>yes (minus blocklist)</td>
</tr>
<tr>
<td><code>CLAUDE_FORGE_TRACE_MUTATIONS=0</code></td>
<td>yes</td>
<td>no</td>
<td>no (or per INNER)</td>
</tr>
</tbody></table>
<h3 id="heading-span-attributes">Span Attributes</h3>
<p><strong>On</strong> <code>session_complete</code><strong>:</strong> <code>session.tokens.input</code>, <code>session.tokens.output</code>, <code>session.tokens.total</code>, <code>session.tokens.turns</code>, <code>session.duration_ms</code>, <code>user.prompt</code> (first 2KB).</p>
<p><strong>On</strong> <code>subagent_result</code><strong>:</strong> <code>agent.description</code>, <code>agent.prompt</code>, <code>agent.output</code>, <code>agent.duration_ms</code>, <code>agent.is_error</code>, <code>agent.tokens.input</code>, <code>agent.tokens.output</code>.</p>
<p><strong>On</strong> <code>tool:*</code><strong>:</strong> <code>tool.name</code>, <code>tool.input</code>, <code>tool.output</code>, <code>tool.duration_ms</code>, <code>tool.is_error</code>.</p>
<h2 id="heading-instrumenting-a-multi-agent-swarm">Instrumenting a Multi-Agent Swarm</h2>
<h3 id="heading-hook-architecture">Hook Architecture</h3>
<p>Claude Code has lifecycle hooks that fire scripts on specific events. Four matter here:</p>
<ol>
<li><p><strong>UserPromptSubmit</strong> (create the root span),</p>
</li>
<li><p><strong>PreToolUse</strong> (start a span),</p>
</li>
<li><p><strong>PostToolUse</strong> (end it with results), and</p>
</li>
<li><p><strong>Stop</strong> (finalize the trace). Each hook gets a JSON payload on stdin and runs as a subprocess.</p>
</li>
</ol>
<h3 id="heading-sending-spans-with-opentelemetry">Sending Spans with OpenTelemetry</h3>
<p>Here's some minimal Python to get a span into Jaeger:</p>
<pre><code class="language-python">from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

resource = Resource.create({"service.name": "my-agent-system"})
exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("agent-tracer")

with tracer.start_as_current_span("my-agent-task") as span:
    span.set_attribute("agent.name", "planner")
    span.set_attribute("agent.tokens.input", 1500)
    span.set_attribute("agent.tokens.output", 800)
</code></pre>
<p>Refresh <code>localhost:16686</code>, pick your service, click "Find Traces."</p>
<h3 id="heading-correlating-pre-and-post-events">Correlating Pre and Post Events</h3>
<p>You need to match each PreToolUse to its PostToolUse. Agent-type tool calls didn't include a <code>tool_use_id</code> in the payload, so I hashed the tool name and input instead. Pre and Post carry identical <code>tool_input</code>, so the hashes line up.</p>
<pre><code class="language-python">import hashlib, json

def correlation_key(tool_name: str, tool_input: dict) -&gt; str:
    content = json.dumps({"tool": tool_name, "input": tool_input}, sort_keys=True)
    return hashlib.sha1(content.encode()).hexdigest()[:16]
</code></pre>
<h3 id="heading-state-across-invocations">State Across Invocations</h3>
<p>Every hook call is a separate process. No shared memory. So I wrote span context to JSON files on Pre and read them back on Post:</p>
<pre><code class="language-plaintext">/tmp/claude-forge-tracing/&lt;session_id&gt;/
├── _root.json              # trace ID, root span context
├── _session_start_ns.json  # timestamp for duration calculation
├── subagent_&lt;hash&gt;.json    # per-subagent span context
└── tool_&lt;hash&gt;.json        # per-tool span context
</code></pre>
<p>File names get sanitized against path traversal. <code>_safe_name()</code> strips everything outside <code>[A-Za-z0-9._-]</code> and falls back to a SHA1 slug.</p>
<h3 id="heading-flushing-without-blocking">Flushing Without Blocking</h3>
<pre><code class="language-python">try:
    provider.force_flush(timeout_millis=1000)
except Exception:
    pass  # Never block the swarm
</code></pre>
<p>I tried 2000ms first and the swarm felt slow. 100ms lost spans on cold TLS connections. 1000ms worked. If Jaeger is down, the swarm keeps running regardless.</p>
<h2 id="heading-viewing-traces-in-the-jaeger-ui">Viewing Traces in the Jaeger UI</h2>
<p>Open <code>http://localhost:16686</code>. Pick <code>claude-forge</code> from the service dropdown. Click "Find Traces."</p>
<p>The trace search filters by operation name, tags, and time range. Since session spans take their name from your prompt, searching "login form" pulls up the runs where you asked for one.</p>
<p>The timeline view is where I spend most of my time. Every span is a horizontal bar, nested by parent-child relationships. I can see the planner took 12 seconds, the implementer 45, the reviewer 8. Click any bar to see token counts, prompts, outputs, error status.</p>
<p>Trace comparison puts two runs side by side. This is good for figuring out why one run succeeded and another did not.</p>
<h2 id="heading-lessons-from-the-trenches">Lessons from the Trenches</h2>
<p><strong>One trace per swarm, not per subagent:</strong> My first version wiped the root span's state file on every Stop event, so each subagent started a new trace. I changed Stop to mark a timestamp while preserving the root.</p>
<p><strong>Use descriptions, not type names:</strong> Subagents all report their type as <code>general-purpose</code>. The description field is where the actual role lives.</p>
<p><strong>Token attribution needs per-agent transcripts:</strong> Claude Code writes subagent transcripts to <code>~/.claude/projects/&lt;project&gt;/&lt;session&gt;/subagents/agent-*.jsonl</code>. Match them via <code>agent-*.meta.json</code>.</p>
<p><strong>Parse boolean env vars explicitly:</strong> <code>bool("0")</code> in Python is <code>True</code>. Use an allowlist: <code>{"1", "true", "yes", "on"}</code>.</p>
<h2 id="heading-environment-variable-reference">Environment Variable Reference</h2>
<table>
<thead>
<tr>
<th>Variable</th>
<th>Purpose</th>
</tr>
</thead>
<tbody><tr>
<td><code>CLAUDE_FORGE_TRACING=1</code></td>
<td>Master opt-in. Hook is a no-op without this.</td>
</tr>
<tr>
<td><code>CLAUDE_FORGE_TRACE_MUTATIONS=0</code></td>
<td>Disable default mutation spans (Write/Edit/Bash). On by default.</td>
</tr>
<tr>
<td><code>CLAUDE_FORGE_TRACE_INNER=1</code></td>
<td>Capture all inner tool calls as child spans (off by default).</td>
</tr>
<tr>
<td><code>CLAUDE_FORGE_TRACE_TOOL_BLOCKLIST</code></td>
<td>Comma-separated tools to skip when inner tracing is on. Defaults to <code>Read,Glob,Grep,TodoWrite,NotebookRead</code>.</td>
</tr>
<tr>
<td><code>CLAUDE_FORGE_HOOK_DEBUG=1</code></td>
<td>Enable debug logging of raw hook payloads. Off by default.</td>
</tr>
<tr>
<td><code>CLAUDE_FORGE_HOOK_DEBUG_LOG</code></td>
<td>Override debug log path. Defaults to <code>~/.cache/claude-forge/hook.log</code>.</td>
</tr>
<tr>
<td><code>OTEL_EXPORTER_OTLP_ENDPOINT</code></td>
<td>OTLP/gRPC endpoint. Defaults to <code>http://localhost:4317</code>.</td>
</tr>
</tbody></table>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>Without visibility into the process, you're being inefficient with tokens and your time. Multi-agent swarms cost real money on every run. When an agent fails and retries, or when a reviewer rejects work that was close, you're paying for that blind.</p>
<p>Tracing gives you the map. You find out where the failure modes are. You find out which agents burn tokens going nowhere. A 45-second implementer run might have been 10 seconds with a better planner prompt. But you would never know that without seeing the breakdown.</p>
<p>Get observability in early. Jaeger and OpenTelemetry make it cheap to set up. Once you can see where things go wrong you can actually fix them.</p>
<p>Claude Forge tracing is on the <a href="https://github.com/HatmanStack/claude-forge">main branch</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build End-to-End LLM Observability in FastAPI with OpenTelemetry ]]>
                </title>
                <description>
                    <![CDATA[ This article shows how to build end-to-end, code-first LLM observability in a FastAPI application using the OpenTelemetry Python SDK. Instead of relying on vendor-specific agents or opaque SDKs, we wi ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-end-to-end-llm-observability-in-fastapi-with-opentelemetry/</link>
                <guid isPermaLink="false">69b4379c6e27dd07d920f14c</guid>
                
                    <category>
                        <![CDATA[ observability ]]>
                    </category>
                
                    <category>
                        <![CDATA[ OpenTelemetry ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ FastAPI ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Jessica Patel ]]>
                </dc:creator>
                <pubDate>Fri, 13 Mar 2026 16:13:16 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/c69a589a-2dce-46a1-ac49-a0d0e2c23c6e.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>This article shows how to build end-to-end, code-first LLM observability in a FastAPI application using the OpenTelemetry Python SDK.</p>
<p>Instead of relying on vendor-specific agents or opaque SDKs, we will manually design traces, spans, and semantic attributes that capture the full lifecycle of an LLM-powered request.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-introduction">Introduction</a></p>
</li>
<li><p><a href="#heading-prerequisites-and-technical-context">Prerequisites and Technical Context</a></p>
</li>
<li><p><a href="#heading-why-llm-observability-is-fundamentally-different">Why LLM Observability Is Fundamentally Different</a></p>
</li>
<li><p><a href="#heading-reference-architecture-a-traceable-rag-request">Reference Architecture: A Traceable RAG Request</a></p>
</li>
<li><p><a href="#heading-reference-architecture-explained">Reference Architecture Explained</a></p>
</li>
<li><p><a href="#heading-why-this-design-is-better-than-simpler-alternatives">Why This Design Is Better Than Simpler Alternatives</a></p>
</li>
<li><p><a href="#heading-llm-models-that-work-best-for-this-architecture">LLM Models That Work Best for This Architecture</a></p>
</li>
<li><p><a href="#heading-opentelemetry-primer-llm-relevant-concepts-only">OpenTelemetry Primer (LLM-Relevant Concepts Only)</a></p>
</li>
<li><p><a href="#heading-designing-llm-aware-spans">Designing LLM-Aware Spans</a></p>
</li>
<li><p><a href="#heading-fastapi-example-end-to-end-llm-spans-complete-and-explained">FastAPI Example: End-to-End LLM Spans (Complete and Explained)</a></p>
</li>
<li><p><a href="#heading-semantic-attributes-best-practices-for-llm-observability">Semantic Attributes: Best Practices for LLM Observability</a></p>
</li>
<li><p><a href="#heading-evaluation-hooks-inside-traces">Evaluation Hooks Inside Traces</a></p>
</li>
<li><p><a href="#heading-exporting-and-visualizing-traces-where-this-fits-with-vendor-tooling">Exporting and Visualizing Traces (Where This Fits with Vendor Tooling)</a></p>
</li>
<li><p><a href="#heading-operational-patterns-and-anti-patterns">Operational Patterns and Anti-Patterns</a></p>
</li>
<li><p><a href="#heading-extending-the-system">Extending the System</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-introduction"><strong>Introduction</strong></h2>
<p>Large Language Models (LLMs) are rapidly becoming a core component of modern software systems. Applications that once relied on deterministic APIs are now incorporating LLM-powered features such as conversational assistants, document summarization, intelligent search, and retrieval-augmented generation (RAG).</p>
<p>While these capabilities unlock new user experiences, they also introduce operational complexity that traditional monitoring approaches were never designed to handle.</p>
<p>Unlike conventional software services, LLM systems are probabilistic by nature. The same request may produce slightly different responses depending on factors such as prompt structure, model configuration, retrieval context, and sampling parameters such as temperature or top-p.</p>
<p>In addition, LLM workloads introduce entirely new operational dimensions such as token consumption, prompt construction latency, inference cost, context window limits, and response quality.</p>
<p>These factors mean that a request can appear technically successful from an infrastructure perspective while still producing an incorrect, hallucinated, or low-quality result.</p>
<p>Traditional observability tools typically focus on infrastructure-level signals such as latency, error rate, and throughput. While these metrics remain important, they are insufficient for understanding how an LLM application behaves in production.</p>
<p>Engineers must also understand what prompt was constructed, which documents were retrieved, how many tokens were consumed, which model configuration was used, and how the final response was evaluated. Without this visibility, debugging LLM behavior becomes extremely difficult and operational costs can quickly spiral out of control.</p>
<p>This is where LLM observability becomes essential. Observability for LLM systems extends beyond infrastructure monitoring. It captures the full lifecycle of an AI-driven request — from user input and context retrieval to prompt construction, model inference, post-processing, and quality evaluation.</p>
<p>When implemented correctly, observability allows teams to answer why the model generated a particular response, which retrieval results influenced the output, how much a request cost in terms of tokens, where latency occurred within the request pipeline, and whether the response passed basic quality or safety checks.</p>
<p>This article demonstrates how to implement end-to-end LLM observability in a FastAPI application using OpenTelemetry. Instead of relying on proprietary monitoring agents or opaque vendor SDKs, we take a code-first approach to instrumentation. By explicitly designing traces, spans, and semantic attributes, we gain precise control over how LLM interactions are observed and analyzed.</p>
<p>Throughout the guide, we will walk through a practical architecture for tracing a retrieval-augmented generation (RAG) workflow, where each stage of the request lifecycle is represented as a trace span. We will explore how to design meaningful span boundaries, capture prompt and model metadata safely, record token usage and cost signals, and attach evaluation results directly to traces.</p>
<p>The article also explains how this instrumentation can be exported to any OpenTelemetry-compatible backend such as Jaeger, Grafana Tempo, or LLM-specific platforms like Phoenix.</p>
<p>By the end of this guide, you will understand how to:</p>
<ul>
<li><p>Structure traces so that each user request maps to a single end-to-end LLM interaction</p>
</li>
<li><p>Design span hierarchies that reflect the logical stages of an LLM pipeline</p>
</li>
<li><p>Capture prompt metadata, model configuration, and token usage safely</p>
</li>
<li><p>Attach evaluation and quality signals to traces for deeper analysis</p>
</li>
<li><p>Export observability data to different backends without changing instrumentation</p>
</li>
</ul>
<p>Most importantly, the goal of this article is not simply to demonstrate how to add telemetry to an application. Instead, it aims to show how to think about observability when building LLM-powered systems.</p>
<p>When LLM operations are treated as first-class components within a distributed system, traces become a powerful tool for debugging, optimization, cost management, and continuous improvement of model behavior.</p>
<h2 id="heading-prerequisites-and-technical-context">Prerequisites and Technical Context</h2>
<p>Before following this guide, you should be familiar with the Python programming language, basic web API concepts, and general microservice architecture. Below are some key tools and concepts used in this article.</p>
<h3 id="heading-fastapi-web-framework">FastAPI (Web Framework)</h3>
<p>FastAPI is used as the primary web framework for the application. It is a modern Python framework designed for building high-performance APIs using standard Python type hints. FastAPI simplifies request validation, serialization, and API documentation while remaining lightweight and fast.</p>
<h3 id="heading-large-language-models-llms">Large Language Models (LLMs)</h3>
<p>Large Language Models (LLMs) are the computational core of the example system. An LLM is a model trained on vast amounts of text data to generate or transform language in ways that resemble human communication. In production environments, LLMs are commonly used for tasks such as conversational interfaces, summarization, and question answering.</p>
<h3 id="heading-observability-concept">Observability (Concept)</h3>
<p>Observability is the overarching concept that connects all the technical pieces in this article. At a high level, observability refers to the ability to understand a system's internal behavior by examining the data it produces during execution. Rather than asking whether a system is simply "up" or "down," observability helps answer deeper questions about why a request behaved a certain way, where latency was introduced, or how different components interacted.</p>
<h3 id="heading-opentelemetry-instrumentation-standard">OpenTelemetry (Instrumentation Standard)</h3>
<p>OpenTelemetry is the mechanism used to implement observability within the application. It is an open, vendor-neutral standard for generating telemetry data such as traces, metrics, and logs. By instrumenting key parts of the LLM workflow, we can observe how requests flow through the system, how long each step takes, and what contextual data influenced the final outcome. OpenTelemetry serves as the foundation for collecting this information in a consistent and portable way, independent of any specific monitoring backend.</p>
<h2 id="heading-why-llm-observability-is-fundamentally-different">Why LLM Observability Is Fundamentally Different</h2>
<p>Traditional observability assumes deterministic behavior: the same input produces the same output. LLM systems violate this assumption. The same request can vary due to prompt template changes, retrieval differences, sampling parameters (temperature, top-p), model version upgrades, and context window truncation.​</p>
<p>As a result, teams need visibility into what the model saw, how it was configured, what it retrieved, how long it took, and how much it cost, all correlated to a single user request. Logs alone are insufficient, and metrics lack dimensionality. Distributed traces are the backbone of LLM observability.</p>
<h2 id="heading-reference-architecture-a-traceable-rag-request">Reference Architecture: A Traceable RAG Request</h2>
<p>A typical FastAPI-based RAG service follows this flow:</p>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/6979762ba2442d262dacf388/50e7fda4-7407-43d6-8f12-045b8e73c7eb.png" alt="FastAPI Based RAG Service" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Each step is observable, but only if we deliberately instrument it. The goal is one trace per user request, with child spans representing each logical LLM step.</p>
<h2 id="heading-reference-architecture-explained">Reference Architecture Explained</h2>
<h3 id="heading-client-sends-a-request-to-chat">Client Sends a Request to /chat</h3>
<p>The architecture begins when a client sends a request to the <code>/chat</code> endpoint. This request typically contains the user's query along with any session or conversation context required by the application.</p>
<p>Keeping the client interface minimal and well-defined is intentional: it ensures the backend receives a predictable input shape and prevents application-specific logic from leaking into downstream LLM processing.</p>
<p>From an observability perspective, this request marks the start of a single end-to-end trace, allowing every subsequent operation to be correlated back to the original user action.</p>
<h3 id="heading-fastapi-validates-input-and-authenticates-the-user">FastAPI Validates Input and Authenticates the User</h3>
<p>Once the request reaches the service, FastAPI performs schema validation and authentication. Validation guarantees that only well-formed inputs proceed through the pipeline, while authentication ensures that expensive LLM operations are only executed for authorized users.</p>
<p>Placing this step early reduces unnecessary computation and protects the system from abuse. It also improves trace quality by ensuring that all observed requests represent legitimate execution paths rather than malformed or rejected traffic.</p>
<h3 id="heading-retriever-queries-the-vector-database">Retriever Queries the Vector Database​</h3>
<p>After validation, the system queries a vector database to retrieve documents relevant to the user's request. This retrieval step is the foundation of retrieval-augmented generation (RAG). By grounding the LLM in external knowledge, the system improves factual accuracy and reduces hallucinations.</p>
<p>Separating retrieval from generation allows teams to tune similarity thresholds, embedding models, and top-k values independently, and it makes it easier to diagnose whether poor responses are caused by bad retrieval or model behavior.</p>
<h3 id="heading-prompt-is-assembled-using-retrieved-documents">Prompt Is Assembled Using Retrieved Documents</h3>
<p>With relevant documents in hand, the system constructs the final prompt that will be sent to the LLM. This step combines the user query, retrieved context, system instructions, and formatting rules into a single structured prompt.</p>
<p>Making prompt assembly an explicit stage enables prompt versioning, experimentation, and observability. It also provides a natural place to detect issues such as context window overflows or excessive prompt size before invoking the model.</p>
<h3 id="heading-llm-api-is-invoked">LLM API Is Invoked</h3>
<p>The LLM API call is the most expensive and non-deterministic operation in the pipeline, which is why it occurs only after all preparatory work is complete. At this stage, the model receives a fully constructed prompt and produces a response based on its configuration parameters.</p>
<p>This step is the primary focus of latency, cost, and reliability controls such as retries, timeouts, and circuit breakers. From an observability standpoint, this span becomes the anchor for token usage, cost attribution, and prompt-level debugging.</p>
<h3 id="heading-response-is-post-processed-and-returned">Response Is Post-Processed and Returned</h3>
<p>After the LLM returns a response, the system performs post-processing before sending the result back to the client. This may include formatting, filtering, validation, or enrichment of the output. Post-processing acts as a final safeguard against malformed or low-quality responses and ensures consistency with application requirements. It also provides a clean boundary for attaching evaluation signals, such as response length, relevance scores, or truncation indicators, before the request completes.</p>
<h2 id="heading-why-this-design-is-better-than-simpler-alternatives">Why This Design Is Better Than Simpler Alternatives</h2>
<p>This architecture intentionally avoids coupling responsibilities together. Validation, retrieval, prompt construction, model execution, and response handling are all distinct steps. This separation makes the system easier to test, easier to observe, and easier to evolve. When something fails, engineers can identify <em>where</em> and <em>why</em> rather than treating the LLM as a black box.​</p>
<p>Compared to a monolithic "send user input directly to the LLM" approach, this design offers better correctness, lower cost, and higher resilience. It also aligns naturally with distributed tracing, since each block maps cleanly to a trace span with a clear semantic purpose. As the system grows, additional features such as caching, fallback models, or policy enforcement can be added without destabilizing the entire flow.​</p>
<p>Most importantly, this architecture treats the LLM as one component in a larger system, not the system itself. That mindset is essential for building reliable production applications.</p>
<h2 id="heading-llm-models-that-work-best-for-this-architecture">LLM Models That Work Best for This Architecture</h2>
<p>This architecture is model-agnostic, but certain model characteristics work particularly well with retrieval-augmented workflows.</p>
<p>Models with strong instruction-following and reasoning capabilities tend to perform best, especially when prompts include structured context from retrieved documents. General-purpose models such as GPT-4-class systems perform well when accuracy and reasoning depth are critical.</p>
<p>For lower-latency or cost-sensitive use cases, smaller instruction-tuned models can be effective when paired with high-quality retrieval. Open-source models such as LLaMA-derived or Mistral-based systems also fit well into this architecture, particularly when deployed behind a private inference endpoint.​</p>
<p>The key requirement is not the model itself, but how it is used. Models that can reliably ground their responses in provided context, respect system instructions, and produce stable outputs under varying prompts integrate most cleanly into this design. Because retrieval and prompt construction are explicit stages, models can be swapped or compared without changing the overall system structure.</p>
<h2 id="heading-opentelemetry-primer-llm-relevant-concepts-only">OpenTelemetry Primer (LLM-Relevant Concepts Only)</h2>
<p>OpenTelemetry defines three core types of telemetry data: traces, metrics, and logs. For LLM systems, traces are the most important. To make them useful, you need to understand a few building blocks:</p>
<ul>
<li><p>a <strong>trace</strong> represents a single end-to-end request</p>
</li>
<li><p>a <strong>span</strong> is a timed operation within that trace</p>
</li>
<li><p><strong>attributes</strong> are key–value metadata attached to spans</p>
</li>
<li><p><strong>events</strong> are time-stamped annotations</p>
</li>
<li><p><strong>context propagation</strong> ensures child spans attach to the correct parent.</p>
</li>
</ul>
<p>FastAPI’s async nature makes correct context propagation essential, but OpenTelemetry’s Python SDK handles this as long as spans are created correctly.</p>
<p>With those concepts in place, the next step is to wire OpenTelemetry into the app. Start by configuring the OpenTelemetry SDK in FastAPI: define a <code>TracerProvider</code>, attach a <code>Resource</code> (service name and environment), configure an exporter (Jaeger, Tempo, Phoenix, and so on), and enable FastAPI auto-instrumentation.</p>
<h2 id="heading-designing-llm-aware-spans">Designing LLM-Aware Spans</h2>
<h3 id="heading-span-taxonomy">Span Taxonomy</h3>
<p>A clean span hierarchy is critical. In this guide, a single <code>http.request</code> span (usually auto-generated) acts as the root, and it contains child spans such as <code>rag.retrieval</code>, <code>rag.prompt.build</code>, <code>llm.call</code>, <code>llm.postprocess</code>, and, optionally, <code>llm.eval</code>. Each of these spans represents a logical unit of work rather than an implementation detail.</p>
<h3 id="heading-span-boundaries">Span Boundaries</h3>
<p>Getting span boundaries right is just as important as picking the right span names. Avoid extremes like wrapping the entire LLM workflow in one giant span, creating a separate span for every token, or dumping all data into logs.</p>
<p>Instead, aim for a few coarse-grained spans that each represent a meaningful step in the request, enrich them with well-chosen attributes, and use events to mark important milestones within a span rather than splitting everything into smaller spans.</p>
<h3 id="heading-instrumenting-the-llm-call">Instrumenting the LLM Call</h3>
<p>When instrumenting the LLM call, treat it as the most critical span in the trace. Whether you are calling OpenAI, Anthropic, or another provider, start the span immediately before the API request and end it only after the full response (or stream) is complete.</p>
<p>Within that span, capture retries, timeouts, and errors so it becomes the central place for latency analysis, cost attribution, and prompt debugging.</p>
<p>For streaming responses, you can emit events for each chunk to track progress, but avoid creating separate child spans unless you truly need fine-grained timing.</p>
<h2 id="heading-fastapi-example-end-to-end-llm-spans-complete-and-explained">FastAPI Example: End-to-End LLM Spans (Complete and Explained)</h2>
<pre><code class="language-python">from fastapi import FastAPI, Request
from opentelemetry import trace
from opentelemetry.trace import Tracer
from typing import List
import asyncio
import hashlib

# Obtain a tracer instance from OpenTelemetry.
# All spans created with this tracer will be part of the same distributed
# tracing system and exported to the configured backend.
tracer: Tracer = trace.get_tracer(__name__)

# Initialize the FastAPI application.
app = FastAPI()

# Helper functions used by the observable endpoint
async def retrieve_documents(query: str) -&gt; List[str]:
    """
    Simulate document retrieval (e.g., vector search or knowledge base lookup).
    This function represents the retrieval stage in a RAG pipeline.
    In a real system, this might query a vector database or search index.
    """
    await asyncio.sleep(0.05)  # Simulate I/O latency
    return [
        "FastAPI enables high-performance async APIs.",
        "OpenTelemetry provides vendor-neutral observability.",
        "LLM observability requires tracing prompts and tokens.",
    ]


def build_prompt(query: str, documents: List[str]) -&gt; str:
    """
    Construct the final prompt from retrieved documents and the user query.
    Prompt construction is kept separate so it can be observed or modified
    independently if needed (for example, to measure prompt assembly latency).
    """
    context = "\n".join(documents)
    return f"""
Context:
{context}

Question:
{query}
"""


class LLMResponse:
    """
    Minimal abstraction for an LLM response.
    This keeps the example self-contained while still allowing us to attach
    token usage and other metadata for observability.
    """

    def __init__(self, text: str, prompt_tokens: int, completion_tokens: int):
        self.text = text
        self.prompt_tokens = prompt_tokens
        self.completion_tokens = completion_token
    
    @property
    def total_tokens(self) -&gt; int:
        return self.prompt_tokens + self.completion_tokens

async def call_llm(prompt: str) -&gt; LLMResponse:
    """
    Simulate an LLM API call.
    In a real implementation, this would call OpenAI, Anthropic, or another
    provider. The artificial delay represents model latency.
    """
    await asyncio.sleep(0.2)  # Simulate inference time
    response_text = "FastAPI and OpenTelemetry enable end-to-end LLM observability."
    # Token count is approximated here for demonstration purposes.
    prompt_tokens = len(prompt.split())
    completion_tokens = len(response_text.split())
    return LLMResponse(response_text, prompt_tokens, completion_tokens)


def summarize_response(response: LLMResponse) -&gt; str:
    """
    Example post-processing step.
    Post-processing is separated into its own phase so any additional latency
    or errors are not incorrectly attributed to the LLM itself.
    """
    return response.text


# Observable FastAPI endpoint
@app.post("/query")
async def rag_query(request: Request, query: str):
    """
    Handle a single RAG-style request with explicit OpenTelemetry spans.
    This endpoint demonstrates how to create one trace per request, with child
    spans for retrieval, LLM invocation, and post-processing.
    """

    # Create a top-level span for the HTTP request.
    # Even if FastAPI auto-instrumentation is enabled, defining this explicitly
    # allows us to attach domain-specific metadata.
    with tracer.start_as_current_span("http.request") as http_span:
        http_span.set_attribute("http.method", "POST")
        http_span.set_attribute("http.route", "/query")

        # Retrieval phase
        # This span isolates the retrieval step so that relevance issues can be
        # debugged independently of LLM behavior.
        with tracer.start_as_current_span("rag.retrieval") as retrieval_span:
            retrieval_span.set_attribute("rag.top_k", 5)
            retrieval_span.set_attribute("rag.similarity_threshold", 0.8)
            documents = await retrieve_documents(query)

            # Record how many documents were returned.
            # This is a key signal when diagnosing hallucinations
            # or missing context in the final response.
            retrieval_span.set_attribute(
                "rag.documents_returned",
                len(documents),
            )

        # LLM invocation phase
        # This span wraps the actual LLM call and is the primary anchor for
        # latency, cost, and prompt-related analysis.
        with tracer.start_as_current_span("llm.call") as llm_span:
            llm_span.set_attribute("llm.provider", "example")
            llm_span.set_attribute("llm.model", "example-llm")
            llm_span.set_attribute("llm.temperature", 0.7)
            llm_span.set_attribute("llm.prompt_template_id", "rag_v1")

            # Build the final prompt using retrieved context.
            # The raw prompt is intentionally not stored as a span attribute.
            prompt = build_prompt(query, documents)
            
            # Prompt metadata
            prompt_hash = hashlib.sha256(prompt.encode()).hexdigest()
            llm_span.set_attribute("llm.prompt_hash", prompt_hash)
            llm_span.set_attribute("llm.prompt_length", len(prompt))

            response = await call_llm(prompt)

            # Hash the response instead of storing raw text.
            # This allows correlation across traces without exposing content.
            response_hash = hashlib.sha256(
                response.text.encode()
            ).hexdigest()
            llm_span.set_attribute("llm.response_hash", response_hash)

            # Record token usage to enable cost attribution
            # and capacity planning.
            llm_span.set_attribute("llm.usage.prompt_tokens", response.prompt_tokens)
            llm_span.set_attribute("llm.usage.completion_tokens", response.completion_tokens)
            llm_span.set_attribute("llm.usage.total_tokens", response.total_tokens)
            
            # example price per token
            estimated_cost = response.total_tokens * 0.000002
            llm_span.set_attribute("llm.cost_estimated_usd", estimated_cost)

        # Post-processing phase
        # Any transformation after the LLM response is captured here,
        # ensuring inference latency is not overstated.
        with tracer.start_as_current_span("llm.postprocess") as post_span:
            summary = summarize_response(response)
            post_span.set_attribute(
                "llm.summary_length",
                len(summary),
            )

    # Return the final response to the client.
    # All spans above belong to the same distributed trace.
    return {"summary": summary}
</code></pre>
<p>Before examining the full code example, it helps to understand how the instrumentation relates to the observability principles described earlier in this article.</p>
<p>The goal of the example is not simply to show how to create spans, but to demonstrate how a single user request can be represented as a structured trace containing meaningful metadata about each stage of the LLM pipeline.</p>
<p>At a high level, the code follows three key design ideas:</p>
<ol>
<li><p>One trace per user request</p>
</li>
<li><p>One span per logical LLM workflow stage</p>
</li>
<li><p>Semantic attributes attached to spans for debugging, cost tracking, and analysis</p>
</li>
</ol>
<p>Each of these concepts directly corresponds to the observability practices discussed earlier.</p>
<h3 id="heading-top-level-request-span">Top-Level Request Span</h3>
<p>The FastAPI endpoint begins by creating a top-level span called <code>http.request</code>. This span represents the entire lifecycle of the incoming request and serves as the root span for the trace.</p>
<pre><code class="language-python">with tracer.start_as_current_span("http.request") as http_span:
</code></pre>
<p>Although FastAPI can generate HTTP spans automatically through OpenTelemetry auto-instrumentation, explicitly creating this span allows the application to attach domain-specific metadata such as route names or user identifiers.</p>
<p>Attributes such as the HTTP method and route are attached here:</p>
<pre><code class="language-python">http_span.set_attribute("http.method", "POST")
http_span.set_attribute("http.route", "/query")
</code></pre>
<p>This ensures that every trace can be easily filtered by endpoint when analyzing production traffic.</p>
<h3 id="heading-retrieval-span">Retrieval Span</h3>
<p>The next span captures the retrieval phase of the RAG pipeline:</p>
<pre><code class="language-python">with tracer.start_as_current_span("rag.retrieval") as retrieval_span:
</code></pre>
<p>This span isolates the vector search or knowledge retrieval step from the rest of the pipeline. If users report irrelevant answers, engineers can inspect this span to determine whether the issue originates from poor retrieval results rather than model behavior.</p>
<p>Several semantic attributes are attached here:</p>
<ul>
<li><p><code>rag.top_k</code> – number of documents requested</p>
</li>
<li><p><code>rag.similarity_threshold</code> – similarity cutoff used for filtering results</p>
</li>
<li><p><code>rag.documents_returned</code> – number of documents actually retrieved</p>
</li>
</ul>
<p>These attributes align with the RAG observability signals discussed in the earlier section of the article.</p>
<h3 id="heading-llm-invocation-span">LLM Invocation Span</h3>
<p>The most important span in the trace is the <code>llm.call</code> span, which wraps the actual model invocation.</p>
<pre><code class="language-python">with tracer.start_as_current_span("llm.call") as llm_span:
</code></pre>
<p>This span captures the latency, configuration, and token usage associated with the LLM request. In production systems, it becomes the primary location for analyzing model behavior and cost.</p>
<p>Key attributes recorded in this span include:</p>
<ul>
<li><p><code>llm.provider</code> – the model provider (OpenAI, Anthropic, etc.)</p>
</li>
<li><p><code>llm.model</code> – the specific model version</p>
</li>
<li><p><code>llm.temperature</code> – sampling parameter controlling response randomness</p>
</li>
<li><p><code>llm.prompt_template_id</code> – identifier for the prompt template used</p>
</li>
</ul>
<p>These attributes make it possible to correlate changes in model configuration with downstream quality or cost changes.</p>
<h3 id="heading-prompt-handling-and-privacy">Prompt Handling and Privacy</h3>
<p>Instead of storing the full prompt or response text directly in the trace, the example demonstrates a safer practice: hashing sensitive data.</p>
<pre><code class="language-python">response_hash = hashlib.sha256(response.text.encode()).hexdigest()
</code></pre>
<p>The resulting hash is stored as a span attribute:</p>
<pre><code class="language-python">llm_span.set_attribute("llm.response_hash", response_hash)
</code></pre>
<p>This approach allows engineers to correlate repeated responses across traces without exposing potentially sensitive content in observability systems.</p>
<h3 id="heading-token-usage-tracking">Token Usage Tracking</h3>
<p>The <code>llm.call</code> span also records token usage:</p>
<pre><code class="language-python">llm_span.set_attribute(
    "llm.usage.total_tokens",
    response.total_tokens
)
</code></pre>
<p>Capturing token usage at the span level is critical for monitoring cost and efficiency, since token consumption directly determines billing for most LLM providers.</p>
<h3 id="heading-post-processing-span">Post-Processing Span</h3>
<p>Finally, the example includes a <code>llm.postprocess</code> span:</p>
<pre><code class="language-python">with tracer.start_as_current_span("llm.postprocess") as post_span:
</code></pre>
<p>This span represents any transformation applied after the model generates its response. Separating post-processing from the LLM call ensures that additional latency — such as formatting, filtering, or validation — is not incorrectly attributed to the model itself.</p>
<p>An attribute such as response length is recorded here:</p>
<pre><code class="language-python">post_span.set_attribute("llm.summary_length", len(summary))
</code></pre>
<p>This can be useful when diagnosing issues such as unexpectedly short or truncated outputs.</p>
<h3 id="heading-how-the-spans-form-a-complete-trace">How the Spans Form a Complete Trace</h3>
<p>When the request finishes, all spans belong to the same distributed trace:</p>
<pre><code class="language-plaintext">http.request
 ├── rag.retrieval
 ├── llm.call
 └── llm.postprocess
</code></pre>
<p>This hierarchy reflects the logical workflow of a retrieval-augmented LLM system. Because each span contains structured metadata, engineers can quickly answer questions such as:</p>
<ul>
<li><p>Was the latency caused by retrieval or model inference?</p>
</li>
<li><p>How many documents influenced the prompt?</p>
</li>
<li><p>Which model configuration produced the response?</p>
</li>
<li><p>How many tokens were consumed?</p>
</li>
<li><p>Was the response post-processed or truncated?</p>
</li>
</ul>
<p>This structured trace design is what transforms observability from simple monitoring into a practical debugging and optimization tool for LLM systems.</p>
<h2 id="heading-semantic-attributes-best-practices-for-llm-observability">Semantic Attributes: Best Practices for LLM Observability</h2>
<p>The goal is not to capture every possible detail, but to record the minimal set of stable, high-signal attributes that enable effective debugging, cost control, and quality analysis in production. Poor attribute design leads to noisy traces, privacy risks, and dashboards that are impossible to reason about.</p>
<h3 id="heading-prompt-response-and-model-metadata">Prompt, Response, and Model Metadata​</h3>
<p>Storing raw prompts is often unsafe and expensive, so it is better to record minimal, structured metadata instead. In practice, this means attaching a stable template identifier with <code>llm.prompt_template_id</code>, a hashed version of the final prompt using <code>llm.prompt_hash</code> (to avoid storing raw text), and a size indicator such as <code>llm.prompt_length</code>, which captures the number of tokens or characters.</p>
<p>You should also always record key inference parameters: <code>llm.provider</code> (for example, "openai" or "anthropic"), <code>llm.model</code> (for example, "gpt-4.1"), <code>llm.temperature</code> and <code>llm.top_p</code> (sampling parameters), <code>llm.max_tokens</code> (the maximum tokens allowed), and <code>llm.stream</code> to indicate whether streaming was enabled, while staying within your organization’s privacy and compliance requirements.</p>
<pre><code class="language-python">
with tracer.start_as_current_span("llm.call") as llm_span:
            llm_span.set_attribute("llm.provider", "example")
            llm_span.set_attribute("llm.model", "example-llm")
            llm_span.set_attribute("llm.temperature", 0.7)
            llm_span.set_attribute("llm.top_p", 0.9)
            llm_span.set_attribute("llm.max_tokens", 512)
            llm_span.set_attribute("llm.stream", False)
            llm_span.set_attribute("llm.prompt_template_id", "rag_v1")

            # Build the final prompt using retrieved context.
            # The raw prompt is intentionally not stored as a span attribute.
            prompt = build_prompt(query, documents)
            
            # Prompt metadata
            prompt_hash = hashlib.sha256(prompt.encode()).hexdigest()
            llm_span.set_attribute("llm.prompt_hash", prompt_hash)
            llm_span.set_attribute("llm.prompt_length", len(prompt))
</code></pre>
<h3 id="heading-token-usage-and-cost-why-this-matters-in-practice">Token Usage and Cost (Why This Matters in Practice)</h3>
<p>Token usage is one of the most common blind spots in LLM systems. Many teams monitor latency and error rates but discover runaway costs only after invoices spike. Because token consumption varies significantly by prompt structure, retrieved context, and model configuration, it must be captured explicitly at the span level.​</p>
<p>The most important practice is to record token usage at the end of the LLM span, once the model has completed inference. This ensures that the values reflect the full request rather than partial or streamed output.</p>
<p>At minimum, capture the attributes:​<code>llm.usage.prompt_tokens</code> ,<code>llm.usage.completion_tokens</code> and <code>llm.usage.total_tokens</code>​.</p>
<pre><code class="language-python">def __init__(self, text: str, prompt_tokens: int, completion_tokens: int):
        self.text = text
        self.prompt_tokens = prompt_tokens
        self.completion_tokens = completion_token
    
    @property
    def total_tokens(self) -&gt; int:
        return self.prompt_tokens + self.completion_tokens

async def call_llm(prompt: str) -&gt; LLMResponse:
    """
    Simulate an LLM API call.
    In a real implementation, this would call OpenAI, Anthropic, or another
    provider. The artificial delay represents model latency.
    """
    await asyncio.sleep(0.2)  # Simulate inference time
    response_text = "FastAPI and OpenTelemetry enable end-to-end LLM observability."
    # Token count is approximated here for demonstration purposes.
    prompt_tokens = len(prompt.split())
    completion_tokens = len(response_text.split())
    return LLMResponse(response_text, prompt_tokens, completion_tokens)
</code></pre>
<p>These values allow you to distinguish between requests that are expensive because of large prompts (often caused by excessive retrieval or poor prompt construction) versus those that are expensive because of long model-generated outputs.</p>
<p>*Where possible, also attach an estimated cost:*​ <code>llm.cost_estimated_usd</code>​</p>
<pre><code class="language-python">    # example price per token
    estimated_cost = response.total_tokens * 0.000002
    llm_span.set_attribute("llm.cost_estimated_usd", estimated_cost)
</code></pre>
<p>This value is typically derived by multiplying token counts by the model's published pricing. Even if the estimate is approximate, it enables powerful analysis. For example, you can identify which endpoints, prompt templates, or user flows are responsible for the highest cumulative cost, rather than relying on coarse, account-level billing dashboards.</p>
<p>Once spans carry the right attributes, the next step is to connect them to output quality, not just system health.</p>
<h2 id="heading-evaluation-hooks-inside-traces">Evaluation Hooks Inside Traces</h2>
<p>This section describes an additional pattern you can layer on top of the core instrumentation in this guide. It is optional and not implemented in the sample code, but it shows how to attach quality signals directly to your traces.</p>
<p>Observability is not just about whether the system stayed up, it is also about whether the model produced a useful answer. Evaluation hooks inside traces let you attach lightweight quality signals directly to the same spans you use for latency and cost.</p>
<p>Inline evaluations are the simplest approach. You can run quick checks synchronously and record the results as span attributes, such as <code>llm.eval.passed</code> for a simple boolean check, <code>llm.eval.relevance_score</code> for an optional numerical score, or flags like <code>llm.eval.hallucination_detected</code> and <code>llm.eval.refusal_detected</code>. These attributes travel with the trace, so you can filter and aggregate on them in your observability backend just like any other field.</p>
<p>For higher accuracy, you can introduce model-based evaluation as a separate step. In this pattern, an evaluator LLM runs asynchronously on the original prompt and response, and its work is captured in a child span (for example, <code>llm.eval</code>) that shares the same trace ID as the main <code>llm.call</code> span. You then attach scores such as relevance, faithfulness, or toxicity to that evaluation span.</p>
<p>Because the evaluation span shares the same trace ID, you can correlate quality regressions with changes in prompts or retrieval.</p>
<h2 id="heading-exporting-and-visualizing-traces-where-this-fits-with-vendor-tooling">Exporting and Visualizing Traces (Where This Fits with Vendor Tooling)</h2>
<p>This code-first observability design is vendor-agnostic. Once traces are emitted using OpenTelemetry, they can be exported to different backends without changing instrumentation.</p>
<p>General-purpose tracing systems like Jaeger and Grafana Tempo help engineers debug latency, errors, and request flow across retrieval, prompting, and model calls, answering how the system behaved. LLM-focused platforms such as Arize Phoenix use the same data but add model-specific insights like prompt clustering, token analysis, and quality correlation.</p>
<p>Because instrumentation stays OpenTelemetry-native, you maintain full control over attributes and trace structure while still using vendor dashboards, and you can switch backends as your needs evolve without touching the application code.</p>
<h2 id="heading-operational-patterns-and-anti-patterns">Operational Patterns and Anti-Patterns</h2>
<p>Effective LLM observability requires disciplined practices. High-volume systems should sample traces to limit overhead, and prompts or responses should be hashed by default to reduce storage and privacy risk. Traces must be treated as production data, with proper access control and retention policies.</p>
<p>Common pitfalls include relying only on vendor SDK traces, logging prompts without trace correlation, or ignoring evaluation signals. These issues fragment visibility and hide quality regressions, especially when observability focuses only on agents instead of full application context.</p>
<h2 id="heading-extending-the-system">Extending the System</h2>
<p>Once traces are reliable, they support advanced capabilities. Metrics like p95 latency can be derived from spans, logs can be linked using trace IDs, and historical traces can power offline evaluation or prompt testing.​</p>
<p>By following OpenTelemetry conventions, the observability stack also stays aligned with emerging LLM semantic standards, keeping the system flexible and future-proof.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>End-to-end LLM observability is not achieved by installing another agent. It is achieved through intentional span design, meaningful semantic attributes, and, where needed, lightweight evaluation hooks.​</p>
<p>By treating LLM calls as first-class operations within distributed traces, you gain faster debugging, controlled costs, safer deployments, and measurable quality improvements. The backend — Jaeger, Tempo, Phoenix — is interchangeable. The instrumentation strategy is not.​</p>
<p>A well-designed trace is the most valuable artifact in a production LLM system.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build AI Agents That Remember User Preferences (Without Breaking Context) ]]>
                </title>
                <description>
                    <![CDATA[ Why Personalization Breaks Most AI Agents Personalization is one of the most requested features in AI-powered applications. Users expect an agent to remember their preferences, adapt to their style, and improve over time. In practice, personalization... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-ai-agents-that-remember-user-preferences-without-breaking-context/</link>
                <guid isPermaLink="false">698cc32db8fec0245bd9996d</guid>
                
                    <category>
                        <![CDATA[ ai agents ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ System Design ]]>
                    </category>
                
                    <category>
                        <![CDATA[ software architecture ]]>
                    </category>
                
                    <category>
                        <![CDATA[ observability ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Developer Tools ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tools ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Nataraj Sundar ]]>
                </dc:creator>
                <pubDate>Wed, 11 Feb 2026 17:58:05 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1770832641633/da49bdca-617e-4272-b5b7-012f3c6c1d61.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <h2 id="heading-why-personalization-breaks-most-ai-agents"><strong>Why Personalization Breaks Most AI Agents</strong></h2>
<p>Personalization is one of the most requested features in AI-powered applications. Users expect an agent to remember their preferences, adapt to their style, and improve over time.</p>
<p>In practice, personalization is unfortunately also one of the fastest ways to break an otherwise working AI agent.</p>
<p>Many agents start with a simple idea: keep adding more conversation history to the prompt. This approach works for demos, but it quickly fails in real applications. Context windows grow too large. Irrelevant information leaks into decisions. Costs increase. Debugging becomes nearly impossible.</p>
<p>If you want a personalized agent that survives production, you need more than a large language model. You need a way to connect the agent to tools, manage multi-step workflows, and store user preferences safely over time – without turning your system into a tangled mess of prompts and callbacks.</p>
<p>In this tutorial, you’ll learn how to design a personalized AI agent using three core building blocks:</p>
<ul>
<li><p><strong>Agent Development Kit (ADK)</strong> to orchestrate agent reasoning and execution</p>
</li>
<li><p><strong>Model Context Protocol (MCP)</strong> to connect tools with clear boundaries</p>
</li>
<li><p><strong>Long-term memory</strong> to store preferences without polluting context</p>
</li>
</ul>
<p>Rather than focusing on setup commands or vendor-specific walkthroughs, we'll focus on the architectural patterns that make personalized agents reliable, debuggable, and maintainable.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770578645884/2fd77443-31d5-4db3-98f0-bba685122a6f.png" alt="User preferences influence an AI agent’s personalized response" class="image--center mx-auto" width="1452" height="578" loading="lazy"></p>
<p><em>Figure 1 — Personalization influences agent responses</em></p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#what-personalized-means-in-a-real-ai-agent">What “Personalized” Means in a Real AI Agent</a></p>
</li>
<li><p><a class="post-section-overview" href="#how-the-agent-architecture-fits-together">How the Agent Architecture Fits Together</a></p>
</li>
<li><p><a class="post-section-overview" href="#how-to-design-the-agent-core-with-adk">How to Design the Agent Core with ADK</a></p>
</li>
<li><p><a class="post-section-overview" href="#how-to-connect-tools-safely-with-mcp">How to Connect Tools Safely with MCP</a></p>
</li>
<li><p><a class="post-section-overview" href="#how-to-add-long-term-memory-without-polluting-context">How to Add Long-Term Memory Without Polluting Context</a></p>
<ul>
<li><a class="post-section-overview" href="#privacy-consent-and-lifecycle-controls-production-checklist">Privacy, Consent, and Lifecycle Controls (Production Checklist)</a></li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#how-the-end-to-end-agent-flow-works">How the End-to-End Agent Flow Works</a></p>
</li>
<li><p><a class="post-section-overview" href="#common-pitfalls-youll-hit-and-how-to-avoid-them">Common Pitfalls You’ll Hit (and How to Avoid Them)</a></p>
</li>
<li><p><a class="post-section-overview" href="#what-you-learned-and-where-to-go-next">What You Learned and Where to Go Next</a></p>
</li>
</ul>
<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>To follow along with this tutorial, you should have:</p>
<ul>
<li><p>Basic familiarity with Python</p>
</li>
<li><p>A general understanding of how large language models work</p>
</li>
<li><p>Optional: a Google Cloud account if you want to run an end-to-end demo. Otherwise, you can follow the architecture and code patterns locally with stubs. We’ll avoid deep infrastructure setup and focus on design patterns rather than deployment mechanics.</p>
</li>
</ul>
<p>You don’t need prior experience with ADK or MCP. I’ll introduce each concept as it appears.</p>
<h2 id="heading-what-personalized-means-in-a-real-ai-agent"><strong>What “Personalized” Means in a Real AI Agent</strong></h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770578714303/4d25a7e4-fcdd-4a1a-a12c-411e41f2021f.png" alt="An AI agent accesses external tools through a protocol boundary/control layer" class="image--center mx-auto" width="1382" height="670" loading="lazy"></p>
<p><em>Figure 2 — Keep preferences out of the prompt: agent ↔ tools across a protocol boundary</em></p>
<p>Before writing any code, it’s important to define what personalization means in an AI agent.</p>
<p>Personalization is not the same as “remembering everything.” In practice, agent state usually falls into three categories:</p>
<ol>
<li><p><strong>Short-term context:</strong> Information needed to complete the current task. This belongs in the prompt.</p>
</li>
<li><p><strong>Session state:</strong> Temporary decisions or selections made during a workflow. This should be structured and scoped to a session.</p>
</li>
<li><p><strong>Long-term memory:</strong> Durable user preferences or facts that should persist across sessions.</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770577191953/3df5aa02-2eb9-4214-bbef-52f18ddb353a.png" alt="Three panels comparing short-term context, session state, and long-term memory" class="image--center mx-auto" width="946" height="510" loading="lazy"></p>
<p><em>Figure 3 — Three kinds of agent state: context (now), session (today), memory (always)</em></p>
<p>Most problems happen when these categories are mixed together.</p>
<p>If you store long-term preferences directly in the prompt, the agent’s behavior becomes unpredictable. If you store everything permanently, memory grows without bounds. If you don’t scope memory at all, unrelated sessions start influencing each other.</p>
<p>A well-designed, personalized agent treats memory as a first-class system component, not as extra text added to a prompt.</p>
<p>In the next section, we'll look at how to structure the agent so these concerns stay separated. </p>
<p>By the end of this tutorial, you’ll understand how to design a personalized AI agent that uses long-term memory safely, connects to tools through clear boundaries, and remains debuggable as it grows.</p>
<h2 id="heading-how-the-agent-architecture-fits-together"><strong>How the Agent Architecture Fits Together</strong></h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770577351960/9b14cadf-d650-4098-8ce1-9fd706537bb9.png" alt="Reference architecture showing a user, an AI agent core, tools, a memory service, and an orchestration runtime" class="image--center mx-auto" width="1100" height="554" loading="lazy"></p>
<p><em>Figure 4 — Reference architecture: agent core + tools + memory service + orchestration runtime</em></p>
<p>The above diagram shows a high-level, personalized AI agent architecture. In it, an agent core handles reasoning and planning while interacting with a tool interface layer, a long-term memory service, and an orchestration runtime.</p>
<p>Let’s now understand the moving parts of a personalized agent and how they interact.</p>
<p>At a high level, the system has four responsibilities:</p>
<ol>
<li><p><strong>Reasoning</strong> – deciding what to do next</p>
</li>
<li><p><strong>Execution</strong> – calling tools and services</p>
</li>
<li><p><strong>Memory</strong> – storing and retrieving long-term preferences</p>
</li>
<li><p><strong>Boundaries</strong> – controlling what the agent is allowed to do</p>
</li>
</ol>
<p>A common mistake you’ll see is to blur these responsibilities together. For example, letting the model decide when to write memory, or allowing tools to execute actions without clear constraints.</p>
<p>Instead, you'll design the system so each responsibility has a clear owner. The core components look like this:</p>
<ul>
<li><p><strong>Agent core</strong>: Handles reasoning and planning</p>
</li>
<li><p><strong>Tools</strong>: Perform external actions (read or write)</p>
</li>
<li><p><strong>MCP layer</strong>: Defines how tools are exposed and invoked</p>
</li>
<li><p><strong>Memory services</strong>: Store long-term user data safely</p>
</li>
</ul>
<p>ADK sits at the center, orchestrating how requests flow between these components. The model never directly talks to databases or services. It reasons about actions, and ADK coordinates execution.</p>
<p>This separation makes the system easier to reason about, debug, and extend.</p>
<h2 id="heading-how-to-design-the-agent-core-with-adk"><strong>How to Design the Agent Core with ADK</strong></h2>
<p>Before we dive in, a quick note on what ADK is<strong>.</strong>  </p>
<p><strong>Agent Development Kit (ADK)</strong> is an agent orchestration framework – the glue code between a large language model and your application. Instead of treating the model as a black box that directly “does things”, ADK helps you structure the agent as a system:</p>
<ul>
<li><p>The model focuses on <strong>reasoning</strong> (turning user intent, context, and memory into a structured plan)</p>
</li>
<li><p>Your runtime stays in control of <strong>execution</strong> (deciding which tools can run, how they run, and what gets logged or persisted)</p>
</li>
</ul>
<p>In other words, ADK is what lets you take tool calling and multi-step workflows out of a giant prompt and turn them into a maintainable and testable architecture. In this tutorial, we’ll use ADK to refer to that orchestration layer. The same patterns apply if you use a different agent framework.</p>
<p><strong>Note:</strong> The following code snippets are simplified reference examples intended to illustrate architectural patterns. They’re not production-ready drop-ins.</p>
<p>Once you understand the architecture, you can start designing the agent core. The agent core is responsible for reasoning, not execution.</p>
<p>A helpful mental model is to think of the agent as a planner, not a doer. Its role is to interpret the user’s goal, consider available context and memory, and produce a structured plan that can later be executed in a controlled way.</p>
<p>To make this concrete, the following example shows how an agent can translate user input and memory into an explicit plan. In practice, ADK orchestrates this using a large language model, but the important idea is that the output is structured intent, not side effects.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Reference example for illustration.</span>

<span class="hljs-keyword">from</span> dataclasses <span class="hljs-keyword">import</span> dataclass
<span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> List, Dict, Any

<span class="hljs-meta">@dataclass</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Step</span>:</span>
    tool: str
    args: Dict[str, Any]

<span class="hljs-meta">@dataclass</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Plan</span>:</span>
    goal: str
    steps: List[Step]

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">build_plan</span>(<span class="hljs-params">user_text: str, memory: Dict[str, Any]</span>) -&gt; Plan:</span>
    <span class="hljs-comment"># In practice, the LLM produces this structure via ADK orchestration.</span>
    goal = <span class="hljs-string">f"Help user: <span class="hljs-subst">{user_text}</span>"</span>
    steps = []
    <span class="hljs-keyword">if</span> memory.get(<span class="hljs-string">"prefers_short_answers"</span>):
        steps.append(Step(tool=<span class="hljs-string">"set_style"</span>, args={<span class="hljs-string">"verbosity"</span>: <span class="hljs-string">"low"</span>}))
    steps.append(Step(tool=<span class="hljs-string">"search_docs"</span>, args={<span class="hljs-string">"query"</span>: user_text}))
    steps.append(Step(tool=<span class="hljs-string">"summarize"</span>, args={<span class="hljs-string">"max_bullets"</span>: <span class="hljs-number">5</span>}))
    <span class="hljs-keyword">return</span> Plan(goal=goal, steps=steps)
</code></pre>
<p>This example illustrates an important constraint: the agent produces a plan, but it doesn’t execute anything directly.</p>
<p>The agent decides <em>what</em> should happen and <em>in what order</em>, while ADK controls <em>when</em> and <em>how</em> each step runs. This separation lets you inspect, test, and reason about decisions before they result in real-world actions.</p>
<p>When personalization is involved, this distinction becomes critical. Preferences may influence planning, but execution should remain tightly controlled by the runtime.</p>
<p>Again, we can consider the agent to be a planner, not a doer.</p>
<p>It should not:</p>
<ul>
<li><p>Perform side effects directly</p>
</li>
<li><p>Write to databases</p>
</li>
<li><p>Call external APIs without supervision</p>
</li>
</ul>
<p>In ADK, this separation is natural. The agent produces intents and tool calls, while the runtime controls how and when those calls are executed.</p>
<p>This design has two major benefits:</p>
<ol>
<li><p><strong>Safety</strong> – you can restrict which tools the agent can access</p>
</li>
<li><p><strong>Debuggability</strong> – you can inspect decisions before execution</p>
</li>
</ol>
<p>When personalization is involved, this becomes even more important. Preferences influence reasoning, but execution should remain tightly controlled.</p>
<h2 id="heading-how-to-connect-tools-safely-with-mcp"><strong>How to Connect Tools Safely with MCP</strong></h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770578793149/2e3f8282-341a-4f03-9313-df3f8c9c5174.png" alt="Tool call routed through a control layer with request, validation, execution, and response steps." class="image--center mx-auto" width="1362" height="870" loading="lazy"></p>
<p><em>Figure 5 — Tool calls with guardrails: request → validate → execute → respond</em></p>
<p>Tools are how agents interact with the real world. They fetch data, generate artifacts, and sometimes perform actions with side effects.</p>
<p>Without clear boundaries, tool usage quickly becomes a source of fragility. Hardcoded API calls leak into prompts, tools evolve independently, and agents gain more authority than intended.</p>
<p>To avoid these problems, tools should be explicitly registered and invoked through a narrow interface. The following example shows a simple tool registry pattern that mirrors how MCP exposes tools to an agent without tightly coupling it to implementations.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Reference example (pseudocode for illustration)</span>

<span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> Callable, Dict, Any

ToolFn = Callable[[Dict[str, Any]], Dict[str, Any]]

TOOLS: Dict[str, ToolFn] = {}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">register_tool</span>(<span class="hljs-params">name: str</span>):</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">decorator</span>(<span class="hljs-params">fn: ToolFn</span>):</span>
        TOOLS[name] = fn
        <span class="hljs-keyword">return</span> fn
    <span class="hljs-keyword">return</span> decorator

<span class="hljs-meta">@register_tool("search_docs")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">search_docs</span>(<span class="hljs-params">args: Dict[str, Any]</span>) -&gt; Dict[str, Any]:</span>
    query = args[<span class="hljs-string">"query"</span>]
    <span class="hljs-comment"># Replace with your MCP client call (or local tool implementation).</span>
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"results"</span>: [<span class="hljs-string">f"doc://example?q=<span class="hljs-subst">{query}</span>"</span>]}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">invoke_tool</span>(<span class="hljs-params">name: str, args: Dict[str, Any]</span>) -&gt; Dict[str, Any]:</span>
    <span class="hljs-keyword">if</span> name <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> TOOLS:
        <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">f"Tool not allowed: <span class="hljs-subst">{name}</span>"</span>)
    <span class="hljs-keyword">return</span> TOOLS[name](args)
</code></pre>
<p>The Model Context Protocol (MCP) provides a clean way to formalize this pattern. You can think of MCP the same way operating systems treat system calls.</p>
<p>An application does not directly manipulate hardware. Instead, it requests operations through well-defined system calls. The kernel decides whether the operation is allowed and how it executes.</p>
<p>In the same way, the agent knows <em>what</em> capabilities exist, MCP defines <em>how</em> those capabilities are invoked, and the runtime controls <em>when</em> and <em>whether</em> they execute.</p>
<p>This separation prevents several common problems, including hardcoded API details in prompts, unexpected breakage when tools change, and agents performing unrestricted side effects.</p>
<p>When designing tools, it helps to classify them by risk: read tools for safe queries, generate tools for planning or synthesis, and commit tools for irreversible actions. In a personalized agent, commit tools should be rare and tightly guarded.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770580271505/d5d34514-3b98-4997-85ed-dee55e65d711.png" alt="Observability around tool calls using logs, traces, and timing across decision points" class="image--center mx-auto" width="996" height="606" loading="lazy"></p>
<p><em>Figure 6 — Observability around tool calls: logs, traces, timing, decision points</em></p>
<h2 id="heading-how-to-add-long-term-memory-without-polluting-context"><strong>How to Add Long-Term Memory Without Polluting Context</strong></h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770577944241/b2a3de65-c5e2-456e-8a33-e9fd4d2695f0.png" alt="Memory candidates extracted from user input, filtered and validated, then stored asynchronously" class="image--center mx-auto" width="1118" height="478" loading="lazy"></p>
<p><em>Figure 7 — Memory admission pipeline: extract → filter/validate → persist asynchronously</em></p>
<p>Memory is where personalization either succeeds or fails.</p>
<p>You can start by storing everything the user says and feed it back into the prompt. This works briefly, then collapses under its own weight as context grows, costs rise, and behavior becomes unpredictable.</p>
<p>A better approach is to treat memory as structured, curated data so you can control what the agent remembers and why with clear admission rules. Before persisting anything, the system should explicitly decide whether the information is worth remembering. The following function demonstrates a simple memory admission policy.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Simplified Reference Only</span>
<span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> Optional, Dict, Any

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">memory_candidate</span>(<span class="hljs-params">user_text: str</span>) -&gt; Optional[Dict[str, Any]]:</span>
    text = user_text.lower()

    <span class="hljs-comment"># Durable</span>
    <span class="hljs-keyword">if</span> <span class="hljs-string">"for this session"</span> <span class="hljs-keyword">in</span> text <span class="hljs-keyword">or</span> <span class="hljs-string">"ignore after"</span> <span class="hljs-keyword">in</span> text:
        <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

    <span class="hljs-comment"># Reusable</span>
    <span class="hljs-keyword">if</span> <span class="hljs-string">"my preferred language is"</span> <span class="hljs-keyword">in</span> text:
        <span class="hljs-keyword">return</span> {<span class="hljs-string">"type"</span>: <span class="hljs-string">"preference"</span>, <span class="hljs-string">"key"</span>: <span class="hljs-string">"language"</span>, <span class="hljs-string">"value"</span>: user_text.split()[<span class="hljs-number">-1</span>]}

    <span class="hljs-comment"># Safe (basic example; add PII checks for your use case)</span>
    <span class="hljs-keyword">if</span> <span class="hljs-string">"password"</span> <span class="hljs-keyword">in</span> text <span class="hljs-keyword">or</span> <span class="hljs-string">"ssn"</span> <span class="hljs-keyword">in</span> text:
        <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

    <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>  <span class="hljs-comment"># default: don’t store</span>
</code></pre>
<p>This policy encodes three questions every memory candidate must answer:</p>
<ul>
<li><p>Is it durable? Will it still matter in the future?</p>
</li>
<li><p>Is it reusable? Will it influence future decisions meaningfully?</p>
</li>
<li><p>Is it safe to persist? Does it avoid sensitive or session-specific data?</p>
</li>
</ul>
<p>Only information that passes all three checks should become long-term memory. In practice, this usually includes stable preferences and long-lived constraints, not temporary instructions or intermediate reasoning.</p>
<h3 id="heading-privacy-consent-and-lifecycle-controls-production-checklist"><strong>Privacy, Consent, and Lifecycle Controls (Production Checklist)</strong></h3>
<p>Even if your admission rules are solid, long-term memory introduces governance requirements:</p>
<ul>
<li><p><strong>User control:</strong> allow users to view, export, and delete stored preferences at any time.</p>
</li>
<li><p><strong>Sensitive data handling:</strong> never store secrets/PII. Run PII detection on every memory candidate (and consider redaction).</p>
</li>
<li><p><strong>Retention + consent:</strong> use explicit consent for persistent memory and apply retention windows (TTL) so memory expires unless it’s still useful.</p>
</li>
<li><p><strong>Security + auditability:</strong> encrypt at rest, restrict access by service identity, and keep an audit log of memory writes/updates.</p>
</li>
</ul>
<p>Memory writes should also be asynchronous. The agent should never block while persisting memory, which keeps interactions responsive and avoids coupling reasoning to storage latency.</p>
<h2 id="heading-how-the-end-to-end-agent-flow-works"><strong>How the End-to-End Agent Flow Works</strong></h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770578847727/f3cbc4b9-5bc9-4026-ae69-6fd7bc1625fc.png" alt="End-to-end flow showing user input, agent reasoning, tool invocation, and memory updates with feedback loops" class="image--center mx-auto" width="1134" height="308" loading="lazy"></p>
<p><em>Figure 8 — End-to-end request lifecycle: user input → plan → tools → memory updates</em></p>
<p>At this point, you can trace exactly how memory and tools interact during a single request. With the individual components in place, it’s helpful to see how they work together during a single request. The following example walks through the full lifecycle of a personalized interaction, from user input to response.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Reference example (pseudocode for illustration)</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">handle_request</span>(<span class="hljs-params">user_id: str, user_text: str</span>) -&gt; str:</span>
    memory = memory_store.get(user_id)  <span class="hljs-comment"># e.g., {"prefers_short_answers": True}</span>
    plan = build_plan(user_text, memory)

    tool_outputs = []
    <span class="hljs-keyword">for</span> step <span class="hljs-keyword">in</span> plan.steps:
        out = invoke_tool(step.tool, step.args)
        tool_outputs.append({step.tool: out})

    response = render_response(goal=plan.goal, tool_outputs=tool_outputs, memory=memory)

    cand = memory_candidate(user_text)
    <span class="hljs-keyword">if</span> cand:
        <span class="hljs-comment"># Never block the user on storage.</span>
        memory_store.write_async(user_id, cand)
    <span class="hljs-keyword">return</span> response
</code></pre>
<p>At a high level, the flow looks like this:</p>
<ol>
<li><p>The user sends a message.</p>
</li>
<li><p>Relevant long-term memory is retrieved.</p>
</li>
<li><p>The agent reasons about the request and produces a plan.</p>
</li>
<li><p>ADK invokes tools through MCP as needed.</p>
</li>
<li><p>Results flow back to the agent.</p>
</li>
<li><p>The agent decides whether new information should be persisted.</p>
</li>
<li><p>Memory is written asynchronously.</p>
</li>
<li><p>The final response is returned to the user.</p>
</li>
</ol>
<p>Notice what does <strong>not</strong> happen: the model does not directly write memory, tools do not execute without coordination, and context does not grow without bounds. This structure keeps personalization controlled and predictable.</p>
<h2 id="heading-common-pitfalls-youll-hit-and-how-to-avoid-them"><strong>Common Pitfalls You’ll Hit (and How to Avoid Them)</strong></h2>
<p>Even with a solid architecture, there are a few failure modes that show up repeatedly in real systems. Many of them stem from allowing agents to perform irreversible actions without explicit checks.</p>
<p>The following example shows a simple guardrail for commit-style tools that require approval before execution.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Reference example (pseudocode for illustration)</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">invoke_commit_tool</span>(<span class="hljs-params">name: str, args: Dict[str, Any], approved: bool</span>) -&gt; Dict[str, Any]:</span>
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> approved:
        <span class="hljs-comment"># Require explicit confirmation or policy approval before side effects.</span>
        <span class="hljs-keyword">return</span> {<span class="hljs-string">"status"</span>: <span class="hljs-string">"blocked"</span>, <span class="hljs-string">"reason"</span>: <span class="hljs-string">"commit tools require approval"</span>}

    <span class="hljs-comment"># For example: create_ticket, send_email, submit_order, update_record</span>
    <span class="hljs-keyword">return</span> invoke_tool(name, args)
</code></pre>
<p>This pattern forces a clear decision point before side effects occur. It also creates an audit trail that explains <em>why</em> an action was allowed or blocked.</p>
<p>Other common pitfalls include over-personalization, leaky memory that persists session-specific data, uncontrolled tool growth, and debugging blind spots caused by unclear boundaries. If you see these symptoms, it usually means responsibilities are not clearly separated.</p>
<h2 id="heading-what-you-learned-and-where-to-go-next"><strong>What You Learned and Where to Go Next</strong></h2>
<p>Personalized AI agents are powerful, but they require discipline. The key insight is that personalization is a <strong>systems problem</strong>, not a prompt problem.</p>
<p>By separating reasoning from execution, structuring memory carefully, and using protocols like MCP to enforce boundaries, you can build agents that scale beyond demos and remain maintainable in production.</p>
<p>As you extend this system, resist the urge to add “just one more prompt tweak.” Instead, ask whether the change belongs in memory, tools, or orchestration.  </p>
<p>That mindset will save you time as your agent grows in complexity.  </p>
<p>If you’d like to continue the conversation, you can find me on <a target="_blank" href="https://www.linkedin.com/in/natarajsundar/">LinkedIn</a>.</p>
<p>*All diagrams in this article were created by the author for educational purposes.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Debug Kubernetes Apps When Logs Fail You – An eBPF Tracing Handbook ]]>
                </title>
                <description>
                    <![CDATA[ Let’s say your Kubernetes pod crashes at 3am and the logs show nothing useful. By the time you SSH into the node, the container is gone, and you're left guessing what happened in those final moments. This is the reality of debugging modern applicatio... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-debug-kubernetes-apps-when-logs-fail-you-an-ebpf-tracing-handbook/</link>
                <guid isPermaLink="false">694190c566a5d5cb99995f9f</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ eBPF ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ observability ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Open Source ]]>
                    </category>
                
                    <category>
                        <![CDATA[ OpenTelemetry ]]>
                    </category>
                
                    <category>
                        <![CDATA[ inspektor gadget ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Opaluwa Emidowojo ]]>
                </dc:creator>
                <pubDate>Tue, 16 Dec 2025 17:03:01 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1765899860869/3eadf316-8539-4624-afba-1d4190b6c62a.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Let’s say your Kubernetes pod crashes at 3am and the logs show nothing useful. By the time you SSH into the node, the container is gone, and you're left guessing what happened in those final moments.</p>
<p>This is the reality of debugging modern applications. Traditional monitoring wasn't built for containers that live for seconds, services that shift across nodes, or network paths that change constantly.</p>
<p>eBPF changes this. It lets you see <em>inside</em> the kernel itself, watching every system call, every network packet, and every process execution – without modifying a single line of code.</p>
<p>In this tutorial, you will trace a real Kubernetes application using eBPF-powered tools. You’ll learn fundamentals that apply across the entire modern observability ecosystem, with gadgets from the Inspektor Gadget ecosystem.</p>
<p>By the end, you’ll be able to:</p>
<ul>
<li><p>Trace requests as they move through your Kubernetes pods</p>
</li>
<li><p>Observe behavior at the kernel and syscall level</p>
</li>
<li><p>Debug failures that logs and metrics simply can’t explain</p>
</li>
</ul>
<h3 id="heading-prerequisites">Prerequisites</h3>
<p><strong>Knowledge requirements:</strong></p>
<ul>
<li><p>Basic Kubernetes concepts: pods, deployments, services, namespaces</p>
</li>
<li><p>Familiarity with kubectl: <code>get</code>, <code>describe</code>, <code>logs</code>, <code>exec</code></p>
</li>
<li><p>Container basics</p>
</li>
<li><p>Basic Linux concepts: processes, system calls</p>
</li>
</ul>
<p><strong>Technical requirements:</strong></p>
<ul>
<li><p>Kubernetes cluster (local or cloud-based)</p>
</li>
<li><p><code>kubectl</code> installed and configured</p>
</li>
<li><p>Cluster admin permissions</p>
</li>
<li><p>Linux kernel 5.10+ (most managed services have this)</p>
</li>
</ul>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-understanding-ebpf-observability">Understanding eBPF Observability</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-ebpf-tracing-works-without-getting-lost-in-the-kernel">How eBPF Tracing Works (Without Getting Lost in the Kernel)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-your-environment">How to Set Up Your Environment</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-trace-your-first-request-hands-on-tutorial">How to Trace Your First Request: Hands-On Tutorial</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-interpret-traces">How to Interpret Traces</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-real-world-debugging-scenarios">Real-World Debugging Scenarios</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-advanced-tracing-insights">Advanced Tracing Insights</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-best-practices-and-production-considerations">Best Practices and Production Considerations</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-next-steps-and-resources">Next Steps and Resources</a></p>
</li>
</ul>
<h2 id="heading-understanding-ebpf-observability">Understanding eBPF Observability</h2>
<p>eBPF (extended Berkeley Packet Filter) is a technology that allows you to run custom programs inside the Linux kernel without changing kernel code or loading kernel modules.</p>
<p>The Linux kernel is the control center of your operating system. Historically, if you wanted to observe low-level activity (like network packets, system calls, or file operations), you had to rely on kernel changes or kernel modules. Both approaches were fragile, difficult to maintain, and carried real stability and security risks.</p>
<p>eBPF shifts how we approach observability. It provides a safe, sandboxed environment where you can run observability programs directly in the kernel with built-in safety checks that prevent crashes or security vulnerabilities.</p>
<h3 id="heading-why-does-this-matter-for-observability">Why does this matter for observability?</h3>
<p>In traditional observability, you instrument your application code. You add logging statements, metrics libraries, and tracing SDKs. This works, but has significant limitations:</p>
<ul>
<li><p><strong>Code changes are required</strong>: You must modify and redeploy applications</p>
</li>
<li><p><strong>It’s language-specific</strong>: Different languages need different libraries</p>
</li>
<li><p><strong>There will likely be blind spots</strong>: You can only see what you explicitly instrument</p>
</li>
<li><p><strong>The overhead</strong>: Heavy instrumentation slows down applications</p>
</li>
<li><p><strong>Container challenges</strong>: By the time you add instrumentation and redeploy, the problem may have disappeared</p>
</li>
</ul>
<p>eBPF takes a different approach. Instead of instrumenting applications, you instrument the kernel. Since every application ultimately makes system calls to the kernel for network I/O, file operations, and process management, you can observe everything from one vantage point.</p>
<h3 id="heading-the-ebpf-advantage-for-kubernetes">The eBPF advantage for Kubernetes</h3>
<p>Kubernetes adds another layer of complexity. Your application might be spread across multiple containers, pods, and nodes. Traditional APM (Application Performance Monitoring) tools struggle here because containers come and go rapidly, network topology changes constantly, service meshes add routing complexity, and you often don't control application code (think third-party services or legacy applications you can't modify.)</p>
<p>eBPF doesn't care about any of this. It sees all activity at the kernel level, regardless of what language your app is written in, whether it's containerized, how many times the pod has been rescheduled, or whether you have access to modify the source code. This universal visibility is why the Cloud Native Computing Foundation (CNCF) and major cloud providers are betting heavily on eBPF for the future of observability.</p>
<h2 id="heading-how-ebpf-tracing-works-without-getting-lost-in-the-kernel">How eBPF Tracing Works (Without Getting Lost in the Kernel)</h2>
<p>When your application runs on Kubernetes, there's a clear separation between user space and kernel space. Your code runs in user space, where it's isolated, safe, and has limited access to system resources. To do anything useful – make network calls, read files, allocate memory – your application must ask the kernel for help. The kernel handles these requests via system calls, commonly called syscalls.</p>
<p>eBPF lets us hook into these syscalls without slowing the system down. It’s like having a CCTV camera at every doorway between user space and kernel space, watching who passes through, when, and what they’re carrying.</p>
<h3 id="heading-a-simple-example-http-request-tracing">A Simple Example: HTTP Request Tracing</h3>
<p>Your application initiates an HTTP GET request, which needs to go through the network stack. To establish a connection, your application first makes a <code>socket()</code> system call to create a network socket. Then it calls <code>connect()</code> to establish a connection to the remote server. Once connected, it uses <code>send()</code> to transmit the HTTP request. Network packets are sent across the wire, and eventually your application calls <code>recv()</code> to receive the response.</p>
<p>With eBPF tools like Inspektor Gadget's Traceloop, you can automatically hook into these syscalls. The eBPF program captures request metadata including source and destination IPs, ports, timing information, and payload sizes. You get a complete trace of the request without touching your application code.</p>
<h3 id="heading-the-ebpf-execution-flow">The eBPF Execution Flow</h3>
<p>Here's what happens under the hood when you run a trace. When you deploy Inspektor Gadget and run a gadget, several things happen behind the scenes. Once deployed, the eBPF program springs into action whenever a traced event occurs.</p>
<p>When your application makes a syscall, the eBPF hook triggers and quickly collects relevant data: timestamps, process IDs, container IDs, pod names, request details, and latency information. This data is sent to user space through eBPF maps, which are efficient data structures for kernel-to-userspace communication.</p>
<p>Inspektor Gadget adds Kubernetes context to raw kernel data. Instead of seeing only process IDs, you can see pod names, namespaces, labels, and other metadata. For example, you can tell that a request originated from the frontend pod in the production namespace and targeted the backend service.</p>
<p>The gadget then presents this information in a format that's immediately useful, whether you're using the CLI or integrating with other observability tools.</p>
<p>eBPF is fast because:</p>
<ul>
<li><p><strong>JIT compilation</strong>: Programs are turned into native machine code for maximum performance</p>
</li>
<li><p><strong>Event-driven</strong>: Only execute when relevant events occur, not continuously polling</p>
</li>
<li><p><strong>Kernel-resident</strong>: No expensive context switching between kernel and user space</p>
</li>
<li><p><strong>Highly optimized</strong>: Typically adds less than 5% overhead even under heavy load</p>
</li>
</ul>
<h3 id="heading-the-tool-inspektor-gadget-amp-traceloop">The Tool: Inspektor Gadget &amp; Traceloop</h3>
<p>For this tutorial, we're using Traceloop, an eBPF-based tool that traces request flows through applications by observing syscalls, network calls, and I/O operations at the kernel level.</p>
<p>Why are we using Traceloop for this tutorial?</p>
<ul>
<li><p>It’s quick to install and run (one command)</p>
</li>
<li><p>The output maps directly to the application’s behavior</p>
</li>
<li><p>It automatically adds Kubernetes context (pod names, namespaces)</p>
</li>
<li><p>You don’t need to make any application code changes</p>
</li>
</ul>
<p>What you'll learn applies beyond Traceloop. All eBPF tracing tools (Pixie, Cilium Hubble, Tetragon) work the same way under the hood. They attach to kernel hooks and collect event data. Once you understand the concepts here, you can use any eBPF observability tool effectively.</p>
<h2 id="heading-how-to-set-up-your-environment">How to Set Up Your Environment</h2>
<p>To get your environment ready for hands-on tracing, we'll verify that your cluster meets the requirements, install Inspektor Gadget, and deploy a sample application to trace.</p>
<h3 id="heading-verify-that-your-cluster-meets-the-requirements">Verify that Your Cluster Meets the Requirements</h3>
<p>Before installing anything, confirm that your Kubernetes cluster is ready for eBPF.</p>
<h4 id="heading-check-your-kubernetes-version">Check your Kubernetes version:</h4>
<pre><code class="lang-bash">kubectl version --short
</code></pre>
<p>You need Kubernetes 1.19 or later. Most modern clusters exceed this requirement, but it's worth verifying.</p>
<h4 id="heading-verify-kernel-version-on-your-nodes">Verify kernel version on your nodes:</h4>
<pre><code class="lang-bash">kubectl get nodes -o wide
</code></pre>
<p>Then check the kernel version on one of your nodes:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># If using a local cluster like minikube or kind</span>
uname -r

<span class="hljs-comment"># For cloud clusters, you might need to check node details</span>
kubectl debug node/&lt;node-name&gt; -it --image=ubuntu -- bash -c <span class="hljs-string">"uname -r"</span>
</code></pre>
<p>You need Linux kernel 5.10 or later for the best eBPF support. Kernel 4.18+ works but with some limitations. If you're using a managed Kubernetes service (GKE, EKS, AKS), you almost certainly have a compatible kernel.</p>
<h4 id="heading-confirm-that-you-have-cluster-admin-permissions">Confirm that you have cluster admin permissions:</h4>
<pre><code class="lang-bash">kubectl auth can-i create deployments --all-namespaces
</code></pre>
<p>This should return "yes". Inspektor Gadget needs elevated permissions to load eBPF programs into the kernel.</p>
<h3 id="heading-install-inspektor-gadget">Install Inspektor Gadget</h3>
<p>You can install Inspektor Gadget in several ways. We'll use the kubectl plugin method as it's the most straightforward for learning.</p>
<h4 id="heading-install-the-kubectl-gadget-plugin">Install the kubectl gadget plugin:</h4>
<pre><code class="lang-bash"><span class="hljs-comment"># Download and install kubectl-gadget</span>
kubectl krew install gadget

<span class="hljs-comment"># Verify installation</span>
kubectl gadget version
</code></pre>
<p>If you don't have krew (the kubectl plugin manager), you can install it first:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Install krew</span>
(
  <span class="hljs-built_in">set</span> -x; <span class="hljs-built_in">cd</span> <span class="hljs-string">"<span class="hljs-subst">$(mktemp -d)</span>"</span> &amp;&amp;
  OS=<span class="hljs-string">"<span class="hljs-subst">$(uname | tr '[:upper:]' '[:lower:]')</span>"</span> &amp;&amp;
  ARCH=<span class="hljs-string">"<span class="hljs-subst">$(uname -m | sed -e 's/x86_64/amd64/' -e 's/\(arm\)\(64\)\?.*/\1\2/' -e 's/aarch64$/arm64/')</span>"</span> &amp;&amp;
  KREW=<span class="hljs-string">"krew-<span class="hljs-variable">${OS}</span>_<span class="hljs-variable">${ARCH}</span>"</span> &amp;&amp;
  curl -fsSLO <span class="hljs-string">"https://github.com/kubernetes-sigs/krew/releases/latest/download/<span class="hljs-variable">${KREW}</span>.tar.gz"</span> &amp;&amp;
  tar zxvf <span class="hljs-string">"<span class="hljs-variable">${KREW}</span>.tar.gz"</span> &amp;&amp;
  ./<span class="hljs-string">"<span class="hljs-variable">${KREW}</span>"</span> install krew
)

<span class="hljs-comment"># Add krew to your PATH</span>
<span class="hljs-built_in">export</span> PATH=<span class="hljs-string">"<span class="hljs-variable">${KREW_ROOT:-<span class="hljs-variable">$HOME</span>/.krew}</span>/bin:<span class="hljs-variable">$PATH</span>"</span>
</code></pre>
<h4 id="heading-deploy-inspektor-gadget-to-your-cluster">Deploy Inspektor Gadget to your cluster:</h4>
<pre><code class="lang-bash">kubectl gadget deploy
</code></pre>
<p>This creates a <code>gadget</code> namespace and deploys the Inspektor Gadget daemon as a DaemonSet, ensuring each node in your cluster can run eBPF programs.</p>
<h4 id="heading-verify-the-deployment">Verify the deployment:</h4>
<pre><code class="lang-bash">kubectl get pods -n gadget
</code></pre>
<p>You should see one <code>gadget-*</code> pod per node, all in the <code>Running</code> state. If a pod is stuck in <code>Pending</code> or <code>CrashLoopBackOff</code>, check that your kernel meets the version requirements.</p>
<h4 id="heading-deploying-a-sample-application">Deploying a sample application</h4>
<p>To learn tracing effectively, we need an application that does something interesting. We'll deploy a simple microservices application with multiple components so you can see traces flowing across service boundaries.</p>
<p>Start by creating a namespace for our demo app:</p>
<pre><code class="lang-bash">kubectl create namespace demo-app
</code></pre>
<p>Then deploy a simple web application with a backend:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">frontend</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">demo-app</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">1</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">frontend</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">metadata:</span>
      <span class="hljs-attr">labels:</span>
        <span class="hljs-attr">app:</span> <span class="hljs-string">frontend</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">containers:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">frontend</span>
        <span class="hljs-attr">image:</span> <span class="hljs-string">gcr.io/google-samples/microservices-demo/frontend:v0.8.0</span>
        <span class="hljs-attr">ports:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">containerPort:</span> <span class="hljs-number">8080</span>
        <span class="hljs-attr">env:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">PORT</span>
          <span class="hljs-attr">value:</span> <span class="hljs-string">"8080"</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">PRODUCT_CATALOG_SERVICE_ADDR</span>
          <span class="hljs-attr">value:</span> <span class="hljs-string">"productcatalog:3550"</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">frontend</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">demo-app</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">type:</span> <span class="hljs-string">LoadBalancer</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">frontend</span>
  <span class="hljs-attr">ports:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">port:</span> <span class="hljs-number">80</span>
    <span class="hljs-attr">targetPort:</span> <span class="hljs-number">8080</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">productcatalog</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">demo-app</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">1</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">productcatalog</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">metadata:</span>
      <span class="hljs-attr">labels:</span>
        <span class="hljs-attr">app:</span> <span class="hljs-string">productcatalog</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">containers:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">server</span>
        <span class="hljs-attr">image:</span> <span class="hljs-string">gcr.io/google-samples/microservices-demo/productcatalogservice:v0.8.0</span>
        <span class="hljs-attr">ports:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">containerPort:</span> <span class="hljs-number">3550</span>
        <span class="hljs-attr">env:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">PORT</span>
          <span class="hljs-attr">value:</span> <span class="hljs-string">"3550"</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">productcatalog</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">demo-app</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">productcatalog</span>
  <span class="hljs-attr">ports:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">port:</span> <span class="hljs-number">3550</span>
    <span class="hljs-attr">targetPort:</span> <span class="hljs-number">3550</span>
</code></pre>
<p>Apply the configuration:</p>
<pre><code class="lang-bash">kubectl apply -f demo-app.yaml
</code></pre>
<p>And wait for pods to be ready:</p>
<pre><code class="lang-bash">kubectl <span class="hljs-built_in">wait</span> --<span class="hljs-keyword">for</span>=condition=ready pod -l app=frontend -n demo-app --timeout=300s
kubectl <span class="hljs-built_in">wait</span> --<span class="hljs-keyword">for</span>=condition=ready pod -l app=productcatalog -n demo-app --timeout=300s
</code></pre>
<p>Then just verify that everything is running:</p>
<pre><code class="lang-bash">kubectl get pods -n demo-app
</code></pre>
<p>You should see both <code>frontend</code> and <code>productcatalog</code> pods in the <code>Running</code> state.</p>
<p>Now you’ll need to get the frontend URL:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># For local clusters (minikube, kind, Docker Desktop)</span>
kubectl port-forward -n demo-app service/frontend 8080:80

<span class="hljs-comment"># Then access http://localhost:8080 in your browser</span>

<span class="hljs-comment"># For cloud clusters</span>
kubectl get service frontend -n demo-app
<span class="hljs-comment"># Look for the EXTERNAL-IP</span>
</code></pre>
<p>Visit the application in your browser to confirm it's working. You should see a simple e-commerce storefront. This application makes HTTP requests from the frontend to the product catalog service, which is perfect for tracing.</p>
<h2 id="heading-how-to-trace-your-first-request-hands-on-tutorial">How to Trace Your First Request: Hands-On Tutorial</h2>
<p>Now that everything is set up, let's capture our first trace and see eBPF observability in action.</p>
<h3 id="heading-generate-the-traffic-to-trace">Generate the Traffic to Trace</h3>
<p>First, we need some application activity to observe. We will generate a few requests for our demo application.</p>
<p>In one terminal, start the Traceloop gadget:</p>
<pre><code class="lang-bash">kubectl gadget traceloop -n demo-app
</code></pre>
<p>This command starts tracing HTTP request handling in the <code>demo-app</code> namespace. Inspektor Gadget monitors the kernel to capture the function calls and system events that occur while processing each request.  </p>
<p>In another terminal, generate some traffic:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># If using port-forward</span>
curl http://localhost:8080

<span class="hljs-comment"># If you have an external IP</span>
curl http://&lt;EXTERNAL-IP&gt;

<span class="hljs-comment"># Generate multiple requests</span>
<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> {1..10}; <span class="hljs-keyword">do</span> curl http://localhost:8080; sleep 1; <span class="hljs-keyword">done</span>
```

<span class="hljs-comment">### Viewing Your First Trace</span>

Switch back to the terminal running the trace loop gadget. You should see output appearing as requests flow through your application. The output will look something like this:
```
NODE         NAMESPACE   POD              CONTAINER    PID    TYPE       COUNT  
minikube     demo-app    frontend-abc123  frontend     1234   loop       1      
minikube     demo-app    frontend-abc123  frontend     1234   loop       2
</code></pre>
<p>Each line shows a traced execution flow, with the count increasing as the same pattern is observed again.</p>
<p>We can make the output more interesting by filtering:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Stop the previous trace with Ctrl+C, then run:</span>
kubectl gadget traceloop -n demo-app --podname frontend
</code></pre>
<p>This narrows our observation to just the frontend pod, reducing noise and making patterns clearer.</p>
<h4 id="heading-understanding-what-youre-seeing">Understanding what you're seeing:</h4>
<p>Each column shows different information about your application:</p>
<ul>
<li><p><strong>NODE</strong>: Which Kubernetes node the traced event occurred on. In multi-node clusters, this helps you understand workload distribution and identify node-specific issues.</p>
</li>
<li><p><strong>NAMESPACE</strong>: The Kubernetes namespace. We filtered to <code>demo-app</code>, so you'll only see that namespace. In production, filtering by namespace is crucial for focusing on specific applications.</p>
</li>
<li><p><strong>POD</strong>: The specific pod where the event occurred. Each pod gets a unique name (like <code>frontend-abc123</code>), allowing you to distinguish between replicas of the same application.</p>
</li>
<li><p><strong>CONTAINER</strong>: Which container within the pod. Pods can have multiple containers (main application, sidecars, init containers), so this helps you pinpoint exactly where activity is happening.</p>
</li>
<li><p><strong>PID</strong>: The process ID inside the container. This is the actual Linux process that made the syscalls eBPF observed. Multiple PIDs might appear if your application uses multiple processes or threads.</p>
</li>
<li><p><strong>TYPE</strong>: The type of event traced. For Traceloop, this identifies kernel-level patterns detected during request processing.</p>
</li>
<li><p><strong>COUNT</strong>: How many times this pattern has been observed. A rapidly incrementing count indicates high request volume.</p>
</li>
</ul>
<h4 id="heading-what-this-tells-you-about-your-application">What this tells you about your application:</h4>
<p>Even from this simple output, you can derive insights. If you see events appearing for the <code>frontend</code> pod but not the <code>productcatalog</code> pod, it might indicate that requests aren't making it to the backend. This is a potential configuration issue. If the <code>COUNT</code> increases rapidly for one pod but not others, you know which replica is receiving traffic, useful for debugging load balancing issues.</p>
<p>The real power becomes clear when you correlate these kernel-level observations with what you know about your application. When you made 10 curl requests, you should see corresponding activity in the trace output. This direct relationship between application behavior and kernel observations is the foundation of eBPF observability.</p>
<h2 id="heading-how-to-interpret-traces">How to Interpret Traces</h2>
<p>Understanding raw trace output is valuable, but interpreting what it means for your application's health and performance is where the real skill lies.</p>
<h3 id="heading-trace-anatomy-spans-timing-and-request-flow">Trace Anatomy: Spans, Timing, and Request Flow</h3>
<p>A trace represents a single request's journey through your system. When you curl the frontend, that generates one trace. A span represents a single operation within that trace like "frontend handles request," "frontend calls product catalog," "product catalog queries data," and "frontend returns response." Each span has timing information: when it started, when it ended, and therefore how long it took.</p>
<p>In traditional distributed tracing with OpenTelemetry or Jaeger, you'd explicitly create these spans in your application code. With eBPF, the tool infers spans from syscall patterns. When eBPF sees your frontend process call <code>connect()</code> to the product catalog's IP, followed by <code>send()</code> and <code>recv()</code>, it understands that's a span representing an HTTP request to the backend service.</p>
<p>The request flow is the sequence of spans showing how your request moved through services. In our demo app,</p>
<ol>
<li><p>The user request arrives at the frontend,</p>
</li>
<li><p>the frontend connects to the product catalog,</p>
</li>
<li><p>the product catalog processes the request,</p>
</li>
<li><p>the product catalog returns the data, the frontend renders the page,</p>
</li>
<li><p>and finally, the response is sent to user.</p>
</li>
</ol>
<h3 id="heading-how-to-follow-requests-across-services">How to Follow Requests Across Services</h3>
<p>Let's trace a request across service boundaries to see this flow in action.</p>
<p>First, we’ll start a more detailed trace:</p>
<pre><code class="lang-bash">kubectl gadget trace_tcp -n demo-app
</code></pre>
<p>The trace_tcp gadget shows network connections, giving us visibility into service-to-service communication.</p>
<p>Next, generate a request:</p>
<pre><code class="lang-bash">curl http://localhost:8080
</code></pre>
<p>In the trace output, look for connection patterns:</p>
<p>You should see the frontend pod establishing a TCP connection to the product catalog service. The trace will show the source (frontend) and destination (product catalog) IPs and ports, along with timing information.</p>
<p>This is how eBPF lets you follow requests: by observing the network syscalls that implement service communication. You don't need a service mesh or instrumentation libraries, the kernel sees all network activity and eBPF captures it.</p>
<h4 id="heading-understanding-the-flow">Understanding the flow:</h4>
<ol>
<li><p>Your curl command triggers a TCP connection to the frontend pod's IP on port 8080</p>
</li>
<li><p>The frontend processes the request and opens a TCP connection to the product catalog's IP on port 3550</p>
</li>
<li><p>Data flows back and forth (you'll see send/receive events)</p>
</li>
<li><p>Connections close when requests complete</p>
</li>
</ol>
<p>Each step is visible to eBPF because each step requires syscalls that the kernel handles.</p>
<h3 id="heading-how-to-identify-bottlenecks-and-errors">How to Identify Bottlenecks and Errors</h3>
<p>We can also use tracing to identify performance issues.</p>
<p>First, let’s start by simulating a slow backend:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create a deliberately slow endpoint by modifying our deployment</span>
kubectl scale deployment productcatalog -n demo-app --replicas=0

<span class="hljs-comment"># Wait a moment, then scale back up</span>
kubectl scale deployment productcatalog -n demo-app --replicas=1
</code></pre>
<p>While the product catalog is down, generate some requests:</p>
<pre><code class="lang-bash"><span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> {1..5}; <span class="hljs-keyword">do</span> curl http://localhost:8080; <span class="hljs-keyword">done</span>
</code></pre>
<p>You should see connection attempts from the frontend to the product catalog, but if the service is unavailable, you'll see different patterns, possibly connection timeouts or connection refused errors, depending on the exact timing.</p>
<p>What bottlenecks look like in traces:</p>
<ul>
<li><p><strong>Long spans</strong>: A span that takes significantly longer than others indicates a bottleneck. In trace loop output, you might see gaps between events or notice certain operations taking longer.</p>
</li>
<li><p><strong>Retries</strong>: Repeated connection attempts to the same destination suggest a failing or slow service.</p>
</li>
<li><p><strong>Error patterns</strong>: Connection failures, timeouts, or unusual syscall sequences indicate problems.</p>
</li>
</ul>
<p>The best skill to have is pattern recognition. A typical, healthy request flow has a rhythm, and events occur in predictable sequences with consistent timing. When something breaks, the rhythm changes. Requests take longer, errors appear, or expected events don't occur at all.</p>
<h2 id="heading-real-world-debugging-scenarios">Real-World Debugging Scenarios</h2>
<p>Now let's go through three realistic scenarios where eBPF helps:</p>
<h3 id="heading-scenario-1-finding-a-slow-endpoint">Scenario 1: Finding a Slow Endpoint</h3>
<p><strong>The problem:</strong> Users report that the product catalog page sometimes loads very slowly, but metrics show normal average latency.</p>
<p>Let’s use Traceloop to investigate:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Start tracing with timing information</span>
kubectl gadget traceloop -n demo-app --podname frontend
</code></pre>
<p>We’ll generate some mixed traffic:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Some requests to the homepage (fast)</span>
curl http://localhost:8080

<span class="hljs-comment"># Some requests to the product catalog (potentially slow)</span>
curl http://localhost:8080/products
</code></pre>
<p>In the trace output, compare the <code>COUNT</code> increments for different request patterns. If certain patterns show significantly more loop iterations or longer gaps between events, that indicates those requests are doing more work, possibly hitting a slow endpoint.</p>
<h4 id="heading-the-diagnosis">The diagnosis:</h4>
<p>You might notice that requests to <code>/products</code> cause the frontend to make multiple calls to the product catalog service (visible with <code>kubectl gadget trace_tcp</code>), while homepage requests don't. This explains why the product page is slow: it's making synchronous calls to a backend service, and if that service is slow or the network is congested, users feel the delay.</p>
<h4 id="heading-the-fix">The fix:</h4>
<p>You might implement caching, make the backend calls asynchronous, or optimize the product catalog service itself. The key is that eBPF helped you identify which specific code path was slow without adding instrumentation to your application.</p>
<h3 id="heading-scenario-2-tracking-down-failed-requests">Scenario 2: Tracking Down Failed Requests</h3>
<p><strong>The problem:</strong> Your monitoring shows a 5% error rate, but application logs don't show any errors. Where are the failures happening?</p>
<p>Now let’s use eBPF to investigate:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Trace network connections to see connection failures</span>
kubectl gadget trace_tcp -n demo-app
</code></pre>
<p>We’ll simulate intermittent failures:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create a failing scenario by temporarily breaking service connectivity</span>
kubectl delete service productcatalog -n demo-app

<span class="hljs-comment"># Generate requests</span>
<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> {1..10}; <span class="hljs-keyword">do</span> curl http://localhost:8080; sleep 1; <span class="hljs-keyword">done</span>

<span class="hljs-comment"># Restore the service</span>
kubectl apply -f demo-app.yaml
</code></pre>
<p>In the TCP trace, you'll see connection attempts from the frontend to the product catalog that fail or time out. The trace will show the source, destination, and what happened (connection refused, timeout, and so on).</p>
<h4 id="heading-the-diagnosis-1">The diagnosis:</h4>
<p>The failures are happening at the network level, the frontend can't reach the product catalog. This might be due to network policy issues, service mesh misconfiguration, or DNS problems. Traditional application logs might not capture this because the application never receives a response to log, and the connection fails before the application layer even gets involved.</p>
<h4 id="heading-why-ebpf-finds-this-when-logs-dont">Why eBPF finds this when logs don't:</h4>
<p>Your application logs what it experiences. If a connection fails at the TCP level, your application might just see "connection refused" and retry without detailed logging.</p>
<p>eBPF sees the actual syscalls and network events, giving you visibility into what's happening beneath your application layer.</p>
<h3 id="heading-scenario-3-understanding-service-dependencies">Scenario 3: Understanding Service Dependencies</h3>
<p><strong>The problem:</strong> You're not sure which services depend on each other, and you want to understand the actual runtime dependencies before making changes.</p>
<p>We’ll use eBPF to map dependencies:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Trace all TCP connections to see who talks to whom</span>
kubectl gadget trace_tcp -n demo-app
</code></pre>
<p>And then generate normal traffic:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Make various requests to exercise different code paths</span>
curl http://localhost:8080
curl http://localhost:8080/products
curl http://localhost:8080/cart
</code></pre>
<p>The trace output shows source and destination for every connection. Build a mental (or actual) map of which pods connect to which services.</p>
<h4 id="heading-the-discovery">The discovery:</h4>
<p>You'll see that the frontend pod connects to the product catalog service, but you might also discover unexpected dependencies. Perhaps the frontend also makes calls to a Redis cache, an authentication service, or external APIs. These runtime dependencies might not be documented or might differ from what architectural diagrams show.</p>
<h4 id="heading-why-this-matters">Why this matters:</h4>
<p>Before deploying a change to the product catalog service, you now know exactly which services will be affected. Before implementing a network policy, you know which connections to allow. Before decomposing a monolith, you understand the actual communication patterns.</p>
<p>This is observability-driven architecture understanding: letting the system show you how it actually works, not how you think it works.</p>
<h2 id="heading-advanced-tracing-insights">Advanced Tracing Insights</h2>
<p>Once you're comfortable with basic request tracing, Inspektor Gadget offers deeper observability capabilities that reveal even more about your system's behavior.</p>
<h3 id="heading-syscall-level-observation">Syscall-Level Observation</h3>
<p>The traceloop and trace_tcp gadgets give you application-level insights, but sometimes you need to go deeper. The trace_exec gadget shows you every process execution in your containers.</p>
<p>First, let’s monitor process execution:</p>
<pre><code class="lang-bash">kubectl gadget trace_exec -n demo-app
</code></pre>
<p>And generate activity:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Exec into a pod and run commands</span>
kubectl <span class="hljs-built_in">exec</span> -it -n demo-app deployment/frontend -- /bin/sh
ls -la
ps aux
<span class="hljs-built_in">exit</span>
</code></pre>
<p>Every command you run inside the container appears in the trace: <code>/bin/sh</code>, <code>ls</code>, <code>ps</code>, and anything else. This helps you understand what's running in your containers, detect suspicious activity, or debug initialization issues.</p>
<p>In production scenarios, this helps you answer questions like: Is my application spawning unexpected subprocesses? Are there security issues like someone running <code>curl</code> to download malicious scripts? Is my <code>init</code> script actually running the commands I think it is?</p>
<h3 id="heading-network-tracing-insights">Network Tracing Insights</h3>
<p>Beyond TCP connections, you can trace DNS queries, which often reveal surprising things about your application's behavior.</p>
<p>Run <code>trace_dns</code>:</p>
<pre><code class="lang-bash">kubectl gadget trace_dns -n demo-app
</code></pre>
<p>Generate requests:</p>
<pre><code class="lang-bash">curl http://localhost:8080
</code></pre>
<p>You'll see every DNS query your application makes: resolving service names, checking for external APIs, perhaps even unexpected queries that indicate misconfiguration or dependencies you didn't know about.</p>
<p>Common insights from DNS tracing include discovering that your application is using external dependencies you didn't document, finding DNS resolution failures that cause intermittent errors, or identifying excessive DNS queries that could be cached.</p>
<h3 id="heading-combining-ebpf-data-with-logs-and-metrics">Combining eBPF Data with Logs and Metrics</h3>
<p>eBPF observability delivers the best results when combined with traditional observability signals. To combine them effectively:</p>
<ul>
<li><p>Use metrics for high-level health monitoring, alerting on anomalies, tracking trends over time, and dashboard visualization.</p>
</li>
<li><p>Use logs for application-specific context, business logic details, error messages with stack traces, and debugging application code.</p>
</li>
<li><p>Use eBPF traces for understanding request flows, identifying where time is spent, discovering runtime dependencies, and debugging issues that don't appear in logs.</p>
</li>
</ul>
<h4 id="heading-a-practical-workflow">A practical workflow:</h4>
<p>Your metrics alert you that latency increased. You check logs but don't see errors, requests are succeeding, just slowly. You use eBPF tracing to identify that requests are spending extra time in network I/O to a particular backend service. Now you check that service's metrics and logs, and discover it's under heavy load. The eBPF trace gave you the clue that logs and metrics alone couldn't provide.</p>
<p>This approach to observability, using the right tool for each question, is how experienced engineers debug complex systems efficiently.</p>
<h3 id="heading-what-ebpf-can-and-cant-see"><strong>What eBPF Can and Can't See</strong></h3>
<p>eBPF excels at:</p>
<ul>
<li><p>Network traffic (requests, responses, latency)</p>
</li>
<li><p>System calls (file I/O, process creation, memory allocation)</p>
</li>
<li><p>Kernel functions (scheduling, locking, resource usage)</p>
</li>
<li><p>Function calls in binaries (with uprobes)</p>
</li>
</ul>
<p>But keep in mind that eBPF has limitations:</p>
<ul>
<li><p>Cannot decrypt encrypted payloads (unless hooking SSL libraries before encryption)</p>
</li>
<li><p>Doesn't automatically understand application logic</p>
</li>
<li><p>Captures low-level events but may need context for high-level semantics</p>
</li>
</ul>
<p>That's why eBPF complements traditional observability rather than replacing it entirely. It gives you infrastructure-level visibility with no code changes and universal coverage. Traditional APM provides application-level context, business metrics, and custom instrumentation. Together, they give you complete observability across your entire stack.</p>
<h2 id="heading-best-practices-and-production-considerations">Best Practices and Production Considerations</h2>
<p>Before using eBPF tracing in production, there are important considerations around performance, security, and operational practices.</p>
<h3 id="heading-performance-impact">Performance Impact</h3>
<p>eBPF's reputation for low overhead is well-deserved, but "low" isn't "zero."</p>
<p>Most eBPF tracing tools add 2-5% CPU overhead and negligible memory overhead. The exact number depends on event frequency, tracing a service that handles 10,000 requests per second will have more overhead than one handling 10 per second.</p>
<p>Measuring the impact:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Before enabling tracing, check baseline resource usage</span>
kubectl top pods -n demo-app

<span class="hljs-comment"># Enable tracing</span>
kubectl gadget traceloop -n demo-app

<span class="hljs-comment"># Check resource usage again</span>
kubectl top pods -n demo-app
</code></pre>
<p>You should see a small increase in CPU usage in the pods where tracing is active. This is the cost of the eBPF programs running in the kernel and processing events.</p>
<h4 id="heading-production-best-practices">Production best practices:</h4>
<p>Use targeted tracing rather than tracing everything everywhere. Trace specific namespaces, pods, or individual containers when investigating issues. For high-volume services, reduce overhead by applying filters, aggregation, or sampling where supported by the tracing tool.</p>
<p>Stop tracing when you’re done investigating. Unlike metrics collection, which typically runs continuously, eBPF-based tracing is best used as an on-demand diagnostic tool to capture detailed insights during active debugging.</p>
<h4 id="heading-when-overhead-matters">When overhead matters:</h4>
<p>If you're running latency-sensitive applications (like high-frequency trading systems or real-time communications), even 2-5% overhead might be unacceptable. In these cases, use eBPF tracing in pre-production environments to identify issues, or enable it temporarily in production only when actively debugging.</p>
<h3 id="heading-security-considerations">Security Considerations</h3>
<p>eBPF is powerful, which means it requires elevated privileges. Understanding the security implications is crucial.</p>
<h4 id="heading-what-ebpf-can-access">What eBPF can access:</h4>
<p>eBPF programs can observe all syscalls, network traffic, and process execution in the kernel. This includes potentially sensitive data like connection details, file paths, and process arguments. While eBPF programs run in a sandbox and can't modify data or crash the kernel, they can read information that might be sensitive.</p>
<h4 id="heading-privilege-requirements">Privilege requirements:</h4>
<p>Loading eBPF programs requires <code>CAP_SYS_ADMIN</code> or <code>CAP_BPF</code> capabilities (on newer kernels). This is a privileged operation, only trusted users should have this access. The Inspektor Gadget DaemonSet runs with these privileges, so protect access to it accordingly.</p>
<h4 id="heading-best-practices">Best practices:</h4>
<p>Implement RBAC (Role-Based Access Control) to restrict who can run gadgets. Not every developer needs the ability to trace production systems.</p>
<p>Also, be mindful of what data you're collecting, if your traces might contain sensitive information (like authentication tokens in HTTP headers), restrict access to trace data.</p>
<p>Lastly, consider using admission controllers to prevent unauthorized eBPF program loading. Audit eBPF usage in production environments to track who ran which gadgets when.</p>
<h4 id="heading-network-policies">Network policies:</h4>
<p>Inspektor Gadget's DaemonSet needs to communicate with the API server and between its components. Ensure your network policies allow this communication while still maintaining appropriate segmentation.</p>
<h3 id="heading-when-to-use-ebpf-tracing-vs-traditional-apm">When to Use eBPF Tracing vs. Traditional APM</h3>
<p>eBPF tracing and traditional APM tools like New Relic, Datadog, or Dynatrace serve different purposes. Understanding when to use each helps you build an effective observability strategy.</p>
<p>Use eBPF tracing when:</p>
<ul>
<li><p>You can't modify application code (third-party applications, legacy systems, compiled binaries)</p>
</li>
<li><p>You need infrastructure-level visibility (network, syscalls, kernel behavior)</p>
</li>
<li><p>You're debugging issues that span service boundaries but don't show up in application logs</p>
</li>
<li><p>You want zero instrumentation overhead during normal operation (run tracing only when needed)</p>
</li>
<li><p>You need to understand what's actually happening versus what the application reports</p>
</li>
</ul>
<p>Use traditional APM when:</p>
<ul>
<li><p>You need business-context metrics (user IDs, transaction types, business-specific data)</p>
</li>
<li><p>You want automatic instrumentation with minimal setup for supported frameworks</p>
</li>
<li><p>You need long-term storage and analysis of all traces (eBPF tracing is often used for real-time investigation)</p>
</li>
<li><p>You want pre-built dashboards and alerting for common application patterns</p>
</li>
<li><p>You need application code-level visibility (stack traces, variable values, function calls)</p>
</li>
</ul>
<h3 id="heading-the-ideal-approach-use-both">The Ideal Approach: Use Both</h3>
<p>Many teams run traditional APM for continuous monitoring and use eBPF tracing for targeted investigation when APM data isn't sufficient. For example, your APM shows that a service is slow but doesn't explain why. You enable eBPF tracing on that service to understand what's happening at the kernel level, network delays, excessive syscalls, unexpected dependencies, and find the root cause.</p>
<p>This complementary approach gives you both the continuous visibility of APM and the deep diagnostic power of eBPF without the overhead of running both at maximum depth all the time.</p>
<h2 id="heading-next-steps-and-resources">Next Steps and Resources</h2>
<p>If you got this far, thanks for reading! Now that you have learned the fundamentals of eBPF observability, and hands-on tracing with Inspektor Gadget, you can continue your journey by:</p>
<h3 id="heading-exploring-other-ebpf-tools">Exploring Other eBPF Tools</h3>
<p>Now that you understand eBPF concepts through traceloop, exploring other tools will be much easier.</p>
<h4 id="heading-try-other-inspektor-gadget-gadgets">Try other Inspektor Gadget gadgets:</h4>
<pre><code class="lang-bash"><span class="hljs-comment"># See all available gadgets</span>
kubectl gadget --<span class="hljs-built_in">help</span>

<span class="hljs-comment"># Some useful ones to explore:</span>
kubectl gadget trace_open -n demo-app     <span class="hljs-comment"># File I/O tracing</span>
kubectl gadget trace_bind -n demo-app     <span class="hljs-comment"># Port binding events</span>
kubectl gadget profile cpu -n demo-app    <span class="hljs-comment"># CPU profiling</span>
kubectl gadget snapshot process -n demo-app  <span class="hljs-comment"># Process listing</span>
</code></pre>
<p>Each gadget teaches you something different about system behavior and gives you another diagnostic tool in your toolkit.</p>
<h3 id="heading-experiment-with-other-ebpf-platforms">Experiment with other eBPF platforms:</h3>
<p>If you're interested in broader observability platforms, try Pixie for its auto-instrumentation and rich UI. Install Cilium with Hubble if you're focused on network observability and want to understand service mesh behavior. Explore Tetragon if security observability interests you, seeing what processes are executing and what files they're accessing.</p>
<p>The concepts transfer directly: all these tools attach eBPF programs to kernel hooks, collect event data, and present it in different ways. Your understanding of syscalls, traces, and kernel-level observation applies universally.</p>
<h3 id="heading-connect-to-the-cncf-observability-ecosystem">Connect to the CNCF Observability Ecosystem</h3>
<p>eBPF observability tools don't exist in isolation. They're part of the broader Cloud Native Computing Foundation ecosystem.</p>
<h4 id="heading-opentelemetry-integration">OpenTelemetry integration:</h4>
<p>Many eBPF tools can export data in OpenTelemetry format, allowing you to combine kernel-level traces with application-level traces in a unified observability backend. This gives you the complete picture: eBPF shows you infrastructure behavior while OpenTelemetry shows you application context.</p>
<h4 id="heading-prometheus-and-grafana">Prometheus and Grafana:</h4>
<p>eBPF-derived metrics can be exposed as Prometheus metrics and visualized in Grafana alongside your application metrics. This unified dashboard approach helps you correlate infrastructure and application behavior.</p>
<h4 id="heading-service-mesh-integration">Service mesh integration:</h4>
<p>If you're using Istio, Linkerd, or other service meshes, eBPF tools like Cilium Hubble can provide deeper visibility into service-to-service communication than the mesh alone provides. The mesh handles traffic management while eBPF gives you kernel-level visibility.</p>
<h4 id="heading-jaeger-and-zipkin">Jaeger and Zipkin:</h4>
<p>For organizations using distributed tracing backends, eBPF traces can be exported to these systems, enriching your trace data with infrastructure-level spans that application instrumentation misses.</p>
<h3 id="heading-community-resources-and-learning-paths">Community Resources and Learning Paths</h3>
<p>The eBPF community is vibrant and welcoming. You can continue learning from the resources below.</p>
<p><strong>Official documentation and blog:</strong></p>
<ul>
<li><p><a target="_blank" href="http://eBPF.io">eBPF.io</a>: The central hub for eBPF documentation, tutorials, and project listings</p>
</li>
<li><p><a target="_blank" href="https://inspektor-gadget.io/docs/latest/">Inspektor Gadget docs</a>: Comprehensive guides for all gadgets and use cases</p>
</li>
<li><p><a target="_blank" href="https://docs.cilium.io/en/stable/index.html">Cilium documentation</a>: Deep dives into eBPF networking</p>
</li>
<li><p><a target="_blank" href="https://www.cncf.io/blog/2025/01/27/what-is-observability-2-0/">CNCF Blog — “What is Observability 2.0?</a>: A quick overview of how modern observability moves beyond traditional tools by unifying metrics, logs, and traces for real-time insight in cloud-native systems.</p>
</li>
</ul>
<p><strong>Learning resources:</strong></p>
<ul>
<li><p><a target="_blank" href="https://cilium.isovalent.com/hubfs/Learning-eBPF%20-%20Full%20book.pdf">Learning eBPF by Liz Rice</a>: Comprehensive book covering eBPF fundamentals</p>
</li>
<li><p><a target="_blank" href="https://ebpf.io/summit-2025/">eBPF Summit</a>: Annual conference with talks from eBPF creators and users</p>
</li>
<li><p><a target="_blank" href="https://www.cncf.io/online-programs/cncf-on-demand-webinar-how-to-start-building-a-self-service-infrastructure-platform-on-kubernetes/">CNCF webinars</a>: Regular sessions on observability topics</p>
</li>
<li><p><a target="_blank" href="https://www.kubernetes.dev/community/community-groups/">Kubernetes observability SIGs</a>: Community discussions and projects</p>
</li>
</ul>
<p>To make this tutorial easy to follow and experiment with, I have included all Kubernetes manifests, demo applications, and eBPF tracing commands in this <a target="_blank" href="https://github.com/Emidowojo/ebpf-k8s-tracing-tutorial">repository</a>. You can also connect with me on <a target="_blank" href="https://www.linkedin.com/in/emidowojo/">LinkedIn</a> if you’d like to stay in touch.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Debug Kubernetes Pods with Traceloop: A Complete Beginner's Guide ]]>
                </title>
                <description>
                    <![CDATA[ Debugging Kubernetes pods can feel like detective work. Your app crashes, and you're left wondering what happened in those critical moments leading up to failure. Traditional kubectl commands show you logs and statuses, but they can't tell you exactl... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-debug-kubernetes-pods-with-traceloop-a-complete-beginners-guide/</link>
                <guid isPermaLink="false">68b1d0b4c2405fa2535ed0c8</guid>
                
                    <category>
                        <![CDATA[ Traceloop ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ debugging ]]>
                    </category>
                
                    <category>
                        <![CDATA[ inspektor gadget ]]>
                    </category>
                
                    <category>
                        <![CDATA[ containers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ observability ]]>
                    </category>
                
                    <category>
                        <![CDATA[ SRE ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Opaluwa Emidowojo ]]>
                </dc:creator>
                <pubDate>Fri, 29 Aug 2025 16:09:24 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1756483063551/4179b718-7883-4a89-a9c2-1c678185469a.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Debugging Kubernetes pods can feel like detective work. Your app crashes, and you're left wondering what happened in those critical moments leading up to failure. Traditional <code>kubectl</code> commands show you logs and statuses, but they can't tell you exactly what your application was doing at the system level when things went wrong.</p>
<p>What if you had a flight recorder for your applications, something that captures every system call in real-time, so you can "rewind" and see the exact sequence of events that led to a crash? That's what Traceloop does. It continuously traces system calls in your pods, giving you a detailed replay of what happened before, during, and after issues occur.</p>
<p>In this guide, you’ll learn how to use Traceloop's system call tracing to debug pod issues that would otherwise be nearly impossible to diagnose.</p>
<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>Before we begin, here are some prerequisites – things you’ll need to know and have:</p>
<ul>
<li><p><strong>Basic Kubernetes concepts</strong>: Understanding of pods, deployments, services, and namespaces</p>
</li>
<li><p><strong>kubectl fundamentals</strong>: Comfortable with commands like <code>kubectl get</code>, <code>kubectl describe</code>, <code>kubectl logs</code>, and <code>kubectl exec</code></p>
</li>
<li><p><strong>Container basics</strong>: Understanding how containerized applications work</p>
</li>
<li><p><strong>Basic Linux concepts</strong>: Understanding of processes and system calls (helpful, but we'll explain as we go)</p>
</li>
</ul>
<p><strong>Technical Requirements</strong></p>
<ul>
<li><p><strong>Kubernetes cluster access</strong>: Local (minikube, kind, Docker Desktop) or cloud-based cluster</p>
</li>
<li><p><code>kubectl</code> installed and configured to connect to your cluster</p>
</li>
<li><p>Sufficient permissions (cluster admin or equivalent RBAC) to:</p>
<ul>
<li><p>Install and run eBPF-based tools (Traceloop uses eBPF)</p>
</li>
<li><p>Create/modify pods and deployments</p>
</li>
<li><p>Access pod logs and system-level data</p>
</li>
</ul>
</li>
<li><p><strong>Linux-based Kubernetes nodes</strong>: Most clusters already run on Linux.</p>
</li>
</ul>
<p><strong>System Requirements</strong></p>
<ul>
<li><p><strong>Extended Berkeley Packet Filter (eBPF) support</strong>: Used for tracing and monitoring at the kernel level. Kernel version 5.10+ recommended.</p>
</li>
<li><p><strong>Sufficient cluster resources</strong>: Traceloop runs alongside your applications</p>
</li>
</ul>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ol>
<li><p><a class="post-section-overview" href="#heading-what-is-traceloop">What is Traceloop?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-traceloop-works">How Traceloop Works</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-traceloop">How to Set Up Traceloop</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-your-first-trace-hands-on-tutorial">Your First Trace: Hands-On Tutorial</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-by-step-debugging-walkthrough">Step-by-Step Debugging Walkthrough</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-real-world-debugging-scenarios">Real-World Debugging Scenarios</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-best-practices">Best Practices</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-what-is-traceloop">What is Traceloop?</h2>
<p><a target="_blank" href="https://inspektor-gadget.io/docs/main/gadgets/traceloop/">Traceloop</a> is a system call tracing and observability tool that works across containerized environments, from Docker containers running locally to pods in production Kubernetes clusters. But before we discuss what that means, let's talk about why system calls matter for debugging.</p>
<p>Every time your application does anything (like opening a file, making a network request, allocating memory, or crashing), it has to interact with the operating system through system calls. These are the fundamental building blocks of how any program interacts with the world around it.</p>
<p>Here's where traditional debugging falls short: when your container crashes, the logs might tell you "segmentation fault" or "out of memory," but they don't tell you the sequence of events that led there. Did the application try to access a file that didn't exist? Was it making network calls that failed? Did it run out of file descriptors?</p>
<p>Traceloop captures this missing piece. It sits at the kernel level using eBPF technology, recording every system call your application makes in real-time. Think of it as installing a dashcam in your application. It's always recording with minimal resources, and when something goes wrong, you have the footage.</p>
<p>Strace is another popular debugging tool – but it requires you to know that there's a problem first. With Traceloop, we can conveniently run it continuously in the background with minimal overhead. If your container crashes at 3am, you can immediately "rewind the tape" and see exactly what system calls happened leading up to the crash.</p>
<p>This helps debug intermittent issues that happen randomly in production but never when you are watching. Because Traceloop is always recording, you finally have visibility into what your application was doing when these mysterious failures occur.</p>
<h2 id="heading-how-traceloop-works">How Traceloop Works</h2>
<p>Now that you understand what Traceloop does, let's look under the hood at how it captures and processes system calls in your containerized environments.</p>
<h3 id="heading-the-technical-foundation">The Technical Foundation</h3>
<p>Traceloop is built on eBPF, a technology that allows programs to run safely in the Linux kernel without changing kernel code. Think of eBPF as a way to install "hooks" directly into the kernel that can observe everything happening on your system with minimal performance impact.</p>
<p>Unlike traditional monitoring tools that work from userspace, eBPF programs run in kernel space, giving them access to system calls as they happen, without relying on the application logging appropriate error messages. This is why Traceloop can capture events that never make it to application logs, like failed system calls or crashes that happen before the application can write anything.</p>
<h3 id="heading-the-flight-recorder-architecture">The Flight Recorder Architecture</h3>
<p>Traceloop uses eBPF maps as an overwriteable ring buffer. Imagine a tape recorder that continuously records over itself. It's always capturing system calls, but it only keeps the most recent data in memory. When something goes wrong, the recording automatically preserves what happened leading up to the incident, just like an airplane's flight recorder after a crash.</p>
<p>This approach solves the production debugging problem: you don't need to predict when issues will happen or attach debuggers after the fact. The recording is always running, waiting for you to need it.</p>
<h3 id="heading-system-call-capture-flow">System Call Capture Flow</h3>
<p>Here's how Traceloop captures and processes system calls across your Kubernetes environment:</p>
<ol>
<li><p><strong>Application pods</strong> generate system calls through normal operation – opening files, making network connections, allocating memory.</p>
</li>
<li><p><strong>eBPF probes (also called hooks)</strong> intercept these system calls at the kernel level before they're processed.</p>
</li>
<li><p><strong>Traceloop recorder</strong> captures the events, buffers them, and adds container context using Inspektor Gadget enrichment (pod name, namespace, container ID).</p>
</li>
<li><p><strong>Output stream</strong> formats the data and makes it available for analysis in real-time or after an incident.</p>
</li>
<li><p><strong>Traceloop user</strong> views and analyzes the captured trace to diagnose the root cause of issues.</p>
</li>
</ol>
<p>Below is a visual representation of the flow. The key advantage is that Traceloop sees everything your application does, even actions that fail silently or happen too quickly for traditional logging to catch. This gives you complete visibility into your application's interaction with the operating system.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755043403339/c5047de7-afc4-48aa-a28e-ee3a1dfbe47f.jpeg" alt="Flow diagram showing how Traceloop works. Application Pods generate system calls, which undergo kernel-level interception via eBPF probes. The probes capture events and pass them to the Traceloop Recorder, which buffers and formats the data. The Output Stream then displays the results to the Traceloop User. The process highlights steps from generating syscalls to capturing, recording, formatting, and presenting the results." class="image--center mx-auto" width="2823" height="981" loading="lazy"></p>
<h3 id="heading-container-isolation-and-context">Container Isolation and Context</h3>
<p>One of Traceloop's strengths is understanding containerized environments. It doesn't just capture raw system calls – it adds context about which pod, container, and namespace generated each call. This means you can trace specific applications without getting overwhelmed by system calls from other containers running on the same node.</p>
<p>This container awareness makes Traceloop particularly powerful in Kubernetes environments where you might have dozens of pods running on a single node, but you only care about debugging one specific application.</p>
<h2 id="heading-how-to-set-up-traceloop">How to Set Up Traceloop</h2>
<p>Before we can start tracing system calls, we need to set up Traceloop in your Kubernetes environment. Traceloop is part of the <a target="_blank" href="https://inspektor-gadget.io/">Inspektor Gadget</a> ecosystem, which provides flexibility in how you use it.</p>
<h3 id="heading-installation-overview">Installation Overview</h3>
<p>This setup:</p>
<ul>
<li><p>Deploys Inspektor Gadget components to all worker nodes</p>
</li>
<li><p>Eliminates the download and initialization overhead on each use, as components are pre-loaded and ready </p>
</li>
<li><p>Eliminates the need to reinstall or reconfigure for each debugging session – just run your traces immediately</p>
</li>
<li><p>Requires cluster admin permissions</p>
</li>
<li><p>Works best for teams doing regular debugging</p>
</li>
</ul>
<h4 id="heading-installation-requirements">Installation Requirements</h4>
<p>First, ensure your cluster meets the requirements:</p>
<ul>
<li><p>Kubernetes cluster with Linux nodes</p>
</li>
<li><p>eBPF support</p>
</li>
<li><p>kubectl installed and configured</p>
</li>
<li><p>Cluster admin permissions</p>
</li>
</ul>
<h4 id="heading-install-kubectl-gadget">Install kubectl gadget</h4>
<p>The recommended way is using krew (kubectl plugin manager):</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Install krew if you don't have it</span>
curl -fsSLO <span class="hljs-string">"https://github.com/kubernetes-sigs/krew/releases/latest/download/krew-linux_amd64.tar.gz"</span>
tar zxvf krew-linux_amd64.tar.gz
./krew-linux_amd64 install krew
<span class="hljs-built_in">export</span> PATH=<span class="hljs-string">"<span class="hljs-variable">${KREW_ROOT:-<span class="hljs-variable">$HOME</span>/.krew}</span>/bin:<span class="hljs-variable">$PATH</span>"</span>

<span class="hljs-comment"># Install kubectl gadget</span>
kubectl krew install gadget
</code></pre>
<p>Alternatively, you can install directly:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># For Linux/macOS</span>
curl -sL https://github.com/inspektor-gadget/inspektor-gadget/releases/latest/download/kubectl-gadget-linux-amd64.tar.gz | sudo tar -C /usr/<span class="hljs-built_in">local</span>/bin -xzf - kubectl-gadget

<span class="hljs-comment"># Verify installation</span>
kubectl gadget version
</code></pre>
<h4 id="heading-deploy-inspektor-gadget-to-your-cluster">Deploy Inspektor Gadget to Your Cluster</h4>
<p>Deploy the Inspektor Gadget components to your cluster:</p>
<pre><code class="lang-bash">kubectl gadget deploy
</code></pre>
<p>This installs the necessary DaemonSets and RBAC configurations that allow gadgets like Traceloop to run on your cluster nodes.</p>
<p>Alternatively, you can also deploy using <a target="_blank" href="https://inspektor-gadget.io/docs/v0.43.0/reference/install-kubernetes/#installation-with-the-helm-chart">Helm</a>.</p>
<h4 id="heading-verify-installation">Verify Installation</h4>
<p>Check that the gadget pods are running:</p>
<pre><code class="lang-bash">kubectl get pods -n gadget
</code></pre>
<p>You should see gadget pods running on each node in your cluster.</p>
<h2 id="heading-your-first-trace-hands-on-tutorial">Your First Trace: Hands-On Tutorial</h2>
<p>Now let's capture our first system call trace. We'll create a simple scenario and watch what happens at the system level.</p>
<h3 id="heading-setting-up-the-test-environment">Setting Up the Test Environment</h3>
<p>First, create a dedicated namespace for our tracing experiments:</p>
<pre><code class="lang-bash">kubectl create ns test-traceloop-ns
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="lang-bash">namespace/test-traceloop-ns created
</code></pre>
<p>Next, create a simple pod that we can interact with:</p>
<pre><code class="lang-bash">kubectl run -n test-traceloop-ns --image busybox test-traceloop-pod --<span class="hljs-built_in">command</span> -- sleep inf
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="lang-bash">pod/test-traceloop-pod created
</code></pre>
<p>This creates a BusyBox container that sleeps indefinitely, giving us a stable target for tracing.</p>
<h3 id="heading-starting-your-first-trace">Starting Your First Trace</h3>
<p>Next, start tracing system calls for our test pod:</p>
<pre><code class="lang-bash">kubectl gadget run traceloop:latest --namespace test-traceloop-ns
</code></pre>
<p>This command starts the flight recorder. You'll see column headers showing what information Traceloop captures:</p>
<pre><code class="lang-bash">K8S.NODE    K8S.NAMESPACE    K8S.PODNAME    K8S.CONTAINERNAME    CPU    PID    COMM    SYSCALL    PARAMETERS    RET
</code></pre>
<p>The trace is now running in the background, continuously recording system calls from our pod.</p>
<h3 id="heading-generating-system-calls">Generating System Calls</h3>
<p>With the trace running, let's generate some activity. In a new terminal window, run a command inside your test pod:</p>
<pre><code class="lang-bash">kubectl <span class="hljs-built_in">exec</span> -ti -n test-traceloop-ns test-traceloop-pod -- /bin/sh
</code></pre>
<p>Once inside the container, run some basic commands:</p>
<pre><code class="lang-bash">ls /
<span class="hljs-built_in">echo</span> <span class="hljs-string">"Hello World"</span> &gt; /tmp/test.txt
cat /tmp/test.txt
</code></pre>
<h3 id="heading-collecting-the-trace">Collecting the Trace</h3>
<p>Back in your original terminal where Traceloop is running, press <strong>Ctrl+C</strong> to stop the recording and see the captured system calls.</p>
<p>You'll see output similar to this:</p>
<pre><code class="lang-bash">K8S.NODE            K8S.NAMESPACE        K8S.PODNAME          K8S.CONTAINERNAME    CPU  PID    COMM  SYSCALL      PARAMETERS                   RET
minikube-docker     test-traceloop-ns    test-traceloop-pod   test-traceloop-pod   2    95419  ls    openat       dfd=-100, filename=<span class="hljs-string">"/lib"</span>    3
minikube-docker     test-traceloop-ns    test-traceloop-pod   test-traceloop-pod   2    95419  ls    getdents64   fd=3, dirent=0x...          201
minikube-docker     test-traceloop-ns    test-traceloop-pod   test-traceloop-pod   2    95419  ls    write        fd=1, buf=<span class="hljs-string">"bin dev etc..."</span>   201
minikube-docker     test-traceloop-ns    test-traceloop-pod   test-traceloop-pod   2    95419  ls    exit_group   error_code=0                 0
</code></pre>
<h3 id="heading-understanding-your-first-trace">Understanding Your First Trace</h3>
<p>Let's break down what we're seeing:</p>
<ul>
<li><p><strong>K8S.PODNAME</strong>: Which pod generated these system calls</p>
</li>
<li><p><strong>PID</strong>: Process ID of the command that ran</p>
</li>
<li><p><strong>COMM</strong>: The command name (ls, echo, cat)</p>
</li>
<li><p><strong>SYSCALL</strong>: The actual system call made (openat, write, exit_group)</p>
</li>
<li><p><strong>PARAMETERS</strong>: Arguments passed to the system call</p>
</li>
<li><p><strong>RET</strong>: Return value (0 usually means success)</p>
</li>
</ul>
<p>This trace shows the <code>ls</code> command opening the <code>/lib</code> directory, reading directory entries, writing the output to stdout, and exiting successfully.</p>
<h3 id="heading-clean-up">Clean Up</h3>
<p>Remove the test resources:</p>
<pre><code class="lang-bash">kubectl delete pod test-traceloop-pod -n test-traceloop-ns
kubectl delete ns test-traceloop-ns
</code></pre>
<p>You can now see exactly what your applications are doing at the kernel level, something that traditional logs and kubectl commands can't show you.</p>
<p>Let's try this with an application that crashes.</p>
<h2 id="heading-step-by-step-debugging-walkthrough">Step-by-Step Debugging Walkthrough</h2>
<p>Now that you know how to capture traces, let's take a look at a real debugging scenario. We'll create an application that crashes and use Traceloop to uncover the root cause. Something that would be nearly impossible with traditional kubectl debugging.</p>
<h3 id="heading-the-scenario-a-mysterious-crash">The Scenario: A Mysterious Crash</h3>
<p>Let's create a Python application that has a subtle bug. It tries to write to a file it doesn't have permission to access, then crashes. This mimics real-world scenarios where applications fail due to permission issues, missing files, or resource constraints.</p>
<h3 id="heading-setting-up-the-problematic-application">Setting Up the Problematic Application</h3>
<p>First, we’ll create a new namespace for our debugging exercise:</p>
<pre><code class="lang-bash">kubectl create ns debug-traceloop-ns
</code></pre>
<p>Now, let's create a pod with an application that will crash:</p>
<pre><code class="lang-bash">kubectl run -n debug-traceloop-ns crash-app --image=python:3.9-slim --restart=Never -- python3 -c <span class="hljs-string">"
import time
import os
print('App starting...')
time.sleep(5)
print('Trying to write to restricted file...')
try:
    with open('/etc/passwd', 'w') as f:
        f.write('malicious content')
except Exception as e:
    print(f'Error: {e}')
    exit(1)
"</span>
</code></pre>
<p>This creates a pod that will:</p>
<ol>
<li><p>Start successfully</p>
</li>
<li><p>Try to write to <code>/etc/passwd</code> (a restricted system file)</p>
</li>
<li><p>Fail and crash with exit code 1</p>
</li>
</ol>
<h3 id="heading-starting-the-trace-before-the-crash">Starting the Trace Before the Crash</h3>
<p>Here's the key difference from traditional debugging. We start tracing before we know there's a problem. In a real scenario, you'd have Traceloop running continuously.</p>
<pre><code class="lang-bash">kubectl gadget run traceloop:latest --namespace debug-traceloop-ns
</code></pre>
<p>The trace starts recording immediately. You'll see the column headers, and the flight recorder is now capturing every system call.</p>
<h3 id="heading-observing-the-application-behavior">Observing the Application Behavior</h3>
<p>In another terminal, check the pod status:</p>
<pre><code class="lang-bash">kubectl get pods -n debug-traceloop-ns -w
</code></pre>
<p>You'll see the pod go through these states:</p>
<ul>
<li><code>Pending</code> → <code>Running</code> → <code>Error</code> → <code>CrashLoopBackOff</code></li>
</ul>
<p>Traditional debugging would show you:</p>
<pre><code class="lang-bash">kubectl logs -n debug-traceloop-ns crash-app
</code></pre>
<p>Output:</p>
<pre><code class="lang-bash">App starting...
Trying to write to restricted file...
Error: [Errno 13] Permission denied: <span class="hljs-string">'/etc/passwd'</span>
</code></pre>
<p>But this doesn't tell you exactly what the application tried to do at the system level.</p>
<h3 id="heading-collecting-and-analyzing-the-trace">Collecting and Analyzing the Trace</h3>
<p>Back in your Traceloop terminal, press <strong>Ctrl+C</strong> to stop the recording. You'll see system calls like this:</p>
<pre><code class="lang-bash">K8S.NODE        K8S.NAMESPACE      K8S.PODNAME  COMM    SYSCALL    PARAMETERS                           RET
minikube-docker debug-traceloop-ns crash-app    python3 openat     dfd=-100, filename=<span class="hljs-string">"/etc/passwd"</span>    -13
minikube-docker debug-traceloop-ns crash-app    python3 write      fd=3, buf=<span class="hljs-string">"App starting..."</span>         16
minikube-docker debug-traceloop-ns crash-app    python3 openat     dfd=-100, filename=<span class="hljs-string">"/etc/passwd"</span>    -13
minikube-docker debug-traceloop-ns crash-app    python3 exit_group error_code=1                        0
</code></pre>
<h3 id="heading-reading-the-system-call-story">Reading the System Call Story</h3>
<p>The trace reveals the exact sequence of events:</p>
<ol>
<li><p><code>openat filename="/etc/passwd" RET=-13</code>: The application tried to open <code>/etc/passwd</code> for writing</p>
<ul>
<li>Return code <code>-13</code> = <code>EACCES</code> (Permission denied)</li>
</ul>
</li>
<li><p><code>write buf="App starting..."</code>: Normal logging output (successful)</p>
</li>
<li><p><code>openat filename="/etc/passwd" RET=-13</code>: Second attempt to open the restricted file (still denied)</p>
</li>
<li><p><code>exit_group error_code=1</code>: Application exits with error code 1</p>
</li>
</ol>
<h3 id="heading-what-traceloop-revealed">What Traceloop Revealed</h3>
<p>Traditional debugging told us "Permission denied" but Traceloop shows us:</p>
<ul>
<li><p><strong>Exactly which file</strong> the application tried to access</p>
</li>
<li><p><strong>When</strong> the permission denial happened in the execution flow</p>
</li>
<li><p><strong>How many times</strong> it tried (twice in this case)</p>
</li>
<li><p><strong>The exact system call</strong> that failed (<code>openat</code>)</p>
</li>
</ul>
<h3 id="heading-real-world-applications">Real-World Applications</h3>
<p>This same approach works for debugging:</p>
<ul>
<li><p><strong>File not found errors</strong>: See exactly which files your app is looking for</p>
</li>
<li><p><strong>Network connection failures</strong>: Observe failed <code>connect()</code> system calls with specific addresses</p>
</li>
<li><p><strong>Memory issues</strong>: Watch <code>mmap()</code> and <code>brk()</code> calls that fail</p>
</li>
<li><p><strong>Container startup problems</strong>: See which system calls fail during initialization</p>
</li>
</ul>
<h3 id="heading-clean-up-1">Clean Up</h3>
<p>Remove the test resources:</p>
<pre><code class="lang-bash">kubectl delete pod crash-app -n debug-traceloop-ns
kubectl delete ns debug-traceloop-ns
</code></pre>
<h3 id="heading-key-takeaway">Key Takeaway</h3>
<p>Traditional Kubernetes debugging shows you what went wrong after it happened. Traceloop's continuous recording shows you exactly how it went wrong at the system level. This level of detail is invaluable for debugging complex production issues where the logs don't tell the full story.</p>
<h2 id="heading-real-world-debugging-scenarios">Real-World Debugging Scenarios</h2>
<p>Now that you understand the fundamentals, let's explore common production issues and how Traceloop helps diagnose them. These scenarios mirror real problems you'll encounter in Kubernetes environments.</p>
<h3 id="heading-scenario-1-container-startup-failures">Scenario 1: Container Startup Failures</h3>
<p><strong>The problem</strong>: Your pod gets stuck in <code>CrashLoopBackOff</code> with unhelpful logs.</p>
<p>Traditional <code>kubectl</code> commands show limited information:</p>
<pre><code class="lang-bash">kubectl describe pod failing-app
<span class="hljs-comment"># Events: Back-off restarting failed container</span>

kubectl logs failing-app
<span class="hljs-comment"># (Empty or minimal output)</span>
</code></pre>
<p>System calls show the application tried to:</p>
<ol>
<li><p>Access configuration files that don't exist</p>
</li>
<li><p>Connect to services that aren't available</p>
</li>
<li><p>Write to directories without proper permissions</p>
</li>
</ol>
<p>Key system calls to watch:</p>
<ol>
<li><p><code>openat</code> with <code>-2</code> return (file not found)</p>
</li>
<li><p><code>connect</code> with <code>-111</code> return (connection refused)</p>
</li>
<li><p><code>access</code> with <code>-13</code> return (permission denied)</p>
</li>
</ol>
<h3 id="heading-scenario-2-memory-and-resource-issues">Scenario 2: Memory and Resource Issues</h3>
<p><strong>The problem</strong>: Application performance degrades or gets OOMKilled.</p>
<p>What Traceloop shows:</p>
<ol>
<li><p><code>mmap</code> calls failing (memory allocation issues)</p>
</li>
<li><p><code>brk</code> system calls indicating heap growth</p>
</li>
<li><p>File descriptor exhaustion through failed <code>openat</code> calls</p>
</li>
<li><p>Excessive <code>write</code> calls indicating memory pressure</p>
</li>
</ol>
<p><strong>Example pattern</strong>:</p>
<pre><code class="lang-bash">SYSCALL    PARAMETERS           RET
mmap       length=1048576       -12  <span class="hljs-comment"># ENOMEM - out of memory</span>
brk        brk=0x55555557d000   0    <span class="hljs-comment"># Heap expansion</span>
openat     filename=<span class="hljs-string">"/tmp/..."</span>   -24  <span class="hljs-comment"># EMFILE - too many open files</span>
</code></pre>
<h3 id="heading-scenario-3-network-connectivity-problems">Scenario 3: Network Connectivity Problems</h3>
<p><strong>The problem</strong>: Service-to-service communication fails intermittently.</p>
<p>Traditional debugging limitations:</p>
<ol>
<li><p>Application logs show "connection timeout"</p>
</li>
<li><p>Network policies seem correct</p>
</li>
<li><p>DNS resolution appears to work</p>
</li>
</ol>
<p>What Traceloop reveals:</p>
<ol>
<li><p>Exact IP addresses and ports being attempted</p>
</li>
<li><p>DNS resolution patterns through <code>openat</code> on <code>/etc/resolv.conf</code></p>
</li>
<li><p>Failed <code>connect</code> calls with specific error codes</p>
</li>
<li><p>Socket creation and binding issues</p>
</li>
</ol>
<p><strong>Key indicators</strong>:</p>
<pre><code class="lang-bash">SYSCALL    PARAMETERS                    RET
socket     family=AF_INET, <span class="hljs-built_in">type</span>=SOCK     3
connect    fd=3, addr=10.96.0.1:443     -110  <span class="hljs-comment"># ETIMEDOUT</span>
close      fd=3                         0
</code></pre>
<h3 id="heading-scenario-4-configuration-and-secret-issues">Scenario 4: Configuration and Secret Issues</h3>
<p><strong>The problem</strong>: Application can't access mounted secrets or config maps.</p>
<p>What system calls reveal:</p>
<ol>
<li><p>File access patterns for mounted volumes</p>
</li>
<li><p>Permission checks on secret files</p>
</li>
<li><p>Configuration file parsing attempts</p>
</li>
</ol>
<p>Common patterns:</p>
<ol>
<li><p>Multiple <code>openat</code> attempts on different config file paths</p>
</li>
<li><p><code>access</code> calls checking file permissions before opening</p>
</li>
<li><p>Failed reads from mounted secret volumes</p>
</li>
</ol>
<h3 id="heading-scenario-5-performance-bottlenecks">Scenario 5: Performance Bottlenecks</h3>
<p><strong>The problem</strong>: Application response times are slow without obvious cause.</p>
<p>Traceloop analysis:</p>
<ol>
<li><p>Excessive <code>fsync</code> calls (disk I/O bottlenecks)</p>
</li>
<li><p>Many <code>futex</code> calls (lock contention)</p>
</li>
<li><p>Frequent <code>recvfrom</code> timeouts (network issues)</p>
</li>
<li><p>Repeated file system operations</p>
</li>
</ol>
<p><strong>Performance indicators</strong>:</p>
<pre><code class="lang-bash">SYSCALL     FREQUENCY    ISSUE
fsync       High         Disk I/O bottleneck
futex       Excessive    Lock contention
poll        Many         Waiting <span class="hljs-keyword">for</span> I/O
recvfrom    Timeouts     Network delays
</code></pre>
<h2 id="heading-best-practices"><strong>Best Practices</strong></h2>
<h3 id="heading-when-to-use-traceloop"><strong>When to Use Traceloop</strong></h3>
<p>Traceloop is most useful when you’re dealing with the kinds of problems that are notoriously difficult to pin down. If you’ve ever struggled with debugging intermittent crashes that don’t happen on demand, or run into confusing permission and access issues, this is where it works best.  </p>
<p>It also helps uncover performance bottlenecks at the system level and provides visibility into application behavior during tricky startup failures. Another common use case is diagnosing network connectivity problems between pods, where other tools usually can't help</p>
<p>Of course, not every problem requires system call tracing. For application-level issues, logs and APM tools are more effective. Cluster-level concerns are often better handled with <code>kubectl describe</code> or by looking at events, and if you’re primarily monitoring resources, standard metrics and dashboards show you what's happening.</p>
<h3 id="heading-performance-considerations"><strong>Performance Considerations</strong></h3>
<p>Like any tracing tool, Traceloop adds some overhead, but it keeps the overhead low. You can keep it efficient by narrowing the scope of your traces. For example, filtering by namespace with <code>--namespace specific-ns</code>, or targeting specific pods using <code>--podname target-pod</code>. In high-traffic environments, it’s best to run traces for shorter periods, and node-specific tracing can further isolate debugging when you don’t want to instrument the entire cluster.</p>
<p>In most cases, Traceloop uses very little CPU and memory, thanks to its eBPF-based approach. This makes it lighter than traditional tools like strace. The actual cost depends on the volume of system calls being recorded, so it’s a good practice to monitor resource usage in your own environment to confirm it’s operating within acceptable limits.</p>
<h3 id="heading-integration-with-your-workflow"><strong>Integration with Your Workflow</strong></h3>
<p>Traceloop works well in dev and production workflows. In development, it’s a powerful way to understand how your application interacts with the system. You can use it to confirm that your app handles edge cases correctly, or to validate permission and resource configurations before promoting workloads into production.</p>
<p>In production environments, you can deploy it in different ways. Depending on how much overhead you're okay with, some teams run it continuously on a small subset of nodes, while others use it only when traditional debugging methods don’t provide enough insight. Pairing Traceloop with your existing monitoring and logging stack can give you a much more complete picture of system behavior.</p>
<p>It also helps with teamwork. Sharing trace outputs makes it easier for teams to reason about complex issues together. The data it provides can guide improvements in error handling and logging, and documenting common system call patterns can help onboard new developers more quickly.</p>
<h3 id="heading-security-considerations"><strong>Security Considerations</strong></h3>
<p>Because Traceloop records low-level system activity, you need to be mindful of what it captures.</p>
<p><strong>What Traceloop Can See:</strong></p>
<ul>
<li><p>System call parameters (such as filenames and network addresses)</p>
</li>
<li><p>Process information and command arguments</p>
</li>
<li><p>File access patterns and permissions</p>
</li>
</ul>
<p><strong>Privacy Measures:</strong></p>
<ul>
<li><p>Limit trace duration to minimize data collection</p>
</li>
<li><p>Use namespace isolation to avoid capturing unrelated workloads</p>
</li>
<li><p>Apply data retention policies for trace outputs</p>
</li>
<li><p>Watch for sensitive information in file paths or system call parameters</p>
</li>
</ul>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Traceloop doesn’t just tell you something went wrong – it shows you how. By recording every system call in real time, it turns mysterious Kubernetes failures into solvable problems. Whether the issue happened seconds ago or in the middle of the night, the tool gives you the ability to rewind, inspect, and respond with confidence.</p>
<h3 id="heading-when-to-use-it">When to Use It</h3>
<p>Keep in mind that Traceloop complements your existing debugging toolkit rather than replacing it. Reach for it when logs don’t tell the whole story, when intermittent problems are hiding in the shadows, when <code>kubectl</code> commands leave you guessing, or when you need to see how your application is really interacting with the system.</p>
<p>Once you’re comfortable with Traceloop, you can add more tools. <a target="_blank" href="https://inspektor-gadget.io/">Inspektor Gadget</a> offers other tools for network, security, and performance debugging that pair well with Traceloop. Integrating it into your incident response workflow, sharing insights across your team, and even considering continuous tracing for critical workloads are good things to try next.</p>
<p>The next time you run into a stubborn Kubernetes pod failure, you won’t be stuck speculating. With Traceloop, you can “rewind the tape” and see exactly what happened. System call tracing may sound complex at first, but in practice, it’s one of the most powerful ways to truly understand how applications behave in containerized environments.</p>
<p><strong>PS:</strong> Have any questions about Traceloop or want to share your debugging challenges? The Inspektor Gadget team and community hang out in the <a target="_blank" href="https://kubernetes.slack.com/archives/CSYL75LF6">#inspektor-gadget</a> channel on Kubernetes Slack. It's a great place to get help from the engineers who built these tools, share experiences, and maybe even contribute to making the ecosystem even better.  </p>
<p>You can also connect with me on <a target="_blank" href="https://www.linkedin.com/in/emidowojo/">LinkedIn</a> if you’d like to stay in touch. If you made it to the end of this tutorial, thanks for reading!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Debug CI/CD Pipelines: A Handbook on Troubleshooting with Observability Tools ]]>
                </title>
                <description>
                    <![CDATA[ Observability is a game-changer for CI/CD pipelines, and it’s one of the most exciting aspects of DevOps. When I started working with CI/CD systems, I assumed the hardest part would be building the pipeline. But with increasingly complex setups, the ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-debug-cicd-pipelines-handbook/</link>
                <guid isPermaLink="false">6850a9eb7255997ee3d47265</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ observability ]]>
                    </category>
                
                    <category>
                        <![CDATA[ #prometheus ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Grafana ]]>
                    </category>
                
                    <category>
                        <![CDATA[ promql ]]>
                    </category>
                
                    <category>
                        <![CDATA[ loki ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Opaluwa Emidowojo ]]>
                </dc:creator>
                <pubDate>Mon, 16 Jun 2025 23:34:03 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1748620971355/d4893ec5-8016-491e-9626-15d971f0c885.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Observability is a game-changer for CI/CD pipelines, and it’s one of the most exciting aspects of DevOps. When I started working with CI/CD systems, I assumed the hardest part would be building the pipeline. But with increasingly complex setups, the real challenge is debugging failures, like builds crashing or tests failing only in production.</p>
<p>Observability tools, such as logs, metrics, and traces, provide the visibility you need to pinpoint issues quickly. In this handbook, we’ll explore free and open-source tools you can use to make your CI/CD pipelines more reliable. We’ll use practical steps to troubleshoot like a pro – no enterprise licenses required.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-why-observability-is-important">Why Observability is Important</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-install-and-configure-grafana-loki-on-budget-infrastructure">How to Install and Configure Grafana Loki on Budget Infrastructure</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-implement-an-elk-stack-alternative-for-pipeline-observability">How to Implement an ELK Stack Alternative for Pipeline Observability</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-create-a-unified-logging-strategy-across-pipeline-components">How to Create a Unified Logging Strategy Across Pipeline Components</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-query-and-analyze-logs-for-effective-troubleshooting">How to Query and Analyze Logs for Effective Troubleshooting</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-prometheus-metrics-alongside-your-logs">How to Set Up Prometheus Metrics Alongside Your Logs</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-create-grafana-dashboards-that-combine-metrics-and-logs">How to Create Grafana Dashboards That Combine Metrics and Logs</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-use-exemplars-to-jump-from-metrics-to-relevant-logs">How to Use Exemplars to Jump from Metrics to Relevant Logs</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-diagnose-and-fix-common-cicd-problems">How to Diagnose and Fix Common CI/CD Problems</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-implement-advanced-debugging-techniques">How to Implement Advanced Debugging Techniques</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-conduct-effective-postmortems-using-logs">How to Conduct Effective Postmortems Using Logs</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-optimize-log-storage-and-management">How to Optimize Log Storage and Management</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h3 id="heading-prerequisites">Prerequisites</h3>
<p>There are some things you should know and have to get the most out of this handbook:</p>
<h4 id="heading-technical-knowledge">Technical Knowledge:</h4>
<ul>
<li><p>Basic understanding of <a target="_blank" href="https://www.freecodecamp.org/news/what-is-ci-cd/">CI/CD pipelines</a> (for example, build, test, deploy stages).</p>
</li>
<li><p>Familiarity with <a target="_blank" href="https://www.freecodecamp.org/news/helpful-linux-commands-you-should-know/">Linux/Unix commands</a> (for example, <code>mkdir</code>, <code>grep</code>, <code>curl</code>).</p>
</li>
<li><p>Comfortable with <a target="_blank" href="https://www.freecodecamp.org/news/the-docker-handbook/">Docker basics</a> (for example, <code>docker run</code>, <code>docker-compose up</code>).</p>
</li>
<li><p>Optional: Awareness of <a target="_blank" href="https://www.freecodecamp.org/news/observability-in-cloud-native-applications/">observability concepts</a> (logs, metrics, traces) or YAML configuration.</p>
</li>
</ul>
<h4 id="heading-software-and-tools">Software and Tools:</h4>
<ul>
<li><p><strong>Docker and Docker Compose</strong>: Installed and running (verify with <code>docker --version</code> and <code>docker-compose --version</code>).</p>
</li>
<li><p><strong>CI/CD Platform</strong>: Access to GitHub Actions, Jenkins, or GitLab CI with a sample pipeline that generates logs.</p>
</li>
<li><p><strong>Text Editor</strong>: For editing YAML files (for example, VS Code, Nano).</p>
</li>
<li><p><strong>Web Browser</strong>: To access tool UIs (for example, Grafana on port 3000, Kibana on 5601).</p>
</li>
<li><p>Optional: <code>curl</code> for testing log forwarding, Git for version control.</p>
</li>
</ul>
<h4 id="heading-hardware-and-infrastructure">Hardware and Infrastructure:</h4>
<ul>
<li><p>Machine with:</p>
<ul>
<li><p>OS: Linux, Windows (with WSL2), or macOS.</p>
</li>
<li><p>4GB RAM (8GB recommended), 20GB free disk space.</p>
</li>
<li><p>Stable internet and ability to open ports (for example, 3100 for Loki, 9200 for Elasticsearch).</p>
</li>
</ul>
</li>
<li><p>Optional: Cloud provider access (for example, AWS, GCP) for scalable setups.</p>
</li>
</ul>
<h4 id="heading-access-and-permissions">Access and Permissions:</h4>
<ul>
<li><p>Admin access to install Docker and configure CI/CD tools.</p>
</li>
<li><p>Permissions to modify pipeline configs (for example, <code>.github/workflows</code>, <code>.gitlab-ci.yml</code>).</p>
</li>
<li><p>Optional: Container registry access (for example, Docker Hub) for custom images.</p>
</li>
</ul>
<h2 id="heading-why-observability-is-important"><strong>Why Observability is Important</strong></h2>
<p>Modern CI/CD pipelines are no longer linear scripts – they are now complex, distributed systems involving multiple tools, environments, and infrastructure layers. One job runs on GitHub Actions, another deploys via Jenkins, and a third builds Docker images in a Kubernetes cluster.</p>
<p>So when something breaks, you’re left chasing logs across tools, guessing where the issue originated, and wasting hours trying to reproduce it.</p>
<p>And worse still, traditional debugging tools often stop at the surface, only showing failed jobs without the context of <em>why</em> they failed or <em>where</em> in the system the fault actually lies.</p>
<p>Observability flips the script. Instead of hunting through disconnected logs or rerunning failed builds blindly, observability gives you <strong>insight</strong>, not just data. By combining structured logs, metrics, and traces, you can:</p>
<ul>
<li><p>Reconstruct exactly what happened in a pipeline failure</p>
</li>
<li><p>Trace a failure across CI agents, deployment steps, and containers</p>
</li>
<li><p>Visualize patterns and anomalies before they become outages</p>
</li>
</ul>
<p>More importantly, observability helps you <strong>move from reactive debugging to proactive prevention</strong>.</p>
<p>Here’s what you’ll learn about and accomplish in this guide:</p>
<ul>
<li><p>Set up cost-effective observability using Grafana Loki, lightweight ELK, and OpenTelemetry</p>
</li>
<li><p>Create a unified logging strategy to connect your pipeline</p>
</li>
<li><p>Write precise queries to quickly pinpoint root causes, correlate logs, metrics, and traces for comprehensive debugging</p>
</li>
<li><p>Troubleshoot CI/CD issues like build failures, flaky tests, and container crashes</p>
</li>
<li><p>Build custom dashboards and automated diagnostic tools</p>
</li>
<li><p>Promote observability through documentation and post-mortems</p>
</li>
</ul>
<p>Whether you're a solo developer or part of a DevOps team, this guide will transform your chaotic CI/CD pipelines into clear, reliable, and observable systems.</p>
<h3 id="heading-how-to-choose-the-right-observability-tool-for-cicd"><strong>How to Choose the Right Observability Tool for CI/CD</strong></h3>
<p>Here’s a quick comparison of Grafana Loki, Lightweight ELK, and Vector for CI/CD observability:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Tool</strong></td><td><strong>Resource Usage</strong></td><td><strong>Setup Complexity</strong></td><td><strong>Best For</strong></td><td><strong>CI/CD Fit</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Grafana Loki</strong></td><td>Low (lightweight)</td><td>Easy (Docker-based)</td><td>Small teams, budget infra</td><td>Simple pipelines, JSON logs, Grafana users</td></tr>
<tr>
<td><strong>Lightweight ELK</strong></td><td>High (Elasticsearch-heavy)</td><td>Moderate (multi-container)</td><td>Teams needing advanced search/visualization</td><td>Complex pipelines, rich querying needs</td></tr>
<tr>
<td><strong>Vector</strong></td><td>Very low</td><td>Easy (single binary)</td><td>Resource-constrained setups</td><td>Minimal setups, log forwarding</td></tr>
</tbody>
</table>
</div><p>How to choose:</p>
<ul>
<li><p><strong>Loki</strong>: Ideal for startups or solo devs with limited resources. Integrates well with Prometheus/Grafana.</p>
</li>
<li><p><strong>ELK</strong>: Best for teams needing Kibana’s advanced visualizations or handling large log volumes.</p>
</li>
<li><p><strong>Vector</strong>: Great for lightweight log forwarding in distributed CI/CD setups.</p>
</li>
</ul>
<p><strong>Grafana Loki</strong> is a log aggregation system like ELK, but it's more lightweight, and it’s ideal for CI/CD pipelines with limited infrastructure.</p>
<h2 id="heading-how-to-install-and-configure-grafana-loki-on-budget-infrastructure">How to Install and Configure Grafana Loki on Budget Infrastructure</h2>
<h3 id="heading-option-a-quick-docker-setup-recommended-for-budget-infra">🛠 Option A: Quick Docker Setup (Recommended for Budget Infra)</h3>
<ol>
<li><p><strong>Create a directory for configuration:</strong></p>
<pre><code class="lang-bash"> mkdir -p ~/loki-setup &amp;&amp; <span class="hljs-built_in">cd</span> ~/loki-setup
</code></pre>
</li>
<li><p><strong>Create a</strong> <code>docker-compose.yml</code>:</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># Defines a Docker Compose setup for Grafana Loki and Promtail to aggregate and scrape logs efficiently.</span>
 <span class="hljs-attr">version:</span> <span class="hljs-string">"3"</span>

 <span class="hljs-attr">services:</span>
   <span class="hljs-attr">loki:</span>
     <span class="hljs-attr">image:</span> <span class="hljs-string">grafana/loki:2.9.4</span>  <span class="hljs-comment"># Uses Loki version 2.9.4 for lightweight log aggregation.</span>
     <span class="hljs-attr">ports:</span>
       <span class="hljs-bullet">-</span> <span class="hljs-string">"3100:3100"</span>  <span class="hljs-comment"># Exposes Loki’s HTTP API port for log ingestion and queries.</span>
     <span class="hljs-attr">command:</span> <span class="hljs-string">-config.file=/etc/loki/loki-config.yaml</span>  <span class="hljs-comment"># Specifies the configuration file for Loki.</span>
     <span class="hljs-attr">volumes:</span>
       <span class="hljs-bullet">-</span> <span class="hljs-string">./loki-config.yaml:/etc/loki/loki-config.yaml</span>  <span class="hljs-comment"># Mounts the local config file into the container.</span>

   <span class="hljs-attr">promtail:</span>
     <span class="hljs-attr">image:</span> <span class="hljs-string">grafana/promtail:2.9.4</span>  <span class="hljs-comment"># Uses Promtail version 2.9.4 to scrape and forward logs to Loki.</span>
     <span class="hljs-attr">volumes:</span>
       <span class="hljs-bullet">-</span> <span class="hljs-string">/var/log:/var/log</span>  <span class="hljs-comment"># Mounts the host’s log directory for Promtail to scrape.</span>
       <span class="hljs-bullet">-</span> <span class="hljs-string">./promtail-config.yaml:/etc/promtail/promtail-config.yaml</span>  <span class="hljs-comment"># Mounts the Promtail config file.</span>
     <span class="hljs-attr">command:</span> <span class="hljs-string">-config.file=/etc/promtail/promtail-config.yaml</span>  <span class="hljs-comment"># Specifies the configuration file for Promtail.</span>
</code></pre>
</li>
<li><p><strong>Create a basic</strong> <code>loki-config.yaml</code>:</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># Configures Grafana Loki for lightweight log storage and querying in a CI/CD environment.</span>
 <span class="hljs-attr">auth_enabled:</span> <span class="hljs-literal">false</span>  <span class="hljs-comment"># Disables authentication for simplicity (not recommended for production).</span>

 <span class="hljs-attr">server:</span>
   <span class="hljs-attr">http_listen_port:</span> <span class="hljs-number">3100</span>  <span class="hljs-comment"># Sets the port for Loki’s HTTP API.</span>

 <span class="hljs-attr">ingester:</span>
   <span class="hljs-attr">lifecycler:</span>
     <span class="hljs-attr">ring:</span>
       <span class="hljs-attr">kvstore:</span>
         <span class="hljs-attr">store:</span> <span class="hljs-string">inmemory</span>  <span class="hljs-comment"># Uses in-memory storage for the ring, suitable for small setups.</span>
       <span class="hljs-attr">replication_factor:</span> <span class="hljs-number">1</span>  <span class="hljs-comment"># Sets single replica for minimal resource use.</span>
   <span class="hljs-attr">chunk_idle_period:</span> <span class="hljs-string">3m</span>  <span class="hljs-comment"># Flushes chunks to storage after 3 minutes of inactivity.</span>
   <span class="hljs-attr">max_chunk_age:</span> <span class="hljs-string">1h</span>  <span class="hljs-comment"># Retires chunks after 1 hour to balance storage and query performance.</span>

 <span class="hljs-attr">schema_config:</span>
   <span class="hljs-attr">configs:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-attr">from:</span> <span class="hljs-number">2023-01-01</span>  <span class="hljs-comment"># Defines the schema start date.</span>
       <span class="hljs-attr">store:</span> <span class="hljs-string">boltdb-shipper</span>  <span class="hljs-comment"># Uses BoltDB for indexing logs.</span>
       <span class="hljs-attr">object_store:</span> <span class="hljs-string">filesystem</span>  <span class="hljs-comment"># Stores logs on the local filesystem.</span>
       <span class="hljs-attr">schema:</span> <span class="hljs-string">v11</span>  <span class="hljs-comment"># Specifies schema version for log storage.</span>
       <span class="hljs-attr">index:</span>
         <span class="hljs-attr">prefix:</span> <span class="hljs-string">index_</span>  <span class="hljs-comment"># Prefix for index files.</span>
         <span class="hljs-attr">period:</span> <span class="hljs-string">24h</span>  <span class="hljs-comment"># Rotates indexes daily.</span>

 <span class="hljs-attr">storage_config:</span>
   <span class="hljs-attr">boltdb_shipper:</span>
     <span class="hljs-attr">active_index_directory:</span> <span class="hljs-string">/tmp/loki/index</span>  <span class="hljs-comment"># Directory for active index files.</span>
     <span class="hljs-attr">cache_location:</span> <span class="hljs-string">/tmp/loki/boltdb-cache</span>  <span class="hljs-comment"># Cache location for BoltDB.</span>
   <span class="hljs-attr">filesystem:</span>
     <span class="hljs-attr">directory:</span> <span class="hljs-string">/tmp/loki/chunks</span>  <span class="hljs-comment"># Directory for storing log chunks.</span>

 <span class="hljs-attr">limits_config:</span>
   <span class="hljs-attr">enforce_metric_name:</span> <span class="hljs-literal">false</span>  <span class="hljs-comment"># Disables strict metric name enforcement for flexibility.</span>
</code></pre>
</li>
<li><p><strong>Create a basic</strong> <code>promtail-config.yaml</code>:</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># Configures Promtail to scrape system logs and forward them to Loki.</span>
 <span class="hljs-attr">server:</span>
   <span class="hljs-attr">http_listen_port:</span> <span class="hljs-number">9080</span>  <span class="hljs-comment"># Sets Promtail’s HTTP port for metrics and health checks.</span>
   <span class="hljs-attr">grpc_listen_port:</span> <span class="hljs-number">0</span>  <span class="hljs-comment"># Disables gRPC to reduce resource usage.</span>

 <span class="hljs-attr">positions:</span>
   <span class="hljs-attr">filename:</span> <span class="hljs-string">/tmp/positions.yaml</span>  <span class="hljs-comment"># Stores the position of scraped logs to resume after restarts.</span>

 <span class="hljs-attr">clients:</span>
   <span class="hljs-bullet">-</span> <span class="hljs-attr">url:</span> <span class="hljs-string">http://loki:3100/loki/api/v1/push</span>  <span class="hljs-comment"># Specifies the Loki endpoint for log ingestion.</span>

 <span class="hljs-attr">scrape_configs:</span>
   <span class="hljs-bullet">-</span> <span class="hljs-attr">job_name:</span> <span class="hljs-string">system</span>  <span class="hljs-comment"># Defines a scraping job for system logs.</span>
     <span class="hljs-attr">static_configs:</span>
       <span class="hljs-bullet">-</span> <span class="hljs-attr">targets:</span>
           <span class="hljs-bullet">-</span> <span class="hljs-string">localhost</span>  <span class="hljs-comment"># Targets the local host for log collection.</span>
         <span class="hljs-attr">labels:</span>
           <span class="hljs-attr">job:</span> <span class="hljs-string">varlogs</span>  <span class="hljs-comment"># Labels logs for easy querying in Loki.</span>
           <span class="hljs-attr">__path__:</span> <span class="hljs-string">/var/log/*.log</span>  <span class="hljs-comment"># Scrapes all log files in /var/log directory.</span>
</code></pre>
</li>
<li><p><strong>Run it:</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># Starts the Loki and Promtail containers in detached mode for background operation.</span>
 docker-compose up -d
</code></pre>
</li>
</ol>
<p>✨ This brings up Loki and Promtail with minimal resources, no authentication, and logs scraping from <code>/var/log</code>.</p>
<h4 id="heading-troubleshooting-loki-setup-issues">Troubleshooting Loki Setup Issues</h4>
<p>If Loki or Promtail fails to start, one of the following may be the issue:</p>
<ol>
<li><p><strong>Container crashes</strong>: Check logs with <code>docker logs loki</code> or <code>docker logs promtail</code>. Look for errors like <em>“out of memory”</em> or <em>“port already in use.”</em></p>
<ul>
<li>Fix: Increase memory (for example, <code>docker-compose.yml</code> resource limits) or change ports (e.g., <code>3101:3100</code>).</li>
</ul>
</li>
<li><p><strong>Logs not ingested</strong>: Verify Promtail is scraping the correct path (<code>/var/log/ci/*.log</code>) using <code>docker exec promtail cat /etc/promtail/promtail-config.yaml</code></p>
<ul>
<li>Fix: Update <code>__path__</code> in <code>promtail-config.yaml</code> to match your CI/CD log directory.</li>
</ul>
</li>
<li><p><strong>Resource Constraints</strong>: Monitor resource usage with <code>docker stats</code> or <code>top</code> on the host.</p>
<ul>
<li>Fix: Ensure your machine has at least 4GB RAM and 20GB disk space, as specified in the prerequisites.</li>
</ul>
</li>
</ol>
<h3 id="heading-configuration-for-cicd-logging">Configuration for CI/CD Logging</h3>
<p>To adapt for CI/CD logs, you should:</p>
<h4 id="heading-1-configure-your-cicd-tools-to-write-logs-to-disk">1. Configure your CI/CD tools to write logs to disk:</h4>
<p>For example, GitHub Actions with a custom runner can write logs to <code>/var/log/gha/*.log</code>.</p>
<p>Update Promtail:</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># Configures Promtail to scrape logs from GitHub Actions runners for CI/CD observability.</span>
<span class="hljs-attr">scrape_configs:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">job_name:</span> <span class="hljs-string">github_actions</span>  <span class="hljs-comment"># Defines a scraping job for GitHub Actions logs.</span>
    <span class="hljs-attr">static_configs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">targets:</span> [<span class="hljs-string">'localhost'</span>]  <span class="hljs-comment"># Targets the local host where the runner writes logs.</span>
        <span class="hljs-attr">labels:</span>
          <span class="hljs-attr">job:</span> <span class="hljs-string">gha</span>  <span class="hljs-comment"># Labels logs for identification in Loki queries.</span>
          <span class="hljs-attr">__path__:</span> <span class="hljs-string">/var/log/gha/*.log</span>  <span class="hljs-comment"># Scrapes logs from the specified directory.</span>
</code></pre>
<h4 id="heading-2-use-structured-logging-json">2. Use structured logging (JSON):</h4>
<p>Make sure your CI/CD tools or scripts output logs in structured format:</p>
<p>Example:</p>
<pre><code class="lang-json"># Example of a structured JSON log for CI/CD pipelines, enabling easy parsing and querying.
{
  <span class="hljs-attr">"timestamp"</span>: <span class="hljs-string">"2025-05-10T13:00:00Z"</span>,  # UTC timestamp for log entry.
  <span class="hljs-attr">"level"</span>: <span class="hljs-string">"error"</span>,  # Log level to indicate severity.
  <span class="hljs-attr">"job"</span>: <span class="hljs-string">"deploy"</span>,  # Identifies the CI/CD job (e.g., deploy stage).
  <span class="hljs-attr">"message"</span>: <span class="hljs-string">"Image pull failed"</span>  # Descriptive message for the error.
}
</code></pre>
<p>This helps when querying with LogQL.</p>
<h3 id="heading-how-to-connect-ci-agents-to-loki">How to Connect CI Agents to Loki</h3>
<p>This section explains three different ways to get your CI pipeline logs into Loki for monitoring and analysis:</p>
<h4 id="heading-option-1-local-setup">Option 1 – Local setup:</h4>
<p>Your CI agents write log files to disk, and Promtail (running on the same machine) reads those files and sends them to Loki.</p>
<h4 id="heading-option-2-using-docker-logging-driver-docker-containers">Option 2 – Using Docker logging driver (Docker containers):</h4>
<p>If your CI agents run in Docker containers, you install a special Loki plugin that automatically captures all container output and sends it directly to Loki without needing separate log files.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Installs the Loki Docker logging driver to send container logs directly to Loki.</span>
docker plugin install grafana/loki-docker-driver:latest --<span class="hljs-built_in">alias</span> loki --grant-all-permissions
</code></pre>
<p>Then run your agent container:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Runs a CI agent container with the Loki logging driver to forward logs.</span>
docker run --log-driver=loki \
  --log-opt loki-url=<span class="hljs-string">"http://&lt;your-loki-host&gt;:3100/loki/api/v1/push"</span> \
  my-ci-agent-image
</code></pre>
<h4 id="heading-option-3-remote-setup">Option 3 – Remote setup:</h4>
<p>If you can't install Promtail locally, you can use a log forwarding tool like <a target="_blank" href="https://fluentbit.io/">Fluent Bit</a> or <a target="_blank" href="https://vector.dev/">Vector</a> to collect logs and push them to Loki over the network.</p>
<p><strong>The goal:</strong> Regardless of which option you choose, you’ll end up with all your CI pipeline logs centralized in Loki, where you can search through them, create dashboards in Grafana, and set up alerts when things go wrong.</p>
<p>It essentially gives you flexibility to integrate log collection based on your infrastructure setup – whether you prefer local agents, Docker plugins, or remote forwarding.</p>
<h2 id="heading-how-to-implement-an-elk-stack-alternative-for-pipeline-observability">How to Implement an ELK Stack Alternative for Pipeline Observability</h2>
<p>When full ELK (Elasticsearch, Logstash, Kibana) is too heavy for your infrastructure, you can go with lightweight setups that achieve similar observability at a lower cost and resource usage.</p>
<h3 id="heading-how-to-install-lightweight-versions-of-elasticsearch-logstash-and-kibana">How to Install Lightweight Versions of Elasticsearch, Logstash, and Kibana</h3>
<p>Goal: Stand up a minimal yet functional ELK stack for debugging CI/CD pipelines.</p>
<h4 id="heading-1-use-docker-to-spin-up-lightweight-containers">1. Use Docker to spin up lightweight containers</h4>
<p>Create a <code>docker-compose.yml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># Defines a Docker Compose setup for a lightweight ELK stack to aggregate and visualize CI/CD logs.</span>
<span class="hljs-attr">version:</span> <span class="hljs-string">'3.7'</span>

<span class="hljs-attr">services:</span>
  <span class="hljs-attr">elasticsearch:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">docker.elastic.co/elasticsearch/elasticsearch:7.17.0</span>  <span class="hljs-comment"># Uses Elasticsearch 7.17.0.</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">elasticsearch</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">discovery.type=single-node</span>  <span class="hljs-comment"># Runs Elasticsearch in single-node mode for simplicity.</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">xpack.security.enabled=false</span>  <span class="hljs-comment"># Disables security features for lightweight setup.</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9200:9200"</span>  <span class="hljs-comment"># Exposes Elasticsearch’s HTTP API port.</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">esdata:/usr/share/elasticsearch/data</span>  <span class="hljs-comment"># Persists Elasticsearch data.</span>

  <span class="hljs-attr">logstash:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">docker.elastic.co/logstash/logstash:7.17.0</span>  <span class="hljs-comment"># Uses Logstash 7.17.0.</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">logstash</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"5044:5044"</span>  <span class="hljs-comment"># Port for receiving logs from Beats.</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9600:9600"</span>  <span class="hljs-comment"># Port for Logstash monitoring.</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">./logstash.conf:/usr/share/logstash/pipeline/logstash.conf</span>  <span class="hljs-comment"># Mounts Logstash config file.</span>

  <span class="hljs-attr">kibana:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">docker.elastic.co/kibana/kibana:7.17.0</span>  <span class="hljs-comment"># Uses Kibana 7.17.0 for visualization.</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kibana</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">ELASTICSEARCH_HOSTS=http://elasticsearch:9200</span>  <span class="hljs-comment"># Links Kibana to Elasticsearch.</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"5601:5601"</span>  <span class="hljs-comment"># Exposes Kibana’s web UI port.</span>

<span class="hljs-attr">volumes:</span>
  <span class="hljs-attr">esdata:</span>  <span class="hljs-comment"># Defines a volume for persisting Elasticsearch data.</span>
</code></pre>
<h4 id="heading-2-minimal-logstash-pipeline-configuration-logstashconf">2. Minimal Logstash pipeline configuration (logstash.conf)</h4>
<pre><code class="lang-javascript"><span class="hljs-comment">// Configures Logstash to process and forward CI/CD logs to Elasticsearch.</span>
input {
  beats {
    <span class="hljs-function"><span class="hljs-params">port</span> =&gt;</span> <span class="hljs-number">5044</span>  <span class="hljs-comment">// Listens for logs from Filebeat on port 5044.</span>
  }
}

filter {
  json {
    <span class="hljs-function"><span class="hljs-params">source</span> =&gt;</span> <span class="hljs-string">"message"</span>  <span class="hljs-comment">// Parses JSON-formatted log messages for structured data.</span>
  }
}

output {
  elasticsearch {
    <span class="hljs-function"><span class="hljs-params">hosts</span> =&gt;</span> [<span class="hljs-string">"http://elasticsearch:9200"</span>]  <span class="hljs-comment">// Sends processed logs to Elasticsearch.</span>
    index =&gt; <span class="hljs-string">"ci-logs-%{+YYYY.MM.dd}"</span>  <span class="hljs-comment">// Stores logs in daily indexes (e.g., ci-logs-2025.05.14).</span>
  }
}
</code></pre>
<h4 id="heading-troubleshooting-elk-setup-issues">Troubleshooting ELK Setup Issues</h4>
<p>If Elasticsearch, Logstash, or Kibana fails to start, one of the following might be the issue:</p>
<ol>
<li><p><strong>Container crashes</strong>: Check logs with <code>docker logs elasticsearch</code>, <code>docker logs logstash</code>, or <code>docker logs kibana</code>. Look for errors like <em>“insufficient disk space”</em> or <em>“port conflict”</em> (for example, 9200, 5601).</p>
<ul>
<li>Fix: Free up disk space (ensure at least 20GB available) or change ports in <code>docker-compose.yml</code> (for example, <code>9201:9200</code>).</li>
</ul>
</li>
<li><p><strong>Logs not ingested</strong>: Verify Logstash is receiving data from Filebeat or Vector using <code>docker logs logstash</code>. Check the <code>logstash.conf</code> input port (for example, 5044).</p>
<ul>
<li>Fix: Ensure Filebeat or Vector is configured to send to the correct Logstash endpoint (e.g., <code>localhost:5044</code>) and update if needed.</li>
</ul>
</li>
<li><p><strong>Resource constraints</strong>: Monitor resource usage with Docker stats or top on the host.</p>
<ul>
<li>Fix: Allocate at least 8GB RAM and 30GB disk space, as Elasticsearch requires more resources than Loki. Adjust memory limits in <code>docker-compose.yml</code> if necessary.</li>
</ul>
</li>
</ol>
<h3 id="heading-how-to-configure-log-shippers-for-different-cicd-components">How to Configure Log Shippers for Different CI/CD Components</h3>
<p>Goal: Get logs from your pipeline into Logstash or Elasticsearch.</p>
<h4 id="heading-option-1-use-filebeat-lightweight-log-shipper">Option 1: Use Filebeat (lightweight log shipper)</h4>
<p>Install <a target="_blank" href="https://www.elastic.co/beats/filebeat">Filebeat</a> on your CI/CD hosts (GitHub runner, Jenkins node, GitLab runner, and so on).</p>
<p>Filebeat config snippet (filebeat.yml):</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># Configures Filebeat to collect CI/CD logs and forward them to Logstash.</span>
<span class="hljs-attr">filebeat.inputs:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">type:</span> <span class="hljs-string">log</span>  <span class="hljs-comment"># Specifies log file input.</span>
    <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>  <span class="hljs-comment"># Enables the input.</span>
    <span class="hljs-attr">paths:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">/var/log/ci/*.log</span>  <span class="hljs-comment"># Scrapes logs from the specified CI log directory.</span>

<span class="hljs-attr">output.logstash:</span>
  <span class="hljs-attr">hosts:</span> [<span class="hljs-string">"localhost:5044"</span>]  <span class="hljs-comment"># Forwards logs to Logstash on port 5044.</span>
</code></pre>
<p>Then run:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Runs Filebeat with the specified configuration file for log collection.</span>
filebeat -e -c filebeat.yml
</code></pre>
<h4 id="heading-option-2-use-vectordev-as-a-more-resource-efficient-alternative-to-filebeat">Option 2: Use Vector.dev as a more resource-efficient alternative to Filebeat</h4>
<p>Vector configuration (vector.toml):</p>
<pre><code class="lang-toml"><span class="hljs-comment"># Configures Vector to collect, parse, and forward CI/CD logs to Elasticsearch efficiently.</span>
<span class="hljs-section">[sources.ci_logs]</span>
  <span class="hljs-attr">type</span> = <span class="hljs-string">"file"</span>  <span class="hljs-comment"># Specifies file-based log collection.</span>
  <span class="hljs-attr">include</span> = [<span class="hljs-string">"/var/log/ci/*.log"</span>]  <span class="hljs-comment"># Targets CI log files.</span>

<span class="hljs-section">[transforms.json_parser]</span>
  <span class="hljs-attr">type</span> = <span class="hljs-string">"remap"</span>  <span class="hljs-comment"># Uses remap transform to parse logs.</span>
  <span class="hljs-attr">inputs</span> = [<span class="hljs-string">"ci_logs"</span>]  <span class="hljs-comment"># Processes logs from the ci_logs source.</span>
  <span class="hljs-attr">source</span> = <span class="hljs-string">'''
  . = parse_json!(.message)  # Parses JSON log messages into structured data.
  '''</span>

<span class="hljs-section">[sinks.to_elasticsearch]</span>
  <span class="hljs-attr">type</span> = <span class="hljs-string">"elasticsearch"</span>  <span class="hljs-comment"># Sends logs to Elasticsearch.</span>
  <span class="hljs-attr">inputs</span> = [<span class="hljs-string">"json_parser"</span>]  <span class="hljs-comment"># Uses parsed logs from the json_parser transform.</span>
  <span class="hljs-attr">endpoint</span> = <span class="hljs-string">"http://localhost:9200"</span>  <span class="hljs-comment"># Specifies the Elasticsearch endpoint.</span>
  <span class="hljs-attr">index</span> = <span class="hljs-string">"ci-logs"</span>  <span class="hljs-comment"># Stores logs in the ci-logs index.</span>
</code></pre>
<p>Run:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Runs Vector with the specified configuration file for log processing.</span>
vector -c vector.toml
</code></pre>
<h3 id="heading-how-to-set-up-index-patterns-and-basic-visualizations">How to Set Up Index Patterns and Basic Visualizations</h3>
<p>Goal: Make CI/CD logs queryable and visual in Kibana.</p>
<h4 id="heading-1-open-kibana-httplocalhost5601httplocalhost5601">1. Open Kibana (<a target="_blank" href="http://localhost:5601/">http://localhost:5601</a>)</h4>
<ul>
<li><p>Go to <strong>Stack Management → Index Patterns</strong></p>
</li>
<li><p>Create a new pattern: <code>ci-logs-*</code></p>
</li>
<li><p>Choose a time field like <code>@timestamp</code></p>
</li>
</ul>
<h4 id="heading-2-visualizations-for-common-cicd-use-cases">2. Visualizations for Common CI/CD Use Cases</h4>
<ul>
<li><p><strong>Bar charts</strong>: Number of failed vs passed builds per day</p>
</li>
<li><p><strong>Pie chart</strong>: Top error types or most frequent failing test names</p>
</li>
<li><p><strong>Line chart</strong>: Duration of builds over time (if duration is logged)</p>
</li>
</ul>
<h4 id="heading-3-saved-searches-amp-dashboards">3. Saved Searches &amp; Dashboards</h4>
<p>You can save a search like this:</p>
<pre><code class="lang-javascript">message: <span class="hljs-string">"error"</span> AND job_name: <span class="hljs-string">"build"</span>
</code></pre>
<p>You can also combine visualizations into a CI/CD Health Dashboard.</p>
<h2 id="heading-how-to-create-a-unified-logging-strategy-across-pipeline-components">How to Create a Unified Logging Strategy Across Pipeline Components</h2>
<p>Creating a unified logging strategy across your CI/CD pipeline components ensures that logs are consistent, traceable, and easy to correlate. This helps you quickly debug issues, monitor system health, and trace requests across different tools and services. Let’s discuss some key practices for achieving a unified logging strategy:</p>
<h3 id="heading-implementing-consistent-log-formats-across-different-tools">Implementing Consistent Log Formats Across Different Tools</h3>
<p>Consistent log formats are important for various reasons. First of all, a standardized log format enables easier querying, searching, and visualization. It also helps with correlation of logs from different services. And consistency also ensures that all logs provide necessary details like timestamp, log level, and request context.</p>
<p>There are also some best practices you should follow when formatting logs:</p>
<p><strong>JSON Format</strong> is highly recommended as it’s structured, machine-readable, and compatible with many observability tools (for example, Loki, Elasticsearch, Grafana).</p>
<p>There are also some key fields you should include:</p>
<ul>
<li><p><code>timestamp</code>: The time the log entry was created (preferably in UTC).</p>
</li>
<li><p><code>log_level</code>: Indicate whether the log is an <code>INFO</code>, <code>ERROR</code>, <code>DEBUG</code>, and so on.</p>
</li>
<li><p><code>service</code>: The service or component generating the log.</p>
</li>
<li><p><code>message</code>: A concise description of the event or error.</p>
</li>
<li><p><code>correlation_id</code>: A unique identifier for requests to trace logs across systems.</p>
</li>
</ul>
<p>Here’s an example of a consistent log in JSON format:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"timestamp"</span>: <span class="hljs-string">"2025-05-10T12:34:56Z"</span>,
  <span class="hljs-attr">"log_level"</span>: <span class="hljs-string">"ERROR"</span>,
  <span class="hljs-attr">"service"</span>: <span class="hljs-string">"ci_cd_pipeline"</span>,
  <span class="hljs-attr">"message"</span>: <span class="hljs-string">"Build failed due to missing dependency"</span>,
  <span class="hljs-attr">"correlation_id"</span>: <span class="hljs-string">"1234567890abcdef"</span>
}
</code></pre>
<h3 id="heading-how-to-set-up-log-forwarding-from-github-actions-jenkins-or-gitlab">How to Set Up Log Forwarding from GitHub Actions, Jenkins, or GitLab</h3>
<p>Log forwarding refers to shipping logs from your CI/CD pipelines to a central spot for easy tracking. It’s helpful because it lets you spot issues fast and debug without digging through scattered files.</p>
<p>For GitHub Actions, you can configure workflows to write logs to a file or send them directly to a log aggregation tool like Loki. In Jenkins, you can use pipeline scripts to forward logs to a log server or file system. Similarly, for GitHub CI, you can add scripts in <code>.gitlab-ci.yml</code> to forward logs to a centralized endpoint.</p>
<p><strong>Using Actions for Outputting Logs:</strong><br>You can store logs in files and then forward them to a logging system (like Loki or Elasticsearch).<br>Here’s an example in a GitHub Action workflow:</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># Defines a GitHub Actions workflow to run tests and forward logs for observability.</span>
<span class="hljs-attr">jobs:</span>
  <span class="hljs-attr">build:</span>
    <span class="hljs-attr">runs-on:</span> <span class="hljs-string">ubuntu-latest</span>  <span class="hljs-comment"># Uses an Ubuntu runner.</span>
    <span class="hljs-attr">steps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Checkout</span> <span class="hljs-string">repository</span>  <span class="hljs-comment"># Checks out the repository code.</span>
        <span class="hljs-attr">uses:</span> <span class="hljs-string">actions/checkout@v2</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Run</span> <span class="hljs-string">tests</span> <span class="hljs-string">and</span> <span class="hljs-string">log</span> <span class="hljs-string">output</span>  <span class="hljs-comment"># Runs tests and saves output to a log file.</span>
        <span class="hljs-attr">run:</span> <span class="hljs-string">|
          echo "Starting tests..."
          npm test | tee test.log  # Captures test output to test.log.
          # Forwards the log file to a Loki endpoint via HTTP POST.
          curl -X POST -F 'file=@test.log' http://your-loki-endpoint</span>
</code></pre>
<p><strong>Log Forwarding with Promtail:</strong><br>If you are using Grafana Loki for log aggregation, set up Promtail to scrape the logs from the GitHub Actions runner.</p>
<h4 id="heading-jenkins">Jenkins:</h4>
<p>Jenkins logs can be forwarded to external systems (like Elasticsearch or Loki) by using log shippers or plugins.</p>
<p><strong>You can use the Logstash Plugin</strong> to forward Jenkins logs to an ELK stack or other systems:</p>
<ul>
<li><p>Install the Logstash plugin on Jenkins.</p>
</li>
<li><p>Configure the plugin to forward logs to an Elasticsearch server or a logging system of choice.</p>
</li>
<li><p>In Jenkins, add log forwarding configurations:</p>
</li>
</ul>
<pre><code class="lang-javascript">pipeline {
  agent any
  stages {
    stage(<span class="hljs-string">'Build'</span>) {
      steps {
        script {
          <span class="hljs-comment">// Example of forwarding logs to a log server</span>
          sh <span class="hljs-string">'echo "Build successful" | curl -X POST -d @- http://your-log-server'</span>
        }
      }
    }
  }
}
</code></pre>
<p><strong>Forward to Loki:</strong><br>Jenkins supports the <code>loki</code> logging driver for containers if running Jenkins in Docker. You can send logs directly to Loki using this driver:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Runs a Jenkins container with the Loki logging driver to send logs directly to Loki.</span>
docker run --log-driver=loki --log-opt loki-url=http://loki:3100 jenkins/jenkins:lts
</code></pre>
<h4 id="heading-gitlab">GitLab:</h4>
<p>GitLab CI allows logs to be forwarded to external systems for centralized collection and analysis.</p>
<p><strong>Use GitLab CI/CD to Output Logs</strong>:<br>Example in <code>.gitlab-ci.yml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># Defines a GitLab CI/CD pipeline to run a build and forward logs to Loki.</span>
<span class="hljs-attr">stages:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">build</span>
<span class="hljs-attr">build:</span>
  <span class="hljs-attr">script:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">echo</span> <span class="hljs-string">"Starting the build"</span> <span class="hljs-string">|</span> <span class="hljs-string">tee</span> <span class="hljs-string">build.log</span>  <span class="hljs-comment"># Saves build output to build.log.</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">curl</span> <span class="hljs-string">-X</span> <span class="hljs-string">POST</span> <span class="hljs-string">-d</span> <span class="hljs-string">@build.log</span> <span class="hljs-string">http://your-loki-endpoint</span>  <span class="hljs-comment"># Forwards the log to Loki.</span>
</code></pre>
<p><strong>GitLab Runners</strong>:<br>Configure GitLab runners to forward logs to an external service like Loki or Elasticsearch using <code>log-driver</code> settings or the <code>fluentd</code> log shipper.</p>
<h3 id="heading-how-to-add-correlation-ids-to-trace-requests-through-the-system">How to Add Correlation IDs to Trace Requests Through the System</h3>
<h4 id="heading-why-correlation-ids-are-important">Why Correlation IDs Are Important:</h4>
<p>Correlation IDs allow you to trace a single request as it travels through different services and tools, enabling end-to-end visibility and troubleshooting.</p>
<p>They are critical for debugging distributed systems, especially when different services (for example, CI tool, deployment tool, API service) are involved.</p>
<h4 id="heading-how-to-add-correlation-ids">How to Add Correlation IDs:</h4>
<p>You can use a UUID (Universally Unique Identifier) or a GUID (Globally Unique Identifier) to generate a unique ID for each request.</p>
<p>If you are using microservices or multiple services in the pipeline, just make sure that the same ID is propagated across each service.</p>
<p>Many logging libraries (for example, <code>winston</code> for Node.js, <code>log4j</code> for Java) support automatic correlation ID generation and logging.</p>
<p>Here’s an example in Node.js (using <code>winston</code>):</p>
<pre><code class="lang-javascript"><span class="hljs-comment">// Sets up Winston for structured logging with correlation IDs in a CI/CD pipeline.</span>
<span class="hljs-keyword">const</span> { createLogger, transports, format } = <span class="hljs-built_in">require</span>(<span class="hljs-string">'winston'</span>);
<span class="hljs-keyword">const</span> { printf } = format;

<span class="hljs-comment">// Creates a logger with a custom format including correlation IDs.</span>
<span class="hljs-keyword">const</span> logger = createLogger({
  <span class="hljs-attr">format</span>: printf(<span class="hljs-function">(<span class="hljs-params">{ level, message, timestamp }</span>) =&gt;</span> {
    <span class="hljs-keyword">return</span> <span class="hljs-string">`<span class="hljs-subst">${timestamp}</span> [<span class="hljs-subst">${level}</span>] <span class="hljs-subst">${message}</span> correlation_id=<span class="hljs-subst">${generateCorrelationId()}</span>`</span>;
  }),
  <span class="hljs-attr">transports</span>: [
    <span class="hljs-keyword">new</span> transports.Console(),  <span class="hljs-comment">// Outputs logs to the console.</span>
  ],
});

<span class="hljs-comment">// Generates a random correlation ID for tracing requests.</span>
<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">generateCorrelationId</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-keyword">return</span> <span class="hljs-built_in">Math</span>.random().toString(<span class="hljs-number">36</span>).substring(<span class="hljs-number">2</span>, <span class="hljs-number">15</span>);
}

<span class="hljs-comment">// Logs a sample message.</span>
logger.info(<span class="hljs-string">'Pipeline execution started'</span>);
</code></pre>
<h4 id="heading-how-to-propagate-correlation-ids-between-services">How to Propagate Correlation IDs Between Services:</h4>
<p>In CI/CD tools, you can configure your pipeline to inject the correlation ID into logs. For example, in GitHub Actions, you can generate a correlation ID in the <code>env</code> section and propagate it in each job:</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># Defines a GitHub Actions workflow that includes a correlation ID for log tracing.</span>
<span class="hljs-attr">jobs:</span>
  <span class="hljs-attr">build:</span>
    <span class="hljs-attr">runs-on:</span> <span class="hljs-string">ubuntu-latest</span>  <span class="hljs-comment"># Uses an Ubuntu runner.</span>
    <span class="hljs-attr">env:</span>
      <span class="hljs-attr">CORRELATION_ID:</span> <span class="hljs-string">${{</span> <span class="hljs-string">github.run_id</span> <span class="hljs-string">}}</span>  <span class="hljs-comment"># Uses the GitHub run ID as a correlation ID.</span>
    <span class="hljs-attr">steps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Checkout</span> <span class="hljs-string">repository</span>  <span class="hljs-comment"># Checks out the repository code.</span>
        <span class="hljs-attr">uses:</span> <span class="hljs-string">actions/checkout@v2</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Log</span> <span class="hljs-string">build</span> <span class="hljs-string">start</span> <span class="hljs-string">with</span> <span class="hljs-string">correlation</span> <span class="hljs-string">ID</span>  <span class="hljs-comment"># Logs the build start with the correlation ID.</span>
        <span class="hljs-attr">run:</span> <span class="hljs-string">echo</span> <span class="hljs-string">"Build started with Correlation ID: $CORRELATION_ID"</span>
</code></pre>
<h4 id="heading-include-correlation-ids-in-all-logs">Include Correlation IDs in All Logs:</h4>
<p>You’ll want to make sure that logs from all components in the pipeline (GitHub Actions, Jenkins, GitLab, deployment tools, and so on) include the correlation ID as part of the log message. This allows you to trace the logs of a single request or pipeline run across different services.</p>
<h4 id="heading-visualize-your-log-flow">Visualize Your Log Flow</h4>
<p>You can create a diagram showing how logs move from your CI/CD tool (for example, GitHub Actions) to Promtail/Vector, then to Loki/Elasticsearch, and finally to Grafana/Kibana for visualization. Use tools like <a target="_blank" href="http://Draw.io">Draw.io</a> to map your pipeline’s observability flow</p>
<h2 id="heading-how-to-query-and-analyze-logs-for-effective-troubleshooting">How to Query and Analyze Logs for Effective Troubleshooting</h2>
<p>In this section, you’ll learn how to use LogQL (Loki's query language) to cut through the noise and find the specific logs that matter. Whether you're hunting down a mysterious build failure or tracking deployment issues across multiple services, these query patterns always help.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748224707087/d348accc-0ef8-4ebb-9cb9-49995404b0ec.png" alt="Bar chart showing CI/CD build results from May 20-26, 2025. Blue bars represent successful builds ranging from 39-52 per day, while red bars show failed builds ranging from 1-9 per day. The chart demonstrates consistently high success rates with low failure rates throughout the week, with May 23 showing the highest failure count at 9 builds." class="image--center mx-auto" width="1468" height="866" loading="lazy"></p>
<p>This bar chart illustrates the CI/CD build performance from May 20 to May 26, 2025. It compares the number of successful builds (in blue) to failed builds (in pink) each day. Successful builds consistently range between 40 and 50, while failed builds peak at 10 on May 23, with other days showing 2 to 8 failures. This indicates a generally stable pipeline with occasional issues.</p>
<h3 id="heading-how-to-write-advanced-logql-queries-to-pinpoint-cicd-issues">How to Write Advanced LogQL Queries to Pinpoint CI/CD Issues</h3>
<p>LogQL is Grafana Loki's query language, designed for querying logs with a syntax similar to Prometheus’s PromQL. It enables efficient log searches and is particularly useful in troubleshooting CI/CD issues.</p>
<h4 id="heading-basic-logql-syntax">Basic LogQL Syntax:</h4>
<p><strong>1. Log Streams:</strong></p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, level=<span class="hljs-string">"error"</span>}
</code></pre>
<p>This query retrieves logs where the <code>job</code> label is <code>ci_cd</code> and the <code>level</code> label is <code>error</code>.</p>
<p><strong>2. Log Filters:</strong></p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>} |= <span class="hljs-string">"build failed"</span>
</code></pre>
<p>The <code>|=</code> operator filters logs to include only those that contain the specified string, for example "build failed".</p>
<p><strong>3. Regular Expressions:</strong></p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>} |~ <span class="hljs-string">"error.*timeout"</span>
</code></pre>
<p>This uses the <code>|~</code> operator to filter logs using a regular expression. In this case, it finds logs that contain an "error" followed by "timeout".</p>
<h4 id="heading-advanced-logql-queries-for-cicd-issues">Advanced LogQL Queries for CI/CD Issues:</h4>
<p><strong>1. Filter Logs for Specific Build Failures:</strong></p>
<p>If your pipeline uses a specific label for build names:</p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, build=<span class="hljs-string">"build123"</span>} |= <span class="hljs-string">"failure"</span>
</code></pre>
<p>This finds logs related to the <code>build123</code> job that contain the word "failure".</p>
<p><strong>2. Using Time Range and Grouping:</strong></p>
<p>To find error logs in the last 15 minutes:</p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, level=<span class="hljs-string">"error"</span>} | <span class="hljs-string">"build failed"</span> | range(start=<span class="hljs-string">"15m"</span>)
</code></pre>
<p>To group logs by job and error type:</p>
<pre><code class="lang-javascript">sum by (job) (count_over_time({job=<span class="hljs-string">"ci_cd"</span>, level=<span class="hljs-string">"error"</span>}[<span class="hljs-number">5</span>m]))
</code></pre>
<p>This will return the count of error logs per job, grouped by job name, over the last 5 minutes.</p>
<h3 id="heading-how-to-create-pipeline-specific-queries-for-common-failure-patterns">How to Create Pipeline-Specific Queries for Common Failure Patterns</h3>
<h4 id="heading-common-failure-patterns-in-cicd-pipelines">Common Failure Patterns in CI/CD Pipelines:</h4>
<p><strong>1. Build Failures:</strong></p>
<p>If your CI system logs contain build errors, you can identify them with:</p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, level=<span class="hljs-string">"error"</span>} |= <span class="hljs-string">"build failed"</span>
</code></pre>
<p>You can extend this to filter by specific steps or stages, for example, “test failed”, or “compilation error”.</p>
<p><strong>2. Test Failures:</strong></p>
<p>Logs from your test runner (for example, Jest, Mocha, JUnit) can contain specific failure messages:</p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, stage=<span class="hljs-string">"test"</span>} |= <span class="hljs-string">"test failed"</span>
</code></pre>
<p><strong>3. Dependency Issues:</strong></p>
<p>If your pipeline is failing due to missing or conflicting dependencies, look for <code>npm</code>, <code>maven</code>, or <code>docker</code> related errors:</p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, image=<span class="hljs-string">"node"</span>} |= <span class="hljs-string">"npm ERR!"</span>
</code></pre>
<p>Or for Maven-related issues:</p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, image=<span class="hljs-string">"maven"</span>} |= <span class="hljs-string">"[ERROR]"</span>
</code></pre>
<p><strong>4. Resource Constraints (for example, Out of Memory):</strong></p>
<p>If you experience resource constraints, you might see logs like "OutOfMemoryError":</p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, level=<span class="hljs-string">"error"</span>} |= <span class="hljs-string">"OutOfMemoryError"</span>
</code></pre>
<p><strong>Example of combining filters:</strong></p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, level=<span class="hljs-string">"error"</span>} |= <span class="hljs-string">"build failed"</span> |~ <span class="hljs-string">"timeout|dependency"</span> | range(start=<span class="hljs-string">"1h"</span>)
</code></pre>
<p>This combines log filters for "build failed", matching any logs with the terms "timeout" or "dependency", from the last hour.</p>
<h3 id="heading-how-to-set-up-alert-rules-based-on-log-patterns">How to Set Up Alert Rules Based on Log Patterns</h3>
<p>Alerts help detect recurring issues proactively. They notify you when a specific pattern appears in your logs, allowing you to take quick action.</p>
<h4 id="heading-steps-for-setting-up-alerts"><strong>Steps for Setting Up Alerts:</strong></h4>
<p><strong>1. Create a Query for the Alert:</strong></p>
<p>First, define the log pattern you want to monitor. For example, an alert for build failures:</p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, level=<span class="hljs-string">"error"</span>} |= <span class="hljs-string">"build failed"</span>
</code></pre>
<p><strong>2. Create an Alert in Grafana:</strong></p>
<p>Follow these steps to set up Grafana alerts:</p>
<ul>
<li><p>Go to your Grafana dashboard.</p>
</li>
<li><p>Choose the panel you want to set the alert on (or create a new panel for this purpose).</p>
</li>
<li><p>In the panel, click the <strong>Alert</strong> tab.</p>
</li>
<li><p>Set the <strong>Query</strong> field to your LogQL query, such as the one above.</p>
</li>
<li><p>Under <strong>Conditions</strong>, define when the alert should trigger, e.g., if the error occurs more than <code>3</code> times within <code>5 minutes</code>.</p>
</li>
</ul>
<p><strong>3. Alert Settings:</strong></p>
<p>Now you’ll want to set up the alert evaluation interval and conditions for triggering the alert (e.g., if the query returns results above a certain threshold).</p>
<p><strong>Here’s an example:</strong> Trigger an alert if the number of errors exceeds 5 within 5 minutes:</p>
<pre><code class="lang-javascript">count_over_time({job=<span class="hljs-string">"ci_cd"</span>, level=<span class="hljs-string">"error"</span>} |= <span class="hljs-string">"build failed"</span>[<span class="hljs-number">5</span>m]) &gt; <span class="hljs-number">5</span>
</code></pre>
<p><strong>4. Set Alert Notifications:</strong></p>
<p>You can choose where you want the alert to be sent (like to Slack, email, or PagerDuty). And Grafana can be integrated with these systems to send real-time alerts to the right team members.</p>
<p><strong>Example alert query for test failures:</strong></p>
<pre><code class="lang-javascript">count_over_time({job=<span class="hljs-string">"ci_cd"</span>, stage=<span class="hljs-string">"test"</span>} |= <span class="hljs-string">"test failed"</span>[<span class="hljs-number">5</span>m]) &gt; <span class="hljs-number">3</span>
</code></pre>
<p>This query triggers an alert if more than 3 test failures are logged within the last 5 minutes.</p>
<h3 id="heading-kibana-query-language-deep-dive-for-cicd-contexts">Kibana Query Language Deep Dive for CI/CD Contexts</h3>
<p>Kibana Query Language (KQL) is a powerful tool for searching and filtering logs within Elasticsearch, and it becomes especially useful for debugging CI/CD pipelines.</p>
<h4 id="heading-basic-query-syntax">Basic Query Syntax:</h4>
<ul>
<li><p><strong>Field:</strong></p>
<pre><code class="lang-javascript">  textCopyEditfieldname:value
</code></pre>
<p>  Example: <code>status: "failure"</code></p>
</li>
<li><p><strong>Wildcard:</strong> Use <code>*</code> to match any number of characters:</p>
<pre><code class="lang-javascript">  textCopyEditmessage: <span class="hljs-string">"test*"</span>
</code></pre>
</li>
<li><p><strong>Range Queries:</strong> To search for logs within a specific time frame:</p>
<pre><code class="lang-javascript">  textCopyEdittimestamp:[<span class="hljs-number">2023</span><span class="hljs-number">-05</span><span class="hljs-number">-01</span> TO <span class="hljs-number">2023</span><span class="hljs-number">-05</span><span class="hljs-number">-15</span>]
</code></pre>
</li>
<li><p><strong>Boolean Queries:</strong> Combine queries using <code>AND</code>, <code>OR</code>, and <code>NOT</code>:</p>
<pre><code class="lang-javascript">  textCopyEditstatus: <span class="hljs-string">"failure"</span> AND build_id: <span class="hljs-string">"12345"</span>
</code></pre>
</li>
</ul>
<h4 id="heading-time-based-queries">Time-Based Queries:</h4>
<p>Since CI/CD logs are often tied to time-sensitive operations (builds, deployments), KQL allows you to filter logs by time:</p>
<pre><code class="lang-javascript">textCopyEdit@timestamp:[now<span class="hljs-number">-1</span>d TO now]
</code></pre>
<h4 id="heading-nested-queries-for-complex-pipelines">Nested Queries (For Complex Pipelines):</h4>
<p>CI/CD logs can have nested or multi-level structures (for example, logs within containers). You can query these nested fields:</p>
<pre><code class="lang-javascript">textCopyEditpipeline.logs.message: <span class="hljs-string">"build failed"</span>
</code></pre>
<h4 id="heading-aggregations-and-grouping">Aggregations and Grouping:</h4>
<p>You can aggregate logs based on certain fields to identify trends or recurring issues:</p>
<pre><code class="lang-javascript">textCopyEditterms aggregation on <span class="hljs-string">"status"</span> field
</code></pre>
<p>This helps identify the most common failure statuses in your pipeline.</p>
<h4 id="heading-field-specific-filtering">Field-Specific Filtering:</h4>
<p>When debugging specific components like a build tool or deployment step, you can filter by those component-specific fields:</p>
<pre><code class="lang-javascript">textCopyEditbuild_tool: <span class="hljs-string">"Jenkins"</span> AND status: <span class="hljs-string">"failure"</span>
</code></pre>
<h4 id="heading-creating-saved-searches-for-recurring-issues">Creating Saved Searches for Recurring Issues</h4>
<p>Once you’ve built queries that help you identify common issues in your CI/CD pipeline, you can save them in Kibana for future use.</p>
<p><strong>1. Create a Saved Search:</strong></p>
<p>Run your desired query in the Kibana Discover tab. Click on the “Save” button and give it a meaningful name, such as "Failed Builds - Last Week". You can add filters and customize the time range to match your typical issue patterns.</p>
<p><strong>2. Use Filters to Pinpoint Recurring Problems:</strong></p>
<p>Create saved searches that focus on specific recurring issues like:</p>
<ul>
<li><p>Build failures based on a specific tool or version.</p>
</li>
<li><p>Test failures within a particular module or set of tests.</p>
</li>
</ul>
<p>Example search for “flaky tests”:</p>
<pre><code class="lang-javascript">textCopyEdittest_status: <span class="hljs-string">"failed"</span> AND error_message: <span class="hljs-string">"*timeout*"</span>
</code></pre>
<p><strong>3. Saving Multiple Variations:</strong></p>
<p>You can save multiple variations of queries based on different error types or CI/CD tools:</p>
<ul>
<li><p><strong>Failed Jobs:</strong> <code>status: "failure"</code></p>
</li>
<li><p><strong>Test Failures in Build:</strong> <code>log_type: "test" AND status: "failure"</code></p>
</li>
<li><p><strong>Resource Constraints:</strong> <code>error_message: "*memory*"</code></p>
</li>
</ul>
<p>These saved searches will allow you to quickly troubleshoot specific issues that occur frequently.</p>
<h4 id="heading-building-visualizations-to-spot-patterns-over-time">Building Visualizations to Spot Patterns Over Time</h4>
<p>Once you have saved searches, Kibana allows you to create visualizations from your data, making it easier to spot trends, anomalies, or patterns over time.</p>
<p><strong>1. Create a Visualization:</strong></p>
<p>Go to the <strong>Visualize</strong> tab in Kibana. Select the appropriate visualization type. Common visualizations for debugging CI/CD pipelines include:</p>
<ul>
<li><p><strong>Line Chart:</strong> Track build failure rates over time.</p>
</li>
<li><p><strong>Bar Chart:</strong> Show the number of failures per CI tool or service.</p>
</li>
<li><p><strong>Pie Chart:</strong> Breakdown of failure reasons (for example, compilation errors, test failures, resource constraints).</p>
</li>
</ul>
<p><strong>2. Track Failure Trends Over Time:</strong></p>
<p>Create a line chart to track build failures over a given period:</p>
<ul>
<li><p><strong>X-Axis:</strong> Time (for example, daily or weekly).</p>
</li>
<li><p><strong>Y-Axis:</strong> Count of build failures.</p>
</li>
<li><p><strong>Aggregation:</strong> Date histogram with <code>@timestamp</code> field.</p>
</li>
</ul>
<p>This will help you visualize how build failures are trending, making it easier to identify recurring issues or spikes in failures.</p>
<p><strong>3. Monitor Failure Types by CI Tool:</strong></p>
<p>Create a bar chart that shows the number of failures broken down by CI tool:</p>
<ul>
<li><p><strong>X-Axis:</strong> CI tool (Jenkins, GitHub Actions, GitLab, and so on).</p>
</li>
<li><p><strong>Y-Axis:</strong> Count of failures.</p>
</li>
<li><p><strong>Aggregation:</strong> Terms aggregation on the <code>ci_tool</code> field.</p>
</li>
</ul>
<p>This visualization helps identify which CI tool is experiencing the most failures and focus troubleshooting efforts there.</p>
<p><strong>4. Visualize Error Messages by Frequency:</strong></p>
<p>You can visualize which error messages appear most frequently, helping you understand what might be causing recurring issues:</p>
<ul>
<li><p><strong>X-Axis:</strong> Error message type.</p>
</li>
<li><p><strong>Y-Axis:</strong> Count of occurrences.</p>
</li>
<li><p><strong>Aggregation:</strong> Terms aggregation on the <code>error_message</code> field.</p>
</li>
</ul>
<p><strong>5. Dashboard for Holistic Monitoring:</strong></p>
<p>Create a dashboard that brings together multiple visualizations. You can have one graph for failure trends, another for failure types (bar chart), and a pie chart showing the percentage of failures caused by different issues. This dashboard gives you a holistic view of your pipeline's health.</p>
<h4 id="heading-advanced-visualization-techniques">Advanced Visualization Techniques:</h4>
<p>There are various advanced techniques you can use to dig further into your data.</p>
<ul>
<li><p><strong>Heatmaps</strong>: Use heatmaps to spot time-based anomalies in build durations or test failures.</p>
</li>
<li><p><strong>Anomaly Detection</strong>: Kibana has built-in anomaly detection that can be applied to log data to automatically detect patterns that deviate from the norm. This is especially useful for catching rare or unexpected errors in your CI/CD pipeline.</p>
<p>  Example for anomaly detection:</p>
<pre><code class="lang-javascript">  textCopyEditfield: duration
  <span class="hljs-attr">aggregation</span>: average
  anomaly detection model: <span class="hljs-string">"baseline"</span>
</code></pre>
</li>
</ul>
<h2 id="heading-how-to-set-up-prometheus-metrics-alongside-your-logs">How to Set Up Prometheus Metrics Alongside Your Logs</h2>
<p>To fully understand your CI/CD pipeline's health and performance, combining metrics and logs is essential. Prometheus is an excellent tool for capturing time-series metrics, and it works seamlessly with Grafana and Loki (or any log aggregation system).</p>
<h3 id="heading-how-to-set-up-prometheus-for-cicd-metrics-collection"><strong>How to Set Up Prometheus for CI/CD Metrics Collection:</strong></h3>
<h4 id="heading-1-install-prometheus">1. Install Prometheus:</h4>
<p>You can install Prometheus using Docker or Kubernetes for easy deployment.</p>
<p>For Docker-based installation:</p>
<pre><code class="lang-bash">docker run -d -p 9090:9090 --name prometheus prom/prometheus
</code></pre>
<h4 id="heading-2-configure-prometheus-to-scrape-metrics"><strong>2. Configure Prometheus to Scrape Metrics:</strong></h4>
<p>Prometheus needs to be configured to scrape metrics from your CI/CD services.</p>
<p>Edit the <code>prometheus.yml</code> file:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">scrape_configs:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">job_name:</span> <span class="hljs-string">'ci_cd_metrics'</span>
    <span class="hljs-attr">static_configs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">targets:</span> [<span class="hljs-string">'localhost:8080'</span>, <span class="hljs-string">'localhost:9091'</span>]
</code></pre>
<h4 id="heading-3-instrument-your-cicd-services">3. Instrument Your CI/CD Services:</h4>
<p>To expose metrics, you need to integrate Prometheus client libraries into your CI/CD services.</p>
<p>For example, to expose build metrics from a Jenkins job, use the <a target="_blank" href="https://plugins.jenkins.io/prometheus/">Prometheus plugin for Jenkins</a>. In GitHub Actions, you can use <a target="_blank" href="https://github.com/prometheus/prometheus">Prometheus</a> to expose job metrics.</p>
<h4 id="heading-4-expose-metrics-endpoint"><strong>4. Expose Metrics Endpoint:</strong></h4>
<p>You’ll want to make sure your services expose a <code>/metrics</code> endpoint that Prometheus can scrape. For example, use Prometheus client libraries in your application to expose this endpoint.</p>
<h4 id="heading-troubleshooting-prometheus-setup-issues">Troubleshooting Prometheus Setup Issues</h4>
<p>If Prometheus fails to start or scrape metrics, here are some things that might be going wrong:</p>
<ol>
<li><p><strong>Container Crashes</strong>: Check logs with <code>docker logs prometheus</code>. Look for errors like “port already in use” (for example, 9090) or configuration parsing issues.</p>
<ul>
<li>Fix: Change the port in <code>docker run</code> (for example, <code>-p 9091:9090</code>) or correct the <code>prometheus.yml</code> file syntax.</li>
</ul>
</li>
<li><p><strong>Metrics Not Scraped</strong>: Verify targets are reachable using <code>docker logs prometheus</code> or test with curl <code>http://localhost:9090/targets</code>. Check <code>prometheus.yml</code> for correct endpoints.</p>
<ul>
<li>Fix: Update <code>targets</code> in <code>scrape_configs</code> (for example, <code>localhost:8080</code>) to match your CI/CD service’s metrics endpoint.</li>
</ul>
</li>
<li><p><strong>Resource Constraints</strong>: Monitor usage with docker stats or top on the host.</p>
<ul>
<li>Fix: Ensure at least 4GB RAM and 10GB disk space. Increase storage retention or reduce scrape frequency in <code>prometheus.yml</code> if needed.</li>
</ul>
</li>
</ol>
<h2 id="heading-how-to-create-grafana-dashboards-that-combine-metrics-and-logs">How to Create Grafana Dashboards That Combine Metrics and Logs</h2>
<p>Once Prometheus is collecting metrics, the next step is to visualize and correlate them in Grafana.</p>
<h3 id="heading-how-to-integrate-prometheus-with-grafana"><strong>How to Integrate Prometheus with Grafana:</strong></h3>
<p>First, you’ll need to install Grafana. You can use Docker or Kubernetes for quick deployment:</p>
<pre><code class="lang-bash">docker run -d -p 3000:3000 --name grafana grafana/grafana
</code></pre>
<p>Next, configure Grafana to use Prometheus as a data source. To do this, log in to Grafana (<code>localhost:3000</code> by default). Go to <code>Configuration</code> &gt; <code>Data Sources</code> &gt; <code>Add Data Source</code> &gt; Choose <code>Prometheus</code>. Enter your Prometheus server URL (for example, <code>http://localhost:9090</code>) and click <code>Save &amp; Test</code>.</p>
<p>Now it’s time to build a unified dashboard. To do this, create a new dashboard in Grafana that combines both logs (Loki) and metrics (Prometheus).</p>
<p>Add a panel with Prometheus data queries to visualize pipeline metrics like build success rate, deployment duration, and failure count. Use the <code>Graph</code> visualization type for time-series data and <code>Stat</code> for quick summary metrics.</p>
<p>Finally, in the same Grafana dashboard, add panels for logs (from Loki or any other logging system). Use the <code>Logs</code> panel to visualize log data and link them with the relevant Prometheus metrics by using time-based correlations.</p>
<p><strong>Example</strong>: If a spike in CPU usage is detected (Prometheus metric), the logs panel could show related logs, like errors or failed build jobs.</p>
<h2 id="heading-how-to-use-exemplars-to-jump-from-metrics-to-relevant-logs">How to Use Exemplars to Jump from Metrics to Relevant Logs</h2>
<p>Exemplars are an advanced feature in Prometheus that allow you to connect metric data with logs and traces. Grafana supports this feature, and it can be incredibly helpful when investigating issues.</p>
<h3 id="heading-how-to-set-up-exemplars-in-prometheus">How to Set Up Exemplars in Prometheus:</h3>
<p><strong>1. Enable Exemplars in Your Application:</strong></p>
<p>Exemplars are essentially traces embedded into your metrics. To use them, you’ll need to make sure your application is instrumented to send exemplar data alongside your metrics.</p>
<p>Many libraries support adding exemplars to Prometheus metrics, such as <code>prom-client</code> (Node.js) and <code>prometheus-net</code> (C#).</p>
<p>Here’s an example in Node.js:</p>
<pre><code class="lang-javascript"><span class="hljs-comment">// Demonstrates adding an exemplar to a Prometheus metric for linking to logs or traces.</span>
<span class="hljs-keyword">const</span> promClient = <span class="hljs-built_in">require</span>(<span class="hljs-string">'prom-client'</span>);

<span class="hljs-comment">// Creates a counter metric to track failed CI/CD builds.</span>
<span class="hljs-keyword">const</span> counter = <span class="hljs-keyword">new</span> promClient.Counter({
  <span class="hljs-attr">name</span>: <span class="hljs-string">'ci_cd_failed_builds_total'</span>,  <span class="hljs-comment">// Metric name for failed builds.</span>
  <span class="hljs-attr">help</span>: <span class="hljs-string">'Total number of failed builds'</span>,  <span class="hljs-comment">// Description of the metric.</span>
});

<span class="hljs-comment">// Increments the counter with an exemplar for tracing.</span>
counter.inc({ <span class="hljs-attr">exemplar</span>: <span class="hljs-string">'build_failed'</span> });
</code></pre>
<p><strong>2. Enable Exemplars in Prometheus Config:</strong></p>
<p>Make sure your Prometheus server is configured to store and expose exemplars. Exemplars are typically included with histogram or summary metrics, so make sure you’ve configured them correctly.</p>
<p><strong>3. Visualizing Exemplars in Grafana:</strong></p>
<p>In Grafana, when you query Prometheus for metrics with exemplars, Grafana will show the linked logs or traces when you hover over a metric.</p>
<p>Use the <code>Exemplar</code> option in Grafana panels to quickly access logs from specific metrics.</p>
<p>For example, if you have a <code>build_failure_total</code> metric and you detect a failure in your pipeline, you can click on the failure metric in Grafana and instantly view the relevant logs for that specific failure using the exemplars.</p>
<h2 id="heading-how-to-diagnose-and-fix-common-cicd-problems">How to Diagnose and Fix Common CI/CD Problems</h2>
<p>CI/CD pipelines often encounter issues like build failures, dependency problems, and flaky tests that can disrupt development workflows. This section provides practical strategies to diagnose and resolve these common problems using log analysis and systematic debugging techniques, helping you restore pipeline stability quickly.</p>
<h3 id="heading-strategy-1-systematically-debug-build-failures"><strong>Strategy 1: Systematically Debug Build Failures</strong></h3>
<p>Build failures are a frequent CI/CD challenge, often stemming from errors in code, tests, or configurations. Systematically debugging these issues involves analyzing logs to pinpoint root causes, using the following approaches.</p>
<h4 id="heading-identifying-patterns-in-compiler-and-test-output">Identifying Patterns in Compiler and Test Output</h4>
<p>When debugging build failures, you need to first examine the logs from the compiler and test outputs. Let’s go over some key strategies.</p>
<h4 id="heading-1-check-for-specific-error-messages">1. Check for Specific Error Messages:</h4>
<p>There are a few common types of error messages you might get. They are:</p>
<ul>
<li><p><strong>Syntax errors</strong>: Look for lines indicating that there's a mismatch in syntax, such as missing semicolons, undeclared variables, or incorrect function calls.</p>
</li>
<li><p><strong>Linker errors</strong>: These often occur when the required libraries or dependencies are not found. You'll typically see errors like <code>undefined reference</code> or <code>symbol not found</code>.</p>
</li>
<li><p><strong>Build tool errors</strong>: If you are using build systems like Maven, Gradle, or MSBuild, their logs will give specific error codes or missing configurations.</p>
</li>
</ul>
<h4 id="heading-2-look-for-common-error-patterns">2. Look for Common Error Patterns:</h4>
<p>Often, failed builds repeat the same error or pattern across multiple runs. Check logs for recurring terms or errors that point to specific modules or functions. And remember that grouping similar issues can help you identify the root cause faster.</p>
<h4 id="heading-3-use-regular-expressions-for-log-filtering">3. Use Regular Expressions for Log Filtering:</h4>
<p>You can use regular expressions to search for keywords in the logs that match common failure patterns (for example, "error", "failed", "exception", "out of memory"). This will help you filter out unrelated messages and focus on the failures.</p>
<p><strong>As an example:</strong></p>
<ul>
<li><p>If the build fails with an "Out of Memory" error, search for any memory allocation issues or settings that can be increased.</p>
</li>
<li><p>If test failures are related to specific modules, inspect those modules for recent changes or dependency issues.</p>
</li>
</ul>
<h3 id="heading-strategy-2-troubleshooting-dependency-issues-with-log-analysis">Strategy 2: Troubleshooting Dependency Issues with Log Analysis</h3>
<p>Dependency issues are common in build failures, especially in complex CI/CD pipelines with multiple modules or services. To resolve these issues, consider the following:</p>
<p><strong>1. Check for Missing or Outdated Dependencies</strong>:</p>
<p>Start by reviewing the build tool’s output to check for messages related to missing dependencies (for example, <code>dependency not found</code>, <code>version conflict</code>).</p>
<p>Many build tools (like Maven, npm, or .NET) will include specific error messages when a dependency is missing or incompatible.</p>
<p><strong>2. Inspect Dependency Resolution Logs</strong>:</p>
<p>Some build tools provide detailed logs showing how dependencies were resolved (for example, the version of a library that was used). These logs can show you if there’s a version mismatch.</p>
<p>Make sure that your <code>package.json</code> (for JavaScript projects), <code>pom.xml</code> (for Java), or <code>csproj</code> (for C#) files are correctly defined with compatible versions.</p>
<p><strong>3. Verify Network Connectivity</strong>:</p>
<p>CI/CD tools sometimes fail to fetch dependencies due to network issues (for example, proxy settings, repository access). Look for any errors indicating that a repository couldn’t be reached.</p>
<p><strong>4. Log Example:</strong></p>
<p>If a Java project fails with <code>Could not find artifact</code>, it's likely a dependency missing or inaccessible. Check the repository URL or if the artifact exists in your Maven repo.</p>
<p><strong>5. Resolve Version Conflicts</strong>:</p>
<p>Version conflicts occur when different dependencies require incompatible versions of the same library. This is especially true in Java (with Maven/Gradle) and .NET projects. Consider using tools to resolve version conflicts automatically or define compatible versions manually.</p>
<h3 id="heading-fixing-flaky-tests-based-on-historical-log-data">Fixing Flaky Tests Based on Historical Log Data</h3>
<p><strong>Note:</strong> Issues like container crashes, logs not ingested, or resource constraints here may resemble those in other sections. These are common across CI/CD services and processes, but each section offers unique context to avoid redundancy.</p>
<p>Flaky tests – that is, those that pass sometimes and fail at other times – are common in CI/CD pipelines, and they can be frustrating. Let’s discuss some strategies for how you can tackle them:</p>
<p><strong>1. Analyze Test Logs Over Time</strong>:</p>
<p>Review historical logs to identify patterns in when the test fails. Look for timing issues, resource limits, or external dependencies that could affect test reliability.</p>
<p>For example, if a test intermittently fails after a certain amount of time or only during specific pipeline stages, it could indicate resource exhaustion or race conditions.</p>
<p><strong>2. Check Test Dependencies</strong>:</p>
<p>Often, flaky tests are dependent on external services or resources (for example, databases, APIs, file systems). Check if these services are consistently available and properly mocked during test execution.</p>
<p>Logs that mention failed connections to external services or unstable environments can give you insights into potential issues with dependencies.</p>
<p><strong>3. Run Tests with Increased Logging</strong>:</p>
<p>Increase the verbosity of test logs to capture more information about the failures. This can help you detect why tests fail in certain conditions.</p>
<p>For example, adding debug logs inside tests can provide more context on the state of the application when the failure occurs.</p>
<p><strong>4. Time of Day Issues</strong>:</p>
<p>Some flaky tests may fail during peak usage times, especially if they rely on shared resources. Look for patterns that correlate with resource contention (for example, database locks, API rate limits).</p>
<p>Logs showing high CPU or memory usage can indicate that resource constraints are affecting the stability of your tests.</p>
<p><strong>5. Implement Retry Logic for Flaky Tests</strong>:</p>
<p>To mitigate the effects of flaky tests, implement automatic retries for tests that fail intermittently. This can help reduce the noise in your CI/CD pipeline while you investigate the root causes.</p>
<p>For example, if a database connection test fails intermittently, you may want to inspect database logs for signs of timeouts or connection pool exhaustion.</p>
<h3 id="heading-how-to-resolve-deployment-pipeline-failures">How to Resolve Deployment Pipeline Failures</h3>
<p>Deployment pipeline failures can stem from several sources, and diagnosing them requires a systematic approach using logs and available observability tools. Below, we will outline the common patterns in logs that indicate resource constraints, permission/authentication issues, and configuration drift between environments.</p>
<p><strong>Log Patterns That Indicate Resource Constraints</strong></p>
<p>Resource constraints are a common cause of pipeline failures. These can include CPU limits, memory usage, or disk space running out. Here's how to recognize these patterns:</p>
<h4 id="heading-key-indicators-in-logs">Key Indicators in Logs:</h4>
<ul>
<li><strong>Memory Issues</strong>: Look for messages like <em>"out of memory"</em>, <em>"memory limit exceeded"</em>, or <em>"OOM killed"</em> in your logs. Here’s an example in Kubernetes logs:</li>
</ul>
<pre><code class="lang-javascript">pod has been OOMKilled
</code></pre>
<ul>
<li><strong>CPU Limits</strong>: Watch for logs showing that a process exceeded CPU limits or was throttled. Here’s an example:</li>
</ul>
<pre><code class="lang-javascript">process <span class="hljs-string">'foo'</span> hit CPU limit, throttling at <span class="hljs-number">100</span>%
</code></pre>
<ul>
<li><strong>Disk Space</strong>: Logs may show file write errors or messages about a disk being full. Here’s an example:</li>
</ul>
<pre><code class="lang-javascript">Unable to write to file, disk space is full.
</code></pre>
<p>You can resolve the memory issues by increasing the allocated memory for your containers, VM, or cloud instances.</p>
<p>You can resolve the CPU issues by adjusting CPU limits or scaling your infrastructure to add more resources.</p>
<p>And finally, you can resolve disk space issues by cleaning up unused files or increasing disk capacity on the server/container.</p>
<p><strong>Identify Permission and Authentication Issues</strong></p>
<p>Permission and authentication issues often result in pipeline failures due to a lack of access to necessary resources or services. These issues might occur when you’re trying to access databases, deploy to cloud services, or authenticate third-party APIs.</p>
<p>There are some key indicators in the logs that you can look out for:</p>
<h4 id="heading-1-authentication-failures">1. Authentication Failures:</h4>
<p>Look for messages related to failed logins, incorrect credentials, or invalid tokens.</p>
<p>Here’s an example:</p>
<pre><code class="lang-javascript">Authentication failed <span class="hljs-keyword">for</span> user <span class="hljs-string">'admin'</span>
</code></pre>
<pre><code class="lang-javascript">Invalid API token provided.
</code></pre>
<h4 id="heading-2-permission-denied">2. Permission Denied:</h4>
<p>Logs may indicate that the CI/CD pipeline lacks the permissions to perform a certain action.</p>
<p>Here’s an example:</p>
<pre><code class="lang-javascript">Access denied <span class="hljs-keyword">for</span> /path/to/deployment/target
</code></pre>
<pre><code class="lang-javascript">Unauthorized request to cloud service.
</code></pre>
<p><strong>How to resolve these errors</strong>:</p>
<ul>
<li><p><strong>Credentials</strong>: Ensure the credentials (API keys, access tokens, SSH keys) used in the pipeline are up-to-date and correctly configured.</p>
</li>
<li><p><strong>Permissions</strong>: Review and update the role-based access control (RBAC) settings for the service account running the pipeline to ensure it has the necessary permissions.</p>
</li>
<li><p><strong>Secrets Management</strong>: Use tools like Vault, AWS Secrets Manager, or Azure Key Vault to securely manage secrets and credentials.</p>
</li>
</ul>
<p><strong>Troubleshooting Configuration Drift Between Environments</strong></p>
<p>Configuration drift occurs when different environments (like development, staging, production) are not synchronized. This can lead to inconsistent behavior during deployments, and often results in failures in one environment but not in others.</p>
<p>Look out for these key indicators in the logs:</p>
<h4 id="heading-1-mismatch-in-environment-variables">1. Mismatch in Environment Variables:</h4>
<p>If you’re using environment variables, check for discrepancies across different stages. For example:</p>
<pre><code class="lang-javascript">Environment variable DATABASE_URL not found <span class="hljs-keyword">in</span> production
</code></pre>
<h4 id="heading-2-dependency-versions">2. Dependency Versions:</h4>
<p>Mismatched versions of dependencies between environments can cause unexpected issues.</p>
<p>Here’s an example:</p>
<pre><code class="lang-javascript"><span class="hljs-built_in">Error</span>: Dependency <span class="hljs-string">'libxyz'</span> version mismatch between environments
</code></pre>
<h4 id="heading-3-service-configuration">3. Service Configuration:</h4>
<p>Look for configuration-related errors that might not be present in a development environment but occur in production.</p>
<p>Here’s an example:</p>
<pre><code class="lang-javascript"><span class="hljs-built_in">Error</span>: Invalid config <span class="hljs-keyword">in</span> <span class="hljs-string">'production-config.yaml'</span>
</code></pre>
<p><strong>How to resolve these errors</strong>:</p>
<ul>
<li><p><strong>Use Infrastructure as Code (IaC)</strong>: Tools like Terraform, Ansible, or CloudFormation can help ensure that environments are provisioned consistently.</p>
</li>
<li><p><strong>Automated Configuration Management</strong>: Use CI/CD pipeline steps to automate environment setup to avoid manual changes that can cause drift.</p>
</li>
<li><p><strong>Environment Consistency Checks</strong>: Implement checks to compare configurations and dependencies across environments before deployment.</p>
<ul>
<li>Example: You can add a pre-deployment stage to run a script that compares environment variables, configurations, and dependency versions between staging and production.</li>
</ul>
</li>
<li><p><strong>Configuration Management Tools</strong>: Use configuration management tools like Chef, Puppet, or SaltStack to maintain consistent configurations across environments.</p>
</li>
</ul>
<h3 id="heading-how-to-debug-container-based-deployment-issues">How to Debug Container-Based Deployment Issues</h3>
<p>Debugging container-based deployment issues requires specialized tools and techniques to trace errors in containerized environments. Below are strategies to efficiently collect logs, diagnose failures, and use ephemeral containers for investigation.</p>
<h4 id="heading-collecting-and-analyzing-container-logs-effectively">Collecting and Analyzing Container Logs Effectively</h4>
<p>Container logs are essential for troubleshooting issues, and effective collection and analysis can significantly speed up the debugging process.</p>
<p>Here’s how you can collect container logs:</p>
<p><strong>1. Docker Logs:</strong></p>
<p>You can use Docker’s <code>logs</code> command to view logs of a specific container:</p>
<pre><code class="lang-bash">docker logs &lt;container_name_or_id&gt;
</code></pre>
<p>If your container uses a logging driver (like <code>json-file</code> or <code>fluentd</code>), ensure that logs are being written to an accessible location.</p>
<p><strong>2. Kubernetes Logs:</strong></p>
<p>For Kubernetes-managed containers, use <code>kubectl</code> to access pod logs:</p>
<pre><code class="lang-bash">kubectl logs &lt;pod_name&gt;
</code></pre>
<p>To view logs for all containers in a pod:</p>
<pre><code class="lang-bash">kubectl logs &lt;pod_name&gt; --all-containers=<span class="hljs-literal">true</span>
</code></pre>
<p><strong>3. Log Aggregation:</strong></p>
<p>You can integrate with centralized logging systems (like, <strong>Grafana Loki</strong>, <strong>Elastic Stack</strong>). You can also use Fluentd or Logstash as log shippers for forwarding logs from containers to a logging backend.</p>
<h4 id="heading-analyzing-logs">Analyzing Logs:</h4>
<p><strong>1. Filter and Search Logs:</strong></p>
<p>Use <code>grep</code> to filter logs for specific error messages or patterns:</p>
<pre><code class="lang-bash">docker logs &lt;container_name&gt; | grep <span class="hljs-string">"ERROR"</span>
</code></pre>
<p>In Kubernetes, you can combine <code>kubectl</code> with <code>grep</code> or other tools for advanced filtering.</p>
<p><strong>2. Log Contextualization:</strong></p>
<p>Include metadata in your logs (for example, container ID, environment, timestamps) for easier debugging. Ensure logs are structured in formats like JSON to allow for better querying and filtering.</p>
<h3 id="heading-how-to-diagnose-image-pull-and-networking-failures">How to Diagnose Image Pull and Networking Failures</h3>
<p>Container deployment failures often stem from issues related to image pulling or network connectivity. Here’s how to troubleshoot these problems:</p>
<h4 id="heading-image-pull-failures">Image Pull Failures:</h4>
<p>There are some common issues you might see, such as:</p>
<ul>
<li><p><strong>Authentication failures:</strong> If the container registry requires authentication, ensure your credentials (username/password or tokens) are correct.</p>
</li>
<li><p><strong>Network connectivity:</strong> Check if the container can access the registry endpoint. Often, firewalls or DNS issues block the image pull.</p>
</li>
<li><p><strong>Image not found:</strong> Verify the image name and tag are correct. Use <code>docker pull</code> to manually pull the image to see if the issue is specific to the deployment process.</p>
</li>
</ul>
<p>There are various ways to diagnose them:</p>
<p>For <strong>Docker</strong>, use:</p>
<pre><code class="lang-bash">docker pull &lt;image_name&gt;
</code></pre>
<p>This will output the specific error message if the image pull fails.</p>
<p>For <strong>Kubernetes</strong>, check the event logs for the pod:</p>
<pre><code class="lang-bash">kubectl describe pod &lt;pod_name&gt;
</code></pre>
<p>Look for the <code>Failed</code> status under "Events" for information about why the image pull failed (for example, wrong credentials or tag). If the issue is with the registry authentication, configure the Kubernetes <strong>imagePullSecrets</strong> or Docker's credentials to ensure the correct access.</p>
<h4 id="heading-networking-failures">Networking Failures:</h4>
<p>Some common issues you may encounter are:</p>
<ul>
<li><p><strong>DNS resolution problems:</strong> Containers may fail to resolve hostnames if DNS configurations are incorrect.</p>
</li>
<li><p><strong>Network policies and firewall rules:</strong> Network policies or firewalls may block necessary ports.</p>
</li>
<li><p><strong>Inter-container communication:</strong> If containers need to talk to each other, ensure they’re on the same network or subnet.</p>
</li>
</ul>
<p>Again, there are various ways to diagnose these issues:</p>
<p><strong>For Docker networking:</strong></p>
<p>You can do this to view all Docker networks:</p>
<pre><code class="lang-bash">docker network ls
</code></pre>
<p>You can also inspect the network of your container like this:</p>
<pre><code class="lang-bash">docker network inspect &lt;network_name&gt;
</code></pre>
<p>Check if the container is correctly attached to the network and if necessary ports are exposed.</p>
<p><strong>For Kubernetes Networking:</strong></p>
<p>You can use <code>kubectl</code> to check network policies:</p>
<pre><code class="lang-bash">kubectl get networkpolicies
</code></pre>
<p>You can also check the pod’s network settings like this:</p>
<pre><code class="lang-bash">kubectl describe pod &lt;pod_name&gt; | grep -i <span class="hljs-string">"Network"</span>
</code></pre>
<p><strong>Testing Connectivity Inside Containers:</strong></p>
<p>For Docker, exec into the container and test:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it &lt;container_id&gt; /bin/bash
ping &lt;hostname_or_ip&gt;
curl http://&lt;service_address&gt;:&lt;port&gt;
</code></pre>
<p>In Kubernetes, use <code>kubectl exec</code> to access the pod and test connectivity:</p>
<pre><code class="lang-bash">kubectl <span class="hljs-built_in">exec</span> -it &lt;pod_name&gt; -- /bin/bash
</code></pre>
<h3 id="heading-how-to-use-ephemeral-debug-containers-for-investigation">How to Use Ephemeral Debug Containers for Investigation</h3>
<p>Ephemeral debug containers are short-lived containers that help investigate issues in a running environment without altering the main application container.</p>
<h4 id="heading-what-are-ephemeral-debug-containers">What are Ephemeral Debug Containers?</h4>
<p>Ephemeral debug containers allow you to run diagnostic commands (like shell access, <code>ping</code>, or <code>curl</code>) in the same network environment as the failing application container, without modifying the application itself.</p>
<h4 id="heading-how-to-set-up-ephemeral-containers-in-docker">How to Set Up Ephemeral Containers in Docker:</h4>
<p><strong>1. Use the</strong> <code>docker run</code> Command:</p>
<p>You can create a new container for debugging by running a container with the same network settings as the failing container:</p>
<pre><code class="lang-bash">docker run -it --network container:&lt;container_name_or_id&gt; --entrypoint /bin/bash &lt;debug_image&gt;
</code></pre>
<p>This command runs an interactive shell inside the debug container using the same network as the target container.</p>
<h4 id="heading-ephemeral-containers-in-kubernetes">Ephemeral Containers in Kubernetes:</h4>
<p>Kubernetes allows you to inject an ephemeral debug container into a running pod. You can add a temporary debug container to your pod using the following command:</p>
<pre><code class="lang-bash">kubectl debug &lt;pod_name&gt; -it --image=&lt;debug_image&gt; --target=&lt;container_name&gt;
</code></pre>
<p>This command will run a new container in the same pod as the target container, allowing you to run diagnostic commands.</p>
<p>Example use cases are investigating file systems, running network diagnostics, checking configuration files, and so on.</p>
<p>These debug containers are meant to be temporary and can be discarded after the issue is resolved.</p>
<h2 id="heading-how-to-implement-advanced-debugging-techniques">How to Implement Advanced Debugging Techniques</h2>
<p>This section covers advanced methods to diagnose complex CI/CD pipeline issues that standard log analysis might miss. We’ll explore distributed tracing to track requests across multiple services and combine traces with logs and metrics for deeper insights.</p>
<p>These techniques are designed to work within budget constraints, ensuring effective debugging for your CI/CD workflows.</p>
<h3 id="heading-choosing-a-tracing-backend-for-cicd"><strong>Choosing a Tracing Backend for CI/CD</strong></h3>
<p>Distributed tracing enables you to monitor a request’s path through various services in your CI/CD pipeline, such as from a build step to a deployment, identifying delays or failures. Choosing a tracing backend involves selecting a tool to store and analyze these trace data. Below, we compare Jaeger, Tempo, and hosted solutions for distributed tracing.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Tool</strong></td><td><strong>Resource Usage</strong></td><td><strong>Setup Complexity</strong></td><td><strong>Best For</strong></td><td><strong>CI/CD Fit</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Jaeger</strong></td><td>Low</td><td>Easy (Docker-based)</td><td>Small teams, local setups</td><td>Simple pipelines, quick trace views</td></tr>
<tr>
<td><strong>Tempo</strong></td><td>Low</td><td>Moderate (Grafana integration)</td><td>Grafana users, log/metric correlation</td><td>Complex pipelines, unified observability</td></tr>
<tr>
<td><strong>Hosted (e.g., Lightstep)</strong></td><td>Variable (cloud-based)</td><td>Easy (managed)</td><td>Teams with budget for cloud services</td><td>Scalable, production-grade tracing</td></tr>
</tbody>
</table>
</div><p>When to choose each one:</p>
<ul>
<li><p><strong>Jaeger</strong>: Ideal for quick, local tracing setups with minimal overhead.</p>
</li>
<li><p><strong>Tempo</strong>: Best for teams already using Grafana Loki/Prometheus for unified observability.</p>
</li>
<li><p><strong>Hosted Solutions</strong>: Suited for large-scale pipelines needing managed scalability.</p>
</li>
</ul>
<h3 id="heading-how-to-set-up-distributed-tracing-on-a-budget">How to Set Up Distributed Tracing on a Budget</h3>
<p>Distributed tracing is crucial for debugging and observing complex, multi-step operations across services. It allows you to follow requests as they propagate through different services and components of your pipeline. Implementing this on a budget can still provide valuable insights.</p>
<h4 id="heading-how-to-use-opentelemetry-with-free-backends">How to Use OpenTelemetry with Free Backends</h4>
<p><a target="_blank" href="https://www.freecodecamp.org/news/how-to-use-opentelementry-to-trace-node-js-applications/">OpenTelemetry</a> is an open-source framework that enables you to collect, process, and export telemetry data like traces and metrics. It supports multiple backends, and we’ll focus on using free, budget-friendly backends for trace storage and analysis.</p>
<p><strong>1. Install OpenTelemetry Collector:</strong></p>
<p>OpenTelemetry provides an agent (collector) that collects traces and metrics from your application and sends them to a backend.</p>
<p>To install the OpenTelemetry Collector, download the binary for your OS or use Docker to deploy it:</p>
<pre><code class="lang-bash">docker pull otel/opentelemetry-collector:latest
</code></pre>
<p>Then run the OpenTelemetry Collector in Docker with a configuration file:</p>
<pre><code class="lang-bash">docker run -d --name opentelemetry-collector -p 55680:55680 -p 14250:14250 otel/opentelemetry-collector
</code></pre>
<p><strong>2. Configure OpenTelemetry to Export to Free Backends:</strong></p>
<p>There are a few popular free backends you can use for distributed tracing, like Jaeger and Prometheus + Tempo. Let’s see how to use both here.</p>
<p>We’ll start with <strong>Jaeger</strong>, an open-source tracing backend. It’s highly scalable and works well with OpenTelemetry.</p>
<p>You can use the Docker version for easy deployment:</p>
<pre><code class="lang-bash">docker run -d --name jaeger -e COLLECTOR_ZIPKIN_HTTP_PORT=9411 -p 5775:5775 -p 6831:6831/udp -p 6832:6832/udp -p 5778:5778 -p 16686:16686 -p 14250:14250 -p 14268:14268 -p 14250:14250 -p 9431:9431 jaegertracing/all-in-one:1.30
</code></pre>
<p>Alternatively, you can use hosted services like <strong>Lightstep</strong>, <strong>AWS X-Ray</strong>, or <strong>Honeycomb</strong> for cloud-native environments.</p>
<p>Now let’s see how to use <strong>Prometheus</strong> + <strong>Tempo</strong> for logs and metrics correlation.</p>
<p>Tempo is a distributed tracing backend built by Grafana that integrates well with other Grafana tools (Loki and Prometheus).</p>
<p>You can install Tempo using Docker:</p>
<pre><code class="lang-bash">docker run -d --name tempo -p 14268:14268 grafana/tempo:latest
</code></pre>
<p><strong>3. Instrument Your Code with OpenTelemetry SDK:</strong></p>
<p>For Python/Node.js/Java/Go applications, you can install the appropriate OpenTelemetry SDK and start tracing.</p>
<p>Here’s a Python example:</p>
<pre><code class="lang-bash">pip install opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation
</code></pre>
<p>And a Node.js example:</p>
<pre><code class="lang-bash">npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/instrumentation
</code></pre>
<p>And one in Java:</p>
<pre><code class="lang-xml"><span class="hljs-tag">&lt;<span class="hljs-name">dependency</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">groupId</span>&gt;</span>io.opentelemetry<span class="hljs-tag">&lt;/<span class="hljs-name">groupId</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">artifactId</span>&gt;</span>opentelemetry-api<span class="hljs-tag">&lt;/<span class="hljs-name">artifactId</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">version</span>&gt;</span>1.0.0<span class="hljs-tag">&lt;/<span class="hljs-name">version</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">dependency</span>&gt;</span>
</code></pre>
<p>After installation, you can use the OpenTelemetry SDK to instrument the application and start collecting traces for HTTP requests, database queries, and other pipeline interactions.</p>
<p><strong>4. Send Data to the Collector:</strong></p>
<p>You can configure the SDK to send trace data to your OpenTelemetry Collector, which will then forward it to your backend (Jaeger, Tempo, and so on). Here’s an example for Python:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> opentelemetry <span class="hljs-keyword">import</span> trace
<span class="hljs-keyword">from</span> opentelemetry.exporter.otlp.proto.http.trace_exporter <span class="hljs-keyword">import</span> OTLPSpanExporter
<span class="hljs-keyword">from</span> opentelemetry.sdk.trace <span class="hljs-keyword">import</span> TracerProvider
<span class="hljs-keyword">from</span> opentelemetry.sdk.trace.export <span class="hljs-keyword">import</span> BatchExportSpanProcessor

trace.set_tracer_provider(TracerProvider())
exporter = OTLPSpanExporter(endpoint=<span class="hljs-string">"http://localhost:55680"</span>)
processor = BatchExportSpanProcessor(exporter)
trace.get_tracer_provider().add_span_processor(processor)
</code></pre>
<p>If traces aren’t appearing, several issues might be occurring:</p>
<ol>
<li><p><strong>Collector fails to start</strong>: Check logs with <code>docker logs otel-collector</code>. Look for errors like “port conflict” or “invalid config.”</p>
<ul>
<li>Fix: Change ports (for example, <code>55681:55680</code>) or verify the config file.</li>
</ul>
</li>
<li><p><strong>No traces in Jaeger</strong>: Ensure the collector is sending data to Jaeger (<code>http://localhost:14250</code>). Test with <code>curl http://localhost:55680</code>.</p>
<ul>
<li>Fix: Update the exporter endpoint in your SDK configuration.</li>
</ul>
</li>
<li><p><strong>Resource constraints</strong>: Monitor usage with <code>docker stats</code>.</p>
<ul>
<li>Fix: Allocate at least 2GB RAM and 10GB disk space for the collector and backend.</li>
</ul>
</li>
</ol>
<h4 id="heading-correlating-traces-with-logs-and-metrics">Correlating Traces with Logs and Metrics</h4>
<p>Combining traces with logs and metrics provides a holistic view of your pipeline’s operations, allowing you to pinpoint the root cause of issues more effectively.</p>
<p>OpenTelemetry and Grafana allow you to link traces, logs, and metrics into a unified view.</p>
<p>Let’s see how you can do this now.</p>
<p><strong>1. Link Logs and Traces Using Correlation IDs:</strong></p>
<p>When generating logs, include trace and span IDs in the log entries. This allows you to correlate logs with specific trace requests.</p>
<p>Here’s an example:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"timestamp"</span>: <span class="hljs-string">"2025-05-10T12:00:00Z"</span>,
  <span class="hljs-attr">"level"</span>: <span class="hljs-string">"error"</span>,
  <span class="hljs-attr">"message"</span>: <span class="hljs-string">"Build failure"</span>,
  <span class="hljs-attr">"trace_id"</span>: <span class="hljs-string">"1234567890abcdef"</span>,
  <span class="hljs-attr">"span_id"</span>: <span class="hljs-string">"0987654321abcdef"</span>
}
</code></pre>
<p><strong>2. Integrating Logs (Loki) with Traces (Jaeger/Tempo) in Grafana:</strong></p>
<p>Grafana can integrate traces from Jaeger or Tempo and correlate them with logs from Loki.</p>
<p>To do this:</p>
<ol>
<li><p><strong>Set up Loki and Tempo in Grafana.</strong></p>
</li>
<li><p>In Grafana’s Explore view, you can search logs and traces side-by-side.</p>
</li>
<li><p>Create dashboards that show metrics, logs, and traces for a complete view of a request flow.</p>
</li>
</ol>
<p><strong>3. Using Prometheus Metrics with Traces:</strong></p>
<p>Prometheus provides metrics that can be correlated with traces. For example, you can use <strong>exemplars</strong> in Prometheus to link specific metric data to trace data.</p>
<p><strong>Example:</strong> If you have a high error rate in your build step, you can correlate this with trace data to identify which requests failed.</p>
<h4 id="heading-creating-trace-visualizations-for-complex-pipeline-operations">Creating Trace Visualizations for Complex Pipeline Operations</h4>
<p>You can visualize traces with Jaeger or Tempo.</p>
<p><strong>To do this in Jaeger:</strong></p>
<p>Once your traces are in Jaeger, you can access the Jaeger UI (<a target="_blank" href="http://localhost:16686"><code>http://localhost:16686</code></a> by default) and use the search functionality to explore traces based on service name, trace ID, or specific operations.</p>
<p>Jaeger allows you to create custom dashboards to visualize the latency, throughput, and errors of requests across services.</p>
<p><strong>To do this in Tempo (Grafana Integration):</strong></p>
<p>Tempo integrates with Grafana, where you can create dashboards that visualize trace data from your pipeline.</p>
<p><strong>Create a Grafana dashboard:</strong></p>
<ol>
<li><p>Add Tempo as a data source in Grafana.</p>
</li>
<li><p>Use the "Trace" panel to query and visualize traces.</p>
</li>
<li><p>Combine trace visualizations with metrics (from Prometheus) and logs (from Loki) to get a unified view of your pipeline.</p>
</li>
</ol>
<p>A typical trace visualization dashboard could show the duration of each step in your pipeline (build, test, deploy) and highlight where delays or errors occur, such as slow database queries or flaky tests.</p>
<p><strong>Troubleshooting Tempo Setup Issues</strong></p>
<p>If Tempo fails to collect or display traces:</p>
<ol>
<li><p><strong>Container fails to start</strong>: Check logs with <code>docker logs tempo</code>. Look for errors like “port already in use” (for example, 14268) or “storage backend unavailable.”</p>
<ul>
<li>Fix: Change ports in the Docker command (for example, <code>-p 14269:14268</code>) or ensure the storage directory (for example, <code>/tmp/tempo</code>) exists and is writable.</li>
</ul>
</li>
<li><p><strong>No traces in Tempo</strong>: Verify the OpenTelemetry Collector is sending traces to Tempo’s endpoint (<code>http://localhost:14268</code>). Test connectivity with <code>curl http://localhost:14268</code>.</p>
<ul>
<li>Fix: Update the collector’s exporter configuration to point to the correct Tempo endpoint, and ensure no firewalls are blocking the connection.</li>
</ul>
</li>
<li><p><strong>Resource constraints</strong>: Monitor usage with <code>docker stats</code> or <code>top</code> on the host.</p>
<ul>
<li>Fix: Allocate at least 2GB RAM and 10GB disk space for Tempo, as tracing data can grow quickly with high-volume pipelines.</li>
</ul>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748226837500/c9865f8c-f737-49a5-a346-a56f4fac37fd.png" alt="Bar chart showing CI/CD pipeline trace latency for May 2025. Three pipeline stages are displayed: Build stage (blue bar) shows approximately 1,200ms latency, Test stage (yellow bar) shows approximately 800ms latency, and Deploy stage (red bar) shows approximately 1,500ms latency. The Deploy stage has the highest latency, followed by Build, then Test." class="image--center mx-auto" width="1468" height="866" loading="lazy"></p>
<p>This bar chart displays the average latency (in milliseconds) for key stages of a CI/CD pipeline in May 2025. The Build stage averages around 1,200 ms (blue), the Test stage around 800 ms (yellow), and the Deploy stage around 1,500 ms (pink), highlighting that deployment is the most time-intensive step.</p>
<h2 id="heading-how-to-build-comprehensive-debugging-dashboards">How to Build Comprehensive Debugging Dashboards</h2>
<p>This section explains how to create Grafana dashboards to troubleshoot CI/CD pipeline issues effectively. We’ll focus on setting up visualizations for key metrics, logs, and system resources to identify problems like build failures or resource bottlenecks, using budget-friendly tools to keep your observability stack lean and actionable.</p>
<h3 id="heading-designing-grafana-dashboards-specifically-for-troubleshooting">Designing Grafana Dashboards Specifically for Troubleshooting</h3>
<h4 id="heading-step-1-understand-the-key-metrics-and-logs-to-monitor">Step 1: Understand the Key Metrics and Logs to Monitor</h4>
<p>When designing a Grafana dashboard for debugging, you should focus on metrics and logs that help identify issues in the pipeline. These could include:</p>
<ul>
<li><p><strong>Build failures</strong>: Errors during build processes (compilation, test failures).</p>
</li>
<li><p><strong>Deployment failures</strong>: Issues in deployment, such as failed jobs, resource limitations, or misconfigurations.</p>
</li>
<li><p><strong>Container logs</strong>: Information about container status and logs (if using containers in your pipeline).</p>
</li>
<li><p><strong>System resource usage</strong>: CPU, memory, and disk usage that may lead to performance bottlenecks.</p>
</li>
<li><p><strong>CI/CD-specific metrics</strong>: Number of successful vs. failed pipeline runs, job duration, job queue times.</p>
</li>
</ul>
<h4 id="heading-step-2-set-up-data-sources">Step 2: Set Up Data Sources</h4>
<p>To start building the dashboard, you’ll need to set up your data sources in Grafana. First, connect your Prometheus instance for collecting metrics. To do this, go to <code>Configuration</code> &gt; <code>Data Sources</code> in Grafana. Then just add <code>Prometheus</code> as a data source and enter the URL (for example, <a target="_blank" href="http://localhost:9090"><code>http://localhost:9090</code></a>).</p>
<p>Next, you need to connect your Loki instance for logs. So go ahead and add <code>Loki</code> as a data source by specifying the URL (for example, <a target="_blank" href="http://localhost:3100"><code>http://localhost:3100</code></a>).</p>
<p>Note that if you're using other sources like InfluxDB or Elasticsearch, you’ll need to make sure that they’re properly connected as data sources.</p>
<h4 id="heading-step-3-create-panels-and-visualizations">Step 3: Create Panels and Visualizations</h4>
<p>Now that your data sources are connected, you can start building your dashboard with the following panels:</p>
<ul>
<li><p><strong>Build Status Panel:</strong></p>
<ul>
<li><p>Create a <strong>stat panel</strong> or <strong>gauge panel</strong> to show the success/failure ratio of pipeline runs.</p>
</li>
<li><p>Query Prometheus or Loki for data like build status (success or failure), number of errors, and job durations.</p>
</li>
</ul>
</li>
<li><p><strong>Error Breakdown Panel:</strong></p>
<ul>
<li><p>Use a <strong>pie chart</strong> to visualize the types of errors (for example, build, deployment, or system resource failures).</p>
</li>
<li><p>Query the logs in Loki to break down error types based on the CI tool (for example, Jenkins, GitHub Actions).</p>
</li>
</ul>
</li>
<li><p><strong>Resource Utilization Panel:</strong></p>
<ul>
<li>Use <strong>time series graphs</strong> to monitor CPU, memory, and disk usage over time, especially for resource-heavy builds or deployments.</li>
</ul>
</li>
<li><p><strong>Job Duration Panel:</strong></p>
<ul>
<li>Use <strong>bar charts</strong> or <strong>line graphs</strong> to track the average duration of jobs over time. Set thresholds for warning signs if a job takes longer than expected.</li>
</ul>
</li>
</ul>
<h4 id="heading-troubleshooting-grafana-dashboard-issues">Troubleshooting Grafana Dashboard Issues</h4>
<p>If Grafana dashboards fail to display data or show errors, you might be having one of these issues:</p>
<ol>
<li><p><strong>Missing data sources</strong>: If metrics, logs, or traces aren’t appearing, verify data source connections in Grafana (for example, Prometheus, Loki, Tempo). Check under Configuration &gt; Data Sources.</p>
<ul>
<li>Fix: Ensure the data source URLs are correct (for example, <code>http://localhost:9090</code> for Prometheus) and test the connection. Re-add the data source if needed.</li>
</ul>
</li>
<li><p><strong>Incorrect Trace IDs</strong>: If trace visualizations (for example, Tempo panels) show no data, confirm that trace IDs in logs match those in Tempo. Use a query like <code>{job="ci_cd"} | json | trace_id="1234567890abcdef"</code> in Loki to cross-check.</p>
<ul>
<li>Fix: Ensure your application logs include trace and span IDs, and verify the OpenTelemetry SDK is correctly instrumented to send traces to Tempo.</li>
</ul>
</li>
<li><p><strong>Resource Constraints</strong>: Monitor Grafana’s resource usage with <code>docker stats</code> if running in a container, or <code>top</code> on the host.</p>
<ul>
<li>Fix: Allocate at least 4GB RAM and 10GB disk space for Grafana, especially when rendering complex dashboards with multiple data sources.</li>
</ul>
</li>
</ol>
<h3 id="heading-how-to-set-up-drill-down-paths-from-high-level-to-detailed-views">How to Set Up Drill-Down Paths from High-Level to Detailed Views</h3>
<h4 id="heading-step-1-create-high-level-overview-panel">Step 1: Create High-Level Overview Panel</h4>
<p>At the top of the dashboard, include a high-level overview panel that summarizes the overall status of the pipeline. This could be:</p>
<ul>
<li><p><strong>Success/Failure Count</strong>: A simple stat panel showing the count of successful vs. failed runs.</p>
</li>
<li><p><strong>Pipeline Health Status</strong>: Display an overall health check of your pipeline using color-coded indicators (green for healthy, red for issues).</p>
</li>
</ul>
<h4 id="heading-step-2-set-up-drill-down-links">Step 2: Set Up Drill-Down Links</h4>
<p>To allow users to drill down from high-level information to detailed views:</p>
<p><strong>1. Link to detailed build information</strong>:</p>
<p>You can create a time series graph that shows build job durations. Add a link to a detailed log view when clicking on a failed job.</p>
<p>For example, when clicking a failed build, you can link to a detailed panel or a separate dashboard that shows the logs and error messages related to that specific run.</p>
<p><strong>2. Link to Logs in Loki</strong>:</p>
<p>You can use <strong>Loki's LogQL</strong> queries to set up a drill-down path. When users click on an error type or a specific job name, it should automatically filter logs for that job or error type.</p>
<p>You can set up drill-down interactions using Dashboard Links in Grafana. In the panel settings, under <code>Links</code>, specify the link to another dashboard that shows detailed logs filtered by the job name or failure type.</p>
<h4 id="heading-step-3-implement-time-range-filters">Step 3: Implement Time Range Filters</h4>
<p>To enhance drill-down functionality, you can add a <strong>time range filter</strong> to allow users to adjust the time window for both logs and metrics. This enables them to zoom in on a specific time frame where failures occurred.</p>
<h3 id="heading-how-to-create-shared-dashboards-for-team-troubleshooting">How to Create Shared Dashboards for Team Troubleshooting</h3>
<h4 id="heading-step-1-share-your-dashboard">Step 1: Share Your Dashboard</h4>
<p>Once your dashboard is designed, you can share it with your team for collaborative troubleshooting:</p>
<p>First, you’ll want to make sure that the correct permissions are set up for your team. You can define specific roles in Grafana with access to the dashboard. Go to <code>Dashboard Settings</code> &gt; <code>Permissions</code>, and grant view or edit access to users or teams.</p>
<p>Next, you can directly share a link to the dashboard with your team members. Use the <code>Share</code> option in the top-right corner of the dashboard, which provides a direct URL and also options to embed the dashboard into other tools (for example, Slack, email).</p>
<p>You can also use <strong>template variables</strong> to allow users to filter and adjust the dashboard for different pipeline runs or environments. For example, add a variable for <code>build_id</code>, <code>job_name</code>, or <code>branch_name</code> that allows users to select specific builds or branches for more granular troubleshooting.</p>
<h4 id="heading-step-2-set-up-alerting">Step 2: Set Up Alerting</h4>
<p>To ensure your team is notified of any pipeline failures, you can set up <strong>alerting rules</strong>. There are a few important ones you’ll want to set up.</p>
<p>First, create alerts for critical issues, like when a pipeline fails or exceeds expected resource usage. This could be for things like build time exceeding a threshold or failure of a deployment stage.</p>
<p>Grafana can send alerts via various channels such as Slack, email, or webhook.</p>
<p>You can also integrate your dashboards with tools like Slack or Teams for real-time notifications and collaboration. Set up automated messages for your team when the dashboard indicates an issue.</p>
<h3 id="heading-how-to-create-automated-diagnostic-tools"><strong>How to Create Automated Diagnostic Tools</strong></h3>
<h4 id="heading-building-scripts-that-collect-relevant-logs-during-failures">Building Scripts that Collect Relevant Logs During Failures</h4>
<p>To automate log collection during failures, you need scripts that can capture logs from different CI/CD stages and services as soon as a failure is detected. Here are the steps you can follow to do this:</p>
<p><strong>1. Write Failure Detection Script:</strong></p>
<p>You can leverage the exit status codes of your CI/CD tools to detect failures. For example, in GitLab CI/CD or GitHub Actions, you can check if the last command failed by inspecting <code>$?</code> in Unix-based systems.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Example for GitLab CI/CD</span>
<span class="hljs-keyword">if</span> [ $? -ne 0 ]; <span class="hljs-keyword">then</span>
    <span class="hljs-built_in">echo</span> <span class="hljs-string">"Failure detected, collecting logs..."</span>
    <span class="hljs-comment"># Custom log collection script call</span>
    ./collect_logs.sh
<span class="hljs-keyword">fi</span>
</code></pre>
<p><strong>2. Log Collection Script (collect_</strong><a target="_blank" href="http://logs.sh"><strong>logs.sh</strong></a><strong>):</strong></p>
<p>The script should collect relevant logs, system metrics, and trace information. For instance:</p>
<pre><code class="lang-bash"><span class="hljs-meta">#!/bin/bash</span>
LOG_DIR=<span class="hljs-string">"/path/to/logs"</span>
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR=<span class="hljs-string">"<span class="hljs-variable">${LOG_DIR}</span>/backup/<span class="hljs-variable">${TIMESTAMP}</span>"</span>
mkdir -p <span class="hljs-variable">$BACKUP_DIR</span>

<span class="hljs-comment"># Collect logs from CI/CD agents, containers, or system logs</span>
cp /var/<span class="hljs-built_in">log</span>/ci_cd/*.<span class="hljs-built_in">log</span> <span class="hljs-variable">$BACKUP_DIR</span>/
cp /path/to/docker_logs/*.<span class="hljs-built_in">log</span> <span class="hljs-variable">$BACKUP_DIR</span>/
<span class="hljs-comment"># Collect metrics or traces from monitoring systems if needed</span>
</code></pre>
<p><strong>3. Use CI/CD Artifacts:</strong></p>
<p>For platforms like GitLab, GitHub Actions, or Jenkins, you can upload logs as artifacts for further investigation. Configure these platforms to save logs in case of a failure.</p>
<p>Here’s an example for GitHub Actions:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">steps:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Run</span> <span class="hljs-string">Tests</span>
    <span class="hljs-attr">run:</span> <span class="hljs-string">|
      npm run test
</span>  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Upload</span> <span class="hljs-string">logs</span> <span class="hljs-string">if</span> <span class="hljs-string">test</span> <span class="hljs-string">fails</span>
    <span class="hljs-attr">if:</span> <span class="hljs-string">failure()</span>
    <span class="hljs-attr">uses:</span> <span class="hljs-string">actions/upload-artifact@v2</span>
    <span class="hljs-attr">with:</span>
      <span class="hljs-attr">name:</span> <span class="hljs-string">test-logs</span>
      <span class="hljs-attr">path:</span> <span class="hljs-string">/path/to/test/logs</span>
</code></pre>
<p><strong>4. Centralized Logging:</strong></p>
<p>Instead of manually collecting logs, you can centralize log storage using logging systems like Grafana Loki, ELK stack, or even cloud-based solutions. This will ensure that logs are accessible even if they are overwritten or lost on individual systems.</p>
<h3 id="heading-how-to-implement-automatic-analysis-of-common-error-patterns">How to Implement Automatic Analysis of Common Error Patterns</h3>
<p>Once logs are collected, you can automate the analysis process by defining common error patterns and automatically searching for them in your logs.</p>
<h4 id="heading-step-1-define-error-patterns">Step 1: Define Error Patterns:</h4>
<p>Establish error signatures or patterns that are common in your CI/CD process, such as failed builds due to missing dependencies, permission issues, or network timeouts.</p>
<p>You can use regex or regular expressions to capture these patterns. Here’s an example – define a regex for failed test patterns:</p>
<pre><code class="lang-bash">TEST_FAILURE_REGEX=<span class="hljs-string">".*FAILURE.*"</span>
</code></pre>
<h4 id="heading-step-2-create-log-analysis-script">Step 2: Create Log Analysis Script:</h4>
<p>Next, you can write a script that scans logs for these common patterns. The script could then categorize or flag errors.</p>
<p>Here’s an example using <code>grep</code> to detect failure patterns:</p>
<pre><code class="lang-bash"><span class="hljs-meta">#!/bin/bash</span>
LOG_DIR=<span class="hljs-string">"/path/to/logs"</span>
ERROR_LOG=<span class="hljs-string">"<span class="hljs-variable">${LOG_DIR}</span>/error_patterns.log"</span>
touch <span class="hljs-variable">$ERROR_LOG</span>

<span class="hljs-comment"># Define error patterns to search for</span>
ERROR_PATTERNS=(<span class="hljs-string">"FAILURE"</span> <span class="hljs-string">"ERROR"</span> <span class="hljs-string">"TIMEOUT"</span>)

<span class="hljs-keyword">for</span> PATTERN <span class="hljs-keyword">in</span> <span class="hljs-string">"<span class="hljs-variable">${ERROR_PATTERNS[@]}</span>"</span>; <span class="hljs-keyword">do</span>
    grep -i <span class="hljs-variable">$PATTERN</span> <span class="hljs-variable">$LOG_DIR</span>/*.<span class="hljs-built_in">log</span> &gt;&gt; <span class="hljs-variable">$ERROR_LOG</span>
<span class="hljs-keyword">done</span>

<span class="hljs-keyword">if</span> [ -s <span class="hljs-variable">$ERROR_LOG</span> ]; <span class="hljs-keyword">then</span>
    <span class="hljs-built_in">echo</span> <span class="hljs-string">"Error patterns found, review the log file."</span>
<span class="hljs-keyword">fi</span>
</code></pre>
<h4 id="heading-step-3-automate-alerting">Step 3: Automate Alerting:</h4>
<p>Once an error pattern is detected, you can integrate the log analysis script with your alerting system (for example, sending an email or Slack notification).</p>
<p>Here’s an example of sending a Slack notification:</p>
<pre><code class="lang-bash"><span class="hljs-keyword">if</span> [ -s <span class="hljs-variable">$ERROR_LOG</span> ]; <span class="hljs-keyword">then</span>
    curl -X POST -H <span class="hljs-string">'Content-type: application/json'</span> \
         --data <span class="hljs-string">'{"text":"Error detected in CI pipeline. Check error log."}'</span> \
         https://hooks.slack.com/services/YOUR_SLACK_WEBHOOK_URL
<span class="hljs-keyword">fi</span>
</code></pre>
<h4 id="heading-step-4-use-observability-tools-for-pattern-recognition">Step 4: Use Observability Tools for Pattern Recognition:</h4>
<p>Leverage observability tools (Grafana Loki, Prometheus) that support log querying and visualization. You can create dashboards that automatically detect anomalies like high failure rates or recurring errors.</p>
<p>Example: Set up a Grafana dashboard with alert rules based on log frequency.</p>
<h3 id="heading-how-to-create-self-healing-pipelines-based-on-known-issues">How to Create Self-Healing Pipelines Based on Known Issues</h3>
<p>Self-healing pipelines can automatically address issues when they are detected by executing pre-defined corrective actions. Let’s walk through how you can set one up.</p>
<h4 id="heading-step-1-define-common-failures-and-solutions">Step 1: Define Common Failures and Solutions:</h4>
<p>Identify recurring issues (for example, dependency issues, build timeouts, flaky tests) that occur in your pipeline. Then, define self-healing actions to mitigate these issues.</p>
<p>Here’s an example of automatically retrying a failed step if it is a known flaky test:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">jobs:</span>
  <span class="hljs-attr">build:</span>
    <span class="hljs-attr">runs-on:</span> <span class="hljs-string">ubuntu-latest</span>
    <span class="hljs-attr">steps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Run</span> <span class="hljs-string">Tests</span>
        <span class="hljs-attr">run:</span> <span class="hljs-string">|
          npm run test
</span>      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Retry</span> <span class="hljs-string">Tests</span> <span class="hljs-string">if</span> <span class="hljs-string">Failed</span>
        <span class="hljs-attr">if:</span> <span class="hljs-string">failure()</span> <span class="hljs-string">&amp;&amp;</span> <span class="hljs-string">(steps.tests.outcome</span> <span class="hljs-string">==</span> <span class="hljs-string">'failure'</span><span class="hljs-string">)</span>
        <span class="hljs-attr">run:</span> <span class="hljs-string">|
          echo "Retrying tests..."
          npm run test</span>
</code></pre>
<h4 id="heading-step-2-automatic-rollbacks">Step 2: Automatic Rollbacks:</h4>
<p>Set up a rollback process for failed deployments. For instance, if a deployment to production fails, the pipeline can automatically revert to the last successful build.</p>
<p>Example in GitLab CI/CD:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">deploy_production:</span>
  <span class="hljs-attr">script:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">./deploy.sh</span>
  <span class="hljs-attr">when:</span> <span class="hljs-string">on_failure</span>
  <span class="hljs-attr">retry:</span> <span class="hljs-number">3</span>
</code></pre>
<h4 id="heading-step-3-build-self-healing-logic-using-retry-mechanisms">Step 3: Build Self-Healing Logic Using Retry Mechanisms:</h4>
<p>Implement retry logic for transient issues (like network glitches) that often cause failures.</p>
<p>Example of retrying a step in GitHub Actions:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">steps:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Retry</span> <span class="hljs-string">Deployment</span>
    <span class="hljs-attr">run:</span> <span class="hljs-string">|
      attempts=0
      max_attempts=3
      until [ $attempts -ge $max_attempts ]
      do
        deploy_script &amp;&amp; break
        attempts=$((attempts+1))
        echo "Attempt $attempts failed. Retrying..."
        sleep 5
      done</span>
</code></pre>
<h4 id="heading-step-4-automate-corrective-actions-for-dependency-issues">Step 4: Automate Corrective Actions for Dependency Issues:</h4>
<p>Set up automatic fixes for dependency-related failures, like clearing caches or re-installing dependencies:</p>
<pre><code class="lang-bash"><span class="hljs-keyword">if</span> [[ $(cat error.log) =~ <span class="hljs-string">"dependency not found"</span> ]]; <span class="hljs-keyword">then</span>
    <span class="hljs-built_in">echo</span> <span class="hljs-string">"Dependency issue detected, reinstalling dependencies..."</span>
    npm install
<span class="hljs-keyword">fi</span>
</code></pre>
<h4 id="heading-step-5-integrate-with-self-healing-services">Step 5: Integrate with Self-Healing Services:</h4>
<p>For more complex self-healing, you can integrate tools like Ansible, Puppet, or even create custom scripts that auto-patch common configuration issues.</p>
<h2 id="heading-how-to-conduct-effective-postmortems-using-logs">How to Conduct Effective Postmortems Using Logs</h2>
<p>Logs are often the single most valuable resource when reconstructing what went wrong in a CI/CD pipeline. Conducting effective postmortems with log data allows teams to extract clear timelines, pinpoint root causes, and define steps to prevent recurrence – all based on concrete evidence.</p>
<h3 id="heading-extract-timeline-and-key-events-from-the-logs">Extract Timeline and Key Events from the Logs</h3>
<p>To accurately understand what happened and when from the info contained in your logs, there’s a straightforward process you can follow.</p>
<h4 id="heading-step-1-centralize-and-structure-logs">Step 1: Centralize and Structure Logs:</h4>
<p>First, make sure that the logs from all pipeline stages (build, test, deploy) are aggregated in a central place like Grafana Loki, ELK, or OpenSearch.</p>
<p>And you’ll want to use a consistent log format (like structured JSON) that includes timestamps, log levels, pipeline stage identifiers, and correlation/request IDs.</p>
<h4 id="heading-step-2-build-a-chronological-view">Step 2: Build a Chronological View:</h4>
<p>You can use timestamp filters in your log UI (for example, Kibana, Grafana Explore) to isolate logs from the incident timeframe.</p>
<p>Look for key lifecycle events, like:</p>
<ul>
<li><p>Start and completion of pipeline steps</p>
</li>
<li><p>Status changes (for example, "test failed", "deployment started", "build queued")</p>
</li>
<li><p>Error messages and warnings</p>
</li>
<li><p>Retry events or unexpected restarts</p>
</li>
</ul>
<h4 id="heading-step-3-extract-logs-programmatically-optional">Step 3: Extract Logs Programmatically (optional):</h4>
<p>Use queries (LogQL, Elasticsearch DSL) to export relevant logs for analysis or inclusion in a post-mortem document.</p>
<h3 id="heading-how-to-identify-root-causes-through-log-analysis">How to Identify Root Causes Through Log Analysis</h3>
<p>To go beyond symptoms and find the real issue, there are various steps you can take.</p>
<p>Start by <strong>looking for the first failure</strong>. You can filter logs by <code>level=error</code> or use log pattern matching to identify the <em>earliest</em> sign of failure. Then trace backward from the failure using correlation IDs or pipeline step identifiers.</p>
<p>Second, make sure you <strong>correlate logs across systems.</strong> Match logs across CI/CD tools (like GitHub Actions → Docker logs → Kubernetes logs). You can use shared correlation IDs or job IDs to group logs from related events.</p>
<p>Next, <strong>pay attention to intermittent signals.</strong> Warnings, retries, or degraded performance preceding the failure may reveal environmental or configuration-related issues.</p>
<p>And finally, <strong>check for external dependencies.</strong> Look for timeout or connection errors involving third-party services, cloud APIs, or internal infrastructure components.</p>
<h3 id="heading-how-to-create-actionable-follow-ups-to-prevent-recurrence"><strong>How to Create Actionable Follow-Ups to Prevent Recurrence</strong></h3>
<p>There are various things you can do to turn your findings into meaningful process improvements.</p>
<p><strong>1. Document the Findings Clearly:</strong></p>
<p>Create a structured post-mortem doc that includes:</p>
<ul>
<li><p>Timeline of events with log excerpts</p>
</li>
<li><p>Immediate trigger and root cause (based on logs)</p>
</li>
<li><p>Impact summary and affected components</p>
</li>
<li><p>Screenshots or saved log queries for reference</p>
</li>
</ul>
<p><strong>2. Define Preventive Actions:</strong></p>
<p>Examples include:</p>
<ul>
<li><p>Adding missing alerts or log-based monitors</p>
</li>
<li><p>Improving log verbosity or adding missing metadata</p>
</li>
<li><p>Fixing brittle test cases or deployment scripts</p>
</li>
<li><p>Updating infrastructure limits or retry strategies</p>
</li>
</ul>
<p><strong>3. Assign Ownership and Deadlines:</strong></p>
<p>Each action item should have a responsible owner and a due date. If applicable, create automated tests or guardrails to catch similar issues in the future.</p>
<p><strong>4. Update Runbooks and Incident Playbooks:</strong></p>
<p>Add log patterns, example queries, and resolutions to shared documentation. This ensures the next person facing a similar issue can act faster.</p>
<p><strong>Pro Tip:</strong> Automate part of your post-mortem process by tagging logs from failed CI runs, exporting them to a shared location, and pre-generating dashboards or incident reports. This reduces manual effort and increases consistency.</p>
<h2 id="heading-how-to-optimize-log-storage-and-management"><strong>How to Optimize Log Storage and Management</strong></h2>
<p>As your CI/CD system grows, logs can become massive, consuming storage and impacting performance. Optimizing log storage helps you make sure that you're retaining what's valuable while staying efficient.</p>
<h3 id="heading-how-to-implement-log-rotation-and-retention-policies">How to Implement Log Rotation and Retention Policies</h3>
<p>Without rotation and retention, logs will pile up endlessly, leading to disk space exhaustion and poor performance. You can help prevent this with <strong>log rotation</strong>.</p>
<p>Log rotation involves creating new log files after a size or time threshold and archiving or deleting old ones.</p>
<p><strong>Linux logrotate tool</strong> – Configure <code>/etc/logrotate.d/&lt;your-app&gt;</code>:</p>
<pre><code class="lang-javascript">/<span class="hljs-keyword">var</span>/log/ci_cd<span class="hljs-comment">/*.log {
    daily
    rotate 7
    compress
    missingok
    notifempty
    create 0640 root adm
}</span>
</code></pre>
<p>This example:</p>
<ul>
<li><p>Rotates daily</p>
</li>
<li><p>Keeps 7 days of logs</p>
</li>
<li><p>Compresses old logs to save space</p>
</li>
</ul>
<p><strong>Docker logs rotation</strong> – in <code>daemon.json</code>:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"log-driver"</span>: <span class="hljs-string">"json-file"</span>,
  <span class="hljs-attr">"log-opts"</span>: {
    <span class="hljs-attr">"max-size"</span>: <span class="hljs-string">"50m"</span>,
    <span class="hljs-attr">"max-file"</span>: <span class="hljs-string">"5"</span>
  }
}
</code></pre>
<p>Retention policies ensure that old logs are automatically deleted based on age or storage usage.</p>
<p>You can set one up in Loki like this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">table_manager:</span>
  <span class="hljs-attr">retention_deletes_enabled:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">retention_period:</span> <span class="hljs-string">168h</span>  <span class="hljs-comment"># 7 days</span>
</code></pre>
<p>Or in Elasticsearch, use Index Lifecycle Management (ILM):</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"policy"</span>: {
    <span class="hljs-attr">"phases"</span>: {
      <span class="hljs-attr">"hot"</span>: {
        <span class="hljs-attr">"actions"</span>: {
          <span class="hljs-attr">"rollover"</span>: { <span class="hljs-attr">"max_age"</span>: <span class="hljs-string">"3d"</span>, <span class="hljs-attr">"max_size"</span>: <span class="hljs-string">"1gb"</span> }
        }
      },
      <span class="hljs-attr">"delete"</span>: {
        <span class="hljs-attr">"min_age"</span>: <span class="hljs-string">"7d"</span>,
        <span class="hljs-attr">"actions"</span>: { <span class="hljs-attr">"delete"</span>: {} }
      }
    }
  }
}
</code></pre>
<h3 id="heading-how-to-set-up-log-compaction-for-long-term-storage">How to Set Up Log Compaction for Long-Term Storage</h3>
<p>Compaction reduces redundancy and keeps only critical log info, which is ideal for long-term audits or analytics.</p>
<h4 id="heading-compaction-techniques">Compaction Techniques:</h4>
<p>There are various different compaction techniques you can try. Here are a couple:</p>
<p><strong>1. Loki (boltdb-shipper mode)</strong>:</p>
<ul>
<li><p>Uses compaction to merge log chunks and reduce storage.</p>
</li>
<li><p>Configure in <code>loki-config.yaml</code>:</p>
<pre><code class="lang-yaml">  <span class="hljs-attr">schema_config:</span>
    <span class="hljs-attr">configs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">from:</span> <span class="hljs-number">2023-01-01</span>
        <span class="hljs-attr">store:</span> <span class="hljs-string">boltdb-shipper</span>
        <span class="hljs-attr">object_store:</span> <span class="hljs-string">filesystem</span>
        <span class="hljs-attr">schema:</span> <span class="hljs-string">v11</span>
</code></pre>
</li>
<li><p>Use a low-retention, high-compaction strategy for archived logs.</p>
</li>
</ul>
<p><strong>2. Elasticsearch</strong>:</p>
<ul>
<li><p>Use <strong>rollup jobs</strong> to reduce resolution of old data.</p>
</li>
<li><p>Stores summarized logs, for example, hourly counts of similar events.</p>
</li>
</ul>
<p><strong>3. Archive to cheaper storage</strong>:</p>
<ul>
<li>Move infrequent-access logs to S3 or Azure Blob Storage using lifecycle rules.</li>
</ul>
<h3 id="heading-how-to-balance-observability-with-resource-constraints">How to Balance Observability with Resource Constraints</h3>
<p>More logs = more observability, but also more cost and overhead. This means that you need a balance. There are various strategies that can help you achieve this balance:</p>
<ol>
<li><p><strong>Log at appropriate levels</strong>:</p>
<ul>
<li><p>Avoid excessive <code>debug</code> or <code>trace</code> logs in production.</p>
</li>
<li><p>Use <code>info</code> and <code>warn</code> levels judiciously.</p>
</li>
<li><p>Only use <code>error</code> or <code>critical</code> for actionable failures.</p>
</li>
</ul>
</li>
<li><p><strong>Sample logs</strong>:</p>
<ul>
<li><p>If high-volume pipelines generate repetitive logs, enable log sampling to reduce duplicates.</p>
</li>
<li><p>Tools like Vector or Fluent Bit support sampling.</p>
</li>
</ul>
</li>
<li><p><strong>Filter out noise</strong>:</p>
<ul>
<li>Use log filters to exclude non-critical logs before they reach the central system.</li>
</ul>
</li>
<li><p><strong>Separate hot vs. cold logs</strong>:</p>
<ul>
<li><p><strong>Hot logs</strong>: recent, real-time data for active debugging.</p>
</li>
<li><p><strong>Cold logs</strong>: archived for compliance, stored with lower performance/storage priority.</p>
</li>
</ul>
</li>
<li><p><strong>Compress everything</strong>:</p>
<ul>
<li><p>Use gzip/zstd compression for both stored and transmitted logs.</p>
</li>
<li><p>Loki, Elasticsearch, and Vector support compression out of the box.</p>
</li>
</ul>
</li>
</ol>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>In this handbook, you have built a full-stack observability layer specifically optimized for CI/CD pipelines without breaking your infrastructure budget. You now have the tools and know-how to:</p>
<ul>
<li><p>Deploy Grafana Loki or a lightweight ELK alternative to capture structured logs from all parts of your pipeline.</p>
</li>
<li><p>Unify and enrich logs across CI/CD tools (for example, GitHub Actions, Jenkins, GitLab) using consistent formats and correlation IDs.</p>
</li>
<li><p>Use powerful log queries (LogQL, Kibana Query Language) to diagnose build failures, flaky tests, and deployment issues with precision.</p>
</li>
<li><p>Correlate logs with metrics and traces to gain deep, contextual visibility into pipeline behavior.</p>
</li>
<li><p>Design reusable debugging dashboards and automation that turn raw logs into insights and action.</p>
</li>
<li><p>Build a culture of shared troubleshooting knowledge through post-mortems, runbooks, and log-driven retrospectives.</p>
</li>
</ul>
<p>To see the full-stack observability layer in action, check out the complete code and configurations in my GitHub repository: <a target="_blank" href="https://github.com/Emidowojo/CICDObservability.git">github.com/Emidowojo/CICDObservability</a>. This repo includes all the setups for Grafana Loki, OpenTelemetry, Prometheus, and more, so you can deploy and explore the entire pipeline observability stack.</p>
<h3 id="heading-next-steps-for-advanced-observability-implementation">Next Steps for Advanced Observability Implementation</h3>
<p>Here’s how you can take your setup even further:</p>
<ol>
<li><p><strong>Fully integrate distributed tracing</strong>: Deploy OpenTelemetry agents across your build and deployment stages. This will help you visualize how code, builds, and deployments flow across systems in real-time.</p>
</li>
<li><p><strong>Automate diagnostic scripts and alerts</strong>: Build scripts to auto-collect logs and metrics on failure, and trigger alerts when known patterns reoccur. This enables faster detection and even self-healing pipelines.</p>
</li>
<li><p><strong>Scale and harden your log infrastructure</strong>: As usage grows, implement log retention, compaction, and storage policies. Explore scalable backends like ClickHouse or object storage (e.g., S3) for long-term archiving.</p>
</li>
<li><p><strong>Train your team on observability best practices</strong>: Share dashboards, create onboarding docs, and schedule log-analysis sessions to build team familiarity with your tools and practices.</p>
</li>
</ol>
<h3 id="heading-resources-for-continued-learning">📚 Resources for Continued Learning</h3>
<p><strong>Official Docs and Tools:</strong></p>
<ul>
<li><p><a target="_blank" href="https://grafana.com/docs/loki/">Grafana Loki Documentation</a></p>
</li>
<li><p><a target="_blank" href="https://grafana.com/docs/loki/latest/clients/promtail/">Promtail Configuration Guide</a></p>
</li>
<li><p><a target="_blank" href="https://opentelemetry.io/docs/">OpenTelemetry</a></p>
</li>
<li><p><a target="_blank" href="https://grafana.com/docs/loki/latest/logql/">LogQL Syntax</a></p>
</li>
<li><p><a target="_blank" href="https://www.elastic.co/guide/en/kibana/current/kuery-query.html">Kibana Query Language</a></p>
</li>
<li><p><a target="_blank" href="https://vector.dev/docs/">Vector (log forwarding)</a></p>
</li>
</ul>
<p><strong>Communities:</strong></p>
<ul>
<li><p><a target="_blank" href="https://www.reddit.com/r/devops/">r/devops on Reddit</a></p>
</li>
<li><p><a target="_blank" href="https://slack.cncf.io/">CNCF Slack – #observability channel</a></p>
</li>
<li><p><a target="_blank" href="https://stackoverflow.com/questions/tagged/logging">Log Management Best Practices on Stack Overflow</a></p>
</li>
</ul>
<p>By investing in observability early and thoughtfully, you not only reduce the time to detect and resolve issues, you also build a more resilient, predictable, and transparent delivery process for your entire engineering team.</p>
<p>I hope this comes in handy for you someday. If you made it to the end of this handbook, thanks for reading! You can connect with me on <a target="_blank" href="https://www.linkedin.com/in/emidowojo/">LinkedIn</a> or on X <a target="_blank" href="https://x.com/Emidowojo">@Emidowojo</a> if you’d like to stay in touch.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ A Beginner's Guide to Observability in Cloud Native Applications ]]>
                </title>
                <description>
                    <![CDATA[ If you're new to cloud native technologies, you may have heard the term 'observability' before. But what exactly does it mean? Is it simply the ability to observe? And if so, what are we observing and why? I had the same questions when I started lear... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/observability-in-cloud-native-applications/</link>
                <guid isPermaLink="false">67e2d66c64d44185d5a6d406</guid>
                
                    <category>
                        <![CDATA[ otlp resource attributes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ cloud native applications ]]>
                    </category>
                
                    <category>
                        <![CDATA[ observability ]]>
                    </category>
                
                    <category>
                        <![CDATA[ #prometheus ]]>
                    </category>
                
                    <category>
                        <![CDATA[ OpenTelemetry ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Otel ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Victoria Nduka ]]>
                </dc:creator>
                <pubDate>Tue, 25 Mar 2025 16:14:36 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1742917070693/fa372981-fb20-4230-bd9f-43b7255b8ced.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you're new to cloud native technologies, you may have heard the term 'observability' before. But what exactly does it mean? Is it simply the ability to observe? And if so, what are we observing and why?</p>
<p>I had the same questions when I started learning about cloud-native technologies. In this article, I'll share my understanding of core observability concepts, introduce essential observability tools, and share insights from a related project I’m working on.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-my-introduction-to-cloud-native-technologies">My Introduction to Cloud Native Technologies</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-observability">What is Observability?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-types-of-observability-data">Types of Observability Data</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-1-metrics">1. Metrics</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-2-logs">2. Logs</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-3-traces">3. Traces</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-observability-tools">Observability Tools</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-prometheus">Prometheus</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-opentelemetry">OpenTelemetry</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-what-are-otlp-resource-attributes">What are OTLP Resource Attributes?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-importance-of-otlp-resource-attributes">Importance of OTLP Resource Attributes</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-my-project-work-fits-into-all-this">How My Project Work Fits into All This</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-additional-resources">Additional Resources</a></p>
</li>
</ul>
<h2 id="heading-my-introduction-to-cloud-native-technologies">My Introduction to Cloud Native Technologies</h2>
<p>I recently got selected as a mentee for the Linux Foundation Mentorship to work on the <a target="_blank" href="https://mentorship.lfx.linuxfoundation.org/project/36e3f336-ce78-4074-b833-012015eb59be">CNCF - Prometheus project</a>. The project is UX-focused, and for the next few months, I'll be working with my mentors to understand how users expect to use OpenTelemetry Line Protocol (OTLP) Resource Attributes in Prometheus.</p>
<p>That's quite a mouthful, I know. I was overwhelmed at first, and honestly, I’m still figuring it out. This is my third week, and although I still have a lot to learn—given that I had no knowledge of cloud native technologies when I applied for this internship—I've already learned quite a bit.</p>
<p>As I often do, I intend to document what I learn through articles to help reinforce concepts in my memory and serve as a resource for other newcomers who may find themselves grappling with these technical terms in the future. You know what they say: you can't say you've understood something until you're able to explain it to someone else who's also new to the topic.</p>
<h2 id="heading-what-is-observability">What is Observability?</h2>
<p>First, I had to learn what the unfamiliar terms meant—and there were a lot of them flying around. OpenTelemetry. Prometheus. Resource attributes. I’ve come to understand that these terms fall under one umbrella: Observability. Let's start there.</p>
<p>Let’s use a food delivery app to illustrate. When someone orders food, a lot happens behind the scenes:</p>
<ul>
<li><p>The app connects to different services (restaurants, payments, delivery)</p>
</li>
<li><p>Data flows between different systems to process the order, assign a driver, and track delivery</p>
</li>
</ul>
<p>Engineers need to monitor all the processes to ensure everything works smoothly. Are orders taking too long to process? Is the payment system failing? Does the app suddenly crash under load? Which part of the system is causing delays?</p>
<p>To answer these questions, engineers <strong>instrument</strong> their code. This means that they configure it to send back real-time data about the state, performance, and behavior of the application. This practice of understanding what's happening inside a complex system based on the data it generates is known as <strong>Observability</strong>.</p>
<p>You can see the process illustrated in the image below:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742335445056/5fe7bb0b-bdf9-4f52-a2c1-7f2977411c6c.png" alt="A flowchart diagram titled &quot;Visual flow of observability data&quot; showing how data moves through a food delivery application system. The flow starts with a User who orders food from a Food App. The Food App connects to three services (Restaurant, Payment, and Delivery). All these components send data to OpenTelemetry, which collects three types of data: Metrics, Logs, and Traces. OpenTelemetry then forwards only the Metrics data to Prometheus, which stores metrics." class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>In the above flowchart diagram, you can see how data might move through a food delivery application system. The flow starts with a User who orders food from a Food App. The Food App connects to three services (Restaurant, Payment, and Delivery). All these components send data to OpenTelemetry, which collects three types of data: Metrics, Logs, and Traces. OpenTelemetry then forwards only the Metrics data to Prometheus, which stores metrics.</p>
<h2 id="heading-types-of-observability-data">Types of Observability Data</h2>
<p>There are three key types of data that systems generate for observability:</p>
<h3 id="heading-1-metrics"><strong>1. Metrics</strong></h3>
<p>Metrics are numerical measurements collected over time that represent the state or performance of your system. Examples in a food delivery app would be the number of orders processed per minute, average order processing time in milliseconds, number of active users or delivery drivers, and so on.</p>
<h3 id="heading-2-logs"><strong>2. Logs</strong></h3>
<p>Logs are text-based records of discrete events that occur within your application. Logs for our food delivery app would look something like this:</p>
<pre><code class="lang-http"><span class="hljs-attribute">ERROR</span>: Payment failed for order #12345 - Credit card declined
<span class="hljs-attribute">INFO</span>: Driver #789 assigned to order #12345
</code></pre>
<h3 id="heading-3-traces"><strong>3. Traces</strong></h3>
<p>Traces track the entire lifecycle of a request as it moves through different services in a system. They help engineers see how different components interact and identify bottlenecks in complex, distributed systems.</p>
<p>For example, in our food delivery app, a single order request might go through the following steps:<br><code>User places an order</code> → <code>Request sent to restaurant system</code> → <code>Payment processor verifies payment</code> → <code>Delivery system assigns a driver</code> → <code>User receives confirmation</code>.</p>
<p>Each step in this journey is recorded as part of a trace. This helps engineers pinpoint where delays occur and optimize the system for better performance.</p>
<p>Observability relies on metrics, logs, and traces working together to provide full system visibility. Metrics tell you something is wrong (“Error rate increased by 5%”). Logs tell you why it happened (“Payment failed due to invalid card details”). Traces show exactly where it happened (“Delay in restaurant service response”).</p>
<h2 id="heading-observability-tools"><strong>Observability Tools</strong></h2>
<p>Observability tools give you visibility into what’s going on within your application. There are a lot of them, but for the purpose of this article, we’ll talk about two: Prometheus and OpenTelemetry. </p>
<h3 id="heading-prometheus"><strong>Prometheus</strong></h3>
<p><a target="_blank" href="https://prometheus.io/">Prometheus</a> is an open-source monitoring and alerting toolkit. It does two things:</p>
<ul>
<li><p>Collects data from applications, specifically metrics (remember the data types we talked about earlier)</p>
</li>
<li><p>and stores them in a time-series database.</p>
</li>
</ul>
<p>A time-series database is a database specifically designed to handle measurements or events that occur over time.</p>
<p>Prometheus uses what's called a <strong>pull-based model</strong> to collect metrics from applications. Pull-based means Prometheus actively requests (pulls) data from services at regular intervals. Think of it like refreshing a webpage to get the latest content.</p>
<h3 id="heading-opentelemetry"><strong>OpenTelemetry</strong></h3>
<p><a target="_blank" href="https://opentelemetry.io/">OpenTelemetry (OTel)</a> collects, processes, and exports observability data. Unlike Prometheus, which mainly focuses on metrics, OpenTelemetry provides a standardized way to instrument applications for all three types of observability data: logs, metrics, and traces.</p>
<p>OpenTelemetry is designed to be vendor-agnostic. This means you can instrument your applications once with OpenTelemetry and then send that telemetry data to any supported observability backend, which could be an open-source solution like Jaeger or Prometheus, or commercial platforms like Datadog, New Relic, Dynatrace, or Honeycomb.</p>
<p>So, for example, you can use OpenTelemetry to instrument your application – and then Prometheus can pull metrics from OpenTelemetry while other tools handle logs and traces.</p>
<h2 id="heading-what-are-otlp-resource-attributes"><strong>What are OTLP Resource Attributes?</strong></h2>
<p>When OpenTelemetry collects data from applications, it does more than just gather raw telemetry data. It also provides context about that data. This context comes in the form of <strong>resource attributes</strong>, which describe where the data came from and what it relates to.</p>
<p>The 'resource' is the component (or entity) producing the data, while the 'attributes' are specific details about that resource.</p>
<p>Resource attributes are structured as pairs of information:</p>
<ul>
<li><p>The "key" is the name or identifier of the attribute (like <code>service.name</code> or <code>host.id</code>)</p>
</li>
<li><p>The "value" is the specific information for that attribute (like <code>payment-service</code> or <code>server-123</code>)</p>
</li>
</ul>
<p>Together, these key-value pairs identify and describe the specific component that's generating the observability data.</p>
<p>For example, if a payment processing service is sending metrics about transaction times, the resource attributes might include:</p>
<ul>
<li><p><code>service.name: "payment-service"</code></p>
</li>
<li><p><code>service.version: "1.2.3"</code></p>
</li>
<li><p><code>deployment.environment: "production"</code></p>
</li>
</ul>
<p>These attributes tell you exactly which service, which version, and in which environment the data is coming from, providing context for interpreting the metrics, logs, or traces.</p>
<p>Resource attributes are not arbitrary. OpenTelemetry provides a standardized set of attribute names and formats that everyone should follow, similar to having an agreed-upon language for describing services and their properties.</p>
<p>For example, OpenTelemetry specifies that you should use <code>service.name</code> (not <code>app_name</code> or <code>service_id</code>) to identify your service. They've created these standardized naming conventions (called <a target="_blank" href="https://opentelemetry.io/docs/concepts/semantic-conventions/">semantic conventions</a>) so that:</p>
<ol>
<li><p>All tools in the ecosystem can understand the same attributes</p>
</li>
<li><p>Engineers across different companies use consistent terminology</p>
</li>
<li><p>Observability data can be easily shared between different systems</p>
</li>
</ol>
<p>You can still create your own custom attributes when you need something specific (like <code>payment.provider</code> for a payment service), but using the standard attributes whenever possible means your telemetry data will work better with existing tools and be more easily understood by other engineers.</p>
<h2 id="heading-importance-of-otlp-resource-attributes">Importance of OTLP Resource Attributes</h2>
<p>Let’s say engineers want to monitor how long food deliveries take and whether there are delays in specific locations. Without resource attributes, OpenTelemetry might simply collect and report this metric like this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">delivery_time_seconds:</span> <span class="hljs-number">1800</span>
</code></pre>
<p>This tells us that a delivery took 1,800 seconds, or 30 minutes, but nothing else. That’s useful, but it lacks context. Where did this happen? Which service handled it? If there was a delay in delivery and engineers wanted to investigate the cause, this alone would not help.</p>
<p>With OpenTelemetry’s resource attributes, the metric becomes more meaningful:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">delivery_time_seconds:</span> <span class="hljs-number">1800</span>
<span class="hljs-attr">resource:</span>
  <span class="hljs-attr">service.name:</span> <span class="hljs-string">"delivery-service"</span>
  <span class="hljs-attr">service.instance.id:</span> <span class="hljs-string">"instance-456"</span>
  <span class="hljs-attr">cloud.region:</span> <span class="hljs-string">"ng-west-2"</span>
  <span class="hljs-attr">deployment.environment:</span> <span class="hljs-string">"production"</span>
  <span class="hljs-attr">customer.city:</span> <span class="hljs-string">"Lagos"</span>
  <span class="hljs-attr">restaurant.id:</span> <span class="hljs-string">"rest-789"</span>
</code></pre>
<p>This tells us:</p>
<ul>
<li><p>The data came from the delivery service.</p>
</li>
<li><p>The instance handling the request is "instance-456".</p>
</li>
<li><p>It’s running in the ng-west-2 cloud region.</p>
</li>
<li><p>The environment is Production (not testing or staging), and so on.</p>
</li>
</ul>
<p>Now, engineers can answer more specific questions:</p>
<ul>
<li><p>Are deliveries slower in certain cities? (Filter by <code>customer.city</code>)</p>
</li>
<li><p>Are certain restaurants taking longer to prepare food? (Filter by <code>restaurant.id</code>)</p>
</li>
<li><p>Are delays only happening in a specific cloud region? (Filter by <code>cloud.region</code>)</p>
</li>
<li><p>Are issues only happening in production or also in staging? (Filter by <code>deployment.environment</code>)</p>
</li>
</ul>
<p>When issues arise, resource attributes allow engineers to quickly narrow down the source of problems. Rather than investigating every service, they can filter by specific attributes to focus their efforts.</p>
<h2 id="heading-how-my-project-work-fits-into-all-this"><strong>How My Project Work Fits into All This</strong></h2>
<p>Many engineers use OpenTelemetry for data collection and then send metrics to Prometheus for storage, querying, and analysis.</p>
<p>But Prometheus does not natively support resource attributes in the same way as OpenTelemetry. Instead, it relies on labels to organize metrics. Since Prometheus traditionally has its own labeling system for metrics, integrating OpenTelemetry's resource attributes creates interesting UX challenges.</p>
<p>One key challenge is the <strong>cardinality</strong> explosion. Cardinality refers to the number of unique combinations of label values (or dimensions) that a metric can have. A "cardinality explosion" occurs when you add labels with many possible values. OpenTelemetry often includes many detailed attributes that, if directly converted to Prometheus labels, would create an overwhelming number of time series. This can slow down Prometheus dramatically or even cause it to crash.</p>
<p>The existing solution involves stuffing all resource attributes into a single JSON-encoded Prometheus label. While this prevents the cardinality explosion, it makes querying extremely cumbersome. Users have to use complex join operations and specialized query syntax to filter or aggregate based on these attributes.</p>
<p>This approach is technically functional but creates a poor user experience. My research aims to understand how users mentally model the transition from OpenTelemetry's rich attribute system to Prometheus's more constrained label system.</p>
<p>The research goals are to:</p>
<ol>
<li><p>Understand how engineers currently use OpenTelemetry resource attributes with Prometheus</p>
</li>
<li><p>Identify pain points in the current integration between these systems</p>
</li>
<li><p>Discover user expectations for how resource attributes should be represented in Prometheus</p>
</li>
</ol>
<p>This work is particularly important as more organizations adopt OpenTelemetry as their instrumentation standard while continuing to use Prometheus for metrics monitoring. Creating a seamless experience between these two popular open-source projects will help improve the overall observability ecosystem.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Observability in cloud native applications is clearly an interesting subject and important for building reliable, performant systems. The tools and concepts we've explored – metrics, logs, traces, Prometheus, and OpenTelemetry – form the foundation of modern observability practices.</p>
<p>As I continue my mentorship program, I'll share more insights about how these technologies work together and try to break them down from the perspective of a first-time learner.</p>
<h2 id="heading-additional-resources">Additional Resources</h2>
<p>Learn more about:</p>
<ol>
<li><p><a target="_blank" href="https://opentelemetry.io/docs/">OpenTelemetry</a></p>
</li>
<li><p><a target="_blank" href="https://prometheus.io/docs/introduction/overview/">Prometheus</a></p>
</li>
<li><p><a target="_blank" href="https://github.com/prometheus/prometheus/issues/15909">My UX research project</a></p>
</li>
</ol>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Automate Alert Provisioning with the SigNoz Terraform Provider ]]>
                </title>
                <description>
                    <![CDATA[ Modern infrastructure requires continuous monitoring and rapid incident response. However, manually configuring and managing alerts is not only labor-intensive but also susceptible to human error. Automating alert provisioning allows you to enforce c... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/automate-alert-provisioning-with-the-signoz-terraform-provider/</link>
                <guid isPermaLink="false">67d87353b13a6fd9fb559ada</guid>
                
                    <category>
                        <![CDATA[ observability ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Terraform ]]>
                    </category>
                
                    <category>
                        <![CDATA[ automation ]]>
                    </category>
                
                    <category>
                        <![CDATA[ signoz ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Gursimar Singh ]]>
                </dc:creator>
                <pubDate>Mon, 17 Mar 2025 19:09:07 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1742237716002/3e7d07f8-39f7-45ba-aac3-d421f61a8785.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Modern infrastructure requires continuous monitoring and rapid incident response. However, manually configuring and managing alerts is not only labor-intensive but also susceptible to human error.</p>
<p>Automating alert provisioning allows you to enforce consistency, secure sensitive credentials, and integrate monitoring into your deployment pipelines.</p>
<p>This guide dives deep into how you can use the SigNoz Terraform Provider to define and manage alert configurations as code, making your observability setup resilient and adaptable.</p>
<h3 id="heading-heres-what-well-cover">Here’s what we’ll cover:</h3>
<ol>
<li><p><a class="post-section-overview" href="#heading-why-automate-alert-provisioning">Why Automate Alert Provisioning?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-are-signoz-and-terraform">What are SigNoz and Terraform?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-overview-of-the-setup">Overview of the Setup</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-steps-to-setup-the-project">Steps to Set Up the Project</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-best-practices-and-security-considerations">Best Practices and Security Considerations</a></p>
</li>
<li><p><a class="post-section-overview" href="#integrating-with-cicd-pipelines">Integrating with CI/CD Pipelines</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-advanced-customizations-and-troubleshooting">Advanced Customizations and Troubleshooting</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-why-automate-alert-provisioning">Why Automate Alert Provisioning?</h2>
<p>It’s a good idea to automate your alert provisioning for various reasons.</p>
<p>First of all, configuring things manually often leads to discrepancies between environments (development, staging, production). Automating alerts ensures that all environments adhere to the same monitoring standards, reducing the likelihood of configuration drift and improving consistency and uniformity.</p>
<p>Also, when alerts are defined as code, every change is tracked in your version control system. This audit trail makes it easier to trace and review changes, collaborate with team members, and roll back configurations if issues arise.</p>
<p>Something else to consider is that as your infrastructure grows, manually managing alerts becomes unsustainable. Automation allows you to quickly and efficiently update your alerting rules across multiple services without the need for repetitive manual intervention.</p>
<p>Automation also helps improve security. Storing sensitive information like API tokens as environment variables or in secret management systems helps maintain security. Automating the process also minimizes human exposure to critical credentials.</p>
<p>And finally, defining alerts as code enables you to integrate monitoring configurations into your CI/CD pipelines. This leads to continuous testing, validation, and deployment of alert rules alongside application updates.</p>
<p>So as you can see, there are many compelling reasons to go the automation route. Now let’s see how you can do this in practice.</p>
<h2 id="heading-what-are-signoz-and-terraform">What Are SigNoz and Terraform?</h2>
<p>SigNoz is an open-source observability platform designed to collect, analyze, and visualize metrics, logs, and traces from your applications. Its most helpful features include:</p>
<ul>
<li><p>It has comprehensive monitoring abilities: Provides detailed insights into system performance, error rates, and user behaviors.</p>
</li>
<li><p>It comes equipped with real-time analytics: Enables proactive issue detection and performance optimization.</p>
</li>
<li><p>It’s community-driven: As an open-source solution, it benefits from community contributions, transparency, and customization.</p>
</li>
<li><p>It’s cost-effective: Offers powerful observability capabilities without the hefty licensing fees of proprietary solutions.</p>
</li>
</ul>
<p>Terraform is an Infrastructure as Code (IaC) tool developed by HashiCorp. It allows you to define and provision infrastructure using declarative configuration files. Terraform’s core advantages include:</p>
<ul>
<li><p>Its declarative syntax: You specify the desired state of your infrastructure, and Terraform handles the implementation.</p>
</li>
<li><p>Its version Control: Configuration files can be managed in Git repositories, enabling traceability and rollback of changes.</p>
</li>
<li><p>Powerful automation: Facilitates automated provisioning and updates, reducing manual effort and errors.</p>
</li>
<li><p>Multi-cloud support: Manages resources across different cloud providers with a consistent workflow.</p>
</li>
</ul>
<p>So you might be wondering: why should you use Terraform with SigNoz?</p>
<p>First of all, Terraform ensures that your infrastructure is managed consistently across different environments, reducing the risk of configuration drift. It also simplifies managing multiple alerts and resources, making it easier to scale your observability setup.</p>
<p>Beyond this, automating the provisioning process reduces manual setup efforts and minimizes the potential for human error.</p>
<p>And finally, Terraform configurations can be version-controlled, allowing teams to track changes over time and collaborate more effectively.</p>
<h2 id="heading-overview-of-the-setup">Overview of the Setup</h2>
<p>This setup utilizes the SigNoz Terraform Provider to manage alerts and notification channels within SigNoz Cloud. The configuration includes:</p>
<ul>
<li><p><strong>Provider configuration:</strong> Establishes the connection to SigNoz using the API endpoint and a securely provided API token.</p>
</li>
<li><p><strong>Notification channels:</strong> Defines where alerts are sent (for example, via email) to ensure the right teams are notified.</p>
</li>
<li><p><strong>Alert rules:</strong> Specifies the conditions under which alerts are triggered, including thresholds and evaluation windows.</p>
</li>
<li><p><strong>External variables:</strong> Enhances flexibility by allowing critical values (like CPU thresholds and email addresses) to be managed externally.</p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before diving into the setup, make sure you have the following:</p>
<ol>
<li><p><strong>SigNoz Cloud account</strong>: If you don't have one, sign up for SigNoz Cloud to host your observability data and configure alerts.</p>
</li>
<li><p><strong>Terraform installed</strong>: Install Terraform on your machine. Terraform is the tool you'll use to manage your infrastructure as code.</p>
</li>
<li><p><strong>SigNoz API token</strong>:</p>
<ul>
<li><p>Log in to your SigNoz Cloud dashboard.</p>
</li>
<li><p>Navigate to Settings &gt; API Tokens.</p>
</li>
<li><p>Click Generate API Token.</p>
</li>
<li><p>Copy the token, as you'll need it to authenticate Terraform with SigNoz.</p>
</li>
</ul>
</li>
<li><p><strong>Basic knowledge of Terraform</strong>: Familiarity with Terraform's syntax and concepts, including writing configuration files and running Terraform commands, is essential.</p>
</li>
<li><p><strong>Text editor</strong>: Use any code editor like Visual Studio Code or Sublime Text to write your Terraform configuration files.</p>
</li>
</ol>
<h2 id="heading-steps-to-set-up-the-project">Steps to Set Up the Project</h2>
<h3 id="heading-1-understand-the-signozalert-resource">1. Understand the <code>signoz_alert</code> Resource</h3>
<p>The <code>signoz_alert</code> resource allows you to create and manage alert rules in SigNoz via Terraform. It supports various alert types, conditions, and configurations. Understanding this resource is crucial as it forms the basis of your alert configuration.</p>
<h3 id="heading-2-set-up-your-terraform-configuration">2. Set Up Your Terraform Configuration</h3>
<p>Create a new directory for your Terraform configuration:</p>
<pre><code class="lang-bash">mkdir signoz-terraform
<span class="hljs-built_in">cd</span> signoz-terraform
</code></pre>
<p>Create a <a target="_blank" href="http://main.tf"><code>main.tf</code></a> file with the following content:</p>
<pre><code class="lang-json">terraform {
  required_providers {
    signoz = {
      source  = <span class="hljs-attr">"SigNoz/signoz"</span>
      version = <span class="hljs-attr">"0.1.3"</span> # Use the latest version from the Terraform Registry
    }
  }
}

provider <span class="hljs-string">"signoz"</span> {
  endpoint  = <span class="hljs-attr">"https://api.us.signoz.cloud"</span> # Replace with your SigNoz Cloud API endpoint
  api_token = var.signoz_api_token
}

variable <span class="hljs-string">"signoz_api_token"</span> {}
</code></pre>
<p>The <code>provider</code> block configures the SigNoz provider, where <code>endpoint</code> specifies the API endpoint and <code>api_token</code> is passed through a variable for security.</p>
<h3 id="heading-3-define-a-notification-channel-optional">3. Define a Notification Channel (Optional)</h3>
<p>If you plan to send alerts to specific channels, define them using <code>signoz_notification_channel</code>. For example, create a <a target="_blank" href="http://channels.tf"><code>channels.tf</code></a> file:</p>
<pre><code class="lang-json">resource <span class="hljs-string">"signoz_notification_channel"</span> <span class="hljs-string">"email_channel"</span> {
  name = <span class="hljs-attr">"Email Channel"</span>
  type = <span class="hljs-attr">"email"</span>

  receivers {
    email_config {
      to = [<span class="hljs-attr">"alerts@example.com"</span>]
    }
  }
}
</code></pre>
<p>Defining a notification channel ensures that alerts are sent to the correct recipients, enhancing the utility of your alerting system.</p>
<h3 id="heading-4-create-an-alert-using-the-signozalert-resource">4. Create an Alert Using the <code>signoz_alert</code> Resource</h3>
<p>Create an <a target="_blank" href="http://alerts.tf"><code>alerts.tf</code></a> file to define your alert:</p>
<pre><code class="lang-json">resource <span class="hljs-string">"signoz_alert"</span> <span class="hljs-string">"cpu_high_usage"</span> {
  alert            = <span class="hljs-attr">"High CPU Usage Alert"</span>
  alert_type       = <span class="hljs-attr">"METRIC_BASED_ALERT"</span>
  severity         = <span class="hljs-attr">"critical"</span>
  description      = <span class="hljs-attr">"Alert when CPU usage exceeds 80% over 5 minutes"</span>
  rule_type        = <span class="hljs-attr">"threshold_rule"</span>
  broadcast_to_all = false
  disabled         = false
  eval_window      = <span class="hljs-attr">"5m0s"</span>
  frequency        = <span class="hljs-attr">"1m0s"</span>
  version          = <span class="hljs-attr">"v4"</span>

  condition = jsonencode({
    compositeQuery = {
      builderQueries = {
        A = {
          aggregateOperator = <span class="hljs-attr">"avg"</span>
          dataSource        = <span class="hljs-attr">"metrics"</span>
          metricName        = <span class="hljs-attr">"cpu_usage_user"</span>
          reduceTo          = <span class="hljs-attr">"avg"</span>
          filters           = {
            items = []
            op    = <span class="hljs-attr">"AND"</span>
          }
          groupBy = []
        }
      }
      queryType = <span class="hljs-string">"builder"</span>
      panelType = <span class="hljs-string">"graph"</span>
      unit      = <span class="hljs-string">"%"</span>
    }
    op                = <span class="hljs-string">"&gt;"</span>
    target            = <span class="hljs-number">80</span>
    matchType         = <span class="hljs-string">"EQUALS"</span>
    selectedQueryName = <span class="hljs-string">"A"</span>
    targetUnit        = <span class="hljs-string">"%"</span>
  })

  preferred_channels = [signoz_notification_channel.email_channel.name]

  labels = {
    severity = <span class="hljs-attr">"critical"</span>
    team     = <span class="hljs-attr">"DevOps"</span>
  }
}
</code></pre>
<p>This configuration creates a high CPU usage alert with specific conditions and notifications. The <code>condition</code> parameter is crucial as it defines the alert triggering logic.</p>
<h3 id="heading-5-provide-the-api-token">5. Provide the API Token</h3>
<p>Set the <code>signoz_api_token</code> as an environment variable:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">export</span> TF_VAR_signoz_api_token=<span class="hljs-string">"YOUR_SIGNOZ_API_TOKEN"</span>
</code></pre>
<p>This ensures that your API token is securely used by Terraform without hardcoding it in your configuration files.</p>
<h3 id="heading-6-initialize-terraform">6. Initialize Terraform</h3>
<p>Run:</p>
<pre><code class="lang-bash">terraform init
</code></pre>
<p>This command initializes your Terraform working directory, downloading necessary plugins, and preparing the environment.</p>
<h3 id="heading-7-review-the-execution-plan">7. Review the Execution Plan</h3>
<p>Generate the execution plan:</p>
<pre><code class="lang-bash">terraform plan
</code></pre>
<p>This step previews the changes Terraform will make, allowing you to verify the configuration before applying it.</p>
<h3 id="heading-8-apply-the-configuration">8. Apply the Configuration</h3>
<p>Apply the changes:</p>
<pre><code class="lang-bash">terraform apply
</code></pre>
<p>Type <code>yes</code> when prompted. This command applies the configuration, creating or updating resources as specified.</p>
<h3 id="heading-9-verify-the-alert-in-signoz-cloud">9. Verify the Alert in SigNoz Cloud</h3>
<p>To do this, follow these steps:</p>
<ul>
<li><p>Log in to your SigNoz Cloud dashboard.</p>
</li>
<li><p>Navigate to Alerts.</p>
</li>
<li><p>Confirm that the "High CPU Usage Alert" is listed.</p>
</li>
<li><p>Click on the alert to view its details and ensure it matches your configuration.</p>
</li>
</ul>
<h3 id="heading-10-modify-the-alert-optional">10. Modify the Alert (Optional)</h3>
<p>To change the CPU usage threshold to 75%, follow these steps:</p>
<ul>
<li><p>Update the target in <a target="_blank" href="http://alerts.tf"><code>alerts.tf</code></a>:</p>
<pre><code class="lang-json">  target = <span class="hljs-number">75</span>
</code></pre>
</li>
<li><p>Apply the changes:</p>
<pre><code class="lang-bash">  terraform apply
</code></pre>
</li>
</ul>
<h3 id="heading-11-destroy-the-resources-optional">11. Destroy the Resources (Optional)</h3>
<p>To remove the alert and notification channel:</p>
<pre><code class="lang-bash">terraform destroy
</code></pre>
<p>Type <code>yes</code> to confirm. This command will delete the resources created by Terraform.</p>
<h2 id="heading-best-practices-and-security-considerations">Best Practices and Security Considerations</h2>
<p>In modern infrastructure automation, robust best practices and security measures are paramount.</p>
<h3 id="heading-use-version-pinning">Use version pinning</h3>
<p>To ensure your alert provisioning remains reliable and maintainable, start with strict version control. Avoid using the latest tag and instead specify an exact version number. This ensures your infrastructure configuration remains consistent and predictable.</p>
<p>By pinning your provider version (for example, use version = "0.1.3" instead of version = "&gt;= 0.1.3".), you eliminate unexpected behavior that can arise from upstream changes. This practice is critical for long-term stability, especially when your infrastructure scales across multiple environments.</p>
<h3 id="heading-externalize-credentials"><strong>Externalize Credentials</strong></h3>
<p>Security is non-negotiable. Instead of embedding sensitive details like API tokens in your codebase, leverage environment variables or dedicated secret management tools such as HashiCorp Vault or AWS Secrets Manager.</p>
<p>For instance, storing your SigNoz API token as an environment variable (TF_VAR_signoz_api_token) not only mitigates the risk of credential exposure but also simplifies the process of credential rotation. Also, enforce access control policies around your configuration repositories and CI/CD pipelines to further secure these secrets.</p>
<h3 id="heading-use-version-control"><strong>Use Version Control</strong></h3>
<p>A mature setup also demands rigorous infrastructure version control. Hosting your Terraform configuration in a Git repository with branch protection and code review policies allows you to track changes meticulously, roll back problematic updates, and maintain an audit trail. This traceability is essential when troubleshooting issues or validating compliance during audits.</p>
<p>You should also document your configuration decisions extensively—explain why a particular CPU threshold was chosen or why specific labels (like severity and team) are used. Such documentation becomes invaluable for onboarding new team members or when revisiting configurations months later.</p>
<h2 id="heading-integrating-with-cicd-pipelines">Integrating with CI/CD Pipelines</h2>
<p>Integrating Terraform with your CI/CD pipeline is a cornerstone of a modern, automated deployment strategy. A well-architected pipeline not only validates your infrastructure changes but also ensures that your alerting rules remain in sync with your evolving application environment.</p>
<p>Continuous Integration (CI) involves automatically merging code changes into a shared repository and running automated tests on each commit. In practice, embedding Terraform plan into your pull request workflow provides early feedback, catching misconfigurations before they reach production. For instance, a GitHub Actions workflow can automatically check your changes:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">name:</span> <span class="hljs-string">Terraform</span> <span class="hljs-string">CI/CD</span>

<span class="hljs-attr">on:</span>
  <span class="hljs-attr">push:</span>
    <span class="hljs-attr">branches:</span> [<span class="hljs-string">main</span>]
  <span class="hljs-attr">pull_request:</span>
    <span class="hljs-attr">branches:</span> [<span class="hljs-string">main</span>]

<span class="hljs-attr">jobs:</span>
  <span class="hljs-attr">terraform:</span>
    <span class="hljs-attr">runs-on:</span> <span class="hljs-string">ubuntu-latest</span>
    <span class="hljs-attr">steps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Checkout</span> <span class="hljs-string">Repository</span>
        <span class="hljs-attr">uses:</span> <span class="hljs-string">actions/checkout@v3</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Setup</span> <span class="hljs-string">Terraform</span>
        <span class="hljs-attr">uses:</span> <span class="hljs-string">hashicorp/setup-terraform@v2</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Terraform</span> <span class="hljs-string">Init</span>
        <span class="hljs-attr">run:</span> <span class="hljs-string">terraform</span> <span class="hljs-string">init</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Terraform</span> <span class="hljs-string">Plan</span>
        <span class="hljs-attr">run:</span> <span class="hljs-string">terraform</span> <span class="hljs-string">plan</span> <span class="hljs-string">-no-color</span>
        <span class="hljs-attr">env:</span>
          <span class="hljs-attr">TF_VAR_signoz_api_token:</span> <span class="hljs-string">${{</span> <span class="hljs-string">secrets.SIGNOZ_API_TOKEN</span> <span class="hljs-string">}}</span>
</code></pre>
<p>This workflow uses GitHub secrets to securely manage your API tokens while validating the configuration changes. Continuous Delivery (CD) takes this further by automating deployments. Once your plan is approved, an automated Terraform apply step (often scheduled during off-peak hours or coordinated with application deployments) ensures smooth, coordinated rollouts.</p>
<p>Advanced pipelines can also include automated rollback mechanisms. For example, if a deployment triggers an anomaly, scripts can automatically revert to a previous version using your version control history—minimizing downtime and reinforcing the feedback loop between application performance and infrastructure configuration.</p>
<h2 id="heading-advanced-customizations-and-troubleshooting">Advanced Customizations and Troubleshooting</h2>
<p>As your observability requirements evolve, you may need to implement advanced customizations. One powerful approach is using multi-metric composite alerts. Instead of triggering an alert on a single threshold, you can design rules that combine multiple conditions—for example, firing only when both CPU usage and memory consumption exceed critical levels. This nuanced alerting minimizes false positives and ensures alerts are issued only during genuine performance issues.</p>
<p>Terraform’s modular design is especially useful here. By creating reusable modules that encapsulate your alert configurations, you can parameterize key variables—such as thresholds, evaluation windows, and notification channels—across a microservices architecture. This modularity enforces consistency while simplifying management and scaling.</p>
<p>Troubleshooting advanced configurations starts with reviewing your <code>terraform plan</code> output to ensure every change aligns with expectations. If an alert isn’t triggering as expected, inspect the JSON structure generated by the <code>jsonencode</code> function. Even minor syntax errors can cause significant issues.</p>
<p>When integrating with incident management tools like PagerDuty or Opsgenie, run comprehensive end-to-end tests in a staging environment. For example, deploy a test alert to a dedicated channel to verify that the complete alerting pipeline—from condition detection to incident escalation—is functioning correctly.</p>
<p>In one real-world scenario, a misconfigured composite query in an alert’s JSON payload led to intermittent failures. By enabling detailed provider logs and iteratively validating the JSON output, the issue was rapidly isolated and resolved. Such experiences underscore the importance of rigorous logging, validation, and testing in production-grade setups.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Automating alert provisioning is a transformative approach to managing observability in modern infrastructures.</p>
<p>By defining alerts and notification channels as code, you make your systems more consistent, scalable, secure, and easily integratabtle with CI/CD. You can set up uniform alert rules across all environments, quickly update and deploy monitoring configs, easily handle secure credentials, and automate CI/CD workflows that stay in sync with application changes. They also become easier to integrate with CI/CD workflows.</p>
<p>I hope you’ve enjoyed this tutorial and have learned something new. I’m always open to suggestions and discussions on <a target="_blank" href="https://www.linkedin.com/in/gursimarsm">LinkedIn</a>. Hit me up with direct messages.</p>
<p>If you’ve enjoyed my writing and want to keep me motivated, consider leaving starts on <a target="_blank" href="https://github.com/gursimarsm">GitHub</a> and endorsing me for relevant skills on <a target="_blank" href="https://www.linkedin.com/in/gursimarsm">LinkedIn</a>.</p>
<p>Till the next one, happy coding!</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
