Opaluwa Emidowojo - freeCodeCamp.org

How to Debug Kubernetes Apps When Logs Fail You – An eBPF Tracing Handbook

Opaluwa Emidowojo — Tue, 16 Dec 2025 17:03:01 +0000

Let’s say your Kubernetes pod crashes at 3am and the logs show nothing useful. By the time you SSH into the node, the container is gone, and you're left guessing what happened in those final moments.

This is the reality of debugging modern applications. Traditional monitoring wasn't built for containers that live for seconds, services that shift across nodes, or network paths that change constantly.

eBPF changes this. It lets you see inside the kernel itself, watching every system call, every network packet, and every process execution – without modifying a single line of code.

In this tutorial, you will trace a real Kubernetes application using eBPF-powered tools. You’ll learn fundamentals that apply across the entire modern observability ecosystem, with gadgets from the Inspektor Gadget ecosystem.

By the end, you’ll be able to:

Trace requests as they move through your Kubernetes pods
Observe behavior at the kernel and syscall level
Debug failures that logs and metrics simply can’t explain

Prerequisites

Knowledge requirements:

Basic Kubernetes concepts: pods, deployments, services, namespaces
Familiarity with kubectl: get, describe, logs, exec
Container basics
Basic Linux concepts: processes, system calls

Technical requirements:

Kubernetes cluster (local or cloud-based)
kubectl installed and configured
Cluster admin permissions
Linux kernel 5.10+ (most managed services have this)

Understanding eBPF Observability
How eBPF Tracing Works (Without Getting Lost in the Kernel)
How to Set Up Your Environment
How to Trace Your First Request: Hands-On Tutorial
How to Interpret Traces
Real-World Debugging Scenarios
Advanced Tracing Insights
Best Practices and Production Considerations
Next Steps and Resources

Understanding eBPF Observability

eBPF (extended Berkeley Packet Filter) is a technology that allows you to run custom programs inside the Linux kernel without changing kernel code or loading kernel modules.

The Linux kernel is the control center of your operating system. Historically, if you wanted to observe low-level activity (like network packets, system calls, or file operations), you had to rely on kernel changes or kernel modules. Both approaches were fragile, difficult to maintain, and carried real stability and security risks.

eBPF shifts how we approach observability. It provides a safe, sandboxed environment where you can run observability programs directly in the kernel with built-in safety checks that prevent crashes or security vulnerabilities.

Why does this matter for observability?

In traditional observability, you instrument your application code. You add logging statements, metrics libraries, and tracing SDKs. This works, but has significant limitations:

Code changes are required: You must modify and redeploy applications
It’s language-specific: Different languages need different libraries
There will likely be blind spots: You can only see what you explicitly instrument
The overhead: Heavy instrumentation slows down applications
Container challenges: By the time you add instrumentation and redeploy, the problem may have disappeared

eBPF takes a different approach. Instead of instrumenting applications, you instrument the kernel. Since every application ultimately makes system calls to the kernel for network I/O, file operations, and process management, you can observe everything from one vantage point.

The eBPF advantage for Kubernetes

Kubernetes adds another layer of complexity. Your application might be spread across multiple containers, pods, and nodes. Traditional APM (Application Performance Monitoring) tools struggle here because containers come and go rapidly, network topology changes constantly, service meshes add routing complexity, and you often don't control application code (think third-party services or legacy applications you can't modify.)

eBPF doesn't care about any of this. It sees all activity at the kernel level, regardless of what language your app is written in, whether it's containerized, how many times the pod has been rescheduled, or whether you have access to modify the source code. This universal visibility is why the Cloud Native Computing Foundation (CNCF) and major cloud providers are betting heavily on eBPF for the future of observability.

How eBPF Tracing Works (Without Getting Lost in the Kernel)

When your application runs on Kubernetes, there's a clear separation between user space and kernel space. Your code runs in user space, where it's isolated, safe, and has limited access to system resources. To do anything useful – make network calls, read files, allocate memory – your application must ask the kernel for help. The kernel handles these requests via system calls, commonly called syscalls.

eBPF lets us hook into these syscalls without slowing the system down. It’s like having a CCTV camera at every doorway between user space and kernel space, watching who passes through, when, and what they’re carrying.

A Simple Example: HTTP Request Tracing

Your application initiates an HTTP GET request, which needs to go through the network stack. To establish a connection, your application first makes a socket() system call to create a network socket. Then it calls connect() to establish a connection to the remote server. Once connected, it uses send() to transmit the HTTP request. Network packets are sent across the wire, and eventually your application calls recv() to receive the response.

With eBPF tools like Inspektor Gadget's Traceloop, you can automatically hook into these syscalls. The eBPF program captures request metadata including source and destination IPs, ports, timing information, and payload sizes. You get a complete trace of the request without touching your application code.

The eBPF Execution Flow

Here's what happens under the hood when you run a trace. When you deploy Inspektor Gadget and run a gadget, several things happen behind the scenes. Once deployed, the eBPF program springs into action whenever a traced event occurs.

When your application makes a syscall, the eBPF hook triggers and quickly collects relevant data: timestamps, process IDs, container IDs, pod names, request details, and latency information. This data is sent to user space through eBPF maps, which are efficient data structures for kernel-to-userspace communication.

Inspektor Gadget adds Kubernetes context to raw kernel data. Instead of seeing only process IDs, you can see pod names, namespaces, labels, and other metadata. For example, you can tell that a request originated from the frontend pod in the production namespace and targeted the backend service.

The gadget then presents this information in a format that's immediately useful, whether you're using the CLI or integrating with other observability tools.

eBPF is fast because:

JIT compilation: Programs are turned into native machine code for maximum performance
Event-driven: Only execute when relevant events occur, not continuously polling
Kernel-resident: No expensive context switching between kernel and user space
Highly optimized: Typically adds less than 5% overhead even under heavy load

The Tool: Inspektor Gadget & Traceloop

For this tutorial, we're using Traceloop, an eBPF-based tool that traces request flows through applications by observing syscalls, network calls, and I/O operations at the kernel level.

Why are we using Traceloop for this tutorial?

It’s quick to install and run (one command)
The output maps directly to the application’s behavior
It automatically adds Kubernetes context (pod names, namespaces)
You don’t need to make any application code changes

What you'll learn applies beyond Traceloop. All eBPF tracing tools (Pixie, Cilium Hubble, Tetragon) work the same way under the hood. They attach to kernel hooks and collect event data. Once you understand the concepts here, you can use any eBPF observability tool effectively.

How to Set Up Your Environment

To get your environment ready for hands-on tracing, we'll verify that your cluster meets the requirements, install Inspektor Gadget, and deploy a sample application to trace.

Verify that Your Cluster Meets the Requirements

Before installing anything, confirm that your Kubernetes cluster is ready for eBPF.

Check your Kubernetes version:

kubectl version --short

You need Kubernetes 1.19 or later. Most modern clusters exceed this requirement, but it's worth verifying.

Verify kernel version on your nodes:

kubectl get nodes -o wide

Then check the kernel version on one of your nodes:

# If using a local cluster like minikube or kind
uname -r

# For cloud clusters, you might need to check node details
kubectl debug node/ -it --image=ubuntu -- bash -c "uname -r"

You need Linux kernel 5.10 or later for the best eBPF support. Kernel 4.18+ works but with some limitations. If you're using a managed Kubernetes service (GKE, EKS, AKS), you almost certainly have a compatible kernel.

Confirm that you have cluster admin permissions:

kubectl auth can-i create deployments --all-namespaces

This should return "yes". Inspektor Gadget needs elevated permissions to load eBPF programs into the kernel.

Install Inspektor Gadget

You can install Inspektor Gadget in several ways. We'll use the kubectl plugin method as it's the most straightforward for learning.

Install the kubectl gadget plugin:

# Download and install kubectl-gadget
kubectl krew install gadget

# Verify installation
kubectl gadget version

If you don't have krew (the kubectl plugin manager), you can install it first:

# Install krew
(
  set -x; cd "$(mktemp -d)" &&
  OS="$(uname | tr '[:upper:]' '[:lower:]')" &&
  ARCH="$(uname -m | sed -e 's/x86_64/amd64/' -e 's/\(arm\)\(64\)\?.*/\1\2/' -e 's/aarch64$/arm64/')" &&
  KREW="krew-${OS}_${ARCH}" &&
  curl -fsSLO "https://github.com/kubernetes-sigs/krew/releases/latest/download/${KREW}.tar.gz" &&
  tar zxvf "${KREW}.tar.gz" &&
  ./"${KREW}" install krew
)

# Add krew to your PATH
export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"

Deploy Inspektor Gadget to your cluster:

kubectl gadget deploy

This creates a gadget namespace and deploys the Inspektor Gadget daemon as a DaemonSet, ensuring each node in your cluster can run eBPF programs.

Verify the deployment:

kubectl get pods -n gadget

You should see one gadget-* pod per node, all in the Running state. If a pod is stuck in Pending or CrashLoopBackOff, check that your kernel meets the version requirements.

Deploying a sample application

To learn tracing effectively, we need an application that does something interesting. We'll deploy a simple microservices application with multiple components so you can see traces flowing across service boundaries.

Start by creating a namespace for our demo app:

kubectl create namespace demo-app

Then deploy a simple web application with a backend:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: frontend
  namespace: demo-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: frontend
  template:
    metadata:
      labels:
        app: frontend
    spec:
      containers:
      - name: frontend
        image: gcr.io/google-samples/microservices-demo/frontend:v0.8.0
        ports:
        - containerPort: 8080
        env:
        - name: PORT
          value: "8080"
        - name: PRODUCT_CATALOG_SERVICE_ADDR
          value: "productcatalog:3550"
---
apiVersion: v1
kind: Service
metadata:
  name: frontend
  namespace: demo-app
spec:
  type: LoadBalancer
  selector:
    app: frontend
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: productcatalog
  namespace: demo-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: productcatalog
  template:
    metadata:
      labels:
        app: productcatalog
    spec:
      containers:
      - name: server
        image: gcr.io/google-samples/microservices-demo/productcatalogservice:v0.8.0
        ports:
        - containerPort: 3550
        env:
        - name: PORT
          value: "3550"
---
apiVersion: v1
kind: Service
metadata:
  name: productcatalog
  namespace: demo-app
spec:
  selector:
    app: productcatalog
  ports:
  - port: 3550
    targetPort: 3550

Apply the configuration:

kubectl apply -f demo-app.yaml

And wait for pods to be ready:

kubectl wait --for=condition=ready pod -l app=frontend -n demo-app --timeout=300s
kubectl wait --for=condition=ready pod -l app=productcatalog -n demo-app --timeout=300s

Then just verify that everything is running:

kubectl get pods -n demo-app

You should see both frontend and productcatalog pods in the Running state.

Now you’ll need to get the frontend URL:

# For local clusters (minikube, kind, Docker Desktop)
kubectl port-forward -n demo-app service/frontend 8080:80

# Then access http://localhost:8080 in your browser

# For cloud clusters
kubectl get service frontend -n demo-app
# Look for the EXTERNAL-IP

Visit the application in your browser to confirm it's working. You should see a simple e-commerce storefront. This application makes HTTP requests from the frontend to the product catalog service, which is perfect for tracing.

How to Trace Your First Request: Hands-On Tutorial

Now that everything is set up, let's capture our first trace and see eBPF observability in action.

Generate the Traffic to Trace

First, we need some application activity to observe. We will generate a few requests for our demo application.

In one terminal, start the Traceloop gadget:

kubectl gadget traceloop -n demo-app

This command starts tracing HTTP request handling in the demo-app namespace. Inspektor Gadget monitors the kernel to capture the function calls and system events that occur while processing each request.

In another terminal, generate some traffic:

# If using port-forward
curl http://localhost:8080

# If you have an external IP
curl http://

# Generate multiple requests
for i in {1..10}; do curl http://localhost:8080; sleep 1; done
```

### Viewing Your First Trace

Switch back to the terminal running the trace loop gadget. You should see output appearing as requests flow through your application. The output will look something like this:
```
NODE         NAMESPACE   POD              CONTAINER    PID    TYPE       COUNT  
minikube     demo-app    frontend-abc123  frontend     1234   loop       1      
minikube     demo-app    frontend-abc123  frontend     1234   loop       2

Each line shows a traced execution flow, with the count increasing as the same pattern is observed again.

We can make the output more interesting by filtering:

# Stop the previous trace with Ctrl+C, then run:
kubectl gadget traceloop -n demo-app --podname frontend

This narrows our observation to just the frontend pod, reducing noise and making patterns clearer.

Understanding what you're seeing:

Each column shows different information about your application:

NODE: Which Kubernetes node the traced event occurred on. In multi-node clusters, this helps you understand workload distribution and identify node-specific issues.
NAMESPACE: The Kubernetes namespace. We filtered to demo-app, so you'll only see that namespace. In production, filtering by namespace is crucial for focusing on specific applications.
POD: The specific pod where the event occurred. Each pod gets a unique name (like frontend-abc123), allowing you to distinguish between replicas of the same application.
CONTAINER: Which container within the pod. Pods can have multiple containers (main application, sidecars, init containers), so this helps you pinpoint exactly where activity is happening.
PID: The process ID inside the container. This is the actual Linux process that made the syscalls eBPF observed. Multiple PIDs might appear if your application uses multiple processes or threads.
TYPE: The type of event traced. For Traceloop, this identifies kernel-level patterns detected during request processing.
COUNT: How many times this pattern has been observed. A rapidly incrementing count indicates high request volume.

What this tells you about your application:

Even from this simple output, you can derive insights. If you see events appearing for the frontend pod but not the productcatalog pod, it might indicate that requests aren't making it to the backend. This is a potential configuration issue. If the COUNT increases rapidly for one pod but not others, you know which replica is receiving traffic, useful for debugging load balancing issues.

The real power becomes clear when you correlate these kernel-level observations with what you know about your application. When you made 10 curl requests, you should see corresponding activity in the trace output. This direct relationship between application behavior and kernel observations is the foundation of eBPF observability.

How to Interpret Traces

Understanding raw trace output is valuable, but interpreting what it means for your application's health and performance is where the real skill lies.

Trace Anatomy: Spans, Timing, and Request Flow

A trace represents a single request's journey through your system. When you curl the frontend, that generates one trace. A span represents a single operation within that trace like "frontend handles request," "frontend calls product catalog," "product catalog queries data," and "frontend returns response." Each span has timing information: when it started, when it ended, and therefore how long it took.

In traditional distributed tracing with OpenTelemetry or Jaeger, you'd explicitly create these spans in your application code. With eBPF, the tool infers spans from syscall patterns. When eBPF sees your frontend process call connect() to the product catalog's IP, followed by send() and recv(), it understands that's a span representing an HTTP request to the backend service.

The request flow is the sequence of spans showing how your request moved through services. In our demo app,

The user request arrives at the frontend,
the frontend connects to the product catalog,
the product catalog processes the request,
the product catalog returns the data, the frontend renders the page,
and finally, the response is sent to user.

How to Follow Requests Across Services

Let's trace a request across service boundaries to see this flow in action.

First, we’ll start a more detailed trace:

kubectl gadget trace_tcp -n demo-app

The trace_tcp gadget shows network connections, giving us visibility into service-to-service communication.

Next, generate a request:

curl http://localhost:8080

In the trace output, look for connection patterns:

You should see the frontend pod establishing a TCP connection to the product catalog service. The trace will show the source (frontend) and destination (product catalog) IPs and ports, along with timing information.

This is how eBPF lets you follow requests: by observing the network syscalls that implement service communication. You don't need a service mesh or instrumentation libraries, the kernel sees all network activity and eBPF captures it.

Understanding the flow:

Your curl command triggers a TCP connection to the frontend pod's IP on port 8080
The frontend processes the request and opens a TCP connection to the product catalog's IP on port 3550
Data flows back and forth (you'll see send/receive events)
Connections close when requests complete

Each step is visible to eBPF because each step requires syscalls that the kernel handles.

How to Identify Bottlenecks and Errors

We can also use tracing to identify performance issues.

First, let’s start by simulating a slow backend:

# Create a deliberately slow endpoint by modifying our deployment
kubectl scale deployment productcatalog -n demo-app --replicas=0

# Wait a moment, then scale back up
kubectl scale deployment productcatalog -n demo-app --replicas=1

While the product catalog is down, generate some requests:

for i in {1..5}; do curl http://localhost:8080; done

You should see connection attempts from the frontend to the product catalog, but if the service is unavailable, you'll see different patterns, possibly connection timeouts or connection refused errors, depending on the exact timing.

What bottlenecks look like in traces:

Long spans: A span that takes significantly longer than others indicates a bottleneck. In trace loop output, you might see gaps between events or notice certain operations taking longer.
Retries: Repeated connection attempts to the same destination suggest a failing or slow service.
Error patterns: Connection failures, timeouts, or unusual syscall sequences indicate problems.

The best skill to have is pattern recognition. A typical, healthy request flow has a rhythm, and events occur in predictable sequences with consistent timing. When something breaks, the rhythm changes. Requests take longer, errors appear, or expected events don't occur at all.

Real-World Debugging Scenarios

Now let's go through three realistic scenarios where eBPF helps:

Scenario 1: Finding a Slow Endpoint

The problem: Users report that the product catalog page sometimes loads very slowly, but metrics show normal average latency.

Let’s use Traceloop to investigate:

# Start tracing with timing information
kubectl gadget traceloop -n demo-app --podname frontend

We’ll generate some mixed traffic:

# Some requests to the homepage (fast)
curl http://localhost:8080

# Some requests to the product catalog (potentially slow)
curl http://localhost:8080/products

In the trace output, compare the COUNT increments for different request patterns. If certain patterns show significantly more loop iterations or longer gaps between events, that indicates those requests are doing more work, possibly hitting a slow endpoint.

The diagnosis:

You might notice that requests to /products cause the frontend to make multiple calls to the product catalog service (visible with kubectl gadget trace_tcp), while homepage requests don't. This explains why the product page is slow: it's making synchronous calls to a backend service, and if that service is slow or the network is congested, users feel the delay.

The fix:

You might implement caching, make the backend calls asynchronous, or optimize the product catalog service itself. The key is that eBPF helped you identify which specific code path was slow without adding instrumentation to your application.

Scenario 2: Tracking Down Failed Requests

The problem: Your monitoring shows a 5% error rate, but application logs don't show any errors. Where are the failures happening?

Now let’s use eBPF to investigate:

# Trace network connections to see connection failures
kubectl gadget trace_tcp -n demo-app

We’ll simulate intermittent failures:

# Create a failing scenario by temporarily breaking service connectivity
kubectl delete service productcatalog -n demo-app

# Generate requests
for i in {1..10}; do curl http://localhost:8080; sleep 1; done

# Restore the service
kubectl apply -f demo-app.yaml

In the TCP trace, you'll see connection attempts from the frontend to the product catalog that fail or time out. The trace will show the source, destination, and what happened (connection refused, timeout, and so on).

The diagnosis:

The failures are happening at the network level, the frontend can't reach the product catalog. This might be due to network policy issues, service mesh misconfiguration, or DNS problems. Traditional application logs might not capture this because the application never receives a response to log, and the connection fails before the application layer even gets involved.

Why eBPF finds this when logs don't:

Your application logs what it experiences. If a connection fails at the TCP level, your application might just see "connection refused" and retry without detailed logging.

eBPF sees the actual syscalls and network events, giving you visibility into what's happening beneath your application layer.

Scenario 3: Understanding Service Dependencies

The problem: You're not sure which services depend on each other, and you want to understand the actual runtime dependencies before making changes.

We’ll use eBPF to map dependencies:

# Trace all TCP connections to see who talks to whom
kubectl gadget trace_tcp -n demo-app

And then generate normal traffic:

# Make various requests to exercise different code paths
curl http://localhost:8080
curl http://localhost:8080/products
curl http://localhost:8080/cart

The trace output shows source and destination for every connection. Build a mental (or actual) map of which pods connect to which services.

The discovery:

You'll see that the frontend pod connects to the product catalog service, but you might also discover unexpected dependencies. Perhaps the frontend also makes calls to a Redis cache, an authentication service, or external APIs. These runtime dependencies might not be documented or might differ from what architectural diagrams show.

Why this matters:

Before deploying a change to the product catalog service, you now know exactly which services will be affected. Before implementing a network policy, you know which connections to allow. Before decomposing a monolith, you understand the actual communication patterns.

This is observability-driven architecture understanding: letting the system show you how it actually works, not how you think it works.

Advanced Tracing Insights

Once you're comfortable with basic request tracing, Inspektor Gadget offers deeper observability capabilities that reveal even more about your system's behavior.

Syscall-Level Observation

The traceloop and trace_tcp gadgets give you application-level insights, but sometimes you need to go deeper. The trace_exec gadget shows you every process execution in your containers.

First, let’s monitor process execution:

kubectl gadget trace_exec -n demo-app

And generate activity:

# Exec into a pod and run commands
kubectl exec -it -n demo-app deployment/frontend -- /bin/sh
ls -la
ps aux
exit

Every command you run inside the container appears in the trace: /bin/sh, ls, ps, and anything else. This helps you understand what's running in your containers, detect suspicious activity, or debug initialization issues.

In production scenarios, this helps you answer questions like: Is my application spawning unexpected subprocesses? Are there security issues like someone running curl to download malicious scripts? Is my init script actually running the commands I think it is?

Network Tracing Insights

Beyond TCP connections, you can trace DNS queries, which often reveal surprising things about your application's behavior.

Run trace_dns:

kubectl gadget trace_dns -n demo-app

Generate requests:

curl http://localhost:8080

You'll see every DNS query your application makes: resolving service names, checking for external APIs, perhaps even unexpected queries that indicate misconfiguration or dependencies you didn't know about.

Common insights from DNS tracing include discovering that your application is using external dependencies you didn't document, finding DNS resolution failures that cause intermittent errors, or identifying excessive DNS queries that could be cached.

Combining eBPF Data with Logs and Metrics

eBPF observability delivers the best results when combined with traditional observability signals. To combine them effectively:

Use metrics for high-level health monitoring, alerting on anomalies, tracking trends over time, and dashboard visualization.
Use logs for application-specific context, business logic details, error messages with stack traces, and debugging application code.
Use eBPF traces for understanding request flows, identifying where time is spent, discovering runtime dependencies, and debugging issues that don't appear in logs.

A practical workflow:

Your metrics alert you that latency increased. You check logs but don't see errors, requests are succeeding, just slowly. You use eBPF tracing to identify that requests are spending extra time in network I/O to a particular backend service. Now you check that service's metrics and logs, and discover it's under heavy load. The eBPF trace gave you the clue that logs and metrics alone couldn't provide.

This approach to observability, using the right tool for each question, is how experienced engineers debug complex systems efficiently.

What eBPF Can and Can't See

eBPF excels at:

Network traffic (requests, responses, latency)
System calls (file I/O, process creation, memory allocation)
Kernel functions (scheduling, locking, resource usage)
Function calls in binaries (with uprobes)

But keep in mind that eBPF has limitations:

Cannot decrypt encrypted payloads (unless hooking SSL libraries before encryption)
Doesn't automatically understand application logic
Captures low-level events but may need context for high-level semantics

That's why eBPF complements traditional observability rather than replacing it entirely. It gives you infrastructure-level visibility with no code changes and universal coverage. Traditional APM provides application-level context, business metrics, and custom instrumentation. Together, they give you complete observability across your entire stack.

Best Practices and Production Considerations

Before using eBPF tracing in production, there are important considerations around performance, security, and operational practices.

Performance Impact

eBPF's reputation for low overhead is well-deserved, but "low" isn't "zero."

Most eBPF tracing tools add 2-5% CPU overhead and negligible memory overhead. The exact number depends on event frequency, tracing a service that handles 10,000 requests per second will have more overhead than one handling 10 per second.

Measuring the impact:

# Before enabling tracing, check baseline resource usage
kubectl top pods -n demo-app

# Enable tracing
kubectl gadget traceloop -n demo-app

# Check resource usage again
kubectl top pods -n demo-app

You should see a small increase in CPU usage in the pods where tracing is active. This is the cost of the eBPF programs running in the kernel and processing events.

Production best practices:

Use targeted tracing rather than tracing everything everywhere. Trace specific namespaces, pods, or individual containers when investigating issues. For high-volume services, reduce overhead by applying filters, aggregation, or sampling where supported by the tracing tool.

Stop tracing when you’re done investigating. Unlike metrics collection, which typically runs continuously, eBPF-based tracing is best used as an on-demand diagnostic tool to capture detailed insights during active debugging.

When overhead matters:

If you're running latency-sensitive applications (like high-frequency trading systems or real-time communications), even 2-5% overhead might be unacceptable. In these cases, use eBPF tracing in pre-production environments to identify issues, or enable it temporarily in production only when actively debugging.

Security Considerations

eBPF is powerful, which means it requires elevated privileges. Understanding the security implications is crucial.

What eBPF can access:

eBPF programs can observe all syscalls, network traffic, and process execution in the kernel. This includes potentially sensitive data like connection details, file paths, and process arguments. While eBPF programs run in a sandbox and can't modify data or crash the kernel, they can read information that might be sensitive.

Privilege requirements:

Loading eBPF programs requires CAP_SYS_ADMIN or CAP_BPF capabilities (on newer kernels). This is a privileged operation, only trusted users should have this access. The Inspektor Gadget DaemonSet runs with these privileges, so protect access to it accordingly.

Best practices:

Implement RBAC (Role-Based Access Control) to restrict who can run gadgets. Not every developer needs the ability to trace production systems.

Also, be mindful of what data you're collecting, if your traces might contain sensitive information (like authentication tokens in HTTP headers), restrict access to trace data.

Lastly, consider using admission controllers to prevent unauthorized eBPF program loading. Audit eBPF usage in production environments to track who ran which gadgets when.

Network policies:

Inspektor Gadget's DaemonSet needs to communicate with the API server and between its components. Ensure your network policies allow this communication while still maintaining appropriate segmentation.

When to Use eBPF Tracing vs. Traditional APM

eBPF tracing and traditional APM tools like New Relic, Datadog, or Dynatrace serve different purposes. Understanding when to use each helps you build an effective observability strategy.

Use eBPF tracing when:

You can't modify application code (third-party applications, legacy systems, compiled binaries)
You need infrastructure-level visibility (network, syscalls, kernel behavior)
You're debugging issues that span service boundaries but don't show up in application logs
You want zero instrumentation overhead during normal operation (run tracing only when needed)
You need to understand what's actually happening versus what the application reports

Use traditional APM when:

You need business-context metrics (user IDs, transaction types, business-specific data)
You want automatic instrumentation with minimal setup for supported frameworks
You need long-term storage and analysis of all traces (eBPF tracing is often used for real-time investigation)
You want pre-built dashboards and alerting for common application patterns
You need application code-level visibility (stack traces, variable values, function calls)

The Ideal Approach: Use Both

Many teams run traditional APM for continuous monitoring and use eBPF tracing for targeted investigation when APM data isn't sufficient. For example, your APM shows that a service is slow but doesn't explain why. You enable eBPF tracing on that service to understand what's happening at the kernel level, network delays, excessive syscalls, unexpected dependencies, and find the root cause.

This complementary approach gives you both the continuous visibility of APM and the deep diagnostic power of eBPF without the overhead of running both at maximum depth all the time.

Next Steps and Resources

If you got this far, thanks for reading! Now that you have learned the fundamentals of eBPF observability, and hands-on tracing with Inspektor Gadget, you can continue your journey by:

Exploring Other eBPF Tools

Now that you understand eBPF concepts through traceloop, exploring other tools will be much easier.

Try other Inspektor Gadget gadgets:

# See all available gadgets
kubectl gadget --help

# Some useful ones to explore:
kubectl gadget trace_open -n demo-app     # File I/O tracing
kubectl gadget trace_bind -n demo-app     # Port binding events
kubectl gadget profile cpu -n demo-app    # CPU profiling
kubectl gadget snapshot process -n demo-app  # Process listing

Each gadget teaches you something different about system behavior and gives you another diagnostic tool in your toolkit.

Experiment with other eBPF platforms:

If you're interested in broader observability platforms, try Pixie for its auto-instrumentation and rich UI. Install Cilium with Hubble if you're focused on network observability and want to understand service mesh behavior. Explore Tetragon if security observability interests you, seeing what processes are executing and what files they're accessing.

The concepts transfer directly: all these tools attach eBPF programs to kernel hooks, collect event data, and present it in different ways. Your understanding of syscalls, traces, and kernel-level observation applies universally.

Connect to the CNCF Observability Ecosystem

eBPF observability tools don't exist in isolation. They're part of the broader Cloud Native Computing Foundation ecosystem.

OpenTelemetry integration:

Many eBPF tools can export data in OpenTelemetry format, allowing you to combine kernel-level traces with application-level traces in a unified observability backend. This gives you the complete picture: eBPF shows you infrastructure behavior while OpenTelemetry shows you application context.

Prometheus and Grafana:

eBPF-derived metrics can be exposed as Prometheus metrics and visualized in Grafana alongside your application metrics. This unified dashboard approach helps you correlate infrastructure and application behavior.

Service mesh integration:

If you're using Istio, Linkerd, or other service meshes, eBPF tools like Cilium Hubble can provide deeper visibility into service-to-service communication than the mesh alone provides. The mesh handles traffic management while eBPF gives you kernel-level visibility.

Jaeger and Zipkin:

For organizations using distributed tracing backends, eBPF traces can be exported to these systems, enriching your trace data with infrastructure-level spans that application instrumentation misses.

Community Resources and Learning Paths

The eBPF community is vibrant and welcoming. You can continue learning from the resources below.

Official documentation and blog:

eBPF.io: The central hub for eBPF documentation, tutorials, and project listings
Inspektor Gadget docs: Comprehensive guides for all gadgets and use cases
Cilium documentation: Deep dives into eBPF networking
CNCF Blog — “What is Observability 2.0?: A quick overview of how modern observability moves beyond traditional tools by unifying metrics, logs, and traces for real-time insight in cloud-native systems.

Learning resources:

Learning eBPF by Liz Rice: Comprehensive book covering eBPF fundamentals
eBPF Summit: Annual conference with talks from eBPF creators and users
CNCF webinars: Regular sessions on observability topics
Kubernetes observability SIGs: Community discussions and projects

To make this tutorial easy to follow and experiment with, I have included all Kubernetes manifests, demo applications, and eBPF tracing commands in this repository. You can also connect with me on LinkedIn if you’d like to stay in touch.

How to Improve Developer Experience in Microservices Applications with .NET Aspire

Opaluwa Emidowojo — Fri, 24 Oct 2025 14:26:29 +0000

Since the advent of microservices, development teams have gained the flexibility to deploy services independently, without coordinating with the entire engineering organization. Bug fixes can be released in isolation without full regression testing, and multiple teams can ship updates simultaneously, sometimes ten or more deploys a day per team.

But we rarely talk about the downsides of microservices. In medium to large-scale systems, the number of services can grow quickly. Netflix reportedly runs over seven hundred microservices, and Uber manages more than two thousand. That kind of scale introduces a lot of moving parts, testing complexity, and debugging challenges across service boundaries. And all of this can severely impact developer experience (DX).

Recently, I came across a new framework called .NET Aspire, which dramatically simplifies local microservices development. Aspire handles service discovery, configuration management, and observability for distributed applications, giving you a complete view of your system through a built-in dashboard. This results in a much simpler, smoother local development experience compared to manually wiring up multiple services. In this guide, we'll explore how Aspire works and how it can help improve developer experience in microservices-based systems.

Prerequisites

Before we begin, ensure you have the following installed:

.NET 8 SDK or later
Docker Desktop
- Aspire uses Docker to run dependencies like Redis, PostgreSQL, and so on.
- Ensure Docker is running before starting
Visual Studio 2022 (v17.9+) or Visual Studio Code with C# Dev Kit
Basic understanding of:
- C# and .NET development
- Microservices architecture concepts
- REST APIs and service communication

Optional but Recommended:

Familiarity with Docker and containerization
Experience with distributed application development
Knowledge of observability concepts (logging, tracing, metrics)

Prerequisites
Understanding Developer Experience in Microservices
Introducing .NET Aspire
How to Set Up .NET Aspire in Your Project
Why This Matters for Developer Experience
Framework: How to Adopt .NET Aspire Incrementally
How to Use the .NET Aspire Dashboard
Practical Scenarios: Solving Real-World DX Challenges with .NET Aspire
Going Further
Key Takeaways and When to Use .NET Aspire
When (and When Not) to Use .NET Aspire
Conclusion

Understanding Developer Experience in Microservices

When people talk about DX, they often think of it as tooling or ergonomics, things like good documentation, fast build times, and clean APIs. But in distributed systems, DX becomes much broader. It’s about how easily developers can set up, run, and reason about the systems they’re building.

In a monolithic application, starting your development environment might mean running a single command like dotnet run. But in a microservices-based system, you might need to start multiple APIs, databases, background workers, and queues, all with specific configuration dependencies. This extra overhead doesn’t just slow you down, it breaks your focus and adds friction to daily development.

Over time, that friction compounds.

Onboarding new developers becomes slower.
Debugging across service boundaries gets harder.
Teams spend more time managing environments than writing features.

That’s why DX is so important in microservices architectures. It is not just about developer happiness, it’s about velocity, consistency, and confidence. If your local environment isn’t easy to run or reason about, every other process in your development lifecycle suffers.

This is where orchestration frameworks like .NET Aspire start to make a real difference. They handle the complexity of coordinating services, so developers can focus on building and iterating faster, the way modern software development is meant to work.

Introducing .NET Aspire

As microservice systems grow, local development environments often become a patchwork of scripts, Docker Compose files, and manual setup steps. Each developer ends up managing their own version of “how to get things running,” and small differences in configuration can lead to big inconsistencies across teams.

.NET Aspire is an orchestration framework designed to simplify this process. It provides a way to define, configure, and run your distributed applications as a single unit, directly within your .NET solution.

In practical terms, Aspire helps developers by handling three key areas automatically:

Service Orchestration
Aspire can start multiple projects (APIs, workers, databases, and so on) in the correct order. It takes care of service dependencies so that, for example, your API doesn’t try to start before the database it depends on is ready.
Configuration Management
Instead of juggling dozens of appsettings.json files or environment variables, Aspire provides a centralized configuration model. It shares connection strings, ports, and environment settings across services in a consistent way.
Observability and Insights
Aspire includes built-in OpenTelemetry support and a dashboard that gives you real-time visibility into your running services, including their health, logs, and endpoints. This makes debugging and local monitoring much easier.

In many ways, Aspire does for services what Kubernetes does for containers, but with a sharper focus on local development and developer experience. It’s not meant to replace your production orchestration tools, it’s designed to make your everyday development smoother, faster, and less error-prone.

How to Set Up .NET Aspire in Your Project

We'll create a microservices setup and watch Aspire orchestrate it with minimal code. Make sure you're running .NET 8 or later. Aspire requires it.

Create a New Aspire Project

Start by creating a new Aspire app host using the .NET CLI:

dotnet new aspire-app -n MyCompany.AppHost

This command creates a new Aspire “host” project, the entry point that orchestrates your other microservices, APIs, and dependencies.

You’ll notice that the generated project contains a Program.cs file with an AppHostBuilder. This builder acts as the control center for your distributed system.

Add Your Microservices

You can now reference your existing projects or create new ones directly in the same solution. For example:

dotnet new webapi -n CatalogService
dotnet new webapi -n OrderService
dotnet new worker -n NotificationWorker

Then, add them to your Aspire host by editing Program.cs:

var builder = DistributedApplication.CreateBuilder(args);

var catalog = builder.AddProject("catalog");
var order = builder.AddProject("order")
                   .WaitFor(catalog); // ensure this starts after CatalogService
var notifications = builder.AddProject("notifications");

builder.Build().Run();

In this example:

AddProject registers each service with Aspire.
.WaitFor() enforces startup dependencies (for example, OrderService depends on CatalogService).
Aspire takes care of starting these services in the right order, sharing environment variables, and managing ports automatically.

Run All Services with One Command

Now, from your app host directory, run:

dotnet run

Aspire will:

Start all the registered services.
Allocate available ports.
Inject shared configurations.
Launch a local dashboard showing service health, endpoints, and logs.

You should see output like this:

Starting CatalogService...
Starting OrderService...
Starting NotificationWorker...
AppHost running on http://localhost:18888

And when you open the dashboard in your browser, you’ll see all your services, their statuses, and links to their APIs.

Add a Local Database (Optional)

To show how Aspire handles dependencies, let’s add a PostgreSQL container:

var db = builder.AddPostgres("postgres");
builder.AddProject("catalog")
       .WithReference(db); // injects connection string automatically

Now when you run the app, Aspire will start PostgreSQL first, generate a connection string, and pass it to CatalogService. No manual setup or .env files required.

Why This Matters for Developer Experience

Before Aspire, getting your local environment running meant opening multiple terminals, waiting around for databases to start, and copying connection strings between projects. With Aspire, it's just one command. Everything starts automatically, configuration is shared across services, and you get observability built in. That's the developer experience win. Less time fighting your setup, more time actually coding.

Framework: How to Adopt .NET Aspire Incrementally

If you’re considering trying Aspire in your own team, you don’t have to migrate everything at once. In fact, the best approach is incremental adoption. Start small and expand gradually.

Here’s a simple framework you can follow:

Step 1: Start Small

Create an Aspire host and connect one or two key services.
This helps your team understand the orchestration flow before scaling up.

dotnet new aspire-app -n MyCompany.AppHost

Step 2: Add Dependencies Incrementally

As you grow, include more services and use .WaitFor() to define dependencies and startup order.

var builder = DistributedApplication.CreateBuilder(args);

var db = builder.AddPostgres("postgres");
builder.AddProject("catalog")
       .WithReference(db);
builder.AddProject("gateway")
       .WaitFor("catalog");

builder.Build().Run();

Step 3: Integrate Observability

Leverage Aspire’s built-in OpenTelemetry integration for metrics and traces. You’ll instantly gain better insight into service interactions even without external tools.

Step 4: Share Your Setup

Commit your Aspire host to source control so every developer uses the same setup.
This ensures consistency across environments, reducing the classic “works on my machine” problem.

Note: Aspire doesn’t require a full rewrite. It works great as a starting layer while your team continues evolving your existing orchestration setup.

How to Use the .NET Aspire Dashboard

One of the standout features of .NET Aspire is its built-in dashboard, which gives you real-time visibility into your microservices while they run locally.

When you start your Aspire app host with dotnet run, it automatically spins up a local dashboard (by default at http://localhost:18888). This dashboard provides a centralized view of all your services — APIs, databases, background workers, and any connected dependencies.

Here’s what you’ll find inside:

Service Overview

The dashboard home page lists every service in your distributed application. For each one, you can see:

Name and type (for example, cache, apiservice, webfrontend)
Current state (Running, Starting, Stopped)
Source
Port and endpoint information
Startup time and uptime
Logs and metrics shortcuts

This immediately replaces the need to track multiple terminal windows or scroll through dozens of logs just to confirm everything started correctly.

The dashboard automatically detects unhealthy or failed services and highlights them, so you can identify startup issues early.

Navigating to Endpoints

Each service card includes quick links to its exposed endpoint, providing easy access to relevant tools and interfaces. For example, APIs may include links to Swagger UI or Scalar, databases may link to pgAdmin or similar management tools, and internal services may offer links to custom dashboards.

This setup allows users to test APIs or verify database connections directly from the dashboard without needing to remember specific ports or manually construct URLs.

Real-Time Logs

Clicking into a specific service opens a detailed view showing real-time logs streamed directly from that service.

This is especially helpful when debugging startup issues or service interactions. Instead of running dotnet run in separate terminals, you can view logs for all your services in one place, color-coded and timestamped for clarity.

Observability Built-In (OpenTelemetry)

Aspire includes OpenTelemetry by default, which means that even without additional configuration, you automatically gain access to several powerful observability features. These include distributed traces across service boundaries, metrics for performance monitoring, and correlated logs that help track requests spanning multiple services.

For teams already using tools like Grafana, Jaeger, or SigNoz, Aspire can export this telemetry data to your preferred observability platform with minimal setup.

With tracing enabled, you can follow a request as it travels from your API to your database, through background workers, and back, all from within the dashboard.

Why the Dashboard Improves Developer Experience

Without Aspire, running a local microservices environment typically requires managing multiple terminal windows, tracking ports manually, and searching through log files to diagnose failures.

Aspire consolidates these tasks into a single visual interface where developers can view all services, check dependencies, inspect logs, and monitor system health directly from the browser.

This integrated environment enables faster debugging, maintains developer focus, and simplifies work with complex systems by reducing the overhead of manual coordination.

Practical Scenarios: Solving Real-World DX Challenges with .NET Aspire

So far, we have looked at how Aspire works and what it provides out of the box. But to really understand its impact on developer experience, let’s go through a few real-world pain points that almost every team building with microservices has faced, and how Aspire helps solve them.

Starting Multiple Services in the Right Order

The Problem: In most microservices setups, service startup order matters. For instance, your API Gateway might depend on the User Service and Catalog Service, which both depend on a Database.
If you start these in the wrong order, the gateway fails to connect, and you end up restarting services manually until everything stabilizes.

How Aspire Solves It: Aspire provides a simple way to express dependencies using .WaitFor():

var builder = DistributedApplication.CreateBuilder(args);

var db = builder.AddPostgres("postgres");
var user = builder.AddProject("user")
                  .WithReference(db);

var catalog = builder.AddProject("catalog")
                     .WithReference(db);

var gateway = builder.AddProject("gateway")
                     .WaitFor(user)
                     .WaitFor(catalog);

builder.Build().Run();

Aspire automatically ensures that each service only starts after the services it depends on are fully ready.
No more manual sequencing or “start this one first” instructions in your README.

Port Conflicts and Configuration Drift

The Problem: Developers often encounter the dreaded “Port 5000 is already in use” or spend time editing configuration files to avoid conflicts. Over time, local setups diverge across the team, making onboarding and debugging harder.

How Aspire Solves It: Aspire dynamically manages ports and configuration at runtime. Each service gets a unique port assignment, and Aspire automatically shares connection information across services.

You can still set explicit ports when needed:

builder.AddProject("frontend")
       .WithHttpEndpoint(port: 5173);

This removes guesswork, keeps environments consistent, and ensures new developers can clone the repo and start everything without editing config files.

Simplifying New Developer Onboarding

The Problem: For many teams, onboarding means following a long README with dozens of setup steps, manual database migrations, and environment variable configurations. It can take hours, or even days before a new developer can run the system locally.

How Aspire Solves It: Aspire defines your entire environment in code. That means the setup process becomes as simple as cloning the repository and running one command:

dotnet run

Aspire will start all necessary services, configure dependencies, and bring up the dashboard for visibility. This transforms onboarding from a multi-hour process into something that can be completed in minutes, with far fewer setup issues.

Improving Debugging and Cross-Service Visibility

The Problem: Debugging in microservices often means jumping between logs, tracing requests across multiple services, or reproducing issues that only appear when several services run together.

How Aspire Solves It: With built-in observability and the Aspire dashboard, you can view logs across all services in one place, inspect health checks and metrics, and trace requests using OpenTelemetry. This makes it much easier to identify issues across service boundaries and speeds up debugging, especially during integration testing or local development.

Running Optional or External Services

The Problem: Sometimes you don’t need to run every service locally. For example, you might connect to a shared staging API or external dependency instead of running a local instance.

How Aspire Solves It: Aspire lets you make services optional using conditional checks:

if (Directory.Exists("../Frontend"))
{
    builder.AddProject("frontend");
}

This makes your setup flexible: you can run a minimal environment for development or a full environment for integration testing, all using the same configuration.

Why These Scenarios Matter

Each of these examples solves a specific friction point in the developer experience. Startup complexity, environment drift, onboarding time, and debugging difficulty.

By automating orchestration and configuration, Aspire frees developers from repetitive setup work and lets them focus on building features instead of managing infrastructure.

Going Further

Once you’re comfortable with Aspire’s basics, you can extend it beyond local orchestration to streamline other parts of your workflow.

Integrate front-end applications
Orchestrate React, Angular, or Node.js apps alongside your .NET services for a unified full-stack setup.
Export telemetry data
Send Aspire’s OpenTelemetry output to platforms like Grafana, Jaeger, or Azure Application Insights for deeper analysis.
Use Aspire in CI/CD pipelines
Bring up full environments for integration or smoke testing during continuous integration runs, all using your existing Aspire configuration.
Explore community examples
Check out the official Aspire samples and templates for advanced orchestration patterns, cloud integration, and observability setups.

Key Takeaways and When to Use .NET Aspire

As we’ve seen throughout this guide, .NET Aspire isn’t just another developer tool, it’s a framework built specifically to improve developer experience in microservices-based applications.

By orchestrating all your services in a consistent, declarative way, Aspire helps teams reduce friction, speed up setup, and make local environments more reliable and observable.

Key Takeaways

Developer Experience (DX) matters as your system grows.
Microservices introduce flexibility and scalability, but they also add complexity; multiple services, ports, dependencies, and startup sequences. Without good orchestration, DX quickly degrades.
Aspire simplifies orchestration for local development.
It automatically handles service startup, dependencies, configuration sharing, and observability all defined in code, right within your .NET solution.
The Aspire dashboard improves visibility.
You get a centralized, real-time view of your entire system; services, logs, health, and endpoints eliminating the need for multiple terminals or manual tracking.
Onboarding new developers becomes faster and smoother.
A single dotnet run command can spin up your entire development environment, reducing setup time from hours or days to minutes.
Built-in observability means better debugging and confidence.
With OpenTelemetry integrated out of the box, developers can trace requests, monitor performance, and diagnose issues across services with minimal setup.

When (and When Not) to Use .NET Aspire

Use Aspire when:

Aspire makes sense if you're building .NET microservices and tired of complex local setup. It's especially valuable when your team is dealing with environment drift, slow onboarding, or startup sequences that feel like juggling. If you want one command to spin up your entire system, with observability built in from day one, Aspire is worth trying.

You might not need Aspire when:

Aspire might not be worth it if your current setup already works well. Maybe you're using Kubernetes or Docker Compose locally and everything runs smoothly. Or you're building a monolith or single service that doesn't need orchestration. Or your stack has a lot of non-.NET components that would need custom wiring. If your local development is already simple and stable, don't fix what isn't broken.

In other words:
Aspire shines in the local development and onboarding phase. Helping developers build, test, and iterate on distributed systems with minimal friction.
It’s not meant to replace production orchestrators like Kubernetes but to complement them by improving the developer’s day-to-day workflow.

Conclusion

Developer Experience is often overlooked when teams move to microservices, but it directly impacts productivity, quality, and morale. By using .NET Aspire, you can bring order, visibility, and simplicity back to your local development environment.

If you’re looking to streamline your microservices workflow, give Aspire a try. You’ll spend less time fighting your setup and more time building what actually matters; great software.

Ready to get started? Check out the official .NET Aspire documentation or clone one of the sample projects to see it in action.

If you made it to the end of this tutorial, thanks for reading! You can also connect with me on LinkedIn if you’d like to stay in touch.

How to Debug Kubernetes Pods with Traceloop: A Complete Beginner's Guide

Opaluwa Emidowojo — Fri, 29 Aug 2025 16:09:24 +0000

Debugging Kubernetes pods can feel like detective work. Your app crashes, and you're left wondering what happened in those critical moments leading up to failure. Traditional kubectl commands show you logs and statuses, but they can't tell you exactly what your application was doing at the system level when things went wrong.

What if you had a flight recorder for your applications, something that captures every system call in real-time, so you can "rewind" and see the exact sequence of events that led to a crash? That's what Traceloop does. It continuously traces system calls in your pods, giving you a detailed replay of what happened before, during, and after issues occur.

In this guide, you’ll learn how to use Traceloop's system call tracing to debug pod issues that would otherwise be nearly impossible to diagnose.

Prerequisites

Before we begin, here are some prerequisites – things you’ll need to know and have:

Basic Kubernetes concepts: Understanding of pods, deployments, services, and namespaces
kubectl fundamentals: Comfortable with commands like kubectl get, kubectl describe, kubectl logs, and kubectl exec
Container basics: Understanding how containerized applications work
Basic Linux concepts: Understanding of processes and system calls (helpful, but we'll explain as we go)

Technical Requirements

Kubernetes cluster access: Local (minikube, kind, Docker Desktop) or cloud-based cluster
kubectl installed and configured to connect to your cluster
Sufficient permissions (cluster admin or equivalent RBAC) to:
- Install and run eBPF-based tools (Traceloop uses eBPF)
- Create/modify pods and deployments
- Access pod logs and system-level data
Linux-based Kubernetes nodes: Most clusters already run on Linux.

System Requirements

Extended Berkeley Packet Filter (eBPF) support: Used for tracing and monitoring at the kernel level. Kernel version 5.10+ recommended.
Sufficient cluster resources: Traceloop runs alongside your applications

What is Traceloop?
How Traceloop Works
How to Set Up Traceloop
Your First Trace: Hands-On Tutorial
Step-by-Step Debugging Walkthrough
Real-World Debugging Scenarios
Best Practices
Conclusion

What is Traceloop?

Traceloop is a system call tracing and observability tool that works across containerized environments, from Docker containers running locally to pods in production Kubernetes clusters. But before we discuss what that means, let's talk about why system calls matter for debugging.

Every time your application does anything (like opening a file, making a network request, allocating memory, or crashing), it has to interact with the operating system through system calls. These are the fundamental building blocks of how any program interacts with the world around it.

Here's where traditional debugging falls short: when your container crashes, the logs might tell you "segmentation fault" or "out of memory," but they don't tell you the sequence of events that led there. Did the application try to access a file that didn't exist? Was it making network calls that failed? Did it run out of file descriptors?

Traceloop captures this missing piece. It sits at the kernel level using eBPF technology, recording every system call your application makes in real-time. Think of it as installing a dashcam in your application. It's always recording with minimal resources, and when something goes wrong, you have the footage.

Strace is another popular debugging tool – but it requires you to know that there's a problem first. With Traceloop, we can conveniently run it continuously in the background with minimal overhead. If your container crashes at 3am, you can immediately "rewind the tape" and see exactly what system calls happened leading up to the crash.

This helps debug intermittent issues that happen randomly in production but never when you are watching. Because Traceloop is always recording, you finally have visibility into what your application was doing when these mysterious failures occur.

How Traceloop Works

Now that you understand what Traceloop does, let's look under the hood at how it captures and processes system calls in your containerized environments.

The Technical Foundation

Traceloop is built on eBPF, a technology that allows programs to run safely in the Linux kernel without changing kernel code. Think of eBPF as a way to install "hooks" directly into the kernel that can observe everything happening on your system with minimal performance impact.

Unlike traditional monitoring tools that work from userspace, eBPF programs run in kernel space, giving them access to system calls as they happen, without relying on the application logging appropriate error messages. This is why Traceloop can capture events that never make it to application logs, like failed system calls or crashes that happen before the application can write anything.

The Flight Recorder Architecture

Traceloop uses eBPF maps as an overwriteable ring buffer. Imagine a tape recorder that continuously records over itself. It's always capturing system calls, but it only keeps the most recent data in memory. When something goes wrong, the recording automatically preserves what happened leading up to the incident, just like an airplane's flight recorder after a crash.

This approach solves the production debugging problem: you don't need to predict when issues will happen or attach debuggers after the fact. The recording is always running, waiting for you to need it.

System Call Capture Flow

Here's how Traceloop captures and processes system calls across your Kubernetes environment:

Application pods generate system calls through normal operation – opening files, making network connections, allocating memory.
eBPF probes (also called hooks) intercept these system calls at the kernel level before they're processed.
Traceloop recorder captures the events, buffers them, and adds container context using Inspektor Gadget enrichment (pod name, namespace, container ID).
Output stream formats the data and makes it available for analysis in real-time or after an incident.
Traceloop user views and analyzes the captured trace to diagnose the root cause of issues.

Below is a visual representation of the flow. The key advantage is that Traceloop sees everything your application does, even actions that fail silently or happen too quickly for traditional logging to catch. This gives you complete visibility into your application's interaction with the operating system.

Container Isolation and Context

One of Traceloop's strengths is understanding containerized environments. It doesn't just capture raw system calls – it adds context about which pod, container, and namespace generated each call. This means you can trace specific applications without getting overwhelmed by system calls from other containers running on the same node.

This container awareness makes Traceloop particularly powerful in Kubernetes environments where you might have dozens of pods running on a single node, but you only care about debugging one specific application.

How to Set Up Traceloop

Before we can start tracing system calls, we need to set up Traceloop in your Kubernetes environment. Traceloop is part of the Inspektor Gadget ecosystem, which provides flexibility in how you use it.

Installation Overview

This setup:

Deploys Inspektor Gadget components to all worker nodes
Eliminates the download and initialization overhead on each use, as components are pre-loaded and ready
Eliminates the need to reinstall or reconfigure for each debugging session – just run your traces immediately
Requires cluster admin permissions
Works best for teams doing regular debugging

Installation Requirements

First, ensure your cluster meets the requirements:

Kubernetes cluster with Linux nodes
eBPF support
kubectl installed and configured
Cluster admin permissions

Install kubectl gadget

The recommended way is using krew (kubectl plugin manager):

# Install krew if you don't have it
curl -fsSLO "https://github.com/kubernetes-sigs/krew/releases/latest/download/krew-linux_amd64.tar.gz"
tar zxvf krew-linux_amd64.tar.gz
./krew-linux_amd64 install krew
export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"

# Install kubectl gadget
kubectl krew install gadget

Alternatively, you can install directly:

# For Linux/macOS
curl -sL https://github.com/inspektor-gadget/inspektor-gadget/releases/latest/download/kubectl-gadget-linux-amd64.tar.gz | sudo tar -C /usr/local/bin -xzf - kubectl-gadget

# Verify installation
kubectl gadget version

Deploy Inspektor Gadget to Your Cluster

Deploy the Inspektor Gadget components to your cluster:

kubectl gadget deploy

This installs the necessary DaemonSets and RBAC configurations that allow gadgets like Traceloop to run on your cluster nodes.

Alternatively, you can also deploy using Helm.

Verify Installation

Check that the gadget pods are running:

kubectl get pods -n gadget

You should see gadget pods running on each node in your cluster.

Your First Trace: Hands-On Tutorial

Now let's capture our first system call trace. We'll create a simple scenario and watch what happens at the system level.

Setting Up the Test Environment

First, create a dedicated namespace for our tracing experiments:

kubectl create ns test-traceloop-ns

Expected output:

namespace/test-traceloop-ns created

Next, create a simple pod that we can interact with:

kubectl run -n test-traceloop-ns --image busybox test-traceloop-pod --command -- sleep inf

Expected output:

pod/test-traceloop-pod created

This creates a BusyBox container that sleeps indefinitely, giving us a stable target for tracing.

Starting Your First Trace

Next, start tracing system calls for our test pod:

kubectl gadget run traceloop:latest --namespace test-traceloop-ns

This command starts the flight recorder. You'll see column headers showing what information Traceloop captures:

K8S.NODE    K8S.NAMESPACE    K8S.PODNAME    K8S.CONTAINERNAME    CPU    PID    COMM    SYSCALL    PARAMETERS    RET

The trace is now running in the background, continuously recording system calls from our pod.

Generating System Calls

With the trace running, let's generate some activity. In a new terminal window, run a command inside your test pod:

kubectl exec -ti -n test-traceloop-ns test-traceloop-pod -- /bin/sh

Once inside the container, run some basic commands:

ls /
echo "Hello World" > /tmp/test.txt
cat /tmp/test.txt

Collecting the Trace

Back in your original terminal where Traceloop is running, press Ctrl+C to stop the recording and see the captured system calls.

You'll see output similar to this:

K8S.NODE            K8S.NAMESPACE        K8S.PODNAME          K8S.CONTAINERNAME    CPU  PID    COMM  SYSCALL      PARAMETERS                   RET
minikube-docker     test-traceloop-ns    test-traceloop-pod   test-traceloop-pod   2    95419  ls    openat       dfd=-100, filename="/lib"    3
minikube-docker     test-traceloop-ns    test-traceloop-pod   test-traceloop-pod   2    95419  ls    getdents64   fd=3, dirent=0x...          201
minikube-docker     test-traceloop-ns    test-traceloop-pod   test-traceloop-pod   2    95419  ls    write        fd=1, buf="bin dev etc..."   201
minikube-docker     test-traceloop-ns    test-traceloop-pod   test-traceloop-pod   2    95419  ls    exit_group   error_code=0                 0

Understanding Your First Trace

Let's break down what we're seeing:

K8S.PODNAME: Which pod generated these system calls
PID: Process ID of the command that ran
COMM: The command name (ls, echo, cat)
SYSCALL: The actual system call made (openat, write, exit_group)
PARAMETERS: Arguments passed to the system call
RET: Return value (0 usually means success)

This trace shows the ls command opening the /lib directory, reading directory entries, writing the output to stdout, and exiting successfully.

Clean Up

Remove the test resources:

kubectl delete pod test-traceloop-pod -n test-traceloop-ns
kubectl delete ns test-traceloop-ns

You can now see exactly what your applications are doing at the kernel level, something that traditional logs and kubectl commands can't show you.

Let's try this with an application that crashes.

Step-by-Step Debugging Walkthrough

Now that you know how to capture traces, let's take a look at a real debugging scenario. We'll create an application that crashes and use Traceloop to uncover the root cause. Something that would be nearly impossible with traditional kubectl debugging.

The Scenario: A Mysterious Crash

Let's create a Python application that has a subtle bug. It tries to write to a file it doesn't have permission to access, then crashes. This mimics real-world scenarios where applications fail due to permission issues, missing files, or resource constraints.

Setting Up the Problematic Application

First, we’ll create a new namespace for our debugging exercise:

kubectl create ns debug-traceloop-ns

Now, let's create a pod with an application that will crash:

kubectl run -n debug-traceloop-ns crash-app --image=python:3.9-slim --restart=Never -- python3 -c "
import time
import os
print('App starting...')
time.sleep(5)
print('Trying to write to restricted file...')
try:
    with open('/etc/passwd', 'w') as f:
        f.write('malicious content')
except Exception as e:
    print(f'Error: {e}')
    exit(1)
"

This creates a pod that will:

Start successfully
Try to write to /etc/passwd (a restricted system file)
Fail and crash with exit code 1

Starting the Trace Before the Crash

Here's the key difference from traditional debugging. We start tracing before we know there's a problem. In a real scenario, you'd have Traceloop running continuously.

kubectl gadget run traceloop:latest --namespace debug-traceloop-ns

The trace starts recording immediately. You'll see the column headers, and the flight recorder is now capturing every system call.

Observing the Application Behavior

In another terminal, check the pod status:

kubectl get pods -n debug-traceloop-ns -w

You'll see the pod go through these states:

Pending → Running → Error → CrashLoopBackOff

Traditional debugging would show you:

kubectl logs -n debug-traceloop-ns crash-app

Output:

App starting...
Trying to write to restricted file...
Error: [Errno 13] Permission denied: '/etc/passwd'

But this doesn't tell you exactly what the application tried to do at the system level.

Collecting and Analyzing the Trace

Back in your Traceloop terminal, press Ctrl+C to stop the recording. You'll see system calls like this:

K8S.NODE        K8S.NAMESPACE      K8S.PODNAME  COMM    SYSCALL    PARAMETERS                           RET
minikube-docker debug-traceloop-ns crash-app    python3 openat     dfd=-100, filename="/etc/passwd"    -13
minikube-docker debug-traceloop-ns crash-app    python3 write      fd=3, buf="App starting..."         16
minikube-docker debug-traceloop-ns crash-app    python3 openat     dfd=-100, filename="/etc/passwd"    -13
minikube-docker debug-traceloop-ns crash-app    python3 exit_group error_code=1                        0

Reading the System Call Story

The trace reveals the exact sequence of events:

openat filename="/etc/passwd" RET=-13: The application tried to open /etc/passwd for writing
- Return code -13 = EACCES (Permission denied)
write buf="App starting...": Normal logging output (successful)
openat filename="/etc/passwd" RET=-13: Second attempt to open the restricted file (still denied)
exit_group error_code=1: Application exits with error code 1

What Traceloop Revealed

Traditional debugging told us "Permission denied" but Traceloop shows us:

Exactly which file the application tried to access
When the permission denial happened in the execution flow
How many times it tried (twice in this case)
The exact system call that failed (openat)

Real-World Applications

This same approach works for debugging:

File not found errors: See exactly which files your app is looking for
Network connection failures: Observe failed connect() system calls with specific addresses
Memory issues: Watch mmap() and brk() calls that fail
Container startup problems: See which system calls fail during initialization

Clean Up

Remove the test resources:

kubectl delete pod crash-app -n debug-traceloop-ns
kubectl delete ns debug-traceloop-ns

Key Takeaway

Traditional Kubernetes debugging shows you what went wrong after it happened. Traceloop's continuous recording shows you exactly how it went wrong at the system level. This level of detail is invaluable for debugging complex production issues where the logs don't tell the full story.

Real-World Debugging Scenarios

Now that you understand the fundamentals, let's explore common production issues and how Traceloop helps diagnose them. These scenarios mirror real problems you'll encounter in Kubernetes environments.

Scenario 1: Container Startup Failures

The problem: Your pod gets stuck in CrashLoopBackOff with unhelpful logs.

Traditional kubectl commands show limited information:

kubectl describe pod failing-app
# Events: Back-off restarting failed container

kubectl logs failing-app
# (Empty or minimal output)

System calls show the application tried to:

Access configuration files that don't exist
Connect to services that aren't available
Write to directories without proper permissions

Key system calls to watch:

openat with -2 return (file not found)
connect with -111 return (connection refused)
access with -13 return (permission denied)

Scenario 2: Memory and Resource Issues

The problem: Application performance degrades or gets OOMKilled.

What Traceloop shows:

mmap calls failing (memory allocation issues)
brk system calls indicating heap growth
File descriptor exhaustion through failed openat calls
Excessive write calls indicating memory pressure

Example pattern:

SYSCALL    PARAMETERS           RET
mmap       length=1048576       -12  # ENOMEM - out of memory
brk        brk=0x55555557d000   0    # Heap expansion
openat     filename="/tmp/..."   -24  # EMFILE - too many open files

Scenario 3: Network Connectivity Problems

The problem: Service-to-service communication fails intermittently.

Traditional debugging limitations:

Application logs show "connection timeout"
Network policies seem correct
DNS resolution appears to work

What Traceloop reveals:

Exact IP addresses and ports being attempted
DNS resolution patterns through openat on /etc/resolv.conf
Failed connect calls with specific error codes
Socket creation and binding issues

Key indicators:

SYSCALL    PARAMETERS                    RET
socket     family=AF_INET, type=SOCK     3
connect    fd=3, addr=10.96.0.1:443     -110  # ETIMEDOUT
close      fd=3                         0

Scenario 4: Configuration and Secret Issues

The problem: Application can't access mounted secrets or config maps.

What system calls reveal:

File access patterns for mounted volumes
Permission checks on secret files
Configuration file parsing attempts

Common patterns:

Multiple openat attempts on different config file paths
access calls checking file permissions before opening
Failed reads from mounted secret volumes

Scenario 5: Performance Bottlenecks

The problem: Application response times are slow without obvious cause.

Traceloop analysis:

Excessive fsync calls (disk I/O bottlenecks)
Many futex calls (lock contention)
Frequent recvfrom timeouts (network issues)
Repeated file system operations

Performance indicators:

SYSCALL     FREQUENCY    ISSUE
fsync       High         Disk I/O bottleneck
futex       Excessive    Lock contention
poll        Many         Waiting for I/O
recvfrom    Timeouts     Network delays

Best Practices

When to Use Traceloop

Traceloop is most useful when you’re dealing with the kinds of problems that are notoriously difficult to pin down. If you’ve ever struggled with debugging intermittent crashes that don’t happen on demand, or run into confusing permission and access issues, this is where it works best.

It also helps uncover performance bottlenecks at the system level and provides visibility into application behavior during tricky startup failures. Another common use case is diagnosing network connectivity problems between pods, where other tools usually can't help

Of course, not every problem requires system call tracing. For application-level issues, logs and APM tools are more effective. Cluster-level concerns are often better handled with kubectl describe or by looking at events, and if you’re primarily monitoring resources, standard metrics and dashboards show you what's happening.

Performance Considerations

Like any tracing tool, Traceloop adds some overhead, but it keeps the overhead low. You can keep it efficient by narrowing the scope of your traces. For example, filtering by namespace with --namespace specific-ns, or targeting specific pods using --podname target-pod. In high-traffic environments, it’s best to run traces for shorter periods, and node-specific tracing can further isolate debugging when you don’t want to instrument the entire cluster.

In most cases, Traceloop uses very little CPU and memory, thanks to its eBPF-based approach. This makes it lighter than traditional tools like strace. The actual cost depends on the volume of system calls being recorded, so it’s a good practice to monitor resource usage in your own environment to confirm it’s operating within acceptable limits.

Integration with Your Workflow

Traceloop works well in dev and production workflows. In development, it’s a powerful way to understand how your application interacts with the system. You can use it to confirm that your app handles edge cases correctly, or to validate permission and resource configurations before promoting workloads into production.

In production environments, you can deploy it in different ways. Depending on how much overhead you're okay with, some teams run it continuously on a small subset of nodes, while others use it only when traditional debugging methods don’t provide enough insight. Pairing Traceloop with your existing monitoring and logging stack can give you a much more complete picture of system behavior.

It also helps with teamwork. Sharing trace outputs makes it easier for teams to reason about complex issues together. The data it provides can guide improvements in error handling and logging, and documenting common system call patterns can help onboard new developers more quickly.

Security Considerations

Because Traceloop records low-level system activity, you need to be mindful of what it captures.

What Traceloop Can See:

System call parameters (such as filenames and network addresses)
Process information and command arguments
File access patterns and permissions

Privacy Measures:

Limit trace duration to minimize data collection
Use namespace isolation to avoid capturing unrelated workloads
Apply data retention policies for trace outputs
Watch for sensitive information in file paths or system call parameters

Conclusion

Traceloop doesn’t just tell you something went wrong – it shows you how. By recording every system call in real time, it turns mysterious Kubernetes failures into solvable problems. Whether the issue happened seconds ago or in the middle of the night, the tool gives you the ability to rewind, inspect, and respond with confidence.

When to Use It

Keep in mind that Traceloop complements your existing debugging toolkit rather than replacing it. Reach for it when logs don’t tell the whole story, when intermittent problems are hiding in the shadows, when kubectl commands leave you guessing, or when you need to see how your application is really interacting with the system.

Once you’re comfortable with Traceloop, you can add more tools. Inspektor Gadget offers other tools for network, security, and performance debugging that pair well with Traceloop. Integrating it into your incident response workflow, sharing insights across your team, and even considering continuous tracing for critical workloads are good things to try next.

The next time you run into a stubborn Kubernetes pod failure, you won’t be stuck speculating. With Traceloop, you can “rewind the tape” and see exactly what happened. System call tracing may sound complex at first, but in practice, it’s one of the most powerful ways to truly understand how applications behave in containerized environments.

PS: Have any questions about Traceloop or want to share your debugging challenges? The Inspektor Gadget team and community hang out in the #inspektor-gadget channel on Kubernetes Slack. It's a great place to get help from the engineers who built these tools, share experiences, and maybe even contribute to making the ecosystem even better.

You can also connect with me on LinkedIn if you’d like to stay in touch. If you made it to the end of this tutorial, thanks for reading!

How to Debug CI/CD Pipelines: A Handbook on Troubleshooting with Observability Tools

Opaluwa Emidowojo — Mon, 16 Jun 2025 23:34:03 +0000

Observability is a game-changer for CI/CD pipelines, and it’s one of the most exciting aspects of DevOps. When I started working with CI/CD systems, I assumed the hardest part would be building the pipeline. But with increasingly complex setups, the real challenge is debugging failures, like builds crashing or tests failing only in production.

Observability tools, such as logs, metrics, and traces, provide the visibility you need to pinpoint issues quickly. In this handbook, we’ll explore free and open-source tools you can use to make your CI/CD pipelines more reliable. We’ll use practical steps to troubleshoot like a pro – no enterprise licenses required.

Prerequisites
Why Observability is Important
How to Install and Configure Grafana Loki on Budget Infrastructure
How to Implement an ELK Stack Alternative for Pipeline Observability
How to Create a Unified Logging Strategy Across Pipeline Components
How to Query and Analyze Logs for Effective Troubleshooting
How to Set Up Prometheus Metrics Alongside Your Logs
How to Create Grafana Dashboards That Combine Metrics and Logs
How to Use Exemplars to Jump from Metrics to Relevant Logs
How to Diagnose and Fix Common CI/CD Problems
How to Implement Advanced Debugging Techniques
How to Conduct Effective Postmortems Using Logs
How to Optimize Log Storage and Management
Conclusion

Prerequisites

There are some things you should know and have to get the most out of this handbook:

Technical Knowledge:

Basic understanding of CI/CD pipelines (for example, build, test, deploy stages).
Familiarity with Linux/Unix commands (for example, mkdir, grep, curl).
Comfortable with Docker basics (for example, docker run, docker-compose up).
Optional: Awareness of observability concepts (logs, metrics, traces) or YAML configuration.

Software and Tools:

Docker and Docker Compose: Installed and running (verify with docker --version and docker-compose --version).
CI/CD Platform: Access to GitHub Actions, Jenkins, or GitLab CI with a sample pipeline that generates logs.
Text Editor: For editing YAML files (for example, VS Code, Nano).
Web Browser: To access tool UIs (for example, Grafana on port 3000, Kibana on 5601).
Optional: curl for testing log forwarding, Git for version control.

Hardware and Infrastructure:

Machine with:
- OS: Linux, Windows (with WSL2), or macOS.
- 4GB RAM (8GB recommended), 20GB free disk space.
- Stable internet and ability to open ports (for example, 3100 for Loki, 9200 for Elasticsearch).
Optional: Cloud provider access (for example, AWS, GCP) for scalable setups.

Access and Permissions:

Admin access to install Docker and configure CI/CD tools.
Permissions to modify pipeline configs (for example, .github/workflows, .gitlab-ci.yml).
Optional: Container registry access (for example, Docker Hub) for custom images.

Why Observability is Important

Modern CI/CD pipelines are no longer linear scripts – they are now complex, distributed systems involving multiple tools, environments, and infrastructure layers. One job runs on GitHub Actions, another deploys via Jenkins, and a third builds Docker images in a Kubernetes cluster.

So when something breaks, you’re left chasing logs across tools, guessing where the issue originated, and wasting hours trying to reproduce it.

And worse still, traditional debugging tools often stop at the surface, only showing failed jobs without the context of why they failed or where in the system the fault actually lies.

Observability flips the script. Instead of hunting through disconnected logs or rerunning failed builds blindly, observability gives you insight, not just data. By combining structured logs, metrics, and traces, you can:

Reconstruct exactly what happened in a pipeline failure
Trace a failure across CI agents, deployment steps, and containers
Visualize patterns and anomalies before they become outages

More importantly, observability helps you move from reactive debugging to proactive prevention.

Here’s what you’ll learn about and accomplish in this guide:

Set up cost-effective observability using Grafana Loki, lightweight ELK, and OpenTelemetry
Create a unified logging strategy to connect your pipeline
Write precise queries to quickly pinpoint root causes, correlate logs, metrics, and traces for comprehensive debugging
Troubleshoot CI/CD issues like build failures, flaky tests, and container crashes
Build custom dashboards and automated diagnostic tools
Promote observability through documentation and post-mortems

Whether you're a solo developer or part of a DevOps team, this guide will transform your chaotic CI/CD pipelines into clear, reliable, and observable systems.

How to Choose the Right Observability Tool for CI/CD

Here’s a quick comparison of Grafana Loki, Lightweight ELK, and Vector for CI/CD observability:

Tool	Resource Usage	Setup Complexity	Best For	CI/CD Fit
Grafana Loki	Low (lightweight)	Easy (Docker-based)	Small teams, budget infra	Simple pipelines, JSON logs, Grafana users
Lightweight ELK	High (Elasticsearch-heavy)	Moderate (multi-container)	Teams needing advanced search/visualization	Complex pipelines, rich querying needs
Vector	Very low	Easy (single binary)	Resource-constrained setups	Minimal setups, log forwarding

How to choose:

Loki: Ideal for startups or solo devs with limited resources. Integrates well with Prometheus/Grafana.
ELK: Best for teams needing Kibana’s advanced visualizations or handling large log volumes.
Vector: Great for lightweight log forwarding in distributed CI/CD setups.

Grafana Loki is a log aggregation system like ELK, but it's more lightweight, and it’s ideal for CI/CD pipelines with limited infrastructure.

How to Install and Configure Grafana Loki on Budget Infrastructure

🛠 Option A: Quick Docker Setup (Recommended for Budget Infra)

Create a directory for configuration:

 mkdir -p ~/loki-setup && cd ~/loki-setup

Create a docker-compose.yml:

 # Defines a Docker Compose setup for Grafana Loki and Promtail to aggregate and scrape logs efficiently.
 version: "3"

 services:
   loki:
     image: grafana/loki:2.9.4  # Uses Loki version 2.9.4 for lightweight log aggregation.
     ports:
       - "3100:3100"  # Exposes Loki’s HTTP API port for log ingestion and queries.
     command: -config.file=/etc/loki/loki-config.yaml  # Specifies the configuration file for Loki.
     volumes:
       - ./loki-config.yaml:/etc/loki/loki-config.yaml  # Mounts the local config file into the container.

   promtail:
     image: grafana/promtail:2.9.4  # Uses Promtail version 2.9.4 to scrape and forward logs to Loki.
     volumes:
       - /var/log:/var/log  # Mounts the host’s log directory for Promtail to scrape.
       - ./promtail-config.yaml:/etc/promtail/promtail-config.yaml  # Mounts the Promtail config file.
     command: -config.file=/etc/promtail/promtail-config.yaml  # Specifies the configuration file for Promtail.

Create a basic loki-config.yaml:

 # Configures Grafana Loki for lightweight log storage and querying in a CI/CD environment.
 auth_enabled: false  # Disables authentication for simplicity (not recommended for production).

 server:
   http_listen_port: 3100  # Sets the port for Loki’s HTTP API.

 ingester:
   lifecycler:
     ring:
       kvstore:
         store: inmemory  # Uses in-memory storage for the ring, suitable for small setups.
       replication_factor: 1  # Sets single replica for minimal resource use.
   chunk_idle_period: 3m  # Flushes chunks to storage after 3 minutes of inactivity.
   max_chunk_age: 1h  # Retires chunks after 1 hour to balance storage and query performance.

 schema_config:
   configs:
     - from: 2023-01-01  # Defines the schema start date.
       store: boltdb-shipper  # Uses BoltDB for indexing logs.
       object_store: filesystem  # Stores logs on the local filesystem.
       schema: v11  # Specifies schema version for log storage.
       index:
         prefix: index_  # Prefix for index files.
         period: 24h  # Rotates indexes daily.

 storage_config:
   boltdb_shipper:
     active_index_directory: /tmp/loki/index  # Directory for active index files.
     cache_location: /tmp/loki/boltdb-cache  # Cache location for BoltDB.
   filesystem:
     directory: /tmp/loki/chunks  # Directory for storing log chunks.

 limits_config:
   enforce_metric_name: false  # Disables strict metric name enforcement for flexibility.

Create a basic promtail-config.yaml:

 # Configures Promtail to scrape system logs and forward them to Loki.
 server:
   http_listen_port: 9080  # Sets Promtail’s HTTP port for metrics and health checks.
   grpc_listen_port: 0  # Disables gRPC to reduce resource usage.

 positions:
   filename: /tmp/positions.yaml  # Stores the position of scraped logs to resume after restarts.

 clients:
   - url: http://loki:3100/loki/api/v1/push  # Specifies the Loki endpoint for log ingestion.

 scrape_configs:
   - job_name: system  # Defines a scraping job for system logs.
     static_configs:
       - targets:
           - localhost  # Targets the local host for log collection.
         labels:
           job: varlogs  # Labels logs for easy querying in Loki.
           __path__: /var/log/*.log  # Scrapes all log files in /var/log directory.

Run it:

 # Starts the Loki and Promtail containers in detached mode for background operation.
 docker-compose up -d

✨ This brings up Loki and Promtail with minimal resources, no authentication, and logs scraping from /var/log.

Troubleshooting Loki Setup Issues

If Loki or Promtail fails to start, one of the following may be the issue:

Container crashes: Check logs with docker logs loki or docker logs promtail. Look for errors like “out of memory” or “port already in use.”
- Fix: Increase memory (for example, docker-compose.yml resource limits) or change ports (e.g., 3101:3100).
Logs not ingested: Verify Promtail is scraping the correct path (/var/log/ci/*.log) using docker exec promtail cat /etc/promtail/promtail-config.yaml
- Fix: Update __path__ in promtail-config.yaml to match your CI/CD log directory.
Resource Constraints: Monitor resource usage with docker stats or top on the host.
- Fix: Ensure your machine has at least 4GB RAM and 20GB disk space, as specified in the prerequisites.

Configuration for CI/CD Logging

To adapt for CI/CD logs, you should:

1. Configure your CI/CD tools to write logs to disk:

For example, GitHub Actions with a custom runner can write logs to /var/log/gha/*.log.

Update Promtail:

# Configures Promtail to scrape logs from GitHub Actions runners for CI/CD observability.
scrape_configs:
  - job_name: github_actions  # Defines a scraping job for GitHub Actions logs.
    static_configs:
      - targets: ['localhost']  # Targets the local host where the runner writes logs.
        labels:
          job: gha  # Labels logs for identification in Loki queries.
          __path__: /var/log/gha/*.log  # Scrapes logs from the specified directory.

2. Use structured logging (JSON):

Make sure your CI/CD tools or scripts output logs in structured format:

Example:

# Example of a structured JSON log for CI/CD pipelines, enabling easy parsing and querying.
{
  "timestamp": "2025-05-10T13:00:00Z",  # UTC timestamp for log entry.
  "level": "error",  # Log level to indicate severity.
  "job": "deploy",  # Identifies the CI/CD job (e.g., deploy stage).
  "message": "Image pull failed"  # Descriptive message for the error.
}

This helps when querying with LogQL.

How to Connect CI Agents to Loki

This section explains three different ways to get your CI pipeline logs into Loki for monitoring and analysis:

Option 1 – Local setup:

Your CI agents write log files to disk, and Promtail (running on the same machine) reads those files and sends them to Loki.

Option 2 – Using Docker logging driver (Docker containers):

If your CI agents run in Docker containers, you install a special Loki plugin that automatically captures all container output and sends it directly to Loki without needing separate log files.

# Installs the Loki Docker logging driver to send container logs directly to Loki.
docker plugin install grafana/loki-docker-driver:latest --alias loki --grant-all-permissions

Then run your agent container:

# Runs a CI agent container with the Loki logging driver to forward logs.
docker run --log-driver=loki \
  --log-opt loki-url="http://:3100/loki/api/v1/push" \
  my-ci-agent-image

Option 3 – Remote setup:

If you can't install Promtail locally, you can use a log forwarding tool like Fluent Bit or Vector to collect logs and push them to Loki over the network.

The goal: Regardless of which option you choose, you’ll end up with all your CI pipeline logs centralized in Loki, where you can search through them, create dashboards in Grafana, and set up alerts when things go wrong.

It essentially gives you flexibility to integrate log collection based on your infrastructure setup – whether you prefer local agents, Docker plugins, or remote forwarding.

How to Implement an ELK Stack Alternative for Pipeline Observability

When full ELK (Elasticsearch, Logstash, Kibana) is too heavy for your infrastructure, you can go with lightweight setups that achieve similar observability at a lower cost and resource usage.

How to Install Lightweight Versions of Elasticsearch, Logstash, and Kibana

Goal: Stand up a minimal yet functional ELK stack for debugging CI/CD pipelines.

1. Use Docker to spin up lightweight containers

Create a docker-compose.yml:

# Defines a Docker Compose setup for a lightweight ELK stack to aggregate and visualize CI/CD logs.
version: '3.7'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.17.0  # Uses Elasticsearch 7.17.0.
    container_name: elasticsearch
    environment:
      - discovery.type=single-node  # Runs Elasticsearch in single-node mode for simplicity.
      - xpack.security.enabled=false  # Disables security features for lightweight setup.
    ports:
      - "9200:9200"  # Exposes Elasticsearch’s HTTP API port.
    volumes:
      - esdata:/usr/share/elasticsearch/data  # Persists Elasticsearch data.

  logstash:
    image: docker.elastic.co/logstash/logstash:7.17.0  # Uses Logstash 7.17.0.
    container_name: logstash
    ports:
      - "5044:5044"  # Port for receiving logs from Beats.
      - "9600:9600"  # Port for Logstash monitoring.
    volumes:
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf  # Mounts Logstash config file.

  kibana:
    image: docker.elastic.co/kibana/kibana:7.17.0  # Uses Kibana 7.17.0 for visualization.
    container_name: kibana
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200  # Links Kibana to Elasticsearch.
    ports:
      - "5601:5601"  # Exposes Kibana’s web UI port.

volumes:
  esdata:  # Defines a volume for persisting Elasticsearch data.

2. Minimal Logstash pipeline configuration (logstash.conf)

// Configures Logstash to process and forward CI/CD logs to Elasticsearch.
input {
  beats {
    port => 5044  // Listens for logs from Filebeat on port 5044.
  }
}

filter {
  json {
    source => "message"  // Parses JSON-formatted log messages for structured data.
  }
}

output {
  elasticsearch {
    hosts => ["http://elasticsearch:9200"]  // Sends processed logs to Elasticsearch.
    index => "ci-logs-%{+YYYY.MM.dd}"  // Stores logs in daily indexes (e.g., ci-logs-2025.05.14).
  }
}

Troubleshooting ELK Setup Issues

If Elasticsearch, Logstash, or Kibana fails to start, one of the following might be the issue:

Container crashes: Check logs with docker logs elasticsearch, docker logs logstash, or docker logs kibana. Look for errors like “insufficient disk space” or “port conflict” (for example, 9200, 5601).
- Fix: Free up disk space (ensure at least 20GB available) or change ports in docker-compose.yml (for example, 9201:9200).
Logs not ingested: Verify Logstash is receiving data from Filebeat or Vector using docker logs logstash. Check the logstash.conf input port (for example, 5044).
- Fix: Ensure Filebeat or Vector is configured to send to the correct Logstash endpoint (e.g., localhost:5044) and update if needed.
Resource constraints: Monitor resource usage with Docker stats or top on the host.
- Fix: Allocate at least 8GB RAM and 30GB disk space, as Elasticsearch requires more resources than Loki. Adjust memory limits in docker-compose.yml if necessary.

How to Configure Log Shippers for Different CI/CD Components

Goal: Get logs from your pipeline into Logstash or Elasticsearch.

Option 1: Use Filebeat (lightweight log shipper)

Install Filebeat on your CI/CD hosts (GitHub runner, Jenkins node, GitLab runner, and so on).

Filebeat config snippet (filebeat.yml):

# Configures Filebeat to collect CI/CD logs and forward them to Logstash.
filebeat.inputs:
  - type: log  # Specifies log file input.
    enabled: true  # Enables the input.
    paths:
      - /var/log/ci/*.log  # Scrapes logs from the specified CI log directory.

output.logstash:
  hosts: ["localhost:5044"]  # Forwards logs to Logstash on port 5044.

Then run:

# Runs Filebeat with the specified configuration file for log collection.
filebeat -e -c filebeat.yml

Option 2: Use Vector.dev as a more resource-efficient alternative to Filebeat

Vector configuration (vector.toml):

# Configures Vector to collect, parse, and forward CI/CD logs to Elasticsearch efficiently.
[sources.ci_logs]
  type = "file"  # Specifies file-based log collection.
  include = ["/var/log/ci/*.log"]  # Targets CI log files.

[transforms.json_parser]
  type = "remap"  # Uses remap transform to parse logs.
  inputs = ["ci_logs"]  # Processes logs from the ci_logs source.
  source = '''
  . = parse_json!(.message)  # Parses JSON log messages into structured data.
  '''

[sinks.to_elasticsearch]
  type = "elasticsearch"  # Sends logs to Elasticsearch.
  inputs = ["json_parser"]  # Uses parsed logs from the json_parser transform.
  endpoint = "http://localhost:9200"  # Specifies the Elasticsearch endpoint.
  index = "ci-logs"  # Stores logs in the ci-logs index.

Run:

# Runs Vector with the specified configuration file for log processing.
vector -c vector.toml

How to Set Up Index Patterns and Basic Visualizations

Goal: Make CI/CD logs queryable and visual in Kibana.

1. Open Kibana (http://localhost:5601)

Go to Stack Management → Index Patterns
Create a new pattern: ci-logs-*
Choose a time field like @timestamp

2. Visualizations for Common CI/CD Use Cases

Bar charts: Number of failed vs passed builds per day
Pie chart: Top error types or most frequent failing test names
Line chart: Duration of builds over time (if duration is logged)

3. Saved Searches & Dashboards

You can save a search like this:

message: "error" AND job_name: "build"

You can also combine visualizations into a CI/CD Health Dashboard.

How to Create a Unified Logging Strategy Across Pipeline Components

Creating a unified logging strategy across your CI/CD pipeline components ensures that logs are consistent, traceable, and easy to correlate. This helps you quickly debug issues, monitor system health, and trace requests across different tools and services. Let’s discuss some key practices for achieving a unified logging strategy:

Implementing Consistent Log Formats Across Different Tools

Consistent log formats are important for various reasons. First of all, a standardized log format enables easier querying, searching, and visualization. It also helps with correlation of logs from different services. And consistency also ensures that all logs provide necessary details like timestamp, log level, and request context.

There are also some best practices you should follow when formatting logs:

JSON Format is highly recommended as it’s structured, machine-readable, and compatible with many observability tools (for example, Loki, Elasticsearch, Grafana).

There are also some key fields you should include:

timestamp: The time the log entry was created (preferably in UTC).
log_level: Indicate whether the log is an INFO, ERROR, DEBUG, and so on.
service: The service or component generating the log.
message: A concise description of the event or error.
correlation_id: A unique identifier for requests to trace logs across systems.

Here’s an example of a consistent log in JSON format:

{
  "timestamp": "2025-05-10T12:34:56Z",
  "log_level": "ERROR",
  "service": "ci_cd_pipeline",
  "message": "Build failed due to missing dependency",
  "correlation_id": "1234567890abcdef"
}

How to Set Up Log Forwarding from GitHub Actions, Jenkins, or GitLab

Log forwarding refers to shipping logs from your CI/CD pipelines to a central spot for easy tracking. It’s helpful because it lets you spot issues fast and debug without digging through scattered files.

For GitHub Actions, you can configure workflows to write logs to a file or send them directly to a log aggregation tool like Loki. In Jenkins, you can use pipeline scripts to forward logs to a log server or file system. Similarly, for GitHub CI, you can add scripts in .gitlab-ci.yml to forward logs to a centralized endpoint.

Using Actions for Outputting Logs:
You can store logs in files and then forward them to a logging system (like Loki or Elasticsearch).
Here’s an example in a GitHub Action workflow:

# Defines a GitHub Actions workflow to run tests and forward logs for observability.
jobs:
  build:
    runs-on: ubuntu-latest  # Uses an Ubuntu runner.
    steps:
      - name: Checkout repository  # Checks out the repository code.
        uses: actions/checkout@v2
      - name: Run tests and log output  # Runs tests and saves output to a log file.
        run: |
          echo "Starting tests..."
          npm test | tee test.log  # Captures test output to test.log.
          # Forwards the log file to a Loki endpoint via HTTP POST.
          curl -X POST -F 'file=@test.log' http://your-loki-endpoint

Log Forwarding with Promtail:
If you are using Grafana Loki for log aggregation, set up Promtail to scrape the logs from the GitHub Actions runner.

Jenkins:

Jenkins logs can be forwarded to external systems (like Elasticsearch or Loki) by using log shippers or plugins.

You can use the Logstash Plugin to forward Jenkins logs to an ELK stack or other systems:

Install the Logstash plugin on Jenkins.
Configure the plugin to forward logs to an Elasticsearch server or a logging system of choice.
In Jenkins, add log forwarding configurations:

pipeline {
  agent any
  stages {
    stage('Build') {
      steps {
        script {
          // Example of forwarding logs to a log server
          sh 'echo "Build successful" | curl -X POST -d @- http://your-log-server'
        }
      }
    }
  }
}

Forward to Loki:
Jenkins supports the loki logging driver for containers if running Jenkins in Docker. You can send logs directly to Loki using this driver:

# Runs a Jenkins container with the Loki logging driver to send logs directly to Loki.
docker run --log-driver=loki --log-opt loki-url=http://loki:3100 jenkins/jenkins:lts

GitLab:

GitLab CI allows logs to be forwarded to external systems for centralized collection and analysis.

Use GitLab CI/CD to Output Logs:
Example in .gitlab-ci.yml:

# Defines a GitLab CI/CD pipeline to run a build and forward logs to Loki.
stages:
  - build
build:
  script:
    - echo "Starting the build" | tee build.log  # Saves build output to build.log.
    - curl -X POST -d @build.log http://your-loki-endpoint  # Forwards the log to Loki.

GitLab Runners:
Configure GitLab runners to forward logs to an external service like Loki or Elasticsearch using log-driver settings or the fluentd log shipper.

How to Add Correlation IDs to Trace Requests Through the System

Why Correlation IDs Are Important:

Correlation IDs allow you to trace a single request as it travels through different services and tools, enabling end-to-end visibility and troubleshooting.

They are critical for debugging distributed systems, especially when different services (for example, CI tool, deployment tool, API service) are involved.

How to Add Correlation IDs:

You can use a UUID (Universally Unique Identifier) or a GUID (Globally Unique Identifier) to generate a unique ID for each request.

If you are using microservices or multiple services in the pipeline, just make sure that the same ID is propagated across each service.

Many logging libraries (for example, winston for Node.js, log4j for Java) support automatic correlation ID generation and logging.

Here’s an example in Node.js (using winston):

// Sets up Winston for structured logging with correlation IDs in a CI/CD pipeline.
const { createLogger, transports, format } = require('winston');
const { printf } = format;

// Creates a logger with a custom format including correlation IDs.
const logger = createLogger({
  format: printf(({ level, message, timestamp }) => {
    return `${timestamp} [${level}] ${message} correlation_id=${generateCorrelationId()}`;
  }),
  transports: [
    new transports.Console(),  // Outputs logs to the console.
  ],
});

// Generates a random correlation ID for tracing requests.
function generateCorrelationId() {
  return Math.random().toString(36).substring(2, 15);
}

// Logs a sample message.
logger.info('Pipeline execution started');

How to Propagate Correlation IDs Between Services:

In CI/CD tools, you can configure your pipeline to inject the correlation ID into logs. For example, in GitHub Actions, you can generate a correlation ID in the env section and propagate it in each job:

# Defines a GitHub Actions workflow that includes a correlation ID for log tracing.
jobs:
  build:
    runs-on: ubuntu-latest  # Uses an Ubuntu runner.
    env:
      CORRELATION_ID: ${{ github.run_id }}  # Uses the GitHub run ID as a correlation ID.
    steps:
      - name: Checkout repository  # Checks out the repository code.
        uses: actions/checkout@v2
      - name: Log build start with correlation ID  # Logs the build start with the correlation ID.
        run: echo "Build started with Correlation ID: $CORRELATION_ID"

Include Correlation IDs in All Logs:

You’ll want to make sure that logs from all components in the pipeline (GitHub Actions, Jenkins, GitLab, deployment tools, and so on) include the correlation ID as part of the log message. This allows you to trace the logs of a single request or pipeline run across different services.

Visualize Your Log Flow

You can create a diagram showing how logs move from your CI/CD tool (for example, GitHub Actions) to Promtail/Vector, then to Loki/Elasticsearch, and finally to Grafana/Kibana for visualization. Use tools like Draw.io to map your pipeline’s observability flow

How to Query and Analyze Logs for Effective Troubleshooting

In this section, you’ll learn how to use LogQL (Loki's query language) to cut through the noise and find the specific logs that matter. Whether you're hunting down a mysterious build failure or tracking deployment issues across multiple services, these query patterns always help.

This bar chart illustrates the CI/CD build performance from May 20 to May 26, 2025. It compares the number of successful builds (in blue) to failed builds (in pink) each day. Successful builds consistently range between 40 and 50, while failed builds peak at 10 on May 23, with other days showing 2 to 8 failures. This indicates a generally stable pipeline with occasional issues.

How to Write Advanced LogQL Queries to Pinpoint CI/CD Issues

LogQL is Grafana Loki's query language, designed for querying logs with a syntax similar to Prometheus’s PromQL. It enables efficient log searches and is particularly useful in troubleshooting CI/CD issues.

Basic LogQL Syntax:

1. Log Streams:

{job="ci_cd", level="error"}

This query retrieves logs where the job label is ci_cd and the level label is error.

2. Log Filters:

{job="ci_cd"} |= "build failed"

The |= operator filters logs to include only those that contain the specified string, for example "build failed".

3. Regular Expressions:

{job="ci_cd"} |~ "error.*timeout"

This uses the |~ operator to filter logs using a regular expression. In this case, it finds logs that contain an "error" followed by "timeout".

Advanced LogQL Queries for CI/CD Issues:

1. Filter Logs for Specific Build Failures:

If your pipeline uses a specific label for build names:

{job="ci_cd", build="build123"} |= "failure"

This finds logs related to the build123 job that contain the word "failure".

2. Using Time Range and Grouping:

To find error logs in the last 15 minutes:

{job="ci_cd", level="error"} | "build failed" | range(start="15m")

To group logs by job and error type:

sum by (job) (count_over_time({job="ci_cd", level="error"}[5m]))

This will return the count of error logs per job, grouped by job name, over the last 5 minutes.

How to Create Pipeline-Specific Queries for Common Failure Patterns

Common Failure Patterns in CI/CD Pipelines:

1. Build Failures:

If your CI system logs contain build errors, you can identify them with:

{job="ci_cd", level="error"} |= "build failed"

You can extend this to filter by specific steps or stages, for example, “test failed”, or “compilation error”.

2. Test Failures:

Logs from your test runner (for example, Jest, Mocha, JUnit) can contain specific failure messages:

{job="ci_cd", stage="test"} |= "test failed"

3. Dependency Issues:

If your pipeline is failing due to missing or conflicting dependencies, look for npm, maven, or docker related errors:

{job="ci_cd", image="node"} |= "npm ERR!"

Or for Maven-related issues:

{job="ci_cd", image="maven"} |= "[ERROR]"

4. Resource Constraints (for example, Out of Memory):

If you experience resource constraints, you might see logs like "OutOfMemoryError":

{job="ci_cd", level="error"} |= "OutOfMemoryError"

Example of combining filters:

{job="ci_cd", level="error"} |= "build failed" |~ "timeout|dependency" | range(start="1h")

This combines log filters for "build failed", matching any logs with the terms "timeout" or "dependency", from the last hour.

How to Set Up Alert Rules Based on Log Patterns

Alerts help detect recurring issues proactively. They notify you when a specific pattern appears in your logs, allowing you to take quick action.

Steps for Setting Up Alerts:

1. Create a Query for the Alert:

First, define the log pattern you want to monitor. For example, an alert for build failures:

{job="ci_cd", level="error"} |= "build failed"

2. Create an Alert in Grafana:

Follow these steps to set up Grafana alerts:

Go to your Grafana dashboard.
Choose the panel you want to set the alert on (or create a new panel for this purpose).
In the panel, click the Alert tab.
Set the Query field to your LogQL query, such as the one above.
Under Conditions, define when the alert should trigger, e.g., if the error occurs more than 3 times within 5 minutes.

3. Alert Settings:

Now you’ll want to set up the alert evaluation interval and conditions for triggering the alert (e.g., if the query returns results above a certain threshold).

Here’s an example: Trigger an alert if the number of errors exceeds 5 within 5 minutes:

count_over_time({job="ci_cd", level="error"} |= "build failed"[5m]) > 5

4. Set Alert Notifications:

You can choose where you want the alert to be sent (like to Slack, email, or PagerDuty). And Grafana can be integrated with these systems to send real-time alerts to the right team members.

Example alert query for test failures:

count_over_time({job="ci_cd", stage="test"} |= "test failed"[5m]) > 3

This query triggers an alert if more than 3 test failures are logged within the last 5 minutes.

Kibana Query Language Deep Dive for CI/CD Contexts

Kibana Query Language (KQL) is a powerful tool for searching and filtering logs within Elasticsearch, and it becomes especially useful for debugging CI/CD pipelines.

Basic Query Syntax:

Field:
```
  textCopyEditfieldname:value
```
Example: status: "failure"
Wildcard: Use * to match any number of characters:
```
  textCopyEditmessage: "test*"
```
Range Queries: To search for logs within a specific time frame:
```
  textCopyEdittimestamp:[2023-05-01 TO 2023-05-15]
```

Boolean Queries: Combine queries using AND, OR, and NOT:

  textCopyEditstatus: "failure" AND build_id: "12345"

Time-Based Queries:

Since CI/CD logs are often tied to time-sensitive operations (builds, deployments), KQL allows you to filter logs by time:

textCopyEdit@timestamp:[now-1d TO now]

Nested Queries (For Complex Pipelines):

CI/CD logs can have nested or multi-level structures (for example, logs within containers). You can query these nested fields:

textCopyEditpipeline.logs.message: "build failed"

Aggregations and Grouping:

You can aggregate logs based on certain fields to identify trends or recurring issues:

textCopyEditterms aggregation on "status" field

This helps identify the most common failure statuses in your pipeline.

Field-Specific Filtering:

When debugging specific components like a build tool or deployment step, you can filter by those component-specific fields:

textCopyEditbuild_tool: "Jenkins" AND status: "failure"

Creating Saved Searches for Recurring Issues

Once you’ve built queries that help you identify common issues in your CI/CD pipeline, you can save them in Kibana for future use.

1. Create a Saved Search:

Run your desired query in the Kibana Discover tab. Click on the “Save” button and give it a meaningful name, such as "Failed Builds - Last Week". You can add filters and customize the time range to match your typical issue patterns.

2. Use Filters to Pinpoint Recurring Problems:

Create saved searches that focus on specific recurring issues like:

Build failures based on a specific tool or version.
Test failures within a particular module or set of tests.

Example search for “flaky tests”:

textCopyEdittest_status: "failed" AND error_message: "*timeout*"

3. Saving Multiple Variations:

You can save multiple variations of queries based on different error types or CI/CD tools:

Failed Jobs: status: "failure"
Test Failures in Build: log_type: "test" AND status: "failure"
Resource Constraints: error_message: "*memory*"

These saved searches will allow you to quickly troubleshoot specific issues that occur frequently.

Building Visualizations to Spot Patterns Over Time

Once you have saved searches, Kibana allows you to create visualizations from your data, making it easier to spot trends, anomalies, or patterns over time.

1. Create a Visualization:

Go to the Visualize tab in Kibana. Select the appropriate visualization type. Common visualizations for debugging CI/CD pipelines include:

Line Chart: Track build failure rates over time.
Bar Chart: Show the number of failures per CI tool or service.
Pie Chart: Breakdown of failure reasons (for example, compilation errors, test failures, resource constraints).

2. Track Failure Trends Over Time:

Create a line chart to track build failures over a given period:

X-Axis: Time (for example, daily or weekly).
Y-Axis: Count of build failures.
Aggregation: Date histogram with @timestamp field.

This will help you visualize how build failures are trending, making it easier to identify recurring issues or spikes in failures.

3. Monitor Failure Types by CI Tool:

Create a bar chart that shows the number of failures broken down by CI tool:

X-Axis: CI tool (Jenkins, GitHub Actions, GitLab, and so on).
Y-Axis: Count of failures.
Aggregation: Terms aggregation on the ci_tool field.

This visualization helps identify which CI tool is experiencing the most failures and focus troubleshooting efforts there.

4. Visualize Error Messages by Frequency:

You can visualize which error messages appear most frequently, helping you understand what might be causing recurring issues:

X-Axis: Error message type.
Y-Axis: Count of occurrences.
Aggregation: Terms aggregation on the error_message field.

5. Dashboard for Holistic Monitoring:

Create a dashboard that brings together multiple visualizations. You can have one graph for failure trends, another for failure types (bar chart), and a pie chart showing the percentage of failures caused by different issues. This dashboard gives you a holistic view of your pipeline's health.

Advanced Visualization Techniques:

There are various advanced techniques you can use to dig further into your data.

Heatmaps: Use heatmaps to spot time-based anomalies in build durations or test failures.
Anomaly Detection: Kibana has built-in anomaly detection that can be applied to log data to automatically detect patterns that deviate from the norm. This is especially useful for catching rare or unexpected errors in your CI/CD pipeline.

Example for anomaly detection:
```
  textCopyEditfield: duration
  aggregation: average
  anomaly detection model: "baseline"
```

How to Set Up Prometheus Metrics Alongside Your Logs

To fully understand your CI/CD pipeline's health and performance, combining metrics and logs is essential. Prometheus is an excellent tool for capturing time-series metrics, and it works seamlessly with Grafana and Loki (or any log aggregation system).

How to Set Up Prometheus for CI/CD Metrics Collection:

1. Install Prometheus:

You can install Prometheus using Docker or Kubernetes for easy deployment.

For Docker-based installation:

docker run -d -p 9090:9090 --name prometheus prom/prometheus

2. Configure Prometheus to Scrape Metrics:

Prometheus needs to be configured to scrape metrics from your CI/CD services.

Edit the prometheus.yml file:

scrape_configs:
  - job_name: 'ci_cd_metrics'
    static_configs:
      - targets: ['localhost:8080', 'localhost:9091']

3. Instrument Your CI/CD Services:

To expose metrics, you need to integrate Prometheus client libraries into your CI/CD services.

For example, to expose build metrics from a Jenkins job, use the Prometheus plugin for Jenkins. In GitHub Actions, you can use Prometheus to expose job metrics.

4. Expose Metrics Endpoint:

You’ll want to make sure your services expose a /metrics endpoint that Prometheus can scrape. For example, use Prometheus client libraries in your application to expose this endpoint.

Troubleshooting Prometheus Setup Issues

If Prometheus fails to start or scrape metrics, here are some things that might be going wrong:

Container Crashes: Check logs with docker logs prometheus. Look for errors like “port already in use” (for example, 9090) or configuration parsing issues.
- Fix: Change the port in docker run (for example, -p 9091:9090) or correct the prometheus.yml file syntax.
Metrics Not Scraped: Verify targets are reachable using docker logs prometheus or test with curl http://localhost:9090/targets. Check prometheus.yml for correct endpoints.
- Fix: Update targets in scrape_configs (for example, localhost:8080) to match your CI/CD service’s metrics endpoint.
Resource Constraints: Monitor usage with docker stats or top on the host.
- Fix: Ensure at least 4GB RAM and 10GB disk space. Increase storage retention or reduce scrape frequency in prometheus.yml if needed.

How to Create Grafana Dashboards That Combine Metrics and Logs

Once Prometheus is collecting metrics, the next step is to visualize and correlate them in Grafana.

How to Integrate Prometheus with Grafana:

First, you’ll need to install Grafana. You can use Docker or Kubernetes for quick deployment:

docker run -d -p 3000:3000 --name grafana grafana/grafana

Next, configure Grafana to use Prometheus as a data source. To do this, log in to Grafana (localhost:3000 by default). Go to Configuration > Data Sources > Add Data Source > Choose Prometheus. Enter your Prometheus server URL (for example, http://localhost:9090) and click Save & Test.

Now it’s time to build a unified dashboard. To do this, create a new dashboard in Grafana that combines both logs (Loki) and metrics (Prometheus).

Add a panel with Prometheus data queries to visualize pipeline metrics like build success rate, deployment duration, and failure count. Use the Graph visualization type for time-series data and Stat for quick summary metrics.

Finally, in the same Grafana dashboard, add panels for logs (from Loki or any other logging system). Use the Logs panel to visualize log data and link them with the relevant Prometheus metrics by using time-based correlations.

Example: If a spike in CPU usage is detected (Prometheus metric), the logs panel could show related logs, like errors or failed build jobs.

How to Use Exemplars to Jump from Metrics to Relevant Logs

Exemplars are an advanced feature in Prometheus that allow you to connect metric data with logs and traces. Grafana supports this feature, and it can be incredibly helpful when investigating issues.

How to Set Up Exemplars in Prometheus:

1. Enable Exemplars in Your Application:

Exemplars are essentially traces embedded into your metrics. To use them, you’ll need to make sure your application is instrumented to send exemplar data alongside your metrics.

Many libraries support adding exemplars to Prometheus metrics, such as prom-client (Node.js) and prometheus-net (C#).

Here’s an example in Node.js:

// Demonstrates adding an exemplar to a Prometheus metric for linking to logs or traces.
const promClient = require('prom-client');

// Creates a counter metric to track failed CI/CD builds.
const counter = new promClient.Counter({
  name: 'ci_cd_failed_builds_total',  // Metric name for failed builds.
  help: 'Total number of failed builds',  // Description of the metric.
});

// Increments the counter with an exemplar for tracing.
counter.inc({ exemplar: 'build_failed' });

2. Enable Exemplars in Prometheus Config:

Make sure your Prometheus server is configured to store and expose exemplars. Exemplars are typically included with histogram or summary metrics, so make sure you’ve configured them correctly.

3. Visualizing Exemplars in Grafana:

In Grafana, when you query Prometheus for metrics with exemplars, Grafana will show the linked logs or traces when you hover over a metric.

Use the Exemplar option in Grafana panels to quickly access logs from specific metrics.

For example, if you have a build_failure_total metric and you detect a failure in your pipeline, you can click on the failure metric in Grafana and instantly view the relevant logs for that specific failure using the exemplars.

How to Diagnose and Fix Common CI/CD Problems

CI/CD pipelines often encounter issues like build failures, dependency problems, and flaky tests that can disrupt development workflows. This section provides practical strategies to diagnose and resolve these common problems using log analysis and systematic debugging techniques, helping you restore pipeline stability quickly.

Strategy 1: Systematically Debug Build Failures

Build failures are a frequent CI/CD challenge, often stemming from errors in code, tests, or configurations. Systematically debugging these issues involves analyzing logs to pinpoint root causes, using the following approaches.

Identifying Patterns in Compiler and Test Output

When debugging build failures, you need to first examine the logs from the compiler and test outputs. Let’s go over some key strategies.

1. Check for Specific Error Messages:

There are a few common types of error messages you might get. They are:

Syntax errors: Look for lines indicating that there's a mismatch in syntax, such as missing semicolons, undeclared variables, or incorrect function calls.
Linker errors: These often occur when the required libraries or dependencies are not found. You'll typically see errors like undefined reference or symbol not found.
Build tool errors: If you are using build systems like Maven, Gradle, or MSBuild, their logs will give specific error codes or missing configurations.

2. Look for Common Error Patterns:

Often, failed builds repeat the same error or pattern across multiple runs. Check logs for recurring terms or errors that point to specific modules or functions. And remember that grouping similar issues can help you identify the root cause faster.

3. Use Regular Expressions for Log Filtering:

You can use regular expressions to search for keywords in the logs that match common failure patterns (for example, "error", "failed", "exception", "out of memory"). This will help you filter out unrelated messages and focus on the failures.

As an example:

If the build fails with an "Out of Memory" error, search for any memory allocation issues or settings that can be increased.
If test failures are related to specific modules, inspect those modules for recent changes or dependency issues.

Strategy 2: Troubleshooting Dependency Issues with Log Analysis

Dependency issues are common in build failures, especially in complex CI/CD pipelines with multiple modules or services. To resolve these issues, consider the following:

1. Check for Missing or Outdated Dependencies:

Start by reviewing the build tool’s output to check for messages related to missing dependencies (for example, dependency not found, version conflict).

Many build tools (like Maven, npm, or .NET) will include specific error messages when a dependency is missing or incompatible.

2. Inspect Dependency Resolution Logs:

Some build tools provide detailed logs showing how dependencies were resolved (for example, the version of a library that was used). These logs can show you if there’s a version mismatch.

Make sure that your package.json (for JavaScript projects), pom.xml (for Java), or csproj (for C#) files are correctly defined with compatible versions.

3. Verify Network Connectivity:

CI/CD tools sometimes fail to fetch dependencies due to network issues (for example, proxy settings, repository access). Look for any errors indicating that a repository couldn’t be reached.

4. Log Example:

If a Java project fails with Could not find artifact, it's likely a dependency missing or inaccessible. Check the repository URL or if the artifact exists in your Maven repo.

5. Resolve Version Conflicts:

Version conflicts occur when different dependencies require incompatible versions of the same library. This is especially true in Java (with Maven/Gradle) and .NET projects. Consider using tools to resolve version conflicts automatically or define compatible versions manually.

Fixing Flaky Tests Based on Historical Log Data

Note: Issues like container crashes, logs not ingested, or resource constraints here may resemble those in other sections. These are common across CI/CD services and processes, but each section offers unique context to avoid redundancy.

Flaky tests – that is, those that pass sometimes and fail at other times – are common in CI/CD pipelines, and they can be frustrating. Let’s discuss some strategies for how you can tackle them:

1. Analyze Test Logs Over Time:

Review historical logs to identify patterns in when the test fails. Look for timing issues, resource limits, or external dependencies that could affect test reliability.

For example, if a test intermittently fails after a certain amount of time or only during specific pipeline stages, it could indicate resource exhaustion or race conditions.

2. Check Test Dependencies:

Often, flaky tests are dependent on external services or resources (for example, databases, APIs, file systems). Check if these services are consistently available and properly mocked during test execution.

Logs that mention failed connections to external services or unstable environments can give you insights into potential issues with dependencies.

3. Run Tests with Increased Logging:

Increase the verbosity of test logs to capture more information about the failures. This can help you detect why tests fail in certain conditions.

For example, adding debug logs inside tests can provide more context on the state of the application when the failure occurs.

4. Time of Day Issues:

Some flaky tests may fail during peak usage times, especially if they rely on shared resources. Look for patterns that correlate with resource contention (for example, database locks, API rate limits).

Logs showing high CPU or memory usage can indicate that resource constraints are affecting the stability of your tests.

5. Implement Retry Logic for Flaky Tests:

To mitigate the effects of flaky tests, implement automatic retries for tests that fail intermittently. This can help reduce the noise in your CI/CD pipeline while you investigate the root causes.

For example, if a database connection test fails intermittently, you may want to inspect database logs for signs of timeouts or connection pool exhaustion.

How to Resolve Deployment Pipeline Failures

Deployment pipeline failures can stem from several sources, and diagnosing them requires a systematic approach using logs and available observability tools. Below, we will outline the common patterns in logs that indicate resource constraints, permission/authentication issues, and configuration drift between environments.

Log Patterns That Indicate Resource Constraints

Resource constraints are a common cause of pipeline failures. These can include CPU limits, memory usage, or disk space running out. Here's how to recognize these patterns:

Key Indicators in Logs:

Memory Issues: Look for messages like "out of memory", "memory limit exceeded", or "OOM killed" in your logs. Here’s an example in Kubernetes logs:

pod has been OOMKilled

CPU Limits: Watch for logs showing that a process exceeded CPU limits or was throttled. Here’s an example:

process 'foo' hit CPU limit, throttling at 100%

Disk Space: Logs may show file write errors or messages about a disk being full. Here’s an example:

Unable to write to file, disk space is full.

You can resolve the memory issues by increasing the allocated memory for your containers, VM, or cloud instances.

You can resolve the CPU issues by adjusting CPU limits or scaling your infrastructure to add more resources.

And finally, you can resolve disk space issues by cleaning up unused files or increasing disk capacity on the server/container.

Identify Permission and Authentication Issues

Permission and authentication issues often result in pipeline failures due to a lack of access to necessary resources or services. These issues might occur when you’re trying to access databases, deploy to cloud services, or authenticate third-party APIs.

There are some key indicators in the logs that you can look out for:

1. Authentication Failures:

Look for messages related to failed logins, incorrect credentials, or invalid tokens.

Here’s an example:

Authentication failed for user 'admin'

Invalid API token provided.

2. Permission Denied:

Logs may indicate that the CI/CD pipeline lacks the permissions to perform a certain action.

Here’s an example:

Access denied for /path/to/deployment/target

Unauthorized request to cloud service.

How to resolve these errors:

Credentials: Ensure the credentials (API keys, access tokens, SSH keys) used in the pipeline are up-to-date and correctly configured.
Permissions: Review and update the role-based access control (RBAC) settings for the service account running the pipeline to ensure it has the necessary permissions.
Secrets Management: Use tools like Vault, AWS Secrets Manager, or Azure Key Vault to securely manage secrets and credentials.

Troubleshooting Configuration Drift Between Environments

Configuration drift occurs when different environments (like development, staging, production) are not synchronized. This can lead to inconsistent behavior during deployments, and often results in failures in one environment but not in others.

Look out for these key indicators in the logs:

1. Mismatch in Environment Variables:

If you’re using environment variables, check for discrepancies across different stages. For example:

Environment variable DATABASE_URL not found in production

2. Dependency Versions:

Mismatched versions of dependencies between environments can cause unexpected issues.

Here’s an example:

Error: Dependency 'libxyz' version mismatch between environments

3. Service Configuration:

Look for configuration-related errors that might not be present in a development environment but occur in production.

Here’s an example:

Error: Invalid config in 'production-config.yaml'

How to resolve these errors:

Use Infrastructure as Code (IaC): Tools like Terraform, Ansible, or CloudFormation can help ensure that environments are provisioned consistently.
Automated Configuration Management: Use CI/CD pipeline steps to automate environment setup to avoid manual changes that can cause drift.
Environment Consistency Checks: Implement checks to compare configurations and dependencies across environments before deployment.
- Example: You can add a pre-deployment stage to run a script that compares environment variables, configurations, and dependency versions between staging and production.
Configuration Management Tools: Use configuration management tools like Chef, Puppet, or SaltStack to maintain consistent configurations across environments.

How to Debug Container-Based Deployment Issues

Debugging container-based deployment issues requires specialized tools and techniques to trace errors in containerized environments. Below are strategies to efficiently collect logs, diagnose failures, and use ephemeral containers for investigation.

Collecting and Analyzing Container Logs Effectively

Container logs are essential for troubleshooting issues, and effective collection and analysis can significantly speed up the debugging process.

Here’s how you can collect container logs:

1. Docker Logs:

You can use Docker’s logs command to view logs of a specific container:

docker logs

If your container uses a logging driver (like json-file or fluentd), ensure that logs are being written to an accessible location.

2. Kubernetes Logs:

For Kubernetes-managed containers, use kubectl to access pod logs:

kubectl logs

To view logs for all containers in a pod:

kubectl logs  --all-containers=true

3. Log Aggregation:

You can integrate with centralized logging systems (like, Grafana Loki, Elastic Stack). You can also use Fluentd or Logstash as log shippers for forwarding logs from containers to a logging backend.

Analyzing Logs:

1. Filter and Search Logs:

Use grep to filter logs for specific error messages or patterns:

docker logs  | grep "ERROR"

In Kubernetes, you can combine kubectl with grep or other tools for advanced filtering.

2. Log Contextualization:

Include metadata in your logs (for example, container ID, environment, timestamps) for easier debugging. Ensure logs are structured in formats like JSON to allow for better querying and filtering.

How to Diagnose Image Pull and Networking Failures

Container deployment failures often stem from issues related to image pulling or network connectivity. Here’s how to troubleshoot these problems:

Image Pull Failures:

There are some common issues you might see, such as:

Authentication failures: If the container registry requires authentication, ensure your credentials (username/password or tokens) are correct.
Network connectivity: Check if the container can access the registry endpoint. Often, firewalls or DNS issues block the image pull.
Image not found: Verify the image name and tag are correct. Use docker pull to manually pull the image to see if the issue is specific to the deployment process.

There are various ways to diagnose them:

For Docker, use:

docker pull

This will output the specific error message if the image pull fails.

For Kubernetes, check the event logs for the pod:

kubectl describe pod

Look for the Failed status under "Events" for information about why the image pull failed (for example, wrong credentials or tag). If the issue is with the registry authentication, configure the Kubernetes imagePullSecrets or Docker's credentials to ensure the correct access.

Networking Failures:

Some common issues you may encounter are:

DNS resolution problems: Containers may fail to resolve hostnames if DNS configurations are incorrect.
Network policies and firewall rules: Network policies or firewalls may block necessary ports.
Inter-container communication: If containers need to talk to each other, ensure they’re on the same network or subnet.

Again, there are various ways to diagnose these issues:

For Docker networking:

You can do this to view all Docker networks:

docker network ls

You can also inspect the network of your container like this:

docker network inspect

Check if the container is correctly attached to the network and if necessary ports are exposed.

For Kubernetes Networking:

You can use kubectl to check network policies:

kubectl get networkpolicies

You can also check the pod’s network settings like this:

kubectl describe pod  | grep -i "Network"

Testing Connectivity Inside Containers:

For Docker, exec into the container and test:

docker exec -it  /bin/bash
ping 
curl http://:

In Kubernetes, use kubectl exec to access the pod and test connectivity:

kubectl exec -it  -- /bin/bash

How to Use Ephemeral Debug Containers for Investigation

Ephemeral debug containers are short-lived containers that help investigate issues in a running environment without altering the main application container.

What are Ephemeral Debug Containers?

Ephemeral debug containers allow you to run diagnostic commands (like shell access, ping, or curl) in the same network environment as the failing application container, without modifying the application itself.

How to Set Up Ephemeral Containers in Docker:

1. Use the docker run Command:

You can create a new container for debugging by running a container with the same network settings as the failing container:

docker run -it --network container: --entrypoint /bin/bash

This command runs an interactive shell inside the debug container using the same network as the target container.

Ephemeral Containers in Kubernetes:

Kubernetes allows you to inject an ephemeral debug container into a running pod. You can add a temporary debug container to your pod using the following command:

kubectl debug  -it --image= --target=

This command will run a new container in the same pod as the target container, allowing you to run diagnostic commands.

Example use cases are investigating file systems, running network diagnostics, checking configuration files, and so on.

These debug containers are meant to be temporary and can be discarded after the issue is resolved.

How to Implement Advanced Debugging Techniques

This section covers advanced methods to diagnose complex CI/CD pipeline issues that standard log analysis might miss. We’ll explore distributed tracing to track requests across multiple services and combine traces with logs and metrics for deeper insights.

These techniques are designed to work within budget constraints, ensuring effective debugging for your CI/CD workflows.

Choosing a Tracing Backend for CI/CD

Distributed tracing enables you to monitor a request’s path through various services in your CI/CD pipeline, such as from a build step to a deployment, identifying delays or failures. Choosing a tracing backend involves selecting a tool to store and analyze these trace data. Below, we compare Jaeger, Tempo, and hosted solutions for distributed tracing.

Tool	Resource Usage	Setup Complexity	Best For	CI/CD Fit
Jaeger	Low	Easy (Docker-based)	Small teams, local setups	Simple pipelines, quick trace views
Tempo	Low	Moderate (Grafana integration)	Grafana users, log/metric correlation	Complex pipelines, unified observability
Hosted (e.g., Lightstep)	Variable (cloud-based)	Easy (managed)	Teams with budget for cloud services	Scalable, production-grade tracing

When to choose each one:

Jaeger: Ideal for quick, local tracing setups with minimal overhead.
Tempo: Best for teams already using Grafana Loki/Prometheus for unified observability.
Hosted Solutions: Suited for large-scale pipelines needing managed scalability.

How to Set Up Distributed Tracing on a Budget

Distributed tracing is crucial for debugging and observing complex, multi-step operations across services. It allows you to follow requests as they propagate through different services and components of your pipeline. Implementing this on a budget can still provide valuable insights.

How to Use OpenTelemetry with Free Backends

OpenTelemetry is an open-source framework that enables you to collect, process, and export telemetry data like traces and metrics. It supports multiple backends, and we’ll focus on using free, budget-friendly backends for trace storage and analysis.

1. Install OpenTelemetry Collector:

OpenTelemetry provides an agent (collector) that collects traces and metrics from your application and sends them to a backend.

To install the OpenTelemetry Collector, download the binary for your OS or use Docker to deploy it:

docker pull otel/opentelemetry-collector:latest

Then run the OpenTelemetry Collector in Docker with a configuration file:

docker run -d --name opentelemetry-collector -p 55680:55680 -p 14250:14250 otel/opentelemetry-collector

2. Configure OpenTelemetry to Export to Free Backends:

There are a few popular free backends you can use for distributed tracing, like Jaeger and Prometheus + Tempo. Let’s see how to use both here.

We’ll start with Jaeger, an open-source tracing backend. It’s highly scalable and works well with OpenTelemetry.

You can use the Docker version for easy deployment:

docker run -d --name jaeger -e COLLECTOR_ZIPKIN_HTTP_PORT=9411 -p 5775:5775 -p 6831:6831/udp -p 6832:6832/udp -p 5778:5778 -p 16686:16686 -p 14250:14250 -p 14268:14268 -p 14250:14250 -p 9431:9431 jaegertracing/all-in-one:1.30

Alternatively, you can use hosted services like Lightstep, AWS X-Ray, or Honeycomb for cloud-native environments.

Now let’s see how to use Prometheus + Tempo for logs and metrics correlation.

Tempo is a distributed tracing backend built by Grafana that integrates well with other Grafana tools (Loki and Prometheus).

You can install Tempo using Docker:

docker run -d --name tempo -p 14268:14268 grafana/tempo:latest

3. Instrument Your Code with OpenTelemetry SDK:

For Python/Node.js/Java/Go applications, you can install the appropriate OpenTelemetry SDK and start tracing.

Here’s a Python example:

pip install opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation

And a Node.js example:

npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/instrumentation

And one in Java:

<dependency>
    <groupId>io.opentelemetrygroupId>
    <artifactId>opentelemetry-apiartifactId>
    <version>1.0.0version>
dependency>

After installation, you can use the OpenTelemetry SDK to instrument the application and start collecting traces for HTTP requests, database queries, and other pipeline interactions.

4. Send Data to the Collector:

You can configure the SDK to send trace data to your OpenTelemetry Collector, which will then forward it to your backend (Jaeger, Tempo, and so on). Here’s an example for Python:

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchExportSpanProcessor

trace.set_tracer_provider(TracerProvider())
exporter = OTLPSpanExporter(endpoint="http://localhost:55680")
processor = BatchExportSpanProcessor(exporter)
trace.get_tracer_provider().add_span_processor(processor)

If traces aren’t appearing, several issues might be occurring:

Collector fails to start: Check logs with docker logs otel-collector. Look for errors like “port conflict” or “invalid config.”
- Fix: Change ports (for example, 55681:55680) or verify the config file.
No traces in Jaeger: Ensure the collector is sending data to Jaeger (http://localhost:14250). Test with curl http://localhost:55680.
- Fix: Update the exporter endpoint in your SDK configuration.
Resource constraints: Monitor usage with docker stats.
- Fix: Allocate at least 2GB RAM and 10GB disk space for the collector and backend.

Correlating Traces with Logs and Metrics

Combining traces with logs and metrics provides a holistic view of your pipeline’s operations, allowing you to pinpoint the root cause of issues more effectively.

OpenTelemetry and Grafana allow you to link traces, logs, and metrics into a unified view.

Let’s see how you can do this now.

1. Link Logs and Traces Using Correlation IDs:

When generating logs, include trace and span IDs in the log entries. This allows you to correlate logs with specific trace requests.

Here’s an example:

{
  "timestamp": "2025-05-10T12:00:00Z",
  "level": "error",
  "message": "Build failure",
  "trace_id": "1234567890abcdef",
  "span_id": "0987654321abcdef"
}

2. Integrating Logs (Loki) with Traces (Jaeger/Tempo) in Grafana:

Grafana can integrate traces from Jaeger or Tempo and correlate them with logs from Loki.

To do this:

Set up Loki and Tempo in Grafana.
In Grafana’s Explore view, you can search logs and traces side-by-side.
Create dashboards that show metrics, logs, and traces for a complete view of a request flow.

3. Using Prometheus Metrics with Traces:

Prometheus provides metrics that can be correlated with traces. For example, you can use exemplars in Prometheus to link specific metric data to trace data.

Example: If you have a high error rate in your build step, you can correlate this with trace data to identify which requests failed.

Creating Trace Visualizations for Complex Pipeline Operations

You can visualize traces with Jaeger or Tempo.

To do this in Jaeger:

Once your traces are in Jaeger, you can access the Jaeger UI (http://localhost:16686 by default) and use the search functionality to explore traces based on service name, trace ID, or specific operations.

Jaeger allows you to create custom dashboards to visualize the latency, throughput, and errors of requests across services.

To do this in Tempo (Grafana Integration):

Tempo integrates with Grafana, where you can create dashboards that visualize trace data from your pipeline.

Create a Grafana dashboard:

Add Tempo as a data source in Grafana.
Use the "Trace" panel to query and visualize traces.
Combine trace visualizations with metrics (from Prometheus) and logs (from Loki) to get a unified view of your pipeline.

A typical trace visualization dashboard could show the duration of each step in your pipeline (build, test, deploy) and highlight where delays or errors occur, such as slow database queries or flaky tests.

Troubleshooting Tempo Setup Issues

If Tempo fails to collect or display traces:

Container fails to start: Check logs with docker logs tempo. Look for errors like “port already in use” (for example, 14268) or “storage backend unavailable.”
- Fix: Change ports in the Docker command (for example, -p 14269:14268) or ensure the storage directory (for example, /tmp/tempo) exists and is writable.
No traces in Tempo: Verify the OpenTelemetry Collector is sending traces to Tempo’s endpoint (http://localhost:14268). Test connectivity with curl http://localhost:14268.
- Fix: Update the collector’s exporter configuration to point to the correct Tempo endpoint, and ensure no firewalls are blocking the connection.
Resource constraints: Monitor usage with docker stats or top on the host.
- Fix: Allocate at least 2GB RAM and 10GB disk space for Tempo, as tracing data can grow quickly with high-volume pipelines.

This bar chart displays the average latency (in milliseconds) for key stages of a CI/CD pipeline in May 2025. The Build stage averages around 1,200 ms (blue), the Test stage around 800 ms (yellow), and the Deploy stage around 1,500 ms (pink), highlighting that deployment is the most time-intensive step.

How to Build Comprehensive Debugging Dashboards

This section explains how to create Grafana dashboards to troubleshoot CI/CD pipeline issues effectively. We’ll focus on setting up visualizations for key metrics, logs, and system resources to identify problems like build failures or resource bottlenecks, using budget-friendly tools to keep your observability stack lean and actionable.

Designing Grafana Dashboards Specifically for Troubleshooting

Step 1: Understand the Key Metrics and Logs to Monitor

When designing a Grafana dashboard for debugging, you should focus on metrics and logs that help identify issues in the pipeline. These could include:

Build failures: Errors during build processes (compilation, test failures).
Deployment failures: Issues in deployment, such as failed jobs, resource limitations, or misconfigurations.
Container logs: Information about container status and logs (if using containers in your pipeline).
System resource usage: CPU, memory, and disk usage that may lead to performance bottlenecks.
CI/CD-specific metrics: Number of successful vs. failed pipeline runs, job duration, job queue times.

Step 2: Set Up Data Sources

To start building the dashboard, you’ll need to set up your data sources in Grafana. First, connect your Prometheus instance for collecting metrics. To do this, go to Configuration > Data Sources in Grafana. Then just add Prometheus as a data source and enter the URL (for example, http://localhost:9090).

Next, you need to connect your Loki instance for logs. So go ahead and add Loki as a data source by specifying the URL (for example, http://localhost:3100).

Note that if you're using other sources like InfluxDB or Elasticsearch, you’ll need to make sure that they’re properly connected as data sources.

Step 3: Create Panels and Visualizations

Now that your data sources are connected, you can start building your dashboard with the following panels:

Build Status Panel:
- Create a stat panel or gauge panel to show the success/failure ratio of pipeline runs.
- Query Prometheus or Loki for data like build status (success or failure), number of errors, and job durations.
Error Breakdown Panel:
- Use a pie chart to visualize the types of errors (for example, build, deployment, or system resource failures).
- Query the logs in Loki to break down error types based on the CI tool (for example, Jenkins, GitHub Actions).
Resource Utilization Panel:
- Use time series graphs to monitor CPU, memory, and disk usage over time, especially for resource-heavy builds or deployments.
Job Duration Panel:
- Use bar charts or line graphs to track the average duration of jobs over time. Set thresholds for warning signs if a job takes longer than expected.

Troubleshooting Grafana Dashboard Issues

If Grafana dashboards fail to display data or show errors, you might be having one of these issues:

Missing data sources: If metrics, logs, or traces aren’t appearing, verify data source connections in Grafana (for example, Prometheus, Loki, Tempo). Check under Configuration > Data Sources.
- Fix: Ensure the data source URLs are correct (for example, http://localhost:9090 for Prometheus) and test the connection. Re-add the data source if needed.
Incorrect Trace IDs: If trace visualizations (for example, Tempo panels) show no data, confirm that trace IDs in logs match those in Tempo. Use a query like {job="ci_cd"} | json | trace_id="1234567890abcdef" in Loki to cross-check.
- Fix: Ensure your application logs include trace and span IDs, and verify the OpenTelemetry SDK is correctly instrumented to send traces to Tempo.
Resource Constraints: Monitor Grafana’s resource usage with docker stats if running in a container, or top on the host.
- Fix: Allocate at least 4GB RAM and 10GB disk space for Grafana, especially when rendering complex dashboards with multiple data sources.

How to Set Up Drill-Down Paths from High-Level to Detailed Views

Step 1: Create High-Level Overview Panel

At the top of the dashboard, include a high-level overview panel that summarizes the overall status of the pipeline. This could be:

Success/Failure Count: A simple stat panel showing the count of successful vs. failed runs.
Pipeline Health Status: Display an overall health check of your pipeline using color-coded indicators (green for healthy, red for issues).

Step 2: Set Up Drill-Down Links

To allow users to drill down from high-level information to detailed views:

1. Link to detailed build information:

You can create a time series graph that shows build job durations. Add a link to a detailed log view when clicking on a failed job.

For example, when clicking a failed build, you can link to a detailed panel or a separate dashboard that shows the logs and error messages related to that specific run.

2. Link to Logs in Loki:

You can use Loki's LogQL queries to set up a drill-down path. When users click on an error type or a specific job name, it should automatically filter logs for that job or error type.

You can set up drill-down interactions using Dashboard Links in Grafana. In the panel settings, under Links, specify the link to another dashboard that shows detailed logs filtered by the job name or failure type.

Step 3: Implement Time Range Filters

To enhance drill-down functionality, you can add a time range filter to allow users to adjust the time window for both logs and metrics. This enables them to zoom in on a specific time frame where failures occurred.

How to Create Shared Dashboards for Team Troubleshooting

Once your dashboard is designed, you can share it with your team for collaborative troubleshooting:

First, you’ll want to make sure that the correct permissions are set up for your team. You can define specific roles in Grafana with access to the dashboard. Go to Dashboard Settings > Permissions, and grant view or edit access to users or teams.

Next, you can directly share a link to the dashboard with your team members. Use the Share option in the top-right corner of the dashboard, which provides a direct URL and also options to embed the dashboard into other tools (for example, Slack, email).

You can also use template variables to allow users to filter and adjust the dashboard for different pipeline runs or environments. For example, add a variable for build_id, job_name, or branch_name that allows users to select specific builds or branches for more granular troubleshooting.

Step 2: Set Up Alerting

To ensure your team is notified of any pipeline failures, you can set up alerting rules. There are a few important ones you’ll want to set up.

First, create alerts for critical issues, like when a pipeline fails or exceeds expected resource usage. This could be for things like build time exceeding a threshold or failure of a deployment stage.

Grafana can send alerts via various channels such as Slack, email, or webhook.

You can also integrate your dashboards with tools like Slack or Teams for real-time notifications and collaboration. Set up automated messages for your team when the dashboard indicates an issue.

How to Create Automated Diagnostic Tools

Building Scripts that Collect Relevant Logs During Failures

To automate log collection during failures, you need scripts that can capture logs from different CI/CD stages and services as soon as a failure is detected. Here are the steps you can follow to do this:

1. Write Failure Detection Script:

You can leverage the exit status codes of your CI/CD tools to detect failures. For example, in GitLab CI/CD or GitHub Actions, you can check if the last command failed by inspecting $? in Unix-based systems.

# Example for GitLab CI/CD
if [ $? -ne 0 ]; then
    echo "Failure detected, collecting logs..."
    # Custom log collection script call
    ./collect_logs.sh
fi

2. Log Collection Script (collect_logs.sh):

The script should collect relevant logs, system metrics, and trace information. For instance:

#!/bin/bash
LOG_DIR="/path/to/logs"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="${LOG_DIR}/backup/${TIMESTAMP}"
mkdir -p $BACKUP_DIR

# Collect logs from CI/CD agents, containers, or system logs
cp /var/log/ci_cd/*.log $BACKUP_DIR/
cp /path/to/docker_logs/*.log $BACKUP_DIR/
# Collect metrics or traces from monitoring systems if needed

3. Use CI/CD Artifacts:

For platforms like GitLab, GitHub Actions, or Jenkins, you can upload logs as artifacts for further investigation. Configure these platforms to save logs in case of a failure.

Here’s an example for GitHub Actions:

steps:
  - name: Run Tests
    run: |
      npm run test
  - name: Upload logs if test fails
    if: failure()
    uses: actions/upload-artifact@v2
    with:
      name: test-logs
      path: /path/to/test/logs

4. Centralized Logging:

Instead of manually collecting logs, you can centralize log storage using logging systems like Grafana Loki, ELK stack, or even cloud-based solutions. This will ensure that logs are accessible even if they are overwritten or lost on individual systems.

How to Implement Automatic Analysis of Common Error Patterns

Once logs are collected, you can automate the analysis process by defining common error patterns and automatically searching for them in your logs.

Step 1: Define Error Patterns:

Establish error signatures or patterns that are common in your CI/CD process, such as failed builds due to missing dependencies, permission issues, or network timeouts.

You can use regex or regular expressions to capture these patterns. Here’s an example – define a regex for failed test patterns:

TEST_FAILURE_REGEX=".*FAILURE.*"

Step 2: Create Log Analysis Script:

Next, you can write a script that scans logs for these common patterns. The script could then categorize or flag errors.

Here’s an example using grep to detect failure patterns:

#!/bin/bash
LOG_DIR="/path/to/logs"
ERROR_LOG="${LOG_DIR}/error_patterns.log"
touch $ERROR_LOG

# Define error patterns to search for
ERROR_PATTERNS=("FAILURE" "ERROR" "TIMEOUT")

for PATTERN in "${ERROR_PATTERNS[@]}"; do
    grep -i $PATTERN $LOG_DIR/*.log >> $ERROR_LOG
done

if [ -s $ERROR_LOG ]; then
    echo "Error patterns found, review the log file."
fi

Step 3: Automate Alerting:

Once an error pattern is detected, you can integrate the log analysis script with your alerting system (for example, sending an email or Slack notification).

Here’s an example of sending a Slack notification:

if [ -s $ERROR_LOG ]; then
    curl -X POST -H 'Content-type: application/json' \
         --data '{"text":"Error detected in CI pipeline. Check error log."}' \
         https://hooks.slack.com/services/YOUR_SLACK_WEBHOOK_URL
fi

Step 4: Use Observability Tools for Pattern Recognition:

Leverage observability tools (Grafana Loki, Prometheus) that support log querying and visualization. You can create dashboards that automatically detect anomalies like high failure rates or recurring errors.

Example: Set up a Grafana dashboard with alert rules based on log frequency.

How to Create Self-Healing Pipelines Based on Known Issues

Self-healing pipelines can automatically address issues when they are detected by executing pre-defined corrective actions. Let’s walk through how you can set one up.

Step 1: Define Common Failures and Solutions:

Identify recurring issues (for example, dependency issues, build timeouts, flaky tests) that occur in your pipeline. Then, define self-healing actions to mitigate these issues.

Here’s an example of automatically retrying a failed step if it is a known flaky test:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Run Tests
        run: |
          npm run test
      - name: Retry Tests if Failed
        if: failure() && (steps.tests.outcome == 'failure')
        run: |
          echo "Retrying tests..."
          npm run test

Step 2: Automatic Rollbacks:

Set up a rollback process for failed deployments. For instance, if a deployment to production fails, the pipeline can automatically revert to the last successful build.

Example in GitLab CI/CD:

deploy_production:
  script:
    - ./deploy.sh
  when: on_failure
  retry: 3

Step 3: Build Self-Healing Logic Using Retry Mechanisms:

Implement retry logic for transient issues (like network glitches) that often cause failures.

Example of retrying a step in GitHub Actions:

steps:
  - name: Retry Deployment
    run: |
      attempts=0
      max_attempts=3
      until [ $attempts -ge $max_attempts ]
      do
        deploy_script && break
        attempts=$((attempts+1))
        echo "Attempt $attempts failed. Retrying..."
        sleep 5
      done

Step 4: Automate Corrective Actions for Dependency Issues:

Set up automatic fixes for dependency-related failures, like clearing caches or re-installing dependencies:

if [[ $(cat error.log) =~ "dependency not found" ]]; then
    echo "Dependency issue detected, reinstalling dependencies..."
    npm install
fi

Step 5: Integrate with Self-Healing Services:

For more complex self-healing, you can integrate tools like Ansible, Puppet, or even create custom scripts that auto-patch common configuration issues.

How to Conduct Effective Postmortems Using Logs

Logs are often the single most valuable resource when reconstructing what went wrong in a CI/CD pipeline. Conducting effective postmortems with log data allows teams to extract clear timelines, pinpoint root causes, and define steps to prevent recurrence – all based on concrete evidence.

Extract Timeline and Key Events from the Logs

To accurately understand what happened and when from the info contained in your logs, there’s a straightforward process you can follow.

Step 1: Centralize and Structure Logs:

First, make sure that the logs from all pipeline stages (build, test, deploy) are aggregated in a central place like Grafana Loki, ELK, or OpenSearch.

And you’ll want to use a consistent log format (like structured JSON) that includes timestamps, log levels, pipeline stage identifiers, and correlation/request IDs.

Step 2: Build a Chronological View:

You can use timestamp filters in your log UI (for example, Kibana, Grafana Explore) to isolate logs from the incident timeframe.

Look for key lifecycle events, like:

Start and completion of pipeline steps
Status changes (for example, "test failed", "deployment started", "build queued")
Error messages and warnings
Retry events or unexpected restarts

Step 3: Extract Logs Programmatically (optional):

Use queries (LogQL, Elasticsearch DSL) to export relevant logs for analysis or inclusion in a post-mortem document.

How to Identify Root Causes Through Log Analysis

To go beyond symptoms and find the real issue, there are various steps you can take.

Start by looking for the first failure. You can filter logs by level=error or use log pattern matching to identify the earliest sign of failure. Then trace backward from the failure using correlation IDs or pipeline step identifiers.

Second, make sure you correlate logs across systems. Match logs across CI/CD tools (like GitHub Actions → Docker logs → Kubernetes logs). You can use shared correlation IDs or job IDs to group logs from related events.

Next, pay attention to intermittent signals. Warnings, retries, or degraded performance preceding the failure may reveal environmental or configuration-related issues.

And finally, check for external dependencies. Look for timeout or connection errors involving third-party services, cloud APIs, or internal infrastructure components.

How to Create Actionable Follow-Ups to Prevent Recurrence

There are various things you can do to turn your findings into meaningful process improvements.

1. Document the Findings Clearly:

Create a structured post-mortem doc that includes:

Timeline of events with log excerpts
Immediate trigger and root cause (based on logs)
Impact summary and affected components
Screenshots or saved log queries for reference

2. Define Preventive Actions:

Examples include:

Adding missing alerts or log-based monitors
Improving log verbosity or adding missing metadata
Fixing brittle test cases or deployment scripts
Updating infrastructure limits or retry strategies

3. Assign Ownership and Deadlines:

Each action item should have a responsible owner and a due date. If applicable, create automated tests or guardrails to catch similar issues in the future.

4. Update Runbooks and Incident Playbooks:

Add log patterns, example queries, and resolutions to shared documentation. This ensures the next person facing a similar issue can act faster.

Pro Tip: Automate part of your post-mortem process by tagging logs from failed CI runs, exporting them to a shared location, and pre-generating dashboards or incident reports. This reduces manual effort and increases consistency.

How to Optimize Log Storage and Management

As your CI/CD system grows, logs can become massive, consuming storage and impacting performance. Optimizing log storage helps you make sure that you're retaining what's valuable while staying efficient.

How to Implement Log Rotation and Retention Policies

Without rotation and retention, logs will pile up endlessly, leading to disk space exhaustion and poor performance. You can help prevent this with log rotation.

Log rotation involves creating new log files after a size or time threshold and archiving or deleting old ones.

Linux logrotate tool – Configure /etc/logrotate.d/:

/var/log/ci_cd/*.log {
    daily
    rotate 7
    compress
    missingok
    notifempty
    create 0640 root adm
}

This example:

Rotates daily
Keeps 7 days of logs
Compresses old logs to save space

Docker logs rotation – in daemon.json:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "50m",
    "max-file": "5"
  }
}

Retention policies ensure that old logs are automatically deleted based on age or storage usage.

You can set one up in Loki like this:

table_manager:
  retention_deletes_enabled: true
  retention_period: 168h  # 7 days

Or in Elasticsearch, use Index Lifecycle Management (ILM):

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": { "max_age": "3d", "max_size": "1gb" }
        }
      },
      "delete": {
        "min_age": "7d",
        "actions": { "delete": {} }
      }
    }
  }
}

How to Set Up Log Compaction for Long-Term Storage

Compaction reduces redundancy and keeps only critical log info, which is ideal for long-term audits or analytics.

Compaction Techniques:

There are various different compaction techniques you can try. Here are a couple:

1. Loki (boltdb-shipper mode):

Uses compaction to merge log chunks and reduce storage.

Configure in loki-config.yaml:

  schema_config:
    configs:
      - from: 2023-01-01
        store: boltdb-shipper
        object_store: filesystem
        schema: v11

Use a low-retention, high-compaction strategy for archived logs.

2. Elasticsearch:

Use rollup jobs to reduce resolution of old data.
Stores summarized logs, for example, hourly counts of similar events.

3. Archive to cheaper storage:

Move infrequent-access logs to S3 or Azure Blob Storage using lifecycle rules.

How to Balance Observability with Resource Constraints

More logs = more observability, but also more cost and overhead. This means that you need a balance. There are various strategies that can help you achieve this balance:

Log at appropriate levels:
- Avoid excessive debug or trace logs in production.
- Use info and warn levels judiciously.
- Only use error or critical for actionable failures.
Sample logs:
- If high-volume pipelines generate repetitive logs, enable log sampling to reduce duplicates.
- Tools like Vector or Fluent Bit support sampling.
Filter out noise:
- Use log filters to exclude non-critical logs before they reach the central system.
Separate hot vs. cold logs:
- Hot logs: recent, real-time data for active debugging.
- Cold logs: archived for compliance, stored with lower performance/storage priority.
Compress everything:
- Use gzip/zstd compression for both stored and transmitted logs.
- Loki, Elasticsearch, and Vector support compression out of the box.

Conclusion

In this handbook, you have built a full-stack observability layer specifically optimized for CI/CD pipelines without breaking your infrastructure budget. You now have the tools and know-how to:

Deploy Grafana Loki or a lightweight ELK alternative to capture structured logs from all parts of your pipeline.
Unify and enrich logs across CI/CD tools (for example, GitHub Actions, Jenkins, GitLab) using consistent formats and correlation IDs.
Use powerful log queries (LogQL, Kibana Query Language) to diagnose build failures, flaky tests, and deployment issues with precision.
Correlate logs with metrics and traces to gain deep, contextual visibility into pipeline behavior.
Design reusable debugging dashboards and automation that turn raw logs into insights and action.
Build a culture of shared troubleshooting knowledge through post-mortems, runbooks, and log-driven retrospectives.

To see the full-stack observability layer in action, check out the complete code and configurations in my GitHub repository: github.com/Emidowojo/CICDObservability. This repo includes all the setups for Grafana Loki, OpenTelemetry, Prometheus, and more, so you can deploy and explore the entire pipeline observability stack.

Next Steps for Advanced Observability Implementation

Here’s how you can take your setup even further:

Fully integrate distributed tracing: Deploy OpenTelemetry agents across your build and deployment stages. This will help you visualize how code, builds, and deployments flow across systems in real-time.
Automate diagnostic scripts and alerts: Build scripts to auto-collect logs and metrics on failure, and trigger alerts when known patterns reoccur. This enables faster detection and even self-healing pipelines.
Scale and harden your log infrastructure: As usage grows, implement log retention, compaction, and storage policies. Explore scalable backends like ClickHouse or object storage (e.g., S3) for long-term archiving.
Train your team on observability best practices: Share dashboards, create onboarding docs, and schedule log-analysis sessions to build team familiarity with your tools and practices.

📚 Resources for Continued Learning

Official Docs and Tools:

Communities:

By investing in observability early and thoughtfully, you not only reduce the time to detect and resolve issues, you also build a more resilient, predictable, and transparent delivery process for your entire engineering team.

I hope this comes in handy for you someday. If you made it to the end of this handbook, thanks for reading! You can connect with me on LinkedIn or on X @Emidowojo if you’d like to stay in touch.

How to Build a Production-Ready DevOps Pipeline with Free Tools

Opaluwa Emidowojo — Mon, 28 Apr 2025 20:15:34 +0000

A few months ago, I dove into DevOps, expecting it to be an expensive journey requiring costly tools and infrastructure. But I discovered you can build professional-grade pipelines using entirely free resources.

If DevOps feels out of reach because you’re also concerned about the cost, don't worry. I’ll guide you step-by-step through creating a production-ready pipeline without spending a dime. Let's get started!

Prerequisites
Introduction
How to Set Up Your Source Control and Project Structure
How to Build Your CI Pipeline with GitHub Actions
How to Optimize Docker Builds for CI
Infrastructure as Code Using Terraform and Free Cloud Providers
How to Set Up Container Orchestration on Minimal Resources
How to Create a Free Deployment Pipeline
How to Build a Comprehensive Monitoring System
How to Implement Security Testing and Scanning
Performance Optimization and Scaling
Putting it All Together
Conclusion

🛠 Prerequisites

Basic Git knowledge: Cloning repos, creating branches, committing code, and creating PRs
Familiarity with command line: For Docker, Terraform, and Kubernetes
Basic understanding of CI/CD: Continuous integration/delivery concepts and pipelines

Accounts needed:

GitHub account
At least one cloud provider: AWS Free Tier (recommended), Oracle Cloud Free Tier, or Google Cloud/Azure with free credits
Terraform Cloud (free tier) for infrastructure state management
Grafana Cloud (free tier) for monitoring
UptimeRobot (free tier) for external availability checks

Tools to Install Locally

Tool	Purpose	Installation Link
Git	Version control	Install Git
Docker	Containerization	Install Docker
Node.js & npm	Sample app & builds	Install Node.js
Terraform	Infrastructure as Code	Install Terraform
kubectl	Kubernetes CLI	Install kubectl
k3d	Lightweight Kubernetes	Install k3d
Trivy	Container security scanning	Install Trivy
OWASP ZAP	Web security scanning	Install ZAP

Optional but Helpful:

VS Code or any good code editor
Postman for testing APIs
Understanding of YAML and Dockerfiles

Introduction

When people hear "DevOps," they often picture complex enterprise systems powered by pricey tools and premium cloud services. But the truth is, you don't actually need a massive budget to build a solid, professional-grade DevOps pipeline. The foundations of good DevOps – automation, consistency, security, and visibility – can be built entirely with free tools.

In this guide, you will learn how to build a production-ready DevOps pipeline using zero-cost resources. We will use a simple CRUD (Create, Read, Update, Delete) app with frontend, backend API, and database as our example project to demonstrate every step of the process.

How to Set Up Your Source Control and Project Structure

1. Create a Well-Structured Repository

A clean repo is the foundation of your pipeline. We will set up:

Separate folders for frontend, backend, and infrastructure
A .github folder to hold workflow configurations
Clear naming conventions and a well-written README.md

🛠 Tip: Use semantic commit messages and consider adopting Conventional Commits for clarity in versioning and changelogs.

2. Set Up Branch Protection Without Paid Features

While GitHub's more advanced rules require Pro, you can still:

Require pull requests before merging
Enable status checks to prevent broken code from landing in main
Enforce linear history for cleaner version control

💡 This makes your project safer and more collaborative, without needing GitHub Enterprise.

3. Implement PR Templates and Automated Checks

Make your reviews smoother:

Add a PULL_REQUEST_TEMPLATE.md to guide contributors
Use GitHub Actions (which we'll set up in the next part) for linting, tests, and formatting checks

✨ These tiny improvements add polish and professionalism.

4. Configure GitHub Issue Templates and Project Boards

Even solo developers benefit from issue tracking:

Add issue templates for bugs and features
Use GitHub Projects to manage work with a Kanban board, all free and native to GitHub

📌 Bonus: This setup lays the groundwork for GitOps practices later on.

5. Advanced Technique: Set Up Custom Validation Scripts as Pre-Commit Hooks

Before code ever hits GitHub, you can catch issues locally with Git hooks. Using a tool like Husky or pre-commit, you can:

Lint code before it's committed
Run tests or formatters automatically
Prevent secrets from being accidentally committed

// Initialize Husky and install needed dependencies
// Then add a pre-commit hook that runs tests before allowing the commit
npx husky-init && npm install
npx husky add .husky/pre-commit "npm test"

6. Sample CRUD App Setup:

Our CRUD app manages users (create, read, update, delete). Below is the minimal code with comments to explain each part:

Backend (backend/):

// backend/package.json
{
  "name": "crud-backend", // Name of the backend project
  "version": "1.0.0", // Version for tracking changes
  "scripts": {
    "start": "node index.js", // Runs the server
    "test": "echo 'Add tests here'", // Placeholder for tests (update with Jest later)
    "lint": "eslint ." // Checks code style with ESLint
  },
  "dependencies": {
    "express": "^4.17.1", // Web framework for API endpoints
    "pg": "^8.7.3" // PostgreSQL client to connect to the database
  },
  "devDependencies": {
    "eslint": "^8.0.0" // Linting tool for code quality
  }
}

// backend/index.js
const express = require('express'); // Import Express for building the API
const { Pool } = require('pg'); // Import PostgreSQL client
const app = express(); // Create an Express app
app.use(express.json()); // Parse JSON request bodies

// Connect to PostgreSQL using DATABASE_URL from environment variables
const pool = new Pool({ connectionString: process.env.DATABASE_URL });

// Health check endpoint for Kubernetes probes and monitoring
app.get('/healthz', (req, res) => res.json({ status: 'ok' }));

// Get all users from the database
app.get('/users', async (req, res) => {
  const { rows } = await pool.query('SELECT * FROM users'); // Query the users table
  res.json(rows); // Send users as JSON
});

// Add a new user to the database
app.post('/users', async (req, res) => {
  const { name } = req.body; // Get name from request body
  // Insert user and return the new record
  const { rows } = await pool.query('INSERT INTO users(name) VALUES($1) RETURNING *', [name]);
  res.json(rows[0]); // Send the new user as JSON
});

// Start the server on port 3000
app.listen(3000, () => console.log('Backend running on port 3000'));

Frontend (frontend/):

// frontend/package.json
{
  "name": "crud-frontend", // Name of the frontend project
  "version": "1.0.0", // Version for tracking changes
  "scripts": {
    "start": "react-scripts start", // Runs the dev server
    "build": "react-scripts build", // Builds for production
    "test": "react-scripts test", // Runs tests (placeholder for Jest)
    "lint": "eslint ." // Checks code style with ESLint
  },
  "dependencies": {
    "react": "^17.0.2", // Core React library
    "react-dom": "^17.0.2", // Renders React to the DOM
    "react-scripts": "^4.0.3", // Scripts for React development
    "axios": "^0.24.0" // HTTP client for API calls
  },
  "devDependencies": {
    "eslint": "^8.0.0" // Linting tool for code quality
  }
}

// frontend/src/App.js
import React, { useState, useEffect } from 'react'; // Import React and hooks
import axios from 'axios'; // Import Axios for API requests

function App() {
  // State for storing users fetched from the backend
  const [users, setUsers] = useState([]);
  // State for the input field to add a new user
  const [name, setName] = useState('');

  // Fetch users when the component mounts
  useEffect(() => {
    axios.get('http://localhost:3000/users').then(res => setUsers(res.data));
  }, []); // Empty array means run once on mount

  // Add a new user via the API
  const addUser = async () => {
    const res = await axios.post('http://localhost:3000/users', { name }); // Post new user
    setUsers([...users, res.data]); // Update users list
    setName(''); // Clear input field
  };

  return (
    <div>
      <h1>Usersh1>
      {/* Input for new user name */}
      <input value={name} onChange={e => setName(e.target.value)} />
      {/* Button to add user */}
      <button onClick={addUser}>Add Userbutton>
      {/* List all users */}
      <ul>{users.map(user => <li key={user.id}>{user.name}li>)}ul>
    div>
  );
}

export default App; // Export the component

Database Setup:

-- infra/db.sql
-- Create a table to store users
CREATE TABLE users (
  id SERIAL PRIMARY KEY, -- Auto-incrementing ID
  name VARCHAR(100) NOT NULL -- User name, required
);

crud-app/
├── backend/
│   ├── package.json
│   └── index.js
├── frontend/
│   ├── package.json
│   └── src/App.js
├── infra/
│   └── db.sql
├── .github/
│   └── workflows/
└── README.md

This app provides a /users endpoint (GET/POST) and a frontend to list/add users, stored in PostgreSQL. The /healthz endpoint supports monitoring. Save this code in your repo to follow the pipeline steps.

How to Build Your CI Pipeline with GitHub Actions

1. Set Up Your First GitHub Actions Workflow

First, let’s create a basic workflow that automatically builds, tests, and lints your app every time you push code or open a pull request. This ensures your app stays healthy and any issues are caught early.

Create a file at .github/workflows/ci.yml and add the following:

# CI workflow to build, test, and lint the CRUD app on push or pull request
name: CI Pipeline
on:
  push:
    branches: [main] # Trigger on pushes to main branch
  pull_request:
    branches: [main] # Trigger on PRs to main branch
jobs:
  build:
    runs-on: ubuntu-latest # Use GitHub's free Linux runner
    steps:
      - uses: actions/checkout@v3 # Check out the repository code
      - name: Set up Node.js # Install Node.js environment
        uses: actions/setup-node@v3
        with:
          node-version: '18' # Use Node.js 18 for consistency
      - name: Cache dependencies # Cache node_modules to speed up builds
        uses: actions/cache@v3
        with:
          path: ~/.npm # Cache npm’s global cache
          key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }} # Key based on OS and package-lock.json
      - run: npm ci # Install dependencies reliably using package-lock.json
      - run: npm test # Run tests defined in package.json
      - run: npm run lint # Run ESLint to ensure code quality

This workflow automatically runs on every push and pull request to the main branch. It installs dependencies, runs tests, and performs code linting, with dependency caching to make builds faster over time.

Common Issues and Fixes:

“Secret not found”: Ensure AWS_ACCESS_KEY_ID is in repository secrets (Settings → Secrets).
Tests fail: Check test/users.test.js for database connectivity.

Understanding GitHub Actions' Free Tier Limits

Before building more workflows, it is important to know what GitHub offers for free.

If you are working on private repositories, you get 2,000 free minutes per month. For public repositories, you get unlimited minutes.

To avoid hitting limits quickly:

Cache your dependencies to cut down install times.
Only trigger workflows on meaningful branches (like main or release).
Skip unnecessary steps when you can.

2. Creating a Multi-Stage Build Pipeline

As your app grows, it is better to split your CI pipeline into clear stages like install, test, and lint. This structure makes workflows easier to maintain and speeds things up, because some jobs can run in parallel.

Here’s how you can split the work into multiple jobs for better clarity:

jobs:
  install:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - run: npm ci  # Clean install of dependencies

  test:
    needs: install  # This job depends on the install job finishing
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - run: npm test  # Run test suite

  lint:
    needs: install  # This job also depends on install but runs in parallel with test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - run: npm run lint  # Run linting checks

By breaking the pipeline into stages, you can quickly spot which step fails, and your test and lint jobs can run at the same time after dependencies are installed.

3. Implement Matrix Builds for Cross-Environment Testing

When you want your app to work across different Node.js versions or databases, matrix builds are your best bet. They let you test across multiple environments in parallel, without duplicating code.

Here’s how you can set up a matrix strategy, to test across multiple environments simultaneously:

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        node-version: [14.x, 16.x, 18.x]  # Test on multiple Node versions
        database: [postgres, mysql]        # Test against different databases
    steps:
      - uses: actions/checkout@v3
      - name: Use Node.js ${{ matrix.node-version }}
        uses: actions/setup-node@v3
        with:
          node-version: ${{ matrix.node-version }}
      - run: npm install
      - run: npm test  # This will run 6 different test combinations (3 Node versions × 2 databases)

Matrix builds save time and help you catch environment-specific bugs early.

4. Optimize Workflow with Dependency Caching

Every second counts in CI. Dependency caching can help save minutes in your workflow by reusing previously installed packages instead of reinstalling them from scratch every time.

Here’s how to set up smart caching to speed up your builds:

- name: Cache node modules
  uses: actions/cache@v3
  with:
    path: |  # Cache both global npm cache and local node_modules
      ~/.npm
      node_modules
    key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}  # Cache key based on OS and dependencies
    restore-keys: |  # Fallback keys if exact match isn't found
      ${{ runner.os }}-node-

This cache setup checks if your dependencies have changed. If not, it restores the cache, making builds significantly faster.

How to Optimize Docker Builds for CI

When you're building Docker images in CI, build time can quickly become a bottleneck. Especially if your images are large. Optimizing your Docker builds makes your pipelines much faster, saves bandwidth, and produces smaller, more efficient images ready for deployment.

In this section, I’ll walk through creating a basic Dockerfile, using multi-stage builds, caching layers, and enabling BuildKit for even faster builds.

1. Create a Baseline Dockerfile

First, start with a simple Dockerfile that installs your app’s dependencies and runs it. This is what you’ll be optimizing later.

# Simple Dockerfile for a Node.js application
FROM node:18-alpine  # Use Alpine for a smaller base image
WORKDIR /app         # Set working directory
COPY . .             # Copy all files to container
RUN npm ci           # Install dependencies (clean install)
CMD ["npm", "start"] # Start the application

Using an Alpine-based Node.js image helps keep your image small from the start.

2. Multi-Stage Docker Builds

Next, let's separate the build process from the production image. Multi-stage builds let you compile or build your app in one stage and only copy over the final product to a clean, smaller image. This keeps production images lean:

# Stage 1: Build the application
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./  # Copy package files first for better caching
RUN npm ci             # Install all dependencies
COPY . .               # Then copy source code
RUN npm run build      # Build the application

# Stage 2: Production image with minimal footprint
FROM node:18-alpine
WORKDIR /app
# Only copy built assets and production dependencies
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/package*.json ./
RUN npm ci --production  # Install only production dependencies
CMD ["node", "dist/server.js"]  # Run the built application

This approach keeps your production images lightweight and secure by excluding unnecessary build tools and dev dependencies.

3. Optimizing Layer Caching

For even faster builds, order your Dockerfile instructions to maximize layer caching. Copy and install dependencies before copying your full source code.

This way, Docker reuses the cached npm install step if your dependencies haven't changed, even if you edit your app's code:

First: COPY package*.json ./
Then: RUN npm ci
Finally: COPY . .

4. Enable BuildKit for Faster Builds

Docker BuildKit is a newer build engine that enables features like better caching, parallel build steps, and overall faster builds.

To enable BuildKit during your CI, run:

- name: Build Docker image
  run: |
    # Enable BuildKit for parallel and more efficient builds
    DOCKER_BUILDKIT=1 docker build -t myapp:latest .

Turning on BuildKit can significantly speed up complex Docker builds and is highly recommended for all CI pipelines.

Infrastructure as Code Using Terraform and Free Cloud Providers

Why Infrastructure as Code (IaC) Matters

When you manage infrastructure manually – that is, clicking around cloud dashboards or setting things up by hand – it’s easy to lose track of what you did and how to repeat it.

Infrastructure as Code (IaC) solves this by letting you define your infrastructure with code, version it just like application code, and track every change over time. This makes your setups easy to replicate across environments (development, staging, production), ensures changes are declarative and auditable, and reduces human error.

Whether you are spinning up a single server or scaling a complex system, IaC lays the foundation for professional-grade infrastructure from day one, letting you automate, document, and grow your environment systematically.

How to Provision Infrastructure with Terraform

Initialize a Terraform Project

First, define the providers and versions you need. Here, we’re using Render’s free cloud hosting service:

# Define required providers and versions
terraform {
  required_providers {
    render = {
      source  = "renderinc/render"  # Using Render's free tier
      version = "0.1.0"             # Specify provider version for stability
    }
  }
}

# Configure the Render provider with authentication
provider "render" {
  api_key = var.render_api_key  # Store API key as a variable
}

Then, configure the provider by authenticating with your API key. It is best practice to store secrets like API keys in variables instead of hardcoding them. This setup tells Terraform what platform you’re working with (Render) and how to authenticate to manage resources automatically.

Provision a Web App on Render

Next, define the infrastructure you want – in this case, a web service hosted on Render:

# Define a web service on Render's free tier
resource "render_service" "web_app" {
  name = "ci-demo-app"                                 # Service name
  type = "web_service"                                 # Type of service
  repo = "https://github.com/YOUR-USERNAME/YOUR-REPO"  # Source repo
  env = "docker"                                       # Use Docker environment
  plan = "starter"                                     # Free tier plan
  branch = "main"                                      # Deploy from main branch
  build_command = "docker build -t app ."              # Build command
  start_command = "docker run -p 3000:3000 app"        # Start command
  auto_deploy = true                                   # Auto-deploy on commits
}

This resource block describes exactly how your app should be deployed. Whenever you change this file and reapply, Terraform will update the infrastructure to match.

Provision PostgreSQL for Free

Most applications need a database, but you don't have to pay for one when you're getting started. Platforms like Railway offer free tiers that are perfect for development and small projects.

You can quickly create a free PostgreSQL instance by signing up on the platform and clicking "Create New Project". At the end, you'll get a DATABASE_URL a connection string that your app will use to talk to the database.

Connect App to DB

In Render (or whatever platform you're using), set an environment variable called DATABASE_URL and paste in the connection string from your PostgreSQL provider. This lets your application securely access the database without hardcoding credentials into your codebase.

Make it Reproducible

Once everything is defined, use Terraform to create and apply an infrastructure plan:

# Create execution plan and save it to a file
terraform plan -out=infra.tfplan
# Apply the saved plan exactly as planned
terraform apply infra.tfplan

Saving the plan to a file (infra.tfplan) ensures you’re applying exactly what you reviewed, so there will be no surprises.

Common Issues and Fixes:

Provider not found: Run terraform init.
API key error: Check render_api_key in Terraform Cloud variables.

How to Set Up Container Orchestration on Minimal Resources

When you're working with limited resources like a laptop, a small server, or a lightweight cloud VM, setting up full Kubernetes can be overwhelming. Instead, you can use K3d, a lightweight Kubernetes distribution that runs inside Docker containers. Here's how to set up a minimal, efficient cluster for local development or testing.

1. Install K3d for Local Kubernetes

First, install K3d. It's a super lightweight way to run Kubernetes clusters inside Docker without needing a heavy setup like Minikube.

# Download and install K3d - a lightweight K8s distribution
curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash

2. Create a Lightweight K3d Cluster

Once K3d is installed, you can spin up a cluster with minimal nodes to save resources.

# Create a minimal K8s cluster with 1 server and 2 agent nodes
k3d cluster create dev-cluster \
  --servers 1 \                        # Single server node to minimize resource usage
  --agents 2 \                         # Two worker nodes for pod distribution
  --volume /tmp/k3dvol:/tmp/k3dvol \   # Mount local volume for persistence
  --port 8080:80@loadbalancer \        # Map port 8080 locally to 80 in the cluster
  --api-port 6443                      # Set the API port

This setup gives you a tiny but real Kubernetes cluster that is perfect for experimentation.

3. Deploy with Optimized Kubernetes Manifests

Now that your cluster is running, you can deploy your app. It's important to define resource requests and limits carefully so your pods don’t consume too much memory or CPU.

# Resource-optimized deployment manifest
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp  # Name of the deployment
spec:
  replicas: 1   # Single replica to save resources
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
    spec:
      containers:
        - name: app
          image: myapp:latest
          resources:
            # Set minimal resource requests
            requests:
              memory: "64Mi"   # Request only 64MB memory
              cpu: "50m"       # Request only 5% of a CPU core
            # Set reasonable limits
            limits:
              memory: "128Mi"  # Limit to 128MB memory
              cpu: "100m"      # Limit to 10% of a CPU core

This ensures Kubernetes knows how much to allocate and avoid overloading your lightweight environment.

4. Set up GitOps with Flux

To manage deployments automatically from your GitHub repository, you can set up GitOps using Flux.

# Install Flux CLI
brew install fluxcd/tap/flux

# Bootstrap Flux on your cluster connected to your GitHub repository
flux bootstrap github \
  --owner=YOUR_GITHUB_USERNAME \    # Your GitHub username
  --repository=YOUR_REPO_NAME \     # Repository to store Flux manifests
  --branch=main \                   # Branch to use
  --path=clusters/dev-cluster \     # Path within repo for cluster configs
  --personal                        # Flag for personal account

Flux watches your repo and applies updates to your cluster, keeping everything declarative and reproducible.

Common Issues and Fixes:

Pods crash: Run kubectl logs pod-name or increase resources.
Flux sync fails: Check GitHub token permissions.

How to Create a Free Deployment Pipeline

Like I said initially, not every project needs expensive infrastructure. If you're just getting started or building side projects, free tiers from cloud providers can cover a lot of ground.

1. Understanding Free Tier Limitations

Here’s a quick overview of popular cloud free tiers:

Provider	Free Tier Highlights
AWS Free Tier	750 hours/month EC2, 5GB S3, 1M Lambda requests
Oracle Cloud Free Tier	2 always-free compute instances, 30GB storage
Google Cloud Free Tier	1 f1-micro instance, 5GB storage

Knowing these limits helps you stay within budget.

2. Set Up Deployment Workflows

You can automate deployments with GitHub Actions. Here's an example of a deployment workflow to AWS:

# GitHub Action workflow for deploying to AWS
name: AWS Deployment

on:
  push:
    branches:
      - main  # Deploy on push to main branch

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3  # Check out code

      # Set up AWS credentials from GitHub secrets
      - name: Set up AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      # Build the Docker image
      - name: Build Docker Image
        run: docker build -t myapp .

      # Push the image to AWS ECR
      - name: Push Docker Image to ECR
        run: |
          # Create repository if it doesn't exist (ignoring errors if it does)
          aws ecr create-repository --repository-name myapp || true

          # Login to ECR
          aws ecr get-login-password | docker login --username AWS --password-stdin .dkr.ecr.us-east-1.amazonaws.com

          # Tag and push the image
          docker tag myapp:latest .dkr.ecr.us-east-1.amazonaws.com/myapp:latest
          docker push .dkr.ecr.us-east-1.amazonaws.com/myapp:latest

3. Implement Zero-Downtime Deployments

Zero downtime is crucial. Kubernetes makes this easy with rolling updates:

# Kubernetes deployment configured for zero-downtime updates
apiVersion: apps/v1
kind: Deployment
metadata:
  name: crud-app
spec:
  replicas: 3  # Multiple replicas for high availability
  selector:
    matchLabels:
      app: crud-app
  template:
    metadata:
      labels:
        app: crud-app
    spec:
      containers:
      - name: app
        image: /crud-app:latest
        ports:
        - containerPort: 80  # Expose container port

By having multiple replicas, you ensure that some pods stay live during updates.

4. Create Cross-Cloud Deployment for Redundancy

If you want better reliability, you can deploy across different clouds in parallel:

# Deploy to multiple cloud providers for redundancy
name: Cross-Cloud Deployment

on:
  push:
    branches:
      - main

jobs:
  # Deploy to AWS
  aws-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: AWS Setup & Deploy
        run: |
          # Configure AWS CLI with credentials
          aws configure set aws_access_key_id ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws configure set aws_secret_access_key ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          # AWS deployment commands...

  # Deploy to Oracle Cloud in parallel
  oracle-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Oracle Setup & Deploy
        run: |
          # Configure Oracle Cloud CLI
          oci setup config
          # Oracle Cloud deployment commands...

Now if one cloud goes down, the other is still up.

5. Implement Automated Rollbacks with Health Checks

Set up health checks so Kubernetes can automatically rollback if something goes wrong:

# Deployment with health checks for automated rollbacks
apiVersion: apps/v1
kind: Deployment
metadata:
  name: crud-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: crud-app
  template:
    metadata:
      labels:
        app: crud-app
    spec:
      containers:
      - name: crud-app
        image: /crud-app:latest
        ports:
        - containerPort: 80
        # Check if the container is alive
        livenessProbe:
          httpGet:
            path: /healthz  # Health check endpoint
            port: 80
          initialDelaySeconds: 5  # Wait before first check
          periodSeconds: 10       # Check every 10 seconds
        # Check if the container is ready to receive traffic
        readinessProbe:
          httpGet:
            path: /readiness  # Readiness check endpoint
            port: 80
          initialDelaySeconds: 5  # Wait before first check
          periodSeconds: 10       # Check every 10 seconds

How to Build a Comprehensive Monitoring System

Even with a small deployment, monitoring is key to spotting issues early. So now, I’ll walk through setting up a comprehensive monitoring system for your application.

You'll learn how to integrate Grafana Cloud for visualizing your metrics, use Prometheus for collecting data, and configure custom alerts to monitor your app's performance. I’ll also cover tracking Service Level Objectives (SLOs) and setting up external monitoring with UptimeRobot to make sure that your endpoints are always available.

1. Set Up Grafana Cloud's Free Tier

Create a Grafana Cloud account and connect Prometheus as a data source. They offer generous free usage, which is perfect for small teams.

2. Configure Prometheus for Metrics Collection

Prometheus collects metrics from your app.

# prometheus.yml - Basic Prometheus configuration
global:
  scrape_interval: 15s  # Collect metrics every 15 seconds
scrape_configs:
  - job_name: 'crud-app'  # Job name for the crud-app metrics
    static_configs:
      - targets: ['localhost:8080']  # Where to collect metrics from

This scrapes your app every 15 seconds for metrics.

3. Create Monitoring Dashboards

Grafana visualizes Prometheus data. You can create dashboards using queries like:

# Calculate average CPU usage rate per instance over 1 minute
avg(rate(cpu_usage_seconds_total[1m])) by (instance)

This calculates average CPU usage over the last minute per instance.

4. Write Custom PromQL Queries for Alerts

You can create smart alerts to detect increasing error rates, like the below:

# Calculate error rate as a percentage of total requests
# Alert when error rate exceeds 5%
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
  / 
sum(rate(http_requests_total[5m])) by (service) > 0.05

This alerts if more than 5% of your traffic results in errors.

5. Implement SLO Tracking on a Budget

You can track Service Level Objectives (SLOs) with Prometheus for free:

# Calculate percentage of requests completed under 200ms
# Alert when it drops below 99%
rate(http_request_duration_seconds_bucket{le="0.2"}[5m]) 
  / rate(http_request_duration_seconds_count[5m]) 
> 0.99

This tracks if 99% of requests complete in under 200ms.

6. Set Up UptimeRobot for External Monitoring

Finally, you can use UptimeRobot to check if your endpoints are reachable externally, and get alerts if anything goes down.

How to Implement Security Testing and Scanning

Security should be integrated into your development pipeline from the start, not added as an afterthought. In this section, I’ll show you how to implement security testing and scanning at various stages of your workflow.

You’ll use GitHub CodeQL for static code analysis, OWASP ZAP for scanning web vulnerabilities, and Trivy for container image scanning. You’ll also learn how to enforce security thresholds directly in your CI pipeline.

1. Enable GitHub Code Scanning with CodeQL

GitHub has built-in code scanning with CodeQL. Here’s how to set it up:

# GitHub workflow for CodeQL security scanning
name: CodeQL

on:
  push:
    branches:
      - main
  pull_request:
    branches:
      - main

jobs:
  analyze:
    name: Analyze code with CodeQL
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      # Initialize the CodeQL scanning tools
      - name: Set up CodeQL
        uses: github/codeql-action/init@v2

      # Run the analysis and generate results
      - name: Analyze code
        uses: github/codeql-action/analyze@v2

This automatically checks your code for security vulnerabilities.

2. Integrate OWASP ZAP into Your CI Pipeline

You can also scan your deployed app with OWASP ZAP like this:

# Automated security scanning with OWASP ZAP
name: ZAP Scan

on:
  push:
    branches:
      - main

jobs:
  zap-scan:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      # Run the ZAP security scan against deployed application
      - name: Run ZAP security scan
        uses: zaproxy/action-full-scan@v0.3.0
        with:
          target: 'https://yourapp.com'  # URL to scan

This checks for common web vulnerabilities.

3. Set Up Trivy for Container Vulnerability Scanning

You can also check your container images for vulnerabilities with Trivy:

# Scan Docker images for vulnerabilities using Trivy
- name: Run Trivy vulnerability scanner
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: 'crud-app:latest'   # Image to scan
    format: 'table'             # Output format
    exit-code: '1'              # Fail the build if vulnerabilities found
    ignore-unfixed: true        # Skip vulnerabilities without fixes
    severity: 'CRITICAL,HIGH'   # Only alert on critical and high severity

Your builds will fail if serious issues are found, keeping you safe by default.

4. Create Threshold-Based Pipeline Failures

You can configure your pipelines to fail automatically if vulnerabilities exceed a set threshold, enforcing strong security practices without manual effort. Here’s how that should look:

# Fail the pipeline if critical or high vulnerabilities are found
- name: Run Trivy vulnerability scanner
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: 'crud-app:latest'   # Image to scan
    format: 'json'              # Output as JSON for parsing
    exit-code: '1'              # Fail the build if vulnerabilities found
    severity: 'CRITICAL,HIGH'   # Check for critical and high severity issues
    ignore-unfixed: true        # Skip vulnerabilities without fixes

This forces a no-compromise security posture – that is, if critical or high vulnerabilities are detected, the build stops immediately.

5. Implement Custom Security Checks

Sometimes you need to go beyond automated scanners. Here's a basic example of a custom security check you can add to your pipeline:

#!/bin/bash

# Custom script to check for hard-coded secrets in source code
# Check for hard-coded API keys in source files
if grep -r "API_KEY" ./src; then
  echo "Security issue: Found hard-coded API keys."
  exit 1  # Fail the build
else
  echo "No hard-coded API keys found."
fi

You can extend this script to scan for patterns like private keys, passwords, or other sensitive information, helping catch issues before they ever reach production.

Performance Optimization and Scaling

Optimizing early saves you pain later. Here’s how to make your pipelines faster, smarter, and more scalable:

1. Measure Pipeline Execution Times

Understanding how long each step takes is the first step to improving it:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      # Record the start time
      - name: Start timer
        run: echo "Start time: $(date)"

      - uses: actions/checkout@v3
      - run: npm install

      # Record the end time to calculate duration
      - name: End timer
        run: echo "End time: $(date)"

Later, you can automate time tracking for full reports and alerts.

2. Implement Parallelization Strategies

Split your jobs smartly to save time:

jobs:
  # First job to install dependencies
  install:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - run: npm ci

  # Run tests in parallel with linting
  test:
    runs-on: ubuntu-latest
    needs: install  # Depends on install job
    steps:
      - uses: actions/checkout@v3
      - run: npm test

  # Run linting in parallel with tests
  lint:
    runs-on: ubuntu-latest
    needs: install  # Also depends on install job
    steps:
      - uses: actions/checkout@v3
      - run: npm run lint

Result: Testing and linting run in parallel after installing dependencies, cutting pipeline time significantly.

3. Set Up Distributed Caching

Caching saves your workflow from repeating expensive tasks:

# Cache dependencies to speed up builds
- name: Cache node modules
  uses: actions/cache@v3
  with:
    path: |
      ~/.npm           # Cache global npm cache
      node_modules     # Cache local dependencies
    key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}  # Key based on OS and dependency hash
    restore-keys: |    # Fallback keys if exact match isn't found
      ${{ runner.os }}-node-

Tip: Also cache build artifacts, Docker layers, and Terraform plans when possible.

4. Create Performance Benchmarks

Track your build times over time with benchmarks:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      # Store the start time as an environment variable
      - name: Start timer
        id: start_time
        run: echo "start_time=$(date +%s)" >> $GITHUB_ENV

      - uses: actions/checkout@v3
      - run: npm install

      # Calculate and display the elapsed time
      - name: End timer and calculate elapsed time
        run: |
          end_time=$(date +%s)
          elapsed_time=$((end_time - ${{ env.start_time }}))
          echo "Build time: $elapsed_time seconds"

With benchmarks in place, you can monitor regressions and trigger optimizations automatically.

5. How to Plan for Growth Beyond Free Tiers

Understand cloud pricing structures: AWS, Azure, GCP all offer generous free tiers, but know the limits to avoid surprise bills. (I have been there and it wasn’t pretty.)
Consider scaling to more advanced CI/CD tools: Jenkins, CircleCI, GitLab can offer better performance or self-hosted control as you grow.
Automate resource provisioning: Use Infrastructure as Code (IaC) with Terraform, Pulumi, or AWS CDK to dynamically scale your infrastructure when your team or traffic grows.

Complete CI/CD Pipeline Example

Here’s a full example tying everything together:

# Complete end-to-end CI/CD pipeline
name: CI/CD Pipeline

on:
  push:
    branches:
      - main

jobs:
  # Initial setup job
  setup:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

  # Build and test job
  build:
    runs-on: ubuntu-latest
    needs: setup  # Depends on setup job
    steps:
      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '16'
      - name: Install dependencies
        run: npm install
      - name: Run security scan
        run: npx eslint .  # Run ESLint for security rules

  # Deploy to Kubernetes job
  deploy:
    runs-on: ubuntu-latest
    needs: build  # Depends on successful build
    steps:
      - name: Setup K3d cluster
        run: k3d cluster create dev-cluster --servers 1 --agents 2 --port 8080:80@loadbalancer
      - name: Apply Kubernetes manifests
        run: kubectl apply -f k8s/  # Apply all K8s manifests in the k8s directory
      - name: Deploy app
        run: kubectl rollout restart deployment/webapp  # Restart deployment for zero-downtime update

  # Infrastructure provisioning job
  terraform:
    runs-on: ubuntu-latest
    needs: deploy  # Run after deployment
    steps:
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
      - name: Terraform Init
        run: terraform init  # Initialize Terraform
      - name: Terraform Apply
        run: terraform apply -auto-approve  # Apply infrastructure changes automatically

Runbook: Failed Deployment:

Issue: Pods fail due to resource limits (for example, OOMKilled, CrashLoopBackOff).
Fix:

  kubectl top pod
  kubectl edit deployment crud-app
  kubectl apply -f deployment.yaml
  kubectl rollout status deployment/crud-app

Tip: Set realistic resource requests and limits early, it'll save you debugging time later.

Conclusion

By following along with this tutorial, you now know how to build a production-ready DevOps pipeline using free tools:

CI/CD: GitHub Actions for testing, linting, and building.
Infrastructure: Terraform for AWS/Render and PostgreSQL setup.
Orchestration: K3d for local Kubernetes.
Monitoring: Grafana, Prometheus, UptimeRobot.
Security: CodeQL, OWASP ZAP, Trivy for vulnerability scanning.

This pipeline is scalable and secure, and it’s perfect for small projects. As your app grows, you might want to consider paid plans for more resources (for example, AWS larger instances, Grafana unlimited metrics). You can check AWS Free Tier, Terraform Docs, and Grafana Docs for more learning.

PS: I’d love to see what you build. Share your pipeline on FreeCodeCamp’s forum or tag me on X @Emidowojo with #DevOpsOnABudget, and tell me about the challenges you faced. You can also connect with me on LinkedIn if you’d like to stay in touch. If you made it to the end of this lengthy article, thanks for reading!

Opaluwa Emidowojo - freeCodeCamp.org

How to Debug Kubernetes Apps When Logs Fail You – An eBPF Tracing Handbook

Prerequisites

Table of Contents

Understanding eBPF Observability

Why does this matter for observability?

The eBPF advantage for Kubernetes

How eBPF Tracing Works (Without Getting Lost in the Kernel)

A Simple Example: HTTP Request Tracing

The eBPF Execution Flow

The Tool: Inspektor Gadget & Traceloop

How to Set Up Your Environment

Verify that Your Cluster Meets the Requirements

Check your Kubernetes version:

Verify kernel version on your nodes:

Confirm that you have cluster admin permissions:

Install Inspektor Gadget

Install the kubectl gadget plugin:

Deploy Inspektor Gadget to your cluster:

Verify the deployment:

Deploying a sample application

How to Trace Your First Request: Hands-On Tutorial

Generate the Traffic to Trace

Understanding what you're seeing:

What this tells you about your application:

How to Interpret Traces

Trace Anatomy: Spans, Timing, and Request Flow

How to Follow Requests Across Services

Understanding the flow:

How to Identify Bottlenecks and Errors

Real-World Debugging Scenarios

Scenario 1: Finding a Slow Endpoint

The diagnosis:

The fix:

Scenario 2: Tracking Down Failed Requests

The diagnosis:

Why eBPF finds this when logs don't:

Scenario 3: Understanding Service Dependencies

The discovery:

Why this matters:

Advanced Tracing Insights

Syscall-Level Observation

Network Tracing Insights

Combining eBPF Data with Logs and Metrics

A practical workflow:

What eBPF Can and Can't See

Best Practices and Production Considerations

Performance Impact

Production best practices:

When overhead matters:

Security Considerations

What eBPF can access:

Privilege requirements:

Best practices:

Network policies:

When to Use eBPF Tracing vs. Traditional APM

The Ideal Approach: Use Both

Next Steps and Resources

Exploring Other eBPF Tools

Try other Inspektor Gadget gadgets:

Experiment with other eBPF platforms:

Connect to the CNCF Observability Ecosystem

OpenTelemetry integration:

Prometheus and Grafana:

Service mesh integration:

Jaeger and Zipkin:

Community Resources and Learning Paths

How to Improve Developer Experience in Microservices Applications with .NET Aspire

Prerequisites

Table of Contents

Understanding Developer Experience in Microservices

Introducing .NET Aspire

How to Set Up .NET Aspire in Your Project

Why This Matters for Developer Experience

Framework: How to Adopt .NET Aspire Incrementally

How to Use the .NET Aspire Dashboard

Service Overview

Navigating to Endpoints

Real-Time Logs

Observability Built-In (OpenTelemetry)