<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Grafana - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Grafana - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sat, 27 Jun 2026 22:33:18 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/grafana/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Debug CI/CD Pipelines: A Handbook on Troubleshooting with Observability Tools ]]>
                </title>
                <description>
                    <![CDATA[ Observability is a game-changer for CI/CD pipelines, and it’s one of the most exciting aspects of DevOps. When I started working with CI/CD systems, I assumed the hardest part would be building the pipeline. But with increasingly complex setups, the ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-debug-cicd-pipelines-handbook/</link>
                <guid isPermaLink="false">6850a9eb7255997ee3d47265</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ observability ]]>
                    </category>
                
                    <category>
                        <![CDATA[ #prometheus ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Grafana ]]>
                    </category>
                
                    <category>
                        <![CDATA[ promql ]]>
                    </category>
                
                    <category>
                        <![CDATA[ loki ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Opaluwa Emidowojo ]]>
                </dc:creator>
                <pubDate>Mon, 16 Jun 2025 23:34:03 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1748620971355/d4893ec5-8016-491e-9626-15d971f0c885.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Observability is a game-changer for CI/CD pipelines, and it’s one of the most exciting aspects of DevOps. When I started working with CI/CD systems, I assumed the hardest part would be building the pipeline. But with increasingly complex setups, the real challenge is debugging failures, like builds crashing or tests failing only in production.</p>
<p>Observability tools, such as logs, metrics, and traces, provide the visibility you need to pinpoint issues quickly. In this handbook, we’ll explore free and open-source tools you can use to make your CI/CD pipelines more reliable. We’ll use practical steps to troubleshoot like a pro – no enterprise licenses required.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-why-observability-is-important">Why Observability is Important</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-install-and-configure-grafana-loki-on-budget-infrastructure">How to Install and Configure Grafana Loki on Budget Infrastructure</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-implement-an-elk-stack-alternative-for-pipeline-observability">How to Implement an ELK Stack Alternative for Pipeline Observability</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-create-a-unified-logging-strategy-across-pipeline-components">How to Create a Unified Logging Strategy Across Pipeline Components</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-query-and-analyze-logs-for-effective-troubleshooting">How to Query and Analyze Logs for Effective Troubleshooting</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-prometheus-metrics-alongside-your-logs">How to Set Up Prometheus Metrics Alongside Your Logs</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-create-grafana-dashboards-that-combine-metrics-and-logs">How to Create Grafana Dashboards That Combine Metrics and Logs</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-use-exemplars-to-jump-from-metrics-to-relevant-logs">How to Use Exemplars to Jump from Metrics to Relevant Logs</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-diagnose-and-fix-common-cicd-problems">How to Diagnose and Fix Common CI/CD Problems</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-implement-advanced-debugging-techniques">How to Implement Advanced Debugging Techniques</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-conduct-effective-postmortems-using-logs">How to Conduct Effective Postmortems Using Logs</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-optimize-log-storage-and-management">How to Optimize Log Storage and Management</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h3 id="heading-prerequisites">Prerequisites</h3>
<p>There are some things you should know and have to get the most out of this handbook:</p>
<h4 id="heading-technical-knowledge">Technical Knowledge:</h4>
<ul>
<li><p>Basic understanding of <a target="_blank" href="https://www.freecodecamp.org/news/what-is-ci-cd/">CI/CD pipelines</a> (for example, build, test, deploy stages).</p>
</li>
<li><p>Familiarity with <a target="_blank" href="https://www.freecodecamp.org/news/helpful-linux-commands-you-should-know/">Linux/Unix commands</a> (for example, <code>mkdir</code>, <code>grep</code>, <code>curl</code>).</p>
</li>
<li><p>Comfortable with <a target="_blank" href="https://www.freecodecamp.org/news/the-docker-handbook/">Docker basics</a> (for example, <code>docker run</code>, <code>docker-compose up</code>).</p>
</li>
<li><p>Optional: Awareness of <a target="_blank" href="https://www.freecodecamp.org/news/observability-in-cloud-native-applications/">observability concepts</a> (logs, metrics, traces) or YAML configuration.</p>
</li>
</ul>
<h4 id="heading-software-and-tools">Software and Tools:</h4>
<ul>
<li><p><strong>Docker and Docker Compose</strong>: Installed and running (verify with <code>docker --version</code> and <code>docker-compose --version</code>).</p>
</li>
<li><p><strong>CI/CD Platform</strong>: Access to GitHub Actions, Jenkins, or GitLab CI with a sample pipeline that generates logs.</p>
</li>
<li><p><strong>Text Editor</strong>: For editing YAML files (for example, VS Code, Nano).</p>
</li>
<li><p><strong>Web Browser</strong>: To access tool UIs (for example, Grafana on port 3000, Kibana on 5601).</p>
</li>
<li><p>Optional: <code>curl</code> for testing log forwarding, Git for version control.</p>
</li>
</ul>
<h4 id="heading-hardware-and-infrastructure">Hardware and Infrastructure:</h4>
<ul>
<li><p>Machine with:</p>
<ul>
<li><p>OS: Linux, Windows (with WSL2), or macOS.</p>
</li>
<li><p>4GB RAM (8GB recommended), 20GB free disk space.</p>
</li>
<li><p>Stable internet and ability to open ports (for example, 3100 for Loki, 9200 for Elasticsearch).</p>
</li>
</ul>
</li>
<li><p>Optional: Cloud provider access (for example, AWS, GCP) for scalable setups.</p>
</li>
</ul>
<h4 id="heading-access-and-permissions">Access and Permissions:</h4>
<ul>
<li><p>Admin access to install Docker and configure CI/CD tools.</p>
</li>
<li><p>Permissions to modify pipeline configs (for example, <code>.github/workflows</code>, <code>.gitlab-ci.yml</code>).</p>
</li>
<li><p>Optional: Container registry access (for example, Docker Hub) for custom images.</p>
</li>
</ul>
<h2 id="heading-why-observability-is-important"><strong>Why Observability is Important</strong></h2>
<p>Modern CI/CD pipelines are no longer linear scripts – they are now complex, distributed systems involving multiple tools, environments, and infrastructure layers. One job runs on GitHub Actions, another deploys via Jenkins, and a third builds Docker images in a Kubernetes cluster.</p>
<p>So when something breaks, you’re left chasing logs across tools, guessing where the issue originated, and wasting hours trying to reproduce it.</p>
<p>And worse still, traditional debugging tools often stop at the surface, only showing failed jobs without the context of <em>why</em> they failed or <em>where</em> in the system the fault actually lies.</p>
<p>Observability flips the script. Instead of hunting through disconnected logs or rerunning failed builds blindly, observability gives you <strong>insight</strong>, not just data. By combining structured logs, metrics, and traces, you can:</p>
<ul>
<li><p>Reconstruct exactly what happened in a pipeline failure</p>
</li>
<li><p>Trace a failure across CI agents, deployment steps, and containers</p>
</li>
<li><p>Visualize patterns and anomalies before they become outages</p>
</li>
</ul>
<p>More importantly, observability helps you <strong>move from reactive debugging to proactive prevention</strong>.</p>
<p>Here’s what you’ll learn about and accomplish in this guide:</p>
<ul>
<li><p>Set up cost-effective observability using Grafana Loki, lightweight ELK, and OpenTelemetry</p>
</li>
<li><p>Create a unified logging strategy to connect your pipeline</p>
</li>
<li><p>Write precise queries to quickly pinpoint root causes, correlate logs, metrics, and traces for comprehensive debugging</p>
</li>
<li><p>Troubleshoot CI/CD issues like build failures, flaky tests, and container crashes</p>
</li>
<li><p>Build custom dashboards and automated diagnostic tools</p>
</li>
<li><p>Promote observability through documentation and post-mortems</p>
</li>
</ul>
<p>Whether you're a solo developer or part of a DevOps team, this guide will transform your chaotic CI/CD pipelines into clear, reliable, and observable systems.</p>
<h3 id="heading-how-to-choose-the-right-observability-tool-for-cicd"><strong>How to Choose the Right Observability Tool for CI/CD</strong></h3>
<p>Here’s a quick comparison of Grafana Loki, Lightweight ELK, and Vector for CI/CD observability:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Tool</strong></td><td><strong>Resource Usage</strong></td><td><strong>Setup Complexity</strong></td><td><strong>Best For</strong></td><td><strong>CI/CD Fit</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Grafana Loki</strong></td><td>Low (lightweight)</td><td>Easy (Docker-based)</td><td>Small teams, budget infra</td><td>Simple pipelines, JSON logs, Grafana users</td></tr>
<tr>
<td><strong>Lightweight ELK</strong></td><td>High (Elasticsearch-heavy)</td><td>Moderate (multi-container)</td><td>Teams needing advanced search/visualization</td><td>Complex pipelines, rich querying needs</td></tr>
<tr>
<td><strong>Vector</strong></td><td>Very low</td><td>Easy (single binary)</td><td>Resource-constrained setups</td><td>Minimal setups, log forwarding</td></tr>
</tbody>
</table>
</div><p>How to choose:</p>
<ul>
<li><p><strong>Loki</strong>: Ideal for startups or solo devs with limited resources. Integrates well with Prometheus/Grafana.</p>
</li>
<li><p><strong>ELK</strong>: Best for teams needing Kibana’s advanced visualizations or handling large log volumes.</p>
</li>
<li><p><strong>Vector</strong>: Great for lightweight log forwarding in distributed CI/CD setups.</p>
</li>
</ul>
<p><strong>Grafana Loki</strong> is a log aggregation system like ELK, but it's more lightweight, and it’s ideal for CI/CD pipelines with limited infrastructure.</p>
<h2 id="heading-how-to-install-and-configure-grafana-loki-on-budget-infrastructure">How to Install and Configure Grafana Loki on Budget Infrastructure</h2>
<h3 id="heading-option-a-quick-docker-setup-recommended-for-budget-infra">🛠 Option A: Quick Docker Setup (Recommended for Budget Infra)</h3>
<ol>
<li><p><strong>Create a directory for configuration:</strong></p>
<pre><code class="lang-bash"> mkdir -p ~/loki-setup &amp;&amp; <span class="hljs-built_in">cd</span> ~/loki-setup
</code></pre>
</li>
<li><p><strong>Create a</strong> <code>docker-compose.yml</code>:</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># Defines a Docker Compose setup for Grafana Loki and Promtail to aggregate and scrape logs efficiently.</span>
 <span class="hljs-attr">version:</span> <span class="hljs-string">"3"</span>

 <span class="hljs-attr">services:</span>
   <span class="hljs-attr">loki:</span>
     <span class="hljs-attr">image:</span> <span class="hljs-string">grafana/loki:2.9.4</span>  <span class="hljs-comment"># Uses Loki version 2.9.4 for lightweight log aggregation.</span>
     <span class="hljs-attr">ports:</span>
       <span class="hljs-bullet">-</span> <span class="hljs-string">"3100:3100"</span>  <span class="hljs-comment"># Exposes Loki’s HTTP API port for log ingestion and queries.</span>
     <span class="hljs-attr">command:</span> <span class="hljs-string">-config.file=/etc/loki/loki-config.yaml</span>  <span class="hljs-comment"># Specifies the configuration file for Loki.</span>
     <span class="hljs-attr">volumes:</span>
       <span class="hljs-bullet">-</span> <span class="hljs-string">./loki-config.yaml:/etc/loki/loki-config.yaml</span>  <span class="hljs-comment"># Mounts the local config file into the container.</span>

   <span class="hljs-attr">promtail:</span>
     <span class="hljs-attr">image:</span> <span class="hljs-string">grafana/promtail:2.9.4</span>  <span class="hljs-comment"># Uses Promtail version 2.9.4 to scrape and forward logs to Loki.</span>
     <span class="hljs-attr">volumes:</span>
       <span class="hljs-bullet">-</span> <span class="hljs-string">/var/log:/var/log</span>  <span class="hljs-comment"># Mounts the host’s log directory for Promtail to scrape.</span>
       <span class="hljs-bullet">-</span> <span class="hljs-string">./promtail-config.yaml:/etc/promtail/promtail-config.yaml</span>  <span class="hljs-comment"># Mounts the Promtail config file.</span>
     <span class="hljs-attr">command:</span> <span class="hljs-string">-config.file=/etc/promtail/promtail-config.yaml</span>  <span class="hljs-comment"># Specifies the configuration file for Promtail.</span>
</code></pre>
</li>
<li><p><strong>Create a basic</strong> <code>loki-config.yaml</code>:</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># Configures Grafana Loki for lightweight log storage and querying in a CI/CD environment.</span>
 <span class="hljs-attr">auth_enabled:</span> <span class="hljs-literal">false</span>  <span class="hljs-comment"># Disables authentication for simplicity (not recommended for production).</span>

 <span class="hljs-attr">server:</span>
   <span class="hljs-attr">http_listen_port:</span> <span class="hljs-number">3100</span>  <span class="hljs-comment"># Sets the port for Loki’s HTTP API.</span>

 <span class="hljs-attr">ingester:</span>
   <span class="hljs-attr">lifecycler:</span>
     <span class="hljs-attr">ring:</span>
       <span class="hljs-attr">kvstore:</span>
         <span class="hljs-attr">store:</span> <span class="hljs-string">inmemory</span>  <span class="hljs-comment"># Uses in-memory storage for the ring, suitable for small setups.</span>
       <span class="hljs-attr">replication_factor:</span> <span class="hljs-number">1</span>  <span class="hljs-comment"># Sets single replica for minimal resource use.</span>
   <span class="hljs-attr">chunk_idle_period:</span> <span class="hljs-string">3m</span>  <span class="hljs-comment"># Flushes chunks to storage after 3 minutes of inactivity.</span>
   <span class="hljs-attr">max_chunk_age:</span> <span class="hljs-string">1h</span>  <span class="hljs-comment"># Retires chunks after 1 hour to balance storage and query performance.</span>

 <span class="hljs-attr">schema_config:</span>
   <span class="hljs-attr">configs:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-attr">from:</span> <span class="hljs-number">2023-01-01</span>  <span class="hljs-comment"># Defines the schema start date.</span>
       <span class="hljs-attr">store:</span> <span class="hljs-string">boltdb-shipper</span>  <span class="hljs-comment"># Uses BoltDB for indexing logs.</span>
       <span class="hljs-attr">object_store:</span> <span class="hljs-string">filesystem</span>  <span class="hljs-comment"># Stores logs on the local filesystem.</span>
       <span class="hljs-attr">schema:</span> <span class="hljs-string">v11</span>  <span class="hljs-comment"># Specifies schema version for log storage.</span>
       <span class="hljs-attr">index:</span>
         <span class="hljs-attr">prefix:</span> <span class="hljs-string">index_</span>  <span class="hljs-comment"># Prefix for index files.</span>
         <span class="hljs-attr">period:</span> <span class="hljs-string">24h</span>  <span class="hljs-comment"># Rotates indexes daily.</span>

 <span class="hljs-attr">storage_config:</span>
   <span class="hljs-attr">boltdb_shipper:</span>
     <span class="hljs-attr">active_index_directory:</span> <span class="hljs-string">/tmp/loki/index</span>  <span class="hljs-comment"># Directory for active index files.</span>
     <span class="hljs-attr">cache_location:</span> <span class="hljs-string">/tmp/loki/boltdb-cache</span>  <span class="hljs-comment"># Cache location for BoltDB.</span>
   <span class="hljs-attr">filesystem:</span>
     <span class="hljs-attr">directory:</span> <span class="hljs-string">/tmp/loki/chunks</span>  <span class="hljs-comment"># Directory for storing log chunks.</span>

 <span class="hljs-attr">limits_config:</span>
   <span class="hljs-attr">enforce_metric_name:</span> <span class="hljs-literal">false</span>  <span class="hljs-comment"># Disables strict metric name enforcement for flexibility.</span>
</code></pre>
</li>
<li><p><strong>Create a basic</strong> <code>promtail-config.yaml</code>:</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># Configures Promtail to scrape system logs and forward them to Loki.</span>
 <span class="hljs-attr">server:</span>
   <span class="hljs-attr">http_listen_port:</span> <span class="hljs-number">9080</span>  <span class="hljs-comment"># Sets Promtail’s HTTP port for metrics and health checks.</span>
   <span class="hljs-attr">grpc_listen_port:</span> <span class="hljs-number">0</span>  <span class="hljs-comment"># Disables gRPC to reduce resource usage.</span>

 <span class="hljs-attr">positions:</span>
   <span class="hljs-attr">filename:</span> <span class="hljs-string">/tmp/positions.yaml</span>  <span class="hljs-comment"># Stores the position of scraped logs to resume after restarts.</span>

 <span class="hljs-attr">clients:</span>
   <span class="hljs-bullet">-</span> <span class="hljs-attr">url:</span> <span class="hljs-string">http://loki:3100/loki/api/v1/push</span>  <span class="hljs-comment"># Specifies the Loki endpoint for log ingestion.</span>

 <span class="hljs-attr">scrape_configs:</span>
   <span class="hljs-bullet">-</span> <span class="hljs-attr">job_name:</span> <span class="hljs-string">system</span>  <span class="hljs-comment"># Defines a scraping job for system logs.</span>
     <span class="hljs-attr">static_configs:</span>
       <span class="hljs-bullet">-</span> <span class="hljs-attr">targets:</span>
           <span class="hljs-bullet">-</span> <span class="hljs-string">localhost</span>  <span class="hljs-comment"># Targets the local host for log collection.</span>
         <span class="hljs-attr">labels:</span>
           <span class="hljs-attr">job:</span> <span class="hljs-string">varlogs</span>  <span class="hljs-comment"># Labels logs for easy querying in Loki.</span>
           <span class="hljs-attr">__path__:</span> <span class="hljs-string">/var/log/*.log</span>  <span class="hljs-comment"># Scrapes all log files in /var/log directory.</span>
</code></pre>
</li>
<li><p><strong>Run it:</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># Starts the Loki and Promtail containers in detached mode for background operation.</span>
 docker-compose up -d
</code></pre>
</li>
</ol>
<p>✨ This brings up Loki and Promtail with minimal resources, no authentication, and logs scraping from <code>/var/log</code>.</p>
<h4 id="heading-troubleshooting-loki-setup-issues">Troubleshooting Loki Setup Issues</h4>
<p>If Loki or Promtail fails to start, one of the following may be the issue:</p>
<ol>
<li><p><strong>Container crashes</strong>: Check logs with <code>docker logs loki</code> or <code>docker logs promtail</code>. Look for errors like <em>“out of memory”</em> or <em>“port already in use.”</em></p>
<ul>
<li>Fix: Increase memory (for example, <code>docker-compose.yml</code> resource limits) or change ports (e.g., <code>3101:3100</code>).</li>
</ul>
</li>
<li><p><strong>Logs not ingested</strong>: Verify Promtail is scraping the correct path (<code>/var/log/ci/*.log</code>) using <code>docker exec promtail cat /etc/promtail/promtail-config.yaml</code></p>
<ul>
<li>Fix: Update <code>__path__</code> in <code>promtail-config.yaml</code> to match your CI/CD log directory.</li>
</ul>
</li>
<li><p><strong>Resource Constraints</strong>: Monitor resource usage with <code>docker stats</code> or <code>top</code> on the host.</p>
<ul>
<li>Fix: Ensure your machine has at least 4GB RAM and 20GB disk space, as specified in the prerequisites.</li>
</ul>
</li>
</ol>
<h3 id="heading-configuration-for-cicd-logging">Configuration for CI/CD Logging</h3>
<p>To adapt for CI/CD logs, you should:</p>
<h4 id="heading-1-configure-your-cicd-tools-to-write-logs-to-disk">1. Configure your CI/CD tools to write logs to disk:</h4>
<p>For example, GitHub Actions with a custom runner can write logs to <code>/var/log/gha/*.log</code>.</p>
<p>Update Promtail:</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># Configures Promtail to scrape logs from GitHub Actions runners for CI/CD observability.</span>
<span class="hljs-attr">scrape_configs:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">job_name:</span> <span class="hljs-string">github_actions</span>  <span class="hljs-comment"># Defines a scraping job for GitHub Actions logs.</span>
    <span class="hljs-attr">static_configs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">targets:</span> [<span class="hljs-string">'localhost'</span>]  <span class="hljs-comment"># Targets the local host where the runner writes logs.</span>
        <span class="hljs-attr">labels:</span>
          <span class="hljs-attr">job:</span> <span class="hljs-string">gha</span>  <span class="hljs-comment"># Labels logs for identification in Loki queries.</span>
          <span class="hljs-attr">__path__:</span> <span class="hljs-string">/var/log/gha/*.log</span>  <span class="hljs-comment"># Scrapes logs from the specified directory.</span>
</code></pre>
<h4 id="heading-2-use-structured-logging-json">2. Use structured logging (JSON):</h4>
<p>Make sure your CI/CD tools or scripts output logs in structured format:</p>
<p>Example:</p>
<pre><code class="lang-json"># Example of a structured JSON log for CI/CD pipelines, enabling easy parsing and querying.
{
  <span class="hljs-attr">"timestamp"</span>: <span class="hljs-string">"2025-05-10T13:00:00Z"</span>,  # UTC timestamp for log entry.
  <span class="hljs-attr">"level"</span>: <span class="hljs-string">"error"</span>,  # Log level to indicate severity.
  <span class="hljs-attr">"job"</span>: <span class="hljs-string">"deploy"</span>,  # Identifies the CI/CD job (e.g., deploy stage).
  <span class="hljs-attr">"message"</span>: <span class="hljs-string">"Image pull failed"</span>  # Descriptive message for the error.
}
</code></pre>
<p>This helps when querying with LogQL.</p>
<h3 id="heading-how-to-connect-ci-agents-to-loki">How to Connect CI Agents to Loki</h3>
<p>This section explains three different ways to get your CI pipeline logs into Loki for monitoring and analysis:</p>
<h4 id="heading-option-1-local-setup">Option 1 – Local setup:</h4>
<p>Your CI agents write log files to disk, and Promtail (running on the same machine) reads those files and sends them to Loki.</p>
<h4 id="heading-option-2-using-docker-logging-driver-docker-containers">Option 2 – Using Docker logging driver (Docker containers):</h4>
<p>If your CI agents run in Docker containers, you install a special Loki plugin that automatically captures all container output and sends it directly to Loki without needing separate log files.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Installs the Loki Docker logging driver to send container logs directly to Loki.</span>
docker plugin install grafana/loki-docker-driver:latest --<span class="hljs-built_in">alias</span> loki --grant-all-permissions
</code></pre>
<p>Then run your agent container:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Runs a CI agent container with the Loki logging driver to forward logs.</span>
docker run --log-driver=loki \
  --log-opt loki-url=<span class="hljs-string">"http://&lt;your-loki-host&gt;:3100/loki/api/v1/push"</span> \
  my-ci-agent-image
</code></pre>
<h4 id="heading-option-3-remote-setup">Option 3 – Remote setup:</h4>
<p>If you can't install Promtail locally, you can use a log forwarding tool like <a target="_blank" href="https://fluentbit.io/">Fluent Bit</a> or <a target="_blank" href="https://vector.dev/">Vector</a> to collect logs and push them to Loki over the network.</p>
<p><strong>The goal:</strong> Regardless of which option you choose, you’ll end up with all your CI pipeline logs centralized in Loki, where you can search through them, create dashboards in Grafana, and set up alerts when things go wrong.</p>
<p>It essentially gives you flexibility to integrate log collection based on your infrastructure setup – whether you prefer local agents, Docker plugins, or remote forwarding.</p>
<h2 id="heading-how-to-implement-an-elk-stack-alternative-for-pipeline-observability">How to Implement an ELK Stack Alternative for Pipeline Observability</h2>
<p>When full ELK (Elasticsearch, Logstash, Kibana) is too heavy for your infrastructure, you can go with lightweight setups that achieve similar observability at a lower cost and resource usage.</p>
<h3 id="heading-how-to-install-lightweight-versions-of-elasticsearch-logstash-and-kibana">How to Install Lightweight Versions of Elasticsearch, Logstash, and Kibana</h3>
<p>Goal: Stand up a minimal yet functional ELK stack for debugging CI/CD pipelines.</p>
<h4 id="heading-1-use-docker-to-spin-up-lightweight-containers">1. Use Docker to spin up lightweight containers</h4>
<p>Create a <code>docker-compose.yml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># Defines a Docker Compose setup for a lightweight ELK stack to aggregate and visualize CI/CD logs.</span>
<span class="hljs-attr">version:</span> <span class="hljs-string">'3.7'</span>

<span class="hljs-attr">services:</span>
  <span class="hljs-attr">elasticsearch:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">docker.elastic.co/elasticsearch/elasticsearch:7.17.0</span>  <span class="hljs-comment"># Uses Elasticsearch 7.17.0.</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">elasticsearch</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">discovery.type=single-node</span>  <span class="hljs-comment"># Runs Elasticsearch in single-node mode for simplicity.</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">xpack.security.enabled=false</span>  <span class="hljs-comment"># Disables security features for lightweight setup.</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9200:9200"</span>  <span class="hljs-comment"># Exposes Elasticsearch’s HTTP API port.</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">esdata:/usr/share/elasticsearch/data</span>  <span class="hljs-comment"># Persists Elasticsearch data.</span>

  <span class="hljs-attr">logstash:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">docker.elastic.co/logstash/logstash:7.17.0</span>  <span class="hljs-comment"># Uses Logstash 7.17.0.</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">logstash</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"5044:5044"</span>  <span class="hljs-comment"># Port for receiving logs from Beats.</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9600:9600"</span>  <span class="hljs-comment"># Port for Logstash monitoring.</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">./logstash.conf:/usr/share/logstash/pipeline/logstash.conf</span>  <span class="hljs-comment"># Mounts Logstash config file.</span>

  <span class="hljs-attr">kibana:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">docker.elastic.co/kibana/kibana:7.17.0</span>  <span class="hljs-comment"># Uses Kibana 7.17.0 for visualization.</span>
    <span class="hljs-attr">container_name:</span> <span class="hljs-string">kibana</span>
    <span class="hljs-attr">environment:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">ELASTICSEARCH_HOSTS=http://elasticsearch:9200</span>  <span class="hljs-comment"># Links Kibana to Elasticsearch.</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"5601:5601"</span>  <span class="hljs-comment"># Exposes Kibana’s web UI port.</span>

<span class="hljs-attr">volumes:</span>
  <span class="hljs-attr">esdata:</span>  <span class="hljs-comment"># Defines a volume for persisting Elasticsearch data.</span>
</code></pre>
<h4 id="heading-2-minimal-logstash-pipeline-configuration-logstashconf">2. Minimal Logstash pipeline configuration (logstash.conf)</h4>
<pre><code class="lang-javascript"><span class="hljs-comment">// Configures Logstash to process and forward CI/CD logs to Elasticsearch.</span>
input {
  beats {
    <span class="hljs-function"><span class="hljs-params">port</span> =&gt;</span> <span class="hljs-number">5044</span>  <span class="hljs-comment">// Listens for logs from Filebeat on port 5044.</span>
  }
}

filter {
  json {
    <span class="hljs-function"><span class="hljs-params">source</span> =&gt;</span> <span class="hljs-string">"message"</span>  <span class="hljs-comment">// Parses JSON-formatted log messages for structured data.</span>
  }
}

output {
  elasticsearch {
    <span class="hljs-function"><span class="hljs-params">hosts</span> =&gt;</span> [<span class="hljs-string">"http://elasticsearch:9200"</span>]  <span class="hljs-comment">// Sends processed logs to Elasticsearch.</span>
    index =&gt; <span class="hljs-string">"ci-logs-%{+YYYY.MM.dd}"</span>  <span class="hljs-comment">// Stores logs in daily indexes (e.g., ci-logs-2025.05.14).</span>
  }
}
</code></pre>
<h4 id="heading-troubleshooting-elk-setup-issues">Troubleshooting ELK Setup Issues</h4>
<p>If Elasticsearch, Logstash, or Kibana fails to start, one of the following might be the issue:</p>
<ol>
<li><p><strong>Container crashes</strong>: Check logs with <code>docker logs elasticsearch</code>, <code>docker logs logstash</code>, or <code>docker logs kibana</code>. Look for errors like <em>“insufficient disk space”</em> or <em>“port conflict”</em> (for example, 9200, 5601).</p>
<ul>
<li>Fix: Free up disk space (ensure at least 20GB available) or change ports in <code>docker-compose.yml</code> (for example, <code>9201:9200</code>).</li>
</ul>
</li>
<li><p><strong>Logs not ingested</strong>: Verify Logstash is receiving data from Filebeat or Vector using <code>docker logs logstash</code>. Check the <code>logstash.conf</code> input port (for example, 5044).</p>
<ul>
<li>Fix: Ensure Filebeat or Vector is configured to send to the correct Logstash endpoint (e.g., <code>localhost:5044</code>) and update if needed.</li>
</ul>
</li>
<li><p><strong>Resource constraints</strong>: Monitor resource usage with Docker stats or top on the host.</p>
<ul>
<li>Fix: Allocate at least 8GB RAM and 30GB disk space, as Elasticsearch requires more resources than Loki. Adjust memory limits in <code>docker-compose.yml</code> if necessary.</li>
</ul>
</li>
</ol>
<h3 id="heading-how-to-configure-log-shippers-for-different-cicd-components">How to Configure Log Shippers for Different CI/CD Components</h3>
<p>Goal: Get logs from your pipeline into Logstash or Elasticsearch.</p>
<h4 id="heading-option-1-use-filebeat-lightweight-log-shipper">Option 1: Use Filebeat (lightweight log shipper)</h4>
<p>Install <a target="_blank" href="https://www.elastic.co/beats/filebeat">Filebeat</a> on your CI/CD hosts (GitHub runner, Jenkins node, GitLab runner, and so on).</p>
<p>Filebeat config snippet (filebeat.yml):</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># Configures Filebeat to collect CI/CD logs and forward them to Logstash.</span>
<span class="hljs-attr">filebeat.inputs:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">type:</span> <span class="hljs-string">log</span>  <span class="hljs-comment"># Specifies log file input.</span>
    <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>  <span class="hljs-comment"># Enables the input.</span>
    <span class="hljs-attr">paths:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">/var/log/ci/*.log</span>  <span class="hljs-comment"># Scrapes logs from the specified CI log directory.</span>

<span class="hljs-attr">output.logstash:</span>
  <span class="hljs-attr">hosts:</span> [<span class="hljs-string">"localhost:5044"</span>]  <span class="hljs-comment"># Forwards logs to Logstash on port 5044.</span>
</code></pre>
<p>Then run:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Runs Filebeat with the specified configuration file for log collection.</span>
filebeat -e -c filebeat.yml
</code></pre>
<h4 id="heading-option-2-use-vectordev-as-a-more-resource-efficient-alternative-to-filebeat">Option 2: Use Vector.dev as a more resource-efficient alternative to Filebeat</h4>
<p>Vector configuration (vector.toml):</p>
<pre><code class="lang-toml"><span class="hljs-comment"># Configures Vector to collect, parse, and forward CI/CD logs to Elasticsearch efficiently.</span>
<span class="hljs-section">[sources.ci_logs]</span>
  <span class="hljs-attr">type</span> = <span class="hljs-string">"file"</span>  <span class="hljs-comment"># Specifies file-based log collection.</span>
  <span class="hljs-attr">include</span> = [<span class="hljs-string">"/var/log/ci/*.log"</span>]  <span class="hljs-comment"># Targets CI log files.</span>

<span class="hljs-section">[transforms.json_parser]</span>
  <span class="hljs-attr">type</span> = <span class="hljs-string">"remap"</span>  <span class="hljs-comment"># Uses remap transform to parse logs.</span>
  <span class="hljs-attr">inputs</span> = [<span class="hljs-string">"ci_logs"</span>]  <span class="hljs-comment"># Processes logs from the ci_logs source.</span>
  <span class="hljs-attr">source</span> = <span class="hljs-string">'''
  . = parse_json!(.message)  # Parses JSON log messages into structured data.
  '''</span>

<span class="hljs-section">[sinks.to_elasticsearch]</span>
  <span class="hljs-attr">type</span> = <span class="hljs-string">"elasticsearch"</span>  <span class="hljs-comment"># Sends logs to Elasticsearch.</span>
  <span class="hljs-attr">inputs</span> = [<span class="hljs-string">"json_parser"</span>]  <span class="hljs-comment"># Uses parsed logs from the json_parser transform.</span>
  <span class="hljs-attr">endpoint</span> = <span class="hljs-string">"http://localhost:9200"</span>  <span class="hljs-comment"># Specifies the Elasticsearch endpoint.</span>
  <span class="hljs-attr">index</span> = <span class="hljs-string">"ci-logs"</span>  <span class="hljs-comment"># Stores logs in the ci-logs index.</span>
</code></pre>
<p>Run:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Runs Vector with the specified configuration file for log processing.</span>
vector -c vector.toml
</code></pre>
<h3 id="heading-how-to-set-up-index-patterns-and-basic-visualizations">How to Set Up Index Patterns and Basic Visualizations</h3>
<p>Goal: Make CI/CD logs queryable and visual in Kibana.</p>
<h4 id="heading-1-open-kibana-httplocalhost5601httplocalhost5601">1. Open Kibana (<a target="_blank" href="http://localhost:5601/">http://localhost:5601</a>)</h4>
<ul>
<li><p>Go to <strong>Stack Management → Index Patterns</strong></p>
</li>
<li><p>Create a new pattern: <code>ci-logs-*</code></p>
</li>
<li><p>Choose a time field like <code>@timestamp</code></p>
</li>
</ul>
<h4 id="heading-2-visualizations-for-common-cicd-use-cases">2. Visualizations for Common CI/CD Use Cases</h4>
<ul>
<li><p><strong>Bar charts</strong>: Number of failed vs passed builds per day</p>
</li>
<li><p><strong>Pie chart</strong>: Top error types or most frequent failing test names</p>
</li>
<li><p><strong>Line chart</strong>: Duration of builds over time (if duration is logged)</p>
</li>
</ul>
<h4 id="heading-3-saved-searches-amp-dashboards">3. Saved Searches &amp; Dashboards</h4>
<p>You can save a search like this:</p>
<pre><code class="lang-javascript">message: <span class="hljs-string">"error"</span> AND job_name: <span class="hljs-string">"build"</span>
</code></pre>
<p>You can also combine visualizations into a CI/CD Health Dashboard.</p>
<h2 id="heading-how-to-create-a-unified-logging-strategy-across-pipeline-components">How to Create a Unified Logging Strategy Across Pipeline Components</h2>
<p>Creating a unified logging strategy across your CI/CD pipeline components ensures that logs are consistent, traceable, and easy to correlate. This helps you quickly debug issues, monitor system health, and trace requests across different tools and services. Let’s discuss some key practices for achieving a unified logging strategy:</p>
<h3 id="heading-implementing-consistent-log-formats-across-different-tools">Implementing Consistent Log Formats Across Different Tools</h3>
<p>Consistent log formats are important for various reasons. First of all, a standardized log format enables easier querying, searching, and visualization. It also helps with correlation of logs from different services. And consistency also ensures that all logs provide necessary details like timestamp, log level, and request context.</p>
<p>There are also some best practices you should follow when formatting logs:</p>
<p><strong>JSON Format</strong> is highly recommended as it’s structured, machine-readable, and compatible with many observability tools (for example, Loki, Elasticsearch, Grafana).</p>
<p>There are also some key fields you should include:</p>
<ul>
<li><p><code>timestamp</code>: The time the log entry was created (preferably in UTC).</p>
</li>
<li><p><code>log_level</code>: Indicate whether the log is an <code>INFO</code>, <code>ERROR</code>, <code>DEBUG</code>, and so on.</p>
</li>
<li><p><code>service</code>: The service or component generating the log.</p>
</li>
<li><p><code>message</code>: A concise description of the event or error.</p>
</li>
<li><p><code>correlation_id</code>: A unique identifier for requests to trace logs across systems.</p>
</li>
</ul>
<p>Here’s an example of a consistent log in JSON format:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"timestamp"</span>: <span class="hljs-string">"2025-05-10T12:34:56Z"</span>,
  <span class="hljs-attr">"log_level"</span>: <span class="hljs-string">"ERROR"</span>,
  <span class="hljs-attr">"service"</span>: <span class="hljs-string">"ci_cd_pipeline"</span>,
  <span class="hljs-attr">"message"</span>: <span class="hljs-string">"Build failed due to missing dependency"</span>,
  <span class="hljs-attr">"correlation_id"</span>: <span class="hljs-string">"1234567890abcdef"</span>
}
</code></pre>
<h3 id="heading-how-to-set-up-log-forwarding-from-github-actions-jenkins-or-gitlab">How to Set Up Log Forwarding from GitHub Actions, Jenkins, or GitLab</h3>
<p>Log forwarding refers to shipping logs from your CI/CD pipelines to a central spot for easy tracking. It’s helpful because it lets you spot issues fast and debug without digging through scattered files.</p>
<p>For GitHub Actions, you can configure workflows to write logs to a file or send them directly to a log aggregation tool like Loki. In Jenkins, you can use pipeline scripts to forward logs to a log server or file system. Similarly, for GitHub CI, you can add scripts in <code>.gitlab-ci.yml</code> to forward logs to a centralized endpoint.</p>
<p><strong>Using Actions for Outputting Logs:</strong><br>You can store logs in files and then forward them to a logging system (like Loki or Elasticsearch).<br>Here’s an example in a GitHub Action workflow:</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># Defines a GitHub Actions workflow to run tests and forward logs for observability.</span>
<span class="hljs-attr">jobs:</span>
  <span class="hljs-attr">build:</span>
    <span class="hljs-attr">runs-on:</span> <span class="hljs-string">ubuntu-latest</span>  <span class="hljs-comment"># Uses an Ubuntu runner.</span>
    <span class="hljs-attr">steps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Checkout</span> <span class="hljs-string">repository</span>  <span class="hljs-comment"># Checks out the repository code.</span>
        <span class="hljs-attr">uses:</span> <span class="hljs-string">actions/checkout@v2</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Run</span> <span class="hljs-string">tests</span> <span class="hljs-string">and</span> <span class="hljs-string">log</span> <span class="hljs-string">output</span>  <span class="hljs-comment"># Runs tests and saves output to a log file.</span>
        <span class="hljs-attr">run:</span> <span class="hljs-string">|
          echo "Starting tests..."
          npm test | tee test.log  # Captures test output to test.log.
          # Forwards the log file to a Loki endpoint via HTTP POST.
          curl -X POST -F 'file=@test.log' http://your-loki-endpoint</span>
</code></pre>
<p><strong>Log Forwarding with Promtail:</strong><br>If you are using Grafana Loki for log aggregation, set up Promtail to scrape the logs from the GitHub Actions runner.</p>
<h4 id="heading-jenkins">Jenkins:</h4>
<p>Jenkins logs can be forwarded to external systems (like Elasticsearch or Loki) by using log shippers or plugins.</p>
<p><strong>You can use the Logstash Plugin</strong> to forward Jenkins logs to an ELK stack or other systems:</p>
<ul>
<li><p>Install the Logstash plugin on Jenkins.</p>
</li>
<li><p>Configure the plugin to forward logs to an Elasticsearch server or a logging system of choice.</p>
</li>
<li><p>In Jenkins, add log forwarding configurations:</p>
</li>
</ul>
<pre><code class="lang-javascript">pipeline {
  agent any
  stages {
    stage(<span class="hljs-string">'Build'</span>) {
      steps {
        script {
          <span class="hljs-comment">// Example of forwarding logs to a log server</span>
          sh <span class="hljs-string">'echo "Build successful" | curl -X POST -d @- http://your-log-server'</span>
        }
      }
    }
  }
}
</code></pre>
<p><strong>Forward to Loki:</strong><br>Jenkins supports the <code>loki</code> logging driver for containers if running Jenkins in Docker. You can send logs directly to Loki using this driver:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Runs a Jenkins container with the Loki logging driver to send logs directly to Loki.</span>
docker run --log-driver=loki --log-opt loki-url=http://loki:3100 jenkins/jenkins:lts
</code></pre>
<h4 id="heading-gitlab">GitLab:</h4>
<p>GitLab CI allows logs to be forwarded to external systems for centralized collection and analysis.</p>
<p><strong>Use GitLab CI/CD to Output Logs</strong>:<br>Example in <code>.gitlab-ci.yml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># Defines a GitLab CI/CD pipeline to run a build and forward logs to Loki.</span>
<span class="hljs-attr">stages:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">build</span>
<span class="hljs-attr">build:</span>
  <span class="hljs-attr">script:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">echo</span> <span class="hljs-string">"Starting the build"</span> <span class="hljs-string">|</span> <span class="hljs-string">tee</span> <span class="hljs-string">build.log</span>  <span class="hljs-comment"># Saves build output to build.log.</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">curl</span> <span class="hljs-string">-X</span> <span class="hljs-string">POST</span> <span class="hljs-string">-d</span> <span class="hljs-string">@build.log</span> <span class="hljs-string">http://your-loki-endpoint</span>  <span class="hljs-comment"># Forwards the log to Loki.</span>
</code></pre>
<p><strong>GitLab Runners</strong>:<br>Configure GitLab runners to forward logs to an external service like Loki or Elasticsearch using <code>log-driver</code> settings or the <code>fluentd</code> log shipper.</p>
<h3 id="heading-how-to-add-correlation-ids-to-trace-requests-through-the-system">How to Add Correlation IDs to Trace Requests Through the System</h3>
<h4 id="heading-why-correlation-ids-are-important">Why Correlation IDs Are Important:</h4>
<p>Correlation IDs allow you to trace a single request as it travels through different services and tools, enabling end-to-end visibility and troubleshooting.</p>
<p>They are critical for debugging distributed systems, especially when different services (for example, CI tool, deployment tool, API service) are involved.</p>
<h4 id="heading-how-to-add-correlation-ids">How to Add Correlation IDs:</h4>
<p>You can use a UUID (Universally Unique Identifier) or a GUID (Globally Unique Identifier) to generate a unique ID for each request.</p>
<p>If you are using microservices or multiple services in the pipeline, just make sure that the same ID is propagated across each service.</p>
<p>Many logging libraries (for example, <code>winston</code> for Node.js, <code>log4j</code> for Java) support automatic correlation ID generation and logging.</p>
<p>Here’s an example in Node.js (using <code>winston</code>):</p>
<pre><code class="lang-javascript"><span class="hljs-comment">// Sets up Winston for structured logging with correlation IDs in a CI/CD pipeline.</span>
<span class="hljs-keyword">const</span> { createLogger, transports, format } = <span class="hljs-built_in">require</span>(<span class="hljs-string">'winston'</span>);
<span class="hljs-keyword">const</span> { printf } = format;

<span class="hljs-comment">// Creates a logger with a custom format including correlation IDs.</span>
<span class="hljs-keyword">const</span> logger = createLogger({
  <span class="hljs-attr">format</span>: printf(<span class="hljs-function">(<span class="hljs-params">{ level, message, timestamp }</span>) =&gt;</span> {
    <span class="hljs-keyword">return</span> <span class="hljs-string">`<span class="hljs-subst">${timestamp}</span> [<span class="hljs-subst">${level}</span>] <span class="hljs-subst">${message}</span> correlation_id=<span class="hljs-subst">${generateCorrelationId()}</span>`</span>;
  }),
  <span class="hljs-attr">transports</span>: [
    <span class="hljs-keyword">new</span> transports.Console(),  <span class="hljs-comment">// Outputs logs to the console.</span>
  ],
});

<span class="hljs-comment">// Generates a random correlation ID for tracing requests.</span>
<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">generateCorrelationId</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-keyword">return</span> <span class="hljs-built_in">Math</span>.random().toString(<span class="hljs-number">36</span>).substring(<span class="hljs-number">2</span>, <span class="hljs-number">15</span>);
}

<span class="hljs-comment">// Logs a sample message.</span>
logger.info(<span class="hljs-string">'Pipeline execution started'</span>);
</code></pre>
<h4 id="heading-how-to-propagate-correlation-ids-between-services">How to Propagate Correlation IDs Between Services:</h4>
<p>In CI/CD tools, you can configure your pipeline to inject the correlation ID into logs. For example, in GitHub Actions, you can generate a correlation ID in the <code>env</code> section and propagate it in each job:</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># Defines a GitHub Actions workflow that includes a correlation ID for log tracing.</span>
<span class="hljs-attr">jobs:</span>
  <span class="hljs-attr">build:</span>
    <span class="hljs-attr">runs-on:</span> <span class="hljs-string">ubuntu-latest</span>  <span class="hljs-comment"># Uses an Ubuntu runner.</span>
    <span class="hljs-attr">env:</span>
      <span class="hljs-attr">CORRELATION_ID:</span> <span class="hljs-string">${{</span> <span class="hljs-string">github.run_id</span> <span class="hljs-string">}}</span>  <span class="hljs-comment"># Uses the GitHub run ID as a correlation ID.</span>
    <span class="hljs-attr">steps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Checkout</span> <span class="hljs-string">repository</span>  <span class="hljs-comment"># Checks out the repository code.</span>
        <span class="hljs-attr">uses:</span> <span class="hljs-string">actions/checkout@v2</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Log</span> <span class="hljs-string">build</span> <span class="hljs-string">start</span> <span class="hljs-string">with</span> <span class="hljs-string">correlation</span> <span class="hljs-string">ID</span>  <span class="hljs-comment"># Logs the build start with the correlation ID.</span>
        <span class="hljs-attr">run:</span> <span class="hljs-string">echo</span> <span class="hljs-string">"Build started with Correlation ID: $CORRELATION_ID"</span>
</code></pre>
<h4 id="heading-include-correlation-ids-in-all-logs">Include Correlation IDs in All Logs:</h4>
<p>You’ll want to make sure that logs from all components in the pipeline (GitHub Actions, Jenkins, GitLab, deployment tools, and so on) include the correlation ID as part of the log message. This allows you to trace the logs of a single request or pipeline run across different services.</p>
<h4 id="heading-visualize-your-log-flow">Visualize Your Log Flow</h4>
<p>You can create a diagram showing how logs move from your CI/CD tool (for example, GitHub Actions) to Promtail/Vector, then to Loki/Elasticsearch, and finally to Grafana/Kibana for visualization. Use tools like <a target="_blank" href="http://Draw.io">Draw.io</a> to map your pipeline’s observability flow</p>
<h2 id="heading-how-to-query-and-analyze-logs-for-effective-troubleshooting">How to Query and Analyze Logs for Effective Troubleshooting</h2>
<p>In this section, you’ll learn how to use LogQL (Loki's query language) to cut through the noise and find the specific logs that matter. Whether you're hunting down a mysterious build failure or tracking deployment issues across multiple services, these query patterns always help.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748224707087/d348accc-0ef8-4ebb-9cb9-49995404b0ec.png" alt="Bar chart showing CI/CD build results from May 20-26, 2025. Blue bars represent successful builds ranging from 39-52 per day, while red bars show failed builds ranging from 1-9 per day. The chart demonstrates consistently high success rates with low failure rates throughout the week, with May 23 showing the highest failure count at 9 builds." class="image--center mx-auto" width="1468" height="866" loading="lazy"></p>
<p>This bar chart illustrates the CI/CD build performance from May 20 to May 26, 2025. It compares the number of successful builds (in blue) to failed builds (in pink) each day. Successful builds consistently range between 40 and 50, while failed builds peak at 10 on May 23, with other days showing 2 to 8 failures. This indicates a generally stable pipeline with occasional issues.</p>
<h3 id="heading-how-to-write-advanced-logql-queries-to-pinpoint-cicd-issues">How to Write Advanced LogQL Queries to Pinpoint CI/CD Issues</h3>
<p>LogQL is Grafana Loki's query language, designed for querying logs with a syntax similar to Prometheus’s PromQL. It enables efficient log searches and is particularly useful in troubleshooting CI/CD issues.</p>
<h4 id="heading-basic-logql-syntax">Basic LogQL Syntax:</h4>
<p><strong>1. Log Streams:</strong></p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, level=<span class="hljs-string">"error"</span>}
</code></pre>
<p>This query retrieves logs where the <code>job</code> label is <code>ci_cd</code> and the <code>level</code> label is <code>error</code>.</p>
<p><strong>2. Log Filters:</strong></p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>} |= <span class="hljs-string">"build failed"</span>
</code></pre>
<p>The <code>|=</code> operator filters logs to include only those that contain the specified string, for example "build failed".</p>
<p><strong>3. Regular Expressions:</strong></p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>} |~ <span class="hljs-string">"error.*timeout"</span>
</code></pre>
<p>This uses the <code>|~</code> operator to filter logs using a regular expression. In this case, it finds logs that contain an "error" followed by "timeout".</p>
<h4 id="heading-advanced-logql-queries-for-cicd-issues">Advanced LogQL Queries for CI/CD Issues:</h4>
<p><strong>1. Filter Logs for Specific Build Failures:</strong></p>
<p>If your pipeline uses a specific label for build names:</p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, build=<span class="hljs-string">"build123"</span>} |= <span class="hljs-string">"failure"</span>
</code></pre>
<p>This finds logs related to the <code>build123</code> job that contain the word "failure".</p>
<p><strong>2. Using Time Range and Grouping:</strong></p>
<p>To find error logs in the last 15 minutes:</p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, level=<span class="hljs-string">"error"</span>} | <span class="hljs-string">"build failed"</span> | range(start=<span class="hljs-string">"15m"</span>)
</code></pre>
<p>To group logs by job and error type:</p>
<pre><code class="lang-javascript">sum by (job) (count_over_time({job=<span class="hljs-string">"ci_cd"</span>, level=<span class="hljs-string">"error"</span>}[<span class="hljs-number">5</span>m]))
</code></pre>
<p>This will return the count of error logs per job, grouped by job name, over the last 5 minutes.</p>
<h3 id="heading-how-to-create-pipeline-specific-queries-for-common-failure-patterns">How to Create Pipeline-Specific Queries for Common Failure Patterns</h3>
<h4 id="heading-common-failure-patterns-in-cicd-pipelines">Common Failure Patterns in CI/CD Pipelines:</h4>
<p><strong>1. Build Failures:</strong></p>
<p>If your CI system logs contain build errors, you can identify them with:</p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, level=<span class="hljs-string">"error"</span>} |= <span class="hljs-string">"build failed"</span>
</code></pre>
<p>You can extend this to filter by specific steps or stages, for example, “test failed”, or “compilation error”.</p>
<p><strong>2. Test Failures:</strong></p>
<p>Logs from your test runner (for example, Jest, Mocha, JUnit) can contain specific failure messages:</p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, stage=<span class="hljs-string">"test"</span>} |= <span class="hljs-string">"test failed"</span>
</code></pre>
<p><strong>3. Dependency Issues:</strong></p>
<p>If your pipeline is failing due to missing or conflicting dependencies, look for <code>npm</code>, <code>maven</code>, or <code>docker</code> related errors:</p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, image=<span class="hljs-string">"node"</span>} |= <span class="hljs-string">"npm ERR!"</span>
</code></pre>
<p>Or for Maven-related issues:</p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, image=<span class="hljs-string">"maven"</span>} |= <span class="hljs-string">"[ERROR]"</span>
</code></pre>
<p><strong>4. Resource Constraints (for example, Out of Memory):</strong></p>
<p>If you experience resource constraints, you might see logs like "OutOfMemoryError":</p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, level=<span class="hljs-string">"error"</span>} |= <span class="hljs-string">"OutOfMemoryError"</span>
</code></pre>
<p><strong>Example of combining filters:</strong></p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, level=<span class="hljs-string">"error"</span>} |= <span class="hljs-string">"build failed"</span> |~ <span class="hljs-string">"timeout|dependency"</span> | range(start=<span class="hljs-string">"1h"</span>)
</code></pre>
<p>This combines log filters for "build failed", matching any logs with the terms "timeout" or "dependency", from the last hour.</p>
<h3 id="heading-how-to-set-up-alert-rules-based-on-log-patterns">How to Set Up Alert Rules Based on Log Patterns</h3>
<p>Alerts help detect recurring issues proactively. They notify you when a specific pattern appears in your logs, allowing you to take quick action.</p>
<h4 id="heading-steps-for-setting-up-alerts"><strong>Steps for Setting Up Alerts:</strong></h4>
<p><strong>1. Create a Query for the Alert:</strong></p>
<p>First, define the log pattern you want to monitor. For example, an alert for build failures:</p>
<pre><code class="lang-javascript">{job=<span class="hljs-string">"ci_cd"</span>, level=<span class="hljs-string">"error"</span>} |= <span class="hljs-string">"build failed"</span>
</code></pre>
<p><strong>2. Create an Alert in Grafana:</strong></p>
<p>Follow these steps to set up Grafana alerts:</p>
<ul>
<li><p>Go to your Grafana dashboard.</p>
</li>
<li><p>Choose the panel you want to set the alert on (or create a new panel for this purpose).</p>
</li>
<li><p>In the panel, click the <strong>Alert</strong> tab.</p>
</li>
<li><p>Set the <strong>Query</strong> field to your LogQL query, such as the one above.</p>
</li>
<li><p>Under <strong>Conditions</strong>, define when the alert should trigger, e.g., if the error occurs more than <code>3</code> times within <code>5 minutes</code>.</p>
</li>
</ul>
<p><strong>3. Alert Settings:</strong></p>
<p>Now you’ll want to set up the alert evaluation interval and conditions for triggering the alert (e.g., if the query returns results above a certain threshold).</p>
<p><strong>Here’s an example:</strong> Trigger an alert if the number of errors exceeds 5 within 5 minutes:</p>
<pre><code class="lang-javascript">count_over_time({job=<span class="hljs-string">"ci_cd"</span>, level=<span class="hljs-string">"error"</span>} |= <span class="hljs-string">"build failed"</span>[<span class="hljs-number">5</span>m]) &gt; <span class="hljs-number">5</span>
</code></pre>
<p><strong>4. Set Alert Notifications:</strong></p>
<p>You can choose where you want the alert to be sent (like to Slack, email, or PagerDuty). And Grafana can be integrated with these systems to send real-time alerts to the right team members.</p>
<p><strong>Example alert query for test failures:</strong></p>
<pre><code class="lang-javascript">count_over_time({job=<span class="hljs-string">"ci_cd"</span>, stage=<span class="hljs-string">"test"</span>} |= <span class="hljs-string">"test failed"</span>[<span class="hljs-number">5</span>m]) &gt; <span class="hljs-number">3</span>
</code></pre>
<p>This query triggers an alert if more than 3 test failures are logged within the last 5 minutes.</p>
<h3 id="heading-kibana-query-language-deep-dive-for-cicd-contexts">Kibana Query Language Deep Dive for CI/CD Contexts</h3>
<p>Kibana Query Language (KQL) is a powerful tool for searching and filtering logs within Elasticsearch, and it becomes especially useful for debugging CI/CD pipelines.</p>
<h4 id="heading-basic-query-syntax">Basic Query Syntax:</h4>
<ul>
<li><p><strong>Field:</strong></p>
<pre><code class="lang-javascript">  textCopyEditfieldname:value
</code></pre>
<p>  Example: <code>status: "failure"</code></p>
</li>
<li><p><strong>Wildcard:</strong> Use <code>*</code> to match any number of characters:</p>
<pre><code class="lang-javascript">  textCopyEditmessage: <span class="hljs-string">"test*"</span>
</code></pre>
</li>
<li><p><strong>Range Queries:</strong> To search for logs within a specific time frame:</p>
<pre><code class="lang-javascript">  textCopyEdittimestamp:[<span class="hljs-number">2023</span><span class="hljs-number">-05</span><span class="hljs-number">-01</span> TO <span class="hljs-number">2023</span><span class="hljs-number">-05</span><span class="hljs-number">-15</span>]
</code></pre>
</li>
<li><p><strong>Boolean Queries:</strong> Combine queries using <code>AND</code>, <code>OR</code>, and <code>NOT</code>:</p>
<pre><code class="lang-javascript">  textCopyEditstatus: <span class="hljs-string">"failure"</span> AND build_id: <span class="hljs-string">"12345"</span>
</code></pre>
</li>
</ul>
<h4 id="heading-time-based-queries">Time-Based Queries:</h4>
<p>Since CI/CD logs are often tied to time-sensitive operations (builds, deployments), KQL allows you to filter logs by time:</p>
<pre><code class="lang-javascript">textCopyEdit@timestamp:[now<span class="hljs-number">-1</span>d TO now]
</code></pre>
<h4 id="heading-nested-queries-for-complex-pipelines">Nested Queries (For Complex Pipelines):</h4>
<p>CI/CD logs can have nested or multi-level structures (for example, logs within containers). You can query these nested fields:</p>
<pre><code class="lang-javascript">textCopyEditpipeline.logs.message: <span class="hljs-string">"build failed"</span>
</code></pre>
<h4 id="heading-aggregations-and-grouping">Aggregations and Grouping:</h4>
<p>You can aggregate logs based on certain fields to identify trends or recurring issues:</p>
<pre><code class="lang-javascript">textCopyEditterms aggregation on <span class="hljs-string">"status"</span> field
</code></pre>
<p>This helps identify the most common failure statuses in your pipeline.</p>
<h4 id="heading-field-specific-filtering">Field-Specific Filtering:</h4>
<p>When debugging specific components like a build tool or deployment step, you can filter by those component-specific fields:</p>
<pre><code class="lang-javascript">textCopyEditbuild_tool: <span class="hljs-string">"Jenkins"</span> AND status: <span class="hljs-string">"failure"</span>
</code></pre>
<h4 id="heading-creating-saved-searches-for-recurring-issues">Creating Saved Searches for Recurring Issues</h4>
<p>Once you’ve built queries that help you identify common issues in your CI/CD pipeline, you can save them in Kibana for future use.</p>
<p><strong>1. Create a Saved Search:</strong></p>
<p>Run your desired query in the Kibana Discover tab. Click on the “Save” button and give it a meaningful name, such as "Failed Builds - Last Week". You can add filters and customize the time range to match your typical issue patterns.</p>
<p><strong>2. Use Filters to Pinpoint Recurring Problems:</strong></p>
<p>Create saved searches that focus on specific recurring issues like:</p>
<ul>
<li><p>Build failures based on a specific tool or version.</p>
</li>
<li><p>Test failures within a particular module or set of tests.</p>
</li>
</ul>
<p>Example search for “flaky tests”:</p>
<pre><code class="lang-javascript">textCopyEdittest_status: <span class="hljs-string">"failed"</span> AND error_message: <span class="hljs-string">"*timeout*"</span>
</code></pre>
<p><strong>3. Saving Multiple Variations:</strong></p>
<p>You can save multiple variations of queries based on different error types or CI/CD tools:</p>
<ul>
<li><p><strong>Failed Jobs:</strong> <code>status: "failure"</code></p>
</li>
<li><p><strong>Test Failures in Build:</strong> <code>log_type: "test" AND status: "failure"</code></p>
</li>
<li><p><strong>Resource Constraints:</strong> <code>error_message: "*memory*"</code></p>
</li>
</ul>
<p>These saved searches will allow you to quickly troubleshoot specific issues that occur frequently.</p>
<h4 id="heading-building-visualizations-to-spot-patterns-over-time">Building Visualizations to Spot Patterns Over Time</h4>
<p>Once you have saved searches, Kibana allows you to create visualizations from your data, making it easier to spot trends, anomalies, or patterns over time.</p>
<p><strong>1. Create a Visualization:</strong></p>
<p>Go to the <strong>Visualize</strong> tab in Kibana. Select the appropriate visualization type. Common visualizations for debugging CI/CD pipelines include:</p>
<ul>
<li><p><strong>Line Chart:</strong> Track build failure rates over time.</p>
</li>
<li><p><strong>Bar Chart:</strong> Show the number of failures per CI tool or service.</p>
</li>
<li><p><strong>Pie Chart:</strong> Breakdown of failure reasons (for example, compilation errors, test failures, resource constraints).</p>
</li>
</ul>
<p><strong>2. Track Failure Trends Over Time:</strong></p>
<p>Create a line chart to track build failures over a given period:</p>
<ul>
<li><p><strong>X-Axis:</strong> Time (for example, daily or weekly).</p>
</li>
<li><p><strong>Y-Axis:</strong> Count of build failures.</p>
</li>
<li><p><strong>Aggregation:</strong> Date histogram with <code>@timestamp</code> field.</p>
</li>
</ul>
<p>This will help you visualize how build failures are trending, making it easier to identify recurring issues or spikes in failures.</p>
<p><strong>3. Monitor Failure Types by CI Tool:</strong></p>
<p>Create a bar chart that shows the number of failures broken down by CI tool:</p>
<ul>
<li><p><strong>X-Axis:</strong> CI tool (Jenkins, GitHub Actions, GitLab, and so on).</p>
</li>
<li><p><strong>Y-Axis:</strong> Count of failures.</p>
</li>
<li><p><strong>Aggregation:</strong> Terms aggregation on the <code>ci_tool</code> field.</p>
</li>
</ul>
<p>This visualization helps identify which CI tool is experiencing the most failures and focus troubleshooting efforts there.</p>
<p><strong>4. Visualize Error Messages by Frequency:</strong></p>
<p>You can visualize which error messages appear most frequently, helping you understand what might be causing recurring issues:</p>
<ul>
<li><p><strong>X-Axis:</strong> Error message type.</p>
</li>
<li><p><strong>Y-Axis:</strong> Count of occurrences.</p>
</li>
<li><p><strong>Aggregation:</strong> Terms aggregation on the <code>error_message</code> field.</p>
</li>
</ul>
<p><strong>5. Dashboard for Holistic Monitoring:</strong></p>
<p>Create a dashboard that brings together multiple visualizations. You can have one graph for failure trends, another for failure types (bar chart), and a pie chart showing the percentage of failures caused by different issues. This dashboard gives you a holistic view of your pipeline's health.</p>
<h4 id="heading-advanced-visualization-techniques">Advanced Visualization Techniques:</h4>
<p>There are various advanced techniques you can use to dig further into your data.</p>
<ul>
<li><p><strong>Heatmaps</strong>: Use heatmaps to spot time-based anomalies in build durations or test failures.</p>
</li>
<li><p><strong>Anomaly Detection</strong>: Kibana has built-in anomaly detection that can be applied to log data to automatically detect patterns that deviate from the norm. This is especially useful for catching rare or unexpected errors in your CI/CD pipeline.</p>
<p>  Example for anomaly detection:</p>
<pre><code class="lang-javascript">  textCopyEditfield: duration
  <span class="hljs-attr">aggregation</span>: average
  anomaly detection model: <span class="hljs-string">"baseline"</span>
</code></pre>
</li>
</ul>
<h2 id="heading-how-to-set-up-prometheus-metrics-alongside-your-logs">How to Set Up Prometheus Metrics Alongside Your Logs</h2>
<p>To fully understand your CI/CD pipeline's health and performance, combining metrics and logs is essential. Prometheus is an excellent tool for capturing time-series metrics, and it works seamlessly with Grafana and Loki (or any log aggregation system).</p>
<h3 id="heading-how-to-set-up-prometheus-for-cicd-metrics-collection"><strong>How to Set Up Prometheus for CI/CD Metrics Collection:</strong></h3>
<h4 id="heading-1-install-prometheus">1. Install Prometheus:</h4>
<p>You can install Prometheus using Docker or Kubernetes for easy deployment.</p>
<p>For Docker-based installation:</p>
<pre><code class="lang-bash">docker run -d -p 9090:9090 --name prometheus prom/prometheus
</code></pre>
<h4 id="heading-2-configure-prometheus-to-scrape-metrics"><strong>2. Configure Prometheus to Scrape Metrics:</strong></h4>
<p>Prometheus needs to be configured to scrape metrics from your CI/CD services.</p>
<p>Edit the <code>prometheus.yml</code> file:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">scrape_configs:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">job_name:</span> <span class="hljs-string">'ci_cd_metrics'</span>
    <span class="hljs-attr">static_configs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">targets:</span> [<span class="hljs-string">'localhost:8080'</span>, <span class="hljs-string">'localhost:9091'</span>]
</code></pre>
<h4 id="heading-3-instrument-your-cicd-services">3. Instrument Your CI/CD Services:</h4>
<p>To expose metrics, you need to integrate Prometheus client libraries into your CI/CD services.</p>
<p>For example, to expose build metrics from a Jenkins job, use the <a target="_blank" href="https://plugins.jenkins.io/prometheus/">Prometheus plugin for Jenkins</a>. In GitHub Actions, you can use <a target="_blank" href="https://github.com/prometheus/prometheus">Prometheus</a> to expose job metrics.</p>
<h4 id="heading-4-expose-metrics-endpoint"><strong>4. Expose Metrics Endpoint:</strong></h4>
<p>You’ll want to make sure your services expose a <code>/metrics</code> endpoint that Prometheus can scrape. For example, use Prometheus client libraries in your application to expose this endpoint.</p>
<h4 id="heading-troubleshooting-prometheus-setup-issues">Troubleshooting Prometheus Setup Issues</h4>
<p>If Prometheus fails to start or scrape metrics, here are some things that might be going wrong:</p>
<ol>
<li><p><strong>Container Crashes</strong>: Check logs with <code>docker logs prometheus</code>. Look for errors like “port already in use” (for example, 9090) or configuration parsing issues.</p>
<ul>
<li>Fix: Change the port in <code>docker run</code> (for example, <code>-p 9091:9090</code>) or correct the <code>prometheus.yml</code> file syntax.</li>
</ul>
</li>
<li><p><strong>Metrics Not Scraped</strong>: Verify targets are reachable using <code>docker logs prometheus</code> or test with curl <code>http://localhost:9090/targets</code>. Check <code>prometheus.yml</code> for correct endpoints.</p>
<ul>
<li>Fix: Update <code>targets</code> in <code>scrape_configs</code> (for example, <code>localhost:8080</code>) to match your CI/CD service’s metrics endpoint.</li>
</ul>
</li>
<li><p><strong>Resource Constraints</strong>: Monitor usage with docker stats or top on the host.</p>
<ul>
<li>Fix: Ensure at least 4GB RAM and 10GB disk space. Increase storage retention or reduce scrape frequency in <code>prometheus.yml</code> if needed.</li>
</ul>
</li>
</ol>
<h2 id="heading-how-to-create-grafana-dashboards-that-combine-metrics-and-logs">How to Create Grafana Dashboards That Combine Metrics and Logs</h2>
<p>Once Prometheus is collecting metrics, the next step is to visualize and correlate them in Grafana.</p>
<h3 id="heading-how-to-integrate-prometheus-with-grafana"><strong>How to Integrate Prometheus with Grafana:</strong></h3>
<p>First, you’ll need to install Grafana. You can use Docker or Kubernetes for quick deployment:</p>
<pre><code class="lang-bash">docker run -d -p 3000:3000 --name grafana grafana/grafana
</code></pre>
<p>Next, configure Grafana to use Prometheus as a data source. To do this, log in to Grafana (<code>localhost:3000</code> by default). Go to <code>Configuration</code> &gt; <code>Data Sources</code> &gt; <code>Add Data Source</code> &gt; Choose <code>Prometheus</code>. Enter your Prometheus server URL (for example, <code>http://localhost:9090</code>) and click <code>Save &amp; Test</code>.</p>
<p>Now it’s time to build a unified dashboard. To do this, create a new dashboard in Grafana that combines both logs (Loki) and metrics (Prometheus).</p>
<p>Add a panel with Prometheus data queries to visualize pipeline metrics like build success rate, deployment duration, and failure count. Use the <code>Graph</code> visualization type for time-series data and <code>Stat</code> for quick summary metrics.</p>
<p>Finally, in the same Grafana dashboard, add panels for logs (from Loki or any other logging system). Use the <code>Logs</code> panel to visualize log data and link them with the relevant Prometheus metrics by using time-based correlations.</p>
<p><strong>Example</strong>: If a spike in CPU usage is detected (Prometheus metric), the logs panel could show related logs, like errors or failed build jobs.</p>
<h2 id="heading-how-to-use-exemplars-to-jump-from-metrics-to-relevant-logs">How to Use Exemplars to Jump from Metrics to Relevant Logs</h2>
<p>Exemplars are an advanced feature in Prometheus that allow you to connect metric data with logs and traces. Grafana supports this feature, and it can be incredibly helpful when investigating issues.</p>
<h3 id="heading-how-to-set-up-exemplars-in-prometheus">How to Set Up Exemplars in Prometheus:</h3>
<p><strong>1. Enable Exemplars in Your Application:</strong></p>
<p>Exemplars are essentially traces embedded into your metrics. To use them, you’ll need to make sure your application is instrumented to send exemplar data alongside your metrics.</p>
<p>Many libraries support adding exemplars to Prometheus metrics, such as <code>prom-client</code> (Node.js) and <code>prometheus-net</code> (C#).</p>
<p>Here’s an example in Node.js:</p>
<pre><code class="lang-javascript"><span class="hljs-comment">// Demonstrates adding an exemplar to a Prometheus metric for linking to logs or traces.</span>
<span class="hljs-keyword">const</span> promClient = <span class="hljs-built_in">require</span>(<span class="hljs-string">'prom-client'</span>);

<span class="hljs-comment">// Creates a counter metric to track failed CI/CD builds.</span>
<span class="hljs-keyword">const</span> counter = <span class="hljs-keyword">new</span> promClient.Counter({
  <span class="hljs-attr">name</span>: <span class="hljs-string">'ci_cd_failed_builds_total'</span>,  <span class="hljs-comment">// Metric name for failed builds.</span>
  <span class="hljs-attr">help</span>: <span class="hljs-string">'Total number of failed builds'</span>,  <span class="hljs-comment">// Description of the metric.</span>
});

<span class="hljs-comment">// Increments the counter with an exemplar for tracing.</span>
counter.inc({ <span class="hljs-attr">exemplar</span>: <span class="hljs-string">'build_failed'</span> });
</code></pre>
<p><strong>2. Enable Exemplars in Prometheus Config:</strong></p>
<p>Make sure your Prometheus server is configured to store and expose exemplars. Exemplars are typically included with histogram or summary metrics, so make sure you’ve configured them correctly.</p>
<p><strong>3. Visualizing Exemplars in Grafana:</strong></p>
<p>In Grafana, when you query Prometheus for metrics with exemplars, Grafana will show the linked logs or traces when you hover over a metric.</p>
<p>Use the <code>Exemplar</code> option in Grafana panels to quickly access logs from specific metrics.</p>
<p>For example, if you have a <code>build_failure_total</code> metric and you detect a failure in your pipeline, you can click on the failure metric in Grafana and instantly view the relevant logs for that specific failure using the exemplars.</p>
<h2 id="heading-how-to-diagnose-and-fix-common-cicd-problems">How to Diagnose and Fix Common CI/CD Problems</h2>
<p>CI/CD pipelines often encounter issues like build failures, dependency problems, and flaky tests that can disrupt development workflows. This section provides practical strategies to diagnose and resolve these common problems using log analysis and systematic debugging techniques, helping you restore pipeline stability quickly.</p>
<h3 id="heading-strategy-1-systematically-debug-build-failures"><strong>Strategy 1: Systematically Debug Build Failures</strong></h3>
<p>Build failures are a frequent CI/CD challenge, often stemming from errors in code, tests, or configurations. Systematically debugging these issues involves analyzing logs to pinpoint root causes, using the following approaches.</p>
<h4 id="heading-identifying-patterns-in-compiler-and-test-output">Identifying Patterns in Compiler and Test Output</h4>
<p>When debugging build failures, you need to first examine the logs from the compiler and test outputs. Let’s go over some key strategies.</p>
<h4 id="heading-1-check-for-specific-error-messages">1. Check for Specific Error Messages:</h4>
<p>There are a few common types of error messages you might get. They are:</p>
<ul>
<li><p><strong>Syntax errors</strong>: Look for lines indicating that there's a mismatch in syntax, such as missing semicolons, undeclared variables, or incorrect function calls.</p>
</li>
<li><p><strong>Linker errors</strong>: These often occur when the required libraries or dependencies are not found. You'll typically see errors like <code>undefined reference</code> or <code>symbol not found</code>.</p>
</li>
<li><p><strong>Build tool errors</strong>: If you are using build systems like Maven, Gradle, or MSBuild, their logs will give specific error codes or missing configurations.</p>
</li>
</ul>
<h4 id="heading-2-look-for-common-error-patterns">2. Look for Common Error Patterns:</h4>
<p>Often, failed builds repeat the same error or pattern across multiple runs. Check logs for recurring terms or errors that point to specific modules or functions. And remember that grouping similar issues can help you identify the root cause faster.</p>
<h4 id="heading-3-use-regular-expressions-for-log-filtering">3. Use Regular Expressions for Log Filtering:</h4>
<p>You can use regular expressions to search for keywords in the logs that match common failure patterns (for example, "error", "failed", "exception", "out of memory"). This will help you filter out unrelated messages and focus on the failures.</p>
<p><strong>As an example:</strong></p>
<ul>
<li><p>If the build fails with an "Out of Memory" error, search for any memory allocation issues or settings that can be increased.</p>
</li>
<li><p>If test failures are related to specific modules, inspect those modules for recent changes or dependency issues.</p>
</li>
</ul>
<h3 id="heading-strategy-2-troubleshooting-dependency-issues-with-log-analysis">Strategy 2: Troubleshooting Dependency Issues with Log Analysis</h3>
<p>Dependency issues are common in build failures, especially in complex CI/CD pipelines with multiple modules or services. To resolve these issues, consider the following:</p>
<p><strong>1. Check for Missing or Outdated Dependencies</strong>:</p>
<p>Start by reviewing the build tool’s output to check for messages related to missing dependencies (for example, <code>dependency not found</code>, <code>version conflict</code>).</p>
<p>Many build tools (like Maven, npm, or .NET) will include specific error messages when a dependency is missing or incompatible.</p>
<p><strong>2. Inspect Dependency Resolution Logs</strong>:</p>
<p>Some build tools provide detailed logs showing how dependencies were resolved (for example, the version of a library that was used). These logs can show you if there’s a version mismatch.</p>
<p>Make sure that your <code>package.json</code> (for JavaScript projects), <code>pom.xml</code> (for Java), or <code>csproj</code> (for C#) files are correctly defined with compatible versions.</p>
<p><strong>3. Verify Network Connectivity</strong>:</p>
<p>CI/CD tools sometimes fail to fetch dependencies due to network issues (for example, proxy settings, repository access). Look for any errors indicating that a repository couldn’t be reached.</p>
<p><strong>4. Log Example:</strong></p>
<p>If a Java project fails with <code>Could not find artifact</code>, it's likely a dependency missing or inaccessible. Check the repository URL or if the artifact exists in your Maven repo.</p>
<p><strong>5. Resolve Version Conflicts</strong>:</p>
<p>Version conflicts occur when different dependencies require incompatible versions of the same library. This is especially true in Java (with Maven/Gradle) and .NET projects. Consider using tools to resolve version conflicts automatically or define compatible versions manually.</p>
<h3 id="heading-fixing-flaky-tests-based-on-historical-log-data">Fixing Flaky Tests Based on Historical Log Data</h3>
<p><strong>Note:</strong> Issues like container crashes, logs not ingested, or resource constraints here may resemble those in other sections. These are common across CI/CD services and processes, but each section offers unique context to avoid redundancy.</p>
<p>Flaky tests – that is, those that pass sometimes and fail at other times – are common in CI/CD pipelines, and they can be frustrating. Let’s discuss some strategies for how you can tackle them:</p>
<p><strong>1. Analyze Test Logs Over Time</strong>:</p>
<p>Review historical logs to identify patterns in when the test fails. Look for timing issues, resource limits, or external dependencies that could affect test reliability.</p>
<p>For example, if a test intermittently fails after a certain amount of time or only during specific pipeline stages, it could indicate resource exhaustion or race conditions.</p>
<p><strong>2. Check Test Dependencies</strong>:</p>
<p>Often, flaky tests are dependent on external services or resources (for example, databases, APIs, file systems). Check if these services are consistently available and properly mocked during test execution.</p>
<p>Logs that mention failed connections to external services or unstable environments can give you insights into potential issues with dependencies.</p>
<p><strong>3. Run Tests with Increased Logging</strong>:</p>
<p>Increase the verbosity of test logs to capture more information about the failures. This can help you detect why tests fail in certain conditions.</p>
<p>For example, adding debug logs inside tests can provide more context on the state of the application when the failure occurs.</p>
<p><strong>4. Time of Day Issues</strong>:</p>
<p>Some flaky tests may fail during peak usage times, especially if they rely on shared resources. Look for patterns that correlate with resource contention (for example, database locks, API rate limits).</p>
<p>Logs showing high CPU or memory usage can indicate that resource constraints are affecting the stability of your tests.</p>
<p><strong>5. Implement Retry Logic for Flaky Tests</strong>:</p>
<p>To mitigate the effects of flaky tests, implement automatic retries for tests that fail intermittently. This can help reduce the noise in your CI/CD pipeline while you investigate the root causes.</p>
<p>For example, if a database connection test fails intermittently, you may want to inspect database logs for signs of timeouts or connection pool exhaustion.</p>
<h3 id="heading-how-to-resolve-deployment-pipeline-failures">How to Resolve Deployment Pipeline Failures</h3>
<p>Deployment pipeline failures can stem from several sources, and diagnosing them requires a systematic approach using logs and available observability tools. Below, we will outline the common patterns in logs that indicate resource constraints, permission/authentication issues, and configuration drift between environments.</p>
<p><strong>Log Patterns That Indicate Resource Constraints</strong></p>
<p>Resource constraints are a common cause of pipeline failures. These can include CPU limits, memory usage, or disk space running out. Here's how to recognize these patterns:</p>
<h4 id="heading-key-indicators-in-logs">Key Indicators in Logs:</h4>
<ul>
<li><strong>Memory Issues</strong>: Look for messages like <em>"out of memory"</em>, <em>"memory limit exceeded"</em>, or <em>"OOM killed"</em> in your logs. Here’s an example in Kubernetes logs:</li>
</ul>
<pre><code class="lang-javascript">pod has been OOMKilled
</code></pre>
<ul>
<li><strong>CPU Limits</strong>: Watch for logs showing that a process exceeded CPU limits or was throttled. Here’s an example:</li>
</ul>
<pre><code class="lang-javascript">process <span class="hljs-string">'foo'</span> hit CPU limit, throttling at <span class="hljs-number">100</span>%
</code></pre>
<ul>
<li><strong>Disk Space</strong>: Logs may show file write errors or messages about a disk being full. Here’s an example:</li>
</ul>
<pre><code class="lang-javascript">Unable to write to file, disk space is full.
</code></pre>
<p>You can resolve the memory issues by increasing the allocated memory for your containers, VM, or cloud instances.</p>
<p>You can resolve the CPU issues by adjusting CPU limits or scaling your infrastructure to add more resources.</p>
<p>And finally, you can resolve disk space issues by cleaning up unused files or increasing disk capacity on the server/container.</p>
<p><strong>Identify Permission and Authentication Issues</strong></p>
<p>Permission and authentication issues often result in pipeline failures due to a lack of access to necessary resources or services. These issues might occur when you’re trying to access databases, deploy to cloud services, or authenticate third-party APIs.</p>
<p>There are some key indicators in the logs that you can look out for:</p>
<h4 id="heading-1-authentication-failures">1. Authentication Failures:</h4>
<p>Look for messages related to failed logins, incorrect credentials, or invalid tokens.</p>
<p>Here’s an example:</p>
<pre><code class="lang-javascript">Authentication failed <span class="hljs-keyword">for</span> user <span class="hljs-string">'admin'</span>
</code></pre>
<pre><code class="lang-javascript">Invalid API token provided.
</code></pre>
<h4 id="heading-2-permission-denied">2. Permission Denied:</h4>
<p>Logs may indicate that the CI/CD pipeline lacks the permissions to perform a certain action.</p>
<p>Here’s an example:</p>
<pre><code class="lang-javascript">Access denied <span class="hljs-keyword">for</span> /path/to/deployment/target
</code></pre>
<pre><code class="lang-javascript">Unauthorized request to cloud service.
</code></pre>
<p><strong>How to resolve these errors</strong>:</p>
<ul>
<li><p><strong>Credentials</strong>: Ensure the credentials (API keys, access tokens, SSH keys) used in the pipeline are up-to-date and correctly configured.</p>
</li>
<li><p><strong>Permissions</strong>: Review and update the role-based access control (RBAC) settings for the service account running the pipeline to ensure it has the necessary permissions.</p>
</li>
<li><p><strong>Secrets Management</strong>: Use tools like Vault, AWS Secrets Manager, or Azure Key Vault to securely manage secrets and credentials.</p>
</li>
</ul>
<p><strong>Troubleshooting Configuration Drift Between Environments</strong></p>
<p>Configuration drift occurs when different environments (like development, staging, production) are not synchronized. This can lead to inconsistent behavior during deployments, and often results in failures in one environment but not in others.</p>
<p>Look out for these key indicators in the logs:</p>
<h4 id="heading-1-mismatch-in-environment-variables">1. Mismatch in Environment Variables:</h4>
<p>If you’re using environment variables, check for discrepancies across different stages. For example:</p>
<pre><code class="lang-javascript">Environment variable DATABASE_URL not found <span class="hljs-keyword">in</span> production
</code></pre>
<h4 id="heading-2-dependency-versions">2. Dependency Versions:</h4>
<p>Mismatched versions of dependencies between environments can cause unexpected issues.</p>
<p>Here’s an example:</p>
<pre><code class="lang-javascript"><span class="hljs-built_in">Error</span>: Dependency <span class="hljs-string">'libxyz'</span> version mismatch between environments
</code></pre>
<h4 id="heading-3-service-configuration">3. Service Configuration:</h4>
<p>Look for configuration-related errors that might not be present in a development environment but occur in production.</p>
<p>Here’s an example:</p>
<pre><code class="lang-javascript"><span class="hljs-built_in">Error</span>: Invalid config <span class="hljs-keyword">in</span> <span class="hljs-string">'production-config.yaml'</span>
</code></pre>
<p><strong>How to resolve these errors</strong>:</p>
<ul>
<li><p><strong>Use Infrastructure as Code (IaC)</strong>: Tools like Terraform, Ansible, or CloudFormation can help ensure that environments are provisioned consistently.</p>
</li>
<li><p><strong>Automated Configuration Management</strong>: Use CI/CD pipeline steps to automate environment setup to avoid manual changes that can cause drift.</p>
</li>
<li><p><strong>Environment Consistency Checks</strong>: Implement checks to compare configurations and dependencies across environments before deployment.</p>
<ul>
<li>Example: You can add a pre-deployment stage to run a script that compares environment variables, configurations, and dependency versions between staging and production.</li>
</ul>
</li>
<li><p><strong>Configuration Management Tools</strong>: Use configuration management tools like Chef, Puppet, or SaltStack to maintain consistent configurations across environments.</p>
</li>
</ul>
<h3 id="heading-how-to-debug-container-based-deployment-issues">How to Debug Container-Based Deployment Issues</h3>
<p>Debugging container-based deployment issues requires specialized tools and techniques to trace errors in containerized environments. Below are strategies to efficiently collect logs, diagnose failures, and use ephemeral containers for investigation.</p>
<h4 id="heading-collecting-and-analyzing-container-logs-effectively">Collecting and Analyzing Container Logs Effectively</h4>
<p>Container logs are essential for troubleshooting issues, and effective collection and analysis can significantly speed up the debugging process.</p>
<p>Here’s how you can collect container logs:</p>
<p><strong>1. Docker Logs:</strong></p>
<p>You can use Docker’s <code>logs</code> command to view logs of a specific container:</p>
<pre><code class="lang-bash">docker logs &lt;container_name_or_id&gt;
</code></pre>
<p>If your container uses a logging driver (like <code>json-file</code> or <code>fluentd</code>), ensure that logs are being written to an accessible location.</p>
<p><strong>2. Kubernetes Logs:</strong></p>
<p>For Kubernetes-managed containers, use <code>kubectl</code> to access pod logs:</p>
<pre><code class="lang-bash">kubectl logs &lt;pod_name&gt;
</code></pre>
<p>To view logs for all containers in a pod:</p>
<pre><code class="lang-bash">kubectl logs &lt;pod_name&gt; --all-containers=<span class="hljs-literal">true</span>
</code></pre>
<p><strong>3. Log Aggregation:</strong></p>
<p>You can integrate with centralized logging systems (like, <strong>Grafana Loki</strong>, <strong>Elastic Stack</strong>). You can also use Fluentd or Logstash as log shippers for forwarding logs from containers to a logging backend.</p>
<h4 id="heading-analyzing-logs">Analyzing Logs:</h4>
<p><strong>1. Filter and Search Logs:</strong></p>
<p>Use <code>grep</code> to filter logs for specific error messages or patterns:</p>
<pre><code class="lang-bash">docker logs &lt;container_name&gt; | grep <span class="hljs-string">"ERROR"</span>
</code></pre>
<p>In Kubernetes, you can combine <code>kubectl</code> with <code>grep</code> or other tools for advanced filtering.</p>
<p><strong>2. Log Contextualization:</strong></p>
<p>Include metadata in your logs (for example, container ID, environment, timestamps) for easier debugging. Ensure logs are structured in formats like JSON to allow for better querying and filtering.</p>
<h3 id="heading-how-to-diagnose-image-pull-and-networking-failures">How to Diagnose Image Pull and Networking Failures</h3>
<p>Container deployment failures often stem from issues related to image pulling or network connectivity. Here’s how to troubleshoot these problems:</p>
<h4 id="heading-image-pull-failures">Image Pull Failures:</h4>
<p>There are some common issues you might see, such as:</p>
<ul>
<li><p><strong>Authentication failures:</strong> If the container registry requires authentication, ensure your credentials (username/password or tokens) are correct.</p>
</li>
<li><p><strong>Network connectivity:</strong> Check if the container can access the registry endpoint. Often, firewalls or DNS issues block the image pull.</p>
</li>
<li><p><strong>Image not found:</strong> Verify the image name and tag are correct. Use <code>docker pull</code> to manually pull the image to see if the issue is specific to the deployment process.</p>
</li>
</ul>
<p>There are various ways to diagnose them:</p>
<p>For <strong>Docker</strong>, use:</p>
<pre><code class="lang-bash">docker pull &lt;image_name&gt;
</code></pre>
<p>This will output the specific error message if the image pull fails.</p>
<p>For <strong>Kubernetes</strong>, check the event logs for the pod:</p>
<pre><code class="lang-bash">kubectl describe pod &lt;pod_name&gt;
</code></pre>
<p>Look for the <code>Failed</code> status under "Events" for information about why the image pull failed (for example, wrong credentials or tag). If the issue is with the registry authentication, configure the Kubernetes <strong>imagePullSecrets</strong> or Docker's credentials to ensure the correct access.</p>
<h4 id="heading-networking-failures">Networking Failures:</h4>
<p>Some common issues you may encounter are:</p>
<ul>
<li><p><strong>DNS resolution problems:</strong> Containers may fail to resolve hostnames if DNS configurations are incorrect.</p>
</li>
<li><p><strong>Network policies and firewall rules:</strong> Network policies or firewalls may block necessary ports.</p>
</li>
<li><p><strong>Inter-container communication:</strong> If containers need to talk to each other, ensure they’re on the same network or subnet.</p>
</li>
</ul>
<p>Again, there are various ways to diagnose these issues:</p>
<p><strong>For Docker networking:</strong></p>
<p>You can do this to view all Docker networks:</p>
<pre><code class="lang-bash">docker network ls
</code></pre>
<p>You can also inspect the network of your container like this:</p>
<pre><code class="lang-bash">docker network inspect &lt;network_name&gt;
</code></pre>
<p>Check if the container is correctly attached to the network and if necessary ports are exposed.</p>
<p><strong>For Kubernetes Networking:</strong></p>
<p>You can use <code>kubectl</code> to check network policies:</p>
<pre><code class="lang-bash">kubectl get networkpolicies
</code></pre>
<p>You can also check the pod’s network settings like this:</p>
<pre><code class="lang-bash">kubectl describe pod &lt;pod_name&gt; | grep -i <span class="hljs-string">"Network"</span>
</code></pre>
<p><strong>Testing Connectivity Inside Containers:</strong></p>
<p>For Docker, exec into the container and test:</p>
<pre><code class="lang-bash">docker <span class="hljs-built_in">exec</span> -it &lt;container_id&gt; /bin/bash
ping &lt;hostname_or_ip&gt;
curl http://&lt;service_address&gt;:&lt;port&gt;
</code></pre>
<p>In Kubernetes, use <code>kubectl exec</code> to access the pod and test connectivity:</p>
<pre><code class="lang-bash">kubectl <span class="hljs-built_in">exec</span> -it &lt;pod_name&gt; -- /bin/bash
</code></pre>
<h3 id="heading-how-to-use-ephemeral-debug-containers-for-investigation">How to Use Ephemeral Debug Containers for Investigation</h3>
<p>Ephemeral debug containers are short-lived containers that help investigate issues in a running environment without altering the main application container.</p>
<h4 id="heading-what-are-ephemeral-debug-containers">What are Ephemeral Debug Containers?</h4>
<p>Ephemeral debug containers allow you to run diagnostic commands (like shell access, <code>ping</code>, or <code>curl</code>) in the same network environment as the failing application container, without modifying the application itself.</p>
<h4 id="heading-how-to-set-up-ephemeral-containers-in-docker">How to Set Up Ephemeral Containers in Docker:</h4>
<p><strong>1. Use the</strong> <code>docker run</code> Command:</p>
<p>You can create a new container for debugging by running a container with the same network settings as the failing container:</p>
<pre><code class="lang-bash">docker run -it --network container:&lt;container_name_or_id&gt; --entrypoint /bin/bash &lt;debug_image&gt;
</code></pre>
<p>This command runs an interactive shell inside the debug container using the same network as the target container.</p>
<h4 id="heading-ephemeral-containers-in-kubernetes">Ephemeral Containers in Kubernetes:</h4>
<p>Kubernetes allows you to inject an ephemeral debug container into a running pod. You can add a temporary debug container to your pod using the following command:</p>
<pre><code class="lang-bash">kubectl debug &lt;pod_name&gt; -it --image=&lt;debug_image&gt; --target=&lt;container_name&gt;
</code></pre>
<p>This command will run a new container in the same pod as the target container, allowing you to run diagnostic commands.</p>
<p>Example use cases are investigating file systems, running network diagnostics, checking configuration files, and so on.</p>
<p>These debug containers are meant to be temporary and can be discarded after the issue is resolved.</p>
<h2 id="heading-how-to-implement-advanced-debugging-techniques">How to Implement Advanced Debugging Techniques</h2>
<p>This section covers advanced methods to diagnose complex CI/CD pipeline issues that standard log analysis might miss. We’ll explore distributed tracing to track requests across multiple services and combine traces with logs and metrics for deeper insights.</p>
<p>These techniques are designed to work within budget constraints, ensuring effective debugging for your CI/CD workflows.</p>
<h3 id="heading-choosing-a-tracing-backend-for-cicd"><strong>Choosing a Tracing Backend for CI/CD</strong></h3>
<p>Distributed tracing enables you to monitor a request’s path through various services in your CI/CD pipeline, such as from a build step to a deployment, identifying delays or failures. Choosing a tracing backend involves selecting a tool to store and analyze these trace data. Below, we compare Jaeger, Tempo, and hosted solutions for distributed tracing.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Tool</strong></td><td><strong>Resource Usage</strong></td><td><strong>Setup Complexity</strong></td><td><strong>Best For</strong></td><td><strong>CI/CD Fit</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Jaeger</strong></td><td>Low</td><td>Easy (Docker-based)</td><td>Small teams, local setups</td><td>Simple pipelines, quick trace views</td></tr>
<tr>
<td><strong>Tempo</strong></td><td>Low</td><td>Moderate (Grafana integration)</td><td>Grafana users, log/metric correlation</td><td>Complex pipelines, unified observability</td></tr>
<tr>
<td><strong>Hosted (e.g., Lightstep)</strong></td><td>Variable (cloud-based)</td><td>Easy (managed)</td><td>Teams with budget for cloud services</td><td>Scalable, production-grade tracing</td></tr>
</tbody>
</table>
</div><p>When to choose each one:</p>
<ul>
<li><p><strong>Jaeger</strong>: Ideal for quick, local tracing setups with minimal overhead.</p>
</li>
<li><p><strong>Tempo</strong>: Best for teams already using Grafana Loki/Prometheus for unified observability.</p>
</li>
<li><p><strong>Hosted Solutions</strong>: Suited for large-scale pipelines needing managed scalability.</p>
</li>
</ul>
<h3 id="heading-how-to-set-up-distributed-tracing-on-a-budget">How to Set Up Distributed Tracing on a Budget</h3>
<p>Distributed tracing is crucial for debugging and observing complex, multi-step operations across services. It allows you to follow requests as they propagate through different services and components of your pipeline. Implementing this on a budget can still provide valuable insights.</p>
<h4 id="heading-how-to-use-opentelemetry-with-free-backends">How to Use OpenTelemetry with Free Backends</h4>
<p><a target="_blank" href="https://www.freecodecamp.org/news/how-to-use-opentelementry-to-trace-node-js-applications/">OpenTelemetry</a> is an open-source framework that enables you to collect, process, and export telemetry data like traces and metrics. It supports multiple backends, and we’ll focus on using free, budget-friendly backends for trace storage and analysis.</p>
<p><strong>1. Install OpenTelemetry Collector:</strong></p>
<p>OpenTelemetry provides an agent (collector) that collects traces and metrics from your application and sends them to a backend.</p>
<p>To install the OpenTelemetry Collector, download the binary for your OS or use Docker to deploy it:</p>
<pre><code class="lang-bash">docker pull otel/opentelemetry-collector:latest
</code></pre>
<p>Then run the OpenTelemetry Collector in Docker with a configuration file:</p>
<pre><code class="lang-bash">docker run -d --name opentelemetry-collector -p 55680:55680 -p 14250:14250 otel/opentelemetry-collector
</code></pre>
<p><strong>2. Configure OpenTelemetry to Export to Free Backends:</strong></p>
<p>There are a few popular free backends you can use for distributed tracing, like Jaeger and Prometheus + Tempo. Let’s see how to use both here.</p>
<p>We’ll start with <strong>Jaeger</strong>, an open-source tracing backend. It’s highly scalable and works well with OpenTelemetry.</p>
<p>You can use the Docker version for easy deployment:</p>
<pre><code class="lang-bash">docker run -d --name jaeger -e COLLECTOR_ZIPKIN_HTTP_PORT=9411 -p 5775:5775 -p 6831:6831/udp -p 6832:6832/udp -p 5778:5778 -p 16686:16686 -p 14250:14250 -p 14268:14268 -p 14250:14250 -p 9431:9431 jaegertracing/all-in-one:1.30
</code></pre>
<p>Alternatively, you can use hosted services like <strong>Lightstep</strong>, <strong>AWS X-Ray</strong>, or <strong>Honeycomb</strong> for cloud-native environments.</p>
<p>Now let’s see how to use <strong>Prometheus</strong> + <strong>Tempo</strong> for logs and metrics correlation.</p>
<p>Tempo is a distributed tracing backend built by Grafana that integrates well with other Grafana tools (Loki and Prometheus).</p>
<p>You can install Tempo using Docker:</p>
<pre><code class="lang-bash">docker run -d --name tempo -p 14268:14268 grafana/tempo:latest
</code></pre>
<p><strong>3. Instrument Your Code with OpenTelemetry SDK:</strong></p>
<p>For Python/Node.js/Java/Go applications, you can install the appropriate OpenTelemetry SDK and start tracing.</p>
<p>Here’s a Python example:</p>
<pre><code class="lang-bash">pip install opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation
</code></pre>
<p>And a Node.js example:</p>
<pre><code class="lang-bash">npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/instrumentation
</code></pre>
<p>And one in Java:</p>
<pre><code class="lang-xml"><span class="hljs-tag">&lt;<span class="hljs-name">dependency</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">groupId</span>&gt;</span>io.opentelemetry<span class="hljs-tag">&lt;/<span class="hljs-name">groupId</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">artifactId</span>&gt;</span>opentelemetry-api<span class="hljs-tag">&lt;/<span class="hljs-name">artifactId</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">version</span>&gt;</span>1.0.0<span class="hljs-tag">&lt;/<span class="hljs-name">version</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">dependency</span>&gt;</span>
</code></pre>
<p>After installation, you can use the OpenTelemetry SDK to instrument the application and start collecting traces for HTTP requests, database queries, and other pipeline interactions.</p>
<p><strong>4. Send Data to the Collector:</strong></p>
<p>You can configure the SDK to send trace data to your OpenTelemetry Collector, which will then forward it to your backend (Jaeger, Tempo, and so on). Here’s an example for Python:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> opentelemetry <span class="hljs-keyword">import</span> trace
<span class="hljs-keyword">from</span> opentelemetry.exporter.otlp.proto.http.trace_exporter <span class="hljs-keyword">import</span> OTLPSpanExporter
<span class="hljs-keyword">from</span> opentelemetry.sdk.trace <span class="hljs-keyword">import</span> TracerProvider
<span class="hljs-keyword">from</span> opentelemetry.sdk.trace.export <span class="hljs-keyword">import</span> BatchExportSpanProcessor

trace.set_tracer_provider(TracerProvider())
exporter = OTLPSpanExporter(endpoint=<span class="hljs-string">"http://localhost:55680"</span>)
processor = BatchExportSpanProcessor(exporter)
trace.get_tracer_provider().add_span_processor(processor)
</code></pre>
<p>If traces aren’t appearing, several issues might be occurring:</p>
<ol>
<li><p><strong>Collector fails to start</strong>: Check logs with <code>docker logs otel-collector</code>. Look for errors like “port conflict” or “invalid config.”</p>
<ul>
<li>Fix: Change ports (for example, <code>55681:55680</code>) or verify the config file.</li>
</ul>
</li>
<li><p><strong>No traces in Jaeger</strong>: Ensure the collector is sending data to Jaeger (<code>http://localhost:14250</code>). Test with <code>curl http://localhost:55680</code>.</p>
<ul>
<li>Fix: Update the exporter endpoint in your SDK configuration.</li>
</ul>
</li>
<li><p><strong>Resource constraints</strong>: Monitor usage with <code>docker stats</code>.</p>
<ul>
<li>Fix: Allocate at least 2GB RAM and 10GB disk space for the collector and backend.</li>
</ul>
</li>
</ol>
<h4 id="heading-correlating-traces-with-logs-and-metrics">Correlating Traces with Logs and Metrics</h4>
<p>Combining traces with logs and metrics provides a holistic view of your pipeline’s operations, allowing you to pinpoint the root cause of issues more effectively.</p>
<p>OpenTelemetry and Grafana allow you to link traces, logs, and metrics into a unified view.</p>
<p>Let’s see how you can do this now.</p>
<p><strong>1. Link Logs and Traces Using Correlation IDs:</strong></p>
<p>When generating logs, include trace and span IDs in the log entries. This allows you to correlate logs with specific trace requests.</p>
<p>Here’s an example:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"timestamp"</span>: <span class="hljs-string">"2025-05-10T12:00:00Z"</span>,
  <span class="hljs-attr">"level"</span>: <span class="hljs-string">"error"</span>,
  <span class="hljs-attr">"message"</span>: <span class="hljs-string">"Build failure"</span>,
  <span class="hljs-attr">"trace_id"</span>: <span class="hljs-string">"1234567890abcdef"</span>,
  <span class="hljs-attr">"span_id"</span>: <span class="hljs-string">"0987654321abcdef"</span>
}
</code></pre>
<p><strong>2. Integrating Logs (Loki) with Traces (Jaeger/Tempo) in Grafana:</strong></p>
<p>Grafana can integrate traces from Jaeger or Tempo and correlate them with logs from Loki.</p>
<p>To do this:</p>
<ol>
<li><p><strong>Set up Loki and Tempo in Grafana.</strong></p>
</li>
<li><p>In Grafana’s Explore view, you can search logs and traces side-by-side.</p>
</li>
<li><p>Create dashboards that show metrics, logs, and traces for a complete view of a request flow.</p>
</li>
</ol>
<p><strong>3. Using Prometheus Metrics with Traces:</strong></p>
<p>Prometheus provides metrics that can be correlated with traces. For example, you can use <strong>exemplars</strong> in Prometheus to link specific metric data to trace data.</p>
<p><strong>Example:</strong> If you have a high error rate in your build step, you can correlate this with trace data to identify which requests failed.</p>
<h4 id="heading-creating-trace-visualizations-for-complex-pipeline-operations">Creating Trace Visualizations for Complex Pipeline Operations</h4>
<p>You can visualize traces with Jaeger or Tempo.</p>
<p><strong>To do this in Jaeger:</strong></p>
<p>Once your traces are in Jaeger, you can access the Jaeger UI (<a target="_blank" href="http://localhost:16686"><code>http://localhost:16686</code></a> by default) and use the search functionality to explore traces based on service name, trace ID, or specific operations.</p>
<p>Jaeger allows you to create custom dashboards to visualize the latency, throughput, and errors of requests across services.</p>
<p><strong>To do this in Tempo (Grafana Integration):</strong></p>
<p>Tempo integrates with Grafana, where you can create dashboards that visualize trace data from your pipeline.</p>
<p><strong>Create a Grafana dashboard:</strong></p>
<ol>
<li><p>Add Tempo as a data source in Grafana.</p>
</li>
<li><p>Use the "Trace" panel to query and visualize traces.</p>
</li>
<li><p>Combine trace visualizations with metrics (from Prometheus) and logs (from Loki) to get a unified view of your pipeline.</p>
</li>
</ol>
<p>A typical trace visualization dashboard could show the duration of each step in your pipeline (build, test, deploy) and highlight where delays or errors occur, such as slow database queries or flaky tests.</p>
<p><strong>Troubleshooting Tempo Setup Issues</strong></p>
<p>If Tempo fails to collect or display traces:</p>
<ol>
<li><p><strong>Container fails to start</strong>: Check logs with <code>docker logs tempo</code>. Look for errors like “port already in use” (for example, 14268) or “storage backend unavailable.”</p>
<ul>
<li>Fix: Change ports in the Docker command (for example, <code>-p 14269:14268</code>) or ensure the storage directory (for example, <code>/tmp/tempo</code>) exists and is writable.</li>
</ul>
</li>
<li><p><strong>No traces in Tempo</strong>: Verify the OpenTelemetry Collector is sending traces to Tempo’s endpoint (<code>http://localhost:14268</code>). Test connectivity with <code>curl http://localhost:14268</code>.</p>
<ul>
<li>Fix: Update the collector’s exporter configuration to point to the correct Tempo endpoint, and ensure no firewalls are blocking the connection.</li>
</ul>
</li>
<li><p><strong>Resource constraints</strong>: Monitor usage with <code>docker stats</code> or <code>top</code> on the host.</p>
<ul>
<li>Fix: Allocate at least 2GB RAM and 10GB disk space for Tempo, as tracing data can grow quickly with high-volume pipelines.</li>
</ul>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748226837500/c9865f8c-f737-49a5-a346-a56f4fac37fd.png" alt="Bar chart showing CI/CD pipeline trace latency for May 2025. Three pipeline stages are displayed: Build stage (blue bar) shows approximately 1,200ms latency, Test stage (yellow bar) shows approximately 800ms latency, and Deploy stage (red bar) shows approximately 1,500ms latency. The Deploy stage has the highest latency, followed by Build, then Test." class="image--center mx-auto" width="1468" height="866" loading="lazy"></p>
<p>This bar chart displays the average latency (in milliseconds) for key stages of a CI/CD pipeline in May 2025. The Build stage averages around 1,200 ms (blue), the Test stage around 800 ms (yellow), and the Deploy stage around 1,500 ms (pink), highlighting that deployment is the most time-intensive step.</p>
<h2 id="heading-how-to-build-comprehensive-debugging-dashboards">How to Build Comprehensive Debugging Dashboards</h2>
<p>This section explains how to create Grafana dashboards to troubleshoot CI/CD pipeline issues effectively. We’ll focus on setting up visualizations for key metrics, logs, and system resources to identify problems like build failures or resource bottlenecks, using budget-friendly tools to keep your observability stack lean and actionable.</p>
<h3 id="heading-designing-grafana-dashboards-specifically-for-troubleshooting">Designing Grafana Dashboards Specifically for Troubleshooting</h3>
<h4 id="heading-step-1-understand-the-key-metrics-and-logs-to-monitor">Step 1: Understand the Key Metrics and Logs to Monitor</h4>
<p>When designing a Grafana dashboard for debugging, you should focus on metrics and logs that help identify issues in the pipeline. These could include:</p>
<ul>
<li><p><strong>Build failures</strong>: Errors during build processes (compilation, test failures).</p>
</li>
<li><p><strong>Deployment failures</strong>: Issues in deployment, such as failed jobs, resource limitations, or misconfigurations.</p>
</li>
<li><p><strong>Container logs</strong>: Information about container status and logs (if using containers in your pipeline).</p>
</li>
<li><p><strong>System resource usage</strong>: CPU, memory, and disk usage that may lead to performance bottlenecks.</p>
</li>
<li><p><strong>CI/CD-specific metrics</strong>: Number of successful vs. failed pipeline runs, job duration, job queue times.</p>
</li>
</ul>
<h4 id="heading-step-2-set-up-data-sources">Step 2: Set Up Data Sources</h4>
<p>To start building the dashboard, you’ll need to set up your data sources in Grafana. First, connect your Prometheus instance for collecting metrics. To do this, go to <code>Configuration</code> &gt; <code>Data Sources</code> in Grafana. Then just add <code>Prometheus</code> as a data source and enter the URL (for example, <a target="_blank" href="http://localhost:9090"><code>http://localhost:9090</code></a>).</p>
<p>Next, you need to connect your Loki instance for logs. So go ahead and add <code>Loki</code> as a data source by specifying the URL (for example, <a target="_blank" href="http://localhost:3100"><code>http://localhost:3100</code></a>).</p>
<p>Note that if you're using other sources like InfluxDB or Elasticsearch, you’ll need to make sure that they’re properly connected as data sources.</p>
<h4 id="heading-step-3-create-panels-and-visualizations">Step 3: Create Panels and Visualizations</h4>
<p>Now that your data sources are connected, you can start building your dashboard with the following panels:</p>
<ul>
<li><p><strong>Build Status Panel:</strong></p>
<ul>
<li><p>Create a <strong>stat panel</strong> or <strong>gauge panel</strong> to show the success/failure ratio of pipeline runs.</p>
</li>
<li><p>Query Prometheus or Loki for data like build status (success or failure), number of errors, and job durations.</p>
</li>
</ul>
</li>
<li><p><strong>Error Breakdown Panel:</strong></p>
<ul>
<li><p>Use a <strong>pie chart</strong> to visualize the types of errors (for example, build, deployment, or system resource failures).</p>
</li>
<li><p>Query the logs in Loki to break down error types based on the CI tool (for example, Jenkins, GitHub Actions).</p>
</li>
</ul>
</li>
<li><p><strong>Resource Utilization Panel:</strong></p>
<ul>
<li>Use <strong>time series graphs</strong> to monitor CPU, memory, and disk usage over time, especially for resource-heavy builds or deployments.</li>
</ul>
</li>
<li><p><strong>Job Duration Panel:</strong></p>
<ul>
<li>Use <strong>bar charts</strong> or <strong>line graphs</strong> to track the average duration of jobs over time. Set thresholds for warning signs if a job takes longer than expected.</li>
</ul>
</li>
</ul>
<h4 id="heading-troubleshooting-grafana-dashboard-issues">Troubleshooting Grafana Dashboard Issues</h4>
<p>If Grafana dashboards fail to display data or show errors, you might be having one of these issues:</p>
<ol>
<li><p><strong>Missing data sources</strong>: If metrics, logs, or traces aren’t appearing, verify data source connections in Grafana (for example, Prometheus, Loki, Tempo). Check under Configuration &gt; Data Sources.</p>
<ul>
<li>Fix: Ensure the data source URLs are correct (for example, <code>http://localhost:9090</code> for Prometheus) and test the connection. Re-add the data source if needed.</li>
</ul>
</li>
<li><p><strong>Incorrect Trace IDs</strong>: If trace visualizations (for example, Tempo panels) show no data, confirm that trace IDs in logs match those in Tempo. Use a query like <code>{job="ci_cd"} | json | trace_id="1234567890abcdef"</code> in Loki to cross-check.</p>
<ul>
<li>Fix: Ensure your application logs include trace and span IDs, and verify the OpenTelemetry SDK is correctly instrumented to send traces to Tempo.</li>
</ul>
</li>
<li><p><strong>Resource Constraints</strong>: Monitor Grafana’s resource usage with <code>docker stats</code> if running in a container, or <code>top</code> on the host.</p>
<ul>
<li>Fix: Allocate at least 4GB RAM and 10GB disk space for Grafana, especially when rendering complex dashboards with multiple data sources.</li>
</ul>
</li>
</ol>
<h3 id="heading-how-to-set-up-drill-down-paths-from-high-level-to-detailed-views">How to Set Up Drill-Down Paths from High-Level to Detailed Views</h3>
<h4 id="heading-step-1-create-high-level-overview-panel">Step 1: Create High-Level Overview Panel</h4>
<p>At the top of the dashboard, include a high-level overview panel that summarizes the overall status of the pipeline. This could be:</p>
<ul>
<li><p><strong>Success/Failure Count</strong>: A simple stat panel showing the count of successful vs. failed runs.</p>
</li>
<li><p><strong>Pipeline Health Status</strong>: Display an overall health check of your pipeline using color-coded indicators (green for healthy, red for issues).</p>
</li>
</ul>
<h4 id="heading-step-2-set-up-drill-down-links">Step 2: Set Up Drill-Down Links</h4>
<p>To allow users to drill down from high-level information to detailed views:</p>
<p><strong>1. Link to detailed build information</strong>:</p>
<p>You can create a time series graph that shows build job durations. Add a link to a detailed log view when clicking on a failed job.</p>
<p>For example, when clicking a failed build, you can link to a detailed panel or a separate dashboard that shows the logs and error messages related to that specific run.</p>
<p><strong>2. Link to Logs in Loki</strong>:</p>
<p>You can use <strong>Loki's LogQL</strong> queries to set up a drill-down path. When users click on an error type or a specific job name, it should automatically filter logs for that job or error type.</p>
<p>You can set up drill-down interactions using Dashboard Links in Grafana. In the panel settings, under <code>Links</code>, specify the link to another dashboard that shows detailed logs filtered by the job name or failure type.</p>
<h4 id="heading-step-3-implement-time-range-filters">Step 3: Implement Time Range Filters</h4>
<p>To enhance drill-down functionality, you can add a <strong>time range filter</strong> to allow users to adjust the time window for both logs and metrics. This enables them to zoom in on a specific time frame where failures occurred.</p>
<h3 id="heading-how-to-create-shared-dashboards-for-team-troubleshooting">How to Create Shared Dashboards for Team Troubleshooting</h3>
<h4 id="heading-step-1-share-your-dashboard">Step 1: Share Your Dashboard</h4>
<p>Once your dashboard is designed, you can share it with your team for collaborative troubleshooting:</p>
<p>First, you’ll want to make sure that the correct permissions are set up for your team. You can define specific roles in Grafana with access to the dashboard. Go to <code>Dashboard Settings</code> &gt; <code>Permissions</code>, and grant view or edit access to users or teams.</p>
<p>Next, you can directly share a link to the dashboard with your team members. Use the <code>Share</code> option in the top-right corner of the dashboard, which provides a direct URL and also options to embed the dashboard into other tools (for example, Slack, email).</p>
<p>You can also use <strong>template variables</strong> to allow users to filter and adjust the dashboard for different pipeline runs or environments. For example, add a variable for <code>build_id</code>, <code>job_name</code>, or <code>branch_name</code> that allows users to select specific builds or branches for more granular troubleshooting.</p>
<h4 id="heading-step-2-set-up-alerting">Step 2: Set Up Alerting</h4>
<p>To ensure your team is notified of any pipeline failures, you can set up <strong>alerting rules</strong>. There are a few important ones you’ll want to set up.</p>
<p>First, create alerts for critical issues, like when a pipeline fails or exceeds expected resource usage. This could be for things like build time exceeding a threshold or failure of a deployment stage.</p>
<p>Grafana can send alerts via various channels such as Slack, email, or webhook.</p>
<p>You can also integrate your dashboards with tools like Slack or Teams for real-time notifications and collaboration. Set up automated messages for your team when the dashboard indicates an issue.</p>
<h3 id="heading-how-to-create-automated-diagnostic-tools"><strong>How to Create Automated Diagnostic Tools</strong></h3>
<h4 id="heading-building-scripts-that-collect-relevant-logs-during-failures">Building Scripts that Collect Relevant Logs During Failures</h4>
<p>To automate log collection during failures, you need scripts that can capture logs from different CI/CD stages and services as soon as a failure is detected. Here are the steps you can follow to do this:</p>
<p><strong>1. Write Failure Detection Script:</strong></p>
<p>You can leverage the exit status codes of your CI/CD tools to detect failures. For example, in GitLab CI/CD or GitHub Actions, you can check if the last command failed by inspecting <code>$?</code> in Unix-based systems.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Example for GitLab CI/CD</span>
<span class="hljs-keyword">if</span> [ $? -ne 0 ]; <span class="hljs-keyword">then</span>
    <span class="hljs-built_in">echo</span> <span class="hljs-string">"Failure detected, collecting logs..."</span>
    <span class="hljs-comment"># Custom log collection script call</span>
    ./collect_logs.sh
<span class="hljs-keyword">fi</span>
</code></pre>
<p><strong>2. Log Collection Script (collect_</strong><a target="_blank" href="http://logs.sh"><strong>logs.sh</strong></a><strong>):</strong></p>
<p>The script should collect relevant logs, system metrics, and trace information. For instance:</p>
<pre><code class="lang-bash"><span class="hljs-meta">#!/bin/bash</span>
LOG_DIR=<span class="hljs-string">"/path/to/logs"</span>
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR=<span class="hljs-string">"<span class="hljs-variable">${LOG_DIR}</span>/backup/<span class="hljs-variable">${TIMESTAMP}</span>"</span>
mkdir -p <span class="hljs-variable">$BACKUP_DIR</span>

<span class="hljs-comment"># Collect logs from CI/CD agents, containers, or system logs</span>
cp /var/<span class="hljs-built_in">log</span>/ci_cd/*.<span class="hljs-built_in">log</span> <span class="hljs-variable">$BACKUP_DIR</span>/
cp /path/to/docker_logs/*.<span class="hljs-built_in">log</span> <span class="hljs-variable">$BACKUP_DIR</span>/
<span class="hljs-comment"># Collect metrics or traces from monitoring systems if needed</span>
</code></pre>
<p><strong>3. Use CI/CD Artifacts:</strong></p>
<p>For platforms like GitLab, GitHub Actions, or Jenkins, you can upload logs as artifacts for further investigation. Configure these platforms to save logs in case of a failure.</p>
<p>Here’s an example for GitHub Actions:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">steps:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Run</span> <span class="hljs-string">Tests</span>
    <span class="hljs-attr">run:</span> <span class="hljs-string">|
      npm run test
</span>  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Upload</span> <span class="hljs-string">logs</span> <span class="hljs-string">if</span> <span class="hljs-string">test</span> <span class="hljs-string">fails</span>
    <span class="hljs-attr">if:</span> <span class="hljs-string">failure()</span>
    <span class="hljs-attr">uses:</span> <span class="hljs-string">actions/upload-artifact@v2</span>
    <span class="hljs-attr">with:</span>
      <span class="hljs-attr">name:</span> <span class="hljs-string">test-logs</span>
      <span class="hljs-attr">path:</span> <span class="hljs-string">/path/to/test/logs</span>
</code></pre>
<p><strong>4. Centralized Logging:</strong></p>
<p>Instead of manually collecting logs, you can centralize log storage using logging systems like Grafana Loki, ELK stack, or even cloud-based solutions. This will ensure that logs are accessible even if they are overwritten or lost on individual systems.</p>
<h3 id="heading-how-to-implement-automatic-analysis-of-common-error-patterns">How to Implement Automatic Analysis of Common Error Patterns</h3>
<p>Once logs are collected, you can automate the analysis process by defining common error patterns and automatically searching for them in your logs.</p>
<h4 id="heading-step-1-define-error-patterns">Step 1: Define Error Patterns:</h4>
<p>Establish error signatures or patterns that are common in your CI/CD process, such as failed builds due to missing dependencies, permission issues, or network timeouts.</p>
<p>You can use regex or regular expressions to capture these patterns. Here’s an example – define a regex for failed test patterns:</p>
<pre><code class="lang-bash">TEST_FAILURE_REGEX=<span class="hljs-string">".*FAILURE.*"</span>
</code></pre>
<h4 id="heading-step-2-create-log-analysis-script">Step 2: Create Log Analysis Script:</h4>
<p>Next, you can write a script that scans logs for these common patterns. The script could then categorize or flag errors.</p>
<p>Here’s an example using <code>grep</code> to detect failure patterns:</p>
<pre><code class="lang-bash"><span class="hljs-meta">#!/bin/bash</span>
LOG_DIR=<span class="hljs-string">"/path/to/logs"</span>
ERROR_LOG=<span class="hljs-string">"<span class="hljs-variable">${LOG_DIR}</span>/error_patterns.log"</span>
touch <span class="hljs-variable">$ERROR_LOG</span>

<span class="hljs-comment"># Define error patterns to search for</span>
ERROR_PATTERNS=(<span class="hljs-string">"FAILURE"</span> <span class="hljs-string">"ERROR"</span> <span class="hljs-string">"TIMEOUT"</span>)

<span class="hljs-keyword">for</span> PATTERN <span class="hljs-keyword">in</span> <span class="hljs-string">"<span class="hljs-variable">${ERROR_PATTERNS[@]}</span>"</span>; <span class="hljs-keyword">do</span>
    grep -i <span class="hljs-variable">$PATTERN</span> <span class="hljs-variable">$LOG_DIR</span>/*.<span class="hljs-built_in">log</span> &gt;&gt; <span class="hljs-variable">$ERROR_LOG</span>
<span class="hljs-keyword">done</span>

<span class="hljs-keyword">if</span> [ -s <span class="hljs-variable">$ERROR_LOG</span> ]; <span class="hljs-keyword">then</span>
    <span class="hljs-built_in">echo</span> <span class="hljs-string">"Error patterns found, review the log file."</span>
<span class="hljs-keyword">fi</span>
</code></pre>
<h4 id="heading-step-3-automate-alerting">Step 3: Automate Alerting:</h4>
<p>Once an error pattern is detected, you can integrate the log analysis script with your alerting system (for example, sending an email or Slack notification).</p>
<p>Here’s an example of sending a Slack notification:</p>
<pre><code class="lang-bash"><span class="hljs-keyword">if</span> [ -s <span class="hljs-variable">$ERROR_LOG</span> ]; <span class="hljs-keyword">then</span>
    curl -X POST -H <span class="hljs-string">'Content-type: application/json'</span> \
         --data <span class="hljs-string">'{"text":"Error detected in CI pipeline. Check error log."}'</span> \
         https://hooks.slack.com/services/YOUR_SLACK_WEBHOOK_URL
<span class="hljs-keyword">fi</span>
</code></pre>
<h4 id="heading-step-4-use-observability-tools-for-pattern-recognition">Step 4: Use Observability Tools for Pattern Recognition:</h4>
<p>Leverage observability tools (Grafana Loki, Prometheus) that support log querying and visualization. You can create dashboards that automatically detect anomalies like high failure rates or recurring errors.</p>
<p>Example: Set up a Grafana dashboard with alert rules based on log frequency.</p>
<h3 id="heading-how-to-create-self-healing-pipelines-based-on-known-issues">How to Create Self-Healing Pipelines Based on Known Issues</h3>
<p>Self-healing pipelines can automatically address issues when they are detected by executing pre-defined corrective actions. Let’s walk through how you can set one up.</p>
<h4 id="heading-step-1-define-common-failures-and-solutions">Step 1: Define Common Failures and Solutions:</h4>
<p>Identify recurring issues (for example, dependency issues, build timeouts, flaky tests) that occur in your pipeline. Then, define self-healing actions to mitigate these issues.</p>
<p>Here’s an example of automatically retrying a failed step if it is a known flaky test:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">jobs:</span>
  <span class="hljs-attr">build:</span>
    <span class="hljs-attr">runs-on:</span> <span class="hljs-string">ubuntu-latest</span>
    <span class="hljs-attr">steps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Run</span> <span class="hljs-string">Tests</span>
        <span class="hljs-attr">run:</span> <span class="hljs-string">|
          npm run test
</span>      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Retry</span> <span class="hljs-string">Tests</span> <span class="hljs-string">if</span> <span class="hljs-string">Failed</span>
        <span class="hljs-attr">if:</span> <span class="hljs-string">failure()</span> <span class="hljs-string">&amp;&amp;</span> <span class="hljs-string">(steps.tests.outcome</span> <span class="hljs-string">==</span> <span class="hljs-string">'failure'</span><span class="hljs-string">)</span>
        <span class="hljs-attr">run:</span> <span class="hljs-string">|
          echo "Retrying tests..."
          npm run test</span>
</code></pre>
<h4 id="heading-step-2-automatic-rollbacks">Step 2: Automatic Rollbacks:</h4>
<p>Set up a rollback process for failed deployments. For instance, if a deployment to production fails, the pipeline can automatically revert to the last successful build.</p>
<p>Example in GitLab CI/CD:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">deploy_production:</span>
  <span class="hljs-attr">script:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">./deploy.sh</span>
  <span class="hljs-attr">when:</span> <span class="hljs-string">on_failure</span>
  <span class="hljs-attr">retry:</span> <span class="hljs-number">3</span>
</code></pre>
<h4 id="heading-step-3-build-self-healing-logic-using-retry-mechanisms">Step 3: Build Self-Healing Logic Using Retry Mechanisms:</h4>
<p>Implement retry logic for transient issues (like network glitches) that often cause failures.</p>
<p>Example of retrying a step in GitHub Actions:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">steps:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Retry</span> <span class="hljs-string">Deployment</span>
    <span class="hljs-attr">run:</span> <span class="hljs-string">|
      attempts=0
      max_attempts=3
      until [ $attempts -ge $max_attempts ]
      do
        deploy_script &amp;&amp; break
        attempts=$((attempts+1))
        echo "Attempt $attempts failed. Retrying..."
        sleep 5
      done</span>
</code></pre>
<h4 id="heading-step-4-automate-corrective-actions-for-dependency-issues">Step 4: Automate Corrective Actions for Dependency Issues:</h4>
<p>Set up automatic fixes for dependency-related failures, like clearing caches or re-installing dependencies:</p>
<pre><code class="lang-bash"><span class="hljs-keyword">if</span> [[ $(cat error.log) =~ <span class="hljs-string">"dependency not found"</span> ]]; <span class="hljs-keyword">then</span>
    <span class="hljs-built_in">echo</span> <span class="hljs-string">"Dependency issue detected, reinstalling dependencies..."</span>
    npm install
<span class="hljs-keyword">fi</span>
</code></pre>
<h4 id="heading-step-5-integrate-with-self-healing-services">Step 5: Integrate with Self-Healing Services:</h4>
<p>For more complex self-healing, you can integrate tools like Ansible, Puppet, or even create custom scripts that auto-patch common configuration issues.</p>
<h2 id="heading-how-to-conduct-effective-postmortems-using-logs">How to Conduct Effective Postmortems Using Logs</h2>
<p>Logs are often the single most valuable resource when reconstructing what went wrong in a CI/CD pipeline. Conducting effective postmortems with log data allows teams to extract clear timelines, pinpoint root causes, and define steps to prevent recurrence – all based on concrete evidence.</p>
<h3 id="heading-extract-timeline-and-key-events-from-the-logs">Extract Timeline and Key Events from the Logs</h3>
<p>To accurately understand what happened and when from the info contained in your logs, there’s a straightforward process you can follow.</p>
<h4 id="heading-step-1-centralize-and-structure-logs">Step 1: Centralize and Structure Logs:</h4>
<p>First, make sure that the logs from all pipeline stages (build, test, deploy) are aggregated in a central place like Grafana Loki, ELK, or OpenSearch.</p>
<p>And you’ll want to use a consistent log format (like structured JSON) that includes timestamps, log levels, pipeline stage identifiers, and correlation/request IDs.</p>
<h4 id="heading-step-2-build-a-chronological-view">Step 2: Build a Chronological View:</h4>
<p>You can use timestamp filters in your log UI (for example, Kibana, Grafana Explore) to isolate logs from the incident timeframe.</p>
<p>Look for key lifecycle events, like:</p>
<ul>
<li><p>Start and completion of pipeline steps</p>
</li>
<li><p>Status changes (for example, "test failed", "deployment started", "build queued")</p>
</li>
<li><p>Error messages and warnings</p>
</li>
<li><p>Retry events or unexpected restarts</p>
</li>
</ul>
<h4 id="heading-step-3-extract-logs-programmatically-optional">Step 3: Extract Logs Programmatically (optional):</h4>
<p>Use queries (LogQL, Elasticsearch DSL) to export relevant logs for analysis or inclusion in a post-mortem document.</p>
<h3 id="heading-how-to-identify-root-causes-through-log-analysis">How to Identify Root Causes Through Log Analysis</h3>
<p>To go beyond symptoms and find the real issue, there are various steps you can take.</p>
<p>Start by <strong>looking for the first failure</strong>. You can filter logs by <code>level=error</code> or use log pattern matching to identify the <em>earliest</em> sign of failure. Then trace backward from the failure using correlation IDs or pipeline step identifiers.</p>
<p>Second, make sure you <strong>correlate logs across systems.</strong> Match logs across CI/CD tools (like GitHub Actions → Docker logs → Kubernetes logs). You can use shared correlation IDs or job IDs to group logs from related events.</p>
<p>Next, <strong>pay attention to intermittent signals.</strong> Warnings, retries, or degraded performance preceding the failure may reveal environmental or configuration-related issues.</p>
<p>And finally, <strong>check for external dependencies.</strong> Look for timeout or connection errors involving third-party services, cloud APIs, or internal infrastructure components.</p>
<h3 id="heading-how-to-create-actionable-follow-ups-to-prevent-recurrence"><strong>How to Create Actionable Follow-Ups to Prevent Recurrence</strong></h3>
<p>There are various things you can do to turn your findings into meaningful process improvements.</p>
<p><strong>1. Document the Findings Clearly:</strong></p>
<p>Create a structured post-mortem doc that includes:</p>
<ul>
<li><p>Timeline of events with log excerpts</p>
</li>
<li><p>Immediate trigger and root cause (based on logs)</p>
</li>
<li><p>Impact summary and affected components</p>
</li>
<li><p>Screenshots or saved log queries for reference</p>
</li>
</ul>
<p><strong>2. Define Preventive Actions:</strong></p>
<p>Examples include:</p>
<ul>
<li><p>Adding missing alerts or log-based monitors</p>
</li>
<li><p>Improving log verbosity or adding missing metadata</p>
</li>
<li><p>Fixing brittle test cases or deployment scripts</p>
</li>
<li><p>Updating infrastructure limits or retry strategies</p>
</li>
</ul>
<p><strong>3. Assign Ownership and Deadlines:</strong></p>
<p>Each action item should have a responsible owner and a due date. If applicable, create automated tests or guardrails to catch similar issues in the future.</p>
<p><strong>4. Update Runbooks and Incident Playbooks:</strong></p>
<p>Add log patterns, example queries, and resolutions to shared documentation. This ensures the next person facing a similar issue can act faster.</p>
<p><strong>Pro Tip:</strong> Automate part of your post-mortem process by tagging logs from failed CI runs, exporting them to a shared location, and pre-generating dashboards or incident reports. This reduces manual effort and increases consistency.</p>
<h2 id="heading-how-to-optimize-log-storage-and-management"><strong>How to Optimize Log Storage and Management</strong></h2>
<p>As your CI/CD system grows, logs can become massive, consuming storage and impacting performance. Optimizing log storage helps you make sure that you're retaining what's valuable while staying efficient.</p>
<h3 id="heading-how-to-implement-log-rotation-and-retention-policies">How to Implement Log Rotation and Retention Policies</h3>
<p>Without rotation and retention, logs will pile up endlessly, leading to disk space exhaustion and poor performance. You can help prevent this with <strong>log rotation</strong>.</p>
<p>Log rotation involves creating new log files after a size or time threshold and archiving or deleting old ones.</p>
<p><strong>Linux logrotate tool</strong> – Configure <code>/etc/logrotate.d/&lt;your-app&gt;</code>:</p>
<pre><code class="lang-javascript">/<span class="hljs-keyword">var</span>/log/ci_cd<span class="hljs-comment">/*.log {
    daily
    rotate 7
    compress
    missingok
    notifempty
    create 0640 root adm
}</span>
</code></pre>
<p>This example:</p>
<ul>
<li><p>Rotates daily</p>
</li>
<li><p>Keeps 7 days of logs</p>
</li>
<li><p>Compresses old logs to save space</p>
</li>
</ul>
<p><strong>Docker logs rotation</strong> – in <code>daemon.json</code>:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"log-driver"</span>: <span class="hljs-string">"json-file"</span>,
  <span class="hljs-attr">"log-opts"</span>: {
    <span class="hljs-attr">"max-size"</span>: <span class="hljs-string">"50m"</span>,
    <span class="hljs-attr">"max-file"</span>: <span class="hljs-string">"5"</span>
  }
}
</code></pre>
<p>Retention policies ensure that old logs are automatically deleted based on age or storage usage.</p>
<p>You can set one up in Loki like this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">table_manager:</span>
  <span class="hljs-attr">retention_deletes_enabled:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">retention_period:</span> <span class="hljs-string">168h</span>  <span class="hljs-comment"># 7 days</span>
</code></pre>
<p>Or in Elasticsearch, use Index Lifecycle Management (ILM):</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"policy"</span>: {
    <span class="hljs-attr">"phases"</span>: {
      <span class="hljs-attr">"hot"</span>: {
        <span class="hljs-attr">"actions"</span>: {
          <span class="hljs-attr">"rollover"</span>: { <span class="hljs-attr">"max_age"</span>: <span class="hljs-string">"3d"</span>, <span class="hljs-attr">"max_size"</span>: <span class="hljs-string">"1gb"</span> }
        }
      },
      <span class="hljs-attr">"delete"</span>: {
        <span class="hljs-attr">"min_age"</span>: <span class="hljs-string">"7d"</span>,
        <span class="hljs-attr">"actions"</span>: { <span class="hljs-attr">"delete"</span>: {} }
      }
    }
  }
}
</code></pre>
<h3 id="heading-how-to-set-up-log-compaction-for-long-term-storage">How to Set Up Log Compaction for Long-Term Storage</h3>
<p>Compaction reduces redundancy and keeps only critical log info, which is ideal for long-term audits or analytics.</p>
<h4 id="heading-compaction-techniques">Compaction Techniques:</h4>
<p>There are various different compaction techniques you can try. Here are a couple:</p>
<p><strong>1. Loki (boltdb-shipper mode)</strong>:</p>
<ul>
<li><p>Uses compaction to merge log chunks and reduce storage.</p>
</li>
<li><p>Configure in <code>loki-config.yaml</code>:</p>
<pre><code class="lang-yaml">  <span class="hljs-attr">schema_config:</span>
    <span class="hljs-attr">configs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">from:</span> <span class="hljs-number">2023-01-01</span>
        <span class="hljs-attr">store:</span> <span class="hljs-string">boltdb-shipper</span>
        <span class="hljs-attr">object_store:</span> <span class="hljs-string">filesystem</span>
        <span class="hljs-attr">schema:</span> <span class="hljs-string">v11</span>
</code></pre>
</li>
<li><p>Use a low-retention, high-compaction strategy for archived logs.</p>
</li>
</ul>
<p><strong>2. Elasticsearch</strong>:</p>
<ul>
<li><p>Use <strong>rollup jobs</strong> to reduce resolution of old data.</p>
</li>
<li><p>Stores summarized logs, for example, hourly counts of similar events.</p>
</li>
</ul>
<p><strong>3. Archive to cheaper storage</strong>:</p>
<ul>
<li>Move infrequent-access logs to S3 or Azure Blob Storage using lifecycle rules.</li>
</ul>
<h3 id="heading-how-to-balance-observability-with-resource-constraints">How to Balance Observability with Resource Constraints</h3>
<p>More logs = more observability, but also more cost and overhead. This means that you need a balance. There are various strategies that can help you achieve this balance:</p>
<ol>
<li><p><strong>Log at appropriate levels</strong>:</p>
<ul>
<li><p>Avoid excessive <code>debug</code> or <code>trace</code> logs in production.</p>
</li>
<li><p>Use <code>info</code> and <code>warn</code> levels judiciously.</p>
</li>
<li><p>Only use <code>error</code> or <code>critical</code> for actionable failures.</p>
</li>
</ul>
</li>
<li><p><strong>Sample logs</strong>:</p>
<ul>
<li><p>If high-volume pipelines generate repetitive logs, enable log sampling to reduce duplicates.</p>
</li>
<li><p>Tools like Vector or Fluent Bit support sampling.</p>
</li>
</ul>
</li>
<li><p><strong>Filter out noise</strong>:</p>
<ul>
<li>Use log filters to exclude non-critical logs before they reach the central system.</li>
</ul>
</li>
<li><p><strong>Separate hot vs. cold logs</strong>:</p>
<ul>
<li><p><strong>Hot logs</strong>: recent, real-time data for active debugging.</p>
</li>
<li><p><strong>Cold logs</strong>: archived for compliance, stored with lower performance/storage priority.</p>
</li>
</ul>
</li>
<li><p><strong>Compress everything</strong>:</p>
<ul>
<li><p>Use gzip/zstd compression for both stored and transmitted logs.</p>
</li>
<li><p>Loki, Elasticsearch, and Vector support compression out of the box.</p>
</li>
</ul>
</li>
</ol>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>In this handbook, you have built a full-stack observability layer specifically optimized for CI/CD pipelines without breaking your infrastructure budget. You now have the tools and know-how to:</p>
<ul>
<li><p>Deploy Grafana Loki or a lightweight ELK alternative to capture structured logs from all parts of your pipeline.</p>
</li>
<li><p>Unify and enrich logs across CI/CD tools (for example, GitHub Actions, Jenkins, GitLab) using consistent formats and correlation IDs.</p>
</li>
<li><p>Use powerful log queries (LogQL, Kibana Query Language) to diagnose build failures, flaky tests, and deployment issues with precision.</p>
</li>
<li><p>Correlate logs with metrics and traces to gain deep, contextual visibility into pipeline behavior.</p>
</li>
<li><p>Design reusable debugging dashboards and automation that turn raw logs into insights and action.</p>
</li>
<li><p>Build a culture of shared troubleshooting knowledge through post-mortems, runbooks, and log-driven retrospectives.</p>
</li>
</ul>
<p>To see the full-stack observability layer in action, check out the complete code and configurations in my GitHub repository: <a target="_blank" href="https://github.com/Emidowojo/CICDObservability.git">github.com/Emidowojo/CICDObservability</a>. This repo includes all the setups for Grafana Loki, OpenTelemetry, Prometheus, and more, so you can deploy and explore the entire pipeline observability stack.</p>
<h3 id="heading-next-steps-for-advanced-observability-implementation">Next Steps for Advanced Observability Implementation</h3>
<p>Here’s how you can take your setup even further:</p>
<ol>
<li><p><strong>Fully integrate distributed tracing</strong>: Deploy OpenTelemetry agents across your build and deployment stages. This will help you visualize how code, builds, and deployments flow across systems in real-time.</p>
</li>
<li><p><strong>Automate diagnostic scripts and alerts</strong>: Build scripts to auto-collect logs and metrics on failure, and trigger alerts when known patterns reoccur. This enables faster detection and even self-healing pipelines.</p>
</li>
<li><p><strong>Scale and harden your log infrastructure</strong>: As usage grows, implement log retention, compaction, and storage policies. Explore scalable backends like ClickHouse or object storage (e.g., S3) for long-term archiving.</p>
</li>
<li><p><strong>Train your team on observability best practices</strong>: Share dashboards, create onboarding docs, and schedule log-analysis sessions to build team familiarity with your tools and practices.</p>
</li>
</ol>
<h3 id="heading-resources-for-continued-learning">📚 Resources for Continued Learning</h3>
<p><strong>Official Docs and Tools:</strong></p>
<ul>
<li><p><a target="_blank" href="https://grafana.com/docs/loki/">Grafana Loki Documentation</a></p>
</li>
<li><p><a target="_blank" href="https://grafana.com/docs/loki/latest/clients/promtail/">Promtail Configuration Guide</a></p>
</li>
<li><p><a target="_blank" href="https://opentelemetry.io/docs/">OpenTelemetry</a></p>
</li>
<li><p><a target="_blank" href="https://grafana.com/docs/loki/latest/logql/">LogQL Syntax</a></p>
</li>
<li><p><a target="_blank" href="https://www.elastic.co/guide/en/kibana/current/kuery-query.html">Kibana Query Language</a></p>
</li>
<li><p><a target="_blank" href="https://vector.dev/docs/">Vector (log forwarding)</a></p>
</li>
</ul>
<p><strong>Communities:</strong></p>
<ul>
<li><p><a target="_blank" href="https://www.reddit.com/r/devops/">r/devops on Reddit</a></p>
</li>
<li><p><a target="_blank" href="https://slack.cncf.io/">CNCF Slack – #observability channel</a></p>
</li>
<li><p><a target="_blank" href="https://stackoverflow.com/questions/tagged/logging">Log Management Best Practices on Stack Overflow</a></p>
</li>
</ul>
<p>By investing in observability early and thoughtfully, you not only reduce the time to detect and resolve issues, you also build a more resilient, predictable, and transparent delivery process for your entire engineering team.</p>
<p>I hope this comes in handy for you someday. If you made it to the end of this handbook, thanks for reading! You can connect with me on <a target="_blank" href="https://www.linkedin.com/in/emidowojo/">LinkedIn</a> or on X <a target="_blank" href="https://x.com/Emidowojo">@Emidowojo</a> if you’d like to stay in touch.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Public Grafana-based Solar Monitoring Dashboard in Home Assistant ]]>
                </title>
                <description>
                    <![CDATA[ If you have a solar inverter setup, one thing you would agree on with me is that data from your solar inverter setup is really important. Another thing that is also important is having a way to show what your energy generation, consumption, and so on... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-public-grafana-based-solar-monitoring-dashboard-in-home-assistant/</link>
                <guid isPermaLink="false">68010bea439877985a5bf3e2</guid>
                
                    <category>
                        <![CDATA[ iot ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Home Assistant ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Grafana ]]>
                    </category>
                
                    <category>
                        <![CDATA[ InfluxDB ]]>
                    </category>
                
                    <category>
                        <![CDATA[ solar energy ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Daniel Anomfueme ]]>
                </dc:creator>
                <pubDate>Thu, 17 Apr 2025 14:10:50 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1744899028552/8ad3f3c4-9b25-473d-b539-14dcb2f2b241.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you have a solar inverter setup, one thing you would agree on with me is that data from your solar inverter setup is really important. Another thing that is also important is having a way to show what your energy generation, consumption, and so on looks like publicly.</p>
<p>The thing is that most solar inverter brands have a form of remote data monitoring platform, from <a target="_blank" href="https://www.victronenergy.com/panel-systems-remote-monitoring/vrm">Victron’s VRM</a> to <a target="_blank" href="https://en.growatt.com/products/growatt-monitoring-platform">Growatt’s ShineServe</a>r to <a target="_blank" href="https://www.deyeinverter.com/product/accessory-monitoring-1/smart-pv-management-platform.html">Deye’s Cloud</a>, among others. But I’m a fan of self-hosting and local control of data. This is one of the best ways to visualize and showcase all that beautiful data you have publicly to fellow tinkerers, solar inverter users, and the general public without relying on the company’s cloud data logger solution.</p>
<p>In this article, we will be using data available in our Home Assistant setup, sending it to <a target="_blank" href="https://www.influxdata.com/products/influxdb/">InfluxDB</a> and making a <a target="_blank" href="https://grafana.com/oss/grafana/">Grafana</a> dashboard out of it. There are a good number of ways to connect your inverter to Home Assistant, depending on the manufacturer. I use a Growatt SPF ES 6000 inverter, and I shared a guide on how to make a local data logger for it that works with Home Assistant <a target="_blank" href="https://hackernoon.com/turn-your-dumb-solar-inverter-into-a-smart-one-with-this-home-assistant-hack">here</a>.</p>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-install-and-configure-influxdb">How to Install and Configure InfluxDB</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-install-and-configure-grafana">How to Install and Configure Grafana</a></p>
</li>
<li><p><a class="post-section-overview" href="#how-to-create-the-grafana-solar-dashboard">How to Create the Grafana Solar Dashboard</a></p>
</li>
<li><p><a class="post-section-overview" href="#how-to-create-a-new-admin-user-and-delete-the-default-admin-user">How to Create a New Admin User and Delete the Default Admin User</a></p>
</li>
<li><p><a class="post-section-overview" href="#how-to-enable-remote-access-to-the-solar-dashboard">How to Enable Remote Access to the Solar Dashboard</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h3 id="heading-prerequisites"><strong>Prerequisites</strong></h3>
<ul>
<li><p>Home Assistant OS</p>
</li>
<li><p>A domain name</p>
</li>
<li><p>An inverter connected to your Home Assistant instance</p>
</li>
</ul>
<h2 id="heading-how-to-install-and-configure-influxdb">How to Install and Configure InfluxDB</h2>
<p>We will be starting by setting up InfluxDB. InfluxDB is an open-source time series database, which differs from the database that <a target="_blank" href="https://www.home-assistant.io/docs/backend/database/#:~:text=The%20default%20database%20used%20is,other%20databases%20can%20be%20used.">Home Assistant uses by default</a>, SQLite. We will be using InfluxDB v1, as it’s much easier to set up.</p>
<p>Go to your Home Assistant dashboard and go to Settings &gt; Add-ons and click on the Add-On Store.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744463486874/9dda1fca-24a9-4c30-a486-3b723e8535fe.png" alt="A screenshot of Home Assistant Add-ons" class="image--center mx-auto" width="1713" height="1378" loading="lazy"></p>
<p>Inside the Add-on Store, search for “InfluxDB“ and click on the Add-on. You should see the screen below, then install.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744463639772/75f66c35-e7b3-4c20-96ea-9e9154829ac5.png" alt="A screenshot of Home Assistant Add-ons, showing InfluxDB Add-on page" class="image--center mx-auto" width="1703" height="1350" loading="lazy"></p>
<p>Toggle the “Watchdog” on, as this allows the add-on to restart if it crashes. Also, toggle the “show in sidebar” on, which allows you to see the add-on on Home Assistant’s sidebar.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744465515531/0c9e9475-08c2-4bc3-afe5-baa4fdbae164.png" alt="A screenshot of InfluxDB Add-on installed and some configurations turned on" class="image--center mx-auto" width="1698" height="1252" loading="lazy"></p>
<p>Start the add-on and look at the logs to be sure it is working. The “Starting NGINX” is an indicator it’s working.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744465746577/f3adbd52-14cd-4e78-b2d7-789ad2c22b31.png" alt="A screenshot of InfluxDB Add-on logs" class="image--center mx-auto" width="1721" height="1246" loading="lazy"></p>
<p>Next, go to your Home Assistant sidebar and click on InfluxDB. You need to create a new database to hold your data and also create a new user that has admin privileges to read and write data. Go to the InfluxDB Admin tab.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744466323654/78f21741-e6ca-4094-8fc3-adc563b3dfc1.png" alt="A screenshot of InfluxDB Add-on Admin settings showing database available" class="image--center mx-auto" width="1722" height="1367" loading="lazy"></p>
<p>Click on Create Database – and you can name the database anything you want. I will be naming mine <strong>homeassistant</strong>.</p>
<p>By default, the retention policy for a created database is infinity (which is forever), but you can configure this to be any time frame you want. Retention policy refers to the time frame of data the database can hold. I prefer to stick with infinity as I want to keep as much data as possible and I have enough storage in my Home Assistant hardware for that.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744466625066/c2ae2012-44d8-4acb-91ae-25ad35fb18ff.png" alt="A screenshot of InfluxDB Add-on Admin settings showing the newly created database available" class="image--center mx-auto" width="1725" height="1350" loading="lazy"></p>
<p>Once the database is created, go to the Users tab so you can create the new admin user. Input a username and password for that user and click on Grant Admin, so the permission level can be set to all. I created a new user called <strong>root</strong>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744467230710/6c025cb7-123a-4552-8090-6a9550e64ecf.png" alt="A screenshot of InfluxDB Add-on Admin settings showing users available" class="image--center mx-auto" width="1712" height="1350" loading="lazy"></p>
<p>At this point, what is left on the InfluxDB side is to tell Home Assistant to start sending sensor data to InfluxDB. You can do this by going to your Home Assistant <strong>configuration.yaml</strong> file and adding this config below to it. Your host is the IP of your Home Assistant, the port is the default port for the InfluxDB add-on, and the remaining values are based on the values you used during setup.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">influxdb:</span>
  <span class="hljs-attr">host:</span> <span class="hljs-number">192.168</span><span class="hljs-number">.8</span><span class="hljs-number">.12</span>
  <span class="hljs-attr">port:</span> <span class="hljs-number">8086</span>
  <span class="hljs-attr">database:</span> <span class="hljs-string">homeassistant</span>
  <span class="hljs-attr">username:</span> <span class="hljs-string">root</span>
  <span class="hljs-attr">password:</span> <span class="hljs-string">password</span>
  <span class="hljs-attr">max_retries:</span> <span class="hljs-number">3</span>
  <span class="hljs-attr">default_measurement:</span> <span class="hljs-string">state</span>
</code></pre>
<p>Restart your Home Assistant and go to InfluxDB. Click on the Explore tab, and check to see if you have a <strong>database.autogen</strong> file there<em>.</em> Click on it, and if you see some values under Measurements &amp; Tags, you are good to go.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744468647521/3082e165-04d9-4b7d-bfd7-8de0662c11a9.png" alt="A screenshot of InfluxDB Add-on Explore tab" class="image--center mx-auto" width="1712" height="1357" loading="lazy"></p>
<h2 id="heading-how-to-install-and-configure-grafana">How to Install and Configure Grafana</h2>
<p>Next on our agenda is to install and configure Grafana. The goal is to have Grafana query InfluxDB and make dashboards based on the queried data.</p>
<p>Go to the Add-on store, search for Grafana, and install it. Remember to toggle on those important settings, then start the add-on.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744470196074/b3d69925-9ccc-45b9-8078-905866222d15.png" alt="A screenshot of Grafana Add-on page" class="image--center mx-auto" width="1715" height="1386" loading="lazy"></p>
<p>Once it has started, click on Grafana on the sidebar. You will arrive at Grafana’s homepage which is where you can create those dashboards.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744470431232/0e4541a5-344f-4f46-bd85-44f4d92ea53c.png" alt="A screenshot of Grafana Add-on homepage" class="image--center mx-auto" width="1703" height="1346" loading="lazy"></p>
<p>But before you do that, you need to connect InfluxDB to Grafana. Navigate to Grafana’s tab &gt;&gt; Connections. You should see an “Add new connection” page. Search for InfluxDB and choose it. Then click on the add new datasource button.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744470636092/8dcb8f2b-7fcd-49d3-80d8-abce4f83ea07.png" alt="A screenshot of Grafana Add-on connection settings page" class="image--center mx-auto" width="1707" height="1355" loading="lazy"></p>
<p>Under HTTP, edit the URL and use <strong>http://ha_ip_address:8086</strong> – don’t omit the <code>http://</code> or try to use <code>localhost</code> with it. Scroll down to the InfluxDB Details and fill in the data you used while setting up InfluxDB. Then click on Save &amp; Test. If the config is correct, you should see a green tick and text saying “datasource is working…measurements found.”</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744471708830/5b0d0306-8b74-4c02-9163-cc23a7c3425c.png" alt="A screenshot of Grafana Add-on connection configuration for InfluxDB" class="image--center mx-auto" width="1700" height="1337" loading="lazy"></p>
<h2 id="heading-how-to-create-the-grafana-solar-dashboard">How to Create the Grafana Solar Dashboard</h2>
<p>With that, you should have InfluxDB running and connected to Grafana. Let’s get to building beautiful dashboards out of all the data being generated. This part is subjective, so you can feel free to edit and modify the design to your taste. We will be using this dashboard <a target="_blank" href="https://helio.openculture.org.ng/public-dashboards/cf813bfa739044129e125bdd65db7a65?ref=blog.openculture.org.ng">here</a> as the inspiration for our design.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744548823070/811cb6b1-d3f6-4880-b665-8af999d4c703.png" alt="A screenshot of a dashboard we want to recreate" class="image--center mx-auto" width="1918" height="941" loading="lazy"></p>
<p>So now go to your Grafana in Home Assistant, click on the + icon and create a new dashboard.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744544792278/0bbc6140-1335-4597-a972-80a5ddee1744.png" alt="A screenshot of Grafana homepage" class="image--center mx-auto" width="1721" height="1340" loading="lazy"></p>
<p>You should know that a dashboard in Grafana refers to the full space and each thing placed on the dashboard is a panel. Each visualization on the dashboard is a panel.</p>
<p>Let’s create a new panel. Pick InfluxDB as the data source, and at the <strong>FROM</strong> row, pick W which is the unit we want to create a visualization from. <strong>WHERE</strong> is entity_id::tag, as that is the way to sort the values by Home Assistant sensor entity name. Then pick the entity id of your panel – mine is <strong>growatt_pv1_charge_power.</strong> You can change the panel title, change the visualization to stat, and add the watt as the unit and the base colour to yellow.</p>
<p>The raw query looks like this:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> mean(<span class="hljs-string">"value"</span>) <span class="hljs-keyword">FROM</span> <span class="hljs-string">"W"</span> <span class="hljs-keyword">WHERE</span> (<span class="hljs-string">"entity_id"</span>::tag = <span class="hljs-string">'growatt_pv1_charge_power'</span>) <span class="hljs-keyword">AND</span> $timeFilter <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> <span class="hljs-built_in">time</span>($__interval) fill(<span class="hljs-literal">null</span>)
</code></pre>
<p>The Grafana edit page looks like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744550024020/bb50d494-c6bc-45f2-8ff0-17173e7255bc.png" alt="A screenshot of Grafana edit panel view" class="image--center mx-auto" width="1710" height="1258" loading="lazy"></p>
<p>At this point, you should be able to recreate the remaining parts of the dashboard. But I manually did all that, so you don’t have to go through it all yourself if you don’t want to.</p>
<p><a target="_blank" href="https://github.com/LifeofDan-EL/Grafana-Solar-Dashboard">Here</a> is the link to a GitHub repo that has the JSON file of this dashboard. When you go to create a dashboard, you will see an option to import from a JSON file. You can choose to copy and paste or upload the file, whichever works for you.</p>
<p>After importing, you only need to edit each panel through the GUI to use your own entity ID tag in Home Assistant and also the UID of your InfluxDB database.</p>
<p>Here is a picture of my finished result:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744559385614/57b58bfe-7fc2-4fa7-a0d2-19485558899a.png" alt="A screenshot of the finished product of the dashboard I built" class="image--center mx-auto" width="3436" height="1357" loading="lazy"></p>
<h2 id="heading-how-to-create-a-new-admin-user-and-delete-the-default-admin-user">How to Create a New Admin User and Delete the Default Admin User</h2>
<p>By default, the Grafana Add-on in Home Assistant uses an auth proxy and creates a default user (<code>admin</code>) with a password (<code>hassio</code>) that's synced with your HA login session. This prevents password or user changes through the UI.</p>
<p>For context, an auth proxy, or authentication proxy, acts as an intermediary between a client and a target resource, handling authentication and authorization on behalf of the client</p>
<p>As a security step, we need to create a new user for the Grafana Add-on and edit their permission to have admin privileges, then delete the default admin user. This is because you can’t change the default admin user password on the Add-on.</p>
<p>Go to Grafana’s menu &gt; Administration&gt; Users and access &gt; Users. Then create a new user.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744571233206/daf2c674-4501-4774-89d9-76992227b531.png" alt="A screenshot of Grafana users setting page" class="image--center mx-auto" width="1721" height="1332" loading="lazy"></p>
<p>Next, give it admin privileges. Edit Grafana Admin to be yes and make sure the organization role is set to admin, then save.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744749794592/21e85eab-99ca-4c7d-9d18-19e9f31ea1da.png" alt="A screenshot of Grafana user setting" class="image--center mx-auto" width="1715" height="1387" loading="lazy"></p>
<p>Go back to the Add-on Configuration tab. Scroll to the Network setting and add a port to expose the Add-on. I will be using port 3000. Save and restart the Add-on. If you have SSL turned on and it isn’t configured, the add-on won’t start. You can disable it as we will have Cloudflare handle that.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744750541094/326aa725-7b81-4407-82ef-e6589a153b9c.png" alt="A screenshot of Grafana Add-on Configuration tab" class="image--center mx-auto" width="1712" height="1388" loading="lazy"></p>
<p>To confirm that the port has been exposed properly, go to <code>http://ha_ip:3000/</code> and confirm you see this Grafana login screen. Make sure it is http and not https.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744563139292/e17d2691-3b77-4faf-997d-95a0b3141066.png" alt="A screenshot of Grafana homepage accessed from outside Home Assistant url" class="image--center mx-auto" width="1698" height="1382" loading="lazy"></p>
<p>Log in as the new user you created. Then go to your list of users and delete the default admin user.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744750661705/1d98e960-e325-4a5c-9746-64ff1819a573.png" alt="A screenshot of edit to the default admin user" class="image--center mx-auto" width="1718" height="1343" loading="lazy"></p>
<p>After that, go back to the Grafana Add-on Configuration tab. Click on the 3 dots on the Options row and choose Edit in YAML. Then add this line below to your configuration file and save.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">grafana_ingress_user:</span> <span class="hljs-string">usernameofnewuser</span>
</code></pre>
<h2 id="heading-how-to-enable-remote-access-to-the-solar-dashboard">How to Enable Remote Access to the Solar Dashboard</h2>
<p>At this point, we have the solar dashboard all ready and we can access it in Home Assistant while inside our home network. But we don’t want it only that way. We want anyone to be able to visit the link without having access to our home network.</p>
<p>I will be implementing this part with the aid of a Home Assistant Cloudflared Add-on that leverages Cloudflare Tunnel. Here is the <a target="_blank" href="https://github.com/brenner-tobias/addon-cloudflared">Github repository</a> – the installation is simple and stress-free.</p>
<p>After going through the setup and having remote access to your Home Assistant network (remember to have 2FA turned on), go to the Cloudflared Add-on configuration tab and edit the Additional Hosts part.</p>
<pre><code class="lang-yaml"><span class="hljs-bullet">-</span> <span class="hljs-attr">hostname:</span> <span class="hljs-string">subdomain_you_want.your_domain.xyz</span>
  <span class="hljs-attr">service:</span> <span class="hljs-string">http://ha_ip:3000</span>
  <span class="hljs-attr">disableChunkedEncoding:</span> <span class="hljs-literal">true</span>
</code></pre>
<p>Save and restart the Add-on and check the logs. You should see it creating a DNS entry for the hostname you added.</p>
<p>As another security step, go to your Grafana Add-on Configuration tab. Add these values to the environment variables.</p>
<pre><code class="lang-yaml"><span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">GF_AUTH_ANONYMOUS_ENABLED</span>
  <span class="hljs-attr">value:</span> <span class="hljs-string">"true"</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">GF_AUTH_ANONYMOUS_ORG_ROLE</span>
  <span class="hljs-attr">value:</span> <span class="hljs-string">"Viewer"</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">GF_AUTH_DISABLE_LOGIN_FORM</span>
  <span class="hljs-attr">value:</span> <span class="hljs-string">"true"</span>
</code></pre>
<ul>
<li><p><code>GF_AUTH_ANONYMOUS_ENABLED</code>: Anyone who visits Grafana without logging in will still be allowed in.</p>
</li>
<li><p><code>GF_AUTH_ANONYMOUS_ORG_ROLE</code>: This sets the default permission for anonymous users. In this case, anonymous users will have the viewer role.</p>
</li>
<li><p><code>GF_AUTH_DISABLE_LOGIN_FORM</code>: Disables the login form on the Grafana login page. Make sure you are already logged in on the remote hostname. But you can always edit this on the Add-on Configuration tab if you get locked out.</p>
</li>
</ul>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>Finally, go to the Remote hostname for your Grafana and you should see the Grafana home page. Then go to your dashboards and click on the solar dashboard created. Share it and choose publicly. Now you can share that link (the URL on that page and not the actual copied URL from the share button) with anyone and they can get to see your beautiful dashboard.</p>
<p>This method serves as an all-in-one way of having everything done, through your Home Assistant machine. I hope you had fun tinkering, see you next time.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Monitor Your Kubernetes Clusters with Prometheus and Grafana on AWS ]]>
                </title>
                <description>
                    <![CDATA[ Creating a solid application monitoring and observability strategy is a critical foundational step when deploying infrastructure or software in any environment. Monitoring ensures that your systems are running smoothly, while observability provides i... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/kubernetes-cluster-observability-with-prometheus-and-grafana-on-aws/</link>
                <guid isPermaLink="false">6790382dcb2eedbe449b6899</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Grafana ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud ]]>
                    </category>
                
                    <category>
                        <![CDATA[ cloudnative ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Eti Ijeoma ]]>
                </dc:creator>
                <pubDate>Wed, 22 Jan 2025 00:13:33 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1737504669572/017570c6-7676-44e1-aa19-4257dd7d30e7.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Creating a solid application monitoring and observability strategy is a critical foundational step when deploying infrastructure or software in any environment. Monitoring ensures that your systems are running smoothly, while observability provides insights into the internal state of your application through the data generated. Together, they help you detect and address issues proactively rather than reacting after a failure occurs.</p>
<p>In Kubernetes environments, the complexity of managing distributed microservices can be challenging. For instance, an application usually spans multiple pods, nodes, and clusters. Because of Kubernetes’s dynamic nature, where pods are frequently created and terminated, proper monitoring and observability are ideal for capturing its fleeting behavior.</p>
<p>Imagine building a microservices application with several connected services handling critical components such as authentication, payments, and databases without proper monitoring. A sudden traffic spike could affect a single service, cascading to other services, causing the system to crash and resulting in downtime.</p>
<p>Without proper visibility, you may struggle to find the root cause of the issue. You may spend hours manually going through logs – and meanwhile, users are frustrated, and businesses are losing revenue and customer trust.</p>
<p>Before we begin the project, you’ll learn key monitoring and observability concepts, as well as why tools like Prometheus and Grafana are crucial for setting up a robust monitoring stack on your Kubernetes infrastructure.</p>
<h3 id="heading-heres-what-well-cover">Here’s what we’ll cover:</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-understanding-monitoring-and-observability">Understanding Monitoring and Observability</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-tools-for-monitoring-and-observability">Tools for Monitoring and Observability</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-deploy-prometheus-and-grafana-on-aws-eks-using-helm">How to Deploy Prometheus and Grafana on AWS EKS using Helm</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-understanding-monitoring-and-observability">Understanding Monitoring and Observability</h2>
<p>Implementing a proper monitoring and observability approach is important in fast-paced production Kubernetes environments. This helps in situations where downtime can lead to serious business loss and damage to customer trust. It’ll hopefully help you avoid the dreaded 2 am calls that are usually triggered by alert noise so you can focus on adding more innovative features to your software (rather than spending so much energy firefighting).</p>
<p>Monitoring and observability are often referred to as the same thing. But they serve two different purposes, especially for development and engineering teams.</p>
<h3 id="heading-monitoring">Monitoring</h3>
<p>In the software development lifecycle, monitoring is the practice of analyzing data in real time or reviewing data trends to ensure the health and performance of systems, infrastructure, and applications. Monitoring acts as the eyes and ears of your IT operations, collecting insightful data and presenting it in a way that is actionable.</p>
<p>If you have visited the IT department of a well-established organization, you have most likely seen large screens displaying colorful dashboards with charts and real-time statistics. This offers a centralized view of the key metrics, such as the server uptime, application response times, and resource usage.</p>
<h3 id="heading-observability">Observability</h3>
<p>Observability helps to address issues that haven’t been anticipated. These are usually called “unknown unknowns."  Unlike monitoring, which deals with predefined parameters and data, observability goes deeper into the application to give a broader view.</p>
<p>This not only helps to answer what is happening within your system and why it is happening. It also uses patterns within the system and application operations to detect and resolve issues efficiently. </p>
<p>Observability revolves around the three pillars of data: <strong>metrics</strong>, <strong>logs</strong>, and <strong>traces</strong>.</p>
<h4 id="heading-1-metrics">1. Metrics</h4>
<p>Metrics consist of time-series measurements such as CPU usage and memory consumption. These data points help teams to manage, optimize, and predict system performance and deviations from expected behavior.</p>
<h4 id="heading-2-logs">2. Logs</h4>
<p>Logs serve as a history of what happened within the system. It is a trail for engineers, especially during troubleshooting. Logs are important in diagnosing root causes and discovering malicious activities.</p>
<h4 id="heading-3-traces">3. Traces</h4>
<p>Traces provide insights into application workflows by tracking requests as they move through various components. They are good for highlighting latency issues and potential points of failure.</p>
<h2 id="heading-tools-for-monitoring-and-observability">Tools for Monitoring and Observability</h2>
<p>Now that you understand the theory behind monitoring and observability, you may be wondering what platforms and tools are available to developers to collect data and get insights about their services.</p>
<p>In the world of cloud-native infrastructure and Kubernetes, many users gravitate towards the popular stack of Prometheus and Grafana.</p>
<h3 id="heading-prometheus">Prometheus</h3>
<p>Prometheus is an open-source tool that specializes in collecting metrics as time-series data. The information is stored with the timestamp when it was recorded.<br>The Prometheus ecosystem includes the main Prometheus server, which scrapes and stores time-series data, an alert manager for managing alerts, a push gateway for handling metrics from short-lived jobs, and exporters for collecting metrics from various services connected to the cluster.</p>
<p>It fits both in machine-centric and application-centric monitoring, especially for microservices in a Kubernetes cluster. It’s designed to be the system you go to if there is a system outage and you need to quickly diagnose problems.</p>
<p>The Prometheus ecosystem includes the main Prometheus server, which scrapes and stores time-series data, an alert manager for managing alerts, a push gateway for handling metrics from short-lived jobs, and exporters for collecting metrics from various services connected to the cluster.</p>
<p>Prometheus fits both in machine-centric and application-centric monitoring, especially for microservices in a Kubernetes cluster. It’s designed to be the system you go to if there is a system outage and you need to quickly diagnose problems.</p>
<h3 id="heading-grafana">Grafana</h3>
<p>Grafana is a visualization tool that transforms, queries, visualizes, and sets alerts on raw metrics stored in Prometheus. With Grafana, you can explore metrics and logs wherever they are stored and display the data on live dashboards. This allows teams to monitor system performance, identify trends, and act quickly on anomalies.</p>
<p>Prometheus and Grafana are compatible with containerized applications, especially in Kubernetes environments. It can also manage workloads outside Kubernetes for flexibility. They are both open-source tools that give developers control over the implementation. There is no licensing cost, which helps teams that cannot afford expensive, powerful solutions.</p>
<p>By combining Prometheus and Grafana, your team gets helpful insights into the system to optimize performance, track errors, and aid troubleshooting processes.</p>
<h2 id="heading-how-to-deploy-prometheus-and-grafana-on-aws-eks-using-helm">How to Deploy Prometheus and Grafana on AWS EKS using Helm</h2>
<h3 id="heading-prerequisites">Prerequisites</h3>
<p>For this project, we will use an EC2 instance with the Ubuntu 22.04 operating system. If you are using Windows or a Mac, log into AWS to create your virtual machine.</p>
<p>Here’s what else you’ll need:</p>
<h4 id="heading-1-aws-account-setup-with-access-keys-and-secret-keys">1. AWS account setup with access keys and secret keys</h4>
<ul>
<li><p><a target="_blank" href="https://portal.aws.amazon.com/billing/signup">AWS Sign Up</a></p>
</li>
<li><p><a target="_blank" href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html">AWS Access ID and Secret Keys</a></p>
</li>
</ul>
<h4 id="heading-2-knowledge-of-kubernetes">2. Knowledge of Kubernetes</h4>
<ul>
<li><a target="_blank" href="https://kubernetes.io/docs/home/">Kubernetes Official Documentation</a></li>
</ul>
<h4 id="heading-3-aws-cli-installation-for-the-virtual-server">3. AWS CLI installation for the virtual server</h4>
<ul>
<li><a target="_blank" href="https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html">AWS CLI Installation Guide</a></li>
</ul>
<h3 id="heading-getting-started">Getting Started</h3>
<p>Let’s start by setting up an EKS cluster on a virtual server and installing the required tools on the server. Then, we’ll deploy our monitoring tools, Prometheus and Grafana, using Helm charts. In the end, we’ll deploy an NGINX web application on Kubernetes and use Grafana to visualize the pod performance and cluster resource usage on the cluster.</p>
<h3 id="heading-step-1-install-aws-cli-eksctl-kubectl-and-helm">Step 1: Install AWS CLI, <code>eksctl</code>, <code>kubectl</code>, and Helm</h3>
<p>AWS CLI is a tool that allows users to interact with AWS services using the command-line interface. It makes the management of cloud resources simpler and enables admins to configure AWS services.</p>
<p>Here, we will install AWS CLI on our server to be able to create Kubernetes resources.</p>
<p>On your server, run the following commands:</p>
<pre><code class="lang-bash">curl <span class="hljs-string">"https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip"</span> -o <span class="hljs-string">"awscliv2.zip"</span>

sudo apt install unzip

unzip awscliv2.zip

sudo ./aws/install
</code></pre>
<p>Verify the installation by running this:</p>
<pre><code class="lang-bash">aws --version
</code></pre>
<p>After installation, configure the AWS CLI with your credentials using the following command:</p>
<pre><code class="lang-bash">aws configure
</code></pre>
<p>You will be prompted to enter your AWS Access Key ID, Secret Access Key, Default region name, and default output format.</p>
<p>Next, we need to install eksctl. <code>eksctl</code> is a command-line tool that simplifies the creation and management of Kubernetes clusters on AWS. It helps you configure, set, and maintain clusters and allows you to manage clusters more effectively.</p>
<p>This tool removes the complexities of setting up a production-grade cluster, helping you and your admins focus only on application development and deployment.</p>
<p>To set up <code>eksctl</code> on your machine, download the latest release using the following command:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># for ARM systems, set ARCH to: arm64, armv6 or armv7</span>
ARCH=amd64

PLATFORM=$(uname -s)_<span class="hljs-variable">$ARCH</span>

curl -sLO <span class="hljs-string">"https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_<span class="hljs-variable">$PLATFORM</span>.tar.gz"</span>

<span class="hljs-comment"># (Optional) Verify checksum</span>

curl -sL <span class="hljs-string">"https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_checksums.txt"</span> | grep <span class="hljs-variable">$PLATFORM</span> | sha256sum --check

tar -xzf eksctl_<span class="hljs-variable">$PLATFORM</span>.tar.gz -C /tmp &amp;&amp; rm eksctl_<span class="hljs-variable">$PLATFORM</span>.tar.gz

sudo mv /tmp/eksctl /usr/<span class="hljs-built_in">local</span>/bin
</code></pre>
<p>Run <code>eksctl version</code> to confirm its successful installation and the version downloaded.</p>
<pre><code class="lang-bash">eksctl version <span class="hljs-comment"># 0.198.0</span>
</code></pre>
<p>Next, we’ll run Kubectl which is a command line interface for managing and interacting with Kubernetes clusters. It enables users to deploy and manage applications within a Kubernetes environment.</p>
<p>With Kubectl, you can perform various crucial operations such as scaling, deployments, inspecting cluster status, and managing networking.</p>
<p>To install <code>kubectl</code>, run the following commands:</p>
<pre><code class="lang-bash">curl -LO <span class="hljs-string">"https://storage.googleapis.com/kubernetes-release/release/<span class="hljs-subst">$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)</span>/bin/linux/amd64/kubectl"</span>
chmod +x ./kubectl
sudo mv ./kubectl /usr/<span class="hljs-built_in">local</span>/bin
</code></pre>
<p>Run <code>kubectl</code> on your command line to confirm it has been installed successfully:</p>
<pre><code class="lang-bash">kubectl version 
<span class="hljs-comment"># client version: 0.198.0</span>
<span class="hljs-comment"># Kustomize Versionv: 5.4.2</span>
<span class="hljs-comment"># Server Version: v1.30.7-eks-56e63d8</span>
</code></pre>
<p>Finally, we’ll install Helm which is a Kubernetes Package manager that simplifies the deployments and management of applications in Kubernetes. It uses <a target="_blank" href="https://helm.sh/docs/topics/charts/"><strong>charts</strong></a> to define Kubernetes resources into a collection of files, handles templating and versioning, and makes application deployment easier.</p>
<p>Here, we will install the Helm package manager on our virtual machine for our cluster deployments. This downloads the installation script and saves it in the <code>get_helm.sh</code> file.</p>
<p>Next, the file is set to executable, which allows only the user to run it. Finally, the script is executed using the <code>./get_helm.sh</code> command.</p>
<pre><code class="lang-bash">curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3

chmod 700 get_helm.sh

./get_helm.sh
</code></pre>
<h3 id="heading-step-2-create-a-kubernetes-cluster">Step 2: Create a Kubernetes Cluster</h3>
<p>Next, we need to create our Kubernetes cluster in AWS with the <code>eksctl</code> command line. We can do this with the following command:</p>
<pre><code class="lang-bash">eksctl create cluster --name my-prac-cluster-1 --version 1.30 --region us-east-1 --nodegroup-name worker-nodes --node-type t2.medium --nodes 2 --nodes-min 2 --nodes-max 3
</code></pre>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXf-np7NdxK-zwQLtYoVE51RUoHYMf5GBHT1tAEjV-eoxk2xCvH13s0tjzPdIb8QVt5amijSBjkpaAh4AuPQd4DJvtnKmMpPQ1_dFoRx1KRNRiCto0U7CpXU-rsd-KH8NhuQoHfRAg?key=U2Fi6zvcj43zRMwXR2oKCzLU" alt="Terminal window in Visual Studio Code displaying a series of commands and outputs related to setting up an EKS cluster. The log includes status updates, such as &quot;creating addon&quot; and &quot;EKS cluster resources have been created.&quot; The environment is Ubuntu, visible through the desktop interface and application icons on the left." width="1366" height="768" loading="lazy"></p>
<p>Let’s break down the command:</p>
<ul>
<li><p><code>–name my-prac-cluster-1</code>: This specifies the name of the EKS cluster that will be created. In this case, the cluster will be named <strong>my-prac-cluster-1</strong>.</p>
</li>
<li><p><code>–version 1.30</code>: This sets the Kubernetes version for the cluster. Here, the version will be version 1.30.</p>
</li>
<li><p><code>--region us-east-1</code>: This specifies the AWS region where the cluster will be provisioned on AWS. Here, it is set to us-east-1.</p>
</li>
<li><p><code>--nodegroup-name worker-nodes</code>: This defines the name of the node groups that will be created. In this case, it’s named <strong>worker-nodes</strong>.</p>
</li>
<li><p><code>--node-type t2.large</code>: This sets the instance type for the worker nodes in the <code>node-group</code>.</p>
</li>
<li><p><code>--nodes 2</code>: This sets the desired number of worker nodes in the node group.</p>
</li>
<li><p><code>--nodes-min 2</code>: This sets the minimum number of worker nodes that should be maintained in the node group to 2.</p>
</li>
<li><p><code>--nodes-max 3</code>: This defines the maximum number of worker nodes allowed in the node group and sets it to 3.</p>
</li>
</ul>
<p>Once the cluster comes up, run the command <code>kubectl get nodes</code> to ensure that the cluster is set up properly.</p>
<h3 id="heading-step-3-install-the-metrics-server">Step 3: Install the Metrics Server</h3>
<p>The metrics server is a component that collects resource data from the Kubelets on each node in the cluster. This includes metrics such as CPU, memory, and network usage, which Prometheus can access. The server provides a single source of truth for resource data and is easy to deploy and use.</p>
<p>Run the following script to install the metrics server:</p>
<pre><code class="lang-bash">kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
</code></pre>
<p>To verify the installation, run the following command:</p>
<pre><code class="lang-bash">kubectl get deployment metrics-server -n kube-system
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1734913540206/47361106-21b0-4076-91c2-ffb8a925b372.jpeg" alt="A terminal screenshot displaying the command output for `kubectl get deployment metrics-server -n kube-system`. It shows the deployment details of &quot;metrics-server&quot; with readiness, up-to-date and available statuses as 1, and an age of 51 minutes." class="image--center mx-auto" width="1122" height="234" loading="lazy"></p>
<h3 id="heading-step-4-install-the-iam-oidc-identity-provider-and-amazon-ebs-csi-driver">Step 4: Install the IAM OIDC Identity Provider and Amazon EBS CSI Driver</h3>
<p>The IAM OpenID connect provider allows Kubernetes access to AWS resources within the cluster. Here, we need EBS volumes to create persistent storage for Prometheus pods.</p>
<p>Run the following commands to create the IAM OIDC provider:</p>
<pre><code class="lang-bash">eksctl utils associate-iam-oidc-provider --cluster my-prac-cluster-1 --approve
</code></pre>
<p>Next, we will create the Amazon EBS CSI Driver that will provide permissions for the cluster to access the EBS volumes. Replace the placeholder “my-cluster” with your cluster name.</p>
<pre><code class="lang-bash">eksctl create iamserviceaccount \

--name ebs-csi-controller-sa \

--namespace kube-system \

--cluster my-prac-cluster-1 \

--role-name AmazonEKS_EBS_CSI_DriverRole \

--role-only \

--attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \

--approve
</code></pre>
<p>Now, we need to add the AWS EBS Driver Addon to the cluster using the following commands:</p>
<pre><code class="lang-bash">eksctl create addon --name aws-ebs-csi-driver --cluster &lt;cluster_name&gt; --service-account-role-arn arn:aws:iam::&lt;AWS_ACCOUNT_ID&gt;:role/AmazonEKS_EBS_CSI_DriverRole --force
</code></pre>
<p>Adding the AWS EBS CSI Driver to your Kubernetes cluster enables the cluster to dynamically create and manage EBS volumes for persistent storage within the cluster. Since our Prometheus installation needs persistent volumes, this add-on will enable the cluster to create EBS volumes to persist data.</p>
<p>Now, our future Prometheus installation will create EBS volumes for persistent storage.</p>
<h3 id="heading-step-5-install-prometheus-and-grafana">Step 5: Install Prometheus and Grafana.</h3>
<p>To install Prometheus and Grafana, we need to add the Helm Stable Charts for the local client.</p>
<p>Run the command below:</p>
<pre><code class="lang-bash">helm repo add stable https://charts.helm.sh/stable
</code></pre>
<p>Next, we will add the Prometheus Helm repo:</p>
<pre><code class="lang-bash">helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
</code></pre>
<p>We’ll use the Prometheus community version because it is well-maintained by the Prometheus community. It offers faster updates and continuous improvements for different Kubernetes environments.</p>
<p>Next, create the Prometheus namespace:</p>
<pre><code class="lang-bash">kubectl create namespace prometheus
</code></pre>
<p>Install Prometheus and Grafana through the <code>kube-prometheus-stack</code> Helm Chart:</p>
<pre><code class="lang-bash">helm install stable prometheus-community/kube-prometheus-stack -n prometheus
</code></pre>
<p>When that’s done, verify that the Prometheus deployment and service are installed by using the command below:</p>
<pre><code class="lang-bash">kubectl get all -n prometheus
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1734913921541/9b851fcc-c390-4d86-9108-adaaddd58d07.jpeg" alt="Terminal output showing Kubernetes resources in the &quot;prometheus&quot; namespace. It lists several pods and services, each with information on readiness, status, restarts, age, type, cluster IP, and ports. All pods are in the &quot;Running&quot; status, with no restarts." class="image--center mx-auto" width="1135" height="597" loading="lazy"></p>
<p>At this stage, you should change the service type from a ClusterIP to a LoadBalancer in the manifest file. We can update the file by running the command below:</p>
<pre><code class="lang-bash">kubectl edit svc stable-kube-prometheus-sta-prometheus -n prometheus
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1734914216836/217e2e5e-1d75-4ca4-88eb-56276f25c12f.jpeg" alt=" Screenshot of a Kubernetes service YAML file edited with kubectl for Prometheus in the prometheus namespace." class="image--center mx-auto" width="1172" height="584" loading="lazy"></p>
<p>After the update, a LoadBalancer URL will be generated for you to access your Prometheus Dashboard.</p>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcWtK8G1cJTD8GI-vACe_KexsXIp5SEXYiGTuBXZNQJ93tpo-XrMVb6ekA1ZRDrAODyFn29p3JdQDOHCdMnq2eapX4drdLMJ_8u_B8z1Jl0LqJjIHJwwIbDhgRUU5tlkGhhnBdYKQ?key=U2Fi6zvcj43zRMwXR2oKCzLU" alt=" Prometheus dashboard showing the &quot;Targets&quot; page with active scrape pools, including details such as endpoints, state, labels, last scrape, scrape duration, and errors for Prometheus Alertmanager services." width="1366" height="768" loading="lazy"></p>
<p>Next, we’ll move over to Grafana. Change the SVC file of Grafana to create a LoadBalancer and expose it to the public using the command below:</p>
<pre><code class="lang-bash">kubectl edit svc stable-grafana -n prometheus
</code></pre>
<p>Next, we will update the Grafana SVC file by changing the service <code>type</code> from <code>ClusterIP</code> to <code>LoadBalancer</code> to expose it to the public using the command below:</p>
<pre><code class="lang-bash">kubectl edit svc stable-grafana -n prometheus
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1734914435224/1ab76b93-2b35-4074-92cb-c7c3173edaee.jpeg" alt=" Screenshot of a Kubernetes service YAML configuration file edited using kubectl edit svc stable-grafana -n prometheus, displaying the details for the Grafana service in the prometheus namespace" class="image--center mx-auto" width="1144" height="601" loading="lazy"></p>
<p>Once the settings are saved, you can use the <code>LoadBalancer</code> link to access your Grafana Dashboard from the browser. The username is <strong>admin</strong>. To get the login password printed in the terminal, run the following command:</p>
<pre><code class="lang-bash">kubectl get secret --namespace prometheus stable-grafana -o jsonpath=<span class="hljs-string">"{.data.admin-password}"</span> | base64 --decode ; <span class="hljs-built_in">echo</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1734915573534/12b64c7f-b74b-4ebb-8a1b-ae6c5357f76a.jpeg" alt="Grafana login screen displaying input fields for email/username and password with a notification indicating a successful login." class="image--center mx-auto" width="1276" height="629" loading="lazy"></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1734915606224/bc06a48b-75e8-4115-816b-462c69975d97.jpeg" alt="Grafana welcome dashboard after a successful authentication" class="image--center mx-auto" width="1243" height="636" loading="lazy"></p>
<p>After successfully logging into the Grafana dashboard, the first step is to create a <code>datasource</code> that will provide the metrics for the Grafana visualization.</p>
<p>Go to <strong>Add your first data source</strong> and choose Prometheus as the Data Source.</p>
<p>Insert the Prometheus URL, and click on “<strong>Save and Test</strong>”. It should show success if Grafana queries the Prometheus URL successfully.</p>
<p>The next step is to create a Dashboard that our Grafana visualization will use to view the metrics of our pods. To do so, click on “<strong>Dashboards</strong>” and then on “<strong>Add Visualization.</strong>”</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1734915769650/10f7877b-7183-4709-b672-ff6d67ef529d.jpeg" alt="Screenshot of a configuration interface showing options for custom query parameters and HTTP method (set to POST) for Prometheus data source. Confirmation message states, &quot;Successfully queried the Prometheus API,&quot; with options to delete or save &amp; test." class="image--center mx-auto" width="1293" height="627" loading="lazy"></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1734915836669/3fb27ce9-23d9-4c35-abf6-7ce44508720d.jpeg" alt="Grafana Dashboard interface with options to add a visualization, import a panel, or import a dashboard. There is a prominent button for adding a visualization." class="image--center mx-auto" width="1263" height="622" loading="lazy"></p>
<p>You’d be taken to an environment where you’d be required to import a dashboard. Select the data source as “<strong>Prometheus-1</strong>” and use the code “<strong>15760</strong>” to import the Node Exporter dashboard to view our pods.</p>
<p>Click on Load after importing the dashboard, and you will see your newly created dashboard.</p>
<p>Here, we can see the entire data of the cluster, the CPU and RAM use, and data regarding pods in a specified namespace.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1734917376090/9e0097f4-07bb-495a-9572-060db373334c.jpeg" alt="Screenshot of a Grafana interface showing a &quot;Select data source&quot; window. Two data sources named &quot;Prometheus&quot; and &quot;prometheus-1&quot; are listed. Options for using mixed data sources, dashboards, and Grafana mock data are on the right." class="image--center mx-auto" width="1266" height="579" loading="lazy"></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1734917392381/c7e01350-d90e-43a1-b172-82899f377d82.jpeg" alt="Screenshot of Grafana's &quot;Import dashboard&quot; page, showing options to upload a JSON file or enter a dashboard ID. The ID &quot;15760&quot; is entered in the input box, and there is a JSON model example displayed below." class="image--center mx-auto" width="1284" height="637" loading="lazy"></p>
<h3 id="heading-step-6-deploying-an-application-on-kubernetes-to-monitor-on-grafana">Step 6: Deploying an Application on Kubernetes to Monitor on Grafana.</h3>
<p>Finally, we will deploy an NGINX container in our EKS Cluster to monitor using Grafana. We need to create a Yaml deployment and service file.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>

<span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>

<span class="hljs-attr">metadata:</span>

<span class="hljs-attr">name:</span> <span class="hljs-string">nginx-app</span>

<span class="hljs-attr">spec:</span>

<span class="hljs-attr">replicas:</span> <span class="hljs-number">2</span>

<span class="hljs-attr">selector:</span>

<span class="hljs-attr">matchLabels:</span>

<span class="hljs-attr">app:</span> <span class="hljs-string">nginx-app</span>

<span class="hljs-attr">template:</span>

<span class="hljs-attr">metadata:</span>

<span class="hljs-attr">labels:</span>

<span class="hljs-attr">app:</span> <span class="hljs-string">nginx-app</span>

<span class="hljs-attr">spec:</span>

<span class="hljs-attr">containers:</span>

<span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">nginx-app</span>

<span class="hljs-attr">image:</span> <span class="hljs-string">nginx:latest</span>

<span class="hljs-attr">ports:</span>

<span class="hljs-bullet">-</span> <span class="hljs-attr">containerPort:</span> <span class="hljs-number">80</span>

<span class="hljs-meta">---</span>

<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>

<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>

<span class="hljs-attr">metadata:</span>

<span class="hljs-attr">name:</span> <span class="hljs-string">nginx-app</span>

<span class="hljs-attr">spec:</span>

<span class="hljs-attr">type:</span> <span class="hljs-string">LoadBalancer</span>

<span class="hljs-attr">ports:</span>

<span class="hljs-bullet">-</span> <span class="hljs-attr">port:</span> <span class="hljs-number">80</span>

<span class="hljs-attr">targetPort:</span> <span class="hljs-number">80</span>

<span class="hljs-attr">selector:</span>

<span class="hljs-attr">app:</span> <span class="hljs-string">nginx-app</span>
</code></pre>
<p>To deploy the Node.js application on the Kubernetes cluster, use the following <code>kubectl</code> command. Verify the deployment by running the following <code>kubectl</code> command:</p>
<pre><code class="lang-bash">kubectl apply -f deployment.yml

kubectl get deployment

kubectl get pods
</code></pre>
<p>Click the load balancer URL to see your application on your browser:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1734917018333/987b78c4-cae0-42dc-bd19-0a4f8a6a6b10.jpeg" alt="Browser window displaying the default welcome page for Nginx, indicating successful installation and suggesting further configuration." class="image--center mx-auto" width="1162" height="710" loading="lazy"></p>
<p>Let’s refresh our Grafana dashboard to see our NGINX web application in Grafana.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1734917096760/72815d56-04a5-47cf-8b2b-89164ccb7e7a.jpeg" alt="A Kubernetes dashboard showing CPU and memory usage by container. The CPU usage graph is on the left, and the memory usage graph is on the right. Both graphs display data for containers named &quot;nginx-app&quot; over the last 5 minutes." class="image--center mx-auto" width="1278" height="626" loading="lazy"></p>
<h3 id="heading-step-7-deleting-the-cluster">Step 7: Deleting the Cluster</h3>
<p>Now that everything is set up, we can delete our Kubernetes Cluster to avoid extra costs. Run the following commands to do so:</p>
<pre><code class="lang-bash">eksctl delete cluster my-prac-cluster-1 –region us-east-1
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1734917455255/04b00a79-d963-4e28-bc99-4cb6ea8c65cf.jpeg" alt="A terminal window displaying a series of commands and system messages related to the deletion of an EKS cluster and associated resources. It shows timestamps for each action, status updates, and confirmation that all cluster resources were deleted successfully." class="image--center mx-auto" width="1192" height="657" loading="lazy"></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>This article teaches the theory behind monitoring and observability and highlights the roles of Prometheus and Grafana in these processes.</p>
<p>We went through a hands-on deployment of Prometheus and Grafana on an EKS cluster and a web application to illustrate how they can be effectively monitored using Grafana.</p>
<p>By leveraging these tools, administrators can enjoy real-time visibility into their Kubernetes infrastructure, easily spot performance bottlenecks, and confidently make decisions that enhance application performance and reliability.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Set Up Grafana on EC2 ]]>
                </title>
                <description>
                    <![CDATA[ In today's data-driven world, it's important to monitor and visualize system metrics to make sure everything works consistently and performs well.  Grafana is an open-source analytics and monitoring platform. It has gained widespread recognition amon... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-set-up-grafana-on-ec2/</link>
                <guid isPermaLink="false">66ba0c6be272700c6e2ec43d</guid>
                
                    <category>
                        <![CDATA[ analytics ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Grafana ]]>
                    </category>
                
                    <category>
                        <![CDATA[ monitoring ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Onwubiko Emmanuel ]]>
                </dc:creator>
                <pubDate>Fri, 02 Aug 2024 13:42:27 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/08/pexels-kawserhamid-176342.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In today's data-driven world, it's important to monitor and visualize system metrics to make sure everything works consistently and performs well. </p>
<p>Grafana is an open-source analytics and monitoring platform. It has gained widespread recognition among developers and enterprises looking to extract more insights from the data produced by their systems. </p>
<p>Grafana has many powerful visualization features, and when combined with Amazon EC2's scalability and flexibility, it creates a stable environment for efficient monitoring. </p>
<p>This article will walk you through setting up Grafana on Amazon EC2 and creating informative dashboards out of raw data. </p>
<h2 id="heading-for-whom-is-this-intended"><strong>For Whom is this Intended?</strong></h2>
<p>This tutorial is intended for both novices to the cloud and experts in DevOps. The goal of this post is to make the installation process easier so you can use Grafana on AWS to its fullest. Now let's get going.</p>
<h2 id="heading-how-to-configure-your-ec2-instance"><strong>How to Configure Your EC2 Instance</strong></h2>
<p>You need to configure the inbound rule for your EC2 instance to access port 3000, as Grafana operates on this port. But first, you need to establish an EC2 instance. You can follow this guide on how to set up your <a target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html">AWS EC2</a> instance. It takes less than 5 minutes.</p>
<p>Once you have created your EC2 instance, you'll need to configure the network inbound rules. So head to your instance page and click on it. On the button widget, click on the <strong>security</strong> tab and click on the security group link (it should look like this: “<strong>sg-547<strong><strong><strong><strong><em>**</em></strong></strong></strong></strong></strong>”). </p>
<p>Once you open the page in the inbound rules section, click on ‘<strong>Edit inbound rules</strong>’. Click on Add a new rule and add <strong>3000</strong> to the port range field, and on the source field, select <strong>0.0.0.0/0.</strong> Then save.</p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_4B51535633ABB1D019D79F3934180D191EF4BB549B6DD5EF46643EA16E05EAAE_1721347239653_image.png" alt="Image" width="1024" height="162" loading="lazy">
<em>Inbound rules</em></p>
<h2 id="heading-how-to-create-an-iam-role"><strong>How to Create an IAM Role</strong></h2>
<p>Now you need to construct an <strong>IAM (Identity Access Management)</strong> role. You're developing an identity role so that you can generate credentials that you'll subsequently use to log in to your Grafana service.</p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_4B51535633ABB1D019D79F3934180D191EF4BB549B6DD5EF46643EA16E05EAAE_1721348061199_IAM+Dashboard.png" alt="Image" width="1912" height="876" loading="lazy">
<em>IAM Dashboard</em></p>
<p>So, in the search field, type "<strong>IAM service</strong>" and click it. Click '<strong>Create role</strong>', and select the AWS service as the trusted entity type.</p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_4B51535633ABB1D019D79F3934180D191EF4BB549B6DD5EF46643EA16E05EAAE_1721348079999_IAM+Role+creation.png" alt="Image" width="1893" height="865" loading="lazy">
<em>IAM Trusted Entity</em></p>
<p>On the use case section, select EC2, then click next.</p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_4B51535633ABB1D019D79F3934180D191EF4BB549B6DD5EF46643EA16E05EAAE_1721348098668_EC2+Use+Case.png" alt="Image" width="1906" height="877" loading="lazy">
<em>IAM role use case</em></p>
<p>On the Add Permissions page, click on the <strong>AdministratorAccess</strong> policy, then click next. Enter a role name – in this case, I used <strong>Grafana-Server-Role.</strong></p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_4B51535633ABB1D019D79F3934180D191EF4BB549B6DD5EF46643EA16E05EAAE_1721348120427_IAM+role+modify+.png" alt="Image" width="1916" height="835" loading="lazy">
<em>Role creation</em></p>
<h2 id="heading-how-to-download-grafana">How to Download Grafana</h2>
<p>Now that you've configured your EC2 inbound rule and also configured the IAM role, let's set up Grafana on your EC2 instance. </p>
<p>So head over to <a target="_blank" href="https://grafana.com/grafana/download">Grafana's download page</a>. Since we'll be downloading the version for Amazon Linux in this tutorial, you need to type in the following command on your Linux command line. Note: You need to connect to your VM instance through SSH (Secure Shell). In this case, I am using the EC2 Instance Connect.</p>
<pre><code class="lang-bash">sudo yum install -y https://dl.grafana.com/enterprise/release/grafana-enterprise-11.1.0-1.x86_64.rpm
</code></pre>
<p>Now you'll enable the Grafana service on your terminal by typing the following command:</p>
<pre><code class="lang-bash">systemctl <span class="hljs-built_in">enable</span> grafana-server.service
</code></pre>
<p>Then start the service:</p>
<pre><code class="lang-bash">systemctl start grafana-server.service
</code></pre>
<p>Check the status of the Grafana service on the EC2 instance by running this command:</p>
<pre><code class="lang-bash">systemctl status grafana-server.service
</code></pre>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_4B51535633ABB1D019D79F3934180D191EF4BB549B6DD5EF46643EA16E05EAAE_1721411484886_Grafana+Active+.png" alt="Image" width="1101" height="175" loading="lazy">
<em>Grafana Service Status</em></p>
<p>Now that you've confirmed that the service is currently active, you'll also need to check if the Grafana service is active on <strong>port 3000</strong>, as you've already created an inbound rule to cater for this. </p>
<p>You can do this by typing the following command:</p>
<pre><code class="lang-bash">netstat -tunpl | grep grafana
</code></pre>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_4B51535633ABB1D019D79F3934180D191EF4BB549B6DD5EF46643EA16E05EAAE_1721411578753_3000+active.png" alt="Image" width="1107" height="22" loading="lazy">
<em>Port 3000 confirmation</em></p>
<p>Now that you've confirmed that the service runs on port 3000, you can go ahead and set up your Grafana dashboard.</p>
<p>You can access the Grafana dashboard by typing the Public IP of your EC2 instance and adding port 3000 on your web browser, something like this: <strong>34.239.101.172:3000</strong>.</p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_4B51535633ABB1D019D79F3934180D191EF4BB549B6DD5EF46643EA16E05EAAE_1721411871355_Grafana.png" alt="Image" width="1877" height="931" loading="lazy">
<em>Grafana Login</em></p>
<p>The default username and password for Grafana are admin, but you'll be given the option to change your password after you sign in with the default credentials. You can also skip the password change process if you like.</p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_4B51535633ABB1D019D79F3934180D191EF4BB549B6DD5EF46643EA16E05EAAE_1721412805910_Grafana+Password.png" alt="Image" width="1902" height="927" loading="lazy">
<em>Change password on Grafana</em></p>
<p>After this step, go to the home page. The next thing to do is to start connecting your Grafana dashboard to a data source. In this case, you're going to connect it to the AWS cloud watch service.</p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_4B51535633ABB1D019D79F3934180D191EF4BB549B6DD5EF46643EA16E05EAAE_1721413426627_Grafana+home.png" alt="Image" width="1918" height="907" loading="lazy">
<em>Grafana</em></p>
<h2 id="heading-how-to-connect-data-sources-to-the-grafana-dashboard"><strong>How to Connect Data Sources to the Grafana Dashboard</strong></h2>
<p>Click on the connections tab on the side menu and click on data sources. Search for the CloudWatch service.</p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_4B51535633ABB1D019D79F3934180D191EF4BB549B6DD5EF46643EA16E05EAAE_1721413905866_image.png" alt="Image" width="1361" height="646" loading="lazy">
<em>Cloudwatch configuration</em></p>
<p>Now you'll be prompted to input your access key ID and secret access key. You will need to create this on your AWS IAM service. </p>
<p>So go back to your IAM management dashboard and go to the user's tab. If you haven’t created an IAM user, you can do so by checking out this <a target="_blank" href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html">IAM user creation tutorial</a>. </p>
<p>In the user IAM dashboard, scroll down to the access keys section and click on <strong>Create access key.</strong></p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_4B51535633ABB1D019D79F3934180D191EF4BB549B6DD5EF46643EA16E05EAAE_1721414737862_Access+Key.png" alt="Image" width="1447" height="277" loading="lazy">
<em>Access key</em></p>
<p>Select the Command Line Interface use case.</p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_4B51535633ABB1D019D79F3934180D191EF4BB549B6DD5EF46643EA16E05EAAE_1721414696820_Access+Key+2.png" alt="Image" width="1917" height="876" loading="lazy">
<em>Access key use case</em></p>
<p>Set the description tag. This step is optional. Then click on the <strong>Create access key.</strong></p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_4B51535633ABB1D019D79F3934180D191EF4BB549B6DD5EF46643EA16E05EAAE_1721414844084_Access+Key+3.png" alt="Image" width="1787" height="738" loading="lazy">
<em>Access keys</em></p>
<p>Now copy the Access Key ID and Secret access key and paste them into the CloudWatch Datasource configuration page on Grafana. Set your default cloud region – in this case, mine is <strong>us-east-1</strong></p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_4B51535633ABB1D019D79F3934180D191EF4BB549B6DD5EF46643EA16E05EAAE_1721414955923_image.png" alt="Image" width="558" height="260" loading="lazy">
<em>Additional settings</em></p>
<p>When you’re done, click on the save and test buttons. Grafana will query the Cloudwatch logs, and if it works fine it will save the configuration.</p>
<h2 id="heading-how-to-create-a-dashboard-on-grafana"><strong>How to Create a Dashboard on Grafana</strong></h2>
<p>Now that you have successfully configured your grafana service, let’s start creating dashboards.</p>
<p>Click on the dashboard tab on the side menu click on <strong>New</strong> and select new dashboard. You should see the screen below:</p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_4B51535633ABB1D019D79F3934180D191EF4BB549B6DD5EF46643EA16E05EAAE_1721416576260_dashboard.png" alt="Image" width="1910" height="928" loading="lazy">
<em>Create a new dashboard</em></p>
<p>Then select <strong>Import dashboard.</strong></p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_4B51535633ABB1D019D79F3934180D191EF4BB549B6DD5EF46643EA16E05EAAE_1721417832804_image.png" alt="Image" width="1354" height="643" loading="lazy">
<em>Import a dashboard</em></p>
<p>For this case, you'll be importing an already-made dashboard from Grafana. Grafana has a lot of dashboards for a lot of use cases and services. But in this case, you'll be importing an EC2 dashboard (<a target="_blank" href="https://grafana.com/grafana/dashboards/11265-amazon-ec2/">Grafana EC2 dashboard</a>). </p>
<p>If you want to import it, you can easily copy the ID of the dashboard that you want to import. It is always accompanied by the dashboard.</p>
<p>So now copy the ID – in this case, it's <strong>11265</strong>. Then paste it into the import field on the import dashboard, and click on the load button.</p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_4B51535633ABB1D019D79F3934180D191EF4BB549B6DD5EF46643EA16E05EAAE_1721418236847_Grafana+Dashboard.png" alt="Image" width="1912" height="927" loading="lazy">
<em>Grafana Dashboard</em></p>
<p>Now you have successfully created a dashboard in Grafana. This dashboard lets you monitor the performance of your EC2 instance. You can monitor metrics such as CPU Utilization, CPU Credit, Disk Ops, Disk Bytes, Network, Network Packets, Status check, and so on.</p>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>Thank you for reading! I hope this step by step guide has helped you learn how to create and set up efficient dashboards using Grafana. </p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ An Intro to Metrics Driven Development: What Are Metrics and Why Should You Use Them? ]]>
                </title>
                <description>
                    <![CDATA[ By dor sever One of the coolest things I have learned in the last year is how to constantly deliver value into production without causing too much chaos. In this post, I’ll explain the metrics-driven development approach and how it helped me to achie... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/metrics-driven-development/</link>
                <guid isPermaLink="false">66d45e43182810487e0ce151</guid>
                
                    <category>
                        <![CDATA[ Metrics driven development ]]>
                    </category>
                
                    <category>
                        <![CDATA[ agile development ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Grafana ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MDD ]]>
                    </category>
                
                    <category>
                        <![CDATA[ metrics ]]>
                    </category>
                
                    <category>
                        <![CDATA[ #prometheus ]]>
                    </category>
                
                    <category>
                        <![CDATA[ TypeScript ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Thu, 12 Mar 2020 16:47:37 +0000</pubDate>
                <media:content url="https://cdn-media-2.freecodecamp.org/w1280/5f9c9c2b740569d1a4ca3062.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By dor sever</p>
<p>One of the coolest things I have learned in the last year is how to constantly deliver value into production without causing <strong>too</strong> much chaos.</p>
<p>In this post, I’ll explain the metrics-driven development approach and how it helped me to achieve that. By the end of the post, you’ll be able to answer the following questions:</p>
<ul>
<li>What are metrics and why should I use them</li>
<li>What are the different types of metrics</li>
<li>What tools could I use to store and display metrics</li>
<li>What is a real-world example of metrics-driven development</li>
</ul>
<h2 id="heading-what-are-metrics-and-why-should-i-use-them">What are metrics and why should I use them?</h2>
<p>Metrics give you the ability to collect information on an actively running system without changing its code.</p>
<p>It allows you to gain valuable data on the behavior of your application while it runs so you can make <strong><a target="_blank" href="https://www.techopedia.com/definition/32877/data-driven-decision-making-dddm">data-driven decisions</a></strong> based on real customer feedback and usage in production.</p>
<h2 id="heading-what-are-the-types-of-metrics-available-to-me">What are the types of metrics available to me?</h2>
<p>These are the most common metrics used today:</p>
<ul>
<li>Counter — Represents a monotonically increasing value.</li>
</ul>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/Screen-Shot-5780-06-10-at-12.37.42-PM.png" alt="Image" width="600" height="400" loading="lazy">
<em>Counters are really useful for measuring rates!</em></p>
<p>In this example, a counter metric is used to calculate the rate of events over time, by counting events per second</p>
<ul>
<li>Gauge — Represents a single value that can go up or down.</li>
</ul>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/Screen-Shot-5780-06-10-at-12.42.06-PM.png" alt="Image" width="600" height="400" loading="lazy">
<em>Gauges are really useful for measuring CPU usage!</em></p>
<p>In this example, a gauge metric is used to monitor the <a target="_blank" href="https://blog.appsignal.com/2018/03/06/understanding-cpu-statistics.html">user CPU</a> in percentages</p>
<ul>
<li>Histogram — A counting of observations (like request durations or sizes) in configurable buckets.</li>
</ul>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/Screen-Shot-5780-06-10-at-12.44.12-PM.png" alt="Image" width="600" height="400" loading="lazy">
<em>Histograms are really useful for measuring request duration!</em></p>
<p>In this example, a histogram metric is used to calculate the 75th and 90th percentiles of an HTTP request duration.</p>
<p>The bits and bytes of the types: counter, histogram, and gauge can be quite confusing. Try reading about it further <a target="_blank" href="https://prometheus.io/docs/concepts/metric_types/">here</a>.</p>
<h2 id="heading-what-tools-can-i-use-to-store-and-display-metrics">What tools can I use to store and display metrics?</h2>
<p>Most monitoring systems consist of a few parts:</p>
<ol>
<li>Time-series database — A database software that optimizes storing and serving <a target="_blank" href="https://en.wikipedia.org/wiki/Time_series">time-series</a> data. Two examples of this kind of database are <a target="_blank" href="https://graphite.readthedocs.io/en/latest/whisper.html">Whisper</a> and <a target="_blank" href="https://prometheus.io/">Prometheus</a>.</li>
<li>Querying engine (with a querying language) — Two examples of common query engines are: <a target="_blank" href="https://graphiteapp.org/">Graphite</a> and <a target="_blank" href="https://prometheus.io/docs/prometheus/latest/querying/basics/">PromQL</a></li>
<li>Alerting system — The mechanism that allows you to configure alerts based on graphs created by the querying language. The system can send these alerts to Mail, Slack, PagerDuty. Two examples of common alerting systems are: <a target="_blank" href="https://grafana.com/">Grafana</a> and <a target="_blank" href="https://prometheus.io/">Prometheus</a>.</li>
<li>UI — Allows you to view the graphs generated by the incoming data and configure queries and alerts. Two examples of common UI systems are: <a target="_blank" href="https://graphiteapp.org/">Graphite</a> and <a target="_blank" href="https://grafana.com/">Grafana</a></li>
</ol>
<p>The setup we are using today in <a target="_blank" href="https://medium.com/@bigpanda_engineering">BigPanda Engineering</a> is</p>
<ul>
<li><a target="_blank" href="https://www.influxdata.com/time-series-platform/telegraf/">Telegraf</a> — used as a StatsD server.</li>
<li><a target="_blank" href="https://prometheus.io/">Prometheus</a> — used as our scrapping engine, Time-series database and querying engine.</li>
<li><a target="_blank" href="https://grafana.com/">Grafana</a> — used for Alerting, and UI</li>
</ul>
<p>And the constraints we had in mind while choosing this stack were:</p>
<ul>
<li>We want scalable and elastic metrics scraping</li>
<li>We want a performant query engine</li>
<li>We want the ability to query our metrics using custom tags(such as service names, hosts, etc.)</li>
</ul>
<h2 id="heading-a-real-world-example-of-metrics-driven-development-of-a-sentiment-analysis-service">A real-world example of Metrics-driven development of a Sentiment Analysis service</h2>
<p>Let’s develop a new pipeline service that calculates sentiments based on textual inputs and does it in a Metrics Driven Development way!</p>
<p>Let’s say I need to develop this pipeline service:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/1_bj6DWm4987CuedEclpyvVw.png" alt="Image" width="600" height="400" loading="lazy">
<em>Sentiment analysis pipeline architecture</em></p>
<p>And this is my usual development process:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/Screen-Shot-5780-06-16-at-7.31.52-AM.png" alt="Image" width="600" height="400" loading="lazy">
<em>Usual development process - Test, code and deploy. Oh my!</em></p>
<p>So I write the following implementation:</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">let</span> senService: SentimentAnalysisService = <span class="hljs-keyword">new</span> SentimentAnalysisService();
<span class="hljs-keyword">while</span> (<span class="hljs-literal">true</span>) {
    <span class="hljs-keyword">let</span> tweetInformation = kafkaConsumer.consume()
    <span class="hljs-keyword">let</span> deserializedTweet: { msg: <span class="hljs-built_in">string</span> } = deSerialize(tweetInformation)
    <span class="hljs-keyword">let</span> sentimentResult = senService.calculateSentiment(deserializedTweet.msg)
    <span class="hljs-keyword">let</span> serializedSentimentResult = serialize(sentimentResult)
    sentimentStore.store(sentimentResult);
    kafkaProducer.produce(serializedSentimentResult, <span class="hljs-string">'sentiment_topic'</span>, <span class="hljs-number">0</span>);
}
</code></pre>
<p>The full gist can be found <a target="_blank" href="https://gist.github.com/dorsev/387800acee8d1b8e6af29c86101fedb8">here</a>.</p>
<p><strong>And t</strong>his method works<strong> perfectly </strong>fine<em>**</em>. </p>
<p><strong>But what happens when it doesn’t</strong>?</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/but-.gif" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The reality is that while working (in an agile development process) we make mistakes. That’s a fact of life. </p>
<p>I believe that the real challenge with making mistakes is not to avoid them, but rather to optimize how fast we detect and repair them. So, we need to gain the ability to <strong>quickly</strong> discover our mistakes.  </p>
<p>It's time for the MDD-way.</p>
<h2 id="heading-the-metrics-driven-development-mdd-way">The Metrics Driven Development (MDD) way</h2>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/commandments.gif" alt="Image" width="600" height="400" loading="lazy">
<em>Behold! <strong>The Three Commandments of Production!</strong></em></p>
<p>The MDD approach is heavily inspired by the <strong>Three Commandments of Production</strong> (which I had learned about the hard way).</p>
<p><strong>The</strong> Three <strong>Commandments of Production are:</strong></p>
<ol>
<li>There are mistakes and bugs in the code you write and deploy.</li>
<li>The data flowing in production is unpredictable and <strong>unique!</strong></li>
<li>Perfect your code from <strong>real customer feedback and usage in production</strong>.</li>
</ol>
<p>And since we now know the <strong>Commandments</strong>, it's time to go over the 4 step plan of the Metrics-Driven development process.</p>
<h2 id="heading-the-4-step-plan-for-a-successful-mdd">The 4-step plan for a successful MDD</h2>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/MDD---oh-wow.png" alt="Image" width="600" height="400" loading="lazy">
<em>Metrics-driven development ?Oh wow!</em></p>
<h3 id="heading-develop-code">Develop code </h3>
<p>I write the code, and whenever possible, wrap it with a feature flag that  allows me to gradually open it for users.</p>
<h3 id="heading-metrics">Metrics</h3>
<p>This consists of two parts:</p>
<p><strong>Add metrics on relevant parts</strong></p>
<p>In this part, I ask myself what are the success or failure metrics I can define to make sure my feature works? In this case, does my new pipeline application perform its logic correctly?</p>
<p><strong>Add alerts on top of them so that I’ll be alerted when a bug occurs</strong></p>
<p>In this part, I ask myself What metric could alert me if I forgot something or did not implement it correctly?</p>
<h3 id="heading-deployment">Deployment</h3>
<p>I deploy the code and immediately monitor it to verify that it’s behaving as I have anticipated.</p>
<h3 id="heading-iterate-this-process-to-perfection">Iterate this process to perfection</h3>
<p>And that's it! Now that we have learned the process, let's tackle an important task inside it.</p>
<h2 id="heading-metrics-to-report-what-should-we-monitor">Metrics to Report — what should we monitor?</h2>
<p>One of the toughest questions for me, when I’m doing MDD, is: “what should I monitor”?</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/MALLTHINGZ.jpeg" alt="Image" width="600" height="400" loading="lazy">
<em>That’s a lovely gif. but un-realistic in most cases.</em></p>
<p>In order to answer the question, lets try to zoom out and look at the big picture.<br>All the possible information available to monitor can be divided into two parts:</p>
<ol>
<li><strong>Applicative information</strong> — Information that has an applicative context and meaning. An example of this will be — “How many tweets did we classify as positive in the last hour”?</li>
<li><strong>Operational information</strong> — Information that is related to the infrastructure that surrounds our application — Cloud data, CPU and disk utilization, network usage, etc.</li>
</ol>
<p>Now, since we cannot monitor everything, we need to choose what applicative and operational information we want to monitor.</p>
<ul>
<li>The operational part really depends on your ops stack and has built-in solutions for (almost) all your monitoring needs.</li>
<li>The applicative part is more unique to your needs, and I'll try to explain how I think about it later in this post.</li>
</ul>
<p>After we do that, we can ask ourselves the question: what alerts do we want to set up on top of the metrics we just defined?</p>
<p>The diagram (of information, metrics, alerts) can be drawn like this:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/world-of.png" alt="Image" width="600" height="400" loading="lazy">
<em>The world of information, metrics, and alerts.</em></p>
<h3 id="heading-applicative-metrics">Applicative metrics</h3>
<p>I usually add applicative metrics out of two needs:</p>
<h4 id="heading-to-answer-questions">To answer questions</h4>
<p>A question is something like, “When my service misbehaves, what information would be helpful to know about?”</p>
<p>Some answers to that question can be — latencies of all IO calls, processing rate, throughput, etc…</p>
<p>Most of these questions will be helpful while you are searching for the answer. But once you found it, chances are you will not look at it again (since you already know the answer).</p>
<p>These questions are usually driven by RND and are (usually) used to gather information internally.</p>
<h4 id="heading-to-add-alerts">To add Alerts</h4>
<p>This may sound backward, but I usually add applicative metrics in order to define alerts on top of them. Meaning, we define the list of alerts and then deduce from them what are the applicative metrics to report.</p>
<p>These alerts are derived from the SLA of the product and are usually treated with mission-critical importance.</p>
<h2 id="heading-common-types-of-alerts">Common types of alerts</h2>
<p>Alerts can be broken down into three parts:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/alert-types.png" alt="Image" width="600" height="400" loading="lazy">
<em>Alerts types to Metrics list</em></p>
<h3 id="heading-sla-alerts">SLA Alerts</h3>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/sla-breach.jpeg" alt="Image" width="600" height="400" loading="lazy">
<em>SLA alerts in reality</em></p>
<p><a target="_blank" href="https://en.wikipedia.org/wiki/Service-level_agreement">SLA</a> alerts surround the places in our system where an SLA is specified to meet explicit customer or internal requirements (i.e availability, throughput, latency, etc.). SLA breaches involve paging RND and waking people up, so try to keep the alerts in this list to a minimum.</p>
<p>Also, we can define <strong>Degradation</strong> Alerts in addition to SLA Alerts.<br>Degradation alerts are defined with lower thresholds then SLA alerts, and are therefore useful in reducing the amount of SLA breaches — by giving you a proper heads-up before they happen.</p>
<p>An example of an SLA alert would be, “All sentiment requests must finish in under 500ms.”</p>
<p>An example of a Degradation Alert will be: “All sentiment requests must finish in under 400ms”.</p>
<p>These are the alerts I defined:</p>
<ol>
<li>Latency — I expect the 90th percentile of a single request duration not to exceed 300ms.</li>
<li>Success/Failure ratio of requests — I expect the number of failures per second, success per second, to remain under 0.01.</li>
<li>Throughput — I expect that the number of operations per second (ops) that the application handles will be &gt; 200</li>
<li>Data Size — I expect the amount of data that we store in a single day should not exceed 2GB.</li>
</ol>
<blockquote>
<p><em>200 ops <em> 60 bytes(Size of Sentiment Result)</em> 86400 sec in a day = 1GB &lt; 2GB</em></p>
</blockquote>
<h3 id="heading-baseline-breaching-alerts">Baseline Breaching Alerts</h3>
<p>These alerts usually involve measuring and defining a baseline and making sure it doesn’t (dramatically) change over time with alerts.</p>
<p>For example, the 99th processing latency for an event must stay relatively the same across time unless we have made dramatic changes to the logic.</p>
<p>These are the alerts I defined:</p>
<ol>
<li>Amount of Positive or Neutral or Negative Sentiment tweets — If for whatever reason, the sum of Positive tweets has increased or decreased dramatically, I might have a bug somewhere in my application.</li>
<li>All latency \ Success ratio of requests \ Throughput \ Data size must not increase\decrease dramatically over time.</li>
</ol>
<h3 id="heading-runtime-properties-alerts">Runtime Properties Alerts</h3>
<p>I’ve given a talk about <a target="_blank" href="https://www.youtube.com/watch?v=Xtuv_aduYjM">Property-Based Tests</a> and their insane strength. As it turns out, collecting metrics allows us to run property-based tests on our system <strong>in production</strong>!</p>
<p>Some properties of our system:</p>
<ol>
<li>Since we consume messages from a Kafka topic, the handled offset must monotonically increase over time.</li>
<li>1 ≥ sentiment score ≥ 0</li>
<li>A tweet should classify as either Negative \ Positive \ Neutral.</li>
<li>A tweet classification must be unique.</li>
</ol>
<p>These alerts helped me validate that:</p>
<ol>
<li>We are reading with the same group-id. Changing consumer group ids by mistake in deployment is a common mistake when using Kafka. It causes a lot of mayhem in production.</li>
<li>The sentiment score is consistently between 0 and 1.</li>
<li>Tweet category length should always be 1.</li>
</ol>
<p>In order to define these alerts, you need to submit metrics from your application. Go <a target="_blank" href="https://gist.github.com/dorsev/181e84e091ae545cb7825b782faf9d20">here</a> for the complete metrics list.</p>
<p>Using these metrics, I can create <strong>alerts</strong> that will “page” me whenever one of these properties do not hold anymore in production.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/processing-latency-alert.png" alt="Image" width="600" height="400" loading="lazy">
<em>Processing latency breached configured SLA! Oh my! ?</em></p>
<p>Let’s take a look at a possible implementation of all these metrics</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">import</span> SDC = <span class="hljs-built_in">require</span>(<span class="hljs-string">"statsd-client"</span>);
<span class="hljs-keyword">let</span> sdc = <span class="hljs-keyword">new</span> SDC({ host: <span class="hljs-string">'localhost'</span> });
<span class="hljs-keyword">let</span> senService: SentimentAnalysisService; <span class="hljs-comment">//...</span>
<span class="hljs-keyword">while</span> (<span class="hljs-literal">true</span>) {
    <span class="hljs-keyword">let</span> tweetInformation = kafkaConsumer.consume()
    sdc.increment(<span class="hljs-string">'incoming_requests_count'</span>)
    <span class="hljs-keyword">let</span> deserializedTweet: { msg: <span class="hljs-built_in">string</span> } = deSerialize(tweetInformation)
    sdc.histogram(<span class="hljs-string">'request_size_chars'</span>, deserializedTweet.msg.length);
    <span class="hljs-keyword">let</span> sentimentResult = senService.calculateSentiment(deserializedTweet.msg)
    <span class="hljs-keyword">if</span> (sentimentResult !== <span class="hljs-literal">undefined</span>) {
        <span class="hljs-keyword">let</span> serializedSentimentResult = serialize(sentimentResult)
        sdc.histogram(<span class="hljs-string">'outgoing_event_size_chars'</span>, serializedSentimentResult.length);
        sentimentStore.store(sentimentResult)
        kafkaProducer.produce(serializedSentimentResult, <span class="hljs-string">'sentiment_topic'</span>, <span class="hljs-number">0</span>);
    }

}
</code></pre>
<p>The full code can be found <a target="_blank" href="https://gist.github.com/dorsev/d7737ed6a866cf98b026d47f4f7faae8">here</a></p>
<p><strong>A few thoughts on the code example above:</strong></p>
<ol>
<li>There has been a staggering amount of metrics added to this codebase.</li>
<li>Metrics add complexity to the codebase, so, like all good things, add them responsibly and in moderation.</li>
<li>Choosing correct metric names is hard. Take your time selecting proper names. <a target="_blank" href="https://prometheus.io/docs/practices/naming/">Here’s</a> an excellent post about this.</li>
<li>You still need to collect these metrics and display them in a monitoring system (like Grafana), plus add alerts on top of them, but that’s a topic for a different post.</li>
</ol>
<h2 id="heading-did-we-reach-the-initial-goal-of-identifying-issues-and-resolving-them-faster">Did we reach the initial goal of identifying issues and resolving them faster?</h2>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/yes-it-was-.gif" alt="Image" width="600" height="400" loading="lazy">
<em>YESSSS, it was!</em></p>
<p>We can now make sure the application latency and throughput do not degrade over time. Also, adding alerts on these metrics allows for a much faster issue discovery and resolution.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Metrics-driven development goes hand in hand with CI\CD, DevOps, and agile development process. If you are using any of the above keywords, then you are in the right place.</p>
<p>When done right, metrics make you feel more confident in your deployment in the same way that seeing passing unit-tests in your build makes you feel confident in the code you write.</p>
<p>Adding metrics allows you to deploy code and feel confident that your production environment is stable and that your application is behaving as expected over time. So I encourage you to try it out!</p>
<h4 id="heading-some-references">Some references</h4>
<ol>
<li>Here is a <a target="_blank" href="https://github.com/dorsev/MetricsSentimentAnalysis">link</a> to the code shown in this post, and <a target="_blank" href="https://gist.github.com/dorsev/181e84e091ae545cb7825b782faf9d20">here</a> is the full metrics list described.</li>
<li>If you are eager to try writing some metrics and to connect them to a monitoring system, check out <a target="_blank" href="https://prometheus.io/docs/introduction/first_steps/">Prometheus</a>, <a target="_blank" href="https://grafana.com/docs/grafana/latest/guides/getting_started/">Grafana</a> and possibly this <a target="_blank" href="https://dev.to/kirklewis/metrics-with-prometheus-statsd-exporter-and-grafana-5145">post</a></li>
<li>This guy wrote a delightful <a target="_blank" href="https://sookocheff.com/post/mdd/mdd/">post</a> about metrics-driven development. GO read it.</li>
</ol>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to run Grafana with DeviceHive ]]>
                </title>
                <description>
                    <![CDATA[ By Nikolay Khabarov DeviceHive is an IoT platform which has plenty of different components. The Grafana plugin is one of them. This plugin can gather data from a DeviceHive server and display it with different dashboards using the very popular tool —... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-run-grafana-with-devicehive-b2f57fe998a8/</link>
                <guid isPermaLink="false">66c3543a5f85c1948b3fabb8</guid>
                
                    <category>
                        <![CDATA[ Grafana ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Internet of Things ]]>
                    </category>
                
                    <category>
                        <![CDATA[ iot ]]>
                    </category>
                
                    <category>
                        <![CDATA[ open source ]]>
                    </category>
                
                    <category>
                        <![CDATA[ technology ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Wed, 29 Nov 2017 21:44:43 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/0*Vam37zski44iqI5J.gif" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Nikolay Khabarov</p>
<p><a target="_blank" href="https://devicehive.com/?utm_source=medium&amp;utm_medium=social&amp;utm_campaign=d-spring-2018">DeviceHive</a> is an IoT platform which has plenty of different components. The Grafana plugin is one of them. This plugin can gather data from a DeviceHive server and display it with different dashboards using the very popular tool — Grafana. This article explains how to create a Grafana dashboard with DeviceHive. As an example, this uses the ESP8266 chip analog pin to visualise the voltage on it.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/nOECJ-zXVy1bAnBBNhEDO3mo7ZehFbSXHG6r" alt="Image" width="800" height="169" loading="lazy"></p>
<h3 id="heading-data">Data</h3>
<p>To display anything on a dashboard we need data. In terms of a DeviceHive server, data can be provided via ‘commands’ and ‘notifications’. Commands are typically used to deliver any messages to a device which the device should execute while ‘notifications’ are the opposite, devices notify their subscribers about certain events. ‘Commands’ and ‘notifications’ are basically simple JSON messages.</p>
<p>Both of these two entities can be used to plot graphs, display static text, a gauge, table or any other Grafana components. For this article we will generate notifications using special DeviceHive firmware for the ESP8266 chip. This firmware allows the chip to connect directly to a DeviceHive server using its protocol and has plenty of <a target="_blank" href="https://github.com/devicehive/esp8266-firmware/blob/develop/DeviceHiveESP8266.md">documented commands</a> which can be issued from the server side.</p>
<h3 id="heading-generating-notifications-with-esp8266-firmware">Generating notifications with ESP8266 firmware</h3>
<p>The binaries for the DeviceHive firmware are available <a target="_blank" href="https://github.com/devicehive/esp8266-firmware/releases">here</a>. Download the latest version and flash this firmware to your chip. The release archive contains documentation on how to do that, but if you have a ‘nodemcu’-like board you just need to connect the board via a microUSB cable to your computer and run the ‘esp-flasher’ util from the release archive for your operating system and wait until it flashes the board. Having flashed the board, there is a need to configure the chip which Wi-Fi network, DeviceHive server, and credential it should use. There are two ways to do that: using a posix-like terminal with the ‘esp-terminal’ util or wirelessly as described <a target="_blank" href="https://github.com/devicehive/esp8266-firmware/blob/develop/DeviceHiveESP8266.md#wireless-configuring">here</a>.</p>
<p>There is a <a target="_blank" href="https://playground.devicehive.com/?utm_source=medium&amp;utm_medium=social&amp;utm_campaign=d-spring-2018">free playground service</a>, which can be used for absolutely free to try a DeviceHive server. After your chip is connected to your server or playground, go to the server admin panel, find your ESP8266 device in the device list and issue the ‘adc/int’ command with the parameters ‘{“0”: 500}’.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/rpg8hn6zzqcOAuNC0aVxEPQmJQM2rSpD79zu" alt="Image" width="800" height="216" loading="lazy"></p>
<p>This command causes the esp8266 to report every 500ms the voltage on ADC input #0(the only ESP8266 has). After switching to ‘notifications’ there should be screen like:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/fW573XDzKshEDdhmbSvyuh6ok-uV8Vs5rkXn" alt="Image" width="800" height="454" loading="lazy"></p>
<p>That is the voltage on the chip’s input pin. And this kind of data is suitable for us to display with Grafana: notifications contain data (parameters in our case), notifications come continuously, and all DeviceHive’s notifications always have a timestamp. Having an analogue sensor connected to this pin it is possible to display this data with Grafana.</p>
<h3 id="heading-installing-the-devicehive-grafana-plugin-to-grafana">Installing the DeviceHive Grafana plugin to Grafana</h3>
<p>Grafana can be used as a local service or as a hosted service. To install Grafana locally, please, refer to the “<a target="_blank" href="http://docs.grafana.org/installation/">Official documentation. Grafana installation</a>”.</p>
<p>You can find how to install plugins in the “<a target="_blank" href="http://docs.grafana.org/plugins/installation/">Official documentation. Plugin installation</a>”.</p>
<p>To install DeviceHive datasource via grafana-cli you can use the following command:</p>
<p><code>$ grafana-cli plugins install devicehive-devicehive-datasource</code></p>
<p>If you want to install the plugin manually, you should perform the following steps:</p>
<p>Prerequisites, these packages should be installed:</p>
<ul>
<li>Grafana &gt;= 4.6</li>
<li>NodeJs &gt;= 8 (optional)</li>
<li>NPM &gt;= 5 (optional)</li>
<li>Grunt (<code>npm install -g grunt</code>) (optional)</li>
</ul>
<p>Also you should have permissions to copy data to the Plugins folder (you could set it in <code>grafana.ini</code> in <code>Paths-&gt;plug</code>ins).</p>
<ol>
<li>Clone this repo to the Plugins folder — <code>git clone [https://github.com/devicehive/devicehive-grafana-datasource.git](https://github.com/devicehive/devicehive-grafana-datasource.git;)</code><a target="_blank" href="https://github.com/devicehive/devicehive-grafana-datasource.git;">;</a></li>
<li>Next steps are optional (in case if you want to rebuild datasource sources code):<br>2.1 Go into folder — <code>cd devicehive-grafana-datasource</code>;<br>2.2 Install all packages — <code>npm install</code>;<br>2.3 Build plugin — <code>npm run build</code>;</li>
<li>Restart Grafana server</li>
<li>Open Grafana in browser;</li>
<li>Open the side menu by clicking the Grafana icon in the top header;</li>
<li>In the side menu click <code>Data Sources</code>;</li>
<li>Click the <code>+ Add data source</code> in the top header;</li>
<li>Select <code>DeviceHive</code> from the <code>Type</code> dropdown;</li>
<li>Configure the datasource.</li>
</ol>
<p>After installation you will be able to see the DeviceHive datasource plugin in the installed plugins list (look at the picture below).</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/AzonaJcVOdW9Hk3FCwUttkuiPJ9C7eWLQB1C" alt="Image" width="800" height="395" loading="lazy"></p>
<h3 id="heading-adding-grafana-datasource">Adding Grafana datasource</h3>
<p>To add DeviceHive datasource, you should perform the following steps:</p>
<ol>
<li>Open the side menu by clicking the Grafana icon in the top header;</li>
<li>In the side menu click <code>Data Sources</code>;</li>
<li>Click the <code>+ Add data source</code> in the top header;</li>
<li>Select <code>DeviceHive</code> from the <code>Type</code> dropdown;</li>
</ol>
<p>Look at the picture below:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/qc1uOouwh6YxgGqfdsQIZEDP1eYpxl2ecySv" alt="Image" width="800" height="361" loading="lazy"></p>
<p>To configure the DeviceHive datasource you should fill on the following fields:</p>
<p>Server URL (is the path to the DeviceHive WebSocket server. For the playground this is ws://playground.devicehive.com/api/websocket)<br>Device ID (unique identifier of the DeviceHive device)<br>Login/Password or AccessToken — credentials to pass authentication</p>
<p>Also, you are able to specify the RefreshToken for auto refreshing the AccessToken</p>
<p>On the picture below you can observe the configuration workflow:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/5gXrxTalhd08rH2yVadrRIuNobmWhTTVaSnu" alt="Image" width="800" height="649" loading="lazy"></p>
<p>After adding and configuring a DeviceHive datasource, it should exist in the datasource list as in the picture below:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/Axn57tNj1KkdGbqIsrstGx7Oa66jk7U7x6qL" alt="Image" width="800" height="325" loading="lazy"></p>
<h3 id="heading-create-new-dashboard">Create new dashboard</h3>
<p>To create a new dashboard you should just click on the “New” button in the sidebar panel as shown in the picture below:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/UxyJ0vnK8ijGBQ3Wc2JtnQIg2GX9GqF6l0DP" alt="Image" width="472" height="413" loading="lazy"></p>
<p>In this article we will show examples on the Graph panel, so, click on the Graph button:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/zRKDMidgaJZq8GeifLHTEacB2NlspwNzOxxf" alt="Image" width="660" height="468" loading="lazy"></p>
<p>After that you will be able to see line chart on your dashboard:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/5jyxzmU-kWVQnXJ6ffNtKFSDWMA7n57xJCaH" alt="Image" width="800" height="180" loading="lazy"></p>
<h3 id="heading-displaying-notificationscommands-with-grafanas-graph">Displaying notifications/commands with Grafana’s graph</h3>
<p>Notification and commands are DeviceHive entities:<br>Command: represents a message dispatched by clients for devices<br>Notification: represents a message dispatched by devices for clients</p>
<p>By default, a Notification or Command message provides the field named “parameters” in which a user can pass their own data.</p>
<p>At the start of this article we configured the ESP8266 device to send notifications with data that represents the state on analogue pin#0 of the chip. In the picture below you are able to observe how to configure the Grafana graph panel to make it show the data on the line chart:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/X5Z5ZHUPRVDhfCKxt4b6hcZ8gYbuxxJJJJC3" alt="Image" width="800" height="336" loading="lazy"></p>
<h3 id="heading-displaying-annotations-on-grafanas-graph">Displaying annotations on Grafana’s graph</h3>
<p>Annotations provide a way to mark points on the graph with rich events. When you hover over an annotation you can get an event description and event tags. The text field can include links to other systems with more detail.<br>More information about annotations you can find by following this <a target="_blank" href="http://docs.grafana.org/reference/annotations/">link</a>.</p>
<p>The picture below shows how to configure annotations powered by a DeviceHive datasource.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/Lw2KYjUe7Q24ya19OuQzVxTLReW6u-Lz5EkL" alt="Image" width="800" height="241" loading="lazy"></p>
<h3 id="heading-advanced-graph-tuning">Advanced graph tuning</h3>
<p>After clicking on the “Add converter” button you will be able to select a converter.<br>A converter is the simple function that transforms a value in some way.</p>
<p>For now, DeviceHive datasources support the following types of converters:</p>
<ul>
<li>Scale — multiplies by a given value</li>
<li>Offset — adds a given value</li>
<li>Unit converter — converts value between different units of below mentioned measurement types:</li>
<li>Temperature (‘c’ — Celsius, ‘f’ — Fahrenheit, ‘k’ — Kelvin)</li>
<li>Length (‘m’ — Meter, ‘mi’ — Mile, ‘yd’ — Yard, ‘ft’ — Feet, ‘in’ — Inch)</li>
<li>Weight (‘kg’ — Kilogram, ‘lb’ — Pound, ‘oz’ — Ounces)</li>
<li>Volume (‘l’ — Liter, ‘gal’ — Gallon, ‘pt’ — Pint)</li>
</ul>
<p><img src="https://cdn-media-1.freecodecamp.org/images/Vtz9svdJWA6c77zySGGgT8DVarEmy1NZQXCe" alt="Image" width="752" height="146" loading="lazy"></p>
<p>An example of this functionality is shown in the picture below.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/RxgzyoJ0xAMYgNGz9WDj-RqextPoxDbGgvd0" alt="Image" width="800" height="284" loading="lazy"></p>
<h3 id="heading-conclusion">Conclusion</h3>
<p>Grafana is a perfect tool to visualise data. It is very flexible and provides many different features to make visualisation in a way thay you like. Grafana can use many data sources from a wide range of software solutions and DeviceHive is one of them. The sample which we described in this article is very simple. Using these principles it is possible to create more advanced graphs and we hope it will be helpful for you. Using Grafana and DeviceHive you can build your own IoT visualisation solutions and moreover you can modify both projects as you wish since Grafana and DeviceHive are open source software.</p>
<p>_Written in collaboration with Igor Trambovetskiy, Senior Developer at <a target="_blank" href="https://devicehive.com/?utm_source=medium&amp;utm_medium=social&amp;utm_campaign=d-spring-2018">DeviceHive</a>._</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
