<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ GPU - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ GPU - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Fri, 05 Jun 2026 20:27:05 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/gpu/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Create a GPU-Optimized Machine Image with HashiCorp Packer on GCP ]]>
                </title>
                <description>
                    <![CDATA[ Every time you spin up GPU infrastructure, you do the same thing: install CUDA drivers, DCGM, apply OS‑level GPU tuning, and fight dependency issues. Same old ritual every single time, wasting expensi ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-create-a-gpu-optimized-machine-image-with-hashicorp-packer-on-gcp/</link>
                <guid isPermaLink="false">69e93606d5f8830e7d9fbad6</guid>
                
                    <category>
                        <![CDATA[ GPU ]]>
                    </category>
                
                    <category>
                        <![CDATA[ VM Image ]]>
                    </category>
                
                    <category>
                        <![CDATA[ GCP ]]>
                    </category>
                
                    <category>
                        <![CDATA[ hashicorp packer ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rasheedat Atinuke Jamiu ]]>
                </dc:creator>
                <pubDate>Wed, 22 Apr 2026 20:30:00 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/fd393878-fe7c-458a-addf-7cd22d8280ac.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Every time you spin up GPU infrastructure, you do the same thing: install CUDA drivers, DCGM, apply OS‑level GPU tuning, and fight dependency issues. Same old ritual every single time, wasting expensive cloud credits and getting frustrated before actual work begins.</p>
<p>In this article, you'll build a reusable GPU-optimized machine image using Packer, pre-loaded with NVIDIA drivers, CUDA Toolkit, NVIDIA Container Toolkit, DCGM, and system-level GPU tuning like persistence mode.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-project-setup">Project Setup</a></p>
</li>
<li><p><a href="#heading-step-1-install-packer">Step 1: Install Packer</a></p>
</li>
<li><p><a href="#heading-step-2-set-up-project-directory">Step 2: Set Up Project Directory</a></p>
</li>
<li><p><a href="#heading-step-3-install-packers-plugins">Step 3: Install Packer's Plugins</a></p>
</li>
<li><p><a href="#heading-step-4-define-your-source">Step 4: Define Your Source</a></p>
</li>
<li><p><a href="#heading-step-5-writing-the-build-template">Step 5: Writing the Build Template</a></p>
</li>
<li><p><a href="#heading-step-6-writing-the-gpu-provisioning-script">Step 6: Writing the GPU Provisioning Script</a></p>
<ul>
<li><p><a href="#heading-section-1-pre-installation-kernel-headers">section 1: Pre-Installation (Kernel Headers)</a></p>
</li>
<li><p><a href="#heading-section-2-installing-nvidias-apt-repository">Section 2: Installing NVIDIA's Apt Repository</a></p>
</li>
<li><p><a href="#heading-section-3-pinning-nvidia-drivers-version">Section 3: Pinning NVIDIA Drivers Version</a></p>
</li>
<li><p><a href="#heading-section-4-installing-the-driver">Section 4: Installing the Driver</a></p>
</li>
<li><p><a href="#heading-section-5-cuda-toolkit-installation">Section 5: CUDA Toolkit Installation</a></p>
</li>
<li><p><a href="#heading-section-6-nvidia-container-toolkit">Section 6: Nvidia Container Toolkit</a></p>
</li>
<li><p><a href="#heading-section-7-installing-dcgm-data-center-gpu-manager">Section 7: Installing DCGM — Data Center GPU Manager</a></p>
</li>
<li><p><a href="#heading-section-8-enabling-persistence-mode">Section 8: Enabling Persistence Mode</a></p>
</li>
<li><p><a href="#heading-section-9-system-tuning-for-gpu-compute-workloads">Section 9: System Tuning for GPU Compute Workloads</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-step-7assembling-and-running-the-build">Step 7:Assembling and Running the Build</a></p>
</li>
<li><p><a href="#heading-step-8-test-the-image-and-verify-the-gpu-stack">Step 8: Test the Image and Verify the GPU Stack</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-references">References</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p><a href="https://www.packer.io/">HashiCorp Packer</a> &gt;= 1.9</p>
</li>
<li><p><a href="https://github.com/hashicorp/packer-plugin-googlecompute">Google Compute Packer plugin</a> (installed via <code>packer init</code>)</p>
</li>
<li><p>Optionally, the <a href="https://github.com/hashicorp/packer-plugin-amazon">AWS Packer plugin</a> can be used for EC2 builds by adding an <code>amazon-ebs</code> source to <code>node.pkr.hcl</code></p>
</li>
<li><p>GCP project with Compute Engine API enabled (or AWS account with EC2 access)</p>
</li>
<li><p>GCP authentication (<code>gcloud auth application-default login</code>) or AWS credentials</p>
</li>
<li><p>Access to an NVIDIA GPU instance type (For example, A100, H100, L4 on GCP; p4d, p5, G6 on AWS)</p>
</li>
</ul>
<h2 id="heading-project-setup">Project Setup</h2>
<h3 id="heading-step-1-install-packer">Step 1: Install Packer</h3>
<p>To get started, you'll install Packer with the steps below if you're on macOS (or you can follow the official documentation for Linux and Windows installation <a href="https://developer.hashicorp.com/packer/tutorials/docker-get-started/get-started-install-cli#:~:text=Chocolatey%20on%20Windows-,Linux,-HashiCorp%20officially%20maintains">guides</a>).</p>
<p>First, you'll install the official Packer formula from the terminal.</p>
<p>Install the HashiCorp tap, a repository of all Hashicorp packages.</p>
<pre><code class="language-plaintext">$ brew tap hashicorp/tap
</code></pre>
<p>Now, install Packer with <code>hashicorp/tap/packer</code>.</p>
<pre><code class="language-plaintext">$ brew install hashicorp/tap/packer
</code></pre>
<h3 id="heading-step-2-set-up-project-directory">Step 2: Set Up Project Directory</h3>
<p>With Packer installed, you'll create your project directory. For clean code and separation of concerns, your project directory should look like the below. Go ahead and create these files in your <code>packer_demo</code> folder using the command below:</p>
<pre><code class="language-plaintext">mkdir -p packer_demo/script &amp;&amp; touch packer_demo/{build.pkr.hcl,source.pkr.hcl,variable.pkr.hcl,local.pkr.hcl,plugins.pkr.hcl,values.pkrvars.hcl} packer_demo/script/base.sh
</code></pre>
<p>Your file directory should look like this:</p>
<pre><code class="language-plaintext">packer_demo
├── build.pkr.hcl                 # Build pipeline — provisioner ordering
├── source.pkr.hcl                # GCP source definition (googlecompute)
├── variable.pkr.hcl              # Variable definitions with defaults
├── local.pkr.hcl                 # Local values
├── plugins.pkr.hcl                # Packer plugin requirements
├── values.pkrvars.hcl             # variable values (copy and customize)
├── script/
│   ├── base.sh                  # requirement script 
</code></pre>
<h3 id="heading-step-3-install-packers-plugins">Step 3: Install Packer's Plugins</h3>
<p>In your <code>plugins.pkr.hcl file,</code>, define your plugins in the <code>packer block.</code> The <code>packer {}</code> block contains Packer settings, including specifying a required plugin version. You'll find the <code>required_plugins</code> block in the Packer block, which specifies all the plugins required by the template to build your image. If you're on Azure or AWS, you can check for the latest plugin <a href="https://developer.hashicorp.com/packer/integrations">here</a>.</p>
<pre><code class="language-hcl">packer {
  required_plugins {
    googlecompute = {
      source  = "github.com/hashicorp/googlecompute"
      version = "~&gt; 1"
    }
  }
}
</code></pre>
<p>Then, initialize your Packer plugin with the command below:</p>
<pre><code class="language-plaintext">packer init .
</code></pre>
<h3 id="heading-step-4-define-your-source">Step 4: Define Your Source</h3>
<p>With your plugin initialized, you can now define your source block. The source block configures a specific builder plugin, which is then invoked by a build block. Source blocks contain your <code>project ID</code>, the zone where your machine will be created, the <code>source_image_family</code> (think of this as your base image, such as Debian, Ubuntu, and so on), and your <code>source_image_project_id</code>.</p>
<p>In GCP, each has an image project ID, such as "ubuntu-os-cloud" for Ubuntu. You'll set the <code>machine type</code> to a GPU machine type because you're building your base image for a GPU machine, so the machine on which it will be created needs to be able to run your commands.</p>
<pre><code class="language-hcl">source "googlecompute" "gpu-node" {
  project_id              = var.project_id
  zone                    = var.zone
  source_image_family     = var.image_family
  source_image_project_id = var.image_project_id
  ssh_username            = var.ssh_username
  machine_type            = var.machine_type



  image_name        = var.image_name
  image_description = var.image_description

  disk_size           = var.disk_size
  on_host_maintenance = "TERMINATE"

  tags = ["gpu-node"]

}
</code></pre>
<p>Setting <code>on_host_maintenance = "TERMINATE"</code> on Google Cloud Compute Engine ensures that a VM instance stops instead of live-migrating during infrastructure maintenance. This is important when using GPUs or specialized hardware that can't migrate, preventing data corruption.</p>
<p>You'll define all your variables in the <code>variable.pkr.hcl</code> file, and set the values in the <code>values.pkrvars.hcl</code>. Remember to always add your <code>values.pkrvars.hcl</code> file to Gitignore.</p>
<pre><code class="language-hcl">variable "image_name" {
  type        = string
  description = "The name of the resulting image"
}

variable "image_description" {
  type        = string
  description = "Description of the image"
}

variable "project_id" {
  type        = string
  description = "The GCP project ID where the image will be created"
}

variable "image_family" {
  type        = string
  description = "The image family to which the resulting image belongs"
}

variable "image_project_id" {
  type        = list(string)
  description = "The project ID(s) to search for the source image"
}

variable "zone" {
  type        = string
  description = "The GCP zone where the build instance will be created"
}

variable "ssh_username" {
  type        = string
  description = "The SSH username to use for connecting to the instance"
}
variable "machine_type" {
  type        = string
  description = "The machine type to use for the build instance"
}

variable "cuda_version" {
  type        = string
  description = "CUDA toolkit version"
  default     = "13.1"
}

variable "driver_version" {
  type        = string
  description = "NVIDIA driver version"
  default     = "590.48.01"
}

variable "disk_size" {
  type        = number
  description = "Boot disk size in GB"
  default     = 50
}
</code></pre>
<p><code>values.pkrvars.hcl</code></p>
<pre><code class="language-hcl">image_name        = "base-gpu-image-{{timestamp}}"
image_description = "Ubuntu 24.04 LTS with gpu drivers and health checks"
project_id        = "your gcp project id"
image_family      = "ubuntu-2404-lts-amd64"
image_project_id  = ["ubuntu-os-cloud"]
zone              = "us-central1-a"
ssh_username      = "packer"
machine_type      = "g2-standard-4"
disk_size        = 50
driver_version   = "590.48.01"
cuda_version      = "13.1" 
</code></pre>
<h3 id="heading-step-5-writing-the-build-template">Step 5: Writing the Build Template</h3>
<p>Create <code>build.pkr.hcl</code>. The <code>build</code> block creates a temporary instance, runs provisioners, and produces an image.</p>
<p>Provisioners in this template are organized as follows:</p>
<ul>
<li><p><strong>First provisioner</strong> runs system updates and upgrades.</p>
</li>
<li><p><strong>Second provisioner</strong> reboots the instance (<code>expect_disconnect = true</code>).</p>
</li>
<li><p><strong>Third provisioner</strong> waits for the instance to come back (<code>pause_before</code>), then runs <code>script/base.sh</code>. This provisioner sets <code>max_retries</code> to handle transient SSH timeouts and pass environment variables for <code>DRIVER_VERSION</code> and <code>CUDA_VERSION</code>.</p>
</li>
</ul>
<p>Lastly, you have the post-processor to tell you the image ID and completion status:</p>
<pre><code class="language-hcl">build {
  sources = ["source.googlecompute.gpu-node"]

  provisioner "shell" {
    inline = [
      "set -e",
      "sudo apt update",
      "sudo apt -y dist-upgrade"
    ]
  }

  provisioner "shell" {
    expect_disconnect = true
    inline            = ["sudo reboot"]
  }

  # Base: NVIDIA drivers, CUDA, DCGM
  provisioner "shell" {
    pause_before = "60s"
    script       = "script/base.sh"
    max_retries  = 2
    environment_vars = [
      "DRIVER_VERSION=${var.driver_version}",
      "CUDA_VERSION=${var.cuda_version}"
    ]
  }

  post-processor "shell-local" {
    inline = [
      "echo '=== Image Build Complete ==='",
      "echo 'Image ID: ${build.ID}'",
      "date"
    ]
  }
}
</code></pre>
<h3 id="heading-step-6-writing-the-gpu-provisioning-script">Step 6: Writing the GPU Provisioning Script</h3>
<p>Now we'll go through the base script, and break down some parts of it.</p>
<h3 id="heading-section-1-pre-installation-kernel-headers">Section 1: Pre-Installation (Kernel Headers)</h3>
<p>Before installing NVIDIA drivers, the system needs kernel headers and build tools. The NVIDIA driver compiles a kernel module during installation via DKMS, so if the headers for your running kernel aren't present, the build will fail silently, and the driver won't load on boot.</p>
<pre><code class="language-shellscript">log "Installing kernel headers and build tools..."
sudo apt-get install -qq -y \
  "linux-headers-$(uname -r)" \
  build-essential \
  dkms \
  curl \
  wget
</code></pre>
<h3 id="heading-section-2-installing-nvidias-apt-repository">Section 2: Installing NVIDIA's Apt Repository</h3>
<p>This snippet downloads and installs NVIDIA’s official keyring package based on your OS Linux distribution, which adds the trusted signing keys needed for the system to verify CUDA packages.</p>
<pre><code class="language-shellscript">log "Adding NVIDIA CUDA apt repository (${DISTRO})..."
wget -q "https://developer.download.nvidia.com/compute/cuda/repos/\({DISTRO}/\){ARCH}/cuda-keyring_1.1-1_all.deb" \
  -O /tmp/cuda-keyring.deb
sudo dpkg -i /tmp/cuda-keyring.deb
rm /tmp/cuda-keyring.deb
sudo apt-get update -qq
</code></pre>
<h3 id="heading-section-3-pinning-nvidia-drivers-version">Section 3: Pinning NVIDIA Drivers Version</h3>
<p>Pinning the NVIDIA driver to a specific version ensures that the system always installs and keeps using exactly that driver version, even when newer drivers appear in the repository.</p>
<p>NVIDIA drivers are tightly coupled with CUDA toolkit versions, Kernel versions, and container runtimes like Docker or NVIDIA Container Toolkit</p>
<p>A mismatch, such as the system auto‑upgrading to a newer driver, can cause CUDA to stop working, break GPU acceleration, or make the machine image inconsistent across deployments.</p>
<pre><code class="language-shellscript">log "Pinning driver to version ${DRIVER_VERSION}..."
sudo apt-get install -qq -y "nvidia-driver-pinning-${DRIVER_VERSION}"
</code></pre>
<h3 id="heading-section-4-installing-the-driver">Section 4: Installing the Driver</h3>
<p>The <code>libnvidia-compute</code> installs only the compute‑related user‑space libraries (CUDA driver components), while the <code>nvidia-dkms-open;</code> installs the <strong>open‑source NVIDIA kernel module</strong>, built locally via DKMS.</p>
<p>Together, these two packages give you a fully functional CUDA driver environment without any GUI or graphics dependencies.</p>
<p>Here, we're using <strong>NVIDIA’s compute‑only driver stack using the open‑source kernel modules</strong>, as it deliberately avoids installing any display-related components, which you don't need.</p>
<p>This method provides an installation module based on DKMS that's better aligned with Linux distros, as it's lightweight, and compute-focused.</p>
<pre><code class="language-shellscript">log "Installing NVIDIA compute-only driver (open kernel modules)..."
sudo apt-get -V install -y \
  libnvidia-compute \
  nvidia-dkms-open
</code></pre>
<h3 id="heading-section-5-cuda-toolkit-installation">Section 5: CUDA Toolkit Installation</h3>
<p>This part of the script installs the <strong>CUDA Toolkit</strong> for the specified version and then makes sure that CUDA’s executables and libraries are available system‑wide for every user and every shell session.</p>
<p>It adds CUDA binaries to PATH, so commands like <code>nvcc</code>, <code>cuda-gdb</code>, and <code>cuda-memcheck</code> work without specifying full paths. It also adds CUDA libraries to LD_LIBRARY_PATH, so applications can find CUDA’s shared libraries at runtime.</p>
<pre><code class="language-shellscript">log "Installing CUDA Toolkit ${CUDA_VERSION}..."
sudo apt-get install -qq -y "cuda-toolkit-${CUDA_VERSION}"

# Persist CUDA paths for all users and sessions
cat &lt;&lt;'EOF' | sudo tee /etc/profile.d/cuda.sh
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
EOF
echo "/usr/local/cuda/lib64" | sudo tee /etc/ld.so.conf.d/cuda.conf
sudo ldconfig
</code></pre>
<h3 id="heading-section-6-nvidia-container-toolkit">Section 6: NVIDIA Container Toolkit</h3>
<p>This block installs the NVIDIA Container Toolkit and configures it so that containers (Docker or containerd) can access the GPU safely and correctly. It’s a critical step for Kubernetes GPU nodes, Docker GPU workloads, and any system that needs GPU acceleration inside containers.</p>
<pre><code class="language-shellscript">log "Installing NVIDIA Container Toolkit..."
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update -qq
sudo apt-get install -qq -y nvidia-container-toolkit

# Configure for containerd (primary Kubernetes runtime)
sudo nvidia-ctk runtime configure --runtime=containerd

# Configure for Docker if present on this image
if systemctl list-unit-files | grep -q "^docker.service"; then
  sudo nvidia-ctk runtime configure --runtime=docker
fi
</code></pre>
<h3 id="heading-section-7-installing-dcgm-data-center-gpu-manager">Section 7: Installing DCGM (Data Center GPU Manager)</h3>
<p>This section covers the installation and validation of NVIDIA DCGM (Data Center GPU Manager), which is NVIDIA’s official management and telemetry framework for data center GPUs.</p>
<p>It offers health monitoring and diagnostics, telemetry (including temperature, clocks, power, and utilization), error reporting, and integration with Kubernetes, Prometheus, and monitoring agents. Your GPU monitoring stack relies on this.</p>
<p>The script extracts the installed version and checks that it meets the <strong>minimum required version</strong> for NVIDIA driver 590+. Then it enforces the version requirement. This prevents a mismatch between the GPU driver and DCGM, which would break monitoring and health checks. It also enables fabric manager for NVLink/NVswitches, if you're on a Multi‑GPU topologies like A100/H100 DGX or multi‑GPU servers.</p>
<pre><code class="language-shellscript">log "Installing DCGM..."
sudo apt-get install -qq -y datacenter-gpu-manager

DCGM_VER=\((dpkg -s datacenter-gpu-manager 2&gt;/dev/null | awk '/^Version:/{print \)2}' | sed 's/^[0-9]*://')
DCGM_MAJOR=\((echo "\){DCGM_VER}" | cut -d. -f1)
DCGM_MINOR=\((echo "\){DCGM_VER}" | cut -d. -f2)
if [[ "\({DCGM_MAJOR}" -lt 4 ]] || { [[ "\){DCGM_MAJOR}" -eq 4 ]] &amp;&amp; [[ "${DCGM_MINOR}" -lt 3 ]]; }; then
  error "DCGM ${DCGM_VER} is below the 4.3 minimum required for driver 590+. Check your CUDA repo."
fi
log "DCGM installed: ${DCGM_VER}"

sudo systemctl enable nvidia-dcgm
sudo systemctl start  nvidia-dcgm

# Fabric Manager — only needed for NVLink/NVSwitch GPUs (A100/H100 multi-GPU nodes)
if systemctl list-unit-files | grep -q "^nvidia-fabricmanager.service"; then
  log "Enabling nvidia-fabricmanager for NVLink GPUs..."
  sudo systemctl enable nvidia-fabricmanager
  sudo systemctl start  nvidia-fabricmanager
fi
</code></pre>
<h3 id="heading-section-8-enabling-persistence-mode">Section 8: Enabling Persistence Mode</h3>
<p>The NVIDIA driver normally unloads itself when the GPU is idle. When a new workload starts, the driver must reload, reinitialize the GPU, and set up memory mappings. This adds a delay of a few hundred milliseconds to several seconds, depending on the GPU and system.</p>
<p>Enabling nvidia‑persistenced keeps the NVIDIA driver loaded in memory even when no GPU workloads are running.</p>
<pre><code class="language-shellscript">log "Enabling nvidia-persistenced..."
sudo systemctl enable nvidia-persistenced
sudo systemctl start  nvidia-persistenced
</code></pre>
<h3 id="heading-section-9-system-tuning-for-gpu-compute-workloads">Section 9: System Tuning for GPU Compute Workloads</h3>
<p>This block applies a set of <strong>system‑level performance and stability tunings</strong> that are standard for high‑performance GPU servers, Kubernetes GPU nodes, and ML/AI workloads.</p>
<p>Each line targets a specific bottleneck or instability pattern that appears in real GPU production environments.</p>
<ul>
<li><p>Swap and memory behavior: Disabling swap and setting <code>vm.swappiness=0</code> prevents the kernel from pushing GPU‑bound processes into swap. GPU workloads are extremely sensitive to latency, and swapping can cause CUDA context resets and GPU driver timeouts.</p>
</li>
<li><p>Hugepages for large memory allocations: Setting <code>vm.nr_hugepages=2048</code> allocates a pool of hugepages, which reduces TLB pressure for large contiguous memory allocations.</p>
<p>CUDA, NCCL, and deep‑learning frameworks frequently allocate large buffers, and hugepages reduce page‑table overhead, improving memory bandwidth and lowering latency for large tensor operations. This is especially useful on multi‑GPU servers.</p>
</li>
<li><p>CPU frequency governor: Installing <code>cpupower</code> and forcing the CPU governor to <code>performance</code> ensures the CPU stays at maximum frequency instead of scaling down.</p>
<p>GPU workloads often become CPU‑bound during Data preprocessing, Kernel launches, and NCCL communication. Keeping CPUs at full speed reduces jitter and improves throughput.</p>
</li>
<li><p>NUMA and topology tools: Installing <code>numactl</code>, <code>libnuma-dev</code>, and <code>hwloc</code> provides tools for pinning processes to NUMA nodes, understanding CPU–GPU affinity, and optimizing multi‑GPU placement.</p>
</li>
<li><p>Disabling irqbalance: Stopping and disabling <code>irqbalance</code> it lets the NVIDIA driver manage interrupt affinity. For GPU servers, irqbalance can incorrectly move GPU interrupts to suboptimal CPUs, causing higher latency and lower throughput.</p>
</li>
</ul>
<pre><code class="language-shell">log "Applying system tuning..."

# Disable swap (critical for Kubernetes scheduler and ML stability)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
echo "vm.swappiness=0"     | sudo tee /etc/sysctl.d/99-gpu-swappiness.conf

# Hugepages — reduces TLB pressure for large memory allocations
echo "vm.nr_hugepages=2048" | sudo tee /etc/sysctl.d/99-gpu-hugepages.conf

# CPU performance governor
sudo apt-get install -qq -y linux-tools-common "linux-tools-$(uname -r)" || true
sudo cpupower frequency-set -g performance || true

# NUMA and topology tools for GPU affinity tuning
sudo apt-get install -qq -y numactl libnuma-dev hwloc

# Disable irqbalance — let NVIDIA driver manage interrupt affinity
sudo systemctl disable irqbalance || true
sudo systemctl stop    irqbalance || true

# Apply all sysctl settings now
sudo sysctl --system
</code></pre>
<p>Full base.sh script here:</p>
<pre><code class="language-shell">#!/bin/bash
set -euo pipefail

log()   { echo "[BASE] $1"; }
error() { echo "[BASE][ERROR] $1" &gt;&amp;2; exit 1; }

###############################################################
###############################################################
[[ -z "${DRIVER_VERSION:-}" ]] &amp;&amp; error "DRIVER_VERSION is not set."
[[ -z "${CUDA_VERSION:-}"   ]] &amp;&amp; error "CUDA_VERSION is not set."

log "DRIVER_VERSION : ${DRIVER_VERSION}"
log "CUDA_VERSION   : ${CUDA_VERSION}"

DISTRO=\((. /etc/os-release &amp;&amp; echo "\){ID}${VERSION_ID}" | tr -d '.')
ARCH="x86_64"

export DEBIAN_FRONTEND=noninteractive

###############################################################
# 1. System update
###############################################################
log "Updating system packages..."
sudo apt-get update -qq
sudo apt-get upgrade -qq -y

###############################################################
# 2. Pre-installation — kernel headers
#    Source: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/ubuntu.html
###############################################################
log "Installing kernel headers and build tools..."
sudo apt-get install -qq -y \
  "linux-headers-$(uname -r)" \
  build-essential \
  dkms \
  curl \
  wget

###############################################################
# 3. NVIDIA CUDA Network Repository
###############################################################
log "Adding NVIDIA CUDA apt repository (${DISTRO})..."
wget -q "https://developer.download.nvidia.com/compute/cuda/repos/\({DISTRO}/\){ARCH}/cuda-keyring_1.1-1_all.deb" \
  -O /tmp/cuda-keyring.deb
sudo dpkg -i /tmp/cuda-keyring.deb
rm /tmp/cuda-keyring.deb
sudo apt-get update -qq

###############################################################
# 4. Pin driver version BEFORE installation (590+ requirement)
###############################################################
log "Pinning driver to version ${DRIVER_VERSION}..."
sudo apt-get install -qq -y "nvidia-driver-pinning-${DRIVER_VERSION}"

###############################################################
# 5. Compute-only (headless) driver — Open Kernel Modules
#    Source: NVIDIA Driver Installation Guide — Compute-only System (Open Kernel Modules)
#
#    libnvidia-compute  = compute libraries only (no GL/Vulkan/display)
#    nvidia-dkms-open   = open-source kernel module built via DKMS
#
#    Open kernel modules are the NVIDIA-recommended choice for
#    Ampere, Hopper, and Blackwell data centre GPUs (A100, H100, etc.)
###############################################################
log "Installing NVIDIA compute-only driver (open kernel modules)..."
sudo apt-get -V install -y \
  libnvidia-compute \
  nvidia-dkms-open

###############################################################
# 6. CUDA Toolkit
###############################################################
log "Installing CUDA Toolkit ${CUDA_VERSION}..."
sudo apt-get install -qq -y "cuda-toolkit-${CUDA_VERSION}"

# Persist CUDA paths for all users and sessions
cat &lt;&lt;'EOF' | sudo tee /etc/profile.d/cuda.sh
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
EOF
echo "/usr/local/cuda/lib64" | sudo tee /etc/ld.so.conf.d/cuda.conf
sudo ldconfig

###############################################################
# 7. NVIDIA Container Toolkit
#    Required for GPU workloads in Docker / containerd / Kubernetes
###############################################################
log "Installing NVIDIA Container Toolkit..."
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update -qq
sudo apt-get install -qq -y nvidia-container-toolkit

# Configure for containerd (primary Kubernetes runtime)
sudo nvidia-ctk runtime configure --runtime=containerd

# Configure for Docker if present on this image
if systemctl list-unit-files | grep -q "^docker.service"; then
  sudo nvidia-ctk runtime configure --runtime=docker
fi

###############################################################
# 8. DCGM — DataCenter GPU Manager
###############################################################
log "Installing DCGM..."
sudo apt-get install -qq -y datacenter-gpu-manager
 
DCGM_VER=\((dpkg -s datacenter-gpu-manager 2&gt;/dev/null | awk '/^Version:/{print \)2}' | sed 's/^[0-9]*://')
DCGM_MAJOR=\((echo "\){DCGM_VER}" | cut -d. -f1)
DCGM_MINOR=\((echo "\){DCGM_VER}" | cut -d. -f2)
if [[ "\({DCGM_MAJOR}" -lt 4 ]] || { [[ "\){DCGM_MAJOR}" -eq 4 ]] &amp;&amp; [[ "${DCGM_MINOR}" -lt 3 ]]; }; then
  error "DCGM ${DCGM_VER} is below the 4.3 minimum required for driver 590+. Check your CUDA repo."
fi
log "DCGM installed: ${DCGM_VER}"

sudo systemctl enable nvidia-dcgm
sudo systemctl start  nvidia-dcgm

# Fabric Manager — only needed for NVLink/NVSwitch GPUs (A100/H100 multi-GPU nodes)
if systemctl list-unit-files | grep -q "^nvidia-fabricmanager.service"; then
  log "Enabling nvidia-fabricmanager for NVLink GPUs..."
  sudo systemctl enable nvidia-fabricmanager
  sudo systemctl start  nvidia-fabricmanager
fi

###############################################################
# 9. NVIDIA Persistence Daemon
#    Keeps the driver loaded between jobs — reduces cold-start
#    latency on the first CUDA call in each new workload
###############################################################
log "Enabling nvidia-persistenced..."
sudo systemctl enable nvidia-persistenced
sudo systemctl start  nvidia-persistenced

###############################################################
# 10. System tuning for GPU compute workloads
###############################################################
log "Applying system tuning..."

# Disable swap (critical for Kubernetes scheduler and ML stability)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
echo "vm.swappiness=0"     | sudo tee /etc/sysctl.d/99-gpu-swappiness.conf

# Hugepages — reduces TLB pressure for large memory allocations
echo "vm.nr_hugepages=2048" | sudo tee /etc/sysctl.d/99-gpu-hugepages.conf

# CPU performance governor
sudo apt-get install -qq -y linux-tools-common "linux-tools-$(uname -r)" || true
sudo cpupower frequency-set -g performance || true

# NUMA and topology tools for GPU affinity tuning
sudo apt-get install -qq -y numactl libnuma-dev hwloc

# Disable irqbalance — let NVIDIA driver manage interrupt affinity
sudo systemctl disable irqbalance || true
sudo systemctl stop    irqbalance || true

# Apply all sysctl settings now
sudo sysctl --system

###############################################################
# Done
###############################################################
log "============================================"
log "Base layer provisioning complete."
log "  OS      : ${DISTRO}"
log "  Driver  : ${DRIVER_VERSION} (open kernel modules, compute-only)"
log "  CUDA    : cuda-toolkit-${CUDA_VERSION}"
log "  DCGM    : ${DCGM_VER}"
log "============================================"
</code></pre>
<h2 id="heading-step-7-assembling-and-running-the-build">Step 7: Assembling and Running the Build</h2>
<p>Validate the template first, then run the build. Validation catches syntax or variable errors early, so the build doesn’t start on a broken config.</p>
<pre><code class="language-shellscript">packer validate -var-file=values.pkrvars.hcl .
</code></pre>
<p>If validation succeeds, you’ll see a short confirmation like <code>The configuration is valid.</code>. After that, start the build. You should expect the process to create a temporary VM, run your provisioners, and produce an image:</p>
<pre><code class="language-plaintext">packer build -var-file=values.pkrvars.hcl .
</code></pre>
<p>The build typically takes <strong>15–20 minutes,</strong> depending on network speed and package installs. Watch the Packer log for three key checkpoints:</p>
<ul>
<li><p><strong>Instance creation</strong> — confirms the temporary VM was provisioned.</p>
</li>
<li><p><strong>Provisioner output</strong> — shows each script step (updates, reboot, <code>script/base.sh</code>) and any errors.</p>
</li>
<li><p><strong>Image creation</strong> — indicates the build finished and an image artifact was written.</p>
</li>
</ul>
<p>If the build fails, copy the failing provisioner’s log lines and re-run the build after fixing the script or variables. For quick troubleshooting, re-run the failing provisioner locally on a matching test VM to iterate faster.</p>
<pre><code class="language-plaintext">googlecompute.gpu-node: output will be in this color.

==&gt; googlecompute.gpu-node: Checking image does not exist...
==&gt; googlecompute.gpu-node: Creating temporary RSA SSH key for instance...
==&gt; googlecompute.gpu-node: no persistent disk to create
==&gt; googlecompute.gpu-node: Using image: ubuntu-2404-noble-amd64-v20260225
==&gt; googlecompute.gpu-node: Creating instance...
==&gt; googlecompute.gpu-node: Loading zone: us-central1-a
==&gt; googlecompute.gpu-node: Loading machine type: g2-standard-4
==&gt; googlecompute.gpu-node: Requesting instance creation...
==&gt; googlecompute.gpu-node: Waiting for creation operation to complete...
==&gt; googlecompute.gpu-node: Instance has been created!
==&gt; googlecompute.gpu-node: Waiting for the instance to become running...
==&gt; googlecompute.gpu-node: IP: 34.58.58.214
==&gt; googlecompute.gpu-node: Using SSH communicator to connect: 34.58.58.214
==&gt; googlecompute.gpu-node: Waiting for SSH to become available...
systemd-logind.service
==&gt; googlecompute.gpu-node:  systemctl restart unattended-upgrades.service
==&gt; googlecompute.gpu-node:
==&gt; googlecompute.gpu-node: No containers need to be restarted.
==&gt; googlecompute.gpu-node:
==&gt; googlecompute.gpu-node: User sessions running outdated binaries:
==&gt; googlecompute.gpu-node:  packer @ session #1: sshd[1535]
==&gt; googlecompute.gpu-node:  packer @ user manager service: systemd[1540]
==&gt; googlecompute.gpu-node: Pausing 1m0s before the next provisioner...
==&gt; googlecompute.gpu-node: Provisioning with shell script: script/base.sh
==&gt; googlecompute.gpu-node: [BASE] DRIVER_VERSION : 590.48.01
==&gt; googlecompute.gpu-node: [BASE] CUDA_VERSION   : 13.1
==&gt; googlecompute.gpu-node: [BASE] Updating system packages...
==&gt; googlecompute.gpu-node: [BASE] Installing kernel headers and build tools...
==&gt; googlecompute.gpu-node: [BASE] Installing CUDA Toolkit 13.1...
==&gt; googlecompute.gpu-node: [BASE] Installing DCGM...
==&gt; googlecompute.gpu-node: [BASE] Enabling nvidia-persistenced...
==&gt; googlecompute.gpu-node: [BASE] Applying system tuning...
==&gt; googlecompute.gpu-node: vm.swappiness=0
==&gt; googlecompute.gpu-node: vm.nr_hugepages=2048
==&gt; googlecompute.gpu-node: Setting cpu: 0
==&gt; googlecompute.gpu-node: Error setting new values. Common errors:
==&gt; googlecompute.gpu-node: [BASE] ============================================
==&gt; googlecompute.gpu-node: [BASE] Base layer provisioning complete.
==&gt; googlecompute.gpu-node: [BASE]   OS      : ubuntu2404
==&gt; googlecompute.gpu-node: [BASE]   Driver  : 590.48.01 (open kernel modules, compute-only)
==&gt; googlecompute.gpu-node: [BASE]   CUDA    : cuda-toolkit-13.1
==&gt; googlecompute.gpu-node: [BASE]   DCGM    : 1:3.3.9
==&gt; googlecompute.gpu-node: [BASE] ============================================
==&gt; googlecompute.gpu-node: Deleting instance...
==&gt; googlecompute.gpu-node: Instance has been deleted!
==&gt; googlecompute.gpu-node: Creating image...
==&gt; googlecompute.gpu-node: Deleting disk...
==&gt; googlecompute.gpu-node: Disk has been deleted!
==&gt; googlecompute.gpu-node: Running post-processor:  (type shell-local)
==&gt; googlecompute.gpu-node (shell-local): Running local shell script: 
==&gt; googlecompute.gpu-node (shell-local): === Image Build Complete ===
==&gt; googlecompute.gpu-node (shell-local): Image ID: packer-69b6c2ee-883a-3602-7bb5-059f1ba27c8b
==&gt; googlecompute.gpu-node (shell-local): Sun Mar 15 15:50:09 WAT 2026
Build 'googlecompute.gpu-node' finished after 17 minutes 55 seconds.

==&gt; Wait completed after 17 minutes 55 seconds

==&gt; Builds finished. The artifacts of successful builds are:
--&gt; googlecompute.gpu-node: A disk image was created in the 'my_project-00000' project: base-gpu-image-1773585134
</code></pre>
<h3 id="heading-step-8-test-the-image-and-verify-the-gpu-stack">Step 8: Test the Image and Verify the GPU Stack</h3>
<p>Confirm the image exists in the GCP Console: <strong>Compute → Storage → Images</strong> and locate your newly created OS image.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/90f304eb-3fe7-4304-b2ad-d86701dde607.png" alt="Your Image information on GCP" style="display:block;margin:0 auto" width="1686" height="692" loading="lazy">

<p>Create a test VM from the image:</p>
<pre><code class="language-plaintext">gcloud compute instances create my-gpu-vm \
  --machine-type=g2-standard-4 \
  --accelerator=count=1,type=nvidia-l4 \
  --image=base-gpu-image-1772718104 \
  --image-project=YOUR_PROJECT_ID \
  --boot-disk-size=50GB \
  --maintenance-policy=TERMINATE \
  --restart-on-failure \
  --zone=us-central1-a

Created [https://www.googleapis.com/compute/v1/projects/my-project-000/zones/us-central1-a/instances/my-gpu-vm].
NAME       ZONE           MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP    EXTERNAL_IP      STATUS
my-gpu-vm  us-central1-a  g2-standard-4               10.128.15.227  104.154.184.217  RUNNING
</code></pre>
<p>Once the instance is <code>RUNNING</code>, verify the NVIDIA driver and GPU are visible:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/364df8fc-7584-40df-8ab7-b3fe349d5065.png" alt="Output from the Nvidia-SMI command showing Driver and CUDA Version" style="display:block;margin:0 auto" width="1508" height="630" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/0912c303-3bb0-47fa-aa34-1c91ff26874f.png" alt="Image verifying the persistence mode is enabled" style="display:block;margin:0 auto" width="1508" height="80" loading="lazy">

<p><strong>The</strong> <code>nvidia-smi</code> <strong>output confirms:</strong></p>
<ul>
<li><p>Driver 590.48.01 loaded</p>
</li>
<li><p>CUDA 13.1 available</p>
</li>
<li><p>Persistence Mode is <code>On</code></p>
</li>
<li><p>The L4 GPU is detected with 23GB VRAM</p>
</li>
<li><p>Zero ECC errors</p>
</li>
<li><p>No running processes (clean idle state).</p>
</li>
</ul>
<p>This is exactly what a healthy base image should look like. Notice <code>Disp.A: Off</code>? That confirms our compute-only driver choice is working — no display adapter is active.</p>
<p>Confirm the installed CUDA toolkit by running. <code>nvcc --version</code>. You can see that version 13.1 was installed as specified.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/cc744624-9408-4348-88d7-61da04b5e1d0.png" alt="Output from the NVCC -Version command" style="display:block;margin:0 auto" width="1508" height="202" loading="lazy">

<p>Let's confirm DCGM installation by running <code>dcgmi discovery -l</code>. Successful output indicates DCGM is running and communicating with the driver.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/114996c6-1f28-43d4-a3fa-13aa7ccd2c82.png" alt="Output from the DCGMI dicovery -l command showing device information" style="display:block;margin:0 auto" width="1508" height="714" loading="lazy">

<h2 id="heading-conclusion">Conclusion</h2>
<p>You now have a production‑grade, GPU‑optimized base image that includes the NVIDIA compute‑only driver built with open kernel modules, DCGM for monitoring, and the CUDA Toolkit. You also applied OS‑level tuning tailored to GPU compute workloads, providing a consistent, reproducible environment with no manual setup.</p>
<p>From here, you can extend the build by adding an application‑layer script to install frameworks such as PyTorch, TensorFlow, or vLLM, or create an instance template that uses this image to scale your GPU infrastructure.</p>
<p>The full Packer project includes additional scripts for training and inference workloads that you can use to extend your image.</p>
<h2 id="heading-references"><strong>References</strong></h2>
<ul>
<li><p>NVIDIA Driver Installation Guide (Ubuntu): <a href="https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/">https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/</a></p>
</li>
<li><p>NVIDIA CUDA Toolkit Documentation: <a href="https://docs.nvidia.com/cuda/">https://docs.nvidia.com/cuda/</a></p>
</li>
<li><p>NVIDIA Container Toolkit Installation Guide: <a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html">https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html</a></p>
</li>
<li><p>NVIDIA DCGM Documentation: <a href="https://docs.nvidia.com/datacenter/dcgm/latest/index.html">https://docs.nvidia.com/datacenter/dcgm/latest/index.html</a></p>
</li>
<li><p>NVIDIA Persistence Daemon: <a href="https://docs.nvidia.com/deploy/driver-persistence/index.html">https://docs.nvidia.com/deploy/driver-persistence/index.html</a></p>
</li>
<li><p>HashiCorp Packer Documentation: <a href="https://developer.hashicorp.com/packer/docs">https://developer.hashicorp.com/packer/docs</a></p>
</li>
<li><p>Packer Google Compute Builder: <a href="https://developer.hashicorp.com/packer/integrations/hashicorp/googlecompute">https://developer.hashicorp.com/packer/integrations/hashicorp/googlecompute</a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ The Evolution of Nvidia Blackwell GPU Memory Architecture ]]>
                </title>
                <description>
                    <![CDATA[ Each GPU generation pushes against the same constraint: memory. Models grow faster than memory capacity, forcing engineers into complex multi-GPU setups, aggressive quantization, or painful trade-offs ]]>
                </description>
                <link>https://www.freecodecamp.org/news/the-evolution-of-nvidia-blackwell-gpu-memory-architecture/</link>
                <guid isPermaLink="false">69e7b761e4367278147e0832</guid>
                
                    <category>
                        <![CDATA[ GPU ]]>
                    </category>
                
                    <category>
                        <![CDATA[ NVIDIA ]]>
                    </category>
                
                    <category>
                        <![CDATA[ NVIDIA B200 ]]>
                    </category>
                
                    <category>
                        <![CDATA[ GH200 ]]>
                    </category>
                
                    <category>
                        <![CDATA[ memory ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rasheedat Atinuke Jamiu ]]>
                </dc:creator>
                <pubDate>Tue, 21 Apr 2026 17:44:01 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/d2339663-d031-49df-9bfb-90505af532f8.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Each GPU generation pushes against the same constraint: memory. Models grow faster than memory capacity, forcing engineers into complex multi-GPU setups, aggressive quantization, or painful trade-offs.</p>
<p>NVIDIA's Blackwell architecture, succeeding Hopper in 2024, attacks this problem at the hardware level, rethinking not just how much memory a GPU has, but how it's structured and accessed entirely.</p>
<p>Running Llama 3 70B is no longer a concern – no parallelization or squeezing the model into tight memory limits. Instead, the same hardware footprint can now handle significantly larger parameter counts.</p>
<p>This article breaks down the memory enhancements that make Blackwell the most capable AI accelerator to date.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>This article assumes you're comfortable with a few GPU fundamentals. If any of these feel shaky, the linked resources will get you up to speed in 10–15 minutes each.</p>
<ul>
<li><p><strong>GPU anatomy</strong> — what an SM is, and the role of registers, shared memory (L1), L2 cache, and memory controllers. [<a href="https://www.arccompute.io/arc-blog/gpu-101-memory-hierarchy">Memory Hierarchy of GPUs</a>]</p>
</li>
<li><p><strong>The three memory metrics</strong> — capacity (how much fits), bandwidth (how fast data moves), and latency (how long a single access takes). These aren't interchangeable, and Blackwell improves all three differently. [<a href="https://www.digitalocean.com/community/tutorials/gpu-memory-bandwidth">GPU Memory Bandwidth]</a></p>
</li>
<li><p><strong>GPU memory types</strong> — HBM, GDDR, and LPDDR5X, and the bandwidth/capacity/power trade-offs between them. [<a href="https://medium.com/@jghaly00/cuda-gpu-memory-types-a07428b3eb16">Cuda GPU Memory Types]</a></p>
</li>
<li><p><strong>Chip interconnects</strong> — PCIe, NVLink, and the idea of a chip-to-chip (C2C) link. [<a href="https://medium.com/@adi.fu7/the-ai-systems-game-are-chip-to-chip-interconnects-the-future-of-inference-ec3bbda53eb3">The AI Systems Game</a>]</p>
</li>
</ul>
<p>If you're solid on all four, you're ready.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-the-generational-leap">The Generational Leap</a></p>
</li>
<li><p><a href="#heading-the-gb200-superchip">The GB200 Superchip</a></p>
<ul>
<li><p><a href="#heading-grace-cpu">Grace CPU</a></p>
</li>
<li><p><a href="#heading-lpddr5x-low-power-double-data-rate-5x">LPDDR5X (Low Power Double Data Rate 5x)</a></p>
</li>
<li><p><a href="#heading-blackwell-gpu">Blackwell GPU</a></p>
</li>
<li><p><a href="#heading-high-bandwidth-interface-nv-hbi">High-Bandwidth Interface (NV-HBI)</a></p>
</li>
<li><p><a href="#heading-nvlink-c-2-c-chip-to-chip">NVLINK C-2-C (Chip-to-Chip)</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-memory-hierarchy-and-bandwidth">Memory Hierarchy and Bandwidth</a></p>
<ul>
<li><p><a href="#heading-the-hierarchy-at-a-glance">The Hierarchy at a Glance</a></p>
</li>
<li><p><a href="#heading-registers-and-l1shared-memory">Registers and L1/Shared Memory</a></p>
</li>
<li><p><a href="#heading-l2-cache-compensating-for-smaller-l1">L2 Cache: Compensating for Smaller L1</a></p>
</li>
<li><p><a href="#heading-hbm3e-the-main-memory-pool">HBM3e: The Main Memory Pool</a></p>
</li>
<li><p><a href="#heading-lpddr5x-the-extended-tier">LPDDR5X: The Extended Tier</a></p>
</li>
<li><p><a href="#heading-data-flow-in-practice">Data Flow in Practice</a></p>
</li>
<li><p><a href="#heading-practical-example-running-llama-3-70b">Practical Example: Running Llama 3 70B</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
<ul>
<li><a href="#heading-references">References</a></li>
</ul>
</li>
</ul>
<h2 id="heading-the-generational-leap">The Generational Leap</h2>
<p>Before diving into how Blackwell achieves its performance gains, here's what changed from the previous GPU generation:</p>
<table>
<thead>
<tr>
<th>Spec</th>
<th>Hopper H100</th>
<th>Blackwell B200</th>
<th>Change</th>
</tr>
</thead>
<tbody><tr>
<td>HBM Capacity</td>
<td>80 GB (HBM3)</td>
<td>192 GB (HBM3e)</td>
<td>2.4×</td>
</tr>
<tr>
<td>HBM Bandwidth</td>
<td>3.35 TB/s</td>
<td>8 TB/s</td>
<td>2.4×</td>
</tr>
<tr>
<td>L2 Cache</td>
<td>50 MB</td>
<td>126 MB</td>
<td>2.5×</td>
</tr>
<tr>
<td>L1/Shared per SM</td>
<td>256 KB</td>
<td>128 KB</td>
<td>0.5×</td>
</tr>
<tr>
<td>Die Design</td>
<td>Monolithic</td>
<td>Dual-die (MCM)</td>
<td>—</td>
</tr>
<tr>
<td>CPU Integration</td>
<td>Separate (PCIe)</td>
<td>Unified (NVLink C2C)</td>
<td>—</td>
</tr>
</tbody></table>
<p>The numbers tell a clear story: more memory, more bandwidth, larger caches. The rest of this article explains how these pieces fit together</p>
<h2 id="heading-the-gb200-superchip">The GB200 Superchip</h2>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769275687108/0c179d24-7f3e-4f63-938c-36723069848c.png" alt="NVIDIA Blackwell GB200 Superchip" style="display:block;margin:0 auto" width="626" height="782" loading="lazy">

<p>The Grace Blackwell (GB200) extends the superchip design NVIDIA introduced with the Grace Hopper (GH200), where an ARM-based Grace CPU is paired with GPU chips in a single package to form one unified computing system.</p>
<p>In the Blackwell generation, the GB200 pairs one Grace CPU with two Blackwell GPUs, connected via NVLink Chip-to-Chip (NVLink-C2C), a high-bandwidth interface that lets the CPU and GPUs share memory and operate as a single system.</p>
<h3 id="heading-grace-cpu">Grace CPU</h3>
<p>The Grace CPU is an ARM Neoverse v2 designed by NVIDIA for bandwidth and efficiency. It handles general-purpose tasks, pre-processing, and tokenization, and feeds data to the GPU through NVLink C-2-C. The Grace CPU acts as extended storage for the GPU.</p>
<p>The Grace CPU runs at a moderate clock speed but compensates with a large memory bandwidth of up to 500GB/s to its LPDDR5X memory (Low Power Double Data Rate 5x – we'll discuss this more in a moment) with about 100MB of L3 Cache.</p>
<h3 id="heading-lpddr5x-low-power-double-data-rate-5x">LPDDR5X (Low Power Double Data Rate 5x)</h3>
<p>The LPDDR5X is a high-speed memory standard that delivers data up to 10.7 Gbps. The LPDDR5X offers low-power efficiency, making it ideal for this use case.</p>
<p>It strikes a perfect balance between performance and power efficiency, delivering up to 500 GB/s while using only about 16W, roughly one-fifth the power of conventional DDR5 memory.</p>
<h3 id="heading-blackwell-gpu">Blackwell GPU</h3>
<p>The Blackwell GPU made significant improvements over the previous Hopper GPU model, especially in terms of memory. The Blackwell GPUs are designed as dual-die GPUs, with two GPU dies in a single module.</p>
<p>Each die is connected by a super-fast NV-HBI (NVIDIA High-Bandwidth Interface) with a speed of 10TB/s, ensuring full performance. Each die contains 104 billion transistors, totaling 208 billion across the two dies. Each die also contains 96 GB of HBM3e memory, totaling 192 GB, with 180 GB usable (as 12 GB is used for error-correcting code (ECC), system firmware, and so on).</p>
<p>With this amount of memory, the Blackwell GPU's memory bandwidth is about 2.4 times faster than that of the Hopper generation.</p>
<p>The L2 cache was also increased to 126 MB. By increasing the L2 cache, Blackwell can store more neural network weights or intermediate results on-chip, avoiding extra trips out to HBM. This ensures the GPU’s compute units are rarely starved for data.</p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769431323592/8d4d8ba8-dd0c-459e-ad4a-b1a750ffe0d9.png" alt="Blackwell dual-die multichip module (MCM) design" style="display:block;margin:0 auto" width="1011" height="534" loading="lazy">

<h3 id="heading-high-bandwidth-interface-nv-hbi">High-Bandwidth Interface (NV-HBI)</h3>
<p>High Bandwidth Interconnect is a standard for die-to-die (or d2d) communication. The NVIDIA High-Bandwidth Interface (NV-HBI) offers a 10TB/s connection, combining the two GPU dies into a single, unified GPU.</p>
<h3 id="heading-nvlink-c-2-c-chip-to-chip">NVLINK C-2-C (Chip-to-Chip)</h3>
<p>The NVLink C-2-C provides a communication speed of up to ~900 GB/s between the Grace CPU and the Blackwell GPUs, eliminating the need to copy memory from the CPU to the GPU memory pool via the PCIe bus.</p>
<p>The NVLink C-2-C interconnect speed is faster than the typical PCIe bus. PCIe Gen6 is only about 128 GB/s per direction compared to the NVLink C-2-C's speed. It's also cache-coherent, meaning both the CPU and GPU share a coherent memory architecture, allowing the CPU to read and write to GPU memory and vice versa.</p>
<p>This unified memory architecture is called Unified CPU-GPU Memory or Extended GPU Memory (EGM) by NVIDIA.</p>
<h2 id="heading-memory-hierarchy-and-bandwidth">Memory Hierarchy and Bandwidth</h2>
<p>Understanding how data flows through Blackwell's memory system is key to optimizing AI workloads. The architecture follows a classic hierarchy principle: smaller, faster memory sits closest to the compute units, with progressively larger but slower memory tiers extending outward.</p>
<h3 id="heading-the-hierarchy-at-a-glance">The Hierarchy at a Glance</h3>
<table>
<thead>
<tr>
<th>Memory Tier</th>
<th>Capacity</th>
<th>Bandwidth</th>
<th>Purpose</th>
</tr>
</thead>
<tbody><tr>
<td>Registers</td>
<td>~256 KB per SM</td>
<td>Immediate</td>
<td>Active computation</td>
</tr>
<tr>
<td>L1/Shared Memory</td>
<td>~128 KB per SM</td>
<td>~40 TB/s aggregate</td>
<td>Data staging, inter-thread sharing</td>
</tr>
<tr>
<td>L2 Cache</td>
<td>64-65 MB per die (~126 MB total)</td>
<td>~20 TB/s</td>
<td>Cross-SM data reuse</td>
</tr>
<tr>
<td>HBM3e</td>
<td>192 GB (180 usable)</td>
<td>8 TB/s</td>
<td>Model weights, activations</td>
</tr>
<tr>
<td>LPDDR5X (CPU)</td>
<td>~480 GB</td>
<td>~500 GB/s (900 GB/s via NVLink C2C)</td>
<td>Overflow, large embeddings</td>
</tr>
</tbody></table>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769610442328/8d805672-0cb7-4318-bc3d-48b225073fbd.png" alt="Blackwell Memory map" style="display:block;margin:0 auto" width="976" height="748" loading="lazy">

<h3 id="heading-registers-and-l1shared-memory">Registers and L1/Shared Memory</h3>
<p>A streaming multiprocessor (SM) executes compute instructions on the GPU. At the lowest level, each Streaming Multiprocessor (SM) contains a register file and configurable L1/Shared memory as illustrated in the diagram above. Registers hold the operands for active computations, that is, data that the GPU cores are working on right now.</p>
<p>An SM executes threads in fixed-size groups known as <em>warps</em>, with each warp containing exactly 32 threads that execute the same instructions in lockstep. The L1/Shared memory acts as a staging area, allowing threads within an SM to share data without going to slower memory tiers.</p>
<p>Blackwell's L1/Shared memory is 128 KB per SM by default, a reduction from Hopper's 256 KB. In specific configurations, this can extend to 228 KB per SM. The aggregate bandwidth across all SMs is approximately 40 TB/s.</p>
<p>Why the reduction? NVIDIA shifted capacity to TMEM for Tensor Core operations and compensated with a larger L2 cache. General-purpose shared memory workloads see less per-SM capacity, but the workloads that matter most, matrix multiplications, get dedicated, faster memory.</p>
<h3 id="heading-l2-cache-compensating-for-smaller-l1">L2 Cache: Compensating for Smaller L1</h3>
<p>The L2 cache sits between the SMs and HBM, shared across all compute units on a die. Blackwell provides 64-65 MB per die (roughly 126 MB total across the dual-die module). This represents a 2.5× increase over Hopper's 50 MB and compensates for the smaller per-SM L1.</p>
<p>In AI workloads, the same model weights are accessed repeatedly across different input batches. A larger L2 cache means more of these weights can stay on-chip between batches, reducing expensive trips to HBM. For inference serving, where the same model handles thousands of requests, this translates directly to lower latency and higher throughput.</p>
<p>The dual-die design does introduce complexity here. Each die has its own 63 MB L2 partition. Accessing data cached on the other die requires crossing the NV-HBI interconnect fast at 10 TB/s, but still slower than local L2 access. NVIDIA's software stack handles this transparently, but performance-conscious engineers should be aware that data placement across dies can affect cache efficiency.</p>
<h3 id="heading-hbm3e-the-main-memory-pool">HBM3e: The Main Memory Pool</h3>
<p>High Bandwidth Memory (HBM3e) serves as the primary storage for model weights, activations, gradients, and input data. Blackwell's HBM3e delivers 8 TB/s of bandwidth per GPU, roughly 2.4× faster than Hopper's 3.35 TB/s HBM3.</p>
<p>The physical implementation uses an 8-Hi stack design: eight DRAM dies stacked vertically, each providing 3 GB, for 24 GB per stack. With eight stacks total (four per die), the B200 GPU provides 192 GB of on-package memory, though 180 GB is usable after accounting for ECC and system overhead.</p>
<p>This bandwidth increase is critical. Tensor Core operations can consume data at enormous rates. If HBM can't feed data fast enough, the compute units stall, leaving expensive silicon idle. Blackwell's 8 TB/s keeps the tensor cores fed even during the largest matrix multiplications.</p>
<h3 id="heading-lpddr5x-the-extended-tier">LPDDR5X: The Extended Tier</h3>
<p>Beyond the GPU's HBM sits the Grace CPU's LPDDR5X memory, approximately 480 GB accessible at up to 500 GB/s locally, or ~900 GB/s when accessed from the GPU via NVLink C-2-C.</p>
<p>Accessing LPDDR5X from the GPU has roughly 10× lower bandwidth and higher latency compared to HBM. But it remains far faster than NVMe SSDs or network storage.</p>
<p>LPDDR5X serves as a high-speed overflow tier. Data that doesn't fit in HBM, such as large embedding tables, KV caches for long-context inference, or checkpoint buffers, can reside in CPU memory without catastrophic performance penalties.</p>
<h3 id="heading-data-flow-in-practice">Data Flow in Practice</h3>
<p>When a Blackwell GPU executes an AI workload, data flows through this hierarchy in stages:</p>
<ol>
<li><p><strong>Model loading</strong>: Weights move from storage → CPU memory → HBM (or stay in LPDDR5X if HBM is full)</p>
</li>
<li><p><strong>Batch processing</strong>: Input data streams into HBM, then into L2 as SMs request it</p>
</li>
<li><p><strong>Computation</strong>: Active data moves from L2 → L1/Shared → Registers as operations execute</p>
</li>
<li><p><strong>Output</strong>: Results flow back down the hierarchy to HBM or CPU memory</p>
</li>
</ol>
<p>Each tier serves as a buffer for the tier above it.</p>
<h2 id="heading-practical-example-running-llama-3-70b">Practical Example: Running Llama 3 70B</h2>
<p>Consider deploying Llama 3 70B for inference. In FP16 precision (Note with GB200, you can go as low as FP4), the model weights alone require approximately 140 GB of memory.</p>
<p><strong>On a Hopper H100 (80 GB HBM3):</strong> The model doesn't fit. You must either quantize aggressively, use tensor parallelism across multiple GPUs, or offload layers to CPU memory over PCIe (slow at ~64 GB/s).</p>
<p><strong>On a single GB200 Superchip (~360 GB usable HBM3e + ~480 GB LPDDR5X):</strong> The full 140 GB model fits easily within a single GPU's HBM, leaving the second GPU's HBM and all CPU memory available for KV cache, batching, or running multiple model instances. No model parallelism required. No aggressive quantization forced by memory limits. The GB200 Superchip provides roughly <strong>10× the usable memory</strong> of a single H100, fundamentally changing what fits on one unit</p>
<p>This is the practical impact of Blackwell's memory architecture: models that previously required multi-GPU setups can now run on a single superchip, simplifying deployment and reducing inter-GPU communication overhead.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Memory has always been the limiting factor in AI hardware. Blackwell changes that equation.</p>
<p>By combining dual-die GPUs, HBM3e with 8 TB/s bandwidth, and unified CPU-GPU memory through NVLink C2C, NVIDIA has delivered a system where a single superchip offers roughly 10× the usable memory of its predecessor. Models that once demanded complex multi-GPU orchestration now fit on one unit.</p>
<p>For AI engineers, this means spending less time working around memory constraints and more time building better models. The architecture isn't just faster, it's fundamentally simpler to work with.</p>
<p>As models continue to grow, Blackwell's memory-first design philosophy points to where GPU architecture is heading: tighter integration, unified memory pools, and specialized hardware for the workloads that matter most.</p>
<h3 id="heading-references">References</h3>
<ol>
<li><p>NVIDIA Blackwell Architecture Technical Brief: <a href="https://resources.nvidia.com/en-us-blackwell-architecture">https://resources.nvidia.com/en-us-blackwell-architecture</a></p>
</li>
<li><p>NVIDIA Blackwell Architecture: A Deep Dive: <a href="https://medium.com/@kvnagesh/nvidia-blackwell-architecture-a-deep-dive-into-the-next-generation-of-ai-computing-79c2b1ce3c1b">https://medium.com/@kvnagesh/nvidia-blackwell-architecture-a-deep-dive-into-the-next-generation-of-ai-computing-79c2b1ce3c1b</a></p>
</li>
<li><p>AI Systems Performance Engineering: <a href="https://learning.oreilly.com/library/view/ai-systems-performance/9798341627772/">https://learning.oreilly.com/library/view/ai-systems-performance/9798341627772/</a></p>
</li>
<li><p>Memory Hierarchy of GPUs**:** <a href="https://www.arccompute.io/arc-blog/gpu-101-memory-hierarchy">https://www.arccompute.io/arc-blog/gpu-101-memory-hierarchy</a></p>
</li>
<li><p>GPU Memory Bandwidth and Its Impact on Performance: <a href="https://www.digitalocean.com/community/tutorials/gpu-memory-bandwidth">https://www.digitalocean.com/community/tutorials/gpu-memory-bandwidth</a></p>
</li>
<li><p>The AI Systems Game: <a href="https://medium.com/@adi.fu7/the-ai-systems-game-are-chip-to-chip-interconnects-the-future-of-inference-ec3bbda53eb3">https://medium.com/@adi.fu7/the-ai-systems-game-are-chip-to-chip-interconnects-the-future-of-inference-ec3bbda53eb3</a></p>
</li>
<li><p>CUDA GPU Memory Types: <a href="https://medium.com/@jghaly00/cuda-gpu-memory-types-a07428b3eb16">https://medium.com/@jghaly00/cuda-gpu-memory-types-a07428b3eb16</a></p>
</li>
</ol>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Set Up CUDA and WSL2 for Windows 11 (including PyTorch and TensorFlow GPU) ]]>
                </title>
                <description>
                    <![CDATA[ If you’re working on complex Machine Learning projects, you’ll need a good Graphics Processing Unit (or GPU) to power everything. And Nvidia is a popular option these days, as it has great compatibility and widespread support. If you’re new to Machin... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-set-up-cuda-and-wsl2-for-windows-11-including-pytorch-and-tensorflow-gpu/</link>
                <guid isPermaLink="false">69309b9e8c594b8177306456</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Windows ]]>
                    </category>
                
                    <category>
                        <![CDATA[ WSL ]]>
                    </category>
                
                    <category>
                        <![CDATA[ GPU ]]>
                    </category>
                
                    <category>
                        <![CDATA[ cuda ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Md. Fahim Bin Amin ]]>
                </dc:creator>
                <pubDate>Wed, 03 Dec 2025 20:20:46 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1764786287487/f0c28401-ce77-4873-b238-59fc6b737ce7.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you’re working on complex Machine Learning projects, you’ll need a good Graphics Processing Unit (or GPU) to power everything. And Nvidia is a popular option these days, as it has great compatibility and widespread support.</p>
<p>If you’re new to Machine Learning and are just getting started, then a free <a target="_blank" href="https://www.kaggle.com/">Kaggle</a> or <a target="_blank" href="https://colab.research.google.com/">Colab</a> might be enough for you. But that won’t be the case when you want to go deeper. You’ll need a GPU, which can get costly if you’re continuously using it on the cloud.</p>
<p>But there’s some good news: you can utilize your computer’s Nvidia GPU (GTX/RTX) quite easily and perform machine learning-related tasks right on your local machine. The cool thing is, it won’t cost you anything other than the electricity it uses!</p>
<p>When you’re running Machine Learning models on your local machines, the most suitable operating system is a Linux-based one, like Ubuntu. But Windows has improved a lot for this purpose. If you’re using the latest Windows 11, you can leverage Windows Subsystem for Linux (WSL) and use your GPU directly for Machine Learning-related workflows.</p>
<p>This process can be quite tricky, though, as can making two popular Machine Learning frameworks, TensorFlow and PyTorch, compatible with your system GPU in Windows 11. That’s why I have written this comprehensive guide to ease your pain.</p>
<p>In it, I’ll help you set up CUDA on Windows Subsystem for Linux 2 (WSL2) so you can leverage your Nvidia GPU for machine learning tasks.</p>
<p>By following these steps, you’ll be able to run ML frameworks like TensorFlow and PyTorch with GPU acceleration on Windows 11.</p>
<p>Keep in mind that this guide assumes you have a compatible Nvidia GPU. Make sure to check <a target="_blank" href="https://developer.nvidia.com/cuda-gpus">Nvidia's official compatibility list</a> before proceeding.</p>
<p>I have also prepared a video for you that’ll help you follow proper guidelines throughout this article.</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/qOJ49nkU4rY" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<p> </p>
<p>Also, if this tutorial helps you, then don’t forget to add a star to the GitHub repository <a target="_blank" href="https://github.com/FahimFBA/CUDA-WSL2-Ubuntu-v2">CUDA-WSL2-Ubuntu-v2</a>. If you face any issues or have any suggestions/improvements, then please raise an issue in the GitHub repository. Currently, the live website is available at <a target="_blank" href="https://ml-win11-v2.fahimbinamin.com/">ml-win11-v2.fahimbinamin.com</a>.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-windows-terminal">Windows Terminal</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-windows-powershell-latest-amp-greatest">Windows PowerShell (Latest &amp; Greatest)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-configure-windows-terminal">Configure Windows Terminal</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-configuration-of-my-computer">Configuration of my computer</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-cpu-virtualization">CPU Virtualization</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-install-wsl2">Install WSL2</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-install-latest-lts-ubuntu-via-wsl2">Install Latest LTS Ubuntu via WSL2</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-update-amp-upgrade-ubuntu-packages">Update &amp; Upgrade Ubuntu Packages</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-install-and-configure-miniconda">Install and Configure Miniconda</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-install-jupyter-amp-ipykernel">Install Jupyter &amp; Ipykernel</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-nvidia-driver">Nvidia Driver</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-install-cuda-dependencies">Install CUDA dependencies</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-cuda-toolkit">CUDA Toolkit</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-add-path-to-shell-profile-for-cuda">Add Path to Shell Profile for CUDA</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-nvcc-version">nvcc Version</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-cudnn-sdk">cuDNN SDK</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-tensorflow-gpu">TensorFlow GPU</a></p>
<ul>
<li><a class="post-section-overview" href="#heading-check-tensorflow-gpu">Check TensorFlow GPU</a></li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-pytorch-gpu">PyTorch GPU</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-check-pytorch-gpu">Check PyTorch GPU</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-check-pytorch-amp-tensorflow-gpu-inside-jupyter-notebook">Check PyTorch &amp; TensorFlow GPU inside Jupyter Notebook</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you begin, make sure you have the following requirements met:</p>
<ul>
<li><p>Windows 11 operating system</p>
</li>
<li><p>Nvidia GPU (GTX/RTX series)</p>
</li>
<li><p>Administrator access to your PC</p>
</li>
<li><p>At least 30 GB of free disk space</p>
</li>
<li><p>Internet connection for downloads</p>
</li>
<li><p>Latest Nvidia drivers installed</p>
</li>
</ul>
<h2 id="heading-windows-terminal">Windows Terminal</h2>
<p>First, you’ll need to ensure that you have Windows Terminal installed properly in your operating system. It is the newest terminal application for users of command-line tools and shells like Command Prompt, PowerShell, and WSL. You can download it from the <a target="_blank" href="https://apps.microsoft.com/detail/9N0DX20HK701?hl=en-us&amp;gl=BD&amp;ocid=pdpshare">Microsoft Store</a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094104150/c73ae561-6888-4eea-9419-186c6659a62f.png" alt="Preview of Windows Terminal on Windows 11" class="image--center mx-auto" width="1133" height="641" loading="lazy"></p>
<p>After ensuring that it’s installed properly, you can proceed to the next steps.</p>
<h2 id="heading-windows-powershell-latest-amp-greatest">Windows PowerShell (Latest &amp; Greatest)</h2>
<p>Windows PowerShell is a modern and updated command-line shell from Microsoft. You can use some Linux specific commands directly on it. It comes with built-in command suggestions. You can download it from the <a target="_blank" href="https://github.com/PowerShell/PowerShell/releases/">official GitHub page</a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094138179/78315197-f4f2-4df4-b022-37cb9e74cda2.png" alt="Preview of Windows PowerShell on GitHub" class="image--center mx-auto" width="1519" height="904" loading="lazy"></p>
<p>Download the latest x64 installer and install it. After ensuring that it is installed properly, you can proceed to the next steps.</p>
<h2 id="heading-configure-windows-terminal">Configure Windows Terminal</h2>
<p>Now you’ll need to configure your Windows Terminal to use PowerShell as the default shell. It’s optional and you might skip this step. But I recommend doing it for a better experience.</p>
<p>Open Windows Terminal. Click on the down arrow icon in the title bar and select "Settings".</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094162440/6ea767c8-da3b-4280-84f8-0eb2b0647a46.png" alt="Preview of Windows PowerShell settings window" class="image--center mx-auto" width="1166" height="660" loading="lazy"></p>
<p>In the Settings tab, under "Startup", find the "Default profile" dropdown menu. Select "PowerShell" from the list.</p>
<p>Now for the "Default terminal application", select "Windows Terminal".</p>
<p>By default, Windows PowerShell always shows the version number in the title bar. If you want to disable it, select the "PowerShell" profile from the left sidebar. Click on the "Command Line" field and add an <code>--nologo</code> argument at the end of the command. After this, the line becomes <code>"C:\Program Files\PowerShell\7\pwsh.exe" --nologo</code>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094185648/3641d5f0-ba34-44b9-8a63-86b53068d02e.png" alt="Preview of Windows PowerShell --nologo setting" class="image--center mx-auto" width="1170" height="654" loading="lazy"></p>
<p>If you don’t use other shells frequently and want to hide them in the dropdown, then you’ll need to select those profiles one by one from the left sidebar. Scroll down to the bottom and find the "Hide profile from dropdown" toggle and enable it. It will hide that specific shell from the dropdown menu.</p>
<p>For example, I am hiding the <strong>Azure Cloud Shell</strong> profile as I don't use it frequently:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094214632/73add1b7-bcdd-4368-86a6-975fa2f72b54.png" alt="Preview of hiding profiles in Windows Terminal" class="image--center mx-auto" width="1151" height="657" loading="lazy"></p>
<p>Now click on the "Save" button at the bottom right corner to apply the changes. Close the Windows Terminal for now.</p>
<h2 id="heading-configuration-of-my-computer">Configuration of My Computer</h2>
<p>I figured it’d be helpful to share my current computer’s configuration so you can have a clear idea of which setup I’m using in this guide. Here are the details:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Component</strong></td><td><strong>Specification</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Processor</strong></td><td>AMD Ryzen 7 7700 8-Core Processor (8 Core 16 Threads)</td></tr>
<tr>
<td><strong>RAM</strong></td><td>64GB DDR5 6000MHz</td></tr>
<tr>
<td><strong>Storage</strong></td><td>1 TB Samsung 980 NVMe SSD, 4 TB HDD, 2 TB SATA SSD</td></tr>
<tr>
<td><strong>GPU</strong></td><td>NVIDIA GeForce RTX 3060 12GB GDDR6</td></tr>
<tr>
<td><strong>Operating System</strong></td><td>Windows 11 Pro Version 25H2</td></tr>
</tbody>
</table>
</div><p>Now that you have an idea about my computer’s configuration, we can proceed to the next steps.</p>
<h2 id="heading-cpu-virtualization">CPU Virtualization</h2>
<p>As we are going to use WSL2, we’ll need to make sure that the CPU virtualization is enabled. To check whether virtualization is enabled or not from Windows, simply open the Windows Task Manager. Go to the Performance tab and select CPU from the left sidebar. In the bottom right corner, you will see the Virtualization status. If it shows "Enabled", then you are good to go. If it shows "Disabled", then you need to enable it from the BIOS.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094252181/29efa40c-ec0a-4d99-adb7-50596348a1aa.png" alt="Preview of Virtualization enabled status in Windows Task Manager" class="image--center mx-auto" width="824" height="760" loading="lazy"></p>
<p>⚠️ You have to ensure that CPU Virtualization is enabled in your BIOS settings. Different manufacturers have different ways to access the BIOS. Usually, you can access the BIOS by pressing the Delete or F2 key during the boot process. Once in BIOS, look for settings related to "Virtualization Technology" or "Intel VT-x"/"AMD-V" and make sure it is enabled. Save the changes and exit the BIOS.</p>
<h2 id="heading-install-wsl2">Install WSL2</h2>
<p>Open the Windows Terminal or Windows PowerShell as an administrator. Run the following command to install WSL2 along with the latest Ubuntu LTS distribution:</p>
<pre><code class="lang-powershell">wsl.exe -<span class="hljs-literal">-install</span>
</code></pre>
<p>It will install Windows Subsystem for Linux 2 (WSL2). After the installation is complete, you will be prompted to restart your computer. Do so to finalize the installation.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094306994/41db30c0-ecb9-4436-a425-8a059b199c42.png" alt="Preview of WSL installation in Windows PowerShell" class="image--center mx-auto" width="1295" height="656" loading="lazy"></p>
<p>⚠️ If you encounter any issues during installation, refer to the <a target="_blank" href="https://learn.microsoft.com/en-us/windows/wsl/troubleshooting">official Microsoft documentation</a> for troubleshooting WSL installation problems.</p>
<h2 id="heading-install-latest-lts-ubuntu-via-wsl2">Install Latest LTS Ubuntu via WSL2</h2>
<p>Open the Windows Terminal or Windows PowerShell again with the administrator privileges. If you want to check the available Linux distributions to install via WSL, run the following command:</p>
<pre><code class="lang-powershell">wsl.exe -<span class="hljs-literal">-list</span> -<span class="hljs-literal">-online</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094455888/8f1f2382-41cc-410f-a7b9-a47d3bb634b6.png" alt="Preview of available WSL distributions in Windows PowerShell" class="image--center mx-auto" width="1291" height="660" loading="lazy"></p>
<p>For installing any specific distribution, run the following command:</p>
<pre><code class="lang-powershell">wsl.exe -<span class="hljs-literal">-install</span> &lt;DistroName&gt;
</code></pre>
<p>We are going to install the latest LTS Ubuntu distribution. As of now, the latest LTS version is Ubuntu 24.04. But I prefer to install the <code>Ubuntu</code> directly as it always points to the latest LTS version. So, run the following command:</p>
<pre><code class="lang-powershell">wsl.exe -<span class="hljs-literal">-install</span> Ubuntu
</code></pre>
<p>You need to give it a default user account name. For me, I am going with <code>fahim</code>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094505280/9beb24de-54da-4e0c-993d-b15f985867e3.png" alt="Preview of Ubuntu installation in Windows PowerShell" class="image--center mx-auto" width="1666" height="858" loading="lazy"></p>
<p>It also comes with a nice GUI management tool for WSL.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094530944/89073fb9-881f-48bd-b5ef-a0b08f74e4c5.png" alt="Preview of WSL GUI management tool" class="image--center mx-auto" width="1114" height="724" loading="lazy"></p>
<p>You can configure a lot of stuff in it including restricting core, RAM, disk space and a lot of specifications from the settings GUI window.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094551095/66aea1e1-e204-4115-80e0-b3dea2d7a2ac.png" alt="Preview of WSL GUI settings window (Memory &amp; Processor)" class="image--center mx-auto" width="1919" height="1024" loading="lazy"></p>
<h2 id="heading-update-amp-upgrade-ubuntu-packages">Update &amp; Upgrade Ubuntu Packages</h2>
<p>Open your Ubuntu terminal from Windows Terminal. First, we need to update and upgrade the existing packages to their latest versions.</p>
<p>To update the Ubuntu system, simply use the following command:</p>
<pre><code class="lang-bash">sudo apt update -y
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094594281/be41e056-7e55-4139-b84b-6b7921a2d435.png" alt="Preview of apt update command in Ubuntu terminal" class="image--center mx-auto" width="1649" height="888" loading="lazy"></p>
<p>To upgrade all the packages at once, simply use the following command:</p>
<pre><code class="lang-bash">sudo apt upgrade -y
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094627958/b1c17b1c-5290-470b-aafe-5b89bb03bd01.png" alt="Preview of apt upgrade command in Ubuntu terminal" class="image--center mx-auto" width="1659" height="934" loading="lazy"></p>
<p>⚠️ Make sure that you have a stable internet connection during the update and upgrade process to avoid any interruptions.</p>
<h2 id="heading-install-and-configure-miniconda">Install and Configure Miniconda</h2>
<p>In Machine Learning, we need to manage multiple environments with different package versions. Conda is a popular package and environment management system that makes it easy to create and manage isolated environments for different projects. We will install Miniconda, a minimal installer for Conda, to manage our Python environments. But if you prefer Anaconda, you can install it instead.</p>
<p>Go to the official website of Miniconda. Currently the Miniconda installer is inside Anaconda <a target="_blank" href="https://www.anaconda.com/docs/getting-started/miniconda/install">here</a>. If the official website gets updated, you can always search for "Miniconda installer" on Google to find the latest version. Also, you can create an issue in the <a target="_blank" href="https://github.com/FahimFBA/CUDA-WSL2-Ubuntu-v2/issues">official GitHub repository of this project</a> to notify me about it.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094667031/7ee2c854-88b6-49ce-8c04-41bf0a052c90.png" alt="Preview of Miniconda official website" class="image--center mx-auto" width="1895" height="935" loading="lazy"></p>
<p>As we are installing it inside WSL, we have to select the macOS/Linux Installation. Then select Linux Terminal Installer and choose Linux x86 for downloading the installer.</p>
<pre><code class="lang-bash">wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
</code></pre>
<p>It will download the installer to your WSL directory. Then use the following command to install it properly:</p>
<pre><code class="lang-bash">bash ~/Miniconda3-latest-Linux-x86_64.sh
</code></pre>
<p>⚠️ Make sure that you are in the correct directory where the installer is downloaded. If you downloaded it to a different location, adjust the path accordingly. Also, replace bash with zsh or sh if you are using a different shell.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094706995/3a317eb9-0340-4a84-8826-45324c93dd2f.png" alt="Preview of Miniconda installation in WSL Ubuntu terminal" class="image--center mx-auto" width="1794" height="922" loading="lazy"></p>
<p>Make sure to choose the initialization option properly. I prefer to keep the conda env active whenever I open a new shell. Therefore, I chose "Yes".</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094727839/f3fc8902-0c37-432c-a912-a92810e89fd1.png" alt="Preview of Miniconda initialization option during installation" class="image--center mx-auto" width="1656" height="924" loading="lazy"></p>
<p>Make sure that the installation succeeds without any errors.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094754454/53dfd998-62c9-4c2a-a71e-0d33e123e027.png" alt="Preview of successful Miniconda installation in WSL Ubuntu terminal" class="image--center mx-auto" width="1652" height="914" loading="lazy"></p>
<p>For the changes to take effect, you can close and reopen the current shell. But you can also do that without closing and reopening the shell by applying the command below.</p>
<pre><code class="lang-bash"><span class="hljs-built_in">source</span> ~/.bashrc
</code></pre>
<p>⚠️ If you’re using a different shell like zsh or fish, make sure to source the appropriate configuration file (e.g., ~/.zshrc for zsh).</p>
<h2 id="heading-install-jupyter-amp-ipykernel">Install Jupyter &amp; Ipykernel</h2>
<p>I prefer to use Jupyter Notebook for running my machine learning experiments. It provides an interactive environment for coding and data analysis. We’ll install Jupyter Notebook and Ipykernel to run Jupyter notebooks in our conda environment. We will do that in all conda environments starting with the <strong>base</strong> environment. It also helps us to keep the conda environment kernel inside Jupyter Notebook.</p>
<p>First, make sure that you are in the base conda environment. You will see (base) on the left side of the terminal.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094812122/66ad5de8-7553-42da-b920-78d20c3bdc9a.png" alt="Preview of conda base environment in WSL Ubuntu terminal" class="image--center mx-auto" width="1917" height="1027" loading="lazy"></p>
<p>Now install Jupyter and Ipykernel both by applying the following command:</p>
<pre><code class="lang-bash">conda install jupyter ipykernel -y
</code></pre>
<p>Make sure that you accept the terms of service of Conda.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094839808/90fe3dcf-053d-4bc7-a031-22f81eb706ca.png" alt="Preview of Jupyter and Ipykernel installation in WSL Ubuntu terminal" class="image--center mx-auto" width="1659" height="927" loading="lazy"></p>
<p>Now, I will create a separate conda environment for both TensorFlow and the PyTorch GPU. You can directly install them in the base environment or in any other environment as per your preference. I am not specifying any specific Python version while creating the environment. It will automatically install the latest stable version of Python.</p>
<pre><code class="lang-bash">conda create -name ml -y
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094865498/ac9ef1f1-4494-4221-8376-5e257c4f9243.png" alt="Preview of creating a new conda environment named 'ml' in WSL Ubuntu terminal" class="image--center mx-auto" width="1659" height="925" loading="lazy"></p>
<p>To activate any specific conda environment, you have to use the following command:</p>
<pre><code class="lang-bash">conda activate &lt;conda-env-name&gt;
</code></pre>
<p>For example, if I want to activate my newly created <strong>ml</strong> environment, I will use this command:</p>
<pre><code class="lang-bash">conda activate ml
</code></pre>
<p>If you’re not sure which conda environments are installed in your system, you can check all available and installed conda environments in your system by running the following command:</p>
<pre><code class="lang-bash">conda env list
</code></pre>
<h2 id="heading-nvidia-driver">Nvidia Driver</h2>
<p>Ensure that you have the latest Nvidia drivers installed on Windows. WSL2 uses the Windows driver, so no separate driver installation is needed in Ubuntu. You can download the latest drivers from the <a target="_blank" href="https://www.nvidia.com/Download/index.aspx">official Nvidia website</a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094915617/cd9b0bfc-77a1-45f1-9dab-4349c8f489ef.png" alt="Preview of Nvidia driver download page" class="image--center mx-auto" width="1750" height="916" loading="lazy"></p>
<p>If you are just installing the latest GPU driver, then after installing the drivers, restart your computer to ensure the changes take effect. You can either use the GeForce Game Ready Driver or the NVIDIA Studio Driver. But I recommend using the Studio Driver for better stability with creative and ML applications.</p>
<h2 id="heading-install-cuda-dependencies">Install CUDA Dependencies</h2>
<p>You might face some issues if you do not have the CUDA dependencies installed properly. I recommend that you install the required dependencies before proceeding further:</p>
<pre><code class="lang-bash">sudo apt install gcc g++ build-essential
</code></pre>
<p>After installing the dependencies, you can then verify the CUDA installation if you had any issues earlier.</p>
<h2 id="heading-cuda-toolkit">CUDA Toolkit</h2>
<p>TensorFlow GPU is very picky about the CUDA version. So we need to install a specific version of CUDA Toolkit that is compatible with the TensorFlow version we are going to install.</p>
<p>To understand exactly which CUDA version is compatible with which TensorFlow version, you can check the official TensorFlow GPU support matrix <a target="_blank" href="https://www.tensorflow.org/install/pip">here</a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095089103/87a44961-9426-4d20-95ac-cde06961b41a.png" alt="Preview of TensorFlow GPU support in official docs" class="image--center mx-auto" width="1879" height="931" loading="lazy"></p>
<p>At the time I’m writing this article, the TensorFlow GPU documentation says that we should have CUDA Toolkit 12.3. So I will ensure that I install exactly that version. You can simply click on that version link in the official docs and it will redirect you to the official Nvidia CUDA Toolkit download page. But if the link gets updated in the future, you can always search for "Nvidia CUDA Toolkit" on Google to find the latest version.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095106589/19689d63-5ebd-4783-8da4-e3dedd277efb.png" alt="Preview of Nvidia CUDA Toolkit official website" class="image--center mx-auto" width="1620" height="925" loading="lazy"></p>
<p>As TensorFlow GPU is asking for exact Version 12.3, I will select version 12.3.0 exactly.</p>
<p>In the CUDA Toolkit download page, make sure to choose the operating system as Linux, Architecture as x86_64, Distribution as WSL-Ubuntu, Version as 2.0 and the Installer type as runfile(local).</p>
<p>⚠️ As we are using Ubuntu in our WSL2, you can also choose Ubuntu as your operating system. But I prefer to choose WSL-Ubuntu for better compatibility.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095151533/b6996611-d4ce-4e07-9c73-30bdc93dbf19.png" alt="Preview of CUDA Toolkit 12.3 download page for WSL-Ubuntu" class="image--center mx-auto" width="1311" height="898" loading="lazy"></p>
<p>After selecting those, it will give you the download commands. You have to apply them sequentially. Make sure that you <strong>don't keep the checkmark in "Kernel Objects" during installing CUDA</strong>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095169368/c2f81594-536f-4788-b765-1aab3b040fa7.png" alt="Preview of CUDA Toolkit 12.3 download commands for WSL-Ubuntu" class="image--center mx-auto" width="1895" height="1001" loading="lazy"></p>
<p>⚠️ Make sure to copy and paste the commands one by one in your WSL Ubuntu terminal to download and install the CUDA Toolkit properly. If you face any issues related to CUDA dependency, then quickly go through the <a class="post-section-overview" href="#heading-install-cuda-dependencies">Install CUDA dependencies</a> section, where I have explained how to install the CUDA dependencies properly.</p>
<h2 id="heading-add-path-to-shell-profile-for-cuda">Add Path to Shell Profile for CUDA</h2>
<p>After installing CUDA Toolkit, we need to add the CUDA binaries to our shell profile for easy access. This will allow us to run CUDA commands from any directory in the terminal.</p>
<p>Note that, depending on the shell you are using (bash, zsh, and so on), you need to add the CUDA path to the appropriate configuration file. Make sure to replace <strong>.bashrc</strong> with <strong>.zshrc</strong> or other configuration files if you are using a different shell.</p>
<p>To add the CUDA binary path, follow the command below:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">echo</span> <span class="hljs-string">'export PATH=/usr/local/cuda-12.3/bin:$PATH'</span> &gt;&gt; ~/.bashrc
</code></pre>
<p>You have to use the updated path where you installed it. Your terminal will show it after installing the CUDA:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095215437/15768563-c956-472e-9633-95b3dd1cb7a3.png" alt="Preview of CUDA installation path in WSL Ubuntu terminal" class="image--center mx-auto" width="1912" height="1011" loading="lazy"></p>
<p>Now, you need to add the path inside the Library path. Just use the exact path where you installed CUDA. Your terminal will list the path properly.</p>
<pre><code class="lang-bash"><span class="hljs-built_in">echo</span> <span class="hljs-string">'export LD_LIBRARY_PATH=/usr/local/cuda-12.3/lib64:$LD_LIBRARY_PATH'</span> &gt;&gt; ~/.bashrc
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095242744/3c708db4-d267-4043-aa11-d04d890904f9.png" alt="Preview of CUDA library path in WSL Ubuntu terminal" class="image--center mx-auto" width="1284" height="693" loading="lazy"></p>
<p>After adding those paths, you need to source the shell profile for the changes to take effect. You can do that by running the following command:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">source</span> ~/.bashrc
</code></pre>
<h2 id="heading-nvcc-version">nvcc Version</h2>
<p>NVCC stands for Nvidia CUDA Compiler. It is basically a compiler driver for the CUDA platform that allows developers to write parallel programs to run on Nvidia GPUs. As we have already installed the CUDA toolkit, we need to see whether the compiler is also properly activated. To check that, we need to verify the version.</p>
<p>Verify that CUDA is properly installed by checking the version:</p>
<pre><code class="lang-bash">nvcc --version
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095277858/2d1ded0a-01ac-4f78-9f6c-ac499d623207.png" alt="Preview of nvcc version check in WSL Ubuntu terminal" class="image--center mx-auto" width="1839" height="946" loading="lazy"></p>
<p>If the output shows the correct CUDA version, then you have successfully installed CUDA Toolkit in your WSL2 Ubuntu environment.</p>
<h2 id="heading-cudnn-sdk">cuDNN SDK</h2>
<p>The cuDNN (CUDA Deep Neural Network) SDK is a <a target="_blank" href="https://developer.nvidia.com/cudnn">GPU accelerated library of primitives for deep neural networks</a>, developed by Nvidia. It provides highly optimized building blocks for common deep learning operations, significantly speeding up the training and inference processes of AI models on Nvidia GPUs.</p>
<p>Note: Even though TensorFlow GPU suggests a specific cuDNN version, it’s often compatible with multiple versions. Because of this, I recommend downloading the latest cuDNN version that is compatible with your installed CUDA version. You can find the cuDNN download page <a target="_blank" href="https://developer.nvidia.com/cudnn-downloads">here</a>.</p>
<p>Select the Operating System as Linux, Architecture as x86_64, Distribution as Ubuntu, Version as 24.04, Installer Type as deb (local), Configuration as FULL. After selecting those, it will give you the download commands. You have to apply them sequentially.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095312370/1fca5959-f492-4160-8027-deec0674863b.png" alt="Preview of cuDNN download commands for Ubuntu 24.04" class="image--center mx-auto" width="1543" height="938" loading="lazy"></p>
<p>⚠️ Make sure to copy and paste the commands one by one in your WSL Ubuntu terminal to download and install the cuDNN SDK properly. If you face any issues related to CUDA dependency, then quickly go through the <a class="post-section-overview" href="#heading-install-cuda-dependencies">Install CUDA dependencies</a> section, where I have explained how to install the CUDA dependencies properly.</p>
<h2 id="heading-tensorflow-gpu">TensorFlow GPU</h2>
<p>Now, we are going to install TensorFlow GPU in our conda environment. Make sure that you have activated the conda environment where you want to install it. I’m going to install it in my previously created <strong>ml</strong> environment. To activate it, I’ll use the following command:</p>
<pre><code class="lang-bash">conda activate ml
</code></pre>
<p>⚠️ Make sure that you have activated the correct conda environment before installing TensorFlow GPU. You will see the environment name in the terminal prompt.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095398777/0c7d8813-eb6c-4e2e-bad9-1fc7d344d7a2.png" alt="Preview of activating 'ml' conda environment in WSL Ubuntu terminal" class="image--center mx-auto" width="1227" height="692" loading="lazy"></p>
<p>I will install ipykernel and jupyter in this new environment.</p>
<pre><code class="lang-bash">conda install jupyter ipykernel -y
</code></pre>
<p>Now, to install TensorFlow GPU, I will simply use the following command:</p>
<pre><code class="lang-bash">pip install tensorflow[and-cuda]
</code></pre>
<p>It might take a couple of minutes depending on the internet speed you have. Just have patience and wait for it to finish the installation.</p>
<h3 id="heading-check-tensorflow-gpu">Check TensorFlow GPU</h3>
<p>After installing TensorFlow GPU, we need to verify that it is working properly with GPU support. Open a Python shell in your Ubuntu terminal and run the following commands:</p>
<pre><code class="lang-bash">python3 -c <span class="hljs-string">"import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"</span>
</code></pre>
<p>If the output shows a list of available GPU devices, then TensorFlow GPU is successfully installed and working properly.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095453933/ccda58fc-9ae9-4185-9c78-6196c98d8b7c.png" alt="Preview of TensorFlow GPU check in WSL Ubuntu terminal" width="1903" height="1029" loading="lazy"></p>
<h2 id="heading-pytorch-gpu">PyTorch GPU</h2>
<p>Now, we’re going to install PyTorch GPU in our conda environment. Make sure that you have activated the conda environment where you want to install it. I’m going to install it in my previously created ml environment. To activate it, I will use the following command:</p>
<pre><code class="lang-bash">conda activate ml
</code></pre>
<p>Installing PyTorch GPU is very straightforward. You can use the official PyTorch installation command generator <a target="_blank" href="https://pytorch.org/get-started/locally/">here</a>.</p>
<p>Make sure to select PyTorch Build as the latest Stable one, Your OS as Linux, Package as Pip, Language as Python. For the Compute Platform, select the CUDA version that matches your installed CUDA Toolkit. For me, it is CUDA 12.3. But, if you can not find the exact one then choose the closest. As CUDA 12.3 is not available for me now, I am choosing CUDA 12.6.</p>
<p>After selecting those, it will give you the installation command. You have to apply it in your WSL Ubuntu terminal.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095511862/6f631369-c8db-4681-9d1c-669ad88df69d.png" alt="Preview of PyTorch installation command generator" class="image--center mx-auto" width="1618" height="911" loading="lazy"></p>
<p>It might take a couple of minutes depending on the internet speed you have. Just have patience and wait for it to finish the installation.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095532246/56232263-36ea-4043-9881-df162965c514.png" alt="Preview of PyTorch GPU installation in WSL Ubuntu terminal" class="image--center mx-auto" width="1280" height="689" loading="lazy"></p>
<h3 id="heading-check-pytorch-gpu">Check PyTorch GPU</h3>
<p>After installing PyTorch GPU, verify that it is working properly with GPU support. Open a Python shell in your Ubuntu terminal and run the following commands:</p>
<pre><code class="lang-bash">python3 - &lt;&lt; <span class="hljs-string">'EOF'</span>
import torch
<span class="hljs-built_in">print</span>(torch.cuda.is_available())
<span class="hljs-built_in">print</span>(torch.cuda.device_count())
<span class="hljs-built_in">print</span>(torch.cuda.current_device())
<span class="hljs-built_in">print</span>(torch.cuda.device(0))
<span class="hljs-built_in">print</span>(torch.cuda.get_device_name(0))
EOF
</code></pre>
<p>The output should look similar to the screenshot, showing:</p>
<ul>
<li><p><strong>True</strong>: GPU is available for PyTorch</p>
</li>
<li><p><strong>1</strong>: Number of detected CUDA devices</p>
</li>
<li><p><strong>0</strong>: Index of the current active CUDA device</p>
</li>
<li><p>A device object representation</p>
</li>
<li><p><strong>NVIDIA GeForce RTX 3060</strong> (or your GPU name)</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095584921/69269152-7ea6-404b-b1ca-8534b51f2491.png" alt="Preview of PyTorch GPU check in WSL Ubuntu terminal" class="image--center mx-auto" width="1917" height="937" loading="lazy"></p>
<h3 id="heading-check-pytorch-amp-tensorflow-gpu-inside-jupyter-notebook">Check PyTorch &amp; TensorFlow GPU inside Jupyter Notebook</h3>
<p>Now that the environment is fully configured, we will verify GPU support directly inside Jupyter Notebook. This ensures both PyTorch and TensorFlow can successfully detect and use your GPU.</p>
<h4 id="heading-1-test-pytorch-gpu">1. Test PyTorch GPU</h4>
<p>Create a new Jupyter Notebook and run the following commands one by one:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch

print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.current_device())
print(torch.cuda.device(<span class="hljs-number">0</span>))
print(torch.cuda.get_device_name(<span class="hljs-number">0</span>))
</code></pre>
<p>If everything is configured correctly, you will see your GPU (for example <strong>NVIDIA GeForce RTX 3060</strong>) detected properly:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095624229/f94c97a0-2e44-45ad-a2a8-52f40c922482.png" alt="Preview of PyTorch GPU check inside Jupyter Notebook" class="image--center mx-auto" width="1861" height="743" loading="lazy"></p>
<h4 id="heading-2-test-tensorflow-gpu">2. Test TensorFlow GPU</h4>
<p>Next, run the following code to check whether TensorFlow detects your GPU:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf

print(tf.config.list_physical_devices(<span class="hljs-string">'GPU'</span>))
</code></pre>
<p>You can also check the number of GPUs detected:</p>
<pre><code class="lang-python">print(<span class="hljs-string">"Num GPUs Available:"</span>, len(tf.config.list_physical_devices(<span class="hljs-string">'GPU'</span>)))
</code></pre>
<p>Finally, run TensorFlow’s built-in GPU validation (warnings are normal):</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf

<span class="hljs-keyword">assert</span> tf.test.is_gpu_available()
<span class="hljs-keyword">assert</span> tf.test.is_built_with_cuda()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095666216/f9017979-b5c9-4b86-9f60-d9aaa2fe8ac1.png" alt="TensorFlow GPU initialization and CUDA validation output" class="image--center mx-auto" width="1638" height="935" loading="lazy"></p>
<p>If TensorFlow logs show your GPU model (such as <strong>RTX 3060</strong>), then TensorFlow GPU is successfully installed and fully working inside Jupyter Notebook.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Thank you so much for reading all the way through. I hope you have been able to configure your Windows 11 computer properly for running almost any kind of Machine Learning-based experiments.</p>
<p>To get more content like this, you can follow me on <a target="_blank" href="https://www.linkedin.com/in/fahimfba/">LinkedIn</a> and <a target="_blank" href="https://x.com/Fahim_FBA">X</a>. You can also check <a target="_blank" href="https://www.fahimbinamin.com/">my website</a> and follow me on <a target="_blank" href="https://github.com/FahimFBA">GitHub</a> if you are into open source and development.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ From Pixels to Predictions: How GPUs Started Powering Modern AI ]]>
                </title>
                <description>
                    <![CDATA[ When people think of artificial intelligence, they imagine complex models, data centers, and cloud servers. What most don’t realize is that the real engine behind this AI revolution started in a place few expected: inside the humble gaming PC. The sa... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/from-pixels-to-predictions-how-gpus-started-powering-modern-ai/</link>
                <guid isPermaLink="false">69164b0a612455db63b30bdc</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ GPU ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Games ]]>
                    </category>
                
                    <category>
                        <![CDATA[ hardware ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Thu, 13 Nov 2025 21:18:02 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1763068628238/531621b6-1931-422f-b8ff-455c1ef58dab.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When people think of artificial intelligence, they imagine complex models, data centers, and cloud servers.</p>
<p>What most don’t realize is that the real engine behind this AI revolution started in a place few expected: inside the humble gaming PC.</p>
<p>The same graphics cards once built to render smooth 3D visuals are now powering chatbots, image generators, and self-driving systems. The journey from pixels to predictions is one of the most fascinating stories in modern computing.</p>
<h2 id="heading-the-cpu-era-and-its-limits"><strong>The CPU Era and Its Limits</strong></h2>
<p>In the early days of machine learning, researchers depended on CPUs to crunch data.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762950299566/a8c9ea6a-f420-4f9e-b87c-5b584be5166a.png" alt="CPU architecture" class="image--center mx-auto" width="1000" height="500" loading="lazy"></p>
<p>CPUs were versatile and great for handling a wide range of tasks, but they had one big limitation: they worked on problems in sequence.</p>
<p>That means they could process only a few operations at a time. For small models, this was fine. But as neural networks grew in complexity, training them on CPUs became painfully slow.</p>
<p>Imagine trying to teach a computer to recognize images. A neural network might have millions of parameters, and every single one needs to be adjusted again and again during training.</p>
<p>On CPUs, that could take days or even weeks. Researchers quickly realized that if AI was going to advance, it needed a completely different kind of hardware.</p>
<h2 id="heading-how-gpus-entered-the-picture"><strong>How GPUs Entered the Picture</strong></h2>
<p><a target="_blank" href="https://aws.amazon.com/what-is/gpu/">Graphics processing units</a>, or GPUs, were originally built to render the fast-moving images in video games. They were designed for parallelism, performing thousands of small calculations at the same time.</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/Axd50ew4pco" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<p> </p>
<p>While a CPU might have a handful of cores, a GPU has thousands. This architecture made GPUs ideal for the kind of math used in machine learning, where the same operation needs to be applied to huge amounts of data simultaneously.</p>
<p>In a way, the GPU was built for games but destined for AI. What started as a chip to make lighting effects smoother and explosions look more realistic soon found a second life powering neural networks.</p>
<p>Around the early 2010s, researchers began experimenting with running <a target="_blank" href="https://www.freecodecamp.org/news/deep-learning-fundamentals-handbook-start-a-career-in-ai/">deep learning algorithms</a> on GPUs, and the results were stunning. Training times dropped from weeks to days, and accuracy improved.</p>
<p>It was a quiet revolution happening in research labs around the world.</p>
<h2 id="heading-the-role-of-gaming-pcs-in-early-ai-research"><strong>The Role of Gaming PCs in Early AI Research</strong></h2>
<p>Here’s where the story gets even more interesting: many of the early breakthroughs in AI didn’t come from massive data centres or expensive supercomputers. They came from researchers using consumer-grade GPUs, often sitting inside regular gaming PCs.</p>
<p>These machines, built for entertainment, turned out to be powerful enough for deep learning experiments.</p>
<p><a target="_blank" href="https://en.wikipedia.org/wiki/CUDA">NVIDIA’s CUDA</a> platform made this possible by allowing developers to program GPUs for tasks beyond graphics. Suddenly, a gaming GPU could handle complex scientific computations.</p>
<p>Researchers used their own rigs, sometimes the same computers they played games on at night, to train neural networks that recognized speech, images, and text. The gaming PC became a testbed for the future of artificial intelligence.</p>
<h2 id="heading-the-turning-point-alexnet-and-the-deep-learning-boom"><strong>The Turning Point: AlexNet and the Deep Learning Boom</strong></h2>
<p>In 2012, a neural network called <a target="_blank" href="https://www.pinecone.io/learn/series/image-search/imagenet/">AlexNet</a> stunned the world by winning the ImageNet competition, a major benchmark in computer vision.</p>
<p>What made AlexNet special wasn’t just its architecture but the hardware behind it. It ran on two NVIDIA GTX 580 GPUs, hardware you could buy for your <a target="_blank" href="https://www.eneba.com/hub/gaming-gear/best-gaming-pc-under-1000/">low-cost gaming PC</a>. That win marked a turning point. It proved that GPUs weren’t just for rendering graphics – they were the key to advancing AI.</p>
<p>After that, the AI world changed fast. Every major research lab and tech company started building GPU clusters. NVIDIA, sensing the opportunity, leaned into AI hardware development.</p>
<p>The same company that once catered mainly to gamers now powered Google, OpenAI, and Tesla. What started as a tool for better visuals had become the backbone of machine intelligence.</p>
<h2 id="heading-why-gpus-are-so-good-at-ai"><strong>Why GPUs Are So Good at AI</strong></h2>
<p>GPUs excel at matrix math, the kind of computation that neural networks rely on.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763135976807/adcb3ce6-40e0-4bb2-b138-7f9208b4a6b4.jpeg" alt="matrix math" class="image--center mx-auto" width="1000" height="271" loading="lazy"></p>
<p>When you train a model, you’re constantly multiplying and adding matrices of numbers. GPUs do this faster because they handle thousands of operations in parallel. They’re also designed with high memory bandwidth, meaning they can move large amounts of data in and out quickly.</p>
<p>This architecture fits perfectly with deep learning workloads. Whether it’s image recognition or language translation, GPUs can process huge batches of data at once.</p>
<p>CPUs, by contrast, get bottlenecked by sequential processing. The difference in performance is like comparing a single craftsman building a house to a team of thousands working at once.</p>
<h2 id="heading-the-ai-hardware-race"><strong>The AI Hardware Race</strong></h2>
<p>As AI took off, the demand for GPUs exploded. What began in gaming PCs scaled up into massive data centers filled with thousands of cards.</p>
<p>Companies like NVIDIA developed new lines of GPUs specifically for AI, such as the Tesla and A100 series. Other players joined the race too, like AMD with its <a target="_blank" href="https://www.amd.com/en/products/software/rocm.html">ROCm platform</a>, and Google with its custom TPUs (Tensor Processing Units).</p>
<p>Yet, even today, the line between gaming and AI hardware remains blurred. The same RTX GPUs designed for gamers are still used by many AI researchers and small startups.</p>
<p>A powerful gaming PC equipped with a modern GPU can run local AI models, generate images, or even fine-tune small language models. The hardware that made virtual worlds come alive now brings intelligence to our real one.</p>
<h2 id="heading-the-future-of-gpus-and-ai"><strong>The Future of GPUs and AI</strong></h2>
<p>As AI models grow larger, new challenges are emerging. GPUs are evolving to handle trillion-parameter models, but they’re also getting smarter about energy use and efficiency.</p>
<p>Technologies like chiplet design, optical interconnects, and AI-specific cores are pushing performance further while keeping costs down.</p>
<p>Meanwhile, local AI is making a comeback. With advancements in GPU efficiency, many users are experimenting with running models on their own machines.</p>
<p>A well-equipped gaming PC can now do what once required access to a cloud GPU cluster. This shift could democratize AI development, letting anyone with the right hardware explore the field from home.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>The GPU’s journey from gaming to AI is one of the most unexpected transformations in tech history. What started as a chip to render virtual landscapes evolved into the heart of artificial intelligence. From early experiments on gaming PCs to the data centers powering today’s largest models, GPUs have bridged the worlds of creativity, computation, and cognition.</p>
<p>As we look ahead, it’s clear that the same technology that once made games more realistic is now making machines more intelligent. The story of the GPU reminds us that innovation often comes from unexpected places, and sometimes, the future of AI begins in the glow of a gaming screen.</p>
<p><em>Hope you enjoyed this article. Find me on</em> <a target="_blank" href="https://linkedin.com/in/manishmshiva"><em>Linkedin</em></a> <em>or</em> <a target="_blank" href="https://manishshivanandhan.com/"><em>visit my website</em></a><em>.</em></p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
