<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ data-engineering - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ data-engineering - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sat, 06 Jun 2026 11:16:57 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/data-engineering/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Preprocess Medical Images for Machine Learning – A Guide Using Chest X-Rays ]]>
                </title>
                <description>
                    <![CDATA[ Working with healthcare data introduces preprocessing challenges that go beyond those you might encounter with structured data. Some familiar techniques still apply, while others look very different o ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-preprocess-medical-images-for-machine-learning/</link>
                <guid isPermaLink="false">6a21b25709761aac249473c9</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ healthcare ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Medical Imaging ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Preprocessing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Lakshmi Mahabaleshwara ]]>
                </dc:creator>
                <pubDate>Thu, 04 Jun 2026 17:13:59 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/eab58d7c-f63a-41ae-a01e-52a65b0be17c.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Working with healthcare data introduces preprocessing challenges that go beyond those you might encounter with structured data. Some familiar techniques still apply, while others look very different once your data becomes medical images.</p>
<p>In this article, you’ll learn how to prepare a real-world medical imaging dataset for machine learning, from initial data validation to a complete preprocessing pipeline.</p>
<p>We’ll use the Chest X-Ray Pneumonia dataset as our running example, but the lessons apply broadly to healthcare imaging data, including ultrasound, MRI, CT, and dermatology images.</p>
<h2 id="heading-what-youll-learn-in-this-article">What You'll Learn in This Article</h2>
<p>By the end of this article, you'll know how to:</p>
<ul>
<li><p>Approach healthcare data preprocessing differently from preprocessing structured data, and recognize where standard techniques fall short</p>
</li>
<li><p>Validate a medical imaging dataset before training to catch corrupted files, mislabels, and data leakage between train and test</p>
</li>
<li><p>Apply six core preprocessing techniques for medical images</p>
</li>
<li><p>Build a complete preprocessing pipeline for chest X-rays using Python with OpenCV.</p>
</li>
</ul>
<h2 id="heading-what-well-cover"><strong>What We'll Cover:</strong></h2>
<ul>
<li><p><a href="#heading-why-preprocessing-data-matters-more-in-healthcare">Why Preprocessing Data Matters More in Healthcare</a></p>
</li>
<li><p><a href="#heading-the-dataset">The Dataset</a></p>
</li>
<li><p><a href="#heading-before-preprocessing-validate-the-dataset">Before Preprocessing: Validate the Dataset</a></p>
</li>
<li><p><a href="#heading-the-six-pillars-of-healthcare-imaging-preprocessing">The Six Pillars of Healthcare Imaging Preprocessing</a></p>
</li>
<li><p><a href="#heading-pillar-1-scaling-making-the-numbers-play-fair">Pillar 1: Scaling — Making the Numbers Play Fair</a></p>
</li>
<li><p><a href="#heading-pillar-2-normalization-centering-the-data">Pillar 2: Normalization — Centering the Data</a></p>
</li>
<li><p><a href="#heading-pillar-3-guiding-the-models-attention">Pillar 3: Guiding the Model's Attention</a></p>
</li>
<li><p><a href="#heading-pillar-4-handling-missing-data">Pillar 4: Handling Missing Data</a></p>
</li>
<li><p><a href="#heading-pillar-5-resizing-amp-resampling-fitting-everything-in-the-same-frame">Pillar 5: Resizing &amp; Resampling — Fitting Everything in the Same Frame</a></p>
</li>
<li><p><a href="#heading-pillar-6-denoising-amp-artifact-handling-cleaning-the-window">Pillar 6: Denoising &amp; Artifact Handling — Cleaning the Window</a></p>
</li>
<li><p><a href="#heading-putting-it-all-together-a-complete-pipeline">Putting it All together: A Complete Pipeline</a></p>
</li>
<li><p><a href="#heading-try-it-yourself">Try it Yourself</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-why-preprocessing-data-matters-more-in-healthcare">Why Preprocessing Data Matters More in Healthcare</h2>
<p>Imagine handing a toddler a jigsaw puzzle with missing pieces, warped edges, and pieces from three different puzzles mixed together. The toddler can't solve it, but that isn't really the toddler's fault.</p>
<p>The same thing happens when raw, messy data gets fed into a machine learning model. A bad prediction on a clinical image can mean a missed diagnosis.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69fd77e89f93a850a46d376f/55671e0b-95ea-4f99-b507-a8742e8981d9.png" alt="Illustration showing a healthcare data preprocessing workflow. Mixed medical images with different sizes, missing labels, noisy scans, and corrupted files enter a preprocessing pipeline and emerge as clean, standardized, model-ready images ready for machine learning." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Healthcare data tends to be messier than what most ML practitioners are used to:</p>
<ul>
<li><p>Images come from different machines, hospitals, and acquisition protocols</p>
</li>
<li><p>Labels are inconsistent, sometimes missing, sometimes wrong</p>
</li>
<li><p>Patient data is incomplete</p>
</li>
<li><p>Image sizes, contrast levels, and orientations vary across sources</p>
</li>
</ul>
<p>Poor preprocessing often leads to models that perform well on benchmark datasets but struggle to generalize to data collected from different hospitals or imaging devices.</p>
<h2 id="heading-the-dataset">The Dataset</h2>
<p>This guide uses the <strong>Chest X-Ray Pneumonia dataset</strong> by Paul Mooney on Kaggle. It's a strong choice for learning preprocessing because:</p>
<ul>
<li><p>It contains around 5,800 pediatric chest X-rays</p>
</li>
<li><p>It has two clear classes — Normal and Pneumonia</p>
</li>
<li><p>It's already organized into train, validation, and test folders</p>
</li>
<li><p>The images are recognizable without specialized medical training</p>
</li>
<li><p>It exhibits almost every preprocessing challenge worth learning</p>
</li>
</ul>
<p>The dataset is available at <a href="https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia">Kaggle: Chest X-Ray Pneumonia</a>.</p>
<h3 id="heading-folder-structure">Folder Structure</h3>
<p>After downloading, the dataset is organized like this:</p>
<pre><code class="language-plaintext">chest_xray/
├── train/
│   ├── NORMAL/
│   └── PNEUMONIA/
├── val/
│   ├── NORMAL/
│   └── PNEUMONIA/
└── test/
    ├── NORMAL/
    └── PNEUMONIA/
</code></pre>
<p>Side-by-side comparison — Normal vs Pneumonia chest X-ray:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69fd77e89f93a850a46d376f/b92e1e14-ac24-4314-afce-bc2c3ce3ea32.png" alt="Side-by-side chest X-ray images showing a normal lung scan on the left and a pneumonia scan on the right. The pneumonia image contains visible cloudy opacities compared with the clearer lung fields in the normal image." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>A quick first look at one of the images:</p>
<pre><code class="language-python">import os
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import cv2

DATA_DIR = "chest_xray"
TRAIN_DIR = os.path.join(DATA_DIR, "train")

# Peek at a sample image
sample_path = os.path.join(TRAIN_DIR, "NORMAL", os.listdir(os.path.join(TRAIN_DIR, "NORMAL"))[0])
sample_image = cv2.imread(sample_path, cv2.IMREAD_GRAYSCALE)

print(f"Image shape: {sample_image.shape}")
print(f"Pixel range: {sample_image.min()} to {sample_image.max()}")
print(f"Data type: {sample_image.dtype}")
</code></pre>
<p>The output reveals a few useful things right away: most images are large (often around 1500×2000 pixels), pixel values fall in the 0–255 range, and image sizes vary across the dataset. Each of these observations will inform a preprocessing step.</p>
<h2 id="heading-before-preprocessing-validate-the-dataset">Before Preprocessing: Validate the Dataset</h2>
<p>Before applying any transformations, it's worth checking that the data itself is intact. This step alone catches issues that would otherwise cause training to fail silently or produce misleading results.</p>
<p>A simple validation function:</p>
<pre><code class="language-python">def validate_dataset(data_dir):
    """Scan a dataset folder and flag common data quality issues."""
    corrupted = []
    too_small = []
    nearly_black = []
    total = 0
    
    for class_name in os.listdir(data_dir):
        class_path = os.path.join(data_dir, class_name)
        if not os.path.isdir(class_path):
            continue
        for fname in os.listdir(class_path):
            fpath = os.path.join(class_path, fname)
            total += 1
            try:
                img = cv2.imread(fpath, cv2.IMREAD_GRAYSCALE)
                if img is None:
                    corrupted.append(fpath)
                    continue
                if img.shape[0] &lt; 100 or img.shape[1] &lt; 100:
                    too_small.append(fpath)
                if img.mean() &lt; 5:
                    nearly_black.append(fpath)
            except Exception:
                corrupted.append(fpath)
    
    print(f"Total files scanned: {total}")
    print(f"Corrupted: {len(corrupted)}")
    print(f"Too small: {len(too_small)}")
    print(f"Nearly black: {len(nearly_black)}")
    return corrupted, too_small, nearly_black

validate_dataset(TRAIN_DIR)
</code></pre>
<p>Common issues this catches:</p>
<ul>
<li><p><strong>Corrupted files</strong> — files that won't open at all</p>
</li>
<li><p><strong>Empty or nearly-black images</strong> — failed acquisitions or saved-as-blank files</p>
</li>
<li><p><strong>Wrong dimensions</strong> — thumbnails or partial downloads mixed in</p>
</li>
<li><p><strong>Duplicate images</strong> — the same scan appearing in both train and test (this causes data leakage)</p>
</li>
<li><p><strong>Mislabeled images</strong> — a normal X-ray placed in the pneumonia folder</p>
</li>
</ul>
<p><strong>⚠️ This step is critical</strong>, One corrupted file can crash a training loop hours into a run. One duplicate between train and test can inflate accuracy scores by several percentage points without anyone noticing.</p>
<h2 id="heading-the-six-pillars-of-healthcare-imaging-preprocessing"><strong>The Six Pillars of Healthcare Imaging Preprocessing</strong></h2>
<p>Preprocessing for medical images can be organized around six core concerns. Two of them carry over directly from preprocessing structured data. Two need to be adapted because the mechanics change when the input is an image. And two are entirely new, they only exist once the data becomes pictures of human bodies.</p>
<h2 id="heading-pillar-1-scaling-making-the-numbers-play-fair">Pillar 1: Scaling — Making the Numbers Play Fair</h2>
<p>Imagine two children comparing their collections. One has 3 seashells. The other has 3,000 stickers. Asking who has more makes the answer seem obvious, but the <em>scales</em> are completely different. Comparing them meaningfully means putting both collections on the same measuring system.</p>
<p>In medical images, pixels usually range from 0 to 255 in 8-bit images, or 0 to 65,535 in some 16-bit medical DICOM images. Neural networks tend to train faster and more reliably when input values are small numbers close to zero.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69fd77e89f93a850a46d376f/1d864b0d-992c-4637-8f43-7ca86c6fd93c.png" alt="Histogram comparison showing chest X-ray pixel values before and after scaling. The left histogram displays values in the 0–255 range, while the right histogram shows the same distribution scaled to the 0–1 range used for machine learning." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p><strong>The fix:</strong> Divide every pixel by its maximum possible value, bringing everything into the 0-to-1 range.</p>
<pre><code class="language-python">image = cv2.imread(sample_path, cv2.IMREAD_GRAYSCALE)

# Scale to [0, 1]
image_scaled = image.astype(np.float32) / 255.0

print(f"Before scaling: {image.min()} to {image.max()}")
print(f"After scaling:  {image_scaled.min():.3f} to {image_scaled.max():.3f}")
</code></pre>
<p><strong>Takeaway:</strong> Pixel scaling follows the same principle as scaling any numerical feature. The values simply happen to be arranged as an image rather than a column.</p>
<h2 id="heading-pillar-2-normalization-centering-the-data">Pillar 2: Normalization — Centering the Data</h2>
<p>Imagine a teacher asks a class to rate a movie from 1 to 10. One child always gives 9s and 10s. Another spreads ratings evenly from 1 to 10. Comparing their opinions fairly requires adjusting each child's score relative to their own average.</p>
<p>In medical imaging even after scaling to 0–1, the overall brightness of images can vary. Some X-rays are taken with stronger exposure than others. Normalization shifts and rescales each image (or each channel) so the values are centered around zero with a standard deviation of one.</p>
<p><strong>The fix:</strong> Subtract the mean, divide by the standard deviation.</p>
<pre><code class="language-python"># Compute mean and std from the TRAINING set only — never from validation or test
def compute_train_stats(train_dir, sample_limit=1000):
    """Compute pixel mean and std across the training set."""
    pixel_values = []
    count = 0
    for class_name in os.listdir(train_dir):
        class_path = os.path.join(train_dir, class_name)
        for fname in os.listdir(class_path):
            if count &gt;= sample_limit:
                break
            img = cv2.imread(os.path.join(class_path, fname), cv2.IMREAD_GRAYSCALE)
            if img is not None:
                pixel_values.append(img.astype(np.float32).flatten() / 255.0)
                count += 1
    pixels = np.concatenate(pixel_values)
    return pixels.mean(), pixels.std()

train_mean, train_std = compute_train_stats(TRAIN_DIR)
image_normalized = (image_scaled - train_mean) / train_std
</code></pre>
<p><strong>⚠️</strong> Avoid this common mistake: Statistics for normalization should be computed from the training set only, never from validation or test. Including those in the calculation leaks information from the evaluation data into the model. The same statistics should then be applied to validation, test, and any new data at inference time.</p>
<p><strong>Takeaway:</strong> Centering and scaling each image around the dataset's statistics is the imaging equivalent of standardizing a feature column. The pixels are now comparable across images, regardless of how bright or dim each scan happened to be.</p>
<h2 id="heading-pillar-3-guiding-the-models-attention">Pillar 3: Guiding the Model's Attention</h2>
<p>Imagine a child walking into a crowded pet store. Instead of describing every animal in sight, a parent points to the features that matter: <em>“Look at the soft fur, the fluffy tail, and the nice small size.”</em> The child learns where to focus their attention.</p>
<p>Medical image preprocessing does something similar. It highlights the regions and features most relevant to the diagnostic task.</p>
<ul>
<li><p><strong>Region-of-interest (ROI) cropping</strong> — focus on the lung field and discard the patient's arms, machine borders, and any imprinted text</p>
</li>
<li><p><strong>Contrast enhancement</strong> — use techniques like CLAHE (Contrast Limited Adaptive Histogram Equalization) to make subtle lung textures more visible</p>
</li>
<li><p><strong>Channel selection</strong> — for images stored as RGB but containing grayscale information, convert to single-channel input to reduce noise</p>
</li>
</ul>
<img src="https://cdn.hashnode.com/uploads/covers/69fd77e89f93a850a46d376f/54cb1319-e794-472e-9ca4-22a063fd5092.png" alt="Three-panel illustration showing a chest X-ray before and after feature enhancement. The first panel shows the original image, the second highlights the lung region of interest, and the third shows the image after CLAHE contrast enhancement with lung textures appearing more visible." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>CLAHE applied to an X-ray:</p>
<pre><code class="language-python"># CLAHE enhances local contrast — useful for X-rays
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
image_enhanced = clahe.apply(image)

# Visualize the difference
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].imshow(image, cmap='gray')
axes[0].set_title('Original')
axes[1].imshow(image_enhanced, cmap='gray')
axes[1].set_title('After CLAHE')
plt.show()
</code></pre>
<p><strong>Takeaway:</strong> The goal of teaching the model what to look at hasn't changed. With structured data, the answer is in new columns. With images, the answer is in cropping, enhancement, and emphasizing the regions that carry diagnostic signal.</p>
<h2 id="heading-pillar-4-handling-missing-data">Pillar 4: Handling Missing Data</h2>
<p>Imagine reading a storybook with a few damaged pages. You don’t throw away the entire book, you decide whether to skip the page, infer what might be missing, or mark it for review.</p>
<p>In medical imaging, missing data can mean corrupted files, missing labels, or incomplete studies rather than empty spreadsheet cells.</p>
<p>The same three strategies — drop, impute, flag — still apply, just with different mechanics:</p>
<pre><code class="language-python"># Strategy 1: Drop — remove unreadable or empty images
def is_valid_image(path):
    try:
        img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
        if img is None:
            return False
        if img.mean() &lt; 5:           # nearly black
            return False
        if img.shape[0] &lt; 50 or img.shape[1] &lt; 50:  # too small
            return False
        return True
    except Exception:
        return False

# Strategy 2: Impute — rare for images, but possible (e.g., in painting to fill in missing patches). Generally avoided for diagnostic data.

# Strategy 3: Flag — track which patients are missing which modalities,
#   and let the model condition on availability. Common in multi-modal healthcare ML.
</code></pre>
<p><strong>Takeaway:</strong> "Missing" in imaging data is rarely just a NaN. It can be a broken file, an unlabeled scan, an absent modality, or a black corner inside an image. The same three strategies still apply.</p>
<h2 id="heading-pillar-5-resizing-amp-resampling-fitting-everything-in-the-same-frame">Pillar 5: Resizing &amp; Resampling — Fitting Everything in the Same Frame</h2>
<p>Imagine displaying children’s drawings on a classroom wall. If every drawing is a different size, they won’t fit neatly into the display. You resize them while preserving their proportions.</p>
<p>Medical images must often be resized to a common input size, but anatomical structures should retain their original shape.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69fd77e89f93a850a46d376f/d36b6f8c-4be0-41b7-ab7c-5ca30c01b3e0.png" alt="Comparison of two chest X-ray resizing approaches. One image is stretched into a square shape, distorting the lungs, while the second preserves the original aspect ratio by adding padding around the image. The aspect-ratio-preserving approach is highlighted as the preferred method." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p><strong>The fix:</strong> Resize all images to a common shape. For medical data, <em>how</em> the resizing is done matters.</p>
<pre><code class="language-python">TARGET_SIZE = (224, 224)

# Simple resize (may distort aspect ratio)
image_resized = cv2.resize(image, TARGET_SIZE)

# Better: preserve aspect ratio with padding
def resize_with_padding(image, target_size):
    h, w = image.shape[:2]
    target_h, target_w = target_size
    scale = min(target_h / h, target_w / w)
    new_h, new_w = int(h * scale), int(w * scale)
    resized = cv2.resize(image, (new_w, new_h))
    
    pad_h = target_h - new_h
    pad_w = target_w - new_w
    top, bottom = pad_h // 2, pad_h - pad_h // 2
    left, right = pad_w // 2, pad_w - pad_w // 2
    padded = cv2.copyMakeBorder(resized, top, bottom, left, right,
                                 cv2.BORDER_CONSTANT, value=0)
    return padded

image_clean_resize = resize_with_padding(image, TARGET_SIZE)
</code></pre>
<p><strong>⚠️ Why aspect ratio matters in healthcare:</strong> Squishing a chest X-ray horizontally makes the lungs look unnatural. Models trained on distorted anatomy often perform worse on real scans. Preserving aspect ratio is generally the safer choice.</p>
<p><strong>Takeaway:</strong> Models need a consistent input size, but the geometry of the anatomy needs to be preserved. Resize, but resize carefully.</p>
<h2 id="heading-pillar-6-denoising-amp-artifact-handling-cleaning-the-window">Pillar 6: Denoising &amp; Artifact Handling — Cleaning the Window</h2>
<p>Imagine looking through a window with dust and smudges on the glass. Cleaning the window makes the view clearer, but scrubbing too aggressively could scratch the glass.</p>
<p>Similarly, medical images often contain noise and acquisition artifacts that should be reduced carefully without removing clinically important details.</p>
<p>For chest X-rays, the most common issues are mild noise and burned-in text or markers. A gentle median or bilateral filter helps with the first, while cropping or masking helps with the second.</p>
<pre><code class="language-python"># Gentle denoising — careful not to blur away clinical detail
image_denoised = cv2.medianBlur(image, ksize=3)

# Bilateral filter preserves edges better than a median filter
image_bilateral = cv2.bilateralFilter(image, d=5, sigmaColor=50, sigmaSpace=50)
</code></pre>
<p><strong>⚠️ A note of caution:</strong> Aggressive denoising can erase the features a model needs to detect a disease. For diagnostic ML, gentle filtering is generally preferred. A useful rule of thumb: if a radiologist can't distinguish the cleaned image from the original, the filtering has gone too far.</p>
<p><strong>Takeaway:</strong> Imaging data carries noise that structured data doesn't have. The window can be cleaned, but never so aggressively that the view is wiped away with the smudges.</p>
<h2 id="heading-putting-it-all-together-a-complete-pipeline">Putting it All Together: A Complete Pipeline</h2>
<img src="https://cdn.hashnode.com/uploads/covers/69fd77e89f93a850a46d376f/c532949b-000c-403e-acb9-f9dec689182e.png" alt="Workflow showing a chest X-ray progressing through a healthcare imaging preprocessing pipeline. The image moves through validation, resizing, denoising, contrast enhancement, scaling, and normalization before becoming a model-ready machine learning input." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Here's how the six pillars combine into a single preprocessing function for chest X-ray images:</p>
<pre><code class="language-python">def preprocess_xray(image_path, target_size=(224, 224),
                    train_mean=0.482, train_std=0.236):
    """
    Full preprocessing pipeline for chest X-ray images.
    Applies all six pillars in order.
    """
    # Pillar 4: Validate first — skip corrupted files
    if not is_valid_image(image_path):
        return None
    
    image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    
    # Pillar 5: Resize with aspect ratio preserved
    image = resize_with_padding(image, target_size)
    
    # Pillar 6: Gentle denoising
    image = cv2.medianBlur(image, 3)
    
    # Pillar 3: Enhance contrast to highlight lung texture
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
    image = clahe.apply(image)
    
    # Pillar 1: Scale to [0, 1]
    image = image.astype(np.float32) / 255.0
    
    # Pillar 2: Normalize using training set statistics
    image = (image - train_mean) / train_std
    
    return image
</code></pre>
<h2 id="heading-try-it-yourself">Try it Yourself</h2>
<p>Every code snippet in this article is bundled into a runnable Kaggle notebook: <a href="https://www.kaggle.com/code/lakshmimahabaleshwar/chest-xray-preprocessing-kaggle">Chest X-Ray Preprocessing — Kaggle Notebook</a>. Fork it, attach the dataset, and run all the cells to see each preprocessing pillar in action on real chest X-rays.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Here's a summary of what we've discussed in this article:</p>
<table>
<thead>
<tr>
<th><strong>Pillar</strong></th>
<th><strong>Purpose</strong></th>
<th><strong>Example</strong></th>
</tr>
</thead>
<tbody><tr>
<td>Scaling</td>
<td>Standardize pixel ranges</td>
<td>0-255 → 0-1</td>
</tr>
<tr>
<td>Normalization</td>
<td>Center brightness distributions</td>
<td>z-score normalization</td>
</tr>
<tr>
<td>Attention Guidance</td>
<td>Highlight diagnostic regions</td>
<td>CLAHE</td>
</tr>
<tr>
<td>Missing Data Handling</td>
<td>Remove unusable scans</td>
<td>Corrupted files</td>
</tr>
<tr>
<td>Resizing</td>
<td>Consistent input size</td>
<td>224×224</td>
</tr>
<tr>
<td>Denoising</td>
<td>Reduce acquisition noise</td>
<td>Median filter</td>
</tr>
</tbody></table>
<p>Preprocessing for structured data is about making numbers play fair so a model can see them clearly.</p>
<p>Preprocessing for healthcare imaging is about respecting the messy reality of how medical data is captured, stored, and labeled. Some standard techniques carry over directly. Some need to be adapted. And a few preprocessing concerns only emerge once the data becomes pictures of human bodies.</p>
<p>Stepping back, whether it's a child learning to organize their toy box, or a model learning to spot pneumonia in a chest X-ray, the quality of learning depends on the quality of data preparation. Get the data right.</p>
<p>If this was useful, you can find a related conceptual primer on preprocessing more broadly here: <a href="https://lakshmimahabaleshwara.substack.com/p/data-preprocessing-for-machine-learning">Data Preprocessing for Machine Learning</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build an Open Source Data Lake for Batch Ingestion ]]>
                </title>
                <description>
                    <![CDATA[ Creating a data platform has been made easier by cloud data analytics platforms like Databricks, Snowflake, and BigQuery. They offer excellent ramp-up and scaling options for small to mid-size teams.  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-an-open-source-data-lake-for-batch-ingestion/</link>
                <guid isPermaLink="false">69e0f1a7b67a275a9d3c9122</guid>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ apache-airflow ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ingestion ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Puneet Singh ]]>
                </dc:creator>
                <pubDate>Thu, 16 Apr 2026 14:26:47 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/ef685075-beac-4bf4-b435-6e942e5e1ac1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Creating a data platform has been made easier by cloud data analytics platforms like Databricks, Snowflake, and BigQuery. They offer excellent ramp-up and scaling options for small to mid-size teams.</p>
<p>But the trade-off isn't just merely renting the outside infrastructure. It also includes proprietary abstraction lock-in, and an operational and security surface area built on top of vendor capabilities.</p>
<p>In this article, you'll set up a batch ingestion layer on an open-source data lake stack where you own every component.</p>
<p>The focus is deliberately narrow. We'll get the ingestion layer up and running end-to-end. Then we'll build on foundations that allow future extension: analytics, governance, and stream processing without locking you into any single tool for those layers. We'll also review documented integration failures along the way: misconfigured catalogs, partition values written as NULL, and Python version mismatches.</p>
<p>By the end, you'll have:</p>
<ul>
<li><p>A working single-node data lake running on Docker (compose), built on RustFS (object storage), Apache Iceberg (table format), and Project Nessie (catalog).</p>
</li>
<li><p>A batch pipeline orchestrated with Apache Airflow, executing PySpark jobs that write versioned, partitioned Iceberg tables.</p>
</li>
<li><p>A real-world ingestion pattern, an external web scraper decoupled from Airflow via Redis, writing raw data to object storage with a lightweight signal table.</p>
</li>
<li><p>A view of what this stack is and isn't, and what you'd add to take it toward production.</p>
</li>
</ul>
<p>A word on scope: this covers the E in <a href="https://www.getdbt.com/blog/extract-load-transform">ELT</a>: getting data in. Transformation (dbt, Spark SQL) and analytics (Trino, Superset) are a natural next layer, but are outside the scope of this article. What you build here is the foundation they'd sit on.</p>
<h3 id="heading-what-well-cover">What We'll Cover:</h3>
<ul>
<li><p><a href="#heading-the-ingestion-problem">The Ingestion Problem</a></p>
</li>
<li><p><a href="#heading-stack">Stack</a></p>
</li>
<li><p><a href="#heading-system-overview">System Overview</a></p>
</li>
<li><p><a href="#heading-quick-start">Quick Start</a></p>
</li>
<li><p><a href="#heading-running-the-pipelines">Running the Pipelines</a></p>
</li>
<li><p><a href="#heading-setup">Setup</a></p>
<ul>
<li><p><a href="#heading-rustfs">RustFS</a></p>
</li>
<li><p><a href="#heading-nessie">Nessie</a></p>
</li>
<li><p><a href="#heading-spark">Spark</a></p>
</li>
<li><p><a href="#heading-apache-airflow">Apache Airflow</a></p>
</li>
<li><p><a href="#heading-scrapredis">Scrapredis</a></p>
</li>
<li><p><a href="#heading-scrapworker">Scrapworker</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-path-forward">Path Forward</a></p>
<ul>
<li><p><a href="#heading-extending-capabilities">Extending Capabilities</a></p>
</li>
<li><p><a href="#heading-adding-layers">Adding Layers</a></p>
</li>
</ul>
</li>
</ul>
<h2 id="heading-the-ingestion-problem">The Ingestion Problem</h2>
<p>The structure of a stack/solution is easier to understand with a use case. A high-level goal is to ingest financial data from external market APIs for trend analysis. You'll focus specifically on setting up ingestion of such data into the warehouse for further analytics.</p>
<p>The data is ingested via a web crawler with a specific rate limit per endpoint. In Batch processing, time-based partitioning is effective for processing by downstream pipelines. It also favors cleaner data retention.</p>
<p>The crawler runs as an external process, decoupled from Airflow via a Redis job queue. This keeps rate limiting and crawl lifecycle outside the orchestration layer, with each component failing and recovering independently.</p>
<p>During ingestion, the priority is data landing with high reliability due to the lack of idempotency in crawl jobs.</p>
<h2 id="heading-stack">Stack</h2>
<ul>
<li><p><a href="https://rustfs.com/"><strong>RustFS</strong></a><strong>:</strong> An S3-compatible object store written in Rust</p>
</li>
<li><p><a href="https://projectnessie.org/"><strong>Project Nessie</strong></a><strong>:</strong> Transactional catalog for Apache Iceberg tables</p>
</li>
<li><p><a href="https://spark.apache.org/"><strong>Apache Spark</strong></a><strong>:</strong> Distributed compute engine</p>
</li>
<li><p><a href="https://airflow.apache.org/"><strong>Apache Airflow</strong></a><strong>:</strong> Job scheduling and orchestration</p>
</li>
<li><p><a href="https://jupyter.org/"><strong>Jupyter Notebook</strong></a> <em>(optional)</em>: Ad-hoc Spark queries against Iceberg tables, not covered in this article</p>
</li>
<li><p><strong>Scrapredis:</strong> Job queue for the web crawler</p>
</li>
<li><p><strong>Scrapworker:</strong> Web crawler and ingestion worker</p>
</li>
</ul>
<p>This setup was tested on a 4-core x86/AMD CPU, 16GB RAM, 60GB disk GCP VM running Debian GNU/Linux 11 (Bullseye). Docker with Compose v2 is required. The setup should work on any comparable Linux environment with similar or better specs.</p>
<h2 id="heading-system-overview">System Overview</h2>
<img src="https://cdn.hashnode.com/uploads/covers/69607e708806706b5c49c7af/429a1e8a-bc39-44dc-8e0b-2cd9152370f5.png" alt="Data Platform Architecture" style="display:block;margin:0 auto" width="3202" height="2385" loading="lazy">

<p>The crawler runs as an external process, decoupled from Airflow via a Redis job queue. Airflow pushes a job specification to the queue containing the endpoint, query params, and target path. The crawler picks it up, executes the crawl, and writes raw results directly to object storage.</p>
<p>This separation keeps rate limiting and crawl lifecycle concerns outside the orchestration layer, and isolates failure modes.</p>
<p>A crawl failure is harder to recover since crawl jobs lack idempotency. Pipeline failures after the crawl stage are independently retryable without re-triggering a crawl.</p>
<h2 id="heading-quick-start">Quick Start</h2>
<p>First, initialize the project:</p>
<pre><code class="language-bash"># Clone the repository
git clone https://github.com/ps-mir/data-platform

# Create the shared Docker network
docker network create data-platform

# Create host directories, set permissions, and download Spark JARs
chmod +x init.sh &amp;&amp; ./init.sh
</code></pre>
<p>Start services in this order (shutdown in reverse):</p>
<ol>
<li><strong>RustFS</strong></li>
</ol>
<pre><code class="language-bash">cd rustfs &amp;&amp; docker compose up -d
</code></pre>
<ol>
<li><strong>Nessie</strong></li>
</ol>
<pre><code class="language-bash">cd nessie &amp;&amp; docker compose up -d
</code></pre>
<ol>
<li><strong>Spark</strong> — requires a build on first run</li>
</ol>
<pre><code class="language-bash">cd spark &amp;&amp; docker compose build &amp;&amp; docker compose up -d
</code></pre>
<ol>
<li><strong>Scrapredis</strong></li>
</ol>
<pre><code class="language-bash">cd scrapredis &amp;&amp; docker compose up -d
</code></pre>
<ol>
<li><strong>Airflow</strong> — requires a build on first run</li>
</ol>
<pre><code class="language-bash">cd airflow-docker &amp;&amp; docker compose build &amp;&amp; docker compose up -d
</code></pre>
<p>Create the Nessie namespaces once after Nessie is up:</p>
<pre><code class="language-bash">curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["default"]}'

curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["scraper"]}'
</code></pre>
<p>Scrapworker runs on the host directly (it's not dockerized). It requires Python &gt;=3.14:</p>
<pre><code class="language-bash">cd scrapworker
pip install -e .
CONFIG_PATH=./config/config.local.yaml RUSTFS_ACCESS_KEY=rustfsadmin RUSTFS_SECRET_KEY=rustfsadmin python -m scrapworker
</code></pre>
<p>Scrapworker must be running before activating <code>scraper_pipeline_v1</code> in Airflow. Without it, the pipeline will push jobs to the queue with no worker to pick them up and hang indefinitely in <code>wait_for_completion</code>.</p>
<p>Trino is also present in setup but not tested for integration with Nessie yet.</p>
<h2 id="heading-running-the-pipelines">Running the Pipelines</h2>
<p>With the stack running, the next step is to activate the pipelines in Airflow. All DAGs are paused at creation by default. The four pipelines build on each other in complexity. Working through them in order is the fastest way to confirm that each layer of the stack is wired correctly before moving to the next.</p>
<p>All four pipelines are loaded but paused by default. Unpause each one in the Airflow UI before triggering.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69607e708806706b5c49c7af/38f95d52-c092-4a00-b660-1233077b781b.png" alt="All Airflow Pipelines" style="display:block;margin:0 auto" width="2678" height="1234" loading="lazy">

<p>Let's go over each pipeline:</p>
<h3 id="heading-sparkstaticdatav1skeleton-hello-dag">spark_static_data_v1_skeleton: <a href="https://github.com/ps-mir/data-platform/blob/07ad47d68fec51f48cd41560921d509a70c5bb6f/airflow-docker/dags/step1_hello_dag.py">Hello DAG</a></h3>
<p>This is a minimal DAG with no Spark, just a Python task that prints a message. If it goes green, Airflow's scheduler and worker are healthy. <code>[2026-04-09 22:00:01] INFO - Task operator:&lt;Task(_PythonDecoratedOperator): say_hello&gt;</code></p>
<h3 id="heading-sparkstaticdatav2submit-spark-submit">spark_static_data_v2_submit: <a href="https://github.com/ps-mir/data-platform/blob/07ad47d68fec51f48cd41560921d509a70c5bb6f/airflow-docker/dags/step2_spark_submit.py">Spark Submit</a></h3>
<p>This submits a PySpark job via <code>SparkSubmitOperator</code> that writes a static dataset to an Iceberg table. No partitioning, every run overwrites the previous content.</p>
<p>In Nessie catalog it appears as:</p>
<pre><code class="language-bash">Type: ICEBERG_TABLE
Metadata Location:s3://warehouse/default/static_data_e7e43123-95a7-44d2-b6d5-67c9c7aa4321/metadata/00000-08a5a2db-6f12-4f21-b2a9-de3d9123fbd3.metadata.json
</code></pre>
<h3 id="heading-sparkpartitioneddatav1-spark-partitioned">spark_partitioned_data_v1: <a href="https://github.com/ps-mir/data-platform/blob/07ad47d68fec51f48cd41560921d509a70c5bb6f/airflow-docker/dags/step3_spark_partitioned.py">Spark Partitioned</a></h3>
<p>This extends step2 with time-based partitioning. Partition values are derived from the scheduled slot time, so every run writes to its own <code>(ds, hr, min)</code> partition without touching previous ones.</p>
<p>Example file path in RustFS: <code>warehouse/default/static_data_partitioned_b172c66f-722b-44f3-bbee-069355753ff6/data/ds=2026-03-28/hr=23/min=15/00000-4-7a196a47-2ac0-4023-af68-ca10487fccb2-0-00001.parquet</code></p>
<h3 id="heading-scraperpipelinev1-scraper-pipeline">scraper_pipeline_v1: <a href="https://github.com/ps-mir/data-platform/blob/07ad47d68fec51f48cd41560921d509a70c5bb6f/airflow-docker/dags/scraper_pipeline.py">Scraper Pipeline</a></h3>
<p>This is the full ingestion flow. Airflow pushes a job to Scrapredis, Scrapworker calls the Binance API and writes raw results to RustFS, then Airflow publishes a signal row to the Nessie catalog.</p>
<p>Every run fetches: <code>https://api.binance.com/api/v3/trades?symbol=BTCUSDT&amp;limit=10</code></p>
<h2 id="heading-setup">Setup</h2>
<p>This is a single-node development setup using Docker Compose. It's built on a well-structured base config that can be extended to production with targeted changes.</p>
<ul>
<li><p>A production deployment would require HA configuration, persistent volume management, and security hardening for each component.</p>
</li>
<li><p>Images are pinned to specific versions to avoid silent breakage between pulls.</p>
</li>
<li><p>All containers share a common external Docker network named <code>data-platform</code>, which allows services to communicate using container names as hostnames.</p>
</li>
<li><p>An <code>init.sh</code> script creates the required local dirs inside the data folder and also creates the Docker network.</p>
</li>
</ul>
<h3 id="heading-rustfs">RustFS</h3>
<p>RustFS is the object storage layer in this stack. Nessie's REST catalog mode has a hard dependency on an S3-compatible endpoint. Running it against a local filesystem fails the Nessie healthcheck at startup and causes catalog initialization to error out. The REST catalog is the recommended mode for new setups because it enables credential vending and multi-engine coordination.</p>
<p>MinIO was the natural choice for self-hosted S3-compatible storage, but it shifted to a more restrictive license. RustFS is the open-source alternative, written in Rust and backed by local disk.</p>
<p>At write time, Spark pushes Parquet files directly to RustFS via S3FileIO. Nessie commits the table metadata alongside, so data and catalog state land together or not at all. This is <a href="https://iceberg.apache.org/">Apache Iceberg</a>'s core guarantee: atomic commits across both data files and metadata.</p>
<p>For production or cloud deployments, managed object storage services like AWS S3, Google Cloud Storage, or Azure Blob Storage are the natural next step. Self-hosted alternatives at scale include <a href="https://github.com/seaweedfs/seaweedfs">SeaweedFS</a>, <a href="https://docs.ceph.com/en/latest/radosgw/">Ceph/RGW</a>, and <a href="https://garagehq.deuxfleurs.fr/">Garage</a>.</p>
<h4 id="heading-notes">Notes:</h4>
<ul>
<li><p><strong>Bucket creation:</strong> A <code>rustfs-init</code> sidecar using <code>amazon/aws-cli</code> runs after RustFS passes its healthcheck and creates the <code>s3://warehouse</code> bucket automatically. You don't create the bucket manually.</p>
</li>
<li><p><strong>Permissions:</strong> RustFS runs as uid=10001 inside the container. The host directories (<code>data/rustfs/data</code> and <code>data/rustfs/applogs</code>) must be owned by that uid before the container starts, or it will fail silently. <code>init.sh</code> handles this with <code>sudo chown -R 10001:10001</code>.</p>
</li>
<li><p><strong>Image pinning:</strong> The compose file pins to <code>rustfs/rustfs:1.0.0-alpha.85-glibc</code>. Before upgrading, verify the uid hasn't changed: <code>docker run --rm --entrypoint id rustfs/rustfs:&lt;new-tag&gt;</code>. If it has, re-run <code>init.sh</code> or re-chown manually.</p>
</li>
<li><p><strong>Spark writes:</strong> Spark writes data files directly to RustFS via S3FileIO. Nessie only manages catalog metadata, it doesn't proxy data. The two interact at commit time, not at write time.</p>
</li>
</ul>
<h3 id="heading-nessie">Nessie</h3>
<p>The catalog tracks the list of tables in the warehouse, along with their data files and schema. Without it, it's hard for Spark to agree on what's in the warehouse.</p>
<p><a href="https://hive.apache.org/docs/latest/admin/adminmanual-metastore-administration/">Hive Metastore</a> offers a Thrift-based API and has been the catalog standard for years. It provides transaction semantics on metadata updates through its backing database, but those transactions stop at the catalog layer. Data files underneath aren't part of the same commit, and there's no cross-table history beyond what the database retains.</p>
<p>Apache Iceberg closes the data and metadata gap with atomic table commits. Nessie builds on that and goes further: it treats the catalog like a Git repository. Every table write is a commit. You can branch, tag, and roll back across multiple tables atomically.</p>
<p>Spark reads and writes table metadata through Nessie's Iceberg REST endpoint. Catalog state is persisted to Postgres, so it survives container restarts.</p>
<h4 id="heading-namespace-bootstrap">Namespace bootstrap</h4>
<p>Unlike Hive Metastore, Nessie doesn't auto-create namespaces. Attempting to write a table to a namespace that doesn't exist fails after data has already been written to RustFS, leaving orphaned files with no catalog entry. Namespaces are structural metadata and belong in a one-time bootstrap step, not in a pipeline.</p>
<p>Nessie manages the Iceberg catalog metadata under <code>s3://warehouse/</code>. Iceberg table data lands under paths derived from the namespace, for example, <code>s3://warehouse/default/</code> for the <code>default</code> namespace.</p>
<h4 id="heading-s3-credential-configuration-issue">S3 Credential Configuration Issue</h4>
<p>Nessie's S3 credential fields don't accept plain strings (likely for security reasons). They require a secret URI in the form <code>urn:nessie-secret:quarkus:&lt;name&gt;</code> even for local credentials.</p>
<p>Additionally, the SCREAMING_SNAKE_CASE environment variable convention is ambiguous for Quarkus property names containing hyphens. The property is silently ignored, and the default (which fails) is used instead. The working approach is dot-notation keys passed directly in the compose environment block, which Quarkus reads without conversion:</p>
<pre><code class="language-properties">nessie.catalog.service.s3.default-options.access-key: "urn:nessie-secret:quarkus:nessie.catalog.secrets.access-key"
nessie.catalog.secrets.access-key.name: rustfsadmin
nessie.catalog.secrets.access-key.secret: rustfsadmin
</code></pre>
<h4 id="heading-nessie-health-check">Nessie health check</h4>
<p>Once the RustFS settings are corrected, Nessie's health check URL(<a href="http://localhost:9090/q/health">http://localhost:9090/q/health</a>) should return the following response:</p>
<pre><code class="language-json">{
    "status": "UP",
    "checks": [
        {
            "name": "MongoDB connection health check",
            "status": "UP"
        },
        {
            "name": "Warehouses Object Stores",
            "status": "UP",
            "data": {
                "warehouse.warehouse.status": "UP"
            }
        },
        {
            "name": "Database connections health check",
            "status": "UP",
            "data": {
                "&lt;default&gt;": "UP"
            }
        }
    ]
}
</code></pre>
<p>The MongoDB connection health check appears in the response even though this stack doesn't use MongoDB. It's a Quarkus built-in probe registered automatically regardless of store type. With JDBC configured, MongoDB is never connected and the UP report is just a placeholder response.</p>
<h4 id="heading-catalog-endpoint-vs-management">Catalog endpoint vs Management</h4>
<p>Nessie exposes two separate APIs. The Iceberg REST catalog is at <code>/iceberg</code>. This is what Spark and Trino connect to. The Nessie management API is at <code>/api/v2</code>, which is for branch operations, commit history, and table inspection. They aren't interchangeable.</p>
<pre><code class="language-properties"># Iceberg REST API
http://localhost:19120/iceberg/v1/main/namespaces
http://localhost:19120/iceberg/v1/config

# Nessie management API
http://localhost:19120/api/v2/config
</code></pre>
<h4 id="heading-notes">Notes:</h4>
<ul>
<li><p><code>path-style-access: true</code> is required for any non-AWS S3 endpoint. <code>region</code> is a dummy value required by the AWS SDK internally.</p>
</li>
<li><p>Nessie's internal port 9000 is remapped to 9090 on the host to avoid conflict with RustFS which occupies 9000 and 9001.</p>
</li>
</ul>
<h4 id="heading-forward-path">Forward path</h4>
<p>Nessie is a stateless REST service, so scaling reads can be done with LB with no coordination between nodes. Durability comes entirely from backend store.</p>
<h3 id="heading-spark">Spark</h3>
<p>As a distributed compute engine, Apache Spark is a reliable and stable choice for long-running jobs. In the current setup, it executes PySpark jobs submitted by Airflow, reads and writes Iceberg tables via the Nessie REST catalog, and writes data files directly to RustFS using S3FileIO. Spark runs in standalone mode with a single master and worker, configured via <code>spark-defaults.conf</code>.</p>
<p>Two JARs are required and must be placed in <code>data/spark/jars/</code> before starting:</p>
<ul>
<li><p><code>iceberg-spark-runtime-3.5_2.12</code>: Iceberg integration for Spark: SparkCatalog, DataFrameWriterV2, SQL extensions, and all table format logic.</p>
</li>
<li><p><code>iceberg-aws-bundle</code>: AWS SDK v2 and Iceberg's S3FileIO, the storage transport layer for writing data files to RustFS. The Spark base image ships only Hadoop AWS (SDK v1). This bundle provides the SDK v2 classes that S3FileIO requires.</p>
</li>
</ul>
<p>Spark uses a custom Dockerfile to install Python 3.12. Build the image before first use:</p>
<pre><code class="language-bash">cd spark
docker compose build
docker compose up -d
</code></pre>
<p>The PySpark jobs are covered in the Airflow section, where we walk through each DAG and its corresponding Spark script as part of the pipeline.</p>
<p>Before submitting any Spark job that writes an Iceberg table, the target namespace must exist in Nessie. Nessie doesn't auto-create namespaces, unlike Hive Metastore. Attempting to write to a missing namespace fails after data has already been written to RustFS, leaving orphaned files with no catalog entry.</p>
<p>Create the <code>default</code> namespace once before running any pipeline:</p>
<pre><code class="language-bash"># Nessie should be up and running at this point
curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["default"]}'
{
  "namespace" : [ "default" ],
  "properties" : { }
}
</code></pre>
<p>Verify:</p>
<pre><code class="language-bash">curl http://localhost:19120/iceberg/v1/main/namespaces
</code></pre>
<h4 id="heading-catalog-mismatch-tables-missing-across-query-engines">Catalog Mismatch: Tables Missing Across Query Engines</h4>
<p>If tables written by Spark aren't visible in Trino, the likely cause is a catalog mismatch. Spark configured with <code>NessieCatalog</code> and Trino using the Iceberg REST catalog maintain separate metadata views — they don't share table state. Both engines must point at the same catalog endpoint: <code>http://nessie:19120/iceberg</code>.</p>
<h4 id="heading-notes">Notes:</h4>
<ul>
<li><p><strong>Worker memory:</strong> The worker is configured with <code>SPARK_WORKER_MEMORY: 8g</code>. Spark's default is 1g is enough to register but not enough to run a job without queuing. Tune this based on available host memory.</p>
</li>
<li><p><strong>Remote signing:</strong> <code>remote-signing-enabled: false</code> Nessie's REST catalog supports credential vending via IAM/STS, but since that integration isn't present here, remote signing is disabled explicitly to avoid request failures.</p>
</li>
<li><p><strong>Config changes need full restart:</strong> Docker file-level bind mounts cache the inode at container start. Editing <code>spark-defaults.conf</code> won't take effect until Spark and the Airflow worker are restarted. In client mode, the Airflow worker is the Spark driver (the process that reads the config on job submission) and must be restarted too.</p>
</li>
<li><p><strong>Jupyter Notebook:</strong> A Jupyter instance with PySpark is included in the stack for ad-hoc queries against Iceberg tables. It connects to the same Spark cluster and Nessie catalog, so any table written by a pipeline is immediately queryable.</p>
</li>
</ul>
<p>⚠️ <strong>Warning:</strong> The Spark worker and Airflow worker (the driver) must run the same Python minor version. PySpark enforces this at runtime and fails immediately if they diverge. The Spark image in this stack uses a custom Dockerfile to install Python 3.12, matching Airflow's base image. If you upgrade either, verify that the versions stay aligned.</p>
<h3 id="heading-apache-airflow">Apache Airflow</h3>
<p>Airflow makes it easier to author, schedule and monitor workflows. In this case, it handles the ingestion for batch processing, but it can be extended to use cases like stream processing.</p>
<p>The Airflow components resemble more closely the DAG processor Airflow Architecture from the <a href="https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/overview.html">official docs</a>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69607e708806706b5c49c7af/a438e02b-0b16-44c7-bcae-92c954a942cc.png" alt="DAG Processor Airflow Architecture" style="display:block;margin:0 auto" width="2308" height="1455" loading="lazy">

<p>Key aspects:</p>
<ul>
<li><p>The DAG Processor continuously parses DAG files and serializes them to the Metadata DB.</p>
</li>
<li><p>The Scheduler reads from there, detects when a DAG run is due, creates task instances, and pushes them to the CeleryExecutor (via Redis queue).</p>
</li>
<li><p>The Celery worker picks up a task and executes it. In the case of a <code>SparkSubmitOperator</code>, the worker process becomes the Spark driver, submitting the job to the Spark cluster.</p>
</li>
<li><p>Executors run on the Spark worker, write Parquet files directly to RustFS, and commit the table metadata to Nessie. Airflow records the task outcome back in the Metadata DB.</p>
</li>
</ul>
<p>Airflow uses a custom Dockerfile to install Java 17 and additional providers. Build the image before first use:</p>
<pre><code class="language-bash">cd airflow-docker
docker compose build
docker compose up -d
</code></pre>
<h4 id="heading-pipelines">Pipelines</h4>
<p>Pipelines need to be created inside <code>airflow-docker/dags</code> folder for dag processor to pick up load the pipeline DAG in metadata DB. Four pipeline examples are provided with varying complexity.</p>
<ol>
<li><p><code>step1_hello_dag.py</code>: single-task DAG with no dependencies, just a Python function that prints a message.</p>
</li>
<li><p><code>step2_spark_submit.py</code>: submits a PySpark job via SparkSubmitOperator. The job writes a static dataset to an Iceberg table via the Nessie catalog.</p>
</li>
<li><p><code>step3_spark_partitioned.py</code>: extends step 2 with time-based partitioning. The scheduled slot time is passed to the PySpark script.</p>
<ul>
<li>Time-based partition values are derived from <code>data_interval_start</code> for idempotency (Backfill, Reruns).</li>
</ul>
</li>
<li><p><code>scraper_pipeline</code>: a real-world ingestion pipeline. Coordinates with the external task executor <code>scrapworker</code> via the Redis queue <code>scrapredis</code>.</p>
<ul>
<li>Both <code>scrapredis</code> and <code>scrapworker</code> must be up and running for this pipeline to work.</li>
</ul>
</li>
</ol>
<h4 id="heading-deploy-mode-and-driver-config">Deploy Mode and Driver Config</h4>
<p>The initial <code>SparkSubmitOperator</code> configuration used <code>deploy_mode="cluster"</code>, which runs the driver on the Spark cluster rather than the submitting machine. This fails immediately on Spark standalone clusters with a hard error:</p>
<pre><code class="language-plaintext">Cluster deploy mode is currently not supported for python applications on standalone clusters.
</code></pre>
<p>Cluster mode for Python is only available on YARN and Kubernetes. The fix is <code>deploy_mode="client"</code>, but this shifts the problem: in client mode, the driver runs on the Airflow worker container, which means the worker needs everything the Spark containers have.</p>
<p>Overall, three changes are required in the Airflow worker:</p>
<ul>
<li><p>The Iceberg and Nessie JARs at <code>/opt/spark/user-jars/</code></p>
</li>
<li><p><code>spark-defaults.conf</code> with catalog, extension, and JAR config</p>
</li>
<li><p><code>SPARK_CONF_DIR=/opt/spark/conf</code>, without this, pip-installed PySpark's <code>spark-submit</code> silently ignores the mounted conf file and runs with no catalog config</p>
</li>
</ul>
<p>The fix was adding all three to <code>x-airflow-common</code> in <code>airflow-docker/docker-compose.yaml</code> so every Airflow service inherits them:</p>
<pre><code class="language-yaml">environment:
  SPARK_CONF_DIR: /opt/spark/conf

volumes:
  - ../data/spark/jars:/opt/spark/user-jars:ro
  - ../spark/spark-defaults.conf:/opt/spark/conf/spark-defaults.conf:ro
</code></pre>
<h4 id="heading-partition-values-written-as-null">Partition Values Written as NULL</h4>
<p>When the third pipeline (Spark Partitioned) ran for the first time, the data landed correctly in RustFS, but querying the Iceberg partitions metadata showed:</p>
<pre><code class="language-plaintext">+------------------+----------+
|         partition|file_count|
+------------------+----------+
|{NULL, NULL, NULL}|         2|
+------------------+----------+
</code></pre>
<p>The original script used Spark's DataSource V1 API:</p>
<pre><code class="language-python">df.write.format("iceberg").mode("overwrite").saveAsTable(table)
</code></pre>
<p>The script used Spark's V1 DataFrame write API with format("iceberg"), which loads an isolated table reference and bypasses Iceberg's catalog write path. As a result, Iceberg committed the data files to storage but wrote NULL partition values into the manifest metadata.</p>
<p>The fix is in Iceberg's native DataFrameWriterV2 API:</p>
<pre><code class="language-python">df.writeTo(table).overwritePartitions()
</code></pre>
<p>This routes through Iceberg's native write path, evaluates partition transforms from the real column values (ds, hr, min), and registers them correctly in the manifest. <code>overwritePartitions()</code> overwrites only the partitions present in the DataFrame. A rerun with the same scheduled time produces the same values and atomically replaces that partition, leaving all others untouched.</p>
<p>⚠️ Existing NULL-partition manifest entries aren't retroactively corrected by subsequent V2 writes. For a brand-new table containing only bad data, DROP TABLE and rewrite is the simplest recovery.</p>
<h3 id="heading-scrapredis">Scrapredis</h3>
<p>Scrapredis is a dedicated Redis instance that sits between Airflow and Scrapworker as a job queue. It's separate from Airflow's internal Redis, which exists solely for CeleryExecutor task dispatch. The separation means the crawler's job queue can be managed, scaled, or replaced without touching Airflow's internals.</p>
<p>The pattern generalises beyond scraping. Any external process that needs its own lifecycle, resource profile, or rate limiting can be wired the same way: Airflow pushes a job, the external worker pops it, and Airflow polls for the result.</p>
<p>The scraper pipeline follows this round-trip:</p>
<ol>
<li>Airflow pushes the job payload to the queue:</li>
</ol>
<pre><code class="language-python">QUEUE_KEY = "scrapworker:jobs"
client.lpush(QUEUE_KEY, json.dumps(payload))
</code></pre>
<ol>
<li>Scrapworker blocks on the queue and pops the next job:</li>
</ol>
<pre><code class="language-python">while True:
    _, payload = client.blpop(redis_cfg["queue_key"])
</code></pre>
<ol>
<li>Once the crawl finishes, Scrapworker writes the outcome and <code>s3_path</code> back to Redis:</li>
</ol>
<pre><code class="language-python">client.set(status_key, json.dumps({"status": "finished", "worker_id": worker_id, "s3_path": job["s3_path"]}), ex=TERMINAL_TTL)
</code></pre>
<ol>
<li>The <code>wait_for_completion</code> task polls for that status key. On success, <code>publish_nessie_signal</code> picks up the <code>s3_path</code> and writes the signal row to Nessie.</li>
</ol>
<h3 id="heading-scrapworker">Scrapworker</h3>
<p>Scrapworker is a Python app that uses the Scrapy crawl framework to crawl all pages of the request. It's decoupled from Airflow due to URL/client specific rate limit semantics. For simplicity, consider it a type of external worker that receives and executes requests from Airflow.</p>
<p>It's responsible for downloading and writing content to object storage (RustFS). The Nessie catalog update is decoupled and kept in a separate Airflow pipeline task.</p>
<h4 id="heading-fixed-signal-table">Fixed Signal Table</h4>
<p>Scrapworker writes raw JSON to RustFS rather than writing scraped data directly as Iceberg columns. The pipeline then publishes a single lightweight signal row to a Nessie-managed Iceberg table.</p>
<p>The signal schema is fixed and minimal (<code>run_id</code>, <code>endpoint</code>, <code>s3_path</code>, <code>ds</code>, <code>hr</code>, <code>min</code>, <code>published_at</code>). It never changes, regardless of what's being scraped.</p>
<p>Mirroring the scraped payload as Iceberg columns would force Scrapworker to own schema evolution across different endpoints. This isn't an ideal place for schema ownership. Instead, schema ownership sits downstream:</p>
<pre><code class="language-plaintext">Scrapworker  →  raw files in RustFS  +  signal row in Iceberg (from Pipeline)
Airflow job  →  reads raw via s3_path, applies schema, writes structured Iceberg table
</code></pre>
<p>The downstream job knows the domain, knows the schema, and is the right place to handle type casting, nulls, and partition layout. Scrapworker stays generic and thin — the same code handles any endpoint without modification.</p>
<h4 id="heading-why-signal-publish-is-a-separate-airflow-task">Why Signal Publish is a Separate Airflow Task</h4>
<p>Scrapworker writes to RustFS and sets <code>status: finished</code> in Redis with the <code>s3_path</code>. A separate Airflow task reads that status and publishes the signal row to Nessie. The two writes are intentionally decoupled.</p>
<p>If scrapworker published to Nessie directly after writing to RustFS, the two writes would share a failure mode. A Nessie failure after a successful RustFS write would leave data stranded with no signal and no clean recovery path. The only option would be a re-crawl which lacks idempotency.</p>
<p>With the decoupled approach, each failure is isolated. A Nessie failure triggers an Airflow retry of the signal publish task only, no re-scrape, no duplicate crawl. RustFS and Nessie failures are independently recoverable.</p>
<h4 id="heading-notes">Notes:</h4>
<ul>
<li><p>Raw scraped files are written directly to <code>s3://warehouse/raw/</code>, entirely outside Nessie's management. Nothing in the Iceberg layer touches this path.</p>
</li>
<li><p>The scrapworker signal table lives in a dedicated <code>scraper</code> namespace. Create it once before scrapworker runs for the first time.</p>
</li>
</ul>
<pre><code class="language-bash">curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["scraper"]}'
</code></pre>
<h2 id="heading-path-forward">Path Forward</h2>
<p>The stack we've built here is a working ingestion layer. It lands data reliably, tracks it in a versioned catalog, and gives you a foundation to build on. Two directions are worth considering from here.</p>
<h3 id="heading-extending-capabilities">Extending Capabilities</h3>
<p>These are improvements to what's already in the stack, making it more robust without adding new components.</p>
<p><strong>Ingestion reliability:</strong> Scrapworker currently handles failures by setting <code>status: failed</code> in Redis, which requires Airflow to re-trigger the full pipeline. Adding client-side rate limiting and per-endpoint retry logic with backoff would make crawl jobs more self-healing, so that a failed page fetch can retry independently without surfacing to Airflow at all.</p>
<p><strong>Config validation:</strong> A misconfigured endpoint schema in <code>config.yaml</code> fails silently at runtime, often deep into a crawl. A <code>validate_config()</code> call at startup would catch missing required fields like <code>offset_param</code> or <code>response_map</code> before any job runs. This becomes more important as more endpoints are added.</p>
<p><strong>Observability:</strong> Airflow alerting and SLA monitoring give early warning when pipelines miss their schedule or tasks take longer than expected. The signal table is useful here too. A lightweight monitor that checks for expected signal rows within a time window is a simple SLA check that works without external tooling.</p>
<h3 id="heading-adding-layers">Adding Layers</h3>
<p>These are new capabilities that build on the ingestion foundation.</p>
<p><strong>Transform layer:</strong> The raw Iceberg tables written by the ingestion layer are the input for a transform step. dbt or Spark SQL can read from raw, apply schema, clean types, and write structured tables to a separate namespace. This is the L in ELT and the natural next step once ingestion is stable.</p>
<p><strong>Analytics:</strong> Trino is already in the stack and partially integrated. Connecting it fully to Nessie enables SQL queries across all Iceberg tables. Adding Superset on top gives a visualisation layer without requiring any changes to the ingestion pipeline.</p>
<p><strong>Broader source onboarding:</strong> The current stack handles one ingestion pattern: a scheduled Airflow pipeline triggering an external HTTP crawler. The same foundation supports pull-based sources like databases using CDC, and push-based sources like event streams via Kafka. The Iceberg tables and Nessie catalog serve as the landing zone regardless of how data arrives.</p>
<p><strong>Governance:</strong> Iceberg and Nessie provide the foundations, covering snapshots, schema evolution, commit history, and time travel. The governance layer on top requires deliberate additions: access control, data quality checks, lineage tracking, and schema enforcement. None of these require replacing what's here, as they sit on top of it.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Efficient Data Processing in Python: Batch vs Streaming Pipelines Explained ]]>
                </title>
                <description>
                    <![CDATA[ Every data pipeline makes a fundamental choice before any code is written: does it process data in chunks on a schedule, or does it process data continuously as it arrives? This choice — batch versus  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/efficient-data-processing-in-python-batch-vs-streaming-pipelines/</link>
                <guid isPermaLink="false">69dcf4dbf57346bc1e06d19b</guid>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Bala Priya C ]]>
                </dc:creator>
                <pubDate>Mon, 13 Apr 2026 13:51:23 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/0cd359d4-9628-4b17-8dc4-a3a2a83172c8.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Every data pipeline makes a fundamental choice before any code is written: does it process data in chunks on a schedule, or does it process data continuously as it arrives?</p>
<p>This choice — batch versus streaming — shapes the architecture of everything downstream. The tools you use, the guarantees you can make about data freshness, the complexity of your error handling, and the infrastructure you need to run it all follow directly from this decision.</p>
<p>Getting it wrong is expensive. Teams that build streaming pipelines when batch would have sufficed end up maintaining complex infrastructure for a problem that didn't require it.</p>
<p>Teams that build batch pipelines when their use case demands real-time processing discover the gap at the worst possible moment — when a stakeholder asks why the dashboard is six hours out of date.</p>
<p>In this article, you'll learn what batch and streaming pipelines actually are, how they differ in terms of architecture and tradeoffs, and how to implement both patterns in Python. By the end, you'll have a clear framework for choosing the right approach for any data engineering problem you solve.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along comfortably, make sure you have:</p>
<ul>
<li><p>Practice writing Python functions and working with modules</p>
</li>
<li><p>Familiarity with pandas DataFrames and basic data manipulation</p>
</li>
<li><p>A general understanding of what ETL pipelines do — extract, transform, load</p>
</li>
</ul>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-what-is-a-batch-pipeline">What Is a Batch Pipeline?</a></p>
<ul>
<li><p><a href="#heading-implementing-a-batch-pipeline-in-python">Implementing a Batch Pipeline in Python</a></p>
</li>
<li><p><a href="#heading-when-batch-works-well">When Batch Works Well</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-what-is-a-streaming-pipeline">What Is a Streaming Pipeline?</a></p>
<ul>
<li><p><a href="#heading-implementing-a-streaming-pipeline-in-python">Implementing a Streaming Pipeline in Python</a></p>
</li>
<li><p><a href="#heading-when-streaming-works-well">When Streaming Works Well</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-the-key-differences-at-a-glance">The Key Differences at a Glance</a></p>
</li>
<li><p><a href="#heading-choosing-between-batch-and-streaming">Choosing Between Batch and Streaming</a></p>
</li>
<li><p><a href="#heading-the-hybrid-pattern-lambda-and-kappa-architectures">The Hybrid Pattern: Lambda and Kappa Architectures</a></p>
</li>
</ul>
<h2 id="heading-what-is-a-batch-pipeline">What Is a Batch Pipeline?</h2>
<p>A batch pipeline processes a bounded, finite collection of records together — a file, a database snapshot, a day's worth of transactions. It runs on a schedule, say, hourly, nightly, weekly, reads all the data for that period, transforms it, and writes the result somewhere. Then it stops and waits until the next run.</p>
<p>The mental model is simple: <strong>collect, then process</strong>. Nothing happens between runs.</p>
<p>In a retail ETL context, a typical batch pipeline might look like this:</p>
<ol>
<li><p>At midnight, extract all orders placed in the last 24 hours from the transactional database</p>
</li>
<li><p>Join with the product catalogue and customer dimension tables</p>
</li>
<li><p>Compute daily revenue aggregates by region and product category</p>
</li>
<li><p>Load the results into the data warehouse for reporting</p>
</li>
</ol>
<p>The pipeline runs, finishes, and produces a complete, consistent snapshot of yesterday's business. By the time analysts arrive in the morning, the warehouse is up to date.</p>
<h3 id="heading-implementing-a-batch-pipeline-in-python">Implementing a Batch Pipeline in Python</h3>
<p>A batch pipeline in its simplest form is a Python script with three clearly separated stages: extract, transform, load.</p>
<pre><code class="language-python">import pandas as pd
from datetime import datetime, timedelta

def extract(filepath: str) -&gt; pd.DataFrame:
    """Load raw orders from a daily export file."""
    df = pd.read_csv(filepath, parse_dates=["order_timestamp"])
    return df

def transform(df: pd.DataFrame) -&gt; pd.DataFrame:
    """Clean and aggregate orders into daily revenue by region."""
    # Filter to completed orders only
    df = df[df["status"] == "completed"].copy()

    # Extract date from timestamp for grouping
    df["order_date"] = df["order_timestamp"].dt.date

    # Aggregate: total revenue and order count per region per day
    summary = (
        df.groupby(["order_date", "region"])
        .agg(
            total_revenue=("order_value_gbp", "sum"),
            order_count=("order_id", "count"),
            avg_order_value=("order_value_gbp", "mean"),
        )
        .reset_index()
    )
    return summary

def load(df: pd.DataFrame, output_path: str) -&gt; None:
    """Write the aggregated result to the warehouse (here, a CSV)."""
    df.to_csv(output_path, index=False)
    print(f"Loaded {len(df)} rows to {output_path}")

# Run the pipeline
raw = extract("orders_2024_06_01.csv")
aggregated = transform(raw)
load(aggregated, "warehouse/daily_revenue_2024_06_01.csv")
</code></pre>
<p>Let's walk through what this code is doing:</p>
<ul>
<li><p><code>extract</code> reads a CSV file representing a daily order export. The <code>parse_dates</code> argument tells pandas to interpret the <code>order_timestamp</code> column as a datetime object rather than a plain string — this matters for the date extraction step in transform.</p>
</li>
<li><p><code>transform</code> does two things: it filters out any orders that didn't complete (returns, cancellations), and then groups the remaining orders by date and region to produce revenue aggregates. The <code>.agg()</code> call computes three metrics per group in a single pass.</p>
</li>
<li><p><code>load</code> writes the result to a destination — in production this would be a database insert or a cloud storage upload, but the pattern is the same regardless.</p>
</li>
</ul>
<p>The three functions are deliberately kept separate. This separation — extract, transform, load — makes each stage independently testable, replaceable, and debuggable. If the transform logic changes, you don't need to modify the extract or load code.</p>
<h3 id="heading-when-batch-works-well">When Batch Works Well</h3>
<p>Batch pipelines are the right choice when:</p>
<ul>
<li><p><strong>Data freshness requirements are measured in hours, not seconds.</strong> A daily sales report doesn't need to be updated every minute. A weekly marketing attribution model certainly doesn't.</p>
</li>
<li><p><strong>You're processing large historical datasets.</strong> Backfilling two years of transaction history into a new data warehouse is inherently a batch job — the data exists, it's bounded, and you want to process it as efficiently as possible in one run.</p>
</li>
<li><p><strong>Consistency matters more than latency.</strong> Batch pipelines produce complete, point-in-time snapshots. Every row in the output was computed from the same input state. This consistency is valuable for financial reporting, regulatory compliance, and any downstream process that requires a stable, reproducible dataset.</p>
</li>
</ul>
<h2 id="heading-what-is-a-streaming-pipeline">What Is a Streaming Pipeline?</h2>
<p>A streaming pipeline processes data continuously, record by record or in small micro-batches, as it arrives. There is no "end" to the dataset — the pipeline runs indefinitely, consuming events from a source like a message queue, a Kafka topic, or a webhook, and processing each one as it comes in.</p>
<p>The mental model is: <strong>process as you collect</strong>. The pipeline is always running.</p>
<p>In the same retail ETL context, a streaming pipeline might handle order events as they're placed:</p>
<ol>
<li><p>An order is placed on the website and an event is published to a message queue</p>
</li>
<li><p>The streaming pipeline consumes the event within milliseconds</p>
</li>
<li><p>It validates, enriches, and routes the event to downstream systems</p>
</li>
<li><p>The fraud detection service, the inventory system, and the real-time dashboard all receive updated information immediately</p>
</li>
</ol>
<p>The difference from batch is fundamental: the data isn't sitting in a file waiting to be processed. It's flowing, and the pipeline has to keep up.</p>
<h3 id="heading-implementing-a-streaming-pipeline-in-python">Implementing a Streaming Pipeline in Python</h3>
<p>Python's generator functions are the natural building block for streaming pipelines. A generator produces values one at a time and pauses between yields — which maps directly onto the idea of processing records as they arrive without loading everything into memory.</p>
<pre><code class="language-python">import json
import time
from typing import Generator, Dict

def event_source(filepath: str) -&gt; Generator[Dict, None, None]:
    """
    Simulate a stream of order events from a file.
    In production, this would consume from Kafka or a message queue.
    """
    with open(filepath, "r") as f:
        for line in f:
            event = json.loads(line.strip())
            yield event
            time.sleep(0.01)  # simulate arrival delay between events

def validate(event: Dict) -&gt; bool:
    """Check that the event has the required fields and valid values."""
    required_fields = ["order_id", "customer_id", "order_value_gbp", "region"]
    if not all(field in event for field in required_fields):
        return False
    if event["order_value_gbp"] &lt;= 0:
        return False
    return True

def enrich(event: Dict) -&gt; Dict:
    """Add derived fields to the event before routing downstream."""
    event["processed_at"] = time.strftime("%Y-%m-%dT%H:%M:%S")
    event["value_tier"] = (
        "high"   if event["order_value_gbp"] &gt;= 500
        else "mid"    if event["order_value_gbp"] &gt;= 100
        else "low"
    )
    return event

def run_streaming_pipeline(source_file: str) -&gt; None:
    """Process each event as it arrives from the source."""
    processed = 0
    skipped = 0

    for raw_event in event_source(source_file):
        if not validate(raw_event):
            skipped += 1
            continue

        enriched_event = enrich(raw_event)

        # In production: publish to downstream topic or write to sink
        print(f"[{enriched_event['processed_at']}] "
              f"Order {enriched_event['order_id']} | "
              f"£{enriched_event['order_value_gbp']:.2f} | "
              f"tier={enriched_event['value_tier']}")
        processed += 1

    print(f"\nDone. Processed: {processed} | Skipped: {skipped}")

run_streaming_pipeline("order_events.jsonl")
</code></pre>
<p>Here's what's happening:</p>
<ul>
<li><p><code>event_source</code> is a generator function — note the <code>yield</code> keyword instead of <code>return</code>. Each call to <code>yield event</code> pauses the function and hands one event to the caller. The pipeline processes that event before the generator resumes and fetches the next one. This means only one event is in memory at a time, regardless of how large the stream is. The <code>time.sleep(0.01)</code> simulates the real-world delay between events arriving from a message queue.</p>
</li>
<li><p><code>validate</code> checks each event for required fields and valid values before doing anything else with it. In a streaming context, bad events are super common — network issues, upstream bugs, and schema changes all produce malformed records. Validating early and skipping invalid events is far safer than letting them propagate into downstream systems.</p>
</li>
<li><p><code>enrich</code> adds derived fields to the event. This can be a processing timestamp and a value tier classification. In production, this step might also join against a lookup table, call an external API, or apply a model prediction.</p>
</li>
<li><p><code>run_streaming_pipeline</code> ties it together. The <code>for</code> loop over <code>event_source</code> consumes events one at a time, processes each through the <code>validate → enrich → route</code> stages, and keeps a running count of processed and skipped events.</p>
</li>
</ul>
<h3 id="heading-when-streaming-works-well">When Streaming Works Well</h3>
<p>Streaming pipelines are the right choice when:</p>
<ul>
<li><p><strong>Data freshness is measured in seconds or milliseconds.</strong> Fraud detection, real-time inventory updates, live dashboards, and alerting systems all require data to be processed immediately — a batch job running every hour would make them useless.</p>
</li>
<li><p><strong>The data volume is too large to accumulate.</strong> High-frequency IoT sensor data, clickstream events, and financial tick data can generate millions of records per hour. Accumulating all of that before processing is often impractical – you'd need enormous storage and the processing job would take too long to be useful.</p>
</li>
<li><p><strong>You need to react, not just report.</strong> Streaming pipelines can trigger downstream actions — send a notification, block a transaction, update a recommendation — in response to individual events. Batch pipelines can only report on what already happened.</p>
</li>
</ul>
<h2 id="heading-the-key-differences-at-a-glance">The Key Differences at a Glance</h2>
<p>Here is an overview of the differences between batch and stream processing we've discussed thus far:</p>
<table>
<thead>
<tr>
<th><strong>DIMENSION</strong></th>
<th><strong>BATCH</strong></th>
<th><strong>STREAMING</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Data model</strong></td>
<td>Bounded, finite dataset</td>
<td>Unbounded, continuous flow</td>
</tr>
<tr>
<td><strong>Processing trigger</strong></td>
<td>Schedule (time or event)</td>
<td>Arrival of each record</td>
</tr>
<tr>
<td><strong>Latency</strong></td>
<td>Minutes to hours</td>
<td>Milliseconds to seconds</td>
</tr>
<tr>
<td><strong>Throughput</strong></td>
<td>High (optimized for bulk processing)</td>
<td>Lower per-record overhead</td>
</tr>
<tr>
<td><strong>Complexity</strong></td>
<td>Lower</td>
<td>Higher</td>
</tr>
<tr>
<td><strong>State management</strong></td>
<td>Stateless per run</td>
<td>Often stateful across events</td>
</tr>
<tr>
<td><strong>Error handling</strong></td>
<td>Retry the whole job</td>
<td>Per-event dead-letter queues</td>
</tr>
<tr>
<td><strong>Consistency</strong></td>
<td>Strong (point-in-time snapshot)</td>
<td>Eventually consistent</td>
</tr>
<tr>
<td><strong>Best for</strong></td>
<td>Reporting, ML training, backfills</td>
<td>Alerting, real-time features, event routing</td>
</tr>
</tbody></table>
<h2 id="heading-choosing-between-batch-and-streaming">Choosing Between Batch and Streaming</h2>
<p>Okay, all of this info is great. But <em>how</em> do you choose between batch and stream processing? The decision comes down to three questions:</p>
<p><strong>How fresh does the data need to be?</strong> If stakeholders can tolerate results that are hours old, batch is simpler and more cost-effective. If they need results within seconds, streaming is unavoidable.</p>
<p><strong>How complex is your processing logic?</strong> Batch jobs can join across large datasets, run expensive aggregations, and apply complex business logic without worrying about latency. Streaming pipelines must process each event quickly, which constrains how much work you can do per record.</p>
<p><strong>What's your operational capacity?</strong> Streaming infrastructure — Kafka clusters, Flink or Spark Streaming jobs, dead-letter queues, exactly-once delivery guarantees — is significantly more complex to operate than a scheduled Python script. If your team is small or your use case doesn't demand real-time results, that complexity is cost without benefit.</p>
<p>Start with batch. It's simpler to build, simpler to test, simpler to debug, and simpler to maintain. Move to streaming when a specific, concrete requirement — not a hypothetical future one — makes batch insufficient. Most data problems are batch problems, and the ones that genuinely require streaming are usually obvious when you run into them.</p>
<p>And as you might have guessed, you may need to combine them for some data processing systems. Which is why hybrid approaches exist.</p>
<h2 id="heading-the-hybrid-pattern-lambda-and-kappa-architectures">The Hybrid Pattern: Lambda and Kappa Architectures</h2>
<p>In practice, many production data systems use both patterns together. The two most common hybrid architectures are: Lambda and Kappa architecture.</p>
<p><a href="https://www.databricks.com/glossary/lambda-architecture"><strong>Lambda architecture</strong></a> runs a batch layer and a streaming layer in parallel. The batch layer processes complete historical data and produces accurate, consistent results on a delay. The streaming layer processes live data and produces approximate results immediately. Downstream consumers merge both outputs — using the streaming result for freshness and the batch result for correctness.</p>
<p>The tradeoff is operational complexity: you're maintaining two separate processing codebases that must produce semantically equivalent results.</p>
<p><a href="https://hazelcast.com/glossary/kappa-architecture/"><strong>Kappa architecture</strong></a> simplifies this by using only a streaming layer, but with the ability to replay historical data through the same pipeline when you need batch-style reprocessing. This works well when your streaming framework like <a href="https://kafka.apache.org/documentation/">Apache Kafka</a> and <a href="https://flink.apache.org/">Apache Flink</a> supports log retention and replay. You get one codebase, one set of logic, and the ability to reprocess history when your pipeline changes.</p>
<p>Neither architecture is universally better. Lambda is more common in organizations that adopted batch processing first and added streaming incrementally. Kappa is more common in systems designed with streaming as the primary pattern.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Batch and streaming are tools with different tradeoffs, each suited to a different class of problems. Batch pipelines excel at consistency, simplicity, and bulk throughput. Streaming pipelines excel at latency, reactivity, and continuous processing.</p>
<p>Understanding both patterns at the architectural level — before reaching for specific frameworks like Apache Spark, Kafka, or Flink — gives you the judgment to choose the right one and explain that choice clearly. The frameworks implement these patterns, while the judgment about which pattern fits your problem is yours to make first.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Optimize PySpark Jobs: Real-World Scenarios for Understanding Logical Plans ]]>
                </title>
                <description>
                    <![CDATA[ In the world of big data, performance isn't just about bigger clusters – it's about smarter code. Spark is deceptively simple to write but notoriously difficult to optimize, because what you write isn't what Spark executes. Between your transformatio... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-optimize-pyspark-jobs-handbook/</link>
                <guid isPermaLink="false">69851d7be613661950e00d8f</guid>
                
                    <category>
                        <![CDATA[ General Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ spark ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ PySpark ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS Glue ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Sameer Shukla ]]>
                </dc:creator>
                <pubDate>Thu, 05 Feb 2026 22:45:15 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1770331493095/d569e168-d3ba-40e0-a500-7f682bbef693.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In the world of big data, performance isn't just about bigger clusters – it's about smarter code. Spark is deceptively simple to write but notoriously difficult to optimize, because what you write isn't what Spark executes. Between your transformations and actual computation lies an invisible translation layer – the logical plan – that determines whether your job runs in minutes or hours.</p>
<p>Most engineers never look at this layer, which is why they spend days tuning configurations that don't address the real problem: inefficient transformations that generate bloated plans.</p>
<p>This handbook teaches you to read, interpret, and control those plans, transforming you from someone who writes PySpark code into someone who architects efficient data pipelines with precision and confidence.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-background-information">Background Information</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-chapter-1-the-spark-mindset-why-plans-matter">Chapter 1: The Spark Mindset: Why Plans Matter</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-chapter-2-understanding-the-spark-execution-flow">Chapter 2: Understanding the Spark Execution Flow</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-chapter-3-reading-and-debugging-plans-like-a-pro">Chapter 3: Reading and Debugging Plans Like a Pro</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-chapter-4-writing-efficient-transformations">Chapter 4: Writing Efficient Transformations</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-scenario-1-rename-in-one-pass-withcolumnrenamed-vs-todf">Scenario 1: Rename in One Pass: withColumnRenamed() vs toDF()</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-2-reusing-expressions">Scenario 2: Reusing expressions</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-3-batch-column-ops">Scenario 3: Batch column ops</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-4-early-filter-vs-late-filter">Scenario 4: Early Filter vs Late Filter</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-5-column-pruning">Scenario 5: Column Pruning</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-6-filter-pushdown-vs-full-scan">Scenario 6: Filter pushdown vs full scan</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-7-de-duplicate-right">Scenario 7: De-duplicate right</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-8-count-smarter">Scenario 8: Count Smarter</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-9-window-wisely">Scenario 9: Window wisely</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-10-incremental-aggregations-with-cache-and-persist">Scenario 10: Incremental Aggregations with Cache and Persist</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-11-reduce-shuffles">Scenario 11: Reduce shuffles</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-12-know-your-shuffle-triggers">Scenario 12: Know Your Shuffle Triggers</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-13-tune-parallelism-shuffle-partitions-amp-aqe">Scenario 13: Tune Parallelism: Shuffle Partitions &amp; AQE</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-14-handle-skew-smartly">Scenario 14: Handle Skew Smartly</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-15-sort-efficiently-orderby-vs-sortwithinpartitions">Scenario 15: Sort Efficiently (orderBy vs sortWithinPartitions)</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-background-information">Background Information</h2>
<h3 id="heading-what-this-handbook-is-really-about">What This Handbook is Really About</h3>
<p>This is not a tutorial about Spark internals, cluster tuning, or PySpark syntax or APIs.</p>
<p>This is a handbook about writing PySpark code that generates efficient logical plans.</p>
<p>Because when your code produces clean, optimized plans, Spark pushes filters correctly, shuffles reduce instead of multiply, projections stay shallow, and the DAG (<a target="_blank" href="https://en.wikipedia.org/wiki/Directed_acyclic_graph">Directed Acyclic Graph</a>) becomes predictable, lean, and fast.</p>
<p>When your code produces messy plans, Spark shuffles more than necessary, and projects pile up into deep, expensive stacks. Filters arrive late instead of early, joins explode into wide, slow operations, and the DAG becomes tangled and expensive.</p>
<p>The difference between a fast job and a slow job is not “faster hardware.” It’s the structure of the plan Spark generates from your code. This handbook teaches you to shape that plan deliberately through scenarios.</p>
<h3 id="heading-who-this-handbook-is-for">Who This Handbook Is For</h3>
<p>This handbook is written for:</p>
<ul>
<li><p><strong>Data engineers</strong> building production ETL pipelines who want to move beyond trial-and-error tuning and understand <em>why</em> jobs perform the way they do</p>
</li>
<li><p><strong>Analytics engineers</strong> working with large datasets in Databricks, EMR, or Glue who need to optimize Spark jobs but don't have time for thousand-page reference manuals</p>
</li>
<li><p><strong>Data scientists</strong> transitioning from pandas to PySpark who find themselves writing code that technically runs but takes forever</p>
</li>
<li><p><strong>Anyone</strong> who has stared at the Spark UI, seen mysterious "Exchange" nodes in the DAG, and wondered, <em>"Why is this shuffling so much data?"</em></p>
</li>
</ul>
<p>You should already be comfortable writing basic PySpark code , creating DataFrames, applying transformations, running aggregations. This handbookbook won't teach you Spark syntax. Instead, it teaches you how to write transformations that work <em>with</em> the optimizer, not against it.</p>
<h3 id="heading-how-this-handbook-is-structured">How This Handbook Is Structured</h3>
<p>We’ll start with foundations, then move on to real-world scenarios.</p>
<p>Chapters 1-3 build your mental model. You'll learn what logical plans are, how they connect to physical execution, and how to read the plan output that Spark shows you. These chapters are short and focused – just enough theory to make the practical scenarios meaningful.</p>
<p>Chapter 4 is the heart of the handbook. It contains 15 real-world scenarios, organized by category. Each scenario shows you a common performance problem, explains what's happening in the logical plan, and demonstrates the better approach. You'll see before-and-after code, plan comparisons, and clear explanations of why one approach outperforms another.</p>
<h3 id="heading-what-youll-learn">What You'll Learn</h3>
<p>By the end of this handbook, you'll be able to:</p>
<ul>
<li><p>Read and interpret Spark's logical, optimized, and physical plans</p>
</li>
<li><p>Identify expensive operations before running your code</p>
</li>
<li><p>Restructure transformations to minimize shuffles</p>
</li>
<li><p>Choose the right join strategies for your data</p>
</li>
<li><p>Avoid common pitfalls that cause memory issues and slow performance</p>
</li>
<li><p>Debug production issues by examining execution plans</p>
</li>
</ul>
<p>More importantly, you'll develop a Spark mindset, an intuition for how your code translates to cluster operations. You'll stop writing code that "should work" and start writing code that you <em>know</em> will work efficiently.</p>
<h3 id="heading-technical-prerequisites">Technical Prerequisites</h3>
<p>I assume that you’re familiar with the following concepts before proceeding:</p>
<ol>
<li><p>Python fundamentals</p>
</li>
<li><p>PySpark basics</p>
<ul>
<li><p>Creating DataFrames and reading data from files</p>
</li>
<li><p>Basic DataFrame operations: select, filter, withColumn, groupBy, join</p>
</li>
<li><p>Writing DataFrames back to storage</p>
</li>
</ul>
</li>
<li><p>Basic Spark concepts</p>
<ul>
<li><p>Basic understanding of Spark applications, jobs, stages, and tasks</p>
</li>
<li><p>Basic understanding of the difference between transformations and actions</p>
</li>
<li><p>Understanding. of partitions and shuffles</p>
</li>
</ul>
</li>
<li><p>AWS Glue (Good to have)</p>
</li>
</ol>
<h2 id="heading-chapter-1-the-spark-mindset-why-plans-matter">Chapter 1: The Spark Mindset: Why Plans Matter</h2>
<p>This chapter isn’t about Spark theory or internals. It’s about understanding Spark Plans, and seeing Spark the way the engine sees your code. Once you understand how Spark builds and optimizes a logical plan, optimization stops being trial and error and becomes intentional engineering.</p>
<p>Behind every simple transformation, Spark quietly redraws its internal blueprint. Every transformation you write from "<em>withColumn</em>" to join changes that plan. When the plan is efficient, Spark flies, but when it’s messy, Spark crawls.</p>
<h3 id="heading-the-invisible-layer-behind-every-transformation">The Invisible Layer Behind Every Transformation</h3>
<p>When you write PySpark code, it feels like you’re chaining operations step by step. In reality, Spark isn’t executing those lines. It’s quietly building a blueprint, a logical plan describing <em>what</em> to do, not <em>how</em>.</p>
<p>Once this plan is built, the Catalyst Optimizer analyzes it, rearranges operations, eliminates redundancies, and produces an optimized plan. Catalyst is Spark’s query optimization engine.</p>
<p>Every DataFrame or SQL operation we write, such as select, filter, join, groupBy, is first converted into a logical plan. Catalyst then analyzes and transforms this plan using a set of rule-based optimizations, such as predicate pushdown, column pruning, constant folding, and join reordering. The result is an optimized logical plan, which Spark later converts into a physical execution plan. Finally, Spark translates that into a physical plan of what your cluster actually runs. This invisible planning layer decides the job’s performance more than any configuration setting.</p>
<h3 id="heading-from-logical-to-optimized-to-physical-plans">From Logical to Optimized to Physical Plans</h3>
<p>When you run <code>df.explain(True)</code>, Spark actually shows you four stages of reasoning:</p>
<h4 id="heading-1-logical-plan">1. Logical Plan</h4>
<p>The logical plan is the first stage where the initial translation of the code results in a tree structure that shows what operations need to happen, without worrying about how to execute them efficiently. It’s a blueprint of the query’s logic before any optimization or physical planning occurs.</p>
<p>This:</p>
<pre><code class="lang-python">df.filter(col(<span class="hljs-string">'age'</span>) &gt; <span class="hljs-number">25</span>) \
  .select(<span class="hljs-string">'firstname'</span>, <span class="hljs-string">'country'</span>) \
  .groupby(<span class="hljs-string">'country'</span>) \
  .count() \
  .explain(<span class="hljs-literal">True</span>)
</code></pre>
<p>results in the following logical plan:</p>
<pre><code class="lang-python">== Parsed Logical Plan ==
<span class="hljs-string">'Aggregate ['</span>country], [<span class="hljs-string">'country, '</span>count(<span class="hljs-number">1</span>) AS count<span class="hljs-comment">#108]</span>
+- Project [firstname<span class="hljs-comment">#95, country#97]</span>
   +- Filter (age<span class="hljs-comment">#96L &gt; cast(25 as bigint))</span>
      +- LogicalRDD [firstname<span class="hljs-comment">#95, age#96L, country#97], false</span>
</code></pre>
<h4 id="heading-2-analyzed-logical-plan">2. Analyzed Logical Plan</h4>
<p>The analyzed logical plan is the second stage in Spark’s query optimization. In this stage, Spark validates the query by checking if tables and columns actually exist in the Catalog and resolving all references. It converts all the unresolved logical plans into a resolved one with correct data types and column bindings before optimization.</p>
<h4 id="heading-3-optimized-logical-plan">3. Optimized Logical Plan</h4>
<p>The optimized logical plan is where Spark's Catalyst optimizer improves the logical plan by applying smart rules like filtering data early, removing unnecessary columns, and combining operations to reduce computation. It's the smarter, more efficient version of your original plan that will execute faster and use fewer resources.</p>
<p>Let’s understand using a simple code example:</p>
<pre><code class="lang-python">df.select(<span class="hljs-string">'firstname'</span>, <span class="hljs-string">'country'</span>) \
  .groupby(<span class="hljs-string">'country'</span>) \
  .count() \
  .filter(col(<span class="hljs-string">'country'</span>) == <span class="hljs-string">'USA'</span>) \
  .explain(<span class="hljs-literal">True</span>)
</code></pre>
<p>Here’s the parsed logical plan:</p>
<pre><code class="lang-python">== Parsed Logical Plan ==
<span class="hljs-string">'Filter '</span>`=`(<span class="hljs-string">'country, USA)
+- Aggregate [country#97], [country#97, count(1) AS count#122L]
   +- Project [firstname#95, country#97]
      +- LogicalRDD [firstname#95, age#96L, country#97], false</span>
</code></pre>
<p>What this means:</p>
<ul>
<li><p>Spark first projects firstname and country</p>
</li>
<li><p>Then aggregates by country</p>
</li>
<li><p>Then applies the filter country = 'USA' <strong>after</strong> aggregation</p>
</li>
</ul>
<p>(because that’s how you wrote it).</p>
<p>Here’s the optimized logical plan:</p>
<pre><code class="lang-python">== Optimized Logical Plan ==
Aggregate [country<span class="hljs-comment">#97], [country#97, count(1) AS count#122L]</span>
+- Project [country<span class="hljs-comment">#97]</span>
   +- Filter (isnotnull(country<span class="hljs-comment">#97) AND (country#97 = USA))</span>
      +- LogicalRDD [firstname<span class="hljs-comment">#95, age#96L, country#97], false</span>
</code></pre>
<p>Key improvements Catalyst applied:</p>
<ul>
<li><p>Filter pushdown: The filter country = 'USA' is pushed below the aggregation, so Spark only groups U.S. rows.</p>
</li>
<li><p>Column pruning: “firstname” is automatically removed because it’s never used in the final output.</p>
</li>
<li><p>Cleaner projection: Intermediate columns are dropped early, reducing I/O and in-memory footprint.</p>
</li>
</ul>
<h4 id="heading-4-physical-plan">4. Physical Plan</h4>
<p>The physical plan is Spark's final execution blueprint that shows exactly how the query will run: which specific algorithms to use, how to distribute work across machines, and the order of low-level operations. It's the concrete, executable version of the optimized logical plan, translated into actual Spark operations like “ShuffleExchange”, “HashAggregate”, and “FileScan” that will run on your cluster.</p>
<p>Catalyst may, for example:</p>
<ul>
<li><p>Fold constants (col("x") * 1 → col("x"))</p>
</li>
<li><p>Push filters closer to the data source</p>
</li>
<li><p>Replace a regular join with a broadcast join when data fits in memory</p>
</li>
</ul>
<p>Once the physical plan is finalized, Spark’s scheduler converts it into a DAG of stages and tasks that run across the cluster. Understanding that lineage, from your code → plan → DAG, is what separates fast jobs from slow ones.</p>
<h3 id="heading-how-to-read-a-logical-plan">How to Read a Logical Plan</h3>
<p>A logical plan prints as a tree: the bottom is your data source, and each higher node represents a transformation.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Node</strong></td><td><strong>Meaning</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Relation / LogicalRDD</td><td>Data source, the initial DataFrame</td></tr>
<tr>
<td>Project</td><td>Column selection and transformation (select, withColumn)</td></tr>
<tr>
<td>Filter</td><td>Row filtering based on conditions (where, filter)</td></tr>
<tr>
<td>Join</td><td>Combining two DataFrames (join, union)</td></tr>
<tr>
<td>Aggregate</td><td>GroupBy and aggregation operations (groupBy, agg)</td></tr>
<tr>
<td>Exchange</td><td>Shuffle operation (data redistribution across partitions)</td></tr>
<tr>
<td>Sort</td><td>Ordering data (orderBy, sort)</td></tr>
</tbody>
</table>
</div><p>Each node represents a transformation. Execution flows from the bottom up. Let's understand with a basic example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> *
<span class="hljs-keyword">from</span> pyspark.sql.types <span class="hljs-keyword">import</span> *

spark = SparkSession.builder.appName(<span class="hljs-string">"Practice"</span>).getOrCreate()

employees_data = [
    (<span class="hljs-number">1</span>, <span class="hljs-string">"John"</span>, <span class="hljs-string">"Doe"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">80000</span>, <span class="hljs-number">28</span>, <span class="hljs-string">"2020-01-15"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">2</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Smith"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">85000</span>, <span class="hljs-number">32</span>, <span class="hljs-string">"2019-03-20"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">3</span>, <span class="hljs-string">"Alice"</span>, <span class="hljs-string">"Johnson"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">60000</span>, <span class="hljs-number">25</span>, <span class="hljs-string">"2021-06-10"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">4</span>, <span class="hljs-string">"Bob"</span>, <span class="hljs-string">"Brown"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">90000</span>, <span class="hljs-number">35</span>, <span class="hljs-string">"2018-07-01"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">5</span>, <span class="hljs-string">"Charlie"</span>, <span class="hljs-string">"Wilson"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">65000</span>, <span class="hljs-number">29</span>, <span class="hljs-string">"2020-11-05"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">6</span>, <span class="hljs-string">"David"</span>, <span class="hljs-string">"Lee"</span>, <span class="hljs-string">"HR"</span>, <span class="hljs-number">55000</span>, <span class="hljs-number">27</span>, <span class="hljs-string">"2021-01-20"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">7</span>, <span class="hljs-string">"Eve"</span>, <span class="hljs-string">"Davis"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">95000</span>, <span class="hljs-number">40</span>, <span class="hljs-string">"2017-04-12"</span>, <span class="hljs-string">"Canada"</span>),
    (<span class="hljs-number">8</span>, <span class="hljs-string">"Frank"</span>, <span class="hljs-string">"Miller"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">70000</span>, <span class="hljs-number">33</span>, <span class="hljs-string">"2019-09-25"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">9</span>, <span class="hljs-string">"Grace"</span>, <span class="hljs-string">"Taylor"</span>, <span class="hljs-string">"HR"</span>, <span class="hljs-number">58000</span>, <span class="hljs-number">26</span>, <span class="hljs-string">"2021-08-15"</span>, <span class="hljs-string">"Canada"</span>),
    (<span class="hljs-number">10</span>, <span class="hljs-string">"Henry"</span>, <span class="hljs-string">"Anderson"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">88000</span>, <span class="hljs-number">31</span>, <span class="hljs-string">"2020-02-28"</span>, <span class="hljs-string">"USA"</span>)
]

df = spark.createDataFrame(employees_data,  
    [<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>])
</code></pre>
<h4 id="heading-version-a-withcolumn-filter">Version A: withColumn → filter</h4>
<p>In this version, we’re using a derived column "withColumn" and then applying a filter to the dataset. This ordering is logically correct and produces the expected result: it shows how introducing derived columns early affects the logical plan. This example shows what happens when Spark is asked to compute a new column before any rows are eliminated.</p>
<pre><code class="lang-python">df_filtered = df \
.withColumn(<span class="hljs-string">'bonus'</span>, col(<span class="hljs-string">'salary'</span>) * <span class="hljs-number">82</span>) \
.filter(col(<span class="hljs-string">'age'</span>) &gt; <span class="hljs-number">35</span>) \
.explain(<span class="hljs-literal">True</span>)
</code></pre>
<h4 id="heading-parsed-logical-plan-simplified">Parsed Logical Plan (Simplified)</h4>
<pre><code class="lang-python">Filter (age &gt; <span class="hljs-number">35</span>)
└─ Project [*, (salary * <span class="hljs-number">82</span>) AS bonus]
   └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>So what’s going on here? Execution flows from the bottom up.</p>
<ul>
<li><p>Spark first reads the LogicalRDD.</p>
</li>
<li><p>Then applies the Project node, keeping all columns and adding bonus.</p>
</li>
<li><p>Finally, the Filter removes rows where age ≤ 35.</p>
</li>
</ul>
<p>This means Spark computes the bonus for every employee, even those who are later filtered out. It's harmless here, but costly on millions of rows, more computation, more I/O, more shuffle volume.</p>
<h4 id="heading-version-b-filter-project">Version B: Filter → Project</h4>
<p>In this version, we apply the filter before introducing the derived column. The idea is to show how pushing row-reducing operations earlier allows Catalyst to produce a leaner logical plan. Compared to Version A, this example demonstrates that the same logic, written in a different order, can significantly reduce the amount of work Spark needs to perform.</p>
<pre><code class="lang-python">df_filtered = df \
.filter(col(<span class="hljs-string">'age'</span>) &gt; <span class="hljs-number">35</span>) \
.withColumn(<span class="hljs-string">'bonus'</span>, col(<span class="hljs-string">'salary'</span>) * <span class="hljs-number">82</span>) \
.explain(<span class="hljs-literal">True</span>)
</code></pre>
<h4 id="heading-parsed-logical-plan-simplified-1">Parsed Logical Plan (Simplified)</h4>
<pre><code class="lang-python">Project [*, (salary * <span class="hljs-number">82</span>) AS bonus]

└─ Filter (age &gt; <span class="hljs-number">35</span>)

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>So what’s going on here?</p>
<ul>
<li><p>Spark starts from the LogicalRDD.</p>
</li>
<li><p>It immediately applies the Filter, reducing the dataset to only employees with age &gt; 35.</p>
</li>
<li><p>Then the Project node adds the derived column bonus for this smaller subset.</p>
</li>
</ul>
<p>Now the Filter sits below the Project in the plan, cutting data movement and minimizing computation. Spark prunes data first, then derives new columns. This order reduces both the volume of data processed and the amount transferred, leading to a lighter and faster plan.</p>
<h3 id="heading-why-you-should-look-at-the-plan-every-time-by-running-dfexplaintrue">Why You Should Look at the Plan Every Time by running <code>df.explain(True)</code></h3>
<p>This is the quickest way to spot performance issues <em>before</em> they hit production. It shows:</p>
<ul>
<li><p>Whether filters sit in the right place.</p>
</li>
<li><p>How many Project nodes exist (each adds overhead).</p>
</li>
<li><p>Where Exchange nodes appear (these are shuffle boundaries).</p>
</li>
<li><p>If Catalyst pushed filters or rewrote joins as expected.</p>
</li>
</ul>
<p>A quick <code>explain()</code> takes seconds, while debugging a bad shuffle in production takes hours. Run <code>explain()</code> whenever you add or reorder transformations. The plan never lies.</p>
<h4 id="heading-what-spark-does-under-the-hood">What Spark Does Under the Hood</h4>
<p>Catalyst can sometimes reorder simple filters automatically, but once you use UDFs, nested logic, or joins, it often can’t. That’s why the best habit is to write transformations in a way that already makes sense to the optimizer. Filter early, avoid redundant projections, and keep plans as shallow as possible.</p>
<p>Optimizing Spark isn’t about tuning cluster configs – it’s about writing code that yields efficient plans. If your plan shows late filters, too many projections, or multiple Exchange nodes, it’s already explaining why your job will run slow.</p>
<h2 id="heading-chapter-2-understanding-the-spark-execution-flow">Chapter 2: Understanding the Spark Execution Flow</h2>
<p>In Chapter 1, you learned how Spark interprets your transformations into logical plans – blueprints of what the job intends to do.</p>
<p>But Spark doesn't stop there. It must translate those plans into distributed actions across a cluster of executors, coordinate data movement, and handle any failures that may occur.</p>
<p>This chapter reveals what happens when that plan leaves the driver: how Spark breaks your job into stages, tasks, and a directed acyclic graph (DAG) that actually runs.</p>
<p>By the end, you’ll understand why some operations shuffle terabytes while others fly, and how to predict it before execution begins.</p>
<h3 id="heading-from-plans-to-stages-to-tasks">From Plans to Stages to Tasks</h3>
<p>A Spark job evolves through three conceptual layers:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Layer</strong></td><td><strong>What It Represents</strong></td><td><strong>Example View</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Plan</td><td>The optimized logical + physical representation of your query</td><td>Read → Filter → Join → Aggregate</td></tr>
<tr>
<td>Stage</td><td>A contiguous set of operations that can run without shuffling data</td><td>“Map Stage” or “Reduce Stage”</td></tr>
<tr>
<td>Task</td><td>The smallest unit of work, one per partition per stage</td><td>“Process Partition 7 of Stage 3”</td></tr>
</tbody>
</table>
</div><h4 id="heading-the-execution-trigger-actions-vs-transformations">The Execution Trigger: Actions vs Transformations</h4>
<p>Here's the critical distinction that determines when execution actually begins:</p>
<pre><code class="lang-python">df1 = spark.paraquet(<span class="hljs-string">"data.paraquet"</span>)
df2 = spark.filter(col(<span class="hljs-string">"age"</span>) &gt; <span class="hljs-number">25</span>)
df3 = spark.groupby(<span class="hljs-string">"city"</span>).count()
</code></pre>
<p>Nothing executes yet! Spark just builds up the logical plan, adding each transformation as a node in the plan tree. No data is read, no filters run, no shuffles happen.</p>
<h4 id="heading-actions-trigger-execution">Actions Trigger Execution</h4>
<p>Spark transformations are lazy. When a sequence of DataFrame operations is defined, a logical plan is created, but no computation takes place. It’s only when Spark encounters an action, an operation that needs a result to be returned to the driver or written out, that execution takes place.</p>
<p>For example:</p>
<pre><code class="lang-python">result = df3.collect()
</code></pre>
<p>At this stage, Spark materializes the logical plan, applies optimizations, creates a physical plan, and executes the job. Until Spark is asked to <strong>act</strong>, such as collect(), count(), or write(), it’s just describing what it needs to do – but it’s not actually doing it.</p>
<h4 id="heading-the-complete-execution-flow">The Complete Execution Flow</h4>
<p>Spark execution is initiated after the execution of an operation such as collect(). The driver then sends the optimized physical plan to the SparkContext, which is then forwarded to the DAG Scheduler. The physical plan is analyzed to determine shuffle boundaries created by wide operations such as <em>groupBy</em> or <em>orderBy</em>.</p>
<p>The plan is then divided into stages that contain narrow operations. These stages are sent to the Task Scheduler as a TaskSet. Each stage has a single task per partition.</p>
<p>The tasks are then assigned to the cores of the executor based on data locality. The execution of the tasks is then initiated. The execution of the stages is initiated after the completion of the previous stage. The final stage is initiated after the completion of the previous stage. The results of the final stage are then returned to the driver or stored.</p>
<h4 id="heading-what-triggers-a-shuffle">What Triggers a Shuffle</h4>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769457412199/308bc894-66a9-4c01-aae1-9ae42e64d32c.png" alt="Comparison of Spark shuffle behavior before and after groupBy" class="image--center mx-auto" width="1920" height="992" loading="lazy"></p>
<p>A shuffle occurs when Spark needs to redistribute data across partitions, typically because the operation requires grouping, joining, or repartitioning data in a way that can’t be done locally within existing partitions.</p>
<p>Common shuffle triggers:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Operation</strong></td><td><strong>Why it Shuffles</strong></td></tr>
</thead>
<tbody>
<tr>
<td>groupBy(), reduceByKey()</td><td>Data with the same key must co-locate for aggregation</td></tr>
<tr>
<td>join()</td><td>Matching keys may reside in different partitions</td></tr>
<tr>
<td>orderBy() / sort()</td><td>Requires global ordering across all partitions</td></tr>
<tr>
<td>distinct()</td><td>Needs comparison of all values across partitions</td></tr>
<tr>
<td>repartition(n)</td><td>Explicit redistribution to a new number of partitions</td></tr>
</tbody>
</table>
</div><pre><code class="lang-python">df.groupBy(<span class="hljs-string">"user_id”) \
  .agg(sum("</span>amount<span class="hljs-string">"))</span>
</code></pre>
<p>In Stage 1 (Map), each task performs a partial aggregation on its partition and writes a shuffle file to disk. During the shuffle, each executor retrieves these files across the network such that all records with the same hash(user_id) % numPartitions are colocated.</p>
<p>In Stage 2 (Reduce), each task performs a final aggregation on its partitioned data and writes back to disk. Because Spark has tracked this process as a DAG, a failed task can re-read only the affected shuffle files instead of re-computing the entire DAG.</p>
<p>In practice, a healthy job has 2-6 stages. Seeing 20+ stages for such simple logic usually means unnecessary shuffles or bad partitioning.</p>
<h4 id="heading-why-shuffles-create-stage-boundaries">Why Shuffles Create Stage Boundaries</h4>
<p>Shuffles force data to move across the network between executors. Spark cannot continue processing until:</p>
<ul>
<li><p>All tasks in the current stage write their shuffle output to disk</p>
</li>
<li><p>The shuffle data is available for the next stage to read over the network</p>
</li>
</ul>
<p>This dependency creates a natural boundary – so a new stage begins after every shuffle. The DAG Scheduler uses these boundaries to determine where stages must wait for previous stages to complete.</p>
<h4 id="heading-common-performance-bottlenecks">Common Performance Bottlenecks</h4>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Bottleneck Type</strong></td><td><strong>Symptom</strong></td><td><strong>Solution</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Data skew</td><td>Few tasks run much longer</td><td>Use salting, split hot keys, or AQE skew join</td></tr>
<tr>
<td>Small files</td><td>Too many tasks, high overhead</td><td>Coalesce or repartition after read</td></tr>
<tr>
<td>Large shuffle</td><td>High network I/O, spill to disk</td><td>Filter early, broadcast small tables, reduce cardinality</td></tr>
<tr>
<td>Unnecessary stages</td><td>Extra Exchange nodes in plan</td><td>Combine operations, remove redundant repartitions</td></tr>
<tr>
<td>Inefficient file formats</td><td>Slow reads, no predicate pushdown</td><td>Use Parquet or ORC with partitioning</td></tr>
<tr>
<td>Complex data types</td><td>Serialization overhead, large objects</td><td>Use simple types, cache in serialized form</td></tr>
</tbody>
</table>
</div><p>Let’s ground this with a small but realistic pattern using the same employees DataFrame. <strong>Goal:</strong> average salary per department and country, only for employees older than 30.</p>
<p>Naïve approach:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> col, when, avg

df_dept_country = df.select(<span class="hljs-string">"department"</span>, <span class="hljs-string">"country"</span>).distinct()

df_result = (
    df.withColumn(
        <span class="hljs-string">"age_group"</span>,
        when(col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">30</span>, <span class="hljs-string">"junior"</span>)
        .when(col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">40</span>, <span class="hljs-string">"mid"</span>)
        .otherwise(<span class="hljs-string">"senior"</span>)
    )
    .join(df_dept_country, [<span class="hljs-string">"department"</span>], <span class="hljs-string">"inner"</span>)
    .groupBy(<span class="hljs-string">"department"</span>, <span class="hljs-string">"country"</span>)
    .agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
</code></pre>
<p>This looks harmless, but:</p>
<ul>
<li><p>The join on "department" introduces a wide dependency → shuffle #1.</p>
</li>
<li><p>The groupBy("department", "country") introduces another wide dependency → shuffle #2.</p>
</li>
</ul>
<p>So we have two shuffles for what should be a simple aggregation. If you run explain on the df_result, you’ll see two exchange nodes, each marking a shuffle and stage boundary.</p>
<h4 id="heading-optimized-approach">Optimized Approach</h4>
<p>We can do better by filtering early, broadcasting the small dimension (df_dept_country), and keeping only one global shuffle for aggregation.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> broadcast

df_dept_country = df.select(<span class="hljs-string">"department"</span>, <span class="hljs-string">"country"</span>).distinct()

df_result_optimized = (
    df.filter(col(<span class="hljs-string">"age"</span>) &gt; <span class="hljs-number">30</span>)
        .join(broadcast(df_dept_country), [<span class="hljs-string">"department"</span>], <span class="hljs-string">"inner"</span>)
        .groupBy(<span class="hljs-string">"department"</span>, <span class="hljs-string">"country"</span>)
        .agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
)
</code></pre>
<p>What changed:</p>
<ul>
<li><p>filter(col("age") &gt; 30) is narrow and runs before any shuffle.</p>
</li>
<li><p>broadcast(df_dept_country) avoids a shuffle for the join.</p>
</li>
<li><p>Only the groupBy("department", "country") causes a single shuffle.</p>
</li>
</ul>
<p>Now explain shows just one Exchange.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Version</strong></td><td><strong>Shuffles</strong></td><td><strong>Stages</strong></td><td><strong>Notes</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Naïve</td><td>2</td><td>~4 (2 map + 2 reduce)</td><td>Join shuffle + groupBy shuffle = double overhead</td></tr>
<tr>
<td>Optimized</td><td>1</td><td>~2 (1 map + 1 reduce)</td><td>Broadcast join avoids shuffle. Only groupBy shuffles</td></tr>
</tbody>
</table>
</div><h2 id="heading-chapter-3-reading-and-debugging-plans-like-a-pro">Chapter 3: Reading and Debugging Plans Like a Pro</h2>
<p>As explained in Chapter 1, Spark executes transformations based on three levels: the logical plan, the optimized logical plan (Catalyst), and the physical plan. This chapter will expand on this explanation and concentrate on the impact of the logical plan on shuffle and execution performance.</p>
<p>By now, you understand how Spark builds and <em>executes</em> plans. But reading those plans and instantly spotting inefficiencies is the real superpower of a performance-focused data engineer.</p>
<p>Spark’s explain() output isn’t random jargon. It’s a precise log of Spark’s thought process. Once you learn to read it, every optimization becomes obvious.</p>
<h3 id="heading-three-layers-in-spark"><strong>Three Layers in Spark</strong></h3>
<p>As we talked about above, every Spark plan has three key views, printed when you call df.explain(True). Let’s review them now:</p>
<ol>
<li><p>Parsed Logical Plan: The raw intent Spark inferred from your code. It may include unresolved column names or expressions.</p>
</li>
<li><p>Analyzed / Optimized Logical Plan: After Spark applies Catalyst optimizations: constant folding, predicate pushdown, column pruning, and plan rearrangements.</p>
</li>
<li><p>Physical Plan: What your executors actually run: joins, shuffles, exchanges, scans, and code-generated operators.</p>
</li>
</ol>
<p>Each stage narrows the gap between what you <em>asked</em> Spark to do and what Spark decides to do.</p>
<pre><code class="lang-python">df_avg = df.filter(col(<span class="hljs-string">"age"</span>) &gt; <span class="hljs-number">30</span>)
        .groupBy(<span class="hljs-string">"department"</span>)
        .agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))

df_avg.explain(<span class="hljs-literal">True</span>)
</code></pre>
<p><strong>1. Parsed Logical Plan</strong></p>
<pre><code class="lang-python">== Parsed Logical Plan ==
<span class="hljs-string">'Aggregate ['</span>department], [<span class="hljs-string">'department, '</span>avg(<span class="hljs-string">'salary) AS avg_salary#8]
+- Filter (age#5L &gt; cast(30 as bigint))
   +- LogicalRDD [id#0L, firstname#1, lastname#2, department#3, salary#4L, age#5L, hire_date#6, country#7], false</span>
</code></pre>
<p>How to read this</p>
<ul>
<li><p>Bottom → data source (LogicalRDD).</p>
</li>
<li><p>Middle → Filter: Spark hasn’t yet optimized column references.</p>
</li>
<li><p>Top → Aggregate: high-level grouping intent.</p>
</li>
</ul>
<p>At this stage, the plan may include unresolved symbols (like 'department or 'avg('salary)), meaning Spark hasn’t yet validated column existence or data types.</p>
<p><strong>2. Optimized Logical Plan</strong></p>
<pre><code class="lang-python">
== Optimized Logical Plan ==
Aggregate [department<span class="hljs-comment">#3], [department#3, avg(salary#4L) AS avg_salary#8]</span>
+- Project [department<span class="hljs-comment">#3, salary#4L]</span>
   +- Filter (isnotnull(age<span class="hljs-comment">#5L) AND (age#5L &gt; 30))</span>
      +- LogicalRDD [id<span class="hljs-comment">#0L, firstname#1, lastname#2, department#3, salary#4L, age#5L, hire_date#6, country#7], false</span>
</code></pre>
<p>Here, Catalyst has done its job:</p>
<ul>
<li><p>Column IDs (#11, #12L) are resolved.</p>
</li>
<li><p>Unused columns are pruned – no need to carry them forward.</p>
</li>
<li><p>The plan now accurately reflects Spark’s optimized logical intent.</p>
</li>
</ul>
<p>If you ever wonder whether Spark pruned columns or pushed filters, this is the section to check.</p>
<p><strong>3. Physical Plan</strong></p>
<pre><code class="lang-python">== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[department<span class="hljs-comment">#3], functions=[avg(salary#4L)], output=[department#3, avg_salary#8])</span>
   +- Exchange hashpartitioning(department<span class="hljs-comment">#3, 200), ENSURE_REQUIREMENTS, [plan_id=19]</span>
      +- HashAggregate(keys=[department<span class="hljs-comment">#3], functions=[partial_avg(salary#4L)], output=[department#3, sum#20, count#21L])</span>
         +- Project [department<span class="hljs-comment">#3, salary#4L]</span>
            +- Filter (isnotnull(age<span class="hljs-comment">#5L) AND (age#5L &gt; 30))</span>
               +- Scan ExistingRDD[id<span class="hljs-comment">#0L,firstname#1,lastname#2,department#3,salary#4L,age#5L,hire_date#6,country#7]</span>
</code></pre>
<p><strong>Breakdown</strong></p>
<ul>
<li><p>Scan ExistingRDD → Spark reading from the in-memory DataFrame.</p>
</li>
<li><p>Filter → narrow transformation, no shuffle.</p>
</li>
<li><p>HashAggregate → partial aggregation per partition.</p>
</li>
<li><p>Exchange → wide dependency: data is shuffled by department.</p>
</li>
<li><p>Top HashAggregate → final aggregation after shuffle.</p>
</li>
</ul>
<p>This structure – partial agg → shuffle → final agg – is Spark’s default two-phase aggregation pattern.</p>
<h4 id="heading-recognizing-common-nodes">Recognizing Common Nodes</h4>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Node / Operator</strong></td><td><strong>Meaning</strong></td><td><strong>Optimization Hint</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Project</td><td>Column selection or computed fields</td><td>Combine multiple withColumn() into one select()</td></tr>
<tr>
<td>Filter</td><td>Predicate on rows</td><td>Push filters as low as possible in the plan</td></tr>
<tr>
<td>Join</td><td>Combine two DataFrames</td><td>Broadcast smaller side if &lt; 10 MB</td></tr>
<tr>
<td>Aggregate</td><td>GroupBy, sum, avg, count</td><td>Filter before aggregating to reduce cardinality</td></tr>
<tr>
<td>Exchange</td><td>Shuffle / data redistribution</td><td>Minimize by filtering early, using broadcast join</td></tr>
<tr>
<td>Sort</td><td>OrderBy, sort</td><td>Avoid global sorts; use within partitions if possible</td></tr>
<tr>
<td>Window</td><td>Windowed analytics (row_number, rank)</td><td>Partition on selective keys to reduce shuffle</td></tr>
</tbody>
</table>
</div><p>Repeated invocations of withColumn stack multiple Project nodes, which increases the plan depth. Instead, combine these invocations using select.</p>
<p>Multiple Exchange nodes imply repeated data shuffles. You can eliminate these by broadcasting the data or filtering.</p>
<p>Multiple scans of the same table within a single operation imply that some caching of strategic intermediates is lacking.</p>
<p>And frequent SortMergeJoin operations imply that Spark is unnecessarily sorting and shuffling the data. You can eliminate these by broadcasting the smaller dataframe or bucketing.</p>
<h4 id="heading-debugging-strategy-read-plans-from-top-to-bottom">Debugging Strategy: Read Plans from Top to Bottom</h4>
<p>Remember: Spark <em>executes</em> plans from bottom up (from data source to final result). But when you're debugging, you read from the top down (from the output schema back to the root cause). This reversal is intentional: you start with what's wrong at the output level, then trace backward through the plan to find where the inefficiency was introduced.</p>
<p>When debugging a slow job:</p>
<ul>
<li><p>Start at the top: Identify output schema and major operators (HashAggregate, Join, and so on).</p>
</li>
<li><p>Scroll for Exchanges: Count them. Each = stage boundary. Ask “Why do I need this shuffle?”</p>
</li>
<li><p>Trace backward: See if filters or projections appear below or above joins.</p>
</li>
<li><p>Look for duplication: Same scan twice? Missing cache? Re-derived columns?</p>
</li>
<li><p>Check join strategy: If it’s SortMergeJoin but one table is small, force a broadcast().</p>
</li>
<li><p>Re-run explain after optimization: You should literally see the extra nodes disappear.</p>
</li>
</ul>
<h4 id="heading-catalyst-optimizer-in-action">Catalyst Optimizer in Action</h4>
<p>Catalyst applies dozens of rules automatically. Knowing a few helps you interpret what changed:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Optimization Rule</strong></td><td><strong>Example Transformation</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Predicate Pushdown</td><td>Moves filters below joins/scans</td></tr>
<tr>
<td>Constant Folding</td><td>Replaces salary * 1 with salary</td></tr>
<tr>
<td>Column Pruning</td><td>Drops unused columns early</td></tr>
<tr>
<td>Combine Filters</td><td>Merges consecutive filters into one</td></tr>
<tr>
<td>Simplify Casts</td><td>Removes redundant type casts</td></tr>
<tr>
<td>Reorder Joins / Join Reordering</td><td>Changes join order for cheaper plan</td></tr>
</tbody>
</table>
</div><p>Putting it all together: every plan tells a story:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769458525411/64fa30a4-b16e-4aed-8c04-d12b476d9ae6.png" alt="Spark Plans and Stages" class="image--center mx-auto" width="1920" height="404" loading="lazy"></p>
<p>As you progress through the practical scenarios in Chapter 4, read every plan before and after. Your goal isn't memorization – it's intuition.</p>
<h2 id="heading-chapter-4-writing-efficient-transformations">Chapter 4: Writing Efficient Transformations</h2>
<p>Every Spark job tells a story, not in code, but in plans. By now, you've seen how Spark interprets transformations (Chapter 1), how it executes them through stages and tasks (Chapter 2), and how to read plans like a detective (Chapter 3). Now comes the part where you apply that knowledge: writing transformations that yield efficient logical plans.</p>
<p>This chapter is the heart of the handbook. It's where we move from understanding Spark's mind to writing code that speaks its language fluently.</p>
<h3 id="heading-why-transformations-matter">Why Transformations Matter</h3>
<p>In PySpark, most performance issues don’t start in clusters or configurations. They start in transformations: the way we chain, filter, rename, or join data. Every transformation reshapes the logical plan, influencing how Spark optimizes, when it shuffles, and whether the final DAG is streamlined or tangled.</p>
<p>A good transformation sequence:</p>
<ul>
<li><p>Keeps plans shallow, not nested.</p>
</li>
<li><p>Applies filters early, not after computation.</p>
</li>
<li><p>Reduces data movement, not just data size.</p>
</li>
<li><p>Let’s Catalyst and AQE optimize freely, without user-induced constraints.</p>
</li>
</ul>
<p>A bad one can double runtime, and you won't see it in your code, only in your plan.</p>
<h3 id="heading-the-goal-of-this-chapter">The Goal of this Chapter</h3>
<p>We’ll explore a series of real-world optimization scenarios, drawn from production ETL and analytical pipelines, each showing how a small change in code can completely reshape the logical plan and execution behavior.</p>
<p>Each scenario is practical and short, following a consistent structure. By the end of this chapter, you’ll be able to <em>see</em> optimization opportunities the moment you write code, because you’ll know exactly how they alter the logical plan beneath.</p>
<h3 id="heading-before-you-dive-in">Before You Dive In:</h3>
<p>Open a Spark shell or notebook. Load your familiar employees DataFrame. Run every example, and compare the explain("formatted") output before and after the fix. Because in this chapter, performance isn’t about more theory, it’s about seeing the difference in the plan and feeling the difference in execution time.</p>
<h3 id="heading-scenario-1-rename-in-one-pass-withcolumnrenamed-vs-todf">Scenario 1: Rename in One Pass: withColumnRenamed() vs toDF()</h3>
<p>If you’ve worked with PySpark DataFrames, you’ve probably had to rename columns, either by calling withColumnRenamed() repeatedly or by using toDF() in one shot.</p>
<p>At first glance, both approaches produce identical results: the columns have the new names you wanted. But beneath the surface, Spark treats them very differently – and that difference shows up directly in your logical plan.</p>
<pre><code class="lang-python">df_renamed = (df.withColumnRenamed(<span class="hljs-string">"id"</span>, <span class="hljs-string">"emp_id"</span>)
    .withColumnRenamed(<span class="hljs-string">"firstname"</span>, <span class="hljs-string">"first_name"</span>)
    .withColumnRenamed(<span class="hljs-string">"lastname"</span>, <span class="hljs-string">"last_name"</span>)
    .withColumnRenamed(<span class="hljs-string">"department"</span>, <span class="hljs-string">"dept"</span>)
    .withColumnRenamed(<span class="hljs-string">"salary"</span>, <span class="hljs-string">"base_salary"</span>)
    .withColumnRenamed(<span class="hljs-string">"age"</span>, <span class="hljs-string">"age_years"</span>)
    .withColumnRenamed(<span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"hired_on"</span>)
    .withColumnRenamed(<span class="hljs-string">"country"</span>, <span class="hljs-string">"country_code"</span>)
)
</code></pre>
<p>This is simple and readable. But Spark builds the plan step by step, adding one Project node for every rename. Each Project node copies all existing columns, plus the newly renamed one. In large schemas (hundreds of columns), this silently bloats the plan.</p>
<h4 id="heading-logical-plan-impact">Logical Plan Impact:</h4>
<pre><code class="lang-python">Project [emp_id, first_name, last_name, dept, base_salary, age_years, hired_on, country_code]

└─ Project [id, first_name, last_name, dept, base_salary, age_years, hired_on, country_code]

└─ Project [id, firstname, last_name, dept, base_salary, age_years, hired_on, country_code]

└─ Project [id, firstname, lastname, dept, base_salary, age_years, hire_date, country_code]

└─ Project [id, firstname, lastname, department, base_salary, age_years, hire_date, country]

└─ Project [id, firstname, lastname, department, salary, age_years, hire_date, country]

└─ Project [id, firstname, lastname, department, salary, age, hire_date, country]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Each rename adds a new Project layer, deepening the DAG. Spark now has to materialize intermediate projections before applying the next one. You can see this by running: <em>df.explain(True).</em></p>
<h4 id="heading-the-better-approach-rename-once-with-todf">The Better Approach: Rename Once with toDF()</h4>
<p>Instead of chaining multiple renames, rename all columns in a single pass:</p>
<pre><code class="lang-python">new_cols = [<span class="hljs-string">"id"</span>, <span class="hljs-string">"first_name"</span>, <span class="hljs-string">"last_name"</span>, <span class="hljs-string">"department"</span>,
            <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hired_on"</span>, <span class="hljs-string">"country"</span>]

df_renamed = df.toDF(*new_cols)
</code></pre>
<h4 id="heading-logical-plan-impact-1">Logical Plan Impact:</h4>
<pre><code class="lang-python">Project [id, first_name, last_name, department, salary, age, hired_on, country]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Now there’s just one Project node, which means one projection over the source data. This gives us a flatter, more efficient plan.</p>
<h4 id="heading-under-the-hood-what-spark-actually-does">Under the Hood: What Spark Actually Does</h4>
<p>Every time you call withColumnRenamed(), Spark rewrites the entire projection list. Catalyst treats the rename as a full column re-selection from the previous node, not as a light-weight alias update. When you chain several renames, Catalyst duplicates internal column metadata for each intermediate step.</p>
<p>By contrast, toDF() rebases the schema in a single action. Catalyst interprets it as a single schema rebinding, so no redundant metadata trees are created.</p>
<h4 id="heading-real-world-timing-glue-job-benchmark">Real-World Timing: Glue Job Benchmark</h4>
<p>To see if chained withColumnRenamed calls add real overhead, here's a simple timing test performed on a Glue job using a DataFrame with 1M rows. First using withColumnRenamed:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> time
<span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession

spark = SparkSession.builder.appName(<span class="hljs-string">"MillionRowsRenameTest"</span>).getOrCreate()

employees_data = [
    (<span class="hljs-number">1</span>, <span class="hljs-string">"John"</span>, <span class="hljs-string">"Doe"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">80000</span>, <span class="hljs-number">28</span>, <span class="hljs-string">"2020-01-15"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">2</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Smith"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">85000</span>, <span class="hljs-number">32</span>, <span class="hljs-string">"2019-03-20"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">3</span>, <span class="hljs-string">"Alice"</span>, <span class="hljs-string">"Johnson"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">60000</span>, <span class="hljs-number">25</span>, <span class="hljs-string">"2021-06-10"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">4</span>, <span class="hljs-string">"Bob"</span>, <span class="hljs-string">"Brown"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">90000</span>, <span class="hljs-number">35</span>, <span class="hljs-string">"2018-07-01"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">5</span>, <span class="hljs-string">"Charlie"</span>, <span class="hljs-string">"Wilson"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">65000</span>, <span class="hljs-number">29</span>, <span class="hljs-string">"2020-11-05"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">6</span>, <span class="hljs-string">"David"</span>, <span class="hljs-string">"Lee"</span>, <span class="hljs-string">"HR"</span>, <span class="hljs-number">55000</span>, <span class="hljs-number">27</span>, <span class="hljs-string">"2021-01-20"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">7</span>, <span class="hljs-string">"Eve"</span>, <span class="hljs-string">"Davis"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">95000</span>, <span class="hljs-number">40</span>, <span class="hljs-string">"2017-04-12"</span>, <span class="hljs-string">"Canada"</span>),
    (<span class="hljs-number">8</span>, <span class="hljs-string">"Frank"</span>, <span class="hljs-string">"Miller"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">70000</span>, <span class="hljs-number">33</span>, <span class="hljs-string">"2019-09-25"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">9</span>, <span class="hljs-string">"Grace"</span>, <span class="hljs-string">"Taylor"</span>, <span class="hljs-string">"HR"</span>, <span class="hljs-number">58000</span>, <span class="hljs-number">26</span>, <span class="hljs-string">"2021-08-15"</span>, <span class="hljs-string">"Canada"</span>),
    (<span class="hljs-number">10</span>, <span class="hljs-string">"Henry"</span>, <span class="hljs-string">"Anderson"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">88000</span>, <span class="hljs-number">31</span>, <span class="hljs-string">"2020-02-28"</span>, <span class="hljs-string">"USA"</span>)
]

multiplied_data = [(i, <span class="hljs-string">f"firstname_<span class="hljs-subst">{i}</span>"</span>, <span class="hljs-string">f"lastname_<span class="hljs-subst">{i}</span>"</span>,
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">3</span>],  <span class="hljs-comment"># department</span>
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">4</span>],  <span class="hljs-comment"># salary</span>
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">5</span>],  <span class="hljs-comment"># age</span>
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">6</span>],  <span class="hljs-comment"># hire_date</span>
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">7</span>])  <span class="hljs-comment"># country</span>
                   <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">1</span>_000_001)]

df = spark.createDataFrame(multiplied_data,
                           [<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>])

start = time.time()
df1 = (df
       .withColumnRenamed(<span class="hljs-string">"firstname"</span>, <span class="hljs-string">"first_name"</span>)
       .withColumnRenamed(<span class="hljs-string">"lastname"</span>, <span class="hljs-string">"last_name"</span>)
       .withColumnRenamed(<span class="hljs-string">"department"</span>, <span class="hljs-string">"dept_name"</span>)
       .withColumnRenamed(<span class="hljs-string">"salary"</span>, <span class="hljs-string">"annual_salary"</span>)
       .withColumnRenamed(<span class="hljs-string">"age"</span>, <span class="hljs-string">"emp_age"</span>)
       .withColumnRenamed(<span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"hired_on"</span>)
       .withColumnRenamed(<span class="hljs-string">"country"</span>, <span class="hljs-string">"work_country"</span>))

print(<span class="hljs-string">"withColumnRenamed Count:"</span>, df1.count())
print(<span class="hljs-string">"withColumnRenamed time:"</span>, round(time.time() - start, <span class="hljs-number">2</span>), <span class="hljs-string">"seconds"</span>)
</code></pre>
<p>Using toDF:</p>
<pre><code class="lang-python">start = time.time()
df2 = df.toDF(<span class="hljs-string">"id"</span>, <span class="hljs-string">"first_name"</span>, <span class="hljs-string">"last_name"</span>, <span class="hljs-string">"dept_name"</span>, <span class="hljs-string">"annual_salary"</span>, <span class="hljs-string">"emp_age"</span>, <span class="hljs-string">"hired_on"</span>, <span class="hljs-string">"work_country"</span>)
print(<span class="hljs-string">"toDF Count:"</span>, df2.count())
print(<span class="hljs-string">"toDF time:"</span>, round(time.time() - start, <span class="hljs-number">2</span>), <span class="hljs-string">"seconds"</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Number of Project Nodes</strong></td><td><strong>Glue Execution Time (1M rows)</strong></td><td><strong>Plan Complexity</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Chained withColumnRenamed()</td><td>8 nodes</td><td>~12 seconds</td><td>Deep, nested</td></tr>
<tr>
<td>Single toDF()</td><td>1 node</td><td>~8 seconds</td><td>Flat, simple</td></tr>
</tbody>
</table>
</div><p>The difference becomes important at larger sizes or in complex pipelines, especially on managed runtimes such as AWS Glue (where planning overhead becomes important), or when tens of millions of rows are involved, where each additional Project increases column resolution, metadata work, and DAG height. And since Spark can’t collapse chained projections when column names are changed, renaming all columns in one go with toDF() results in a flatter logical and physical plan: one rename, one projection, and faster execution.</p>
<h3 id="heading-scenario-2-reusing-expressions">Scenario 2: Reusing Expressions</h3>
<p>Sometimes Spark jobs run slower, not because of shuffles or joins, but because the same computation is performed repeatedly within the logical plan. Every time you repeat an expression, say, col("salary") * 0.1 in multiple places, Spark treats it as a <em>new</em> derived column, expanding the logical plan and forcing redundant work.</p>
<h4 id="heading-the-problem-repeated-expressions">The Problem: Repeated Expressions</h4>
<p>Let’s say we’re calculating bonus and total compensation for employees:</p>
<pre><code class="lang-python">df_expr = (
    df.withColumn(<span class="hljs-string">"bonus"</span>, col(<span class="hljs-string">"salary"</span>) * <span class="hljs-number">0.10</span>)
      .withColumn(<span class="hljs-string">"total_comp"</span>, col(<span class="hljs-string">"salary"</span>) + (col(<span class="hljs-string">"salary"</span>) * <span class="hljs-number">0.10</span>))
)
</code></pre>
<p>At first glance, it’s simple enough. But Spark’s optimizer doesn’t automatically know that the (col("salary") * 0.10) in the second column is identical to the one computed in the first. Both get evaluated separately in the logical plan.</p>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Project [id, firstname, lastname, department,

salary, age, hire_date, country,

(salary * <span class="hljs-number">0.10</span>) AS bonus,

(salary + (salary * <span class="hljs-number">0.10</span>)) AS total_comp]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>While this looks compact, Spark must compute (salary * 0.10) twice, once for bonus, again inside total_comp. For a large dataset (say 100 M rows), that’s two full column evaluations. The waste compounds when your expression is complex, imagine parsing JSON, applying UDFs, or running date arithmetic multiple times.</p>
<h4 id="heading-the-better-approach-compute-once-reuse-everywhere">The Better Approach: Compute Once, Reuse Everywhere</h4>
<p>Compute the expression once, store it as a column, and reference it later:</p>
<pre><code class="lang-python">df_expr = (
    df.withColumn(<span class="hljs-string">"bonus"</span>, col(<span class="hljs-string">"salary"</span>) * <span class="hljs-number">0.10</span>)
      .withColumn(<span class="hljs-string">"total_comp"</span>, col(<span class="hljs-string">"salary"</span>) + col(<span class="hljs-string">"bonus"</span>))
)
</code></pre>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Project [id, firstname, lastname, department,

salary, age, hire_date, country,

(salary * <span class="hljs-number">0.10</span>) AS bonus,

(salary + bonus) AS total_comp]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Now Spark calculates (salary * 0.10) once, stores it in the bonus column, and reuses that column when computing total_comp. This single change cuts CPU cost and memory usage.</p>
<h4 id="heading-under-the-hood-why-repetition-hurts">Under the Hood: Why Repetition Hurts</h4>
<p>Spark’s Catalyst optimizer doesn’t automatically factor out repeated expressions across different columns. Each withColumn() creates a new Project node with its own expression tree. If multiple nodes reuse the same arithmetic or function, Catalyst re-evaluates them independently.</p>
<p>On small DataFrames, this cost is invisible. On wide, computation-heavy jobs (think feature engineering pipelines), it can add hundreds of milliseconds per task.</p>
<p>Each redundant expression increases:</p>
<ul>
<li><p>Catalyst’s internal expression resolution time</p>
</li>
<li><p>The size of generated Java code in WholeStageCodegen</p>
</li>
<li><p>CPU cycles per row, since Spark cannot share intermediate results between columns in the same node</p>
</li>
</ul>
<h4 id="heading-real-world-benchmark-aws-glue">Real-World Benchmark: AWS Glue</h4>
<p>We tested this pattern on AWS Glue (Spark 3.3) with 10 million rows and a simulated expensive computation on the similar dataset we used in Scenario 1.</p>
<pre><code class="lang-python">df = spark.createDataFrame(multiplied_data,
                           [<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>])

expr = sqrt(exp(log(col(<span class="hljs-string">"salary"</span>) + <span class="hljs-number">1</span>)))

start = time.time()

df_repeated = (
    df.withColumn(<span class="hljs-string">"metric_a"</span>, expr)
      .withColumn(<span class="hljs-string">"metric_b"</span>, expr * <span class="hljs-number">2</span>)
      .withColumn(<span class="hljs-string">"metric_c"</span>, expr / <span class="hljs-number">10</span>)
)

df_repeated.count()
time_repeated = round(time.time() - start, <span class="hljs-number">2</span>)

start = time.time()

df_reused = (
    df.withColumn(<span class="hljs-string">"metric"</span>, expr)
      .withColumn(<span class="hljs-string">"metric_a"</span>, col(<span class="hljs-string">"metric"</span>))
      .withColumn(<span class="hljs-string">"metric_b"</span>, col(<span class="hljs-string">"metric"</span>) * <span class="hljs-number">2</span>)
      .withColumn(<span class="hljs-string">"metric_c"</span>, col(<span class="hljs-string">"metric"</span>) / <span class="hljs-number">10</span>)
)

df_reused.count()

print(<span class="hljs-string">"Repeated expr time:"</span>, time_repeated, <span class="hljs-string">"seconds"</span>)
print(<span class="hljs-string">"Reused expr time:"</span>, round(time.time() - start, <span class="hljs-number">2</span>), <span class="hljs-string">"seconds"</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Project Nodes</strong></td><td><strong>Execution Time (10M rows)</strong></td><td><strong>Expression Evaluations</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Repeated expression</td><td>Multiple (nested)</td><td>~18 seconds</td><td>3x per row</td></tr>
<tr>
<td>Compute once, reuse</td><td>Single</td><td>~11 seconds</td><td>1x per row</td></tr>
</tbody>
</table>
</div><p>The performance gap widens further with genuinely expensive expressions (like regex extraction, JSON parsing, or UDFs).</p>
<h4 id="heading-physical-plan-implication">Physical Plan Implication</h4>
<p>In the physical plan, repeated expressions expand into multiple Java blocks within the same WholeStageCodegen node:</p>
<pre><code class="lang-python">*(<span class="hljs-number">1</span>) Project [sqrt(exp(log(salary + <span class="hljs-number">1</span>))) AS metric_a,

(sqrt(exp(log(salary + <span class="hljs-number">1</span>))) * <span class="hljs-number">2</span>) AS metric_b,

(sqrt(exp(log(salary + <span class="hljs-number">1</span>))) / <span class="hljs-number">10</span>) AS metric_c, ...]
</code></pre>
<p>Spark literally embeds three copies of the same logic.</p>
<p>Each is JIT-compiled separately, leading to:</p>
<ul>
<li><p>Larger generated Java classes</p>
</li>
<li><p>Higher CPU utilization</p>
</li>
<li><p>Longer code-generation time before tasks even start</p>
</li>
</ul>
<p>When reusing a column, Spark generates one expression and references it by name, dramatically shrinking the codegen footprint. If you have complex transformations (nested when, UDFs, regex extractions, and so on), compute them once and reuse them with col("alias"). For even heavier expressions that appear across multiple pipelines, consider persisting the intermediate.</p>
<p>DataFrame:</p>
<pre><code class="lang-python">df_features = df.withColumn(<span class="hljs-string">"complex_feature"</span>, complex_logic)

df_features.cache()
</code></pre>
<p>That cache can save multiple recomputations across downstream steps.</p>
<h3 id="heading-scenario-3-batch-column-ops">Scenario 3: Batch Column Ops</h3>
<p>Most PySpark pipelines don’t die because of one big, obvious mistake. They slow down from a thousand tiny cuts: one extra withColumn() here, another there, until the logical plan turns into a tall stack of projections.</p>
<p>On its own, withColumn() is fine. The problem is how we use it:</p>
<ul>
<li><p>10–30 chained calls in a row</p>
</li>
<li><p>Re-deriving similar expressions</p>
</li>
<li><p>Spreading logic across many tiny steps</p>
</li>
</ul>
<p>This scenario shows how batching column operations into a single select() produces a flatter, cleaner logical plan that scales better and is easier to reason about.</p>
<h4 id="heading-the-problem-chaining-withcolumn-forever">The Problem: Chaining withColumn() Forever</h4>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> col, concat_ws, when, lit

df_transformed = (
    df.withColumn(<span class="hljs-string">"full_name"</span>, concat_ws(<span class="hljs-string">" "</span>, col(<span class="hljs-string">"firstname"</span>), col(<span class="hljs-string">"lastname"</span>)))
      .withColumn(<span class="hljs-string">"is_senior"</span>, when(col(<span class="hljs-string">"age"</span>) &gt;= <span class="hljs-number">35</span>, lit(<span class="hljs-number">1</span>)).otherwise(lit(<span class="hljs-number">0</span>)))
      .withColumn(<span class="hljs-string">"salary_k"</span>, col(<span class="hljs-string">"salary"</span>) / <span class="hljs-number">1000.0</span>)
      .withColumn(<span class="hljs-string">"experience_band"</span>,
                  when(col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">30</span>, <span class="hljs-string">"junior"</span>)
                  .when((col(<span class="hljs-string">"age"</span>) &gt;= <span class="hljs-number">30</span>) &amp; (col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">40</span>), <span class="hljs-string">"mid"</span>)
                  .otherwise(<span class="hljs-string">"senior"</span>))
      .withColumn(<span class="hljs-string">"country_upper"</span>, col(<span class="hljs-string">"country"</span>).upper())
)
</code></pre>
<p>It reads nicely, it runs, and everyone moves on. But under the hood, Spark builds this as multiple Project nodes, one per withColumn() call.</p>
<p><strong>Simplified Logical Plan (Chained): Conceptually</strong></p>
<pre><code class="lang-python">Project [..., country_upper]

└─ Project [..., experience_band]

   └─ Project [..., salary_k]

      └─ Project [..., is_senior]

         └─ Project [..., full_name]

            └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Each layer re-selects all existing columns, adds one more derived column, and deepens the plan.</p>
<h4 id="heading-the-better-approach-batch-with-select">The Better Approach: Batch with select()</h4>
<p>Instead of incrementally patching the schema, build it once.</p>
<pre><code class="lang-python">df_transformed = df.select(
    col(<span class="hljs-string">"id"</span>),
    col(<span class="hljs-string">"firstname"</span>),
    col(<span class="hljs-string">"lastname"</span>),
    col(<span class="hljs-string">"department"</span>),
    col(<span class="hljs-string">"salary"</span>),
    col(<span class="hljs-string">"age"</span>),
    col(<span class="hljs-string">"hire_date"</span>),
    col(<span class="hljs-string">"country"</span>),
    concat_ws(<span class="hljs-string">" "</span>, col(<span class="hljs-string">"firstname"</span>), col(<span class="hljs-string">"lastname"</span>)).alias(<span class="hljs-string">"full_name"</span>),
    when(col(<span class="hljs-string">"age"</span>) &gt;= <span class="hljs-number">35</span>, lit(<span class="hljs-number">1</span>)).otherwise(lit(<span class="hljs-number">0</span>)).alias(<span class="hljs-string">"is_senior"</span>),
    (col(<span class="hljs-string">"salary"</span>) / <span class="hljs-number">1000.0</span>).alias(<span class="hljs-string">"salary_k"</span>),
    when(col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">30</span>, <span class="hljs-string">"junior"</span>)
        .when((col(<span class="hljs-string">"age"</span>) &gt;= <span class="hljs-number">30</span>) &amp; (col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">40</span>), <span class="hljs-string">"mid"</span>)
        .otherwise(<span class="hljs-string">"senior"</span>).alias(<span class="hljs-string">"experience_band"</span>),
    col(<span class="hljs-string">"country"</span>).upper().alias(<span class="hljs-string">"country_upper"</span>)
)
</code></pre>
<p><strong>Simplified Logical Plan (Batched):</strong></p>
<pre><code class="lang-python">Project [id, firstname, lastname, department, salary, age, hire_date, country,

         full_name, is_senior, salary_k, experience_band, country_upper]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>One Project. All derived columns <em>are</em> defined together. Flatter DAG. Cleaner plan.</p>
<h4 id="heading-under-the-hood-why-this-matters">Under the Hood: Why This Matters</h4>
<p>Each withColumn() is syntactic sugar for: “Take the previous plan, and create a new Project on top of it.” So 10 withColumn() calls = 10 projections wrapped on top of each other.</p>
<p>Catalyst can sometimes collapse adjacent Project nodes, but:</p>
<ul>
<li><p>Not always (especially when aliases shadow each other).</p>
</li>
<li><p>Not when expressions become complex or interdependent.</p>
</li>
<li><p>Not when UDFs or analysis barriers appear.</p>
</li>
</ul>
<p>Batching with select():</p>
<ul>
<li><p>Gives Catalyst a single, complete view of all expressions.</p>
</li>
<li><p>Enables more aggressive optimizations (constant folding, expression reuse, pruning).</p>
</li>
<li><p>Keeps expression trees shallower and codegen output smaller.</p>
</li>
</ul>
<p>Think of it as the difference between editing a sentence 10 times in a row and writing the final sentence once, cleanly.</p>
<h4 id="heading-real-world-example-using-the-employees-df-at-scale">Real-World Example: Using the Employees DF at Scale:</h4>
<p>Chained version (many withColumn()):</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> col, concat_ws, when, lit, upper
<span class="hljs-keyword">import</span> time

start = time.time()
df_chain = (
    df.withColumn(<span class="hljs-string">"full_name"</span>, concat_ws(<span class="hljs-string">" "</span>, col(<span class="hljs-string">"firstname"</span>), col(<span class="hljs-string">"lastname"</span>)))
      .withColumn(<span class="hljs-string">"is_senior"</span>, when(col(<span class="hljs-string">"age"</span>) &gt;= <span class="hljs-number">35</span>, <span class="hljs-number">1</span>).otherwise(<span class="hljs-number">0</span>))
      .withColumn(<span class="hljs-string">"salary_k"</span>, col(<span class="hljs-string">"salary"</span>) / <span class="hljs-number">1000.0</span>)
      .withColumn(<span class="hljs-string">"high_earner"</span>, when(col(<span class="hljs-string">"salary"</span>) &gt;= <span class="hljs-number">90000</span>, <span class="hljs-number">1</span>).otherwise(<span class="hljs-number">0</span>))
      .withColumn(<span class="hljs-string">"experience_band"</span>,
                  when(col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">30</span>, <span class="hljs-string">"junior"</span>)
                  .when((col(<span class="hljs-string">"age"</span>) &gt;= <span class="hljs-number">30</span>) &amp; (col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">40</span>), <span class="hljs-string">"mid"</span>)
                  .otherwise(<span class="hljs-string">"senior"</span>))
      .withColumn(<span class="hljs-string">"country_upper"</span>, upper(col(<span class="hljs-string">"country"</span>)))
)

df_chain.count()
time_chain = round(time.time() - start, <span class="hljs-number">2</span>)
</code></pre>
<p>Batched version (single select()):</p>
<pre><code class="lang-python">start = time.time()
df_batch = df.select(
    <span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>,
    concat_ws(<span class="hljs-string">" "</span>, col(<span class="hljs-string">"firstname"</span>), col(<span class="hljs-string">"lastname"</span>)).alias(<span class="hljs-string">"full_name"</span>),
    when(col(<span class="hljs-string">"age"</span>) &gt;= <span class="hljs-number">35</span>, <span class="hljs-number">1</span>).otherwise(<span class="hljs-number">0</span>).alias(<span class="hljs-string">"is_senior"</span>),
    (col(<span class="hljs-string">"salary"</span>) / <span class="hljs-number">1000.0</span>).alias(<span class="hljs-string">"salary_k"</span>),
    when(col(<span class="hljs-string">"salary"</span>) &gt;= <span class="hljs-number">90000</span>, <span class="hljs-number">1</span>).otherwise(<span class="hljs-number">0</span>).alias(<span class="hljs-string">"high_earner"</span>),
    when(col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">30</span>, <span class="hljs-string">"junior"</span>)
        .when((col(<span class="hljs-string">"age"</span>) &gt;= <span class="hljs-number">30</span>) &amp; (col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">40</span>), <span class="hljs-string">"mid"</span>)
        .otherwise(<span class="hljs-string">"senior"</span>).alias(<span class="hljs-string">"experience_band"</span>),
    upper(col(<span class="hljs-string">"country"</span>)).alias(<span class="hljs-string">"country_upper"</span>)
)

df_batch.count()
time_batch = round(time.time() - start, <span class="hljs-number">2</span>)
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Logical Shape</strong></td><td><strong>Glue Execution Time (1M rows)</strong></td><td><strong>Notes</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Chained withColumn()</td><td>6 nested Projects</td><td>~14 seconds</td><td>Deep plan, more Catalyst work</td></tr>
<tr>
<td>Single select()</td><td>1 Project</td><td>~9 seconds</td><td>Flat planning, cleaner DAG</td></tr>
</tbody>
</table>
</div><p>The distinction is most evident when there are more derived columns, more complex expressions (UDFs, window functions), or when executing on managed runtimes such as AWS Glue.</p>
<p>In the chained cases, there are more Project nodes, code generation is fragmented, and expression evaluation is less amenable to global optimization.</p>
<p>In the batched cases, Spark generates a single Project node, more work is consolidated into a single WholeStageCodegen pipeline, code generation is reduced, the JVM is less stressed, and the plan is flatter and more amenable to optimization. This is not only cleaner, but it’s also faster, more reliable, and friendlier to Spark’s optimizer.</p>
<h3 id="heading-scenario-4-early-filter-vs-late-filter">Scenario 4: Early Filter vs Late Filter</h3>
<p>Many pipelines apply transformations first, adding columns, joining datasets, or calculating derived metrics, before filtering records. That order looks harmless in code but can double or triple the workload at execution.</p>
<h4 id="heading-problem-late-filtering">Problem: Late Filtering</h4>
<pre><code class="lang-python">df_late = (
    df.withColumn(<span class="hljs-string">"bonus"</span>, col(<span class="hljs-string">"salary"</span>) * <span class="hljs-number">0.1</span>)
      .withColumn(<span class="hljs-string">"salary_k"</span>, col(<span class="hljs-string">"salary"</span>) / <span class="hljs-number">1000</span>)
      .filter(col(<span class="hljs-string">"age"</span>) &gt; <span class="hljs-number">35</span>)
)
</code></pre>
<p>This means Spark first computes all columns for every employee, then discards most rows.</p>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Filter (age &gt; <span class="hljs-number">35</span>)

└─ Project [id, firstname, lastname, department, salary, age, hire_date, country,

            (salary * <span class="hljs-number">0.1</span>) AS bonus,

            (salary / <span class="hljs-number">1000</span>) AS salary_k]

   └─ LogicalRDD [...]
</code></pre>
<p>Catalyst can sometimes reorder this automatically, but when it can't (due to UDFs or complex logic), you're doing unnecessary work on data that's thrown away.</p>
<h4 id="heading-better-approach-early-filtering">Better Approach: Early Filtering</h4>
<pre><code class="lang-python">df_early = (
    df.filter(col(<span class="hljs-string">"age"</span>) &gt; <span class="hljs-number">35</span>)
      .withColumn(<span class="hljs-string">"bonus"</span>, col(<span class="hljs-string">"salary"</span>) * <span class="hljs-number">0.1</span>)
      .withColumn(<span class="hljs-string">"salary_k"</span>, col(<span class="hljs-string">"salary"</span>) / <span class="hljs-number">1000</span>)
)
</code></pre>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Project [id, firstname, lastname, department, salary, age, hire_date, country,

         (salary * <span class="hljs-number">0.1</span>) AS bonus,

         (salary / <span class="hljs-number">1000</span>) AS salary_k]

└─ Filter (age &gt; <span class="hljs-number">35</span>)

   └─ LogicalRDD [...]
</code></pre>
<p>Now Spark prunes the dataset first, then applies transformations. The result: smaller intermediate data, less codegen, shorter logical plan, shorter DAG, and smaller shuffle footprint.</p>
<h4 id="heading-real-world-benchmark-aws-glue-1">Real-World Benchmark: AWS Glue</h4>
<p>Late Filtering:</p>
<pre><code class="lang-python">df = spark.createDataFrame(
    multiplied_data,
    [<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>]
)

start_late = time.time()

df_late = (
    df.withColumn(<span class="hljs-string">"bonus"</span>, col(<span class="hljs-string">"salary"</span>) * <span class="hljs-number">0.1</span>)
      .withColumn(<span class="hljs-string">"salary_k"</span>, col(<span class="hljs-string">"salary"</span>) / <span class="hljs-number">1000</span>)
      .filter(col(<span class="hljs-string">"age"</span>) &gt; <span class="hljs-number">35</span>)   
)

df_late.count()
time_late = round(time.time() - start_late, <span class="hljs-number">2</span>)
</code></pre>
<p>Early Filtering:</p>
<pre><code class="lang-python">start_early = time.time()

df_early = (
    df.filter(col(<span class="hljs-string">"age"</span>) &gt; <span class="hljs-number">35</span>)    
      .withColumn(<span class="hljs-string">"bonus"</span>, col(<span class="hljs-string">"salary"</span>) * <span class="hljs-number">0.1</span>)
      .withColumn(<span class="hljs-string">"salary_k"</span>, col(<span class="hljs-string">"salary"</span>) / <span class="hljs-number">1000</span>)
)

df_early.count()
time_early = round(time.time() - start_early, <span class="hljs-number">2</span>)

print(<span class="hljs-string">"Late Filter Time:"</span>, time_late, <span class="hljs-string">"seconds"</span>)
print(<span class="hljs-string">"Early Filter Time:"</span>, time_early, <span class="hljs-string">"seconds"</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Rows Processed Before Filter</strong></td><td><strong>Execution Time (approx)</strong></td><td><strong>Notes</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Late filter</td><td>1,000,000 (all rows)</td><td>~14 seconds</td><td>Computes bonus and salary_k for all rows, then filters</td></tr>
<tr>
<td>Early filter</td><td>300,000 (filtered subset)</td><td>~9 seconds</td><td>Filters first, computes only for age &gt; 35</td></tr>
</tbody>
</table>
</div><p>The early filter approach processes significantly less data before the projection, leading to faster execution and less memory pressure.</p>
<p>Always filter as early as possible, before joins, aggregations, expensive transformations (such as UDFs or window functions), and even during file reads via Parquet/ORC pushdown, since filtering at the source touches fewer partitions and leads to faster jobs.</p>
<h3 id="heading-scenario-5-column-pruning">Scenario 5: Column Pruning</h3>
<p>When working with Spark DataFrames, convenience often wins over correctness and nothing feels more convenient than select("*"). It’s quick, flexible, and perfect for exploration.</p>
<p>But in production pipelines, that little star silently costs CPU, memory, network bandwidth, and runtime efficiency. Every time you write select("*"), Spark expands it into <em>every</em> column from your schema, even if you’re using just one or two later.</p>
<p>Those extra attributes flow through every stage of the plan, from filters and joins to aggregations and shuffles. The result: inflated logical plans, bigger shuffle files, and slower queries.</p>
<h4 id="heading-the-problem-the-lazy-star">The Problem: “The Lazy Star”</h4>
<pre><code class="lang-python">df_star = (
    df.select(<span class="hljs-string">"*"</span>)
      .filter(col(<span class="hljs-string">"department"</span>) == <span class="hljs-string">"Engineering"</span>)
      .groupBy(<span class="hljs-string">"country"</span>)
      .agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
)
</code></pre>
<p>At first glance, this seems harmless. But the problem is: only two columns (country and salary) are needed for the aggregation, but Spark carries all eight (id, firstname, lastname, department, salary, age, hire_date, country) through every transformation.</p>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Aggregate [country], [avg(salary) AS avg_salary]

└─ Filter (department = Engineering)

   └─ Project [id, firstname, lastname, department, salary, age, hire_date, country]

      └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Every node in this tree carries all columns. Catalyst can’t prune them because you explicitly asked for "*". The excess attributes are serialized, shuffled, and deserialized across the cluster, even though they serve no purpose in the final result.</p>
<h4 id="heading-the-fix-select-only-what-you-need">The Fix: Select Only What You Need</h4>
<p>Be deliberate with your projections. Select the minimal schema required for the task.</p>
<pre><code class="lang-python">df_pruned = (
    df.select(<span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"country"</span>)
      .filter(col(<span class="hljs-string">"department"</span>) == <span class="hljs-string">"Engineering"</span>)
      .groupBy(<span class="hljs-string">"country"</span>)
      .agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
)
</code></pre>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Aggregate [country], [avg(salary) AS avg_salary]

└─ Filter (department = Engineering)

   └─ Project [department, salary, country]

      └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Now Spark reads and processes only the three required columns: department, salary, and country. The plan is narrower, the DAG simpler, and execution faster.</p>
<h4 id="heading-real-world-benchmark-aws-glue-2">Real-World Benchmark: AWS Glue</h4>
<p>Wide Projection:</p>
<pre><code class="lang-python">df = spark.createDataFrame(multiplied_data,
                           [<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>])

start = time.time()
df_star = (
    df.select(<span class="hljs-string">"*"</span>)
      .filter(col(<span class="hljs-string">"department"</span>) == <span class="hljs-string">"Engineering"</span>)
      .groupBy(<span class="hljs-string">"country"</span>)
      .agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
)

df_star.count()
time_star = round(time.time() - start, <span class="hljs-number">2</span>)
</code></pre>
<p>Pruned Projection:</p>
<pre><code class="lang-python">start = time.time()

df_pruned = (
    df.select(<span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"country"</span>)
      .filter(col(<span class="hljs-string">"department"</span>) == <span class="hljs-string">"Engineering"</span>)
      .groupBy(<span class="hljs-string">"country"</span>)
      .agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
)

df_pruned.count()
time_pruned = round(time.time() - start, <span class="hljs-number">2</span>)

print(<span class="hljs-string">f"select('*') time: <span class="hljs-subst">{time_star}</span>s"</span>)
print(<span class="hljs-string">f"pruned columns time: <span class="hljs-subst">{time_pruned}</span>s"</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Columns Processed</strong></td><td><strong>Execution Time (1M rows)</strong></td><td><strong>Observation</strong></td></tr>
</thead>
<tbody>
<tr>
<td>select("*")</td><td>8</td><td>~26.54 s</td><td>Spark carries all columns through the plan.</td></tr>
<tr>
<td>Pruned projection</td><td>3</td><td>~2.21 s</td><td>Only needed columns processed → faster and lighter.</td></tr>
</tbody>
</table>
</div><h4 id="heading-under-the-hood-how-catalyst-handles-columns">Under the Hood: How Catalyst Handles Columns</h4>
<p>When you call select("*"), Catalyst resolves <em>every attribute</em> into the logical plan. Each subsequent transformation inherits that full attribute list, increasing plan depth and overhead.</p>
<p>Catalyst includes a rule called ColumnPruning, which removes unused attributes but it only works when Spark <em>can see</em> which columns are necessary. If you use "*" or dynamically reference df.columns, Catalyst loses visibility.</p>
<p><strong>Works:</strong></p>
<pre><code class="lang-python">df \
    .select(<span class="hljs-string">"salary"</span>, <span class="hljs-string">"country"</span>) \
    .groupBy(<span class="hljs-string">"country"</span>) \
    .agg(avg(<span class="hljs-string">"salary"</span>))
</code></pre>
<p><strong>Doesn’t Work:</strong></p>
<pre><code class="lang-python">cols = df.columns

df.select(cols) \
  .groupBy(<span class="hljs-string">"country"</span>) \
  .agg(avg(<span class="hljs-string">"salary"</span>))
</code></pre>
<p>In the second case, Catalyst can’t prune anything because cols might include everything.</p>
<h4 id="heading-physical-plan-differences">Physical Plan Differences</h4>
<pre><code class="lang-python">Wide Projection (select(<span class="hljs-string">"*"</span>)):

*(<span class="hljs-number">1</span>) HashAggregate(keys=[country], functions=[avg(salary)])

+- *(<span class="hljs-number">1</span>) Project [id, firstname, lastname, department, salary, age, hire_date, country]

   +- *(<span class="hljs-number">1</span>) Filter (department = Engineering)

      +- *(<span class="hljs-number">1</span>) Scan parquet ...
</code></pre>
<p>Pruned Projection:</p>
<pre><code class="lang-python">*(<span class="hljs-number">1</span>) HashAggregate(keys=[country], functions=[avg(salary)])

+- *(<span class="hljs-number">1</span>) Project [department, salary, country]

   +- *(<span class="hljs-number">1</span>) Filter (department = Engineering)

      +- *(<span class="hljs-number">1</span>) Scan parquet [department, salary, country]
</code></pre>
<p>Notice the last line: Spark physically scans only the three referenced columns from Parquet. That’s genuine I/O reduction, not just logical simplification. Using select(*) increases shuffle file sizes, memory usage during serialization, Catalyst planning time, and I/O and network traffic, and the solution requires no more than specifying the necessary columns.</p>
<p>But in managed environments like AWS Glue or Databricks, this simple practice can greatly reduce ETL time, particularly for Parquet or Delta files, due to effective column pruning during explicit projection. It’s one of the easiest and highest-impact Spark optimization techniques, starting with typing fewer asterisks.</p>
<h3 id="heading-scenario-6-filter-pushdown-vs-full-scan">Scenario 6: Filter Pushdown vs Full Scan</h3>
<p>When a Spark job feels slow right from the start, even before joins or aggregations, the culprit is often hidden at the data-read layer. Spark spends seconds (or minutes) scanning every record, even though most rows are useless for the query.</p>
<p>That’s where filter pushdown comes in. It tells Spark to <em>push your filter logic down to the file reader</em> so that Parquet / ORC / Delta formats return only the relevant rows from disk. Done right, this optimization can reduce scan size significantly. Done wrong, Spark performs a full scan, reading everything before filtering in memory.</p>
<h4 id="heading-the-problem-late-filters-and-full-scans">The Problem: Late Filters and Full Scans</h4>
<pre><code class="lang-python">employees_df = spark.read.parquet(<span class="hljs-string">"s3://data/employee_data/"</span>)

df_full = (
    employees_df
        .select(<span class="hljs-string">"*"</span>)  <span class="hljs-comment"># reads all columns</span>
        .filter(col(<span class="hljs-string">"country"</span>) == <span class="hljs-string">"Canada"</span>)
)
</code></pre>
<p>Looks fine, right? But Spark can’t push this filter to the Parquet reader because it’s applied <em>after</em> the select("*") projection step. Catalyst sees the filter as operating on a projected DataFrame, not the raw scan, so the pushdown boundary is lost.</p>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Filter (country = Canada)

└─ Project [id, firstname, lastname, department, salary, age, hire_date, country]

   └─ Scan parquet employee_data [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Every record from every Parquet file is read into memory before the filter executes. In large tables, this means scanning terabytes when you only need megabytes.</p>
<h4 id="heading-the-fix-filter-early-and-project-light">The Fix: Filter Early and Project Light</h4>
<p>Move filters as close as possible to the data source and limit columns before Spark reads them:</p>
<pre><code class="lang-python">df_pushdown = (
    spark.read.parquet(<span class="hljs-string">"s3://data/employee_data/"</span>)
        .select(<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"country"</span>)
        .filter(col(<span class="hljs-string">"country"</span>) == <span class="hljs-string">"Canada"</span>)
)
</code></pre>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Project [id, firstname, department, salary, country]

└─ Scan parquet employee_data [id, firstname, department, salary, country]
</code></pre>
<p>PushedFilters: [country = Canada]</p>
<p>Notice the difference: PushedFilters appears in the plan. That means the Parquet reader handles the predicate, returning only matching blocks and rows.</p>
<h4 id="heading-under-the-hood-what-actually-happens">Under the Hood: What Actually Happens</h4>
<p>When Spark performs filter pushdown, it leverages the Parquet metadata (min/max statistics and row-group indexes) stored in file footers.</p>
<ul>
<li><p>Spark inspects file-level metadata for the predicate column (country).</p>
</li>
<li><p>It skips any row group whose values don’t match (country ≠ Canada).</p>
</li>
<li><p>It reads only the necessary row groups and columns from disk.</p>
</li>
<li><p>Those records enter the DAG directly – no in-memory filtering required.</p>
</li>
</ul>
<p>This optimization happens entirely before Spark begins executing stages, reducing both I/O and network transfer.</p>
<h4 id="heading-real-world-benchmark-aws-glue-3">Real-World Benchmark: AWS Glue</h4>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> time
<span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> col

spark = SparkSession.builder.appName(<span class="hljs-string">"FilterPushdownBenchmark"</span>).getOrCreate()

start = time.time()
df_full = (
    spark.read.parquet(<span class="hljs-string">"s3://data/employee_data/"</span>)
        .select(<span class="hljs-string">"*"</span>)                         <span class="hljs-comment"># all columns</span>
        .filter(col(<span class="hljs-string">"country"</span>) == <span class="hljs-string">"Canada"</span>)  
)
df_full.count()
time_full = round(time.time() - start, <span class="hljs-number">2</span>)

start = time.time()
df_pushdown = (
    spark.read.parquet(<span class="hljs-string">"s3://data/employee_data/"</span>)
        .select(<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"country"</span>)
        .filter(col(<span class="hljs-string">"country"</span>) == <span class="hljs-string">"Canada"</span>)  
)
df_pushdown.count()
time_push = round(time.time() - start, <span class="hljs-number">2</span>)

print(<span class="hljs-string">"Full Scan Time:"</span>, time_full, <span class="hljs-string">"sec"</span>)
print(<span class="hljs-string">"Filter Pushdown Time:"</span>, time_push, <span class="hljs-string">"sec"</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Execution Time (1 M rows)</strong></td><td><strong>Observation</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Full Scan</td><td>14.2 s</td><td>All files scanned and filtered in memory.</td></tr>
<tr>
<td>Filter Pushdown</td><td>3.8 s</td><td>Only relevant row groups and columns read.</td></tr>
</tbody>
</table>
</div><p><strong>Physical Plan Comparison</strong></p>
<p>Full Scan:</p>
<pre><code class="lang-python">*(<span class="hljs-number">1</span>) Filter (country = Canada)

+- *(<span class="hljs-number">1</span>) ColumnarToRow

   +- *(<span class="hljs-number">1</span>) FileScan parquet [id, firstname, lastname, department, salary, age, hire_date, country]

      Batched: true, DataFilters: [], PushedFilters: []
</code></pre>
<p>Pushdown:</p>
<pre><code class="lang-python">*(<span class="hljs-number">1</span>) ColumnarToRow

+- *(<span class="hljs-number">1</span>) FileScan parquet [id, firstname, department, salary, country]

   Batched: true, DataFilters: [isnotnull(country)], PushedFilters: [country = Canada]
</code></pre>
<p>The difference is clear: PushedFilters confirms that Spark applied predicate pushdown, skipping unnecessary row groups at the scan stage.</p>
<h4 id="heading-reflection-why-pushdown-matters">Reflection: Why Pushdown Matters</h4>
<p>Pushdown isn’t a micro-optimization. It’s actually often the single biggest performance lever in Spark ETL. In data lakes with hundreds of files, full scans waste hours and inflate AWS S3 I/O costs. By filtering and projecting early, Spark prunes both rows and columns before execution even begins.</p>
<p>Apply filters as early as possible in the read pipeline, combine filter pushdown with column pruning, verify PushedFilters in explain("formatted"), avoid UDFs and select("*") at read time, and let pushdown turn “read everything and discard most” into “read only what you need.”</p>
<h3 id="heading-scenario-7-de-duplicate-right">Scenario 7: De-duplicate Right</h3>
<h4 id="heading-the-problem-all-row-deduplication-and-why-it-hurts">The Problem: “All-Row Deduplication” and Why It Hurts</h4>
<p>When we use this:</p>
<pre><code class="lang-python">df.dropDuplicates()
</code></pre>
<p>Spark removes identical rows across all columns. It sounds simple, but this operation forces Spark to treat every column as part of the deduplication key.</p>
<p>Internally, it means:</p>
<ul>
<li><p>Every attribute is serialized and hashed.</p>
</li>
<li><p>Every unique combination of all columns is shuffled across the cluster to ensure global uniqueness.</p>
</li>
<li><p>Even small changes in a non-essential field (like hire_date) cause new keys and destroy aggregation locality.</p>
</li>
</ul>
<p>In wide tables, this is one of the heaviest shuffle operations Spark can perform: df.dropDuplicates()</p>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Aggregate [id, firstname, lastname, department, salary, age, hire_date, country], [first(id) AS id, ...]

└─ Exchange hashpartitioning(id, firstname, lastname, department, salary, age, hire_date, country, <span class="hljs-number">200</span>)

   └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Notice the Exchange: that’s a full shuffle across all columns. Spark must send every record to the partition responsible for its unique combination of all fields. This is slow, memory-intensive, and scales poorly as columns grow.</p>
<h4 id="heading-the-better-approach-key-based-deduplication">The Better Approach: Key-Based Deduplication</h4>
<p>In most real datasets, duplicates are determined by a primary or business key, not all attributes. For example, if id uniquely identifies an employee, we only need to keep one record per id.</p>
<pre><code class="lang-python">df.dropDuplicates([<span class="hljs-string">"id"</span>])
</code></pre>
<p>Now Spark deduplicates based only on the id column.</p>
<pre><code class="lang-python">Aggregate [id], [first(id) AS id, first(firstname) AS firstname, ...]

└─ Exchange hashpartitioning(id, <span class="hljs-number">200</span>)

   └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>The shuffle is dramatically narrower. Instead of hashing across all columns, Spark redistributes data only by id. Fewer bytes, smaller shuffle files, faster reduce stage</p>
<h4 id="heading-real-world-benchmark-aws-glue-4">Real-World Benchmark: AWS Glue</h4>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> time
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> exp, log, sqrt, col, concat_ws, when, upper, avg
<span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession

spark = SparkSession.builder.appName(<span class="hljs-string">"MillionRowsRenameTest"</span>).getOrCreate()

employees_data = [
    (<span class="hljs-number">1</span>, <span class="hljs-string">"John"</span>, <span class="hljs-string">"Doe"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">80000</span>, <span class="hljs-number">28</span>, <span class="hljs-string">"2020-01-15"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">2</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Smith"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">85000</span>, <span class="hljs-number">32</span>, <span class="hljs-string">"2019-03-20"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">3</span>, <span class="hljs-string">"Alice"</span>, <span class="hljs-string">"Johnson"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">60000</span>, <span class="hljs-number">25</span>, <span class="hljs-string">"2021-06-10"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">4</span>, <span class="hljs-string">"Bob"</span>, <span class="hljs-string">"Brown"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">90000</span>, <span class="hljs-number">35</span>, <span class="hljs-string">"2018-07-01"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">5</span>, <span class="hljs-string">"Charlie"</span>, <span class="hljs-string">"Wilson"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">65000</span>, <span class="hljs-number">29</span>, <span class="hljs-string">"2020-11-05"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">6</span>, <span class="hljs-string">"David"</span>, <span class="hljs-string">"Lee"</span>, <span class="hljs-string">"HR"</span>, <span class="hljs-number">55000</span>, <span class="hljs-number">27</span>, <span class="hljs-string">"2021-01-20"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">7</span>, <span class="hljs-string">"Eve"</span>, <span class="hljs-string">"Davis"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">95000</span>, <span class="hljs-number">40</span>, <span class="hljs-string">"2017-04-12"</span>, <span class="hljs-string">"Canada"</span>),
    (<span class="hljs-number">8</span>, <span class="hljs-string">"Frank"</span>, <span class="hljs-string">"Miller"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">70000</span>, <span class="hljs-number">33</span>, <span class="hljs-string">"2019-09-25"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">9</span>, <span class="hljs-string">"Grace"</span>, <span class="hljs-string">"Taylor"</span>, <span class="hljs-string">"HR"</span>, <span class="hljs-number">58000</span>, <span class="hljs-number">26</span>, <span class="hljs-string">"2021-08-15"</span>, <span class="hljs-string">"Canada"</span>),
    (<span class="hljs-number">10</span>, <span class="hljs-string">"Henry"</span>, <span class="hljs-string">"Anderson"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">88000</span>, <span class="hljs-number">31</span>, <span class="hljs-string">"2020-02-28"</span>, <span class="hljs-string">"USA"</span>)
]

multiplied_data = [(i, <span class="hljs-string">f"firstname_<span class="hljs-subst">{i}</span>"</span>, <span class="hljs-string">f"lastname_<span class="hljs-subst">{i}</span>"</span>,
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">3</span>],   <span class="hljs-comment"># department</span>
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">4</span>],   <span class="hljs-comment"># salary</span>
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">5</span>],   <span class="hljs-comment"># age</span>
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">6</span>],   <span class="hljs-comment"># hire_date</span>
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">7</span>]    <span class="hljs-comment"># country</span>
                    )
                   <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">1</span>_000_001)]

df = spark.createDataFrame(
    multiplied_data,
    [<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>]
)

start = time.time()
dedup_full = df.dropDuplicates()
dedup_full.count()
time_full = round(time.time() - start, <span class="hljs-number">2</span>)

start = time.time()
dedup_key = df.dropDuplicates([<span class="hljs-string">"id"</span>])
dedup_key.count()
time_key = round(time.time() - start, <span class="hljs-number">2</span>)

print(<span class="hljs-string">f"Full-row dedup time: <span class="hljs-subst">{time_full}</span>s"</span>)
print(<span class="hljs-string">f"Key-based dedup time: <span class="hljs-subst">{time_key}</span>s"</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Execution Time (1M rows)</strong></td><td><strong>Observation</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Full-Row Dedup</td><td>27.6 s</td><td>Shuffle across all attributes, large hash table</td></tr>
<tr>
<td>Key-Based Dedup (["id"])</td><td>2.06 s</td><td>10× faster, minimal shuffle width</td></tr>
</tbody>
</table>
</div><h4 id="heading-under-the-hood-what-catalyst-does">Under the Hood: What Catalyst Does</h4>
<p>When you specify a key list, Catalyst rewrites dropDuplicates(keys) into a partial + final aggregate plan, just like a groupBy:</p>
<p>HashAggregate(keys=[id], functions=[first(...)])</p>
<p>This allows Spark to:</p>
<ul>
<li><p>Perform map-side partial aggregation on each partition (before shuffle).</p>
</li>
<li><p>Exchange only the grouping key (id).</p>
</li>
<li><p>Perform a final aggregation on the reduced data.</p>
</li>
</ul>
<p>The all-column version can’t do that optimization because every column participates in uniqueness Spark must ensure <em>complete</em> data redistribution.</p>
<h4 id="heading-best-practices-for-deduplication">Best Practices for Deduplication</h4>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Practice</strong></td><td><strong>Why It Matters</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Always deduplicate by key columns</td><td>Reduces shuffle width and data movement</td></tr>
<tr>
<td>Use deterministic keys (id, email, ssn)</td><td>Ensures predictable grouping</td></tr>
<tr>
<td>Avoid dropDuplicates() without arguments</td><td>Forces global shuffle across all attributes</td></tr>
<tr>
<td>Combine with column pruning</td><td>Keep only necessary fields before deduplication</td></tr>
<tr>
<td>For “latest record” logic, use window functions</td><td>Allows targeted deduplication (row_number() with order)</td></tr>
<tr>
<td>Cache intermediate datasets if reused</td><td>Avoids recomputation of expensive dedup stages</td></tr>
</tbody>
</table>
</div><h4 id="heading-combining-deduplication-amp-aggregation">Combining Deduplication &amp; Aggregation</h4>
<p>You can merge deduplication with aggregation for even better results:</p>
<pre><code class="lang-python">df_dedup_agg = (
    df.dropDuplicates([<span class="hljs-string">"id"</span>])
        .groupBy(<span class="hljs-string">"department"</span>)
        .agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
)
</code></pre>
<p>Spark now reuses the same shuffle partitioning for both operations, one shuffle instead of two. The plan will show:</p>
<pre><code class="lang-python">HashAggregate(keys=[department], functions=[avg(salary)])

└─ HashAggregate(keys=[id], functions=[first(...), first(department)])

   └─ Exchange hashpartitioning(id, <span class="hljs-number">200</span>)
</code></pre>
<p>Prefer dropDuplicates(["key_col"]) over dropDuplicates() to deduplicate by business or surrogate keys rather than the entire schema. Combine deduplication with projection to reduce I/O, and remember that one narrow shuffle is always better than a wide shuffle. Deduplication isn’t just cleanup – it’s an optimization strategy. Choose your keys wisely, and Spark will reward you with faster jobs and lighter DAGs.</p>
<h3 id="heading-scenario-8-count-smarter">Scenario 8: Count Smarter</h3>
<p>In production, one of the most common performance pitfalls is the simplest line of code:</p>
<pre><code class="lang-python"><span class="hljs-keyword">if</span> df.count() &gt; <span class="hljs-number">0</span>:
</code></pre>
<p>At first glance, this seems harmless. You just want to know whether the DataFrame has any data before writing, joining, or aggregating. But in Spark, count() is not metadata lookup, it’s a full cluster-wide job.</p>
<p><strong>What Really Happens with count()</strong><br>When you call df.count(), Spark executes a complete action:</p>
<ul>
<li><p>It scans every partition.</p>
</li>
<li><p>Deserializes every row.</p>
</li>
<li><p>Counts records locally on each executor.</p>
</li>
<li><p>Reduces the counts to the driver.</p>
</li>
</ul>
<p>That means your “empty check” runs a full distributed computation, even when the dataset has billions of rows or lives in S3.</p>
<pre><code class="lang-python">df.count()
</code></pre>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">*(<span class="hljs-number">1</span>) HashAggregate(keys=[], functions=[count(<span class="hljs-number">1</span>)])

+- *(<span class="hljs-number">1</span>) ColumnarToRow

   +- *(<span class="hljs-number">1</span>) FileScan parquet [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Every record is read, aggregated, and returned just to produce a single integer.</p>
<p>Now imagine this runs in the middle of your Glue job, before a write, before a filter, or inside a loop. You’ve just added a full-table scan to your DAG for no reason.</p>
<h4 id="heading-the-smarter-way-limit1-or-head1">The Smarter Way: limit(1) or head(1)</h4>
<p>If all you need to know is whether data exists, you don’t need to count every record. You just need to know if there’s <em>at least one</em>.</p>
<p>Two efficient alternatives</p>
<pre><code class="lang-python">df.head(<span class="hljs-number">1</span>)
<span class="hljs-comment">#or</span>
df.limit(<span class="hljs-number">1</span>).collect()
</code></pre>
<p>Both execute a lazy scan that stops as soon as one record is found.</p>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">TakeOrderedAndProject(limit=<span class="hljs-number">1</span>)

└─ *(<span class="hljs-number">1</span>) FileScan parquet [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<ul>
<li><p>No global aggregation.</p>
</li>
<li><p>No shuffle.</p>
</li>
<li><p>No full scan.</p>
</li>
</ul>
<h4 id="heading-real-world-benchmark-aws-glue-5">Real-World Benchmark: AWS Glue</h4>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> time
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> exp, log, sqrt, col, concat_ws, when, upper, avg
<span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession

<span class="hljs-comment"># Initialize Spark session</span>
spark = SparkSession.builder.appName(<span class="hljs-string">"MillionRowsRenameTest"</span>).getOrCreate()

<span class="hljs-comment"># Base dataset (10 sample employees)</span>
employees_data = [
    (<span class="hljs-number">1</span>, <span class="hljs-string">"John"</span>, <span class="hljs-string">"Doe"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">80000</span>, <span class="hljs-number">28</span>, <span class="hljs-string">"2020-01-15"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">2</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Smith"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">85000</span>, <span class="hljs-number">32</span>, <span class="hljs-string">"2019-03-20"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">3</span>, <span class="hljs-string">"Alice"</span>, <span class="hljs-string">"Johnson"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">60000</span>, <span class="hljs-number">25</span>, <span class="hljs-string">"2021-06-10"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">4</span>, <span class="hljs-string">"Bob"</span>, <span class="hljs-string">"Brown"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">90000</span>, <span class="hljs-number">35</span>, <span class="hljs-string">"2018-07-01"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">5</span>, <span class="hljs-string">"Charlie"</span>, <span class="hljs-string">"Wilson"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">65000</span>, <span class="hljs-number">29</span>, <span class="hljs-string">"2020-11-05"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">6</span>, <span class="hljs-string">"David"</span>, <span class="hljs-string">"Lee"</span>, <span class="hljs-string">"HR"</span>, <span class="hljs-number">55000</span>, <span class="hljs-number">27</span>, <span class="hljs-string">"2021-01-20"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">7</span>, <span class="hljs-string">"Eve"</span>, <span class="hljs-string">"Davis"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">95000</span>, <span class="hljs-number">40</span>, <span class="hljs-string">"2017-04-12"</span>, <span class="hljs-string">"Canada"</span>),
    (<span class="hljs-number">8</span>, <span class="hljs-string">"Frank"</span>, <span class="hljs-string">"Miller"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">70000</span>, <span class="hljs-number">33</span>, <span class="hljs-string">"2019-09-25"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">9</span>, <span class="hljs-string">"Grace"</span>, <span class="hljs-string">"Taylor"</span>, <span class="hljs-string">"HR"</span>, <span class="hljs-number">58000</span>, <span class="hljs-number">26</span>, <span class="hljs-string">"2021-08-15"</span>, <span class="hljs-string">"Canada"</span>),
    (<span class="hljs-number">10</span>, <span class="hljs-string">"Henry"</span>, <span class="hljs-string">"Anderson"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">88000</span>, <span class="hljs-number">31</span>, <span class="hljs-string">"2020-02-28"</span>, <span class="hljs-string">"USA"</span>)
]

<span class="hljs-comment"># Create 1 million rows</span>
multiplied_data = [
    (i, <span class="hljs-string">f"firstname_<span class="hljs-subst">{i}</span>"</span>, <span class="hljs-string">f"lastname_<span class="hljs-subst">{i}</span>"</span>,
     employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">3</span>],
     employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">4</span>],
     employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">5</span>],
     employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">6</span>],
     employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">7</span>])
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">1</span>_000_001)
]

df = spark.createDataFrame(
    multiplied_data,
    [<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>]
)
<span class="hljs-comment"># Create DataFrame</span>
df = spark.createDataFrame(
    multiplied_data,
    [<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>]
)

start = time.time()
df.count()
count_time = round(time.time() - start, <span class="hljs-number">2</span>)

start = time.time()
df.limit(<span class="hljs-number">1</span>).collect()
limit_time = round(time.time() - start, <span class="hljs-number">2</span>)

start = time.time()
df.head(<span class="hljs-number">1</span>)
head_time = round(time.time() - start, <span class="hljs-number">2</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Method</strong></td><td><strong>Plan Type</strong></td><td><strong>Execution Time (1M rows)</strong></td><td><strong>Notes</strong></td></tr>
</thead>
<tbody>
<tr>
<td>count()</td><td>HashAggregate + Exchange</td><td>26.33 s</td><td>Full scan + aggregation</td></tr>
<tr>
<td>limit(1)</td><td>TakeOrderedAndProject</td><td>0.62 s</td><td>Stops after first record</td></tr>
<tr>
<td>head(1)</td><td>TakeOrderedAndProject</td><td>0.42 s</td><td>Fastest, single partition</td></tr>
</tbody>
</table>
</div><p>The difference is significant for the same logical check.</p>
<p>So why does this difference exist? Spark’s execution model treats every action as a trigger for computation. count() is an aggregation action, requiring global communication, and limit(1) and head(1) are sampling actions, short-circuiting the job after fetching the first record. Catalyst generates a TakeOrderedAndProject node instead of HashAggregate, and the scheduler terminates once one task finishes.</p>
<p><strong>Plan comparison:</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Action</strong></td><td><strong>Simplified Plan</strong></td><td><strong>Type</strong></td><td><strong>Behavior</strong></td></tr>
</thead>
<tbody>
<tr>
<td>count()</td><td>HashAggregate → Exchange → FileScan</td><td>Global</td><td>Full scan, wide dependency</td></tr>
<tr>
<td>limit(1)</td><td>TakeOrderedAndProject → FileScan</td><td>Local</td><td>Early stop, narrow dependency</td></tr>
<tr>
<td>head(1)</td><td>TakeOrderedAndProject → FileScan</td><td>Local</td><td>Early stop, single task</td></tr>
</tbody>
</table>
</div><p>Avoid using count() to check emptiness since it triggers a full scan. Use limit(1) or head(1) for lightweight existence checks. And reserve count() only when the total is required, because Spark will always process all data unless explicitly told to stop. Other alternatives</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><code>df.take(1)</code></td><td>Similar to head() returns array</td></tr>
</thead>
<tbody>
<tr>
<td><code>df.first()</code></td><td>Returns first Row or None</td></tr>
<tr>
<td><code>df.isEmpty()</code></td><td>Returns true if DataFrame has no rows</td></tr>
<tr>
<td><code>df.rdd.isEmpty()</code></td><td>RDD-level check</td></tr>
</tbody>
</table>
</div><h3 id="heading-scenario-9-window-wisely">Scenario 9: Window Wisely</h3>
<p>Window functions (rank(), dense_rank(), lag(), avg() with over(), and so on) are essential in analytics. They let you calculate running totals, rankings, or time-based metrics.</p>
<p>But in Spark, they’re not cheap, because they rely on shuffles and ordering.</p>
<p>Each window operation:</p>
<ul>
<li><p>Requires all rows for the same partition key to be co-located on the same node.</p>
</li>
<li><p>Requires sorting those rows by the orderBy() clause within each partition.</p>
</li>
</ul>
<p>If you omit partitionBy() (or use it with too broad a key), Spark treats the entire dataset as one partition, triggering a massive shuffle and global sort.</p>
<h4 id="heading-global-window-the-wrong-way">Global Window: The Wrong Way</h4>
<p>Let’s compute employee rankings by salary without partitioning:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql.window <span class="hljs-keyword">import</span> Window
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> rank, col

window_spec = Window.orderBy(col(<span class="hljs-string">"salary"</span>).desc())

df_ranked = df.withColumn(<span class="hljs-string">"salary_rank"</span>, rank().over(window_spec))
</code></pre>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Window [rank() windowspecdefinition(orderBy=[salary DESC]) AS salary_rank]

└─ Sort [salary DESC], true

   └─ Exchange rangepartitioning(salary DESC, <span class="hljs-number">200</span>)

      └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Spark must shuffle and sort the entire dataset globally, a full sort across all rows. Every executor gets a slice of this single global range, and all data must move through the network.</p>
<h4 id="heading-partition-by-a-selective-key-the-better-way">Partition by a Selective Key: The Better Way</h4>
<p>Most analytics don’t need a global ranking. You likely want rankings within a department or group, not across the entire company.</p>
<pre><code class="lang-python">window_spec = Window.partitionBy(<span class="hljs-string">"department"</span>).orderBy(col(<span class="hljs-string">"salary"</span>).desc())

df_ranked = df.withColumn(<span class="hljs-string">"salary_rank"</span>, rank().over(window_spec))
</code></pre>
<p>Now Spark builds separate windows per department. Each partition’s data stays local, dramatically reducing shuffle size.</p>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Window [rank() windowspecdefinition(partitionBy=[department], orderBy=[salary DESC]) AS salary_rank]

└─ Sort [department ASC, salary DESC], false

   └─ Exchange hashpartitioning(department, <span class="hljs-number">200</span>)

      └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>The Exchange now partitions data only by department. The shuffle boundary is narrower, fewer bytes transferred, fewer sort comparisons, and smaller spill risk.</p>
<h4 id="heading-real-world-benchmark-aws-glue-6">Real-World Benchmark: AWS Glue</h4>
<p>We can execute the windows function on the same 1 million row dataset:</p>
<pre><code class="lang-python">df = spark.createDataFrame(multiplied_data,
[<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>,
 <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>])

start = time.time()
window_global = Window.orderBy(col(<span class="hljs-string">"salary"</span>).desc())
df_global = df.withColumn(<span class="hljs-string">"salary_rank"</span>, rank().over(window_global))
df_global.count()
global_time = round(time.time() - start, <span class="hljs-number">2</span>)
print(<span class="hljs-string">f'global_time:<span class="hljs-subst">{global_time}</span>'</span>)

start = time.time()
window_local = Window.partitionBy(<span class="hljs-string">"department"</span>).orderBy(col(<span class="hljs-string">"salary"</span>).desc())
df_local = df.withColumn(<span class="hljs-string">"salary_rank"</span>, rank().over(window_local))
df_local.count()
local_time = round(time.time() - start, <span class="hljs-number">2</span>)
print(<span class="hljs-string">f'local_time:<span class="hljs-subst">{local_time}</span>'</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Stage Count</strong></td><td><strong>Execution Time (1M rows)</strong></td><td><strong>Observation</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Global Window (no partition)</td><td>5</td><td>30.21 s</td><td>Full dataset shuffle + global sort</td></tr>
<tr>
<td>Partitioned Window (by department)</td><td>3</td><td>1.74 s</td><td>Localized sort, fewer shuffle files</td></tr>
</tbody>
</table>
</div><p>Partitioning the window reduces shuffle data volume significantly and runtime as well. The difference grows exponentially as data scales.</p>
<h4 id="heading-under-the-hood-what-spark-actually-does-1">Under the Hood: What Spark Actually Does</h4>
<p>Each Window transformation adds a physical plan node like:</p>
<p>WindowExec [rank() windowspecdefinition(...)], frame=RangeFrame</p>
<p>This node is non-pipelined – it materializes input partitions before computing window metrics. Catalyst optimizer can’t push filters or projections inside WindowExec, which means:</p>
<ul>
<li><p>If you rank before filtering, Spark computes ranks for all rows.</p>
</li>
<li><p>If you order globally, Spark must sort everything before starting.</p>
</li>
</ul>
<p>That’s why window placement in your code matters almost as much as partition keys.</p>
<h4 id="heading-common-anti-patterns">Common Anti-Patterns:</h4>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Anti-Pattern</strong></td><td><strong>Why It Hurts</strong></td><td><strong>Fix</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Missing partitionBy()</td><td>Global sort across dataset</td><td>Partition by key columns</td></tr>
<tr>
<td>Overly broad partition key</td><td>Creates too many small partitions</td><td>Use selective, not unique keys</td></tr>
<tr>
<td>Wide, unbounded window frame</td><td>Retains all rows in memory per key</td><td>Use bounded ranges (for example, rowsBetween(-3, 0))</td></tr>
<tr>
<td>Filtering after window</td><td>Computes unnecessary metrics</td><td>Filter first, then window</td></tr>
<tr>
<td>Multiple chained windows</td><td>Each triggers new sort</td><td>Combine window metrics in one spec</td></tr>
</tbody>
</table>
</div><p>Partition on selective keys to reduce shuffle volume, and avoid global windows that force full sorts and shuffles. Prefer bounded frames to keep state in memory and limit disk spill, and filter early while combining metrics to minimize unnecessary data flowing through WindowExec. Windows are powerful, but unbounded ones can silently crush performance. In Spark, partitioning isn’t optional. It’s the line between analytics and overhead.</p>
<h3 id="heading-scenario-10-incremental-aggregations-with-cache-and-persist">Scenario 10: Incremental Aggregations with Cache and Persist</h3>
<p>When multiple actions depend on the same expensive base computation, don’t recompute it every time. Materialize it once with cache() or persist(), then reuse it. Most Spark teams get this wrong in two ways:</p>
<ul>
<li><p>They never cache, so Spark recomputes long lineages (filters, joins, window ops) for every action.</p>
</li>
<li><p>They cache everything, blowing executor memory and making things worse.</p>
</li>
</ul>
<p>This scenario shows how to do it intelligently.</p>
<h4 id="heading-the-problem-recomputing-the-same-work-for-every-metric">The Problem: Recomputing the Same Work for Every Metric</h4>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> col, avg, max <span class="hljs-keyword">as</span> max_, count

base = (
    df.filter(col(<span class="hljs-string">"department"</span>) == <span class="hljs-string">"Engineering"</span>)
      .filter(col(<span class="hljs-string">"country"</span>) == <span class="hljs-string">"USA"</span>)
      .filter(col(<span class="hljs-string">"salary"</span>) &gt; <span class="hljs-number">70000</span>)
)

avg_salary = base.groupBy(<span class="hljs-string">"department"</span>).agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
max_salary = base.groupBy(<span class="hljs-string">"department"</span>).agg(max_(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"max_salary"</span>))
cnt_salary = base.groupBy(<span class="hljs-string">"department"</span>).agg(count(<span class="hljs-string">"*"</span>).alias(<span class="hljs-string">"cnt"</span>))

Looks totally fine at a glance. But remember: Spark <span class="hljs-keyword">is</span> lazy.
Every time you trigger an action:

avg_salary.show()
max_salary.show()
cnt_salary.show()
</code></pre>
<p>Spark walks back to the same base definition and re-runs all filters and shuffles for each metric – unless you persist.</p>
<p>So instead of 1 filtered + shuffled dataset reused 3 times, you effectively get:</p>
<ul>
<li><p>3 jobs</p>
</li>
<li><p>3 scans / filter chains</p>
</li>
<li><p>3 groupBy shuffles</p>
</li>
</ul>
<p>for the same input slice.</p>
<p><strong>Simplified Logical Plan Shape (Without Cache):</strong></p>
<pre><code class="lang-python">HashAggregate [department], [avg/max/count]

└─ Exchange hashpartitioning(department)

   └─ Filter (department = <span class="hljs-string">'Engineering'</span> AND country = <span class="hljs-string">'USA'</span> AND salary &gt; <span class="hljs-number">70000</span>)

      └─ Scan ...
</code></pre>
<p>And Spark builds this three times. Even though the filter logic is identical, each action triggers a new job with:</p>
<ul>
<li><p>new stages,</p>
</li>
<li><p>new shuffles, and</p>
</li>
<li><p>new scans.</p>
</li>
</ul>
<p>On large datasets (hundreds of GBs), this is brutal.</p>
<h4 id="heading-the-better-approach-cache-the-shared-base">The Better Approach: Cache the Shared Base</h4>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> StorageLevel

base = (
    df.filter(col(<span class="hljs-string">"department"</span>) == <span class="hljs-string">"Engineering"</span>)
      .filter(col(<span class="hljs-string">"country"</span>) == <span class="hljs-string">"USA"</span>)
      .filter(col(<span class="hljs-string">"salary"</span>) &gt; <span class="hljs-number">70000</span>)
)

base = base.persist(StorageLevel.MEMORY_AND_DISK)

base.count()

avg_salary = base.groupBy(<span class="hljs-string">"department"</span>).agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
max_salary = base.groupBy(<span class="hljs-string">"department"</span>).agg(max_(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"max_salary"</span>))
cnt_salary = base.groupBy(<span class="hljs-string">"department"</span>).agg(count(<span class="hljs-string">"*"</span>).alias(<span class="hljs-string">"cnt"</span>))

avg_salary.show()
max_salary.show()
cnt_salary.show()

base.unpersist()
</code></pre>
<p>Now, the filters and initial scan run once, the results are cached, and all subsequent aggregates read from cached data instead of recomputing upstream logic.</p>
<p><strong>Logical Plan Shape (With Cache):</strong></p>
<p>Before materialization (base.count()), the plan still shows the lineage. Afterward, subsequent actions operate off the cached node.</p>
<pre><code class="lang-python">InMemoryRelation [department, salary, country, ...]

   └─ * Cached <span class="hljs-keyword">from</span>:

      Filter (department = <span class="hljs-string">'Engineering'</span> AND country = <span class="hljs-string">'USA'</span> AND salary &gt; <span class="hljs-number">70000</span>)

      └─ Scan parquet employees_large ...
</code></pre>
<p>Then:</p>
<pre><code class="lang-python">HashAggregate [department], [avg/max/count]

└─ InMemoryRelation [...]
</code></pre>
<p>One heavy pipeline, many cheap reads. The DAG becomes flatter:</p>
<ul>
<li><p>Expensive scan &amp; filter &amp; shuffle: once.</p>
</li>
<li><p>Cheap aggregations: N times from memory/disk.</p>
</li>
</ul>
<h4 id="heading-real-world-benchmark-aws-glue-7">Real-World Benchmark: AWS Glue</h4>
<pre><code class="lang-python">df = spark.createDataFrame(multiplied_data,
[<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>,
<span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>])

base = (
    df.filter(col(<span class="hljs-string">"department"</span>) == <span class="hljs-string">"Engineering"</span>)
      .filter(col(<span class="hljs-string">"country"</span>) == <span class="hljs-string">"USA"</span>)
      .filter(col(<span class="hljs-string">"salary"</span>) &gt; <span class="hljs-number">85000</span>)
)


start = time.time()

avg_salary = base.groupBy(<span class="hljs-string">"department"</span>).agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
max_salary = base.groupBy(<span class="hljs-string">"department"</span>).agg(max_(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"max_salary"</span>))
cnt = base.groupBy(<span class="hljs-string">"department"</span>).agg(count(<span class="hljs-string">"*"</span>).alias(<span class="hljs-string">"emp_count"</span>))

print(<span class="hljs-string">"---- Without Cache ----"</span>)
avg_salary.show()
max_salary.show()
cnt.show()

no_cache_time = round(time.time() - start, <span class="hljs-number">2</span>)
print(<span class="hljs-string">f"Total time without cache: <span class="hljs-subst">{no_cache_time}</span> seconds"</span>)


<span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> DataFrame

base_cached = base.persist(StorageLevel.MEMORY_AND_DISK)
base_cached.count()  <span class="hljs-comment"># materialize cache</span>

start = time.time()

avg_salary_c = base_cached.groupBy(<span class="hljs-string">"department"</span>).agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
max_salary_c = base_cached.groupBy(<span class="hljs-string">"department"</span>).agg(max_(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"max_salary"</span>))
cnt_c = base_cached.groupBy(<span class="hljs-string">"department"</span>).agg(count(<span class="hljs-string">"*"</span>).alias(<span class="hljs-string">"emp_count"</span>))

print(<span class="hljs-string">"---- With Cache ----"</span>)
avg_salary_c.show()
max_salary_c.show()
cnt_c.show()

cache_time = round(time.time() - start, <span class="hljs-number">2</span>)
print(<span class="hljs-string">f"Total time with cache: <span class="hljs-subst">{cache_time}</span> seconds"</span>)

<span class="hljs-comment"># Cleanup</span>
base_cached.unpersist()

print(<span class="hljs-string">"\n==== Summary ===="</span>)
print(<span class="hljs-string">f"Without cache: <span class="hljs-subst">{no_cache_time}</span>s | With cache: <span class="hljs-subst">{cache_time}</span>s"</span>)
print(<span class="hljs-string">"================="</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Execution Time (1M rows)</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Without Cache</td><td>30.75 s</td></tr>
<tr>
<td>With Cache</td><td>3.34 s</td></tr>
</tbody>
</table>
</div><h4 id="heading-under-the-hood-why-this-works"><strong>Under the Hood: Why This Works</strong></h4>
<p>Using cache() or persist() in Spark inserts an InMemoryRelation / InMemoryTableScanExec node so that expensive intermediate results are stored in executor memory (or memory+disk). This allows future jobs to reuse cached blocks instead of re-scanning sources or re-computing shuffles. This shortens downstream logical plans, reduces repeated shuffles, and lowers load on systems like S3, HDFS, or JDBC.</p>
<p>Without caching, every action replays the full lineage and Spark recomputes the data unless another operator or AQE optimization has already materialized part of it. But caching should not become “cache everything”. Rather, you should avoid caching very large DataFrames used only once, wide raw inputs instead of filtered/aggregated subsets, or long-lived caches that are never unpersisted.</p>
<p>A good rule of thumb is to cache only when the DataFrame is expensive to recompute (joins, filters, windows, UDFs), is used at least twice, and is reasonably sized after filtering so it can fit in memory or work with MEMORY_AND_DISK. Otherwise, allow Spark to recompute.</p>
<p>Conceptually, caching converts a tall, repetitive DAG such as repeated “HashAggregate → Exchange → Filter → Scan” sequences into a hub-and-spoke design where one heavy cached hub feeds multiple lightweight downstream aggregates.</p>
<p>When multiple actions depend on the same expensive computation, cache or persist the shared base to flatten the DAG, eliminate repeated scans and shuffles, and improve end-to-end performance. All this while being intentional by caching only when reuse is real, the data size is safe, and always calling <code>unpersist()</code> when done.</p>
<p>Don’t make Spark re-solve the same puzzle three times. Let it solve it once, remember the answer, and move on.</p>
<h3 id="heading-scenario-11-reduce-shuffles">Scenario 11: Reduce Shuffles</h3>
<p>Shuffles are Spark’s invisible tax collectors. Every time your data crosses executors, you pay in CPU, disk I/O, and network bandwidth.</p>
<p>Two of the most common yet misunderstood transformations that trigger or avoid shuffles are coalesce() and repartition(). Both change partition counts, but they do it in fundamentally different ways.</p>
<h4 id="heading-the-problem"><strong>The Problem</strong></h4>
<p>Writing <code>df_result = df.repartition(10)</code> and thinking “I’m just changing partitions so Spark won’t move data unnecessarily.” But that assumption is wrong. <code>repartition()</code> always performs a full shuffle, even when:</p>
<ul>
<li><p>You are reducing partitions (from 200 → 10), or</p>
</li>
<li><p>You are increasing partitions (from 10 → 200).</p>
</li>
</ul>
<p>In both cases, Spark redistributes every row across the cluster according to a new hash partitioning scheme. So even if your data is already partitioned optimally, repartition() will still reshuffle it, adding a stage boundary.</p>
<p><strong>Logical Plan:</strong></p>
<pre><code class="lang-python">Exchange hashpartitioning(...)

└─ LogicalRDD [...]
</code></pre>
<p>That Exchange node signals a wide dependency: Spark spills intermediate data to disk, transfers it over the network, and reloads it before the next stage. In short: repartition() = "new shuffle, no matter what."</p>
<h4 id="heading-the-better-approach-coalesce">The Better Approach: coalesce()</h4>
<p>If your goal is to reduce the number of partitions, for example, before writing results to S3 or Snowflake – use coalesce() instead.</p>
<p><code>df_result = df.coalesce(10)</code></p>
<p>coalesce() merges existing partitions locally within each executor, avoiding the costly reshuffle step. It uses a narrow dependency, meaning each output partition depends on one or more existing partitions <em>from the same node</em>.</p>
<p>Coalesce</p>
<p>└─ LogicalRDD [...]</p>
<ul>
<li><p>No Exchange.</p>
</li>
<li><p>No network shuffle.</p>
</li>
<li><p>Just local merges – fast and cheap.</p>
</li>
</ul>
<h4 id="heading-real-world-benchmark-aws-glue-8">Real-World Benchmark: AWS Glue</h4>
<pre><code class="lang-python">df = spark.createDataFrame(multiplied_data,
[<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>])

start = time.time()
df_repart = df.repartition(<span class="hljs-number">10</span>)
df_repart.count()
print(<span class="hljs-string">"Repartition time:"</span>, round(time.time() - start, <span class="hljs-number">2</span>), <span class="hljs-string">"sec"</span>)

start = time.time()
df_coalesced = df.coalesce(<span class="hljs-number">10</span>)
df_coalesced.count()
print(<span class="hljs-string">"Coalesce time:"</span>, round(time.time() - start, <span class="hljs-number">2</span>), <span class="hljs-string">"sec"</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Operation</strong></td><td><strong>Plan Node</strong></td><td><strong>Shuffle Triggered</strong></td><td><strong>Glue Runtime</strong></td><td><strong>Observation</strong></td></tr>
</thead>
<tbody>
<tr>
<td>repartition(10)</td><td>Exchange</td><td>Yes</td><td>18.2 s</td><td>Full cluster reshuffle</td></tr>
<tr>
<td>coalesce(10)</td><td>Coalesce</td><td>No</td><td>1.99 s</td><td>Local partition merge only</td></tr>
</tbody>
</table>
</div><p>Even though both ended with 10 partitions, repartition() took significantly longer all because of the unnecessary shuffle.</p>
<h4 id="heading-why-this-matters">Why This Matters</h4>
<p>Each Exchange node in your logical plan creates a new stage in your DAG, meaning:</p>
<ul>
<li><p>Extra disk I/O</p>
</li>
<li><p>Extra serialization</p>
</li>
<li><p>Extra network transfer</p>
</li>
</ul>
<p>That’s why avoiding just one shuffle in a Glue ETL pipeline can save seconds to minutes per run, especially on wide datasets.</p>
<p><strong>When to use which:</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Goal</strong></td><td><strong>Transformation</strong></td><td><strong>Reasoning</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Increase parallelism for heavy groupBy or join</td><td>repartition()</td><td>Distributes data evenly across executors</td></tr>
<tr>
<td>Reduce file count before writing</td><td>coalesce()</td><td>Avoids shuffle, merges partitions locally</td></tr>
<tr>
<td>Rebalance skewed data before a join</td><td>repartition(by="key")</td><td>Enables better key distribution</td></tr>
<tr>
<td>Optimize output after aggregation</td><td>coalesce()</td><td>Prevents too many small output files</td></tr>
</tbody>
</table>
</div><h4 id="heading-aqe-and-auto-coalescing">AQE and Auto Coalescing</h4>
<p>You can enable Adaptive Query Execution (AQE) in AWS Glue 3.0+ to let Spark merge small shuffle partitions automatically:</p>
<p><code>spark.conf.set("spark.sql.adaptive.enabled", "true")</code></p>
<p><code>spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")</code></p>
<p>With AQE, Spark dynamically combines small partitions <em>after</em> shuffle to balance performance and I/O.</p>
<p>repartition() always triggers a shuffle, while coalesce() avoids shuffles and is ideal for local merges before writes. You should always inspect Exchange nodes to identify shuffle points. Note that in AWS Glue, avoiding even one shuffle can yield ~7× runtime improvement at the 1M-row scale. Finally, use AQE to enable dynamic partition coalescing in larger workflows.</p>
<h3 id="heading-scenario-12-know-your-shuffle-triggers">Scenario 12: Know Your Shuffle Triggers</h3>
<p>Much of Spark's performance comes from invisible data movement. Every shuffle boundary adds a new stage, a new write–read cycle, and sometimes minutes of extra execution time.</p>
<p>In Spark, any operation that requires rearranging data between partitions introduces a wide dependency, represented in the logical plan as an Exchange node.</p>
<p>Common shuffle triggers:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Operation</strong></td><td><strong>Why It Shuffles</strong></td><td><strong>Plan Node</strong></td></tr>
</thead>
<tbody>
<tr>
<td>join()</td><td>Records with the same key must be co-located for matching</td><td>Exchange (on join keys)</td></tr>
<tr>
<td>groupBy() / agg()</td><td>Keys must gather to a single partition for aggregation</td><td>Exchange</td></tr>
<tr>
<td>distinct()</td><td>Spark must compare all values across partitions</td><td>Exchange</td></tr>
<tr>
<td>orderBy()</td><td>Requires global ordering of data</td><td>Exchange</td></tr>
<tr>
<td>repartition()</td><td>Explicit reshuffle for partition balancing</td><td>Exchange</td></tr>
</tbody>
</table>
</div><p>Each Exchange means a shuffle stage: Spark writes partition data to disk, transfers it over the network, and reads it back into memory on the next stage. That’s your hidden performance cliff.</p>
<pre><code class="lang-python">df_result = (
    df.groupBy(<span class="hljs-string">"department"</span>)
      .agg(sum(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"total_salary"</span>))
      .join(df.select(<span class="hljs-string">"department"</span>, <span class="hljs-string">"country"</span>)
            .distinct(), <span class="hljs-string">"department"</span>)
      .orderBy(<span class="hljs-string">"total_salary"</span>, ascending=<span class="hljs-literal">False</span>)
)

df_result.explain(<span class="hljs-string">"formatted"</span>)
</code></pre>
<p><strong>Logical Plan Simplified:</strong></p>
<pre><code class="lang-python">Sort [total_salary DESC]

└─ Exchange (<span class="hljs-keyword">global</span> sort)

   └─ SortMergeJoin [department]

      ├─ Exchange (groupBy shuffle)

      │   └─ HashAggregate (sum salary)

      └─ Exchange (distinct shuffle)

          └─ Aggregate (department, country)
</code></pre>
<p>We can see three Exchange nodes, one for the aggregation, one for the distinct join, and one for the global sort. That’s three separate shuffles, three full dataset transfers.</p>
<h4 id="heading-better-approach">Better Approach</h4>
<p>Whenever possible, combine wide transformations into a single stage before an action. For instance, you can compute aggregates and join results in one consistent shuffle domain:</p>
<pre><code class="lang-python">agg_df = df.groupBy(<span class="hljs-string">"department"</span>) \
    .agg(sum(<span class="hljs-string">"salary"</span>) \
    .alias(<span class="hljs-string">"total_salary"</span>))

country_df = df.select(<span class="hljs-string">"department"</span>, <span class="hljs-string">"country"</span>).distinct()

df_result = (
    agg_df.join(country_df, <span class="hljs-string">"department"</span>)
          .sortWithinPartitions(<span class="hljs-string">"total_salary"</span>, ascending=<span class="hljs-literal">False</span>)
)
</code></pre>
<p><strong>Logical Plan Simplified:</strong></p>
<pre><code class="lang-python">SortWithinPartitions [total_salary DESC]

└─ SortMergeJoin [department]

   ├─ Exchange (shared shuffle <span class="hljs-keyword">for</span> join)

   └─ Exchange (shared shuffle <span class="hljs-keyword">for</span> distinct)
</code></pre>
<p>Now Spark reuses shuffle partitions across compatible operations – only one shuffle boundary remains. The rest execute as narrow transformations.</p>
<h4 id="heading-real-world-benchmark-aws-glue-1m">Real-World Benchmark: AWS Glue (1M)</h4>
<pre><code class="lang-python">df = spark.createDataFrame(multiplied_data,
[<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>]).repartition(<span class="hljs-number">20</span>)

<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> sum <span class="hljs-keyword">as</span> sum_

start = time.time()

dept_salary = (
    df.groupBy(<span class="hljs-string">"department"</span>)
      .agg(sum_(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"total_salary"</span>))
)

dept_country = (
    df.select(<span class="hljs-string">"department"</span>, <span class="hljs-string">"country"</span>)
      .distinct()
)

naive_result = (
    dept_salary.join(dept_country, <span class="hljs-string">"department"</span>, <span class="hljs-string">"inner"</span>)
               .orderBy(col(<span class="hljs-string">"total_salary"</span>).desc())
)

naive_count = naive_result.count()
naive_time = round(time.time() - start, <span class="hljs-number">2</span>)


start = time.time()

dept_country_once = (
    df.select(<span class="hljs-string">"department"</span>, <span class="hljs-string">"country"</span>)
      .distinct()
)

optimized = (
    df.groupBy(<span class="hljs-string">"department"</span>)
      .agg(sum_(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"total_salary"</span>))
      .join(dept_country_once, <span class="hljs-string">"department"</span>, <span class="hljs-string">"inner"</span>)
      .sortWithinPartitions(col(<span class="hljs-string">"total_salary"</span>).desc())
      <span class="hljs-comment"># local ordering, avoids extra global shuffle</span>
)

opt_count = optimized.count()
opt_time = round(time.time() - start, <span class="hljs-number">2</span>)

print(<span class="hljs-string">"Optimized result count:"</span>, opt_count)
print(<span class="hljs-string">"Optimized pipeline time:"</span>, opt_time, <span class="hljs-string">"sec"</span>)

print(<span class="hljs-string">"\nOptimized plan:"</span>)
optimized.explain(<span class="hljs-string">"formatted"</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Pipeline</strong></td><td><strong># of Shuffles</strong></td><td><strong>Glue Runtime (sec)</strong></td><td><strong>Observation</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Naive: groupBy + distinct + orderBy</td><td>3</td><td>28.99 s</td><td>Multiple wide stages</td></tr>
<tr>
<td>Optimized: combined agg + join + sortWithinPartitions</td><td>1</td><td>3.52 s</td><td>Single wide stage</td></tr>
</tbody>
</table>
</div><p>By merging compatible stages and using sortWithinPartitions() instead of global orderBy(), the job ran significantly faster on the same dataset, with fewer Exchange nodes and shorter lineage. Run df.explain and search for Exchange. Each one signals a full shuffle. You can also check Spark UI → SQL tab → Exchange Read/Write Size to see exactly how much data moved.</p>
<p>Every Exchange represents a shuffle, adding serialization, network I/O, and stage overhead, so avoid chaining wide operations back-to-back by combining them under a consistent partition key. Prefer sortWithinPartitions() over global orderBy() when ordering is local, monitor plan depth to catch consecutive wide dependencies, and note that in AWS Glue eliminating even one shuffle in a 1M-row job can significantly reduce runtime.</p>
<h3 id="heading-scenario-13-tune-parallelism-shuffle-partitions-amp-aqe">Scenario 13: Tune Parallelism: Shuffle Partitions &amp; AQE</h3>
<p>Most Spark jobs are either over-parallelized (thousands of tiny tasks doing almost nothing, flooding the driver and filesystem) or under-parallelized (a handful of huge tasks doing all the work, causing slow stages and skew-like behavior). Both waste resources. We can control this behavior using spark.sql.shuffle.partitions and Adaptive Query Execution (AQE).</p>
<p>By default (in many environments), the default value <code>spark.conf.get("spark.sql.shuffle.partitions")</code> is 200, meaning that every shuffle produces approximately 200 shuffle partitions, regardless of data size. That means every shuffle (groupBy, join, distinct, and so on) creates ~200 shuffle partitions. Whether this default is reasonable depends entirely on the workload:</p>
<ul>
<li><p>If you’re processing 2 GB, 200 partitions might be great.</p>
</li>
<li><p>If you’re processing 5 MB, 200 partitions is comedy – 200 tiny tasks, overhead &gt; work.</p>
</li>
<li><p>If you’re processing 2 TB, 200 partitions might be too few – tasks become huge and slow.</p>
</li>
</ul>
<h4 id="heading-example-a-the-default-plan-too-many-tiny-tasks">Example A: The Default Plan (Too Many Tiny Tasks)</h4>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> sum <span class="hljs-keyword">as</span> sum_

spark = SparkSession.builder.appName(<span class="hljs-string">"ParallelismExample"</span>).getOrCreate()

spark.conf.get(<span class="hljs-string">"spark.sql.shuffle.partitions"</span>)  <span class="hljs-comment"># '200'</span>

data = [
    (<span class="hljs-number">1</span>, <span class="hljs-string">"John"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">90000</span>),
    (<span class="hljs-number">2</span>, <span class="hljs-string">"Alice"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">85000</span>),
    (<span class="hljs-number">3</span>, <span class="hljs-string">"Bob"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">75000</span>),
    (<span class="hljs-number">4</span>, <span class="hljs-string">"Eve"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">72000</span>),
    (<span class="hljs-number">5</span>, <span class="hljs-string">"Grace"</span>, <span class="hljs-string">"HR"</span>, <span class="hljs-number">65000</span>),
]

df = spark.createDataFrame(data, [<span class="hljs-string">"id"</span>, <span class="hljs-string">"name"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>])

agg_df = df.groupBy(<span class="hljs-string">"department"</span>).agg(sum_(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"total_salary"</span>))
agg_df.explain(<span class="hljs-string">"formatted"</span>)
</code></pre>
<p>Even though there are only 3 departments, Spark will still create 200 shuffle partitions – meaning 200 tasks for 3 groups of data.</p>
<p><strong>Effect:</strong> Each task has almost nothing to do. Spark spends more time planning and scheduling than actually computing.</p>
<h4 id="heading-example-b-tuned-plan-balanced-parallelism">Example B: Tuned Plan (Balanced Parallelism)</h4>
<pre><code class="lang-python">spark.conf.set(<span class="hljs-string">"spark.sql.shuffle.partitions"</span>, <span class="hljs-string">"8"</span>)
agg_df = df.groupBy(<span class="hljs-string">"department"</span>).agg(sum_(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"total_salary"</span>))
agg_df.explain(<span class="hljs-string">"formatted"</span>)
</code></pre>
<p>Now Spark launches only <strong>8 partitions</strong> still parallelized, but not wasteful. Even in this small example, you can visually feel the difference: one logical change, but a completely leaner physical plan.</p>
<h4 id="heading-the-real-problem-static-tuning-doesnt-scale">The Real Problem: Static Tuning Doesn’t Scale</h4>
<p>In production, job sizes vary:</p>
<ul>
<li><p>Today: 10 GB</p>
</li>
<li><p>Tomorrow: 500 GB</p>
</li>
<li><p>Next week: 200 MB (sampling run)</p>
</li>
</ul>
<p>Manually changing shuffle partitions for each run is neither practical nor reliable. That’s where Adaptive Query Execution (AQE) steps in.</p>
<h4 id="heading-adaptive-query-execution-aqe-smarter-dynamic-parallelism">Adaptive Query Execution (AQE): Smarter, Dynamic Parallelism</h4>
<p>AQE doesn’t guess. It measures actual shuffle statistics at runtime and rewrites the plan <em>while the job is running.</em></p>
<pre><code class="lang-python">spark.conf.set(<span class="hljs-string">"spark.sql.adaptive.enabled"</span>, <span class="hljs-string">"true"</span>)
spark.conf.set(<span class="hljs-string">"spark.sql.adaptive.coalescePartitions.enabled"</span>, <span class="hljs-string">"true"</span>)
spark.conf.set(<span class="hljs-string">"spark.sql.adaptive.coalescePartitions.minPartitionSize"</span>, <span class="hljs-string">"64m"</span>)
spark.conf.set(<span class="hljs-string">"spark.sql.adaptive.coalescePartitions.maxPartitionSize"</span>, <span class="hljs-string">"256m"</span>)
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Configuration</strong></td><td><strong>Shuffle Partitions</strong></td><td><strong>Task Distribution</strong></td><td><strong>Observation</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Default</td><td>200</td><td>200 tasks / 3 groups</td><td>Too granular, mostly idle</td></tr>
<tr>
<td>Tuned</td><td>8</td><td>8 tasks / 3 groups</td><td>Balanced execution</td></tr>
</tbody>
</table>
</div><p>AQE merges tiny shuffle partitions, or splits huge ones, based on <strong>real-time data metrics</strong>, not pre-set assumptions.</p>
<pre><code class="lang-python">df = spark.createDataFrame(multiplied_data,
    [<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>,
     <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>])

start = time.time()
agg_df = df.groupBy(<span class="hljs-string">"department"</span>).agg(sum_(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"total_salary"</span>))
agg_df.count()

print(<span class="hljs-string">f'Num Partitions df: <span class="hljs-subst">{df.rdd.getNumPartitions()}</span>'</span>)
print(<span class="hljs-string">f'Num Partitions aggdf: <span class="hljs-subst">{agg_df.rdd.getNumPartitions()}</span>'</span>)
print(<span class="hljs-string">"Execution time:"</span>, round(time.time() - start, <span class="hljs-number">2</span>), <span class="hljs-string">"sec"</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Stage</strong></td><td><strong>Without AQE</strong></td><td><strong>With AQE</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Stage 3 (Aggregation)</td><td>200 shuffle partitions, each reading KBs</td><td>8–12 coalesced partitions</td></tr>
<tr>
<td>Stage 4 (Join Output)</td><td>200 shuffle files</td><td>Merged into balanced partitions</td></tr>
<tr>
<td><strong>Result</strong></td><td>Many small tasks, high overhead</td><td>Fewer, balanced tasks, faster runtime</td></tr>
</tbody>
</table>
</div><h4 id="heading-understanding-the-plan"><strong>Understanding the Plan</strong></h4>
<p>Before AQE (static):</p>
<p><code>Exchange hashpartitioning(department, 200)</code></p>
<p>With AQE: AdaptiveSparkPlan (coalesced)</p>
<p><code>HashAggregate(keys=[department], functions=[sum(salary)])</code></p>
<p><code>Exchange hashpartitioning(department, 200)</code>  <em># runtime coalesced to 12</em></p>
<p>The logical plan remains the same, but the physical execution plan is rewritten during runtime. Spark intelligently reduces or merges shuffle partitions based on data volume.</p>
<p>Spark’s default 200 shuffle partitions often misfit real workloads. Static tuning may work for predictable pipelines, but fails with variable data. On the other hand, AQE uses shuffle statistics to dynamically coalesce partitions at runtime, use it with sensible ceilings (for example, 400 partitions) and always verify in the Spark UI to catch over-partitioning (many tasks reading KBs) or under-partitioning (few tasks reading GBs).</p>
<h3 id="heading-scenario-14-handle-skew-smartly">Scenario 14: Handle Skew Smartly</h3>
<p>In an ideal Spark world, all partitions contain roughly equal amounts of data. But real datasets are rarely that kind. If one key (say "USA", "2024", or "customer_123") holds millions of rows while others have only a few, Spark ends up with one or two massive partitions. Those partitions take disproportionately longer to process, leaving other executors idle. That’s data skew: the silent killer of parallelism.</p>
<p>You’ll often spot it in Spark UI:</p>
<ul>
<li><p>198 tasks finish quickly.</p>
</li>
<li><p>2 tasks take 10× longer.</p>
</li>
<li><p>Stage stays stuck at 98% for minutes.</p>
</li>
</ul>
<h4 id="heading-example-a-the-skew-problem">Example A: The Skew Problem</h4>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession, functions <span class="hljs-keyword">as</span> F

spark = SparkSession.builder.appName(<span class="hljs-string">"DataSkewDemo"</span>).getOrCreate()

<span class="hljs-comment"># Create skewed dataset</span>
df = spark.range(<span class="hljs-number">0</span>, <span class="hljs-number">10000</span>).toDF(<span class="hljs-string">"id"</span>) \
    .withColumn(<span class="hljs-string">"department"</span>,
        F.when(F.col(<span class="hljs-string">"id"</span>) &lt; <span class="hljs-number">8000</span>, <span class="hljs-string">"Engineering"</span>)  <span class="hljs-comment"># 80% of data</span>
         .when(F.col(<span class="hljs-string">"id"</span>) &lt; <span class="hljs-number">9000</span>, <span class="hljs-string">"Sales"</span>)
         .otherwise(<span class="hljs-string">"HR"</span>)) \
    .withColumn(<span class="hljs-string">"salary"</span>, (F.rand() * <span class="hljs-number">100000</span>).cast(<span class="hljs-string">"int"</span>))

df.groupBy(<span class="hljs-string">"department"</span>).count().show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769464257950/6963171b-92de-4721-9bb3-6951c68a2775.png" alt="6963171b-92de-4721-9bb3-6951c68a2775" class="image--center mx-auto" width="594" height="446" loading="lazy"></p>
<p>Spark will hash “Engineering” into just one reducer partition, making it heavier than others. That single task becomes a bottleneck, the shuffle has technically completed, but the stage waits for that one lagging task.</p>
<h4 id="heading-example-b-the-solution-salting-hot-keys">Example B: The Solution: Salting Hot Keys</h4>
<p>To handle skew, we the hot key (Engineering) into multiple pseudo-keys using a random salt. This redistributes that large partition across multiple reducers.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> rand, concat, lit, floor

salt_buckets = <span class="hljs-number">10</span>

df_salted = (
    df.withColumn(
        <span class="hljs-string">"department_salted"</span>,
        F.when(F.col(<span class="hljs-string">"department"</span>) == <span class="hljs-string">"Engineering"</span>,
            F.concat(F.col(<span class="hljs-string">"department"</span>), lit(<span class="hljs-string">"_"</span>),
                     (F.floor(rand() * salt_buckets))))
         .otherwise(F.col(<span class="hljs-string">"department"</span>))
    )
)

df_salted.groupBy(<span class="hljs-string">"department_salted"</span>).agg(F.avg(<span class="hljs-string">"salary"</span>))
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769464242395/c4ec0bc6-67bf-488c-b619-7130ceef878e.png" alt="c4ec0bc6-67bf-488c-b619-7130ceef878e" class="image--center mx-auto" width="536" height="468" loading="lazy"></p>
<p>Now “Engineering” isn’t one hot key – it’s <strong>10 smaller keys</strong> like Engineering_0, Engineering_1, ..., Engineering_9. Each one goes to a separate reducer partition, enabling parallel processing.</p>
<h4 id="heading-example-c-post-aggregation-desalting">Example C: Post-Aggregation Desalting</h4>
<p>After aggregating, recombine salted keys to get the original department names:</p>
<pre><code class="lang-python">df_final = (
    df_salted.groupBy(<span class="hljs-string">"department_salted"</span>)
        .agg(F.avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
        .withColumn(<span class="hljs-string">"department"</span>, F.split(F.col(<span class="hljs-string">"department_salted"</span>), <span class="hljs-string">"_"</span>)
            .getItem(<span class="hljs-number">0</span>))
        .groupBy(<span class="hljs-string">"department"</span>)
        .agg(F.avg(<span class="hljs-string">"avg_salary"</span>).alias(<span class="hljs-string">"final_avg_salary"</span>))
)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769464321049/6349c2c3-a0e3-4f9e-be3e-c59639004128.png" alt="6349c2c3-a0e3-4f9e-be3e-c59639004128" class="image--center mx-auto" width="540" height="242" loading="lazy"></p>
<h4 id="heading-when-to-use-salting">When to Use Salting</h4>
<p>Use salting when:</p>
<ul>
<li><p>You observe stage skew (one or few long tasks).</p>
</li>
<li><p>Shuffle read sizes vary drastically between tasks.</p>
</li>
<li><p>The skew originates from a few dominant key values.</p>
</li>
</ul>
<p>Avoid it when:</p>
<ul>
<li><p>The dataset is small (&lt; 1 GB).</p>
</li>
<li><p>You already use partitioning or bucketing keys with uniform distribution.</p>
</li>
</ul>
<p><strong>Alternative approaches:</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Technique</strong></td><td><strong>Use Case</strong></td><td><strong>Pros</strong></td><td><strong>Cons</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Salting (manual)</td><td>Skewed joins/aggregations</td><td>Full control</td><td>Requires extra logic to merge</td></tr>
<tr>
<td>Skew join hints (/*+ SKEWJOIN */)</td><td>Supported joins in Spark 3+</td><td>No extra columns needed</td><td>Works only on joins</td></tr>
<tr>
<td>Broadcast smaller side</td><td>One table ≪ other</td><td>Avoids shuffle on big side</td><td>Limited by broadcast size</td></tr>
<tr>
<td>AQE skew optimization</td><td>Spark 3.0+</td><td>Automatic handling</td><td>Needs AQE enabled</td></tr>
</tbody>
</table>
</div><h4 id="heading-glue-specific-tip">Glue-Specific Tip</h4>
<p>AWS Glue 3.0+ includes Spark 3.x, meaning you can also enable AQE’s built-in skew optimization:</p>
<p><code>spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")</code></p>
<p><code>spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "128m")</code></p>
<p>Spark will automatically detect large shuffle partitions and split them, effectively auto-salting hot keys at runtime. Data skew causes uneven shuffle sizes across tasks and can be detected in the Spark UI or via shuffle read/write metrics. Mitigate heavy-key skew with manual salting (recombined later) or rely on AQE skew join optimization for mild cases, and always validate improvements in the Spark UI SQL tab by checking “Shuffle Read Size.”</p>
<h3 id="heading-scenario-15-sort-efficiently-orderby-vs-sortwithinpartitions">Scenario 15: Sort Efficiently (orderBy vs sortWithinPartitions)</h3>
<p>Most Spark jobs need sorted data at some point – for window functions, for writing ordered files, or for downstream processing. The instinct is to reach for orderBy(). But those instincts cost you a full shuffle every single time.</p>
<h4 id="heading-the-problem-global-sort-when-you-dont-need-it">The Problem: Global Sort When You Don't Need It</h4>
<p>Let's say you want to write employee data partitioned by department, sorted by salary within each department:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> col

<span class="hljs-comment"># Naive approach: global sort</span>
df_sorted = df.orderBy(col(<span class="hljs-string">"department"</span>), col(<span class="hljs-string">"salary"</span>).desc())

df_sorted.write.partitionBy(<span class="hljs-string">"department"</span>).parquet(<span class="hljs-string">"s3://output/employees/"</span>)
</code></pre>
<p>This looks reasonable. You're sorting by department and salary, then writing partitioned files. Clean and simple. But here's what Spark actually does:</p>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Sort [department ASC, salary DESC], true

└─ Exchange rangepartitioning(department ASC, salary DESC, <span class="hljs-number">200</span>)

   └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>That Exchange <code>rangepartitioning</code> is a full shuffle. So Spark:</p>
<ul>
<li><p>Samples the data to determine range boundaries</p>
</li>
<li><p>Redistributes every row across 200 partitions based on sort keys</p>
</li>
<li><p>Sorts each partition locally</p>
</li>
<li><p>Produces globally ordered output</p>
</li>
</ul>
<p>You just shuffled 1 million rows across the cluster to achieve global ordering – even though you're immediately partitioning by department on write, which destroys that global order anyway.</p>
<h4 id="heading-why-this-hurts">Why This Hurts</h4>
<p>Range partitioning for global sort is one of the most expensive shuffles Spark performs:</p>
<ul>
<li><p>Sampling overhead: Spark must scan data twice (once to sample, once to process)</p>
</li>
<li><p>Network transfer: Every row moves to a new executor based on range boundaries</p>
</li>
<li><p>Disk I/O: Shuffle files written and read from disk</p>
</li>
<li><p>Wasted work: Global ordering across departments is meaningless when you partition by department</p>
</li>
</ul>
<p>For 1M rows, this adds 8-12 seconds of pure shuffle overhead.</p>
<h4 id="heading-the-better-approach-sort-locally-within-partitions">The Better Approach: Sort Locally Within Partitions</h4>
<p>If you only need ordering <em>within</em> each department (or within each output partition), use sortWithinPartitions():</p>
<pre><code class="lang-python"><span class="hljs-comment"># Optimized approach: local sort only</span>
df_sorted = df.sortWithinPartitions(col(<span class="hljs-string">"department"</span>), col(<span class="hljs-string">"salary"</span>).desc())
df_sorted.write.partitionBy(<span class="hljs-string">"department"</span>).parquet(<span class="hljs-string">"s3://output/employees/"</span>)
</code></pre>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Sort [department ASC, salary DESC], false

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<ul>
<li><p>No Exchange.</p>
</li>
<li><p>No shuffle.</p>
</li>
<li><p>Just local sorting within existing partitions.</p>
</li>
</ul>
<p>Spark sorts each partition in-place, without moving data across the network. The false flag in the Sort node indicates this is a local sort, not a global one.</p>
<h4 id="heading-real-world-benchmark-aws-glue-9">Real-World Benchmark: AWS Glue</h4>
<p>Let's measure the difference on 1 million employee records: First, will start with Global Sort with orderBy:</p>
<pre><code class="lang-python">print(<span class="hljs-string">"\n--- Testing orderBy() (global sort) ---"</span>)

start = time.time()

df_global = df.orderBy(col(<span class="hljs-string">"department"</span>), col(<span class="hljs-string">"salary"</span>).desc())
df_global.write.mode(<span class="hljs-string">"overwrite"</span>).parquet(<span class="hljs-string">"/tmp/global_sort_output"</span>)

global_time = round(time.time() - start, <span class="hljs-number">2</span>)
print(<span class="hljs-string">f"orderBy() time: <span class="hljs-subst">{global_time}</span>s"</span>)
</code></pre>
<p>Local Sort:</p>
<pre><code class="lang-python">print(<span class="hljs-string">"\n--- Testing sortWithinPartitions() (local sort) ---"</span>)

start = time.time()

df_local = df.sortWithinPartitions(col(<span class="hljs-string">"department"</span>), col(<span class="hljs-string">"salary"</span>).desc())
df_local.write.mode(<span class="hljs-string">"overwrite"</span>).parquet(<span class="hljs-string">"/tmp/local_sort_output"</span>)

local_time = round(time.time() - start, <span class="hljs-number">2</span>)
print(<span class="hljs-string">f"sortWithinPartitions() time: <span class="hljs-subst">{local_time}</span>s"</span>)
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Plan Type</strong></td><td><strong>Execution Time (1M rows)</strong></td><td><strong>Observation</strong></td></tr>
</thead>
<tbody>
<tr>
<td>orderBy()</td><td>Exchange rangepartitioning</td><td>10.34 s</td><td>Full shuffle for global sort</td></tr>
<tr>
<td>sortWithinPartitions()</td><td>Local Sort (no Exchange)</td><td>2.18 s</td><td>In-place sorting, no network transfer</td></tr>
</tbody>
</table>
</div><p><strong>Physical Plan Differences:</strong></p>
<p><strong>orderBy() Physical Plan:</strong></p>
<pre><code class="lang-python">*(<span class="hljs-number">2</span>) Sort [department ASC NULLS FIRST, salary DESC NULLS LAST], true, <span class="hljs-number">0</span>

+- Exchange rangepartitioning(department ASC NULLS FIRST, salary DESC NULLS LAST, <span class="hljs-number">200</span>)

   +- *(<span class="hljs-number">1</span>) Project [id, firstname, lastname, department, salary, age, hire_date, country]

      +- *(<span class="hljs-number">1</span>) Scan ExistingRDD[id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>The Exchange rangepartitioning node marks the shuffle boundary. Spark must:</p>
<ul>
<li><p>Sample data to determine range splits</p>
</li>
<li><p>Redistribute all rows across executors</p>
</li>
<li><p>Sort within each range partition</p>
</li>
</ul>
<p><strong>sortWithinPartitions() Physical Plan:</strong></p>
<pre><code class="lang-python">*(<span class="hljs-number">1</span>) Sort [department ASC NULLS FIRST, salary DESC NULLS LAST], false, <span class="hljs-number">0</span>

+- *(<span class="hljs-number">1</span>) Project [id, firstname, lastname, department, salary, age, hire_date, country]

   +- *(<span class="hljs-number">1</span>) Scan ExistingRDD[id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>No Exchange. The false flag in Sort indicates local sorting only. Each partition is sorted independently, in parallel, without any data movement.</p>
<p><strong>When to Use Which:</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Use Case</strong></td><td><strong>Method</strong></td><td><strong>Why</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Writing partitioned files (Parquet, Delta)</td><td>sortWithinPartitions()</td><td>Partition-level order is sufficient; global order wasted</td></tr>
<tr>
<td>Window functions with ROWS BETWEEN</td><td>sortWithinPartitions()</td><td>Only need order within each window partition</td></tr>
<tr>
<td>Top-N per group (rank, dense_rank)</td><td>sortWithinPartitions()</td><td>Ranking is local to each partition key</td></tr>
<tr>
<td>Final output must be globally ordered</td><td>orderBy()</td><td>Need total order across all partitions</td></tr>
<tr>
<td>Downstream system requires strict ordering</td><td>orderBy()</td><td>For example, time-series data for sequential processing</td></tr>
<tr>
<td>Sorting before coalesce() for fewer output files</td><td>sortWithinPartitions()</td><td>Maintains order within merged partitions</td></tr>
</tbody>
</table>
</div><h4 id="heading-common-anti-pattern">Common Anti-Pattern</h4>
<pre><code class="lang-python">df.orderBy(<span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>) \
  .write.partitionBy(<span class="hljs-string">"department"</span>) \
  .parquet(<span class="hljs-string">"output/"</span>)
</code></pre>
<p><strong>Problem:</strong> You're globally sorting by department, then immediately partitioning by department. The global order is destroyed during partitioning.</p>
<p>Here’s the fix:</p>
<pre><code class="lang-python">df.sortWithinPartitions(<span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>) \
  .write.partitionBy(<span class="hljs-string">"department"</span>) \
  .parquet(<span class="hljs-string">"output/"</span>)
</code></pre>
<p>Or even better, if you're partitioning by department anyway:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Best: let partitioning handle distribution</span>
df.write.partitionBy(<span class="hljs-string">"department"</span>) \
    .sortBy(<span class="hljs-string">"salary"</span>) \
    .parquet(<span class="hljs-string">"output/"</span>)
</code></pre>
<p>orderBy() triggers an expensive full shuffle using range partitioning, while sortWithinPartitions() sorts data locally without a shuffle and is often 4–5× faster. Use it when writing partitioned files, computing window functions with partitionBy(), or when order is needed only within groups, and reserve orderBy() strictly for true global ordering, because in most production ETL, the best sort is the one that doesn’t shuffle.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You began this handbook likely wondering why your Spark application was slow, and now you see that the answer was both clear and not so clear: your problem was never your Spark application, your configuration, or your version of Spark. It was your plan all along.</p>
<p>You now understand that Spark runs plans, not code, that transformation order affects logical plans, that shuffles generate stages and are key to runtime performance, and that examining your physical plans allows you to directly link your application performance issues back to your problematic line of code.</p>
<p>And you’ve seen this pattern repeat across many scenarios: problem, plan, solution, improved plan, and so forth, until optimization feels less like a dark art and more like a certainty.</p>
<p>This is the Spark optimization mindset: read plans before you write code, and challenge every single Exchange. Engineers who write high-performance Spark jobs minimize shuffles, filter early, project narrowly, deal with skew carefully, and validate everything via explain() and the Spark UI. Once you learn to read the plan, Spark performance becomes mechanical.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Data Loading with Python and AI ]]>
                </title>
                <description>
                    <![CDATA[ Modern data pipelines are the backbone of data engineering, enabling organizations to collect, process, and leverage massive volumes of information efficiently. But building and maintaining these pipelines isn't always straightforward. From API rate ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/data-loading-with-python-and-ai/</link>
                <guid isPermaLink="false">68016ae0a7699ff474c9f1d0</guid>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Thu, 17 Apr 2025 20:56:00 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1744923345695/c75fb9d7-4552-439a-9550-9c2d63be940d.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Modern data pipelines are the backbone of data engineering, enabling organizations to collect, process, and leverage massive volumes of information efficiently. But building and maintaining these pipelines isn't always straightforward. From API rate limits and changing data schemas to ensuring consistent loading and transformation, engineers face many challenges that can disrupt operations. Mastering data ingestion, the process of collecting and importing data for immediate use or storage, is important for building resilient, scalable systems that can evolve with business needs.</p>
<p>We just published a course on the freeCodeCamp.org YouTube channel that will teach you all about mastering data ingestion for data engineering using Python. Created by Alexey Grigorev and Adrian Brudaru and supported by a grant from <a target="_blank" href="https://dlthub.com/">dlthub.com</a>, this comprehensive course dives deep into the core challenges of building robust data pipelines and provides practical, real-world solutions. Whether you're an aspiring data engineer or a developer looking to level up, this course equips you with senior-level strategies to design pipelines that gracefully handle schema evolution, API limitations, and more.</p>
<p>In Alexey's section of the course, you'll start with the foundations: understanding what data ingestion really means and how to approach it through streaming, batching, and working with REST APIs. You'll learn to normalize incoming data, load it into tools like DuckDB, and implement dynamic schema management to future-proof your pipelines.</p>
<p>Adrian then teaches how to use <a target="_blank" href="https://github.com/dlt-hub/dlthub">DLT</a> (Data Load Tool), an open-source Python library for data loading, to simplify and scale your pipeline implementations. You'll go hands-on with configuring secrets, managing data contracts, handling incremental loading, tuning performance, and deploying your pipelines using tools like GitHub Actions, Crontab, Dagster, and Airflow. There’s even an exciting section on creating data pipelines using LLMs, where you’ll learn to craft effective prompts and integrate generative AI into your workflows.</p>
<p>Here is the full list of sections in this course:</p>
<p><strong>Alexey's part</strong></p>
<ul>
<li><p>Introduction</p>
</li>
<li><p>What is data ingestion</p>
</li>
<li><p>Extracting data: Data Streaming &amp; Batching</p>
</li>
<li><p>Extracting data: Working with RestAPI</p>
</li>
<li><p>Normalizing data</p>
</li>
<li><p>Loading data into DuckDB</p>
</li>
<li><p>Dynamic schema management</p>
</li>
<li><p>What is next?</p>
</li>
</ul>
<p><strong>Adrian's part</strong></p>
<ul>
<li><p>Introduction</p>
</li>
<li><p>Overview</p>
</li>
<li><p>Extracting data with dlt: dlt RestAPI Client</p>
</li>
<li><p>dlt Resources</p>
</li>
<li><p>How to configure secrets</p>
</li>
<li><p>Normalizing data with dlt</p>
</li>
<li><p>Data Contracts</p>
</li>
<li><p>Alerting schema changes</p>
</li>
<li><p>Loading data with dlt</p>
</li>
<li><p>Write dispositions</p>
</li>
<li><p>Incremental loading</p>
</li>
<li><p>Loading data from SQL database to SQL database</p>
</li>
<li><p>Backfilling</p>
</li>
<li><p>SCD2</p>
</li>
<li><p>Performance tuning</p>
</li>
<li><p>Loading data to Data Lakes &amp; Lakehouses &amp; Catalogs</p>
</li>
<li><p>Loading data to Warehouses/MPPs,Staging</p>
</li>
<li><p>Deployment &amp; orchestration</p>
</li>
<li><p>Deployment with Git Actions</p>
</li>
<li><p>Deployment with Crontab</p>
</li>
<li><p>Deployment with Dagster</p>
</li>
<li><p>Deployment with Airflow</p>
</li>
<li><p>Create pipelines with LLMs: Understanding the challenge</p>
</li>
<li><p>Create pipelines with LLMs: Creating prompts and LLM friendly documentation</p>
</li>
<li><p>Create pipelines with LLMs: Demo</p>
</li>
</ul>
<p>Check out the full course for free on the <a target="_blank" href="https://youtu.be/T23Bs75F7ZQ">freeCodeCamp.org YouTube channel</a>.</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/T23Bs75F7ZQ" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ From Accountant to Data Engineer with Alyson La [Podcast #168] ]]>
                </title>
                <description>
                    <![CDATA[ On this week's episode of the podcast, freeCodeCamp founder Quincy Larson interviews Alyson La. She taught herself how to code while working as an accountant at GitHub and was able to transition to a data scientist there, then ultimately a software e... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/from-accountant-to-data-engineer-with-alyson-la-podcast-168/</link>
                <guid isPermaLink="false">67f9c04b3e25f02ff9a26b80</guid>
                
                    <category>
                        <![CDATA[ Career ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data engineer ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Quincy Larson ]]>
                </dc:creator>
                <pubDate>Sat, 12 Apr 2025 01:22:19 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1744420903260/fae4b593-d653-41eb-b70b-031591aa2f35.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>On this week's episode of the podcast, freeCodeCamp founder Quincy Larson interviews Alyson La. She taught herself how to code while working as an accountant at GitHub and was able to transition to a data scientist there, then ultimately a software engineer.</p>
<p>After one of her kids got diagnosed with autism, she left her career for 3 years to be a full-time mom. She then re-entered the workforce and now teaches other moms how to do the same through a charity called Tech-Moms. She recently won a teacher of the year award and was a top 5 finalist in a data visualization competition.</p>
<p>We talk about:</p>
<ul>
<li><p>How Alyson taught herself programming while working as an accountant</p>
</li>
<li><p>How she transitioned to data analyst and ultimately data engineer</p>
</li>
<li><p>Tips for preparing for a break from work to take care of your family or address burnout</p>
</li>
<li><p>How to re-enter with the workforce with gusto</p>
</li>
</ul>
<p>Support for this podcast comes from the 11,384 kind folks who support freeCodeCamp through a monthly donation. You can join these chill human beings and help our charity's mission by going to <a target="_blank" href="http://donate.freecodecamp.org">donate.freecodecamp.org</a></p>
<p>You can watch the podcast on YouTube:</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/oYMUKagK0n8" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<p> </p>
<p>You can listen to the podcast in Apple Podcasts, Spotify, or your favorite podcast app. Be sure to follow the freeCodeCamp Podcast there so you'll get new episodes each Friday.</p>
<p>Links we talk about during our conversation:</p>
<ul>
<li><p>Alyson's new analytics consultancy: <a target="_blank" href="https://alysonla.com/">https://alysonla.com/</a></p>
</li>
<li><p>The charity Alyson teaches at: <a target="_blank" href="https://www.tech-moms.org/">https://www.tech-moms.org/</a></p>
</li>
<li><p>Tech-Mom's Data class: <a target="_blank" href="https://github.com/Tech-Moms/data-analytics-course">https://github.com/Tech-Moms/data-analytics-course</a></p>
</li>
<li><p>The petition site Alyson mentioned: <a target="_blank" href="https://playground-petition-portal-9cfaeecf.vercel.app/">https://playground-petition-portal-9cfaeecf.vercel.app/</a></p>
</li>
<li><p>Alyson's Drake fan page: <a target="_blank" href="https://alysonla.github.io/drizzydrakefanpage/">https://alysonla.github.io/drizzydrakefanpage/</a></p>
</li>
<li><p>Alyson's matching game: <a target="_blank" href="https://alysonla.github.io/hubber-memory-game/">https://alysonla.github.io/hubber-memory-game/</a></p>
</li>
<li><p>Alyson substack: <a target="_blank" href="https://alysonsaiplayground.substack.com/">https://alysonsaiplayground.substack.com/</a></p>
</li>
<li><p>The data visualization app Alyson that was a finalist in the recent competition: <a target="_blank" href="https://pixar-scroll-tale.lovable.app/">https://pixar-scroll-tale.lovable.app/</a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use Databricks Delta Lake with SQL – Full Handbook ]]>
                </title>
                <description>
                    <![CDATA[ Welcome to the Databricks Delta Lake with SQL Handbook! Databricks is a unified analytics platform that brings together data engineering, data science, and business analytics into a collaborative workspace. Delta Lake, a powerful storage layer built ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/databricks-sql-handbook/</link>
                <guid isPermaLink="false">66d45d98680e33282da25e0a</guid>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                    <category>
                        <![CDATA[ SQL ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Atharva Shah ]]>
                </dc:creator>
                <pubDate>Tue, 05 Sep 2023 13:57:32 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/09/Databricks-Delta-Lake-with-SQL-Handbook-Cover.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Welcome to the Databricks Delta Lake with SQL Handbook! Databricks is a unified analytics platform that brings together data engineering, data science, and business analytics into a collaborative workspace.</p>
<p>Delta Lake, a powerful storage layer built on top of Databricks, provides enhanced reliability, performance, and data quality for big data workloads.</p>
<p>This is a hands-on training guide where you will get a chance to dive into the world of Databricks and learn how to effectively use Delta Lake for managing and analyzing data. It'll provide you with the essential SQL skills to efficiently interact with Delta tables and perform advanced data analytics.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>This handbook is designed for beginner-level SQL users who have some experience with cloud platforms and clusters. Although no prior experience with Databricks is required, it is recommended that you have a basic understanding of the following concepts:</p>
<ul>
<li><p><strong>Databases:</strong> Familiarity with the basic structure and functionality of databases will be helpful.</p>
</li>
<li><p><strong>SQL Queries:</strong> Knowledge of SQL syntax and the ability to write basic queries is essential.</p>
</li>
<li><p><strong>Jupyter Notebooks:</strong> Understanding how Jupyter notebooks work and being comfortable with running code cells is recommended.</p>
</li>
</ul>
<p>While this handbook assumes a certain level of familiarity with databases, SQL, and Jupyter notebooks, it will guide you step-by-step through each process, ensuring that you understand and follow along with the material.</p>
<p>As such, no installation is necessary, as all the work will be done on Databricks Delta Notebooks running in the cluster. Everything has already been provisioned, eliminating the need for any setup or configuration.</p>
<p>By the end of this handbook, you would have gained a solid foundation in using SQL with Databricks, enabling you to leverage its powerful capabilities for data analysis and manipulation.</p>
<p>Let's get started!</p>
<h2 id="heading-table-of-contents"><strong>Table of Contents</strong></h2>
<p>Here are the sections of this tutorial:</p>
<ol>
<li><a class="post-section-overview" href="#heading-introduction-to-databricks">Introduction to Databricks</a></li>
</ol>
<ul>
<li><p>What is Databricks?</p>
</li>
<li><p>Key features and benefits</p>
</li>
<li><p>Getting started with Databricks Workspace</p>
</li>
<li><p>Notebook basics and interactive analytics</p>
</li>
</ul>
<ol start="2">
<li><a class="post-section-overview" href="#heading-introduction-to-delta">Introduction to Delta</a></li>
</ol>
<ul>
<li><p>Understanding Delta Lake</p>
</li>
<li><p>Advantages of using Delta</p>
</li>
<li><p>Use cases of Delta in real-world scenarios</p>
</li>
<li><p>Supported languages and platforms for Delta</p>
</li>
</ul>
<ol start="3">
<li><a class="post-section-overview" href="#heading-how-to-create-and-manage-tables">How to Create and Manage Tables</a></li>
</ol>
<ul>
<li><p>Creating tables from various data sources</p>
</li>
<li><p>SQL Data Definition Language (DDL) commands</p>
</li>
<li><p>SQL Data Manipulation Language (DML) commands</p>
</li>
<li><p>Creating tables from a Databricks dataset</p>
</li>
<li><p>Saving the loaded CSV file to Delta using Python</p>
</li>
</ul>
<ol start="4">
<li><a class="post-section-overview" href="#heading-delta-sql-command-support">Delta SQL Command Support</a></li>
</ol>
<ul>
<li><p>Delta SQL commands for data management</p>
</li>
<li><p>Performing UPSERT (UPDATE and INSERT) operations</p>
</li>
</ul>
<ol start="5">
<li><a class="post-section-overview" href="#heading-advanced-sql-queries">Advanced SQL Queries</a></li>
</ol>
<ul>
<li><p>Handling data visualization in Delta</p>
</li>
<li><p>Advanced aggregate queries in Delta</p>
</li>
<li><p>Counting diamonds by clarity using SQL</p>
</li>
<li><p>Adding table constraints for data integrity</p>
</li>
</ul>
<ol start="6">
<li><a class="post-section-overview" href="#heading-how-to-work-with-dataframes">How to Work with DataFrames</a></li>
</ol>
<ul>
<li><p>Creating a DataFrame from a Databricks dataset</p>
</li>
<li><p>Data manipulation and displaying results using DataFrames</p>
</li>
</ul>
<ol start="7">
<li><a class="post-section-overview" href="#heading-version-control-and-time-travel-in-delta">Version Control and Time Travel in Delta</a></li>
</ol>
<ul>
<li><p>Understanding version control and time travel in Delta</p>
</li>
<li><p>Restoring data to a specific version</p>
</li>
<li><p>Utilizing autogenerated fields for metadata tracking</p>
</li>
</ul>
<ol start="8">
<li><a class="post-section-overview" href="#heading-delta-table-cloning">Delta Table Cloning</a></li>
</ol>
<ul>
<li><p>Deep and shallow copying of Delta tables</p>
</li>
<li><p>Efficiently cloning Delta tables for data exploration and analysis</p>
</li>
</ul>
<ol start="9">
<li><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></li>
</ol>
<h2 id="heading-introduction-to-databricks">Introduction to Databricks</h2>
<p>Databricks is a unified analytics platform that combines data engineering, data science, and machine learning into a single collaborative environment. Leveraging Apache Spark, it processes and analyzes vast amounts of data efficiently.</p>
<p>Databricks offers benefits like seamless scalability, real-time collaboration, and simplified workflows, making it a favored choice for data-driven enterprises.</p>
<p>Its versatility suits various use cases: from ETL processes and data preparation to advanced analytics and AI model development. Databricks aids in uncovering insights from structured and unstructured data, empowering businesses to make informed decisions swiftly.</p>
<p>You can see its application in finance for fraud detection, healthcare for predictive analytics, e-commerce for recommendation engines, and so on. Basically, Databricks accelerates data-driven innovation, transforming raw information into actionable intelligence.</p>
<p>To follow along this tutorial, you should first create a <a target="_blank" href="https://www.databricks.com/try-databricks">Community Edition account</a> so you can create your clusters.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-209.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Create a Databricks Community Edition Account</em></p>
<p>Once you've created your account, head over to the <a target="_blank" href="https://community.cloud.databricks.com/login.html">Community Edition login page</a>. Once you have signed in, you'll be greeted with a screen very similar to the one shown below.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-212.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Databricks User Dashboard with options to create workspaces, notebooks, and import data</em></p>
<p>From the sidebar on the left, you can create your workspaces, and upload datasets and files that you wish to process.</p>
<p>To follow along, click on the link highlighted in the image above (the one that says "create a notebook"). It will launch a new notebook on Databricks platform where we'll be writing all the code.</p>
<p>You can also access all your notebooks from the left sidebar or from the "Recents" tab on the home screen once you login.</p>
<p>You can find all the code, instructions, and steps used in this handbook with explanations in one of the public notebooks I have created <a target="_blank" href="https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4547302627522723/2783251604801531/8769490171999815/latest.html">here</a>.</p>
<p>On creating a new notebook, you should create a cluster to run your commands and process the data. Clusters in the Databricks Delta platform are groups of computing resources that drive efficient data processing. They execute tasks in parallel, speeding up tasks like ETL and analysis.</p>
<p>Clusters offer tailored resource allocation, ensuring optimal performance and scalability. Supporting multiple users and tasks concurrently, clusters encourage collaboration. Leveraging Apache Spark, they enable advanced analytics and machine learning.</p>
<p>Integral to Databricks Delta's ACID transactions, clusters ensure data integrity. Overall, clusters empower seamless, high-performance data handling, essential for tasks ranging from data preparation to sophisticated analytics and AI model training.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-213.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Provision a cluster by creating a new resource to run commands in the notebook</em></p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-214.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Proceed with the standard configuration</em></p>
<p>Now that we have the notebook and clusters set up, we can start with the code. But before we do that, here are a few key terms to know. Awareness of these is more about the platform and less about SQL syntax which will be covered below.</p>
<h3 id="heading-data-ingestion">Data Ingestion</h3>
<p>Data ingestion in Delta involves loading data from third-party sources, such as Fivetran. The most efficient storage medium for data in Delta is Parquet, which is a columnar storage format. To load data into Delta, we can use Spark or PySpark Python and specify the storage location. The loaded data can be accessed and queried using SQL syntax with the <code>COPY INTO</code> command.</p>
<h3 id="heading-dashboards">Dashboards</h3>
<p>Visualizations created in SQL notebooks within Delta can be added to custom dashboards for BI/Analytics. These dashboards are lightweight and provide real-time updates based on data refreshment. This enables users to create insightful and interactive dashboards for data analysis and reporting. You need not create your dashboards from scratch. Popular Dashboard templates are available.</p>
<h3 id="heading-policies">Policies</h3>
<p>Delta provides data governance through the Unity Catalog, ensuring that users only have access to databases and tables they are permitted to view or edit. This granular control over data access enhances security and data privacy within the system.</p>
<h3 id="heading-history">History</h3>
<p>Moderators or superusers can access the history of each query run against all databases, along with timestamps and query execution times. This feature helps in understanding query patterns and optimizing database performance based on usage insights.</p>
<h3 id="heading-optimization">Optimization</h3>
<p>To improve query performance, Delta offers various optimization techniques, such as database indexing, clustering, Bloom filter indexing, and leveraging MPP paradigms like MapReduce. Knowledge of normalization and schema design also contributes to writing efficient SQL queries.</p>
<h3 id="heading-alerts">Alerts</h3>
<p>Delta allows users to set alerts based on comparison operators applied to query results. For example, when a sales count query returns a value below a threshold, an alert can be triggered via Slack, ticketing tools, or emails. Customizable alerts ensure timely notifications for critical data events.</p>
<h3 id="heading-persona-based-design">Persona-Based Design</h3>
<p>The Databricks Platform is designed to cater to different personas, including Data Science/Analytics and BI/MLOps specialists. Users get segregated interfaces tailored to their roles. However, the Unity Catalog can aggregate all these views, providing a cohesive experience.</p>
<h3 id="heading-sql-workspace">SQL Workspace</h3>
<p>The SQL Workspace in Delta provides an interface similar to MySQL Workbench or PgAdmin. Users can perform SQL queries on datasets without the need to load the data repeatedly, as done in notebooks. This efficient querying enhances the SQL-based data analysis experience.</p>
<h3 id="heading-integration-with-other-bi-tools">Integration with other BI Tools</h3>
<p>Databricks integrates well with Tableau and PowerBI. You can import your data points and visualizations seamlessly and get consistent and synced results in the BI tools of your choice. With the click of a button, live queries are generated against the Databricks datasets.</p>
<h2 id="heading-introduction-to-delta">Introduction to Delta</h2>
<p>Delta Lake is an open storage format used to save your data in your Lakehouse. Delta provides an abstraction layer on top of files. It's the storage foundation of your Lakehouse.</p>
<h3 id="heading-why-delta-lake">Why Delta Lake?</h3>
<p><img src="https://pages.databricks.com/rs/094-YMS-629/images/delta-lake-logo-whitebackground.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Running an ingestion pipeline on Cloud Storage can be very challenging. Data teams typically face the following challenges:</p>
<ul>
<li><p>Hard to append data (Adding newly arrived data leads to incorrect reads).</p>
</li>
<li><p>Modification of existing data is difficult (GDPR/CCPA requires making fine-grained changes to the existing data lake).</p>
</li>
<li><p>Jobs failing mid-way (Half of the data appears in the data lake, the rest may be missing).</p>
</li>
<li><p>Data quality issues (It’s a constant headache to ensure that all the data is correct and high quality).</p>
</li>
<li><p>Real-time operations (Mixing streaming and batch leads to inconsistency).</p>
</li>
<li><p>Costly to keep historical versions of the data (Regulated environments require reproducibility, auditing, and governance).</p>
</li>
<li><p>Difficult to handle large metadata (For large data lakes, the metadata itself becomes difficult to manage).</p>
</li>
<li><p>“Too many files” problems (Data lakes are not great at handling millions of small files).</p>
</li>
<li><p>Hard to get great performance (Partitioning the data for performance is error-prone and difficult to change).</p>
</li>
</ul>
<p>These challenges have a real impact on team efficiency and productivity, spending unnecessary time fixing low-level, technical issues instead of focusing on high-level, business implementation.</p>
<p>Because Delta Lake solves all the low-level technical challenges of saving petabytes of data in your lakehouse, it lets you focus on implementing a simple data pipeline while providing blazing-fast query answers for your BI and analytics reports.</p>
<p>In addition, Delta Lake is a fully open source project under the Linux Foundation and is adopted by most of the data players. You know you own your data and won't have vendor lock-in.</p>
<h3 id="heading-features-and-capabilities">Features and Capabilities</h3>
<p>You can think about Delta as a file format that your engine can leverage to bring the following capabilities out of the box:</p>
<ul>
<li><p>ACID transactions</p>
</li>
<li><p>Support for DELETE/UPDATE/MERGE</p>
</li>
<li><p>Unify batch &amp; streaming</p>
</li>
<li><p>Time Travel</p>
</li>
<li><p>Clone zero copy</p>
</li>
<li><p>Generated partitions</p>
</li>
<li><p>CDF - Change Data Flow (DBR runtime)</p>
</li>
<li><p>Blazing-fast queries</p>
</li>
</ul>
<p>This hands-on quickstart guide is going to focus on:</p>
<ul>
<li><p>Loading Databases and Tabular Data from a variety of sources</p>
</li>
<li><p>Writing DDL, DML, and DTL queries on these datasets</p>
</li>
<li><p>Visualizing Datasets to get conclusive results</p>
</li>
<li><p>Time travel and Restoring database</p>
</li>
<li><p>Performance Optimization</p>
</li>
</ul>
<h2 id="heading-how-to-create-and-manage-tables">How to Create and Manage Tables</h2>
<p>Okay, time to code! If you still have the notebook that we created earlier along with the clusters open, you can start by following along with the code below. Don't worry, explanations for every step will follow.</p>
<p>Select the dropdown next to the notebook title and ensure SQL is selected since this handbook is all about Delta Lakes with SQL.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-215.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Select Notebook language to be SQL</em></p>
<h3 id="heading-how-to-create-tables-from-a-databricks-dataset">How to Create Tables from a Databricks Dataset</h3>
<p>Databricks notebooks are very much like Jupyter Notebooks. You have to insert your code into cells and run them one by one or together. All the output is shown cell by cell, progressively.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-216.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Databricks notebook interface</em></p>
<p>Here's the code from the image above:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">DROP</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">EXISTS</span> diamonds; 
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> diamonds 
<span class="hljs-keyword">USING</span> csv 
OPTIONS (<span class="hljs-keyword">path</span> <span class="hljs-string">"/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv"</span>, header <span class="hljs-string">"true"</span>)
</code></pre>
<p>In the code above, the two SQL statements (<code>CREATE TABLE</code>) are used to create a table named <code>diamonds</code> in a database. The table is based on data from a CSV file located at the specified path.</p>
<p>If a table with the same name already exists, the <code>DROP TABLE IF EXISTS diamonds</code> statement ensures it is deleted before creating a new one. The table will have the same schema as the CSV file, with the first row assumed to be the header containing column names ("header 'true'").</p>
<p>Here's a command that returns all the records from the <code>diamonds</code> table:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">from</span> diamonds
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-183.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>The above query returns all the records from the diamonds table</em></p>
<p>Here's another command:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">describe</span> diamonds;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-184.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Table metadata returned by the</em> <code>describe</code> command</p>
<p>In SQL, the <code>DESCRIBE</code> statement is used to retrieve metadata information about a table's structure. The specific syntax for the <code>DESCRIBE</code> statement can vary depending on the database system being used.</p>
<p>However, its primary purpose is to provide details about the columns in a table, such as their names, data types, constraints, and other properties.</p>
<h3 id="heading-saving-the-loaded-csv-file-to-delta-using-python">Saving the loaded CSV file to Delta using Python</h3>
<p>The best part about using the Databricks platform is that it allows you to write Python, SQL, Scala, and R interchangeably in the same notebook.</p>
<p>You can switch up the languages at any given point by using the <strong>"Delta Magic Commands".</strong> You can find a full list of magic commands at the end of this handbook.</p>
<pre><code class="lang-python">%python

diamonds = spark.read.csv(<span class="hljs-string">"/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv"</span>, header=<span class="hljs-string">"true"</span>, inferSchema=<span class="hljs-string">"true"</span>)

diamonds.write.format(<span class="hljs-string">"delta"</span>).mode(<span class="hljs-string">"overwrite"</span>).save(<span class="hljs-string">"/delta/diamonds"</span>)
</code></pre>
<p>Data is read from a CSV file located at <strong>/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv</strong> into a Spark DataFrame named <code>diamonds</code>. The first row of the CSV file is treated as the header, and Spark infers the schema for the DataFrame based on the data.</p>
<p>The DataFrame <code>diamonds</code> is written in a Delta Lake table format. If the table already exists at the specified location (<strong>/delta/diamonds</strong>), it will be overwritten. If it does not exist, a new table will be created.</p>
<pre><code class="lang-sql">
<span class="hljs-keyword">DROP</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">EXISTS</span> diamonds;

<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> diamonds <span class="hljs-keyword">USING</span> DELTA LOCATION <span class="hljs-string">'/delta/diamonds/'</span>
</code></pre>
<p>The SQL statements above drops any existing table named <code>diamonds</code> and creates a new Delta Lake table named <code>diamonds</code> using the data stored in the Delta Lake format at the <strong>/delta/diamonds/</strong> location.</p>
<p>You can run a SELECT statement to ensure that the table appears as expected:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">from</span> diamonds
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-185.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>The same diamonds table result set once restored from Delta Lake</em></p>
<h2 id="heading-delta-sql-command-support"><strong>Delta SQL Command Support</strong></h2>
<p>In the world of databases, there are two fundamental types of commands: Data Manipulation Language (DML) and Data Definition Language (DDL). These commands play a crucial role in managing and organizing data within a database. In this article, we will explore what DML and DDL commands are, their key differences, and provide examples of how they are used.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-186.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Databricks Notebooks support all the SQL commands including DDL and DML commands highlighted here</em></p>
<h3 id="heading-data-manipulation-language-dml">Data Manipulation Language (DML)</h3>
<p>It is used to manipulate or modify data stored in a database. These commands allow users to insert, retrieve, update, and delete data from database tables. Let's take a closer look at some commonly used DML commands:</p>
<p><strong>SELECT</strong>: The <code>SELECT</code> command is used to retrieve data from one or more tables in a database. It allows you to specify the columns and rows you want to extract by using conditions and filters. For example, <code>SELECT * FROM Customers</code> retrieves all the records from the <code>Customers</code> table.</p>
<p><strong>INSERT</strong>: The <code>INSERT</code> command adds new data into a table. It allows you to specify the value for each column or select values from another table. For example, <code>INSERT INTO Customers (Name, Email) VALUES ('John Doe', 'john@example.com')</code> adds a new customer record to the <code>Customers</code> table.</p>
<p><strong>UPDATE</strong>: The <code>UPDATE</code> command is used to modify existing data in a table. It allows you to change the values of specific columns based on certain conditions. For example, <code>UPDATE Customers SET Email = 'new@example.com' WHERE ID = 1</code> updates the email address of the customer with ID of 1.</p>
<p><strong>DELETE</strong>: The <code>DELETE</code> command is used to remove data from a table. It allows you to delete specific rows based on certain conditions. For example, <code>DELETE FROM Customers WHERE ID = 1</code> deletes the customer record with ID of 1 from the <code>Customers</code> table.</p>
<h3 id="heading-data-definition-language-ddl-commands">Data Definition Language (DDL) Commands</h3>
<p>DDL commands are used to define the structure and organization of a database. These commands allow users to create, modify, and delete database objects such as tables, indexes, and constraints.</p>
<p>Let's explore some commonly used DDL commands:</p>
<p><strong>CREATE</strong>: Creates a new database object, such as a table or an index. It allows you to define the columns, data types, and constraints for the object. For example, <code>CREATE TABLE Customers (ID INT, Name VARCHAR(50), Email VARCHAR(100))</code> creates a new table named <code>Customers</code> with three columns.</p>
<p><strong>ALTER</strong>: Modifies the structure of an existing database object. It allows you to add, modify, or delete columns, constraints, or indexes. For example, <code>ALTER TABLE Customers ADD COLUMN Phone VARCHAR(20)</code> adds a new column named <code>Phone</code> to the <code>Customers</code> table.</p>
<p><strong>DROP</strong>: Deletes an existing database object. It permanently removes the object and its associated data from the database. For example, <code>DROP TABLE Customers</code> deletes the <code>Customers</code> table from the database.</p>
<p><strong>TRUNCATE</strong>: The <code>TRUNCATE</code> command is used to remove all the data from a table, while keeping the table structure intact. It is faster than the <code>DELETE</code> command when you want to remove all records from a table. For example, <code>TRUNCATE TABLE Customers</code> removes all records from the <code>Customers</code> table.</p>
<p>Delta Lake supports standard DML including <code>UPDATE</code>, <code>DELETE</code> and <code>MERGE INTO</code>, providing developers with more control to manage their big datasets.</p>
<p>Here's an example that uses the <code>INSERT</code>, <code>UPDATE</code>, and <code>SELECT</code> commands:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> diamonds(_c0, carat, cut,    color,    clarity,    <span class="hljs-keyword">depth</span>,    <span class="hljs-keyword">table</span>,    price,    x,    y,    z) <span class="hljs-keyword">values</span> (<span class="hljs-number">53941</span>, <span class="hljs-number">0.22</span>,    <span class="hljs-string">'Premium'</span>, <span class="hljs-string">'I'</span>,    <span class="hljs-string">'SI2'</span>,    <span class="hljs-string">'60.3'</span>,    <span class="hljs-string">'62.1'</span>,    <span class="hljs-string">'334'</span>,    <span class="hljs-string">'3.79'</span>,    <span class="hljs-string">'3.75'</span>,    <span class="hljs-string">'2.27'</span>);

<span class="hljs-keyword">UPDATE</span> diamonds <span class="hljs-keyword">SET</span> carat = <span class="hljs-number">0.20</span> <span class="hljs-keyword">WHERE</span> _c0 = <span class="hljs-number">53941</span>;

<span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> diamonds <span class="hljs-keyword">where</span> _c0=<span class="hljs-number">53941</span>;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-187.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Fetching a unique record from the table</em></p>
<p>In the example above, an initial row is inserted into the <code>diamonds</code> table with specific values for each column.</p>
<p>Then the carat value for the row with <code>_c0</code> equal to 53941 is updated to 0.20.</p>
<p>The final <code>SELECT</code> statement retrieves the row with <code>_c0</code> equal to 53941, showing its current state after the <code>INSERT</code> and <code>UPDATE</code> operations. This shows that the record insertion was successful.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">DELETE</span> <span class="hljs-keyword">FROM</span> diamonds <span class="hljs-keyword">where</span> _c0=<span class="hljs-number">53941</span>;

<span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> diamonds <span class="hljs-keyword">where</span> _c0=<span class="hljs-number">53941</span>;
</code></pre>
<p>The above <code>DELETE</code> command paired with the <code>WHERE</code> clause removes the row from the database and the subsequent <code>SELECT</code> query validates this by returning a null result set.</p>
<h3 id="heading-upsert-operation">UPSERT Operation</h3>
<p>The "upsert" operation updates if the record exists, and inserts the record doesn't exist.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span>  diamond__mini(_c0 <span class="hljs-built_in">int</span>, carat <span class="hljs-keyword">double</span>, cut <span class="hljs-keyword">string</span>,    color <span class="hljs-keyword">string</span>,    clarity <span class="hljs-keyword">string</span>,    <span class="hljs-keyword">depth</span> <span class="hljs-keyword">double</span>, <span class="hljs-keyword">table</span> <span class="hljs-keyword">double</span>,    price <span class="hljs-built_in">int</span>,    x <span class="hljs-keyword">double</span>,    y <span class="hljs-keyword">double</span>,    z <span class="hljs-keyword">double</span>);

<span class="hljs-keyword">delete</span> <span class="hljs-keyword">from</span> diamond__mini;

<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> diamond__mini(_c0, carat, cut,    color,    clarity,    <span class="hljs-keyword">depth</span>,    <span class="hljs-keyword">table</span>,    price,    x,    y,    z) <span class="hljs-keyword">values</span> (<span class="hljs-number">1</span>, <span class="hljs-number">0.22</span>,    <span class="hljs-string">'Premium'</span>, <span class="hljs-string">'I'</span>,    <span class="hljs-string">'SI2'</span>,    <span class="hljs-string">'60.3'</span>,    <span class="hljs-string">'62.1'</span>,    <span class="hljs-string">'334'</span>,    <span class="hljs-string">'3.79'</span>,    <span class="hljs-string">'3.75'</span>,    <span class="hljs-string">'2.27'</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> diamond__mini(_c0, carat, cut,    color,    clarity,    <span class="hljs-keyword">depth</span>,    <span class="hljs-keyword">table</span>,    price,    x,    y,    z) <span class="hljs-keyword">values</span> (<span class="hljs-number">2</span>, <span class="hljs-number">0.22</span>,    <span class="hljs-string">'Premium'</span>, <span class="hljs-string">'I'</span>,    <span class="hljs-string">'SI2'</span>,    <span class="hljs-string">'60.3'</span>,    <span class="hljs-string">'62.1'</span>,    <span class="hljs-string">'334'</span>,    <span class="hljs-string">'3.79'</span>,    <span class="hljs-string">'3.75'</span>,    <span class="hljs-string">'2.27'</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> diamond__mini(_c0, carat, cut,    color,    clarity,    <span class="hljs-keyword">depth</span>,    <span class="hljs-keyword">table</span>,    price,    x,    y,    z) <span class="hljs-keyword">values</span> (<span class="hljs-number">90000</span>, <span class="hljs-number">0.22</span>,    <span class="hljs-string">'Premium'</span>, <span class="hljs-string">'I'</span>,    <span class="hljs-string">'SI2'</span>,    <span class="hljs-string">'60.3'</span>,    <span class="hljs-string">'62.1'</span>,    <span class="hljs-string">'334'</span>,    <span class="hljs-string">'3.79'</span>,    <span class="hljs-string">'3.75'</span>,    <span class="hljs-string">'2.27'</span>);

<span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> diamond__mini;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-188.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Creating a subset</em> <code>diamonds_mini</code> to demonstrate the UPSERT operation</p>
<p>In this scenario, we have created a table named <code>diamond__mini</code> to test upsert (that is, insert or update) operations into the <code>diamonds</code> table.</p>
<p><code>diamond__mini</code> is a subset of the <code>diamonds</code> table, containing only 3 records. Two of these rows (with <code>_c0</code> values 1 and 2) already exist in the <code>diamonds</code> table, and one row (with <code>_c0</code> value 90000) does not exist.</p>
<p>Therefore, the code will drop and create the <code>diamond__mini</code> table with a specific schema to match the <code>diamonds</code> table.</p>
<p>Then clear the <code>diamond__mini</code> table by deleting all existing records, ensuring that we have a clean slate for the upsert test.</p>
<p>It'll then perform three <code>INSERT</code> statements to the <code>diamond__mini</code> table, attempting to add three new records with different <code>_c0</code> values, including one with <code>_c0 = 90000</code>.</p>
<p>Lastly, we'll select all records from the <code>diamond__mini</code> table to observe the changes and verify if the upsert worked correctly.</p>
<p>Since the <code>_c0</code> values 1 and 2 already exist in the <code>diamonds</code> table, the corresponding rows in <code>diamond__mini</code> will be considered as updates for the existing rows.</p>
<p>On the other hand, the row with <code>_c0 = 90000</code> is new and does not exist in the <code>diamonds</code> table, so it will be treated as an insert.</p>
<p>The <code>describe</code> command shows the metadata of the new table:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">describe</span> diamond__mini
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-189.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Fetching metadata of the newly created table</em></p>
<p>Here's another example that uses the upsert operation:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-192.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>upsert operation on diamond and diamond_mini tables</em></p>
<pre><code class="lang-sql"><span class="hljs-comment">-- perform UPSERT operation based on matching column and row criteria from diamond__mini to diamonds table. If a match is found, record will update otherwise it will be inserted.</span>

<span class="hljs-keyword">MERGE</span> <span class="hljs-keyword">INTO</span> diamonds <span class="hljs-keyword">as</span> d <span class="hljs-keyword">USING</span> diamond__mini <span class="hljs-keyword">as</span> m
  <span class="hljs-keyword">ON</span> d._c0 = m._c0
  <span class="hljs-keyword">WHEN</span> <span class="hljs-keyword">MATCHED</span> <span class="hljs-keyword">THEN</span> 
    <span class="hljs-keyword">UPDATE</span> <span class="hljs-keyword">SET</span> *
  <span class="hljs-keyword">WHEN</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">MATCHED</span> 
    <span class="hljs-keyword">THEN</span> <span class="hljs-keyword">INSERT</span> * ;

<span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> diamonds <span class="hljs-keyword">where</span> _c0 <span class="hljs-keyword">in</span> (<span class="hljs-number">1</span> ,<span class="hljs-number">2</span>, <span class="hljs-number">90000</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-193.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>UPSERT operation successful. Values for records with</em> <code>_c0</code> = [1,2] were updated and 90,000 was inserted</p>
<p>In this example, a <code>MERGE</code> operation is performed between two tables: <code>diamonds</code> (target table) and <code>diamond__mini</code> (source table). The <code>MERGE</code> statement compares the records in both tables based on the common <code>_c0</code> column.</p>
<p>Here's a concise explanation:</p>
<ol>
<li><p>The <code>MERGE</code> statement matches records with the same <code>_c0</code> value in both tables (<code>diamonds</code> and <code>diamond__mini</code>).</p>
</li>
<li><p>When a match is found (based on <code>_c0</code>), it performs an <code>UPDATE</code> on the target table (<code>diamonds</code>) using the values from the source table (<code>diamond__mini</code>). This is done for all columns using <code>UPDATE SET *</code>.</p>
</li>
<li><p>If no match is found for a record from the source table (<code>diamond__mini</code>), it performs an <code>INSERT</code> into the target table (<code>diamonds</code>) using the values from the source table for all columns (using <code>INSERT *</code>).</p>
</li>
<li><p>After the <code>MERGE</code> operation, a <code>SELECT</code> statement retrieves the records from the target table (<code>diamonds</code>) with _c0 values 1, 2, and 90000 to observe the changes made during the merge.</p>
</li>
</ol>
<p>The <code>MERGE</code> statement is used to synchronize data between the <code>diamonds</code>and <code>diamond__mini</code> tables based on their common <code>_c0</code>column, updating existing records and inserting new ones.</p>
<h2 id="heading-advanced-sql-queries">Advanced SQL Queries</h2>
<h3 id="heading-data-visualization-in-delta">Data Visualization in Delta</h3>
<p>In Databricks Delta platform, you can leverage SQL queries to visualize data and gain valuable insights without the need for complex programming. Here are some ways to visualize data using SQL queries in Databricks Delta:</p>
<ol>
<li><p><strong>Basic SELECT Queries:</strong> Retrieves data from your Delta tables. By selecting specific columns or applying filters with WHERE clauses, you can quickly get an overview of the data's characteristics.</p>
</li>
<li><p><strong>Aggregate Functions:</strong> SQL provides a variety of aggregate functions like <code>COUNT</code>, <code>SUM</code>, <code>AVG</code>, <code>MIN</code>, and <code>MAX</code>. By using these functions, you can summarize and visualize data at a higher level. You perform operations such as counting the number of records, calculating the average values, or finding the maximum and minimum values.</p>
</li>
<li><p><strong>Grouping and Aggregating Data:</strong> The <code>GROUP BY</code> clause in SQL allows you to group data based on specific columns, and then apply aggregate functions to each group. This enables generation of meaningful insights by analyzing data on a category-wise basis.</p>
</li>
<li><p><strong>Window Functions:</strong> SQL window functions, like <code>ROW_NUMBER</code>, <code>RANK</code>, and <code>DENSE_RANK</code>, are valuable for partitioning data and calculating rankings or running totals. These functions enable analyzing data in a more granular way and help discover patterns.</p>
</li>
<li><p><strong>Joining Tables:</strong> Helps to combine data from multiple Delta tables using SQL <code>JOIN</code> operations. Merging related data, performing cross-table analysis, and advanced visualizations is possible through joins.</p>
</li>
<li><p><strong>Subqueries and CTEs:</strong> SQL subqueries and Common Table Expressions (CTEs) allow you to break down complex problems into manageable parts. These techniques can simplify analysis and make SQL queries more organized and maintainable.</p>
</li>
<li><p><strong>Window Aggregates:</strong> SQL window aggregates, such as <code>SUM</code>, <code>AVG</code>, and <code>ROW_NUMBER</code> with the <code>OVER</code> clause, enable you to perform calculations on specific windows or ranges of data. This is useful for analyzing trends over time or within specific subsets of your data.</p>
</li>
<li><p><strong>CASE Statements:</strong> CASE statements in SQL help you create conditional expressions, allowing you to categorize or group data based on certain conditions. This can aid in creating custom labels or grouping data into different categories for visualization purposes.</p>
</li>
</ol>
<p>The platform's powerful SQL capabilities empower data analysts and developers to extract meaningful insights from their Delta Lake data, all without the need for additional programming languages or tools.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- aggregate query to get average price based on diamond colors</span>
<span class="hljs-keyword">SELECT</span> color, <span class="hljs-keyword">avg</span>(price) <span class="hljs-keyword">AS</span> avg_price <span class="hljs-keyword">FROM</span> diamonds <span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> color <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> color
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-194.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Tabular View for the Query</em></p>
<p>This SQL query above is used to retrieve the average price of diamonds based on their colors.</p>
<p>Let's break down the code:</p>
<p><code>SELECT color, avg(price) AS avg_price</code> specifies the columns that will be selected in the result set. It selects the <code>color</code> column and calculates the average price using the <code>avg()</code> function. The calculated average is aliased as <code>avg_price</code> for easier reference in the result set.</p>
<p>The <code>FROM diamonds</code> command specifies the table from which data will be retrieved. In this case, the table is named <code>diamonds</code>.</p>
<p><code>GROUP BY color</code> groups the data by the <code>color</code> column. The result set will contain one row for each unique color, and the average price will be calculated for each group separately.</p>
<p><code>ORDER BY color</code> arranges the result set in ascending order based on the <code>color</code> column. The output will be sorted alphabetically by color.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-195.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Visualized Results for the Query</em></p>
<h3 id="heading-count-of-diamonds-by-clarity">Count of Diamonds by Clarity</h3>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> clarity, <span class="hljs-keyword">COUNT</span>(*) <span class="hljs-keyword">AS</span> <span class="hljs-keyword">count</span>
<span class="hljs-keyword">FROM</span> diamonds
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> clarity
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">count</span> <span class="hljs-keyword">DESC</span>;
</code></pre>
<p>This SQL query above calculates the count of diamonds for each clarity level and presents the results in descending order. It selects the <code>clarity</code> column and uses the <code>COUNT()</code> function to count the number of occurrences for each clarity value.</p>
<p>The result set is grouped by clarity and sorted in descending order based on the count of diamonds.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-196.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Pie Chart visualization based on the above query</em></p>
<h3 id="heading-average-price-by-depth-range">Average Price by Depth Range</h3>
<pre><code class="lang-sql"><span class="hljs-comment">-- This SQL query calculates the average price of diamonds grouped into depth ranges (60-62 and 62-64), and 'Other' for all other depth values, from the 'diamonds' table. The results are ordered in descending order based on the average price.</span>

<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">CASE</span> 
         <span class="hljs-keyword">WHEN</span> <span class="hljs-keyword">depth</span> <span class="hljs-keyword">BETWEEN</span> <span class="hljs-number">60</span> <span class="hljs-keyword">AND</span> <span class="hljs-number">62</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">'60-62'</span>
         <span class="hljs-keyword">WHEN</span> <span class="hljs-keyword">depth</span> <span class="hljs-keyword">BETWEEN</span> <span class="hljs-number">62</span> <span class="hljs-keyword">AND</span> <span class="hljs-number">64</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">'62-64'</span>
         <span class="hljs-keyword">ELSE</span> <span class="hljs-string">'Other'</span>
       <span class="hljs-keyword">END</span> <span class="hljs-keyword">AS</span> depth_range,
       <span class="hljs-keyword">AVG</span>(<span class="hljs-keyword">CAST</span>(price <span class="hljs-keyword">AS</span> <span class="hljs-keyword">DOUBLE</span>)) <span class="hljs-keyword">AS</span> avg_price
<span class="hljs-keyword">FROM</span> diamonds
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> depth_range
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> avg_price <span class="hljs-keyword">DESC</span>;
</code></pre>
<p>Here, we are calculating the average price of diamonds grouped into depth ranges. It uses a <code>CASE</code> statement to categorize the diamonds into three depth ranges: '60-62' for depths between 60 and 62, '62-64' for depths between 62 and 64, and 'Other' for all other depth values.</p>
<p>The <code>AVG()</code> function is then used to calculate the average price for each depth range. The result set is grouped by the <code>depth_range</code> column and ordered in descending order based on the average price.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-197.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Average price based on the grouped depth range, achieved using CASE syntax</em></p>
<h3 id="heading-price-distribution-by-table">Price Distribution by Table</h3>
<pre><code class="lang-sql"><span class="hljs-comment">--  Calculate the median, first quartile (q1), and third quartile (q3) prices for each unique 'table' in the 'diamonds' table based on the 'price' column. The results are grouped by 'table' and provide valuable statistical insights into the price distribution within each category.</span>

<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">table</span>, 
       <span class="hljs-keyword">PERCENTILE_CONT</span>(<span class="hljs-number">0.5</span>) <span class="hljs-keyword">WITHIN</span> <span class="hljs-keyword">GROUP</span> (<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">CAST</span>(price <span class="hljs-keyword">AS</span> <span class="hljs-keyword">DOUBLE</span>)) <span class="hljs-keyword">AS</span> median_price,
       <span class="hljs-keyword">PERCENTILE_CONT</span>(<span class="hljs-number">0.25</span>) <span class="hljs-keyword">WITHIN</span> <span class="hljs-keyword">GROUP</span> (<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">CAST</span>(price <span class="hljs-keyword">AS</span> <span class="hljs-keyword">DOUBLE</span>)) <span class="hljs-keyword">AS</span> q1_price,
       <span class="hljs-keyword">PERCENTILE_CONT</span>(<span class="hljs-number">0.75</span>) <span class="hljs-keyword">WITHIN</span> <span class="hljs-keyword">GROUP</span> (<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">CAST</span>(price <span class="hljs-keyword">AS</span> <span class="hljs-keyword">DOUBLE</span>)) <span class="hljs-keyword">AS</span> q3_price
<span class="hljs-keyword">FROM</span> diamonds
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> <span class="hljs-keyword">table</span>;
</code></pre>
<p>This SQL query calculates the median, first quartile (q1), and third quartile (q3) prices for each unique <code>table</code> value in the <code>diamonds</code> table. It uses the <code>PERCENTILE_CONT()</code> function to calculate these statistical measures.</p>
<p>The function is applied to the <code>price</code> column, which is cast as a double for accurate calculations. The result set is grouped by the <code>table</code> column, providing insights into the price distribution within each <code>table</code> category.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-198.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Casting media, Q1 and Q3 figures based on the price</em></p>
<h3 id="heading-price-factor-by-x-y-and-z">Price Factor by X, Y and Z</h3>
<pre><code class="lang-sql"><span class="hljs-comment">-- Calculate the average price of diamonds grouped by their x, y, and z values from the 'diamonds' table. The results are ordered in descending order based on the average price, providing valuable insights into the average price of diamonds with different x, y, and z dimensions.</span>

<span class="hljs-keyword">SELECT</span> x, y, z, <span class="hljs-keyword">AVG</span>(<span class="hljs-keyword">CAST</span>(price <span class="hljs-keyword">AS</span> <span class="hljs-keyword">DOUBLE</span>)) <span class="hljs-keyword">AS</span> avg_price
<span class="hljs-keyword">FROM</span> diamonds
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> x, y, z
<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> avg_price <span class="hljs-keyword">DESC</span>;
</code></pre>
<p>This query will calculate the average price of diamonds grouped by their x, y, and z values from the <code>diamonds</code> table. It selects the columns <code>x</code>, <code>y</code>, <code>z</code>, and uses the <code>AVG()</code> function to calculate the average price for each combination of x, y, and z values.</p>
<p>The result set is then ordered in descending order based on the average price, providing insights into the average price of diamonds with different dimensions.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-199.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Visualization showing average price of diamonds grouped by their x, y, and z values from the 'diamonds' table</em></p>
<h3 id="heading-add-constraints">Add Constraints</h3>
<pre><code class="lang-sql"><span class="hljs-comment">-- This SQL code snippet alters the 'diamonds' table by dropping the existing constraint 'id_not_null' if it exists. Then, it adds a new constraint named 'id_not_null' to ensure that the column '_c0' must not contain null values, enforcing data integrity in the table.</span>

<span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">TABLE</span> diamonds <span class="hljs-keyword">DROP</span> <span class="hljs-keyword">CONSTRAINT</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">EXISTS</span> id_not_null;
<span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">TABLE</span> diamonds <span class="hljs-keyword">ADD</span> <span class="hljs-keyword">CONSTRAINT</span> id_not_null <span class="hljs-keyword">CHECK</span> (_c0 <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">null</span>);
</code></pre>
<pre><code class="lang-sql"><span class="hljs-comment">-- This command will fail as we insert a user with a null id::</span>
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> diamonds(_c0, carat, cut,    color,    clarity,    <span class="hljs-keyword">depth</span>,    <span class="hljs-keyword">table</span>,    price,    x,    y,    z) <span class="hljs-keyword">values</span> (<span class="hljs-literal">null</span>, <span class="hljs-number">0.22</span>,    <span class="hljs-string">'Premium'</span>, <span class="hljs-string">'I'</span>,    <span class="hljs-string">'SI2'</span>,    <span class="hljs-string">'60.3'</span>,    <span class="hljs-string">'62.1'</span>,    <span class="hljs-string">'334'</span>,    <span class="hljs-string">'3.79'</span>,    <span class="hljs-string">'3.75'</span>,    <span class="hljs-string">'2.27'</span>);
</code></pre>
<p>Note that this won't actually yield any output. Guess why? Because it does not stick to the NOT NULL constraint. So, whenever constraints are not fulfilled an error will be thrown. In this case, this exact error is shown:</p>
<pre><code class="lang-sql">Error in SQL statement: DeltaInvariantViolationException: <span class="hljs-keyword">CHECK</span> <span class="hljs-keyword">constraint</span> id_not_null (_c0 <span class="hljs-keyword">IS</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>) violated <span class="hljs-keyword">by</span> <span class="hljs-keyword">row</span> <span class="hljs-keyword">with</span> <span class="hljs-keyword">values</span>:
 - _c0 : <span class="hljs-literal">null</span>
</code></pre>
<p>This SQL code snippet demonstrates the alteration of the <code>diamonds</code> table to enforce data integrity.</p>
<p>The first line of code, <code>ALTER TABLE diamonds DROP CONSTRAINT IF EXISTS id_not_null;</code>, checks if a constraint named <code>id_not_null</code> exists in the <code>diamonds</code> table and drops it if it does. This step ensures that any existing constraint with the same name is removed before adding a new one.</p>
<p>The second line of code, <code>ALTER TABLE diamonds ADD CONSTRAINT id_not_null CHECK (_c0 is not null);</code>, adds a new constraint named <code>id_not_null</code> to the <code>diamonds</code> table. This constraint specifies that the column <code>_c0</code> must not contain null values. It ensures that whenever data is inserted or updated in this table, the '_c0' column cannot have a null value, maintaining data integrity.</p>
<p>However, the subsequent command, <code>INSERT INTO diamonds(_c0, carat, cut, color, clarity, depth, table, price, x, y, z) VALUES (null, 0.22, 'Premium', 'I', 'SI2', '60.3', '62.1', '334', '3.79', '3.75', '2.27');</code>, attempts to insert a row into the <code>diamonds</code> table with a null value in the <code>_c0</code> column.</p>
<p>Since the newly added constraint prohibits null values in this column, the <code>INSERT</code> operation will fail, preserving the data integrity specified by the constraint.</p>
<h2 id="heading-how-to-work-with-dataframes">How to Work with Dataframes</h2>
<p>The best part is that you are not just restricted to using SQL to achieve this. Below, the same thing is done by first loading the dataset into <code>diamonds</code> with Python and then using pyspark library functions to do complex queries.</p>
<pre><code class="lang-python">%python
diamonds = spark.read.csv(<span class="hljs-string">"/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv"</span>, header=<span class="hljs-string">"true"</span>, inferSchema=<span class="hljs-string">"true"</span>)
</code></pre>
<p>In the Databricks Delta Lake platform, the <code>spark</code> object represents the SparkSession, which is the entry point for interacting with Spark functionality. It provides a programming interface to work with structured and semi-structured data.</p>
<p>The <code>spark.read.csv()</code> function is used to read a CSV file into a DataFrame. In this case, it reads the <strong>diamonds.csv</strong> file from the specified path. The arguments passed to the function include:</p>
<ul>
<li><p><code>"/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv"</code>: This is the path to the CSV file. You can replace this with the actual path where your file is located.</p>
</li>
<li><p><code>header="true"</code>: This specifies that the first row of the CSV file contains the column names.</p>
</li>
<li><p><code>inferSchema="true"</code>: This instructs Spark to automatically infer the data types of the columns in the DataFrame.</p>
</li>
</ul>
<p>Once the CSV file is read, it is stored in the <code>diamonds</code> variable as a DataFrame. The DataFrame represents a distributed collection of data organized into named columns. It provides various functions and methods to manipulate and analyze the data.</p>
<p>By reading the CSV file into a DataFrame on the Databricks Delta Lake platform, you can leverage the rich querying and processing capabilities of Spark to perform data analysis, transformations, and other operations on the diamonds data.</p>
<h3 id="heading-manipulate-the-data-and-displays-the-results"><strong>Manipulate the data and displays the results</strong></h3>
<p>The below example showcases that on the Databricks Delta Lake platform, you are not limited to using only SQL queries. You can also leverage Python and its rich ecosystem of libraries, such as PySpark, to perform complex data manipulations and analyses.</p>
<p>By using Python, you have access to a wide range of functions and methods provided by PySpark's DataFrame API. This allows you to perform various transformations, aggregations, calculations, and sorting operations on your data.</p>
<p>Whether you choose to use SQL or Python, the Databricks Delta Lake platform provides a flexible environment for data processing and analysis, enabling you to unlock valuable insights from your data.</p>
<pre><code class="lang-python">%python
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> avg

display(diamonds.select(<span class="hljs-string">"color"</span>,<span class="hljs-string">"price"</span>).groupBy(<span class="hljs-string">"color"</span>).agg(avg(<span class="hljs-string">"price"</span>)).sort(<span class="hljs-string">"color"</span>))
</code></pre>
<p>Firstly, the <code>from pyspark.sql.functions import avg</code> statement imports the <code>avg</code> function from the <code>pyspark.sql.functions</code> module. This function is used to calculate the average value of a column.</p>
<p>Next, the <code>diamonds.select("color", "price").groupBy("color").agg(avg("price")).sort("color")</code> expression performs the following operations:</p>
<p><code>diamonds.select("color", "price")</code> selects only the <code>color</code> and <code>price</code> columns from the <code>diamonds</code> DataFrame.</p>
<p><code>groupBy("color")</code> groups the data based on the <code>color</code> column.</p>
<p><code>agg(avg("price"))</code> calculates the average price for each group (color). The <code>avg("price")</code> argument specifies that we want to calculate the average of the "price" column.</p>
<p><code>sort("color")</code> sorts the resulting DataFrame in ascending order based on the <code>color</code> column.</p>
<p>Finally, the <code>display()</code> function is used to visualize the resulting DataFrame in a tabular format.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-200.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-version-control-and-time-travel-in-delta"><strong>Version Control and Time Travel in Delta</strong></h2>
<p>Databricks Delta’s time travel capabilities simplify building data pipelines. It comes handy when auditing data changes, reproducing experiments and reports or performing database transaction rollbacks. It is also useful for disaster recovery and allows us to undo changes and shifting back to any specific version of a database.</p>
<p>As you write into a Delta table or directory, every operation is automatically versioned. Query a table by referring to a timestamp or a version number.</p>
<p>The command below returns a list of all the versions and timestamps in a table called <code>diamonds</code>:</p>
<pre><code class="lang-javascript">DESCRIBE HISTORY diamonds;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-201.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>DESCRIBE HISTORY</em> <code>table_name</code> returns a list of all the versions of the table along with their timestamps, operations. It also includes which user ran the query.</p>
<h3 id="heading-restore-setup">Restore Setup</h3>
<p>Delta provides built-in support for backup and restore strategies to handle issues like data corruption or accidental data loss. In our scenario, we'll intentionally delete some rows from the main table to simulate such situations.</p>
<p>We'll then use Delta's restore capability to revert the table to a point in time before the delete operation. By doing so, we can verify if the deletion was successful or if the data was restored correctly to its previous state. This feature ensures data safety and provides an easy way to recover from undesirable changes or failures.</p>
<p>Here's the code:</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Delete 10 records from the main table</span>
<span class="hljs-keyword">DELETE</span> <span class="hljs-keyword">FROM</span> diamonds <span class="hljs-keyword">where</span> <span class="hljs-string">`_c0`</span><span class="hljs-keyword">in</span> (<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>,<span class="hljs-number">6</span>,<span class="hljs-number">7</span>,<span class="hljs-number">8</span>,<span class="hljs-number">9</span>,<span class="hljs-number">10</span>);
<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">COUNT</span>(*) <span class="hljs-keyword">from</span> diamonds;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-202.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Row count after deleing 10 records from the main table</em></p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">COUNT</span>(*) <span class="hljs-keyword">FROM</span> diamonds <span class="hljs-keyword">VERSION</span> <span class="hljs-keyword">AS</span> <span class="hljs-keyword">OF</span> <span class="hljs-number">19</span>;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-203.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Row count by referencing a previous version of the table</em></p>
<h2 id="heading-restoring-from-a-version-number"><strong>Restoring From A Version Number</strong></h2>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-204.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Illustration of how a Version Restore works in Databricks Notebooks</em></p>
<p>The code below restores the <code>diamonds</code> table to the version that existed at version number 19, using a database versioning or historical data feature. After the restoration, a <code>SELECT</code> statement is executed to retrieve all data from the <code>diamonds</code> table as it existed at version 19.</p>
<p>This process allows you to view the historical state of the table at that specific version, enabling data analysis or comparisons with the current version.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- restore the state of diamonds table to that of version 19 (refer the database images in the previous cell)</span>

<span class="hljs-keyword">RESTORE</span> <span class="hljs-keyword">TABLE</span> diamonds <span class="hljs-keyword">TO</span> <span class="hljs-keyword">VERSION</span> <span class="hljs-keyword">AS</span> <span class="hljs-keyword">OF</span> <span class="hljs-number">19</span>;
<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">from</span> diamonds;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-205.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>SELECT query running against a restored version of the database</em></p>
<h3 id="heading-autogenerated-fields">Autogenerated Fields</h3>
<p>Let us see how to use auto-increment in Delta with SQL. The code below demonstrates the creation of a table called <code>test__autogen</code> with an "autogenerated" field named <code>id</code>. The <code>id</code> column is defined as <code>BIGINT GENERATED ALWAYS AS IDENTITY</code>, meaning its values will be automatically generated by the database engine during the insertion process.</p>
<p>The <code>id</code> serves as an auto-incrementing primary key for the table, ensuring each new record receives a unique identifier without any manual input. This feature simplifies data insertion and guarantees the uniqueness of records within the table, enhancing database management efficiency.</p>
<p>This auto-incrementing feature is commonly used for primary keys, as it guarantees the uniqueness of each record in the table. It also saves developers from having to manage the generation of unique identifiers manually, providing a more streamlined and efficient workflow.</p>
<pre><code class="lang-sql">%sql 
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> test__autogen (
  <span class="hljs-keyword">id</span> <span class="hljs-built_in">BIGINT</span> <span class="hljs-keyword">GENERATED</span> <span class="hljs-keyword">ALWAYS</span> <span class="hljs-keyword">AS</span> <span class="hljs-keyword">IDENTITY</span> ( <span class="hljs-keyword">START</span> <span class="hljs-keyword">WITH</span> <span class="hljs-number">10000</span> <span class="hljs-keyword">INCREMENT</span> <span class="hljs-keyword">BY</span> <span class="hljs-number">1</span> ), 
  <span class="hljs-keyword">name</span> <span class="hljs-keyword">STRING</span>, 
  surname <span class="hljs-keyword">STRING</span>, 
  email <span class="hljs-keyword">STRING</span>, 
  city <span class="hljs-keyword">STRING</span>) ;

<span class="hljs-comment">-- Note that we don't insert data for the id. The engine will handle that for us:</span>
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> test__autogen (<span class="hljs-keyword">name</span>, surname, email, city) <span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'Atharva'</span>, <span class="hljs-string">'Shah'</span>, <span class="hljs-string">'highnessatharva@gmail.com'</span>, <span class="hljs-string">'Pune, IN'</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> test__autogen (<span class="hljs-keyword">name</span>, surname, email, city) <span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'James'</span>, <span class="hljs-string">'Dean'</span>, <span class="hljs-string">'james@proton.mail'</span>, <span class="hljs-string">'Tokyo, JP'</span>);

<span class="hljs-comment">-- The ID is automatically generated!</span>
<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">from</span> test__autogen;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-206.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Records with an autogenerated</em> <code>id</code></p>
<h2 id="heading-delta-table-cloning"><strong>Delta Table Cloning</strong></h2>
<p>Cloning Delta tables allows you to create a replica of an existing Delta table at a specific version. This feature is particularly valuable when you need to transfer data from a production environment to a staging environment or when archiving a specific version for regulatory purposes.</p>
<p>There are two types of clones available:</p>
<ol>
<li><p><strong>Deep Clone:</strong> This type of clone copies both the source table data and metadata to the clone target. In other words, it replicates the entire table, making it independent of the source.</p>
</li>
<li><p><strong>Shallow Clone:</strong> A shallow clone only replicates the table metadata without copying the actual data files to the clone target. As a result, these clones are more cost-effective to create. However, it's crucial to note that shallow clones act as pointers to the main table. If a <code>VACUUM</code> operation is performed on the original table, it may delete the underlying files and potentially impact the shallow clone.</p>
</li>
</ol>
<p>It's important to remember that any modifications made to either deep or shallow clones only affect the clones themselves and not the source table.</p>
<p>Cloning Delta tables is a powerful feature that simplifies data replication and version archiving, enhancing data management capabilities within your Delta Lake environment.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-207.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Difference between a Shallow Clone and a Deep Clone</em></p>
<p>The code below shows how to clone a table using shallow and deep clones:</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Shallow clone (zero copy)</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> diamonds__shallow__clone
  SHALLOW <span class="hljs-keyword">CLONE</span> diamonds
  <span class="hljs-keyword">VERSION</span> <span class="hljs-keyword">AS</span> <span class="hljs-keyword">OF</span> <span class="hljs-number">19</span>;

<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> diamonds__shallow__clone;

<span class="hljs-comment">-- Deep clone (copy data)</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> diamonds__deep__clone
  DEEP <span class="hljs-keyword">CLONE</span> diamonds;

<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> diamonds__deep__clone;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/08/image-208.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Selecting records from the deep cloned table</em></p>
<h2 id="heading-delta-magic-commands">Delta Magic Commands</h2>
<p>There are convenient shortcuts in Databricks notebooks for managing Delta tables. They simplify common operations like displaying table metadata and running optimization.</p>
<p>You can use these shortcut commands to improve productivity by streamlining Delta table management tasks within a notebook environment.</p>
<ol>
<li><p><code>%run</code>: runs a Python file or a notebook.</p>
</li>
<li><p><code>%sh</code>: executes shell commands on the cluster nodes.</p>
</li>
<li><p><code>%fs</code>: allows you to interact with the Databricks file system.</p>
</li>
<li><p><code>%sql</code>: allows you to run SQL queries.</p>
</li>
<li><p><code>%scala</code>: switches the notebook context to Scala.</p>
</li>
<li><p><code>%python</code>: switches the notebook context to Python.</p>
</li>
<li><p><code>%md</code>: allows you to write markdown text.</p>
</li>
<li><p><code>%r</code>: switches the notebook context to R.</p>
</li>
<li><p><code>%lsmagic</code>: lists all the available magic commands.</p>
</li>
<li><p><code>%jobs</code>: lists all the running jobs.</p>
</li>
<li><p><code>%config</code>: allows you to set configuration options for the notebook.</p>
</li>
<li><p><code>%reload</code>: reloads the contents of a module.</p>
</li>
<li><p><code>%pip</code>: allows you to install Python packages.</p>
</li>
<li><p><code>%load</code>: loads the contents of a file into a cell.</p>
</li>
<li><p><code>%matplotlib</code>: sets up the matplotlib backend.</p>
</li>
<li><p><code>%who</code>: lists all the variables in the current scope.</p>
</li>
<li><p><code>%env</code>: allows you to set environment variables.</p>
</li>
</ol>
<h2 id="heading-conclusion">Conclusion</h2>
<p>This in-depth handbook explored the power of Databricks, a platform that unifies analytics and data science in a single workspace. We went through Databricks Workspace, interactive analytics, and Delta Lake, emphasizing its data manipulation and analysis capabilities.</p>
<p>Delta, a data integrity and agility engine, supports SQL commands as well as sophisticated queries. Data frames are used to shape and display data to improve insights. Retrospection and accuracy are enabled through version control and time travel. Delta's table cloning provides innovation by permitting analytical studies into previously undiscovered territory.</p>
<p>Your pursuit of data excellence doesn't end here. Let's stay connected: explore more insights on my <a target="_blank" href="https://atharvashah.netlify.app/">blog</a>, consider supporting me with a <a target="_blank" href="https://www.buymeacoffee.com/atharvashah">cup of coffee</a>, and join the conversation on <a target="_blank" href="https://twitter.com/cultist_dev">Twitter</a> and <a target="_blank" href="https://www.linkedin.com/in/atharva-shah-5873a2111/">LinkedIn</a>. Keep the momentum going by checking out a few of my other posts.</p>
<h2 id="heading-references">References</h2>
<ol>
<li><p><a target="_blank" href="https://docs.databricks.com/delta/index.html">Databricks Official Documentation</a></p>
</li>
<li><p><a target="_blank" href="https://databricks.com/labs">Databricks Labs - Delta Lake Tutorials</a></p>
</li>
</ol>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Orchestrate an ETL Data Pipeline with Apache Airflow ]]>
                </title>
                <description>
                    <![CDATA[ By Aviator Ifeanyichukwu Data Orchestration involves using different tools and technologies together to extract, transform, and load (ETL) data from multiple sources into a central repository.  Data orchestration typically involves a combination of t... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/orchestrate-an-etl-data-pipeline-with-apache-airflow/</link>
                <guid isPermaLink="false">66d45dd7052ad259f07e4a7d</guid>
                
                    <category>
                        <![CDATA[ apache ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Wed, 01 Mar 2023 22:42:42 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/02/etl_image.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Aviator Ifeanyichukwu</p>
<p>Data Orchestration involves using different tools and technologies together to extract, transform, and load (ETL) data from multiple sources into a central repository. </p>
<p>Data orchestration typically involves a combination of technologies such as data integration tools and data warehouses.</p>
<p>Apache Airflow is a tool for data orchestration.</p>
<p>With Airflow, data teams can schedule, monitor, and manage the entire data workflow. Airflow makes it easier for organizations to manage their data, automate their workflows, and gain valuable insights from their data</p>
<p>In this guide, you will be writing an ETL data pipeline. It will download data from Twitter, transform the data into a CSV file, and load the data into a Postgres database, which will serve as a data warehouse.  </p>
<p>External users or applications will be able to connect to the database to build visualizations and make policy decisions.</p>
<h3 id="heading-what-you-will-learn">What you will learn</h3>
<ol>
<li>How to extract data from Twitter</li>
<li>How to write a DAG script</li>
<li>How to load data into a database</li>
<li>How to use Airflow Operators</li>
</ol>
<h3 id="heading-what-you-need">What you need</h3>
<p>To follow along with this tutorial, you'll need the following:</p>
<ul>
<li>Apache Airflow installed on your machine</li>
<li>Airflow development environment up and running</li>
<li>An understanding of the building blocks of Apache Airflow (Tasks, Operators, etc)</li>
<li>An IDE of your choice. Mine is VsCode.</li>
</ul>
<p>Sounds interesting yeah? Let’s begin.</p>
<h2 id="heading-how-to-get-the-data-from-twitter">How to Get the Data from Twitter</h2>
<p>Twitter is a social media platform where users gather to share information and discuss trending world events/topics. Tons of data is generated daily through this platform. This will be your data source.</p>
<p>To get data from Twitter, you need to connect to its API. Numerous libraries make it easy to connect to the Twitter API. For this guide, we'll use snscrape. You will also need Pandas, a Python library for data exploration and transformation.</p>
<h3 id="heading-installation">Installation</h3>
<p>Make sure your Airflow virtual environment is currently active.</p>
<pre><code class="lang-python">pip install snscrape pandas
</code></pre>
<p>Inside the Airflow dags folder, create two files: extract.py and transform.py.</p>
<p>extract.py:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> snscrape.modules.twitter <span class="hljs-keyword">as</span> sntwitter
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> transform <span class="hljs-keyword">import</span> transform_data


<span class="hljs-comment"># Creating list to append tweet data to</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">extract_data</span>():</span>

    <span class="hljs-comment"># scrape tweets and append to a list</span>
  <span class="hljs-keyword">for</span> i,tweet <span class="hljs-keyword">in</span> enumerate(sntwitter.TwitterSearchScraper(<span class="hljs-string">'Chatham House since:2023-01-14'</span>).get_items()):
    <span class="hljs-keyword">if</span> i&gt;<span class="hljs-number">1000</span>:
      <span class="hljs-keyword">break</span>
    tweets_list.append([tweet.date, tweet.user.username, tweet.rawContent, 
                          tweet.sourceLabel,tweet.user.location
                          ])

      <span class="hljs-comment"># convert tweets into a dataframe</span>
  tweets_df = pd.DataFrame(tweets_list, columns=[<span class="hljs-string">'datetime'</span>, <span class="hljs-string">'username'</span>, <span class="hljs-string">'text'</span>, <span class="hljs-string">'source'</span>, <span class="hljs-string">'location'</span>])

      <span class="hljs-comment"># save tweets as csv file</span>

  transform_data(tweets_df)
</code></pre>
<p>transform.py:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> airflow.hooks.postgres_hook <span class="hljs-keyword">import</span> PostgresHook

<span class="hljs-comment"># Load clean data into postgres database</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">task_data_upload</span>(<span class="hljs-params">data</span>):</span>
  print(data.head() )

  data = data.to_csv(index=<span class="hljs-literal">None</span>, header=<span class="hljs-literal">None</span>)

  postgres_sql_upload = PostgresHook(postgres_conn_id=<span class="hljs-string">"postgres_connection"</span>)
  postgres_sql_upload.bulk_load(<span class="hljs-string">'twitter_etl_table'</span>, data)

  <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span>

<span class="hljs-comment">## perform data cleaning and transformation</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">transform_data</span>(<span class="hljs-params">tweets_df</span>):</span>
  print(tweets_df.info() )
    <span class="hljs-comment">### Transformation happens here    </span>

  <span class="hljs-comment"># load transformed data into database</span>
  task_data_upload(tweets_df)
</code></pre>
<p>### </p>
<h3 id="heading-the-database">The Database</h3>
<p>Airflow comes with a SQLite3 database. To store your data, you'll use PostgreSQL as a database.</p>
<p>You should have PostgreSQL installed and running on your machine.</p>
<h3 id="heading-install-the-libraries">Install the libraries</h3>
<pre><code class="lang-python">pip install psycopg2
</code></pre>
<p>If this fails, try installing the binary version like this:</p>
<pre><code class="lang-python">pip install psycopg2-binary
</code></pre>
<p>Install the provider package for the Postgres database like this:</p>
<pre><code class="lang-python">pip install apache-airflow-providers-postgres
</code></pre>
<h2 id="heading-how-to-set-up-the-dag-script">How to Set Up the DAG Script</h2>
<p>Create a file named etl_pipeline.py inside the dags folder.</p>
<p>Start by importing the different airflow operators like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> airflow <span class="hljs-keyword">import</span> DAG
<span class="hljs-keyword">from</span> airflow.operators.empty <span class="hljs-keyword">import</span> EmptyOperator
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime, timedelta

<span class="hljs-keyword">with</span> DAG(
  <span class="hljs-string">'etl_twitter_pipeline'</span>,
  description=<span class="hljs-string">"A simple twitter ETL pipeline using Python,PostgreSQL and Apache Airflow"</span>,
  start_date=datetime(year=<span class="hljs-number">2023</span>, month=<span class="hljs-number">2</span>, day=<span class="hljs-number">5</span>),
  schedule_interval=timedelta(minutes=<span class="hljs-number">2</span>)
) <span class="hljs-keyword">as</span> dag:

  start_pipeline = EmptyOperator(
    task_id=<span class="hljs-string">'start_pipeline'</span>,
  )

start_pipeline
</code></pre>
<p>With a dag_id named 'etl_twitter_pipeline', this dag is scheduled to run every two minutes, as defined by the schedule interval.</p>
<h3 id="heading-how-to-view-the-web-ui">How to View the Web UI</h3>
<p>Start the scheduler with this command:</p>
<pre><code class="lang-python">airflow scheduler
</code></pre>
<p>Then start the web server with this command: </p>
<pre><code class="lang-python">airflow webserver
</code></pre>
<p>Open the browser on localhost:8080 to view the UI.</p>
<p>Search for a dag named ‘etl_twitter_pipeline’, and click on the toggle icon on the left to start the dag.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/airflow_ui_1.png" alt="Image" width="600" height="400" loading="lazy">
<em>Airflow UI showing created dags</em></p>
<h2 id="heading-how-to-set-up-a-postgres-database-connection">How to Set Up a Postgres Database Connection</h2>
<p>You should already have apache-airflow-providers-postgres and psycopg2 or psycopg2-binary installed in your virtual environment.</p>
<p>From the UI, navigate to <em>Admin</em> -&gt; <em>Connections</em>. Click on the plus sign at the top left corner of your screen to add a new connection and specify the connection parameters. Click on test to verify the connection to the database server. Once completed, scroll to the bottom of the screen and click on <em>Save</em>.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/postgres_connect-1.png" alt="Image" width="600" height="400" loading="lazy">
<em>PostgreSQL database connection</em></p>
<p>Inside the Airflow directory created in the virtual environment, open the airflow.cfg file in your text editor, locate the variable named sql_alchemy_conn, and set the PostgreSQL connection string:</p>
<pre><code class="lang-python">sql_alchemy_conn = postgresql+psycopg2://postgres:<span class="hljs-number">1234</span>@localhost:<span class="hljs-number">5432</span>/test
</code></pre>
<p>The Airflow executor is currently set to SequentialExecutor. Change this to LocalExecutor:</p>
<pre><code class="lang-python">executor = LocalExecutor
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/executor.png" alt="Image" width="600" height="400" loading="lazy">
<em>Airflow DAG Executor</em></p>
<p>The Airflow UI is currently cluttered with samples of example dags. In the airflow.cfg config file, find the load_examples variable, and set it to False.</p>
<pre><code class="lang-python">load_examples = <span class="hljs-literal">False</span>
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/load_eg.png" alt="Image" width="600" height="400" loading="lazy">
<em>Disable example dags</em></p>
<p>Restart the webserver, reload the web UI, and you should now have a clean UI:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/clean_dag.png" alt="Image" width="600" height="400" loading="lazy">
<em>Airflow UI</em></p>
<h2 id="heading-how-to-use-the-postgres-operator">How to Use the Postgres Operator</h2>
<p>Start by importing the different Airflow operators. You'll also need to import the extract and transform Python files.</p>
<p>etl_pipeline.py</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> airflow <span class="hljs-keyword">import</span> DAG
<span class="hljs-keyword">from</span> airflow.operators.python <span class="hljs-keyword">import</span> PythonOperator
<span class="hljs-keyword">from</span> airflow.operators.empty <span class="hljs-keyword">import</span> EmptyOperator
<span class="hljs-keyword">from</span> airflow.operators.postgres_operator <span class="hljs-keyword">import</span> PostgresOperator

<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime, timedelta

<span class="hljs-keyword">from</span> extract <span class="hljs-keyword">import</span> extract_data



<span class="hljs-keyword">with</span> DAG(
  <span class="hljs-string">'etl_twitter_pipeline'</span>,
  description=<span class="hljs-string">"A simple twitter ETL pipeline using Python,PostgreSQL and Apache Airflow"</span>,
  start_date=datetime(year=<span class="hljs-number">2023</span>, month=<span class="hljs-number">2</span>, day=<span class="hljs-number">5</span>),
  schedule_interval=timedelta(minutes=<span class="hljs-number">5</span>)
) <span class="hljs-keyword">as</span> dag:

  start_pipeline = EmptyOperator(
        task_id=<span class="hljs-string">'start_pipeline'</span>,
    )

  create_table = PostgresOperator(
    task_id=<span class="hljs-string">'create_table'</span>,
    postgres_conn_id=<span class="hljs-string">'postgres_connection'</span>,
    sql=<span class="hljs-string">'sql/create_table.sql'</span>
  )


  etl = PythonOperator(
    task_id = <span class="hljs-string">'extract_data'</span>,
    python_callable = extract_data
  )


  clean_table = PostgresOperator(
      task_id=<span class="hljs-string">'clean_sql_table'</span>,
      postgres_conn_id=<span class="hljs-string">'postgres_connection'</span>,
      sql=[<span class="hljs-string">"""TRUNCATE TABLE twitter_etl_table"""</span>]
  )

  end_pipeline = EmptyOperator(
      task_id=<span class="hljs-string">'end_pipeline'</span>,
  )
</code></pre>
<p>sql/create_table.sql</p>
<pre><code class="lang-sql">sql="""<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">EXISTS</span> twitter_etl_table(
      <span class="hljs-keyword">id</span> <span class="hljs-built_in">SERIAL</span> PRIMARY <span class="hljs-keyword">KEY</span>,
      datetime <span class="hljs-built_in">DATE</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
      username <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">200</span>) <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
      <span class="hljs-built_in">text</span> <span class="hljs-built_in">TEXT</span>,
      <span class="hljs-keyword">source</span> <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">200</span>),
      location <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">200</span>)
    );"""
</code></pre>
<p>The create_table task makes a connection to postgres to create a table.</p>
<p>The ETL task makes a call to the extract_data() function which is where our ETL data processing takes place.</p>
<p>The clean_table task invokes the postgresOperator which truncates the table of previous contents before new contents in inserted into the postgres table.</p>
<p>The end_pipeline marks the end of the task definition.</p>
<h3 id="heading-how-to-create-dependencies-between-tasks">How to Create Dependencies Between Tasks</h3>
<p>The last step is to create a dependencies between tasks, to enable Airflow to know the order of priority to schedule tasks.</p>
<pre><code class="lang-python">start_pipeline &gt;&gt; create_table &gt;&gt; clean_table &gt;&gt; etl &gt;&gt; end_pipeline
</code></pre>
<h2 id="heading-how-to-test-the-workflow">How to Test the Workflow</h2>
<p>To start, click on the 'etl_twitter_pipeline' dag. Click on the graph view option, and you can now see the flow of your ETL pipeline and the dependencies between tasks.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/Screenshot-2023-02-27-at-17-04-55-etl_twitter_pipeline---Graph---Airflow.png" alt="Image" width="600" height="400" loading="lazy">
<em>Airflow running data pipeline</em></p>
<p>And there you have it – your ETL data pipeline in Airflow. I hope you found it useful and yours is working properly.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Apache Airflow is an easy-to-use orchestration tool making it easy to schedule and monitor data pipelines. With your knowledge of Python, you can write DAG scripts to schedule and monitor your data pipeline.</p>
<p>In this guide, you learned how to set up an ETL pipeline using Airflow and also how to schedule and monitor the pipeline.</p>
<p>You also have seen the usage of some Airflow operators such as PythonOperator, PostgresOperator, and EmptyOperator.</p>
<p>I hope you learned something from this guide.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ The Difference Between Data Science and Data Engineering ]]>
                </title>
                <description>
                    <![CDATA[ By Edem Gold I recently became very interested in Data Science and Data Engineering, especially how they compare and complement each other.  I initially assumed Data Engineering was a subset of Data Science. But after extensive research I found out j... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/data-science-vs-data-engineering/</link>
                <guid isPermaLink="false">66d84fc7f20d0925f8515af3</guid>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 21 Feb 2023 22:54:32 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/02/pexels-markus-spiske-177598--1-.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Edem Gold</p>
<p>I recently became very interested in Data Science and Data Engineering, especially how they compare and complement each other. </p>
<p>I initially assumed Data Engineering was a subset of Data Science. But after extensive research I found out just how much the two fields differ.</p>
<p>In this article, I'll discuss the differences between Data Science and Data Engineering and the main tasks of each field.</p>
<blockquote>
<p>"Data is the new oil. It’s valuable, but if unrefined it cannot really be used." – Clive Humby</p>
</blockquote>
<h2 id="heading-what-do-we-mean-by-data">What Do We Mean by Data?</h2>
<p>To fully understand the relationship between Data Science and Data Engineering, you have to understand the one thing that links them both: data.</p>
<p>Data is a word that has become commonplace in today's society, with so many reports of <a target="_blank" href="https://www.statista.com/statistics/1307426/number-of-data-breaches-worldwide">data leaks</a>,<a target="_blank" href="https://www.security.org/resources/data-tech-companies-have/">the innapropriate collection of data by big tech companies</a>, and so on.</p>
<p>Data refers to information that is collected and stored in a format that can be processed by a computer. It can be in various forms such as numbers, text, images, and videos, and it can be collected, stored, and analyzed to extract insights and inform decisions.</p>
<p><strong>Now why do so many companies want data and what's so special about it?</strong></p>
<p>Data is important to companies because it allows them to make informed decisions about their operations and strategies. By analyzing data, companies can gain insights into the behaviour of their users. Then, they can use the insights they get from their users to make their products way more efficient, desireable, and useful.</p>
<p>Data scientists and engineers are the people responsible for collecting the data, making it useful, analysing it, gaining insights and trends from it. They also pass on the information they've mined to management in order to permit informed decision making. Now let's see how they differ.</p>
<h2 id="heading-what-is-data-science">What is Data Science?</h2>
<p>Data Science was named the <em>The Sexiest Job of the 21st Century</em> by the <a target="_blank" href="https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century">Harvard Business Review</a>, and its claim to the title is arguably legitimate.</p>
<p>Data Science is the process of using scientific methods, algorithms, and systems to analyse and extract value from data.</p>
<p>In other words, the data scientist is the individual responsible for gaining insights from data and making abstract mathematical models from the data in order to enable prediction.</p>
<p>Now let's look at the data engineer.</p>
<h2 id="heading-what-is-data-engineering">What is Data Engineering?</h2>
<p>Data Engineering is the process of designing, constructing, and maintaining the pipelines and infrastructure that collect, store, process and analyze data.</p>
<p>The Data Engineer is the individual who's responsible for ensuring that the data required by Data Scientists is available in the correct and accurate format.</p>
<p>Data is infuriatingly complex and disordered when it is collected. In order for Data Scientists to efficiently gain insights from it, the data needs to be pre-processed. </p>
<p>Then, once insights have been made, Data Scientists formulate an abstract mathematical model from the data, commonly known as a <a target="_blank" href="https://learn.microsoft.com/en-us/windows/ai/windows-ml/what-is-a-machine-learning-model">Machine Learning Model</a>. This abstraction needs to be post-processed in order to be deployed and integrated into the product. </p>
<p>All these tasks are performed by data engineers.</p>
<h2 id="heading-the-relationship-between-data-scientists-and-the-data-engineers-explained-with-an-analogy">The Relationship Between Data Scientists and the Data Engineers – Explained with an Analogy</h2>
<p>Imagine you placed a bet with a friend on the outcome of a football game. But you wanted to cut out the luck factor that is always so present in uninformed guesses. This way you can be extremely sure that your team wins the game and you win the bet.</p>
<p>A data engineer would collect the data on the two teams involved in the bet. They'd consider data points such as <em>number of games won, possession rate per game, and results of previous clashes between the two teams</em>. Then they'd create an ETL pipeline where the data would be collected, cleaned, and stored for the data scientist.</p>
<p>The Data Scientist would then perform something called <em>Predictive Analysis</em> using Machine Learning. This means that the data scientist would feed the data prepared by the data engineer into an algorithm that would then generate a mathematical abstraction called a <em>Machine Learning model</em>. </p>
<p>Then the Machine learning model would predict the team expected to win the bet. And just like that your guess becomes less of guess and more of a data-informed decision.</p>
<h2 id="heading-summary">Summary</h2>
<p>As you can hopefully see from this description of Data Scientists and Engineers, a Data Scientist is similar to a star football player and the Data Engineer like the player's very talented coach who keeps them fit and provides them with tactics to win a game.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Pass Microsoft’s Azure Data Engineer Certification – DP-203 Exam Guide ]]>
                </title>
                <description>
                    <![CDATA[ By Ryan Dawson Data Engineering jobs are in high demand. And getting a certification in the subject is a good way to learn and to deepen and prove your skills.  Each cloud provider offers a certification specialized to their Data Engineering services... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-pass-microsoft-azure-data-engineer-certification-dp-203/</link>
                <guid isPermaLink="false">66d460c947a8245f78752ab7</guid>
                
                    <category>
                        <![CDATA[ Azure ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Certification ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Microsoft ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 19 Oct 2021 16:02:51 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2021/10/ux-g6bea5bfef_1920.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Ryan Dawson</p>
<p>Data Engineering jobs are in <a target="_blank" href="https://insights.dice.com/2019/06/04/data-engineer-remains-top-demand-job/">high demand</a>. And getting a certification in the subject is a good way to learn and to deepen and prove your skills. </p>
<p>Each cloud provider offers a certification specialized to their Data Engineering services. They carry weight but are not easy to pass.</p>
<p>The Azure DP-203 is a pretty tough certification to pass. I spent about 20 hrs on an online course, 5hrs on practice exams, about 5 hrs reading, and another few hours figuring out what tools to use, how to sign-up, and so on.</p>
<p>My aim was to get to a pass quickly and I think it can be done quicker than I did it. Here are some tips on getting your Microsoft Azure Data Engineer certification as quickly as possible.</p>
<h1 id="heading-pre-requisites">Pre-requisites</h1>
<p>I have background in data engineering but I don’t think it’s required. If you don’t have this background then I’d suggest a <a target="_blank" href="https://www.xplenty.com/blog/etl-data-warehousing-explained-etl-tool-basics/">short primer</a> on key Data Engineering concepts (OLTP vs OLAP, data warehouses, and data lakes). </p>
<p>I was also very familiar with cloud computing before the course but not especially familiar with Azure’s core services. If you want a primer on cloud computing then <a target="_blank" href="https://docs.microsoft.com/en-us/learn/modules/intro-to-azure-fundamentals/">Microsoft has one</a>.</p>
<h1 id="heading-how-to-study-for-the-dp-203-exam">How to Study for the DP-203 Exam</h1>
<h3 id="heading-online-courses">Online Courses</h3>
<p>I used a <a target="_blank" href="https://www.udemy.com/course/data-engineering-on-microsoft-azure/">Udemy course</a> by Alan Rodrigues to work through the content. I liked that it showed the tools in action, illustrated the concepts, and contained exam tips and practice tests. </p>
<p>There is <a target="_blank" href="https://docs.microsoft.com/en-us/learn/certifications/exams/dp-203">official content from Microsoft</a> and that looks good too. The official content is a mixture of text and video with more weight towards text, whereas courses like Udemy (and there are plenty of them) are mostly video. </p>
<p>I would suggest to just pick whatever you like the look of and try not to overthink the decision.</p>
<h3 id="heading-books">Books</h3>
<p>I also bought a book to read on my kindle but I didn’t get much value from that and would say a course is enough. </p>
<p>I’d suggest checking when any material was last updated before choosing it, as the course content does change over time and books can get outdated quickly. I do find it helpful to have reading material to complement a course, I think I just didn't pick the right book for me this time.</p>
<h3 id="heading-getting-hands-on">Getting Hands-on</h3>
<p>Microsoft, on their website, points to an option of instructor-led training but I didn’t do that. I also didn’t do any practical labs in the course that I worked through. </p>
<p><a target="_blank" href="https://dhyanintech.medium.com/how-to-prepare-for-the-azure-data-engineer-associate-certification-4cf122f1937f">Others do recommend</a> getting hands-on. I suspect it depends how much background you have and whether you feel confident that you’ve grasped the material just from watching instructional content. </p>
<p>Some people take notes while watching videos to help organise thoughts and make sure they’re understanding. I didn’t take notes either, though I have on previous courses.</p>
<h1 id="heading-how-to-practice-for-the-exam">How to Practice for the Exam</h1>
<p>My top tip is to use the <a target="_blank" href="https://uk.mindhub.com/dp-203-data-engineering-on-microsoft-azure-microsoft-official-practice-test/p/MU-DP-203?utm_source=microsoft&amp;utm_medium=certpage&amp;utm_campaign=msofficialpractice">official practice exam tool</a>. It is worth the cost.</p>
<p>I used practice tests on Udemy and they were helpful. But with those you don’t get feedback until you’ve completed the test. </p>
<p>When learning I prefer to find out what I’ve got wrong straight away. If I don't find out the answer until the end of a test then I have to remember the context of the question, which involves more effort. I also want to ensure that what lodges in my brain is the right answer and not the wrong one. For that it's better to be corrected quickly.</p>
<h3 id="heading-features-of-the-official-practice-test">Features of the Official Practice Test</h3>
<p>The official practice test lets you choose test lengths for however long you’ve got to practice at that moment. You can configure whether it tells you answers straight away or at the end.</p>
<p><img src="https://lh5.googleusercontent.com/gvuQ1De2BRHBvvXQbJkn6RRXM7EoSA34WPZpxQmO_6VLx-F-toNyLWgjifowqJRyWKEJ2m5A_aaUJ5q7OUu-NWdSeft1xpugLzYi3IIM9XjOgTEbGgvsUUKXmEBnAP4hKm5b_eAP=s0" alt="Image" width="1986" height="1470" loading="lazy"></p>
<p>The format of the questions matches the exam, which you don’t get with all of the practice exam sites. The exam has some multi-part linked questions and a lot of sites aren’t able to do that.</p>
<h3 id="heading-getting-started-with-the-official-practice-test">Getting Started With the Official Practice Test</h3>
<p>The flow to buy and login to the practice test is a bit confusing. You click through from the Microsoft website and buy it from a third-party vendor (<a target="_blank" href="https://uk.mindhub.com/dp-203-data-engineering-on-microsoft-azure-microsoft-official-practice-test/p/MU-DP-203?utm_source=microsoft&amp;utm_medium=certpage&amp;utm_campaign=msofficialpractice">in the UK, where I am, this is mindhub</a>). But you don’t need a login there and if you do create one you can’t use it to login to the test. </p>
<p>Instead you login with your Microsoft or GitHub identity to <a target="_blank" href="https://marketplace.measureup.com/login">marketplace.measureup.com</a> and register a key from the purchase.</p>
<h3 id="heading-practice-test-books">Practice Test Books</h3>
<p>I also used a book of practice questions and answers. That way I could do some exam prep away from my laptop. </p>
<p>There’s several of these books out there and I suspect they’re pretty similar. They’re not as good on quality as the practice test. I found errors like an answer section following a different question than the one it belonged to (which may be a kindle thing). The explanations also aren’t as detailed but it’s still helpful. </p>
<p>The one I used had a format of question and options on one page and then answers on the next. I would recommend using a book like this.</p>
<h3 id="heading-topics-i-targeted">Topics I Targeted</h3>
<p>There are certain topics that it’s worth getting clear about because it seems to me like they come up a lot and the questions are mostly of a similar format. The ones that stood out for me are:</p>
<ul>
<li>Storage tiers. If you know hot, cold and archive you’ll get some marks.</li>
<li>Star schema.</li>
<li>Slowly changing dimension types. Wikipedia explains this well.</li>
<li>Distributions for Synapse dedicated tables. Expect to get asked a question with a big Fact table and small Dimension tables. The dimension tables will want replicated distribution and the big fact table will be hash-distributed with hashing on some foreign key type column used in joins.</li>
<li>Difference between Synapse (warehouse with some added processing features), Stream Analytics (real-time processing), Data Lake (large-scale unstructured storage), Data Factory (ETL) and Databricks (managed Spark plus notebooks, ML and delta lake).</li>
<li>When to use Parquet, Avro, Json and CSV formats. (The answer is almost always that parquet is best for querying data at scale but they also like to make sure you know avro is good for timestamped data.)</li>
<li>The syntax for writing to a file or stream using Spark.</li>
</ul>
<p>For me these were topics I felt I could bank on. There were other topics I was confident about after practicing, but for some topics there’s lots of different ways the questions can be approached. For these I felt I could see the patterns and they were fairly predictable. It’s reassuring to have some questions you can do from memory without having to think a lot.</p>
<h3 id="heading-learning-to-analyse-questions">Learning to Analyse Questions</h3>
<p>Learning how to read and analyse the questions is a skill. Often the questions contain some incidental detail around a scenario (for example, the fact that it’s a grocery company) and some clues that point to an answer (for example, the size of their sales transaction table). </p>
<p>I don’t always recognise where a question is going from a first read, so then I find it useful to quickly scan the answers. This is dangerous, though, as then you might find something you recognise and match it to a clue and jump on that. But the questions can require picking up on multiple clues. </p>
<p>You want to eliminate the wrong answers if you can (which is tricky if you try to go fast). Sometimes there’s also multiple right answers to select and you’ll miss marks if you choose just one when there’s two correct selections (the questions do tell you when you should select multiple but it’s easy to miss if you rush or get tired). </p>
<p>This is why I recommend the official practice exam tool – it’s great for learning the format of questions and the skill of how to read them.</p>
<h1 id="heading-when-to-tackle-the-real-thing">When to Tackle the Real Thing</h1>
<h3 id="heading-can-i-learn-it-all">Can I Learn It All?</h3>
<p>The scope of the exam is so broad that it’s hard to really know everything. You’ll even get occasional wildcard questions that bring in Azure services that don’t relate to data. </p>
<p>When that does happen and you don’t know about those services, you can typically guess your way through it so long as you’re clear enough about the data services.</p>
<p>Cosmos DB is a new part of the Azure data stack. It’s not explicitly part of the syllabus right now but the exam scope is broad and it does creep in. It’s worth knowing a little about it but don’t get sucked into learning all its ins and outs.</p>
<p>If you try to really learn everything that could come up then you’ll have an awful lot to learn. </p>
<p>Databricks is huge in itself. You don’t need to know any of the Databricks machine learning stuff. You basically just need to know about setting up clusters, working with files in Azure storage using Spark, authentication and differences between Databricks and other Azure services that happen to feature flavours of Spark (Synapse and HDInsights).</p>
<h3 id="heading-so-when-am-i-ready">So When Am I Ready?</h3>
<p>If you’re doing well consistently on practice exams, you’ve developed a knack for guessing and can make the guesses pretty quick, then you’re ready to give it a go.</p>
<p>But it’s also about your comfort level – don’t let me (or anyone) rush you into doing it before you feel ready.</p>
<h1 id="heading-how-to-tackle-the-exam">How to Tackle the Exam</h1>
<h3 id="heading-structure-and-timing">Structure and Timing</h3>
<p>You don’t get much time per question. For me it was 65 questions in 100 minutes. The total exam time is more than 100 minutes but 100 minutes is the time you get for answering questions. </p>
<p>Some questions are more than one mark per selection in the answers so the number of questions you get can vary (I believe with the number of marks available staying the same). </p>
<p>Basically expect to answer roughly one question every 90 seconds or so. But don’t worry about time while first practicing. You’ll get faster with practice.</p>
<p>I actually didn’t finish all the questions. I thought I had finished after about 60 questions and started reviewing answers. Then I realised I had to click through to a separate case study section. This happens right at the end, and although there is guidance I didn’t find it super clear. I didn’t miss out on many marks from it though as it’s not a big section.</p>
<h3 id="heading-online-proctored-vs-test-centre">Online Proctored vs Test Centre</h3>
<p>Traditionally these kinds of certification tests were taken in test centres under supervision. Now the online proctored versions are popular. This was my first online proctored exam.</p>
<p>When I first looked at the guidance on online proctored exams I thought it sounded complicated. I knew that at a test centre there would be someone in the room the whole time and they'd be watching in case anyone had snuck in notes. I kept wondering how you can replicate this in an online proctored exam?</p>
<p>There are some instructions for the online proctored test about keeping your room clear and showing that you don't have any notes. These confused me and for a while thought I would need to buy an extra webcam to show that my room was clear and that I didn't have any notes at any point. </p>
<p>But it’s actually much simpler. I only needed to take photos of the room (using a link that gets sent) at the beginning and stay on my main webcam (a built-in one was enough). So it's just a check of the room at the beginning and then you just stay on webcam.</p>
<h2 id="heading-good-luck">Good Luck!</h2>
<p>I hope this advice is helpful to you and I wish you good luck! Feel free to ask me questions <a target="_blank" href="https://twitter.com/ryandawsongb">via Twitter</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Learn these quick tricks in PostgreSQL ]]>
                </title>
                <description>
                    <![CDATA[ By Peter Gleeson PostgreSQL is one of the most popular open source SQL dialects. One of its main advantages is the ability to extend its functionality with some inbuilt tools. Here, let's look at a few PostgreSQL tricks that you can start using to ta... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/postgresql-tricks/</link>
                <guid isPermaLink="false">66d460ae264384a65d5a95cd</guid>
                
                    <category>
                        <![CDATA[ analytics ]]>
                    </category>
                
                    <category>
                        <![CDATA[ backend ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ database ]]>
                    </category>
                
                    <category>
                        <![CDATA[ postgres ]]>
                    </category>
                
                    <category>
                        <![CDATA[ SQL ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Mon, 18 Nov 2019 13:33:00 +0000</pubDate>
                <media:content url="https://cdn-media-2.freecodecamp.org/w1280/5f9c9f3e740569d1a4ca418d.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Peter Gleeson</p>
<p><a target="_blank" href="https://www.postgresql.org/about/">PostgreSQL</a> is one of the most popular open source SQL dialects. One of its main advantages is the ability to extend its functionality with some inbuilt tools.</p>
<p>Here, let's look at a few PostgreSQL tricks that you can start using to take your SQL skills to the next level. </p>
<p>You'll find out how to:</p>
<ul>
<li>Quickly copy files into a database</li>
<li>Summarise data in crosstab format</li>
<li>Take advantage of arrays and JSON data in SQL</li>
<li>Work with geometric data</li>
<li>Run statistical analyses directly on your database</li>
<li>Use recursion to solve problems</li>
</ul>
<h3 id="heading-copy-data-from-a-file">Copy data from a file</h3>
<p>An easy way to quickly import data from an external file is to use the COPY function. Simply create the table you want to use, then pass in the filepath of your dataset to the COPY command.</p>
<p>The example below creates a table called revenue and fills it from a <a target="_blank" href="https://gist.github.com/pg0408/43664635ee89558ba4961a84b833342b">randomly generated CSV file</a>.</p>
<p>You can include extra parameters, to indicate the filetype (here, the file is a CSV) and whether to read the first row as column headers.</p>
<p>You can learn more <a target="_blank" href="https://www.postgresql.org/docs/12/sql-copy.html">here</a>.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> revenue (
  <span class="hljs-keyword">store</span> <span class="hljs-built_in">VARCHAR</span>,
  <span class="hljs-keyword">year</span> <span class="hljs-built_in">INT</span>,
  revenue <span class="hljs-built_in">INT</span>,
  PRIMARY <span class="hljs-keyword">KEY</span> (product, <span class="hljs-keyword">year</span>)
);

COPY revenue FROM '~/Projects/datasets/revenue.csv' <span class="hljs-keyword">WITH</span> HEADER CSV;
</code></pre>
<h3 id="heading-summarise-data-using-the-crosstab-function">Summarise data using the crosstab function</h3>
<p>If you fancy yourself as a spreadsheet pro, you will probably be familiar with <a target="_blank" href="https://support.office.com/en-us/article/create-a-pivottable-to-analyze-worksheet-data-a9a84538-bfe9-40a9-a8e9-f99134456576">creating pivot tables</a> from dumps of data. You can do the same in PostgreSQL with the crosstab function.</p>
<p>The crosstab function can take data in the form on the left, and summarise it in the form on the right (which is much easier to read). The example here will follow on with the revenue data from before.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2019/11/Screenshot-2019-11-17-at-16.54.40.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>First, enable the <a target="_blank" href="https://www.postgresql.org/docs/12/tablefunc.html">tablefunc extension</a> with the command below:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> EXTENSION tablefunc;
</code></pre>
<p>Next, write a query using the crosstab function:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> CROSSTAB(
  <span class="hljs-string">'SELECT
          *
    FROM revenue
    ORDER BY 1,2'</span>
  ) 
<span class="hljs-keyword">AS</span> summary(
    <span class="hljs-keyword">store</span> <span class="hljs-built_in">VARCHAR</span>, 
    <span class="hljs-string">"2016"</span> <span class="hljs-built_in">INT</span>, 
    <span class="hljs-string">"2017"</span> <span class="hljs-built_in">INT</span>, 
    <span class="hljs-string">"2018"</span> <span class="hljs-built_in">INT</span>
    );
</code></pre>
<p>There are two things to consider when using this function.</p>
<ul>
<li>First, pass in a query selecting data from your underlying table. You may simply select the table as it is (as shown here). However, you might want to filter, join or aggregate if required. Be sure to order the data correctly.</li>
<li>Then, define the output (in the example, the output is called 'summary', but you can call it any name). List the column headers you want to use and the data type they will contain.</li>
</ul>
<p>The output will be as shown below:</p>
<pre><code>  store  |  <span class="hljs-number">2016</span>   |  <span class="hljs-number">2017</span>   |  <span class="hljs-number">2018</span>   
---------+---------+---------+---------
 Alpha   | <span class="hljs-number">1637000</span> | <span class="hljs-number">2190000</span> | <span class="hljs-number">3287000</span>
 Bravo   | <span class="hljs-number">2205000</span> |  <span class="hljs-number">982000</span> | <span class="hljs-number">3399000</span>
 Charlie | <span class="hljs-number">1549000</span> | <span class="hljs-number">1117000</span> | <span class="hljs-number">1399000</span>
 Delta   |  <span class="hljs-number">664000</span> | <span class="hljs-number">2065000</span> | <span class="hljs-number">2931000</span>
 Echo    | <span class="hljs-number">1795000</span> | <span class="hljs-number">2706000</span> | <span class="hljs-number">1047000</span>
(<span class="hljs-number">5</span> rows)
</code></pre><h3 id="heading-work-with-arrays-and-json">Work with arrays and JSON</h3>
<p>PostgreSQL supports multi-dimensional array data types. These are comparable to similar data types in many other languages, including Python and JavaScript.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2019/11/Screenshot-2019-11-17-at-23.02.00.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You might want to use them in situations where it helps to work with more dynamic, less-structured data. </p>
<p>For example, imagine a table describing published articles and subject tags. An article could have no tags, or it could have many. Trying to store this data in a structured table format would be unnecessarily complicated.</p>
<p>You can define arrays using a data type, followed by square brackets. You can optionally specify their dimensions (however, this is not enforced).</p>
<p>For example, to create a 1-D array of any number of text elements, you would use <code>text[]</code>. To create a three-by-three two dimensional array of integer elements, you would use <code>int[3][3]</code>.</p>
<p>Take a look at the example below:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> articles (
  title <span class="hljs-built_in">VARCHAR</span> PRIMARY <span class="hljs-keyword">KEY</span>,
  tags <span class="hljs-built_in">TEXT</span>[]
);
</code></pre>
<p>To insert arrays as records, use the syntax <code>'{"first","second","third"}'</code>. </p>
<pre><code class="lang-sql"><span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> articles (title, tags)
  <span class="hljs-keyword">VALUES</span> 
  (<span class="hljs-string">'Lorem ipsum'</span>, <span class="hljs-string">'{"random"}'</span>),
  (<span class="hljs-string">'Placeholder here'</span>, <span class="hljs-string">'{"motivation","random"}'</span>),
  (<span class="hljs-string">'Postgresql tricks'</span>, <span class="hljs-string">'{"data","self-reference"}'</span>);
</code></pre>
<p>There are a lot of <a target="_blank" href="https://www.postgresql.org/docs/12/functions-array.html">things you can do with arrays</a> in PostgreSQL.</p>
<p>For a start, you can check if an array contains a given element. This is useful for filtering. You can use the "contains" operator <code>@&gt;</code> to do this. The query below finds all the articles which have the tag "random".</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
  *
<span class="hljs-keyword">FROM</span> articles
<span class="hljs-keyword">WHERE</span> tags @&gt; <span class="hljs-string">'{"random"}'</span>;
</code></pre>
<p>You can also concatenate (join together) arrays using the <code>||</code> operator, or check for overlapping elements with the <code>&amp;&amp;</code> operator.</p>
<p>You can search arrays by index (unlike many languages, PostgreSQL arrays start counting from one, instead of zero).</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    tags[<span class="hljs-number">1</span>]
<span class="hljs-keyword">FROM</span> articles;
</code></pre>
<p>As well as arrays, PostgreSQL also lets you use <a target="_blank" href="https://www.w3schools.com/whatis/whatis_json.asp">JSON</a> as a data type. Again, this provides the advantages of working with unstructured data. You can also access elements by their key name.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> sessions (
    session_id <span class="hljs-built_in">SERIAL</span> PRIMARY <span class="hljs-keyword">KEY</span>,
    session_info <span class="hljs-keyword">JSON</span>
);

<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> sessions (session_info)
<span class="hljs-keyword">VALUES</span>
(<span class="hljs-string">'{"app_version": 1.0, "device_type": "Android"}'</span>),
(<span class="hljs-string">'{"app_version": 1.2, "device_type": "iOS"}'</span>),
(<span class="hljs-string">'{"app_version": 1.4, "device_type": "iOS", "mode":"default"}'</span>);
</code></pre>
<p>Again, there are many <a target="_blank" href="https://www.postgresql.org/docs/12/datatype-json.html">things you can do with JSON</a> data in PostgreSQL. You can use the <code>-&gt;</code> and <code>-&gt;&gt;</code> operators to "unpackage" the JSON objects to use in queries.</p>
<p>For example, this query finds the values of the <code>device_type</code> key:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
  session_info -&gt; <span class="hljs-string">'device_type'</span> <span class="hljs-keyword">AS</span> devices
<span class="hljs-keyword">FROM</span> sessions;
</code></pre>
<p>And this query counts how many sessions were on app version 1.0 or earlier:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
  <span class="hljs-keyword">COUNT</span>(*)
<span class="hljs-keyword">FROM</span> sessions
<span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">CAST</span>(session_info -&gt;&gt; <span class="hljs-string">'app_version'</span> <span class="hljs-keyword">AS</span> <span class="hljs-built_in">decimal</span>) &lt;= <span class="hljs-number">1.0</span>;
</code></pre>
<h3 id="heading-run-statistical-analyses">Run statistical analyses</h3>
<p>Often, people see SQL as good for storing data and running simple queries, but not for running more in-depth analyses. For that, you should use another tool such as Python or R or your favourite spreadsheet software.</p>
<p>However, PostgreSQL brings with it enough statistical capabilities to get you started.</p>
<p>For instance, it can calculate summary statistics, correlation, regression and random sampling. The table below contains some simple data to play around with.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> stats (
  sample_id <span class="hljs-built_in">SERIAL</span> PRIMARY <span class="hljs-keyword">KEY</span>,
  x <span class="hljs-built_in">INT</span>,
  y <span class="hljs-built_in">INT</span>
);

<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> stats (x,y)
  <span class="hljs-keyword">VALUES</span> 
  (<span class="hljs-number">1</span>,<span class="hljs-number">2</span>), (<span class="hljs-number">3</span>,<span class="hljs-number">4</span>), (<span class="hljs-number">6</span>,<span class="hljs-number">5</span>), (<span class="hljs-number">7</span>,<span class="hljs-number">8</span>), (<span class="hljs-number">9</span>,<span class="hljs-number">10</span>);
</code></pre>
<p>You can find the mean, variance and standard deviation using the functions below:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    <span class="hljs-keyword">AVG</span>(x),
    <span class="hljs-keyword">VARIANCE</span>(x),
    <span class="hljs-keyword">STDDEV</span>(x)
<span class="hljs-keyword">FROM</span> stats;
</code></pre>
<p>You can also find the median (or any other percentile value) using the percentile_cont function:</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- median</span>
<span class="hljs-keyword">SELECT</span>
  <span class="hljs-keyword">PERCENTILE_CONT</span>(<span class="hljs-number">0.5</span>)
<span class="hljs-keyword">WITHIN</span> <span class="hljs-keyword">GROUP</span> (<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> x) 
<span class="hljs-keyword">FROM</span> stats;

<span class="hljs-comment">-- 90th percentile</span>
<span class="hljs-keyword">SELECT</span>
  <span class="hljs-keyword">PERCENTILE_CONT</span>(<span class="hljs-number">0.9</span>)
<span class="hljs-keyword">WITHIN</span> <span class="hljs-keyword">GROUP</span> (<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> x) 
<span class="hljs-keyword">FROM</span> stats;
</code></pre>
<p>Another trick lets you calculate the correlation coefficients between different columns. Simply use the corr function.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    <span class="hljs-keyword">CORR</span>(x,y)
<span class="hljs-keyword">FROM</span> stats;
</code></pre>
<p>PostgreSQL lets you run <a target="_blank" href="https://en.wikipedia.org/wiki/Linear_regression">linear regression</a> (sometimes called the most basic form of machine learning) via a set of inbuilt functions.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    <span class="hljs-keyword">REGR_INTERCEPT</span>(x,y),
    REGR_SLOP(x,y),
    <span class="hljs-keyword">REGR_R2</span>(x,y)
<span class="hljs-keyword">FROM</span> stats;
</code></pre>
<p>You can even run <a target="_blank" href="https://www.freecodecamp.org/news/solve-the-unsolvable-with-monte-carlo-methods-294de03c80cd/">Monte Carlo simulations</a> with single queries. The query below uses the generate_series and random number functions to estimate the value of π by randomly sampling one million points inside a unit circle. </p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> 
    <span class="hljs-keyword">CAST</span>(
        <span class="hljs-keyword">COUNT</span>(*) * <span class="hljs-number">4</span> <span class="hljs-keyword">AS</span> <span class="hljs-built_in">FLOAT</span>
        ) / <span class="hljs-number">1000000</span> <span class="hljs-keyword">AS</span> <span class="hljs-keyword">pi</span> 
<span class="hljs-keyword">FROM</span> GENERATE_SERIES(<span class="hljs-number">1</span>,<span class="hljs-number">1000000</span>)
<span class="hljs-keyword">WHERE</span> CIRCLE(POINT(<span class="hljs-number">0.5</span>,<span class="hljs-number">0.5</span>),<span class="hljs-number">0.5</span>) @&gt; POINT(RANDOM(), RANDOM());
</code></pre>
<h3 id="heading-work-with-shape-data">Work with shape data</h3>
<p>Another unusual data type available in PostgreSQL is <a target="_blank" href="https://www.postgresql.org/docs/12/datatype-geometric.html">geometric data</a>.</p>
<p>That's right, you can work with points, lines, polygons and circles within SQL.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2019/11/Screenshot-2019-11-18-at-00.06.33.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Points are the basic building block for all geometric data types in PostgreSQL. They are represented as (x, y) coordinates.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    POINT(<span class="hljs-number">0</span>,<span class="hljs-number">0</span>) <span class="hljs-keyword">AS</span> <span class="hljs-string">"origin"</span>,
    POINT(<span class="hljs-number">1</span>,<span class="hljs-number">1</span>) <span class="hljs-keyword">AS</span> <span class="hljs-string">"point"</span>;
</code></pre>
<p>You can also define lines. These can either be infinite lines (specified by giving any two points on the line). Or, they can be line segments (specified by giving the 'start' and 'end' points of the line).</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    LINE <span class="hljs-string">'((0,0),(1,1))'</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">"line"</span>,
    LSEG <span class="hljs-string">'((2,2),(3,3))'</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">"line_segment"</span>;
</code></pre>
<p>Polygons are defined by a longer series of points.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    POLYGON <span class="hljs-string">'((0,0),(1,1),(0,2))'</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">"triangle"</span>,
    POLYGON <span class="hljs-string">'((0,0),(0,1),(1,1),(1,0))'</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">"square"</span>,
    POLYGON <span class="hljs-string">'((0,0),(0,1),(2,1),(2,0))'</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">"rectangle"</span>;
</code></pre>
<p>Circles are defined by a central point and a radius.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    CIRCLE <span class="hljs-string">'((0,0),1)'</span> <span class="hljs-keyword">as</span> <span class="hljs-string">"small_circle"</span>,
    CIRCLE <span class="hljs-string">'(0,0),5)'</span> <span class="hljs-keyword">as</span> <span class="hljs-string">"big_circle"</span>;
</code></pre>
<p>There are <a target="_blank" href="https://www.postgresql.org/docs/12/functions-geometry.html">many functions and operators</a> that can be applied to geometric data in PostgreSQL.</p>
<p>You can:</p>
<ul>
<li>Check if two lines are parallel with the <code>?||</code> operator:</li>
</ul>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    LINE <span class="hljs-string">'((0,0),(1,1))'</span> ?|| LINE <span class="hljs-string">'((2,3),(3,4))'</span>;
</code></pre>
<ul>
<li>Find the distance between two objects with the <code>&lt;-&gt;</code> operator:</li>
</ul>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> 
    POINT(<span class="hljs-number">0</span>,<span class="hljs-number">0</span>) &lt;-&gt; POINT(<span class="hljs-number">1</span>,<span class="hljs-number">1</span>);
</code></pre>
<ul>
<li>Check if two shapes overlap at any point with the <code>&amp;&amp;</code> operator:</li>
</ul>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    CIRCLE <span class="hljs-string">'((0,0),1)'</span> &amp;&amp;  CIRCLE <span class="hljs-string">'((1,1),1)'</span>;
</code></pre>
<ul>
<li>Translate (shift position) a shape using the <code>+</code> operator:</li>
</ul>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    POLYGON <span class="hljs-string">'((0,0),(1,2),(1,1))'</span> + POINT(<span class="hljs-number">0</span>,<span class="hljs-number">3</span>);
</code></pre>
<p>And lots more besides - check out <a target="_blank" href="https://www.postgresql.org/docs/12/functions-geometry.html">the documentation</a> for more detail!</p>
<h3 id="heading-use-recursive-queries">Use recursive queries</h3>
<p><a target="_blank" href="https://www.freecodecamp.org/news/how-recursion-works-explained-with-flowcharts-and-a-video-de61f40cb7f9/">Recursion</a> is a programming technique that can be used to solve problems using a function which calls itself. Did you know that you can write recursive queries in PostgreSQL?</p>
<p>There are three parts required to do this:</p>
<ul>
<li>First, you define a starting expression.</li>
<li>Then, define a recursive expression that will be evaluated repeatedly</li>
<li>Finally, define a "termination criteria" - a condition which tells the function to stop calling itself, and return an output.</li>
</ul>
<p>The query below returns the first hundred numbers in <a target="_blank" href="https://www.mathsisfun.com/numbers/fibonacci-sequence.html">the Fibonacci sequence</a>:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">WITH</span> <span class="hljs-keyword">RECURSIVE</span> fibonacci(n,x,y) <span class="hljs-keyword">AS</span> (
    <span class="hljs-keyword">SELECT</span>
        <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> n ,
          <span class="hljs-number">0</span> :: <span class="hljs-built_in">NUMERIC</span> <span class="hljs-keyword">AS</span> x,
        <span class="hljs-number">1</span> :: <span class="hljs-built_in">NUMERIC</span> <span class="hljs-keyword">AS</span> y
      <span class="hljs-keyword">UNION</span> <span class="hljs-keyword">ALL</span>
      <span class="hljs-keyword">SELECT</span>
        n + <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> n,
          y <span class="hljs-keyword">AS</span> x,
        x + y <span class="hljs-keyword">AS</span> y 
      <span class="hljs-keyword">FROM</span> fibonacci 
      <span class="hljs-keyword">WHERE</span> n &lt; <span class="hljs-number">100</span>
    )
<span class="hljs-keyword">SELECT</span>
    x 
<span class="hljs-keyword">FROM</span> fibonacci;
</code></pre>
<p>Let's break this down.</p>
<p>First, it uses the WITH clause to define a (recursive) <a target="_blank" href="https://www.postgresql.org/docs/12/queries-with.html#QUERIES-WITH-SELECT">Common Table Expression</a> called <code>fibonacci</code>. Then, it defines an initial expression:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">WITH</span> <span class="hljs-keyword">RECURSIVE</span> fibonacci(n,x,y) <span class="hljs-keyword">AS</span> (
    <span class="hljs-keyword">SELECT</span>
        <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> n ,
          <span class="hljs-number">0</span> :: <span class="hljs-built_in">NUMERIC</span> <span class="hljs-keyword">AS</span> x,
        <span class="hljs-number">1</span> :: <span class="hljs-built_in">NUMERIC</span> <span class="hljs-keyword">AS</span> y...
</code></pre>
<p>Next, it defines the recursive expression that queries <code>fibonacci</code>:</p>
<pre><code class="lang-sql"> ...UNION ALL
      <span class="hljs-keyword">SELECT</span>
        n + <span class="hljs-number">1</span> <span class="hljs-keyword">AS</span> n,
          y <span class="hljs-keyword">AS</span> x,
        x + y <span class="hljs-keyword">AS</span> y 
      <span class="hljs-keyword">FROM</span> fibonacci...
</code></pre>
<p>Finally, it uses a WHERE clause to define the termination criteria, and then selects column x to give the output sequence:</p>
<pre><code class="lang-sql">...WHERE n &lt; 100
        )
    <span class="hljs-keyword">SELECT</span>
        x 
    <span class="hljs-keyword">FROM</span> fibonacci;
</code></pre>
<p>Perhaps you can think of another example of recursion that could be implemented in PostgreSQL?</p>
<h3 id="heading-final-remarks">Final remarks</h3>
<p>So, there you have it - a quick run through of some great features you may or may not have known PostgreSQL could provide. There are no doubt more features worth covering that didn't make it into this list.</p>
<p>PostgreSQL is a rich and powerful programming language in its own right. So, next time you are stuck figuring out how to solve a data related problem, take a look and see if PostgreSQL has you covered. You might surprised how often it does!</p>
<p>Thanks for reading!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to import Google BigQuery tables to AWS Athena ]]>
                </title>
                <description>
                    <![CDATA[ By Aftab Ansari As a data engineer, it is quite likely that you are using one of the leading big data cloud platforms such as AWS, Microsoft Azure, or Google Cloud for your data processing. Also, migrating data from one platform to another is somethi... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-import-google-bigquery-tables-to-aws-athena-5da842a13539/</link>
                <guid isPermaLink="false">66c352c5c2631756f9f063d7</guid>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ big data ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data migration ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tech  ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Mon, 11 Mar 2019 18:55:49 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*518Z4MAe36ZqeLX2LTzeDA.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Aftab Ansari</p>
<p>As a data engineer, it is quite likely that you are using one of the leading big data cloud platforms such as AWS, Microsoft Azure, or Google Cloud for your data processing. Also, migrating data from one platform to another is something you might have already faced or will face at some point.</p>
<p>In this post, I will show how I imported Google BigQuery tables to AWS Athena. If you only need a list of tools to be used with some very high-level guidance, you can quickly look at this <a target="_blank" href="https://amazon-aws-big-data-demystified.ninja/2018/05/27/how-to-export-data-from-google-big-query-into-aws-s3-emr-hive/">post that shows how to import a single BigQuery table into Hive metastore</a>. In this article, I will show one way of importing a full BigQuery project (multiple tables) into both Hive and Athena metastore.</p>
<p>There are few import limitations: for example, when you import data from partitioned tables, you cannot import individual partitions. Please check the <a target="_blank" href="https://cloud.google.com/bigquery/docs/exporting-data">limitations</a> before starting the process.</p>
<p>In order to successfully import Google BigQuery tables to Athena, I performed the steps shown below. I used AVRO format when dumping data and the schemas from Google BigQuery and loading them into AWS Athena.</p>
<p><a class="post-section-overview" href="#264f">Step 1. Dump BigQuery data to Google Cloud Storage</a></p>
<p><a class="post-section-overview" href="#3af9">Step 2. Transfer data from Google Cloud Storage to AWS S3</a></p>
<p><a class="post-section-overview" href="#c089">Step 3. Extract AVRO schema from AVRO files stored in S3</a></p>
<p><a class="post-section-overview" href="#cc2d">Step 4. Create Hive tables on top of AVRO data, use schema from Step 3</a></p>
<p><a class="post-section-overview" href="#fbf3">Step 5. Extract Hive table definition from Hive tables</a></p>
<p><a class="post-section-overview" href="#c6f6">Step 6. Use the output of Step 3 and 5 to create Athena tables</a></p>
<p>So why do I have to create Hive tables in the first place although the end goal is to have data in Athena? This is because:</p>
<ul>
<li>Athena does not support using <code>avro.schema.url</code> to specify table schema.</li>
<li>Athena requires you to explicitly specify field names and their data types in CREATE statement.</li>
<li>Athena also requires the AVRO schema in JSON format under <code>avro.schema.literal</code>.</li>
<li>You can check this AWS <a target="_blank" href="https://docs.aws.amazon.com/athena/latest/ug/avro.html">doc</a> for more details.</li>
</ul>
<p>So, Hive tables can be created directly by pointing to AVRO schema files stored on S3. But to have the same in Athena, columns and schema are required in the CREATE TABLE statement.</p>
<p>One way to overcome this is to first extract schema from AVRO data to be supplied as <code>avro.schema.literal</code> . Second, for field names and data types required for CREATE statement, create Hive tables based on AVRO schemas stored in S3 and use <code>SHOW CREATE TABLE</code> to dump/export Hive table definitions which contain field names and datatypes. Finally, create Athena tables by combining the extracted AVRO schema and Hive table definition. I will discuss in detail in subsequent sections.</p>
<p>For the demonstration, I have the following BigQuery tables that I would like to import to Athena.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/klOvEMVXS8X9k5YaxGkacXgjhdsxMGwmnupj" alt="Image" width="800" height="232" loading="lazy"></p>
<p>So, let’s get started!</p>
<h3 id="heading-step-1-dump-bigquery-data-to-google-cloud-storage">Step 1. Dump BigQuery data to Google Cloud Storage</h3>
<p>It is possible to dump BigQuery data in Google storage with the help of the Google cloud UI. However, this can become a tedious task if you have to dump several tables manually.</p>
<p>To tackle this problem, I used Google Cloud Shell. In Cloud Shell, you can combine regular shell scripting with BigQuery commands and dump multiple tables relatively fast. You can activate Cloud Shell as shown in the picture below.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/o3Vj-2DY5je5jS1wZeE84Dh6OuO37CrR106-" alt="Image" width="800" height="143" loading="lazy"></p>
<p>From Cloud Shell, the following operation provides the BigQuery <code>extract</code> commands to dump each table of the “backend” dataset to Google Cloud Storage.</p>
<pre><code>bq ls backend | cut -d <span class="hljs-string">' '</span> -f3 | tail -n+<span class="hljs-number">3</span> | xargs -I@ echo bq --location=US extract --destination_format AVRO --compression SNAPPY &lt;dataset&gt;.@ gs:<span class="hljs-comment">//&lt;bucket&gt;@</span>
</code></pre><p>In my case it prints:</p>
<pre><code>aftab_ansari@cloudshell:~ (project-ark-archive)$ bq ls backend | cut -d <span class="hljs-string">' '</span> -f3 | tail -n+<span class="hljs-number">3</span> | xargs -I@ echo bq --location=US extract --destination_format AVRO --compression SNAPPY backend.@ gs:<span class="hljs-comment">//plr_data_transfer_temp/bigquery_data/backend/@/@-*.avro</span>
</code></pre><pre><code>bq --location=US extract --destination_format AVRO --compression SNAPPY backend.sessions_daily_phase2 gs:<span class="hljs-comment">//plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2/sessions_daily_phase2-*.avro</span>
</code></pre><pre><code>bq --location=US extract --destination_format AVRO --compression SNAPPY backend.sessions_detailed_phase2 gs:<span class="hljs-comment">//plr_data_transfer_temp/bigquery_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-*.avro</span>
</code></pre><pre><code>bq --location=US extract --destination_format AVRO --compression SNAPPY backend.sessions_phase2 gs:<span class="hljs-comment">//plr_data_transfer_temp/bigquery_data/backend/sessions_phase2/sessions_phase2-*.avro</span>
</code></pre><p>Please note: <code>--compression SNAPPY</code>, this is important, as uncompressed and big files can cause the <code>gsutil</code> command (that is used to transfer data to AWS S3) to get stuck. The wildcard (<strong>*</strong>) makes <code>bq extract</code> split bigger tables (&gt;1GB) into multiple output files. Running those commands on Cloud Shell, copy data to the following Google Storage directory.</p>
<pre><code>gs:<span class="hljs-comment">//plr_data_transfer_temp/bigquery_data/backend/table_name/table_name-*.avro</span>
</code></pre><p>Let’s do <code>ls</code> to see the dumped AVRO file.</p>
<pre><code>aftab_ansari@cloudshell:~ (project-ark-archive)$ gsutil ls gs:<span class="hljs-comment">//plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2</span>
</code></pre><pre><code>gs:<span class="hljs-comment">//plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro</span>
</code></pre><p>I can also browse from the UI and find the data like shown below.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/bdhhLC9Dyuv59VpLeMhLhHNm0Ul-sAe7L8f8" alt="Image" width="800" height="289" loading="lazy"></p>
<h3 id="heading-step-2-transfer-data-from-google-cloud-storage-to-aws-s3">Step 2. Transfer data from Google Cloud Storage to AWS S3</h3>
<p>Transferring data from Google Storage to AWS S3 is straightforward. First, set up your S3 credentials. On Cloud Shell, create or edit <code>.boto</code> file ( <code>vi ~/.boto</code>) and add these:</p>
<pre><code>[Credentials]aws_access_key_id = &lt;your aws access key ID&gt;aws_secret_access_key = &lt;your aws secret access key&gt;[s3]host = s3.us-east-1.amazonaws.comuse-sigv4 = True
</code></pre><p>Please note: <strong>s3.us-east-1.amazonaws.com</strong> has to correspond with the region where the bucket is.</p>
<p>After setting up the credentials, execute <code>gsutil</code> to transfer data from Google Storage to AWS S3. For example:</p>
<pre><code>gsutil rsync -r gs:<span class="hljs-comment">//your-gs-bucket/your-extract-path/your-schema s3://your-aws-bucket/your-target-path/your-schema</span>
</code></pre><p>Add the <strong><em>-n</em></strong> flag to the command above to display the operations that would be performed using the specified command without actually running them.</p>
<p>In this case, to transfer the data to S3, I used the following:</p>
<pre><code>aftab_ansari@cloudshell:~ (project-ark-archive)$ gsutil rsync -r gs:<span class="hljs-comment">//plr_data_transfer_temp/bigquery_data/backend s3://my-bucket/bq_data/backend</span>
</code></pre><pre><code>Building synchronization state…Starting synchronization…Copying gs:<span class="hljs-comment">//plr_data_transfer_temp/bigquery_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro [Content-Type=application/octet-stream]...Copying gs://plr_data_transfer_temp/bigquery_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avro [Content-Type=application/octet-stream]...Copying gs://plr_data_transfer_temp/bigquery_data/backend/sessions_phase2/sessions_phase2-000000000000.avro [Content-Type=application/octet-stream]...| [3 files][987.8 KiB/987.8 KiB]Operation completed over 3 objects/987.8 KiB.</span>
</code></pre><p>Let’s check if the data got transferred to S3. I verified that from my local machine:</p>
<pre><code>aws s3 ls --recursive  s3:<span class="hljs-comment">//my-bucket/bq_data/backend --profile smoke | awk '{print $4}'</span>
</code></pre><pre><code>bq_data/backend/sessions_daily_phase2/sessions_daily_phase2<span class="hljs-number">-000000000000.</span>avrobq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2<span class="hljs-number">-000000000000.</span>avrobq_data/backend/sessions_phase2/sessions_phase2<span class="hljs-number">-000000000000.</span>avro
</code></pre><h3 id="heading-step-3-extract-avro-schema-from-avro-files-stored-in-s3">Step 3. Extract AVRO schema from AVRO files stored in S3</h3>
<p>To extract schema from AVRO data, you can use the Apache <code>avro-tools-&lt;version&amp;g</code>t;.jar wit<code>h the get</code>schema parameter. The benefit of using this tool is that it returns schema in the form you can use direct<code>ly in WITH SERDEPROP</code>ERTIES statement when creating Athena tables.</p>
<p>You noticed I got only one <code>.avro</code> file per table when dumping BigQuery tables. This was because of small data volume — otherwise, I would have gotten several files per table. Regardless of single or multiple files per table, it’s enough to run avro-tools against any single file per table to extract that table’s schema.</p>
<p>I downloaded the latest version of avro-tools which is <code>avro-tools-1.8.2.jar</code>. I first copied all <code>.avro</code> files from s3 to local disk:</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ aws s3 cp s3:<span class="hljs-comment">//my-bucket/bq_data/backend/ bq_data/backend/ --recursive</span>
</code></pre><pre><code>download: s3:<span class="hljs-comment">//my-bucket/bq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avro to bq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2-000000000000.avro</span>
</code></pre><pre><code>download: s3:<span class="hljs-comment">//my-bucket/bq_data/backend/sessions_phase2/sessions_phase2-000000000000.avro to bq_data/backend/sessions_phase2/sessions_phase2-000000000000.avro</span>
</code></pre><pre><code>download: s3:<span class="hljs-comment">//my-bucket/bq_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro to bq_data/backend/sessions_daily_phase2/sessions_daily_phase2-000000000000.avro</span>
</code></pre><p>Avro-tools command should look like <code>java -jar avro-tools-1.8.2.jar getschema your_data.avro &gt; schema_file.a</code>vsc. This can become tedious if you have several AVRO files (in reality, I’ve done this for a project with many more tables). Again, I used a shell script to generate commands. I creat<code>ed extract_schema_avro</code>.sh with the following content:</p>
<pre><code>schema_avro=(bq_data/backend<span class="hljs-comment">/*)for i in ${!schema_avro[*]}; do  input_file=$(find ${schema_avro[$i]} -type f)  output_file=$(ls -l ${schema_avro[$i]} | tail -n+2 \    | awk -v srch="avro" -v repl="avsc" '{ sub(srch,repl,$9);    print $9 }')  commands=$(    echo "java -jar avro-tools-1.8.2.jar getschema " \      $input_file" &gt; bq_data/schemas/backend/avro/"$output_file  )  echo $commandsdone</span>
</code></pre><p>Running <code>extract_schema_avro.sh</code> provides the following:</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ sh extract_schema_avro.sh
</code></pre><pre><code>java -jar avro-tools<span class="hljs-number">-1.8</span><span class="hljs-number">.2</span>.jar getschema bq_data/backend/sessions_daily_phase2/sessions_daily_phase2<span class="hljs-number">-000000000000.</span>avro &gt; bq_data/schemas/backend/avro/sessions_daily_phase2<span class="hljs-number">-000000000000.</span>avsc
</code></pre><pre><code>java -jar avro-tools<span class="hljs-number">-1.8</span><span class="hljs-number">.2</span>.jar getschema bq_data/backend/sessions_detailed_phase2/sessions_detailed_phase2<span class="hljs-number">-000000000000.</span>avro &gt; bq_data/schemas/backend/avro/sessions_detailed_phase2<span class="hljs-number">-000000000000.</span>avsc
</code></pre><pre><code>java -jar avro-tools<span class="hljs-number">-1.8</span><span class="hljs-number">.2</span>.jar getschema bq_data/backend/sessions_phase2/sessions_phase2<span class="hljs-number">-000000000000.</span>avro &gt; bq_data/schemas/backend/avro/sessions_phase2<span class="hljs-number">-000000000000.</span>avsc
</code></pre><p>Executing the above commands copies the extracted schema under <code>bq_data/schemas/backend/avro/</code> :</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ ls -l bq_data/schemas/backend/avro<span class="hljs-comment">/* | awk '{print $9}'</span>
</code></pre><pre><code>bq_data/schemas/backend/avro/sessions_daily_phase2<span class="hljs-number">-000000000000.</span>avscbq_data/schemas/backend/avro/sessions_detailed_phase2<span class="hljs-number">-000000000000.</span>avscbq_data/schemas/backend/avro/sessions_phase2<span class="hljs-number">-000000000000.</span>avsc
</code></pre><p>Let’s also check what’s inside an <code>.avsc</code> file.</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ cat bq_data/schemas/backend/avro/sessions_detailed_phase2<span class="hljs-number">-000000000000.</span>avsc
</code></pre><pre><code>{<span class="hljs-string">"type"</span> : <span class="hljs-string">"record"</span>,<span class="hljs-string">"name"</span> : <span class="hljs-string">"Root"</span>,<span class="hljs-string">"fields"</span> : [ {<span class="hljs-string">"name"</span> : <span class="hljs-string">"uid"</span>,<span class="hljs-string">"type"</span> : [ <span class="hljs-string">"null"</span>, <span class="hljs-string">"string"</span> ]}, {<span class="hljs-string">"name"</span> : <span class="hljs-string">"platform"</span>,<span class="hljs-string">"type"</span> : [ <span class="hljs-string">"null"</span>, <span class="hljs-string">"string"</span> ]}, {<span class="hljs-string">"name"</span> : <span class="hljs-string">"version"</span>,<span class="hljs-string">"type"</span> : [ <span class="hljs-string">"null"</span>, <span class="hljs-string">"string"</span> ]}, {<span class="hljs-string">"name"</span> : <span class="hljs-string">"country"</span>,<span class="hljs-string">"type"</span> : [ <span class="hljs-string">"null"</span>, <span class="hljs-string">"string"</span> ]}, {<span class="hljs-string">"name"</span> : <span class="hljs-string">"sessions"</span>,<span class="hljs-string">"type"</span> : [ <span class="hljs-string">"null"</span>, <span class="hljs-string">"long"</span> ]}, {<span class="hljs-string">"name"</span> : <span class="hljs-string">"active_days"</span>,<span class="hljs-string">"type"</span> : [ <span class="hljs-string">"null"</span>, <span class="hljs-string">"long"</span> ]}, {<span class="hljs-string">"name"</span> : <span class="hljs-string">"session_time_minutes"</span>,<span class="hljs-string">"type"</span> : [ <span class="hljs-string">"null"</span>, <span class="hljs-string">"double"</span> ]} ]}
</code></pre><p>As you can see, the schema is in the form that can be directly used in Athena <code>WITH SERDEPROPERTIES</code>. But before Athena, I used the AVRO schemas to create Hive tables. If you want to avoid Hive table creation, you can read the <code>.avsc</code> files to extract field names and data types, but then you have to map the data types yourself from AVRO format to Athena table creation DDL.</p>
<p>The complexity of the mapping task depends on how complex the data types are in your tables. For simplicity (and to cover most simple to complex data types), I let Hive do the mapping for me. So I created the tables first in Hive metastore. Then I used <code>SHOW CREATE TABLE</code> to get the field names and data types part of the DDL.</p>
<h3 id="heading-step-4-create-hive-tables-on-top-of-avro-data-use-schema-from-step-3">Step 4. Create Hive tables on top of AVRO data, use schema from Step 3</h3>
<p>As discussed earlier, Hive allows creating tables by using <code>avro.schema.url</code>. So once you have schema (<code>.avsc</code> file) extracted from AVRO data, you can create tables as follows:</p>
<pre><code>CREATE EXTERNAL TABLE table_nameSTORED AS AVROLOCATION <span class="hljs-string">'s3://your-aws-bucket/your-target-path/avro_data'</span>TBLPROPERTIES (<span class="hljs-string">'avro.schema.url'</span>=<span class="hljs-string">'s3://your-aws-bucket/your-target-path/your-avro-schema'</span>);
</code></pre><p>First, upload the extracted schemas to S3 so that <code>avro.schema.url</code> can refer to their S3 locations:</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ aws s3 cp bq_data/schemas s3:<span class="hljs-comment">//my-bucket/bq_data/schemas --recursive</span>
</code></pre><pre><code>upload: bq_data/schemas/backend/avro/sessions_daily_phase2<span class="hljs-number">-000000000000.</span>avsc to s3:<span class="hljs-comment">//my-bucket/bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc</span>
</code></pre><pre><code>upload: bq_data/schemas/backend/avro/sessions_phase2<span class="hljs-number">-000000000000.</span>avsc to s3:<span class="hljs-comment">//my-bucket/bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc</span>
</code></pre><pre><code>upload: bq_data/schemas/backend/avro/sessions_detailed_phase2<span class="hljs-number">-000000000000.</span>avsc to s3:<span class="hljs-comment">//my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc</span>
</code></pre><p>After having both AVRO data and schema in S3, DDL for Hive table can be created using the template shown at the beginning of this section. I used another shell script <code>create_tables_hive.sh</code> (shown below) to cover any number of tables:</p>
<pre><code>schema_avro=$(ls -l bq_data/backend | tail -n+<span class="hljs-number">2</span> | awk <span class="hljs-string">'{print $9}'</span>)<span class="hljs-keyword">for</span> table_name <span class="hljs-keyword">in</span> $schema_avro; <span class="hljs-keyword">do</span>  file_name=$(ls -l bq_data/backend/$table_name | tail -n+<span class="hljs-number">2</span> | awk \    -v srch=<span class="hljs-string">"avro"</span> -v repl=<span class="hljs-string">"avsc"</span> <span class="hljs-string">'{ sub(srch,repl,$9); print $9 }'</span>)  table_definition=$(    echo <span class="hljs-string">"CREATE EXTERNAL TABLE IF NOT EXISTS backend."</span>$table_name<span class="hljs-string">"\\nSTORED AS AVRO"</span><span class="hljs-string">"\\nLOCATION 's3://my-bucket/bq_data/backend/"</span>$table_name<span class="hljs-string">"'"</span><span class="hljs-string">"\\nTBLPROPERTIES('avro.schema.url'='s3://my-bucket/bq_data/\schemas/backend/avro/"</span>$file_name<span class="hljs-string">"');"</span>  )  printf <span class="hljs-string">"\n$table_definition\n"</span>done
</code></pre><p>Running the script provides the following:</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ sh create_tables_hive.sh
</code></pre><pre><code>CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_daily_phase2STORED AS AVROLOCATION <span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_daily_phase2'</span> TBLPROPERTIES (<span class="hljs-string">'avro.schema.url'</span>=<span class="hljs-string">'s3://my-bucket/bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc'</span>);
</code></pre><pre><code>CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_detailed_phase2 STORED AS AVROLOCATION <span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_detailed_phase2'</span>TBLPROPERTIES (<span class="hljs-string">'avro.schema.url'</span>=<span class="hljs-string">'s3://my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc'</span>);
</code></pre><pre><code>CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_phase2STORED AS AVROLOCATION <span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_phase2'</span> TBLPROPERTIES (<span class="hljs-string">'avro.schema.url'</span>=<span class="hljs-string">'s3://my-bucket/bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc'</span>);
</code></pre><p>I ran the above on Hive console to actually create the Hive tables:</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ hiveLogging initialized using configuration <span class="hljs-keyword">in</span> file:<span class="hljs-regexp">/etc/</span>hive/conf.dist/hive-log4j2.properties Async: <span class="hljs-literal">false</span>
</code></pre><pre><code>hive&gt; CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_daily_phase2&gt; STORED AS AVRO&gt; LOCATION <span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_daily_phase2'</span> TBLPROPERTIES (<span class="hljs-string">'avro.schema.url'</span>=<span class="hljs-string">'s3://my-bucket/bq_data/schemas/backend/avro/sessions_daily_phase2-000000000000.avsc'</span>);OKTime taken: <span class="hljs-number">4.24</span> seconds
</code></pre><pre><code>hive&gt;&gt; CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_detailed_phase2 STORED AS AVRO&gt; LOCATION <span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_detailed_phase2'</span>&gt; TBLPROPERTIES (<span class="hljs-string">'avro.schema.url'</span>=<span class="hljs-string">'s3://my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc'</span>);OKTime taken: <span class="hljs-number">0.563</span> seconds
</code></pre><pre><code>hive&gt;&gt; CREATE EXTERNAL TABLE IF NOT EXISTS backend.sessions_phase2&gt; STORED AS AVRO&gt; LOCATION <span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_phase2'</span> TBLPROPERTIES (<span class="hljs-string">'avro.schema.url'</span>=<span class="hljs-string">'s3://my-bucket/bq_data/schemas/backend/avro/sessions_phase2-000000000000.avsc'</span>);OKTime taken: <span class="hljs-number">0.386</span> seconds
</code></pre><p>So I have created the Hive tables successfully. To verify that the tables work, I ran this simple query:</p>
<pre><code>hive&gt; select count(*) <span class="hljs-keyword">from</span> backend.sessions_detailed_phase2;Query ID = hadoop_20190214152548_2316cb5b<span class="hljs-number">-29</span>f1<span class="hljs-number">-4416</span><span class="hljs-number">-922</span>e-a6ff02ec1775Total jobs = <span class="hljs-number">1</span>Launching Job <span class="hljs-number">1</span> out <span class="hljs-keyword">of</span> <span class="hljs-number">1</span>Status: Running (Executing on YARN cluster <span class="hljs-keyword">with</span> App id application_1550010493995_0220)----------------------------------------------------------------------------------------------VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED----------------------------------------------------------------------------------------------<span class="hljs-built_in">Map</span> <span class="hljs-number">1</span> .......... container     SUCCEEDED      <span class="hljs-number">1</span>          <span class="hljs-number">1</span>        <span class="hljs-number">0</span>        <span class="hljs-number">0</span>       <span class="hljs-number">0</span>       <span class="hljs-number">0</span>Reducer <span class="hljs-number">2</span> ...... container     SUCCEEDED      <span class="hljs-number">1</span>          <span class="hljs-number">1</span>        <span class="hljs-number">0</span>        <span class="hljs-number">0</span>       <span class="hljs-number">0</span>       <span class="hljs-number">0</span>----------------------------------------------------------------------------------------------VERTICES: <span class="hljs-number">02</span>/<span class="hljs-number">02</span>  [==========================&gt;&gt;] <span class="hljs-number">100</span>%  ELAPSED TIME: <span class="hljs-number">8.17</span> s----------------------------------------------------------------------------------------------OK6130
</code></pre><p>So it works!</p>
<h3 id="heading-step-5-extract-hive-table-definition-from-hive-tables">Step 5. Extract Hive table definition from Hive tables</h3>
<p>As discussed earlier, Athena requires you to explicitly specify field names and their data types in <code>CREATE</code> statement. In Step 3, I extracted the AVRO schema, which can be used in <code>WITH SERDEPROPERTIES</code> of Athena table DDL, but I also have to specify all the fiend names and their (Hive) data types. Now that I have the tables in the Hive metastore, I can easily get those by running <code>SHOW CREATE TABLE</code>. First, prepare the Hive DDL queries for all tables:</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ ls -l bq_data/backend | tail -n+<span class="hljs-number">2</span> | awk <span class="hljs-string">'{print "hive -e '</span>\<span class="hljs-string">''</span>SHOW CREATE TABLE backend.<span class="hljs-string">"$9"</span><span class="hljs-string">'\''</span> &gt; bq_data/schemas/backend/hql/backend.<span class="hljs-string">"$9"</span>.hql;<span class="hljs-string">" }'</span>
</code></pre><pre><code>hive -e <span class="hljs-string">'SHOW CREATE TABLE backend.sessions_daily_phase2'</span> &gt; bq_data/schemas/backend/hql/backend.sessions_daily_phase2.hql;
</code></pre><pre><code>hive -e <span class="hljs-string">'SHOW CREATE TABLE backend.sessions_detailed_phase2'</span> &gt; bq_data/schemas/backend/hql/backend.sessions_detailed_phase2.hql;
</code></pre><pre><code>hive -e <span class="hljs-string">'SHOW CREATE TABLE backend.sessions_phase2'</span> &gt; bq_data/schemas/backend/hql/backend.sessions_phase2.hql;
</code></pre><p>Executing the above commands copies Hive table definitions under <code>bq_data/schemas/backend/hql/</code>. Let’s see what’s inside:</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ cat bq_data/schemas/backend/hql/backend.sessions_detailed_phase2.hql
</code></pre><pre><code>CREATE EXTERNAL TABLE <span class="hljs-string">`backend.sessions_detailed_phase2`</span>(<span class="hljs-string">`uid`</span> string COMMENT <span class="hljs-string">''</span>,<span class="hljs-string">`platform`</span> string COMMENT <span class="hljs-string">''</span>,<span class="hljs-string">`version`</span> string COMMENT <span class="hljs-string">''</span>,<span class="hljs-string">`country`</span> string COMMENT <span class="hljs-string">''</span>,<span class="hljs-string">`sessions`</span> bigint COMMENT <span class="hljs-string">''</span>,<span class="hljs-string">`active_days`</span> bigint COMMENT <span class="hljs-string">''</span>,<span class="hljs-string">`session_time_minutes`</span> double COMMENT <span class="hljs-string">''</span>)ROW FORMAT SERDE<span class="hljs-string">'org.apache.hadoop.hive.serde2.avro.AvroSerDe'</span>STORED AS INPUTFORMAT<span class="hljs-string">'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'</span>OUTPUTFORMAT<span class="hljs-string">'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'</span>LOCATION<span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_detailed_phase2'</span>TBLPROPERTIES (<span class="hljs-string">'avro.schema.url'</span>=<span class="hljs-string">'s3://my-bucket/bq_data/schemas/backend/avro/sessions_detailed_phase2-000000000000.avsc'</span>,<span class="hljs-string">'transient_lastDdlTime'</span>=<span class="hljs-string">'1550157659'</span>)
</code></pre><p>By now all the building blocks needed for creating AVRO tables in Athena are there:</p>
<ul>
<li>Field names and data types can be obtained from the Hive table DDL (to be used in columns section of <code>CREATE</code> statement)</li>
<li>AVRO schema (JSON) can be obtained from the extracted <code>.avsc</code> files (to be supplied in <code>WITH SERDEPROPERTIES</code>).</li>
</ul>
<h3 id="heading-step-6-use-the-output-of-steps-3-and-5-to-create-athena-tables">Step 6. Use the output of Steps 3 and 5 to Create Athena tables</h3>
<p>If you are still with me, you have done a great job coming this far. I am now going to perform the final step which is creating Athena tables. I used the following script to combine <code>.avsc</code> and <code>.hql</code> files to construct Athena table definitions:</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> tmpAftab]$ cat create_tables_athena.sh
</code></pre><pre><code># directory where extracted avro schemas are storedschema_avro=(bq_data/schemas/backend/avro<span class="hljs-comment">/*)# directory where extracted HQL schemas are storedschema_hive=(bq_data/schemas/backend/hql/*)for i in ${!schema_avro[*]}; do  schema=$(awk -F '{print $0}' '/CREATE/{flag=1}/STORED/{flag=0}\   flag' ${schema_hive[$i]})  location=$(awk -F '{print $0}' '/LOCATION/{flag=1; next}\  /TBLPROPERTIES/{flag=0} flag' ${schema_hive[$i]})  properties=$(cat ${schema_avro[$i]})  table=$(echo $schema '\n' \    "WITH SERDEPROPERTIES ('avro.schema.literal'='\n"$properties \    "\n""')STORED AS AVRO \n" \    "LOCATION" $location";\n\n")  printf "\n$table\n"done \  &gt; bq_data/schemas/backend/all_athena_tables/all_athena_tables.hql</span>
</code></pre><p>Running the above script copies Athena table definitions to <code>bq_data/schemas/backend/all_athena_tables/all_athena_tables.hql</code>. In my case it contains:</p>
<pre><code>[hadoop@ip<span class="hljs-number">-10</span><span class="hljs-number">-0</span><span class="hljs-number">-10</span><span class="hljs-number">-205</span> all_athena_tables]$ cat all_athena_tables.hql
</code></pre><pre><code>CREATE EXTERNAL TABLE <span class="hljs-string">`backend.sessions_daily_phase2`</span>( <span class="hljs-string">`uid`</span> string COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`activity_date`</span> string COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`sessions`</span> bigint COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`session_time_minutes`</span> double COMMENT <span class="hljs-string">''</span>)ROW FORMAT SERDE <span class="hljs-string">'org.apache.hadoop.hive.serde2.avro.AvroSerDe'</span>WITH SERDEPROPERTIES (<span class="hljs-string">'avro.schema.literal'</span>=<span class="hljs-string">'{ "type" : "record", "name" : "Root", "fields" : [ { "name" : "uid", "type" : [ "null", "string" ] }, { "name" : "activity_date", "type" : [ "null", "string" ] }, { "name" : "sessions", "type" : [ "null", "long" ] }, { "name" : "session_time_minutes", "type" : [ "null", "double" ] } ] }'</span>)STORED AS AVROLOCATION <span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_daily_phase2'</span>;
</code></pre><pre><code>CREATE EXTERNAL TABLE <span class="hljs-string">`backend.sessions_detailed_phase2`</span>( <span class="hljs-string">`uid`</span> string COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`platform`</span> string COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`version`</span> string COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`country`</span> string COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`sessions`</span> bigint COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`active_days`</span> bigint COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`session_time_minutes`</span> double COMMENT <span class="hljs-string">''</span>)ROW FORMAT SERDE <span class="hljs-string">'org.apache.hadoop.hive.serde2.avro.AvroSerDe'</span>WITH SERDEPROPERTIES (<span class="hljs-string">'avro.schema.literal'</span>=<span class="hljs-string">'{ "type" : "record", "name" : "Root", "fields" : [ { "name" : "uid", "type" : [ "null", "string" ] }, { "name" : "platform", "type" : [ "null", "string" ] }, { "name" : "version", "type" : [ "null", "string" ] }, { "name" : "country", "type" : [ "null", "string" ] }, { "name" : "sessions", "type" : [ "null", "long" ] }, { "name" : "active_days", "type" : [ "null", "long" ] }, { "name" : "session_time_minutes", "type" : [ "null", "double" ] } ] } '</span>)STORED AS AVROLOCATION <span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_detailed_phase2'</span>;
</code></pre><pre><code>CREATE EXTERNAL TABLE <span class="hljs-string">`backend.sessions_phase2`</span>( <span class="hljs-string">`uid`</span> string COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`sessions`</span> bigint COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`active_days`</span> bigint COMMENT <span class="hljs-string">''</span>, <span class="hljs-string">`session_time_minutes`</span> double COMMENT <span class="hljs-string">''</span>)ROW FORMAT SERDE <span class="hljs-string">'org.apache.hadoop.hive.serde2.avro.AvroSerDe'</span>WITH SERDEPROPERTIES (<span class="hljs-string">'avro.schema.literal'</span>=<span class="hljs-string">'{ "type" : "record", "name" : "Root", "fields" : [ { "name" : "uid", "type" : [ "null", "string" ] }, { "name" : "sessions", "type" : [ "null", "long" ] }, { "name" : "active_days", "type" : [ "null", "long" ] }, { "name" : "session_time_minutes", "type" : [ "null", "double" ] } ] }'</span>)STORED AS AVROLOCATION <span class="hljs-string">'s3://my-bucket/bq_data/backend/sessions_phase2'</span>;
</code></pre><p>And finally, I ran the above scripts in Athena to create the tables:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/9TNErBinL98R9qH9pasv6a99jQWPNANLXPKP" alt="Image" width="800" height="528" loading="lazy"></p>
<p>There you have it.</p>
<p>I feel that the process is a bit lengthy. However, this has worked well for me. The other approach would be to use AWS Glue wizard to crawl the data and infer the schema. If you have used AWS Glue wizard, please share your experience in the comment section below.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Use these open-source tools for Data Warehousing ]]>
                </title>
                <description>
                    <![CDATA[ These days, everyone talks about open-source software. However, this is still not common in the Data Warehousing (DWH) field. Why is this? For this post, I chose some open-source technologies and used them together to build a full data architecture f... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/open-source-data-warehousing-druid-apache-airflow-superset-f26d149c9b7/</link>
                <guid isPermaLink="false">66d46149768263422736e8c4</guid>
                
                    <category>
                        <![CDATA[ big data ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ open source ]]>
                    </category>
                
                    <category>
                        <![CDATA[ General Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tech  ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Simon Späti ]]>
                </dc:creator>
                <pubDate>Thu, 29 Nov 2018 06:00:53 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/0*vp7sdOKpaw8JiXnP.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>These days, everyone talks about open-source software. However, this is still not common in the Data Warehousing (DWH) field. Why is this?</p>
<p>For this post, I chose some open-source technologies and used them together to build a full data architecture for a Data Warehouse system.</p>
<p>I went with <a target="_blank" href="http://www.druid.io/">Apache Druid</a> for data storage, <a target="_blank" href="https://superset.incubator.apache.org/">Apache Superset</a> for querying, and <a target="_blank" href="https://airflow.apache.org/">Apache Airflow</a> as a task orchestrator.</p>
<h3 id="heading-druid-the-data-store">Druid — the data store</h3>
<p>Druid is an open-source, column-oriented, distributed data store written in Java. It’s designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/X5ERR00kMGXwmbNaZ3a-6OuTgGhV6JosZGqX" alt="Image" width="726" height="647" loading="lazy"></p>
<h4 id="heading-why-use-druid">Why use Druid?</h4>
<p>Druid has many key features, including sub-second OLAP queries, real-time streaming ingestion, scalability, and cost effectiveness.</p>
<p>With the <a target="_blank" href="http://www.sspaeti.com/blog/olap-whats-coming-next#Comparison_modern_OLAP_Technologies">comparison of modern OLAP Technologies</a> in mind, I chose Druid over ClickHouse, Pinot and Apache Kylin. Recently, <a target="_blank" href="https://azure.microsoft.com/en-us/blog/azure-hdinsight-brings-next-generation-hadoop-3-0-and-enterprise-security-to-the-cloud/">Microsoft announced they will add Druid</a> to their Azure HDInsight 4.0.</p>
<h4 id="heading-why-not-druid">Why not Druid?</h4>
<p>Carter Shanklin wrote <a target="_blank" href="https://de.hortonworks.com/blog/apache-hive-druid-part-1-3/">a detailed post about Druid’s limitations</a> at Horthonwork.com. The main issue is with its support for SQL joins, and advanced SQL capabilities.</p>
<h3 id="heading-the-architecture-of-druid">The Architecture of Druid</h3>
<p>Druid is scalable due to its cluster architecture. You have three different node types — the Middle-Manager-Node, the Historical Node and the Broker.</p>
<p>The great thing is that you can add as many nodes as you want in the specific area that fits best for you. If you have many queries to run, you can add more Brokers. Or, if a lot of data needs to be batch-ingested, you would add middle managers and so on.</p>
<p>A simple architecture is shown below. You can read more about Druid’s design <a target="_blank" href="http://druid.io/docs/latest/design/">here</a>.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/xeRSbDmEa6pmf5ZOOxBF-hsunHv9i27cfHfU" alt="Image" width="726" height="408" loading="lazy"></p>
<h3 id="heading-apache-superset-the-ui">Apache Superset — the UI</h3>
<p>The easiest way to query against Druid is through a lightweight, open-source tool called <a target="_blank" href="https://superset.incubator.apache.org/">Apache Superset</a>.</p>
<p>It is easy to use and has all common chart types like Bubble Chart, Word Count, Heatmaps, Boxplot and <a target="_blank" href="https://superset.incubator.apache.org/gallery.html">many more</a>.</p>
<p>Druid provides a Rest-API, and in the newest version also a SQL Query API. This makes it easy to use with any tool, whether it is standard SQL, any existing BI-tool or a custom application.</p>
<h3 id="heading-apache-airflow-the-orchestrator">Apache Airflow — the Orchestrator</h3>
<p>As mentioned in <a target="_blank" href="https://www.sspaeti.com/blog/olap-whats-coming-next/#Orchestrators">Orchestrators — Scheduling and monitor workflows</a>, this is one of the most critical decisions.</p>
<p>In the past, ETL tools like Microsoft SQL Server Integration Services (SSIS) and others were widely used. They were where your data transformation, cleaning and normalisation took place.</p>
<p>In more modern architectures, these tools aren’t enough anymore.</p>
<p>Moreover, code and data transformation logic are much more valuable to other data-savvy people in the company.</p>
<p>I highly recommend you read a blog post from <a target="_blank" href="https://medium.com/@maximebeauchemin">Maxime Beauchemin</a> about <a target="_blank" href="https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a">Functional Data Engineering — a modern paradigm for batch data processing</a>. This goes much deeper into how modern data pipelines should be.</p>
<p>Also, consider the read of <a target="_blank" href="https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b">The Downfall of the Data Engineer</a> where Max explains about the breaking “data silo” and much more.</p>
<h4 id="heading-why-use-airflow">Why use Airflow?</h4>
<p><a target="_blank" href="https://airflow.apache.org/">Apache Airflow</a> is a very popular tool for this task orchestration. Airflow is written in Python. Tasks are written as Directed Acyclic Graphs (<a target="_blank" href="https://en.wikipedia.org/wiki/Directed_acyclic_graph">DAGs</a>). These are also written in Python.</p>
<p>Instead of encapsulating your critical transformation logic somewhere in a tool, you place it where it belongs to inside the Orchestrator.</p>
<p>Another advantage is using plain Python. There is no need to encapsulate other dependencies or requirements, like fetching from an FTP, copying data from A to B, writing a batch-file. You do that and everything else in the same place.</p>
<h4 id="heading-features-of-airflow">Features of Airflow</h4>
<p>Moreover, you get a fully functional overview of all current tasks in one place.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/RoR-Nl7GwR7qSU1rWxUX8RYAywKVwfqfgMPO" alt="Image" width="726" height="479" loading="lazy"></p>
<p>More relevant features of Airflow are that you write workflows as if you are writing programs. External jobs like Databricks, Spark, etc. are no problems.</p>
<p>Job testing goes through Airflow itself. That includes passing parameters to other jobs downstream or verifing what is running on Airflow and seeing the actual code. The log files and other meta-data are accessible through the web GUI.</p>
<p>(Re)run only on parts of the workflow and dependent tasks is a crucial feature which comes out of the box when you create your workflows with Airflow. The jobs/tasks are run in a context, the scheduler passes in the necessary details plus the work gets distributed across your cluster at the task level, not at the DAG level.</p>
<p>For many more feature visit the <a target="_blank" href="https://gtoonstra.github.io/etl-with-airflow/great.html">full list</a>.</p>
<h4 id="heading-etl-with-apache-airflow">ETL with Apache Airflow</h4>
<p>If you want to start with Apache Airflow as your new ETL-tool, please start with this <a target="_blank" href="https://gtoonstra.github.io/etl-with-airflow/">ETL best practices with Airflow</a> shared with you. It has simple <a target="_blank" href="https://gtoonstra.github.io/etl-with-airflow/etlexample.html">ETL</a>-examples, with plain SQL, with <a target="_blank" href="https://gtoonstra.github.io/etl-with-airflow/hiveexample.html">HIVE</a>, with <a target="_blank" href="https://gtoonstra.github.io/etl-with-airflow/datavault.html">Data Vault</a>, <a target="_blank" href="https://gtoonstra.github.io/etl-with-airflow/datavault2.html">Data Vault 2</a>, and <a target="_blank" href="https://gtoonstra.github.io/etl-with-airflow/datavault-bigdata.html">Data Vault with Big Data processes</a>. It gives you an excellent overview of what’s possible and also how you would approach it.</p>
<p>At the same time, there is a Docker container that you can use, meaning you don’t even have to set-up any infrastructure. You can pull the container from <a target="_blank" href="https://gtoonstra.github.io/etl-with-airflow/etlexample.html#run-airflow-from-docker">here</a>.</p>
<p>For the GitHub-repo follow the link on <a target="_blank" href="https://github.com/gtoonstra/etl-with-airflow">etl-with-airflow</a>.</p>
<h3 id="heading-conclusion">Conclusion</h3>
<p>If you’re searching for open-source data architecture, you cannot ignore Druid for speedy OLAP responses, Apache Airflow as an orchestrator that keeps your data lineage and schedules in line, plus an easy to use dashboard tool like Apache Superset.</p>
<p>My experience so far is that Druid is bloody fast and a perfect fit for <a target="_blank" href="https://medium.com/@sspaeti/olap-whats-coming-next-be01c1567b87">OLAP cube replacements</a> in a traditional way, but still needs a more relaxed startup to install clusters, ingest data, view logs etc. If you need that, have a look at <a target="_blank" href="https://imply.io/">Impy</a> which was created by the founders of Druid. It creates all the services around Druid that you need. Unfortunately, though, it’s not open-source.</p>
<p>Apache Airflow and its features as an orchestrator are something which has not happened much yet in traditional Business Intelligence environments. I believe this change comes very naturally when you start using open-source and more new technologies.</p>
<p>And Apache Superset is an easy and fast way to be up and running and showing data from Druid. There for better tools like Tableau, etc., but not for free. That’s why Superset fits well in the ecosystem if you’re already using the above open-source technologies. But as an enterprise company, you might want to spend some money in that category because that is what the users can see at the end of the day.</p>
<p>Related Links:</p>
<ul>
<li><p><a target="_blank" href="https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a">Understanding Apache Airflow’s key concepts</a></p>
</li>
<li><p><a target="_blank" href="https://medium.com/airbnb-engineering/druid-airbnb-data-platform-601c312f2a4c">How Druid enables analytics at Airbnb</a></p>
</li>
<li><p><a target="_blank" href="https://techcrunch.com/2018/05/01/google-launches-cloud-composer-a-new-workflow-automation-tool-for-developers/">Google launches Cloud Composer, a new workflow automation tool for developers</a></p>
</li>
<li><p><a target="_blank" href="https://cloud.google.com/composer/">A fully managed workflow orchestration service built on Apache Airflow</a></p>
</li>
<li><p><a target="_blank" href="https://databricks.com/blog/2016/12/08/integrating-apache-airflow-databricks-building-etl-pipelines-apache-spark.html">Integrating Apache Airflow and Databricks: Building ETL pipelines with Apache Spark</a></p>
</li>
<li><p><a target="_blank" href="https://gtoonstra.github.io/etl-with-airflow/">ETL with Apache Airflow</a></p>
</li>
<li><p><a target="_blank" href="https://hackernoon.com/data-engineering-the-future-of-data-warehousing-81bc953a9b00">What is Data Engineering and the future of Data Warehousing</a></p>
</li>
<li><p><a target="_blank" href="https://imply.io/">Imply — Managed Druid platform (closed-source)</a></p>
</li>
<li><p><a target="_blank" href="https://de.hortonworks.com/blog/apache-hive-druid-part-1-3/">Ultra-fast OLAP Analytics with Apache Hive and Druid</a></p>
</li>
</ul>
<p><em>Originally published at</em> <a target="_blank" href="https://www.sspaeti.com/blog/open-source-data-warehousing-druid-airflow-superset/"><em>www.sspaeti.com</em></a> <em>on November 29, 2018.</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How you can access your “dark data” with Amazon Redshift Spectrum ]]>
                </title>
                <description>
                    <![CDATA[ By Lars Kamp Amazon’s Simple Storage Service (S3) has been around since 2006. Enterprises have been pumping their data into this data lake at a furious rate. Within 10 years of its birth, S3 stored over 2 trillion objects, each up to 5 terabytes in s... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/amazon-redshift-spectrum-diving-into-the-data-lake-7532e7e11716/</link>
                <guid isPermaLink="false">66c343c8f9d371e3aae26818</guid>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ big data ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tech  ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Fri, 05 Jan 2018 21:49:56 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/0*n-cTG_rKS4cY8bTd.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Lars Kamp</p>
<p>Amazon’s Simple Storage Service (<a target="_blank" href="https://aws.amazon.com/s3/">S3</a>) has been around since 2006. Enterprises have been pumping their data into this data lake at a furious rate. Within 10 years of its birth, S3 stored over <a target="_blank" href="https://www.statista.com/statistics/222309/total-number-of-objects-stored-in-amazons-s3/">2 trillion objects</a>, each up to 5 terabytes in size. These companies know their data is valuable and worth preserving. But much of this data lies inert, in “cold” data lakes, unavailable for analysis, as so-called “dark data.”</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/-gp4hTOE-I4Ma3hQRbOYtqnQpZnmXjXby2as" alt="Image" width="638" height="359" loading="lazy">
<em>The Dark Data Problem. Source: Amazon AWS.</em></p>
<h3 id="heading-analyzing-dark-data">Analyzing “Dark Data”</h3>
<p>So what lies below the surface of data lakes? The first thing for organizations to do is to find out what dark data they have accumulated. Then then need to analyze it in search of valuable insights. That means analysts need solutions that allow them to access petabytes of dark data.</p>
<p>With <a target="_blank" href="https://aws.amazon.com/redshift/spectrum/">Amazon Redshift Spectrum</a>, you can query data in Amazon S3 without first loading it into Amazon Redshift. For nomenclature purposes, I’ll use “Redshift” for “Amazon Redshift,” and “Spectrum” for “Amazon Redshift Spectrum.”</p>
<p>There are three major existing ways to access and analyze data in S3.</p>
<ul>
<li><a target="_blank" href="https://aws.amazon.com/emr/">Amazon Elastic MapReduce</a> (EMR). EMR uses <a target="_blank">Hadoop</a>-style queries to access and process large data sets in S3.</li>
<li><a target="_blank" href="https://aws.amazon.com/athena/">Amazon Athena.</a> Athena offers a console to query S3 data with standard SQL and no infrastructure to manage. Athena also has an <a target="_blank" href="https://docs.aws.amazon.com/athena/latest/APIReference/Welcome.html">API</a>.</li>
<li><a target="_blank" href="https://aws.amazon.com/redshift/">Amazon Redshift</a>. You can load data from S3 into an Amazon Redshift cluster for analysis.</li>
</ul>
<p>So why not use these existing options? For example, companies already use Amazon Redshift to analyze their “hot” data. So why not load that cold data from S3 into Redshift and call it a day?</p>
<p><strong>There are two main reasons:</strong></p>
<ul>
<li><strong>Effort</strong>. Loading data into Amazon Redshift involves extract, transform, and load (ETL) steps. Those steps are necessary to convert and structure data for analysis. Amazon estimates that figuring out the right ETL consumes 70% of an analytics project.</li>
<li><strong>Cost</strong>. You may not even know what data to extract until you have analyzed it a bit. Uploading lots of cold S3 data for analysis requires growing your clusters. That translates to paying more, as Redshift pricing is based on the size of your cluster. Meanwhile, you continue to pay S3 storage charges for retaining your cold data.</li>
</ul>
<p>Redshift Spectrum offers the best of both worlds. With Spectrum, you can:</p>
<ul>
<li>Continue using your analytics applications, with the same queries you’ve written for Redshift.</li>
<li>Leave cold data as-is in S3, and query it via Amazon Redshift, without ETL processing. That includes joining data from your data lake with data in Redshift, using a single query.</li>
<li>Decouple processing from storage. Because there’s no need to increase cluster size, you can save on Redshift storage.</li>
<li>Pay only when you run queries against S3 data. Spectrum queries cost a reasonable $5 /terabyte of data processed.</li>
</ul>
<p><img src="https://cdn-media-1.freecodecamp.org/images/Abq3pAAbYCjQUnb9t8cZunLqckNJNzMeKiFL" alt="Image" width="800" height="500" loading="lazy">
<em>Data Stack with Amazon Redshift, Amazon Redshift Spectrum, Amazon Athena, AWS Glue and S3.</em></p>
<p>Spectrum is the “glue” or “bridge” layer that provides Redshift an interface to S3 data. Redshift becomes the access layer for your business applications. Spectrum is the query processing layer for data accessed from S3. The above picture illustrates the relationship between these services.</p>
<h4 id="heading-a-closer-look-at-redshift-spectrum"><strong>A closer look at Redshift Spectrum</strong></h4>
<p>From a deployment perspective, Spectrum is “under the hood.” It’s a group of managed nodes in your <a target="_blank" href="https://aws.amazon.com/vpc/">VPC</a>, available to any of your Redshift clusters that are Spectrum-enabled. It pushes compute-intensive tasks down to the Redshift Spectrum layer. That layer is independent of your Amazon Redshift cluster.</p>
<p>There are three key concepts to understand how to run queries with Redshift Spectrum:</p>
<ol>
<li>External data catalog</li>
<li>External schemas</li>
<li>External tables</li>
</ol>
<p>The <strong>external data catalog</strong> contains the schema definitions for the data you wish to access in S3. It’s a central metadata repository for your data assets.</p>
<p>The <strong>external schema</strong> contains your tables. External tables allow you to query data in S3 using the same SELECT syntax as with other Amazon Redshift tables. External tables are read-only, that is, you can’t write to an external table.</p>
<p>You can keep writing your usual Redshift queries. The main change with Spectrum is that the queries now also contain a reference to data stored in S3.</p>
<h3 id="heading-joining-internal-and-external-tables">Joining internal and external tables</h3>
<p>The Redshift query engine treats internal and external tables the same way. You can do the typical operations like queries and joins on either type of table or a combination of both. Query an external table and join its data with that from an internal one.</p>
<p>As an example, let’s say you are using Redshift to analyze data of your e-commerce site visitors. What pages they visit, how long they stay, what they buy (or not), and so on. You keep a year’s worth of data in your Redshift clusters. Older data you move to S3.</p>
<p>Then you notice an odd seasonal variation. You want to see if this was also true for past years, or if it was an aberration for this year. Luckily you have saved historic clickstream data in S3, going back many years. You can now access that historic data via an external table with Spectrum, and run the same queries you’re running in Amazon Redshift. Or you can create new insights by joining other past data with this year’s data.</p>
<p>Redshift parses, compiles, and distributes an SQL query to the nodes in a cluster the normal way. The part of the query that references an external data source gets sent to Spectrum. Spectrum processes the relevant data in S3, and sends the result back to Redshift. Redshift collects the partial results from its nodes and Spectrum, concatenates and joins them (and so on), and returns the complete result.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/qoYMU2VsrmezRos4wFLAa9yszoz0-ZSv8FKL" alt="Image" width="800" height="450" loading="lazy"></p>
<h3 id="heading-summary">Summary</h3>
<p>Here are a few points to keep in mind when working with Spectrum:</p>
<ul>
<li>Your business applications remain unchanged and don’t know how or where a query is running. The only change for the business analyst is when defining access to external tables.</li>
<li>External data remains in S3 — there is no ETL to load it into your Redshift cluster. That decouples your storage layer in S3 from your processing layer with Redshift and Spectrum.</li>
<li>You don’t need to increase the size of your Redshift cluster to process data in S3. You only pay for the S3 data your queries actually access.</li>
<li>Redshift does all the hard work of minimizing the number of Spectrum nodes needed to access the S3 data. It also makes processing between Redshift and Spectrum efficient.</li>
</ul>
<p>You should also do the homework to ensure that processing of data in S3 is economical and efficient. You can save on costs and get better performance if you partition the data, compress it, or convert it to columnar formats such as Apache Parquet.</p>
<p>In summary, Spectrum adds one more tool to your Redshift-based data warehouse investment. You can now use its power to probe and analyze your data lake on an as-needed basis for a very low per query price.</p>
<p>I’m the cofounder of intermix.io. If you want to check it out, you can do so <a target="_blank" href="https://www.intermix.io/slow-queries-fix/?utm_source=medium&amp;utm_campaign=redshift_spectrum">here</a>.</p>
<p>_Originally published at <a target="_blank" href="https://www.intermix.io/amazon-redshift-spectrum-diving-data-lake/?utm_source=medium&amp;utm_campaign=Redshift_spectrum_medium">www.intermix.io</a>._</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ SQLAlchemy makes ETL magically easy ]]>
                </title>
                <description>
                    <![CDATA[ By Peter Gleeson One of the key aspects of any data science workflow is the sourcing, cleaning, and storing of raw data in a form that can be used upstream. This process is commonly referred to as “Extract-Transform-Load,” or ETL for short. It is imp... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/sqlalchemy-makes-etl-magically-easy-ab2bd0df928/</link>
                <guid isPermaLink="false">66d460b737bd2215d1e245b6</guid>
                
                    <category>
                        <![CDATA[ analytics ]]>
                    </category>
                
                    <category>
                        <![CDATA[ backend ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Backend Development ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ETL ]]>
                    </category>
                
                    <category>
                        <![CDATA[ General Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ software development ]]>
                    </category>
                
                    <category>
                        <![CDATA[ SQL ]]>
                    </category>
                
                    <category>
                        <![CDATA[ sqlalchemy ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tech  ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Fri, 29 Dec 2017 08:37:13 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*G7XlxVd4okqhBrn6_WhMaQ.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Peter Gleeson</p>
<p>One of the key aspects of any data science workflow is the sourcing, cleaning, and storing of raw data in a form that can be used upstream. This process is commonly referred to as “Extract-Transform-Load,” or ETL for short.</p>
<p>It is important to design efficient, robust, and reliable ETL processes, or “data pipelines.” An inefficient pipeline will make working with data slow and unproductive. A non-robust pipeline will break easily, leaving gaps.</p>
<p>Worse still, an unreliable data pipeline will silently contaminate your database with false data that may not become apparent until damage has been done.</p>
<p>Although critically important, ETL development can be a slow and cumbersome process at times. Luckily, there are open source solutions that make life much easier.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/IWyl3vAwg96qzltFcC8xa57pHLZkoASbhVhB" alt="Image" width="800" height="618" loading="lazy"></p>
<h4 id="heading-what-is-sqlalchemy">What is SQLAlchemy?</h4>
<p>One such solution is a Python module called SQLAlchemy. It allows data engineers and developers to define schemas, write queries, and manipulate SQL databases entirely through Python.</p>
<p>SQLAlchemy’s Object Relational Mapper (ORM) and Expression Language functionalities iron out some of the idiosyncrasies apparent between different implementations of SQL by allowing you to associate Python classes and constructs with data tables and expressions.</p>
<p>Here, we’ll run through some highlights of SQLAlchemy to discover what it can do and how it can make ETL development a smoother process.</p>
<h4 id="heading-setting-up">Setting up</h4>
<p>You can install SQLAlchemy using the pip package installer.</p>
<pre><code>$ sudo pip install sqlalchemy
</code></pre><p>As for SQL itself, there are many different versions available, including MySQL, Postgres, Oracle, and Microsoft SQL Server. For this article, we’ll be using SQLite.</p>
<p>SQLite is an open-source implementation of SQL that usually comes pre-installed with Linux and Mac OS X. It is also available for Windows. If you don’t have it on your system already, you can follow <a target="_blank" href="https://www.tutorialspoint.com/sqlite/sqlite_installation.htm">these instructions</a> to get up and running.</p>
<p>In a new directory, use the terminal to create a new database:</p>
<pre><code>$ mkdir sqlalchemy-demo &amp;&amp; cd sqlalchemy-demo
$ touch demo.db
</code></pre><h4 id="heading-defining-a-schema">Defining a schema</h4>
<p>A <strong>database schema</strong> defines the structure of a database system, in terms of tables, columns, fields, and the relationships between them. Schemas can be defined in raw SQL, or through the use of SQLAlchemy’s ORM feature.</p>
<p>Below is an example showing how to define a schema of two tables for an imaginary blogging platform. One is a table of users, and the other is a table of posts uploaded.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sqlalchemy <span class="hljs-keyword">import</span> *
<span class="hljs-keyword">from</span> sqlalchemy.ext.declarative <span class="hljs-keyword">import</span> declarative_base
<span class="hljs-keyword">from</span> sqlalchemy.orm <span class="hljs-keyword">import</span> sessionmaker
<span class="hljs-keyword">from</span> sqlalchemy.sql <span class="hljs-keyword">import</span> *

engine = create_engine(<span class="hljs-string">'sqlite:///demo.db'</span>)
Base = declarative_base()

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Users</span>(<span class="hljs-params">Base</span>):</span>
    __tablename__ = <span class="hljs-string">"users"</span>
    UserId = Column(Integer, primary_key=<span class="hljs-literal">True</span>)
    Title = Column(String)
    FirstName = Column(String)
    LastName = Column(String)
    Email = Column(String)
    Username = Column(String)
    DOB = Column(DateTime)

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Uploads</span>(<span class="hljs-params">Base</span>):</span>
    __tablename__ = <span class="hljs-string">"uploads"</span>
    UploadId = Column(Integer, primary_key=<span class="hljs-literal">True</span>)
    UserId = Column(Integer)
    Title = Column(String)
    Body = Column(String)
    Timestamp = Column(DateTime)

Users.__table__.create(bind=engine, checkfirst=<span class="hljs-literal">True</span>)
Uploads.__table__.create(bind=engine, checkfirst=<span class="hljs-literal">True</span>)
</code></pre>
<p>First, import everything you need from SQLAlchemy. Then, use <code>create_engine(connection_string)</code> to connect to your database. The exact connection string will depend on the version of SQL you are working with. This example uses a relative path to the SQLite database created earlier.</p>
<p>Next, start defining your table classes. The first one in the example is <code>Users</code>. Each column in this table is defined as a class variable using SQLAlchemy’s <code>Column(type)</code>, where <code>type</code> is a data type (such as <code>Integer</code>, <code>String</code>, <code>DateTime</code> and so on). Use <code>primary_key=True</code> to denote columns which will be used as primary keys.</p>
<p>The next table defined here is <code>Uploads</code>. It’s very much the same idea — each column is defined as before.</p>
<p>The final two lines actually create the tables. The <code>checkfirst=True</code> parameter ensures that new tables are only created if they do not currently exist in the database.</p>
<h4 id="heading-extract">Extract</h4>
<p>Once the schema has been defined, the next task is to <strong>extract</strong> the raw data from its source. The exact details can vary wildly from case to case, depending on how the raw data is provided. Maybe your app calls an in-house or third-party API, or perhaps you need to read data logged in a CSV file.</p>
<p>The example below uses two APIs to simulate data for the fictional blogging platform described above. The <code>Users</code> table will be populated with profiles randomly generated at <a target="_blank" href="https://randomuser.me/">randomuser.me</a>, and the <code>Uploads</code> table will contain lorem ipsum-inspired data courtesy of <a target="_blank" href="http://jsonplaceholder.typicode.com/">JSONPlaceholder</a>.</p>
<p>Python’s <code>Requests</code> module can be used to call these APIs, as shown below:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> requests

url = <span class="hljs-string">'https://randomuser.me/api/?results=10'</span>
users_json = requests.get(url).json()
url2 = <span class="hljs-string">'https://jsonplaceholder.typicode.com/posts/'</span>
uploads_json = requests.get(url2).json()
</code></pre>
<p>The data is currently held in two objects (<code>users_json</code> and <code>uploads_json</code>) in JSON format. The next step will be to transform and load this data into the tables defined earlier.</p>
<h4 id="heading-transform">Transform</h4>
<p>Before the data can be loaded into the database, it is important to ensure that it is in the correct format. The JSON objects created in the code above are nested, and contain more data than is required for the tables defined.</p>
<p>An important intermediary step is to <strong>transform</strong> the data from its current nested JSON format to a flat format that can be safely written to the database without error.</p>
<p>For the example running through this article, the data are relatively simple, and won’t need much transformation. The code below creates two lists, <code>users</code> and <code>uploads</code>, which will be used in the final step:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime, timedelta
<span class="hljs-keyword">from</span> random <span class="hljs-keyword">import</span> randint

users, uploads = [], []

<span class="hljs-keyword">for</span> i, result <span class="hljs-keyword">in</span> enumerate(users_json[<span class="hljs-string">'results'</span>]):
    row = {}
    row[<span class="hljs-string">'UserId'</span>] = i
    row[<span class="hljs-string">'Title'</span>] = result[<span class="hljs-string">'name'</span>][<span class="hljs-string">'title'</span>]
    row[<span class="hljs-string">'FirstName'</span>] = result[<span class="hljs-string">'name'</span>][<span class="hljs-string">'first'</span>]
    row[<span class="hljs-string">'LastName'</span>] = result[<span class="hljs-string">'name'</span>][<span class="hljs-string">'last'</span>]
    row[<span class="hljs-string">'Email'</span>] = result[<span class="hljs-string">'email'</span>]
    row[<span class="hljs-string">'Username'</span>] = result[<span class="hljs-string">'login'</span>][<span class="hljs-string">'username'</span>]
    dob = datetime.strptime(result[<span class="hljs-string">'dob'</span>],<span class="hljs-string">'%Y-%m-%d %H:%M:%S'</span>)    
    row[<span class="hljs-string">'DOB'</span>] = dob.date()

    users.append(row)

<span class="hljs-keyword">for</span> result <span class="hljs-keyword">in</span> uploads_json:
    row = {}
    row[<span class="hljs-string">'UploadId'</span>] = result[<span class="hljs-string">'id'</span>]
    row[<span class="hljs-string">'UserId'</span>] = result[<span class="hljs-string">'userId'</span>]
    row[<span class="hljs-string">'Title'</span>] = result[<span class="hljs-string">'title'</span>]
    row[<span class="hljs-string">'Body'</span>] = result[<span class="hljs-string">'body'</span>]
    delta = timedelta(seconds=randint(<span class="hljs-number">1</span>,<span class="hljs-number">86400</span>))
    row[<span class="hljs-string">'Timestamp'</span>] = datetime.now() - delta

    uploads.append(row)
</code></pre>
<p>The main step here is to iterate through the JSON objects created before. For each result, create a new Python dictionary object with keys corresponding to each column defined for the relevant table in the schema. This ensures that the data is no longer nested, and keeps only the data needed for the tables.</p>
<p>The other step is to use Python’s <code>datetime</code> module to manipulate dates, and transform them into <code>DateTime</code> type objects that can be written to the database. For the sake of this example, random <code>DateTime</code> objects are generated using the <code>timedelta()</code> method from Python’s DateTime module.</p>
<p>Each created dictionary is appended to a list, which will be used in the final step of the pipeline.</p>
<h4 id="heading-load">Load</h4>
<p>Finally, the data is in a form that can be <strong>loaded</strong> into the database. SQLAlchemy makes this step straightforward through its Session API.</p>
<p>The Session API acts a bit like a middleman, or “holding zone,” for Python objects you have either loaded from or associated with the database. These objects can be manipulated within the session before being committed to the database.</p>
<p>The code below creates a new session object, adds rows to it, then merges and commits them to the database:</p>
<pre><code class="lang-python">Session = sessionmaker(bind=engine)
session = Session()

<span class="hljs-keyword">for</span> user <span class="hljs-keyword">in</span> users:
    row = Users(**user)
    session.add(row)

<span class="hljs-keyword">for</span> upload <span class="hljs-keyword">in</span> uploads:
    row = Uploads(**upload)
    session.add(row)

session.commit()
</code></pre>
<p>The <code>sessionmaker</code> factory is used to generate newly-configured <code>Session</code> classes. <code>Session</code> is an everyday Python class that is instantiated on the second line as <code>session</code>.</p>
<p>Next up are two loops which iterate through the <code>users</code> and <code>uploads</code> lists created earlier. The elements of these lists are dictionary objects whose keys correspond to the columns given in the <code>Users</code> and <code>Uploads</code> classes defined previously.</p>
<p>Each object is used to instantiate a new instance of the relevant class (using Python’s handy <code>some_function(**some_dict)</code> trick). This object is added to the current session with <code>session.add()</code>.</p>
<p>Finally, when the session contains the rows to be added, <code>session.commit()</code> is used to commit the transaction to the database.</p>
<h4 id="heading-aggregating">Aggregating</h4>
<p>Another cool feature of SQLAlchemy is the ability to use its Expression Language system to write and execute backend-agnostic SQL queries.</p>
<p>What are the advantages of writing backend-agnostic queries? For a start, they make any future migration projects a whole lot easier. Different versions of SQL have somewhat incompatible syntaxes, but SQLAlchemy’s Expression Language acts as a lingua franca between them.</p>
<p>Also, being able to query and interact with your database in a seamlessly Pythonic way is a real advantage to developers who’d prefer work entirely in the language they know best. However, SQLAlchemy will also let you work in plain SQL, for cases when it is simpler to use a pre-written query.</p>
<p>Here, we will extend the fictional blogging platform example to illustrate how this works. Once the basic Users and Uploads tables have been created and populated, a next step might be to create an <strong>aggregated</strong> table — for instance, showing how many articles each user has posted, and the time they were last active.</p>
<p>First, define a class for the aggregated table:</p>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">UploadCounts</span>(<span class="hljs-params">Base</span>):</span>
    __tablename__ = <span class="hljs-string">"upload_counts"</span>
    UserId = Column(Integer, primary_key=<span class="hljs-literal">True</span>)
    LastActive = Column(DateTime)
    PostCount = Column(Integer)

UploadCounts.__table__.create(bind=engine, checkfirst=<span class="hljs-literal">True</span>)
</code></pre>
<p>This table will have three columns. For each <code>UserId</code>, it will store the timestamp of when they were last active, and a count of how many posts they have uploaded.</p>
<p>In plain SQL, this table would be populated using a query along the lines of:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> upload_counts
<span class="hljs-keyword">SELECT</span>
  UserId,
  <span class="hljs-keyword">MAX</span>(<span class="hljs-built_in">Timestamp</span>) <span class="hljs-keyword">AS</span> LastActive,
  <span class="hljs-keyword">COUNT</span>(UploadId) <span class="hljs-keyword">AS</span> PostCount
<span class="hljs-keyword">FROM</span>
  uploads
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> <span class="hljs-number">1</span>;
</code></pre>
<p>In SQLAlchemy, this would be written as:</p>
<pre><code class="lang-python">connection = engine.connect()

query = select([Uploads.UserId,
    func.max(Uploads.Timestamp).label(<span class="hljs-string">'LastActive'</span>),
    func.count(Uploads.UploadId).label(<span class="hljs-string">'PostCount'</span>)]).\ 
    group_by(<span class="hljs-string">'UserId'</span>)

results = connection.execute(query)

<span class="hljs-keyword">for</span> result <span class="hljs-keyword">in</span> results:
    row = UploadCounts(**result)
    session.add(row)

session.commit()
</code></pre>
<p>The first line creates a <code>Connection</code> object using the <code>engine</code> object’s <code>connect()</code> method. Next, a query is defined using the <code>select()</code> function.</p>
<p>This query is the same as the plain SQL version given above. It selects the <code>UserId</code> column from the <code>uploads</code> table. It also applies <code>func.max()</code> to the <code>Timestamp</code> column, which identifies the most recent timestamp. This is labelled <code>LastActive</code> using the <code>label()</code> method.</p>
<p>Likewise, the query applies <code>func.count()</code> to count the number of records that appear in the <code>Title</code> column. This is labelled <code>PostCount</code>.</p>
<p>Finally, the query uses <code>group_by()</code> to group results by <code>UserId</code>.</p>
<p>To use the results of the query, a for loop iterates over the row objects returned by <code>connection.execute(query)</code>. Each row is used to instantiate an instance of the <code>UploadCounts</code> table class. As before, each row is added to the <code>session</code> object, and finally the session is committed to the database.</p>
<h4 id="heading-checking-out">Checking out</h4>
<p>Once you have run this script, you may want to convince yourself that the data have been written correctly into the <code>demo.db</code> database created earlier.</p>
<p>After quitting Python, open the database in SQLite:</p>
<pre><code>$ sqlite3 demo.db
</code></pre><p>Now, you should be able to run the following queries:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> <span class="hljs-keyword">users</span>;

<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> uploads;

<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> upload_counts;
</code></pre>
<p>And the contents of each table will be printed to the console! By scheduling the Python script to run at regular intervals, you can be sure the database will be kept up-to-date.</p>
<p>You could now use these tables to write queries for further analysis, or to build dashboards for visualisation purposes.</p>
<h4 id="heading-reading-further">Reading further</h4>
<p>If you’ve made it this far, then hopefully you’ll have learned a thing or two about how SQLAlchemy can make ETL development in Python much more straightforward!</p>
<p>It is not possible for a single article to do full justice to all the features of SQLAlchemy. However, one of the project’s key advantages is the depth and detail of its documentation. You can dive into it <a target="_blank" href="http://docs.sqlalchemy.org/en/latest/">here</a>.</p>
<p>Otherwise, check out <a target="_blank" href="https://github.com/crazyguitar/pysheeet/blob/master/docs/notes/python-sqlalchemy.rst">this cheatsheet</a> if you want to get started quickly.</p>
<p>The full code for this article can be found in <a target="_blank" href="https://gist.github.com/anonymous/a2fc91fdb87dbfaee365f6707e94c442">this gist</a>.</p>
<p>Thanks for reading! If you have any questions or comments, please leave a response below.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
