<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ data analysis - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ data analysis - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sun, 24 May 2026 19:48:22 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/data-analysis/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Optimize PySpark Jobs: Real-World Scenarios for Understanding Logical Plans ]]>
                </title>
                <description>
                    <![CDATA[ In the world of big data, performance isn't just about bigger clusters – it's about smarter code. Spark is deceptively simple to write but notoriously difficult to optimize, because what you write isn't what Spark executes. Between your transformatio... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-optimize-pyspark-jobs-handbook/</link>
                <guid isPermaLink="false">69851d7be613661950e00d8f</guid>
                
                    <category>
                        <![CDATA[ General Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ spark ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ PySpark ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS Glue ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Sameer Shukla ]]>
                </dc:creator>
                <pubDate>Thu, 05 Feb 2026 22:45:15 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1770331493095/d569e168-d3ba-40e0-a500-7f682bbef693.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In the world of big data, performance isn't just about bigger clusters – it's about smarter code. Spark is deceptively simple to write but notoriously difficult to optimize, because what you write isn't what Spark executes. Between your transformations and actual computation lies an invisible translation layer – the logical plan – that determines whether your job runs in minutes or hours.</p>
<p>Most engineers never look at this layer, which is why they spend days tuning configurations that don't address the real problem: inefficient transformations that generate bloated plans.</p>
<p>This handbook teaches you to read, interpret, and control those plans, transforming you from someone who writes PySpark code into someone who architects efficient data pipelines with precision and confidence.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-background-information">Background Information</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-chapter-1-the-spark-mindset-why-plans-matter">Chapter 1: The Spark Mindset: Why Plans Matter</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-chapter-2-understanding-the-spark-execution-flow">Chapter 2: Understanding the Spark Execution Flow</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-chapter-3-reading-and-debugging-plans-like-a-pro">Chapter 3: Reading and Debugging Plans Like a Pro</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-chapter-4-writing-efficient-transformations">Chapter 4: Writing Efficient Transformations</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-scenario-1-rename-in-one-pass-withcolumnrenamed-vs-todf">Scenario 1: Rename in One Pass: withColumnRenamed() vs toDF()</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-2-reusing-expressions">Scenario 2: Reusing expressions</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-3-batch-column-ops">Scenario 3: Batch column ops</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-4-early-filter-vs-late-filter">Scenario 4: Early Filter vs Late Filter</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-5-column-pruning">Scenario 5: Column Pruning</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-6-filter-pushdown-vs-full-scan">Scenario 6: Filter pushdown vs full scan</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-7-de-duplicate-right">Scenario 7: De-duplicate right</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-8-count-smarter">Scenario 8: Count Smarter</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-9-window-wisely">Scenario 9: Window wisely</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-10-incremental-aggregations-with-cache-and-persist">Scenario 10: Incremental Aggregations with Cache and Persist</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-11-reduce-shuffles">Scenario 11: Reduce shuffles</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-12-know-your-shuffle-triggers">Scenario 12: Know Your Shuffle Triggers</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-13-tune-parallelism-shuffle-partitions-amp-aqe">Scenario 13: Tune Parallelism: Shuffle Partitions &amp; AQE</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-14-handle-skew-smartly">Scenario 14: Handle Skew Smartly</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-scenario-15-sort-efficiently-orderby-vs-sortwithinpartitions">Scenario 15: Sort Efficiently (orderBy vs sortWithinPartitions)</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-background-information">Background Information</h2>
<h3 id="heading-what-this-handbook-is-really-about">What This Handbook is Really About</h3>
<p>This is not a tutorial about Spark internals, cluster tuning, or PySpark syntax or APIs.</p>
<p>This is a handbook about writing PySpark code that generates efficient logical plans.</p>
<p>Because when your code produces clean, optimized plans, Spark pushes filters correctly, shuffles reduce instead of multiply, projections stay shallow, and the DAG (<a target="_blank" href="https://en.wikipedia.org/wiki/Directed_acyclic_graph">Directed Acyclic Graph</a>) becomes predictable, lean, and fast.</p>
<p>When your code produces messy plans, Spark shuffles more than necessary, and projects pile up into deep, expensive stacks. Filters arrive late instead of early, joins explode into wide, slow operations, and the DAG becomes tangled and expensive.</p>
<p>The difference between a fast job and a slow job is not “faster hardware.” It’s the structure of the plan Spark generates from your code. This handbook teaches you to shape that plan deliberately through scenarios.</p>
<h3 id="heading-who-this-handbook-is-for">Who This Handbook Is For</h3>
<p>This handbook is written for:</p>
<ul>
<li><p><strong>Data engineers</strong> building production ETL pipelines who want to move beyond trial-and-error tuning and understand <em>why</em> jobs perform the way they do</p>
</li>
<li><p><strong>Analytics engineers</strong> working with large datasets in Databricks, EMR, or Glue who need to optimize Spark jobs but don't have time for thousand-page reference manuals</p>
</li>
<li><p><strong>Data scientists</strong> transitioning from pandas to PySpark who find themselves writing code that technically runs but takes forever</p>
</li>
<li><p><strong>Anyone</strong> who has stared at the Spark UI, seen mysterious "Exchange" nodes in the DAG, and wondered, <em>"Why is this shuffling so much data?"</em></p>
</li>
</ul>
<p>You should already be comfortable writing basic PySpark code , creating DataFrames, applying transformations, running aggregations. This handbookbook won't teach you Spark syntax. Instead, it teaches you how to write transformations that work <em>with</em> the optimizer, not against it.</p>
<h3 id="heading-how-this-handbook-is-structured">How This Handbook Is Structured</h3>
<p>We’ll start with foundations, then move on to real-world scenarios.</p>
<p>Chapters 1-3 build your mental model. You'll learn what logical plans are, how they connect to physical execution, and how to read the plan output that Spark shows you. These chapters are short and focused – just enough theory to make the practical scenarios meaningful.</p>
<p>Chapter 4 is the heart of the handbook. It contains 15 real-world scenarios, organized by category. Each scenario shows you a common performance problem, explains what's happening in the logical plan, and demonstrates the better approach. You'll see before-and-after code, plan comparisons, and clear explanations of why one approach outperforms another.</p>
<h3 id="heading-what-youll-learn">What You'll Learn</h3>
<p>By the end of this handbook, you'll be able to:</p>
<ul>
<li><p>Read and interpret Spark's logical, optimized, and physical plans</p>
</li>
<li><p>Identify expensive operations before running your code</p>
</li>
<li><p>Restructure transformations to minimize shuffles</p>
</li>
<li><p>Choose the right join strategies for your data</p>
</li>
<li><p>Avoid common pitfalls that cause memory issues and slow performance</p>
</li>
<li><p>Debug production issues by examining execution plans</p>
</li>
</ul>
<p>More importantly, you'll develop a Spark mindset, an intuition for how your code translates to cluster operations. You'll stop writing code that "should work" and start writing code that you <em>know</em> will work efficiently.</p>
<h3 id="heading-technical-prerequisites">Technical Prerequisites</h3>
<p>I assume that you’re familiar with the following concepts before proceeding:</p>
<ol>
<li><p>Python fundamentals</p>
</li>
<li><p>PySpark basics</p>
<ul>
<li><p>Creating DataFrames and reading data from files</p>
</li>
<li><p>Basic DataFrame operations: select, filter, withColumn, groupBy, join</p>
</li>
<li><p>Writing DataFrames back to storage</p>
</li>
</ul>
</li>
<li><p>Basic Spark concepts</p>
<ul>
<li><p>Basic understanding of Spark applications, jobs, stages, and tasks</p>
</li>
<li><p>Basic understanding of the difference between transformations and actions</p>
</li>
<li><p>Understanding. of partitions and shuffles</p>
</li>
</ul>
</li>
<li><p>AWS Glue (Good to have)</p>
</li>
</ol>
<h2 id="heading-chapter-1-the-spark-mindset-why-plans-matter">Chapter 1: The Spark Mindset: Why Plans Matter</h2>
<p>This chapter isn’t about Spark theory or internals. It’s about understanding Spark Plans, and seeing Spark the way the engine sees your code. Once you understand how Spark builds and optimizes a logical plan, optimization stops being trial and error and becomes intentional engineering.</p>
<p>Behind every simple transformation, Spark quietly redraws its internal blueprint. Every transformation you write from "<em>withColumn</em>" to join changes that plan. When the plan is efficient, Spark flies, but when it’s messy, Spark crawls.</p>
<h3 id="heading-the-invisible-layer-behind-every-transformation">The Invisible Layer Behind Every Transformation</h3>
<p>When you write PySpark code, it feels like you’re chaining operations step by step. In reality, Spark isn’t executing those lines. It’s quietly building a blueprint, a logical plan describing <em>what</em> to do, not <em>how</em>.</p>
<p>Once this plan is built, the Catalyst Optimizer analyzes it, rearranges operations, eliminates redundancies, and produces an optimized plan. Catalyst is Spark’s query optimization engine.</p>
<p>Every DataFrame or SQL operation we write, such as select, filter, join, groupBy, is first converted into a logical plan. Catalyst then analyzes and transforms this plan using a set of rule-based optimizations, such as predicate pushdown, column pruning, constant folding, and join reordering. The result is an optimized logical plan, which Spark later converts into a physical execution plan. Finally, Spark translates that into a physical plan of what your cluster actually runs. This invisible planning layer decides the job’s performance more than any configuration setting.</p>
<h3 id="heading-from-logical-to-optimized-to-physical-plans">From Logical to Optimized to Physical Plans</h3>
<p>When you run <code>df.explain(True)</code>, Spark actually shows you four stages of reasoning:</p>
<h4 id="heading-1-logical-plan">1. Logical Plan</h4>
<p>The logical plan is the first stage where the initial translation of the code results in a tree structure that shows what operations need to happen, without worrying about how to execute them efficiently. It’s a blueprint of the query’s logic before any optimization or physical planning occurs.</p>
<p>This:</p>
<pre><code class="lang-python">df.filter(col(<span class="hljs-string">'age'</span>) &gt; <span class="hljs-number">25</span>) \
  .select(<span class="hljs-string">'firstname'</span>, <span class="hljs-string">'country'</span>) \
  .groupby(<span class="hljs-string">'country'</span>) \
  .count() \
  .explain(<span class="hljs-literal">True</span>)
</code></pre>
<p>results in the following logical plan:</p>
<pre><code class="lang-python">== Parsed Logical Plan ==
<span class="hljs-string">'Aggregate ['</span>country], [<span class="hljs-string">'country, '</span>count(<span class="hljs-number">1</span>) AS count<span class="hljs-comment">#108]</span>
+- Project [firstname<span class="hljs-comment">#95, country#97]</span>
   +- Filter (age<span class="hljs-comment">#96L &gt; cast(25 as bigint))</span>
      +- LogicalRDD [firstname<span class="hljs-comment">#95, age#96L, country#97], false</span>
</code></pre>
<h4 id="heading-2-analyzed-logical-plan">2. Analyzed Logical Plan</h4>
<p>The analyzed logical plan is the second stage in Spark’s query optimization. In this stage, Spark validates the query by checking if tables and columns actually exist in the Catalog and resolving all references. It converts all the unresolved logical plans into a resolved one with correct data types and column bindings before optimization.</p>
<h4 id="heading-3-optimized-logical-plan">3. Optimized Logical Plan</h4>
<p>The optimized logical plan is where Spark's Catalyst optimizer improves the logical plan by applying smart rules like filtering data early, removing unnecessary columns, and combining operations to reduce computation. It's the smarter, more efficient version of your original plan that will execute faster and use fewer resources.</p>
<p>Let’s understand using a simple code example:</p>
<pre><code class="lang-python">df.select(<span class="hljs-string">'firstname'</span>, <span class="hljs-string">'country'</span>) \
  .groupby(<span class="hljs-string">'country'</span>) \
  .count() \
  .filter(col(<span class="hljs-string">'country'</span>) == <span class="hljs-string">'USA'</span>) \
  .explain(<span class="hljs-literal">True</span>)
</code></pre>
<p>Here’s the parsed logical plan:</p>
<pre><code class="lang-python">== Parsed Logical Plan ==
<span class="hljs-string">'Filter '</span>`=`(<span class="hljs-string">'country, USA)
+- Aggregate [country#97], [country#97, count(1) AS count#122L]
   +- Project [firstname#95, country#97]
      +- LogicalRDD [firstname#95, age#96L, country#97], false</span>
</code></pre>
<p>What this means:</p>
<ul>
<li><p>Spark first projects firstname and country</p>
</li>
<li><p>Then aggregates by country</p>
</li>
<li><p>Then applies the filter country = 'USA' <strong>after</strong> aggregation</p>
</li>
</ul>
<p>(because that’s how you wrote it).</p>
<p>Here’s the optimized logical plan:</p>
<pre><code class="lang-python">== Optimized Logical Plan ==
Aggregate [country<span class="hljs-comment">#97], [country#97, count(1) AS count#122L]</span>
+- Project [country<span class="hljs-comment">#97]</span>
   +- Filter (isnotnull(country<span class="hljs-comment">#97) AND (country#97 = USA))</span>
      +- LogicalRDD [firstname<span class="hljs-comment">#95, age#96L, country#97], false</span>
</code></pre>
<p>Key improvements Catalyst applied:</p>
<ul>
<li><p>Filter pushdown: The filter country = 'USA' is pushed below the aggregation, so Spark only groups U.S. rows.</p>
</li>
<li><p>Column pruning: “firstname” is automatically removed because it’s never used in the final output.</p>
</li>
<li><p>Cleaner projection: Intermediate columns are dropped early, reducing I/O and in-memory footprint.</p>
</li>
</ul>
<h4 id="heading-4-physical-plan">4. Physical Plan</h4>
<p>The physical plan is Spark's final execution blueprint that shows exactly how the query will run: which specific algorithms to use, how to distribute work across machines, and the order of low-level operations. It's the concrete, executable version of the optimized logical plan, translated into actual Spark operations like “ShuffleExchange”, “HashAggregate”, and “FileScan” that will run on your cluster.</p>
<p>Catalyst may, for example:</p>
<ul>
<li><p>Fold constants (col("x") * 1 → col("x"))</p>
</li>
<li><p>Push filters closer to the data source</p>
</li>
<li><p>Replace a regular join with a broadcast join when data fits in memory</p>
</li>
</ul>
<p>Once the physical plan is finalized, Spark’s scheduler converts it into a DAG of stages and tasks that run across the cluster. Understanding that lineage, from your code → plan → DAG, is what separates fast jobs from slow ones.</p>
<h3 id="heading-how-to-read-a-logical-plan">How to Read a Logical Plan</h3>
<p>A logical plan prints as a tree: the bottom is your data source, and each higher node represents a transformation.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Node</strong></td><td><strong>Meaning</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Relation / LogicalRDD</td><td>Data source, the initial DataFrame</td></tr>
<tr>
<td>Project</td><td>Column selection and transformation (select, withColumn)</td></tr>
<tr>
<td>Filter</td><td>Row filtering based on conditions (where, filter)</td></tr>
<tr>
<td>Join</td><td>Combining two DataFrames (join, union)</td></tr>
<tr>
<td>Aggregate</td><td>GroupBy and aggregation operations (groupBy, agg)</td></tr>
<tr>
<td>Exchange</td><td>Shuffle operation (data redistribution across partitions)</td></tr>
<tr>
<td>Sort</td><td>Ordering data (orderBy, sort)</td></tr>
</tbody>
</table>
</div><p>Each node represents a transformation. Execution flows from the bottom up. Let's understand with a basic example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> *
<span class="hljs-keyword">from</span> pyspark.sql.types <span class="hljs-keyword">import</span> *

spark = SparkSession.builder.appName(<span class="hljs-string">"Practice"</span>).getOrCreate()

employees_data = [
    (<span class="hljs-number">1</span>, <span class="hljs-string">"John"</span>, <span class="hljs-string">"Doe"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">80000</span>, <span class="hljs-number">28</span>, <span class="hljs-string">"2020-01-15"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">2</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Smith"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">85000</span>, <span class="hljs-number">32</span>, <span class="hljs-string">"2019-03-20"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">3</span>, <span class="hljs-string">"Alice"</span>, <span class="hljs-string">"Johnson"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">60000</span>, <span class="hljs-number">25</span>, <span class="hljs-string">"2021-06-10"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">4</span>, <span class="hljs-string">"Bob"</span>, <span class="hljs-string">"Brown"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">90000</span>, <span class="hljs-number">35</span>, <span class="hljs-string">"2018-07-01"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">5</span>, <span class="hljs-string">"Charlie"</span>, <span class="hljs-string">"Wilson"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">65000</span>, <span class="hljs-number">29</span>, <span class="hljs-string">"2020-11-05"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">6</span>, <span class="hljs-string">"David"</span>, <span class="hljs-string">"Lee"</span>, <span class="hljs-string">"HR"</span>, <span class="hljs-number">55000</span>, <span class="hljs-number">27</span>, <span class="hljs-string">"2021-01-20"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">7</span>, <span class="hljs-string">"Eve"</span>, <span class="hljs-string">"Davis"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">95000</span>, <span class="hljs-number">40</span>, <span class="hljs-string">"2017-04-12"</span>, <span class="hljs-string">"Canada"</span>),
    (<span class="hljs-number">8</span>, <span class="hljs-string">"Frank"</span>, <span class="hljs-string">"Miller"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">70000</span>, <span class="hljs-number">33</span>, <span class="hljs-string">"2019-09-25"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">9</span>, <span class="hljs-string">"Grace"</span>, <span class="hljs-string">"Taylor"</span>, <span class="hljs-string">"HR"</span>, <span class="hljs-number">58000</span>, <span class="hljs-number">26</span>, <span class="hljs-string">"2021-08-15"</span>, <span class="hljs-string">"Canada"</span>),
    (<span class="hljs-number">10</span>, <span class="hljs-string">"Henry"</span>, <span class="hljs-string">"Anderson"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">88000</span>, <span class="hljs-number">31</span>, <span class="hljs-string">"2020-02-28"</span>, <span class="hljs-string">"USA"</span>)
]

df = spark.createDataFrame(employees_data,  
    [<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>])
</code></pre>
<h4 id="heading-version-a-withcolumn-filter">Version A: withColumn → filter</h4>
<p>In this version, we’re using a derived column "withColumn" and then applying a filter to the dataset. This ordering is logically correct and produces the expected result: it shows how introducing derived columns early affects the logical plan. This example shows what happens when Spark is asked to compute a new column before any rows are eliminated.</p>
<pre><code class="lang-python">df_filtered = df \
.withColumn(<span class="hljs-string">'bonus'</span>, col(<span class="hljs-string">'salary'</span>) * <span class="hljs-number">82</span>) \
.filter(col(<span class="hljs-string">'age'</span>) &gt; <span class="hljs-number">35</span>) \
.explain(<span class="hljs-literal">True</span>)
</code></pre>
<h4 id="heading-parsed-logical-plan-simplified">Parsed Logical Plan (Simplified)</h4>
<pre><code class="lang-python">Filter (age &gt; <span class="hljs-number">35</span>)
└─ Project [*, (salary * <span class="hljs-number">82</span>) AS bonus]
   └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>So what’s going on here? Execution flows from the bottom up.</p>
<ul>
<li><p>Spark first reads the LogicalRDD.</p>
</li>
<li><p>Then applies the Project node, keeping all columns and adding bonus.</p>
</li>
<li><p>Finally, the Filter removes rows where age ≤ 35.</p>
</li>
</ul>
<p>This means Spark computes the bonus for every employee, even those who are later filtered out. It's harmless here, but costly on millions of rows, more computation, more I/O, more shuffle volume.</p>
<h4 id="heading-version-b-filter-project">Version B: Filter → Project</h4>
<p>In this version, we apply the filter before introducing the derived column. The idea is to show how pushing row-reducing operations earlier allows Catalyst to produce a leaner logical plan. Compared to Version A, this example demonstrates that the same logic, written in a different order, can significantly reduce the amount of work Spark needs to perform.</p>
<pre><code class="lang-python">df_filtered = df \
.filter(col(<span class="hljs-string">'age'</span>) &gt; <span class="hljs-number">35</span>) \
.withColumn(<span class="hljs-string">'bonus'</span>, col(<span class="hljs-string">'salary'</span>) * <span class="hljs-number">82</span>) \
.explain(<span class="hljs-literal">True</span>)
</code></pre>
<h4 id="heading-parsed-logical-plan-simplified-1">Parsed Logical Plan (Simplified)</h4>
<pre><code class="lang-python">Project [*, (salary * <span class="hljs-number">82</span>) AS bonus]

└─ Filter (age &gt; <span class="hljs-number">35</span>)

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>So what’s going on here?</p>
<ul>
<li><p>Spark starts from the LogicalRDD.</p>
</li>
<li><p>It immediately applies the Filter, reducing the dataset to only employees with age &gt; 35.</p>
</li>
<li><p>Then the Project node adds the derived column bonus for this smaller subset.</p>
</li>
</ul>
<p>Now the Filter sits below the Project in the plan, cutting data movement and minimizing computation. Spark prunes data first, then derives new columns. This order reduces both the volume of data processed and the amount transferred, leading to a lighter and faster plan.</p>
<h3 id="heading-why-you-should-look-at-the-plan-every-time-by-running-dfexplaintrue">Why You Should Look at the Plan Every Time by running <code>df.explain(True)</code></h3>
<p>This is the quickest way to spot performance issues <em>before</em> they hit production. It shows:</p>
<ul>
<li><p>Whether filters sit in the right place.</p>
</li>
<li><p>How many Project nodes exist (each adds overhead).</p>
</li>
<li><p>Where Exchange nodes appear (these are shuffle boundaries).</p>
</li>
<li><p>If Catalyst pushed filters or rewrote joins as expected.</p>
</li>
</ul>
<p>A quick <code>explain()</code> takes seconds, while debugging a bad shuffle in production takes hours. Run <code>explain()</code> whenever you add or reorder transformations. The plan never lies.</p>
<h4 id="heading-what-spark-does-under-the-hood">What Spark Does Under the Hood</h4>
<p>Catalyst can sometimes reorder simple filters automatically, but once you use UDFs, nested logic, or joins, it often can’t. That’s why the best habit is to write transformations in a way that already makes sense to the optimizer. Filter early, avoid redundant projections, and keep plans as shallow as possible.</p>
<p>Optimizing Spark isn’t about tuning cluster configs – it’s about writing code that yields efficient plans. If your plan shows late filters, too many projections, or multiple Exchange nodes, it’s already explaining why your job will run slow.</p>
<h2 id="heading-chapter-2-understanding-the-spark-execution-flow">Chapter 2: Understanding the Spark Execution Flow</h2>
<p>In Chapter 1, you learned how Spark interprets your transformations into logical plans – blueprints of what the job intends to do.</p>
<p>But Spark doesn't stop there. It must translate those plans into distributed actions across a cluster of executors, coordinate data movement, and handle any failures that may occur.</p>
<p>This chapter reveals what happens when that plan leaves the driver: how Spark breaks your job into stages, tasks, and a directed acyclic graph (DAG) that actually runs.</p>
<p>By the end, you’ll understand why some operations shuffle terabytes while others fly, and how to predict it before execution begins.</p>
<h3 id="heading-from-plans-to-stages-to-tasks">From Plans to Stages to Tasks</h3>
<p>A Spark job evolves through three conceptual layers:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Layer</strong></td><td><strong>What It Represents</strong></td><td><strong>Example View</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Plan</td><td>The optimized logical + physical representation of your query</td><td>Read → Filter → Join → Aggregate</td></tr>
<tr>
<td>Stage</td><td>A contiguous set of operations that can run without shuffling data</td><td>“Map Stage” or “Reduce Stage”</td></tr>
<tr>
<td>Task</td><td>The smallest unit of work, one per partition per stage</td><td>“Process Partition 7 of Stage 3”</td></tr>
</tbody>
</table>
</div><h4 id="heading-the-execution-trigger-actions-vs-transformations">The Execution Trigger: Actions vs Transformations</h4>
<p>Here's the critical distinction that determines when execution actually begins:</p>
<pre><code class="lang-python">df1 = spark.paraquet(<span class="hljs-string">"data.paraquet"</span>)
df2 = spark.filter(col(<span class="hljs-string">"age"</span>) &gt; <span class="hljs-number">25</span>)
df3 = spark.groupby(<span class="hljs-string">"city"</span>).count()
</code></pre>
<p>Nothing executes yet! Spark just builds up the logical plan, adding each transformation as a node in the plan tree. No data is read, no filters run, no shuffles happen.</p>
<h4 id="heading-actions-trigger-execution">Actions Trigger Execution</h4>
<p>Spark transformations are lazy. When a sequence of DataFrame operations is defined, a logical plan is created, but no computation takes place. It’s only when Spark encounters an action, an operation that needs a result to be returned to the driver or written out, that execution takes place.</p>
<p>For example:</p>
<pre><code class="lang-python">result = df3.collect()
</code></pre>
<p>At this stage, Spark materializes the logical plan, applies optimizations, creates a physical plan, and executes the job. Until Spark is asked to <strong>act</strong>, such as collect(), count(), or write(), it’s just describing what it needs to do – but it’s not actually doing it.</p>
<h4 id="heading-the-complete-execution-flow">The Complete Execution Flow</h4>
<p>Spark execution is initiated after the execution of an operation such as collect(). The driver then sends the optimized physical plan to the SparkContext, which is then forwarded to the DAG Scheduler. The physical plan is analyzed to determine shuffle boundaries created by wide operations such as <em>groupBy</em> or <em>orderBy</em>.</p>
<p>The plan is then divided into stages that contain narrow operations. These stages are sent to the Task Scheduler as a TaskSet. Each stage has a single task per partition.</p>
<p>The tasks are then assigned to the cores of the executor based on data locality. The execution of the tasks is then initiated. The execution of the stages is initiated after the completion of the previous stage. The final stage is initiated after the completion of the previous stage. The results of the final stage are then returned to the driver or stored.</p>
<h4 id="heading-what-triggers-a-shuffle">What Triggers a Shuffle</h4>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769457412199/308bc894-66a9-4c01-aae1-9ae42e64d32c.png" alt="Comparison of Spark shuffle behavior before and after groupBy" class="image--center mx-auto" width="1920" height="992" loading="lazy"></p>
<p>A shuffle occurs when Spark needs to redistribute data across partitions, typically because the operation requires grouping, joining, or repartitioning data in a way that can’t be done locally within existing partitions.</p>
<p>Common shuffle triggers:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Operation</strong></td><td><strong>Why it Shuffles</strong></td></tr>
</thead>
<tbody>
<tr>
<td>groupBy(), reduceByKey()</td><td>Data with the same key must co-locate for aggregation</td></tr>
<tr>
<td>join()</td><td>Matching keys may reside in different partitions</td></tr>
<tr>
<td>orderBy() / sort()</td><td>Requires global ordering across all partitions</td></tr>
<tr>
<td>distinct()</td><td>Needs comparison of all values across partitions</td></tr>
<tr>
<td>repartition(n)</td><td>Explicit redistribution to a new number of partitions</td></tr>
</tbody>
</table>
</div><pre><code class="lang-python">df.groupBy(<span class="hljs-string">"user_id”) \
  .agg(sum("</span>amount<span class="hljs-string">"))</span>
</code></pre>
<p>In Stage 1 (Map), each task performs a partial aggregation on its partition and writes a shuffle file to disk. During the shuffle, each executor retrieves these files across the network such that all records with the same hash(user_id) % numPartitions are colocated.</p>
<p>In Stage 2 (Reduce), each task performs a final aggregation on its partitioned data and writes back to disk. Because Spark has tracked this process as a DAG, a failed task can re-read only the affected shuffle files instead of re-computing the entire DAG.</p>
<p>In practice, a healthy job has 2-6 stages. Seeing 20+ stages for such simple logic usually means unnecessary shuffles or bad partitioning.</p>
<h4 id="heading-why-shuffles-create-stage-boundaries">Why Shuffles Create Stage Boundaries</h4>
<p>Shuffles force data to move across the network between executors. Spark cannot continue processing until:</p>
<ul>
<li><p>All tasks in the current stage write their shuffle output to disk</p>
</li>
<li><p>The shuffle data is available for the next stage to read over the network</p>
</li>
</ul>
<p>This dependency creates a natural boundary – so a new stage begins after every shuffle. The DAG Scheduler uses these boundaries to determine where stages must wait for previous stages to complete.</p>
<h4 id="heading-common-performance-bottlenecks">Common Performance Bottlenecks</h4>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Bottleneck Type</strong></td><td><strong>Symptom</strong></td><td><strong>Solution</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Data skew</td><td>Few tasks run much longer</td><td>Use salting, split hot keys, or AQE skew join</td></tr>
<tr>
<td>Small files</td><td>Too many tasks, high overhead</td><td>Coalesce or repartition after read</td></tr>
<tr>
<td>Large shuffle</td><td>High network I/O, spill to disk</td><td>Filter early, broadcast small tables, reduce cardinality</td></tr>
<tr>
<td>Unnecessary stages</td><td>Extra Exchange nodes in plan</td><td>Combine operations, remove redundant repartitions</td></tr>
<tr>
<td>Inefficient file formats</td><td>Slow reads, no predicate pushdown</td><td>Use Parquet or ORC with partitioning</td></tr>
<tr>
<td>Complex data types</td><td>Serialization overhead, large objects</td><td>Use simple types, cache in serialized form</td></tr>
</tbody>
</table>
</div><p>Let’s ground this with a small but realistic pattern using the same employees DataFrame. <strong>Goal:</strong> average salary per department and country, only for employees older than 30.</p>
<p>Naïve approach:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> col, when, avg

df_dept_country = df.select(<span class="hljs-string">"department"</span>, <span class="hljs-string">"country"</span>).distinct()

df_result = (
    df.withColumn(
        <span class="hljs-string">"age_group"</span>,
        when(col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">30</span>, <span class="hljs-string">"junior"</span>)
        .when(col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">40</span>, <span class="hljs-string">"mid"</span>)
        .otherwise(<span class="hljs-string">"senior"</span>)
    )
    .join(df_dept_country, [<span class="hljs-string">"department"</span>], <span class="hljs-string">"inner"</span>)
    .groupBy(<span class="hljs-string">"department"</span>, <span class="hljs-string">"country"</span>)
    .agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
</code></pre>
<p>This looks harmless, but:</p>
<ul>
<li><p>The join on "department" introduces a wide dependency → shuffle #1.</p>
</li>
<li><p>The groupBy("department", "country") introduces another wide dependency → shuffle #2.</p>
</li>
</ul>
<p>So we have two shuffles for what should be a simple aggregation. If you run explain on the df_result, you’ll see two exchange nodes, each marking a shuffle and stage boundary.</p>
<h4 id="heading-optimized-approach">Optimized Approach</h4>
<p>We can do better by filtering early, broadcasting the small dimension (df_dept_country), and keeping only one global shuffle for aggregation.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> broadcast

df_dept_country = df.select(<span class="hljs-string">"department"</span>, <span class="hljs-string">"country"</span>).distinct()

df_result_optimized = (
    df.filter(col(<span class="hljs-string">"age"</span>) &gt; <span class="hljs-number">30</span>)
        .join(broadcast(df_dept_country), [<span class="hljs-string">"department"</span>], <span class="hljs-string">"inner"</span>)
        .groupBy(<span class="hljs-string">"department"</span>, <span class="hljs-string">"country"</span>)
        .agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
)
</code></pre>
<p>What changed:</p>
<ul>
<li><p>filter(col("age") &gt; 30) is narrow and runs before any shuffle.</p>
</li>
<li><p>broadcast(df_dept_country) avoids a shuffle for the join.</p>
</li>
<li><p>Only the groupBy("department", "country") causes a single shuffle.</p>
</li>
</ul>
<p>Now explain shows just one Exchange.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Version</strong></td><td><strong>Shuffles</strong></td><td><strong>Stages</strong></td><td><strong>Notes</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Naïve</td><td>2</td><td>~4 (2 map + 2 reduce)</td><td>Join shuffle + groupBy shuffle = double overhead</td></tr>
<tr>
<td>Optimized</td><td>1</td><td>~2 (1 map + 1 reduce)</td><td>Broadcast join avoids shuffle. Only groupBy shuffles</td></tr>
</tbody>
</table>
</div><h2 id="heading-chapter-3-reading-and-debugging-plans-like-a-pro">Chapter 3: Reading and Debugging Plans Like a Pro</h2>
<p>As explained in Chapter 1, Spark executes transformations based on three levels: the logical plan, the optimized logical plan (Catalyst), and the physical plan. This chapter will expand on this explanation and concentrate on the impact of the logical plan on shuffle and execution performance.</p>
<p>By now, you understand how Spark builds and <em>executes</em> plans. But reading those plans and instantly spotting inefficiencies is the real superpower of a performance-focused data engineer.</p>
<p>Spark’s explain() output isn’t random jargon. It’s a precise log of Spark’s thought process. Once you learn to read it, every optimization becomes obvious.</p>
<h3 id="heading-three-layers-in-spark"><strong>Three Layers in Spark</strong></h3>
<p>As we talked about above, every Spark plan has three key views, printed when you call df.explain(True). Let’s review them now:</p>
<ol>
<li><p>Parsed Logical Plan: The raw intent Spark inferred from your code. It may include unresolved column names or expressions.</p>
</li>
<li><p>Analyzed / Optimized Logical Plan: After Spark applies Catalyst optimizations: constant folding, predicate pushdown, column pruning, and plan rearrangements.</p>
</li>
<li><p>Physical Plan: What your executors actually run: joins, shuffles, exchanges, scans, and code-generated operators.</p>
</li>
</ol>
<p>Each stage narrows the gap between what you <em>asked</em> Spark to do and what Spark decides to do.</p>
<pre><code class="lang-python">df_avg = df.filter(col(<span class="hljs-string">"age"</span>) &gt; <span class="hljs-number">30</span>)
        .groupBy(<span class="hljs-string">"department"</span>)
        .agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))

df_avg.explain(<span class="hljs-literal">True</span>)
</code></pre>
<p><strong>1. Parsed Logical Plan</strong></p>
<pre><code class="lang-python">== Parsed Logical Plan ==
<span class="hljs-string">'Aggregate ['</span>department], [<span class="hljs-string">'department, '</span>avg(<span class="hljs-string">'salary) AS avg_salary#8]
+- Filter (age#5L &gt; cast(30 as bigint))
   +- LogicalRDD [id#0L, firstname#1, lastname#2, department#3, salary#4L, age#5L, hire_date#6, country#7], false</span>
</code></pre>
<p>How to read this</p>
<ul>
<li><p>Bottom → data source (LogicalRDD).</p>
</li>
<li><p>Middle → Filter: Spark hasn’t yet optimized column references.</p>
</li>
<li><p>Top → Aggregate: high-level grouping intent.</p>
</li>
</ul>
<p>At this stage, the plan may include unresolved symbols (like 'department or 'avg('salary)), meaning Spark hasn’t yet validated column existence or data types.</p>
<p><strong>2. Optimized Logical Plan</strong></p>
<pre><code class="lang-python">
== Optimized Logical Plan ==
Aggregate [department<span class="hljs-comment">#3], [department#3, avg(salary#4L) AS avg_salary#8]</span>
+- Project [department<span class="hljs-comment">#3, salary#4L]</span>
   +- Filter (isnotnull(age<span class="hljs-comment">#5L) AND (age#5L &gt; 30))</span>
      +- LogicalRDD [id<span class="hljs-comment">#0L, firstname#1, lastname#2, department#3, salary#4L, age#5L, hire_date#6, country#7], false</span>
</code></pre>
<p>Here, Catalyst has done its job:</p>
<ul>
<li><p>Column IDs (#11, #12L) are resolved.</p>
</li>
<li><p>Unused columns are pruned – no need to carry them forward.</p>
</li>
<li><p>The plan now accurately reflects Spark’s optimized logical intent.</p>
</li>
</ul>
<p>If you ever wonder whether Spark pruned columns or pushed filters, this is the section to check.</p>
<p><strong>3. Physical Plan</strong></p>
<pre><code class="lang-python">== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[department<span class="hljs-comment">#3], functions=[avg(salary#4L)], output=[department#3, avg_salary#8])</span>
   +- Exchange hashpartitioning(department<span class="hljs-comment">#3, 200), ENSURE_REQUIREMENTS, [plan_id=19]</span>
      +- HashAggregate(keys=[department<span class="hljs-comment">#3], functions=[partial_avg(salary#4L)], output=[department#3, sum#20, count#21L])</span>
         +- Project [department<span class="hljs-comment">#3, salary#4L]</span>
            +- Filter (isnotnull(age<span class="hljs-comment">#5L) AND (age#5L &gt; 30))</span>
               +- Scan ExistingRDD[id<span class="hljs-comment">#0L,firstname#1,lastname#2,department#3,salary#4L,age#5L,hire_date#6,country#7]</span>
</code></pre>
<p><strong>Breakdown</strong></p>
<ul>
<li><p>Scan ExistingRDD → Spark reading from the in-memory DataFrame.</p>
</li>
<li><p>Filter → narrow transformation, no shuffle.</p>
</li>
<li><p>HashAggregate → partial aggregation per partition.</p>
</li>
<li><p>Exchange → wide dependency: data is shuffled by department.</p>
</li>
<li><p>Top HashAggregate → final aggregation after shuffle.</p>
</li>
</ul>
<p>This structure – partial agg → shuffle → final agg – is Spark’s default two-phase aggregation pattern.</p>
<h4 id="heading-recognizing-common-nodes">Recognizing Common Nodes</h4>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Node / Operator</strong></td><td><strong>Meaning</strong></td><td><strong>Optimization Hint</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Project</td><td>Column selection or computed fields</td><td>Combine multiple withColumn() into one select()</td></tr>
<tr>
<td>Filter</td><td>Predicate on rows</td><td>Push filters as low as possible in the plan</td></tr>
<tr>
<td>Join</td><td>Combine two DataFrames</td><td>Broadcast smaller side if &lt; 10 MB</td></tr>
<tr>
<td>Aggregate</td><td>GroupBy, sum, avg, count</td><td>Filter before aggregating to reduce cardinality</td></tr>
<tr>
<td>Exchange</td><td>Shuffle / data redistribution</td><td>Minimize by filtering early, using broadcast join</td></tr>
<tr>
<td>Sort</td><td>OrderBy, sort</td><td>Avoid global sorts; use within partitions if possible</td></tr>
<tr>
<td>Window</td><td>Windowed analytics (row_number, rank)</td><td>Partition on selective keys to reduce shuffle</td></tr>
</tbody>
</table>
</div><p>Repeated invocations of withColumn stack multiple Project nodes, which increases the plan depth. Instead, combine these invocations using select.</p>
<p>Multiple Exchange nodes imply repeated data shuffles. You can eliminate these by broadcasting the data or filtering.</p>
<p>Multiple scans of the same table within a single operation imply that some caching of strategic intermediates is lacking.</p>
<p>And frequent SortMergeJoin operations imply that Spark is unnecessarily sorting and shuffling the data. You can eliminate these by broadcasting the smaller dataframe or bucketing.</p>
<h4 id="heading-debugging-strategy-read-plans-from-top-to-bottom">Debugging Strategy: Read Plans from Top to Bottom</h4>
<p>Remember: Spark <em>executes</em> plans from bottom up (from data source to final result). But when you're debugging, you read from the top down (from the output schema back to the root cause). This reversal is intentional: you start with what's wrong at the output level, then trace backward through the plan to find where the inefficiency was introduced.</p>
<p>When debugging a slow job:</p>
<ul>
<li><p>Start at the top: Identify output schema and major operators (HashAggregate, Join, and so on).</p>
</li>
<li><p>Scroll for Exchanges: Count them. Each = stage boundary. Ask “Why do I need this shuffle?”</p>
</li>
<li><p>Trace backward: See if filters or projections appear below or above joins.</p>
</li>
<li><p>Look for duplication: Same scan twice? Missing cache? Re-derived columns?</p>
</li>
<li><p>Check join strategy: If it’s SortMergeJoin but one table is small, force a broadcast().</p>
</li>
<li><p>Re-run explain after optimization: You should literally see the extra nodes disappear.</p>
</li>
</ul>
<h4 id="heading-catalyst-optimizer-in-action">Catalyst Optimizer in Action</h4>
<p>Catalyst applies dozens of rules automatically. Knowing a few helps you interpret what changed:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Optimization Rule</strong></td><td><strong>Example Transformation</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Predicate Pushdown</td><td>Moves filters below joins/scans</td></tr>
<tr>
<td>Constant Folding</td><td>Replaces salary * 1 with salary</td></tr>
<tr>
<td>Column Pruning</td><td>Drops unused columns early</td></tr>
<tr>
<td>Combine Filters</td><td>Merges consecutive filters into one</td></tr>
<tr>
<td>Simplify Casts</td><td>Removes redundant type casts</td></tr>
<tr>
<td>Reorder Joins / Join Reordering</td><td>Changes join order for cheaper plan</td></tr>
</tbody>
</table>
</div><p>Putting it all together: every plan tells a story:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769458525411/64fa30a4-b16e-4aed-8c04-d12b476d9ae6.png" alt="Spark Plans and Stages" class="image--center mx-auto" width="1920" height="404" loading="lazy"></p>
<p>As you progress through the practical scenarios in Chapter 4, read every plan before and after. Your goal isn't memorization – it's intuition.</p>
<h2 id="heading-chapter-4-writing-efficient-transformations">Chapter 4: Writing Efficient Transformations</h2>
<p>Every Spark job tells a story, not in code, but in plans. By now, you've seen how Spark interprets transformations (Chapter 1), how it executes them through stages and tasks (Chapter 2), and how to read plans like a detective (Chapter 3). Now comes the part where you apply that knowledge: writing transformations that yield efficient logical plans.</p>
<p>This chapter is the heart of the handbook. It's where we move from understanding Spark's mind to writing code that speaks its language fluently.</p>
<h3 id="heading-why-transformations-matter">Why Transformations Matter</h3>
<p>In PySpark, most performance issues don’t start in clusters or configurations. They start in transformations: the way we chain, filter, rename, or join data. Every transformation reshapes the logical plan, influencing how Spark optimizes, when it shuffles, and whether the final DAG is streamlined or tangled.</p>
<p>A good transformation sequence:</p>
<ul>
<li><p>Keeps plans shallow, not nested.</p>
</li>
<li><p>Applies filters early, not after computation.</p>
</li>
<li><p>Reduces data movement, not just data size.</p>
</li>
<li><p>Let’s Catalyst and AQE optimize freely, without user-induced constraints.</p>
</li>
</ul>
<p>A bad one can double runtime, and you won't see it in your code, only in your plan.</p>
<h3 id="heading-the-goal-of-this-chapter">The Goal of this Chapter</h3>
<p>We’ll explore a series of real-world optimization scenarios, drawn from production ETL and analytical pipelines, each showing how a small change in code can completely reshape the logical plan and execution behavior.</p>
<p>Each scenario is practical and short, following a consistent structure. By the end of this chapter, you’ll be able to <em>see</em> optimization opportunities the moment you write code, because you’ll know exactly how they alter the logical plan beneath.</p>
<h3 id="heading-before-you-dive-in">Before You Dive In:</h3>
<p>Open a Spark shell or notebook. Load your familiar employees DataFrame. Run every example, and compare the explain("formatted") output before and after the fix. Because in this chapter, performance isn’t about more theory, it’s about seeing the difference in the plan and feeling the difference in execution time.</p>
<h3 id="heading-scenario-1-rename-in-one-pass-withcolumnrenamed-vs-todf">Scenario 1: Rename in One Pass: withColumnRenamed() vs toDF()</h3>
<p>If you’ve worked with PySpark DataFrames, you’ve probably had to rename columns, either by calling withColumnRenamed() repeatedly or by using toDF() in one shot.</p>
<p>At first glance, both approaches produce identical results: the columns have the new names you wanted. But beneath the surface, Spark treats them very differently – and that difference shows up directly in your logical plan.</p>
<pre><code class="lang-python">df_renamed = (df.withColumnRenamed(<span class="hljs-string">"id"</span>, <span class="hljs-string">"emp_id"</span>)
    .withColumnRenamed(<span class="hljs-string">"firstname"</span>, <span class="hljs-string">"first_name"</span>)
    .withColumnRenamed(<span class="hljs-string">"lastname"</span>, <span class="hljs-string">"last_name"</span>)
    .withColumnRenamed(<span class="hljs-string">"department"</span>, <span class="hljs-string">"dept"</span>)
    .withColumnRenamed(<span class="hljs-string">"salary"</span>, <span class="hljs-string">"base_salary"</span>)
    .withColumnRenamed(<span class="hljs-string">"age"</span>, <span class="hljs-string">"age_years"</span>)
    .withColumnRenamed(<span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"hired_on"</span>)
    .withColumnRenamed(<span class="hljs-string">"country"</span>, <span class="hljs-string">"country_code"</span>)
)
</code></pre>
<p>This is simple and readable. But Spark builds the plan step by step, adding one Project node for every rename. Each Project node copies all existing columns, plus the newly renamed one. In large schemas (hundreds of columns), this silently bloats the plan.</p>
<h4 id="heading-logical-plan-impact">Logical Plan Impact:</h4>
<pre><code class="lang-python">Project [emp_id, first_name, last_name, dept, base_salary, age_years, hired_on, country_code]

└─ Project [id, first_name, last_name, dept, base_salary, age_years, hired_on, country_code]

└─ Project [id, firstname, last_name, dept, base_salary, age_years, hired_on, country_code]

└─ Project [id, firstname, lastname, dept, base_salary, age_years, hire_date, country_code]

└─ Project [id, firstname, lastname, department, base_salary, age_years, hire_date, country]

└─ Project [id, firstname, lastname, department, salary, age_years, hire_date, country]

└─ Project [id, firstname, lastname, department, salary, age, hire_date, country]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Each rename adds a new Project layer, deepening the DAG. Spark now has to materialize intermediate projections before applying the next one. You can see this by running: <em>df.explain(True).</em></p>
<h4 id="heading-the-better-approach-rename-once-with-todf">The Better Approach: Rename Once with toDF()</h4>
<p>Instead of chaining multiple renames, rename all columns in a single pass:</p>
<pre><code class="lang-python">new_cols = [<span class="hljs-string">"id"</span>, <span class="hljs-string">"first_name"</span>, <span class="hljs-string">"last_name"</span>, <span class="hljs-string">"department"</span>,
            <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hired_on"</span>, <span class="hljs-string">"country"</span>]

df_renamed = df.toDF(*new_cols)
</code></pre>
<h4 id="heading-logical-plan-impact-1">Logical Plan Impact:</h4>
<pre><code class="lang-python">Project [id, first_name, last_name, department, salary, age, hired_on, country]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Now there’s just one Project node, which means one projection over the source data. This gives us a flatter, more efficient plan.</p>
<h4 id="heading-under-the-hood-what-spark-actually-does">Under the Hood: What Spark Actually Does</h4>
<p>Every time you call withColumnRenamed(), Spark rewrites the entire projection list. Catalyst treats the rename as a full column re-selection from the previous node, not as a light-weight alias update. When you chain several renames, Catalyst duplicates internal column metadata for each intermediate step.</p>
<p>By contrast, toDF() rebases the schema in a single action. Catalyst interprets it as a single schema rebinding, so no redundant metadata trees are created.</p>
<h4 id="heading-real-world-timing-glue-job-benchmark">Real-World Timing: Glue Job Benchmark</h4>
<p>To see if chained withColumnRenamed calls add real overhead, here's a simple timing test performed on a Glue job using a DataFrame with 1M rows. First using withColumnRenamed:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> time
<span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession

spark = SparkSession.builder.appName(<span class="hljs-string">"MillionRowsRenameTest"</span>).getOrCreate()

employees_data = [
    (<span class="hljs-number">1</span>, <span class="hljs-string">"John"</span>, <span class="hljs-string">"Doe"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">80000</span>, <span class="hljs-number">28</span>, <span class="hljs-string">"2020-01-15"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">2</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Smith"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">85000</span>, <span class="hljs-number">32</span>, <span class="hljs-string">"2019-03-20"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">3</span>, <span class="hljs-string">"Alice"</span>, <span class="hljs-string">"Johnson"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">60000</span>, <span class="hljs-number">25</span>, <span class="hljs-string">"2021-06-10"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">4</span>, <span class="hljs-string">"Bob"</span>, <span class="hljs-string">"Brown"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">90000</span>, <span class="hljs-number">35</span>, <span class="hljs-string">"2018-07-01"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">5</span>, <span class="hljs-string">"Charlie"</span>, <span class="hljs-string">"Wilson"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">65000</span>, <span class="hljs-number">29</span>, <span class="hljs-string">"2020-11-05"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">6</span>, <span class="hljs-string">"David"</span>, <span class="hljs-string">"Lee"</span>, <span class="hljs-string">"HR"</span>, <span class="hljs-number">55000</span>, <span class="hljs-number">27</span>, <span class="hljs-string">"2021-01-20"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">7</span>, <span class="hljs-string">"Eve"</span>, <span class="hljs-string">"Davis"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">95000</span>, <span class="hljs-number">40</span>, <span class="hljs-string">"2017-04-12"</span>, <span class="hljs-string">"Canada"</span>),
    (<span class="hljs-number">8</span>, <span class="hljs-string">"Frank"</span>, <span class="hljs-string">"Miller"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">70000</span>, <span class="hljs-number">33</span>, <span class="hljs-string">"2019-09-25"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">9</span>, <span class="hljs-string">"Grace"</span>, <span class="hljs-string">"Taylor"</span>, <span class="hljs-string">"HR"</span>, <span class="hljs-number">58000</span>, <span class="hljs-number">26</span>, <span class="hljs-string">"2021-08-15"</span>, <span class="hljs-string">"Canada"</span>),
    (<span class="hljs-number">10</span>, <span class="hljs-string">"Henry"</span>, <span class="hljs-string">"Anderson"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">88000</span>, <span class="hljs-number">31</span>, <span class="hljs-string">"2020-02-28"</span>, <span class="hljs-string">"USA"</span>)
]

multiplied_data = [(i, <span class="hljs-string">f"firstname_<span class="hljs-subst">{i}</span>"</span>, <span class="hljs-string">f"lastname_<span class="hljs-subst">{i}</span>"</span>,
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">3</span>],  <span class="hljs-comment"># department</span>
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">4</span>],  <span class="hljs-comment"># salary</span>
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">5</span>],  <span class="hljs-comment"># age</span>
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">6</span>],  <span class="hljs-comment"># hire_date</span>
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">7</span>])  <span class="hljs-comment"># country</span>
                   <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">1</span>_000_001)]

df = spark.createDataFrame(multiplied_data,
                           [<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>])

start = time.time()
df1 = (df
       .withColumnRenamed(<span class="hljs-string">"firstname"</span>, <span class="hljs-string">"first_name"</span>)
       .withColumnRenamed(<span class="hljs-string">"lastname"</span>, <span class="hljs-string">"last_name"</span>)
       .withColumnRenamed(<span class="hljs-string">"department"</span>, <span class="hljs-string">"dept_name"</span>)
       .withColumnRenamed(<span class="hljs-string">"salary"</span>, <span class="hljs-string">"annual_salary"</span>)
       .withColumnRenamed(<span class="hljs-string">"age"</span>, <span class="hljs-string">"emp_age"</span>)
       .withColumnRenamed(<span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"hired_on"</span>)
       .withColumnRenamed(<span class="hljs-string">"country"</span>, <span class="hljs-string">"work_country"</span>))

print(<span class="hljs-string">"withColumnRenamed Count:"</span>, df1.count())
print(<span class="hljs-string">"withColumnRenamed time:"</span>, round(time.time() - start, <span class="hljs-number">2</span>), <span class="hljs-string">"seconds"</span>)
</code></pre>
<p>Using toDF:</p>
<pre><code class="lang-python">start = time.time()
df2 = df.toDF(<span class="hljs-string">"id"</span>, <span class="hljs-string">"first_name"</span>, <span class="hljs-string">"last_name"</span>, <span class="hljs-string">"dept_name"</span>, <span class="hljs-string">"annual_salary"</span>, <span class="hljs-string">"emp_age"</span>, <span class="hljs-string">"hired_on"</span>, <span class="hljs-string">"work_country"</span>)
print(<span class="hljs-string">"toDF Count:"</span>, df2.count())
print(<span class="hljs-string">"toDF time:"</span>, round(time.time() - start, <span class="hljs-number">2</span>), <span class="hljs-string">"seconds"</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Number of Project Nodes</strong></td><td><strong>Glue Execution Time (1M rows)</strong></td><td><strong>Plan Complexity</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Chained withColumnRenamed()</td><td>8 nodes</td><td>~12 seconds</td><td>Deep, nested</td></tr>
<tr>
<td>Single toDF()</td><td>1 node</td><td>~8 seconds</td><td>Flat, simple</td></tr>
</tbody>
</table>
</div><p>The difference becomes important at larger sizes or in complex pipelines, especially on managed runtimes such as AWS Glue (where planning overhead becomes important), or when tens of millions of rows are involved, where each additional Project increases column resolution, metadata work, and DAG height. And since Spark can’t collapse chained projections when column names are changed, renaming all columns in one go with toDF() results in a flatter logical and physical plan: one rename, one projection, and faster execution.</p>
<h3 id="heading-scenario-2-reusing-expressions">Scenario 2: Reusing Expressions</h3>
<p>Sometimes Spark jobs run slower, not because of shuffles or joins, but because the same computation is performed repeatedly within the logical plan. Every time you repeat an expression, say, col("salary") * 0.1 in multiple places, Spark treats it as a <em>new</em> derived column, expanding the logical plan and forcing redundant work.</p>
<h4 id="heading-the-problem-repeated-expressions">The Problem: Repeated Expressions</h4>
<p>Let’s say we’re calculating bonus and total compensation for employees:</p>
<pre><code class="lang-python">df_expr = (
    df.withColumn(<span class="hljs-string">"bonus"</span>, col(<span class="hljs-string">"salary"</span>) * <span class="hljs-number">0.10</span>)
      .withColumn(<span class="hljs-string">"total_comp"</span>, col(<span class="hljs-string">"salary"</span>) + (col(<span class="hljs-string">"salary"</span>) * <span class="hljs-number">0.10</span>))
)
</code></pre>
<p>At first glance, it’s simple enough. But Spark’s optimizer doesn’t automatically know that the (col("salary") * 0.10) in the second column is identical to the one computed in the first. Both get evaluated separately in the logical plan.</p>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Project [id, firstname, lastname, department,

salary, age, hire_date, country,

(salary * <span class="hljs-number">0.10</span>) AS bonus,

(salary + (salary * <span class="hljs-number">0.10</span>)) AS total_comp]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>While this looks compact, Spark must compute (salary * 0.10) twice, once for bonus, again inside total_comp. For a large dataset (say 100 M rows), that’s two full column evaluations. The waste compounds when your expression is complex, imagine parsing JSON, applying UDFs, or running date arithmetic multiple times.</p>
<h4 id="heading-the-better-approach-compute-once-reuse-everywhere">The Better Approach: Compute Once, Reuse Everywhere</h4>
<p>Compute the expression once, store it as a column, and reference it later:</p>
<pre><code class="lang-python">df_expr = (
    df.withColumn(<span class="hljs-string">"bonus"</span>, col(<span class="hljs-string">"salary"</span>) * <span class="hljs-number">0.10</span>)
      .withColumn(<span class="hljs-string">"total_comp"</span>, col(<span class="hljs-string">"salary"</span>) + col(<span class="hljs-string">"bonus"</span>))
)
</code></pre>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Project [id, firstname, lastname, department,

salary, age, hire_date, country,

(salary * <span class="hljs-number">0.10</span>) AS bonus,

(salary + bonus) AS total_comp]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Now Spark calculates (salary * 0.10) once, stores it in the bonus column, and reuses that column when computing total_comp. This single change cuts CPU cost and memory usage.</p>
<h4 id="heading-under-the-hood-why-repetition-hurts">Under the Hood: Why Repetition Hurts</h4>
<p>Spark’s Catalyst optimizer doesn’t automatically factor out repeated expressions across different columns. Each withColumn() creates a new Project node with its own expression tree. If multiple nodes reuse the same arithmetic or function, Catalyst re-evaluates them independently.</p>
<p>On small DataFrames, this cost is invisible. On wide, computation-heavy jobs (think feature engineering pipelines), it can add hundreds of milliseconds per task.</p>
<p>Each redundant expression increases:</p>
<ul>
<li><p>Catalyst’s internal expression resolution time</p>
</li>
<li><p>The size of generated Java code in WholeStageCodegen</p>
</li>
<li><p>CPU cycles per row, since Spark cannot share intermediate results between columns in the same node</p>
</li>
</ul>
<h4 id="heading-real-world-benchmark-aws-glue">Real-World Benchmark: AWS Glue</h4>
<p>We tested this pattern on AWS Glue (Spark 3.3) with 10 million rows and a simulated expensive computation on the similar dataset we used in Scenario 1.</p>
<pre><code class="lang-python">df = spark.createDataFrame(multiplied_data,
                           [<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>])

expr = sqrt(exp(log(col(<span class="hljs-string">"salary"</span>) + <span class="hljs-number">1</span>)))

start = time.time()

df_repeated = (
    df.withColumn(<span class="hljs-string">"metric_a"</span>, expr)
      .withColumn(<span class="hljs-string">"metric_b"</span>, expr * <span class="hljs-number">2</span>)
      .withColumn(<span class="hljs-string">"metric_c"</span>, expr / <span class="hljs-number">10</span>)
)

df_repeated.count()
time_repeated = round(time.time() - start, <span class="hljs-number">2</span>)

start = time.time()

df_reused = (
    df.withColumn(<span class="hljs-string">"metric"</span>, expr)
      .withColumn(<span class="hljs-string">"metric_a"</span>, col(<span class="hljs-string">"metric"</span>))
      .withColumn(<span class="hljs-string">"metric_b"</span>, col(<span class="hljs-string">"metric"</span>) * <span class="hljs-number">2</span>)
      .withColumn(<span class="hljs-string">"metric_c"</span>, col(<span class="hljs-string">"metric"</span>) / <span class="hljs-number">10</span>)
)

df_reused.count()

print(<span class="hljs-string">"Repeated expr time:"</span>, time_repeated, <span class="hljs-string">"seconds"</span>)
print(<span class="hljs-string">"Reused expr time:"</span>, round(time.time() - start, <span class="hljs-number">2</span>), <span class="hljs-string">"seconds"</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Project Nodes</strong></td><td><strong>Execution Time (10M rows)</strong></td><td><strong>Expression Evaluations</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Repeated expression</td><td>Multiple (nested)</td><td>~18 seconds</td><td>3x per row</td></tr>
<tr>
<td>Compute once, reuse</td><td>Single</td><td>~11 seconds</td><td>1x per row</td></tr>
</tbody>
</table>
</div><p>The performance gap widens further with genuinely expensive expressions (like regex extraction, JSON parsing, or UDFs).</p>
<h4 id="heading-physical-plan-implication">Physical Plan Implication</h4>
<p>In the physical plan, repeated expressions expand into multiple Java blocks within the same WholeStageCodegen node:</p>
<pre><code class="lang-python">*(<span class="hljs-number">1</span>) Project [sqrt(exp(log(salary + <span class="hljs-number">1</span>))) AS metric_a,

(sqrt(exp(log(salary + <span class="hljs-number">1</span>))) * <span class="hljs-number">2</span>) AS metric_b,

(sqrt(exp(log(salary + <span class="hljs-number">1</span>))) / <span class="hljs-number">10</span>) AS metric_c, ...]
</code></pre>
<p>Spark literally embeds three copies of the same logic.</p>
<p>Each is JIT-compiled separately, leading to:</p>
<ul>
<li><p>Larger generated Java classes</p>
</li>
<li><p>Higher CPU utilization</p>
</li>
<li><p>Longer code-generation time before tasks even start</p>
</li>
</ul>
<p>When reusing a column, Spark generates one expression and references it by name, dramatically shrinking the codegen footprint. If you have complex transformations (nested when, UDFs, regex extractions, and so on), compute them once and reuse them with col("alias"). For even heavier expressions that appear across multiple pipelines, consider persisting the intermediate.</p>
<p>DataFrame:</p>
<pre><code class="lang-python">df_features = df.withColumn(<span class="hljs-string">"complex_feature"</span>, complex_logic)

df_features.cache()
</code></pre>
<p>That cache can save multiple recomputations across downstream steps.</p>
<h3 id="heading-scenario-3-batch-column-ops">Scenario 3: Batch Column Ops</h3>
<p>Most PySpark pipelines don’t die because of one big, obvious mistake. They slow down from a thousand tiny cuts: one extra withColumn() here, another there, until the logical plan turns into a tall stack of projections.</p>
<p>On its own, withColumn() is fine. The problem is how we use it:</p>
<ul>
<li><p>10–30 chained calls in a row</p>
</li>
<li><p>Re-deriving similar expressions</p>
</li>
<li><p>Spreading logic across many tiny steps</p>
</li>
</ul>
<p>This scenario shows how batching column operations into a single select() produces a flatter, cleaner logical plan that scales better and is easier to reason about.</p>
<h4 id="heading-the-problem-chaining-withcolumn-forever">The Problem: Chaining withColumn() Forever</h4>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> col, concat_ws, when, lit

df_transformed = (
    df.withColumn(<span class="hljs-string">"full_name"</span>, concat_ws(<span class="hljs-string">" "</span>, col(<span class="hljs-string">"firstname"</span>), col(<span class="hljs-string">"lastname"</span>)))
      .withColumn(<span class="hljs-string">"is_senior"</span>, when(col(<span class="hljs-string">"age"</span>) &gt;= <span class="hljs-number">35</span>, lit(<span class="hljs-number">1</span>)).otherwise(lit(<span class="hljs-number">0</span>)))
      .withColumn(<span class="hljs-string">"salary_k"</span>, col(<span class="hljs-string">"salary"</span>) / <span class="hljs-number">1000.0</span>)
      .withColumn(<span class="hljs-string">"experience_band"</span>,
                  when(col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">30</span>, <span class="hljs-string">"junior"</span>)
                  .when((col(<span class="hljs-string">"age"</span>) &gt;= <span class="hljs-number">30</span>) &amp; (col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">40</span>), <span class="hljs-string">"mid"</span>)
                  .otherwise(<span class="hljs-string">"senior"</span>))
      .withColumn(<span class="hljs-string">"country_upper"</span>, col(<span class="hljs-string">"country"</span>).upper())
)
</code></pre>
<p>It reads nicely, it runs, and everyone moves on. But under the hood, Spark builds this as multiple Project nodes, one per withColumn() call.</p>
<p><strong>Simplified Logical Plan (Chained): Conceptually</strong></p>
<pre><code class="lang-python">Project [..., country_upper]

└─ Project [..., experience_band]

   └─ Project [..., salary_k]

      └─ Project [..., is_senior]

         └─ Project [..., full_name]

            └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Each layer re-selects all existing columns, adds one more derived column, and deepens the plan.</p>
<h4 id="heading-the-better-approach-batch-with-select">The Better Approach: Batch with select()</h4>
<p>Instead of incrementally patching the schema, build it once.</p>
<pre><code class="lang-python">df_transformed = df.select(
    col(<span class="hljs-string">"id"</span>),
    col(<span class="hljs-string">"firstname"</span>),
    col(<span class="hljs-string">"lastname"</span>),
    col(<span class="hljs-string">"department"</span>),
    col(<span class="hljs-string">"salary"</span>),
    col(<span class="hljs-string">"age"</span>),
    col(<span class="hljs-string">"hire_date"</span>),
    col(<span class="hljs-string">"country"</span>),
    concat_ws(<span class="hljs-string">" "</span>, col(<span class="hljs-string">"firstname"</span>), col(<span class="hljs-string">"lastname"</span>)).alias(<span class="hljs-string">"full_name"</span>),
    when(col(<span class="hljs-string">"age"</span>) &gt;= <span class="hljs-number">35</span>, lit(<span class="hljs-number">1</span>)).otherwise(lit(<span class="hljs-number">0</span>)).alias(<span class="hljs-string">"is_senior"</span>),
    (col(<span class="hljs-string">"salary"</span>) / <span class="hljs-number">1000.0</span>).alias(<span class="hljs-string">"salary_k"</span>),
    when(col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">30</span>, <span class="hljs-string">"junior"</span>)
        .when((col(<span class="hljs-string">"age"</span>) &gt;= <span class="hljs-number">30</span>) &amp; (col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">40</span>), <span class="hljs-string">"mid"</span>)
        .otherwise(<span class="hljs-string">"senior"</span>).alias(<span class="hljs-string">"experience_band"</span>),
    col(<span class="hljs-string">"country"</span>).upper().alias(<span class="hljs-string">"country_upper"</span>)
)
</code></pre>
<p><strong>Simplified Logical Plan (Batched):</strong></p>
<pre><code class="lang-python">Project [id, firstname, lastname, department, salary, age, hire_date, country,

         full_name, is_senior, salary_k, experience_band, country_upper]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>One Project. All derived columns <em>are</em> defined together. Flatter DAG. Cleaner plan.</p>
<h4 id="heading-under-the-hood-why-this-matters">Under the Hood: Why This Matters</h4>
<p>Each withColumn() is syntactic sugar for: “Take the previous plan, and create a new Project on top of it.” So 10 withColumn() calls = 10 projections wrapped on top of each other.</p>
<p>Catalyst can sometimes collapse adjacent Project nodes, but:</p>
<ul>
<li><p>Not always (especially when aliases shadow each other).</p>
</li>
<li><p>Not when expressions become complex or interdependent.</p>
</li>
<li><p>Not when UDFs or analysis barriers appear.</p>
</li>
</ul>
<p>Batching with select():</p>
<ul>
<li><p>Gives Catalyst a single, complete view of all expressions.</p>
</li>
<li><p>Enables more aggressive optimizations (constant folding, expression reuse, pruning).</p>
</li>
<li><p>Keeps expression trees shallower and codegen output smaller.</p>
</li>
</ul>
<p>Think of it as the difference between editing a sentence 10 times in a row and writing the final sentence once, cleanly.</p>
<h4 id="heading-real-world-example-using-the-employees-df-at-scale">Real-World Example: Using the Employees DF at Scale:</h4>
<p>Chained version (many withColumn()):</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> col, concat_ws, when, lit, upper
<span class="hljs-keyword">import</span> time

start = time.time()
df_chain = (
    df.withColumn(<span class="hljs-string">"full_name"</span>, concat_ws(<span class="hljs-string">" "</span>, col(<span class="hljs-string">"firstname"</span>), col(<span class="hljs-string">"lastname"</span>)))
      .withColumn(<span class="hljs-string">"is_senior"</span>, when(col(<span class="hljs-string">"age"</span>) &gt;= <span class="hljs-number">35</span>, <span class="hljs-number">1</span>).otherwise(<span class="hljs-number">0</span>))
      .withColumn(<span class="hljs-string">"salary_k"</span>, col(<span class="hljs-string">"salary"</span>) / <span class="hljs-number">1000.0</span>)
      .withColumn(<span class="hljs-string">"high_earner"</span>, when(col(<span class="hljs-string">"salary"</span>) &gt;= <span class="hljs-number">90000</span>, <span class="hljs-number">1</span>).otherwise(<span class="hljs-number">0</span>))
      .withColumn(<span class="hljs-string">"experience_band"</span>,
                  when(col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">30</span>, <span class="hljs-string">"junior"</span>)
                  .when((col(<span class="hljs-string">"age"</span>) &gt;= <span class="hljs-number">30</span>) &amp; (col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">40</span>), <span class="hljs-string">"mid"</span>)
                  .otherwise(<span class="hljs-string">"senior"</span>))
      .withColumn(<span class="hljs-string">"country_upper"</span>, upper(col(<span class="hljs-string">"country"</span>)))
)

df_chain.count()
time_chain = round(time.time() - start, <span class="hljs-number">2</span>)
</code></pre>
<p>Batched version (single select()):</p>
<pre><code class="lang-python">start = time.time()
df_batch = df.select(
    <span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>,
    concat_ws(<span class="hljs-string">" "</span>, col(<span class="hljs-string">"firstname"</span>), col(<span class="hljs-string">"lastname"</span>)).alias(<span class="hljs-string">"full_name"</span>),
    when(col(<span class="hljs-string">"age"</span>) &gt;= <span class="hljs-number">35</span>, <span class="hljs-number">1</span>).otherwise(<span class="hljs-number">0</span>).alias(<span class="hljs-string">"is_senior"</span>),
    (col(<span class="hljs-string">"salary"</span>) / <span class="hljs-number">1000.0</span>).alias(<span class="hljs-string">"salary_k"</span>),
    when(col(<span class="hljs-string">"salary"</span>) &gt;= <span class="hljs-number">90000</span>, <span class="hljs-number">1</span>).otherwise(<span class="hljs-number">0</span>).alias(<span class="hljs-string">"high_earner"</span>),
    when(col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">30</span>, <span class="hljs-string">"junior"</span>)
        .when((col(<span class="hljs-string">"age"</span>) &gt;= <span class="hljs-number">30</span>) &amp; (col(<span class="hljs-string">"age"</span>) &lt; <span class="hljs-number">40</span>), <span class="hljs-string">"mid"</span>)
        .otherwise(<span class="hljs-string">"senior"</span>).alias(<span class="hljs-string">"experience_band"</span>),
    upper(col(<span class="hljs-string">"country"</span>)).alias(<span class="hljs-string">"country_upper"</span>)
)

df_batch.count()
time_batch = round(time.time() - start, <span class="hljs-number">2</span>)
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Logical Shape</strong></td><td><strong>Glue Execution Time (1M rows)</strong></td><td><strong>Notes</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Chained withColumn()</td><td>6 nested Projects</td><td>~14 seconds</td><td>Deep plan, more Catalyst work</td></tr>
<tr>
<td>Single select()</td><td>1 Project</td><td>~9 seconds</td><td>Flat planning, cleaner DAG</td></tr>
</tbody>
</table>
</div><p>The distinction is most evident when there are more derived columns, more complex expressions (UDFs, window functions), or when executing on managed runtimes such as AWS Glue.</p>
<p>In the chained cases, there are more Project nodes, code generation is fragmented, and expression evaluation is less amenable to global optimization.</p>
<p>In the batched cases, Spark generates a single Project node, more work is consolidated into a single WholeStageCodegen pipeline, code generation is reduced, the JVM is less stressed, and the plan is flatter and more amenable to optimization. This is not only cleaner, but it’s also faster, more reliable, and friendlier to Spark’s optimizer.</p>
<h3 id="heading-scenario-4-early-filter-vs-late-filter">Scenario 4: Early Filter vs Late Filter</h3>
<p>Many pipelines apply transformations first, adding columns, joining datasets, or calculating derived metrics, before filtering records. That order looks harmless in code but can double or triple the workload at execution.</p>
<h4 id="heading-problem-late-filtering">Problem: Late Filtering</h4>
<pre><code class="lang-python">df_late = (
    df.withColumn(<span class="hljs-string">"bonus"</span>, col(<span class="hljs-string">"salary"</span>) * <span class="hljs-number">0.1</span>)
      .withColumn(<span class="hljs-string">"salary_k"</span>, col(<span class="hljs-string">"salary"</span>) / <span class="hljs-number">1000</span>)
      .filter(col(<span class="hljs-string">"age"</span>) &gt; <span class="hljs-number">35</span>)
)
</code></pre>
<p>This means Spark first computes all columns for every employee, then discards most rows.</p>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Filter (age &gt; <span class="hljs-number">35</span>)

└─ Project [id, firstname, lastname, department, salary, age, hire_date, country,

            (salary * <span class="hljs-number">0.1</span>) AS bonus,

            (salary / <span class="hljs-number">1000</span>) AS salary_k]

   └─ LogicalRDD [...]
</code></pre>
<p>Catalyst can sometimes reorder this automatically, but when it can't (due to UDFs or complex logic), you're doing unnecessary work on data that's thrown away.</p>
<h4 id="heading-better-approach-early-filtering">Better Approach: Early Filtering</h4>
<pre><code class="lang-python">df_early = (
    df.filter(col(<span class="hljs-string">"age"</span>) &gt; <span class="hljs-number">35</span>)
      .withColumn(<span class="hljs-string">"bonus"</span>, col(<span class="hljs-string">"salary"</span>) * <span class="hljs-number">0.1</span>)
      .withColumn(<span class="hljs-string">"salary_k"</span>, col(<span class="hljs-string">"salary"</span>) / <span class="hljs-number">1000</span>)
)
</code></pre>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Project [id, firstname, lastname, department, salary, age, hire_date, country,

         (salary * <span class="hljs-number">0.1</span>) AS bonus,

         (salary / <span class="hljs-number">1000</span>) AS salary_k]

└─ Filter (age &gt; <span class="hljs-number">35</span>)

   └─ LogicalRDD [...]
</code></pre>
<p>Now Spark prunes the dataset first, then applies transformations. The result: smaller intermediate data, less codegen, shorter logical plan, shorter DAG, and smaller shuffle footprint.</p>
<h4 id="heading-real-world-benchmark-aws-glue-1">Real-World Benchmark: AWS Glue</h4>
<p>Late Filtering:</p>
<pre><code class="lang-python">df = spark.createDataFrame(
    multiplied_data,
    [<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>]
)

start_late = time.time()

df_late = (
    df.withColumn(<span class="hljs-string">"bonus"</span>, col(<span class="hljs-string">"salary"</span>) * <span class="hljs-number">0.1</span>)
      .withColumn(<span class="hljs-string">"salary_k"</span>, col(<span class="hljs-string">"salary"</span>) / <span class="hljs-number">1000</span>)
      .filter(col(<span class="hljs-string">"age"</span>) &gt; <span class="hljs-number">35</span>)   
)

df_late.count()
time_late = round(time.time() - start_late, <span class="hljs-number">2</span>)
</code></pre>
<p>Early Filtering:</p>
<pre><code class="lang-python">start_early = time.time()

df_early = (
    df.filter(col(<span class="hljs-string">"age"</span>) &gt; <span class="hljs-number">35</span>)    
      .withColumn(<span class="hljs-string">"bonus"</span>, col(<span class="hljs-string">"salary"</span>) * <span class="hljs-number">0.1</span>)
      .withColumn(<span class="hljs-string">"salary_k"</span>, col(<span class="hljs-string">"salary"</span>) / <span class="hljs-number">1000</span>)
)

df_early.count()
time_early = round(time.time() - start_early, <span class="hljs-number">2</span>)

print(<span class="hljs-string">"Late Filter Time:"</span>, time_late, <span class="hljs-string">"seconds"</span>)
print(<span class="hljs-string">"Early Filter Time:"</span>, time_early, <span class="hljs-string">"seconds"</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Rows Processed Before Filter</strong></td><td><strong>Execution Time (approx)</strong></td><td><strong>Notes</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Late filter</td><td>1,000,000 (all rows)</td><td>~14 seconds</td><td>Computes bonus and salary_k for all rows, then filters</td></tr>
<tr>
<td>Early filter</td><td>300,000 (filtered subset)</td><td>~9 seconds</td><td>Filters first, computes only for age &gt; 35</td></tr>
</tbody>
</table>
</div><p>The early filter approach processes significantly less data before the projection, leading to faster execution and less memory pressure.</p>
<p>Always filter as early as possible, before joins, aggregations, expensive transformations (such as UDFs or window functions), and even during file reads via Parquet/ORC pushdown, since filtering at the source touches fewer partitions and leads to faster jobs.</p>
<h3 id="heading-scenario-5-column-pruning">Scenario 5: Column Pruning</h3>
<p>When working with Spark DataFrames, convenience often wins over correctness and nothing feels more convenient than select("*"). It’s quick, flexible, and perfect for exploration.</p>
<p>But in production pipelines, that little star silently costs CPU, memory, network bandwidth, and runtime efficiency. Every time you write select("*"), Spark expands it into <em>every</em> column from your schema, even if you’re using just one or two later.</p>
<p>Those extra attributes flow through every stage of the plan, from filters and joins to aggregations and shuffles. The result: inflated logical plans, bigger shuffle files, and slower queries.</p>
<h4 id="heading-the-problem-the-lazy-star">The Problem: “The Lazy Star”</h4>
<pre><code class="lang-python">df_star = (
    df.select(<span class="hljs-string">"*"</span>)
      .filter(col(<span class="hljs-string">"department"</span>) == <span class="hljs-string">"Engineering"</span>)
      .groupBy(<span class="hljs-string">"country"</span>)
      .agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
)
</code></pre>
<p>At first glance, this seems harmless. But the problem is: only two columns (country and salary) are needed for the aggregation, but Spark carries all eight (id, firstname, lastname, department, salary, age, hire_date, country) through every transformation.</p>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Aggregate [country], [avg(salary) AS avg_salary]

└─ Filter (department = Engineering)

   └─ Project [id, firstname, lastname, department, salary, age, hire_date, country]

      └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Every node in this tree carries all columns. Catalyst can’t prune them because you explicitly asked for "*". The excess attributes are serialized, shuffled, and deserialized across the cluster, even though they serve no purpose in the final result.</p>
<h4 id="heading-the-fix-select-only-what-you-need">The Fix: Select Only What You Need</h4>
<p>Be deliberate with your projections. Select the minimal schema required for the task.</p>
<pre><code class="lang-python">df_pruned = (
    df.select(<span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"country"</span>)
      .filter(col(<span class="hljs-string">"department"</span>) == <span class="hljs-string">"Engineering"</span>)
      .groupBy(<span class="hljs-string">"country"</span>)
      .agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
)
</code></pre>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Aggregate [country], [avg(salary) AS avg_salary]

└─ Filter (department = Engineering)

   └─ Project [department, salary, country]

      └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Now Spark reads and processes only the three required columns: department, salary, and country. The plan is narrower, the DAG simpler, and execution faster.</p>
<h4 id="heading-real-world-benchmark-aws-glue-2">Real-World Benchmark: AWS Glue</h4>
<p>Wide Projection:</p>
<pre><code class="lang-python">df = spark.createDataFrame(multiplied_data,
                           [<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>])

start = time.time()
df_star = (
    df.select(<span class="hljs-string">"*"</span>)
      .filter(col(<span class="hljs-string">"department"</span>) == <span class="hljs-string">"Engineering"</span>)
      .groupBy(<span class="hljs-string">"country"</span>)
      .agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
)

df_star.count()
time_star = round(time.time() - start, <span class="hljs-number">2</span>)
</code></pre>
<p>Pruned Projection:</p>
<pre><code class="lang-python">start = time.time()

df_pruned = (
    df.select(<span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"country"</span>)
      .filter(col(<span class="hljs-string">"department"</span>) == <span class="hljs-string">"Engineering"</span>)
      .groupBy(<span class="hljs-string">"country"</span>)
      .agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
)

df_pruned.count()
time_pruned = round(time.time() - start, <span class="hljs-number">2</span>)

print(<span class="hljs-string">f"select('*') time: <span class="hljs-subst">{time_star}</span>s"</span>)
print(<span class="hljs-string">f"pruned columns time: <span class="hljs-subst">{time_pruned}</span>s"</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Columns Processed</strong></td><td><strong>Execution Time (1M rows)</strong></td><td><strong>Observation</strong></td></tr>
</thead>
<tbody>
<tr>
<td>select("*")</td><td>8</td><td>~26.54 s</td><td>Spark carries all columns through the plan.</td></tr>
<tr>
<td>Pruned projection</td><td>3</td><td>~2.21 s</td><td>Only needed columns processed → faster and lighter.</td></tr>
</tbody>
</table>
</div><h4 id="heading-under-the-hood-how-catalyst-handles-columns">Under the Hood: How Catalyst Handles Columns</h4>
<p>When you call select("*"), Catalyst resolves <em>every attribute</em> into the logical plan. Each subsequent transformation inherits that full attribute list, increasing plan depth and overhead.</p>
<p>Catalyst includes a rule called ColumnPruning, which removes unused attributes but it only works when Spark <em>can see</em> which columns are necessary. If you use "*" or dynamically reference df.columns, Catalyst loses visibility.</p>
<p><strong>Works:</strong></p>
<pre><code class="lang-python">df \
    .select(<span class="hljs-string">"salary"</span>, <span class="hljs-string">"country"</span>) \
    .groupBy(<span class="hljs-string">"country"</span>) \
    .agg(avg(<span class="hljs-string">"salary"</span>))
</code></pre>
<p><strong>Doesn’t Work:</strong></p>
<pre><code class="lang-python">cols = df.columns

df.select(cols) \
  .groupBy(<span class="hljs-string">"country"</span>) \
  .agg(avg(<span class="hljs-string">"salary"</span>))
</code></pre>
<p>In the second case, Catalyst can’t prune anything because cols might include everything.</p>
<h4 id="heading-physical-plan-differences">Physical Plan Differences</h4>
<pre><code class="lang-python">Wide Projection (select(<span class="hljs-string">"*"</span>)):

*(<span class="hljs-number">1</span>) HashAggregate(keys=[country], functions=[avg(salary)])

+- *(<span class="hljs-number">1</span>) Project [id, firstname, lastname, department, salary, age, hire_date, country]

   +- *(<span class="hljs-number">1</span>) Filter (department = Engineering)

      +- *(<span class="hljs-number">1</span>) Scan parquet ...
</code></pre>
<p>Pruned Projection:</p>
<pre><code class="lang-python">*(<span class="hljs-number">1</span>) HashAggregate(keys=[country], functions=[avg(salary)])

+- *(<span class="hljs-number">1</span>) Project [department, salary, country]

   +- *(<span class="hljs-number">1</span>) Filter (department = Engineering)

      +- *(<span class="hljs-number">1</span>) Scan parquet [department, salary, country]
</code></pre>
<p>Notice the last line: Spark physically scans only the three referenced columns from Parquet. That’s genuine I/O reduction, not just logical simplification. Using select(*) increases shuffle file sizes, memory usage during serialization, Catalyst planning time, and I/O and network traffic, and the solution requires no more than specifying the necessary columns.</p>
<p>But in managed environments like AWS Glue or Databricks, this simple practice can greatly reduce ETL time, particularly for Parquet or Delta files, due to effective column pruning during explicit projection. It’s one of the easiest and highest-impact Spark optimization techniques, starting with typing fewer asterisks.</p>
<h3 id="heading-scenario-6-filter-pushdown-vs-full-scan">Scenario 6: Filter Pushdown vs Full Scan</h3>
<p>When a Spark job feels slow right from the start, even before joins or aggregations, the culprit is often hidden at the data-read layer. Spark spends seconds (or minutes) scanning every record, even though most rows are useless for the query.</p>
<p>That’s where filter pushdown comes in. It tells Spark to <em>push your filter logic down to the file reader</em> so that Parquet / ORC / Delta formats return only the relevant rows from disk. Done right, this optimization can reduce scan size significantly. Done wrong, Spark performs a full scan, reading everything before filtering in memory.</p>
<h4 id="heading-the-problem-late-filters-and-full-scans">The Problem: Late Filters and Full Scans</h4>
<pre><code class="lang-python">employees_df = spark.read.parquet(<span class="hljs-string">"s3://data/employee_data/"</span>)

df_full = (
    employees_df
        .select(<span class="hljs-string">"*"</span>)  <span class="hljs-comment"># reads all columns</span>
        .filter(col(<span class="hljs-string">"country"</span>) == <span class="hljs-string">"Canada"</span>)
)
</code></pre>
<p>Looks fine, right? But Spark can’t push this filter to the Parquet reader because it’s applied <em>after</em> the select("*") projection step. Catalyst sees the filter as operating on a projected DataFrame, not the raw scan, so the pushdown boundary is lost.</p>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Filter (country = Canada)

└─ Project [id, firstname, lastname, department, salary, age, hire_date, country]

   └─ Scan parquet employee_data [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Every record from every Parquet file is read into memory before the filter executes. In large tables, this means scanning terabytes when you only need megabytes.</p>
<h4 id="heading-the-fix-filter-early-and-project-light">The Fix: Filter Early and Project Light</h4>
<p>Move filters as close as possible to the data source and limit columns before Spark reads them:</p>
<pre><code class="lang-python">df_pushdown = (
    spark.read.parquet(<span class="hljs-string">"s3://data/employee_data/"</span>)
        .select(<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"country"</span>)
        .filter(col(<span class="hljs-string">"country"</span>) == <span class="hljs-string">"Canada"</span>)
)
</code></pre>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Project [id, firstname, department, salary, country]

└─ Scan parquet employee_data [id, firstname, department, salary, country]
</code></pre>
<p>PushedFilters: [country = Canada]</p>
<p>Notice the difference: PushedFilters appears in the plan. That means the Parquet reader handles the predicate, returning only matching blocks and rows.</p>
<h4 id="heading-under-the-hood-what-actually-happens">Under the Hood: What Actually Happens</h4>
<p>When Spark performs filter pushdown, it leverages the Parquet metadata (min/max statistics and row-group indexes) stored in file footers.</p>
<ul>
<li><p>Spark inspects file-level metadata for the predicate column (country).</p>
</li>
<li><p>It skips any row group whose values don’t match (country ≠ Canada).</p>
</li>
<li><p>It reads only the necessary row groups and columns from disk.</p>
</li>
<li><p>Those records enter the DAG directly – no in-memory filtering required.</p>
</li>
</ul>
<p>This optimization happens entirely before Spark begins executing stages, reducing both I/O and network transfer.</p>
<h4 id="heading-real-world-benchmark-aws-glue-3">Real-World Benchmark: AWS Glue</h4>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> time
<span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> col

spark = SparkSession.builder.appName(<span class="hljs-string">"FilterPushdownBenchmark"</span>).getOrCreate()

start = time.time()
df_full = (
    spark.read.parquet(<span class="hljs-string">"s3://data/employee_data/"</span>)
        .select(<span class="hljs-string">"*"</span>)                         <span class="hljs-comment"># all columns</span>
        .filter(col(<span class="hljs-string">"country"</span>) == <span class="hljs-string">"Canada"</span>)  
)
df_full.count()
time_full = round(time.time() - start, <span class="hljs-number">2</span>)

start = time.time()
df_pushdown = (
    spark.read.parquet(<span class="hljs-string">"s3://data/employee_data/"</span>)
        .select(<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"country"</span>)
        .filter(col(<span class="hljs-string">"country"</span>) == <span class="hljs-string">"Canada"</span>)  
)
df_pushdown.count()
time_push = round(time.time() - start, <span class="hljs-number">2</span>)

print(<span class="hljs-string">"Full Scan Time:"</span>, time_full, <span class="hljs-string">"sec"</span>)
print(<span class="hljs-string">"Filter Pushdown Time:"</span>, time_push, <span class="hljs-string">"sec"</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Execution Time (1 M rows)</strong></td><td><strong>Observation</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Full Scan</td><td>14.2 s</td><td>All files scanned and filtered in memory.</td></tr>
<tr>
<td>Filter Pushdown</td><td>3.8 s</td><td>Only relevant row groups and columns read.</td></tr>
</tbody>
</table>
</div><p><strong>Physical Plan Comparison</strong></p>
<p>Full Scan:</p>
<pre><code class="lang-python">*(<span class="hljs-number">1</span>) Filter (country = Canada)

+- *(<span class="hljs-number">1</span>) ColumnarToRow

   +- *(<span class="hljs-number">1</span>) FileScan parquet [id, firstname, lastname, department, salary, age, hire_date, country]

      Batched: true, DataFilters: [], PushedFilters: []
</code></pre>
<p>Pushdown:</p>
<pre><code class="lang-python">*(<span class="hljs-number">1</span>) ColumnarToRow

+- *(<span class="hljs-number">1</span>) FileScan parquet [id, firstname, department, salary, country]

   Batched: true, DataFilters: [isnotnull(country)], PushedFilters: [country = Canada]
</code></pre>
<p>The difference is clear: PushedFilters confirms that Spark applied predicate pushdown, skipping unnecessary row groups at the scan stage.</p>
<h4 id="heading-reflection-why-pushdown-matters">Reflection: Why Pushdown Matters</h4>
<p>Pushdown isn’t a micro-optimization. It’s actually often the single biggest performance lever in Spark ETL. In data lakes with hundreds of files, full scans waste hours and inflate AWS S3 I/O costs. By filtering and projecting early, Spark prunes both rows and columns before execution even begins.</p>
<p>Apply filters as early as possible in the read pipeline, combine filter pushdown with column pruning, verify PushedFilters in explain("formatted"), avoid UDFs and select("*") at read time, and let pushdown turn “read everything and discard most” into “read only what you need.”</p>
<h3 id="heading-scenario-7-de-duplicate-right">Scenario 7: De-duplicate Right</h3>
<h4 id="heading-the-problem-all-row-deduplication-and-why-it-hurts">The Problem: “All-Row Deduplication” and Why It Hurts</h4>
<p>When we use this:</p>
<pre><code class="lang-python">df.dropDuplicates()
</code></pre>
<p>Spark removes identical rows across all columns. It sounds simple, but this operation forces Spark to treat every column as part of the deduplication key.</p>
<p>Internally, it means:</p>
<ul>
<li><p>Every attribute is serialized and hashed.</p>
</li>
<li><p>Every unique combination of all columns is shuffled across the cluster to ensure global uniqueness.</p>
</li>
<li><p>Even small changes in a non-essential field (like hire_date) cause new keys and destroy aggregation locality.</p>
</li>
</ul>
<p>In wide tables, this is one of the heaviest shuffle operations Spark can perform: df.dropDuplicates()</p>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Aggregate [id, firstname, lastname, department, salary, age, hire_date, country], [first(id) AS id, ...]

└─ Exchange hashpartitioning(id, firstname, lastname, department, salary, age, hire_date, country, <span class="hljs-number">200</span>)

   └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Notice the Exchange: that’s a full shuffle across all columns. Spark must send every record to the partition responsible for its unique combination of all fields. This is slow, memory-intensive, and scales poorly as columns grow.</p>
<h4 id="heading-the-better-approach-key-based-deduplication">The Better Approach: Key-Based Deduplication</h4>
<p>In most real datasets, duplicates are determined by a primary or business key, not all attributes. For example, if id uniquely identifies an employee, we only need to keep one record per id.</p>
<pre><code class="lang-python">df.dropDuplicates([<span class="hljs-string">"id"</span>])
</code></pre>
<p>Now Spark deduplicates based only on the id column.</p>
<pre><code class="lang-python">Aggregate [id], [first(id) AS id, first(firstname) AS firstname, ...]

└─ Exchange hashpartitioning(id, <span class="hljs-number">200</span>)

   └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>The shuffle is dramatically narrower. Instead of hashing across all columns, Spark redistributes data only by id. Fewer bytes, smaller shuffle files, faster reduce stage</p>
<h4 id="heading-real-world-benchmark-aws-glue-4">Real-World Benchmark: AWS Glue</h4>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> time
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> exp, log, sqrt, col, concat_ws, when, upper, avg
<span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession

spark = SparkSession.builder.appName(<span class="hljs-string">"MillionRowsRenameTest"</span>).getOrCreate()

employees_data = [
    (<span class="hljs-number">1</span>, <span class="hljs-string">"John"</span>, <span class="hljs-string">"Doe"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">80000</span>, <span class="hljs-number">28</span>, <span class="hljs-string">"2020-01-15"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">2</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Smith"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">85000</span>, <span class="hljs-number">32</span>, <span class="hljs-string">"2019-03-20"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">3</span>, <span class="hljs-string">"Alice"</span>, <span class="hljs-string">"Johnson"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">60000</span>, <span class="hljs-number">25</span>, <span class="hljs-string">"2021-06-10"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">4</span>, <span class="hljs-string">"Bob"</span>, <span class="hljs-string">"Brown"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">90000</span>, <span class="hljs-number">35</span>, <span class="hljs-string">"2018-07-01"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">5</span>, <span class="hljs-string">"Charlie"</span>, <span class="hljs-string">"Wilson"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">65000</span>, <span class="hljs-number">29</span>, <span class="hljs-string">"2020-11-05"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">6</span>, <span class="hljs-string">"David"</span>, <span class="hljs-string">"Lee"</span>, <span class="hljs-string">"HR"</span>, <span class="hljs-number">55000</span>, <span class="hljs-number">27</span>, <span class="hljs-string">"2021-01-20"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">7</span>, <span class="hljs-string">"Eve"</span>, <span class="hljs-string">"Davis"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">95000</span>, <span class="hljs-number">40</span>, <span class="hljs-string">"2017-04-12"</span>, <span class="hljs-string">"Canada"</span>),
    (<span class="hljs-number">8</span>, <span class="hljs-string">"Frank"</span>, <span class="hljs-string">"Miller"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">70000</span>, <span class="hljs-number">33</span>, <span class="hljs-string">"2019-09-25"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">9</span>, <span class="hljs-string">"Grace"</span>, <span class="hljs-string">"Taylor"</span>, <span class="hljs-string">"HR"</span>, <span class="hljs-number">58000</span>, <span class="hljs-number">26</span>, <span class="hljs-string">"2021-08-15"</span>, <span class="hljs-string">"Canada"</span>),
    (<span class="hljs-number">10</span>, <span class="hljs-string">"Henry"</span>, <span class="hljs-string">"Anderson"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">88000</span>, <span class="hljs-number">31</span>, <span class="hljs-string">"2020-02-28"</span>, <span class="hljs-string">"USA"</span>)
]

multiplied_data = [(i, <span class="hljs-string">f"firstname_<span class="hljs-subst">{i}</span>"</span>, <span class="hljs-string">f"lastname_<span class="hljs-subst">{i}</span>"</span>,
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">3</span>],   <span class="hljs-comment"># department</span>
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">4</span>],   <span class="hljs-comment"># salary</span>
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">5</span>],   <span class="hljs-comment"># age</span>
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">6</span>],   <span class="hljs-comment"># hire_date</span>
                    employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">7</span>]    <span class="hljs-comment"># country</span>
                    )
                   <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">1</span>_000_001)]

df = spark.createDataFrame(
    multiplied_data,
    [<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>]
)

start = time.time()
dedup_full = df.dropDuplicates()
dedup_full.count()
time_full = round(time.time() - start, <span class="hljs-number">2</span>)

start = time.time()
dedup_key = df.dropDuplicates([<span class="hljs-string">"id"</span>])
dedup_key.count()
time_key = round(time.time() - start, <span class="hljs-number">2</span>)

print(<span class="hljs-string">f"Full-row dedup time: <span class="hljs-subst">{time_full}</span>s"</span>)
print(<span class="hljs-string">f"Key-based dedup time: <span class="hljs-subst">{time_key}</span>s"</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Execution Time (1M rows)</strong></td><td><strong>Observation</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Full-Row Dedup</td><td>27.6 s</td><td>Shuffle across all attributes, large hash table</td></tr>
<tr>
<td>Key-Based Dedup (["id"])</td><td>2.06 s</td><td>10× faster, minimal shuffle width</td></tr>
</tbody>
</table>
</div><h4 id="heading-under-the-hood-what-catalyst-does">Under the Hood: What Catalyst Does</h4>
<p>When you specify a key list, Catalyst rewrites dropDuplicates(keys) into a partial + final aggregate plan, just like a groupBy:</p>
<p>HashAggregate(keys=[id], functions=[first(...)])</p>
<p>This allows Spark to:</p>
<ul>
<li><p>Perform map-side partial aggregation on each partition (before shuffle).</p>
</li>
<li><p>Exchange only the grouping key (id).</p>
</li>
<li><p>Perform a final aggregation on the reduced data.</p>
</li>
</ul>
<p>The all-column version can’t do that optimization because every column participates in uniqueness Spark must ensure <em>complete</em> data redistribution.</p>
<h4 id="heading-best-practices-for-deduplication">Best Practices for Deduplication</h4>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Practice</strong></td><td><strong>Why It Matters</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Always deduplicate by key columns</td><td>Reduces shuffle width and data movement</td></tr>
<tr>
<td>Use deterministic keys (id, email, ssn)</td><td>Ensures predictable grouping</td></tr>
<tr>
<td>Avoid dropDuplicates() without arguments</td><td>Forces global shuffle across all attributes</td></tr>
<tr>
<td>Combine with column pruning</td><td>Keep only necessary fields before deduplication</td></tr>
<tr>
<td>For “latest record” logic, use window functions</td><td>Allows targeted deduplication (row_number() with order)</td></tr>
<tr>
<td>Cache intermediate datasets if reused</td><td>Avoids recomputation of expensive dedup stages</td></tr>
</tbody>
</table>
</div><h4 id="heading-combining-deduplication-amp-aggregation">Combining Deduplication &amp; Aggregation</h4>
<p>You can merge deduplication with aggregation for even better results:</p>
<pre><code class="lang-python">df_dedup_agg = (
    df.dropDuplicates([<span class="hljs-string">"id"</span>])
        .groupBy(<span class="hljs-string">"department"</span>)
        .agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
)
</code></pre>
<p>Spark now reuses the same shuffle partitioning for both operations, one shuffle instead of two. The plan will show:</p>
<pre><code class="lang-python">HashAggregate(keys=[department], functions=[avg(salary)])

└─ HashAggregate(keys=[id], functions=[first(...), first(department)])

   └─ Exchange hashpartitioning(id, <span class="hljs-number">200</span>)
</code></pre>
<p>Prefer dropDuplicates(["key_col"]) over dropDuplicates() to deduplicate by business or surrogate keys rather than the entire schema. Combine deduplication with projection to reduce I/O, and remember that one narrow shuffle is always better than a wide shuffle. Deduplication isn’t just cleanup – it’s an optimization strategy. Choose your keys wisely, and Spark will reward you with faster jobs and lighter DAGs.</p>
<h3 id="heading-scenario-8-count-smarter">Scenario 8: Count Smarter</h3>
<p>In production, one of the most common performance pitfalls is the simplest line of code:</p>
<pre><code class="lang-python"><span class="hljs-keyword">if</span> df.count() &gt; <span class="hljs-number">0</span>:
</code></pre>
<p>At first glance, this seems harmless. You just want to know whether the DataFrame has any data before writing, joining, or aggregating. But in Spark, count() is not metadata lookup, it’s a full cluster-wide job.</p>
<p><strong>What Really Happens with count()</strong><br>When you call df.count(), Spark executes a complete action:</p>
<ul>
<li><p>It scans every partition.</p>
</li>
<li><p>Deserializes every row.</p>
</li>
<li><p>Counts records locally on each executor.</p>
</li>
<li><p>Reduces the counts to the driver.</p>
</li>
</ul>
<p>That means your “empty check” runs a full distributed computation, even when the dataset has billions of rows or lives in S3.</p>
<pre><code class="lang-python">df.count()
</code></pre>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">*(<span class="hljs-number">1</span>) HashAggregate(keys=[], functions=[count(<span class="hljs-number">1</span>)])

+- *(<span class="hljs-number">1</span>) ColumnarToRow

   +- *(<span class="hljs-number">1</span>) FileScan parquet [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Every record is read, aggregated, and returned just to produce a single integer.</p>
<p>Now imagine this runs in the middle of your Glue job, before a write, before a filter, or inside a loop. You’ve just added a full-table scan to your DAG for no reason.</p>
<h4 id="heading-the-smarter-way-limit1-or-head1">The Smarter Way: limit(1) or head(1)</h4>
<p>If all you need to know is whether data exists, you don’t need to count every record. You just need to know if there’s <em>at least one</em>.</p>
<p>Two efficient alternatives</p>
<pre><code class="lang-python">df.head(<span class="hljs-number">1</span>)
<span class="hljs-comment">#or</span>
df.limit(<span class="hljs-number">1</span>).collect()
</code></pre>
<p>Both execute a lazy scan that stops as soon as one record is found.</p>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">TakeOrderedAndProject(limit=<span class="hljs-number">1</span>)

└─ *(<span class="hljs-number">1</span>) FileScan parquet [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<ul>
<li><p>No global aggregation.</p>
</li>
<li><p>No shuffle.</p>
</li>
<li><p>No full scan.</p>
</li>
</ul>
<h4 id="heading-real-world-benchmark-aws-glue-5">Real-World Benchmark: AWS Glue</h4>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> time
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> exp, log, sqrt, col, concat_ws, when, upper, avg
<span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession

<span class="hljs-comment"># Initialize Spark session</span>
spark = SparkSession.builder.appName(<span class="hljs-string">"MillionRowsRenameTest"</span>).getOrCreate()

<span class="hljs-comment"># Base dataset (10 sample employees)</span>
employees_data = [
    (<span class="hljs-number">1</span>, <span class="hljs-string">"John"</span>, <span class="hljs-string">"Doe"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">80000</span>, <span class="hljs-number">28</span>, <span class="hljs-string">"2020-01-15"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">2</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Smith"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">85000</span>, <span class="hljs-number">32</span>, <span class="hljs-string">"2019-03-20"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">3</span>, <span class="hljs-string">"Alice"</span>, <span class="hljs-string">"Johnson"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">60000</span>, <span class="hljs-number">25</span>, <span class="hljs-string">"2021-06-10"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">4</span>, <span class="hljs-string">"Bob"</span>, <span class="hljs-string">"Brown"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">90000</span>, <span class="hljs-number">35</span>, <span class="hljs-string">"2018-07-01"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">5</span>, <span class="hljs-string">"Charlie"</span>, <span class="hljs-string">"Wilson"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">65000</span>, <span class="hljs-number">29</span>, <span class="hljs-string">"2020-11-05"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">6</span>, <span class="hljs-string">"David"</span>, <span class="hljs-string">"Lee"</span>, <span class="hljs-string">"HR"</span>, <span class="hljs-number">55000</span>, <span class="hljs-number">27</span>, <span class="hljs-string">"2021-01-20"</span>, <span class="hljs-string">"USA"</span>),
    (<span class="hljs-number">7</span>, <span class="hljs-string">"Eve"</span>, <span class="hljs-string">"Davis"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">95000</span>, <span class="hljs-number">40</span>, <span class="hljs-string">"2017-04-12"</span>, <span class="hljs-string">"Canada"</span>),
    (<span class="hljs-number">8</span>, <span class="hljs-string">"Frank"</span>, <span class="hljs-string">"Miller"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">70000</span>, <span class="hljs-number">33</span>, <span class="hljs-string">"2019-09-25"</span>, <span class="hljs-string">"UK"</span>),
    (<span class="hljs-number">9</span>, <span class="hljs-string">"Grace"</span>, <span class="hljs-string">"Taylor"</span>, <span class="hljs-string">"HR"</span>, <span class="hljs-number">58000</span>, <span class="hljs-number">26</span>, <span class="hljs-string">"2021-08-15"</span>, <span class="hljs-string">"Canada"</span>),
    (<span class="hljs-number">10</span>, <span class="hljs-string">"Henry"</span>, <span class="hljs-string">"Anderson"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">88000</span>, <span class="hljs-number">31</span>, <span class="hljs-string">"2020-02-28"</span>, <span class="hljs-string">"USA"</span>)
]

<span class="hljs-comment"># Create 1 million rows</span>
multiplied_data = [
    (i, <span class="hljs-string">f"firstname_<span class="hljs-subst">{i}</span>"</span>, <span class="hljs-string">f"lastname_<span class="hljs-subst">{i}</span>"</span>,
     employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">3</span>],
     employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">4</span>],
     employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">5</span>],
     employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">6</span>],
     employees_data[i % <span class="hljs-number">10</span>][<span class="hljs-number">7</span>])
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">1</span>_000_001)
]

df = spark.createDataFrame(
    multiplied_data,
    [<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>]
)
<span class="hljs-comment"># Create DataFrame</span>
df = spark.createDataFrame(
    multiplied_data,
    [<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>]
)

start = time.time()
df.count()
count_time = round(time.time() - start, <span class="hljs-number">2</span>)

start = time.time()
df.limit(<span class="hljs-number">1</span>).collect()
limit_time = round(time.time() - start, <span class="hljs-number">2</span>)

start = time.time()
df.head(<span class="hljs-number">1</span>)
head_time = round(time.time() - start, <span class="hljs-number">2</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Method</strong></td><td><strong>Plan Type</strong></td><td><strong>Execution Time (1M rows)</strong></td><td><strong>Notes</strong></td></tr>
</thead>
<tbody>
<tr>
<td>count()</td><td>HashAggregate + Exchange</td><td>26.33 s</td><td>Full scan + aggregation</td></tr>
<tr>
<td>limit(1)</td><td>TakeOrderedAndProject</td><td>0.62 s</td><td>Stops after first record</td></tr>
<tr>
<td>head(1)</td><td>TakeOrderedAndProject</td><td>0.42 s</td><td>Fastest, single partition</td></tr>
</tbody>
</table>
</div><p>The difference is significant for the same logical check.</p>
<p>So why does this difference exist? Spark’s execution model treats every action as a trigger for computation. count() is an aggregation action, requiring global communication, and limit(1) and head(1) are sampling actions, short-circuiting the job after fetching the first record. Catalyst generates a TakeOrderedAndProject node instead of HashAggregate, and the scheduler terminates once one task finishes.</p>
<p><strong>Plan comparison:</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Action</strong></td><td><strong>Simplified Plan</strong></td><td><strong>Type</strong></td><td><strong>Behavior</strong></td></tr>
</thead>
<tbody>
<tr>
<td>count()</td><td>HashAggregate → Exchange → FileScan</td><td>Global</td><td>Full scan, wide dependency</td></tr>
<tr>
<td>limit(1)</td><td>TakeOrderedAndProject → FileScan</td><td>Local</td><td>Early stop, narrow dependency</td></tr>
<tr>
<td>head(1)</td><td>TakeOrderedAndProject → FileScan</td><td>Local</td><td>Early stop, single task</td></tr>
</tbody>
</table>
</div><p>Avoid using count() to check emptiness since it triggers a full scan. Use limit(1) or head(1) for lightweight existence checks. And reserve count() only when the total is required, because Spark will always process all data unless explicitly told to stop. Other alternatives</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><code>df.take(1)</code></td><td>Similar to head() returns array</td></tr>
</thead>
<tbody>
<tr>
<td><code>df.first()</code></td><td>Returns first Row or None</td></tr>
<tr>
<td><code>df.isEmpty()</code></td><td>Returns true if DataFrame has no rows</td></tr>
<tr>
<td><code>df.rdd.isEmpty()</code></td><td>RDD-level check</td></tr>
</tbody>
</table>
</div><h3 id="heading-scenario-9-window-wisely">Scenario 9: Window Wisely</h3>
<p>Window functions (rank(), dense_rank(), lag(), avg() with over(), and so on) are essential in analytics. They let you calculate running totals, rankings, or time-based metrics.</p>
<p>But in Spark, they’re not cheap, because they rely on shuffles and ordering.</p>
<p>Each window operation:</p>
<ul>
<li><p>Requires all rows for the same partition key to be co-located on the same node.</p>
</li>
<li><p>Requires sorting those rows by the orderBy() clause within each partition.</p>
</li>
</ul>
<p>If you omit partitionBy() (or use it with too broad a key), Spark treats the entire dataset as one partition, triggering a massive shuffle and global sort.</p>
<h4 id="heading-global-window-the-wrong-way">Global Window: The Wrong Way</h4>
<p>Let’s compute employee rankings by salary without partitioning:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql.window <span class="hljs-keyword">import</span> Window
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> rank, col

window_spec = Window.orderBy(col(<span class="hljs-string">"salary"</span>).desc())

df_ranked = df.withColumn(<span class="hljs-string">"salary_rank"</span>, rank().over(window_spec))
</code></pre>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Window [rank() windowspecdefinition(orderBy=[salary DESC]) AS salary_rank]

└─ Sort [salary DESC], true

   └─ Exchange rangepartitioning(salary DESC, <span class="hljs-number">200</span>)

      └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>Spark must shuffle and sort the entire dataset globally, a full sort across all rows. Every executor gets a slice of this single global range, and all data must move through the network.</p>
<h4 id="heading-partition-by-a-selective-key-the-better-way">Partition by a Selective Key: The Better Way</h4>
<p>Most analytics don’t need a global ranking. You likely want rankings within a department or group, not across the entire company.</p>
<pre><code class="lang-python">window_spec = Window.partitionBy(<span class="hljs-string">"department"</span>).orderBy(col(<span class="hljs-string">"salary"</span>).desc())

df_ranked = df.withColumn(<span class="hljs-string">"salary_rank"</span>, rank().over(window_spec))
</code></pre>
<p>Now Spark builds separate windows per department. Each partition’s data stays local, dramatically reducing shuffle size.</p>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Window [rank() windowspecdefinition(partitionBy=[department], orderBy=[salary DESC]) AS salary_rank]

└─ Sort [department ASC, salary DESC], false

   └─ Exchange hashpartitioning(department, <span class="hljs-number">200</span>)

      └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>The Exchange now partitions data only by department. The shuffle boundary is narrower, fewer bytes transferred, fewer sort comparisons, and smaller spill risk.</p>
<h4 id="heading-real-world-benchmark-aws-glue-6">Real-World Benchmark: AWS Glue</h4>
<p>We can execute the windows function on the same 1 million row dataset:</p>
<pre><code class="lang-python">df = spark.createDataFrame(multiplied_data,
[<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>,
 <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>])

start = time.time()
window_global = Window.orderBy(col(<span class="hljs-string">"salary"</span>).desc())
df_global = df.withColumn(<span class="hljs-string">"salary_rank"</span>, rank().over(window_global))
df_global.count()
global_time = round(time.time() - start, <span class="hljs-number">2</span>)
print(<span class="hljs-string">f'global_time:<span class="hljs-subst">{global_time}</span>'</span>)

start = time.time()
window_local = Window.partitionBy(<span class="hljs-string">"department"</span>).orderBy(col(<span class="hljs-string">"salary"</span>).desc())
df_local = df.withColumn(<span class="hljs-string">"salary_rank"</span>, rank().over(window_local))
df_local.count()
local_time = round(time.time() - start, <span class="hljs-number">2</span>)
print(<span class="hljs-string">f'local_time:<span class="hljs-subst">{local_time}</span>'</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Stage Count</strong></td><td><strong>Execution Time (1M rows)</strong></td><td><strong>Observation</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Global Window (no partition)</td><td>5</td><td>30.21 s</td><td>Full dataset shuffle + global sort</td></tr>
<tr>
<td>Partitioned Window (by department)</td><td>3</td><td>1.74 s</td><td>Localized sort, fewer shuffle files</td></tr>
</tbody>
</table>
</div><p>Partitioning the window reduces shuffle data volume significantly and runtime as well. The difference grows exponentially as data scales.</p>
<h4 id="heading-under-the-hood-what-spark-actually-does-1">Under the Hood: What Spark Actually Does</h4>
<p>Each Window transformation adds a physical plan node like:</p>
<p>WindowExec [rank() windowspecdefinition(...)], frame=RangeFrame</p>
<p>This node is non-pipelined – it materializes input partitions before computing window metrics. Catalyst optimizer can’t push filters or projections inside WindowExec, which means:</p>
<ul>
<li><p>If you rank before filtering, Spark computes ranks for all rows.</p>
</li>
<li><p>If you order globally, Spark must sort everything before starting.</p>
</li>
</ul>
<p>That’s why window placement in your code matters almost as much as partition keys.</p>
<h4 id="heading-common-anti-patterns">Common Anti-Patterns:</h4>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Anti-Pattern</strong></td><td><strong>Why It Hurts</strong></td><td><strong>Fix</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Missing partitionBy()</td><td>Global sort across dataset</td><td>Partition by key columns</td></tr>
<tr>
<td>Overly broad partition key</td><td>Creates too many small partitions</td><td>Use selective, not unique keys</td></tr>
<tr>
<td>Wide, unbounded window frame</td><td>Retains all rows in memory per key</td><td>Use bounded ranges (for example, rowsBetween(-3, 0))</td></tr>
<tr>
<td>Filtering after window</td><td>Computes unnecessary metrics</td><td>Filter first, then window</td></tr>
<tr>
<td>Multiple chained windows</td><td>Each triggers new sort</td><td>Combine window metrics in one spec</td></tr>
</tbody>
</table>
</div><p>Partition on selective keys to reduce shuffle volume, and avoid global windows that force full sorts and shuffles. Prefer bounded frames to keep state in memory and limit disk spill, and filter early while combining metrics to minimize unnecessary data flowing through WindowExec. Windows are powerful, but unbounded ones can silently crush performance. In Spark, partitioning isn’t optional. It’s the line between analytics and overhead.</p>
<h3 id="heading-scenario-10-incremental-aggregations-with-cache-and-persist">Scenario 10: Incremental Aggregations with Cache and Persist</h3>
<p>When multiple actions depend on the same expensive base computation, don’t recompute it every time. Materialize it once with cache() or persist(), then reuse it. Most Spark teams get this wrong in two ways:</p>
<ul>
<li><p>They never cache, so Spark recomputes long lineages (filters, joins, window ops) for every action.</p>
</li>
<li><p>They cache everything, blowing executor memory and making things worse.</p>
</li>
</ul>
<p>This scenario shows how to do it intelligently.</p>
<h4 id="heading-the-problem-recomputing-the-same-work-for-every-metric">The Problem: Recomputing the Same Work for Every Metric</h4>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> col, avg, max <span class="hljs-keyword">as</span> max_, count

base = (
    df.filter(col(<span class="hljs-string">"department"</span>) == <span class="hljs-string">"Engineering"</span>)
      .filter(col(<span class="hljs-string">"country"</span>) == <span class="hljs-string">"USA"</span>)
      .filter(col(<span class="hljs-string">"salary"</span>) &gt; <span class="hljs-number">70000</span>)
)

avg_salary = base.groupBy(<span class="hljs-string">"department"</span>).agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
max_salary = base.groupBy(<span class="hljs-string">"department"</span>).agg(max_(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"max_salary"</span>))
cnt_salary = base.groupBy(<span class="hljs-string">"department"</span>).agg(count(<span class="hljs-string">"*"</span>).alias(<span class="hljs-string">"cnt"</span>))

Looks totally fine at a glance. But remember: Spark <span class="hljs-keyword">is</span> lazy.
Every time you trigger an action:

avg_salary.show()
max_salary.show()
cnt_salary.show()
</code></pre>
<p>Spark walks back to the same base definition and re-runs all filters and shuffles for each metric – unless you persist.</p>
<p>So instead of 1 filtered + shuffled dataset reused 3 times, you effectively get:</p>
<ul>
<li><p>3 jobs</p>
</li>
<li><p>3 scans / filter chains</p>
</li>
<li><p>3 groupBy shuffles</p>
</li>
</ul>
<p>for the same input slice.</p>
<p><strong>Simplified Logical Plan Shape (Without Cache):</strong></p>
<pre><code class="lang-python">HashAggregate [department], [avg/max/count]

└─ Exchange hashpartitioning(department)

   └─ Filter (department = <span class="hljs-string">'Engineering'</span> AND country = <span class="hljs-string">'USA'</span> AND salary &gt; <span class="hljs-number">70000</span>)

      └─ Scan ...
</code></pre>
<p>And Spark builds this three times. Even though the filter logic is identical, each action triggers a new job with:</p>
<ul>
<li><p>new stages,</p>
</li>
<li><p>new shuffles, and</p>
</li>
<li><p>new scans.</p>
</li>
</ul>
<p>On large datasets (hundreds of GBs), this is brutal.</p>
<h4 id="heading-the-better-approach-cache-the-shared-base">The Better Approach: Cache the Shared Base</h4>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> StorageLevel

base = (
    df.filter(col(<span class="hljs-string">"department"</span>) == <span class="hljs-string">"Engineering"</span>)
      .filter(col(<span class="hljs-string">"country"</span>) == <span class="hljs-string">"USA"</span>)
      .filter(col(<span class="hljs-string">"salary"</span>) &gt; <span class="hljs-number">70000</span>)
)

base = base.persist(StorageLevel.MEMORY_AND_DISK)

base.count()

avg_salary = base.groupBy(<span class="hljs-string">"department"</span>).agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
max_salary = base.groupBy(<span class="hljs-string">"department"</span>).agg(max_(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"max_salary"</span>))
cnt_salary = base.groupBy(<span class="hljs-string">"department"</span>).agg(count(<span class="hljs-string">"*"</span>).alias(<span class="hljs-string">"cnt"</span>))

avg_salary.show()
max_salary.show()
cnt_salary.show()

base.unpersist()
</code></pre>
<p>Now, the filters and initial scan run once, the results are cached, and all subsequent aggregates read from cached data instead of recomputing upstream logic.</p>
<p><strong>Logical Plan Shape (With Cache):</strong></p>
<p>Before materialization (base.count()), the plan still shows the lineage. Afterward, subsequent actions operate off the cached node.</p>
<pre><code class="lang-python">InMemoryRelation [department, salary, country, ...]

   └─ * Cached <span class="hljs-keyword">from</span>:

      Filter (department = <span class="hljs-string">'Engineering'</span> AND country = <span class="hljs-string">'USA'</span> AND salary &gt; <span class="hljs-number">70000</span>)

      └─ Scan parquet employees_large ...
</code></pre>
<p>Then:</p>
<pre><code class="lang-python">HashAggregate [department], [avg/max/count]

└─ InMemoryRelation [...]
</code></pre>
<p>One heavy pipeline, many cheap reads. The DAG becomes flatter:</p>
<ul>
<li><p>Expensive scan &amp; filter &amp; shuffle: once.</p>
</li>
<li><p>Cheap aggregations: N times from memory/disk.</p>
</li>
</ul>
<h4 id="heading-real-world-benchmark-aws-glue-7">Real-World Benchmark: AWS Glue</h4>
<pre><code class="lang-python">df = spark.createDataFrame(multiplied_data,
[<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>,
<span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>])

base = (
    df.filter(col(<span class="hljs-string">"department"</span>) == <span class="hljs-string">"Engineering"</span>)
      .filter(col(<span class="hljs-string">"country"</span>) == <span class="hljs-string">"USA"</span>)
      .filter(col(<span class="hljs-string">"salary"</span>) &gt; <span class="hljs-number">85000</span>)
)


start = time.time()

avg_salary = base.groupBy(<span class="hljs-string">"department"</span>).agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
max_salary = base.groupBy(<span class="hljs-string">"department"</span>).agg(max_(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"max_salary"</span>))
cnt = base.groupBy(<span class="hljs-string">"department"</span>).agg(count(<span class="hljs-string">"*"</span>).alias(<span class="hljs-string">"emp_count"</span>))

print(<span class="hljs-string">"---- Without Cache ----"</span>)
avg_salary.show()
max_salary.show()
cnt.show()

no_cache_time = round(time.time() - start, <span class="hljs-number">2</span>)
print(<span class="hljs-string">f"Total time without cache: <span class="hljs-subst">{no_cache_time}</span> seconds"</span>)


<span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> DataFrame

base_cached = base.persist(StorageLevel.MEMORY_AND_DISK)
base_cached.count()  <span class="hljs-comment"># materialize cache</span>

start = time.time()

avg_salary_c = base_cached.groupBy(<span class="hljs-string">"department"</span>).agg(avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
max_salary_c = base_cached.groupBy(<span class="hljs-string">"department"</span>).agg(max_(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"max_salary"</span>))
cnt_c = base_cached.groupBy(<span class="hljs-string">"department"</span>).agg(count(<span class="hljs-string">"*"</span>).alias(<span class="hljs-string">"emp_count"</span>))

print(<span class="hljs-string">"---- With Cache ----"</span>)
avg_salary_c.show()
max_salary_c.show()
cnt_c.show()

cache_time = round(time.time() - start, <span class="hljs-number">2</span>)
print(<span class="hljs-string">f"Total time with cache: <span class="hljs-subst">{cache_time}</span> seconds"</span>)

<span class="hljs-comment"># Cleanup</span>
base_cached.unpersist()

print(<span class="hljs-string">"\n==== Summary ===="</span>)
print(<span class="hljs-string">f"Without cache: <span class="hljs-subst">{no_cache_time}</span>s | With cache: <span class="hljs-subst">{cache_time}</span>s"</span>)
print(<span class="hljs-string">"================="</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Execution Time (1M rows)</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Without Cache</td><td>30.75 s</td></tr>
<tr>
<td>With Cache</td><td>3.34 s</td></tr>
</tbody>
</table>
</div><h4 id="heading-under-the-hood-why-this-works"><strong>Under the Hood: Why This Works</strong></h4>
<p>Using cache() or persist() in Spark inserts an InMemoryRelation / InMemoryTableScanExec node so that expensive intermediate results are stored in executor memory (or memory+disk). This allows future jobs to reuse cached blocks instead of re-scanning sources or re-computing shuffles. This shortens downstream logical plans, reduces repeated shuffles, and lowers load on systems like S3, HDFS, or JDBC.</p>
<p>Without caching, every action replays the full lineage and Spark recomputes the data unless another operator or AQE optimization has already materialized part of it. But caching should not become “cache everything”. Rather, you should avoid caching very large DataFrames used only once, wide raw inputs instead of filtered/aggregated subsets, or long-lived caches that are never unpersisted.</p>
<p>A good rule of thumb is to cache only when the DataFrame is expensive to recompute (joins, filters, windows, UDFs), is used at least twice, and is reasonably sized after filtering so it can fit in memory or work with MEMORY_AND_DISK. Otherwise, allow Spark to recompute.</p>
<p>Conceptually, caching converts a tall, repetitive DAG such as repeated “HashAggregate → Exchange → Filter → Scan” sequences into a hub-and-spoke design where one heavy cached hub feeds multiple lightweight downstream aggregates.</p>
<p>When multiple actions depend on the same expensive computation, cache or persist the shared base to flatten the DAG, eliminate repeated scans and shuffles, and improve end-to-end performance. All this while being intentional by caching only when reuse is real, the data size is safe, and always calling <code>unpersist()</code> when done.</p>
<p>Don’t make Spark re-solve the same puzzle three times. Let it solve it once, remember the answer, and move on.</p>
<h3 id="heading-scenario-11-reduce-shuffles">Scenario 11: Reduce Shuffles</h3>
<p>Shuffles are Spark’s invisible tax collectors. Every time your data crosses executors, you pay in CPU, disk I/O, and network bandwidth.</p>
<p>Two of the most common yet misunderstood transformations that trigger or avoid shuffles are coalesce() and repartition(). Both change partition counts, but they do it in fundamentally different ways.</p>
<h4 id="heading-the-problem"><strong>The Problem</strong></h4>
<p>Writing <code>df_result = df.repartition(10)</code> and thinking “I’m just changing partitions so Spark won’t move data unnecessarily.” But that assumption is wrong. <code>repartition()</code> always performs a full shuffle, even when:</p>
<ul>
<li><p>You are reducing partitions (from 200 → 10), or</p>
</li>
<li><p>You are increasing partitions (from 10 → 200).</p>
</li>
</ul>
<p>In both cases, Spark redistributes every row across the cluster according to a new hash partitioning scheme. So even if your data is already partitioned optimally, repartition() will still reshuffle it, adding a stage boundary.</p>
<p><strong>Logical Plan:</strong></p>
<pre><code class="lang-python">Exchange hashpartitioning(...)

└─ LogicalRDD [...]
</code></pre>
<p>That Exchange node signals a wide dependency: Spark spills intermediate data to disk, transfers it over the network, and reloads it before the next stage. In short: repartition() = "new shuffle, no matter what."</p>
<h4 id="heading-the-better-approach-coalesce">The Better Approach: coalesce()</h4>
<p>If your goal is to reduce the number of partitions, for example, before writing results to S3 or Snowflake – use coalesce() instead.</p>
<p><code>df_result = df.coalesce(10)</code></p>
<p>coalesce() merges existing partitions locally within each executor, avoiding the costly reshuffle step. It uses a narrow dependency, meaning each output partition depends on one or more existing partitions <em>from the same node</em>.</p>
<p>Coalesce</p>
<p>└─ LogicalRDD [...]</p>
<ul>
<li><p>No Exchange.</p>
</li>
<li><p>No network shuffle.</p>
</li>
<li><p>Just local merges – fast and cheap.</p>
</li>
</ul>
<h4 id="heading-real-world-benchmark-aws-glue-8">Real-World Benchmark: AWS Glue</h4>
<pre><code class="lang-python">df = spark.createDataFrame(multiplied_data,
[<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>])

start = time.time()
df_repart = df.repartition(<span class="hljs-number">10</span>)
df_repart.count()
print(<span class="hljs-string">"Repartition time:"</span>, round(time.time() - start, <span class="hljs-number">2</span>), <span class="hljs-string">"sec"</span>)

start = time.time()
df_coalesced = df.coalesce(<span class="hljs-number">10</span>)
df_coalesced.count()
print(<span class="hljs-string">"Coalesce time:"</span>, round(time.time() - start, <span class="hljs-number">2</span>), <span class="hljs-string">"sec"</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Operation</strong></td><td><strong>Plan Node</strong></td><td><strong>Shuffle Triggered</strong></td><td><strong>Glue Runtime</strong></td><td><strong>Observation</strong></td></tr>
</thead>
<tbody>
<tr>
<td>repartition(10)</td><td>Exchange</td><td>Yes</td><td>18.2 s</td><td>Full cluster reshuffle</td></tr>
<tr>
<td>coalesce(10)</td><td>Coalesce</td><td>No</td><td>1.99 s</td><td>Local partition merge only</td></tr>
</tbody>
</table>
</div><p>Even though both ended with 10 partitions, repartition() took significantly longer all because of the unnecessary shuffle.</p>
<h4 id="heading-why-this-matters">Why This Matters</h4>
<p>Each Exchange node in your logical plan creates a new stage in your DAG, meaning:</p>
<ul>
<li><p>Extra disk I/O</p>
</li>
<li><p>Extra serialization</p>
</li>
<li><p>Extra network transfer</p>
</li>
</ul>
<p>That’s why avoiding just one shuffle in a Glue ETL pipeline can save seconds to minutes per run, especially on wide datasets.</p>
<p><strong>When to use which:</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Goal</strong></td><td><strong>Transformation</strong></td><td><strong>Reasoning</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Increase parallelism for heavy groupBy or join</td><td>repartition()</td><td>Distributes data evenly across executors</td></tr>
<tr>
<td>Reduce file count before writing</td><td>coalesce()</td><td>Avoids shuffle, merges partitions locally</td></tr>
<tr>
<td>Rebalance skewed data before a join</td><td>repartition(by="key")</td><td>Enables better key distribution</td></tr>
<tr>
<td>Optimize output after aggregation</td><td>coalesce()</td><td>Prevents too many small output files</td></tr>
</tbody>
</table>
</div><h4 id="heading-aqe-and-auto-coalescing">AQE and Auto Coalescing</h4>
<p>You can enable Adaptive Query Execution (AQE) in AWS Glue 3.0+ to let Spark merge small shuffle partitions automatically:</p>
<p><code>spark.conf.set("spark.sql.adaptive.enabled", "true")</code></p>
<p><code>spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")</code></p>
<p>With AQE, Spark dynamically combines small partitions <em>after</em> shuffle to balance performance and I/O.</p>
<p>repartition() always triggers a shuffle, while coalesce() avoids shuffles and is ideal for local merges before writes. You should always inspect Exchange nodes to identify shuffle points. Note that in AWS Glue, avoiding even one shuffle can yield ~7× runtime improvement at the 1M-row scale. Finally, use AQE to enable dynamic partition coalescing in larger workflows.</p>
<h3 id="heading-scenario-12-know-your-shuffle-triggers">Scenario 12: Know Your Shuffle Triggers</h3>
<p>Much of Spark's performance comes from invisible data movement. Every shuffle boundary adds a new stage, a new write–read cycle, and sometimes minutes of extra execution time.</p>
<p>In Spark, any operation that requires rearranging data between partitions introduces a wide dependency, represented in the logical plan as an Exchange node.</p>
<p>Common shuffle triggers:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Operation</strong></td><td><strong>Why It Shuffles</strong></td><td><strong>Plan Node</strong></td></tr>
</thead>
<tbody>
<tr>
<td>join()</td><td>Records with the same key must be co-located for matching</td><td>Exchange (on join keys)</td></tr>
<tr>
<td>groupBy() / agg()</td><td>Keys must gather to a single partition for aggregation</td><td>Exchange</td></tr>
<tr>
<td>distinct()</td><td>Spark must compare all values across partitions</td><td>Exchange</td></tr>
<tr>
<td>orderBy()</td><td>Requires global ordering of data</td><td>Exchange</td></tr>
<tr>
<td>repartition()</td><td>Explicit reshuffle for partition balancing</td><td>Exchange</td></tr>
</tbody>
</table>
</div><p>Each Exchange means a shuffle stage: Spark writes partition data to disk, transfers it over the network, and reads it back into memory on the next stage. That’s your hidden performance cliff.</p>
<pre><code class="lang-python">df_result = (
    df.groupBy(<span class="hljs-string">"department"</span>)
      .agg(sum(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"total_salary"</span>))
      .join(df.select(<span class="hljs-string">"department"</span>, <span class="hljs-string">"country"</span>)
            .distinct(), <span class="hljs-string">"department"</span>)
      .orderBy(<span class="hljs-string">"total_salary"</span>, ascending=<span class="hljs-literal">False</span>)
)

df_result.explain(<span class="hljs-string">"formatted"</span>)
</code></pre>
<p><strong>Logical Plan Simplified:</strong></p>
<pre><code class="lang-python">Sort [total_salary DESC]

└─ Exchange (<span class="hljs-keyword">global</span> sort)

   └─ SortMergeJoin [department]

      ├─ Exchange (groupBy shuffle)

      │   └─ HashAggregate (sum salary)

      └─ Exchange (distinct shuffle)

          └─ Aggregate (department, country)
</code></pre>
<p>We can see three Exchange nodes, one for the aggregation, one for the distinct join, and one for the global sort. That’s three separate shuffles, three full dataset transfers.</p>
<h4 id="heading-better-approach">Better Approach</h4>
<p>Whenever possible, combine wide transformations into a single stage before an action. For instance, you can compute aggregates and join results in one consistent shuffle domain:</p>
<pre><code class="lang-python">agg_df = df.groupBy(<span class="hljs-string">"department"</span>) \
    .agg(sum(<span class="hljs-string">"salary"</span>) \
    .alias(<span class="hljs-string">"total_salary"</span>))

country_df = df.select(<span class="hljs-string">"department"</span>, <span class="hljs-string">"country"</span>).distinct()

df_result = (
    agg_df.join(country_df, <span class="hljs-string">"department"</span>)
          .sortWithinPartitions(<span class="hljs-string">"total_salary"</span>, ascending=<span class="hljs-literal">False</span>)
)
</code></pre>
<p><strong>Logical Plan Simplified:</strong></p>
<pre><code class="lang-python">SortWithinPartitions [total_salary DESC]

└─ SortMergeJoin [department]

   ├─ Exchange (shared shuffle <span class="hljs-keyword">for</span> join)

   └─ Exchange (shared shuffle <span class="hljs-keyword">for</span> distinct)
</code></pre>
<p>Now Spark reuses shuffle partitions across compatible operations – only one shuffle boundary remains. The rest execute as narrow transformations.</p>
<h4 id="heading-real-world-benchmark-aws-glue-1m">Real-World Benchmark: AWS Glue (1M)</h4>
<pre><code class="lang-python">df = spark.createDataFrame(multiplied_data,
[<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>, <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>]).repartition(<span class="hljs-number">20</span>)

<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> sum <span class="hljs-keyword">as</span> sum_

start = time.time()

dept_salary = (
    df.groupBy(<span class="hljs-string">"department"</span>)
      .agg(sum_(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"total_salary"</span>))
)

dept_country = (
    df.select(<span class="hljs-string">"department"</span>, <span class="hljs-string">"country"</span>)
      .distinct()
)

naive_result = (
    dept_salary.join(dept_country, <span class="hljs-string">"department"</span>, <span class="hljs-string">"inner"</span>)
               .orderBy(col(<span class="hljs-string">"total_salary"</span>).desc())
)

naive_count = naive_result.count()
naive_time = round(time.time() - start, <span class="hljs-number">2</span>)


start = time.time()

dept_country_once = (
    df.select(<span class="hljs-string">"department"</span>, <span class="hljs-string">"country"</span>)
      .distinct()
)

optimized = (
    df.groupBy(<span class="hljs-string">"department"</span>)
      .agg(sum_(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"total_salary"</span>))
      .join(dept_country_once, <span class="hljs-string">"department"</span>, <span class="hljs-string">"inner"</span>)
      .sortWithinPartitions(col(<span class="hljs-string">"total_salary"</span>).desc())
      <span class="hljs-comment"># local ordering, avoids extra global shuffle</span>
)

opt_count = optimized.count()
opt_time = round(time.time() - start, <span class="hljs-number">2</span>)

print(<span class="hljs-string">"Optimized result count:"</span>, opt_count)
print(<span class="hljs-string">"Optimized pipeline time:"</span>, opt_time, <span class="hljs-string">"sec"</span>)

print(<span class="hljs-string">"\nOptimized plan:"</span>)
optimized.explain(<span class="hljs-string">"formatted"</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Pipeline</strong></td><td><strong># of Shuffles</strong></td><td><strong>Glue Runtime (sec)</strong></td><td><strong>Observation</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Naive: groupBy + distinct + orderBy</td><td>3</td><td>28.99 s</td><td>Multiple wide stages</td></tr>
<tr>
<td>Optimized: combined agg + join + sortWithinPartitions</td><td>1</td><td>3.52 s</td><td>Single wide stage</td></tr>
</tbody>
</table>
</div><p>By merging compatible stages and using sortWithinPartitions() instead of global orderBy(), the job ran significantly faster on the same dataset, with fewer Exchange nodes and shorter lineage. Run df.explain and search for Exchange. Each one signals a full shuffle. You can also check Spark UI → SQL tab → Exchange Read/Write Size to see exactly how much data moved.</p>
<p>Every Exchange represents a shuffle, adding serialization, network I/O, and stage overhead, so avoid chaining wide operations back-to-back by combining them under a consistent partition key. Prefer sortWithinPartitions() over global orderBy() when ordering is local, monitor plan depth to catch consecutive wide dependencies, and note that in AWS Glue eliminating even one shuffle in a 1M-row job can significantly reduce runtime.</p>
<h3 id="heading-scenario-13-tune-parallelism-shuffle-partitions-amp-aqe">Scenario 13: Tune Parallelism: Shuffle Partitions &amp; AQE</h3>
<p>Most Spark jobs are either over-parallelized (thousands of tiny tasks doing almost nothing, flooding the driver and filesystem) or under-parallelized (a handful of huge tasks doing all the work, causing slow stages and skew-like behavior). Both waste resources. We can control this behavior using spark.sql.shuffle.partitions and Adaptive Query Execution (AQE).</p>
<p>By default (in many environments), the default value <code>spark.conf.get("spark.sql.shuffle.partitions")</code> is 200, meaning that every shuffle produces approximately 200 shuffle partitions, regardless of data size. That means every shuffle (groupBy, join, distinct, and so on) creates ~200 shuffle partitions. Whether this default is reasonable depends entirely on the workload:</p>
<ul>
<li><p>If you’re processing 2 GB, 200 partitions might be great.</p>
</li>
<li><p>If you’re processing 5 MB, 200 partitions is comedy – 200 tiny tasks, overhead &gt; work.</p>
</li>
<li><p>If you’re processing 2 TB, 200 partitions might be too few – tasks become huge and slow.</p>
</li>
</ul>
<h4 id="heading-example-a-the-default-plan-too-many-tiny-tasks">Example A: The Default Plan (Too Many Tiny Tasks)</h4>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> sum <span class="hljs-keyword">as</span> sum_

spark = SparkSession.builder.appName(<span class="hljs-string">"ParallelismExample"</span>).getOrCreate()

spark.conf.get(<span class="hljs-string">"spark.sql.shuffle.partitions"</span>)  <span class="hljs-comment"># '200'</span>

data = [
    (<span class="hljs-number">1</span>, <span class="hljs-string">"John"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">90000</span>),
    (<span class="hljs-number">2</span>, <span class="hljs-string">"Alice"</span>, <span class="hljs-string">"Engineering"</span>, <span class="hljs-number">85000</span>),
    (<span class="hljs-number">3</span>, <span class="hljs-string">"Bob"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">75000</span>),
    (<span class="hljs-number">4</span>, <span class="hljs-string">"Eve"</span>, <span class="hljs-string">"Sales"</span>, <span class="hljs-number">72000</span>),
    (<span class="hljs-number">5</span>, <span class="hljs-string">"Grace"</span>, <span class="hljs-string">"HR"</span>, <span class="hljs-number">65000</span>),
]

df = spark.createDataFrame(data, [<span class="hljs-string">"id"</span>, <span class="hljs-string">"name"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>])

agg_df = df.groupBy(<span class="hljs-string">"department"</span>).agg(sum_(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"total_salary"</span>))
agg_df.explain(<span class="hljs-string">"formatted"</span>)
</code></pre>
<p>Even though there are only 3 departments, Spark will still create 200 shuffle partitions – meaning 200 tasks for 3 groups of data.</p>
<p><strong>Effect:</strong> Each task has almost nothing to do. Spark spends more time planning and scheduling than actually computing.</p>
<h4 id="heading-example-b-tuned-plan-balanced-parallelism">Example B: Tuned Plan (Balanced Parallelism)</h4>
<pre><code class="lang-python">spark.conf.set(<span class="hljs-string">"spark.sql.shuffle.partitions"</span>, <span class="hljs-string">"8"</span>)
agg_df = df.groupBy(<span class="hljs-string">"department"</span>).agg(sum_(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"total_salary"</span>))
agg_df.explain(<span class="hljs-string">"formatted"</span>)
</code></pre>
<p>Now Spark launches only <strong>8 partitions</strong> still parallelized, but not wasteful. Even in this small example, you can visually feel the difference: one logical change, but a completely leaner physical plan.</p>
<h4 id="heading-the-real-problem-static-tuning-doesnt-scale">The Real Problem: Static Tuning Doesn’t Scale</h4>
<p>In production, job sizes vary:</p>
<ul>
<li><p>Today: 10 GB</p>
</li>
<li><p>Tomorrow: 500 GB</p>
</li>
<li><p>Next week: 200 MB (sampling run)</p>
</li>
</ul>
<p>Manually changing shuffle partitions for each run is neither practical nor reliable. That’s where Adaptive Query Execution (AQE) steps in.</p>
<h4 id="heading-adaptive-query-execution-aqe-smarter-dynamic-parallelism">Adaptive Query Execution (AQE): Smarter, Dynamic Parallelism</h4>
<p>AQE doesn’t guess. It measures actual shuffle statistics at runtime and rewrites the plan <em>while the job is running.</em></p>
<pre><code class="lang-python">spark.conf.set(<span class="hljs-string">"spark.sql.adaptive.enabled"</span>, <span class="hljs-string">"true"</span>)
spark.conf.set(<span class="hljs-string">"spark.sql.adaptive.coalescePartitions.enabled"</span>, <span class="hljs-string">"true"</span>)
spark.conf.set(<span class="hljs-string">"spark.sql.adaptive.coalescePartitions.minPartitionSize"</span>, <span class="hljs-string">"64m"</span>)
spark.conf.set(<span class="hljs-string">"spark.sql.adaptive.coalescePartitions.maxPartitionSize"</span>, <span class="hljs-string">"256m"</span>)
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Configuration</strong></td><td><strong>Shuffle Partitions</strong></td><td><strong>Task Distribution</strong></td><td><strong>Observation</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Default</td><td>200</td><td>200 tasks / 3 groups</td><td>Too granular, mostly idle</td></tr>
<tr>
<td>Tuned</td><td>8</td><td>8 tasks / 3 groups</td><td>Balanced execution</td></tr>
</tbody>
</table>
</div><p>AQE merges tiny shuffle partitions, or splits huge ones, based on <strong>real-time data metrics</strong>, not pre-set assumptions.</p>
<pre><code class="lang-python">df = spark.createDataFrame(multiplied_data,
    [<span class="hljs-string">"id"</span>, <span class="hljs-string">"firstname"</span>, <span class="hljs-string">"lastname"</span>, <span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>, <span class="hljs-string">"age"</span>,
     <span class="hljs-string">"hire_date"</span>, <span class="hljs-string">"country"</span>])

start = time.time()
agg_df = df.groupBy(<span class="hljs-string">"department"</span>).agg(sum_(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"total_salary"</span>))
agg_df.count()

print(<span class="hljs-string">f'Num Partitions df: <span class="hljs-subst">{df.rdd.getNumPartitions()}</span>'</span>)
print(<span class="hljs-string">f'Num Partitions aggdf: <span class="hljs-subst">{agg_df.rdd.getNumPartitions()}</span>'</span>)
print(<span class="hljs-string">"Execution time:"</span>, round(time.time() - start, <span class="hljs-number">2</span>), <span class="hljs-string">"sec"</span>)

spark.stop()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Stage</strong></td><td><strong>Without AQE</strong></td><td><strong>With AQE</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Stage 3 (Aggregation)</td><td>200 shuffle partitions, each reading KBs</td><td>8–12 coalesced partitions</td></tr>
<tr>
<td>Stage 4 (Join Output)</td><td>200 shuffle files</td><td>Merged into balanced partitions</td></tr>
<tr>
<td><strong>Result</strong></td><td>Many small tasks, high overhead</td><td>Fewer, balanced tasks, faster runtime</td></tr>
</tbody>
</table>
</div><h4 id="heading-understanding-the-plan"><strong>Understanding the Plan</strong></h4>
<p>Before AQE (static):</p>
<p><code>Exchange hashpartitioning(department, 200)</code></p>
<p>With AQE: AdaptiveSparkPlan (coalesced)</p>
<p><code>HashAggregate(keys=[department], functions=[sum(salary)])</code></p>
<p><code>Exchange hashpartitioning(department, 200)</code>  <em># runtime coalesced to 12</em></p>
<p>The logical plan remains the same, but the physical execution plan is rewritten during runtime. Spark intelligently reduces or merges shuffle partitions based on data volume.</p>
<p>Spark’s default 200 shuffle partitions often misfit real workloads. Static tuning may work for predictable pipelines, but fails with variable data. On the other hand, AQE uses shuffle statistics to dynamically coalesce partitions at runtime, use it with sensible ceilings (for example, 400 partitions) and always verify in the Spark UI to catch over-partitioning (many tasks reading KBs) or under-partitioning (few tasks reading GBs).</p>
<h3 id="heading-scenario-14-handle-skew-smartly">Scenario 14: Handle Skew Smartly</h3>
<p>In an ideal Spark world, all partitions contain roughly equal amounts of data. But real datasets are rarely that kind. If one key (say "USA", "2024", or "customer_123") holds millions of rows while others have only a few, Spark ends up with one or two massive partitions. Those partitions take disproportionately longer to process, leaving other executors idle. That’s data skew: the silent killer of parallelism.</p>
<p>You’ll often spot it in Spark UI:</p>
<ul>
<li><p>198 tasks finish quickly.</p>
</li>
<li><p>2 tasks take 10× longer.</p>
</li>
<li><p>Stage stays stuck at 98% for minutes.</p>
</li>
</ul>
<h4 id="heading-example-a-the-skew-problem">Example A: The Skew Problem</h4>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession, functions <span class="hljs-keyword">as</span> F

spark = SparkSession.builder.appName(<span class="hljs-string">"DataSkewDemo"</span>).getOrCreate()

<span class="hljs-comment"># Create skewed dataset</span>
df = spark.range(<span class="hljs-number">0</span>, <span class="hljs-number">10000</span>).toDF(<span class="hljs-string">"id"</span>) \
    .withColumn(<span class="hljs-string">"department"</span>,
        F.when(F.col(<span class="hljs-string">"id"</span>) &lt; <span class="hljs-number">8000</span>, <span class="hljs-string">"Engineering"</span>)  <span class="hljs-comment"># 80% of data</span>
         .when(F.col(<span class="hljs-string">"id"</span>) &lt; <span class="hljs-number">9000</span>, <span class="hljs-string">"Sales"</span>)
         .otherwise(<span class="hljs-string">"HR"</span>)) \
    .withColumn(<span class="hljs-string">"salary"</span>, (F.rand() * <span class="hljs-number">100000</span>).cast(<span class="hljs-string">"int"</span>))

df.groupBy(<span class="hljs-string">"department"</span>).count().show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769464257950/6963171b-92de-4721-9bb3-6951c68a2775.png" alt="6963171b-92de-4721-9bb3-6951c68a2775" class="image--center mx-auto" width="594" height="446" loading="lazy"></p>
<p>Spark will hash “Engineering” into just one reducer partition, making it heavier than others. That single task becomes a bottleneck, the shuffle has technically completed, but the stage waits for that one lagging task.</p>
<h4 id="heading-example-b-the-solution-salting-hot-keys">Example B: The Solution: Salting Hot Keys</h4>
<p>To handle skew, we the hot key (Engineering) into multiple pseudo-keys using a random salt. This redistributes that large partition across multiple reducers.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> rand, concat, lit, floor

salt_buckets = <span class="hljs-number">10</span>

df_salted = (
    df.withColumn(
        <span class="hljs-string">"department_salted"</span>,
        F.when(F.col(<span class="hljs-string">"department"</span>) == <span class="hljs-string">"Engineering"</span>,
            F.concat(F.col(<span class="hljs-string">"department"</span>), lit(<span class="hljs-string">"_"</span>),
                     (F.floor(rand() * salt_buckets))))
         .otherwise(F.col(<span class="hljs-string">"department"</span>))
    )
)

df_salted.groupBy(<span class="hljs-string">"department_salted"</span>).agg(F.avg(<span class="hljs-string">"salary"</span>))
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769464242395/c4ec0bc6-67bf-488c-b619-7130ceef878e.png" alt="c4ec0bc6-67bf-488c-b619-7130ceef878e" class="image--center mx-auto" width="536" height="468" loading="lazy"></p>
<p>Now “Engineering” isn’t one hot key – it’s <strong>10 smaller keys</strong> like Engineering_0, Engineering_1, ..., Engineering_9. Each one goes to a separate reducer partition, enabling parallel processing.</p>
<h4 id="heading-example-c-post-aggregation-desalting">Example C: Post-Aggregation Desalting</h4>
<p>After aggregating, recombine salted keys to get the original department names:</p>
<pre><code class="lang-python">df_final = (
    df_salted.groupBy(<span class="hljs-string">"department_salted"</span>)
        .agg(F.avg(<span class="hljs-string">"salary"</span>).alias(<span class="hljs-string">"avg_salary"</span>))
        .withColumn(<span class="hljs-string">"department"</span>, F.split(F.col(<span class="hljs-string">"department_salted"</span>), <span class="hljs-string">"_"</span>)
            .getItem(<span class="hljs-number">0</span>))
        .groupBy(<span class="hljs-string">"department"</span>)
        .agg(F.avg(<span class="hljs-string">"avg_salary"</span>).alias(<span class="hljs-string">"final_avg_salary"</span>))
)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769464321049/6349c2c3-a0e3-4f9e-be3e-c59639004128.png" alt="6349c2c3-a0e3-4f9e-be3e-c59639004128" class="image--center mx-auto" width="540" height="242" loading="lazy"></p>
<h4 id="heading-when-to-use-salting">When to Use Salting</h4>
<p>Use salting when:</p>
<ul>
<li><p>You observe stage skew (one or few long tasks).</p>
</li>
<li><p>Shuffle read sizes vary drastically between tasks.</p>
</li>
<li><p>The skew originates from a few dominant key values.</p>
</li>
</ul>
<p>Avoid it when:</p>
<ul>
<li><p>The dataset is small (&lt; 1 GB).</p>
</li>
<li><p>You already use partitioning or bucketing keys with uniform distribution.</p>
</li>
</ul>
<p><strong>Alternative approaches:</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Technique</strong></td><td><strong>Use Case</strong></td><td><strong>Pros</strong></td><td><strong>Cons</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Salting (manual)</td><td>Skewed joins/aggregations</td><td>Full control</td><td>Requires extra logic to merge</td></tr>
<tr>
<td>Skew join hints (/*+ SKEWJOIN */)</td><td>Supported joins in Spark 3+</td><td>No extra columns needed</td><td>Works only on joins</td></tr>
<tr>
<td>Broadcast smaller side</td><td>One table ≪ other</td><td>Avoids shuffle on big side</td><td>Limited by broadcast size</td></tr>
<tr>
<td>AQE skew optimization</td><td>Spark 3.0+</td><td>Automatic handling</td><td>Needs AQE enabled</td></tr>
</tbody>
</table>
</div><h4 id="heading-glue-specific-tip">Glue-Specific Tip</h4>
<p>AWS Glue 3.0+ includes Spark 3.x, meaning you can also enable AQE’s built-in skew optimization:</p>
<p><code>spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")</code></p>
<p><code>spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "128m")</code></p>
<p>Spark will automatically detect large shuffle partitions and split them, effectively auto-salting hot keys at runtime. Data skew causes uneven shuffle sizes across tasks and can be detected in the Spark UI or via shuffle read/write metrics. Mitigate heavy-key skew with manual salting (recombined later) or rely on AQE skew join optimization for mild cases, and always validate improvements in the Spark UI SQL tab by checking “Shuffle Read Size.”</p>
<h3 id="heading-scenario-15-sort-efficiently-orderby-vs-sortwithinpartitions">Scenario 15: Sort Efficiently (orderBy vs sortWithinPartitions)</h3>
<p>Most Spark jobs need sorted data at some point – for window functions, for writing ordered files, or for downstream processing. The instinct is to reach for orderBy(). But those instincts cost you a full shuffle every single time.</p>
<h4 id="heading-the-problem-global-sort-when-you-dont-need-it">The Problem: Global Sort When You Don't Need It</h4>
<p>Let's say you want to write employee data partitioned by department, sorted by salary within each department:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> col

<span class="hljs-comment"># Naive approach: global sort</span>
df_sorted = df.orderBy(col(<span class="hljs-string">"department"</span>), col(<span class="hljs-string">"salary"</span>).desc())

df_sorted.write.partitionBy(<span class="hljs-string">"department"</span>).parquet(<span class="hljs-string">"s3://output/employees/"</span>)
</code></pre>
<p>This looks reasonable. You're sorting by department and salary, then writing partitioned files. Clean and simple. But here's what Spark actually does:</p>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Sort [department ASC, salary DESC], true

└─ Exchange rangepartitioning(department ASC, salary DESC, <span class="hljs-number">200</span>)

   └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>That Exchange <code>rangepartitioning</code> is a full shuffle. So Spark:</p>
<ul>
<li><p>Samples the data to determine range boundaries</p>
</li>
<li><p>Redistributes every row across 200 partitions based on sort keys</p>
</li>
<li><p>Sorts each partition locally</p>
</li>
<li><p>Produces globally ordered output</p>
</li>
</ul>
<p>You just shuffled 1 million rows across the cluster to achieve global ordering – even though you're immediately partitioning by department on write, which destroys that global order anyway.</p>
<h4 id="heading-why-this-hurts">Why This Hurts</h4>
<p>Range partitioning for global sort is one of the most expensive shuffles Spark performs:</p>
<ul>
<li><p>Sampling overhead: Spark must scan data twice (once to sample, once to process)</p>
</li>
<li><p>Network transfer: Every row moves to a new executor based on range boundaries</p>
</li>
<li><p>Disk I/O: Shuffle files written and read from disk</p>
</li>
<li><p>Wasted work: Global ordering across departments is meaningless when you partition by department</p>
</li>
</ul>
<p>For 1M rows, this adds 8-12 seconds of pure shuffle overhead.</p>
<h4 id="heading-the-better-approach-sort-locally-within-partitions">The Better Approach: Sort Locally Within Partitions</h4>
<p>If you only need ordering <em>within</em> each department (or within each output partition), use sortWithinPartitions():</p>
<pre><code class="lang-python"><span class="hljs-comment"># Optimized approach: local sort only</span>
df_sorted = df.sortWithinPartitions(col(<span class="hljs-string">"department"</span>), col(<span class="hljs-string">"salary"</span>).desc())
df_sorted.write.partitionBy(<span class="hljs-string">"department"</span>).parquet(<span class="hljs-string">"s3://output/employees/"</span>)
</code></pre>
<p><strong>Simplified Logical Plan:</strong></p>
<pre><code class="lang-python">Sort [department ASC, salary DESC], false

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<ul>
<li><p>No Exchange.</p>
</li>
<li><p>No shuffle.</p>
</li>
<li><p>Just local sorting within existing partitions.</p>
</li>
</ul>
<p>Spark sorts each partition in-place, without moving data across the network. The false flag in the Sort node indicates this is a local sort, not a global one.</p>
<h4 id="heading-real-world-benchmark-aws-glue-9">Real-World Benchmark: AWS Glue</h4>
<p>Let's measure the difference on 1 million employee records: First, will start with Global Sort with orderBy:</p>
<pre><code class="lang-python">print(<span class="hljs-string">"\n--- Testing orderBy() (global sort) ---"</span>)

start = time.time()

df_global = df.orderBy(col(<span class="hljs-string">"department"</span>), col(<span class="hljs-string">"salary"</span>).desc())
df_global.write.mode(<span class="hljs-string">"overwrite"</span>).parquet(<span class="hljs-string">"/tmp/global_sort_output"</span>)

global_time = round(time.time() - start, <span class="hljs-number">2</span>)
print(<span class="hljs-string">f"orderBy() time: <span class="hljs-subst">{global_time}</span>s"</span>)
</code></pre>
<p>Local Sort:</p>
<pre><code class="lang-python">print(<span class="hljs-string">"\n--- Testing sortWithinPartitions() (local sort) ---"</span>)

start = time.time()

df_local = df.sortWithinPartitions(col(<span class="hljs-string">"department"</span>), col(<span class="hljs-string">"salary"</span>).desc())
df_local.write.mode(<span class="hljs-string">"overwrite"</span>).parquet(<span class="hljs-string">"/tmp/local_sort_output"</span>)

local_time = round(time.time() - start, <span class="hljs-number">2</span>)
print(<span class="hljs-string">f"sortWithinPartitions() time: <span class="hljs-subst">{local_time}</span>s"</span>)
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Approach</strong></td><td><strong>Plan Type</strong></td><td><strong>Execution Time (1M rows)</strong></td><td><strong>Observation</strong></td></tr>
</thead>
<tbody>
<tr>
<td>orderBy()</td><td>Exchange rangepartitioning</td><td>10.34 s</td><td>Full shuffle for global sort</td></tr>
<tr>
<td>sortWithinPartitions()</td><td>Local Sort (no Exchange)</td><td>2.18 s</td><td>In-place sorting, no network transfer</td></tr>
</tbody>
</table>
</div><p><strong>Physical Plan Differences:</strong></p>
<p><strong>orderBy() Physical Plan:</strong></p>
<pre><code class="lang-python">*(<span class="hljs-number">2</span>) Sort [department ASC NULLS FIRST, salary DESC NULLS LAST], true, <span class="hljs-number">0</span>

+- Exchange rangepartitioning(department ASC NULLS FIRST, salary DESC NULLS LAST, <span class="hljs-number">200</span>)

   +- *(<span class="hljs-number">1</span>) Project [id, firstname, lastname, department, salary, age, hire_date, country]

      +- *(<span class="hljs-number">1</span>) Scan ExistingRDD[id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>The Exchange rangepartitioning node marks the shuffle boundary. Spark must:</p>
<ul>
<li><p>Sample data to determine range splits</p>
</li>
<li><p>Redistribute all rows across executors</p>
</li>
<li><p>Sort within each range partition</p>
</li>
</ul>
<p><strong>sortWithinPartitions() Physical Plan:</strong></p>
<pre><code class="lang-python">*(<span class="hljs-number">1</span>) Sort [department ASC NULLS FIRST, salary DESC NULLS LAST], false, <span class="hljs-number">0</span>

+- *(<span class="hljs-number">1</span>) Project [id, firstname, lastname, department, salary, age, hire_date, country]

   +- *(<span class="hljs-number">1</span>) Scan ExistingRDD[id, firstname, lastname, department, salary, age, hire_date, country]
</code></pre>
<p>No Exchange. The false flag in Sort indicates local sorting only. Each partition is sorted independently, in parallel, without any data movement.</p>
<p><strong>When to Use Which:</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Use Case</strong></td><td><strong>Method</strong></td><td><strong>Why</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Writing partitioned files (Parquet, Delta)</td><td>sortWithinPartitions()</td><td>Partition-level order is sufficient; global order wasted</td></tr>
<tr>
<td>Window functions with ROWS BETWEEN</td><td>sortWithinPartitions()</td><td>Only need order within each window partition</td></tr>
<tr>
<td>Top-N per group (rank, dense_rank)</td><td>sortWithinPartitions()</td><td>Ranking is local to each partition key</td></tr>
<tr>
<td>Final output must be globally ordered</td><td>orderBy()</td><td>Need total order across all partitions</td></tr>
<tr>
<td>Downstream system requires strict ordering</td><td>orderBy()</td><td>For example, time-series data for sequential processing</td></tr>
<tr>
<td>Sorting before coalesce() for fewer output files</td><td>sortWithinPartitions()</td><td>Maintains order within merged partitions</td></tr>
</tbody>
</table>
</div><h4 id="heading-common-anti-pattern">Common Anti-Pattern</h4>
<pre><code class="lang-python">df.orderBy(<span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>) \
  .write.partitionBy(<span class="hljs-string">"department"</span>) \
  .parquet(<span class="hljs-string">"output/"</span>)
</code></pre>
<p><strong>Problem:</strong> You're globally sorting by department, then immediately partitioning by department. The global order is destroyed during partitioning.</p>
<p>Here’s the fix:</p>
<pre><code class="lang-python">df.sortWithinPartitions(<span class="hljs-string">"department"</span>, <span class="hljs-string">"salary"</span>) \
  .write.partitionBy(<span class="hljs-string">"department"</span>) \
  .parquet(<span class="hljs-string">"output/"</span>)
</code></pre>
<p>Or even better, if you're partitioning by department anyway:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Best: let partitioning handle distribution</span>
df.write.partitionBy(<span class="hljs-string">"department"</span>) \
    .sortBy(<span class="hljs-string">"salary"</span>) \
    .parquet(<span class="hljs-string">"output/"</span>)
</code></pre>
<p>orderBy() triggers an expensive full shuffle using range partitioning, while sortWithinPartitions() sorts data locally without a shuffle and is often 4–5× faster. Use it when writing partitioned files, computing window functions with partitionBy(), or when order is needed only within groups, and reserve orderBy() strictly for true global ordering, because in most production ETL, the best sort is the one that doesn’t shuffle.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You began this handbook likely wondering why your Spark application was slow, and now you see that the answer was both clear and not so clear: your problem was never your Spark application, your configuration, or your version of Spark. It was your plan all along.</p>
<p>You now understand that Spark runs plans, not code, that transformation order affects logical plans, that shuffles generate stages and are key to runtime performance, and that examining your physical plans allows you to directly link your application performance issues back to your problematic line of code.</p>
<p>And you’ve seen this pattern repeat across many scenarios: problem, plan, solution, improved plan, and so forth, until optimization feels less like a dark art and more like a certainty.</p>
<p>This is the Spark optimization mindset: read plans before you write code, and challenge every single Exchange. Engineers who write high-performance Spark jobs minimize shuffles, filter early, project narrowly, deal with skew carefully, and validate everything via explain() and the Spark UI. Once you learn to read the plan, Spark performance becomes mechanical.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Create Boxplots and Model Data in R Using ggplot2 ]]>
                </title>
                <description>
                    <![CDATA[ In this tutorial, you’ll walk through a complete data analysis project using the HR Analytics dataset by Saad Haroon on Kaggle. You’ll start by loading and cleaning the data, then explore it visually using boxplots with ggplot2. Finally, you’ll learn... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-create-boxplots-and-model-data-in-r/</link>
                <guid isPermaLink="false">69693680d6f0e208b327d21c</guid>
                
                    <category>
                        <![CDATA[ data visualization ]]>
                    </category>
                
                    <category>
                        <![CDATA[ R Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tiffany Mojo Omondi ]]>
                </dc:creator>
                <pubDate>Thu, 15 Jan 2026 18:48:32 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768418231372/f36e1cca-eed9-4620-bd7c-19788d8beafe.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In this tutorial, you’ll walk through a complete data analysis project using the HR Analytics dataset by Saad Haroon on Kaggle. You’ll start by loading and cleaning the data, then explore it visually using boxplots with ggplot2. Finally, you’ll learn about statistical modelling using linear regression and logistic regression in R.</p>
<p>By the end of this article, you should understand how to create boxplots in R, why they matter, and how they fit into a real-world analytics workflow.</p>
<h2 id="heading-table-of-contents"><strong>Table of Contents</strong></h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-your-r-environment">How to Set Up Your R Environment</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-load-and-inspect-the-data">How to Load and Inspect the Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-clean-and-prepare-the-data">How to Clean and Prepare the Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-use-boxplots">How to Use Boxplots</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-create-boxplots-with-ggplot2">How to Create Boxplots with ggplot2</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-perform-exploratory-data-analysis">How to Perform Exploratory Data Analysis</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-linear-regression-models">How to Build Linear Regression Models</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-logistic-regression-models">How to Build Logistic Regression Models</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-why-visualization-comes-before-modeling">Why Visualization Comes Before Modeling</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>Before you begin, you should be comfortable with the following:</p>
<ul>
<li><p>Basic R syntax (variables, functions, data frames).</p>
</li>
<li><p>Installing and loading R packages.</p>
</li>
<li><p>Understanding what rows and columns represent in a dataset.</p>
</li>
<li><p>Very basic statistics (mean, median, distributions).</p>
</li>
</ul>
<h2 id="heading-how-to-set-up-your-r-environment">How to Set Up Your R Environment</h2>
<p>Start by installing and loading the packages you will need.</p>
<pre><code class="lang-r">install.packages(c(<span class="hljs-string">"tidyverse"</span>, <span class="hljs-string">"ggplot2"</span>))
<span class="hljs-keyword">library</span>(tidyverse)
<span class="hljs-keyword">library</span>(ggplot2)
</code></pre>
<p><code>tidyverse</code> provides tools for data manipulation and visualization. <code>ggplot2</code> is the visualization engine you will use for boxplots. Loading the libraries makes their functions available for use</p>
<h2 id="heading-how-to-load-and-inspect-the-data">How to Load and Inspect the Data</h2>
<p>First, download the <a target="_blank" href="https://www.kaggle.com/datasets/saadharoon27/hr-analytics-dataset">HR Analytics dataset by Saad Haroon from Kaggle</a>.</p>
<p>Assuming the downloaded dataset is saved as "C:/Users/johndoe/Downloads/archive (2)/HR_Analytics.csv", load the path file into R.  </p>
<p>You can view a sample of the the dataset by running the <code>head</code> function. To view the structure of the dataset, you can run the <code>str</code> function.</p>
<pre><code class="lang-r">hr &lt;- read.csv(<span class="hljs-string">"C:/Users/johndoe/Downloads/archive (2)/HR_Analytics.csv"</span>)
head(hr)
str(hr)
</code></pre>
<p>The <code>read.csv</code> function imports the dataset into R. The <code>head</code> function shows the first six rows so you can preview the data. The <code>str</code> function reveals data types, helping you spot categorical versus numeric variables early.</p>
<p>Remember that understanding your data structure early prevents errors later when plotting or modeling. Once you run the <code>head</code> function, you should see the following in your console:</p>
<p>From the <code>head</code> function, you can see:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768489839861/f304305e-b889-4e25-8315-ff24c5201681.png" alt="first-six-rows-of-a-hr-dataset-shown-in-the-r-console" class="image--center mx-auto" width="1753" height="347" loading="lazy"></p>
<h3 id="heading-structure">Structure</h3>
<ul>
<li><p>Each row represents <strong>one employee</strong>.</p>
</li>
<li><p>Each column represents a <strong>feature/variable</strong> about the employee.</p>
</li>
</ul>
<h3 id="heading-key-columns-amp-meaning">Key Columns &amp; Meaning</h3>
<ul>
<li><p><code>EmpID</code> → Employee identifier</p>
</li>
<li><p><code>Age</code> → Age in years</p>
</li>
<li><p><code>AgeGroup</code> → Age category (for example, <code>18-25</code>)</p>
</li>
<li><p><code>Attrition</code> → Whether the employee left or not (<code>Yes/No</code>)</p>
</li>
<li><p><code>BusinessTravel</code> → Travel frequency (<code>Travel_Rarely</code>, <code>Travel_Frequently</code>, <code>Non-Travel</code>)</p>
</li>
<li><p><code>Department</code> → Employee department</p>
</li>
<li><p><code>DistanceFromHome</code> → Distance from home to office (km)</p>
</li>
<li><p><code>Education</code> / <code>EducationField</code> → Level and field of education</p>
</li>
<li><p><code>EmployeeCount</code> → Usually 1 per employee (redundant)</p>
</li>
<li><p><code>Gender</code> → Male / Female</p>
</li>
<li><p><code>JobRole</code> / <code>JobSatisfaction</code> → Job title and satisfaction level</p>
</li>
<li><p><code>MonthlyIncome</code> / <code>SalarySlab</code> → Salary amount and category</p>
</li>
<li><p><code>YearsAtCompany</code> / <code>YearsInCurrentRole</code> → Experience metrics</p>
</li>
<li><p><code>OverTime</code> → Works overtime (<code>Yes/No</code>)</p>
</li>
<li><p>Other features: <code>PerformanceRating</code>, <code>TrainingTimesLastYear</code>, <code>WorkLifeBalance</code>, <code>StockOptionLevel</code>, and so on.</p>
</li>
</ul>
<h3 id="heading-data-types"><strong>Data Types</strong></h3>
<ul>
<li><p><strong>Numeric</strong> → <code>Age</code>, <code>DistanceFromHome</code>, <code>MonthlyIncome</code>, <code>YearsAtCompany</code></p>
</li>
<li><p><strong>Categorical / Character</strong> → <code>Attrition</code>, <code>Gender</code>, <code>Department</code>, <code>JobRole</code></p>
</li>
</ul>
<h3 id="heading-observations"><strong>Observations</strong></h3>
<ul>
<li><p>The dataset is tabular, like a spreadsheet.</p>
</li>
<li><p>There are multiple categorical columns</p>
</li>
<li><p>There are multiple numeric columns</p>
</li>
<li><p>Some columns seem redundant or constant; doesn’t provide useful information because of the same values (for example, <code>EmployeeCount</code>)</p>
</li>
</ul>
<p>From the <code>str</code> function, you can gather that:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768488901453/80d8cae9-d569-4749-8028-0a6e9cc128c4.png" alt="r-output-showing-structure-of-hr-dataset" class="image--center mx-auto" width="1046" height="612" loading="lazy"></p>
<p>The dataset contains 1,480 observations and 38 variables. Each row represents one employee, and each column represents a feature about that employee.</p>
<p>Each column has a name, data type, and example values. For instance, <code>Age</code> and <code>DistanceFromHome</code> are numeric (<code>int</code>), with values like 28 or 12. <code>EmpID</code> and <code>Department</code> are character strings (<code>chr</code>), with examples like Research &amp; Development or Sales. Other features include <code>JobRole</code> (Analyst, Manager) and <code>Attrition</code> (Yes/No).</p>
<p>The dataset contains mixed data types. Some columns are numeric, such as <code>MonthlyIncome</code> or <code>YearsAtCompany</code>. Some are character or categorical, like <code>Gender</code> (Male/Female) and <code>BusinessTravel</code> (Travel_Rarely, Travel_Frequently). A few columns are redundant or constant. For example, <code>EmployeeCount</code> has the same value of 1 for all rows and does not provide useful information.</p>
<h2 id="heading-how-to-clean-and-prepare-the-data">How to Clean and Prepare the Data</h2>
<p>Before visualization, you must clean your data. In order to find out what you need to clean you can investigate the data.</p>
<p>Run the <code>summary</code> function to view the statistics of the dataset. You also need to run the <code>is.na</code> function to identify missing values to be removed.</p>
<pre><code class="lang-r">summary(hr)
colSums(is.na(hr))
</code></pre>
<p>The <code>summary</code> function gives quick statistics and flags suspicious values. The <code>is.na</code> function checks for missing data. Boxplots are sensitive to extreme values, so knowing what you are working with is critical.  </p>
<p>After running the <code>summary</code> function, the following will appear in your console:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768490404469/ef3bd30d-c3c9-4cf0-9c91-80a0e56f52f5.png" alt="r-summary-output-of-hr-dataset-showing-statistical-distributions" class="image--center mx-auto" width="1778" height="495" loading="lazy"></p>
<p>This shows the basic statistics of each column. After running the <code>is.na</code> function, the following will also appear in your console:  </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768490678134/00a12c24-224e-4c8f-80ee-bc7bbd4d8ca6.png" alt="r-output-showing-missing-value-counts-per-column-in-hr-dataset" class="image--center mx-auto" width="1832" height="198" loading="lazy"></p>
<p>From this output, you can see that only <code>YearsWithCurrManager</code> has <code>57</code>, meaning that <strong>57 employees</strong> don’t have a value for this column.</p>
<p>You can drop this whole column along with the other redundant columns we saw earlier on. You can do this with the code below.</p>
<pre><code class="lang-r">hr &lt;- hr %&gt;% select(-c(EmployeeCount, Over18, StandardHours, YearsWithCurrManager))
</code></pre>
<p>To verify if the columns are gone, use this code:</p>
<pre><code class="lang-r">colnames(hr)
</code></pre>
<p>Now we need to convert important categorical variables to factors. Doing this tells R that the column has <strong>two categories</strong> (‘Yes’ and ‘No’), not continuous text.</p>
<pre><code class="lang-r">hr$Attrition &lt;- as.factor(hr$Attrition)
hr$JobRole &lt;- as.factor(hr$JobRole)
hr$Department &lt;- as.factor(hr$Department)
</code></pre>
<p>This also ensures ggplot2 treats them correctly when grouping.</p>
<h2 id="heading-how-to-use-boxplots">How to Use Boxplots</h2>
<p>A boxplot displays key features of a dataset. The median is shown by the line in the middle of the box. The interquartile range is represented by the box itself while the whiskers show the spread of the data. Outliers appear as individual points.</p>
<p>Boxplots are mostly useful when you want to compare distributions across groups, such as income by job role or age by attrition status.</p>
<p>Let’s start with a simple boxplot of monthly income.</p>
<pre><code class="lang-r">ggplot(hr, aes(y = MonthlyIncome)) +
  geom_boxplot(fill = <span class="hljs-string">"blue"</span>) +
  labs(
    title = <span class="hljs-string">"Distribution of Monthly Income"</span>,
    y = <span class="hljs-string">"Monthly Income"</span>)
</code></pre>
<p>The <code>aes</code> function tells ggplot what variable to plot. <code>geom_boxplot</code> draws the boxplot. The <code>labs</code> function labels parts of the plot drawn, that is the <code>x</code> axis, <code>y</code> axis, and the title.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766410411798/200b1c22-3b73-49f0-ba30-9b83d28f3055.png" alt="A-vertical-boxplot-showing-the-distribution-of-employee-monthly-income." class="image--center mx-auto" width="473" height="523" loading="lazy"></p>
<h2 id="heading-how-to-create-boxplots-with-ggplot2">How to Create Boxplots with ggplot2</h2>
<p>Now lets compare <code>income</code> across <code>job roles</code>.</p>
<pre><code class="lang-r">ggplot(hr, aes(x = JobRole, y = MonthlyIncome)) +
  geom_boxplot(fill = <span class="hljs-string">"lightblue"</span>) +
  theme(axis.text.x = element_text(angle = <span class="hljs-number">45</span>, hjust = <span class="hljs-number">1</span>)) +
  labs(
    title = <span class="hljs-string">"Monthly Income by Job Role"</span>,
    x = <span class="hljs-string">"Job Role"</span>,
    y = <span class="hljs-string">"Monthly Income"</span>)
</code></pre>
<p>The x aesthetic lists all the job roles. The labels are rotated to improve readability. This visualization quickly reveals income differences across roles.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766508710023/c12ca136-38bf-492e-af90-24d7021b54a4.png" alt="Multiple-boxplots-comparing-monthly-income-distributions-across-different-job-roles." class="image--center mx-auto" width="852" height="522" loading="lazy"></p>
<h2 id="heading-how-to-perform-exploratory-data-analysis-eda">How to Perform Exploratory Data Analysis (EDA)</h2>
<p>Exploratory data analysis involves using visual methods to ask questions and gain a deeper understanding of the data.</p>
<p>We can use the example of <code>Years at company</code> by <code>department</code>.</p>
<pre><code class="lang-r">ggplot(hr, aes(x = Department, y = YearsAtCompany)) +
  geom_boxplot(fill = <span class="hljs-string">"darkblue"</span>) +
  labs(
    title = <span class="hljs-string">"Years at Company by Department"</span>,
    y = <span class="hljs-string">"Years at Company"</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766512679598/5e5da8cd-8fe7-4fae-bbe9-362af901b330.png" alt="Boxplots-showing-employee-tenure-across-departments." class="image--center mx-auto" width="842" height="518" loading="lazy"></p>
<h2 id="heading-how-to-build-linear-regression-models">How to Build Linear Regression Models</h2>
<p>To understand how to build linear regression models, you have to model <code>MonthlyIncome</code> using <code>YearsAtCompany</code> with the command below.</p>
<p>The first one creates the model while the second displays it.</p>
<pre><code class="lang-r">hr_lm&lt;- lm(MonthlyIncome ~ YearsAtCompany, data = hr)
summary(hr_lm)
</code></pre>
<p>Linear regression estimates how income changes with tenure. This works when the variables are numeric.</p>
<p>After running the code, your console should show you this output:</p>
<pre><code class="lang-r">Call:
lm(formula = MonthlyIncome ~ YearsAtCompany, data = hr)

Residuals:
   Min     1Q Median     3Q    Max 
 -<span class="hljs-number">9506</span>  -<span class="hljs-number">2488</span>  -<span class="hljs-number">1186</span>   <span class="hljs-number">1403</span>  <span class="hljs-number">15483</span> 

Coefficients:
               Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)     <span class="hljs-number">3734.47</span>     <span class="hljs-number">159.41</span>   <span class="hljs-number">23.43</span>   &lt;<span class="hljs-number">2e-16</span> ***
YearsAtCompany   <span class="hljs-number">395.25</span>      <span class="hljs-number">17.14</span>   <span class="hljs-number">23.07</span>   &lt;<span class="hljs-number">2e-16</span> ***
---
Signif. codes:  <span class="hljs-number">0</span> ‘***’ <span class="hljs-number">0.001</span> ‘**’ <span class="hljs-number">0.01</span> ‘*’ <span class="hljs-number">0.05</span> ‘.’ <span class="hljs-number">0.1</span> ‘ ’ <span class="hljs-number">1</span>

Residual standard error: <span class="hljs-number">4032</span> on <span class="hljs-number">1478</span> degrees of freedom
Multiple R-squared:  <span class="hljs-number">0.2647</span>,    Adjusted R-squared:  <span class="hljs-number">0.2642</span> 
<span class="hljs-literal">F</span>-statistic:   <span class="hljs-number">532</span> on <span class="hljs-number">1</span> and <span class="hljs-number">1478</span> DF,  p-value: &lt; <span class="hljs-number">2.2e-16</span>
</code></pre>
<p>Let’s interpret this model.</p>
<p>If an employee has 0 years at the company, their base monthly income is $3734.47. This comes from the intercept.</p>
<p>For each year an employee spends at the company, their monthly income is predicted to increase by $395.25.</p>
<p>Both coefficients have p-values &lt; <code>2e-16</code>. This means they are highly significant. It strongly shows that the years an employee spends at a company affects their income.</p>
<p>The model’s R-squared is <code>0.2647</code>. This means about 26% of the variation in monthly income is explained by the years an employee spends at the company. This is low, so other factors like role, department, or education likely affect income too.</p>
<p>The model’s F-statistic is <code>532</code>, with a p-value &lt; <code>2.2e-16</code>. This means the model is statistically significant overall.</p>
<p>In general, the longer an employee stays at a company, the more they earn, roughly $395 extra per year. But years at the company alone explain only about a quarter of their income. You need to consider other variables for better predictions.</p>
<h2 id="heading-how-to-build-logistic-regression-models">How to Build Logistic Regression Models</h2>
<p>You can now learn how to predict attrition. The first command generates the model while the second displays it.</p>
<pre><code class="lang-r">hr_glm&lt;- glm(
  Attrition ~ MonthlyIncome + YearsAtCompany,
  data = hr,
  family = binomial)


summary(hr_glm)
</code></pre>
<p>Your console should show this as an output when you run both commands.</p>
<pre><code class="lang-r">Call:
glm(formula = Attrition ~ MonthlyIncome + YearsAtCompany, family = binomial, 
    data = hr)

Coefficients:
                 Estimate Std. Error z value Pr(&gt;|z|)    
(Intercept)    -<span class="hljs-number">8.094e-01</span>  <span class="hljs-number">1.375e-01</span>  -<span class="hljs-number">5.886</span> <span class="hljs-number">3.96e-09</span> ***
MonthlyIncome  -<span class="hljs-number">9.449e-05</span>  <span class="hljs-number">2.302e-05</span>  -<span class="hljs-number">4.104</span> <span class="hljs-number">4.05e-05</span> ***
YearsAtCompany -<span class="hljs-number">5.047e-02</span>  <span class="hljs-number">1.792e-02</span>  -<span class="hljs-number">2.817</span>  <span class="hljs-number">0.00485</span> ** 
---
Signif. codes:  <span class="hljs-number">0</span> ‘***’ <span class="hljs-number">0.001</span> ‘**’ <span class="hljs-number">0.01</span> ‘*’ <span class="hljs-number">0.05</span> ‘.’ <span class="hljs-number">0.1</span> ‘ ’ <span class="hljs-number">1</span>

(Dispersion parameter <span class="hljs-keyword">for</span> binomial family taken to be <span class="hljs-number">1</span>)

    Null deviance: <span class="hljs-number">1305.4</span>  on <span class="hljs-number">1479</span>  degrees of freedom
Residual deviance: <span class="hljs-number">1252.5</span>  on <span class="hljs-number">1477</span>  degrees of freedom
AIC: <span class="hljs-number">1258.5</span>

Number of Fisher Scoring iterations: <span class="hljs-number">5</span>
</code></pre>
<p>Logistic regression is used for binary outcomes, that is, yes or no. It estimates probability.</p>
<p>Let’s interpret this logistic regression model. The model predicts whether an employee is likely to leave the company (Attrition) based on their <code>Monthly Income</code> and <code>Years at Company.</code></p>
<p>The intercept is <code>-0.809</code>. This is the baseline log-odds of leaving when their income and years at the company are zero.</p>
<p>The employees’ <code>Monthly Income</code> has a coefficient of <code>-0.0000945</code>. This means that as their income increases, their chance of leaving decreases slightly. An increase in income makes them less likely to quit.</p>
<p>The employees’ <code>Years at Company</code> have a coefficient of <code>-0.0505</code>. This shows that the longer they stay, the less likely they are to leave. Each additional year reduces their attrition probability.</p>
<p>All coefficients are statistically significant. <code>Monthly Income</code> and <code>Years at Company</code> both strongly affect their likelihood to stay.</p>
<p>The model’s residual deviance is <code>1252.5</code>, lower than the null deviance of <code>1305.4</code>. This means the model explains some of the variation in attrition.</p>
<p>The key takeaway is that if an employee earns more and stays longer at the company, they are less likely to leave. These factors matter, but other elements also influence attrition.</p>
<h2 id="heading-why-visualization-comes-before-modeling">Why Visualization Comes Before Modeling</h2>
<p>Boxplots help you to:</p>
<ul>
<li><p><strong>Detect outliers:</strong> Boxplots highlight extreme values that interfere with model results.</p>
</li>
<li><p><strong>Compare groups:</strong> Boxplots allow quick comparison of distributions across different categories.</p>
</li>
<li><p><strong>Form hypotheses:</strong> Visual patterns assist in identifying relationships worth testing in a model.</p>
</li>
<li><p><strong>Validate modeling assumptions:</strong> Boxplots help check distribution shape and variance before modeling.</p>
</li>
</ul>
<p>Modeling without visualization often leads to misinterpretation or false confidence.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you learned how to load and clean data, understand boxplots and their importance. You also learned how to use ggplot2 to compare distributions, perform exploratory data analysis (EDA), build linear and logistic regression models, and link visualization insights to modeling results.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Create Scatterplots and Model Data in R Using ggplot2 ]]>
                </title>
                <description>
                    <![CDATA[ You can use R as a powerful tool for data analysis, data visualization, and statistical modelling. In this guide, you’ll learn how to load real-world data into R, visualize patterns using ggplot2, build simple linear and logistic regression models, a... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-create-scatterplots-and-model-data-in-r/</link>
                <guid isPermaLink="false">695ba922d307c8d32fc522ea</guid>
                
                    <category>
                        <![CDATA[ data visualization ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ R Language ]]>
                    </category>
                
                    <category>
                        <![CDATA[ R Programming ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tiffany Mojo Omondi ]]>
                </dc:creator>
                <pubDate>Mon, 05 Jan 2026 12:05:54 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1767614352690/8b993426-f193-4ff3-b5ec-dd6dda11028e.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>You can use R as a powerful tool for data analysis, data visualization, and statistical modelling. In this guide, you’ll learn how to load real-world data into R, visualize patterns using ggplot2, build simple linear and logistic regression models, and interpret the models. By the end, you should know how to use R for your own projects.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-your-r-environment">How to Set Up Your R Environment</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-use-data-types-in-r">How to Use Data Types in R</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-use-data-structures-in-r">How to Use Data Structures in R</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-import-data-in-r">How to Import Data in R</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-visualize-data-with-ggplot2">How to Visualize Data with ggplot2</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-statistical-models-in-r">How to Build Statistical Models in R</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before we get started, you should have the following:</p>
<ul>
<li><p>R installed (version 4.0 or higher).</p>
</li>
<li><p>RStudio installed (recommended for beginners).</p>
</li>
<li><p>Basic familiarity with programming concepts such as variables and functions.</p>
</li>
<li><p>A basic understanding of statistics (mean, correlation, regression).</p>
</li>
</ul>
<h2 id="heading-how-to-set-up-your-r-environment">How to Set Up Your R Environment</h2>
<p>Before you start working with data, load the required libraries:</p>
<pre><code class="lang-plaintext">library(tidyverse)   # Data manipulation + ggplot2
library(readxl)      # Importing Excel files
</code></pre>
<p>These load the required libraries into the R. <code>tidyverse</code> is a collection of packages used for data manipulation and visualization, including <code>ggplot2</code>. <code>readxl</code> allows you to import Excel files directly into R without converting them to CSV format first.</p>
<h2 id="heading-how-to-use-data-types-in-r">How to Use Data Types in R</h2>
<p>Knowing data types helps you avoid errors and choose the right analysis methods.</p>
<h3 id="heading-common-data-types">Common Data Types</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Data type</td><td>Example</td><td>Use case</td></tr>
</thead>
<tbody>
<tr>
<td>Numeric</td><td><code>x &lt;- 5.7</code></td><td>Measurements, prices</td></tr>
<tr>
<td>Integer</td><td><code>y &lt;- 10L</code></td><td>Counts</td></tr>
<tr>
<td>Character</td><td><code>"House prices"</code></td><td>Text labels</td></tr>
<tr>
<td>Logical</td><td><code>TRUE</code></td><td>Conditions</td></tr>
<tr>
<td>Complex</td><td><code>2 + 3i</code></td><td>Advanced math</td></tr>
</tbody>
</table>
</div><h3 id="heading-numeric-data-types-in-r">Numeric Data Types in R</h3>
<pre><code class="lang-r">price &lt;- <span class="hljs-number">199.99</span>
tax &lt;- <span class="hljs-number">16.5</span>
total_cost &lt;- price + tax
total_cost
</code></pre>
<p>Numeric data is used for continuous values such as measurements, prices, or averages. As you can see, these are numeric values that can be used in a calculation. Numeric data types allow arithmetic operations such as addition, subtraction, multiplication, and division.</p>
<h3 id="heading-integer-data-types-in-r">Integer Data Types in R</h3>
<pre><code class="lang-r">students &lt;- <span class="hljs-number">30L</span>
classes &lt;- <span class="hljs-number">4L</span>
total_students &lt;- students * classes
total_students
</code></pre>
<p>Integers are whole numbers and are commonly used for counting. The <code>L</code> tells R that the values are integers. Integers are useful when working with counts, indexes, or discrete values.</p>
<h3 id="heading-character-data-types-in-r">Character Data Types in R</h3>
<pre><code class="lang-r">course_name &lt;- <span class="hljs-string">"Data Science"</span>
university &lt;- <span class="hljs-string">"Harvard University"</span>
paste(course_name, <span class="hljs-string">"at"</span>, university)
</code></pre>
<p>Character data is used to store text such as names, labels, or categories. The example above shows how character data can be combined using the <code>paste()</code> function. This data type cannot be used in mathematical operations.</p>
<h3 id="heading-logical-data-types-in-r">Logical Data Types in R</h3>
<pre><code class="lang-r">score &lt;- <span class="hljs-number">75</span>
passed &lt;- score &gt;= <span class="hljs-number">50</span>
passed
</code></pre>
<p>Logical data represents Boolean values: <code>TRUE</code> or <code>FALSE</code>. These are commonly used in conditions and filtering. Here, R evaluates a condition and returns <code>TRUE</code> because the score meets the requirement. Logical values are essential in decision-making and control flow.</p>
<h3 id="heading-complex-data-types-in-r">Complex Data Types in R</h3>
<p>Complex numbers contain both real and imaginary parts and are mostly used in advanced mathematical computations.</p>
<pre><code class="lang-r">z &lt;- <span class="hljs-number">2</span> + <span class="hljs-number">3i</span>
Mod(z)
</code></pre>
<p>This example calculates the magnitude of a complex number. Complex data types are rarely used in basic data analysis but are available in R.</p>
<h2 id="heading-how-to-use-data-structures-in-r">How to Use Data Structures in R</h2>
<p>R stores data in different structures depending on your goals. This is important because choosing the right structure makes operations easier. Its functions behave differently depending on the structure. Moreover, structures help R understand whether your data are numbers, categories, or text.</p>
<h3 id="heading-common-data-structures-in-r">Common Data Structures in R</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Structure</td><td>Best for</td></tr>
</thead>
<tbody>
<tr>
<td>Vector</td><td>Single column of data</td></tr>
<tr>
<td>Matrix</td><td>Numeric tables</td></tr>
<tr>
<td>Data Frame</td><td>Spreadsheet-like data</td></tr>
<tr>
<td>List</td><td>Mixed objects</td></tr>
</tbody>
</table>
</div><pre><code class="lang-r">vec &lt;- c(<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>)
mat &lt;- matrix(<span class="hljs-number">1</span>:<span class="hljs-number">9</span>, nrow = <span class="hljs-number">3</span>)
df &lt;- data.frame(Name = c(<span class="hljs-string">"Car"</span>, <span class="hljs-string">"Bike"</span>), Number = c(<span class="hljs-number">110</span>, <span class="hljs-number">95</span>))
lst &lt;- list(numbers = vec, matrix = mat, info = df)

str(lst) <span class="hljs-comment">##shows the structure of the list</span>
</code></pre>
<p>Lets understand the code above:</p>
<ul>
<li><p><code>vec</code> is a vector that stores a single type of data.</p>
</li>
<li><p><code>mat</code> is a matrix that organizes numeric values into rows and columns.</p>
</li>
<li><p><code>df</code> is a data frame that works like a spreadsheet, allowing different data types in each column.</p>
</li>
<li><p><code>lst</code> is a list that stores multiple objects of different types.</p>
</li>
<li><p>The <code>str()</code> function shows how these objects are nested within the list.</p>
</li>
</ul>
<h2 id="heading-how-to-import-data-in-r"><strong>How to Import Data in R</strong></h2>
<p>Now you can start working with your real data. You can import files into R by copying the path of the CSV or Excel file and pasting it into the command.</p>
<p><strong>For Windows:</strong> Replace single backward slashes / with either double backward slashes \ or single forward slashes \. For example:</p>
<pre><code class="lang-r">
Windows
```r
data &lt;- read.csv("C:\\Users\\file\\Documents\\data.csv") or 
data &lt;- read.csv("C:/Users/file/Documents/data.csv")
</code></pre>
<p><strong>For macOS/Linux:</strong> Single forward slashes work fine:</p>
<pre><code class="lang-r">macOS/Linux
data &lt;- read.csv(<span class="hljs-string">"/Users/file/Documents/data.csv"</span>)
</code></pre>
<h3 id="heading-how-to-read-a-csv-and-excel-file"><strong>How to Read a CSV and Excel File</strong></h3>
<pre><code class="lang-r"><span class="hljs-comment">#Import CSV file </span>
data &lt;- read.csv(<span class="hljs-string">"C:/Users/file/Documents/data.csv"</span>) or data &lt;- read.csv(<span class="hljs-string">"C:\\Users\\file\\Documents\\data.csv"</span>) <span class="hljs-comment">## for windows</span>

head(data.csv)
</code></pre>
<p>You can import a CSV file into R using a file path. On Windows systems, file paths can use either double forward slashes (<code>//</code>) or double backslashes (<code>\</code>). The imported data is stored as a data frame named data.</p>
<pre><code class="lang-r">data_excel &lt;- read_excel(<span class="hljs-string">"C:/Users/file/Documents/HR Data Set.xlsx"</span>)
head(data_excel)
</code></pre>
<p>You can import an Excel file into R using the code <code>read_excel()</code> function from the <code>readxl</code> package. The <code>head()</code> function is then used to preview the first few rows of the dataset.</p>
<p>Use the following commands to understand your data:</p>
<pre><code class="lang-r">str(data.csv)
summary(data.csv)

str(data_excel)
summary(data_excel)
</code></pre>
<p><code>str()</code> shows the structure of the dataset, including column names and data types. <code>summary()</code> provides descriptive statistics such as minimum, maximum, mean, and quartiles for each variable. Together, these functions help you understand the dataset before analysis.</p>
<h2 id="heading-how-to-visualize-data-with-ggplot2"><strong>How to Visualize Data with ggplot2</strong></h2>
<p>Visualization helps you spot patterns before you build models.</p>
<h3 id="heading-scatter-plot-example"><strong>Scatter Plot Example</strong></h3>
<p>We’ll use the built-in <code>mtcars</code> dataset in R. First, load the library to make it available for use:</p>
<pre><code class="lang-r">data(mtcars)
<span class="hljs-keyword">library</span>(ggplot2)

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point(size = <span class="hljs-number">3</span>,color=<span class="hljs-string">"blue"</span>) +geom_smooth(method=<span class="hljs-string">"lm"</span>,color=<span class="hljs-string">"red"</span>,se=<span class="hljs-literal">FALSE</span>)+
  labs(
    title = <span class="hljs-string">"Fuel Efficiency by Weight and Cylinders"</span>,
    x = <span class="hljs-string">"Weight (1000 lbs)"</span>,
    y = <span class="hljs-string">"Miles per Gallon"</span>
  ) +
  theme_minimal()
</code></pre>
<p>Let us break down the code to grasp it fully:</p>
<ul>
<li><p><code>data(mtcars)</code> loads the built-in <code>mtcars</code> dataset, which contains information about car specifications.</p>
</li>
<li><p><code>library(ggplot2)</code> enables data visualization.</p>
</li>
<li><p><code>aes()</code> was used to insert your dataset columns, which defines the <code>x</code> and <code>y</code> values.</p>
</li>
<li><p><code>aes()</code> was used to design the plot outside. For example, set point <code>size</code> and <code>color</code>.</p>
</li>
<li><p><code>geom_smooth()</code> wass used to add a trend line with. Here, we use <code>method="lm"</code> to fit a linear regression line. The <code>se=TRUE/FALSE</code> option controls the shading for confidence intervals. Use <code>TRUE</code> if you want the shading and <code>FALSE</code> if you don’t.</p>
</li>
<li><p><code>labs()</code> was used for label the plot and set the <code>title</code>, <code>x</code>-axis, and <code>y</code>-axis labels.</p>
</li>
<li><p>Finally, we set the plot theme using <code>theme_minimal()</code>.</p>
</li>
</ul>
<p>Running this code will produce a scatterplot showing fuel efficiency by weight and cylinders. The plot should look like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1765914755069/8921e803-7fa6-4705-802c-23ff8918bee5.png" alt="Scatterplot of mpg against vehicle weight with regression line" class="image--center mx-auto" width="912" height="527" loading="lazy"></p>
<h2 id="heading-how-to-build-statistical-models-in-r"><strong>How to Build Statistical Models in R</strong></h2>
<h3 id="heading-linear-regression"><strong>Linear Regression</strong></h3>
<p>You can use linear regression for continuous outcomes, basically to predict numerical values. For example, to predict a car’s miles per gallon (<code>mpg</code>) based on weight (<code>wt</code>) and horsepower (<code>hp</code>), you can use this formula:</p>
<pre><code class="lang-r">lm_model &lt;- lm(mpg ~ wt + hp, data = mtcars)
summary(lm_model)
</code></pre>
<p>But what does it mean?</p>
<ul>
<li><p><code>lm()</code> stands for linear model.</p>
</li>
<li><p>The response variable is <code>mpg</code>. This is the outcome you want to predict.</p>
</li>
<li><p>Predictor variables are <code>wt</code> and <code>hp</code>. These explain changes in the response.</p>
</li>
</ul>
<p>Once you run the model, it should look like this in your console:</p>
<pre><code class="lang-r">Call:
lm(formula = mpg ~ wt + hp, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-<span class="hljs-number">3.941</span> -<span class="hljs-number">1.600</span> -<span class="hljs-number">0.182</span>  <span class="hljs-number">1.050</span>  <span class="hljs-number">5.854</span> 

Coefficients:
            Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) <span class="hljs-number">37.22727</span>    <span class="hljs-number">1.59879</span>  <span class="hljs-number">23.285</span>  &lt; <span class="hljs-number">2e-16</span> ***
wt          -<span class="hljs-number">3.87783</span>    <span class="hljs-number">0.63273</span>  -<span class="hljs-number">6.129</span> <span class="hljs-number">1.12e-06</span> ***
hp          -<span class="hljs-number">0.03177</span>    <span class="hljs-number">0.00903</span>  -<span class="hljs-number">3.519</span>  <span class="hljs-number">0.00145</span> ** 
---
Signif. codes:  <span class="hljs-number">0</span> ‘***’ <span class="hljs-number">0.001</span> ‘**’ <span class="hljs-number">0.01</span> ‘*’ <span class="hljs-number">0.05</span> ‘.’ <span class="hljs-number">0.1</span> ‘ ’ <span class="hljs-number">1</span>

Residual standard error: <span class="hljs-number">2.593</span> on <span class="hljs-number">29</span> degrees of freedom
Multiple R-squared:  <span class="hljs-number">0.8268</span>,    Adjusted R-squared:  <span class="hljs-number">0.8148</span> 
<span class="hljs-literal">F</span>-statistic: <span class="hljs-number">69.21</span> on <span class="hljs-number">2</span> and <span class="hljs-number">29</span> DF,  p-value: <span class="hljs-number">9.109e-12</span>
</code></pre>
<p>Here’s an interpretation of the linear regression model:</p>
<ul>
<li><p>You created a model on miles per gallon (<code>mpg</code>) based on weight (<code>wt</code>) and horsepower (<code>hp</code>).</p>
</li>
<li><p>The intercept <code>37.227</code> is the <code>mpg</code> when <code>wt=0</code> and <code>hp=0</code>. In other words, when all other variables are <code>0</code>, the base <code>mpg</code> is <code>37.227</code>. The intercept is always the baseline value of the outcome when all other variables in the model are zero.</p>
</li>
<li><p>With every additional unit of weight (1000lbs), the <code>mpg</code> decreases by <code>3.877</code>. This variable affects the <code>mpg</code> greatly as seen with the <code>p-value</code>. The <code>p-value</code> is &lt;0.001, hence strong and statistically significant.</p>
</li>
<li><p>With every additional unit of horsepower, the <code>mpg</code> decreases by <code>0.031</code>. This variable affects the <code>mpg</code>, as seen with the <code>p-value</code> being <code>0.00145</code>, which is <strong>less than 0.01</strong>, indicating that horsepower is a statistically significant predictor of <code>mpg</code>, although its effect is smaller compared to vehicle weight.</p>
</li>
</ul>
<h3 id="heading-does-the-model-fit-the-data-and-why">Does the Model Fit the Data, and Why?</h3>
<p>The R-squared value shows that 83% of the variation in <code>mpg</code> is explained by weight and horsepower.</p>
<p><strong>Summary of the interpretation</strong>: Cars that are heavier and with more horsepower have lower fuel efficiency. These two variables explain most of the variation in <code>mpg</code> in the dataset.</p>
<h3 id="heading-logistic-regression"><strong>Logistic Regression</strong></h3>
<p>You can use logistic regression for binary outcomes, like yes/no questions. For example, predicting whether a vehicle is automatic or manual based on weight and horsepower.</p>
<pre><code class="lang-r">glm_model &lt;- glm(am ~ wt + hp, data = mtcars, family = binomial)
summary(glm_model)
</code></pre>
<p>Lets understand the code</p>
<ul>
<li><p><code>glm()</code> stands for generalized linear model.</p>
</li>
<li><p>The <code>family=binomial</code> option tells R to run logistic regression.</p>
</li>
<li><p>The response variable <code>am</code> indicates transmission type: 0 = automatic, 1 = manual.</p>
</li>
<li><p>Predictor variables remain <code>wt</code> and <code>hp</code>.</p>
</li>
</ul>
<p>Once you run the model, it should look like this in your console:</p>
<pre><code class="lang-r">Call:
glm(formula = am ~ wt + hp, family = binomial, data = mtcars)

Coefficients:
            Estimate Std. Error z value Pr(&gt;|z|)   
(Intercept) <span class="hljs-number">18.86630</span>    <span class="hljs-number">7.44356</span>   <span class="hljs-number">2.535</span>  <span class="hljs-number">0.01126</span> * 
wt          -<span class="hljs-number">8.08348</span>    <span class="hljs-number">3.06868</span>  -<span class="hljs-number">2.634</span>  <span class="hljs-number">0.00843</span> **
hp           <span class="hljs-number">0.03626</span>    <span class="hljs-number">0.01773</span>   <span class="hljs-number">2.044</span>  <span class="hljs-number">0.04091</span> * 
---
Signif. codes:  <span class="hljs-number">0</span> ‘***’ <span class="hljs-number">0.001</span> ‘**’ <span class="hljs-number">0.01</span> ‘*’ <span class="hljs-number">0.05</span> ‘.’ <span class="hljs-number">0.1</span> ‘ ’ <span class="hljs-number">1</span>

(Dispersion parameter <span class="hljs-keyword">for</span> binomial family taken to be <span class="hljs-number">1</span>)

    Null deviance: <span class="hljs-number">43.230</span>  on <span class="hljs-number">31</span>  degrees of freedom
Residual deviance: <span class="hljs-number">10.059</span>  on <span class="hljs-number">29</span>  degrees of freedom
AIC: <span class="hljs-number">16.059</span>

Number of Fisher Scoring iterations: <span class="hljs-number">8</span>
</code></pre>
<p>Here’s an interpreting of the logistic regression model:</p>
<ul>
<li><p>The intercept <code>18.866</code> represents the log-odds of a car being manual when <code>wt=0</code> and <code>hp=0</code>. In other words, when all other variables are <code>0</code>, the baseline log-odds of the outcome is <code>18.866</code>. The intercept is always the baseline value of the outcome when all other variables in the model are zero.</p>
</li>
<li><p>With every additional unit of weight (1000 lbs), the log odds of the car being manual decrease by <code>8.083</code>. This variable strongly affects the probability of the car being manual, as seen with the <code>p-value</code> being <code>0.008</code>, which is statistically significant.</p>
</li>
<li><p>With every additional unit of horsepower, the log odds of the car being manual increase by <code>0.036</code>. This variable also affects the probability of being manual, as seen with the <code>p-value</code> being <code>0.041</code>, which is statistically significant.</p>
</li>
</ul>
<p><strong>Summary of the interpretation</strong>: Heavier cars are more likely to be automatic, while higher horsepower slightly increases the chance of being manual. Together, <code>wt</code> and <code>hp</code> explain a large portion of transmission type variation.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you learned how to use R for data analysis, visualization, and statistical modeling, and how to set up your R environment and work with basic data types and data structures.</p>
<p>This article also showed you how to import real-world datasets and explore them using summary statistics. This should help you understand your data before analysis.</p>
<p>Using ggplot2, we visualized the relationships and identified patterns. We built and interpreted a linear regression model to predict fuel efficiency and a logistic regression model to classify transmission type.</p>
<p>You also learned how to interpret coefficients, p-values, and goodness-of-fit measures.</p>
<p>With these skills, you can load datasets, visualize trends, and build simple predictive models in R. Keep practicing with new datasets and explore more advanced techniques to improve your data analysis skills.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How Embedded Analytics Makes Your App More Valuable ]]>
                </title>
                <description>
                    <![CDATA[ Most business apps capture data. They track orders, tickets, leads, expenses, tasks, or deliveries. But when someone needs insights, they often leave the app, export a file or open a BI tool to get answers. This extra step slows down decisions and cr... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-embedded-analytics-makes-your-app-more-valuable/</link>
                <guid isPermaLink="false">693704c4f37606a62f1727d3</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ business ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Mon, 08 Dec 2025 17:03:00 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1765213278642/9bbd88ba-c803-45d5-bace-97cf7ccca83e.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Most business apps capture data. They track orders, tickets, leads, expenses, tasks, or deliveries.</p>
<p>But when someone needs insights, they often leave the app, export a file or open a BI tool to get answers. This extra step slows down decisions and creates friction.</p>
<p><a target="_blank" href="https://www.thoughtspot.com/data-trends/embedded-analytics">Embedded analytics</a> removes that friction. It means placing reports, dashboards, charts, KPIs and even AI-powered insights directly inside your existing app.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764849629682/1031082f-219c-4303-9b07-c0fb16ed806b.png" alt="Embedded analytics benefits" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Instead of switching to another tool, users get answers in the exact moment they are doing their work.</p>
<p>Companies like Tableau, Pyramid, and Sigma have helped popularise this idea by allowing their analytics engines to sit inside other products. But the real value comes not from the vendors but from how deeply analytics becomes part of the workflow.</p>
<p>When embedded analytics is done well, your app becomes more valuable because it helps users think and act in the same place.</p>
<p>In this article, we will learn how embedding analytics directly inside a product increases its usefulness. We will also see how it improves decision-making and creates new revenue opportunities for the product. </p>
<h2 id="heading-what-well-cover">What We’ll Cover</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-why-embedded-analytics-matters">Why Embedded Analytics Matters</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-embedded-analytics-looks-like-inside-an-app">What Embedded Analytics Looks Like Inside an App</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-embedded-analytics-makes-your-app-more-valuable">How Embedded Analytics Makes Your App More Valuable</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-practical-ways-to-start-using-embedded-analytics">Practical Ways to Start Using Embedded Analytics</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-design-principles-for-effective-embedded-analytics">Design Principles for Effective Embedded Analytics</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-why-embedded-analytics-matters">Why Embedded Analytics Matters</h2>
<p>In any business workflow, insight is always a step behind action.</p>
<p>A support manager who wants to understand why backlogs are rising must check a separate reporting tool. </p>
<p>A sales leader who wants to see pipeline health needs to open a BI dashboard. </p>
<p>A supply chain manager who wants to diagnose delays must export data to Excel.</p>
<p>These breaks in context may seem small, but they pile up. Users lose time. Decisions slow down. Only power users become comfortable with analytics.</p>
<p>Embedded analytics changes this pattern. By placing insights directly where work happens, you remove the hidden cost of switching tools.</p>
<p>A support manager can see backlog trends next to the ticket queue. A sales rep can see win rates while updating deals. A logistics coordinator can see average delay times next to shipment details.</p>
<p>Your app becomes more useful because it no longer just stores data. It helps make sense of it.</p>
<h2 id="heading-what-embedded-analytics-looks-like-inside-an-app">What Embedded Analytics Looks Like Inside an App</h2>
<p>There are many ways embedded analytics can appear in a product.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764849694771/099687c7-b47e-41c3-9033-713b64633267.png" alt="In-App Analytics" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>At the simplest level, it can be a dashboard embedded through an iframe or a JavaScript snippet. This still gives users a unified experience without opening another product.</p>
<p>More advanced setups weave analytics into the core interface. A CRM might show prediction scores on each lead instead of only having a separate “Reports” tab. </p>
<p>An operations platform powered by <a target="_blank" href="https://www.tableau.com/products/embedded-analytics">Tablaeu</a> might show throughput and error trends beside the workflow screen. A finance app might reveal margin drivers while approving invoices.</p>
<p>The experience should feel native to the product. Fonts match. Colours match. Navigation stays consistent. Users should not feel like they are opening a separate tool. They should feel like the analytics belong exactly where they appear.</p>
<h2 id="heading-how-embedded-analytics-makes-your-app-more-valuable">How Embedded Analytics Makes Your App More Valuable</h2>
<p>Embedded analytics deepens product usefulness by changing how users interact with data.</p>
<p>It moves insight to the front of decisions. Instead of digging for answers elsewhere, users see context exactly when needed.</p>
<p>A procurement manager adjusting an order quantity sees supplier reliability and historical pricing right there. They can make smarter decisions without leaving the screen.</p>
<p>This unlocks new value stories. Customers pay because they get decision-making power built into the product itself. Companies like <a target="_blank" href="https://www.pyramidanalytics.com/">Pyramid Analytics</a> are often used to deliver enterprise-grade insights inside portals and internal tools, letting companies sell analytics as an added feature.</p>
<p>It also reduces dependency on analysts. Modern embedded analytics platforms enable search-based exploration and drag-and-drop analysis. Business teams no longer need to wait for a data team to create every custom view.</p>
<p>And it strengthens <a target="_blank" href="https://www.wallstreetprep.com/knowledge/product-stickiness/">product stickiness</a>. When your app becomes a central hub for both workflows and decisions, users rely on it more. Competing products without analytics feel incomplete.</p>
<h2 id="heading-practical-ways-to-start-using-embedded-analytics">Practical Ways to Start Using Embedded Analytics</h2>
<p>One of the simplest ways to implement embedded analytics is to place a live BI dashboard directly inside your application.</p>
<p>Modern tools such as Tableau allow dashboards to be published with secure embed URLs. These dashboards can then appear as part of your interface instead of forcing users to open a separate reporting system.</p>
<p>Imagine you are building a recruiting platform. Your customers track candidates, interviews, and hiring cycles, but they still leave your product whenever they want an overview.</p>
<p>By embedding analytics, you can surface a pipeline health view directly inside the product’s home screen. Hiring managers would see average time-to-hire, conversion rates, and offer acceptance trends without ever exporting data.</p>
<p>The implementation is surprisingly straightforward. First, you create and publish a dashboard in your BI tool, so it becomes accessible via a URL such as:</p>
<pre><code class="lang-python">https://analytics.yourapp.com/views/hiring_overview
</code></pre>
<p>Next, you embed that dashboard inside your product UI using a simple iframe. A page in your web app could include the following:</p>
<pre><code class="lang-html"><span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"dashboard-container"</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">iframe</span>
    <span class="hljs-attr">src</span>=<span class="hljs-string">"https://analytics.yourapp.com/views/hiring_overview"</span>
    <span class="hljs-attr">style</span>=<span class="hljs-string">"width:100%; height:500px; border:none;"</span>
  &gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">iframe</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
</code></pre>
<p>The iframe source points to your analytics dashboard, and its sizing and border settings ensure the embedded view looks like part of your application rather than an external tool. From a design perspective, the dashboard blends in because it inherits the surrounding layout, spacing, and styling.</p>
<p>What matters most is the experience for the user. Instead of jumping between systems, hiring teams now see insights the moment they open the app.</p>
<p>Recruiters review candidate lists while seeing hiring trends directly above them. Managers check pipeline health during weekly planning sessions without exporting spreadsheets. Executives understand bottlenecks simply by logging in, rather than waiting for emailed reports. The insight lives where the work happens, which is exactly what makes embedded analytics valuable.</p>
<p>This small implementation illustrates how embedding a readymade dashboard can increase usefulness without changing data architecture. By letting users access answers in context, your product shifts from a system that records information to one that helps interpret and act on it.</p>
<h2 id="heading-design-principles-for-effective-embedded-analytics">Design Principles for Effective Embedded Analytics</h2>
<p>Great embedded analytics is not about building fancy charts. It is about making the app easier to understand and easier to act on.</p>
<p>Begin with clear questions. Each chart should answer something specific. Instead of a generic graph called Revenue by Region, use a title such as “Which region is growing fastest this quarter?” Clear questions guide the user’s attention.</p>
<p>Show only what matters. Many analytics tools allow complex dashboards, but in a business app, less is more. Three focused metrics are more useful than fifteen distracting charts.</p>
<p>Support deeper exploration. While the first view should be simple, users who need detail should be able to drill down into more granular data, then into tables, then into raw records. This avoids overwhelming beginners while keeping power users happy.</p>
<p>Prioritize performance. Embedded analytics runs inside your product, so slow dashboards feel like a slow app. Pre-aggregate heavy metrics and use caching wherever possible. Leading platforms make speed a core priority because it directly affects user experience.</p>
<p>Match the product’s design. White-label options from companies like <a target="_blank" href="https://www.gooddata.com/">GoodData</a> help make embedded dashboards feel native. Consistent colors and typography matter more than many teams expect.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Embedded analytics is not a cosmetic add-on. It’s a strategic way to lift product value. When you plan your roadmap, tie analytics ideas to measurable business outcomes.</p>
<p>Analytics can reduce churn by making users more successful. It can increase the adoption of core workflows by helping people understand what is happening. It can become a revenue driver through premium analytic tiers.</p>
<p>The market also shows how important analytics has become. Companies promote decision intelligence as a core capability for enterprise apps. Many large enterprises use embedded analytics to serve both internal teams and external customers with faster insights.</p>
<p>If your product still pushes users toward Excel exports or sends them to a separate BI portal, you are leaving value behind. When analytics becomes part of the main interface, your product shifts from being a system of record to a system of insight.</p>
<p>That is when the usefulness deepens, user loyalty grows, and your app becomes a place where better decisions happen every day.</p>
<p><em>Hope you enjoyed this article. Find me on</em> <a target="_blank" href="https://linkedin.com/in/manishmshiva"><em>Linkedin</em></a> <em>or</em> <a target="_blank" href="https://manishshivanandhan.com/"><em>visit my website</em></a><em>.</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Common Pitfalls to Avoid When Analyzing and Modeling Data ]]>
                </title>
                <description>
                    <![CDATA[ Working with data at any level, whether as an analyst, engineer, scientist, or decision-maker, involves going through a range of challenges. Even experienced teams can run into issues that quietly affect the quality of their work. A mislabeled column... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/common-pitfalls-to-avoid-when-analyzing-and-modeling-data/</link>
                <guid isPermaLink="false">68ee54b2edcf5de25dd4bb13</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Oyedele Tioluwani ]]>
                </dc:creator>
                <pubDate>Tue, 14 Oct 2025 13:48:34 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1760449475934/80950373-2a61-4b75-bd8f-b0dfd08f6e21.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Working with data at any level, whether as an analyst, engineer, scientist, or decision-maker, involves going through a range of challenges. Even experienced teams can run into issues that quietly affect the quality of their work. A mislabeled column, an unclear definition, or a data leak that slips by unnoticed can all lead to results that do not hold up when it matters most.</p>
<p>Reliable analysis depends on how data is handled throughout the process. From collection and preparation to modeling and interpretation, each step carries its own risks. Many of the most persistent problems come not from technical gaps, but from missing checks or assumptions that go unspoken.</p>
<p>This guide highlights some of the most common pitfalls in data analysis and shows where they tend to appear. Along the way, it covers:</p>
<ul>
<li><p>Biased or unclear inputs that cause trouble early on</p>
</li>
<li><p>Validation mistakes that distort model performance</p>
</li>
<li><p>Misinterpretation of results that leads to the wrong conclusions</p>
</li>
<li><p>Workflow gaps that slow teams down or create confusion</p>
</li>
<li><p>Practical steps you can take to catch and correct these issues</p>
</li>
</ul>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-data-collection-pitfalls">Data Collection Pitfalls</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-data-preparation-pitfalls">Data Preparation Pitfalls</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-modeling-and-validation-pitfalls">Modeling and Validation Pitfalls</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-interpretation-and-communication-pitfalls">Interpretation and Communication Pitfalls</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-organizational-and-workflow-pitfalls">Organizational and Workflow Pitfalls</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-data-collection-pitfalls"><strong>Data Collection Pitfalls</strong></h2>
<p>A lot of data issues begin before any modeling takes place. The way data is collected helps shape what your analysis can reveal. Once the inputs are biased or inconsistent, even solid techniques may lead to unreliable results.</p>
<p>One common issue is the bias in data sources. When a large portion of the data comes from digital channels like websites or apps, it creates an imbalance. For instance, if a model is trained only on web traffic, it could miss users who engage through offline means, like in-person visits or phone support. This then results in blind spots that limit how well the model performs once deployed.</p>
<p>Inconsistent definitions across systems also pose a major challenge. A simple label like “customer” could represent various things - it could refer to an active user in one database, a prospect in another, or even a past buyer elsewhere. Without shared definitions, one can end up using the same terms to mean very different things, and this leads to confusion and misaligned metrics.</p>
<p>A third issue is the lack of metadata or data provenance. Without clear records of where the data came from or how well it has changed over time, it becomes harder to trace issues, explain outputs, or reproduce results.</p>
<p><strong>The way out:</strong></p>
<ul>
<li><p>Combine data from multiple sources to build a more complete and representative picture</p>
</li>
<li><p>Use stratified sampling to reduce bias where possible</p>
</li>
<li><p>Set up regular audits to catch data drift or gaps early</p>
</li>
<li><p>Maintain a shared data dictionary and align terms across teams</p>
</li>
<li><p>Track data lineage with tools like dbt, Apache Atlas, or OpenMetadata</p>
</li>
</ul>
<p>Getting data collection right sets a strong foundation for analysis and helps prevent issues down the line.</p>
<h2 id="heading-data-preparation-pitfalls"><strong>Data Preparation Pitfalls</strong></h2>
<p>Once the data has been collected, the next step involves cleaning and shaping it for use. This is another delicate stage where data analysts often encounter an issue. Some choices that seem helpful at first can create problems later, especially when they aren’t documented or tested properly.</p>
<p><strong>Silent Data Leakage</strong></p>
<p>Data leakage occurs when a model learns from information that it would not have access to at prediction time. Let’s say for example, you’re building a model in January to predict whether a customer will make a purchase in February. If your dataset includes transactions from February, and you use that to calculate a feature like “days since last purchase”, then your model is learning from data it wouldn’t realistically have at prediction time.</p>
<p><strong>Improper Handling of Missing Values</strong></p>
<p>Quite a number of data explorers think missing values are just gaps to be filled. In certain cases, the fact that data is missing can be just as meaningful as the value itself. In a customer churn dataset, some users might have blank entries for recent activities because they have already stopped engaging with the product. Filling those gaps with averages and zeros without context could make the model treat them the same as users who simply haven’t generated enough data yet, which can be misleading. </p>
<p><strong>Over-aggressive Outlier Removal</strong></p>
<p>It’s tempting to remove extreme values to simplify modeling, but outliers often represent, although rare, yet important events.  In fraud detection, for instance, the anomalies are the very signals the models need to learn from. Discarding them automatically based on z-scores or quantiles may improve the short-term accuracy while weakening long-term reliability.</p>
<p><strong>The way out</strong></p>
<ul>
<li><p>To avoid data leakage, create training and test splits before engineering features. Make use of chronological splits when modeling time-based behavior, and regularly audit feature logic.</p>
</li>
<li><p>For missing values, go through the missingness patterns first. Use indicator variables where necessary, and treat the missingness as a signal, rather than just a defect.</p>
</li>
<li><p>With outliers, analyze their sources before removing them. If they are recognized, try using robust models that can handle skewed data or flag them for downstream use instead of deleting them.</p>
</li>
</ul>
<p>Getting this stage right protects your models from brittle and unstable behavior.</p>
<h2 id="heading-modeling-and-validation-pitfalls"><strong>Modeling and Validation Pitfalls</strong></h2>
<p>A common thought in this field is that models are only as reliable as the assumptions built into them. Mistakes at this phase are often reflected late, sometimes after the models have been deployed, making them harder to catch and more expensive to fix.</p>
<p><strong>Overfitting Through Hyperparameter Tuning</strong></p>
<p>Trying to make a model perfect with the training data can lead to patterns that don’t hold up in practice. When one tests hundreds of hyperparameter combinations without proper checks, the model often ends up learning noise rather than signals in the data, thereby resulting in excellent scores during cross-validation but weak performance in production. For instance, a churn model might show an excellent performance during development, but once it is deployed to a new region with a slight difference in customer behavior, it then starts to miss the mark.</p>
<p><strong>Validation Leakage</strong></p>
<p>Leakage can occur when the validation process accidentally gives the model access to target-related information. One common case is target encoding, where features like average purchase per customer group are calculated on the full dataset rather than only on the training set. This can lead to inflated validation scores and a false sense of confidence.</p>
<p><strong>Ignoring Data Drift and Concept Drift</strong></p>
<p>Data changes over time, and so do the basic relationships that models rely on. A model trained on behavior from eight months ago may not reflect current realities. Imagine a fraud detection model built before a major policy shift or change of product; the possibility that the model may fail to catch new fraud patterns that arise afterwards is extremely high.</p>
<p><strong>The Way Out</strong></p>
<ul>
<li><p>Use nested cross-validation (a technique that separates hyperparameter tuning from final evaluation by using two loops of cross-validation) to avoid overfitting during the model selection. After this, you can then compare results against simple baselines to keep complexity in check.</p>
</li>
<li><p>Treat feature engineering as part of the pipeline and apply it within each training fold to avoid leakage. For time-sensitive data, validate progressively to reflect real-world use.</p>
</li>
<li><p>Check for drift using techniques like the Kolmogorov-Smirnov test or the Population Stability Index, and link alerts to retraining processes so models can evolve with data.</p>
</li>
</ul>
<p>These steps go a long way in keeping your models solid in production and ready for whatever the data throws at them.</p>
<h2 id="heading-interpretation-and-communication-pitfalls"><strong>Interpretation and Communication Pitfalls</strong></h2>
<p>Clear, responsible communication is just as important as accurate modeling. But it is very easy to slip into habits that make results look more certain, more compelling, more reliable than they really are. These missteps can lead teams to act on insights that don’t hold up.</p>
<p><strong>Overconfidence in Statistical Significance</strong></p>
<p>Testing lots of variables without making adjustments can make weak signals look important. Imagine you run a dozen A/B tests and pick the one with a p-value below 0.05. Without correcting for multiple comparisons, there’s a good chance that result is just noise.</p>
<p><strong>Ignoring Practical Significance</strong></p>
<p>A result can be significant statistically but still meaningless when viewed in context. For example, finding a 0.1% lift in clickthrough rate, which is technically real but not worth the cost of rolling out a change across the product.</p>
<p><strong>Model Explainability Missteps</strong></p>
<p>When explanation tools are used without context, they can confuse rather than clarify. Showing a ranked list of SHAP values might look impressive, but if the stakeholders don’t understand what the features mean or how they interact, the takeaway is lost.</p>
<p><strong>The Way Out</strong></p>
<ul>
<li><p>Be cautious with statistical significance. If you’re running several tests, apply corrections for multiple comparisons (Bonferroni or Benjamini-Hochberg methods, for instance) and avoid selectively reporting only the findings that look significant and ignoring those that don’t. </p>
</li>
<li><p>Look beyond what is statistically true and ask whether it is practically useful. A small, significant change might not be worth acting on at the end of the day.</p>
</li>
<li><p>When using explainability tools like SHAP or LIME, don’t assume the outputs speak for themselves. Add plain-language summaries, relevant examples, and business contexts to make them actionable. It is better to explain less with clarity than more with confusion.</p>
</li>
</ul>
<p>These habits make your results easier to trust, interpret, and apply, which is ultimately the point of the work.</p>
<h2 id="heading-organizational-and-workflow-pitfalls"><strong>Organizational and Workflow Pitfalls</strong></h2>
<p>A major fact is that analytics is most effective when it is collaborative and responsive.  Gaps in team structure or feedback processes can slow progress and limit the value of your work.</p>
<p>Teams working in isolation are a frequent issue. When analysts, engineers, and business stakeholders do not share tools or goals, efforts get duplicated and insights become fragmented. For example, one team might define active users based on weekly logins, while another uses monthly engagements, resulting in mismatched reports.</p>
<p>Lack of feedback from deployed models is another pitfall. If no one tracks what happens after predictions are made, teams miss the opportunity to refine and improve their processes. Imagine if a loan approval model is deployed, but there’s no follow-up on repayment behavior, it becomes difficult to tell whether the model is supporting sound lending decisions or increasing default risk.</p>
<p><strong>The way out</strong></p>
<ul>
<li><p>Encourage collaboration by forming cross-functional teams and coordinating around shared planning cycles.  Align on definitions early and rely on centralized dashboards to ensure that everyone is working from the same source of truth.</p>
</li>
<li><p>Create feedback loops and make them a standard part of your workflow, Track real-world outcomes, and schedule regular post-deployment reviews to understand what is working and what is not.</p>
</li>
<li><p>Include end users alongside data teams and treat their input as essential to improving the system.</p>
</li>
</ul>
<p>Taking these actions helps analytics stay practical, consistent, and responsive to real needs.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Each stage of the data workflow benefits from clarity, structure, and shared understanding. The table below shows all the mentioned pitfalls, together with the way out to help teams build more reliable models and deliver results that hold up in real-world settings.</p>
<table><tbody><tr><td><p><strong>Category</strong></p></td><td><p><strong>Pitfall</strong></p></td><td><p><strong>Consequences</strong></p></td><td><p><strong>Recommended Approach</strong></p></td></tr><tr><td><p><strong>Data collection</strong></p></td><td><p>Unreliable sources</p></td><td><p>Skewed insights</p></td><td><p>Validate source quality and apply consistent standards</p></td></tr><tr><td><p><strong>Data preparation</strong></p></td><td><p>Silent data leakage</p></td><td><p>Inflated model performance without real-world value</p></td><td><p>Use proper data splits and audit derived features</p></td></tr><tr><td><p><strong>Modeling &amp; validation</strong></p></td><td><p>Overfitting through hyperparameter tuning</p></td><td><p>Strong validation results that don’t translate to reality</p></td><td><p>Use nested cross-validation (a structure where tuning happens inside training folds) and keep simple baselines for comparison</p></td></tr><tr><td><p><strong>Interpretation &amp; communication</strong></p></td><td><p>Overconfidence in statistical significance</p></td><td><p>Misleading conclusions from small or selective effects</p></td><td><p>Adjust for multiple comparisons and report confidence intervals alongside p-values</p></td></tr><tr><td><p><strong>Organizational &amp; workflow</strong></p></td><td><p>Fragmented teams</p></td><td><p>Redundant work and inconsistent metrics</p></td><td><p>Encourage collaboration with shared planning, dashboards, and definitions</p></td></tr></tbody></table>

<p>Strong analytic practice is built over time. Keeping these pitfalls in view helps teams stay consistent, improve delivery, and create results that stay useful across projects and contexts.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Forecast Time Series Data with Python Darts ]]>
                </title>
                <description>
                    <![CDATA[ When analyzing time series data, your main objective is to consider the period during which the data is collected and how your variable of interest changes over time. There are various libraries for time series forecasting in Python, and Darts is one... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-forecast-time-series-data-with-python-darts/</link>
                <guid isPermaLink="false">68e40c4dd441014d7e52dc0d</guid>
                
                    <category>
                        <![CDATA[ data visualization ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Adejumo Ridwan Suleiman ]]>
                </dc:creator>
                <pubDate>Mon, 06 Oct 2025 18:37:01 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759775700643/6f7d18b3-2060-4708-b56e-3450acf58546.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When analyzing time series data, your main objective is to consider the period during which the data is collected and how your variable of interest changes over time.</p>
<p>There are various libraries for time series forecasting in Python, and <a target="_blank" href="https://unit8co.github.io/darts/">Darts</a> is one of them. Unlike other forecasting libraries, Darts is a high-level forecasting library with algorithms to handle various time series data, regardless of the kind of trend they portray.</p>
<p>This tutorial will walk you through how you can forecast time series data using Python Darts. This will help you make meaningful insights whenever you come across time series data such as stock prices, weather measurements, and so on.</p>
<h3 id="heading-heres-what-well-cover">Here’s what we’ll cover:</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-what-is-python-darts">What is Python Darts?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-dependencies">How to Set Up Dependencies</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-understanding-the-dataset">Understanding the Dataset</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-prepare-the-data-for-darts">How to Prepare the Data for Darts</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-a-forecasting-model">How to Build a Forecasting Model</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-classical-model">Classical Model</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-machine-learning-models">Machine Learning Models</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-forecast-with-deep-learning-models">How to Forecast with Deep Learning models</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-model-evaluation">Model Evaluation</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-backtesting">BackTesting</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-hyper-parameter-tuning">Hyper Parameter Tuning</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-real-world-use-cases">Real-World Use Cases</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-best-practices">Best Practices</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-what-is-python-darts">What is Python Darts?</h2>
<p>Python Darts is an open-source library for time series analysis and forecasting. It has various models ranging from statistical time series models like ARIMA, and SARIMA, to machine learning and deep learning models like Prophet, and LSTM.</p>
<p>It has various algorithms for handling missing imputations in time series data, and can handle time series problems ranging from univariate, multivariate to hierarchical time series.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before we proceed, you will need to have the following:</p>
<ul>
<li><p>Python 3.9+ installed.</p>
</li>
<li><p>Jupyter Notebook, Google Colab, or Positron to run your code.</p>
</li>
<li><p>Download the <a target="_blank" href="https://www.kaggle.com/datasets/kalilurrahman/netflix-stock-data-live-and-latest">Netflix stock data</a>.</p>
</li>
<li><p>Have the following libraries installed:</p>
<ul>
<li><p><code>darts</code> for time series analysis</p>
</li>
<li><p><code>pandas</code> for data wrangling</p>
</li>
<li><p><code>matplotlib</code> for data visualization.</p>
</li>
</ul>
</li>
</ul>
<h2 id="heading-how-to-set-up-dependencies">How to Set Up Dependencies</h2>
<p>Load the following libraries.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> darts
<span class="hljs-keyword">from</span> darts <span class="hljs-keyword">import</span> TimeSeries
<span class="hljs-keyword">from</span> darts.models <span class="hljs-keyword">import</span> ARIMA
<span class="hljs-keyword">from</span> darts.models <span class="hljs-keyword">import</span> RegressionModel
<span class="hljs-keyword">from</span> lightgbm <span class="hljs-keyword">import</span> LGBMRegressor
<span class="hljs-keyword">from</span> darts.models <span class="hljs-keyword">import</span> RNNModel
<span class="hljs-keyword">from</span> darts.metrics <span class="hljs-keyword">import</span> mape
<span class="hljs-keyword">import</span> itertools
</code></pre>
<h2 id="heading-understanding-the-dataset">Understanding the Dataset</h2>
<p>The Netflix stock data contains historical daily prices of Netflix stock from the year 2002 till date.</p>
<p>Load the data and have a preview of it.</p>
<pre><code class="lang-python">netflix = pd.read_csv(<span class="hljs-string">"/kaggle/input/netflix-stock-data-live-and-latest/Netflix_stock_history.csv"</span>)
netflix[<span class="hljs-string">'Date'</span>] = pd.to_datetime(netflix[<span class="hljs-string">'Date'</span>], utc=<span class="hljs-literal">True</span>).dt.tz_convert(<span class="hljs-literal">None</span>)
netflix.head()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757927775470/2d4b542c-3869-40c5-844c-a733b5cc4bea.png" alt="Image showing the first 5 rows of the Netflix stock data" class="image--center mx-auto" width="1059" height="484" loading="lazy"></p>
<p>To forecast a time series data, we need a <code>Date</code> column, which we already have, and then the variable of interest. We have several variables, but for this tutorial, we will focus on the <code>Close</code> variable of Netflix stocks.</p>
<p>Let’s visualize the data to see how Netflix closing price performed over the years.</p>
<pre><code class="lang-python">netflix.plot(x=<span class="hljs-string">'Date'</span>, y=<span class="hljs-string">'Close'</span>, figsize=(<span class="hljs-number">10</span>,<span class="hljs-number">5</span>))
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757928810807/75a1fa13-4f2e-4bdd-a539-5eaf2663843a.png" alt="Image showing a line chart of Netflix stock data from 2000 to date" class="image--center mx-auto" width="1036" height="517" loading="lazy"></p>
<p>From the chart above, you can see that Netflix stock showed exponential growth in recent years. This means that the data is non-stationary, implying that there are no consistent changes over time.</p>
<p>There are a lot of random fluctuations in the data, which might make it difficult to forecast. Such data usually requires advanced models to handle the various fluctuations or noise present in the data.</p>
<h2 id="heading-how-to-prepare-the-data-for-darts"><strong>How to Prepare the Data for Darts</strong></h2>
<p>Before preparing the data for Darts, you need to take note of few things.</p>
<p>First of all, if you look at our data preview earlier on, you would notice that it is recorded daily, we also need to fill in missing dates.</p>
<p>Copy and paste this code into your notebook.</p>
<pre><code class="lang-python">start = netflix[<span class="hljs-string">'Date'</span>].min()
end = netflix[<span class="hljs-string">'Date'</span>].max()

netflix = (
    netflix.set_index(<span class="hljs-string">'Date'</span>)
           .reindex(pd.date_range(start=start, end=end, freq=<span class="hljs-string">'D'</span>))
           .ffill()
           .reset_index()
           .rename(columns={<span class="hljs-string">'index'</span>: <span class="hljs-string">'Date'</span>})
)
netflix.head()
</code></pre>
<p>The code above ensures the <code>netflix</code> dataset has a continuous daily time series by filling in missing dates.</p>
<p>First, it finds the earliest <code>start</code> and latest <code>end</code> dates in the data, then creates a full daily date range between them.</p>
<p>By setting the <code>Date</code> column as the index and using <code>.reindex()</code> method, it inserts rows for any missing dates, which initially contain <code>NaN</code>.</p>
<p>The <code>.ffill()</code> method (forward fill) replaces these gaps by carrying forward the last known value, which is common for stock data when markets are closed, such as weekends.</p>
<p>Finally, the index is reset, and the column is renamed back to <code>Date</code>, producing a clean, continuous dataset ready for time series analysis.</p>
<p>Next, we need to convert the data to a Darts <code>Timeseries</code> object to make it usable by the Darts library.</p>
<pre><code class="lang-python"> = TimeSeries.from_dataframe(
    netflix,
    time_col=<span class="hljs-string">'Date'</span>,
    value_cols=<span class="hljs-string">'Close'</span>,
)
</code></pre>
<p>The code above converts the <code>netflix</code> DataFrame into a Darts <code>TimeSeries</code> object, which is optimized for time series modeling and forecasting.</p>
<p>It takes the <code>Date</code> column (<code>time_col='Date'</code>) as the timeline and the <code>Close</code> column (<code>value_cols='Close'</code>) as the target values to forecast.</p>
<p>The resulting <code>series</code> object is now structured for use with Darts’ advanced forecasting models like ARIMA, Prophet, RNNs, and other time series algorithms.</p>
<p>Just like you would with any other machine learning model, you need to split your data into a training set and a validation set.</p>
<pre><code class="lang-python">train, val = series.split_before(<span class="hljs-number">0.8</span>)
</code></pre>
<h2 id="heading-how-to-build-a-forecasting-model"><strong>How to Build a Forecasting Model</strong></h2>
<p>When building a forecasting model, you have the privilege of trying various models and picking the best-performing one.</p>
<p>The Darts library has various algorithms for time series analysis, from popular statistical algorithms like the Auto Regressive Integrated Moving Average (ARIMA) and Moving Average (MA) models, to machine learning and deep learning algorithms like Prophet and Long Short Term Memory (LSTM).</p>
<p>Note, I will only demonstrate how these algorithms work - it’s not necessary that we get accurate model metrics. But with further feature engineering, hyperparameter tuning, and cross-validation, you can get good results on your own.</p>
<h3 id="heading-classical-model">Classical Model</h3>
<p>The classical mode is the use of statistical time series models such as ARIMA. ARIMA is made up of the following components:</p>
<ul>
<li><p><strong>AR (AutoRegressive):</strong> Predict past values by looking at previous ones.</p>
</li>
<li><p><strong>I (Integrated):</strong> Remove trends by focusing on changes instead of raw values.</p>
</li>
<li><p><strong>MA (Moving Average):</strong> Learn from the errors of past predictions to improve accuracy.</p>
</li>
</ul>
<p>Run the code below in your notebook to fit an ARIMA model.</p>
<pre><code class="lang-python">arima_model = ARIMA()
arima_model.fit(train)
arima_forecast = arima_model.predict(len(val))
</code></pre>
<p>To visualize the forecast by the model, call the <code>.plot()</code> method on the <code>forecast</code> object.</p>
<pre><code class="lang-python">series.plot(label=<span class="hljs-string">'actual'</span>)
arima_forecast.plot(label=<span class="hljs-string">'forecast'</span>)
plt.legend()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758028284156/a40f2341-cfc6-4a9f-8297-e0511c2bb254.png" alt="Image showing the ARIMA model forecast of netflix stock " class="image--center mx-auto" width="820" height="646" loading="lazy"></p>
<p>You can improve the model by adding some additional parameters to the <code>ARIMA()</code> class. You can read more about that in the <a target="_blank" href="https://unit8co.github.io/darts/generated_api/darts.models.forecasting.arima.html">Darts documentation</a>.</p>
<h3 id="heading-machine-learning-models"><strong>Machine Learning Models</strong></h3>
<p>Classical models like ARIMA can’t handle non-linear data. Machine learning models fill this gap. We’ll use the LightGBM model as an example.</p>
<p>The LightGBM is a machine learning model that builds models sequentially based on decision trees. It adds new decision trees that correct the errors of previous trees.</p>
<p>Although it was not designed to handle time series, with some feature engineering such as lags, rolling statistics, and seasonal indicators, you can make it learn patterns from time series data.</p>
<p>Run this code on your notebook to fit a LightGBM model on the Netflix data.</p>
<pre><code class="lang-python">lgbm = LGBMRegressor()
lgbm_model = RegressionModel(lags=<span class="hljs-number">12</span>, model=lgbm)
lgbm_model.fit(train)
lgbm_forecast = lgbm_model.predict(len(val))
</code></pre>
<p>From the code above, the <code>lag</code> argument is set to <code>12</code>, which is the value of the Netflix stock price for 12 days before a selected day.</p>
<p>Let’s have a view of the forecast by running the following code.</p>
<pre><code class="lang-python">series.plot(label=<span class="hljs-string">'actual'</span>)
lgbm_forecast.plot(label=<span class="hljs-string">'forecast'</span>)
plt.legend()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758029933172/54f34a69-4f6b-4b44-85ab-d0b45931d701.png" alt="Image showing the LightGBM model forecast of netflix stock " class="image--center mx-auto" width="813" height="631" loading="lazy"></p>
<p>You can read more about tuning the LightGBM model from the <a target="_blank" href="https://unit8co.github.io/darts/generated_api/darts.models.forecasting.lgbm.html">Darts documentation</a> to improve the above model.</p>
<h3 id="heading-how-to-forecast-with-deep-learning-models"><strong>How to Forecast with Deep Learning models</strong></h3>
<p>You can go for deep learning models designed for time series, such as LSTM, a kind of Recurrent Neural Network (RNN) designed to capture long-term dependencies in sequential data.</p>
<p>Run the following code to build the LSTM model.</p>
<pre><code class="lang-python">lstm_model = RNNModel(model=<span class="hljs-string">'LSTM'</span>, input_chunk_length=<span class="hljs-number">12</span>, output_chunk_length=<span class="hljs-number">6</span>, n_epochs=<span class="hljs-number">100</span>)
lstm_model.fit(train)
lstm_forecast = rnn_model.predict(len(val))
</code></pre>
<p>Now let’s visualize the forecast and see what we have.</p>
<pre><code class="lang-python">series.plot(label=<span class="hljs-string">'actual'</span>)
lstm_forecast.plot(label=<span class="hljs-string">'forecast'</span>)
plt.legend()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758116174578/2ff80218-2254-452d-8d4c-2f85c61612de.png" alt="Image showing the LSTM model forecast of Netflix stock " class="image--center mx-auto" width="682" height="526" loading="lazy"></p>
<p>You can look up the <a target="_blank" href="https://unit8co.github.io/darts/generated_api/darts.models.forecasting.rnn_model.html">Darts documentation</a> to improve the model and check out other deep learning models also.</p>
<h2 id="heading-model-evaluation"><strong>Model Evaluation</strong></h2>
<p>Now that you have three models, you need to select the best one among them using the Mean Absolute Percentage Error (MAPE).</p>
<p>It expresses the average absolute error as a percentage of the actual values, and the closer your value is to 0, the better your model.</p>
<p>Run the following to print the MAPE of each respective model.</p>
<pre><code class="lang-python">arima_error = mape(val, arima_forecast)
print(<span class="hljs-string">"MAPE:"</span>, arima_error)
lgbm_error = mape(val, lgbm_forecast)
print(<span class="hljs-string">"MAPE:"</span>, lgbm_error)
lstm_error = mape(val, lstm_forecast)
print(<span class="hljs-string">"MAPE:"</span>, lstm_error)
</code></pre>
<pre><code class="lang-bash">&gt; MAPE: 38.33262525601514
&gt; MAPE: 39.00241495209449
&gt; MAPE: 38.82910057097827
</code></pre>
<p>The model with the lowest MAPE is the ARIMA model with approximately 38.33, which means it’s our best-performing model.</p>
<h2 id="heading-backtesting">BackTesting</h2>
<p>Darts has a feature called backtesting that allows you to evaluate your models based on historical data, using a rolling forecast.</p>
<p>Backtesting is like a time machine for forecasting. It simulates how your model would have performed in the past by repeatedly training it on historical data up to a certain point, making a prediction for the next step, then moving forward, and repeating the process.</p>
<p>This rolling evaluation simulates how the model would behave in real-world conditions, where future data is unknown, helping you measure its consistency and reliability over time, instead of just testing it once on a single validation set.</p>
<p>Since the ARIMA model is currently our best-performing model, run the code below to implement backtesting.</p>
<pre><code class="lang-python">
<span class="hljs-comment"># Perform backtesting on the training + validation series</span>
backtest_series = train.concatenate(val)

<span class="hljs-comment"># Backtest</span>
backtest_forecast = arima_model.historical_forecasts(
    series=backtest_series,
    start=<span class="hljs-number">0.8</span>,          <span class="hljs-comment"># fraction of the series to start forecasting from</span>
    forecast_horizon=len(val),
    stride=<span class="hljs-number">1</span>,           <span class="hljs-comment"># step size of rolling forecast</span>
    retrain=<span class="hljs-literal">True</span>,       <span class="hljs-comment"># retrain the model at each step</span>
    verbose=<span class="hljs-literal">True</span>
)

<span class="hljs-comment"># Compute metrics</span>
error = mape(backtest_series[-len(val):], backtest_forecast)
print(<span class="hljs-string">f"MAPE: <span class="hljs-subst">{error:<span class="hljs-number">.2</span>f}</span>%"</span>)
</code></pre>
<pre><code class="lang-bash">&gt; historical forecasts: 100%|██████████| 1/1 [00:02&lt;00:00,  2.69s/it]MAPE: 47.27%
</code></pre>
<p>In the code above,</p>
<ul>
<li><p>The <code>start</code> argument defines where to start backtesting, which in this case is the last 20% series of the data.</p>
</li>
<li><p>The <code>forecast_horizon</code> is how many steps ahead to forecast at each point.</p>
</li>
<li><p>The <code>stride</code> is how frequently to retrain/forecast.</p>
</li>
<li><p>The <code>retrain=True</code> refits the model at each step for realistic evaluation.</p>
</li>
</ul>
<p>You can see that the MAPE, after backtesting, is higher because backtesting is more realistic, and it is more difficult to achieve a lower MAPE.</p>
<p>On your own, you can try to replicate backtesting for the other models.</p>
<h2 id="heading-hyper-parameter-tuning">Hyper Parameter Tuning</h2>
<p>The ARIMA model has three hyperparameter:</p>
<ul>
<li><p><code>p</code> which is the AR order</p>
</li>
<li><p><code>d</code> which is the differencing order</p>
</li>
<li><p><code>q</code> which is the MA order</p>
</li>
</ul>
<p>You can use either grid or random search to tune your ARIMA model in Darts.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Define possible values</span>
p_values = range(<span class="hljs-number">0</span>, <span class="hljs-number">4</span>)
d_values = range(<span class="hljs-number">0</span>, <span class="hljs-number">3</span>)
q_values = range(<span class="hljs-number">0</span>, <span class="hljs-number">4</span>)

best_mape = float(<span class="hljs-string">'inf'</span>)
best_params = <span class="hljs-literal">None</span>

<span class="hljs-keyword">for</span> p, d, q <span class="hljs-keyword">in</span> itertools.product(p_values, d_values, q_values):
    <span class="hljs-keyword">try</span>:
        arima_model = ARIMA(p=p, d=d, q=q)
        arima_model.fit(train)
        arima_forecast = arima_model.predict(len(val))
        arima_error = mape(val, arima_forecast)
        <span class="hljs-keyword">if</span> arima_error &lt; best_mape:
            best_mape = arima_error
            best_params = (p, d, q)
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        <span class="hljs-comment"># Some combinations may fail</span>
        <span class="hljs-keyword">continue</span>

print(<span class="hljs-string">f"Best ARIMA params: p=<span class="hljs-subst">{best_params[<span class="hljs-number">0</span>]}</span>, d=<span class="hljs-subst">{best_params[<span class="hljs-number">1</span>]}</span>, q=<span class="hljs-subst">{best_params[<span class="hljs-number">2</span>]}</span> with MAPE=<span class="hljs-subst">{best_mape:<span class="hljs-number">.2</span>f}</span>%"</span>)
</code></pre>
<pre><code class="lang-bash">&gt; Best ARIMA params: p=2, d=0, q=3 with MAPE=35.95%
</code></pre>
<p>In the above code, you define a range of possible values for the <code>p</code>, <code>d</code> , and <code>q</code> components, iterating over each combination of those values and choosing the model with the best MAPE among them.</p>
<p>Note that each model has its specific parameter you would have to tune, and you will need to check <a target="_blank" href="https://unit8co.github.io/darts/userguide/hyperparameter_optimization.html">the Darts documentation</a> for the hyperparameters of other models.</p>
<h2 id="heading-real-world-use-cases"><strong>Real-World Use Cases</strong></h2>
<p>Forecasting time series data has a lot of real-world applications, some of which are:</p>
<ul>
<li><p><strong>Stock price prediction:</strong> Like the dataset used in this tutorial, forecasting is used in finance for stock price prediction, allowing investors to manage risk.</p>
</li>
<li><p><strong>Demand forecasting for inventory:</strong> As a store owner, you can forecast product demands based on past sales of a product. This lets you know products that are in high demand.</p>
</li>
<li><p><strong>Energy consumption prediction:</strong> Governments, industries, and consumers can plan and manage energy production, distribution, and consumption efficiently, based on data from past usage. This helps to avoid blackouts and wastage, enabling them to prepare ahead.</p>
</li>
</ul>
<h2 id="heading-best-practices">Best Practices</h2>
<ul>
<li><p><strong>Always visualize residuals:</strong> Residuals are the difference between forecasted values and actual values. You must visualize them to detect outliers and unusual events.</p>
</li>
<li><p><strong>Perform proper backtesting:</strong> Backtesting lets you see a more realistic model, subjected to various changes that can occur in real life. When you backtest all your models, you end up getting a model that performs well when forecasting.</p>
</li>
<li><p><strong>Avoid data leakage:</strong> Do not train your models on validation sets to avoid bias, and always use cross-validation where necessary.</p>
</li>
<li><p><strong>Use domain knowledge for feature engineering:</strong> Ensure you understand the data you are working with. This comes in handy in feature engineering, when you want to come up with new features to help your forecasting model, especially in multivariate time series forecasting.</p>
</li>
</ul>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>This tutorial is more like an overview, especially if you are new to time series, but you can build a lot just from what you have learned.</p>
<p>You already have an idea of what time series and forecasting are, and how you can use the Darts Python library to achieve that.</p>
<p>You also learned of various models for forecasting time series data, and how you can apply techniques such as backtesting and hyperparameter tuning to achieve better results.</p>
<p>Another interesting thing with Darts is its ability to handle <a target="_blank" href="https://unit8co.github.io/darts/userguide/timeseries.html#hierarchical-time-series">hierarchical time series</a>. Here, data is structured at aggregated levels.</p>
<p>Darts is one of the most powerful time series libraries in Python and has a lot of models to handle various cases. You can proceed to explore models such as <a target="_blank" href="https://unit8co.github.io/darts/generated_api/darts.models.forecasting.transformer_model.html">Transformers</a> and also <a target="_blank" href="https://unit8co.github.io/darts/examples/01-multi-time-series-and-covariates.html">multi-series forecasting</a>, which are used for special use cases.</p>
<p>If you are interested in more data science and statistics articles, don’t forget to check out <a target="_blank" href="https://learndata.xyz/blog">my blog</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ The Rise of AI Analytics and What It Means for Industries ]]>
                </title>
                <description>
                    <![CDATA[ Businesses today are flooded with data. From online purchases to hospital records, every action generates information.  But data alone is not useful. What matters is how companies use it to make decisions.  This is where AI analytics comes in. It com... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/the-rise-of-ai-analytics-and-what-it-means-for-industries/</link>
                <guid isPermaLink="false">686d3f17ba9172cf1f2ba98b</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Tue, 08 Jul 2025 15:53:59 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1751989824781/f5bd446f-cc03-4aca-9391-614dd4e8edb6.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Businesses today are flooded with data. From online purchases to hospital records, every action generates information. </p>
<p>But data alone is not useful. What matters is how companies use it to make decisions. </p>
<p>This is where AI analytics comes in. It combines artificial intelligence with data analysis to find patterns, make predictions, and suggest actions.</p>
<p>In this article, you will learn what AI analytics is, why it’s growing so fast, and how it’s changing different industries. You will also learn about some of the open-source tools leading this change.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-what-is-ai-analytics">What is AI Analytics?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-why-is-ai-analytics-growing-so-fast">Why is AI Analytics Growing So Fast?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-areas-where-ai-analytics-shine">Areas Where AI Analytics Shine</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-ai-analytics-in-retail">AI Analytics in Retail</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-ai-analytics-in-healthcare">AI Analytics in Healthcare</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-ai-analytics-in-finance">AI Analytics in Finance</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-ai-analytics-in-manufacturing">AI Analytics in Manufacturing</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-core-benefits-of-ai-analytics">Core Benefits of AI Analytics</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-challenges-of-ai-analytics">Challenges of AI Analytics</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-role-of-humans-in-ai-analytics">The Role of Humans in AI Analytics</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-popular-open-source-ai-analytics-tools">Popular Open-Source AI Analytics Tools</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-future-of-ai-analytics">The Future of AI Analytics</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-what-is-ai-analytics"><strong>What is AI Analytics?</strong></h2>
<p><a target="_blank" href="https://www.ibm.com/think/topics/ai-analytics">AI analytics</a> uses artificial intelligence to process and analyse data. </p>
<p>Traditional data analytics focused on what happened in the past. AI analytics goes further. It can tell you why something happened, what will likely happen next, and what you should do about it.</p>
<p>For example, if sales drop in a store, traditional reports only show the numbers. </p>
<p>AI analytics looks at customer behaviour, market trends, and past data to explain why sales dropped and suggest ways to increase them again.</p>
<h2 id="heading-why-is-ai-analytics-growing-so-fast"><strong>Why is AI Analytics Growing So Fast?</strong></h2>
<p>The primary reason is the explosion of data. </p>
<p>Companies now collect massive amounts of data from websites, apps, sensors, and machines. Traditional tools can’t handle this scale of information, but AI models are built for it.</p>
<p>Another reason is cheaper computing power. In the past, running AI models required expensive hardware. Today, with cloud computing and open-source software like <a target="_blank" href="https://www.freecodecamp.org/news/pytorch-vs-tensorflow-for-deep-learning-projects/">TensorFlow and PyTorch</a>, any company can use AI analytics.</p>
<p>A third reason is better algorithms. AI models have become smarter and easier to use. Libraries like Scikit-learn and H2O.ai offer ready-made models that save time and effort for data scientists.</p>
<h2 id="heading-areas-where-ai-analytics-shine">Areas Where AI Analytics Shine</h2>
<h3 id="heading-ai-analytics-in-retail"><strong>AI Analytics in Retail</strong></h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1751528082497/aadc1d43-4b23-4424-9553-4972a4836540.jpeg" alt="AI in retail" class="image--center mx-auto" width="1792" height="1024" loading="lazy"></p>
<p><a target="_blank" href="https://www.sap.com/resources/ai-in-retail">Retail companies use AI analytics</a> to understand customers better and improve their shopping experience. One common use is personalised recommendations. Online stores use AI models to suggest products based on your browsing and purchase history. Libraries like LightFM help build these recommendation systems.</p>
<p>AI analytics also helps retailers manage inventory. By predicting what products will sell in the coming weeks, stores can stock up accordingly and reduce waste. Some retailers even use AI to design store layouts that increase sales by studying how customers move inside stores.</p>
<h3 id="heading-ai-analytics-in-healthcare"><strong>AI Analytics in Healthcare</strong></h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1751528111195/623bb050-1b8c-409a-884f-44a26f9c0ccd.png" alt="AI in Healthcare" class="image--center mx-auto" width="1000" height="750" loading="lazy"></p>
<p>Thanks to AI, <a target="_blank" href="https://hexaware.com/blogs/data-analytics-in-healthcare/">data analytics in the health industry</a> has seen huge growth. Hospitals now use AI analytics to predict which patients are at risk of readmission. This helps doctors take preventive action before problems get worse.</p>
<p>AI also improves diagnosis accuracy. For example, deep learning models can analyse X-rays and MRI scans to detect diseases like cancer at an early stage. Hospitals use open-source tools like TensorFlow to build these image recognition models.</p>
<p>Another area is staff management. AI analytics helps hospitals allocate nurses and doctors based on predicted patient inflow, making operations more efficient.</p>
<h3 id="heading-ai-analytics-in-finance"><strong>AI Analytics in Finance</strong></h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1751528160043/e30d300a-cf87-43f3-812e-76e04ac18079.png" alt="AI in Finance" class="image--center mx-auto" width="1081" height="601" loading="lazy"></p>
<p>Banks and financial firms rely heavily on AI analytics. </p>
<p>One important use is fraud detection. <a target="_blank" href="https://www.ibm.com/think/topics/artificial-intelligence-finance">AI models analyse millions of transactions</a> in real time to spot unusual patterns, stopping fraud before it happens. Open-source tools like H2O.ai help build these models efficiently.</p>
<p>Another use is credit scoring. Traditional credit scores only looked at a few factors. AI analytics can process more data points, creating fairer and more accurate credit scores for loan approvals.</p>
<p>Investment firms use AI analytics to predict stock market trends. Tools like Prophet by Facebook allow analysts to forecast future prices based on past data, improving investment strategies.</p>
<h3 id="heading-ai-analytics-in-manufacturing"><strong>AI Analytics in Manufacturing</strong></h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1751528196144/c08e9c33-a3ac-4c38-bb23-296a57a2cf67.png" alt="AI in Manufacturing" class="image--center mx-auto" width="1086" height="548" loading="lazy"></p>
<p>Factories use <a target="_blank" href="https://www.ibm.com/think/topics/ai-in-manufacturing">AI analytics to improve operations</a> and reduce costs. One major use is predictive maintenance. Machines often fail without warning, causing production delays. AI analytics predicts when machines are likely to break down by analysing sensor data, allowing timely maintenance.</p>
<p>Factories also use AI to optimise production schedules. AI models analyse past production data, raw material availability, and market demand to plan manufacturing activities efficiently. This reduces costs and increases output.</p>
<h2 id="heading-core-benefits-of-ai-analytics"><strong>Core Benefits of AI Analytics</strong></h2>
<p>AI analytics helps companies make faster and better decisions. It processes data in minutes and suggests the best course of action. This saves time and resources.</p>
<p>Using AI analytics also leads to cost savings. Automation reduces the need for manual analysis and lowers the chance of human error.</p>
<p>Finally, AI analytics gives companies a competitive advantage. Businesses that use AI can respond to market changes quickly, stay ahead of competitors, and offer better services to customers.</p>
<h2 id="heading-challenges-of-ai-analytics"><strong>Challenges of AI Analytics</strong></h2>
<p>Despite its many benefits, AI analytics has some challenges. </p>
<p>One is data privacy. Industries like healthcare and finance deal with sensitive data that must be protected while using AI models.</p>
<p>To mitigate this, teams can implement strong data governance policies, use data anonymisation techniques, and ensure compliance with regulations like <a target="_blank" href="https://en.wikipedia.org/wiki/General_Data_Protection_Regulation">GDPR</a> and <a target="_blank" href="https://en.wikipedia.org/wiki/Health_Insurance_Portability_and_Accountability_Act">HIPAA</a>.</p>
<p>Another challenge is the lack of skilled professionals. Building AI models requires knowledge of data science and programming, which many companies still lack today. Businesses can address this by investing in training for existing staff, hiring specialised talent, or using user-friendly AutoML tools that reduce the need for advanced coding skills.</p>
<p>Bias in AI models is also a concern. If the data used to train the model is biased, the AI predictions will also be biased. This can lead to unfair decisions, especially in areas like credit scoring or hiring. To reduce bias, teams should audit the data regularly and involve diverse stakeholders when designing and validating models.</p>
<h2 id="heading-the-role-of-humans-in-ai-analytics">The Role of Humans in AI Analytics</h2>
<p>While AI analytics can process huge amounts of data and suggest actions, humans remain essential in the entire process. Data scientists and analysts design the AI models, decide which data to use, and define what questions the AI should answer.</p>
<p>After AI produces results, data scientists analyse its outputs to check for accuracy and relevance. For example, an AI model might suggest increasing inventory for a product, but a human analyst will assess whether other factors like seasonality or upcoming trends have been considered properly.</p>
<p>Monitoring AI models is another crucial role for humans. Over time, models can become outdated if the data they were trained on no longer reflects current realities, a problem known as <a target="_blank" href="https://domino.ai/data-science-dictionary/model-drift">model drift</a>. Data scientists regularly retrain and test models to maintain their accuracy.</p>
<p>Finally, we have to ensure that AI outputs are ethical and unbiased. We have to check for unfair recommendations or decisions, especially in sensitive areas like healthcare or finance, and adjust models to reduce any bias found.</p>
<h2 id="heading-popular-open-source-ai-analytics-tools"><strong>Popular Open-Source AI Analytics Tools</strong></h2>
<p>Several open-source tools are making AI analytics accessible to everyone. </p>
<ul>
<li><p><a target="_blank" href="https://www.freecodecamp.org/news/tensorflow-basics/">TensorFlow</a> is a deep learning framework by Google used for building complex AI models in healthcare, finance, and retail.</p>
</li>
<li><p><a target="_blank" href="https://www.freecodecamp.org/news/learn-pytorch-in-five-projects/">PyTorch</a> is another popular tool, preferred by researchers for its flexibility in building neural networks.</p>
</li>
<li><p><a target="_blank" href="https://www.freecodecamp.org/news/machine-learning-with-python-and-scikit-learn/">Scikit-learn</a> is widely used for traditional machine learning tasks such as classification and regression.</p>
</li>
<li><p><a target="_blank" href="https://h2o.ai/">H2O.ai</a> offers automated machine learning features, making it easier for businesses without large data science teams to build models.</p>
</li>
<li><p><a target="_blank" href="https://www.knime.com/">KNIME</a> provides a visual workflow that integrates AI models with business data systems, while <a target="_blank" href="https://spark.apache.org/mllib/">Apache Spark MLlib</a> is useful for analysing large datasets quickly. </p>
</li>
<li><p><a target="_blank" href="https://en.wikipedia.org/wiki/RapidMiner">RapidMiner</a> is also popular for building and deploying data science models in production environments.</p>
</li>
</ul>
<h2 id="heading-the-future-of-ai-analytics"><strong>The Future of AI Analytics</strong></h2>
<p>AI analytics is only going to grow stronger. </p>
<p>In the future, companies will use AI for real-time decision-making and industries will be able to act instantly based on live data streams.</p>
<p>Explainable AI will also become important. Businesses will demand AI models that clearly explain their predictions, building trust in automated decisions.</p>
<p>As AI tools become easier to use, even small businesses will adopt AI analytics to compete with larger firms. For example, a small clinic may use AI to predict patient no-shows and send reminders, improving efficiency and revenue.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>AI analytics is changing how industries work. In the healthcare sector, data analytics is helping hospitals save lives through better predictions. Retailers are using AI to personalise shopping experiences. Banks are using it to stop fraud and improve lending decisions. Factories are becoming more efficient with predictive maintenance.</p>
<p>Businesses that start using AI analytics today will lead their industries tomorrow. The time to adopt AI analytics is now, to make better decisions, reduce costs, and stay ahead in this fast-changing world.</p>
<p>Hope you enjoyed this article. You can <a target="_blank" href="https://www.linkedin.com/in/manishmshiva/">find me on LinkedIn</a> if you want to connect. If you are interested in taking up data analytics as a career, <a target="_blank" href="https://grow.google/intl/en_in/data-analytics-course">Google has a free course</a>. See you soon with a new article.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Extract  YouTube Analytics Data and Analyze in Python ]]>
                </title>
                <description>
                    <![CDATA[ If you’re a YouTube content creator, you’ll make data-driven decisions when posting content. This helps you target the right audience when creating your videos. YouTube Studio provides YouTube Analytics, where you can get comprehensive data about you... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/extract-youtube-analytics-data-and-analyze-in-python/</link>
                <guid isPermaLink="false">67e425c92a171465d4fb4cce</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Adejumo Ridwan Suleiman ]]>
                </dc:creator>
                <pubDate>Wed, 26 Mar 2025 16:05:29 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1743005089726/39e2323d-8f7b-4bf4-94cb-288aeb9cea4f.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you’re a YouTube content creator, you’ll make data-driven decisions when posting content. This helps you target the right audience when creating your videos.</p>
<p>YouTube Studio provides YouTube Analytics, where you can get comprehensive data about your channel. But there is a caveat: most of the statistics provided by YouTube Analytics are descriptive and not predictive. This means information like future views, subscriber counts, and factors influencing watch time or earnings are unavailable. This means you’ll need to calculate these metrics yourself.</p>
<p>In this article, you’ll learn how to export data from YouTube Analytics to Python so you can analyze it further or create visualizations. You can even build your own custom dashboard using various Python libraries like <a target="_blank" href="https://streamlit.io/">Streamlit</a>, <a target="_blank" href="https://shiny.posit.co/py/">Shiny</a>, or <a target="_blank" href="https://dash.plotly.com/">Dash</a>.</p>
<h3 id="heading-heres-what-we">Here’s what we</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-1-identify-the-problem-statement">Step 1: Identify the Problem Statement</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-extract-the-data">Step 2: Extract the Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-analyze-the-data-in-python">Step 3: Analyze the Data in Python</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-correlation-analysis">Correlation Analysis</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-audience-retention-analysis">Audience Retention Analysis</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p>Active YouTube and YouTube Studio Account</p>
</li>
<li><p>Jupyter Notebook, Google Colab, Kaggle, or any other environment that supports Python</p>
</li>
<li><p><a target="_blank" href="https://pandas.pydata.org/">Pandas</a> library installed</p>
</li>
<li><p><a target="_blank" href="https://seaborn.pydata.org/">Seaborn</a> library installed</p>
</li>
<li><p><a target="_blank" href="https://matplotlib.org/">Matplotlib</a> library installed</p>
</li>
</ul>
<h2 id="heading-step-1-identify-the-problem-statement">Step 1: Identify the Problem Statement</h2>
<p>Before proceeding, we need to know what we’re looking for – because YouTube Analytics has many metrics, and this can get overwhelming. My channel doesn’t have a ton of subscribers, but I have quite a few videos and views. So we’ll use my data as an example.</p>
<p>Just note that this analysis I’ll conduct in this tutorial is specific to my channel and can vary from channel to channel. You’ll be able to use the techniques here to answer the same/similar questions using your data, but your results will be different from mine.</p>
<p>Here are the questions I would like to find an answer for:</p>
<ol>
<li><strong>Correlation Analysis</strong></li>
</ol>
<ul>
<li><p><strong>Views and watch time</strong> – Are longer watch times associated with higher views?</p>
</li>
<li><p><strong>Views and subscribers</strong> – Do more views translate to more subscribers?</p>
</li>
<li><p><strong>Impressions and Click-Through Rate (CTR%) –</strong> Does a stronger impression lead to better engagement?</p>
</li>
<li><p><strong>Watch time and average view duration</strong> – Are longer videos watched more?</p>
</li>
</ul>
<ol start="2">
<li><strong>Audience Retention Analysis</strong></li>
</ol>
<ul>
<li><p><strong>Average view duration vs. Video length</strong> – Are longer videos watched in full?</p>
</li>
<li><p><strong>Drop-off points</strong> – Which duration range has the best retention?</p>
</li>
<li><p><strong>Retention Rate (%)</strong> – Watch time divided by duration?</p>
</li>
</ul>
<h2 id="heading-step-2-extract-the-data">Step 2: Extract the Data</h2>
<p>Sign in to your YouTube Studio account, go to the Analytics tab, and click Advanced mode.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742548010236/1392de34-a280-4117-9a3d-feda80392f62.png" alt="Image showing YouTube Analytics Dashboard and the Advanced Mode" class="image--center mx-auto" width="1920" height="927" loading="lazy"></p>
<p>This will open a dashboard showing comprehensive descriptive analytics of your YouTube channel. This can get overwhelming, as there are a lot of metrics and filters with various types of data. This is why I emphasized the importance of knowing your problem and identifying your questions before diving in.</p>
<p>You can select the range of data you are interested in using the date dropdown (1 in the image below) and the Compare to button (2) to compare data from different date ranges.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742548329162/3b8be0ea-769a-4723-b427-f911b3cfec83.png" alt="Image showing the date dropdown and the Compare to button" class="image--center mx-auto" width="1914" height="904" loading="lazy"></p>
<p>The column headers you see in the dashboard are the filters. Each contains different metrics, and you can find some metrics in one or more filters. You can play around with the tabs and dropdowns to understand them better.</p>
<p>This is just a foundation for understanding your YouTube channel performance. If you have a long-running channel with a large number of subscribers and views, trust me – you can get a lot of insights from your data.</p>
<p>For this tutorial, I will select my entire lifetime data (1) and click the download button at the top right-hand corner (2).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742548442210/8fbddcac-98cb-4e52-9355-5383e6afc172.png" alt="Image showing the lifetime option under the date dropdown" class="image--center mx-auto" width="1900" height="915" loading="lazy"></p>
<p>This will display two options: whether to open the data in Google Sheets in a new tab or download the CSV file.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742548490620/c8829a2b-228b-45fd-8789-45dfb397f2da.png" alt="Image showing the download options to open the data in a google sheets new tab or download the csv" class="image--center mx-auto" width="718" height="474" loading="lazy"></p>
<p>Since we want to use the data in Python, select the option to download the CSV file. After downloading the file, extract the files from the zip folder, and inside the extracted folder, you will see three CSV files: <code>Chart data.csv</code>, <code>Table data.csv</code>, and <code>Totals.csv</code>.</p>
<p>For this tutorial, we are interested in the <code>Table data.csv</code>. Click the data to open and view it in Excel to do some manual data cleaning before importing the data in Python.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742548741025/ace69aaf-bb0e-40de-aa1e-e716bb4182aa.png" alt="Image showing the Table data in Excel" class="image--center mx-auto" width="1891" height="604" loading="lazy"></p>
<p>The data is a list of all the videos on my YouTube channel, which is forty (yours might have more or fewer). Remove the first row, which is the <code>Total</code> row, and save the changes.</p>
<p>Here are the columns in the dataset:</p>
<ul>
<li><p><code>Content</code>: The video id</p>
</li>
<li><p><code>Video title</code>: The video title</p>
</li>
<li><p><code>Video publish time</code>: The day the video was published</p>
</li>
<li><p><code>Duration</code>: The video duration in seconds</p>
</li>
<li><p><code>Views</code>: The number of views per video</p>
</li>
<li><p><code>Watch time</code>: The estimated amount of video watch time by your audience in hours</p>
</li>
<li><p><code>Subscribers</code>: Change in total subscribers found by subtracting subscribers lost from subscribers gained for the selected date and region.</p>
</li>
<li><p><code>Average view duration</code>: Estimated average minutes watched per video.</p>
</li>
<li><p><code>Impressions</code>: Number of times your videos were shown to viewers.</p>
</li>
<li><p><code>Impressions click-through rate (%)</code>: Number of times viewers clicked your video after seeing an impression.</p>
</li>
</ul>
<h2 id="heading-step-3-analyze-the-data-in-python">Step 3: Analyze the Data in Python</h2>
<p>Go to your Jupyter Notebook and import the Pandas, Seaborn, and Matplotlib libraries.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
</code></pre>
<p>Next, import the <code>Table data.csv</code> file.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Load data</span>
df = pd.read_csv(<span class="hljs-string">"/content/Table data.csv"</span>)
</code></pre>
<h3 id="heading-correlation-analysis">Correlation Analysis</h3>
<p>Concerning our problem statement, we are going to plot a <a target="_blank" href="https://www.quanthub.com/how-to-read-a-correlation-heatmap/">correlation heatmap</a> between the following variables: <code>Views</code>, <code>Watch time (hours)</code>, <code>Subscribers</code>, <code>Average view duration</code>, and <code>Impressions-click-through rate (%)</code> to see the strength and direction of the relationship between them.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Convert "Average view duration" (formatted as H:M:S) to seconds</span>
df[<span class="hljs-string">'Average view duration'</span>] = pd.to_timedelta(df[<span class="hljs-string">'Average view duration'</span>]).dt.total_seconds()

<span class="hljs-comment"># Select relevant columns for correlation analysis</span>
correlation_data = df[[<span class="hljs-string">'Views'</span>, <span class="hljs-string">'Watch time (hours)'</span>, <span class="hljs-string">'Subscribers'</span>, <span class="hljs-string">'Average view duration'</span>, <span class="hljs-string">'Impressions'</span>, <span class="hljs-string">'Impressions click-through rate (%)'</span>]]

<span class="hljs-comment"># Compute correlation matrix</span>
corr_matrix = correlation_data.corr()

<span class="hljs-comment"># Visualization using a heatmap</span>
plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">6</span>))
sns.heatmap(corr_matrix, annot=<span class="hljs-literal">True</span>, cmap=<span class="hljs-string">'coolwarm'</span>, fmt=<span class="hljs-string">".2f"</span>, linewidths=<span class="hljs-number">0.5</span>)
plt.title(<span class="hljs-string">"YouTube Analytics Correlation Heatmap"</span>)
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742632975699/427811d8-09ca-4a8d-8fdc-98cdaf5b7033.png" alt="Correlation heatmap showing the relationship between the selected variables" class="image--center mx-auto" width="1174" height="913" loading="lazy"></p>
<p>Correlation coefficient ranges from -1 to 1, where values less than 0 mean a negative relationship, while those above 0 mean a positive relationship. The lower the value in a negative relationship, the stronger the negative relationship, while the higher the value in a positive relationship, the stronger the relationship.</p>
<p>Based on the plot above, here are the key insights:</p>
<ul>
<li><p><strong>Views and watch time</strong>: There's a strong correlation (0.94) between views and watch time, suggesting that as videos get more views, they also accumulate more watch hours, proportionally.</p>
</li>
<li><p><strong>Views and impressions</strong>: There's a strong correlation (0.89) between views and impressions, indicating that videos that are shown more frequently in recommendations and search results tend to get more views.</p>
</li>
<li><p><strong>Average view duration</strong>: This metric has very weak correlations with almost all other metrics. It is particularly notable in views (0.06), subscribers (0.01), and impressions (0.03).</p>
</li>
<li><p><strong>Subscribers and metrics</strong>: Subscribers have a moderate to strong correlation with views (0.75) and impressions (0.79) and a weaker correlation with click-through rate (0.54).</p>
</li>
<li><p><strong>Click-through rate</strong>: Has moderate correlations with views (0.69) and watch time (0.66) but a weaker correlation with subscribers (0.54).</p>
</li>
</ul>
<p>The most significant insight is that average view duration appears to operate independently from other metrics. This suggests that on my YouTube channel, a video's ability to retain viewers throughout its length isn't necessarily connected to how many people watch it, how often it's recommended, or how many subscribers the channel has.</p>
<p>This implies that the strategies I would implement to increase my views, subscribers, and impressions might differ from those needed to improve average view duration, an important factor in YouTube's recommendation algorithm. This means I need to look at other YouTube metrics that have a relationship with average view duration, which is a topic for another article.</p>
<h3 id="heading-audience-retention-analysis">Audience Retention Analysis</h3>
<p>To analyze audience retention, we need to create a new variable <code>Retention Rate (%)</code>, which is calculated by dividing a video’s <code>Average view duration</code> by the <code>Duration</code> and expressing it as a percentage.</p>
<pre><code class="lang-python">
<span class="hljs-comment"># Calculate retention rate as (Average View Duration / Total Video Duration) * 100</span>
df[<span class="hljs-string">'Retention Rate (%)'</span>] = (df[<span class="hljs-string">'Average view duration'</span>] / df[<span class="hljs-string">'Duration'</span>]) * <span class="hljs-number">100</span>
</code></pre>
<p>Next is to sort the videos in ascending order based on <code>Retention Rate (%)</code> and display the top 10 videos with the highest retention rate.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Sort videos by retention rate</span>
df_sorted = df.sort_values(by=<span class="hljs-string">'Retention Rate (%)'</span>, ascending=<span class="hljs-literal">False</span>)

<span class="hljs-comment"># Display top 10 videos with highest retention</span>
df_sorted[[<span class="hljs-string">'Video title'</span>, <span class="hljs-string">'Duration'</span>, <span class="hljs-string">'Average view duration'</span>, <span class="hljs-string">'Retention Rate (%)'</span>]].head(<span class="hljs-number">10</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742634265073/fc5bac65-18f3-467a-a8da-85f95ae00488.png" alt="Image showing top ten videos by retention rate" class="image--center mx-auto" width="1194" height="550" loading="lazy"></p>
<p>From the table above, you will notice that most of the videos in the top 10 spot are not above 503 seconds, which is approximately 8 minutes. This implies that my audience are interested in short, mid-range videos.</p>
<p>Most videos with the high retention rate have a duration less than 4 minutes, with a retention rate ranging from 27% - 40%. With this insight, I can ensure that the next videos I will upload are within 5 to 8 minutes.</p>
<p>Let’s take a look at the bottom 10 videos with a low retention rate:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Sort videos by retention rate</span>
df_sorted = df.sort_values(by=<span class="hljs-string">'Retention Rate (%)'</span>, ascending=<span class="hljs-literal">False</span>)

<span class="hljs-comment"># Display bottom 10 videos with highest retention</span>
df_sorted[[<span class="hljs-string">'Video title'</span>, <span class="hljs-string">'Duration'</span>, <span class="hljs-string">'Average view duration'</span>, <span class="hljs-string">'Retention Rate (%)'</span>]].tail(<span class="hljs-number">10</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742634531458/28b1d8e8-38d9-480e-8259-a30f659386a3.png" alt="Image showing bottom ten videos by retention rate" class="image--center mx-auto" width="1168" height="538" loading="lazy"></p>
<p>From the above information, you will notice that long videos in my channel spanning approximately 22 - 58 minutes have a low retention rate. This further supports the claim above that my audience is more interested in shorter videos.</p>
<p>We can further decide to plot a scattered plot of <code>Duration</code> against <code>Retention Rate (%)</code> to summarize the above tables.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Set style for plots</span>
sns.set_style(<span class="hljs-string">"whitegrid"</span>)

<span class="hljs-comment"># Plot Retention Rate vs. Video Duration</span>
plt.figure(figsize=(<span class="hljs-number">12</span>, <span class="hljs-number">6</span>))

sns.scatterplot(data=df, x=<span class="hljs-string">'Duration'</span>, y=<span class="hljs-string">'Retention Rate (%)'</span>, hue=<span class="hljs-string">'Views'</span>, size=<span class="hljs-string">'Views'</span>, sizes=(<span class="hljs-number">20</span>, <span class="hljs-number">200</span>), palette=<span class="hljs-string">'coolwarm'</span>)
plt.title(<span class="hljs-string">"Audience Retention vs. Video Duration"</span>)
plt.xlabel(<span class="hljs-string">"Video Duration (seconds)"</span>)
plt.ylabel(<span class="hljs-string">"Retention Rate (%)"</span>)
plt.legend(title=<span class="hljs-string">"Views"</span>, loc=<span class="hljs-string">"upper right"</span>)

plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742634776775/e024b61c-d86f-45d6-b8fb-13ff87e101e9.png" alt="Scatter plot showing audience retention against video duration" class="image--center mx-auto" width="1486" height="820" loading="lazy"></p>
<p>The <a target="_blank" href="https://byjus.com/commerce/scatter-diagram/">scatter plot</a> above shows the relationship between audience retention rate (y-axis, measured as a percentage) and video duration (x-axis, measured in seconds) for various videos. Here are the following key observations:</p>
<ul>
<li><p>There's a clear negative correlation between video duration and retention rate – as videos get longer, the retention rate generally decreases.</p>
</li>
<li><p>The highest retention rates (35-40%) are found in shorter videos, mostly under 500 seconds (around 8 minutes).</p>
</li>
<li><p>Videos over 1500 seconds (25 minutes) consistently show retention rates below 15%.</p>
</li>
<li><p>The size and color of the dots represent the number of views, with larger, redder dots indicating more views (up to 1000) and smaller, blue dots representing fewer views (around 200).</p>
</li>
<li><p>Interestingly, some mid-length videos (around 500 seconds) have both higher view counts (indicated by larger red dots) and decent retention rates of about 25%.</p>
</li>
<li><p>The longest video in the dataset (at around 3500 seconds or 58 minutes) has a retention rate of about 14% and relatively few views.</p>
</li>
</ul>
<p>This plot further confirms the claim that shorter videos tend to better maintain audience attention on my channel, though some mid-length videos can still perform well in terms of both retention and view count.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>What we’ve learned from my data is just the tip of the iceberg. YouTube has many metrics, and because my channel is not monetized and has few subscribers and videos, I don’t have data on monetization, demographics, and other metrics.</p>
<p>But after reading this article, I hope that you can think of endless information you want to get based on these metrics. You can even forecast your views, subscriber counts, and revenue for the next days or months. You can also perform a multivariate time series analysis to see how these factors affect your primary variable of interest.</p>
<p>If you find this article interesting, don’t forget to check out my <a target="_blank" href="https://learndata.xyz/blog">blog</a> for other interesting articles, follow me on <a target="_blank" href="https://medium.com/@adejumo999">Medium</a>, connect on <a target="_blank" href="https://www.linkedin.com/in/adejumoridwan/">LinkedIn</a>, and subscribe to my <a target="_blank" href="http://www.youtube.com/@learndata_xyz">YouTube channel</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Microsoft Excel: 14 Time-Saving Keyboard Shortcuts ]]>
                </title>
                <description>
                    <![CDATA[ Microsoft Excel is the quintessential spreadsheet software used everywhere from universities to small businesses to enterprises. It’s a lifesaver for countless financial professionals, data analysts, and teachers. But it’s also one of those programs ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/microsoft-excel-keyboard-shortcuts/</link>
                <guid isPermaLink="false">670ead175b03d2b334e6560f</guid>
                
                    <category>
                        <![CDATA[ excel ]]>
                    </category>
                
                    <category>
                        <![CDATA[ spreadsheets ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Eamonn Cottrell ]]>
                </dc:creator>
                <pubDate>Tue, 15 Oct 2024 17:57:43 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1728930420116/82fe0d9f-89fe-4332-82dc-43b44d4ee2dc.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Microsoft Excel is the quintessential spreadsheet software used everywhere from universities to small businesses to enterprises.</p>
<p>It’s a lifesaver for countless financial professionals, data analysts, and teachers. But it’s also one of those programs that virtually everyone in any role can benefit from learning.</p>
<p>A handful of shortcuts can go a long way in increasing your productivity (and enjoyment) while using Excel.</p>
<p>In this article, I’ll detail some of the many shortcuts that I have found helpful throughout my studies and career.</p>
<h3 id="heading-excel-shortcuts-well-cover">Excel Shortcuts We’ll Cover:</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-how-to-execute-shortcuts-in-excel">How to Execute Shortcuts in Excel</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-shortcut-to-create-a-table">Shortcut to Create a Table</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-shortcut-to-create-a-pivot-table">Shortcut to Create a Pivot Table</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-shortcut-to-autofit-column-sizes">Shortcut to AutoFit Column Sizes</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-shortcut-to-open-format-cells">Shortcut to Open Format Cells</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-shortcut-to-center-contents-of-cell">Shortcut to Center Contents of Cell</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-shortcut-to-fill-color">Shortcut to Fill Color</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-shortcut-to-fill-contents-down-or-right">Shortcut to Fill Contents Down (or Right)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-shortcut-to-show-or-hide-gridlines">Shortcut to Show or Hide Gridlines</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-shortcut-to-show-all-formulas">Shortcut to Show all Formulas</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-shortcuts-for-navigation-in-excel">Shortcuts for Navigation in Excel</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-shortcut-to-open-the-autofilter-menu-in-excel">Shortcut to Open the AutoFilter Menu in Excel</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-shortcut-to-create-a-slicer-in-excel">Shortcut to Create a Slicer in Excel</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-shortcut-to-create-checkboxes-in-excel">Shortcut to Create Checkboxes in Excel</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-shortcut-to-create-charts-in-excel">Shortcut to Create Charts in Excel</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-other-notes">Other Notes</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-more-shortcuts">More Shortcuts</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-got-sheet">Got Sheet</a></p>
</li>
</ul>
<p>Here’s a video walkthrough of everything we’ll cover in this article:</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/RfFhh0n4bMM" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<p> </p>
<h2 id="heading-how-to-execute-shortcuts-in-excel">How to Execute Shortcuts in Excel</h2>
<p>In Excel, the more you can learn to do without touching your mouse, the better. Often, by keeping your hands on your keyboard, you can save a lot of time in each project.</p>
<p>As such, virtually every command imaginable is available as a keyboard shortcut.</p>
<p>The basic ones like <code>CTRL + S</code> for save and <code>CTRL + C</code> for copy are present. But Excel goes a step further…</p>
<p>The real power comes from the <code>alt</code> shortcuts. By pressing sequences of keys that typically start with <code>alt</code>, almost all of the actions in the Ribbon become available.</p>
<p>By simply pressing the <code>alt</code> key, all of the shortcut sequences available from the current view become highlighted in yellow on the ribbon:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728924144231/6b083a86-6db6-41e3-b960-4a06a5d3284f.png" alt="alt shortcuts in Excel" class="image--center mx-auto" width="1059" height="280" loading="lazy"></p>
<p>Incidentally, some of the shortcuts’ sequences are separated by commas, while at other times two letters appear next to each other. Their execution is the same. Simply press them in sequence.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728925276816/8867e861-ecd7-43d9-989d-dab2fef62d8a.png" alt="one vs two letter shortcut sequences" class="image--center mx-auto" width="169" height="89" loading="lazy"></p>
<p>Below, you’ll find some of my favorites, but keep in mind that you can easily view all the available shortcuts at any time by pressing <code>alt</code> and then continuing to press the appropriate keys for the corresponding tabs and actions.</p>
<h2 id="heading-shortcut-to-create-a-table">Shortcut to Create a Table</h2>
<h3 id="heading-keyboard-shortcut-ctrl-t">Keyboard Shortcut: <code>ctrl + t</code></h3>
<p>Tables in Excel are often the preferred format to begin manipulating and visualizing data. As long as the active cell is inside your data range, pressing <code>ctrl + t</code> will pop up a dialog and automatically select the data range to be used for the table.</p>
<p>It is adjustable if Excel gets a column or row wrong, and you can also toggle table headers on or off from this initial box.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728924526482/aa5ffdee-2327-4232-8a6d-c74e271f4bce.png" alt="create table in Excel" class="image--center mx-auto" width="342" height="204" loading="lazy"></p>
<h2 id="heading-shortcut-to-create-a-pivot-table">Shortcut to Create a Pivot Table</h2>
<h3 id="heading-keyboard-shortcut-alt-n-v-t">Keyboard shortcut: <code>alt + n, v, t</code></h3>
<p>Want to appear a lot smarter than you are? Learn the basics of pivot tables in ten minutes. Your coworkers will remain impressed for weeks.</p>
<p>To create a pivot table, simply click somewhere in your data range and press <code>ALT + n, v, t</code>.</p>
<p>Excel is smart enough to recognize the data range you likely want included even if it isn’t already formatted as a table.</p>
<p>A dialog box will pop up for you to confirm the data range and location for the pivot table.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728924350920/21f22ae2-238c-4e73-9690-cbc538c46458.png" alt="21f22ae2-238c-4e73-9690-cbc538c46458" class="image--center mx-auto" width="635" height="357" loading="lazy"></p>
<h2 id="heading-shortcut-to-autofit-column-sizes">Shortcut to AutoFit Column Sizes</h2>
<h3 id="heading-keyboard-shortcut-alt-h-o-i">Keyboard Shortcut: <code>alt + h, o, i</code></h3>
<p>Ever get tired of resizing your columns so that the text in the cells doesn’t clip or spill? You can always double click the column boundary headings. This resizes the column to fit the widest cell’s contents.</p>
<p>The keyboard shortcut is much more efficient and allows you to autofit multiple columns in one fell swoop. Simply click and drag the column boundaries to select any number of columns. Execute the <code>alt + h, o, i</code> shortcut and all the columns in your active range will autofit their width.</p>
<p>You can also autofit cells individually by selecting one or more active cells and performing the same shortcut.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728924664667/9611fe54-9d2f-442c-9246-a3ab85b1e81d.png" alt="autofit column width in Excel" class="image--center mx-auto" width="411" height="427" loading="lazy"></p>
<h2 id="heading-shortcut-to-open-format-cells">Shortcut to Open Format Cells</h2>
<h3 id="heading-keyboard-shortcut-ctrl-1-or-alt-h-fm">Keyboard Shortcut: <code>CTRL + 1</code> or <code>alt + h, fm</code></h3>
<p>Unless you are content with mediocrity, you’ll be formatting the cells in your spreadsheet at some point to create more readable, user-friendly content.</p>
<p>The simplest shortcut to open up the Format Cells window is <code>CTRL + 1</code>, although if you want to flex your dexterity, <code>alt + h, fm</code> will get you there as well.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728925061866/0d773119-097f-479e-bcc5-5bb8ba311a5d.png" alt="Cell formatting options" class="image--center mx-auto" width="783" height="774" loading="lazy"></p>
<h2 id="heading-shortcut-to-center-contents-of-cell">Shortcut to Center Contents of Cell</h2>
<h3 id="heading-keyboard-shortcut-horizontal-alt-h-a-c">Keyboard Shortcut: (horizontal) <code>alt + h, a, c</code>,</h3>
<p>(vertical) <code>alt + h, a, m</code></p>
<p>We have it easy compared to web developers. There seems a never-ending supply of articles and videos reminding developers how to center <code>divs</code>.</p>
<p>In Excel, we need only remember two quick shortcuts.</p>
<ul>
<li><p>For horizontal centering: <code>alt + h, a, c</code></p>
</li>
<li><p>For vertical centering: <code>alt + h, a, m</code></p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728925842812/56676126-976e-4f1d-bb8a-2bf7072f9342.png" alt="centering text in a cell in Excel" class="image--center mx-auto" width="453" height="372" loading="lazy"></p>
<p>There are other shortcuts for left, right, top and bottom alignment, but most of the time when we change the original alignment, it’s to center it.</p>
<h2 id="heading-shortcut-to-fill-color">Shortcut to Fill Color</h2>
<h3 id="heading-keyboard-shortcut-alt-h-h">Keyboard Shortcut: <code>alt + h, h</code></h3>
<p>For a quick highlight, it takes precious seconds to mouse up to the fill color icon. Pressing <code>alt + h, h</code> quickly toggles the color selection open.</p>
<p>Once it’s open, you can leave your mouse to the side and arrow down to your favorite color</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728926023010/6cebdbc6-e27e-4784-8cc8-d46dbf8c4985.png" alt="fill color menu in Excel" class="image--center mx-auto" width="385" height="603" loading="lazy"></p>
<h2 id="heading-shortcut-to-fill-contents-down-or-right">Shortcut to Fill Contents Down (or Right)</h2>
<h3 id="heading-keyboard-shortcut-down-ctrl-d-right-ctrl-r">Keyboard Shortcut: (down) <code>CTRL + D</code>, (right) <code>CTRL + R</code></h3>
<p>One of the most powerful features of Excel is the ability to drag formulas and functions down or across many cells, effectively reproducing a single calculation many times on different pieces of data.</p>
<p>By typing a formula in cell A8 in the image below, we can then highlight A8 and drag down to our heart’s content. Then, by pressing <code>CTRL + D</code>, the formula will be copied down into every highlighted cell.</p>
<p>By default, it will also preserve the relative reference of the cell. In other words, the next cell will contain the formula <code>A7 + A8</code> and then <code>A8 + A9</code>, and so on.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728927617121/35dd07d3-90ac-4c31-9318-1e588bff007c.png" alt="Dragging down" class="image--center mx-auto" width="1116" height="1349" loading="lazy"></p>
<h2 id="heading-shortcut-to-show-or-hide-gridlines">Shortcut to Show or Hide Gridlines</h2>
<h3 id="heading-keyboard-shortcut-alt-w-vg">Keyboard Shortcut: <code>alt + w, vg</code></h3>
<p>The mark of a real data analyst is not the quality of their reports, but the precision of their workbook. Gridlines have got to go. If you need lines, you can add borders.</p>
<p>Toggle off the gridlines with <code>alt + w, vg</code>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728926496034/ab39e08c-1ac2-4d4d-986e-1c79597a322e.png" alt="Shortcut to toggle off gridlines" class="image--center mx-auto" width="332" height="158" loading="lazy"></p>
<p>And if you need those borders immediately, highlight your data range and press <code>alt + h, b</code> to open up the border menu. If you simply need all borders, <code>alt + h, b, a</code> will do the trick</p>
<h2 id="heading-shortcut-to-show-all-formulas">Shortcut to Show all Formulas</h2>
<h3 id="heading-keyboard-shortcut-ctrl">Keyboard Shortcut: <code>CTRL + ~</code></h3>
<p>You’ll likely never lose control of a workbook.</p>
<p>But in the event you access one of your less rigorous colleague’s workbooks and need to see what functions they’ve Frankensteined together, press <code>CTRL + ~</code> to display all the functions instead in the spreadsheet.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728927250678/868f9daa-04d7-447c-b494-15fea77a4e1c.png" alt="display functions in Excel workbook" class="image--center mx-auto" width="807" height="709" loading="lazy"></p>
<h2 id="heading-shortcuts-for-navigation-in-excel">Shortcuts for Navigation in Excel</h2>
<h3 id="heading-keyboard-shortcuts-ctrl-arrow-keys-and-others">Keyboard Shortcuts: <code>CTRL + arrow keys</code> (and others)</h3>
<p>Navigating the grid can be very fast with the keyboard. Pressing <code>CTRL +</code> the <code>arrows</code>, the <code>home</code> and the <code>end</code> buttons will warp you all over the active sheet.</p>
<p>Using the <code>arrows</code> and <code>CTRL</code>, you go to the last nonblank cell in the row or column.</p>
<p>Using <code>CTRL + home or end</code>, you go to the beginning and the end of the workbook, respectively. (When inside a table, <code>home</code> and <code>end</code> take you to the beginning and end of the table only.)</p>
<h2 id="heading-shortcut-to-open-the-autofilter-menu-in-excel">Shortcut to Open the AutoFilter Menu in Excel</h2>
<h3 id="heading-keyboard-shortcut-alt-down-arrow">Keyboard Shortcut: <code>alt + down arrow</code></h3>
<p>Another superpower in Excel is the ease with which we can filter and sort large pieces of data. To access the AutoFilter Menu quickly, press <code>alt + down arrow</code> while in the header for the column you’d like to filter.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728928353490/dcca3af8-2858-4ca0-abd6-d7e197826008.png" alt="auto-filter menu" class="image--center mx-auto" width="444" height="918" loading="lazy"></p>
<h2 id="heading-shortcut-to-create-a-slicer-in-excel">Shortcut to Create a Slicer in Excel</h2>
<h3 id="heading-keyboard-shortcut-alt-n-sf">Keyboard Shortcut: <code>alt + n, sf</code></h3>
<p>For an even more user-friendly method of sorting, you can insert a slicer directly onto the spreadsheet by pressing <code>alt + n, sf</code>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728928617015/599d1bb7-4873-4f3d-aee2-6d478dcc034c.png" alt="Menu for inserting slicers" class="image--center mx-auto" width="536" height="556" loading="lazy"></p>
<h2 id="heading-shortcut-to-create-checkboxes-in-excel">Shortcut to Create Checkboxes in Excel</h2>
<h3 id="heading-keyboard-shortcut-alt-n-cb">Keyboard Shortcut: <code>alt + n, cb</code></h3>
<p>If you’re anything like me, you’ll find a way to use checkboxes in almost every spreadsheet you create. They’re extremely useful for toggling selections on and off in a workbook, and as of June 2024, Excel has made them available in production Excel.</p>
<p>Press <code>alt + n, cb</code> to insert a checkbox in a cell.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728931963136/cc54e602-2d2c-4536-b07f-408f8404977d.png" alt="cc54e602-2d2c-4536-b07f-408f8404977d" class="image--center mx-auto" width="506" height="467" loading="lazy"></p>
<h2 id="heading-shortcut-to-create-charts-in-excel">Shortcut to Create Charts in Excel</h2>
<h3 id="heading-keyboard-shortcut-alt-n-r">Keyboard Shortcut: <code>alt + n, r</code></h3>
<p>There are a ton of chart types in Excel. To quickly open up the recommended charts dialog box, we can press <code>alt + n, r</code>.</p>
<p>Or, if we know we want a specific type of chart, there are multiple options to shortcut straight to them, like <code>alt + n, C1</code>, <code>alt + n, N1</code>, <code>alt + n, SA</code> and so on.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729001331573/b9e240ba-6f42-4817-9d8c-e27241f66b18.png" alt="Chart types" class="image--center mx-auto" width="911" height="439" loading="lazy"></p>
<h2 id="heading-more-shortcuts">More Shortcuts</h2>
<p>There are a zillion shortcuts available in Excel. If you want to check out the full list, you can find a current version maintained by <a target="_blank" href="https://support.microsoft.com/en-us/office/keyboard-shortcuts-in-excel-1798d9d5-842a-42b8-9c99-9b7213f0040f">Microsoft here</a>.</p>
<h2 id="heading-got-sheet">Got Sheet</h2>
<p>Come join my free newsletter, <a target="_blank" href="https://www.gotsheet.xyz/subscribe">Got Sheet</a>. I show people how to get good at spreadsheets every week.</p>
<p>You can find me over on <a target="_blank" href="https://www.youtube.com/@eamonncottrell">YouTube</a> as well.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Data Analysis with Python – How I Analyzed My Empire State Building Run-Up Performance ]]>
                </title>
                <description>
                    <![CDATA[ A tower running race is a race that you run up the stairs of a building. These happen around the world. I got the chance to participate in the Empire State Run Up in NYC, 2023 edition. The Empire State Building Run-Up (ESBRU)—the world’s first and m... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/empire-state-building-run-up-analysis-with-python/</link>
                <guid isPermaLink="false">66d85138ec0a9800d5b8e6e6</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Jose Vicente Nunez ]]>
                </dc:creator>
                <pubDate>Wed, 08 May 2024 16:56:28 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/05/empire_state_runup-1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>A <a target="_blank" href="https://en.wikipedia.org/wiki/Tower_running">tower running race</a> is a race that you run up the stairs of a building. These happen around the world. I got the chance to participate in the Empire State Run Up in NYC, 2023 edition.</p>
<blockquote>
<p>The Empire State Building Run-Up (ESBRU)—the world’s first and most famous tower race—challenges runners from near and far to race up its famed 86 flights—1,576 stairs.</p>
<p>While visitors can reach the building’s Observatory via elevator in under one minute, the fastest runners have covered the 86 floors by foot in about 10 minutes.</p>
<p>Leaders in the sport of professional tower-running converge at the Empire State Building in what some consider the ultimate test of endurance.</p>
</blockquote>
<p>I got lucky and managed to participate in this race. A few days after finishing the race, I realized that I wanted to know more about my performance, and what I could have done to better.</p>
<p>So naturally I went to the race organizer website and started looking at the numbers. And it was slow and tedious, plus it brought up more issues:</p>
<ol>
<li><p>Getting the data for offline analysis is difficult. You can see your results and others for comparison, but I found that the tools didn't offer an option to download the raw data, and they were clumsy to use.</p>
</li>
<li><p>Most tools out there to analyze race results are paid or do not apply to this type of race. Knowing what to expect reduces your anxiety, allows you to train better, and keeps your expectations in check.</p>
</li>
</ol>
<p>By now you've probably guessed that you can solve the data retrieval issues and post-race analysis using low-cost Open Source tools. This also allows you to apply different techniques to learn about the race and, depending on the quality of the data, even make performance predictions.</p>
<p>This is a very personal piece for me. I will share my race results and give you my biased opinion about the race. 😁</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-how-i-ended-up-running-to-the-top-of-the-empire-state-building">How I Ended Up Running to the Top of the Empire State Building</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-you-need-to-follow-this-tutorial">What You Need to Follow this Tutorial</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-get-the-data-using-web-scraping">How to Get the Data using Web Scraping</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-clean-up-the-data">How to Clean Up the Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-analyze-the-data">How to Analyze the Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-visualize-the-results">How to Visualize the results</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-run-the-applications">How to Run the Applications</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-else-can-we-learn">What Else Can We Learn?</a></p>
</li>
</ol>
<h2 id="heading-how-i-ended-up-running-to-the-top-of-the-empire-state-building">How I Ended Up Running to the Top of the Empire State Building</h2>
<p>Many of us have run a regular race at some point in our lives – there are many distances like <em>5K</em>, <em>10K</em>, <em>Half</em> <em>Marathon</em>, and <em>Full</em> <em>Marathon</em>. But there is no way to compare how you will perform while running the stairs all the way to the top of one of the most famous buildings in the world.</p>
<p>If you have ever been at the base of the skyscrapers in New York City and have looked up, you get the idea. Picture yourself running up the stairs, all the way to the top, without stopping.</p>
<p>Getting accepted is tough, because unlike a race like the <a target="_blank" href="https://en.wikipedia.org/wiki/New_York_City_Marathon">New York Marathon</a>, the Empire State Building can only accommodate around 500 runners (or should I say <em>climbers</em>?).</p>
<p>Add to that fact that the demand to participate is high, and then you can see that your chances to get in through the lottery are pretty slim (I read somewhere that there are only 50 lottery positions for more than 5,000 applicants).</p>
<p>You can imagine my surprise when I got an email saying that I was selected to participate after trying for 4 years in a row.</p>
<p>I panicked. Have you ever been at the base of the Empire State and looked up? Some days when it's cloudy you can't even see the top of the building.</p>
<p>I wasn't unprepared. But I had to adjust my training routine to be ready for this challenge with a small window of two months, and no experience doing a tower run.</p>
<p>The day of the race came and this is how it went for me:</p>
<ul>
<li><p>It was tough. I knew I had to pace myself, otherwise, the race would have ended for me on floor 20th as opposed to the 86th. You have to focus on a "keep going" mentality, regardless of how tired you feel. And then it is over, just like that.</p>
</li>
<li><p>You don't sprint, you climb 2 steps at a time at a steady pace, and you use the handrails to take weight off your legs.</p>
</li>
<li><p>No need to carb load or hydrate too much. If you do well, you will be done in around 30 minutes.</p>
</li>
<li><p>Nobody is pushing anyone. At least for non-elite racers like me, I was alone for most of the race.</p>
</li>
<li><p>I got passed and I passed a lot of people who forgot the 'pace yourself' rule. If you sprint, you will be toasted before floor 25, for sure.</p>
</li>
</ul>
<p>I had a blast and got great satisfaction from having this race ticked off my bucket list, the same way I felt after running the <a target="_blank" href="https://results.nyrr.org/event/40/finishers#search=Jose%2520Nunez%2520Zuleta">NYC Marathon</a>.</p>
<p>It was time now to do a post-race analysis using several of my favorite Open Source tools, which I'll explain in the next section.</p>
<h2 id="heading-what-you-need-to-follow-this-tutorial">What You Need to Follow this Tutorial</h2>
<p>Like the race, most of the challenges to writing this application were mental. You only need to break the main problem down into smaller pieces and then tackle each piece at a time:</p>
<ol>
<li><p>Get the data by scraping the website (very few sites allow you to export race results as a CSV).</p>
</li>
<li><p>Clean up the data, normalize it, and make it ready for automatic processing.</p>
</li>
<li><p>Ask questions. Then translate those questions into code and tests, ideally using statistics to get reliable answers.</p>
</li>
<li><p>Present the results. A UI (Text or Graphic) will do wonders due to its low consumption, but charts speak volumes too.</p>
</li>
</ol>
<p>You should have some experience in a programming language to get the most out of this article. My code is written in Python (you will need version 3.8+) and runs on Linux (I used <a target="_blank" href="https://fedoraproject.org/">Fedora 37 distribution</a>).</p>
<p>In a nutshell, I want to show that it is possible to do all the above with Open Source technologies. Then you can reuse this knowledge for other projects, not just for tower race analyses. 😅</p>
<p>I strongly recommend that you <a target="_blank" href="https://github.com/josevnz/tutorials/tree/main/docs/EmpireStateRunUp">get the source code</a> (It is <a target="_blank" href="https://github.com/josevnz/tutorials/tree/main?tab=Apache-2.0-1-ov-file#readme">Open Source</a>!). Get your hands dirty, break the scripts, and have fun. You will need Git to clone the repository:</p>
<pre><code class="lang-shell">git clone https://github.com/josevnz/tutorials.git
cd tutorials/docs/EmpireStateRunUp/
python -m ~/virtualenv/EmpireStateRunUp
. ~/virtualenv/EmpireStateRunUp/bin/activate
pip install --upgrade pip
pip install --upgrade build
pip install --upgrade wheel
pip install --editable .
</code></pre>
<p>Or if you just want to run the code while reading this tutorial (using my latest version from <a target="_blank" href="https://pypi.org/project/EmpireStateRunUp/">Pypi</a>):</p>
<pre><code class="lang-shell">python -m ~/virtualenv/EmpireStateRunUp
. ~/virtualenv/EmpireStateRunUp/bin/activate 
pip install --upgrade EmpireStateRunUp
</code></pre>
<p>We can now move to the next stage:a getting the data.</p>
<h2 id="heading-how-to-get-the-data-using-web-scraping">How to Get the Data using Web Scraping</h2>
<p>The race results site doesn't have an export feature, and I never heard back from their support team to see if there was an alternate way to get the race data. So the only alternative left was to do some web scraping.</p>
<p>The website is pretty basic and only allows scrolling through each record, so I decided to do web scraping to get the results into a format I could use later for data analysis.</p>
<h3 id="heading-the-rules-of-web-scraping">The rules of web scraping</h3>
<p>There are very 3 simple rules:</p>
<ol>
<li><p>Rule #1: <strong>Don't do it</strong>. Data flow changes, and your scraper will break the minute you are done getting the data. It will require time and effort. <em>Lots of it</em>.</p>
</li>
<li><p>Rule #2: <strong>Re-read rule number 1</strong>. If you can't get the data in any another format, then go to rule #3</p>
</li>
<li><p>Rule #3: <strong>Choose a good framework to automate what you can</strong> and prepare to do heavy data cleanup (also known as "give me patience for the stuff I can't control, like poorly done HTML and CSS").</p>
</li>
</ol>
<p>I decided to use <a target="_blank" href="https://www.selenium.dev/documentation/webdriver/">Selenium Web Driver</a> as it calls a real browser, like Firefox, to navigate the website. Selenium allows you to automate browser actions while you get the same rendered HTML you see when you navigate the site.</p>
<p>Selenium <em>is a complex tool</em> and will require you to spend some time experimenting with what works and what does not. Below is a simple script I wrote to get all the runner's names and race detail links in one run:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> re
<span class="hljs-keyword">from</span> time <span class="hljs-keyword">import</span> sleep

<span class="hljs-keyword">from</span> selenium <span class="hljs-keyword">import</span> webdriver
<span class="hljs-keyword">from</span> selenium.webdriver.common.by <span class="hljs-keyword">import</span> By
<span class="hljs-keyword">from</span> selenium.webdriver.firefox.options <span class="hljs-keyword">import</span> Options
<span class="hljs-keyword">from</span> selenium.webdriver.firefox.webdriver <span class="hljs-keyword">import</span> WebDriver
<span class="hljs-keyword">from</span> selenium.webdriver.support.wait <span class="hljs-keyword">import</span> WebDriverWait
<span class="hljs-keyword">from</span> selenium.webdriver.support <span class="hljs-keyword">import</span> expected_conditions
<span class="hljs-comment"># AthLinks is nice enough to post the race results and their interface is very human-friendly. Not so machine parsing friendly.</span>
RESULTS = <span class="hljs-string">"https://www.athlinks.com/event/382111/results/Event/1062909/Course/2407855/Results"</span>
LINKS = {}


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">print_links</span>(<span class="hljs-params">web_driver: WebDriver, page: int</span>) -&gt; <span class="hljs-keyword">None</span>:</span>
    <span class="hljs-keyword">for</span> a <span class="hljs-keyword">in</span> web_driver.find_elements(By.TAG_NAME, <span class="hljs-string">"a"</span>):
        href = a.get_attribute(<span class="hljs-string">'href'</span>)
        <span class="hljs-keyword">if</span> re.search(<span class="hljs-string">'Bib'</span>, href):
            name = a.text.strip().title()
            print(<span class="hljs-string">f"Page=<span class="hljs-subst">{page}</span>, <span class="hljs-subst">{name}</span>=<span class="hljs-subst">{href.strip()}</span>"</span>)
            LINKS[name] = href.strip()


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">click</span>(<span class="hljs-params">level: int</span>) -&gt; <span class="hljs-keyword">None</span>:</span>
    button = WebDriverWait(driver, <span class="hljs-number">20</span>).until(
        expected_conditions.element_to_be_clickable((By.CSS_SELECTOR, <span class="hljs-string">f"div:nth-child(<span class="hljs-subst">{level}</span>) &gt; button"</span>)))
    driver.execute_script(<span class="hljs-string">"arguments[0].click();"</span>, button)
    sleep(<span class="hljs-number">2.5</span>)


options = Options()
options.add_argument(<span class="hljs-string">"--headless"</span>)
driver = webdriver.Firefox(options=options)
driver.get(RESULTS)
sleep(<span class="hljs-number">2.5</span>)
print_links(driver, <span class="hljs-number">1</span>)
click(<span class="hljs-number">6</span>)
print_links(driver, <span class="hljs-number">2</span>)
click(<span class="hljs-number">7</span>)
print_links(driver, <span class="hljs-number">3</span>)
click(<span class="hljs-number">7</span>)
print_links(driver, <span class="hljs-number">4</span>)
click(<span class="hljs-number">9</span>)
print_links(driver, <span class="hljs-number">5</span>)
click(<span class="hljs-number">9</span>)
print_links(driver, <span class="hljs-number">6</span>)
click(<span class="hljs-number">7</span>)
print_links(driver, <span class="hljs-number">7</span>)
click(<span class="hljs-number">7</span>)
print_links(driver, <span class="hljs-number">8</span>)
print(len(LINKS))
</code></pre>
<p>The code above is hardly reusable, but it gets the job done by doing the following:</p>
<ol>
<li><p>Gets the main web-page with the <code>driver.get(...)</code> method</p>
</li>
<li><p>Then gets the <code>&lt;a href</code> tags, and sleeps a little to get a chance to render the HTML</p>
</li>
<li><p>Then finds and clicks the <code>&gt;</code> (next page) button</p>
</li>
<li><p>Does these steps a total of 8 times, as this is how many pages of results are available (each page has 50 runners)</p>
</li>
</ol>
<p>To get the full race results I wrote scraper.py code. The code deals with navigating multiple pages and extracting the data. Demonstration below:</p>
<pre><code class="lang-shell">(EmpireStateRunUp) [josevnz@dmaf5 EmpireStateRunUp]$ esru_scraper /home/josevnz/temp/raw_data.csv
2023-12-30 14:05:00,987 Saving results to /home/josevnz/temp/raw_data.csv
2023-12-30 14:05:53,091 Got 377 racer results
2023-12-30 14:05:53,091 Processing BIB: 19, will fetch: https://www.athlinks.com/event/382111/results/Event/1062909/Course/2407855/Bib/19
2023-12-30 14:06:02,207 Wrote: name=Wai Ching Soh, position=1, {'name': 'Wai Ching Soh', 'url': 'https://www.athlinks.com/event/382111/results/Event/1062909/Course/2407855/Bib/19', 'overall position': '1', 'gender': 'M', 'age': 29, 'city': 'Kuala Lumpur', 'state': '-', 'country': 'MYS', 'bib': 19, '20th floor position': '1', '20th floor gender position': '1', '20th floor division position': '1', '20th floor pace': '42:30', '20th floor time': '1:42', '65th floor position': '1', '65th floor gender position': '1', '65th floor division position': '1', '65th floor pace': '54:03', '65th floor time': '7:34', 'gender position': '1', 'division position': '1', 'pace': '53:00', 'time': '10:36', 'level': 'Full Course'}
...
</code></pre>
<p>It does just minimal manipulation of the data from the web page. The purpose of this code is just to get the data as quickly as possible before the formatting changes.</p>
<p>Data cannot be used yet as-is – it needs cleaning up. And that's the next step in this article.</p>
<h2 id="heading-how-to-clean-up-the-data">How to Clean Up the Data</h2>
<p><a target="_blank" href="https://github.com/josevnz/tutorials/blob/main/docs/EmpireStateRunUp/test/raw_data.csv">Getting the data</a> is just the first battle of many more to come. <a target="_blank" href="https://en.wikibooks.org/wiki/Statistics/Data_Analysis/Data_Cleaning">You will notice inconsistencies on the data</a> and missing values. In order to make your numeric results good, you need to make assumptions.</p>
<p>Luckily for me, the dataset is very small (375+ records, one for each runner) so I was able to come up with a few rules to tidy up the <a target="_blank" href="https://github.com/josevnz/tutorials/blob/main/docs/EmpireStateRunUp/empirestaterunup/results-first-level-2023.csv">data file</a> I was going to use during my analysis.</p>
<p>I also supplemented my data with another data set that has the <a target="_blank" href="https://github.com/josevnz/tutorials/blob/main/docs/EmpireStateRunUp/empirestaterunup/country_codes.csv">3-digit country codes</a> as well as other details, for a nicer presentation.</p>
<p>The <code>data_normalizer.raw_read(raw_file: Path) -&gt; Iterable[Dict[str, Any]]</code> method does the heavy work of fixing the data for inconsistencies before saving into a CSV format.</p>
<p>There are no hard rules here, as cleanup has a high correlation with the data set. For example, to figure out to which wave each runner was assigned I had to make some assumptions based on what I saw the day of the race.</p>
<p>Let me show you what I mean with some code:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">from</span> enum <span class="hljs-keyword">import</span> Enum
<span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> Dict

<span class="hljs-string">"""
Runners started on waves, but for basic analysis, we will assume all runners were able to run
at the same time.
"""</span>
BASE_RACE_DATETIME = datetime.datetime(
    year=<span class="hljs-number">2023</span>,
    month=<span class="hljs-number">9</span>,
    day=<span class="hljs-number">4</span>,
    hour=<span class="hljs-number">20</span>,
    minute=<span class="hljs-number">0</span>,
    second=<span class="hljs-number">0</span>,
    microsecond=<span class="hljs-number">0</span>
)

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Waves</span>(<span class="hljs-params">Enum</span>):</span>
    <span class="hljs-string">"""
    22 Elite male
    17 Elite female
    There are some holes, so either some runners did not show up or there was spare capacity.
    https://runsignup.com/Race/EmpireStateBuildingRunUp/Page-4
    https://runsignup.com/Race/EmpireStateBuildingRunUp/Page-5
    I guessed who went into which category, based on the BIB numbers I saw that day
    """</span>
    ELITE_MEN = [<span class="hljs-string">"Elite Men"</span>, [<span class="hljs-number">1</span>, <span class="hljs-number">25</span>], BASE_RACE_DATETIME]
    ELITE_WOMEN = [<span class="hljs-string">"Elite Women"</span>, [<span class="hljs-number">26</span>, <span class="hljs-number">49</span>], BASE_RACE_DATETIME + datetime.timedelta(minutes=<span class="hljs-number">2</span>)]
    PURPLE = [<span class="hljs-string">"Specialty"</span>, [<span class="hljs-number">100</span>, <span class="hljs-number">199</span>], BASE_RACE_DATETIME + datetime.timedelta(minutes=<span class="hljs-number">10</span>)]
    GREEN = [<span class="hljs-string">"Sponsors"</span>, [<span class="hljs-number">200</span>, <span class="hljs-number">299</span>], BASE_RACE_DATETIME + datetime.timedelta(minutes=<span class="hljs-number">20</span>)]
    <span class="hljs-string">"""
    The date people applied for the lottery determined the colors. Let's assume that
    General Lottery Open: 7/17 9AM- 7/28 11:59PM
    General Lottery Draw Date: 8/1
    """</span>
    ORANGE = [<span class="hljs-string">"Tenants"</span>, [<span class="hljs-number">300</span>, <span class="hljs-number">399</span>], BASE_RACE_DATETIME + datetime.timedelta(minutes=<span class="hljs-number">30</span>)]
    GREY = [<span class="hljs-string">"General 1"</span>, [<span class="hljs-number">400</span>, <span class="hljs-number">499</span>], BASE_RACE_DATETIME + datetime.timedelta(minutes=<span class="hljs-number">40</span>)]
    GOLD = [<span class="hljs-string">"General 2"</span>, [<span class="hljs-number">500</span>, <span class="hljs-number">599</span>], BASE_RACE_DATETIME + datetime.timedelta(minutes=<span class="hljs-number">50</span>)]
    BLACK = [<span class="hljs-string">"General 3"</span>, [<span class="hljs-number">600</span>, <span class="hljs-number">699</span>], BASE_RACE_DATETIME + datetime.timedelta(minutes=<span class="hljs-number">60</span>)]

<span class="hljs-string">"""
Interested only in people who completed the 86 floors. So is it either a full course or dnf
"""</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Level</span>(<span class="hljs-params">Enum</span>):</span>
    FULL = <span class="hljs-string">"Full Course"</span>
    DNF = <span class="hljs-string">"DNF"</span>

<span class="hljs-comment"># Fields are sorted by interest</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">RaceFields</span>(<span class="hljs-params">Enum</span>):</span>
    BIB = <span class="hljs-string">"bib"</span>
    NAME = <span class="hljs-string">"name"</span>
    OVERALL_POSITION = <span class="hljs-string">"overall position"</span>
    TIME = <span class="hljs-string">"time"</span>
    GENDER = <span class="hljs-string">"gender"</span>
    GENDER_POSITION = <span class="hljs-string">"gender position"</span>
    AGE = <span class="hljs-string">"age"</span>
    DIVISION_POSITION = <span class="hljs-string">"division position"</span>
    COUNTRY = <span class="hljs-string">"country"</span>
    STATE = <span class="hljs-string">"state"</span>
    CITY = <span class="hljs-string">"city"</span>
    PACE = <span class="hljs-string">"pace"</span>
    TWENTY_FLOOR_POSITION = <span class="hljs-string">"20th floor position"</span>
    TWENTY_FLOOR_GENDER_POSITION = <span class="hljs-string">"20th floor gender position"</span>
    TWENTY_FLOOR_DIVISION_POSITION = <span class="hljs-string">"20th floor division position"</span>
    TWENTY_FLOOR_PACE = <span class="hljs-string">'20th floor pace'</span>
    TWENTY_FLOOR_TIME = <span class="hljs-string">'20th floor time'</span>
    SIXTY_FLOOR_POSITION = <span class="hljs-string">"65th floor position"</span>
    SIXTY_FIVE_FLOOR_GENDER_POSITION = <span class="hljs-string">"65th floor gender position"</span>
    SIXTY_FIVE_FLOOR_DIVISION_POSITION = <span class="hljs-string">"65th floor division position"</span>
    SIXTY_FIVE_FLOOR_PACE = <span class="hljs-string">'65th floor pace'</span>
    SIXTY_FIVE_FLOOR_TIME = <span class="hljs-string">'65th floor time'</span>
    WAVE = <span class="hljs-string">"wave"</span>
    LEVEL = <span class="hljs-string">"level"</span>
    URL = <span class="hljs-string">"url"</span>

FIELD_NAMES = [x.value <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> RaceFields <span class="hljs-keyword">if</span> x != RaceFields.URL]
FIELD_NAMES_FOR_SCRAPING = [x.value <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> RaceFields]
FIELD_NAMES_AND_POS: Dict[RaceFields, int] = {}
pos = <span class="hljs-number">0</span>
<span class="hljs-keyword">for</span> field <span class="hljs-keyword">in</span> RaceFields:
    FIELD_NAMES_AND_POS[field] = pos
    pos += <span class="hljs-number">1</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_wave_from_bib</span>(<span class="hljs-params">bib: int</span>) -&gt; Waves:</span>
    <span class="hljs-keyword">for</span> wave <span class="hljs-keyword">in</span> Waves:
        (lower, upper) = wave.value[<span class="hljs-number">1</span>]
        <span class="hljs-keyword">if</span> lower &lt;= bib &lt;= upper:
            <span class="hljs-keyword">return</span> wave
    <span class="hljs-keyword">return</span> Waves.BLACK

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_description_for_wave</span>(<span class="hljs-params">wave: Waves</span>) -&gt; str:</span>
    <span class="hljs-keyword">return</span> wave.value[<span class="hljs-number">0</span>]
</code></pre>
<p>I used <a target="_blank" href="https://docs.python.org/3/library/enum.html">enums</a> to make it clear what type of data I was working on, especially for the names of the fields. Consistency is key.</p>
<p>As for cleaning the data, well there were some obvious fixes I had to apply like:</p>
<ol>
<li><p>Format of the times like pace, race time, and so on so it could be parsed later</p>
</li>
<li><p>Capitalize some values to make them easier to read</p>
</li>
<li><p>Early string to integer conversion for values like age, position, and so on. If that fails, assign 'not a number'.</p>
</li>
</ol>
<p>By all means, we are not done massaging the data. A simple function takes care of this stage inside the <a target="_blank" href="https://github.com/josevnz/tutorials/blob/main/docs/EmpireStateRunUp/empirestaterunup/data.py">data</a> module:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Omitted imports and Enum declarations as they were shown early on. </span>
<span class="hljs-comment"># Check the source code for 'data.py' for more details</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">raw_csv_read</span>(<span class="hljs-params">raw_file: Path</span>) -&gt; Iterable[Dict[str, Any]]:</span>
    record = {}
    <span class="hljs-keyword">with</span> open(raw_file, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> raw_csv_file:
        reader = csv.DictReader(raw_csv_file)
        row: Dict[str, Any]
        <span class="hljs-keyword">for</span> row <span class="hljs-keyword">in</span> reader:
            <span class="hljs-keyword">try</span>:
                csv_field: str
                <span class="hljs-keyword">for</span> csv_field <span class="hljs-keyword">in</span> FIELD_NAMES_FOR_SCRAPING:
                    column_val = row[csv_field].strip()
                    <span class="hljs-keyword">if</span> csv_field == RaceFields.BIB.value:
                        bib = int(column_val)
                        record[csv_field] = bib
                    <span class="hljs-keyword">elif</span> csv_field <span class="hljs-keyword">in</span> [ RaceFields.GENDER_POSITION.value, RaceFields.DIVISION_POSITION.value, RaceFields.OVERALL_POSITION.value,  RaceFields.TWENTY_FLOOR_POSITION.value,
                        RaceFields.TWENTY_FLOOR_DIVISION_POSITION.value, RaceFields.TWENTY_FLOOR_GENDER_POSITION.value, RaceFields.SIXTY_FLOOR_POSITION.value, RaceFields.SIXTY_FIVE_FLOOR_DIVISION_POSITION.value,
                        RaceFields.SIXTY_FIVE_FLOOR_GENDER_POSITION.value, RaceFields.AGE.value ]:
                        <span class="hljs-keyword">try</span>:
                            record[csv_field] = int(column_val)
                        <span class="hljs-keyword">except</span> ValueError:
                            record[csv_field] = math.nan
                    <span class="hljs-keyword">elif</span> csv_field == RaceFields.WAVE.value:
                        record[csv_field] = get_description_for_wave(get_wave_from_bib(bib)).upper()
                    <span class="hljs-keyword">elif</span> csv_field <span class="hljs-keyword">in</span> [RaceFields.GENDER.value, RaceFields.COUNTRY.value]:
                        record[csv_field] = column_val.upper()
                    <span class="hljs-keyword">elif</span> csv_field <span class="hljs-keyword">in</span> [RaceFields.CITY.value, RaceFields.STATE.value,

                    ]:
                        record[csv_field] = column_val.capitalize()
                    <span class="hljs-keyword">elif</span> csv_field <span class="hljs-keyword">in</span> [RaceFields.SIXTY_FIVE_FLOOR_PACE.value, RaceFields.SIXTY_FIVE_FLOOR_TIME.value, RaceFields.TWENTY_FLOOR_PACE.value,
                        RaceFields.TWENTY_FLOOR_TIME.value, RaceFields.PACE.value, RaceFields.TIME.value ]:
                        parts = column_val.strip().split(<span class="hljs-string">':'</span>)
                        <span class="hljs-keyword">for</span> idx <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, len(parts)):
                            <span class="hljs-keyword">if</span> len(parts[idx]) == <span class="hljs-number">1</span>:
                                parts[idx] = <span class="hljs-string">f"0<span class="hljs-subst">{parts[idx]}</span>"</span>
                        <span class="hljs-keyword">if</span> len(parts) == <span class="hljs-number">2</span>:
                            parts.insert(<span class="hljs-number">0</span>, <span class="hljs-string">"00"</span>)
                        record[csv_field] = <span class="hljs-string">":"</span>.join(parts)
                    <span class="hljs-keyword">else</span>:
                        record[csv_field] = column_val
                <span class="hljs-keyword">if</span> record[csv_field] <span class="hljs-keyword">in</span> [<span class="hljs-string">'-'</span>, <span class="hljs-string">'--'</span>]:
                    record[csv_field] = <span class="hljs-string">""</span>
                <span class="hljs-keyword">yield</span> record
            <span class="hljs-keyword">except</span> IndexError:
                <span class="hljs-keyword">raise</span>
</code></pre>
<p>The <code>esru_csv_cleaner</code> script is the sum of the first stage cleanup effort, which takes the raw captured data and writes a CSV file with some important corrections:</p>
<pre><code class="lang-shell">esru_csv_cleaner --rawfile /home/josevnz/temp/raw_data.csv /home/josevnz/tutorials/docs/EmpireStateRunUp/empirestaterunup/results-full-level-2023.csv
</code></pre>
<p>Now with the data ready, we can proceed to load the data and ask some questions about the race.</p>
<h2 id="heading-how-to-analyze-the-data">How to Analyze the Data</h2>
<p>Once the data is clean (or as clean as we can get it), it's time to move into running some numbers. Before writing more code, I took a piece of paper and asked myself a few questions about the race:</p>
<ul>
<li><p>There are any interesting buckets/ clusters for age, race time, wave, and country participation?</p>
</li>
<li><p>A histogram for Age and Country would be nice to see</p>
</li>
<li><p>Describe the data! (median, percentiles, and so on)</p>
</li>
<li><p>Find outliers. <a target="_blank" href="https://www.investopedia.com/terms/z/zscore.asp">There is a way to apply Z-scores</a> here?</p>
</li>
</ul>
<p>I decided to use <a target="_blank" href="https://pandas.pydata.org/">Python Pandas</a> for this task. This Open Source framework has an arsenal of tools to manipulate the data and to calculate statistics. It also has good tools to perform additional cleanup if needed.</p>
<p>So how does Pandas work?</p>
<h3 id="heading-crash-course-on-pandas">Crash Course on Pandas</h3>
<p>I strongly recommend that you check out <a target="_blank" href="https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html">10 minutes to pandas</a> if you are not familiar with the tool. For my DataFrame, I made the BIB an index as it is unique, and it has no special value for aggregation functions – but the 'id' attribute is unique.</p>
<p>It's important to note that also at this stage I needed to normalize the data, which I'll explain shortly:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Omitted imports and Enum declarations as they were shown early on. </span>
<span class="hljs-comment"># Check the source code for 'data.py' for more details</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_data</span>(<span class="hljs-params">data_file: Path = None, remove_dnf: bool = True</span>) -&gt; DataFrame:</span>
    <span class="hljs-string">"""
    * The code removes by default the DNF runners to avoid distortion on the results.
    * Replace unknown/ nan values with the median, to make analysis easier and avoid distortions
    """</span>
    <span class="hljs-keyword">if</span> data_file:
        def_file = data_file
    <span class="hljs-keyword">else</span>:
        def_file = RACE_RESULTS_FULL_LEVEL
    df = pandas.read_csv(
        def_file
    )
    <span class="hljs-keyword">for</span> time_field <span class="hljs-keyword">in</span> [
        RaceFields.PACE.value,
        RaceFields.TIME.value,
        RaceFields.TWENTY_FLOOR_PACE.value,
        RaceFields.TWENTY_FLOOR_TIME.value,
        RaceFields.SIXTY_FIVE_FLOOR_PACE.value,
        RaceFields.SIXTY_FIVE_FLOOR_TIME.value
    ]:
        <span class="hljs-keyword">try</span>:
            df[time_field] = pandas.to_timedelta(df[time_field])
        <span class="hljs-keyword">except</span> ValueError <span class="hljs-keyword">as</span> ve:
            <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">f'<span class="hljs-subst">{time_field}</span>=<span class="hljs-subst">{df[time_field]}</span>'</span>, ve)
    df[<span class="hljs-string">'finishtimestamp'</span>] = BASE_RACE_DATETIME + df[RaceFields.TIME.value]
    <span class="hljs-keyword">if</span> remove_dnf:
        df.drop(df[df.level == <span class="hljs-string">'DNF'</span>].index, inplace=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># Normalize Age</span>
    median_age = df[RaceFields.AGE.value].median()
    df[RaceFields.AGE.value].fillna(median_age, inplace=<span class="hljs-literal">True</span>)
    df[RaceFields.AGE.value] = df[RaceFields.AGE.value].astype(int)

    <span class="hljs-comment"># Normalize state and city</span>
    df.replace({RaceFields.STATE.value: {<span class="hljs-string">'-'</span>: <span class="hljs-string">''</span>}}, inplace=<span class="hljs-literal">True</span>)
    df[RaceFields.STATE.value].fillna(<span class="hljs-string">''</span>, inplace=<span class="hljs-literal">True</span>)
    df[RaceFields.CITY.value].fillna(<span class="hljs-string">''</span>, inplace=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># Normalize overall position, 3 levels</span>
    median_pos = df[RaceFields.OVERALL_POSITION.value].median()
    df[RaceFields.OVERALL_POSITION.value].fillna(median_pos, inplace=<span class="hljs-literal">True</span>)
    df[RaceFields.OVERALL_POSITION.value] = df[RaceFields.OVERALL_POSITION.value].astype(int)
    median_pos = df[RaceFields.TWENTY_FLOOR_POSITION.value].median()
    df[RaceFields.TWENTY_FLOOR_POSITION.value].fillna(median_pos, inplace=<span class="hljs-literal">True</span>)
    df[RaceFields.TWENTY_FLOOR_POSITION.value] = df[RaceFields.TWENTY_FLOOR_POSITION.value].astype(int)
    median_pos = df[RaceFields.SIXTY_FLOOR_POSITION.value].median()
    df[RaceFields.SIXTY_FLOOR_POSITION.value].fillna(median_pos, inplace=<span class="hljs-literal">True</span>)
    df[RaceFields.SIXTY_FLOOR_POSITION.value] = df[RaceFields.SIXTY_FLOOR_POSITION.value].astype(int)

    <span class="hljs-comment"># Normalize gender position, 3 levels</span>
    median_gender_pos = df[RaceFields.GENDER_POSITION.value].median()
    df[RaceFields.GENDER_POSITION.value].fillna(median_gender_pos, inplace=<span class="hljs-literal">True</span>)
    df[RaceFields.GENDER_POSITION.value] = df[RaceFields.GENDER_POSITION.value].astype(int)
    median_gender_pos = df[RaceFields.TWENTY_FLOOR_GENDER_POSITION.value].median()
    df[RaceFields.TWENTY_FLOOR_GENDER_POSITION.value].fillna(median_gender_pos, inplace=<span class="hljs-literal">True</span>)
    df[RaceFields.TWENTY_FLOOR_GENDER_POSITION.value] = df[RaceFields.TWENTY_FLOOR_GENDER_POSITION.value].astype(int)
    median_gender_pos = df[RaceFields.SIXTY_FIVE_FLOOR_GENDER_POSITION.value].median()
    df[RaceFields.SIXTY_FIVE_FLOOR_GENDER_POSITION.value].fillna(median_gender_pos, inplace=<span class="hljs-literal">True</span>)
    df[RaceFields.SIXTY_FIVE_FLOOR_GENDER_POSITION.value] = df[
        RaceFields.SIXTY_FIVE_FLOOR_GENDER_POSITION.value].astype(int)

    <span class="hljs-comment"># Normalize age/ division position, 3 levels</span>
    median_div_pos = df[RaceFields.DIVISION_POSITION.value].median()
    df[RaceFields.DIVISION_POSITION.value].fillna(median_div_pos, inplace=<span class="hljs-literal">True</span>)
    df[RaceFields.DIVISION_POSITION.value] = df[RaceFields.DIVISION_POSITION.value].astype(int)
    median_div_pos = df[RaceFields.TWENTY_FLOOR_DIVISION_POSITION.value].median()
    df[RaceFields.TWENTY_FLOOR_DIVISION_POSITION.value].fillna(median_div_pos, inplace=<span class="hljs-literal">True</span>)
    df[RaceFields.TWENTY_FLOOR_DIVISION_POSITION.value] = df[RaceFields.TWENTY_FLOOR_DIVISION_POSITION.value].astype(int)
    median_div_pos = df[RaceFields.SIXTY_FIVE_FLOOR_DIVISION_POSITION.value].median()
    df[RaceFields.SIXTY_FIVE_FLOOR_DIVISION_POSITION.value].fillna(median_div_pos, inplace=<span class="hljs-literal">True</span>)
    df[RaceFields.SIXTY_FIVE_FLOOR_DIVISION_POSITION.value] = df[
        RaceFields.SIXTY_FIVE_FLOOR_DIVISION_POSITION.value].astype(int)

    <span class="hljs-comment"># Normalize 65th floor pace and time</span>
    sixty_five_floor_pace_median = df[RaceFields.SIXTY_FIVE_FLOOR_PACE.value].median()
    sixty_five_floor_time_median = df[RaceFields.SIXTY_FIVE_FLOOR_TIME.value].median()
    df[RaceFields.SIXTY_FIVE_FLOOR_PACE.value].fillna(sixty_five_floor_pace_median, inplace=<span class="hljs-literal">True</span>)
    df[RaceFields.SIXTY_FIVE_FLOOR_TIME.value].fillna(sixty_five_floor_time_median, inplace=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># Normalize BIB and make it the index</span>
    df[RaceFields.BIB.value] = df[RaceFields.BIB.value].astype(int)
    df.set_index(RaceFields.BIB.value, inplace=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># URL was useful during scraping, not needed for analysis</span>
    df.drop([RaceFields.URL.value], axis=<span class="hljs-number">1</span>, inplace=<span class="hljs-literal">True</span>)

    <span class="hljs-keyword">return</span> df
</code></pre>
<p>I do a few things here after giving back the converted CSV back to the user, as a DataFrame:</p>
<ul>
<li><p>Replaced "Not a Number" (nan) values with the median to avoid affecting the aggregation results. This makes analysis easier.</p>
</li>
<li><p>Dropped rows for runners that did not reach floor 86. Makes the analysis easier, and there are too few of them.</p>
</li>
<li><p>Convert some string columns into native data types like integers, timestamps</p>
</li>
<li><p>A few entries did not have the gender defined. That affected other fields like 'gender_position'. To avoid distortions, these were filled with the median.</p>
</li>
</ul>
<p>In the end, this is how my <a target="_blank" href="https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html">DataFrame</a> loading looked like:</p>
<pre><code class="lang-shell">(EmpireStateRunUp) [josevnz@dmaf5 EmpireStateRunUp]$ python3
Python 3.11.6 (main, Oct  3 2023, 00:00:00) [GCC 12.3.1 20230508 (Red Hat 12.3.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
</code></pre>
<p>And the resulting <a target="_blank" href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html"><strong>DataFrame</strong></a> instance:</p>
<pre><code class="lang-shell">&gt;&gt;&gt; # Using custom load_data function that returns a Panda DataFrame
&gt;&gt;&gt; from empirestaterunup.data import load_data
&gt;&gt;&gt; load_data('empirestaterunup/results-full-level-2023.csv')
                    name  overall position            time gender  gender position  age  ...  65th floor division position 65th floor pace 65th floor time       wave        level     finishtimestamp
bib                                                                                      ...                                                                                                          
19         Wai Ching Soh                 1 0 days 00:10:36      M                1   29  ...                             1 0 days 00:54:03 0 days 00:07:34  ELITE MEN  Full Course 2023-09-04 20:10:36
22        Ryoji Watanabe                 2 0 days 00:10:52      M                2   40  ...                             1 0 days 00:54:31 0 days 00:07:38  ELITE MEN  Full Course 2023-09-04 20:10:52
16            Fabio Ruga                 3 0 days 00:11:14      M                3   42  ...                             2 0 days 00:57:09 0 days 00:08:00  ELITE MEN  Full Course 2023-09-04 20:11:14
11        Emanuele Manzi                 4 0 days 00:11:28      M                4   45  ...                             3 0 days 00:59:17 0 days 00:08:18  ELITE MEN  Full Course 2023-09-04 20:11:28
249             Alex Cyr                 5 0 days 00:11:52      M                5   28  ...                             2 0 days 01:01:19 0 days 00:08:35   SPONSORS  Full Course 2023-09-04 20:11:52
..                   ...               ...             ...    ...              ...  ...  ...                           ...             ...             ...        ...          ...                 ...
555     Caroline Edwards               372 0 days 00:55:17      F              143   47  ...                            39 0 days 04:57:23 0 days 00:41:38  GENERAL 2  Full Course 2023-09-04 20:55:17
557        Sarah Preston               373 0 days 00:55:22      F              144   34  ...                            41 0 days 04:58:20 0 days 00:41:46  GENERAL 2  Full Course 2023-09-04 20:55:22
544  Christopher Winkler               374 0 days 01:00:10      M              228   40  ...                            18 0 days 01:49:53 0 days 00:15:23  GENERAL 2  Full Course 2023-09-04 21:00:10
545          Jay Winkler               375 0 days 01:05:19      U               93   33  ...                            18 0 days 05:28:56 0 days 00:46:03  GENERAL 2  Full Course 2023-09-04 21:05:19
646           Dana Zajko               376 0 days 01:06:48      F              145   38  ...                            42 0 days 05:15:14 0 days 00:44:08  GENERAL 3  Full Course 2023-09-04 21:06:48

[375 rows x 24 columns]
</code></pre>
<p>Once the data was loaded, I was able to start asking questions. For example, to detect the outliers I used a <a target="_blank" href="https://en.wikipedia.org/wiki/Standard_score">Z-score</a>.</p>
<p>All the analysis logic <a target="_blank" href="https://github.com/josevnz/tutorials/blob/main/docs/EmpireStateRunUp/empirestaterunup/analyze.py">was kept together on a single module called 'analyze'</a>, separate from presentation, data loading, or reports, to promote reuse.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pandas <span class="hljs-keyword">import</span> DataFrame
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_zscore</span>(<span class="hljs-params">df: DataFrame, column: str</span>):</span>
    filtered = df[column]
    <span class="hljs-keyword">return</span> filtered.sub(filtered.mean()).div(filtered.std(ddof=<span class="hljs-number">0</span>))

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_outliers</span>(<span class="hljs-params">df: DataFrame, column: str, std_threshold: int = <span class="hljs-number">3</span></span>) -&gt; DataFrame:</span>
    <span class="hljs-string">"""
    Use the z-score, anything further away than 3 standard deviations is considered an outlier.
    """</span>
    filtered_df = df[column]
    z_scores = get_zscore(df=df, column=column)
    is_over = np.abs(z_scores) &gt; std_threshold
    <span class="hljs-keyword">return</span> filtered_df[is_over]
</code></pre>
<p>Also, it is very simple to get common statistics just by calling <code>describe</code> on our data:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pandas <span class="hljs-keyword">import</span> DataFrame
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_5_number</span>(<span class="hljs-params">criteria: str, data: DataFrame</span>) -&gt; DataFrame:</span>
    <span class="hljs-keyword">return</span> data[criteria].describe()
</code></pre>
<p>For example, let me show you summary metrics for different aspects of the race:</p>
<pre><code class="lang-shell">&gt;&gt;&gt; from empirestaterunup.data import load_data
&gt;&gt;&gt; df = load_data('empirestaterunup/results-full-level-2023.csv')
&gt;&gt;&gt; from empirestaterunup.analyze import get_5_number
&gt;&gt;&gt; from empirestaterunup.analyze import SUMMARY_METRICS
&gt;&gt;&gt; print(SUMMARY_METRICS)
('age', 'time', 'pace')
&gt;&gt;&gt; for key in SUMMARY_METRICS:
...     ndf = get_5_number(criteria=key, data=df)
...     print(ndf)
... 
count    375.000000
mean      41.309333
std       11.735968
min       11.000000
25%       33.000000
50%       40.000000
75%       49.000000
max       78.000000
Name: age, dtype: float64
count                          375
mean     0 days 00:23:03.461333333
std      0 days 00:08:06.313479117
min                0 days 00:10:36
25%                0 days 00:18:09
50%                0 days 00:21:20
75%         0 days 00:25:13.500000
max                0 days 01:06:48
Name: time, dtype: object
count                          375
mean     0 days 01:55:17.306666666
std      0 days 00:40:31.567395588
min                0 days 00:53:00
25%                0 days 01:30:45
50%                0 days 01:46:40
75%         0 days 02:06:07.500000
max                0 days 05:34:00
Name: pace, dtype: object
</code></pre>
<p>Making sure data web scraping, data loading, and analytics work well is a must. Testing is an integral part of writing code, so I kept adding more of it and went back to writing unit tests.</p>
<p>Let's check how to test our code (feel free to skip the next section if you are familiar with unit testing)</p>
<h3 id="heading-testing-testing-and-after-thatmore-testing">Testing, testing, and after that...more testing</h3>
<p>I assume you are familiar with writing small, self-contained pieces of code to test your code. These are called unit tests.</p>
<blockquote>
<p>The unittest unit testing framework was originally inspired by JUnit and has a similar flavor as major unit testing frameworks in other languages. It supports test automation, sharing of setup and shutdown code for tests, aggregation of tests into collections, and independence of the tests from the reporting framework. (From the <a target="_blank" href="https://docs.python.org/3/library/unittest.html">Python docs</a>)</p>
</blockquote>
<p>I tried to have a simple <a target="_blank" href="https://docs.python.org/3/library/unittest.html">unit test</a> for every method I wrote on the code. This saved me lots of headaches down the road. As I refactored the code, I found better ways to get the same results, producing correct numbers.</p>
<p>A Unit test in this context is a class that extends <code>unittest.TestCase</code>. Each method that starts with <code>test_</code> is a test that must pass several assertions.</p>
<p>For example, to make sure the analytics worked as expected, I wrote a test module called <code>test_analyze</code>:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Not all test cases are shown, please check the full code of 'test/test_analyze.py'</span>
<span class="hljs-keyword">import</span> unittest
<span class="hljs-keyword">from</span> pandas <span class="hljs-keyword">import</span> DataFrame
<span class="hljs-keyword">from</span> empirestaterunup.analyze <span class="hljs-keyword">import</span> get_country_counts
<span class="hljs-keyword">from</span> empirestaterunup.data <span class="hljs-keyword">import</span> load_data

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">AnalyzeTestCase</span>(<span class="hljs-params">unittest.TestCase</span>):</span>
    df: DataFrame

<span class="hljs-meta">    @classmethod</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">setUpClass</span>(<span class="hljs-params">cls</span>) -&gt; <span class="hljs-keyword">None</span>:</span>
        cls.df = load_data()

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">test_get_country_counts</span>(<span class="hljs-params">self</span>):</span>
        country_counts, min_countries, max_countries = get_country_counts(df=AnalyzeTestCase.df)
        self.assertIsNotNone(country_counts)
        self.assertEqual(<span class="hljs-number">2</span>, country_counts[<span class="hljs-string">'JPN'</span>])
        self.assertIsNotNone(min_countries)
        self.assertEqual(<span class="hljs-number">3</span>, min_countries.shape[<span class="hljs-number">0</span>])
        self.assertIsNotNone(max_countries)
        self.assertEqual(<span class="hljs-number">14</span>, max_countries.shape[<span class="hljs-number">0</span>])


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    unittest.main()
</code></pre>
<p>So far we got the data, and made sure <a target="_blank" href="https://github.com/josevnz/tutorials/blob/main/docs/EmpireStateRunUp/test/test_data.py">it meets the expectations</a>. I wrote <a target="_blank" href="https://github.com/josevnz/tutorials/blob/main/docs/EmpireStateRunUp/test/test_analyze.py">separate tests</a> for the analytics code and also for the scraper.</p>
<p>Testing the user interface requires a different approach, as it needs to simulate clicks and wait for screen changes. Sometimes failures are easy to spot (like crashes), but sometimes issues are much more subtle (did we get the right data displayed?).</p>
<p>Will revisit this particular testing modality after we introduce first how to visualize the results.</p>
<h2 id="heading-how-to-visualize-the-results">How to Visualize the Results</h2>
<p>I wanted to use the terminal as much as possible to visualize my findings, and to keep requirements to a minimum. I decided to use the <a target="_blank" href="https://textual.textualize.io/">Textual</a> framework to accomplish that.</p>
<p>This framework is very complete and allows you to build text applications that are responsive and beautiful to look at.</p>
<p>They are also easy to write, so before we go deeper into the resulting applications, let's pause to learn about Textual.</p>
<h3 id="heading-text-user-interfaces-tui-with-textual">Text User Interfaces (TUI) with Textual</h3>
<p>The <a target="_blank" href="https://textual.textualize.io/">Textual project</a> has a nice tutorial that <a target="_blank" href="https://textual.textualize.io/tutorial/">you can read</a> to get up to speed.</p>
<p>Let's see some code. One of the applications is called <code>esru_outlier</code>. TUI code lives on the <a target="_blank" href="https://github.com/josevnz/tutorials/blob/main/docs/EmpireStateRunUp/empirestaterunup/apps.py">apps</a> module that shows several tables together with the outliers we found before, using the z-score.</p>
<p>OutlierApp (extends App) collects all the basic information on a table for each outlier group and then calls the <code>RunnerDetailScreen</code> to display details about a runner.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/esrm_outlier_first_screen.png" alt="Screen shot of the OutlierApp table that shows outliers on the race results" width="600" height="400" loading="lazy"></p>
<p><em>Outliers first screen (by Age, Running Time, and Pace)</em></p>
<p>Next is code with explanations that shows how to build this screen:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Only the code of the application shown here</span>
<span class="hljs-comment"># This application shows 3 tables: SUMMARY_METRICS = (RaceFields.AGE.value, RaceFields.TIME.value, RaceFields.PACE.value)</span>
<span class="hljs-comment"># Every application in Textual extends the App class</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">OutlierApp</span>(<span class="hljs-params">App</span>):</span>
    DF: DataFrame = <span class="hljs-literal">None</span>
    BINDINGS = [ (<span class="hljs-string">"q"</span>, <span class="hljs-string">"quit_app"</span>, <span class="hljs-string">"Quit"</span>), ]  <span class="hljs-comment"># Bind 'q' to 'quit_app' method `action_quit_app`, which in turn exists the app</span>
    CSS_PATH = <span class="hljs-string">"outliers.tcss"</span>  <span class="hljs-comment"># Styling can be done externally, similar to using CSS</span>
    ENABLE_COMMAND_PALETTE = <span class="hljs-literal">False</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">action_quit_app</span>(<span class="hljs-params">self</span>):</span>
        self.exit(<span class="hljs-number">0</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">compose</span>(<span class="hljs-params">self</span>) -&gt; ComposeResult:</span>
        <span class="hljs-string">"""
        Here we 'Yield' Widgets/ components that will be rendered in order on the TUI
        How do the components get their layout on the screen? They use a cascading style sheet (CSS): outliers.tcss and
        some explicit layout containers like the class `Vertical` that can contain other Widgets
        Here we have a header, tables, and a footer 
        """</span>
        <span class="hljs-keyword">yield</span> Header(show_clock=<span class="hljs-literal">True</span>)
        <span class="hljs-keyword">for</span> column_name <span class="hljs-keyword">in</span> SUMMARY_METRICS:
            table = DataTable(id=<span class="hljs-string">f'<span class="hljs-subst">{column_name}</span>_outlier'</span>)
            table.cursor_type = <span class="hljs-string">'row'</span>
            table.zebra_stripes = <span class="hljs-literal">True</span>
            table.tooltip = <span class="hljs-string">"Get runner details"</span>
            <span class="hljs-keyword">if</span> column_name == RaceFields.AGE.value:
                label = Label(<span class="hljs-string">f"<span class="hljs-subst">{column_name}</span> (older) outliers:"</span>.title())
            <span class="hljs-keyword">else</span>:
                label = Label(<span class="hljs-string">f"<span class="hljs-subst">{column_name}</span> (slower) outliers:"</span>.title())
            <span class="hljs-keyword">yield</span> Vertical(
                label,
                table
            )
        <span class="hljs-keyword">yield</span> Footer()

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">on_mount</span>(<span class="hljs-params">self</span>) -&gt; <span class="hljs-keyword">None</span>:</span>
        <span class="hljs-string">"""
        Here we populate each table with data from the DataFrame. Each table has outliers of different types.
        All can be obtained with the `get_outliers` method.
        """</span>
        <span class="hljs-keyword">for</span> column <span class="hljs-keyword">in</span> SUMMARY_METRICS:
            table = self.get_widget_by_id(<span class="hljs-string">f'<span class="hljs-subst">{column}</span>_outlier'</span>, expect_type=DataTable)
            columns = [x.title() <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> [<span class="hljs-string">'bib'</span>, column]]
            table.add_columns(*columns)
            table.add_rows(*[get_outliers(df=OutlierApp.DF, column=column).to_dict().items()])

<span class="hljs-meta">    @on(DataTable.HeaderSelected)</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">on_header_clicked</span>(<span class="hljs-params">self, event: DataTable.HeaderSelected</span>):</span>
        <span class="hljs-string">"""
        When the user selects a column header it generates a 'HeaderSelected' event.
        The annotation on this method tells Textual that we will handle this event here
        We can extract the table, the selected column, and then sort the table contents.
        """</span>
        table = event.data_table
        table.sort(event.column_key)

<span class="hljs-meta">    @on(DataTable.RowSelected)</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">on_row_clicked</span>(<span class="hljs-params">self, event: DataTable.RowSelected</span>) -&gt; <span class="hljs-keyword">None</span>:</span>
        <span class="hljs-string">"""
        Similarly, when the user selects a row it generates a RowSelected method
        What we do on the 'on_row_clicked' method is capture the event, get the row contents, and construct
        a new modal screen (RunnerDetailScreen) which we push on top of the regular screen.
        There we show the runner details differently. 
        """</span>
        table = event.data_table
        row = table.get_row(event.row_key)
        runner_detail = RunnerDetailScreen(df=OutlierApp.DF, row=row)
        self.push_screen(runner_detail)
</code></pre>
<p>The class <code>RunnerDetailScreen</code> (extends <code>ModalScreen</code>) handles showing the racer details using formatted Markdown, which shows up when you click on the table that was rendered before:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/esrm_outlier_runner_detail.png" alt="Screen shot of the OutlierApp runner details that shows outliers on the race results" width="600" height="400" loading="lazy"></p>
<p><em>Rendered Markdown with details about the selected runner</em></p>
<p>And here's the code that allows that with explanations:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Omitted imports and helper methods, only showing TUI-related code. See the 'apps.py' file for full code</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">RunnerDetailScreen</span>(<span class="hljs-params">ModalScreen</span>):</span>
    ENABLE_COMMAND_PALETTE = <span class="hljs-literal">False</span>  <span class="hljs-comment"># Disable the search bar, it is active by default and is not needed here</span>
    CSS_PATH = <span class="hljs-string">"runner_details.tcss"</span>  <span class="hljs-comment"># Handle the styles using external CSS</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">
            self,
            name: str | None = None,
            ident: str | None = None,
            classes: str | None = None,
            row: List[Any] | None = None,
            df: DataFrame = None,
            country_df: DataFrame = None
    </span>):</span>
        <span class="hljs-string">"""
        Override the constructor and load useful data like country ISO codes
        We get the Pandas DataFrame with the details that will be shown to the user
        """</span>
        super().__init__(name, ident, classes)
        self.row = row
        self.df = df
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> country_df:
            self.country_df = load_country_details()
        <span class="hljs-keyword">else</span>:
            self.country_df = country_df

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">compose</span>(<span class="hljs-params">self</span>) -&gt; ComposeResult:</span>
        <span class="hljs-string">"""
        In compose we prepare the markdown, and we let the MarkdownViewer handle details like 
        a nice automatic table of contents.
        Notice that we call `self.log.info('xxx'). We use that for debugging when this application
        is called using 'textual'.
        """</span>
        bib_idx = FIELD_NAMES_AND_POS[RaceFields.BIB]
        bibs = [self.row[bib_idx]]
        columns, details = df_to_list_of_tuples(self.df, bibs)
        self.log.info(<span class="hljs-string">f"Columns: <span class="hljs-subst">{columns}</span>"</span>)
        self.log.info(<span class="hljs-string">f"Details: <span class="hljs-subst">{details}</span>"</span>)
        row_markdown = <span class="hljs-string">""</span>
        position_markdown = {}
        split_markdown = {}
        <span class="hljs-keyword">for</span> legend <span class="hljs-keyword">in</span> [<span class="hljs-string">'full'</span>, <span class="hljs-string">'20th'</span>, <span class="hljs-string">'65th'</span>]:
            position_markdown[legend] = <span class="hljs-string">''</span>
            split_markdown[legend] = <span class="hljs-string">''</span>
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, len(columns)):
            column = columns[i]
            detail = details[<span class="hljs-number">0</span>][i]
            <span class="hljs-keyword">if</span> re.search(<span class="hljs-string">'pace|time'</span>, column):
                <span class="hljs-keyword">if</span> re.search(<span class="hljs-string">'20th'</span>, column):
                    split_markdown[<span class="hljs-string">'20th'</span>] += <span class="hljs-string">f"\n* **<span class="hljs-subst">{column.title()}</span>:** <span class="hljs-subst">{detail}</span>"</span>
                <span class="hljs-keyword">elif</span> re.search(<span class="hljs-string">'65th'</span>, column):
                    split_markdown[<span class="hljs-string">'65th'</span>] += <span class="hljs-string">f"\n* **<span class="hljs-subst">{column.title()}</span>:** <span class="hljs-subst">{detail}</span>"</span>
                <span class="hljs-keyword">else</span>:
                    split_markdown[<span class="hljs-string">'full'</span>] += <span class="hljs-string">f"\n* **<span class="hljs-subst">{column.title()}</span>:** <span class="hljs-subst">{detail}</span>"</span>
            <span class="hljs-keyword">elif</span> re.search(<span class="hljs-string">'position'</span>, column):
                <span class="hljs-keyword">if</span> re.search(<span class="hljs-string">'20th'</span>, column):
                    position_markdown[<span class="hljs-string">'20th'</span>] += <span class="hljs-string">f"\n* **<span class="hljs-subst">{column.title()}</span>:** <span class="hljs-subst">{detail}</span>"</span>
                <span class="hljs-keyword">elif</span> re.search(<span class="hljs-string">'65th'</span>, column):
                    position_markdown[<span class="hljs-string">'65th'</span>] += <span class="hljs-string">f"\n* **<span class="hljs-subst">{column.title()}</span>:** <span class="hljs-subst">{detail}</span>"</span>
                <span class="hljs-keyword">else</span>:
                    position_markdown[<span class="hljs-string">'full'</span>] += <span class="hljs-string">f"\n* **<span class="hljs-subst">{column.title()}</span>:** <span class="hljs-subst">{detail}</span>"</span>
            <span class="hljs-keyword">elif</span> re.search(<span class="hljs-string">'url|bib'</span>, column):
                <span class="hljs-keyword">pass</span>  <span class="hljs-comment"># Skip uninteresting columns</span>
            <span class="hljs-keyword">else</span>:
                row_markdown += <span class="hljs-string">f"\n* **<span class="hljs-subst">{column.title()}</span>:** <span class="hljs-subst">{detail}</span>"</span>
        <span class="hljs-keyword">yield</span> MarkdownViewer(<span class="hljs-string">f"""# Full Course Race details     
## Runner BIO (BIB: <span class="hljs-subst">{bibs[<span class="hljs-number">0</span>]}</span>)
<span class="hljs-subst">{row_markdown}</span>
## Positions
### 20th floor        
<span class="hljs-subst">{position_markdown[<span class="hljs-string">'20th'</span>]}</span>
### 65th floor        
<span class="hljs-subst">{position_markdown[<span class="hljs-string">'65th'</span>]}</span>
### Full course        
<span class="hljs-subst">{position_markdown[<span class="hljs-string">'full'</span>]}</span>                
## Race time split   
### 20th floor        
<span class="hljs-subst">{split_markdown[<span class="hljs-string">'20th'</span>]}</span>
### 65th floor        
<span class="hljs-subst">{split_markdown[<span class="hljs-string">'65th'</span>]}</span>
### Full course        
<span class="hljs-subst">{split_markdown[<span class="hljs-string">'full'</span>]}</span>         
        """</span>)
        <span class="hljs-comment"># This button is used to close this screen and send the user to the previous screen</span>
        btn = Button(<span class="hljs-string">"Close"</span>, variant=<span class="hljs-string">"primary"</span>, id=<span class="hljs-string">"close"</span>)
        btn.tooltip = <span class="hljs-string">"Back to main screen"</span>
        <span class="hljs-keyword">yield</span> btn

<span class="hljs-meta">    @on(Button.Pressed, "#close")</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">on_button_pressed</span>(<span class="hljs-params">self, _</span>) -&gt; <span class="hljs-keyword">None</span>:</span>
        <span class="hljs-string">"""
        Simple logic, pop the previous screen and make this one disappear
        """</span>
        self.app.pop_screen()
</code></pre>
<p>This class is reusable. There are other classes (like <code>BrowserApp</code> in this tutorial) that also send data when a user clicks on a table row, and those details get displayed using this modal screen.</p>
<p>We can customize the appearance using CSS (yes, like a web application). It looks a lot like a web application's <a target="_blank" href="https://en.wikipedia.org/wiki/CSS">CSS</a> (but it's not exactly the same). For example to add style to a button, here's the code:</p>
<pre><code class="lang-text">Button {
    dock: bottom;
    width: 100%;
    height: auto;
}
</code></pre>
<p>As you can see, Textual is a pretty powerful framework. It reminds me a lot of <a target="_blank" href="https://en.wikipedia.org/wiki/Swing_(Java)">Java Swing</a>, but without the extra complexity.</p>
<p>But is it just information in tabular format? I also wanted to have different graph types that could explain behavior like age cluster and gender distribution. For that, I wrote a few classes on the 'apps' module with the help of Matplotlib.</p>
<h3 id="heading-plots-with-matplotlib">Plots with Matplotlib</h3>
<p>I wanted to use some charts to display the data, and I made them with <a target="_blank" href="https://matplotlib.org/">matplotlib</a>. The code to generate an age plot box, that shows how old the participating runners were, is very straightforward.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/esru_age_box_plot.png" alt="Box plot showing age distribution among racers" width="600" height="400" loading="lazy"></p>
<p><em>Age box plot in Matplotlib that shows than most of the runners were in the 40-50 year old range.</em></p>
<p>And here's the code that produced that plot:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Not all code is shown here (helper methods, imports)</span>
<span class="hljs-comment"># Please check the apps.py module to see all missing code</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Plotter</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">plot_gender</span>(<span class="hljs-params">self</span>):</span>
        <span class="hljs-string">"""
        In this method, we get our data frame filtering by gender and get counts
        Then we create a pie plot
        """</span>
        series = self.df[RaceFields.GENDER.value].value_counts()
        fig, ax = plt.subplots(layout=<span class="hljs-string">'constrained'</span>)
        wedges, texts, auto_texts = ax.pie(
            series.values,
            labels=series.keys(),
            autopct=<span class="hljs-string">"%%%.2f"</span>,
            shadow=<span class="hljs-literal">True</span>,
            startangle=<span class="hljs-number">90</span>,
            explode=(<span class="hljs-number">0.1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>)
        )
        ax.set_title = <span class="hljs-string">"Gender participation"</span>
        ax.set_xlabel(<span class="hljs-string">'Gender distribution'</span>)

        <span class="hljs-comment"># Legend with the fastest runners by gender</span>
        fastest = find_fastest(self.df, FastestFilters.Gender)
        fastest_legend = [<span class="hljs-string">f"<span class="hljs-subst">{fastest[gender][<span class="hljs-string">'name'</span>]}</span> - <span class="hljs-subst">{beautify_race_times(fastest[gender][<span class="hljs-string">'time'</span>])}</span>"</span> <span class="hljs-keyword">for</span> gender <span class="hljs-keyword">in</span>
                          series.keys()]
        ax.legend(wedges, fastest_legend,
                  title=<span class="hljs-string">"Fastest by gender"</span>,
                  loc=<span class="hljs-string">"center left"</span>,
                  bbox_to_anchor=(<span class="hljs-number">1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0.5</span>, <span class="hljs-number">1</span>))
</code></pre>
<p>Interesting – most of the runners were between 40-50 years old.</p>
<p>Now let's go back to testing TUI.</p>
<h3 id="heading-testing-the-user-interfaces">Testing the User Interfaces</h3>
<p>When I started working on this small project, I knew that there was going to be a lot of testing. What I wasn't sure about was how I would be able to test the TUI.</p>
<p>I figured at least two ways would be useful with Textual: one being able to see the message flow between components and the other using unit tests with a twist:</p>
<h4 id="heading-following-the-message-flow-with-textual">Following the message flow with Textual</h4>
<p>Textual supports an interesting development mode that allows you to change CSS and see the changes on your application without a restart. Also, you can see how the TUI events propagate, which is invaluable for debugging.</p>
<p>In one terminal, start the console:</p>
<pre><code class="lang-shell">(EmpireStateRunUp) [josevnz@dmaf5 EmpireStateRunUp]$ . ~/virtualenv/EmpireStateRunUp/bin/activate
(EmpireStateRunUp) [josevnz@dmaf5 EmpireStateRunUp]$ textual console
▌Textual Development Console v0.46.0                                                                                                                                             
▌Run a Textual app with textual run --dev my_app.py to connect.                                                                                                                  
▌Press Ctrl+C to quit.
</code></pre>
<p>Then in another terminal, start your application but using development mode:</p>
<pre><code class="lang-shell">(EmpireStateRunUp) [josevnz@dmaf5 EmpireStateRunUp]$ textual run --dev --command esru_browser
</code></pre>
<p>If you check back on your console terminal, you will see any messages you sent with App.log along with the events:</p>
<pre><code class="lang-shell">─────────────────────────────────────────────────────────────────────────── Client '127.0.0.1' connected ───────────────────────────────────────────────────────────────────────────
[18:28:17] SYSTEM                                                                                                                                                        app.py:2188
Connected to devtools ( ws://127.0.0.1:8081 )
[18:28:17] SYSTEM                                                                                                                                                        app.py:2192
---
[18:28:17] SYSTEM                                                                                                                                                        app.py:2194
driver=&lt;class 'textual.drivers.linux_driver.LinuxDriver'&gt;
[18:28:17] SYSTEM                                                                                                                                                        app.py:2195
loop=&lt;_UnixSelectorEventLoop running=True closed=False debug=False&gt;
[18:28:17] SYSTEM                                                                                                                                                        app.py:2196
features=frozenset({'debug', 'devtools'})
[18:28:17] SYSTEM                                                                                                                                                        app.py:2228
STARTED FileMonitor({PosixPath('/home/josevnz/EmpireStateCleanup/docs/EmpireStateRunUp/empirestaterunup/browser.tcss')})
[18:28:17] EVENT                                                                                                                                                 message_pump.py:706
Load() &gt;&gt;&gt; BrowserApp(title='Race Runners', classes={'-dark-mode'}) method=None
[18:28:17] EVENT                                                                                                                                                 message_pump.py:697
Mount() &gt;&gt;&gt; DataTable(id='runners') method=&lt;ScrollView.on_mount&gt;
[18:28:17] EVENT                                                                                                                                                 message_pump.py:697
Mount() &gt;&gt;&gt; DataTable(id='runners') method=&lt;Widget.on_mount&gt;
[18:28:17] EVENT                                                                                                                                                 message_pump.py:697
Mount() &gt;&gt;&gt; Footer() method=&lt;Footer.on_mount&gt;
[18:28:17] EVENT                                                                                                                                                 message_pump.py:697
Mount() &gt;&gt;&gt; Footer() method=&lt;Widget.on_mount&gt;
[18:28:17] EVENT                                                                                                                                                 message_pump.py:697
Mount() &gt;&gt;&gt; ToastRack(id='textual-toastrack') method=&lt;Widget.on_mount&gt;
...
RowHighlighted(cursor_row=0, row_key=&lt;textual.widgets._data_table.RowKey object at 0x7fc8d98800d0&gt;) &gt;&gt;&gt; BrowserApp(title='Race Runners', classes={'-dark-mode'}) method=None
[18:28:17] EVENT                                                                                                                                                 message_pump.py:697
Mount() &gt;&gt;&gt; ScrollBarCorner() method=&lt;Widget.on_mount&gt;
[18:28:17] EVENT                                                                                                                                                 message_pump.py:706
Resize(size=Size(width=2, height=1), virtual_size=Size(width=178, height=47), container_size=Size(width=178, height=47)) &gt;&gt;&gt; ScrollBarCorner() method=None
[18:28:17] EVENT                                                                                                                                                 message_pump.py:706
Show() &gt;&gt;&gt; ScrollBarCorner() method=None
</code></pre>
<h4 id="heading-using-unittest-and-pilot">Using unittest and Pilot</h4>
<p>The framework has the <a target="_blank" href="https://textual.textualize.io/api/pilot/">Pilot class</a> that you can use to make automated calls to Textual Widgets and wait for events. This means you can simulate user interaction with the application to validate that it behaves as expected. This is more powerful than the regular unit tests as you can also cover UI interactions with expected results:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> unittest
<span class="hljs-keyword">from</span> textual.widgets <span class="hljs-keyword">import</span> DataTable, MarkdownViewer
<span class="hljs-keyword">from</span> empirestaterunup.apps <span class="hljs-keyword">import</span> BrowserApp


<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">AppTestCase</span>(<span class="hljs-params">unittest.IsolatedAsyncioTestCase</span>):</span>
    <span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">test_browser_app</span>(<span class="hljs-params">self</span>):</span>
        app = BrowserApp()
        self.assertIsNotNone(app)
        <span class="hljs-keyword">async</span> <span class="hljs-keyword">with</span> app.run_test() <span class="hljs-keyword">as</span> pilot:

            <span class="hljs-string">"""
            Test the command palette
            """</span>
            <span class="hljs-keyword">await</span> pilot.press(<span class="hljs-string">"ctrl+\\"</span>)
            <span class="hljs-keyword">for</span> char <span class="hljs-keyword">in</span> <span class="hljs-string">"jose"</span>.split():
                <span class="hljs-keyword">await</span> pilot.press(char)
            <span class="hljs-keyword">await</span> pilot.press(<span class="hljs-string">"enter"</span>)
            <span class="hljs-comment"># This returns the runner screen. Check that it has some contents</span>
            markdown_viewer = app.screen.query(MarkdownViewer).first()
            self.assertTrue(markdown_viewer.document)
            <span class="hljs-keyword">await</span> pilot.click(<span class="hljs-string">"#close"</span>)  <span class="hljs-comment"># Close the new screen, pop the original one</span>
            <span class="hljs-comment"># Go back to the main screen, now select a runner but using the table</span>
            table = app.screen.query(DataTable).first()
            coordinate = table.cursor_coordinate
            self.assertTrue(table.is_valid_coordinate(coordinate))
            <span class="hljs-keyword">await</span> pilot.press(<span class="hljs-string">"enter"</span>)
            <span class="hljs-keyword">await</span> pilot.pause()
            markdown_viewer = app.screen.query(MarkdownViewer).first()
            self.assertTrue(markdown_viewer)
            <span class="hljs-comment"># After validating the markdown one more time, close the app</span>
            <span class="hljs-comment"># Quit the app by pressing q</span>
            <span class="hljs-keyword">await</span> pilot.press(<span class="hljs-string">"q"</span>)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    unittest.main()
</code></pre>
<p>This is invaluable, and something that many times requires an external toolset to validate (for example in Java you have the class <a target="_blank" href="https://docs.oracle.com/javase/8/docs/api/java/awt/Robot.html">Robot</a>).</p>
<h2 id="heading-how-to-run-the-applications">How to Run the Applications</h2>
<p>Finally, it's time to get familiar with mini applications (you can see an animated <a target="_blank" href="https://github.com/josevnz/tutorials/blob/main/docs/EmpireStateRunUp/EmpireStateRunUp.svg">demonstration of the TUI applications here</a>).</p>
<h3 id="heading-browsing-through-the-data">Browsing Through the Data</h3>
<p>The <code>esru_browser</code> is a simple browser that lets you navigate through the raw race data.</p>
<pre><code class="lang-shell">esru_browser
</code></pre>
<p>The application shows all the race details for every Runner in a table that allows sorting by column.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/esru_browser.png" alt="Raw runners data in a table" width="600" height="400" loading="lazy"></p>
<p><em>The esru_browser window shows all runners' results. Here you can sort, search for runners, and click to get more details</em></p>
<p>And the command palette allows searching for runners by name (it's basically a search bar with fuzzy logic):</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/race_runners_2023-12-31T18_35_53_558956.svg" alt="race_runners_2023-12-31T18_35_53_558956.svg, searching for runners by name" width="600" height="400" loading="lazy"></p>
<p><em>Matches show up on the palette as you type</em></p>
<h3 id="heading-summary-reports">Summary Reports</h3>
<p>To get insights about racer behavior, you need some summary reports (as opposed to drilling down into each racer's details).</p>
<p>This application provides details about the following:</p>
<ul>
<li><p>Count, standard deviation, mean, min, max 45%, 50%, and 75% for age, time, and pace</p>
</li>
<li><p>Group and count distribution for Age, Wave, and Gender</p>
</li>
</ul>
<pre><code class="lang-shell">esru_numbers
</code></pre>
<p>Some interesting facts about the race:</p>
<ul>
<li><p>The average age was 41 years old, and 40 years old was the largest age group.</p>
</li>
<li><p>The majority number of people belonged to the 'BLACK WAVE'.</p>
</li>
<li><p>The majority of the people finished the race in between 20 and 30 minutes.</p>
</li>
<li><p>The youngest runner was 11 years old, and the oldest was 78.</p>
</li>
</ul>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/esru_numbers.svg" alt="Statistics of interest, like average age, wave they belong, finishing time" width="600" height="400" loading="lazy"></p>
<p><em>esru_numbers gives a bird's eye view of all the racers, categorized by buckets</em></p>
<h3 id="heading-finding-outliers">Finding Outliers</h3>
<p>This application uses the <em>Z-score</em> to find the outliers for several metrics for this race:</p>
<pre><code class="lang-shell">esru_outlier
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/esru_outlier-1.svg" alt="Table with outliers details" width="600" height="400" loading="lazy"></p>
<p><em>the esru_outlier main screen shows you racers that did not follow regular patterns</em></p>
<p>Because these results drill down to the BIB number, you can click on a row and get more details about a runner:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/esru_outlier-2.svg" alt="Outlier racer details, including BIB" width="600" height="400" loading="lazy"></p>
<p><em>And you can get details for each outlier. Yes, code is reusable and is the same to show details for any runner</em></p>
<p>Textual has excellent support for rendering Markdown as well as programming languages. Take a look at the code to see for yourself.</p>
<h3 id="heading-a-few-plot-graphics-for-you">A Few Plot Graphics For You</h3>
<p>The <a target="_blank" href="https://github.com/josevnz/tutorials/blob/main/docs/EmpireStateRunUp/empirestaterunup/apps.py">esru_plot</a> application offers a few plot graphics to help you visualize the data. Inside, the class <code>Plotter</code> does all the heavy lifting</p>
<h4 id="heading-age-plots">Age plots</h4>
<p>The program can generate two flavors for the same data, one is a Box diagram:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/esru_age_box_plot-1.png" alt="Age plot, Pie chart" width="600" height="400" loading="lazy"></p>
<p><em>The age box diagram we saw before</em></p>
<p>The second is a regular histogram:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/age_histogram.png" alt="Age histogram" width="600" height="400" loading="lazy"></p>
<p><em>Age histogram shows the same as the box diagram but the buckets are more visible. Same data, many ways to explain the racer demographics.</em></p>
<p>You can see from both graphics that the group age with the most participants is the 40-45-year-old bracket and the outliers are in the 10-20 and 70-80 year old groups.</p>
<h4 id="heading-participants-per-country-plot">Participants per country plot</h4>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/participants_per_country.png" alt="Histogram" width="600" height="400" loading="lazy"></p>
<p><em>This plot shows all the countries with the number of participants, with the best runner from each.</em></p>
<p>No surprises here: the overwhelming majority of racers come from the United States, followed by Mexico. Interestingly, the winner of the 2023 race is from Malaysia, with only 2 runners participating.</p>
<h4 id="heading-gender-distribution">Gender distribution</h4>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/gender_distribution.png" alt="Gender pie" width="600" height="400" loading="lazy"></p>
<p><em>The gender distribution pie showing the best racer for each category</em></p>
<p>The majority of the runners identified themselves as Males, followed by Females.</p>
<h2 id="heading-what-else-can-we-learn">What Else Can We Learn?</h2>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/esru2023_nyc-1.JPG" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>NYC was well represented on the event. Yeah, I'm talking about the NYC police department running in full gear, not me on the left ;-)</em></p>
<p>Participating in this race was a great experience. The best part was that it fueled my curiosity and led me to write this code to get more interesting facts about the race.</p>
<p>There is plenty more to learn about the tools you just saw in this tutorial:</p>
<ul>
<li><p>There are a lot of public race datasets, and you can use them to apply what you learned here. Just take a look at <a target="_blank" href="https://github.com/davidjaimes/nyc-marathon">this dataset of the New York City Marathon, period 1970-2018</a>. What <a target="_blank" href="https://github.com/meiguan/nyc2018marathonfinishers">other questions</a> you can ask about the data?</p>
</li>
<li><p>You saw just the tip of what you can do with Textual. I encourage you to explore the <a target="_blank" href="https://github.com/josevnz/tutorials/blob/main/docs/EmpireStateRunUp/empirestaterunup/apps.py">apps.py</a> module. Take a look at the <a target="_blank" href="https://github.com/Textualize/textual/tree/main/examples">example applications</a> as well.</p>
</li>
<li><p><a target="_blank" href="https://www.selenium.dev/documentation/webdriver/">Selenium Web driver</a> is not just a tool for web scraping but for automated testing of web applications. It doesn't get better than having your browser perform automated testing for you. It is a big framework, so be prepared to spend time reading and running your tests. I strongly suggest you look <a target="_blank" href="https://github.com/SeleniumHQ/seleniumhq.github.io/tree/trunk/examples/python">at the examples</a>. Trial an error will give you better results.</p>
</li>
<li><p>Apply for the <a target="_blank" href="https://www.esbnyc.com/empire-state-building-run">Empire Estate Run Up</a> lottery or run through a charity, if you like this kind of race. Who said <a target="_blank" href="https://en.wikipedia.org/wiki/King_Kong">King Kong</a> is the only one who could make it to the top?</p>
</li>
<li><p>Sadly, I'm not in a position to offer you any training advice. Every person is different. I do recommend that you check with your doctor before you participate in a race like this, and get some professional advice from a running coach.</p>
</li>
<li><p>But most important of all, believe you can do this (the race and writing some tools to process the race data) and have fun while doing it. This is a pre-requisite for any project.</p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ What is Microsoft Fabric? How to Build a Customer Segmentation Project ]]>
                </title>
                <description>
                    <![CDATA[ Microsoft Fabric is a data analytics tool that can help you streamline all your data needs and workflows, from data integration to analytics and engineering. In this guide, I'll explain what Microsoft Fabric is in more detail, how it works, and walk ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/what-is-microsoft-fabric/</link>
                <guid isPermaLink="false">66ba5bc6fa3ca700fcc9f230</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Microsoft ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Benny Ifeanyi Iheagwara ]]>
                </dc:creator>
                <pubDate>Tue, 05 Mar 2024 01:00:06 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/03/Green-Orange-and-Brown-Collage-Math-Quiz-Presentation-1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Microsoft Fabric is a data analytics tool that can help you streamline all your data needs and workflows, from data integration to analytics and engineering.</p>
<p>In this guide, I'll explain what Microsoft Fabric is in more detail, how it works, and walk you through building a project with it. If you already have an understanding of the platform, you can skip to the <a class="post-section-overview" href="#heading-how-to-get-started-with-microsoft-fabric-an-end-to-end-project-example-1">Microsoft Fabric project.</a></p>
<p>Here's what you'll learn about in this guide:</p>
<ul>
<li><a class="post-section-overview" href="#heading-what-is-microsoft-fabric">What is Microsoft Fabric?</a></li>
<li><a class="post-section-overview" href="#heading-why-you-should-learn-about-microsoft-fabric">Why you should learn about Microsoft Fabric</a></li>
<li><a class="post-section-overview" href="#heading-microsoft-fabric-architecture">Microsoft Fabric architecture and components</a></li>
<li><a class="post-section-overview" href="#heading-how-to-get-started-with-microsoft-fabric-an-end-to-end-project-example-1">How to get started by building a simple project</a></li>
<li><a class="post-section-overview" href="#heading-how-to-create-a-workspace-in-microsoft-fabric">How to create a workspace in Microsoft Fabric</a></li>
<li><a class="post-section-overview" href="#heading-how-to-create-a-lakehouse-in-microsoft-fabric">How to create a Lakehouse in Microsoft Fabric</a></li>
<li><a class="post-section-overview" href="#how-to-use-kaggle-data-in-microsoft-fabric">How to use Kaggle API data in Microsoft Fabric</a></li>
<li><a class="post-section-overview" href="#heading-how-to-use-the-data-wrangler-in-microsoft-fabric">How to use the Data Wrangler in Microsoft Fabric</a></li>
<li><a class="post-section-overview" href="#heading-how-to-perform-customer-segmentation-in-microsoft-fabric">How to perform customer segmentation in Microsoft Fabric</a></li>
<li><a class="post-section-overview" href="#heading-how-to-visualize-lakehouse-data-in-power-bi">How to visualize your lakehouse data in Power BI</a></li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along, you will need to have a Power BI license. You can get one for free to practice with using the <a target="_blank" href="https://learn.microsoft.com/en-us/office/developer-program/microsoft-365-developer-program">Microsoft 365 Developer Program</a>.</p>
<p>It would be also be helpful if you have knowledge of Microsoft Power BI and Python.</p>
<h2 id="heading-what-is-microsoft-fabric">What is Microsoft Fabric?</h2>
<p>Microsoft Fabric is an all-in-one analytics software-as-a-service (SaaS) platform for managing all your data analytics needs and workflows. Microsoft built this end-to-end platform to handle data-related data, from your data storage and migration to your real-time data analytics, data science projects, and data engineering workflow.</p>
<p>But how does it work?</p>
<p>This tool brings together various new and preexisting data tools and technologies—Power BI, OneLake, Azure Data Factory, Data Activator, Power Query, Apache Spark, Synapse Data Warehouse, Synapse Data Engineering, Synapse Data Science, Synapse Real-Time Analytics, Azure Machine Learning, and various connectors.</p>
<h2 id="heading-why-you-should-learn-about-microsoft-fabric">Why You Should Learn About Microsoft Fabric</h2>
<p>The best part of Microsoft Fabric is its simplicity in terms of functionality. Using various technologies together, you can do everything all in one place and focus more on what you can do with it and less on licensing, supporting systems, dependencies, and how to integrate with all these different platforms.</p>
<p>Another benefit of the platform is how it handles your data. This provides and allows you to maintain a single reliable source of information. With Microsoft Fabric’s OneLake, you can have a single, unified data storage. </p>
<p>Microsoft Fabric also has Azure’s OpenAI service integrated into its layer. This way, you can use AI (Co-pilot) to help you discover insights quickly.</p>
<p>Lastly, since it is an all-in-one platform, there is a cost-saving edge since there is no need to subscribe to multiple vendors.</p>
<h2 id="heading-microsoft-fabric-architecture">Microsoft Fabric Architecture</h2>
<p>Think of Microsoft Fabric as your data estate.</p>
<p>Just like every piece of real estate, Microsoft Fabric has various components in its architecture.</p>
<p>Let’s start by looking at the terminology you'll encounter and need to understand when using Microsoft Fabric's architecture:</p>
<h3 id="heading-experiences-and-workloads">Experiences and Workloads:</h3>
<p>These refer to the various capabilities of the platform. Every experience on the platform is tailored with a specific user in mind. </p>
<p>Below are some examples of the various experiences/workloads available. You'll notice that each of them are built for a specific purpose, task, and user. </p>
<ul>
<li><strong>Data factory</strong>: This application gives users over 150 connectors to Lakehouses, warehouses, cloud, and on-premise data sources and orchestrates data pipelines for data transformation. A Lakehouse here refers to a data platform for storing structured and unstructured data. You can also copy your on-prem data to the cloud and load it into OneLake through the Data Factory.</li>
<li><strong>Synapse data engineering</strong> is part of the data engineering experience on the platform. It has some cool features like Lakehouses, built data pipelines, and a Spark engine.</li>
<li><strong>Synapse data warehouse</strong> provides you with a unified and serverless SQL engine. Like your “traditional” data warehouse, you have the full capabilities of your transactional T-SQL features.</li>
<li><strong>Synapse real-time analytics</strong> allows you to stream data from Internet of Things (IoT) devices, telemetry, and logs. You can also use the workload here to analyze semi-structured data using its Kusto Query Language (KQL) capabilities, just like Azure Data Explorer.</li>
<li><strong>Synapse data science</strong> allows you to build, collaborate, train, and deploy fully scalable end-to-end Machine learning (ML) and AI models. You can also carry out your ML experiments in your notebooks and log your models using the Fabric Auto Logging feature. A must-mention tool in this experience is the Data Wrangler, a Fabric graphical user interface for data transformation. With this tool, you can clean your data by simplifying by clicking buttons while the tool automatically generates the Python code for you. It is similar to Power Query.</li>
<li><strong>Business Intelligence with Power BI</strong> helps you quickly turn your business data into insightful analytic reports and dashboards.</li>
<li><strong>Data Activator</strong> allows you to take care of your data observability and monitor workloads in a non-code/low-code way. This tells you when specific data points hit a threshold or match a pattern. You can also automate particular actions and kickoff Power Automates flows when specific conditions occur.</li>
<li><strong>Copilot in Fabric</strong> provides you with an Azure OpenAI Service. This means you can build reports, describe how you want to ingest your data, summarize, explore, and transform your data using the natural language capability of Azure OpenAI.</li>
</ul>
<h3 id="heading-workspaces">Workspaces</h3>
<p>Workspaces are similar to Power BI’s workspace. Here, you can share and collaborate with others and create reports, Warehouses, Lakehouses, dashboards, and notebooks.</p>
<h3 id="heading-capacity-unit-cu">Capacity Unit (CU)</h3>
<p>A CU is the ability of your resource to perform or produce an output.</p>
<p>Now we'll look at the various components of Microsoft Fabric's architecture.</p>
<h3 id="heading-onelake">OneLake</h3>
<p>OneLake is the central data repository for Microsoft Fabric that stores the data in Delta Lake format. Think of it as OneDrive for your data. This repository allows you to explore and find data assets in your organization.</p>
<p>One exciting thing is Shortcuts, which allows you to share or point to data in other locations in OneLake without moving or duplicating the data. This removes any case of data redundancy.</p>
<h3 id="heading-lakehouses-vs-warehouses">Lakehouses vs Warehouses</h3>
<p>While both "houses" hold data, some differences exist between Lakehouses and Warehouses in Microsoft Fabric.</p>
<p>For starters, a Lakehouse can store any data type, whether structured or unstructured. It is, however, stored in the <a target="_blank" href="https://learn.microsoft.com/en-us/fabric/get-started/delta-lake-interoperability">Delta format</a> by default. The Delta format is a storage layer that offers ACID (Atomicity, Consistency, Isolation, Durability) transactions. A Warehouse, on the other hand, is more suited for structured data.</p>
<p>Lakehouses also support Notebooks. So you can work with various languages from PySpark to SQL and R. Warehouses, on the other hand, only use SQL. </p>
<p>Keep in mind, though, that Fabric provides you with two types of Warehouses: SQL Endpoint and Synapse Data Warehouse.</p>
<ul>
<li>SQL Endpoint is auto-generated when a Lakehouse is created. This mean you can have a SQL-based experience and can query Lakehouse data using T-SQL language. </li>
<li>Synapse Data Warehouse is more of your traditional SQL engine. So you can use it to create and query data out of OneLake.</li>
</ul>
<h2 id="heading-how-to-get-started-with-microsoft-fabric-an-end-to-end-project-example">How to Get Started With Microsoft Fabric – An End-to-End Project Example</h2>
<p>To get a glimpse of how the Fabric platform works, we will build a little project.</p>
<p>We'll create a Lakehouse to store a mall dataset from Kaggle using the Kaggle API. We will also transform our data using Data Wrangler. Then, we will perform customer segmentation on our data based on the customer's annual income and spending score using the KMeans clustering algorithm. This will allow us to group the customers into various categories like low income earners that don't spend, average income earning customers, and high income customers who do not spend much.</p>
<p>Let's get started.</p>
<h3 id="heading-how-to-enable-fabric">How to Enable Fabric</h3>
<p>The first thing we need to do is to log into Microsoft Power BI. Here, we will activate Microsoft Fabric's capabilities for our workspace. </p>
<p>To do this, follow these steps:</p>
<p>First, navigate to the capacity settings in the <a target="_blank" href="https://app.powerbi.com/home?experience=power-bi">admin portal</a>. The admin portal is where administrators control and manage the various Power BI features.</p>
<p><img src="https://lh7-us.googleusercontent.com/M5O2_Xb5h76ydZyy_VteTWpz2i3Nc_FiQoyZUXA_js69sWZidtAfzKMZ2-mJBgam4GqD0FXfft4fVFkBu_sw1rUCMIypcZHgWh49FgXO5xk-Q0dduYL3_7FGb5wLKrHoBPrL6-GU9nN3bdFrpsQT5wQ" alt="Image" width="344" height="777" loading="lazy">
<em>Admin Portal of Microsoft fabric</em></p>
<p>Then under the <strong>Tenant setting</strong> tab, look for <strong>Microsoft fabric</strong> tab.</p>
<p>Under that tab, enable the <strong>Users can create fabric items</strong> toggle to on. Once you've done that, select <strong>Apply</strong>.</p>
<p><img src="https://lh7-us.googleusercontent.com/yLMF0s789eNL7RW94Ax0Ssm-i9g1_wyOC7fgyPbql2DjNOgrrFVIMIKBrZMKs5aZA-br3MBgOrHu7g26moAG2kLI8JUE6WdJiRmC0wUK8Ak4h2TbDzt-t54LeOkBCqz2cTzpFrBT7q5MnvdgidTdGvo" alt="Image" width="1169" height="759" loading="lazy"></p>
<p>Now your environment will be set up and the various services should appear at the bottom left of your screen.</p>
<p><img src="https://lh7-us.googleusercontent.com/PKdkrIktTXMGw2O04yYa8-lkAiaUq6dZ_C4OCX3q6y3qlOl2jWr8hblLUwiFoWMDWyUPtF_aPAkfYKhXvaCOTjiU3ZlZAjrU3BJuAYx2QJfdKMRkQWalSVK7aRE0cqXepKM_oRUjvlSmYqCtL7tz1CE" alt="Image" width="442" height="424" loading="lazy">
<em>Now you can see all the services like Power BI, Data Factory, and so on.</em></p>
<h3 id="heading-ia"> </h3>
<p>How to Create a Workspace in Microsoft Fabric</p>
<p>We'll use a <a target="_blank" href="https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python">mall customer segmentation dataset from Kaggle</a> for this demo. This data, as mentioned in Kaggle, was created for the purpose of learning customer segmentation concepts.</p>
<p>Let's talk a little bit about the dataset. Imagine you have a supermarket mall and each customer has a membership card. You also have a data catalog of each customer with basic information like their customer ID, age, gender, annual income and spending score. </p>
<p>Now we want to segment these customer into various groups so we can improve customer loyalty, understand the customers better, and more effectively target our marketing strategy. </p>
<p>To achieve this, we will use the spending score assigned to each customer to define their purchasing power.</p>
<p>To get started, you'll need to create a new workspace. You can do that by following these steps:</p>
<ol>
<li>Head to your <a target="_blank" href="https://app.powerbi.com/home?experience=power-bi&amp;clientSideAuth=0">Microsoft Fabric home page</a>.</li>
<li>Select <strong>workspaces</strong> and click on <strong>New Workspace</strong>.</li>
<li>Give your workspace a name – I'm calling mine FabricMall.</li>
<li>Click on <strong>Advanced</strong> to view the dropdown options and select <strong>Trial</strong> if you are making use of your Fabric trial.</li>
<li>Click <strong>Apply</strong>.</li>
</ol>
<p><img src="https://lh7-us.googleusercontent.com/KvydyWSwyknsCNEHahc8aNME1z4nxVsLYUlMmAf73ru4O1XoYz5YnrBAHml_uYJPajix6svZ_S5VlJn7Nv4GNvfxXNyHChZXF9ZFjOCDNs-QY0cVlZT3abtkukhjEs2Ik9HFq7NTg47_gHrrbquuppI" alt="Image" width="1600" height="719" loading="lazy">
<em>How to create a workspace in Microsoft fabric</em></p>
<p>The next thing you want to do is to create a Lakehouse for your data.</p>
<h3 id="heading-how-to-create-a-lakehouse-in-microsoft-fabric">How to Create a Lakehouse in Microsoft Fabric</h3>
<p>To create a Lakehouse, first click on <strong>New</strong> within your workspace. This will display a list of various tasks you can do within your workspace.</p>
<p>Then select <strong>More options</strong> and select <strong>Lakehouse</strong>. </p>
<p><img src="https://lh7-us.googleusercontent.com/_zF0EcAg_tSGHvdpZt41huS5OR346NZ7AGTlWioXKIKuT5D5s7h_SIjLH-Yia13tpTGeobE3VsxE5zS4vOoya5S4qdqHRGJJcnAZSNnNn2s_C_F2J2tjIYDoK1BP_omkv3HaEGvSfd6v-XiiBlKv-qQ" alt="Image" width="673" height="861" loading="lazy">
<em>Selecting Lakehouse under "More options"</em></p>
<p>Then give it a name, like <strong>FabricMallLake</strong>, and click on <strong>Open notebook</strong>.</p>
<p>Click on <strong>New notebook</strong> and <strong>Open</strong>. You can rename your notebook at the top left corner of your notebook. The notebook is similar to the Jupyter notebook experience.</p>
<p><img src="https://lh7-us.googleusercontent.com/hquyOMggUOEdoyLE53_a1dJBmvguAZegZ2atVLxiA8p3wpXHgLvZOZA3uj2SzMDnDXxhAV5D0rJE2gwv2yGw1_u2AotOEAgcP0Sqh5YtKiX4WBdENgGc5fb30MEou1RA0ejSSEnyucYvhdqej5UXEXs" alt="Image" width="1600" height="691" loading="lazy">
<em>Notebooks in Fabric</em></p>
<h3 id="heading-how-to-use-kaggle-api-data-in-microsoft-fabric">How to Use Kaggle API Data in Microsoft Fabric</h3>
<p>Notebooks allow us to write, visualize, and execute code. Within the Notebook, we will use Python to perform a customer segmentation on our data in Microsoft Fabric.</p>
<p>First, import Kaggle using the command below:</p>
<pre><code class="lang-python">!pip install Kaggle
</code></pre>
<p>Next, you'll need to import your operating system and connect to the Kaggle API.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
os.chdir(<span class="hljs-string">'/lakehouse/default/Files'</span>)
os.environ[<span class="hljs-string">'KAGGLE_USERNAME'</span>] = <span class="hljs-string">'bennyifeanyi'</span>
os.environ[<span class="hljs-string">'KAGGLE_KEY'</span>] = <span class="hljs-string">'050019167fbe0027359cdb4b5eea50fe'</span>
<span class="hljs-keyword">from</span> kaggle.api.kaggle_api_extended <span class="hljs-keyword">import</span> KaggleApi
api = KaggleApi()
api.authenticate()
api.dataset_download_file(<span class="hljs-string">'vjchoudhary7/customer-segmentation-tutorial-in-python'</span>, <span class="hljs-string">'Mall_Customers.csv'</span>)
</code></pre>
<p>In the code above, <code>os.chdir('/lakehouse/default/Files')</code> represents our File API path. Also remember to replace the <a target="_blank" href="https://www.kaggle.com/settings">username and API Key</a> with your own.  </p>
<p>Now import Pandas. This will allow you to read your file.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
df = pd.read_csv(<span class="hljs-string">"/lakehouse/default/"</span> + <span class="hljs-string">"Files/Mall_Customers.csv"</span>)
df.head()
</code></pre>
<p>But before we start segmenting our customers, let's transform our data by exploring the data wrangler.</p>
<h3 id="heading-how-to-use-the-data-wrangler-in-microsoft-fabric">How to Use the Data Wrangler in Microsoft Fabric</h3>
<p>One of the most exciting things about this notebook is that you can perform data cleaning tasks without writing code using the Data Wrangler.</p>
<p>To do that, click on <strong>Data</strong> on the ribbon and select <strong>Transform DataFrame in Data Wrangler</strong>. </p>
<p>We will perform the following transformations:</p>
<ul>
<li>We will convert the gender column to lowercase.</li>
<li>We will also rename the columns with special characters like the dollar sign, brackets, and a dash. This is because I noticed Fabric finds it hard to handle these characters at the moment.</li>
</ul>
<p>To do these transformations, follow these steps:</p>
<p>Under the <strong>Operation</strong> tab, select <strong>Convert text to lowercase</strong>.</p>
<p>Pick the column – Gender in this example – and select <strong>Apply</strong>. This will convert your Gender column to lowercase and automatically generate the codes.</p>
<p><img src="https://lh7-us.googleusercontent.com/-QkNWJszDVHAMtm282FTLr-_NekndORMvaR45tqhxDIg7rMW7Rr2FfMTEOW2kb_ZlnmNxQ50MfWB4hma-lbMcNr6Du1BmFd-f7ehG-4-sSJbdhf7WmV0CrvCZGnE92w8qddCCyHaaxM6HAE_yvhYgDM" alt="Image" width="1600" height="755" loading="lazy">
<em>Data wrangler: Formatting text</em></p>
<p>Similarly, under the <strong>schema</strong> tab, select rename columns.</p>
<p>Rename <strong>Annual Income (k$)</strong> to <strong>AnnualIncome</strong>, and <strong>Spending Score (1-100)</strong> to <strong>SpendingScore</strong>.</p>
<p>Once you’re done with the transformation, click <strong>Add code to notebook</strong>.</p>
<p><img src="https://lh7-us.googleusercontent.com/vtvL7X_ll8Nh2mpc7bW01cqy-XvMeiy7whyrJtQdbc0QTz3VQ-qYV3-uywa4QVI2DpfvLPXudHy-a4bTFAOt0Fp2d0ac6lUVp7L0zT38m6ImNQrFTtKp8WtFPZaVjEjCNMrtSph7fhAZSw7o_DQvWe0" alt="Image" width="1501" height="807" loading="lazy">
<em>Data wrangler: Rename column</em></p>
<p>Back in the notebook, we can visualize our data using the code below:</p>
<pre><code class="lang-python">sparkdf = spark.createDataFrame(df_clean)
display(sparkdf)
</code></pre>
<p>Within the chart element created, select <strong>Customize chart</strong>. Pick the columns you want and select <strong>Apply</strong>.</p>
<p><img src="https://lh7-us.googleusercontent.com/WZoVr74bKT59da-YBwDishooHH1rqufkWA_jN-zr2eDK237rrKTXZybjZ-U5iWU7qnPOFyPnHKA0SkjIuC_ADk_X3Uh35sSAFMz254_FVKcc4IQGxBPQwNsP3Z_d-0uPHJxWxqJpoHdoJP_KOjQw6jo" alt="Image" width="1579" height="700" loading="lazy">
<em>Charts in Data Wrangler</em></p>
<p>Once that's done, we can save the data in the Lakehouse using this code below:</p>
<pre><code class="lang-python">sparkdf.write.format(<span class="hljs-string">"delta"</span>).mode(<span class="hljs-string">"overwrite"</span>).saveAsTable(<span class="hljs-string">"malldatadf"</span>)
</code></pre>
<p><img src="https://lh7-us.googleusercontent.com/boKGK5-xUaWccqNy76XjSXDd0Fdkrg2JOyqYiDTq51JOog-a_KMWsfLHTskC5iySI8nBuHjiWsDhj1ZVwLG5TxHbRciWTjBJIisKsvQJLsqEq4-UnFVfHBL1ngWMYMdZ5nheYw9pqwmApxaoL8WIMRE" alt="Image" width="600" height="400" loading="lazy">
<em>Saving data in Lakehouse</em></p>
<h3 id="heading-how-to-perform-customer-segmentation-in-microsoft-fabric">How to Perform Customer Segmentation in Microsoft Fabric</h3>
<p>For our customer segmentation, we will use the KMeans clustering algorithm to segment the customers based on their annual income and spending score. </p>
<p>K-means clustering is an unsupervised machine learning algorithm. It groups similar data points in your data based on underlying observations, similarities, and input vectors. </p>
<p>We will do this by importing our libraries, applying our K-means by training the K-Means clustering model, and visualizing the clusters of customers based on their annual income and spending score. </p>
<p>We will also include and show the centroids of each cluster, providing insights into the distribution of customers in the dataset. </p>
<p>The centroids here refers to the center points of the clusters found by our algorithm. This is calculated as the average of all the data points in that cluster. When we visualize the clusters, the centroid will be represented with a distinct symbol or color.</p>
<p>Run this code to achieve this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">from</span> sklearn.cluster <span class="hljs-keyword">import</span> KMeans
<span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> StandardScaler
<span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> MinMaxScaler
X = df_clean[[<span class="hljs-string">'AnnualIncome'</span>, <span class="hljs-string">'SpendingScore'</span>]]
<span class="hljs-comment"># Feature normalization</span>
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=<span class="hljs-number">5</span>, init=<span class="hljs-string">'k-means++'</span>, random_state=<span class="hljs-number">42</span>)
kmeans.fit(X_scaled)
plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">8</span>))
<span class="hljs-keyword">for</span> cluster_label <span class="hljs-keyword">in</span> range(<span class="hljs-number">5</span>):  <span class="hljs-comment"># Loop through each cluster label</span>
cluster_points = X[kmeans.labels_ == cluster_label]
centroid = cluster_points.mean(axis=<span class="hljs-number">0</span>)  <span class="hljs-comment"># Calculate the centroid as the mean position of the data points</span>
plt.scatter(cluster_points[<span class="hljs-string">'AnnualIncome'</span>], cluster_points[<span class="hljs-string">'SpendingScore'</span>],
s=<span class="hljs-number">50</span>, label=<span class="hljs-string">f'Cluster <span class="hljs-subst">{cluster_label + <span class="hljs-number">1</span>}</span>'</span>)  <span class="hljs-comment"># Plot points for the current cluster</span>
plt.scatter(centroid[<span class="hljs-number">0</span>], centroid[<span class="hljs-number">1</span>], s=<span class="hljs-number">300</span>, c=<span class="hljs-string">'black'</span>, marker=<span class="hljs-string">'*'</span>, label=<span class="hljs-string">f'Centroid <span class="hljs-subst">{cluster_label + <span class="hljs-number">1</span>}</span>'</span>)  <span class="hljs-comment"># Plot the centroid</span>
plt.title(<span class="hljs-string">'Clusters of Customers'</span>)
plt.xlabel(<span class="hljs-string">'Annual Income (k$)'</span>)
plt.ylabel(<span class="hljs-string">'Spending Score (1-100)'</span>)
plt.legend()
plt.show()
</code></pre>
<p>Here's the output:</p>
<p><img src="https://lh7-us.googleusercontent.com/lsIdbv7j_QbsmChgxFgs-X0QQEguqGZS_Hsvrj1kB55hIUsuTt5kGP5denL28jszo_HCjTe9NB-NbYfS2rsXJgw1LnHH6c7Z7E0cJe1vdW5pe3s9o4F2AebF2l6MB3M_XHtEYIzuzGSmFGaPFYbfj4w" alt="Image" width="600" height="400" loading="lazy">
<em>Performing Customer Segmentation in Microsoft Fabric</em></p>
<p>The result of our analysis shows that our customers can be grouped into 5 clusters:</p>
<ul>
<li>Cluster 1 (Purple) are low income earners with a low spending score.</li>
<li>Cluster 2 (Blue) are low income earners with a high spending score.</li>
<li>Cluster 3 (Red) are average income earning customers with significant spending scores.</li>
<li>Cluster 4 (Orange) are high income customers who do not spend much at the mall. They’re probably not satisfied with the services rendered.</li>
<li>Cluster 5 (Green) are high income customers with a high spending score.</li>
</ul>
<p>We can also save our prediction as a new dataset using this code:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Create a new DataFrame to store the clustering results</span>
cluster_df = pd.DataFrame(data=X, columns=[<span class="hljs-string">'AnnualIncome'</span>, <span class="hljs-string">'SpendingScore'</span>])
cluster_df[<span class="hljs-string">'Cluster'</span>] = cluster_label
sparkclusterdf = spark.createDataFrame(cluster_df)
sparkclusterdf.write.format(<span class="hljs-string">"delta"</span>).mode(<span class="hljs-string">"overwrite"</span>).saveAsTable(<span class="hljs-string">"clusterdatadf"</span>)
</code></pre>
<p><img src="https://lh7-us.googleusercontent.com/vMJYBX_nbjwPdODAlVKfWp-KWvqRD6BW-pPg4XAZ8UVgSMkaI4-tDRQZqlA38Eg5iVpvP-f_cUI9vXL6dxmUYJl-kJ_t46lQfsXytQGGAW1iHSGad8x7KwEqxDBeP2effQ-LME1PX5qE3-7NBUoa9Yg" alt="Image" width="1600" height="647" loading="lazy">
<em>Customer segementation prediction</em></p>
<p>Want to take a look at the notebook? You can download it from <a target="_blank" href="https://github.com/Bennykillua/Project/tree/main/CustomerSegmentationMicrosoftFabric">my GitHub</a>.</p>
<h3 id="heading-how-to-visualize-lakehouse-data-in-power-bi">How to Visualize Lakehouse Data in Power BI</h3>
<p>Now we can decide to visualize our data on a dashboard within Fabric.</p>
<p>Head back to the FabricMall workspace and select the <strong>semantic model type</strong> of the FabricMallLake Lakehouse.</p>
<p><img src="https://lh7-us.googleusercontent.com/YO0SWvhNJEdz2o3a85rhOf8CHorcX50o_Fu3sqJWdGP-P8kO8t1CD194a7JB9Tx3LxyFjMvjE0ek9CrRBSMKyXGy2vrx0hPQ9BZofrlI9BRw3o4nqDCegmZ1GCyi2pDMk4mfKuCvFycUW6f0kwjYnxQ" alt="Image" width="1301" height="640" loading="lazy">
<em>semantic model type of the FabricMallLake LakeHouse</em></p>
<p>Then select <strong>Manage default semantic model</strong>.</p>
<p><img src="https://lh7-us.googleusercontent.com/j5k-aWOHKXMKrkfygcD7HBIUDONorZcnpbH0j2uNbiL1rLZ8sdhOIscIKnTLZXwFBGEDNp30v3oYi0vPsG-t_SawMcVcp1kd7PSI81iM-ZOm1IGn72KFs5hDPmFbJ_UAF4Cr2wiEphaM93EWgiVfXug" alt="Image" width="1600" height="650" loading="lazy">
<em>Manage default semantic model In Microsoft Fabric</em></p>
<p>Pick your dataset, click <strong>Confirm</strong>, and then select <strong>New Report</strong>. </p>
<p>Let's visualize the average age in our data. To do this, click on the card visual and drag the age into this card. This will automatically create a visual showing the average age in your dataset. </p>
<p><img src="https://lh7-us.googleusercontent.com/eh28PLD0HCw2m2fWIbVhIrL78TLRP0hqF5aSDbEcE6_hzFaZaWA9c_AX5_u_w6yG49ovcvBVWY_Og4nQYqDnUCeIEe73o6LAgyrH0pLv0Gy1eMxxmhrV2KbmIDPuQhgPsimL_Drnxkq6wlE-OrG0CFA" alt="Image" width="1600" height="671" loading="lazy">
<em>Power BI service in Microsoft Fabric</em></p>
<p>Just like in <a target="_blank" href="https://www.freecodecamp.org/news/teach-yourself-data-analytics-in-30-days/#:~:text=Enterprise%20strength%20tools%20like%20Tableau%20Splunk%2C%20or%20Microsoft's%20Power%20BI&amp;text=You%20can%20download%20Jupiter%20to%20your%20PC%20or%20a%20private%20server%20and%20access%20the">Power BI Desktop</a>, you can create your measure, build your report, and publish your dashboard. You can learn more about how to create visuals in Power BI using this free <a target="_blank" href="https://www.youtube.com/watch?v=PSNXoAs2FtQ">freeCodeCamp YouTube data analysis video</a>.</p>
<p>Alternatively, you can open Power BI Desktop, and connect to your Lakehouses from Onelake data hub.</p>
<p><img src="https://lh7-us.googleusercontent.com/Na-xm9ThvGM6rkljbdDHD_ZUzekJ88mzCRQSoKOW7bCNfgmB_dkusJjoOrBfyIam-Smnvm_2p08G-25MVx_IsJpvUxnCYZab4NlKCCystqkn7kdPN56QLxvJ0ikCLmca4w4Y828dk8lUE2tqakpDWr4" alt="Image" width="600" height="400" loading="lazy">
<em>Connect to your Lakehouse in Power BI</em></p>
<h2 id="heading-where-can-i-learn-more-about-microsoft-fabric">Where Can I Learn More about Microsoft Fabric?</h2>
<p>Though Microsoft Fabric is a pretty new data platform, I hope you can tell that this tool will help you ease the way you and your team consume, analyze, and get insight from your data.</p>
<p>To learn more you can start with the <a target="_blank" href="https://www.microsoft.com/en-us/microsoft-fabric/getting-started">fabric official documentation</a> or any helpful YouTube tutorial like <a target="_blank" href="https://www.youtube.com/playlist?list=PLUeJI2NOafNvaNor3qUHw1gyFuz_K1Rtt">Francis’s Fabric course.</a> I would also advise you to start with freeCodeCamp's Fabric publication tags if you want a compilation of resources.</p>
<p>Lastly, if you’re new to data analysis, start your journey today with <a target="_blank" href="https://www.youtube.com/watch?v=PSNXoAs2FtQ">freeCodeCamp’s Data Analyst Bootcamp for Beginners on YouTube</a>. It covers everything from SQL, Tableau, Power BI, and Python to Excel, Pandas, and real-life projects building.  </p>
<p>If you enjoyed reading this article and/or have any questions and want to connect, you can find me on <a target="_blank" href="https://www.linkedin.com/in/ifeanyi-iheagwara/">LinkedIn</a>, <a target="_blank" href="https://twitter.com/Bennykillua">Twitter</a> and do check out my articles on <a target="_blank" href="https://www.freecodecamp.org/news/author/benny/">freeCodeCamp</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Essential SQL Concepts for Data Analysts – Explained with Code Examples ]]>
                </title>
                <description>
                    <![CDATA[ By Joel Hereth In the vast and ever-growing realm of data analytics, Structured Query Language (SQL) serves as a fundamental building block.  While SQL's roots lie in database management, it has expanded its reach, becoming the go-to tool for data ex... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/sql-concepts-for-data-analysts/</link>
                <guid isPermaLink="false">66d45f6336c45a88f96b7cdf</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ SQL ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 27 Feb 2024 00:45:00 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/02/Group-2.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Joel Hereth</p>
<p>In the vast and ever-growing realm of data analytics, <a target="_blank" href="https://www.freecodecamp.org/news/what-is-sql-database-definition-for-beginners/">Structured Query Language</a> (SQL) serves as a fundamental building block. </p>
<p>While SQL's roots lie in database management, it has expanded its reach, becoming the go-to tool for data extraction, manipulation, and analysis. </p>
<p>Whether you're just starting your journey as a <a target="_blank" href="https://www.freecodecamp.org/news/data-analytics-roadmap/">data analyst</a> or looking to bolster your proficiency in its tools, understanding essential SQL concepts is non-negotiable. </p>
<p>This guide will take you through the critical aspects of SQL that are important for your success in the data analytics field. </p>
<h2 id="heading-table-of-contents">Table of Contents:</h2>
<ol>
<li><a class="post-section-overview" href="#heading-the-role-of-sql-in-data-analytics">The Role of SQL in Data Analytics</a></li>
<li><a class="post-section-overview" href="#heading-key-sql-concepts-to-learn">Key SQL Concepts to Learn</a><br>– <a class="post-section-overview" href="#heading-basic-commands">Basic Commands</a><br>– <a class="post-section-overview" href="#heading-the-case-statement">The <code>CASE</code> Statement</a><br><a class="post-section-overview" href="#heading-subqueries-and-common-table-expressions-ctes">– Subqueries and Common Table Expressions (CTEs)</a><br>– <a class="post-section-overview" href="#heading-joins-and-unions">Joins and Unions</a><br>– <a class="post-section-overview" href="#heading-string-and-date-formatting">String and Date Formatting</a><br>– <a class="post-section-overview" href="#heading-window-functions">Window Functions</a></li>
<li><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></li>
</ol>
<h2 id="heading-the-role-of-sql-in-data-analytics">The Role of SQL in Data Analytics</h2>
<p>Before getting into the nitty-gritty, it's important to understand the pivotal role of SQL in data analytics. </p>
<p>SQL is the lingua franca of the database world, serving as a translator between human and machine. This makes it a must-learn for anyone diving into the data domain. </p>
<p>To appreciate SQL's significance, you need only to look at the tasks it allows you to perform. From transforming raw data into insightful reports to creating data-driven applications and executing complex data operations, SQL is the powerhouse that enables analysts and professionals to extract hidden gems from the vast seas of databases. </p>
<h2 id="heading-key-sql-concepts-to-learn">Key SQL Concepts to Learn</h2>
<h3 id="heading-basic-commands">Basic Commands</h3>
<p><a target="_blank" href="https://www.freecodecamp.org/news/sql-select-statement-and-query-examples/">SQL commands</a> can be categorized into entities that manage the structure of the database schema (DDL - Data Definition Language), control the content of the database tables (DML - Data Manipulation Language), and access and work on the data within the database (DQL - Data Query Language). You'll want to start here to lay a solid foundation. </p>
<h4 id="heading-dml-commands">DML Commands:</h4>
<ul>
<li><code>SELECT</code>: retrieves data from one or more tables.</li>
</ul>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> product_name, price
<span class="hljs-keyword">FROM</span> products
<span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">category</span> = <span class="hljs-string">'Electronics'</span>;
</code></pre>
<ul>
<li><code>INSERT</code>: inserts new rows into a table.</li>
</ul>
<pre><code class="lang-sql"><span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> customers (<span class="hljs-keyword">name</span>, email)
<span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'John Doe'</span>, <span class="hljs-string">'john@example.com'</span>);
</code></pre>
<ul>
<li><code>UPDATE</code>: modifies existing data within a table.</li>
</ul>
<pre><code class="lang-sql"><span class="hljs-keyword">UPDATE</span> inventory
<span class="hljs-keyword">SET</span> quantity = <span class="hljs-number">50</span>
<span class="hljs-keyword">WHERE</span> product_id = <span class="hljs-number">101</span>;
</code></pre>
<ul>
<li><code>DELETE</code>: removes existing rows from a table.</li>
</ul>
<pre><code class="lang-sql"><span class="hljs-keyword">DELETE</span> <span class="hljs-keyword">FROM</span> orders
<span class="hljs-keyword">WHERE</span> order_id = <span class="hljs-number">12345</span>;
</code></pre>
<h4 id="heading-ddl-commands">DDL Commands:</h4>
<ul>
<li><code>CREATE TABLE</code>: creates a new table within the database.</li>
</ul>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> employees (
    employee_id <span class="hljs-built_in">INT</span> PRIMARY <span class="hljs-keyword">KEY</span>,
    <span class="hljs-keyword">name</span> <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">50</span>),
    department <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">50</span>),
    salary <span class="hljs-built_in">DECIMAL</span>(<span class="hljs-number">10</span>, <span class="hljs-number">2</span>)
);
</code></pre>
<ul>
<li><code>ALTER TABLE</code>: modifies an existing table within the database.</li>
</ul>
<pre><code class="lang-sql"><span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">TABLE</span> employees
<span class="hljs-keyword">ADD</span> hire_date <span class="hljs-built_in">DATE</span>;
</code></pre>
<ul>
<li><code>DROP TABLE</code>: removes an entire table from the database.</li>
</ul>
<pre><code class="lang-sql"><span class="hljs-keyword">DROP</span> <span class="hljs-keyword">TABLE</span> customers;
</code></pre>
<h4 id="heading-dql-commands">DQL Commands:</h4>
<ul>
<li><code>SELECT</code>: also part of DML but often associated with DQL as it is used to query data only.</li>
</ul>
<h3 id="heading-the-case-statement">The <code>CASE</code> Statement</h3>
<p>The <code>CASE</code> statement takes scalars, predicates, function calls, and even SQL queries as input and returns an expression value. It’s an extremely versatile tool that can be used to transform data, perform if-then-else logic, categorize information, and more.</p>
<h4 id="heading-basic-syntax-of-case">Basic Syntax of <code>CASE</code>:</h4>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> column_name,
  <span class="hljs-keyword">CASE</span>
    <span class="hljs-keyword">WHEN</span> condition1 <span class="hljs-keyword">THEN</span> result1
    <span class="hljs-keyword">WHEN</span> condition2 <span class="hljs-keyword">THEN</span> result2
    <span class="hljs-keyword">ELSE</span> result3
  <span class="hljs-keyword">END</span>
<span class="hljs-keyword">FROM</span> table_name;
</code></pre>
<p>Understanding how and when to use <code>CASE statements</code> is a critical SQL skill to master as a <a target="_blank" href="https://bigtechinterviews.com/33-must-know-data-analyst-sql-interview-questions-and-answers/">data analyst</a> dealing with complex datasets. To showcase the different <code>CASE</code> statements, we have the <code>actions</code> table with the <code>user_id</code>, <code>action</code>, and <code>date</code> fields. </p>
<pre><code>CREATE TABLE actions (
  <span class="hljs-string">"user_id"</span> INTEGER,
  <span class="hljs-string">"action"</span> VARCHAR(<span class="hljs-number">50</span>),
  <span class="hljs-string">"date"</span> DATE
);

INSERT INTO actions (
     <span class="hljs-string">"user_id"</span>,
      <span class="hljs-string">"action"</span>,
      <span class="hljs-string">"date"</span>
)
VALUES
    (<span class="hljs-number">1</span>, <span class="hljs-string">'post'</span>, <span class="hljs-attr">current_timestamp</span>::DATE<span class="hljs-number">-3</span>),
    (<span class="hljs-number">2</span>, <span class="hljs-string">'edit'</span>, <span class="hljs-attr">current_timestamp</span>::DATE<span class="hljs-number">-2</span>),
    (<span class="hljs-number">3</span>, <span class="hljs-string">'post'</span>, <span class="hljs-attr">current_timestamp</span>::DATE<span class="hljs-number">-1</span>),
    (<span class="hljs-number">4</span>, <span class="hljs-string">'post'</span>, <span class="hljs-attr">current_timestamp</span>::DATE<span class="hljs-number">-1</span>),
    (<span class="hljs-number">5</span>, <span class="hljs-string">'edit'</span>, <span class="hljs-attr">current_timestamp</span>::DATE<span class="hljs-number">-5</span>),
    (<span class="hljs-number">6</span>, <span class="hljs-string">'cancel'</span>, <span class="hljs-attr">current_timestamp</span>::DATE<span class="hljs-number">-2</span>),
    (<span class="hljs-number">7</span>, <span class="hljs-string">'post'</span>, <span class="hljs-attr">current_timestamp</span>::DATE<span class="hljs-number">-2</span>),
    (<span class="hljs-number">8</span>, <span class="hljs-string">'post'</span>, <span class="hljs-attr">current_timestamp</span>::DATE<span class="hljs-number">-1</span>),
    (<span class="hljs-number">9</span>, <span class="hljs-string">'post'</span>, <span class="hljs-attr">current_timestamp</span>::DATE<span class="hljs-number">-1</span>),
    (<span class="hljs-number">10</span>, <span class="hljs-string">'cancel'</span>, <span class="hljs-attr">current_timestamp</span>::DATE<span class="hljs-number">-3</span>),
    (<span class="hljs-number">11</span>, <span class="hljs-string">'post'</span>, <span class="hljs-attr">current_timestamp</span>::DATE<span class="hljs-number">-2</span>),
    (<span class="hljs-number">12</span>, <span class="hljs-string">'post'</span>, <span class="hljs-attr">current_timestamp</span>::DATE<span class="hljs-number">-2</span>);
</code></pre><p>Your manager is about to go into a meeting with the event director and asks you to write a query to showcase the current post rate for all time rounded two decimals. In this case based on the <code>actions</code> table structure, we'll need to utilize a <code>CASE</code> statement. </p>
<pre><code class="lang-sql"><span class="hljs-keyword">select</span> <span class="hljs-keyword">round</span>(<span class="hljs-number">1.0</span>*
<span class="hljs-keyword">sum</span>(<span class="hljs-keyword">case</span> <span class="hljs-keyword">when</span> <span class="hljs-keyword">action</span>=<span class="hljs-string">'post'</span> <span class="hljs-keyword">then</span> <span class="hljs-number">1</span> <span class="hljs-keyword">else</span> <span class="hljs-number">0</span> <span class="hljs-keyword">end</span>)
/
<span class="hljs-keyword">count</span>(<span class="hljs-number">1</span>)
,<span class="hljs-number">2</span>) post_rate
<span class="hljs-keyword">from</span> actions;
</code></pre>
<p>Initially, we employ a <code>CASE</code> statement to assign a value of 1 to <code>posts</code>, and 0 otherwise. Afterward, we aggregate these results using <code>SUM()</code>. Then, we divide this sum by the total count of records, represented by <code>COUNT(1)</code>, which includes all records, not exclusively posts. </p>
<p>This computation yields our <code>post rate</code>. To ensure decimal precision, we multiply the numerator by 1.0. Finally, we round the entire result to two decimal points as needed.</p>
<h3 id="heading-subqueries-and-common-table-expressions-ctes">Subqueries and Common Table Expressions (CTEs)</h3>
<p>Subqueries, or inner queries, allow you to use queries within another SQL statement. Common Table Expressions (CTEs) are named temporary result sets that you can reference within a <code>SELECT</code>, <code>INSERT</code>, <code>UPDATE</code>, or <code>DELETE</code> statement.</p>
<h4 id="heading-subqueries">Subqueries:</h4>
<ul>
<li>Scalar Subquery: a subquery that returns a single value.</li>
<li>Column Subquery: a subquery that returns one or more columns.</li>
<li>Table Subquery: a subquery that looks like a table (used with any operator expecting a table).</li>
</ul>
<h4 id="heading-ctes">CTEs:</h4>
<ul>
<li>Provide a more readable and maintainable alternative to a derived table or subquery.</li>
<li>Can reference themselves, which is useful for recursive queries.</li>
</ul>
<p>To demonstrate the use cases, we're going to practice with both the traditional <code>subquery</code> and <code>CTE</code> using the following SQL schema: </p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> all_numbers (
  <span class="hljs-string">"phone_number"</span> <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">25</span>)
  );


<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> confirmed_numbers (
  <span class="hljs-string">"phone_number"</span> <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">25</span>)
  );


<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> all_numbers
(<span class="hljs-string">"phone_number"</span>)
<span class="hljs-keyword">VALUES</span>
(<span class="hljs-string">'706-766-8523'</span>),
(<span class="hljs-string">'555-239-6874'</span>),
(<span class="hljs-string">'407-234-5041'</span>),
(<span class="hljs-string">'(123)351-6123'</span>),
(<span class="hljs-string">'251-874-3478'</span>);

<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> confirmed_numbers
 (<span class="hljs-string">"phone_number"</span>)
 <span class="hljs-keyword">VALUES</span>
(<span class="hljs-string">'555-239-6874'</span>),
(<span class="hljs-string">'407-234-5041'</span>),
(<span class="hljs-string">'(123)351-6123'</span>);
</code></pre>
<p>For example, let's say you're a <a target="_blank" href="https://bigtechinterviews.com/33-must-know-data-analyst-sql-interview-questions-and-answers/">data analyst</a> at DoorDash and you've been asked to retrieve all the phone numbers that are in the <code>all_numbers</code> table but are not present in the <code>confirmed_numbers</code> table. You can solve this by using a traditional <code>subquery</code>:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> phone_number
<span class="hljs-keyword">FROM</span> all_numbers
<span class="hljs-keyword">WHERE</span> phone_number <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">IN</span> (
  <span class="hljs-keyword">SELECT</span> phone_number
  <span class="hljs-keyword">FROM</span> confirmed_numbers
);
</code></pre>
<p>Alternatively, if the database is very large, you might want to think about using a <code>CTE</code> since they're more efficient for larger databases. </p>
<pre><code class="lang-sql"><span class="hljs-keyword">WITH</span> excluded_numbers <span class="hljs-keyword">AS</span> (
  <span class="hljs-keyword">SELECT</span> phone_number
  <span class="hljs-keyword">FROM</span> confirmed_numbers
)

<span class="hljs-keyword">SELECT</span> phone_number
<span class="hljs-keyword">FROM</span> all_numbers
<span class="hljs-keyword">WHERE</span> phone_number <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">IN</span> (
  <span class="hljs-keyword">SELECT</span> phone_number
  <span class="hljs-keyword">FROM</span> excluded_numbers
);
</code></pre>
<h3 id="heading-joins-and-unions">Joins and Unions</h3>
<p>Joins help you combine data from multiple tables based on a related column between them, while unions allow you to combine the result sets of two or more <code>SELECT</code> statements. Both are critical for harnessing the full power of your SQL queries.</p>
<p><img src="https://lh7-us.googleusercontent.com/bDaLcZkcJJoDGwH7o85fn1nYNO7ZvjPrHlkn6ShA4lNRhpWW3Zdp2QpW8vn-LNbn5ZlblFMW7N8OVN5am2PTXi3pLReyV3-pXpXvghF_m2iJVw2Wu4-WBXE-em_kNlxFBrpgXvLwWEHC_EWgRAvgtac" alt="Image" width="1600" height="1287" loading="lazy">
<em>Table illustrating the different types of SQL joins (Left, Full, Right, and Inner)</em></p>
<h4 id="heading-types-of-joins">Types of Joins:</h4>
<ul>
<li><code>INNER JOIN</code>: returns rows when there is a match in both tables.</li>
<li><code>LEFT JOIN</code>: returns all rows from the left table and the matched rows from the right table.</li>
<li><code>RIGHT JOIN</code>: returns all rows from the right table and the matched rows from the left table.</li>
<li><code>FULL JOIN</code>: returns all rows when there is a match in one of the tables.</li>
</ul>
<p>To illustrate the various <code>JOIN</code> types in SQL, consider a scenario where we want to compile the relationship between sales figures and their corresponding sales representatives across different regions. </p>
<p>For this purpose, we have two tables: <code>sales_data</code> and <code>representatives</code>. They are linked by the <code>rep_id</code> field, which serves as a foreign key in the <code>sales_data</code> table and a primary key in the <code>representatives</code> table. Here's what that looks like:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> sales_data (
    sale_id <span class="hljs-built_in">INT</span> PRIMARY <span class="hljs-keyword">KEY</span>,
    rep_id <span class="hljs-built_in">INT</span>,
    region <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">50</span>),
    sales <span class="hljs-built_in">DECIMAL</span>(<span class="hljs-number">10</span>, <span class="hljs-number">2</span>)
);

<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> sales_data (sale_id, rep_id, region, sales) <span class="hljs-keyword">VALUES</span>
(<span class="hljs-number">1</span>, <span class="hljs-number">101</span>, <span class="hljs-string">'East'</span>, <span class="hljs-number">1000.00</span>),
(<span class="hljs-number">2</span>, <span class="hljs-number">102</span>, <span class="hljs-string">'East'</span>, <span class="hljs-number">1500.50</span>),
(<span class="hljs-number">3</span>, <span class="hljs-number">103</span>, <span class="hljs-string">'West'</span>, <span class="hljs-number">2000.00</span>),
(<span class="hljs-number">4</span>, <span class="hljs-number">104</span>, <span class="hljs-string">'West'</span>, <span class="hljs-number">2500.75</span>),
(<span class="hljs-number">5</span>, <span class="hljs-literal">NULL</span>, <span class="hljs-string">'West'</span>, <span class="hljs-number">3000.00</span>);  

<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> representatives (
    rep_id <span class="hljs-built_in">INT</span> PRIMARY <span class="hljs-keyword">KEY</span>,
    sales_rep <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">100</span>),
    region <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">50</span>)
);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> representatives (rep_id, sales_rep, region) <span class="hljs-keyword">VALUES</span>
(<span class="hljs-number">101</span>, <span class="hljs-string">'John Doe'</span>, <span class="hljs-string">'East'</span>),
(<span class="hljs-number">102</span>, <span class="hljs-string">'Jane Smith'</span>, <span class="hljs-string">'East'</span>),
(<span class="hljs-number">105</span>, <span class="hljs-string">'Jim Beam'</span>, <span class="hljs-string">'North'</span>),
(<span class="hljs-number">106</span>, <span class="hljs-string">'Jill Jackson'</span>, <span class="hljs-string">'North'</span>),
(<span class="hljs-number">107</span>, <span class="hljs-string">'Jack Johnson'</span>, <span class="hljs-string">'South'</span>);
</code></pre>
<p>For our example, suppose we want to match sales to representatives in the East region. We would use an <code>INNER JOIN</code> to fetch only the rows with matching <code>rep_id</code> in both tables:</p>
<pre><code>SELECT s.sales, r.sales_rep
FROM sales_data s
INNER JOIN representatives r
ON s.rep_id = r.rep_id
WHERE s.region = <span class="hljs-string">'East'</span>;
</code></pre><p>In the case of wanting to see all sales data in the West region, including those without a corresponding sales representative, a <code>LEFT JOIN</code> comes in handy:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> s.sales, r.sales_rep
<span class="hljs-keyword">FROM</span> sales_data s
<span class="hljs-keyword">LEFT</span> <span class="hljs-keyword">JOIN</span> representatives r
<span class="hljs-keyword">ON</span> s.rep_id = r.rep_id
<span class="hljs-keyword">WHERE</span> s.region = <span class="hljs-string">'West'</span>;
</code></pre>
<p>If our interest instead is in all representatives in the North region, even those without associated sales data, we would use a <code>RIGHT JOIN</code>:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> s.sales, r.sales_rep
<span class="hljs-keyword">FROM</span> sales_data s
<span class="hljs-keyword">RIGHT</span> <span class="hljs-keyword">JOIN</span> representatives r
<span class="hljs-keyword">ON</span> s.rep_id = r.rep_id
<span class="hljs-keyword">WHERE</span> r.region = <span class="hljs-string">'North'</span>;
</code></pre>
<p>Lastly, to see all possible combinations of sales and representatives across all regions, regardless of matching <code>rep_id</code>, we use a <code>FULL JOIN</code>:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> s.sales, r.sales_rep
<span class="hljs-keyword">FROM</span> sales_data s
<span class="hljs-keyword">FULL</span> <span class="hljs-keyword">JOIN</span> representatives r
<span class="hljs-keyword">ON</span> s.rep_id = r.rep_id;
</code></pre>
<h4 id="heading-union-and-union-all">Union and Union All:</h4>
<ul>
<li><code>UNION</code>: returns the distinct rows that appear in either of the two result sets.</li>
<li><code>UNION ALL</code>: returns all the rows including duplicates.</li>
</ul>
<p>Continuing with the same SQL schema above containing the <code>sales_data</code> table and <code>representatives</code> table, let's review scenarios where we'd want use a <code>UNION</code> and <code>UNION ALL</code>.</p>
<p>Using a <code>UNION</code>, let's construct a SQL query to efficiently retrieve the names of all sales representatives from both the <code>sales_data</code> and <code>representatives tables</code>.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> sales_rep <span class="hljs-keyword">AS</span> representative_name <span class="hljs-keyword">FROM</span> representatives
<span class="hljs-keyword">UNION</span>
<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">DISTINCT</span> rep_id <span class="hljs-keyword">AS</span> representative_name <span class="hljs-keyword">FROM</span> sales_data;
</code></pre>
<p>Now, let's explore how to utilize a <code>UNION ALL</code> operation to retrieve the names of all sales representatives from both the <code>sales_data</code> and <code>representatives</code> tables, including duplicates.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> sales_rep <span class="hljs-keyword">AS</span> representative_name <span class="hljs-keyword">FROM</span> representatives
<span class="hljs-keyword">UNION</span> <span class="hljs-keyword">ALL</span>
<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">DISTINCT</span> rep_id <span class="hljs-keyword">AS</span> representative_name <span class="hljs-keyword">FROM</span> sales_data;
</code></pre>
<p><img src="https://lh7-us.googleusercontent.com/XVSysMGd6SLoUcwd71sSfcdARXElf1GMjr-QTwdP4n4wjjQXYMT5VNUe3rhkW6a4elW9KcMuj6qrKnFV0J6SD4f-qUara6wOvnx0By4qrRjYtLZ2M1WFZdFpE8OV1lDxKJL654jEHdVh72NxTedhm2o" alt="Image" width="1600" height="1037" loading="lazy">
<em>Table illustrating the different types of SQL UNIONs (UNION vs UNION ALL)</em></p>
<h3 id="heading-string-and-date-formatting">String and Date Formatting</h3>
<p>The manipulation of string and date values is common in data analysis. Understanding how to format these types properly is crucial for meaningful analysis.</p>
<h4 id="heading-string-functions">String Functions:</h4>
<ul>
<li><code>CONCAT</code>: merges two or more strings into one.</li>
<li><code>SUBSTRING</code>: returns a part of a string.</li>
<li><code>LENGTH</code> or <code>LEN</code>: returns the length of a string.</li>
</ul>
<h4 id="heading-date-functions">Date Functions:</h4>
<ul>
<li><code>DATEADD</code>: adds an interval to a date.</li>
<li><code>DATEDIFF</code>: returns the time between two dates.</li>
<li><code>DATENAME</code> or <code>TO_CHAR</code>: Returns part of a date like day, month, or year.</li>
</ul>
<p>To demonstrate the usage of string and date functions in SQL, let's delve into a scenario involving orders and deliveries. </p>
<p>We have two tables: <code>orders</code> and <code>deliveries</code>. Here's a breakdown of each table and its columns:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> orders (
    order_id <span class="hljs-built_in">INT</span> PRIMARY <span class="hljs-keyword">KEY</span>,
    customer_id <span class="hljs-built_in">INT</span>,
    order_date <span class="hljs-built_in">DATE</span>,
    total_amount <span class="hljs-built_in">DECIMAL</span>(<span class="hljs-number">10</span>, <span class="hljs-number">2</span>)
);

<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> orders (order_id, customer_id, order_date, total_amount) <span class="hljs-keyword">VALUES</span>
(<span class="hljs-number">1</span>, <span class="hljs-number">201</span>, <span class="hljs-string">'2024-02-20'</span>, <span class="hljs-number">500.00</span>),
(<span class="hljs-number">2</span>, <span class="hljs-number">202</span>, <span class="hljs-string">'2024-02-21'</span>, <span class="hljs-number">750.25</span>),
(<span class="hljs-number">3</span>, <span class="hljs-number">203</span>, <span class="hljs-string">'2024-02-21'</span>, <span class="hljs-number">1000.00</span>),
(<span class="hljs-number">4</span>, <span class="hljs-number">204</span>, <span class="hljs-string">'2024-02-22'</span>, <span class="hljs-number">1200.75</span>),
(<span class="hljs-number">5</span>, <span class="hljs-number">205</span>, <span class="hljs-string">'2024-02-22'</span>, <span class="hljs-number">1500.00</span>);

<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> deliveries (
    delivery_id <span class="hljs-built_in">INT</span> PRIMARY <span class="hljs-keyword">KEY</span>,
    order_id <span class="hljs-built_in">INT</span>,
    delivery_date <span class="hljs-built_in">DATE</span>,
    delivery_status <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">50</span>)
);

<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> deliveries (delivery_id, order_id, delivery_date, delivery_status) <span class="hljs-keyword">VALUES</span>
(<span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-string">'2024-02-21'</span>, <span class="hljs-string">'Delivered'</span>),
(<span class="hljs-number">2</span>, <span class="hljs-number">2</span>, <span class="hljs-string">'2024-02-22'</span>, <span class="hljs-string">'In transit'</span>),
(<span class="hljs-number">3</span>, <span class="hljs-number">3</span>, <span class="hljs-string">'2024-02-22'</span>, <span class="hljs-string">'Delivered'</span>),
(<span class="hljs-number">4</span>, <span class="hljs-number">4</span>, <span class="hljs-literal">NULL</span>, <span class="hljs-string">'Pending'</span>),
(<span class="hljs-number">5</span>, <span class="hljs-number">5</span>, <span class="hljs-literal">NULL</span>, <span class="hljs-string">'Pending'</span>);
</code></pre>
<p>Say you've been tasked with optimizing order tracking systems. To streamline this process, you need to create unique order identifiers by merging <code>customer IDs</code> and <code>order IDs</code>. Leveraging the <code>CONCAT</code> function in SQL, you merge these identifiers, ensuring efficient order management and analysis.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">CONCAT</span>(customer_id, <span class="hljs-string">'-'</span>, order_id) <span class="hljs-keyword">AS</span> order_identifier
<span class="hljs-keyword">FROM</span> orders;
</code></pre>
<p>Your next task is to categorize delivery statuses accurately, which is essential for operational efficiency. But delivery status messages often contain irrelevant details. </p>
<p>To simplify this process, you use the <code>SUBSTRING</code> function in SQL to extract the initial characters of the delivery status. This enables swift categorization and analysis of delivery progress.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">SUBSTRING</span>(delivery_status, <span class="hljs-number">1</span>, <span class="hljs-number">3</span>) <span class="hljs-keyword">AS</span> status_summary
<span class="hljs-keyword">FROM</span> deliveries;
</code></pre>
<p>Now imagine you need to ensure the consistency of delivery status messages. It's crucial to validate that delivery status updates adhere to defined length constraints. </p>
<p>By employing the <code>LENGTH/LEN</code> function in SQL, you calculate the length of each delivery status message. This facilitates robust validation mechanisms, ensuring uniformity and integrity in your data.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> delivery_id, <span class="hljs-keyword">LENGTH</span>(delivery_status) <span class="hljs-keyword">AS</span> status_length
<span class="hljs-keyword">FROM</span> deliveries;
</code></pre>
<h4 id="heading-date-functions-1">Date Functions</h4>
<p>When querying the <code>orders</code> and <code>deliveries</code> tables in the SQL schema provided, the <code>DATEADD</code> function is particularly useful in scenarios where you need to calculate future dates or deadlines based on existing ones. </p>
<p>For example, you might use <code>DATEADD</code> to find the expected delivery date by adding a certain number of days to <code>order_date</code> to ensure delivery within a predefined time frame. </p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> order_id, customer_id, <span class="hljs-keyword">DATEADD</span>(<span class="hljs-keyword">day</span>, <span class="hljs-number">3</span>, order_date) <span class="hljs-keyword">AS</span> expected_delivery_date
<span class="hljs-keyword">FROM</span> orders;
</code></pre>
<p>The <code>DATEDIFF</code> function can also be useful in calculating differences between dates. For instance, if you need to find the average time it takes for an order to be delivered, you could subtract the <code>order_date</code> from the <code>delivery_date</code> and then calculate the average using <code>AVG</code>.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">AVG</span>(<span class="hljs-keyword">DATEDIFF</span>(<span class="hljs-keyword">day</span>,order_date,delivery_date)) <span class="hljs-keyword">AS</span> average_delivery_time
<span class="hljs-keyword">FROM</span> orders o <span class="hljs-keyword">INNER</span> <span class="hljs-keyword">JOIN</span> deliveries d <span class="hljs-keyword">ON</span> o.order_id = d.order_id
<span class="hljs-keyword">WHERE</span> delivery_status = <span class="hljs-string">'Delivered'</span>;
</code></pre>
<p><code>TO_CHAR</code> function can be useful in converting dates to a specific format. For instance, if you need to display the delivery date as <code>Month DD, YYYY</code> instead of the default format, you could use <code>TO_CHAR</code> in your query.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> order_id, customer_id, TO_CHAR(delivery_date,<span class="hljs-string">'Month DD, YYYY'</span>) <span class="hljs-keyword">AS</span> formatted_delivery_date
<span class="hljs-keyword">FROM</span> orders o <span class="hljs-keyword">INNER</span> <span class="hljs-keyword">JOIN</span> deliveries d <span class="hljs-keyword">ON</span> o.order_id = d.order_id;
</code></pre>
<h3 id="heading-window-functions">Window Functions</h3>
<p><a target="_blank" href="https://www.freecodecamp.org/news/window-functions-in-sql/">Window functions</a> are a powerful feature that allow you to perform calculations across a set of table rows related to the current row, known as the window, without the need for a self-join. This includes the capability to perform running totals, moving averages, and more.</p>
<h4 id="heading-common-window-functions">Common Window Functions:</h4>
<ul>
<li><code>ROW_NUMBER()</code>: assigns a unique number to each row to which a window function is applied.</li>
<li><code>RANK()</code>: provides a rank to each row within a result set with the same rank given to the rows that have the same ranking.</li>
<li><code>DENSE_RANK()</code>: similar to <code>RANK()</code>, but the ranks are consecutive.</li>
</ul>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> product_data (
    product_id <span class="hljs-built_in">INT</span> PRIMARY <span class="hljs-keyword">KEY</span>,
    total_inventory <span class="hljs-built_in">INT</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
    total_sales <span class="hljs-built_in">INT</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>,
    region <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">50</span>) <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>
);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> product_data (product_id, total_inventory, total_sales, region) <span class="hljs-keyword">VALUES</span>
(<span class="hljs-number">1</span>, <span class="hljs-number">100</span>, <span class="hljs-number">500</span>, <span class="hljs-string">'North America'</span>),
(<span class="hljs-number">2</span>, <span class="hljs-number">150</span>, <span class="hljs-number">750</span>, <span class="hljs-string">'Europe'</span>),
(<span class="hljs-number">3</span>, <span class="hljs-number">200</span>, <span class="hljs-number">1000</span>, <span class="hljs-string">'Asia'</span>),
(<span class="hljs-number">4</span>, <span class="hljs-number">120</span>, <span class="hljs-number">1200</span>, <span class="hljs-string">'North America'</span>),
(<span class="hljs-number">5</span>, <span class="hljs-number">180</span>, <span class="hljs-number">1500</span>, <span class="hljs-string">'Europe'</span>);
</code></pre>
<p>For example, your sales director Slacks you and asks you to calculate a running total of sales over product inventory. You can do this using a basic <code>SUM Window Function ()</code></p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> 
    product_id,
    total_inventory,
    <span class="hljs-keyword">SUM</span>(total_sales) <span class="hljs-keyword">OVER</span>(<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> product_id) <span class="hljs-keyword">AS</span> running_total_sales
<span class="hljs-keyword">FROM</span> product_data;
</code></pre>
<p>Now, diving deeper into the problem. Say it's a large dataset and Excel won't cut it for this task and you want to partition it out by <code>region</code>. You can do this by applying <code>ROW_NUMBER()</code>. </p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> 
    region,
    product_id,
    ROW_NUMBER() <span class="hljs-keyword">OVER</span>(<span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> region <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> product_id) <span class="hljs-keyword">AS</span> region_product_rank
<span class="hljs-keyword">FROM</span> product_data;
</code></pre>
<p>Alternatively, you could swap the <code>ROW_NUMBER()</code> for <code>DENSE_RANK()</code> or <code>RANK()</code> depending on the the use case. </p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>As a data analyst, your proficiency in SQL will evolve as you handle more complex data scenarios and questions. </p>
<p>These essential SQL concepts serve as a good starting point – but continuous learning and applying these concepts in practical scenarios are what will truly solidify your understanding and expertise. </p>
<p>Keep exploring new features, tools, and resources such <a target="_blank" href="https://www.freecodecamp.org/news/tag/sql/">freeCodeCamp</a> or <a target="_blank" href="https://www.freecodecamp.org/news/p/9c7695e4-dcd8-4fa9-a653-7a719f738f13/bigtechinterviews.com">Big Tech Interviews</a>, and you'll find SQL to be an ever-rewarding, ever-deepening skill to have in your data toolkit.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use Pandas for Data Cleaning and Preprocessing ]]>
                </title>
                <description>
                    <![CDATA[ Steve Lohr of The New York Times said: "Data scientists, according to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing unruly digital data, before it can be explored ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/data-cleaning-and-preprocessing-with-pandasbdvhj/</link>
                <guid isPermaLink="false">66d4608c733861e3a22a734d</guid>
                
                    <category>
                        <![CDATA[ data ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Oluwadamisi Samuel ]]>
                </dc:creator>
                <pubDate>Tue, 30 Jan 2024 14:55:00 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/01/Cream-Neutral-Minimalist-New-Business-Pitch-Deck-Presentation--1-.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Steve Lohr of The New York Times said: "Data scientists, according to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."</p>
<p>This statement is 100% accurate, as this encompasses a series of steps that ensure data used for data science, machine learning and analysis projects are complete, accurate, unbiased and reliable.</p>
<p>The quality of your dataset plays a pivotal role in the success of your analysis or model. As the saying goes, “garbage in, garbage out”, the quality and reliability of your model and analysis heavily depends on the quality of your data.</p>
<p>Raw data, collected from various sources, are often messy, contain errors, inconsistencies, missing values and outliers. Data cleaning and preprocessing aims to identify and rectify these issues to ensure accurate, reliable and meaningful results during model building and data analysis as wrong conclusions could be costly.</p>
<p>This is where Pandas comes into play, it is a wonderful tool used in the data world to do both data cleaning and preprocessing. In this article, we'll delve into the essential concepts of data cleaning and preprocessing using the powerful Python library, Pandas.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-introduction">Introduction</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-data-cleaning">What is Data Cleaning?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-data-processing">What is Data Processing?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-import-the-necessary-libraries">How to Import the Necessary Libraries</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-load-the-dataset">How to Load the Dataset</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-exploratory-data-analysis-eda">Exploratory Data Analysis (EDA)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-handle-missing-values">How to Handle Missing Values</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-remove-duplicate-records">How to Remove Duplicate Records</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-data-types-and-conversion">Data Types and Conversion</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-encode-categorical-variables">How to Encode Categorical Variables</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-handle-outliers">How to Handle Outliers</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p>A basic understanding of Python.</p>
</li>
<li><p>Basic understanding of data cleaning.</p>
</li>
</ul>
<h2 id="heading-introduction">Introduction</h2>
<p>Pandas is a popular open-source data manipulation and analysis library for Python. It provides easy-to-use functions needed to work with structured data seamlessly.</p>
<p>Pandas also integrates seamlessly with other popular Python libraries, such as NumPy for numerical computing and Matplotlib for data visualization. This makes it a powerful asset for data driven tasks.</p>
<p>Pandas excels in handling missing data, reshaping datasets, merging and joining multiple datasets, and performing complex operations on data, making it exceptionally useful for data cleaning and manipulation.</p>
<p>At its core, Pandas introduces two key data structures: <code>Series</code> and <code>DataFrame</code>. A <code>Series</code> is a one-dimensional array-like object that can hold any data type, while a <code>DataFrame</code> is a two-dimensional table with labeled axes (rows and columns). These structures allow users to manipulate, clean, and analyze datasets efficiently.</p>
<h2 id="heading-what-is-data-cleaning">What is Data Cleaning?</h2>
<p>Before we embark on our data adventure with Pandas, let's take a moment to explain the term "data cleaning." Think of it as the digital detox for your dataset, where we tidy up, and and prioritize accuracy above all else.</p>
<p>Data cleaning involves identifying and rectifying errors, inconsistencies, and missing values within a dataset. It's like preparing your ingredients before cooking; you want everything in order to get the perfect analysis or visualization.</p>
<p>Why bother with data cleaning? Well, imagine trying to analyze sales trends when some entries are missing, or working with a dataset that has duplicate records throwing off your calculations. Not ideal, right?</p>
<p>In this digital detox, we use tools like Pandas to get rid of inconsistencies, straighten out errors, and let the true clarity of your data shine through.</p>
<h2 id="heading-what-is-data-processing">What is Data Processing?</h2>
<p>You may be wondering, "Does data cleaning and data preprocessing mean the same thing?" The answer is no – they do not.</p>
<p>Picture this: you stumble upon an ancient treasure chest buried in the digital sands of your dataset. Data cleaning is like carefully unearthing that chest, dusting off the cobwebs, and ensuring that what's inside is authentic and reliable.</p>
<p>As for data preprocessing, you can think of it as taking that discovered treasure and preparing its contents for public display. It goes beyond cleaning; it's about transforming and optimizing the data for specific analyses or tasks.</p>
<p>Data cleaning is the initial phase of refining your dataset, making it readable and usable with techniques like removing duplicates, handling missing values and data type conversion while data preprocessing is similar to taking this refined data and scaling with more advanced techniques such as feature engineering, encoding categorical variables and and handling outliers to achieve better and more advanced results.</p>
<p>The goal is to turn your dataset into a refined masterpiece, ready for analysis or modeling.</p>
<h2 id="heading-how-to-import-the-necessary-libraries">How to Import the Necessary Libraries</h2>
<p>Before we embark on data cleaning and preprocessing, let's import the <code>Pandas</code> library.</p>
<p>To save time and typing, we often import Pandas as <code>pd</code>. This lets us use the shorter <code>pd.read_csv()</code> instead of <code>pandas.read_csv()</code> for reading CSV files, making our code more efficient and readable.</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
</code></pre>
<h2 id="heading-how-to-load-the-dataset">How to Load the Dataset</h2>
<p>Start by loading your dataset into a Pandas DataFrame.</p>
<p>In this example, we'll use a hypothetical dataset named <strong>your_dataset.csv</strong>. We will load the dataset into a variable called <code>df</code>.</p>
<pre><code class="lang-py"><span class="hljs-comment">#Replace 'your_dataset.csv' with the actual dataset name or file path</span>
df = pd.read_csv(<span class="hljs-string">'your_dataset.csv'</span>)
</code></pre>
<h2 id="heading-exploratory-data-analysis-eda">Exploratory Data Analysis (EDA)</h2>
<p>EDA helps you understand the structure and characteristics of your dataset. Some Pandas functions help us gain insights into our dataset. We call these functions by calling the dataset variable plus the function.</p>
<p>For example:</p>
<ul>
<li><p><code>df.head()</code> will call the first 5 rows of the dataset. You can specify the number of rows to be displayed in the parentheses.</p>
</li>
<li><p><code>df.describe()</code> gives some statistical data like percentile, mean and standard deviation of the numerical values of the Series or DataFrame.</p>
</li>
<li><p><code>df.info()</code> gives the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).</p>
</li>
</ul>
<p>Here's a code example below:</p>
<pre><code class="lang-py"><span class="hljs-comment">#Display the first few rows of the dataset</span>
print(df.head())

<span class="hljs-comment">#Summary statistics</span>
print(df.describe())

<span class="hljs-comment">#Information about the dataset</span>
print(df.info())
</code></pre>
<h2 id="heading-how-to-handle-missing-values">How to Handle Missing Values</h2>
<p>As a newbie in this field, missing values pose a significant stress as they come in different formats and can adversely impact your analysis or model.</p>
<p>Machine learning models cannot be trained with data that has missing or "NAN" values as they can alter your end result during analysis. But do not fret, Pandas provides methods to handle this problem.</p>
<p>One way to do this is by removing the missing values altogether. Code snippet below:</p>
<pre><code class="lang-py"><span class="hljs-comment">#Check for missing values</span>
print(df.isnull().sum())

<span class="hljs-comment">#Drop rows with missing valiues and place it in a new variable "df_cleaned"</span>
df_cleaned = df.dropna()

<span class="hljs-comment">#Fill missing values with mean for numerical data and place it ina new variable called df_filled</span>
df_filled = df.fillna(df.mean())
</code></pre>
<p>But if the number of rows that have missing values is large, then this method will be inadequate.</p>
<p>For numerical data, you can simply compute the mean and input it into the rows that have missing values. Code snippet below:</p>
<pre><code class="lang-py"><span class="hljs-comment">#Replace missing values with the mean of each column</span>
df.fillna(df.mean(), inplace=<span class="hljs-literal">True</span>)

<span class="hljs-comment">#If you want to replace missing values in a specific column, you can do it this way:</span>
<span class="hljs-comment">#Replace 'column_name' with the actual column name</span>
df[<span class="hljs-string">'column_name'</span>].fillna(df[<span class="hljs-string">'column_name'</span>].mean(), inplace=<span class="hljs-literal">True</span>)

<span class="hljs-comment">#Now, df contains no missing values, and NaNs have been replaced with column mean</span>
</code></pre>
<h2 id="heading-how-to-remove-duplicate-records">How to Remove Duplicate Records</h2>
<p>Duplicate records can distort your analysis by influencing the results in ways that do not accurately show trends and underlying patterns (by producing outliers).</p>
<p>Pandas helps to identify and remove the duplicate values in an easy way by placing them in new variables.</p>
<p>Code snippet below:</p>
<pre><code class="lang-py"><span class="hljs-comment">#Identify duplicates</span>
print(df.duplicated().sum())

<span class="hljs-comment">#Remove duplicates</span>
df_no_duplicates = df.drop_duplicates()
</code></pre>
<h2 id="heading-data-types-and-conversion">Data Types and Conversion</h2>
<p>Data type conversion in Pandas is a crucial aspect of data preprocessing, allowing you to ensure that your data is in the appropriate format for analysis or modeling.</p>
<p>Data from various sources are usually messy and the data types of some values may be in the wrong format, for example some numerical values may come in 'float' or 'string' format instead of 'integer' format and a mix up of these formats leads to errors and wrong results.</p>
<p>You can convert a Column of type <code>int</code> to <code>float</code> with the following code:</p>
<pre><code class="lang-py"><span class="hljs-comment">#Convert 'Column1' to float</span>
df[<span class="hljs-string">'Column1'</span>] = df[<span class="hljs-string">'Column1'</span>].astype(float)

<span class="hljs-comment">#Display updated data types</span>
print(df.dtypes)
</code></pre>
<p>You can use <code>df.dtypes</code> to print column data types.</p>
<h2 id="heading-how-to-encode-categorical-variables">How to Encode Categorical Variables</h2>
<p>For machine learning algorithms, having categorical values in your dataset (non-numerical values) is crucial in ensuring the best model as they are equally as important.</p>
<p>These could be car brand names in a cars dataset for predicting car prices. But machine learning algorithms cannot processes this datatype, therefore it must be converted to numerical data before it can be used.</p>
<p>Pandas provides the <code>get_dummies</code> function which converts categorical values into numerical format(Binary format) such that it is recognized by the algorithm as a placeholder for values and not hierarchical data that can undergo numerical analysis. this just means that the numbers the brand name is converted to is not interpreted as 1 is greater than 0, but it tells the algorithm that both 1 and 0 are placeholders for categorical data. Code snippet is shown below:</p>
<pre><code class="lang-py"><span class="hljs-comment">#To convert categorical data from the column "Car_Brand" to numerical data</span>
df_encode = pd.get_dummies(df, columns=[Car_Brand])

<span class="hljs-comment">#The categorical data is converted to binary format of Numerical data</span>
</code></pre>
<h2 id="heading-how-to-handle-outliers">How to Handle Outliers</h2>
<p>Outliers are data points significantly different from the majority of the data, they can distort statistical measures and adversely affect the performance of machine learning models.</p>
<p>They may be caused by human error, missing NaN values, or could be accurate data that does not correlate with the rest of the data.</p>
<p>There are several methods to identify and remove outliers, they are:</p>
<ul>
<li><p>Remove NaN values.</p>
</li>
<li><p>Visualize the data before and after removal.</p>
</li>
<li><p>Z-score method (for normally distributed data).</p>
</li>
<li><p>IQR (Interquartile range) method for more robust data.</p>
</li>
</ul>
<p>The IQR is useful for identifying outliers in a dataset. According to the IQR method, values that fall below Q1−1.5× IQR or above Q3+1.5×IQR are considered outliers.</p>
<p>This rule is based on the assumption that most of the data in a normal distribution should fall within this range.</p>
<p>Here's a code snippet for the IQR method:</p>
<pre><code class="lang-py"><span class="hljs-comment">#Using median calculations and IQR, outliers are identified and these data points should be removed</span>
Q1 = df[<span class="hljs-string">"column_name"</span>].quantile(<span class="hljs-number">0.25</span>)
Q3 = df[<span class="hljs-string">"column_name"</span>].quantile(<span class="hljs-number">0.75</span>)
IQR = Q3 - Q1
lower_bound = Q1 - <span class="hljs-number">1.5</span> * IQR
upper_bound = Q3 + <span class="hljs-number">1.5</span> * IQR
df = df[df[<span class="hljs-string">"column_name"</span>].between(lower_bound, upper_bound)]
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Data cleaning and preprocessing are integral components of any data analysis, science or machine learning project. Pandas, with its versatile functions, facilitates these processes efficiently.</p>
<p>By following the concepts outlined in this article, you can ensure that your data is well-prepared for analysis and modeling, ultimately leading to more accurate and reliable results.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Learn Data Analysis and Visualization with Python Using Astronomical Data ]]>
                </title>
                <description>
                    <![CDATA[ Are you fascinated by both Python and the night sky?   We just posted a course on the freeCodeCamp.org YouTube channel that will teach you how to use Python to analyze and visualize astronomical Data. This course was created by Spartificial, whose mi... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/learn-data-analysis-and-visualization-with-python-using-astrongomical-data/</link>
                <guid isPermaLink="false">66b20421a2135cc2539a21ad</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Fri, 19 Jan 2024 19:48:58 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/01/dataanalysis.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Are you fascinated by both Python and the night sky?  </p>
<p>We just posted a course on the freeCodeCamp.org YouTube channel that will teach you how to use Python to analyze and visualize astronomical Data. This course was created by Spartificial, whose mission is to provide engineering education in the most engaging way.</p>
<p>This course is a journey through the universe of data analysis and visualization, tailored specifically for astronomical data. The course covers everything from the basics of Python programming to advanced image processing techniques.</p>
<p>Here is what you will learn in each module of the course.</p>
<h4 id="heading-module-1-starting-with-python">Module 1: Starting with Python</h4>
<p>Begin your adventure with Python, starting from the very basics. You'll get acquainted with Python programming using Google Colab. Learn about variables, data types, control flow, f-strings, user inputs, and functions. This module lays a strong foundation for handling astronomical data efficiently.</p>
<h4 id="heading-module-2-tabular-data-visualization">Module 2: Tabular Data Visualization</h4>
<p>Dive into the world of tabular data with tools like Pandas, Matplotlib, and Seaborn. This module teaches you to import libraries, analyze star color data, detect outliers, and create compelling visualizations like line plots and Hertzsprung-Russell diagrams. It's all about making sense of complex astronomical datasets.</p>
<h4 id="heading-module-3-image-data-visualization">Module 3: Image Data Visualization</h4>
<p>Explore the fascinating realm of astronomical image data. Learn about FITS files and use Python to bring galaxies like M31 to life on your screen. You'll delve into image processing techniques such as MinMax and ZScaleInterval scaling, enhancing your ability to interpret celestial images.</p>
<h4 id="heading-module-4-image-processing-apply-filters-and-extracting-features">Module 4: Image Processing | Apply Filters and Extracting Features</h4>
<p>This module takes you deeper into the world of image processing. Learn about convolution operations, Gaussian kernels, and feature enhancement. You'll discover techniques for identifying and extracting features from astronomical images, skills that are crucial for research and analysis.</p>
<p>This course offers hand-on learning. It emphasizes a practical approach, filled with examples and real-world datasets. You will get step-by-step guidance. This ensures a solid understanding of each concept, regardless of your previous experience level.</p>
<p>Whether you're an astronomy enthusiast, a seasoned researcher, or a curious programmer, this course offers an opportunity to enhance your skills and dive into the world of astronomical data analysis.</p>
<p>Watch the full course <a target="_blank" href="https://youtu.be/H9KefzbryEw">on the freeCodeCamp.org YouTube channel</a> (7-hour watch).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/H9KefzbryEw" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Prepare for Data Analyst Job Interviews ]]>
                </title>
                <description>
                    <![CDATA[ By Jess Wilk In today’s digital world, every business and organization collects and uses data to build better products, target the right customers, improve efficiency, and even forecast future demand.  They say that data is the new oil – and now is t... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/prepare-for-data-analyst-job-interview/</link>
                <guid isPermaLink="false">66d45f6b51f567b42d9f8465</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Job Interview ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Fri, 12 Jan 2024 17:31:48 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/08/pexels-alex-green-5699475--1-.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Jess Wilk</p>
<p>In today’s digital world, every business and organization collects and uses data to build better products, target the right customers, improve efficiency, and even forecast future demand. </p>
<p>They say that data is the new oil – and now is the perfect time to enter the data analytics job market. </p>
<p>According to <a target="_blank" href="https://www.payscale.com/research/US/Job=Junior_Data_Analyst/Salary">PayScale</a>, the average salary for entry-level roles in analytics is around $55,492 per year. The average salary for a <a target="_blank" href="https://www.payscale.com/research/US/Job=Senior_Data_Analyst/Salary">skilled analyst</a> is about $88,928 per year. Even if you are a beginner to programming in Python, you can learn the essential skills for data analysis quickly if you are consistent.</p>
<p>In this article, I’ll go over what Data Analytics skills you'll need to know, and how to prepare for and ace interviews to land a Data Analyst position with Python.</p>
<h2 id="heading-what-does-a-data-analyst-do">What Does a Data Analyst Do?</h2>
<p>As a data analyst, your primary responsibility is transforming raw data into meaningful insights. </p>
<p>Usually, the job description involves cleaning and organizing data to make sure that the quality of data is good. You'll also perform statistical analysis, interpret trends in complex datasets, build models, and create visualizations to communicate findings effectively. This information will help teams make business decisions and get valuable insights for the company's managers and key stakeholders.</p>
<p>Market research analysts collect and evaluate consumer and competitor data. A business analyst for Walmart could analyze purchase trends and identify seasonal patterns during events like Black Friday, Christmas, and New Year. This data could help the company expect higher demand and re-stock. </p>
<p>A data analyst at IKEA might analyze customer preferences in different rural and urban regions to better strategize which products to sell. </p>
<p>Data plays a role in every stage of a company, from market sizing and customer acquisition to advertising, customer journey, final conversion rate, and data-driven decisions. </p>
<p>Since I started working in data science, I have always felt like a little detective uncovering patterns and hidden knowledge. Are you now excited to learn how to become a data analyst? Let’s start with actionable insights.</p>
<h2 id="heading-essential-technical-skills-to-develop">Essential Technical Skills to Develop</h2>
<p>The first step while preparing for any role is identifying and learning the right skills. Here are the essential and in-demand skills you should learn to become a data analyst:</p>
<h3 id="heading-python-programming">Python Programming</h3>
<p>One of the most crucial skills for a data analyst is proficiency in the Python programming language. Python is widely used in organizations to perform various tasks such as handling datasets, cleaning and manipulating them, and carrying out statistical analysis. </p>
<p>The popularity of Python stems from its ability to support a plethora of open-source packages and libraries and its flexibility and user-friendliness. I am confident that Python will continue to be an indispensable tool for data analysts in 2024. </p>
<p>If you’re new to Python, you can check out the <a target="_blank" href="https://hyperskill.org/tracks/6">Introduction to Python</a> course on Hyperskill with hands-on projects, where I contribute as an expert. You don't need any degree to start learning.</p>
<p>But Python is vast – where should you start?</p>
<p>Start by learning basic syntax and data structures like lists, dictionaries, classes, and so on. </p>
<p>Once you are comfortable with the basics, get familiar with the essential libraries like Pandas (to read and manipulate data frames), Numpy (for statistical analysis), Matplotlib, and Seaborn for data visualization (creating plots).</p>
<ul>
<li>Here's a helpful course that teaches you <a target="_blank" href="https://www.freecodecamp.org/news/learn-pandas-for-data-science/">Pandas and Python for Data Analysis</a>.</li>
<li>This course teaches you how to use <a target="_blank" href="https://www.freecodecamp.org/news/introduction-to-data-vizualization-using-matplotlib/">Matplotlib for data visualization</a>.</li>
<li>Here's an in-depth guide to using <a target="_blank" href="https://www.freecodecamp.org/news/the-ultimate-guide-to-the-numpy-scientific-computing-library-for-python/">NumPy for scientific computing in Python</a>.</li>
</ul>
<p><img src="https://lh7-us.googleusercontent.com/9eK_ePxpe3FAkliq4MSpt2CqKjE3Wo-bKxsZTIqHx2lc6gKzm6MEXF-xGYq4xjzF73zjV8FvYyu4y-z8MAHE-7REXFkXOr6_Pr8vmTXXb9YzHyFfM6eBDCJYHRueiALL21qa8WBRjNu8xHV51vQ31Yw" alt="Image" width="600" height="400" loading="lazy">
<em>Python logo</em></p>
<h3 id="heading-sql">SQL</h3>
<p>SQL (Structured Query Language) helps you interact with large relational databases. You should learn how to create and update SQL tables, perform filtering and aggregation, and extract insights. <a target="_blank" href="https://www.freecodecamp.org/news/learn-mysql-beginners-course/">MySQL is a commonly used syntax</a>. </p>
<p>You can check out the <a target="_blank" href="https://hyperskill.org/tracks/31?category=8">SQL course</a> for beginners on Hyperskill. And if you want a text-based overview, <a target="_blank" href="https://www.freecodecamp.org/news/a-beginners-guide-to-sql/">here's a full handbook</a> that teaches you all the SQL basics you'll need to know.</p>
<h3 id="heading-data-visualization-tools-and-software">Data Visualization Tools and Software</h3>
<p>Analysing the data is the process, but presenting your insights is the final destination. You must master visualization analytics tools like <a target="_blank" href="https://www.freecodecamp.org/news/tableau-for-data-science-and-data-visualization-crash-course/">Tableau software</a> or <a target="_blank" href="https://www.freecodecamp.org/news/python-in-powerbi/">Power BI</a> to create dashboards and reports. </p>
<p>As a data analyst, you may have to present your findings to non-technical teams interpretably. There are also many advanced methods, like interactive dashboards and geographic mapping, for visualizing spatial data to help make informed decisions.</p>
<h3 id="heading-statistics">Statistics</h3>
<p>Probability and Statistics cover a wide range of essential concepts for anyone working with data. You should know the basic types of distributions, such as Normal, poisson, and skewed, and how to handle each. </p>
<p>Many metrics, like mean, median, and standard deviation, can help analyze numerical variables and identify anomalies or outliers. P-value and Hypothesis testing are also critical.</p>
<p>Here's a tutorial on the <a target="_blank" href="https://www.freecodecamp.org/news/top-statistics-concepts-to-know-before-getting-into-data-science/">top Stats concepts to know before getting into Data Science</a> if you want to check your skills.</p>
<h3 id="heading-excel">Excel</h3>
<p>Even though most of us are familiar with <a target="_blank" href="https://www.freecodecamp.org/news/learn-microsoft-excel/">Excel basics</a>, you should learn functions like VLOOKUP, HLOOKUP, INDEX, MATCH, and IF statements for data manipulation. </p>
<p>Understanding <a target="_blank" href="https://www.freecodecamp.org/news/how-to-create-a-pivot-table-in-excel/">how to use PivotTables</a> for summarizing and analyzing large datasets and enabling dynamic data exploration is crucial.</p>
<p>If you want to learn more about how you can use Excel for data analysis, <a target="_blank" href="https://www.freecodecamp.org/news/data-analysis-with-python-for-excel-users-course/">here's a course on that</a>.</p>
<h2 id="heading-develop-your-portfolio">Develop Your Portfolio</h2>
<p>The data analytics industry is highly profitable but also fiercely competitive. More than simply working through courses and acquiring skills is required to stand out. </p>
<p>To become a successful data analyst, you must <a target="_blank" href="https://www.freecodecamp.org/news/level-up-developer-portfolio/">build a portfolio of projects</a> demonstrating your abilities. </p>
<p>Once you're familiar with the relevant technology, identify a problem that requires analysis and locate a publicly available dataset. Analyze the dataset using various methods and extract any meaningful insights. If you don't have a degree, focus on making your portfolio the best you can.</p>
<p>Kaggle is a best friend to any data analyst beginner. Numerous datasets are available in all fields, from movie reviews and tweets to medical X-rays. Open notebooks allow you to see what expert data scientists have worked on with the same dataset. This is a great way to get guidance on approach and inspiration for ideas to try out.</p>
<p>For example, take the popular <a target="_blank" href="https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews">Kaggle dataset of IMDB Movie reviews</a>. What can you do with it? I’ll share a few ideas to help you get started. </p>
<p>You can begin at a basic level by calculating statistics to summarize critical metrics such as average rating, distribution of ratings, and the most reviewed genres. </p>
<p>Then you could use natural language processing (NLP) techniques to perform sentiment analysis on the movie reviews. </p>
<p>Next, create visualizations to present findings effectively. For instance, plot sentiment scores over time, visualize the distribution of reviews across genres, or create a word cloud highlighting frequently used words in positive and negative reviews.</p>
<p>Tailoring your projects to align with your interests and the specific requirements of potential employers will make your portfolio stand out in a sea of applicants. </p>
<p>For example, if you want to work in healthcare, do a project that adds value to the field. Remember, it's not just about the code; it's about telling a compelling story with the data.</p>
<p>Finally, you'll want to scrape and analyze real-time data. Build a tool that tracks social media sentiment about a brand or analyzes website traffic patterns.</p>
<h2 id="heading-how-to-build-a-good-cv">How to Build a Good CV</h2>
<p>The first stage of any job application is shortlisting based on your CV (or résumé). Creating a concise and technically sound CV/résumé to increase your odds is crucial. </p>
<p>Your CV must be based on your educational background, coursework information, achievements, prior internships or work experience, and extracurriculars. </p>
<p>Let me share a few tips on creating a compelling CV or résumé:</p>
<ul>
<li><strong>Custom CVs</strong>: When creating your résumé, customize it to the job you are applying for. Emphasize the skills and projects that are most relevant to the specific role. If appropriate, you can also include any extracurricular activities demonstrating your ability to manage a team. But you must provide only accurate information – this should go without saying, but embellishing your résumé beyond your actual experience is unacceptable.</li>
<li><strong>Quantify your achievements:</strong> Instead of mentioning that you <em>conducted data analysis</em>, mention specific projects, tools used, and your impact. For example, you could say that you increased website conversion rate by 15% through A/B testing. Remember to add any Python libraries, frameworks, and tools you used.</li>
<li><strong>Keep it concise and visually appealing:</strong> Recruiters review hundreds of résumés and may need help reading each line in the first round. So make a résumé that simultaneously conveys your skills and experience highlights. Use bullet points, clear headings, and formatting when needed to highlight certain aspects.</li>
</ul>
<h2 id="heading-tips-to-ace-the-technical-interview">Tips to Ace the Technical Interview</h2>
<p>The final stage is the technical interview. Below, I have gathered some tips that will help you understand what your preparation might involve, along with examples of questions you might encounter. Remember that each case is unique and you should use these as general guidelines.</p>
<p>First, make sure you <strong>practice coding a lot</strong>. You can use platforms like HackerRank or LeetCode. Remember that transparent and efficient code is vital for passing an interview. For example, you might be asked to describe the correct syntax for the <code>reshape()</code> function in NumPy.</p>
<p>Next, make sure you are comfortable <strong>working with SQL</strong>. You'll need to know how to handle complex queries, joins, subqueries, and data manipulation in SQL. A question like "How do you subset or filter data in SQL?" or "What is a Subquery in SQL?" could come up.</p>
<p>You should also be prepared to discuss and demonstrate your <strong>skills in data visualization</strong>. You should be able to explain your choices in visualization for different types of data. For instance, "How is joining different from blending in Tableau?" or "What is the difference between Treemaps and Heatmaps in Tableau?"</p>
<p>You'll also want to have a <strong>good understanding of statistics</strong>. Be prepared to discuss statistical concepts like mean, median, mode, standard deviation, correlation, and regression analysis. </p>
<p>You might be asked to interpret data or explain the significance of statistical findings in a business context, such as "Explain the term Normal Distribution” or “How do you treat outliers in a dataset?”</p>
<p>Next, make sure you have a solid foundation in <strong>data cleaning and preprocessing</strong>. Be ready to talk about experiences with cleaning and preparing data, involving dealing with missing values, outlier detection, and normalization. </p>
<p>Knowing tools like Pandas in Python can be particularly beneficial. An example question could be, "How can you add a column to a Pandas Data Frame?"</p>
<p>Be comfortable with <strong>data-driven decision making</strong>. You might be asked to explain how you have used data to inform decision-making in past experiences in order to demonstrate your ability to draw conclusions from collected data and use it for the company's business decisions.</p>
<p>You should also be able to <strong>showcase your past work</strong>. If possible, bring examples of your past work or projects, such as a portfolio or detailed case studies. </p>
<p>Be ready to discuss the challenges faced, how you approached them, and the outcomes. Questions like "Have you ever run an analysis on the wrong set of data? How did you figure out your error?" can be expected.</p>
<p>Also, don't neglect <strong>behavioral skills</strong>. Be prepared for behavioral questions that explore your problem-solving skills, teamwork, and ability to handle deadlines and pressure. Reflect on your past experiences and be ready to share stories that highlight these skills.</p>
<p>And finally, brush up on your <strong>industry knowledge</strong>. If the company operates in a specific industry (like finance, healthcare, retail, and so on), having some background knowledge or experience in that industry can be advantageous. Tailor your preparation to understand the unique data challenges and opportunities in that sector.</p>
<p>Remember, each company may have a different focus in their technical interviews, so try to get as much information as possible about the interview format beforehand. This way, you can tailor your preparation to meet their specific expectations.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Becoming a data analyst is a marathon, not a sprint.</p>
<p>If you are interested in a career as a data analyst, Python is an excellent language to learn. It is a versatile tool that allows you to manipulate, analyze, and visualize data effectively. By mastering in-demand skills such as Python, SQL, Data Visualization tools, Statistics, and Excel, you can set yourself up for success in the data analytics job market.</p>
<p>Also, building a portfolio of projects showcasing your abilities is crucial to stand out as an entry-level data analyst. The data analytics industry is rapidly growing, and there is a high demand for qualified professionals. </p>
<p>So, start learning and experimenting with data today to land your dream job as a data analyst in Python.</p>
<p>Embrace the learning, celebrate the small wins, and don't be afraid to ask for help. Good luck with your goals and data analyst career path!</p>
<p>Thank you for reading! I'm Jess, and I'm an expert at Hyperskill. You can check out an <a target="_blank" href="https://hyperskill.org/tracks/6"><strong>Introduction to Python</strong></a> course on the platform.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
