data analysis - freeCodeCamp.org

How to Optimize PySpark Jobs: Real-World Scenarios for Understanding Logical Plans

Sameer Shukla — Thu, 05 Feb 2026 22:45:15 +0000

In the world of big data, performance isn't just about bigger clusters – it's about smarter code. Spark is deceptively simple to write but notoriously difficult to optimize, because what you write isn't what Spark executes. Between your transformations and actual computation lies an invisible translation layer – the logical plan – that determines whether your job runs in minutes or hours.

Most engineers never look at this layer, which is why they spend days tuning configurations that don't address the real problem: inefficient transformations that generate bloated plans.

This handbook teaches you to read, interpret, and control those plans, transforming you from someone who writes PySpark code into someone who architects efficient data pipelines with precision and confidence.

Background Information
Chapter 1: The Spark Mindset: Why Plans Matter
Chapter 2: Understanding the Spark Execution Flow
Chapter 3: Reading and Debugging Plans Like a Pro
Chapter 4: Writing Efficient Transformations
Conclusion

Background Information

What This Handbook is Really About

This is not a tutorial about Spark internals, cluster tuning, or PySpark syntax or APIs.

This is a handbook about writing PySpark code that generates efficient logical plans.

Because when your code produces clean, optimized plans, Spark pushes filters correctly, shuffles reduce instead of multiply, projections stay shallow, and the DAG (Directed Acyclic Graph) becomes predictable, lean, and fast.

When your code produces messy plans, Spark shuffles more than necessary, and projects pile up into deep, expensive stacks. Filters arrive late instead of early, joins explode into wide, slow operations, and the DAG becomes tangled and expensive.

The difference between a fast job and a slow job is not “faster hardware.” It’s the structure of the plan Spark generates from your code. This handbook teaches you to shape that plan deliberately through scenarios.

Who This Handbook Is For

This handbook is written for:

Data engineers building production ETL pipelines who want to move beyond trial-and-error tuning and understand why jobs perform the way they do
Analytics engineers working with large datasets in Databricks, EMR, or Glue who need to optimize Spark jobs but don't have time for thousand-page reference manuals
Data scientists transitioning from pandas to PySpark who find themselves writing code that technically runs but takes forever
Anyone who has stared at the Spark UI, seen mysterious "Exchange" nodes in the DAG, and wondered, "Why is this shuffling so much data?"

You should already be comfortable writing basic PySpark code , creating DataFrames, applying transformations, running aggregations. This handbookbook won't teach you Spark syntax. Instead, it teaches you how to write transformations that work with the optimizer, not against it.

How This Handbook Is Structured

We’ll start with foundations, then move on to real-world scenarios.

Chapters 1-3 build your mental model. You'll learn what logical plans are, how they connect to physical execution, and how to read the plan output that Spark shows you. These chapters are short and focused – just enough theory to make the practical scenarios meaningful.

Chapter 4 is the heart of the handbook. It contains 15 real-world scenarios, organized by category. Each scenario shows you a common performance problem, explains what's happening in the logical plan, and demonstrates the better approach. You'll see before-and-after code, plan comparisons, and clear explanations of why one approach outperforms another.

What You'll Learn

By the end of this handbook, you'll be able to:

Read and interpret Spark's logical, optimized, and physical plans
Identify expensive operations before running your code
Restructure transformations to minimize shuffles
Choose the right join strategies for your data
Avoid common pitfalls that cause memory issues and slow performance
Debug production issues by examining execution plans

More importantly, you'll develop a Spark mindset, an intuition for how your code translates to cluster operations. You'll stop writing code that "should work" and start writing code that you know will work efficiently.

Technical Prerequisites

I assume that you’re familiar with the following concepts before proceeding:

Python fundamentals
PySpark basics
- Creating DataFrames and reading data from files
- Basic DataFrame operations: select, filter, withColumn, groupBy, join
- Writing DataFrames back to storage
Basic Spark concepts
- Basic understanding of Spark applications, jobs, stages, and tasks
- Basic understanding of the difference between transformations and actions
- Understanding. of partitions and shuffles
AWS Glue (Good to have)

Chapter 1: The Spark Mindset: Why Plans Matter

This chapter isn’t about Spark theory or internals. It’s about understanding Spark Plans, and seeing Spark the way the engine sees your code. Once you understand how Spark builds and optimizes a logical plan, optimization stops being trial and error and becomes intentional engineering.

Behind every simple transformation, Spark quietly redraws its internal blueprint. Every transformation you write from "withColumn" to join changes that plan. When the plan is efficient, Spark flies, but when it’s messy, Spark crawls.

The Invisible Layer Behind Every Transformation

When you write PySpark code, it feels like you’re chaining operations step by step. In reality, Spark isn’t executing those lines. It’s quietly building a blueprint, a logical plan describing what to do, not how.

Once this plan is built, the Catalyst Optimizer analyzes it, rearranges operations, eliminates redundancies, and produces an optimized plan. Catalyst is Spark’s query optimization engine.

Every DataFrame or SQL operation we write, such as select, filter, join, groupBy, is first converted into a logical plan. Catalyst then analyzes and transforms this plan using a set of rule-based optimizations, such as predicate pushdown, column pruning, constant folding, and join reordering. The result is an optimized logical plan, which Spark later converts into a physical execution plan. Finally, Spark translates that into a physical plan of what your cluster actually runs. This invisible planning layer decides the job’s performance more than any configuration setting.

From Logical to Optimized to Physical Plans

When you run df.explain(True), Spark actually shows you four stages of reasoning:

1. Logical Plan

The logical plan is the first stage where the initial translation of the code results in a tree structure that shows what operations need to happen, without worrying about how to execute them efficiently. It’s a blueprint of the query’s logic before any optimization or physical planning occurs.

This:

df.filter(col('age') > 25) \
  .select('firstname', 'country') \
  .groupby('country') \
  .count() \
  .explain(True)

results in the following logical plan:

== Parsed Logical Plan ==
'Aggregate ['country], ['country, 'count(1) AS count#108]
+- Project [firstname#95, country#97]
   +- Filter (age#96L > cast(25 as bigint))
      +- LogicalRDD [firstname#95, age#96L, country#97], false

2. Analyzed Logical Plan

The analyzed logical plan is the second stage in Spark’s query optimization. In this stage, Spark validates the query by checking if tables and columns actually exist in the Catalog and resolving all references. It converts all the unresolved logical plans into a resolved one with correct data types and column bindings before optimization.

3. Optimized Logical Plan

The optimized logical plan is where Spark's Catalyst optimizer improves the logical plan by applying smart rules like filtering data early, removing unnecessary columns, and combining operations to reduce computation. It's the smarter, more efficient version of your original plan that will execute faster and use fewer resources.

Let’s understand using a simple code example:

df.select('firstname', 'country') \
  .groupby('country') \
  .count() \
  .filter(col('country') == 'USA') \
  .explain(True)

Here’s the parsed logical plan:

== Parsed Logical Plan ==
'Filter '`=`('country, USA)
+- Aggregate [country#97], [country#97, count(1) AS count#122L]
   +- Project [firstname#95, country#97]
      +- LogicalRDD [firstname#95, age#96L, country#97], false

What this means:

Spark first projects firstname and country
Then aggregates by country
Then applies the filter country = 'USA' after aggregation

(because that’s how you wrote it).

Here’s the optimized logical plan:

== Optimized Logical Plan ==
Aggregate [country#97], [country#97, count(1) AS count#122L]
+- Project [country#97]
   +- Filter (isnotnull(country#97) AND (country#97 = USA))
      +- LogicalRDD [firstname#95, age#96L, country#97], false

Key improvements Catalyst applied:

Filter pushdown: The filter country = 'USA' is pushed below the aggregation, so Spark only groups U.S. rows.
Column pruning: “firstname” is automatically removed because it’s never used in the final output.
Cleaner projection: Intermediate columns are dropped early, reducing I/O and in-memory footprint.

4. Physical Plan

The physical plan is Spark's final execution blueprint that shows exactly how the query will run: which specific algorithms to use, how to distribute work across machines, and the order of low-level operations. It's the concrete, executable version of the optimized logical plan, translated into actual Spark operations like “ShuffleExchange”, “HashAggregate”, and “FileScan” that will run on your cluster.

Catalyst may, for example:

Fold constants (col("x") * 1 → col("x"))
Push filters closer to the data source
Replace a regular join with a broadcast join when data fits in memory

Once the physical plan is finalized, Spark’s scheduler converts it into a DAG of stages and tasks that run across the cluster. Understanding that lineage, from your code → plan → DAG, is what separates fast jobs from slow ones.

How to Read a Logical Plan

A logical plan prints as a tree: the bottom is your data source, and each higher node represents a transformation.

Node	Meaning
Relation / LogicalRDD	Data source, the initial DataFrame
Project	Column selection and transformation (select, withColumn)
Filter	Row filtering based on conditions (where, filter)
Join	Combining two DataFrames (join, union)
Aggregate	GroupBy and aggregation operations (groupBy, agg)
Exchange	Shuffle operation (data redistribution across partitions)
Sort	Ordering data (orderBy, sort)

Each node represents a transformation. Execution flows from the bottom up. Let's understand with a basic example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder.appName("Practice").getOrCreate()

employees_data = [
    (1, "John", "Doe", "Engineering", 80000, 28, "2020-01-15", "USA"),
    (2, "Jane", "Smith", "Engineering", 85000, 32, "2019-03-20", "USA"),
    (3, "Alice", "Johnson", "Sales", 60000, 25, "2021-06-10", "UK"),
    (4, "Bob", "Brown", "Engineering", 90000, 35, "2018-07-01", "USA"),
    (5, "Charlie", "Wilson", "Sales", 65000, 29, "2020-11-05", "UK"),
    (6, "David", "Lee", "HR", 55000, 27, "2021-01-20", "USA"),
    (7, "Eve", "Davis", "Engineering", 95000, 40, "2017-04-12", "Canada"),
    (8, "Frank", "Miller", "Sales", 70000, 33, "2019-09-25", "UK"),
    (9, "Grace", "Taylor", "HR", 58000, 26, "2021-08-15", "Canada"),
    (10, "Henry", "Anderson", "Engineering", 88000, 31, "2020-02-28", "USA")
]

df = spark.createDataFrame(employees_data,  
    ["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"])

Version A: withColumn → filter

In this version, we’re using a derived column "withColumn" and then applying a filter to the dataset. This ordering is logically correct and produces the expected result: it shows how introducing derived columns early affects the logical plan. This example shows what happens when Spark is asked to compute a new column before any rows are eliminated.

df_filtered = df \
.withColumn('bonus', col('salary') * 82) \
.filter(col('age') > 35) \
.explain(True)

Parsed Logical Plan (Simplified)

Filter (age > 35)
└─ Project [*, (salary * 82) AS bonus]
   └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

So what’s going on here? Execution flows from the bottom up.

Spark first reads the LogicalRDD.
Then applies the Project node, keeping all columns and adding bonus.
Finally, the Filter removes rows where age ≤ 35.

This means Spark computes the bonus for every employee, even those who are later filtered out. It's harmless here, but costly on millions of rows, more computation, more I/O, more shuffle volume.

Version B: Filter → Project

In this version, we apply the filter before introducing the derived column. The idea is to show how pushing row-reducing operations earlier allows Catalyst to produce a leaner logical plan. Compared to Version A, this example demonstrates that the same logic, written in a different order, can significantly reduce the amount of work Spark needs to perform.

df_filtered = df \
.filter(col('age') > 35) \
.withColumn('bonus', col('salary') * 82) \
.explain(True)

Parsed Logical Plan (Simplified)

Project [*, (salary * 82) AS bonus]

└─ Filter (age > 35)

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

So what’s going on here?

Spark starts from the LogicalRDD.
It immediately applies the Filter, reducing the dataset to only employees with age > 35.
Then the Project node adds the derived column bonus for this smaller subset.

Now the Filter sits below the Project in the plan, cutting data movement and minimizing computation. Spark prunes data first, then derives new columns. This order reduces both the volume of data processed and the amount transferred, leading to a lighter and faster plan.

Why You Should Look at the Plan Every Time by running `df.explain(True)`

This is the quickest way to spot performance issues before they hit production. It shows:

Whether filters sit in the right place.
How many Project nodes exist (each adds overhead).
Where Exchange nodes appear (these are shuffle boundaries).
If Catalyst pushed filters or rewrote joins as expected.

A quick explain() takes seconds, while debugging a bad shuffle in production takes hours. Run explain() whenever you add or reorder transformations. The plan never lies.

What Spark Does Under the Hood

Catalyst can sometimes reorder simple filters automatically, but once you use UDFs, nested logic, or joins, it often can’t. That’s why the best habit is to write transformations in a way that already makes sense to the optimizer. Filter early, avoid redundant projections, and keep plans as shallow as possible.

Optimizing Spark isn’t about tuning cluster configs – it’s about writing code that yields efficient plans. If your plan shows late filters, too many projections, or multiple Exchange nodes, it’s already explaining why your job will run slow.

Chapter 2: Understanding the Spark Execution Flow

In Chapter 1, you learned how Spark interprets your transformations into logical plans – blueprints of what the job intends to do.

But Spark doesn't stop there. It must translate those plans into distributed actions across a cluster of executors, coordinate data movement, and handle any failures that may occur.

This chapter reveals what happens when that plan leaves the driver: how Spark breaks your job into stages, tasks, and a directed acyclic graph (DAG) that actually runs.

By the end, you’ll understand why some operations shuffle terabytes while others fly, and how to predict it before execution begins.

From Plans to Stages to Tasks

A Spark job evolves through three conceptual layers:

Layer	What It Represents	Example View
Plan	The optimized logical + physical representation of your query	Read → Filter → Join → Aggregate
Stage	A contiguous set of operations that can run without shuffling data	“Map Stage” or “Reduce Stage”
Task	The smallest unit of work, one per partition per stage	“Process Partition 7 of Stage 3”

The Execution Trigger: Actions vs Transformations

Here's the critical distinction that determines when execution actually begins:

df1 = spark.paraquet("data.paraquet")
df2 = spark.filter(col("age") > 25)
df3 = spark.groupby("city").count()

Nothing executes yet! Spark just builds up the logical plan, adding each transformation as a node in the plan tree. No data is read, no filters run, no shuffles happen.

Actions Trigger Execution

Spark transformations are lazy. When a sequence of DataFrame operations is defined, a logical plan is created, but no computation takes place. It’s only when Spark encounters an action, an operation that needs a result to be returned to the driver or written out, that execution takes place.

For example:

result = df3.collect()

At this stage, Spark materializes the logical plan, applies optimizations, creates a physical plan, and executes the job. Until Spark is asked to act, such as collect(), count(), or write(), it’s just describing what it needs to do – but it’s not actually doing it.

The Complete Execution Flow

Spark execution is initiated after the execution of an operation such as collect(). The driver then sends the optimized physical plan to the SparkContext, which is then forwarded to the DAG Scheduler. The physical plan is analyzed to determine shuffle boundaries created by wide operations such as groupBy or orderBy.

The plan is then divided into stages that contain narrow operations. These stages are sent to the Task Scheduler as a TaskSet. Each stage has a single task per partition.

The tasks are then assigned to the cores of the executor based on data locality. The execution of the tasks is then initiated. The execution of the stages is initiated after the completion of the previous stage. The final stage is initiated after the completion of the previous stage. The results of the final stage are then returned to the driver or stored.

What Triggers a Shuffle

A shuffle occurs when Spark needs to redistribute data across partitions, typically because the operation requires grouping, joining, or repartitioning data in a way that can’t be done locally within existing partitions.

Common shuffle triggers:

Operation	Why it Shuffles
groupBy(), reduceByKey()	Data with the same key must co-locate for aggregation
join()	Matching keys may reside in different partitions
orderBy() / sort()	Requires global ordering across all partitions
distinct()	Needs comparison of all values across partitions
repartition(n)	Explicit redistribution to a new number of partitions

df.groupBy("user_id”) \
  .agg(sum("amount"))

In Stage 1 (Map), each task performs a partial aggregation on its partition and writes a shuffle file to disk. During the shuffle, each executor retrieves these files across the network such that all records with the same hash(user_id) % numPartitions are colocated.

In Stage 2 (Reduce), each task performs a final aggregation on its partitioned data and writes back to disk. Because Spark has tracked this process as a DAG, a failed task can re-read only the affected shuffle files instead of re-computing the entire DAG.

In practice, a healthy job has 2-6 stages. Seeing 20+ stages for such simple logic usually means unnecessary shuffles or bad partitioning.

Why Shuffles Create Stage Boundaries

Shuffles force data to move across the network between executors. Spark cannot continue processing until:

All tasks in the current stage write their shuffle output to disk
The shuffle data is available for the next stage to read over the network

This dependency creates a natural boundary – so a new stage begins after every shuffle. The DAG Scheduler uses these boundaries to determine where stages must wait for previous stages to complete.

Common Performance Bottlenecks

Bottleneck Type	Symptom	Solution
Data skew	Few tasks run much longer	Use salting, split hot keys, or AQE skew join
Small files	Too many tasks, high overhead	Coalesce or repartition after read
Large shuffle	High network I/O, spill to disk	Filter early, broadcast small tables, reduce cardinality
Unnecessary stages	Extra Exchange nodes in plan	Combine operations, remove redundant repartitions
Inefficient file formats	Slow reads, no predicate pushdown	Use Parquet or ORC with partitioning
Complex data types	Serialization overhead, large objects	Use simple types, cache in serialized form

Let’s ground this with a small but realistic pattern using the same employees DataFrame. Goal: average salary per department and country, only for employees older than 30.

Naïve approach:

from pyspark.sql.functions import col, when, avg

df_dept_country = df.select("department", "country").distinct()

df_result = (
    df.withColumn(
        "age_group",
        when(col("age") < 30, "junior")
        .when(col("age") < 40, "mid")
        .otherwise("senior")
    )
    .join(df_dept_country, ["department"], "inner")
    .groupBy("department", "country")
    .agg(avg("salary").alias("avg_salary"))

This looks harmless, but:

The join on "department" introduces a wide dependency → shuffle #1.
The groupBy("department", "country") introduces another wide dependency → shuffle #2.

So we have two shuffles for what should be a simple aggregation. If you run explain on the df_result, you’ll see two exchange nodes, each marking a shuffle and stage boundary.

Optimized Approach

We can do better by filtering early, broadcasting the small dimension (df_dept_country), and keeping only one global shuffle for aggregation.

from pyspark.sql.functions import broadcast

df_dept_country = df.select("department", "country").distinct()

df_result_optimized = (
    df.filter(col("age") > 30)
        .join(broadcast(df_dept_country), ["department"], "inner")
        .groupBy("department", "country")
        .agg(avg("salary").alias("avg_salary"))
)

What changed:

filter(col("age") > 30) is narrow and runs before any shuffle.
broadcast(df_dept_country) avoids a shuffle for the join.
Only the groupBy("department", "country") causes a single shuffle.

Now explain shows just one Exchange.

Version	Shuffles	Stages	Notes
Naïve	2	~4 (2 map + 2 reduce)	Join shuffle + groupBy shuffle = double overhead
Optimized	1	~2 (1 map + 1 reduce)	Broadcast join avoids shuffle. Only groupBy shuffles

Chapter 3: Reading and Debugging Plans Like a Pro

As explained in Chapter 1, Spark executes transformations based on three levels: the logical plan, the optimized logical plan (Catalyst), and the physical plan. This chapter will expand on this explanation and concentrate on the impact of the logical plan on shuffle and execution performance.

By now, you understand how Spark builds and executes plans. But reading those plans and instantly spotting inefficiencies is the real superpower of a performance-focused data engineer.

Spark’s explain() output isn’t random jargon. It’s a precise log of Spark’s thought process. Once you learn to read it, every optimization becomes obvious.

Three Layers in Spark

As we talked about above, every Spark plan has three key views, printed when you call df.explain(True). Let’s review them now:

Parsed Logical Plan: The raw intent Spark inferred from your code. It may include unresolved column names or expressions.
Analyzed / Optimized Logical Plan: After Spark applies Catalyst optimizations: constant folding, predicate pushdown, column pruning, and plan rearrangements.
Physical Plan: What your executors actually run: joins, shuffles, exchanges, scans, and code-generated operators.

Each stage narrows the gap between what you asked Spark to do and what Spark decides to do.

df_avg = df.filter(col("age") > 30)
        .groupBy("department")
        .agg(avg("salary").alias("avg_salary"))

df_avg.explain(True)

1. Parsed Logical Plan

== Parsed Logical Plan ==
'Aggregate ['department], ['department, 'avg('salary) AS avg_salary#8]
+- Filter (age#5L > cast(30 as bigint))
   +- LogicalRDD [id#0L, firstname#1, lastname#2, department#3, salary#4L, age#5L, hire_date#6, country#7], false

How to read this

Bottom → data source (LogicalRDD).
Middle → Filter: Spark hasn’t yet optimized column references.
Top → Aggregate: high-level grouping intent.

At this stage, the plan may include unresolved symbols (like 'department or 'avg('salary)), meaning Spark hasn’t yet validated column existence or data types.

2. Optimized Logical Plan


== Optimized Logical Plan ==
Aggregate [department#3], [department#3, avg(salary#4L) AS avg_salary#8]
+- Project [department#3, salary#4L]
   +- Filter (isnotnull(age#5L) AND (age#5L > 30))
      +- LogicalRDD [id#0L, firstname#1, lastname#2, department#3, salary#4L, age#5L, hire_date#6, country#7], false

Here, Catalyst has done its job:

Column IDs (#11, #12L) are resolved.
Unused columns are pruned – no need to carry them forward.
The plan now accurately reflects Spark’s optimized logical intent.

If you ever wonder whether Spark pruned columns or pushed filters, this is the section to check.

3. Physical Plan

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[department#3], functions=[avg(salary#4L)], output=[department#3, avg_salary#8])
   +- Exchange hashpartitioning(department#3, 200), ENSURE_REQUIREMENTS, [plan_id=19]
      +- HashAggregate(keys=[department#3], functions=[partial_avg(salary#4L)], output=[department#3, sum#20, count#21L])
         +- Project [department#3, salary#4L]
            +- Filter (isnotnull(age#5L) AND (age#5L > 30))
               +- Scan ExistingRDD[id#0L,firstname#1,lastname#2,department#3,salary#4L,age#5L,hire_date#6,country#7]

Breakdown

Scan ExistingRDD → Spark reading from the in-memory DataFrame.
Filter → narrow transformation, no shuffle.
HashAggregate → partial aggregation per partition.
Exchange → wide dependency: data is shuffled by department.
Top HashAggregate → final aggregation after shuffle.

This structure – partial agg → shuffle → final agg – is Spark’s default two-phase aggregation pattern.

Recognizing Common Nodes

Node / Operator	Meaning	Optimization Hint
Project	Column selection or computed fields	Combine multiple withColumn() into one select()
Filter	Predicate on rows	Push filters as low as possible in the plan
Join	Combine two DataFrames	Broadcast smaller side if < 10 MB
Aggregate	GroupBy, sum, avg, count	Filter before aggregating to reduce cardinality
Exchange	Shuffle / data redistribution	Minimize by filtering early, using broadcast join
Sort	OrderBy, sort	Avoid global sorts; use within partitions if possible
Window	Windowed analytics (row_number, rank)	Partition on selective keys to reduce shuffle

Repeated invocations of withColumn stack multiple Project nodes, which increases the plan depth. Instead, combine these invocations using select.

Multiple Exchange nodes imply repeated data shuffles. You can eliminate these by broadcasting the data or filtering.

Multiple scans of the same table within a single operation imply that some caching of strategic intermediates is lacking.

And frequent SortMergeJoin operations imply that Spark is unnecessarily sorting and shuffling the data. You can eliminate these by broadcasting the smaller dataframe or bucketing.

Debugging Strategy: Read Plans from Top to Bottom

Remember: Spark executes plans from bottom up (from data source to final result). But when you're debugging, you read from the top down (from the output schema back to the root cause). This reversal is intentional: you start with what's wrong at the output level, then trace backward through the plan to find where the inefficiency was introduced.

When debugging a slow job:

Start at the top: Identify output schema and major operators (HashAggregate, Join, and so on).
Scroll for Exchanges: Count them. Each = stage boundary. Ask “Why do I need this shuffle?”
Trace backward: See if filters or projections appear below or above joins.
Look for duplication: Same scan twice? Missing cache? Re-derived columns?
Check join strategy: If it’s SortMergeJoin but one table is small, force a broadcast().
Re-run explain after optimization: You should literally see the extra nodes disappear.

Catalyst Optimizer in Action

Catalyst applies dozens of rules automatically. Knowing a few helps you interpret what changed:

Optimization Rule	Example Transformation
Predicate Pushdown	Moves filters below joins/scans
Constant Folding	Replaces salary * 1 with salary
Column Pruning	Drops unused columns early
Combine Filters	Merges consecutive filters into one
Simplify Casts	Removes redundant type casts
Reorder Joins / Join Reordering	Changes join order for cheaper plan

Putting it all together: every plan tells a story:

As you progress through the practical scenarios in Chapter 4, read every plan before and after. Your goal isn't memorization – it's intuition.

Chapter 4: Writing Efficient Transformations

Every Spark job tells a story, not in code, but in plans. By now, you've seen how Spark interprets transformations (Chapter 1), how it executes them through stages and tasks (Chapter 2), and how to read plans like a detective (Chapter 3). Now comes the part where you apply that knowledge: writing transformations that yield efficient logical plans.

This chapter is the heart of the handbook. It's where we move from understanding Spark's mind to writing code that speaks its language fluently.

Why Transformations Matter

In PySpark, most performance issues don’t start in clusters or configurations. They start in transformations: the way we chain, filter, rename, or join data. Every transformation reshapes the logical plan, influencing how Spark optimizes, when it shuffles, and whether the final DAG is streamlined or tangled.

A good transformation sequence:

Keeps plans shallow, not nested.
Applies filters early, not after computation.
Reduces data movement, not just data size.
Let’s Catalyst and AQE optimize freely, without user-induced constraints.

A bad one can double runtime, and you won't see it in your code, only in your plan.

The Goal of this Chapter

We’ll explore a series of real-world optimization scenarios, drawn from production ETL and analytical pipelines, each showing how a small change in code can completely reshape the logical plan and execution behavior.

Each scenario is practical and short, following a consistent structure. By the end of this chapter, you’ll be able to see optimization opportunities the moment you write code, because you’ll know exactly how they alter the logical plan beneath.

Before You Dive In:

Open a Spark shell or notebook. Load your familiar employees DataFrame. Run every example, and compare the explain("formatted") output before and after the fix. Because in this chapter, performance isn’t about more theory, it’s about seeing the difference in the plan and feeling the difference in execution time.

Scenario 1: Rename in One Pass: withColumnRenamed() vs toDF()

If you’ve worked with PySpark DataFrames, you’ve probably had to rename columns, either by calling withColumnRenamed() repeatedly or by using toDF() in one shot.

At first glance, both approaches produce identical results: the columns have the new names you wanted. But beneath the surface, Spark treats them very differently – and that difference shows up directly in your logical plan.

df_renamed = (df.withColumnRenamed("id", "emp_id")
    .withColumnRenamed("firstname", "first_name")
    .withColumnRenamed("lastname", "last_name")
    .withColumnRenamed("department", "dept")
    .withColumnRenamed("salary", "base_salary")
    .withColumnRenamed("age", "age_years")
    .withColumnRenamed("hire_date", "hired_on")
    .withColumnRenamed("country", "country_code")
)

This is simple and readable. But Spark builds the plan step by step, adding one Project node for every rename. Each Project node copies all existing columns, plus the newly renamed one. In large schemas (hundreds of columns), this silently bloats the plan.

Logical Plan Impact:

Project [emp_id, first_name, last_name, dept, base_salary, age_years, hired_on, country_code]

└─ Project [id, first_name, last_name, dept, base_salary, age_years, hired_on, country_code]

└─ Project [id, firstname, last_name, dept, base_salary, age_years, hired_on, country_code]

└─ Project [id, firstname, lastname, dept, base_salary, age_years, hire_date, country_code]

└─ Project [id, firstname, lastname, department, base_salary, age_years, hire_date, country]

└─ Project [id, firstname, lastname, department, salary, age_years, hire_date, country]

└─ Project [id, firstname, lastname, department, salary, age, hire_date, country]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

Each rename adds a new Project layer, deepening the DAG. Spark now has to materialize intermediate projections before applying the next one. You can see this by running: df.explain(True).

The Better Approach: Rename Once with toDF()

Instead of chaining multiple renames, rename all columns in a single pass:

new_cols = ["id", "first_name", "last_name", "department",
            "salary", "age", "hired_on", "country"]

df_renamed = df.toDF(*new_cols)

Logical Plan Impact:

Project [id, first_name, last_name, department, salary, age, hired_on, country]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

Now there’s just one Project node, which means one projection over the source data. This gives us a flatter, more efficient plan.

Under the Hood: What Spark Actually Does

Every time you call withColumnRenamed(), Spark rewrites the entire projection list. Catalyst treats the rename as a full column re-selection from the previous node, not as a light-weight alias update. When you chain several renames, Catalyst duplicates internal column metadata for each intermediate step.

By contrast, toDF() rebases the schema in a single action. Catalyst interprets it as a single schema rebinding, so no redundant metadata trees are created.

Real-World Timing: Glue Job Benchmark

To see if chained withColumnRenamed calls add real overhead, here's a simple timing test performed on a Glue job using a DataFrame with 1M rows. First using withColumnRenamed:

import time
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MillionRowsRenameTest").getOrCreate()

employees_data = [
    (1, "John", "Doe", "Engineering", 80000, 28, "2020-01-15", "USA"),
    (2, "Jane", "Smith", "Engineering", 85000, 32, "2019-03-20", "USA"),
    (3, "Alice", "Johnson", "Sales", 60000, 25, "2021-06-10", "UK"),
    (4, "Bob", "Brown", "Engineering", 90000, 35, "2018-07-01", "USA"),
    (5, "Charlie", "Wilson", "Sales", 65000, 29, "2020-11-05", "UK"),
    (6, "David", "Lee", "HR", 55000, 27, "2021-01-20", "USA"),
    (7, "Eve", "Davis", "Engineering", 95000, 40, "2017-04-12", "Canada"),
    (8, "Frank", "Miller", "Sales", 70000, 33, "2019-09-25", "UK"),
    (9, "Grace", "Taylor", "HR", 58000, 26, "2021-08-15", "Canada"),
    (10, "Henry", "Anderson", "Engineering", 88000, 31, "2020-02-28", "USA")
]

multiplied_data = [(i, f"firstname_{i}", f"lastname_{i}",
                    employees_data[i % 10][3],  # department
                    employees_data[i % 10][4],  # salary
                    employees_data[i % 10][5],  # age
                    employees_data[i % 10][6],  # hire_date
                    employees_data[i % 10][7])  # country
                   for i in range(1, 1_000_001)]

df = spark.createDataFrame(multiplied_data,
                           ["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"])

start = time.time()
df1 = (df
       .withColumnRenamed("firstname", "first_name")
       .withColumnRenamed("lastname", "last_name")
       .withColumnRenamed("department", "dept_name")
       .withColumnRenamed("salary", "annual_salary")
       .withColumnRenamed("age", "emp_age")
       .withColumnRenamed("hire_date", "hired_on")
       .withColumnRenamed("country", "work_country"))

print("withColumnRenamed Count:", df1.count())
print("withColumnRenamed time:", round(time.time() - start, 2), "seconds")

Using toDF:

start = time.time()
df2 = df.toDF("id", "first_name", "last_name", "dept_name", "annual_salary", "emp_age", "hired_on", "work_country")
print("toDF Count:", df2.count())
print("toDF time:", round(time.time() - start, 2), "seconds")

spark.stop()

Approach	Number of Project Nodes	Glue Execution Time (1M rows)	Plan Complexity
Chained withColumnRenamed()	8 nodes	~12 seconds	Deep, nested
Single toDF()	1 node	~8 seconds	Flat, simple

The difference becomes important at larger sizes or in complex pipelines, especially on managed runtimes such as AWS Glue (where planning overhead becomes important), or when tens of millions of rows are involved, where each additional Project increases column resolution, metadata work, and DAG height. And since Spark can’t collapse chained projections when column names are changed, renaming all columns in one go with toDF() results in a flatter logical and physical plan: one rename, one projection, and faster execution.

Scenario 2: Reusing Expressions

Sometimes Spark jobs run slower, not because of shuffles or joins, but because the same computation is performed repeatedly within the logical plan. Every time you repeat an expression, say, col("salary") * 0.1 in multiple places, Spark treats it as a new derived column, expanding the logical plan and forcing redundant work.

The Problem: Repeated Expressions

Let’s say we’re calculating bonus and total compensation for employees:

df_expr = (
    df.withColumn("bonus", col("salary") * 0.10)
      .withColumn("total_comp", col("salary") + (col("salary") * 0.10))
)

At first glance, it’s simple enough. But Spark’s optimizer doesn’t automatically know that the (col("salary") * 0.10) in the second column is identical to the one computed in the first. Both get evaluated separately in the logical plan.

Simplified Logical Plan:

Project [id, firstname, lastname, department,

salary, age, hire_date, country,

(salary * 0.10) AS bonus,

(salary + (salary * 0.10)) AS total_comp]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

While this looks compact, Spark must compute (salary * 0.10) twice, once for bonus, again inside total_comp. For a large dataset (say 100 M rows), that’s two full column evaluations. The waste compounds when your expression is complex, imagine parsing JSON, applying UDFs, or running date arithmetic multiple times.

The Better Approach: Compute Once, Reuse Everywhere

Compute the expression once, store it as a column, and reference it later:

df_expr = (
    df.withColumn("bonus", col("salary") * 0.10)
      .withColumn("total_comp", col("salary") + col("bonus"))
)

Simplified Logical Plan:

Project [id, firstname, lastname, department,

salary, age, hire_date, country,

(salary * 0.10) AS bonus,

(salary + bonus) AS total_comp]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

Now Spark calculates (salary * 0.10) once, stores it in the bonus column, and reuses that column when computing total_comp. This single change cuts CPU cost and memory usage.

Under the Hood: Why Repetition Hurts

Spark’s Catalyst optimizer doesn’t automatically factor out repeated expressions across different columns. Each withColumn() creates a new Project node with its own expression tree. If multiple nodes reuse the same arithmetic or function, Catalyst re-evaluates them independently.

On small DataFrames, this cost is invisible. On wide, computation-heavy jobs (think feature engineering pipelines), it can add hundreds of milliseconds per task.

Each redundant expression increases:

Catalyst’s internal expression resolution time
The size of generated Java code in WholeStageCodegen
CPU cycles per row, since Spark cannot share intermediate results between columns in the same node

Real-World Benchmark: AWS Glue

We tested this pattern on AWS Glue (Spark 3.3) with 10 million rows and a simulated expensive computation on the similar dataset we used in Scenario 1.

df = spark.createDataFrame(multiplied_data,
                           ["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"])

expr = sqrt(exp(log(col("salary") + 1)))

start = time.time()

df_repeated = (
    df.withColumn("metric_a", expr)
      .withColumn("metric_b", expr * 2)
      .withColumn("metric_c", expr / 10)
)

df_repeated.count()
time_repeated = round(time.time() - start, 2)

start = time.time()

df_reused = (
    df.withColumn("metric", expr)
      .withColumn("metric_a", col("metric"))
      .withColumn("metric_b", col("metric") * 2)
      .withColumn("metric_c", col("metric") / 10)
)

df_reused.count()

print("Repeated expr time:", time_repeated, "seconds")
print("Reused expr time:", round(time.time() - start, 2), "seconds")

spark.stop()

Approach	Project Nodes	Execution Time (10M rows)	Expression Evaluations
Repeated expression	Multiple (nested)	~18 seconds	3x per row
Compute once, reuse	Single	~11 seconds	1x per row

The performance gap widens further with genuinely expensive expressions (like regex extraction, JSON parsing, or UDFs).

Physical Plan Implication

In the physical plan, repeated expressions expand into multiple Java blocks within the same WholeStageCodegen node:

*(1) Project [sqrt(exp(log(salary + 1))) AS metric_a,

(sqrt(exp(log(salary + 1))) * 2) AS metric_b,

(sqrt(exp(log(salary + 1))) / 10) AS metric_c, ...]

Spark literally embeds three copies of the same logic.

Each is JIT-compiled separately, leading to:

Larger generated Java classes
Higher CPU utilization
Longer code-generation time before tasks even start

When reusing a column, Spark generates one expression and references it by name, dramatically shrinking the codegen footprint. If you have complex transformations (nested when, UDFs, regex extractions, and so on), compute them once and reuse them with col("alias"). For even heavier expressions that appear across multiple pipelines, consider persisting the intermediate.

DataFrame:

df_features = df.withColumn("complex_feature", complex_logic)

df_features.cache()

That cache can save multiple recomputations across downstream steps.

Scenario 3: Batch Column Ops

Most PySpark pipelines don’t die because of one big, obvious mistake. They slow down from a thousand tiny cuts: one extra withColumn() here, another there, until the logical plan turns into a tall stack of projections.

On its own, withColumn() is fine. The problem is how we use it:

10–30 chained calls in a row
Re-deriving similar expressions
Spreading logic across many tiny steps

This scenario shows how batching column operations into a single select() produces a flatter, cleaner logical plan that scales better and is easier to reason about.

The Problem: Chaining withColumn() Forever

from pyspark.sql.functions import col, concat_ws, when, lit

df_transformed = (
    df.withColumn("full_name", concat_ws(" ", col("firstname"), col("lastname")))
      .withColumn("is_senior", when(col("age") >= 35, lit(1)).otherwise(lit(0)))
      .withColumn("salary_k", col("salary") / 1000.0)
      .withColumn("experience_band",
                  when(col("age") < 30, "junior")
                  .when((col("age") >= 30) & (col("age") < 40), "mid")
                  .otherwise("senior"))
      .withColumn("country_upper", col("country").upper())
)

It reads nicely, it runs, and everyone moves on. But under the hood, Spark builds this as multiple Project nodes, one per withColumn() call.

Simplified Logical Plan (Chained): Conceptually

Project [..., country_upper]

└─ Project [..., experience_band]

   └─ Project [..., salary_k]

      └─ Project [..., is_senior]

         └─ Project [..., full_name]

            └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

Each layer re-selects all existing columns, adds one more derived column, and deepens the plan.

The Better Approach: Batch with select()

Instead of incrementally patching the schema, build it once.

df_transformed = df.select(
    col("id"),
    col("firstname"),
    col("lastname"),
    col("department"),
    col("salary"),
    col("age"),
    col("hire_date"),
    col("country"),
    concat_ws(" ", col("firstname"), col("lastname")).alias("full_name"),
    when(col("age") >= 35, lit(1)).otherwise(lit(0)).alias("is_senior"),
    (col("salary") / 1000.0).alias("salary_k"),
    when(col("age") < 30, "junior")
        .when((col("age") >= 30) & (col("age") < 40), "mid")
        .otherwise("senior").alias("experience_band"),
    col("country").upper().alias("country_upper")
)

Simplified Logical Plan (Batched):

Project [id, firstname, lastname, department, salary, age, hire_date, country,

         full_name, is_senior, salary_k, experience_band, country_upper]

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

One Project. All derived columns are defined together. Flatter DAG. Cleaner plan.

Under the Hood: Why This Matters

Each withColumn() is syntactic sugar for: “Take the previous plan, and create a new Project on top of it.” So 10 withColumn() calls = 10 projections wrapped on top of each other.

Catalyst can sometimes collapse adjacent Project nodes, but:

Not always (especially when aliases shadow each other).
Not when expressions become complex or interdependent.
Not when UDFs or analysis barriers appear.

Batching with select():

Gives Catalyst a single, complete view of all expressions.
Enables more aggressive optimizations (constant folding, expression reuse, pruning).
Keeps expression trees shallower and codegen output smaller.

Think of it as the difference between editing a sentence 10 times in a row and writing the final sentence once, cleanly.

Real-World Example: Using the Employees DF at Scale:

Chained version (many withColumn()):

from pyspark.sql.functions import col, concat_ws, when, lit, upper
import time

start = time.time()
df_chain = (
    df.withColumn("full_name", concat_ws(" ", col("firstname"), col("lastname")))
      .withColumn("is_senior", when(col("age") >= 35, 1).otherwise(0))
      .withColumn("salary_k", col("salary") / 1000.0)
      .withColumn("high_earner", when(col("salary") >= 90000, 1).otherwise(0))
      .withColumn("experience_band",
                  when(col("age") < 30, "junior")
                  .when((col("age") >= 30) & (col("age") < 40), "mid")
                  .otherwise("senior"))
      .withColumn("country_upper", upper(col("country")))
)

df_chain.count()
time_chain = round(time.time() - start, 2)

Batched version (single select()):

start = time.time()
df_batch = df.select(
    "id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country",
    concat_ws(" ", col("firstname"), col("lastname")).alias("full_name"),
    when(col("age") >= 35, 1).otherwise(0).alias("is_senior"),
    (col("salary") / 1000.0).alias("salary_k"),
    when(col("salary") >= 90000, 1).otherwise(0).alias("high_earner"),
    when(col("age") < 30, "junior")
        .when((col("age") >= 30) & (col("age") < 40), "mid")
        .otherwise("senior").alias("experience_band"),
    upper(col("country")).alias("country_upper")
)

df_batch.count()
time_batch = round(time.time() - start, 2)

Approach	Logical Shape	Glue Execution Time (1M rows)	Notes
Chained withColumn()	6 nested Projects	~14 seconds	Deep plan, more Catalyst work
Single select()	1 Project	~9 seconds	Flat planning, cleaner DAG

The distinction is most evident when there are more derived columns, more complex expressions (UDFs, window functions), or when executing on managed runtimes such as AWS Glue.

In the chained cases, there are more Project nodes, code generation is fragmented, and expression evaluation is less amenable to global optimization.

In the batched cases, Spark generates a single Project node, more work is consolidated into a single WholeStageCodegen pipeline, code generation is reduced, the JVM is less stressed, and the plan is flatter and more amenable to optimization. This is not only cleaner, but it’s also faster, more reliable, and friendlier to Spark’s optimizer.

Scenario 4: Early Filter vs Late Filter

Many pipelines apply transformations first, adding columns, joining datasets, or calculating derived metrics, before filtering records. That order looks harmless in code but can double or triple the workload at execution.

Problem: Late Filtering

df_late = (
    df.withColumn("bonus", col("salary") * 0.1)
      .withColumn("salary_k", col("salary") / 1000)
      .filter(col("age") > 35)
)

This means Spark first computes all columns for every employee, then discards most rows.

Simplified Logical Plan:

Filter (age > 35)

└─ Project [id, firstname, lastname, department, salary, age, hire_date, country,

            (salary * 0.1) AS bonus,

            (salary / 1000) AS salary_k]

   └─ LogicalRDD [...]

Catalyst can sometimes reorder this automatically, but when it can't (due to UDFs or complex logic), you're doing unnecessary work on data that's thrown away.

Better Approach: Early Filtering

df_early = (
    df.filter(col("age") > 35)
      .withColumn("bonus", col("salary") * 0.1)
      .withColumn("salary_k", col("salary") / 1000)
)

Simplified Logical Plan:

Project [id, firstname, lastname, department, salary, age, hire_date, country,

         (salary * 0.1) AS bonus,

         (salary / 1000) AS salary_k]

└─ Filter (age > 35)

   └─ LogicalRDD [...]

Now Spark prunes the dataset first, then applies transformations. The result: smaller intermediate data, less codegen, shorter logical plan, shorter DAG, and smaller shuffle footprint.

Real-World Benchmark: AWS Glue

Late Filtering:

df = spark.createDataFrame(
    multiplied_data,
    ["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"]
)

start_late = time.time()

df_late = (
    df.withColumn("bonus", col("salary") * 0.1)
      .withColumn("salary_k", col("salary") / 1000)
      .filter(col("age") > 35)   
)

df_late.count()
time_late = round(time.time() - start_late, 2)

Early Filtering:

start_early = time.time()

df_early = (
    df.filter(col("age") > 35)    
      .withColumn("bonus", col("salary") * 0.1)
      .withColumn("salary_k", col("salary") / 1000)
)

df_early.count()
time_early = round(time.time() - start_early, 2)

print("Late Filter Time:", time_late, "seconds")
print("Early Filter Time:", time_early, "seconds")

spark.stop()

Approach	Rows Processed Before Filter	Execution Time (approx)	Notes
Late filter	1,000,000 (all rows)	~14 seconds	Computes bonus and salary_k for all rows, then filters
Early filter	300,000 (filtered subset)	~9 seconds	Filters first, computes only for age > 35

The early filter approach processes significantly less data before the projection, leading to faster execution and less memory pressure.

Always filter as early as possible, before joins, aggregations, expensive transformations (such as UDFs or window functions), and even during file reads via Parquet/ORC pushdown, since filtering at the source touches fewer partitions and leads to faster jobs.

Scenario 5: Column Pruning

When working with Spark DataFrames, convenience often wins over correctness and nothing feels more convenient than select("*"). It’s quick, flexible, and perfect for exploration.

But in production pipelines, that little star silently costs CPU, memory, network bandwidth, and runtime efficiency. Every time you write select("*"), Spark expands it into every column from your schema, even if you’re using just one or two later.

Those extra attributes flow through every stage of the plan, from filters and joins to aggregations and shuffles. The result: inflated logical plans, bigger shuffle files, and slower queries.

The Problem: “The Lazy Star”

df_star = (
    df.select("*")
      .filter(col("department") == "Engineering")
      .groupBy("country")
      .agg(avg("salary").alias("avg_salary"))
)

At first glance, this seems harmless. But the problem is: only two columns (country and salary) are needed for the aggregation, but Spark carries all eight (id, firstname, lastname, department, salary, age, hire_date, country) through every transformation.

Simplified Logical Plan:

Aggregate [country], [avg(salary) AS avg_salary]

└─ Filter (department = Engineering)

   └─ Project [id, firstname, lastname, department, salary, age, hire_date, country]

      └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

Every node in this tree carries all columns. Catalyst can’t prune them because you explicitly asked for "*". The excess attributes are serialized, shuffled, and deserialized across the cluster, even though they serve no purpose in the final result.

The Fix: Select Only What You Need

Be deliberate with your projections. Select the minimal schema required for the task.

df_pruned = (
    df.select("department", "salary", "country")
      .filter(col("department") == "Engineering")
      .groupBy("country")
      .agg(avg("salary").alias("avg_salary"))
)

Simplified Logical Plan:

Aggregate [country], [avg(salary) AS avg_salary]

└─ Filter (department = Engineering)

   └─ Project [department, salary, country]

      └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

Now Spark reads and processes only the three required columns: department, salary, and country. The plan is narrower, the DAG simpler, and execution faster.

Real-World Benchmark: AWS Glue

Wide Projection:

df = spark.createDataFrame(multiplied_data,
                           ["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"])

start = time.time()
df_star = (
    df.select("*")
      .filter(col("department") == "Engineering")
      .groupBy("country")
      .agg(avg("salary").alias("avg_salary"))
)

df_star.count()
time_star = round(time.time() - start, 2)

Pruned Projection:

start = time.time()

df_pruned = (
    df.select("department", "salary", "country")
      .filter(col("department") == "Engineering")
      .groupBy("country")
      .agg(avg("salary").alias("avg_salary"))
)

df_pruned.count()
time_pruned = round(time.time() - start, 2)

print(f"select('*') time: {time_star}s")
print(f"pruned columns time: {time_pruned}s")

spark.stop()

Approach	Columns Processed	Execution Time (1M rows)	Observation
select("*")	8	~26.54 s	Spark carries all columns through the plan.
Pruned projection	3	~2.21 s	Only needed columns processed → faster and lighter.

Under the Hood: How Catalyst Handles Columns

When you call select("*"), Catalyst resolves every attribute into the logical plan. Each subsequent transformation inherits that full attribute list, increasing plan depth and overhead.

Catalyst includes a rule called ColumnPruning, which removes unused attributes but it only works when Spark can see which columns are necessary. If you use "*" or dynamically reference df.columns, Catalyst loses visibility.

Works:

df \
    .select("salary", "country") \
    .groupBy("country") \
    .agg(avg("salary"))

Doesn’t Work:

cols = df.columns

df.select(cols) \
  .groupBy("country") \
  .agg(avg("salary"))

In the second case, Catalyst can’t prune anything because cols might include everything.

Physical Plan Differences

Wide Projection (select("*")):

*(1) HashAggregate(keys=[country], functions=[avg(salary)])

+- *(1) Project [id, firstname, lastname, department, salary, age, hire_date, country]

   +- *(1) Filter (department = Engineering)

      +- *(1) Scan parquet ...

Pruned Projection:

*(1) HashAggregate(keys=[country], functions=[avg(salary)])

+- *(1) Project [department, salary, country]

   +- *(1) Filter (department = Engineering)

      +- *(1) Scan parquet [department, salary, country]

Notice the last line: Spark physically scans only the three referenced columns from Parquet. That’s genuine I/O reduction, not just logical simplification. Using select(*) increases shuffle file sizes, memory usage during serialization, Catalyst planning time, and I/O and network traffic, and the solution requires no more than specifying the necessary columns.

But in managed environments like AWS Glue or Databricks, this simple practice can greatly reduce ETL time, particularly for Parquet or Delta files, due to effective column pruning during explicit projection. It’s one of the easiest and highest-impact Spark optimization techniques, starting with typing fewer asterisks.

Scenario 6: Filter Pushdown vs Full Scan

When a Spark job feels slow right from the start, even before joins or aggregations, the culprit is often hidden at the data-read layer. Spark spends seconds (or minutes) scanning every record, even though most rows are useless for the query.

That’s where filter pushdown comes in. It tells Spark to push your filter logic down to the file reader so that Parquet / ORC / Delta formats return only the relevant rows from disk. Done right, this optimization can reduce scan size significantly. Done wrong, Spark performs a full scan, reading everything before filtering in memory.

The Problem: Late Filters and Full Scans

employees_df = spark.read.parquet("s3://data/employee_data/")

df_full = (
    employees_df
        .select("*")  # reads all columns
        .filter(col("country") == "Canada")
)

Looks fine, right? But Spark can’t push this filter to the Parquet reader because it’s applied after the select("*") projection step. Catalyst sees the filter as operating on a projected DataFrame, not the raw scan, so the pushdown boundary is lost.

Simplified Logical Plan:

Filter (country = Canada)

└─ Project [id, firstname, lastname, department, salary, age, hire_date, country]

   └─ Scan parquet employee_data [id, firstname, lastname, department, salary, age, hire_date, country]

Every record from every Parquet file is read into memory before the filter executes. In large tables, this means scanning terabytes when you only need megabytes.

The Fix: Filter Early and Project Light

Move filters as close as possible to the data source and limit columns before Spark reads them:

df_pushdown = (
    spark.read.parquet("s3://data/employee_data/")
        .select("id", "firstname", "department", "salary", "country")
        .filter(col("country") == "Canada")
)

Simplified Logical Plan:

Project [id, firstname, department, salary, country]

└─ Scan parquet employee_data [id, firstname, department, salary, country]

PushedFilters: [country = Canada]

Notice the difference: PushedFilters appears in the plan. That means the Parquet reader handles the predicate, returning only matching blocks and rows.

Under the Hood: What Actually Happens

When Spark performs filter pushdown, it leverages the Parquet metadata (min/max statistics and row-group indexes) stored in file footers.

Spark inspects file-level metadata for the predicate column (country).
It skips any row group whose values don’t match (country ≠ Canada).
It reads only the necessary row groups and columns from disk.
Those records enter the DAG directly – no in-memory filtering required.

This optimization happens entirely before Spark begins executing stages, reducing both I/O and network transfer.

Real-World Benchmark: AWS Glue

import time
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("FilterPushdownBenchmark").getOrCreate()

start = time.time()
df_full = (
    spark.read.parquet("s3://data/employee_data/")
        .select("*")                         # all columns
        .filter(col("country") == "Canada")  
)
df_full.count()
time_full = round(time.time() - start, 2)

start = time.time()
df_pushdown = (
    spark.read.parquet("s3://data/employee_data/")
        .select("id", "firstname", "department", "salary", "country")
        .filter(col("country") == "Canada")  
)
df_pushdown.count()
time_push = round(time.time() - start, 2)

print("Full Scan Time:", time_full, "sec")
print("Filter Pushdown Time:", time_push, "sec")

spark.stop()

Approach	Execution Time (1 M rows)	Observation
Full Scan	14.2 s	All files scanned and filtered in memory.
Filter Pushdown	3.8 s	Only relevant row groups and columns read.

Physical Plan Comparison

Full Scan:

*(1) Filter (country = Canada)

+- *(1) ColumnarToRow

   +- *(1) FileScan parquet [id, firstname, lastname, department, salary, age, hire_date, country]

      Batched: true, DataFilters: [], PushedFilters: []

Pushdown:

*(1) ColumnarToRow

+- *(1) FileScan parquet [id, firstname, department, salary, country]

   Batched: true, DataFilters: [isnotnull(country)], PushedFilters: [country = Canada]

The difference is clear: PushedFilters confirms that Spark applied predicate pushdown, skipping unnecessary row groups at the scan stage.

Reflection: Why Pushdown Matters

Pushdown isn’t a micro-optimization. It’s actually often the single biggest performance lever in Spark ETL. In data lakes with hundreds of files, full scans waste hours and inflate AWS S3 I/O costs. By filtering and projecting early, Spark prunes both rows and columns before execution even begins.

Apply filters as early as possible in the read pipeline, combine filter pushdown with column pruning, verify PushedFilters in explain("formatted"), avoid UDFs and select("*") at read time, and let pushdown turn “read everything and discard most” into “read only what you need.”

Scenario 7: De-duplicate Right

The Problem: “All-Row Deduplication” and Why It Hurts

When we use this:

df.dropDuplicates()

Spark removes identical rows across all columns. It sounds simple, but this operation forces Spark to treat every column as part of the deduplication key.

Internally, it means:

Every attribute is serialized and hashed.
Every unique combination of all columns is shuffled across the cluster to ensure global uniqueness.
Even small changes in a non-essential field (like hire_date) cause new keys and destroy aggregation locality.

In wide tables, this is one of the heaviest shuffle operations Spark can perform: df.dropDuplicates()

Simplified Logical Plan:

Aggregate [id, firstname, lastname, department, salary, age, hire_date, country], [first(id) AS id, ...]

└─ Exchange hashpartitioning(id, firstname, lastname, department, salary, age, hire_date, country, 200)

   └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

Notice the Exchange: that’s a full shuffle across all columns. Spark must send every record to the partition responsible for its unique combination of all fields. This is slow, memory-intensive, and scales poorly as columns grow.

The Better Approach: Key-Based Deduplication

In most real datasets, duplicates are determined by a primary or business key, not all attributes. For example, if id uniquely identifies an employee, we only need to keep one record per id.

df.dropDuplicates(["id"])

Now Spark deduplicates based only on the id column.

Aggregate [id], [first(id) AS id, first(firstname) AS firstname, ...]

└─ Exchange hashpartitioning(id, 200)

   └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

The shuffle is dramatically narrower. Instead of hashing across all columns, Spark redistributes data only by id. Fewer bytes, smaller shuffle files, faster reduce stage

Real-World Benchmark: AWS Glue

import time
from pyspark.sql.functions import exp, log, sqrt, col, concat_ws, when, upper, avg
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MillionRowsRenameTest").getOrCreate()

employees_data = [
    (1, "John", "Doe", "Engineering", 80000, 28, "2020-01-15", "USA"),
    (2, "Jane", "Smith", "Engineering", 85000, 32, "2019-03-20", "USA"),
    (3, "Alice", "Johnson", "Sales", 60000, 25, "2021-06-10", "UK"),
    (4, "Bob", "Brown", "Engineering", 90000, 35, "2018-07-01", "USA"),
    (5, "Charlie", "Wilson", "Sales", 65000, 29, "2020-11-05", "UK"),
    (6, "David", "Lee", "HR", 55000, 27, "2021-01-20", "USA"),
    (7, "Eve", "Davis", "Engineering", 95000, 40, "2017-04-12", "Canada"),
    (8, "Frank", "Miller", "Sales", 70000, 33, "2019-09-25", "UK"),
    (9, "Grace", "Taylor", "HR", 58000, 26, "2021-08-15", "Canada"),
    (10, "Henry", "Anderson", "Engineering", 88000, 31, "2020-02-28", "USA")
]

multiplied_data = [(i, f"firstname_{i}", f"lastname_{i}",
                    employees_data[i % 10][3],   # department
                    employees_data[i % 10][4],   # salary
                    employees_data[i % 10][5],   # age
                    employees_data[i % 10][6],   # hire_date
                    employees_data[i % 10][7]    # country
                    )
                   for i in range(1, 1_000_001)]

df = spark.createDataFrame(
    multiplied_data,
    ["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"]
)

start = time.time()
dedup_full = df.dropDuplicates()
dedup_full.count()
time_full = round(time.time() - start, 2)

start = time.time()
dedup_key = df.dropDuplicates(["id"])
dedup_key.count()
time_key = round(time.time() - start, 2)

print(f"Full-row dedup time: {time_full}s")
print(f"Key-based dedup time: {time_key}s")

spark.stop()

Approach	Execution Time (1M rows)	Observation
Full-Row Dedup	27.6 s	Shuffle across all attributes, large hash table
Key-Based Dedup (["id"])	2.06 s	10× faster, minimal shuffle width

Under the Hood: What Catalyst Does

When you specify a key list, Catalyst rewrites dropDuplicates(keys) into a partial + final aggregate plan, just like a groupBy:

HashAggregate(keys=[id], functions=[first(...)])

This allows Spark to:

Perform map-side partial aggregation on each partition (before shuffle).
Exchange only the grouping key (id).
Perform a final aggregation on the reduced data.

The all-column version can’t do that optimization because every column participates in uniqueness Spark must ensure complete data redistribution.

Best Practices for Deduplication

Practice	Why It Matters
Always deduplicate by key columns	Reduces shuffle width and data movement
Use deterministic keys (id, email, ssn)	Ensures predictable grouping
Avoid dropDuplicates() without arguments	Forces global shuffle across all attributes
Combine with column pruning	Keep only necessary fields before deduplication
For “latest record” logic, use window functions	Allows targeted deduplication (row_number() with order)
Cache intermediate datasets if reused	Avoids recomputation of expensive dedup stages

Combining Deduplication & Aggregation

You can merge deduplication with aggregation for even better results:

df_dedup_agg = (
    df.dropDuplicates(["id"])
        .groupBy("department")
        .agg(avg("salary").alias("avg_salary"))
)

Spark now reuses the same shuffle partitioning for both operations, one shuffle instead of two. The plan will show:

HashAggregate(keys=[department], functions=[avg(salary)])

└─ HashAggregate(keys=[id], functions=[first(...), first(department)])

   └─ Exchange hashpartitioning(id, 200)

Prefer dropDuplicates(["key_col"]) over dropDuplicates() to deduplicate by business or surrogate keys rather than the entire schema. Combine deduplication with projection to reduce I/O, and remember that one narrow shuffle is always better than a wide shuffle. Deduplication isn’t just cleanup – it’s an optimization strategy. Choose your keys wisely, and Spark will reward you with faster jobs and lighter DAGs.

Scenario 8: Count Smarter

In production, one of the most common performance pitfalls is the simplest line of code:

if df.count() > 0:

At first glance, this seems harmless. You just want to know whether the DataFrame has any data before writing, joining, or aggregating. But in Spark, count() is not metadata lookup, it’s a full cluster-wide job.

What Really Happens with count()
When you call df.count(), Spark executes a complete action:

It scans every partition.
Deserializes every row.
Counts records locally on each executor.
Reduces the counts to the driver.

That means your “empty check” runs a full distributed computation, even when the dataset has billions of rows or lives in S3.

df.count()

Simplified Logical Plan:

*(1) HashAggregate(keys=[], functions=[count(1)])

+- *(1) ColumnarToRow

   +- *(1) FileScan parquet [id, firstname, lastname, department, salary, age, hire_date, country]

Every record is read, aggregated, and returned just to produce a single integer.

Now imagine this runs in the middle of your Glue job, before a write, before a filter, or inside a loop. You’ve just added a full-table scan to your DAG for no reason.

The Smarter Way: limit(1) or head(1)

If all you need to know is whether data exists, you don’t need to count every record. You just need to know if there’s at least one.

Two efficient alternatives

df.head(1)
#or
df.limit(1).collect()

Both execute a lazy scan that stops as soon as one record is found.

Simplified Logical Plan:

TakeOrderedAndProject(limit=1)

└─ *(1) FileScan parquet [id, firstname, lastname, department, salary, age, hire_date, country]

No global aggregation.
No shuffle.
No full scan.

Real-World Benchmark: AWS Glue

import time
from pyspark.sql.functions import exp, log, sqrt, col, concat_ws, when, upper, avg
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("MillionRowsRenameTest").getOrCreate()

# Base dataset (10 sample employees)
employees_data = [
    (1, "John", "Doe", "Engineering", 80000, 28, "2020-01-15", "USA"),
    (2, "Jane", "Smith", "Engineering", 85000, 32, "2019-03-20", "USA"),
    (3, "Alice", "Johnson", "Sales", 60000, 25, "2021-06-10", "UK"),
    (4, "Bob", "Brown", "Engineering", 90000, 35, "2018-07-01", "USA"),
    (5, "Charlie", "Wilson", "Sales", 65000, 29, "2020-11-05", "UK"),
    (6, "David", "Lee", "HR", 55000, 27, "2021-01-20", "USA"),
    (7, "Eve", "Davis", "Engineering", 95000, 40, "2017-04-12", "Canada"),
    (8, "Frank", "Miller", "Sales", 70000, 33, "2019-09-25", "UK"),
    (9, "Grace", "Taylor", "HR", 58000, 26, "2021-08-15", "Canada"),
    (10, "Henry", "Anderson", "Engineering", 88000, 31, "2020-02-28", "USA")
]

# Create 1 million rows
multiplied_data = [
    (i, f"firstname_{i}", f"lastname_{i}",
     employees_data[i % 10][3],
     employees_data[i % 10][4],
     employees_data[i % 10][5],
     employees_data[i % 10][6],
     employees_data[i % 10][7])
    for i in range(1, 1_000_001)
]

df = spark.createDataFrame(
    multiplied_data,
    ["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"]
)
# Create DataFrame
df = spark.createDataFrame(
    multiplied_data,
    ["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"]
)

start = time.time()
df.count()
count_time = round(time.time() - start, 2)

start = time.time()
df.limit(1).collect()
limit_time = round(time.time() - start, 2)

start = time.time()
df.head(1)
head_time = round(time.time() - start, 2)

spark.stop()

Method	Plan Type	Execution Time (1M rows)	Notes
count()	HashAggregate + Exchange	26.33 s	Full scan + aggregation
limit(1)	TakeOrderedAndProject	0.62 s	Stops after first record
head(1)	TakeOrderedAndProject	0.42 s	Fastest, single partition

The difference is significant for the same logical check.

So why does this difference exist? Spark’s execution model treats every action as a trigger for computation. count() is an aggregation action, requiring global communication, and limit(1) and head(1) are sampling actions, short-circuiting the job after fetching the first record. Catalyst generates a TakeOrderedAndProject node instead of HashAggregate, and the scheduler terminates once one task finishes.

Plan comparison:

Action	Simplified Plan	Type	Behavior
count()	HashAggregate → Exchange → FileScan	Global	Full scan, wide dependency
limit(1)	TakeOrderedAndProject → FileScan	Local	Early stop, narrow dependency
head(1)	TakeOrderedAndProject → FileScan	Local	Early stop, single task

Avoid using count() to check emptiness since it triggers a full scan. Use limit(1) or head(1) for lightweight existence checks. And reserve count() only when the total is required, because Spark will always process all data unless explicitly told to stop. Other alternatives

`df.take(1)`	Similar to head() returns array
`df.first()`	Returns first Row or None
`df.isEmpty()`	Returns true if DataFrame has no rows
`df.rdd.isEmpty()`	RDD-level check

Scenario 9: Window Wisely

Window functions (rank(), dense_rank(), lag(), avg() with over(), and so on) are essential in analytics. They let you calculate running totals, rankings, or time-based metrics.

But in Spark, they’re not cheap, because they rely on shuffles and ordering.

Each window operation:

Requires all rows for the same partition key to be co-located on the same node.
Requires sorting those rows by the orderBy() clause within each partition.

If you omit partitionBy() (or use it with too broad a key), Spark treats the entire dataset as one partition, triggering a massive shuffle and global sort.

Global Window: The Wrong Way

Let’s compute employee rankings by salary without partitioning:

from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col

window_spec = Window.orderBy(col("salary").desc())

df_ranked = df.withColumn("salary_rank", rank().over(window_spec))

Simplified Logical Plan:

Window [rank() windowspecdefinition(orderBy=[salary DESC]) AS salary_rank]

└─ Sort [salary DESC], true

   └─ Exchange rangepartitioning(salary DESC, 200)

      └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

Spark must shuffle and sort the entire dataset globally, a full sort across all rows. Every executor gets a slice of this single global range, and all data must move through the network.

Partition by a Selective Key: The Better Way

Most analytics don’t need a global ranking. You likely want rankings within a department or group, not across the entire company.

window_spec = Window.partitionBy("department").orderBy(col("salary").desc())

df_ranked = df.withColumn("salary_rank", rank().over(window_spec))

Now Spark builds separate windows per department. Each partition’s data stays local, dramatically reducing shuffle size.

Simplified Logical Plan:

Window [rank() windowspecdefinition(partitionBy=[department], orderBy=[salary DESC]) AS salary_rank]

└─ Sort [department ASC, salary DESC], false

   └─ Exchange hashpartitioning(department, 200)

      └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

The Exchange now partitions data only by department. The shuffle boundary is narrower, fewer bytes transferred, fewer sort comparisons, and smaller spill risk.

Real-World Benchmark: AWS Glue

We can execute the windows function on the same 1 million row dataset:

df = spark.createDataFrame(multiplied_data,
["id", "firstname", "lastname", "department", "salary", "age",
 "hire_date", "country"])

start = time.time()
window_global = Window.orderBy(col("salary").desc())
df_global = df.withColumn("salary_rank", rank().over(window_global))
df_global.count()
global_time = round(time.time() - start, 2)
print(f'global_time:{global_time}')

start = time.time()
window_local = Window.partitionBy("department").orderBy(col("salary").desc())
df_local = df.withColumn("salary_rank", rank().over(window_local))
df_local.count()
local_time = round(time.time() - start, 2)
print(f'local_time:{local_time}')

spark.stop()

Approach	Stage Count	Execution Time (1M rows)	Observation
Global Window (no partition)	5	30.21 s	Full dataset shuffle + global sort
Partitioned Window (by department)	3	1.74 s	Localized sort, fewer shuffle files

Partitioning the window reduces shuffle data volume significantly and runtime as well. The difference grows exponentially as data scales.

Under the Hood: What Spark Actually Does

Each Window transformation adds a physical plan node like:

WindowExec [rank() windowspecdefinition(...)], frame=RangeFrame

This node is non-pipelined – it materializes input partitions before computing window metrics. Catalyst optimizer can’t push filters or projections inside WindowExec, which means:

If you rank before filtering, Spark computes ranks for all rows.
If you order globally, Spark must sort everything before starting.

That’s why window placement in your code matters almost as much as partition keys.

Common Anti-Patterns:

Anti-Pattern	Why It Hurts	Fix
Missing partitionBy()	Global sort across dataset	Partition by key columns
Overly broad partition key	Creates too many small partitions	Use selective, not unique keys
Wide, unbounded window frame	Retains all rows in memory per key	Use bounded ranges (for example, rowsBetween(-3, 0))
Filtering after window	Computes unnecessary metrics	Filter first, then window
Multiple chained windows	Each triggers new sort	Combine window metrics in one spec

Partition on selective keys to reduce shuffle volume, and avoid global windows that force full sorts and shuffles. Prefer bounded frames to keep state in memory and limit disk spill, and filter early while combining metrics to minimize unnecessary data flowing through WindowExec. Windows are powerful, but unbounded ones can silently crush performance. In Spark, partitioning isn’t optional. It’s the line between analytics and overhead.

Scenario 10: Incremental Aggregations with Cache and Persist

When multiple actions depend on the same expensive base computation, don’t recompute it every time. Materialize it once with cache() or persist(), then reuse it. Most Spark teams get this wrong in two ways:

They never cache, so Spark recomputes long lineages (filters, joins, window ops) for every action.
They cache everything, blowing executor memory and making things worse.

This scenario shows how to do it intelligently.

The Problem: Recomputing the Same Work for Every Metric

from pyspark.sql.functions import col, avg, max as max_, count

base = (
    df.filter(col("department") == "Engineering")
      .filter(col("country") == "USA")
      .filter(col("salary") > 70000)
)

avg_salary = base.groupBy("department").agg(avg("salary").alias("avg_salary"))
max_salary = base.groupBy("department").agg(max_("salary").alias("max_salary"))
cnt_salary = base.groupBy("department").agg(count("*").alias("cnt"))

Looks totally fine at a glance. But remember: Spark is lazy.
Every time you trigger an action:

avg_salary.show()
max_salary.show()
cnt_salary.show()

Spark walks back to the same base definition and re-runs all filters and shuffles for each metric – unless you persist.

So instead of 1 filtered + shuffled dataset reused 3 times, you effectively get:

3 jobs
3 scans / filter chains
3 groupBy shuffles

for the same input slice.

Simplified Logical Plan Shape (Without Cache):

HashAggregate [department], [avg/max/count]

└─ Exchange hashpartitioning(department)

   └─ Filter (department = 'Engineering' AND country = 'USA' AND salary > 70000)

      └─ Scan ...

And Spark builds this three times. Even though the filter logic is identical, each action triggers a new job with:

new stages,
new shuffles, and
new scans.

On large datasets (hundreds of GBs), this is brutal.

The Better Approach: Cache the Shared Base

from pyspark.sql import StorageLevel

base = (
    df.filter(col("department") == "Engineering")
      .filter(col("country") == "USA")
      .filter(col("salary") > 70000)
)

base = base.persist(StorageLevel.MEMORY_AND_DISK)

base.count()

avg_salary = base.groupBy("department").agg(avg("salary").alias("avg_salary"))
max_salary = base.groupBy("department").agg(max_("salary").alias("max_salary"))
cnt_salary = base.groupBy("department").agg(count("*").alias("cnt"))

avg_salary.show()
max_salary.show()
cnt_salary.show()

base.unpersist()

Now, the filters and initial scan run once, the results are cached, and all subsequent aggregates read from cached data instead of recomputing upstream logic.

Logical Plan Shape (With Cache):

Before materialization (base.count()), the plan still shows the lineage. Afterward, subsequent actions operate off the cached node.

InMemoryRelation [department, salary, country, ...]

   └─ * Cached from:

      Filter (department = 'Engineering' AND country = 'USA' AND salary > 70000)

      └─ Scan parquet employees_large ...

Then:

HashAggregate [department], [avg/max/count]

└─ InMemoryRelation [...]

One heavy pipeline, many cheap reads. The DAG becomes flatter:

Expensive scan & filter & shuffle: once.
Cheap aggregations: N times from memory/disk.

Real-World Benchmark: AWS Glue

df = spark.createDataFrame(multiplied_data,
["id", "firstname", "lastname", "department", "salary", "age",
"hire_date", "country"])

base = (
    df.filter(col("department") == "Engineering")
      .filter(col("country") == "USA")
      .filter(col("salary") > 85000)
)


start = time.time()

avg_salary = base.groupBy("department").agg(avg("salary").alias("avg_salary"))
max_salary = base.groupBy("department").agg(max_("salary").alias("max_salary"))
cnt = base.groupBy("department").agg(count("*").alias("emp_count"))

print("---- Without Cache ----")
avg_salary.show()
max_salary.show()
cnt.show()

no_cache_time = round(time.time() - start, 2)
print(f"Total time without cache: {no_cache_time} seconds")


from pyspark.sql import DataFrame

base_cached = base.persist(StorageLevel.MEMORY_AND_DISK)
base_cached.count()  # materialize cache

start = time.time()

avg_salary_c = base_cached.groupBy("department").agg(avg("salary").alias("avg_salary"))
max_salary_c = base_cached.groupBy("department").agg(max_("salary").alias("max_salary"))
cnt_c = base_cached.groupBy("department").agg(count("*").alias("emp_count"))

print("---- With Cache ----")
avg_salary_c.show()
max_salary_c.show()
cnt_c.show()

cache_time = round(time.time() - start, 2)
print(f"Total time with cache: {cache_time} seconds")

# Cleanup
base_cached.unpersist()

print("\n==== Summary ====")
print(f"Without cache: {no_cache_time}s | With cache: {cache_time}s")
print("=================")

spark.stop()

Approach	Execution Time (1M rows)
Without Cache	30.75 s
With Cache	3.34 s

Under the Hood: Why This Works

Using cache() or persist() in Spark inserts an InMemoryRelation / InMemoryTableScanExec node so that expensive intermediate results are stored in executor memory (or memory+disk). This allows future jobs to reuse cached blocks instead of re-scanning sources or re-computing shuffles. This shortens downstream logical plans, reduces repeated shuffles, and lowers load on systems like S3, HDFS, or JDBC.

Without caching, every action replays the full lineage and Spark recomputes the data unless another operator or AQE optimization has already materialized part of it. But caching should not become “cache everything”. Rather, you should avoid caching very large DataFrames used only once, wide raw inputs instead of filtered/aggregated subsets, or long-lived caches that are never unpersisted.

A good rule of thumb is to cache only when the DataFrame is expensive to recompute (joins, filters, windows, UDFs), is used at least twice, and is reasonably sized after filtering so it can fit in memory or work with MEMORY_AND_DISK. Otherwise, allow Spark to recompute.

Conceptually, caching converts a tall, repetitive DAG such as repeated “HashAggregate → Exchange → Filter → Scan” sequences into a hub-and-spoke design where one heavy cached hub feeds multiple lightweight downstream aggregates.

When multiple actions depend on the same expensive computation, cache or persist the shared base to flatten the DAG, eliminate repeated scans and shuffles, and improve end-to-end performance. All this while being intentional by caching only when reuse is real, the data size is safe, and always calling unpersist() when done.

Don’t make Spark re-solve the same puzzle three times. Let it solve it once, remember the answer, and move on.

Scenario 11: Reduce Shuffles

Shuffles are Spark’s invisible tax collectors. Every time your data crosses executors, you pay in CPU, disk I/O, and network bandwidth.

Two of the most common yet misunderstood transformations that trigger or avoid shuffles are coalesce() and repartition(). Both change partition counts, but they do it in fundamentally different ways.

The Problem

Writing df_result = df.repartition(10) and thinking “I’m just changing partitions so Spark won’t move data unnecessarily.” But that assumption is wrong. repartition() always performs a full shuffle, even when:

You are reducing partitions (from 200 → 10), or
You are increasing partitions (from 10 → 200).

In both cases, Spark redistributes every row across the cluster according to a new hash partitioning scheme. So even if your data is already partitioned optimally, repartition() will still reshuffle it, adding a stage boundary.

Logical Plan:

Exchange hashpartitioning(...)

└─ LogicalRDD [...]

That Exchange node signals a wide dependency: Spark spills intermediate data to disk, transfers it over the network, and reloads it before the next stage. In short: repartition() = "new shuffle, no matter what."

The Better Approach: coalesce()

If your goal is to reduce the number of partitions, for example, before writing results to S3 or Snowflake – use coalesce() instead.

df_result = df.coalesce(10)

coalesce() merges existing partitions locally within each executor, avoiding the costly reshuffle step. It uses a narrow dependency, meaning each output partition depends on one or more existing partitions from the same node.

Coalesce

└─ LogicalRDD [...]

No Exchange.
No network shuffle.
Just local merges – fast and cheap.

Real-World Benchmark: AWS Glue

df = spark.createDataFrame(multiplied_data,
["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"])

start = time.time()
df_repart = df.repartition(10)
df_repart.count()
print("Repartition time:", round(time.time() - start, 2), "sec")

start = time.time()
df_coalesced = df.coalesce(10)
df_coalesced.count()
print("Coalesce time:", round(time.time() - start, 2), "sec")

spark.stop()

Operation	Plan Node	Shuffle Triggered	Glue Runtime	Observation
repartition(10)	Exchange	Yes	18.2 s	Full cluster reshuffle
coalesce(10)	Coalesce	No	1.99 s	Local partition merge only

Even though both ended with 10 partitions, repartition() took significantly longer all because of the unnecessary shuffle.

Why This Matters

Each Exchange node in your logical plan creates a new stage in your DAG, meaning:

Extra disk I/O
Extra serialization
Extra network transfer

That’s why avoiding just one shuffle in a Glue ETL pipeline can save seconds to minutes per run, especially on wide datasets.

When to use which:

Goal	Transformation	Reasoning
Increase parallelism for heavy groupBy or join	repartition()	Distributes data evenly across executors
Reduce file count before writing	coalesce()	Avoids shuffle, merges partitions locally
Rebalance skewed data before a join	repartition(by="key")	Enables better key distribution
Optimize output after aggregation	coalesce()	Prevents too many small output files

AQE and Auto Coalescing

You can enable Adaptive Query Execution (AQE) in AWS Glue 3.0+ to let Spark merge small shuffle partitions automatically:

spark.conf.set("spark.sql.adaptive.enabled", "true")

spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

With AQE, Spark dynamically combines small partitions after shuffle to balance performance and I/O.

repartition() always triggers a shuffle, while coalesce() avoids shuffles and is ideal for local merges before writes. You should always inspect Exchange nodes to identify shuffle points. Note that in AWS Glue, avoiding even one shuffle can yield ~7× runtime improvement at the 1M-row scale. Finally, use AQE to enable dynamic partition coalescing in larger workflows.

Scenario 12: Know Your Shuffle Triggers

Much of Spark's performance comes from invisible data movement. Every shuffle boundary adds a new stage, a new write–read cycle, and sometimes minutes of extra execution time.

In Spark, any operation that requires rearranging data between partitions introduces a wide dependency, represented in the logical plan as an Exchange node.

Common shuffle triggers:

Operation	Why It Shuffles	Plan Node
join()	Records with the same key must be co-located for matching	Exchange (on join keys)
groupBy() / agg()	Keys must gather to a single partition for aggregation	Exchange
distinct()	Spark must compare all values across partitions	Exchange
orderBy()	Requires global ordering of data	Exchange
repartition()	Explicit reshuffle for partition balancing	Exchange

Each Exchange means a shuffle stage: Spark writes partition data to disk, transfers it over the network, and reads it back into memory on the next stage. That’s your hidden performance cliff.

df_result = (
    df.groupBy("department")
      .agg(sum("salary").alias("total_salary"))
      .join(df.select("department", "country")
            .distinct(), "department")
      .orderBy("total_salary", ascending=False)
)

df_result.explain("formatted")

Logical Plan Simplified:

Sort [total_salary DESC]

└─ Exchange (global sort)

   └─ SortMergeJoin [department]

      ├─ Exchange (groupBy shuffle)

      │   └─ HashAggregate (sum salary)

      └─ Exchange (distinct shuffle)

          └─ Aggregate (department, country)

We can see three Exchange nodes, one for the aggregation, one for the distinct join, and one for the global sort. That’s three separate shuffles, three full dataset transfers.

Better Approach

Whenever possible, combine wide transformations into a single stage before an action. For instance, you can compute aggregates and join results in one consistent shuffle domain:

agg_df = df.groupBy("department") \
    .agg(sum("salary") \
    .alias("total_salary"))

country_df = df.select("department", "country").distinct()

df_result = (
    agg_df.join(country_df, "department")
          .sortWithinPartitions("total_salary", ascending=False)
)

Logical Plan Simplified:

SortWithinPartitions [total_salary DESC]

└─ SortMergeJoin [department]

   ├─ Exchange (shared shuffle for join)

   └─ Exchange (shared shuffle for distinct)

Now Spark reuses shuffle partitions across compatible operations – only one shuffle boundary remains. The rest execute as narrow transformations.

Real-World Benchmark: AWS Glue (1M)

df = spark.createDataFrame(multiplied_data,
["id", "firstname", "lastname", "department", "salary", "age", "hire_date", "country"]).repartition(20)

from pyspark.sql.functions import sum as sum_

start = time.time()

dept_salary = (
    df.groupBy("department")
      .agg(sum_("salary").alias("total_salary"))
)

dept_country = (
    df.select("department", "country")
      .distinct()
)

naive_result = (
    dept_salary.join(dept_country, "department", "inner")
               .orderBy(col("total_salary").desc())
)

naive_count = naive_result.count()
naive_time = round(time.time() - start, 2)


start = time.time()

dept_country_once = (
    df.select("department", "country")
      .distinct()
)

optimized = (
    df.groupBy("department")
      .agg(sum_("salary").alias("total_salary"))
      .join(dept_country_once, "department", "inner")
      .sortWithinPartitions(col("total_salary").desc())
      # local ordering, avoids extra global shuffle
)

opt_count = optimized.count()
opt_time = round(time.time() - start, 2)

print("Optimized result count:", opt_count)
print("Optimized pipeline time:", opt_time, "sec")

print("\nOptimized plan:")
optimized.explain("formatted")

spark.stop()

Pipeline	# of Shuffles	Glue Runtime (sec)	Observation
Naive: groupBy + distinct + orderBy	3	28.99 s	Multiple wide stages
Optimized: combined agg + join + sortWithinPartitions	1	3.52 s	Single wide stage

By merging compatible stages and using sortWithinPartitions() instead of global orderBy(), the job ran significantly faster on the same dataset, with fewer Exchange nodes and shorter lineage. Run df.explain and search for Exchange. Each one signals a full shuffle. You can also check Spark UI → SQL tab → Exchange Read/Write Size to see exactly how much data moved.

Every Exchange represents a shuffle, adding serialization, network I/O, and stage overhead, so avoid chaining wide operations back-to-back by combining them under a consistent partition key. Prefer sortWithinPartitions() over global orderBy() when ordering is local, monitor plan depth to catch consecutive wide dependencies, and note that in AWS Glue eliminating even one shuffle in a 1M-row job can significantly reduce runtime.

Scenario 13: Tune Parallelism: Shuffle Partitions & AQE

Most Spark jobs are either over-parallelized (thousands of tiny tasks doing almost nothing, flooding the driver and filesystem) or under-parallelized (a handful of huge tasks doing all the work, causing slow stages and skew-like behavior). Both waste resources. We can control this behavior using spark.sql.shuffle.partitions and Adaptive Query Execution (AQE).

By default (in many environments), the default value spark.conf.get("spark.sql.shuffle.partitions") is 200, meaning that every shuffle produces approximately 200 shuffle partitions, regardless of data size. That means every shuffle (groupBy, join, distinct, and so on) creates ~200 shuffle partitions. Whether this default is reasonable depends entirely on the workload:

If you’re processing 2 GB, 200 partitions might be great.
If you’re processing 5 MB, 200 partitions is comedy – 200 tiny tasks, overhead > work.
If you’re processing 2 TB, 200 partitions might be too few – tasks become huge and slow.

Example A: The Default Plan (Too Many Tiny Tasks)

from pyspark.sql import SparkSession
from pyspark.sql.functions import sum as sum_

spark = SparkSession.builder.appName("ParallelismExample").getOrCreate()

spark.conf.get("spark.sql.shuffle.partitions")  # '200'

data = [
    (1, "John", "Engineering", 90000),
    (2, "Alice", "Engineering", 85000),
    (3, "Bob", "Sales", 75000),
    (4, "Eve", "Sales", 72000),
    (5, "Grace", "HR", 65000),
]

df = spark.createDataFrame(data, ["id", "name", "department", "salary"])

agg_df = df.groupBy("department").agg(sum_("salary").alias("total_salary"))
agg_df.explain("formatted")

Even though there are only 3 departments, Spark will still create 200 shuffle partitions – meaning 200 tasks for 3 groups of data.

Effect: Each task has almost nothing to do. Spark spends more time planning and scheduling than actually computing.

Example B: Tuned Plan (Balanced Parallelism)

spark.conf.set("spark.sql.shuffle.partitions", "8")
agg_df = df.groupBy("department").agg(sum_("salary").alias("total_salary"))
agg_df.explain("formatted")

Now Spark launches only 8 partitions still parallelized, but not wasteful. Even in this small example, you can visually feel the difference: one logical change, but a completely leaner physical plan.

The Real Problem: Static Tuning Doesn’t Scale

In production, job sizes vary:

Today: 10 GB
Tomorrow: 500 GB
Next week: 200 MB (sampling run)

Manually changing shuffle partitions for each run is neither practical nor reliable. That’s where Adaptive Query Execution (AQE) steps in.

Adaptive Query Execution (AQE): Smarter, Dynamic Parallelism

AQE doesn’t guess. It measures actual shuffle statistics at runtime and rewrites the plan while the job is running.

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.minPartitionSize", "64m")
spark.conf.set("spark.sql.adaptive.coalescePartitions.maxPartitionSize", "256m")

Configuration	Shuffle Partitions	Task Distribution	Observation
Default	200	200 tasks / 3 groups	Too granular, mostly idle
Tuned	8	8 tasks / 3 groups	Balanced execution

AQE merges tiny shuffle partitions, or splits huge ones, based on real-time data metrics, not pre-set assumptions.

df = spark.createDataFrame(multiplied_data,
    ["id", "firstname", "lastname", "department", "salary", "age",
     "hire_date", "country"])

start = time.time()
agg_df = df.groupBy("department").agg(sum_("salary").alias("total_salary"))
agg_df.count()

print(f'Num Partitions df: {df.rdd.getNumPartitions()}')
print(f'Num Partitions aggdf: {agg_df.rdd.getNumPartitions()}')
print("Execution time:", round(time.time() - start, 2), "sec")

spark.stop()

Stage	Without AQE	With AQE
Stage 3 (Aggregation)	200 shuffle partitions, each reading KBs	8–12 coalesced partitions
Stage 4 (Join Output)	200 shuffle files	Merged into balanced partitions
Result	Many small tasks, high overhead	Fewer, balanced tasks, faster runtime

Understanding the Plan

Before AQE (static):

Exchange hashpartitioning(department, 200)

With AQE: AdaptiveSparkPlan (coalesced)

HashAggregate(keys=[department], functions=[sum(salary)])

Exchange hashpartitioning(department, 200) # runtime coalesced to 12

The logical plan remains the same, but the physical execution plan is rewritten during runtime. Spark intelligently reduces or merges shuffle partitions based on data volume.

Spark’s default 200 shuffle partitions often misfit real workloads. Static tuning may work for predictable pipelines, but fails with variable data. On the other hand, AQE uses shuffle statistics to dynamically coalesce partitions at runtime, use it with sensible ceilings (for example, 400 partitions) and always verify in the Spark UI to catch over-partitioning (many tasks reading KBs) or under-partitioning (few tasks reading GBs).

Scenario 14: Handle Skew Smartly

In an ideal Spark world, all partitions contain roughly equal amounts of data. But real datasets are rarely that kind. If one key (say "USA", "2024", or "customer_123") holds millions of rows while others have only a few, Spark ends up with one or two massive partitions. Those partitions take disproportionately longer to process, leaving other executors idle. That’s data skew: the silent killer of parallelism.

You’ll often spot it in Spark UI:

198 tasks finish quickly.
2 tasks take 10× longer.
Stage stays stuck at 98% for minutes.

Example A: The Skew Problem

from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.appName("DataSkewDemo").getOrCreate()

# Create skewed dataset
df = spark.range(0, 10000).toDF("id") \
    .withColumn("department",
        F.when(F.col("id") < 8000, "Engineering")  # 80% of data
         .when(F.col("id") < 9000, "Sales")
         .otherwise("HR")) \
    .withColumn("salary", (F.rand() * 100000).cast("int"))

df.groupBy("department").count().show()

Spark will hash “Engineering” into just one reducer partition, making it heavier than others. That single task becomes a bottleneck, the shuffle has technically completed, but the stage waits for that one lagging task.

Example B: The Solution: Salting Hot Keys

To handle skew, we the hot key (Engineering) into multiple pseudo-keys using a random salt. This redistributes that large partition across multiple reducers.

from pyspark.sql.functions import rand, concat, lit, floor

salt_buckets = 10

df_salted = (
    df.withColumn(
        "department_salted",
        F.when(F.col("department") == "Engineering",
            F.concat(F.col("department"), lit("_"),
                     (F.floor(rand() * salt_buckets))))
         .otherwise(F.col("department"))
    )
)

df_salted.groupBy("department_salted").agg(F.avg("salary"))

Now “Engineering” isn’t one hot key – it’s 10 smaller keys like Engineering_0, Engineering_1, ..., Engineering_9. Each one goes to a separate reducer partition, enabling parallel processing.

Example C: Post-Aggregation Desalting

After aggregating, recombine salted keys to get the original department names:

df_final = (
    df_salted.groupBy("department_salted")
        .agg(F.avg("salary").alias("avg_salary"))
        .withColumn("department", F.split(F.col("department_salted"), "_")
            .getItem(0))
        .groupBy("department")
        .agg(F.avg("avg_salary").alias("final_avg_salary"))
)

When to Use Salting

Use salting when:

You observe stage skew (one or few long tasks).
Shuffle read sizes vary drastically between tasks.
The skew originates from a few dominant key values.

Avoid it when:

The dataset is small (< 1 GB).
You already use partitioning or bucketing keys with uniform distribution.

Alternative approaches:

Technique	Use Case	Pros	Cons
Salting (manual)	Skewed joins/aggregations	Full control	Requires extra logic to merge
Skew join hints (/+ SKEWJOIN /)	Supported joins in Spark 3+	No extra columns needed	Works only on joins
Broadcast smaller side	One table ≪ other	Avoids shuffle on big side	Limited by broadcast size
AQE skew optimization	Spark 3.0+	Automatic handling	Needs AQE enabled

Glue-Specific Tip

AWS Glue 3.0+ includes Spark 3.x, meaning you can also enable AQE’s built-in skew optimization:

spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "128m")

Spark will automatically detect large shuffle partitions and split them, effectively auto-salting hot keys at runtime. Data skew causes uneven shuffle sizes across tasks and can be detected in the Spark UI or via shuffle read/write metrics. Mitigate heavy-key skew with manual salting (recombined later) or rely on AQE skew join optimization for mild cases, and always validate improvements in the Spark UI SQL tab by checking “Shuffle Read Size.”

Scenario 15: Sort Efficiently (orderBy vs sortWithinPartitions)

Most Spark jobs need sorted data at some point – for window functions, for writing ordered files, or for downstream processing. The instinct is to reach for orderBy(). But those instincts cost you a full shuffle every single time.

The Problem: Global Sort When You Don't Need It

Let's say you want to write employee data partitioned by department, sorted by salary within each department:

from pyspark.sql.functions import col

# Naive approach: global sort
df_sorted = df.orderBy(col("department"), col("salary").desc())

df_sorted.write.partitionBy("department").parquet("s3://output/employees/")

This looks reasonable. You're sorting by department and salary, then writing partitioned files. Clean and simple. But here's what Spark actually does:

Simplified Logical Plan:

Sort [department ASC, salary DESC], true

└─ Exchange rangepartitioning(department ASC, salary DESC, 200)

   └─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

That Exchange rangepartitioning is a full shuffle. So Spark:

Samples the data to determine range boundaries
Redistributes every row across 200 partitions based on sort keys
Sorts each partition locally
Produces globally ordered output

You just shuffled 1 million rows across the cluster to achieve global ordering – even though you're immediately partitioning by department on write, which destroys that global order anyway.

Why This Hurts

Range partitioning for global sort is one of the most expensive shuffles Spark performs:

Sampling overhead: Spark must scan data twice (once to sample, once to process)
Network transfer: Every row moves to a new executor based on range boundaries
Disk I/O: Shuffle files written and read from disk
Wasted work: Global ordering across departments is meaningless when you partition by department

For 1M rows, this adds 8-12 seconds of pure shuffle overhead.

The Better Approach: Sort Locally Within Partitions

If you only need ordering within each department (or within each output partition), use sortWithinPartitions():

# Optimized approach: local sort only
df_sorted = df.sortWithinPartitions(col("department"), col("salary").desc())
df_sorted.write.partitionBy("department").parquet("s3://output/employees/")

Simplified Logical Plan:

Sort [department ASC, salary DESC], false

└─ LogicalRDD [id, firstname, lastname, department, salary, age, hire_date, country]

No Exchange.
No shuffle.
Just local sorting within existing partitions.

Spark sorts each partition in-place, without moving data across the network. The false flag in the Sort node indicates this is a local sort, not a global one.

Real-World Benchmark: AWS Glue

Let's measure the difference on 1 million employee records: First, will start with Global Sort with orderBy:

print("\n--- Testing orderBy() (global sort) ---")

start = time.time()

df_global = df.orderBy(col("department"), col("salary").desc())
df_global.write.mode("overwrite").parquet("/tmp/global_sort_output")

global_time = round(time.time() - start, 2)
print(f"orderBy() time: {global_time}s")

Local Sort:

print("\n--- Testing sortWithinPartitions() (local sort) ---")

start = time.time()

df_local = df.sortWithinPartitions(col("department"), col("salary").desc())
df_local.write.mode("overwrite").parquet("/tmp/local_sort_output")

local_time = round(time.time() - start, 2)
print(f"sortWithinPartitions() time: {local_time}s")

Approach	Plan Type	Execution Time (1M rows)	Observation
orderBy()	Exchange rangepartitioning	10.34 s	Full shuffle for global sort
sortWithinPartitions()	Local Sort (no Exchange)	2.18 s	In-place sorting, no network transfer

Physical Plan Differences:

orderBy() Physical Plan:

*(2) Sort [department ASC NULLS FIRST, salary DESC NULLS LAST], true, 0

+- Exchange rangepartitioning(department ASC NULLS FIRST, salary DESC NULLS LAST, 200)

   +- *(1) Project [id, firstname, lastname, department, salary, age, hire_date, country]

      +- *(1) Scan ExistingRDD[id, firstname, lastname, department, salary, age, hire_date, country]

The Exchange rangepartitioning node marks the shuffle boundary. Spark must:

Sample data to determine range splits
Redistribute all rows across executors
Sort within each range partition

sortWithinPartitions() Physical Plan:

*(1) Sort [department ASC NULLS FIRST, salary DESC NULLS LAST], false, 0

+- *(1) Project [id, firstname, lastname, department, salary, age, hire_date, country]

   +- *(1) Scan ExistingRDD[id, firstname, lastname, department, salary, age, hire_date, country]

No Exchange. The false flag in Sort indicates local sorting only. Each partition is sorted independently, in parallel, without any data movement.

When to Use Which:

Use Case	Method	Why
Writing partitioned files (Parquet, Delta)	sortWithinPartitions()	Partition-level order is sufficient; global order wasted
Window functions with ROWS BETWEEN	sortWithinPartitions()	Only need order within each window partition
Top-N per group (rank, dense_rank)	sortWithinPartitions()	Ranking is local to each partition key
Final output must be globally ordered	orderBy()	Need total order across all partitions
Downstream system requires strict ordering	orderBy()	For example, time-series data for sequential processing
Sorting before coalesce() for fewer output files	sortWithinPartitions()	Maintains order within merged partitions

Common Anti-Pattern

df.orderBy("department", "salary") \
  .write.partitionBy("department") \
  .parquet("output/")

Problem: You're globally sorting by department, then immediately partitioning by department. The global order is destroyed during partitioning.

Here’s the fix:

df.sortWithinPartitions("department", "salary") \
  .write.partitionBy("department") \
  .parquet("output/")

Or even better, if you're partitioning by department anyway:

# Best: let partitioning handle distribution
df.write.partitionBy("department") \
    .sortBy("salary") \
    .parquet("output/")

orderBy() triggers an expensive full shuffle using range partitioning, while sortWithinPartitions() sorts data locally without a shuffle and is often 4–5× faster. Use it when writing partitioned files, computing window functions with partitionBy(), or when order is needed only within groups, and reserve orderBy() strictly for true global ordering, because in most production ETL, the best sort is the one that doesn’t shuffle.

Conclusion

You began this handbook likely wondering why your Spark application was slow, and now you see that the answer was both clear and not so clear: your problem was never your Spark application, your configuration, or your version of Spark. It was your plan all along.

You now understand that Spark runs plans, not code, that transformation order affects logical plans, that shuffles generate stages and are key to runtime performance, and that examining your physical plans allows you to directly link your application performance issues back to your problematic line of code.

And you’ve seen this pattern repeat across many scenarios: problem, plan, solution, improved plan, and so forth, until optimization feels less like a dark art and more like a certainty.

This is the Spark optimization mindset: read plans before you write code, and challenge every single Exchange. Engineers who write high-performance Spark jobs minimize shuffles, filter early, project narrowly, deal with skew carefully, and validate everything via explain() and the Spark UI. Once you learn to read the plan, Spark performance becomes mechanical.

How to Create Boxplots and Model Data in R Using ggplot2

Tiffany Mojo Omondi — Thu, 15 Jan 2026 18:48:32 +0000

In this tutorial, you’ll walk through a complete data analysis project using the HR Analytics dataset by Saad Haroon on Kaggle. You’ll start by loading and cleaning the data, then explore it visually using boxplots with ggplot2. Finally, you’ll learn about statistical modelling using linear regression and logistic regression in R.

By the end of this article, you should understand how to create boxplots in R, why they matter, and how they fit into a real-world analytics workflow.

Prerequisites
How to Set Up Your R Environment
How to Load and Inspect the Data
How to Clean and Prepare the Data
How to Use Boxplots
How to Create Boxplots with ggplot2
How to Perform Exploratory Data Analysis
How to Build Linear Regression Models
How to Build Logistic Regression Models
Why Visualization Comes Before Modeling
Conclusion

Prerequisites

Before you begin, you should be comfortable with the following:

Basic R syntax (variables, functions, data frames).
Installing and loading R packages.
Understanding what rows and columns represent in a dataset.
Very basic statistics (mean, median, distributions).

How to Set Up Your R Environment

Start by installing and loading the packages you will need.

install.packages(c("tidyverse", "ggplot2"))
library(tidyverse)
library(ggplot2)

tidyverse provides tools for data manipulation and visualization. ggplot2 is the visualization engine you will use for boxplots. Loading the libraries makes their functions available for use

How to Load and Inspect the Data

First, download the HR Analytics dataset by Saad Haroon from Kaggle.

Assuming the downloaded dataset is saved as "C:/Users/johndoe/Downloads/archive (2)/HR_Analytics.csv", load the path file into R.

You can view a sample of the the dataset by running the head function. To view the structure of the dataset, you can run the str function.

hr <- read.csv("C:/Users/johndoe/Downloads/archive (2)/HR_Analytics.csv")
head(hr)
str(hr)

The read.csv function imports the dataset into R. The head function shows the first six rows so you can preview the data. The str function reveals data types, helping you spot categorical versus numeric variables early.

Remember that understanding your data structure early prevents errors later when plotting or modeling. Once you run the head function, you should see the following in your console:

From the head function, you can see:

Structure

Each row represents one employee.
Each column represents a feature/variable about the employee.

Key Columns & Meaning

EmpID → Employee identifier
Age → Age in years
AgeGroup → Age category (for example, 18-25)
Attrition → Whether the employee left or not (Yes/No)
BusinessTravel → Travel frequency (Travel_Rarely, Travel_Frequently, Non-Travel)
Department → Employee department
DistanceFromHome → Distance from home to office (km)
Education / EducationField → Level and field of education
EmployeeCount → Usually 1 per employee (redundant)
Gender → Male / Female
JobRole / JobSatisfaction → Job title and satisfaction level
MonthlyIncome / SalarySlab → Salary amount and category
YearsAtCompany / YearsInCurrentRole → Experience metrics
OverTime → Works overtime (Yes/No)
Other features: PerformanceRating, TrainingTimesLastYear, WorkLifeBalance, StockOptionLevel, and so on.

Data Types

Numeric → Age, DistanceFromHome, MonthlyIncome, YearsAtCompany
Categorical / Character → Attrition, Gender, Department, JobRole

Observations

The dataset is tabular, like a spreadsheet.
There are multiple categorical columns
There are multiple numeric columns
Some columns seem redundant or constant; doesn’t provide useful information because of the same values (for example, EmployeeCount)

From the str function, you can gather that:

The dataset contains 1,480 observations and 38 variables. Each row represents one employee, and each column represents a feature about that employee.

Each column has a name, data type, and example values. For instance, Age and DistanceFromHome are numeric (int), with values like 28 or 12. EmpID and Department are character strings (chr), with examples like Research & Development or Sales. Other features include JobRole (Analyst, Manager) and Attrition (Yes/No).

The dataset contains mixed data types. Some columns are numeric, such as MonthlyIncome or YearsAtCompany. Some are character or categorical, like Gender (Male/Female) and BusinessTravel (Travel_Rarely, Travel_Frequently). A few columns are redundant or constant. For example, EmployeeCount has the same value of 1 for all rows and does not provide useful information.

How to Clean and Prepare the Data

Before visualization, you must clean your data. In order to find out what you need to clean you can investigate the data.

Run the summary function to view the statistics of the dataset. You also need to run the is.na function to identify missing values to be removed.

summary(hr)
colSums(is.na(hr))

The summary function gives quick statistics and flags suspicious values. The is.na function checks for missing data. Boxplots are sensitive to extreme values, so knowing what you are working with is critical.

After running the summary function, the following will appear in your console:

This shows the basic statistics of each column. After running the is.na function, the following will also appear in your console:

From this output, you can see that only YearsWithCurrManager has 57, meaning that 57 employees don’t have a value for this column.

You can drop this whole column along with the other redundant columns we saw earlier on. You can do this with the code below.

hr <- hr %>% select(-c(EmployeeCount, Over18, StandardHours, YearsWithCurrManager))

To verify if the columns are gone, use this code:

colnames(hr)

Now we need to convert important categorical variables to factors. Doing this tells R that the column has two categories (‘Yes’ and ‘No’), not continuous text.

hr$Attrition <- as.factor(hr$Attrition)
hr$JobRole <- as.factor(hr$JobRole)
hr$Department <- as.factor(hr$Department)

This also ensures ggplot2 treats them correctly when grouping.

How to Use Boxplots

A boxplot displays key features of a dataset. The median is shown by the line in the middle of the box. The interquartile range is represented by the box itself while the whiskers show the spread of the data. Outliers appear as individual points.

Boxplots are mostly useful when you want to compare distributions across groups, such as income by job role or age by attrition status.

Let’s start with a simple boxplot of monthly income.

ggplot(hr, aes(y = MonthlyIncome)) +
  geom_boxplot(fill = "blue") +
  labs(
    title = "Distribution of Monthly Income",
    y = "Monthly Income")

The aes function tells ggplot what variable to plot. geom_boxplot draws the boxplot. The labs function labels parts of the plot drawn, that is the x axis, y axis, and the title.

How to Create Boxplots with ggplot2

Now lets compare income across job roles.

ggplot(hr, aes(x = JobRole, y = MonthlyIncome)) +
  geom_boxplot(fill = "lightblue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    title = "Monthly Income by Job Role",
    x = "Job Role",
    y = "Monthly Income")

The x aesthetic lists all the job roles. The labels are rotated to improve readability. This visualization quickly reveals income differences across roles.

How to Perform Exploratory Data Analysis (EDA)

Exploratory data analysis involves using visual methods to ask questions and gain a deeper understanding of the data.

We can use the example of Years at company by department.

ggplot(hr, aes(x = Department, y = YearsAtCompany)) +
  geom_boxplot(fill = "darkblue") +
  labs(
    title = "Years at Company by Department",
    y = "Years at Company")

How to Build Linear Regression Models

To understand how to build linear regression models, you have to model MonthlyIncome using YearsAtCompany with the command below.

The first one creates the model while the second displays it.

hr_lm<- lm(MonthlyIncome ~ YearsAtCompany, data = hr)
summary(hr_lm)

Linear regression estimates how income changes with tenure. This works when the variables are numeric.

After running the code, your console should show you this output:

Call:
lm(formula = MonthlyIncome ~ YearsAtCompany, data = hr)

Residuals:
   Min     1Q Median     3Q    Max 
 -9506  -2488  -1186   1403  15483 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3734.47     159.41   23.43   <2e-16 ***
YearsAtCompany   395.25      17.14   23.07   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4032 on 1478 degrees of freedom
Multiple R-squared:  0.2647,    Adjusted R-squared:  0.2642 
F-statistic:   532 on 1 and 1478 DF,  p-value: < 2.2e-16

Let’s interpret this model.

If an employee has 0 years at the company, their base monthly income is $3734.47. This comes from the intercept.

For each year an employee spends at the company, their monthly income is predicted to increase by $395.25.

Both coefficients have p-values < 2e-16. This means they are highly significant. It strongly shows that the years an employee spends at a company affects their income.

The model’s R-squared is 0.2647. This means about 26% of the variation in monthly income is explained by the years an employee spends at the company. This is low, so other factors like role, department, or education likely affect income too.

The model’s F-statistic is 532, with a p-value < 2.2e-16. This means the model is statistically significant overall.

In general, the longer an employee stays at a company, the more they earn, roughly $395 extra per year. But years at the company alone explain only about a quarter of their income. You need to consider other variables for better predictions.

How to Build Logistic Regression Models

You can now learn how to predict attrition. The first command generates the model while the second displays it.

hr_glm<- glm(
  Attrition ~ MonthlyIncome + YearsAtCompany,
  data = hr,
  family = binomial)


summary(hr_glm)

Your console should show this as an output when you run both commands.

Call:
glm(formula = Attrition ~ MonthlyIncome + YearsAtCompany, family = binomial, 
    data = hr)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -8.094e-01  1.375e-01  -5.886 3.96e-09 ***
MonthlyIncome  -9.449e-05  2.302e-05  -4.104 4.05e-05 ***
YearsAtCompany -5.047e-02  1.792e-02  -2.817  0.00485 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1305.4  on 1479  degrees of freedom
Residual deviance: 1252.5  on 1477  degrees of freedom
AIC: 1258.5

Number of Fisher Scoring iterations: 5

Logistic regression is used for binary outcomes, that is, yes or no. It estimates probability.

Let’s interpret this logistic regression model. The model predicts whether an employee is likely to leave the company (Attrition) based on their Monthly Income and Years at Company.

The intercept is -0.809. This is the baseline log-odds of leaving when their income and years at the company are zero.

The employees’ Monthly Income has a coefficient of -0.0000945. This means that as their income increases, their chance of leaving decreases slightly. An increase in income makes them less likely to quit.

The employees’ Years at Company have a coefficient of -0.0505. This shows that the longer they stay, the less likely they are to leave. Each additional year reduces their attrition probability.

All coefficients are statistically significant. Monthly Income and Years at Company both strongly affect their likelihood to stay.

The model’s residual deviance is 1252.5, lower than the null deviance of 1305.4. This means the model explains some of the variation in attrition.

The key takeaway is that if an employee earns more and stays longer at the company, they are less likely to leave. These factors matter, but other elements also influence attrition.

Why Visualization Comes Before Modeling

Boxplots help you to:

Detect outliers: Boxplots highlight extreme values that interfere with model results.
Compare groups: Boxplots allow quick comparison of distributions across different categories.
Form hypotheses: Visual patterns assist in identifying relationships worth testing in a model.
Validate modeling assumptions: Boxplots help check distribution shape and variance before modeling.

Modeling without visualization often leads to misinterpretation or false confidence.

Conclusion

In this tutorial, you learned how to load and clean data, understand boxplots and their importance. You also learned how to use ggplot2 to compare distributions, perform exploratory data analysis (EDA), build linear and logistic regression models, and link visualization insights to modeling results.

How to Create Scatterplots and Model Data in R Using ggplot2

Tiffany Mojo Omondi — Mon, 05 Jan 2026 12:05:54 +0000

You can use R as a powerful tool for data analysis, data visualization, and statistical modelling. In this guide, you’ll learn how to load real-world data into R, visualize patterns using ggplot2, build simple linear and logistic regression models, and interpret the models. By the end, you should know how to use R for your own projects.

Prerequisites
How to Set Up Your R Environment
How to Use Data Types in R
How to Use Data Structures in R
How to Import Data in R
How to Visualize Data with ggplot2
How to Build Statistical Models in R
Conclusion

Prerequisites

Before we get started, you should have the following:

R installed (version 4.0 or higher).
RStudio installed (recommended for beginners).
Basic familiarity with programming concepts such as variables and functions.
A basic understanding of statistics (mean, correlation, regression).

How to Set Up Your R Environment

Before you start working with data, load the required libraries:

library(tidyverse)   # Data manipulation + ggplot2
library(readxl)      # Importing Excel files

These load the required libraries into the R. tidyverse is a collection of packages used for data manipulation and visualization, including ggplot2. readxl allows you to import Excel files directly into R without converting them to CSV format first.

How to Use Data Types in R

Knowing data types helps you avoid errors and choose the right analysis methods.

Common Data Types

Data type	Example	Use case
Numeric	`x <- 5.7`	Measurements, prices
Integer	`y <- 10L`	Counts
Character	`"House prices"`	Text labels
Logical	`TRUE`	Conditions
Complex	`2 + 3i`	Advanced math

Numeric Data Types in R

price <- 199.99
tax <- 16.5
total_cost <- price + tax
total_cost

Numeric data is used for continuous values such as measurements, prices, or averages. As you can see, these are numeric values that can be used in a calculation. Numeric data types allow arithmetic operations such as addition, subtraction, multiplication, and division.

Integer Data Types in R

students <- 30L
classes <- 4L
total_students <- students * classes
total_students

Integers are whole numbers and are commonly used for counting. The L tells R that the values are integers. Integers are useful when working with counts, indexes, or discrete values.

Character Data Types in R

course_name <- "Data Science"
university <- "Harvard University"
paste(course_name, "at", university)

Character data is used to store text such as names, labels, or categories. The example above shows how character data can be combined using the paste() function. This data type cannot be used in mathematical operations.

Logical Data Types in R

score <- 75
passed <- score >= 50
passed

Logical data represents Boolean values: TRUE or FALSE. These are commonly used in conditions and filtering. Here, R evaluates a condition and returns TRUE because the score meets the requirement. Logical values are essential in decision-making and control flow.

Complex Data Types in R

Complex numbers contain both real and imaginary parts and are mostly used in advanced mathematical computations.

z <- 2 + 3i
Mod(z)

This example calculates the magnitude of a complex number. Complex data types are rarely used in basic data analysis but are available in R.

How to Use Data Structures in R

R stores data in different structures depending on your goals. This is important because choosing the right structure makes operations easier. Its functions behave differently depending on the structure. Moreover, structures help R understand whether your data are numbers, categories, or text.

Common Data Structures in R

Structure	Best for
Vector	Single column of data
Matrix	Numeric tables
Data Frame	Spreadsheet-like data
List	Mixed objects

vec <- c(1, 2, 3, 4)
mat <- matrix(1:9, nrow = 3)
df <- data.frame(Name = c("Car", "Bike"), Number = c(110, 95))
lst <- list(numbers = vec, matrix = mat, info = df)

str(lst) ##shows the structure of the list

Lets understand the code above:

vec is a vector that stores a single type of data.
mat is a matrix that organizes numeric values into rows and columns.
df is a data frame that works like a spreadsheet, allowing different data types in each column.
lst is a list that stores multiple objects of different types.
The str() function shows how these objects are nested within the list.

How to Import Data in R

Now you can start working with your real data. You can import files into R by copying the path of the CSV or Excel file and pasting it into the command.

For Windows: Replace single backward slashes / with either double backward slashes \ or single forward slashes \. For example:


Windows
```r
data <- read.csv("C:\\Users\\file\\Documents\\data.csv") or 
data <- read.csv("C:/Users/file/Documents/data.csv")

For macOS/Linux: Single forward slashes work fine:

macOS/Linux
data <- read.csv("/Users/file/Documents/data.csv")

How to Read a CSV and Excel File

#Import CSV file 
data <- read.csv("C:/Users/file/Documents/data.csv") or data <- read.csv("C:\\Users\\file\\Documents\\data.csv") ## for windows

head(data.csv)

You can import a CSV file into R using a file path. On Windows systems, file paths can use either double forward slashes (//) or double backslashes (\). The imported data is stored as a data frame named data.

data_excel <- read_excel("C:/Users/file/Documents/HR Data Set.xlsx")
head(data_excel)

You can import an Excel file into R using the code read_excel() function from the readxl package. The head() function is then used to preview the first few rows of the dataset.

Use the following commands to understand your data:

str(data.csv)
summary(data.csv)

str(data_excel)
summary(data_excel)

str() shows the structure of the dataset, including column names and data types. summary() provides descriptive statistics such as minimum, maximum, mean, and quartiles for each variable. Together, these functions help you understand the dataset before analysis.

How to Visualize Data with ggplot2

Visualization helps you spot patterns before you build models.

Scatter Plot Example

We’ll use the built-in mtcars dataset in R. First, load the library to make it available for use:

data(mtcars)
library(ggplot2)

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point(size = 3,color="blue") +geom_smooth(method="lm",color="red",se=FALSE)+
  labs(
    title = "Fuel Efficiency by Weight and Cylinders",
    x = "Weight (1000 lbs)",
    y = "Miles per Gallon"
  ) +
  theme_minimal()

Let us break down the code to grasp it fully:

data(mtcars) loads the built-in mtcars dataset, which contains information about car specifications.
library(ggplot2) enables data visualization.
aes() was used to insert your dataset columns, which defines the x and y values.
aes() was used to design the plot outside. For example, set point size and color.
geom_smooth() wass used to add a trend line with. Here, we use method="lm" to fit a linear regression line. The se=TRUE/FALSE option controls the shading for confidence intervals. Use TRUE if you want the shading and FALSE if you don’t.
labs() was used for label the plot and set the title, x-axis, and y-axis labels.
Finally, we set the plot theme using theme_minimal().

Running this code will produce a scatterplot showing fuel efficiency by weight and cylinders. The plot should look like this:

How to Build Statistical Models in R

Linear Regression

You can use linear regression for continuous outcomes, basically to predict numerical values. For example, to predict a car’s miles per gallon (mpg) based on weight (wt) and horsepower (hp), you can use this formula:

lm_model <- lm(mpg ~ wt + hp, data = mtcars)
summary(lm_model)

But what does it mean?

lm() stands for linear model.
The response variable is mpg. This is the outcome you want to predict.
Predictor variables are wt and hp. These explain changes in the response.

Once you run the model, it should look like this in your console:

Call:
lm(formula = mpg ~ wt + hp, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.941 -1.600 -0.182  1.050  5.854 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
wt          -3.87783    0.63273  -6.129 1.12e-06 ***
hp          -0.03177    0.00903  -3.519  0.00145 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.593 on 29 degrees of freedom
Multiple R-squared:  0.8268,    Adjusted R-squared:  0.8148 
F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

Here’s an interpretation of the linear regression model:

You created a model on miles per gallon (mpg) based on weight (wt) and horsepower (hp).
The intercept 37.227 is the mpg when wt=0 and hp=0. In other words, when all other variables are 0, the base mpg is 37.227. The intercept is always the baseline value of the outcome when all other variables in the model are zero.
With every additional unit of weight (1000lbs), the mpg decreases by 3.877. This variable affects the mpg greatly as seen with the p-value. The p-value is <0.001, hence strong and statistically significant.
With every additional unit of horsepower, the mpg decreases by 0.031. This variable affects the mpg, as seen with the p-value being 0.00145, which is less than 0.01, indicating that horsepower is a statistically significant predictor of mpg, although its effect is smaller compared to vehicle weight.

Does the Model Fit the Data, and Why?

The R-squared value shows that 83% of the variation in mpg is explained by weight and horsepower.

Summary of the interpretation: Cars that are heavier and with more horsepower have lower fuel efficiency. These two variables explain most of the variation in mpg in the dataset.

Logistic Regression

You can use logistic regression for binary outcomes, like yes/no questions. For example, predicting whether a vehicle is automatic or manual based on weight and horsepower.

glm_model <- glm(am ~ wt + hp, data = mtcars, family = binomial)
summary(glm_model)

Lets understand the code

glm() stands for generalized linear model.
The family=binomial option tells R to run logistic regression.
The response variable am indicates transmission type: 0 = automatic, 1 = manual.
Predictor variables remain wt and hp.

Once you run the model, it should look like this in your console:

Call:
glm(formula = am ~ wt + hp, family = binomial, data = mtcars)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) 18.86630    7.44356   2.535  0.01126 * 
wt          -8.08348    3.06868  -2.634  0.00843 **
hp           0.03626    0.01773   2.044  0.04091 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.230  on 31  degrees of freedom
Residual deviance: 10.059  on 29  degrees of freedom
AIC: 16.059

Number of Fisher Scoring iterations: 8

Here’s an interpreting of the logistic regression model:

The intercept 18.866 represents the log-odds of a car being manual when wt=0 and hp=0. In other words, when all other variables are 0, the baseline log-odds of the outcome is 18.866. The intercept is always the baseline value of the outcome when all other variables in the model are zero.
With every additional unit of weight (1000 lbs), the log odds of the car being manual decrease by 8.083. This variable strongly affects the probability of the car being manual, as seen with the p-value being 0.008, which is statistically significant.
With every additional unit of horsepower, the log odds of the car being manual increase by 0.036. This variable also affects the probability of being manual, as seen with the p-value being 0.041, which is statistically significant.

Summary of the interpretation: Heavier cars are more likely to be automatic, while higher horsepower slightly increases the chance of being manual. Together, wt and hp explain a large portion of transmission type variation.

Conclusion

In this tutorial, you learned how to use R for data analysis, visualization, and statistical modeling, and how to set up your R environment and work with basic data types and data structures.

This article also showed you how to import real-world datasets and explore them using summary statistics. This should help you understand your data before analysis.

Using ggplot2, we visualized the relationships and identified patterns. We built and interpreted a linear regression model to predict fuel efficiency and a logistic regression model to classify transmission type.

You also learned how to interpret coefficients, p-values, and goodness-of-fit measures.

With these skills, you can load datasets, visualize trends, and build simple predictive models in R. Keep practicing with new datasets and explore more advanced techniques to improve your data analysis skills.

How Embedded Analytics Makes Your App More Valuable

Manish Shivanandhan — Mon, 08 Dec 2025 17:03:00 +0000

Most business apps capture data. They track orders, tickets, leads, expenses, tasks, or deliveries.

But when someone needs insights, they often leave the app, export a file or open a BI tool to get answers. This extra step slows down decisions and creates friction.

Embedded analytics removes that friction. It means placing reports, dashboards, charts, KPIs and even AI-powered insights directly inside your existing app.

Instead of switching to another tool, users get answers in the exact moment they are doing their work.

Companies like Tableau, Pyramid, and Sigma have helped popularise this idea by allowing their analytics engines to sit inside other products. But the real value comes not from the vendors but from how deeply analytics becomes part of the workflow.

When embedded analytics is done well, your app becomes more valuable because it helps users think and act in the same place.

In this article, we will learn how embedding analytics directly inside a product increases its usefulness. We will also see how it improves decision-making and creates new revenue opportunities for the product.

What We’ll Cover

Why Embedded Analytics Matters
What Embedded Analytics Looks Like Inside an App
How Embedded Analytics Makes Your App More Valuable
Practical Ways to Start Using Embedded Analytics
Design Principles for Effective Embedded Analytics
Conclusion

Why Embedded Analytics Matters

In any business workflow, insight is always a step behind action.

A support manager who wants to understand why backlogs are rising must check a separate reporting tool.

A sales leader who wants to see pipeline health needs to open a BI dashboard.

A supply chain manager who wants to diagnose delays must export data to Excel.

These breaks in context may seem small, but they pile up. Users lose time. Decisions slow down. Only power users become comfortable with analytics.

Embedded analytics changes this pattern. By placing insights directly where work happens, you remove the hidden cost of switching tools.

A support manager can see backlog trends next to the ticket queue. A sales rep can see win rates while updating deals. A logistics coordinator can see average delay times next to shipment details.

Your app becomes more useful because it no longer just stores data. It helps make sense of it.

What Embedded Analytics Looks Like Inside an App

There are many ways embedded analytics can appear in a product.

At the simplest level, it can be a dashboard embedded through an iframe or a JavaScript snippet. This still gives users a unified experience without opening another product.

More advanced setups weave analytics into the core interface. A CRM might show prediction scores on each lead instead of only having a separate “Reports” tab.

An operations platform powered by Tablaeu might show throughput and error trends beside the workflow screen. A finance app might reveal margin drivers while approving invoices.

The experience should feel native to the product. Fonts match. Colours match. Navigation stays consistent. Users should not feel like they are opening a separate tool. They should feel like the analytics belong exactly where they appear.

How Embedded Analytics Makes Your App More Valuable

Embedded analytics deepens product usefulness by changing how users interact with data.

It moves insight to the front of decisions. Instead of digging for answers elsewhere, users see context exactly when needed.

A procurement manager adjusting an order quantity sees supplier reliability and historical pricing right there. They can make smarter decisions without leaving the screen.

This unlocks new value stories. Customers pay because they get decision-making power built into the product itself. Companies like Pyramid Analytics are often used to deliver enterprise-grade insights inside portals and internal tools, letting companies sell analytics as an added feature.

It also reduces dependency on analysts. Modern embedded analytics platforms enable search-based exploration and drag-and-drop analysis. Business teams no longer need to wait for a data team to create every custom view.

And it strengthens product stickiness. When your app becomes a central hub for both workflows and decisions, users rely on it more. Competing products without analytics feel incomplete.

Practical Ways to Start Using Embedded Analytics

One of the simplest ways to implement embedded analytics is to place a live BI dashboard directly inside your application.

Modern tools such as Tableau allow dashboards to be published with secure embed URLs. These dashboards can then appear as part of your interface instead of forcing users to open a separate reporting system.

Imagine you are building a recruiting platform. Your customers track candidates, interviews, and hiring cycles, but they still leave your product whenever they want an overview.

By embedding analytics, you can surface a pipeline health view directly inside the product’s home screen. Hiring managers would see average time-to-hire, conversion rates, and offer acceptance trends without ever exporting data.

The implementation is surprisingly straightforward. First, you create and publish a dashboard in your BI tool, so it becomes accessible via a URL such as:

https://analytics.yourapp.com/views/hiring_overview

Next, you embed that dashboard inside your product UI using a simple iframe. A page in your web app could include the following:

<div class="dashboard-container">
  <iframe
    src="https://analytics.yourapp.com/views/hiring_overview"
    style="width:100%; height:500px; border:none;"
  >iframe>
div>

The iframe source points to your analytics dashboard, and its sizing and border settings ensure the embedded view looks like part of your application rather than an external tool. From a design perspective, the dashboard blends in because it inherits the surrounding layout, spacing, and styling.

What matters most is the experience for the user. Instead of jumping between systems, hiring teams now see insights the moment they open the app.

Recruiters review candidate lists while seeing hiring trends directly above them. Managers check pipeline health during weekly planning sessions without exporting spreadsheets. Executives understand bottlenecks simply by logging in, rather than waiting for emailed reports. The insight lives where the work happens, which is exactly what makes embedded analytics valuable.

This small implementation illustrates how embedding a readymade dashboard can increase usefulness without changing data architecture. By letting users access answers in context, your product shifts from a system that records information to one that helps interpret and act on it.

Design Principles for Effective Embedded Analytics

Great embedded analytics is not about building fancy charts. It is about making the app easier to understand and easier to act on.

Begin with clear questions. Each chart should answer something specific. Instead of a generic graph called Revenue by Region, use a title such as “Which region is growing fastest this quarter?” Clear questions guide the user’s attention.

Show only what matters. Many analytics tools allow complex dashboards, but in a business app, less is more. Three focused metrics are more useful than fifteen distracting charts.

Support deeper exploration. While the first view should be simple, users who need detail should be able to drill down into more granular data, then into tables, then into raw records. This avoids overwhelming beginners while keeping power users happy.

Prioritize performance. Embedded analytics runs inside your product, so slow dashboards feel like a slow app. Pre-aggregate heavy metrics and use caching wherever possible. Leading platforms make speed a core priority because it directly affects user experience.

Match the product’s design. White-label options from companies like GoodData help make embedded dashboards feel native. Consistent colors and typography matter more than many teams expect.

Conclusion

Embedded analytics is not a cosmetic add-on. It’s a strategic way to lift product value. When you plan your roadmap, tie analytics ideas to measurable business outcomes.

Analytics can reduce churn by making users more successful. It can increase the adoption of core workflows by helping people understand what is happening. It can become a revenue driver through premium analytic tiers.

The market also shows how important analytics has become. Companies promote decision intelligence as a core capability for enterprise apps. Many large enterprises use embedded analytics to serve both internal teams and external customers with faster insights.

If your product still pushes users toward Excel exports or sends them to a separate BI portal, you are leaving value behind. When analytics becomes part of the main interface, your product shifts from being a system of record to a system of insight.

That is when the usefulness deepens, user loyalty grows, and your app becomes a place where better decisions happen every day.

Hope you enjoyed this article. Find me on Linkedin or visit my website.

Common Pitfalls to Avoid When Analyzing and Modeling Data

Oyedele Tioluwani — Tue, 14 Oct 2025 13:48:34 +0000

Working with data at any level, whether as an analyst, engineer, scientist, or decision-maker, involves going through a range of challenges. Even experienced teams can run into issues that quietly affect the quality of their work. A mislabeled column, an unclear definition, or a data leak that slips by unnoticed can all lead to results that do not hold up when it matters most.

Reliable analysis depends on how data is handled throughout the process. From collection and preparation to modeling and interpretation, each step carries its own risks. Many of the most persistent problems come not from technical gaps, but from missing checks or assumptions that go unspoken.

This guide highlights some of the most common pitfalls in data analysis and shows where they tend to appear. Along the way, it covers:

Biased or unclear inputs that cause trouble early on
Validation mistakes that distort model performance
Misinterpretation of results that leads to the wrong conclusions
Workflow gaps that slow teams down or create confusion
Practical steps you can take to catch and correct these issues

Data Collection Pitfalls
Data Preparation Pitfalls
Modeling and Validation Pitfalls
Interpretation and Communication Pitfalls
Organizational and Workflow Pitfalls
Conclusion

Data Collection Pitfalls

A lot of data issues begin before any modeling takes place. The way data is collected helps shape what your analysis can reveal. Once the inputs are biased or inconsistent, even solid techniques may lead to unreliable results.

One common issue is the bias in data sources. When a large portion of the data comes from digital channels like websites or apps, it creates an imbalance. For instance, if a model is trained only on web traffic, it could miss users who engage through offline means, like in-person visits or phone support. This then results in blind spots that limit how well the model performs once deployed.

Inconsistent definitions across systems also pose a major challenge. A simple label like “customer” could represent various things - it could refer to an active user in one database, a prospect in another, or even a past buyer elsewhere. Without shared definitions, one can end up using the same terms to mean very different things, and this leads to confusion and misaligned metrics.

A third issue is the lack of metadata or data provenance. Without clear records of where the data came from or how well it has changed over time, it becomes harder to trace issues, explain outputs, or reproduce results.

The way out:

Combine data from multiple sources to build a more complete and representative picture
Use stratified sampling to reduce bias where possible
Set up regular audits to catch data drift or gaps early
Maintain a shared data dictionary and align terms across teams
Track data lineage with tools like dbt, Apache Atlas, or OpenMetadata

Getting data collection right sets a strong foundation for analysis and helps prevent issues down the line.

Data Preparation Pitfalls

Once the data has been collected, the next step involves cleaning and shaping it for use. This is another delicate stage where data analysts often encounter an issue. Some choices that seem helpful at first can create problems later, especially when they aren’t documented or tested properly.

Silent Data Leakage

Data leakage occurs when a model learns from information that it would not have access to at prediction time. Let’s say for example, you’re building a model in January to predict whether a customer will make a purchase in February. If your dataset includes transactions from February, and you use that to calculate a feature like “days since last purchase”, then your model is learning from data it wouldn’t realistically have at prediction time.

Improper Handling of Missing Values

Quite a number of data explorers think missing values are just gaps to be filled. In certain cases, the fact that data is missing can be just as meaningful as the value itself. In a customer churn dataset, some users might have blank entries for recent activities because they have already stopped engaging with the product. Filling those gaps with averages and zeros without context could make the model treat them the same as users who simply haven’t generated enough data yet, which can be misleading.

Over-aggressive Outlier Removal

It’s tempting to remove extreme values to simplify modeling, but outliers often represent, although rare, yet important events. In fraud detection, for instance, the anomalies are the very signals the models need to learn from. Discarding them automatically based on z-scores or quantiles may improve the short-term accuracy while weakening long-term reliability.

The way out

To avoid data leakage, create training and test splits before engineering features. Make use of chronological splits when modeling time-based behavior, and regularly audit feature logic.
For missing values, go through the missingness patterns first. Use indicator variables where necessary, and treat the missingness as a signal, rather than just a defect.
With outliers, analyze their sources before removing them. If they are recognized, try using robust models that can handle skewed data or flag them for downstream use instead of deleting them.

Getting this stage right protects your models from brittle and unstable behavior.

Modeling and Validation Pitfalls

A common thought in this field is that models are only as reliable as the assumptions built into them. Mistakes at this phase are often reflected late, sometimes after the models have been deployed, making them harder to catch and more expensive to fix.

Overfitting Through Hyperparameter Tuning

Trying to make a model perfect with the training data can lead to patterns that don’t hold up in practice. When one tests hundreds of hyperparameter combinations without proper checks, the model often ends up learning noise rather than signals in the data, thereby resulting in excellent scores during cross-validation but weak performance in production. For instance, a churn model might show an excellent performance during development, but once it is deployed to a new region with a slight difference in customer behavior, it then starts to miss the mark.

Validation Leakage

Leakage can occur when the validation process accidentally gives the model access to target-related information. One common case is target encoding, where features like average purchase per customer group are calculated on the full dataset rather than only on the training set. This can lead to inflated validation scores and a false sense of confidence.

Ignoring Data Drift and Concept Drift

Data changes over time, and so do the basic relationships that models rely on. A model trained on behavior from eight months ago may not reflect current realities. Imagine a fraud detection model built before a major policy shift or change of product; the possibility that the model may fail to catch new fraud patterns that arise afterwards is extremely high.

The Way Out

Use nested cross-validation (a technique that separates hyperparameter tuning from final evaluation by using two loops of cross-validation) to avoid overfitting during the model selection. After this, you can then compare results against simple baselines to keep complexity in check.
Treat feature engineering as part of the pipeline and apply it within each training fold to avoid leakage. For time-sensitive data, validate progressively to reflect real-world use.
Check for drift using techniques like the Kolmogorov-Smirnov test or the Population Stability Index, and link alerts to retraining processes so models can evolve with data.

These steps go a long way in keeping your models solid in production and ready for whatever the data throws at them.

Interpretation and Communication Pitfalls

Clear, responsible communication is just as important as accurate modeling. But it is very easy to slip into habits that make results look more certain, more compelling, more reliable than they really are. These missteps can lead teams to act on insights that don’t hold up.

Overconfidence in Statistical Significance

Testing lots of variables without making adjustments can make weak signals look important. Imagine you run a dozen A/B tests and pick the one with a p-value below 0.05. Without correcting for multiple comparisons, there’s a good chance that result is just noise.

Ignoring Practical Significance

A result can be significant statistically but still meaningless when viewed in context. For example, finding a 0.1% lift in clickthrough rate, which is technically real but not worth the cost of rolling out a change across the product.

Model Explainability Missteps

When explanation tools are used without context, they can confuse rather than clarify. Showing a ranked list of SHAP values might look impressive, but if the stakeholders don’t understand what the features mean or how they interact, the takeaway is lost.

The Way Out

Be cautious with statistical significance. If you’re running several tests, apply corrections for multiple comparisons (Bonferroni or Benjamini-Hochberg methods, for instance) and avoid selectively reporting only the findings that look significant and ignoring those that don’t.
Look beyond what is statistically true and ask whether it is practically useful. A small, significant change might not be worth acting on at the end of the day.
When using explainability tools like SHAP or LIME, don’t assume the outputs speak for themselves. Add plain-language summaries, relevant examples, and business contexts to make them actionable. It is better to explain less with clarity than more with confusion.

These habits make your results easier to trust, interpret, and apply, which is ultimately the point of the work.

Organizational and Workflow Pitfalls

A major fact is that analytics is most effective when it is collaborative and responsive. Gaps in team structure or feedback processes can slow progress and limit the value of your work.

Teams working in isolation are a frequent issue. When analysts, engineers, and business stakeholders do not share tools or goals, efforts get duplicated and insights become fragmented. For example, one team might define active users based on weekly logins, while another uses monthly engagements, resulting in mismatched reports.

Lack of feedback from deployed models is another pitfall. If no one tracks what happens after predictions are made, teams miss the opportunity to refine and improve their processes. Imagine if a loan approval model is deployed, but there’s no follow-up on repayment behavior, it becomes difficult to tell whether the model is supporting sound lending decisions or increasing default risk.

The way out

Encourage collaboration by forming cross-functional teams and coordinating around shared planning cycles. Align on definitions early and rely on centralized dashboards to ensure that everyone is working from the same source of truth.
Create feedback loops and make them a standard part of your workflow, Track real-world outcomes, and schedule regular post-deployment reviews to understand what is working and what is not.
Include end users alongside data teams and treat their input as essential to improving the system.

Taking these actions helps analytics stay practical, consistent, and responsive to real needs.

Conclusion

Each stage of the data workflow benefits from clarity, structure, and shared understanding. The table below shows all the mentioned pitfalls, together with the way out to help teams build more reliable models and deliver results that hold up in real-world settings.

Category	Pitfall	Consequences	Recommended Approach
Data collection	Unreliable sources	Skewed insights	Validate source quality and apply consistent standards
Data preparation	Silent data leakage	Inflated model performance without real-world value	Use proper data splits and audit derived features
Modeling & validation	Overfitting through hyperparameter tuning	Strong validation results that don’t translate to reality	Use nested cross-validation (a structure where tuning happens inside training folds) and keep simple baselines for comparison
Interpretation & communication	Overconfidence in statistical significance	Misleading conclusions from small or selective effects	Adjust for multiple comparisons and report confidence intervals alongside p-values
Organizational & workflow	Fragmented teams	Redundant work and inconsistent metrics	Encourage collaboration with shared planning, dashboards, and definitions

Strong analytic practice is built over time. Keeping these pitfalls in view helps teams stay consistent, improve delivery, and create results that stay useful across projects and contexts.

How to Forecast Time Series Data with Python Darts

Adejumo Ridwan Suleiman — Mon, 06 Oct 2025 18:37:01 +0000

When analyzing time series data, your main objective is to consider the period during which the data is collected and how your variable of interest changes over time.

There are various libraries for time series forecasting in Python, and Darts is one of them. Unlike other forecasting libraries, Darts is a high-level forecasting library with algorithms to handle various time series data, regardless of the kind of trend they portray.

This tutorial will walk you through how you can forecast time series data using Python Darts. This will help you make meaningful insights whenever you come across time series data such as stock prices, weather measurements, and so on.

What is Python Darts?

Python Darts is an open-source library for time series analysis and forecasting. It has various models ranging from statistical time series models like ARIMA, and SARIMA, to machine learning and deep learning models like Prophet, and LSTM.

It has various algorithms for handling missing imputations in time series data, and can handle time series problems ranging from univariate, multivariate to hierarchical time series.

Prerequisites

Before we proceed, you will need to have the following:

Python 3.9+ installed.
Jupyter Notebook, Google Colab, or Positron to run your code.
Download the Netflix stock data.
Have the following libraries installed:
- darts for time series analysis
- pandas for data wrangling
- matplotlib for data visualization.

How to Set Up Dependencies

Load the following libraries.

import matplotlib.pyplot as plt
import pandas as pd
import darts
from darts import TimeSeries
from darts.models import ARIMA
from darts.models import RegressionModel
from lightgbm import LGBMRegressor
from darts.models import RNNModel
from darts.metrics import mape
import itertools

Understanding the Dataset

The Netflix stock data contains historical daily prices of Netflix stock from the year 2002 till date.

Load the data and have a preview of it.

netflix = pd.read_csv("/kaggle/input/netflix-stock-data-live-and-latest/Netflix_stock_history.csv")
netflix['Date'] = pd.to_datetime(netflix['Date'], utc=True).dt.tz_convert(None)
netflix.head()

To forecast a time series data, we need a Date column, which we already have, and then the variable of interest. We have several variables, but for this tutorial, we will focus on the Close variable of Netflix stocks.

Let’s visualize the data to see how Netflix closing price performed over the years.

netflix.plot(x='Date', y='Close', figsize=(10,5))
plt.show()

From the chart above, you can see that Netflix stock showed exponential growth in recent years. This means that the data is non-stationary, implying that there are no consistent changes over time.

There are a lot of random fluctuations in the data, which might make it difficult to forecast. Such data usually requires advanced models to handle the various fluctuations or noise present in the data.

How to Prepare the Data for Darts

Before preparing the data for Darts, you need to take note of few things.

First of all, if you look at our data preview earlier on, you would notice that it is recorded daily, we also need to fill in missing dates.

Copy and paste this code into your notebook.

start = netflix['Date'].min()
end = netflix['Date'].max()

netflix = (
    netflix.set_index('Date')
           .reindex(pd.date_range(start=start, end=end, freq='D'))
           .ffill()
           .reset_index()
           .rename(columns={'index': 'Date'})
)
netflix.head()

The code above ensures the netflix dataset has a continuous daily time series by filling in missing dates.

First, it finds the earliest start and latest end dates in the data, then creates a full daily date range between them.

By setting the Date column as the index and using .reindex() method, it inserts rows for any missing dates, which initially contain NaN.

The .ffill() method (forward fill) replaces these gaps by carrying forward the last known value, which is common for stock data when markets are closed, such as weekends.

Finally, the index is reset, and the column is renamed back to Date, producing a clean, continuous dataset ready for time series analysis.

Next, we need to convert the data to a Darts Timeseries object to make it usable by the Darts library.

 = TimeSeries.from_dataframe(
    netflix,
    time_col='Date',
    value_cols='Close',
)

The code above converts the netflix DataFrame into a Darts TimeSeries object, which is optimized for time series modeling and forecasting.

It takes the Date column (time_col='Date') as the timeline and the Close column (value_cols='Close') as the target values to forecast.

The resulting series object is now structured for use with Darts’ advanced forecasting models like ARIMA, Prophet, RNNs, and other time series algorithms.

Just like you would with any other machine learning model, you need to split your data into a training set and a validation set.

train, val = series.split_before(0.8)

How to Build a Forecasting Model

When building a forecasting model, you have the privilege of trying various models and picking the best-performing one.

The Darts library has various algorithms for time series analysis, from popular statistical algorithms like the Auto Regressive Integrated Moving Average (ARIMA) and Moving Average (MA) models, to machine learning and deep learning algorithms like Prophet and Long Short Term Memory (LSTM).

Note, I will only demonstrate how these algorithms work - it’s not necessary that we get accurate model metrics. But with further feature engineering, hyperparameter tuning, and cross-validation, you can get good results on your own.

Classical Model

The classical mode is the use of statistical time series models such as ARIMA. ARIMA is made up of the following components:

AR (AutoRegressive): Predict past values by looking at previous ones.
I (Integrated): Remove trends by focusing on changes instead of raw values.
MA (Moving Average): Learn from the errors of past predictions to improve accuracy.

Run the code below in your notebook to fit an ARIMA model.

arima_model = ARIMA()
arima_model.fit(train)
arima_forecast = arima_model.predict(len(val))

To visualize the forecast by the model, call the .plot() method on the forecast object.

series.plot(label='actual')
arima_forecast.plot(label='forecast')
plt.legend()

You can improve the model by adding some additional parameters to the ARIMA() class. You can read more about that in the Darts documentation.

Machine Learning Models

Classical models like ARIMA can’t handle non-linear data. Machine learning models fill this gap. We’ll use the LightGBM model as an example.

The LightGBM is a machine learning model that builds models sequentially based on decision trees. It adds new decision trees that correct the errors of previous trees.

Although it was not designed to handle time series, with some feature engineering such as lags, rolling statistics, and seasonal indicators, you can make it learn patterns from time series data.

Run this code on your notebook to fit a LightGBM model on the Netflix data.

lgbm = LGBMRegressor()
lgbm_model = RegressionModel(lags=12, model=lgbm)
lgbm_model.fit(train)
lgbm_forecast = lgbm_model.predict(len(val))

From the code above, the lag argument is set to 12, which is the value of the Netflix stock price for 12 days before a selected day.

Let’s have a view of the forecast by running the following code.

series.plot(label='actual')
lgbm_forecast.plot(label='forecast')
plt.legend()

You can read more about tuning the LightGBM model from the Darts documentation to improve the above model.

How to Forecast with Deep Learning models

You can go for deep learning models designed for time series, such as LSTM, a kind of Recurrent Neural Network (RNN) designed to capture long-term dependencies in sequential data.

Run the following code to build the LSTM model.

lstm_model = RNNModel(model='LSTM', input_chunk_length=12, output_chunk_length=6, n_epochs=100)
lstm_model.fit(train)
lstm_forecast = rnn_model.predict(len(val))

Now let’s visualize the forecast and see what we have.

series.plot(label='actual')
lstm_forecast.plot(label='forecast')
plt.legend()

You can look up the Darts documentation to improve the model and check out other deep learning models also.

Model Evaluation

Now that you have three models, you need to select the best one among them using the Mean Absolute Percentage Error (MAPE).

It expresses the average absolute error as a percentage of the actual values, and the closer your value is to 0, the better your model.

Run the following to print the MAPE of each respective model.

arima_error = mape(val, arima_forecast)
print("MAPE:", arima_error)
lgbm_error = mape(val, lgbm_forecast)
print("MAPE:", lgbm_error)
lstm_error = mape(val, lstm_forecast)
print("MAPE:", lstm_error)

> MAPE: 38.33262525601514
> MAPE: 39.00241495209449
> MAPE: 38.82910057097827

The model with the lowest MAPE is the ARIMA model with approximately 38.33, which means it’s our best-performing model.

BackTesting

Darts has a feature called backtesting that allows you to evaluate your models based on historical data, using a rolling forecast.

Backtesting is like a time machine for forecasting. It simulates how your model would have performed in the past by repeatedly training it on historical data up to a certain point, making a prediction for the next step, then moving forward, and repeating the process.

This rolling evaluation simulates how the model would behave in real-world conditions, where future data is unknown, helping you measure its consistency and reliability over time, instead of just testing it once on a single validation set.

Since the ARIMA model is currently our best-performing model, run the code below to implement backtesting.


# Perform backtesting on the training + validation series
backtest_series = train.concatenate(val)

# Backtest
backtest_forecast = arima_model.historical_forecasts(
    series=backtest_series,
    start=0.8,          # fraction of the series to start forecasting from
    forecast_horizon=len(val),
    stride=1,           # step size of rolling forecast
    retrain=True,       # retrain the model at each step
    verbose=True
)

# Compute metrics
error = mape(backtest_series[-len(val):], backtest_forecast)
print(f"MAPE: {error:.2f}%")

> historical forecasts: 100%|██████████| 1/1 [00:02<00:00,  2.69s/it]MAPE: 47.27%

In the code above,

The start argument defines where to start backtesting, which in this case is the last 20% series of the data.
The forecast_horizon is how many steps ahead to forecast at each point.
The stride is how frequently to retrain/forecast.
The retrain=True refits the model at each step for realistic evaluation.

You can see that the MAPE, after backtesting, is higher because backtesting is more realistic, and it is more difficult to achieve a lower MAPE.

On your own, you can try to replicate backtesting for the other models.

Hyper Parameter Tuning

The ARIMA model has three hyperparameter:

p which is the AR order
d which is the differencing order
q which is the MA order

You can use either grid or random search to tune your ARIMA model in Darts.

# Define possible values
p_values = range(0, 4)
d_values = range(0, 3)
q_values = range(0, 4)

best_mape = float('inf')
best_params = None

for p, d, q in itertools.product(p_values, d_values, q_values):
    try:
        arima_model = ARIMA(p=p, d=d, q=q)
        arima_model.fit(train)
        arima_forecast = arima_model.predict(len(val))
        arima_error = mape(val, arima_forecast)
        if arima_error < best_mape:
            best_mape = arima_error
            best_params = (p, d, q)
    except Exception as e:
        # Some combinations may fail
        continue

print(f"Best ARIMA params: p={best_params[0]}, d={best_params[1]}, q={best_params[2]} with MAPE={best_mape:.2f}%")

> Best ARIMA params: p=2, d=0, q=3 with MAPE=35.95%

In the above code, you define a range of possible values for the p, d , and q components, iterating over each combination of those values and choosing the model with the best MAPE among them.

Note that each model has its specific parameter you would have to tune, and you will need to check the Darts documentation for the hyperparameters of other models.

Real-World Use Cases

Forecasting time series data has a lot of real-world applications, some of which are:

Stock price prediction: Like the dataset used in this tutorial, forecasting is used in finance for stock price prediction, allowing investors to manage risk.
Demand forecasting for inventory: As a store owner, you can forecast product demands based on past sales of a product. This lets you know products that are in high demand.
Energy consumption prediction: Governments, industries, and consumers can plan and manage energy production, distribution, and consumption efficiently, based on data from past usage. This helps to avoid blackouts and wastage, enabling them to prepare ahead.

Best Practices

Always visualize residuals: Residuals are the difference between forecasted values and actual values. You must visualize them to detect outliers and unusual events.
Perform proper backtesting: Backtesting lets you see a more realistic model, subjected to various changes that can occur in real life. When you backtest all your models, you end up getting a model that performs well when forecasting.
Avoid data leakage: Do not train your models on validation sets to avoid bias, and always use cross-validation where necessary.
Use domain knowledge for feature engineering: Ensure you understand the data you are working with. This comes in handy in feature engineering, when you want to come up with new features to help your forecasting model, especially in multivariate time series forecasting.

Conclusion

This tutorial is more like an overview, especially if you are new to time series, but you can build a lot just from what you have learned.

You already have an idea of what time series and forecasting are, and how you can use the Darts Python library to achieve that.

You also learned of various models for forecasting time series data, and how you can apply techniques such as backtesting and hyperparameter tuning to achieve better results.

Another interesting thing with Darts is its ability to handle hierarchical time series. Here, data is structured at aggregated levels.

Darts is one of the most powerful time series libraries in Python and has a lot of models to handle various cases. You can proceed to explore models such as Transformers and also multi-series forecasting, which are used for special use cases.

If you are interested in more data science and statistics articles, don’t forget to check out my blog.

The Rise of AI Analytics and What It Means for Industries

Manish Shivanandhan — Tue, 08 Jul 2025 15:53:59 +0000

Businesses today are flooded with data. From online purchases to hospital records, every action generates information.

But data alone is not useful. What matters is how companies use it to make decisions.

This is where AI analytics comes in. It combines artificial intelligence with data analysis to find patterns, make predictions, and suggest actions.

In this article, you will learn what AI analytics is, why it’s growing so fast, and how it’s changing different industries. You will also learn about some of the open-source tools leading this change.

What is AI Analytics?
Why is AI Analytics Growing So Fast?
Areas Where AI Analytics Shine
Core Benefits of AI Analytics
Challenges of AI Analytics
The Role of Humans in AI Analytics
Popular Open-Source AI Analytics Tools
The Future of AI Analytics
Conclusion

What is AI Analytics?

AI analytics uses artificial intelligence to process and analyse data.

Traditional data analytics focused on what happened in the past. AI analytics goes further. It can tell you why something happened, what will likely happen next, and what you should do about it.

For example, if sales drop in a store, traditional reports only show the numbers.

AI analytics looks at customer behaviour, market trends, and past data to explain why sales dropped and suggest ways to increase them again.

Why is AI Analytics Growing So Fast?

The primary reason is the explosion of data.

Companies now collect massive amounts of data from websites, apps, sensors, and machines. Traditional tools can’t handle this scale of information, but AI models are built for it.

Another reason is cheaper computing power. In the past, running AI models required expensive hardware. Today, with cloud computing and open-source software like TensorFlow and PyTorch, any company can use AI analytics.

A third reason is better algorithms. AI models have become smarter and easier to use. Libraries like Scikit-learn and H2O.ai offer ready-made models that save time and effort for data scientists.

Areas Where AI Analytics Shine

AI Analytics in Retail

Retail companies use AI analytics to understand customers better and improve their shopping experience. One common use is personalised recommendations. Online stores use AI models to suggest products based on your browsing and purchase history. Libraries like LightFM help build these recommendation systems.

AI analytics also helps retailers manage inventory. By predicting what products will sell in the coming weeks, stores can stock up accordingly and reduce waste. Some retailers even use AI to design store layouts that increase sales by studying how customers move inside stores.

AI Analytics in Healthcare

Thanks to AI, data analytics in the health industry has seen huge growth. Hospitals now use AI analytics to predict which patients are at risk of readmission. This helps doctors take preventive action before problems get worse.

AI also improves diagnosis accuracy. For example, deep learning models can analyse X-rays and MRI scans to detect diseases like cancer at an early stage. Hospitals use open-source tools like TensorFlow to build these image recognition models.

Another area is staff management. AI analytics helps hospitals allocate nurses and doctors based on predicted patient inflow, making operations more efficient.

AI Analytics in Finance

Banks and financial firms rely heavily on AI analytics.

One important use is fraud detection. AI models analyse millions of transactions in real time to spot unusual patterns, stopping fraud before it happens. Open-source tools like H2O.ai help build these models efficiently.

Another use is credit scoring. Traditional credit scores only looked at a few factors. AI analytics can process more data points, creating fairer and more accurate credit scores for loan approvals.

Investment firms use AI analytics to predict stock market trends. Tools like Prophet by Facebook allow analysts to forecast future prices based on past data, improving investment strategies.

AI Analytics in Manufacturing

Factories use AI analytics to improve operations and reduce costs. One major use is predictive maintenance. Machines often fail without warning, causing production delays. AI analytics predicts when machines are likely to break down by analysing sensor data, allowing timely maintenance.

Factories also use AI to optimise production schedules. AI models analyse past production data, raw material availability, and market demand to plan manufacturing activities efficiently. This reduces costs and increases output.

Core Benefits of AI Analytics

AI analytics helps companies make faster and better decisions. It processes data in minutes and suggests the best course of action. This saves time and resources.

Using AI analytics also leads to cost savings. Automation reduces the need for manual analysis and lowers the chance of human error.

Finally, AI analytics gives companies a competitive advantage. Businesses that use AI can respond to market changes quickly, stay ahead of competitors, and offer better services to customers.

Challenges of AI Analytics

Despite its many benefits, AI analytics has some challenges.

One is data privacy. Industries like healthcare and finance deal with sensitive data that must be protected while using AI models.

To mitigate this, teams can implement strong data governance policies, use data anonymisation techniques, and ensure compliance with regulations like GDPR and HIPAA.

Another challenge is the lack of skilled professionals. Building AI models requires knowledge of data science and programming, which many companies still lack today. Businesses can address this by investing in training for existing staff, hiring specialised talent, or using user-friendly AutoML tools that reduce the need for advanced coding skills.

Bias in AI models is also a concern. If the data used to train the model is biased, the AI predictions will also be biased. This can lead to unfair decisions, especially in areas like credit scoring or hiring. To reduce bias, teams should audit the data regularly and involve diverse stakeholders when designing and validating models.

The Role of Humans in AI Analytics

While AI analytics can process huge amounts of data and suggest actions, humans remain essential in the entire process. Data scientists and analysts design the AI models, decide which data to use, and define what questions the AI should answer.

After AI produces results, data scientists analyse its outputs to check for accuracy and relevance. For example, an AI model might suggest increasing inventory for a product, but a human analyst will assess whether other factors like seasonality or upcoming trends have been considered properly.

Monitoring AI models is another crucial role for humans. Over time, models can become outdated if the data they were trained on no longer reflects current realities, a problem known as model drift. Data scientists regularly retrain and test models to maintain their accuracy.

Finally, we have to ensure that AI outputs are ethical and unbiased. We have to check for unfair recommendations or decisions, especially in sensitive areas like healthcare or finance, and adjust models to reduce any bias found.

Popular Open-Source AI Analytics Tools

Several open-source tools are making AI analytics accessible to everyone.

TensorFlow is a deep learning framework by Google used for building complex AI models in healthcare, finance, and retail.
PyTorch is another popular tool, preferred by researchers for its flexibility in building neural networks.
Scikit-learn is widely used for traditional machine learning tasks such as classification and regression.
H2O.ai offers automated machine learning features, making it easier for businesses without large data science teams to build models.
KNIME provides a visual workflow that integrates AI models with business data systems, while Apache Spark MLlib is useful for analysing large datasets quickly.
RapidMiner is also popular for building and deploying data science models in production environments.

The Future of AI Analytics

AI analytics is only going to grow stronger.

In the future, companies will use AI for real-time decision-making and industries will be able to act instantly based on live data streams.

Explainable AI will also become important. Businesses will demand AI models that clearly explain their predictions, building trust in automated decisions.

As AI tools become easier to use, even small businesses will adopt AI analytics to compete with larger firms. For example, a small clinic may use AI to predict patient no-shows and send reminders, improving efficiency and revenue.

Conclusion

AI analytics is changing how industries work. In the healthcare sector, data analytics is helping hospitals save lives through better predictions. Retailers are using AI to personalise shopping experiences. Banks are using it to stop fraud and improve lending decisions. Factories are becoming more efficient with predictive maintenance.

Businesses that start using AI analytics today will lead their industries tomorrow. The time to adopt AI analytics is now, to make better decisions, reduce costs, and stay ahead in this fast-changing world.

Hope you enjoyed this article. You can find me on LinkedIn if you want to connect. If you are interested in taking up data analytics as a career, Google has a free course. See you soon with a new article.

How to Extract YouTube Analytics Data and Analyze in Python

Adejumo Ridwan Suleiman — Wed, 26 Mar 2025 16:05:29 +0000

If you’re a YouTube content creator, you’ll make data-driven decisions when posting content. This helps you target the right audience when creating your videos.

YouTube Studio provides YouTube Analytics, where you can get comprehensive data about your channel. But there is a caveat: most of the statistics provided by YouTube Analytics are descriptive and not predictive. This means information like future views, subscriber counts, and factors influencing watch time or earnings are unavailable. This means you’ll need to calculate these metrics yourself.

In this article, you’ll learn how to export data from YouTube Analytics to Python so you can analyze it further or create visualizations. You can even build your own custom dashboard using various Python libraries like Streamlit, Shiny, or Dash.

Here’s what we

Prerequisites
Step 1: Identify the Problem Statement
Step 2: Extract the Data
Step 3: Analyze the Data in Python
- Correlation Analysis
- Audience Retention Analysis
Conclusion

Prerequisites

Active YouTube and YouTube Studio Account
Jupyter Notebook, Google Colab, Kaggle, or any other environment that supports Python
Pandas library installed
Seaborn library installed
Matplotlib library installed

Step 1: Identify the Problem Statement

Before proceeding, we need to know what we’re looking for – because YouTube Analytics has many metrics, and this can get overwhelming. My channel doesn’t have a ton of subscribers, but I have quite a few videos and views. So we’ll use my data as an example.

Just note that this analysis I’ll conduct in this tutorial is specific to my channel and can vary from channel to channel. You’ll be able to use the techniques here to answer the same/similar questions using your data, but your results will be different from mine.

Here are the questions I would like to find an answer for:

Correlation Analysis

Views and watch time – Are longer watch times associated with higher views?
Views and subscribers – Do more views translate to more subscribers?
Impressions and Click-Through Rate (CTR%) – Does a stronger impression lead to better engagement?
Watch time and average view duration – Are longer videos watched more?

Audience Retention Analysis

Average view duration vs. Video length – Are longer videos watched in full?
Drop-off points – Which duration range has the best retention?
Retention Rate (%) – Watch time divided by duration?

Step 2: Extract the Data

This will open a dashboard showing comprehensive descriptive analytics of your YouTube channel. This can get overwhelming, as there are a lot of metrics and filters with various types of data. This is why I emphasized the importance of knowing your problem and identifying your questions before diving in.

You can select the range of data you are interested in using the date dropdown (1 in the image below) and the Compare to button (2) to compare data from different date ranges.

The column headers you see in the dashboard are the filters. Each contains different metrics, and you can find some metrics in one or more filters. You can play around with the tabs and dropdowns to understand them better.

This is just a foundation for understanding your YouTube channel performance. If you have a long-running channel with a large number of subscribers and views, trust me – you can get a lot of insights from your data.

For this tutorial, I will select my entire lifetime data (1) and click the download button at the top right-hand corner (2).

This will display two options: whether to open the data in Google Sheets in a new tab or download the CSV file.

Since we want to use the data in Python, select the option to download the CSV file. After downloading the file, extract the files from the zip folder, and inside the extracted folder, you will see three CSV files: Chart data.csv, Table data.csv, and Totals.csv.

For this tutorial, we are interested in the Table data.csv. Click the data to open and view it in Excel to do some manual data cleaning before importing the data in Python.

The data is a list of all the videos on my YouTube channel, which is forty (yours might have more or fewer). Remove the first row, which is the Total row, and save the changes.

Here are the columns in the dataset:

Content: The video id
Video title: The video title
Video publish time: The day the video was published
Duration: The video duration in seconds
Views: The number of views per video
Watch time: The estimated amount of video watch time by your audience in hours
Subscribers: Change in total subscribers found by subtracting subscribers lost from subscribers gained for the selected date and region.
Average view duration: Estimated average minutes watched per video.
Impressions: Number of times your videos were shown to viewers.
Impressions click-through rate (%): Number of times viewers clicked your video after seeing an impression.

Step 3: Analyze the Data in Python

Go to your Jupyter Notebook and import the Pandas, Seaborn, and Matplotlib libraries.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Next, import the Table data.csv file.

# Load data
df = pd.read_csv("/content/Table data.csv")

Correlation Analysis

Concerning our problem statement, we are going to plot a correlation heatmap between the following variables: Views, Watch time (hours), Subscribers, Average view duration, and Impressions-click-through rate (%) to see the strength and direction of the relationship between them.

# Convert "Average view duration" (formatted as H:M:S) to seconds
df['Average view duration'] = pd.to_timedelta(df['Average view duration']).dt.total_seconds()

# Select relevant columns for correlation analysis
correlation_data = df[['Views', 'Watch time (hours)', 'Subscribers', 'Average view duration', 'Impressions', 'Impressions click-through rate (%)']]

# Compute correlation matrix
corr_matrix = correlation_data.corr()

# Visualization using a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("YouTube Analytics Correlation Heatmap")
plt.show()

Correlation coefficient ranges from -1 to 1, where values less than 0 mean a negative relationship, while those above 0 mean a positive relationship. The lower the value in a negative relationship, the stronger the negative relationship, while the higher the value in a positive relationship, the stronger the relationship.

Based on the plot above, here are the key insights:

Views and watch time: There's a strong correlation (0.94) between views and watch time, suggesting that as videos get more views, they also accumulate more watch hours, proportionally.
Views and impressions: There's a strong correlation (0.89) between views and impressions, indicating that videos that are shown more frequently in recommendations and search results tend to get more views.
Average view duration: This metric has very weak correlations with almost all other metrics. It is particularly notable in views (0.06), subscribers (0.01), and impressions (0.03).
Subscribers and metrics: Subscribers have a moderate to strong correlation with views (0.75) and impressions (0.79) and a weaker correlation with click-through rate (0.54).
Click-through rate: Has moderate correlations with views (0.69) and watch time (0.66) but a weaker correlation with subscribers (0.54).

The most significant insight is that average view duration appears to operate independently from other metrics. This suggests that on my YouTube channel, a video's ability to retain viewers throughout its length isn't necessarily connected to how many people watch it, how often it's recommended, or how many subscribers the channel has.

This implies that the strategies I would implement to increase my views, subscribers, and impressions might differ from those needed to improve average view duration, an important factor in YouTube's recommendation algorithm. This means I need to look at other YouTube metrics that have a relationship with average view duration, which is a topic for another article.

Audience Retention Analysis

To analyze audience retention, we need to create a new variable Retention Rate (%), which is calculated by dividing a video’s Average view duration by the Duration and expressing it as a percentage.


# Calculate retention rate as (Average View Duration / Total Video Duration) * 100
df['Retention Rate (%)'] = (df['Average view duration'] / df['Duration']) * 100

Next is to sort the videos in ascending order based on Retention Rate (%) and display the top 10 videos with the highest retention rate.

# Sort videos by retention rate
df_sorted = df.sort_values(by='Retention Rate (%)', ascending=False)

# Display top 10 videos with highest retention
df_sorted[['Video title', 'Duration', 'Average view duration', 'Retention Rate (%)']].head(10)

From the table above, you will notice that most of the videos in the top 10 spot are not above 503 seconds, which is approximately 8 minutes. This implies that my audience are interested in short, mid-range videos.

Most videos with the high retention rate have a duration less than 4 minutes, with a retention rate ranging from 27% - 40%. With this insight, I can ensure that the next videos I will upload are within 5 to 8 minutes.

Let’s take a look at the bottom 10 videos with a low retention rate:

# Sort videos by retention rate
df_sorted = df.sort_values(by='Retention Rate (%)', ascending=False)

# Display bottom 10 videos with highest retention
df_sorted[['Video title', 'Duration', 'Average view duration', 'Retention Rate (%)']].tail(10)

From the above information, you will notice that long videos in my channel spanning approximately 22 - 58 minutes have a low retention rate. This further supports the claim above that my audience is more interested in shorter videos.

We can further decide to plot a scattered plot of Duration against Retention Rate (%) to summarize the above tables.

# Set style for plots
sns.set_style("whitegrid")

# Plot Retention Rate vs. Video Duration
plt.figure(figsize=(12, 6))

sns.scatterplot(data=df, x='Duration', y='Retention Rate (%)', hue='Views', size='Views', sizes=(20, 200), palette='coolwarm')
plt.title("Audience Retention vs. Video Duration")
plt.xlabel("Video Duration (seconds)")
plt.ylabel("Retention Rate (%)")
plt.legend(title="Views", loc="upper right")

plt.show()

The scatter plot above shows the relationship between audience retention rate (y-axis, measured as a percentage) and video duration (x-axis, measured in seconds) for various videos. Here are the following key observations:

There's a clear negative correlation between video duration and retention rate – as videos get longer, the retention rate generally decreases.
The highest retention rates (35-40%) are found in shorter videos, mostly under 500 seconds (around 8 minutes).
Videos over 1500 seconds (25 minutes) consistently show retention rates below 15%.
The size and color of the dots represent the number of views, with larger, redder dots indicating more views (up to 1000) and smaller, blue dots representing fewer views (around 200).
Interestingly, some mid-length videos (around 500 seconds) have both higher view counts (indicated by larger red dots) and decent retention rates of about 25%.
The longest video in the dataset (at around 3500 seconds or 58 minutes) has a retention rate of about 14% and relatively few views.

This plot further confirms the claim that shorter videos tend to better maintain audience attention on my channel, though some mid-length videos can still perform well in terms of both retention and view count.

Conclusion

What we’ve learned from my data is just the tip of the iceberg. YouTube has many metrics, and because my channel is not monetized and has few subscribers and videos, I don’t have data on monetization, demographics, and other metrics.

But after reading this article, I hope that you can think of endless information you want to get based on these metrics. You can even forecast your views, subscriber counts, and revenue for the next days or months. You can also perform a multivariate time series analysis to see how these factors affect your primary variable of interest.

If you find this article interesting, don’t forget to check out my blog for other interesting articles, follow me on Medium, connect on LinkedIn, and subscribe to my YouTube channel.

Microsoft Excel: 14 Time-Saving Keyboard Shortcuts

Eamonn Cottrell — Tue, 15 Oct 2024 17:57:43 +0000

Microsoft Excel is the quintessential spreadsheet software used everywhere from universities to small businesses to enterprises.

It’s a lifesaver for countless financial professionals, data analysts, and teachers. But it’s also one of those programs that virtually everyone in any role can benefit from learning.

A handful of shortcuts can go a long way in increasing your productivity (and enjoyment) while using Excel.

In this article, I’ll detail some of the many shortcuts that I have found helpful throughout my studies and career.

Excel Shortcuts We’ll Cover:

How to Execute Shortcuts in Excel
Shortcut to Create a Table
Shortcut to Create a Pivot Table
Shortcut to AutoFit Column Sizes
Shortcut to Open Format Cells
Shortcut to Center Contents of Cell
Shortcut to Fill Color
Shortcut to Fill Contents Down (or Right)
Shortcut to Show or Hide Gridlines
Shortcut to Show all Formulas
Shortcuts for Navigation in Excel
Shortcut to Open the AutoFilter Menu in Excel
Shortcut to Create a Slicer in Excel
Shortcut to Create Checkboxes in Excel
Shortcut to Create Charts in Excel
Other Notes
More Shortcuts
Got Sheet

Here’s a video walkthrough of everything we’ll cover in this article:

How to Execute Shortcuts in Excel

In Excel, the more you can learn to do without touching your mouse, the better. Often, by keeping your hands on your keyboard, you can save a lot of time in each project.

As such, virtually every command imaginable is available as a keyboard shortcut.

The basic ones like CTRL + S for save and CTRL + C for copy are present. But Excel goes a step further…

The real power comes from the alt shortcuts. By pressing sequences of keys that typically start with alt, almost all of the actions in the Ribbon become available.

By simply pressing the alt key, all of the shortcut sequences available from the current view become highlighted in yellow on the ribbon:

Incidentally, some of the shortcuts’ sequences are separated by commas, while at other times two letters appear next to each other. Their execution is the same. Simply press them in sequence.

Below, you’ll find some of my favorites, but keep in mind that you can easily view all the available shortcuts at any time by pressing alt and then continuing to press the appropriate keys for the corresponding tabs and actions.

Shortcut to Create a Table

Keyboard Shortcut: `ctrl + t`

Tables in Excel are often the preferred format to begin manipulating and visualizing data. As long as the active cell is inside your data range, pressing ctrl + t will pop up a dialog and automatically select the data range to be used for the table.

It is adjustable if Excel gets a column or row wrong, and you can also toggle table headers on or off from this initial box.

Shortcut to Create a Pivot Table

Keyboard shortcut: `alt + n, v, t`

Want to appear a lot smarter than you are? Learn the basics of pivot tables in ten minutes. Your coworkers will remain impressed for weeks.

To create a pivot table, simply click somewhere in your data range and press ALT + n, v, t.

Excel is smart enough to recognize the data range you likely want included even if it isn’t already formatted as a table.

A dialog box will pop up for you to confirm the data range and location for the pivot table.

Shortcut to AutoFit Column Sizes

Keyboard Shortcut: `alt + h, o, i`

Ever get tired of resizing your columns so that the text in the cells doesn’t clip or spill? You can always double click the column boundary headings. This resizes the column to fit the widest cell’s contents.

The keyboard shortcut is much more efficient and allows you to autofit multiple columns in one fell swoop. Simply click and drag the column boundaries to select any number of columns. Execute the alt + h, o, i shortcut and all the columns in your active range will autofit their width.

You can also autofit cells individually by selecting one or more active cells and performing the same shortcut.

Shortcut to Open Format Cells

Keyboard Shortcut: `CTRL + 1` or `alt + h, fm`

Unless you are content with mediocrity, you’ll be formatting the cells in your spreadsheet at some point to create more readable, user-friendly content.

The simplest shortcut to open up the Format Cells window is CTRL + 1, although if you want to flex your dexterity, alt + h, fm will get you there as well.

Shortcut to Center Contents of Cell

Keyboard Shortcut: (horizontal) `alt + h, a, c`,

(vertical) alt + h, a, m

We have it easy compared to web developers. There seems a never-ending supply of articles and videos reminding developers how to center divs.

In Excel, we need only remember two quick shortcuts.

For horizontal centering: alt + h, a, c
For vertical centering: alt + h, a, m

There are other shortcuts for left, right, top and bottom alignment, but most of the time when we change the original alignment, it’s to center it.

Shortcut to Fill Color

Keyboard Shortcut: `alt + h, h`

For a quick highlight, it takes precious seconds to mouse up to the fill color icon. Pressing alt + h, h quickly toggles the color selection open.

Once it’s open, you can leave your mouse to the side and arrow down to your favorite color

Shortcut to Fill Contents Down (or Right)

Keyboard Shortcut: (down) `CTRL + D`, (right) `CTRL + R`

One of the most powerful features of Excel is the ability to drag formulas and functions down or across many cells, effectively reproducing a single calculation many times on different pieces of data.

By typing a formula in cell A8 in the image below, we can then highlight A8 and drag down to our heart’s content. Then, by pressing CTRL + D, the formula will be copied down into every highlighted cell.

By default, it will also preserve the relative reference of the cell. In other words, the next cell will contain the formula A7 + A8 and then A8 + A9, and so on.

Shortcut to Show or Hide Gridlines

Keyboard Shortcut: `alt + w, vg`

The mark of a real data analyst is not the quality of their reports, but the precision of their workbook. Gridlines have got to go. If you need lines, you can add borders.

Toggle off the gridlines with alt + w, vg.

And if you need those borders immediately, highlight your data range and press alt + h, b to open up the border menu. If you simply need all borders, alt + h, b, a will do the trick

Shortcut to Show all Formulas

Keyboard Shortcut: `CTRL + ~`

You’ll likely never lose control of a workbook.

But in the event you access one of your less rigorous colleague’s workbooks and need to see what functions they’ve Frankensteined together, press CTRL + ~ to display all the functions instead in the spreadsheet.

Keyboard Shortcuts: `CTRL + arrow keys` (and others)

Navigating the grid can be very fast with the keyboard. Pressing CTRL + the arrows, the home and the end buttons will warp you all over the active sheet.

Using the arrows and CTRL, you go to the last nonblank cell in the row or column.

Using CTRL + home or end, you go to the beginning and the end of the workbook, respectively. (When inside a table, home and end take you to the beginning and end of the table only.)

Keyboard Shortcut: `alt + down arrow`

Another superpower in Excel is the ease with which we can filter and sort large pieces of data. To access the AutoFilter Menu quickly, press alt + down arrow while in the header for the column you’d like to filter.

Shortcut to Create a Slicer in Excel

Keyboard Shortcut: `alt + n, sf`

For an even more user-friendly method of sorting, you can insert a slicer directly onto the spreadsheet by pressing alt + n, sf.

Shortcut to Create Checkboxes in Excel

Keyboard Shortcut: `alt + n, cb`

If you’re anything like me, you’ll find a way to use checkboxes in almost every spreadsheet you create. They’re extremely useful for toggling selections on and off in a workbook, and as of June 2024, Excel has made them available in production Excel.

Press alt + n, cb to insert a checkbox in a cell.

Shortcut to Create Charts in Excel

Keyboard Shortcut: `alt + n, r`

There are a ton of chart types in Excel. To quickly open up the recommended charts dialog box, we can press alt + n, r.

Or, if we know we want a specific type of chart, there are multiple options to shortcut straight to them, like alt + n, C1, alt + n, N1, alt + n, SA and so on.

More Shortcuts

There are a zillion shortcuts available in Excel. If you want to check out the full list, you can find a current version maintained by Microsoft here.

Got Sheet

Come join my free newsletter, Got Sheet. I show people how to get good at spreadsheets every week.

You can find me over on YouTube as well.

Data Analysis with Python – How I Analyzed My Empire State Building Run-Up Performance

Jose Vicente Nunez — Wed, 08 May 2024 16:56:28 +0000

A tower running race is a race that you run up the stairs of a building. These happen around the world. I got the chance to participate in the Empire State Run Up in NYC, 2023 edition.

The Empire State Building Run-Up (ESBRU)—the world’s first and most famous tower race—challenges runners from near and far to race up its famed 86 flights—1,576 stairs.

While visitors can reach the building’s Observatory via elevator in under one minute, the fastest runners have covered the 86 floors by foot in about 10 minutes.

Leaders in the sport of professional tower-running converge at the Empire State Building in what some consider the ultimate test of endurance.

I got lucky and managed to participate in this race. A few days after finishing the race, I realized that I wanted to know more about my performance, and what I could have done to better.

So naturally I went to the race organizer website and started looking at the numbers. And it was slow and tedious, plus it brought up more issues:

Getting the data for offline analysis is difficult. You can see your results and others for comparison, but I found that the tools didn't offer an option to download the raw data, and they were clumsy to use.
Most tools out there to analyze race results are paid or do not apply to this type of race. Knowing what to expect reduces your anxiety, allows you to train better, and keeps your expectations in check.

By now you've probably guessed that you can solve the data retrieval issues and post-race analysis using low-cost Open Source tools. This also allows you to apply different techniques to learn about the race and, depending on the quality of the data, even make performance predictions.

This is a very personal piece for me. I will share my race results and give you my biased opinion about the race. 😁

How I Ended Up Running to the Top of the Empire State Building
What You Need to Follow this Tutorial
How to Get the Data using Web Scraping
How to Clean Up the Data
How to Analyze the Data
How to Visualize the results
How to Run the Applications
What Else Can We Learn?

How I Ended Up Running to the Top of the Empire State Building

Many of us have run a regular race at some point in our lives – there are many distances like 5K, 10K, Half Marathon, and Full Marathon. But there is no way to compare how you will perform while running the stairs all the way to the top of one of the most famous buildings in the world.

If you have ever been at the base of the skyscrapers in New York City and have looked up, you get the idea. Picture yourself running up the stairs, all the way to the top, without stopping.

Getting accepted is tough, because unlike a race like the New York Marathon, the Empire State Building can only accommodate around 500 runners (or should I say climbers?).

Add to that fact that the demand to participate is high, and then you can see that your chances to get in through the lottery are pretty slim (I read somewhere that there are only 50 lottery positions for more than 5,000 applicants).

You can imagine my surprise when I got an email saying that I was selected to participate after trying for 4 years in a row.

I panicked. Have you ever been at the base of the Empire State and looked up? Some days when it's cloudy you can't even see the top of the building.

I wasn't unprepared. But I had to adjust my training routine to be ready for this challenge with a small window of two months, and no experience doing a tower run.

The day of the race came and this is how it went for me:

It was tough. I knew I had to pace myself, otherwise, the race would have ended for me on floor 20th as opposed to the 86th. You have to focus on a "keep going" mentality, regardless of how tired you feel. And then it is over, just like that.
You don't sprint, you climb 2 steps at a time at a steady pace, and you use the handrails to take weight off your legs.
No need to carb load or hydrate too much. If you do well, you will be done in around 30 minutes.
Nobody is pushing anyone. At least for non-elite racers like me, I was alone for most of the race.
I got passed and I passed a lot of people who forgot the 'pace yourself' rule. If you sprint, you will be toasted before floor 25, for sure.

I had a blast and got great satisfaction from having this race ticked off my bucket list, the same way I felt after running the NYC Marathon.

It was time now to do a post-race analysis using several of my favorite Open Source tools, which I'll explain in the next section.

What You Need to Follow this Tutorial

Like the race, most of the challenges to writing this application were mental. You only need to break the main problem down into smaller pieces and then tackle each piece at a time:

Get the data by scraping the website (very few sites allow you to export race results as a CSV).
Clean up the data, normalize it, and make it ready for automatic processing.
Ask questions. Then translate those questions into code and tests, ideally using statistics to get reliable answers.
Present the results. A UI (Text or Graphic) will do wonders due to its low consumption, but charts speak volumes too.

You should have some experience in a programming language to get the most out of this article. My code is written in Python (you will need version 3.8+) and runs on Linux (I used Fedora 37 distribution).

In a nutshell, I want to show that it is possible to do all the above with Open Source technologies. Then you can reuse this knowledge for other projects, not just for tower race analyses. 😅

I strongly recommend that you get the source code (It is Open Source!). Get your hands dirty, break the scripts, and have fun. You will need Git to clone the repository:

git clone https://github.com/josevnz/tutorials.git
cd tutorials/docs/EmpireStateRunUp/
python -m ~/virtualenv/EmpireStateRunUp
. ~/virtualenv/EmpireStateRunUp/bin/activate
pip install --upgrade pip
pip install --upgrade build
pip install --upgrade wheel
pip install --editable .

Or if you just want to run the code while reading this tutorial (using my latest version from Pypi):

python -m ~/virtualenv/EmpireStateRunUp
. ~/virtualenv/EmpireStateRunUp/bin/activate 
pip install --upgrade EmpireStateRunUp

We can now move to the next stage:a getting the data.

How to Get the Data using Web Scraping

The race results site doesn't have an export feature, and I never heard back from their support team to see if there was an alternate way to get the race data. So the only alternative left was to do some web scraping.

The website is pretty basic and only allows scrolling through each record, so I decided to do web scraping to get the results into a format I could use later for data analysis.

The rules of web scraping

There are very 3 simple rules:

Rule #1: Don't do it. Data flow changes, and your scraper will break the minute you are done getting the data. It will require time and effort. Lots of it.
Rule #2: Re-read rule number 1. If you can't get the data in any another format, then go to rule #3
Rule #3: Choose a good framework to automate what you can and prepare to do heavy data cleanup (also known as "give me patience for the stuff I can't control, like poorly done HTML and CSS").

I decided to use Selenium Web Driver as it calls a real browser, like Firefox, to navigate the website. Selenium allows you to automate browser actions while you get the same rendered HTML you see when you navigate the site.

Selenium is a complex tool and will require you to spend some time experimenting with what works and what does not. Below is a simple script I wrote to get all the runner's names and race detail links in one run:

import re
from time import sleep

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.webdriver import WebDriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions
# AthLinks is nice enough to post the race results and their interface is very human-friendly. Not so machine parsing friendly.
RESULTS = "https://www.athlinks.com/event/382111/results/Event/1062909/Course/2407855/Results"
LINKS = {}


def print_links(web_driver: WebDriver, page: int) -> None:
    for a in web_driver.find_elements(By.TAG_NAME, "a"):
        href = a.get_attribute('href')
        if re.search('Bib', href):
            name = a.text.strip().title()
            print(f"Page={page}, {name}={href.strip()}")
            LINKS[name] = href.strip()


def click(level: int) -> None:
    button = WebDriverWait(driver, 20).until(
        expected_conditions.element_to_be_clickable((By.CSS_SELECTOR, f"div:nth-child({level}) > button")))
    driver.execute_script("arguments[0].click();", button)
    sleep(2.5)


options = Options()
options.add_argument("--headless")
driver = webdriver.Firefox(options=options)
driver.get(RESULTS)
sleep(2.5)
print_links(driver, 1)
click(6)
print_links(driver, 2)
click(7)
print_links(driver, 3)
click(7)
print_links(driver, 4)
click(9)
print_links(driver, 5)
click(9)
print_links(driver, 6)
click(7)
print_links(driver, 7)
click(7)
print_links(driver, 8)
print(len(LINKS))

The code above is hardly reusable, but it gets the job done by doing the following:

Gets the main web-page with the driver.get(...) method
Then gets the tags, and sleeps a little to get a chance to render the HTML


Then finds and clicks the > (next page) button

Does these steps a total of 8 times, as this is how many pages of results are available (each page has 50 runners)


To get the full race results I wrote scraper.py code. The code deals with navigating multiple pages and extracting the data. Demonstration below:
(EmpireStateRunUp) [josevnz@dmaf5 EmpireStateRunUp]$ esru_scraper /home/josevnz/temp/raw_data.csv
2023-12-30 14:05:00,987 Saving results to /home/josevnz/temp/raw_data.csv
2023-12-30 14:05:53,091 Got 377 racer results
2023-12-30 14:05:53,091 Processing BIB: 19, will fetch: https://www.athlinks.com/event/382111/results/Event/1062909/Course/2407855/Bib/19
2023-12-30 14:06:02,207 Wrote: name=Wai Ching Soh, position=1, {'name': 'Wai Ching Soh', 'url': 'https://www.athlinks.com/event/382111/results/Event/1062909/Course/2407855/Bib/19', 'overall position': '1', 'gender': 'M', 'age': 29, 'city': 'Kuala Lumpur', 'state': '-', 'country': 'MYS', 'bib': 19, '20th floor position': '1', '20th floor gender position': '1', '20th floor division position': '1', '20th floor pace': '42:30', '20th floor time': '1:42', '65th floor position': '1', '65th floor gender position': '1', '65th floor division position': '1', '65th floor pace': '54:03', '65th floor time': '7:34', 'gender position': '1', 'division position': '1', 'pace': '53:00', 'time': '10:36', 'level': 'Full Course'}
...

It does just minimal manipulation of the data from the web page. The purpose of this code is just to get the data as quickly as possible before the formatting changes.
Data cannot be used yet as-is – it needs cleaning up. And that's the next step in this article.
How to Clean Up the Data
Getting the data is just the first battle of many more to come. You will notice inconsistencies on the data and missing values. In order to make your numeric results good, you need to make assumptions.
Luckily for me, the dataset is very small (375+ records, one for each runner) so I was able to come up with a few rules to tidy up the data file I was going to use during my analysis.
I also supplemented my data with another data set that has the 3-digit country codes as well as other details, for a nicer presentation.
The data_normalizer.raw_read(raw_file: Path) -> Iterable[Dict[str, Any]] method does the heavy work of fixing the data for inconsistencies before saving into a CSV format.
There are no hard rules here, as cleanup has a high correlation with the data set. For example, to figure out to which wave each runner was assigned I had to make some assumptions based on what I saw the day of the race.
Let me show you what I mean with some code:
import datetime
from enum import Enum
from typing import Dict

"""
Runners started on waves, but for basic analysis, we will assume all runners were able to run
at the same time.
"""
BASE_RACE_DATETIME = datetime.datetime(
    year=2023,
    month=9,
    day=4,
    hour=20,
    minute=0,
    second=0,
    microsecond=0
)

class Waves(Enum):
    """
    22 Elite male
    17 Elite female
    There are some holes, so either some runners did not show up or there was spare capacity.
    https://runsignup.com/Race/EmpireStateBuildingRunUp/Page-4
    https://runsignup.com/Race/EmpireStateBuildingRunUp/Page-5
    I guessed who went into which category, based on the BIB numbers I saw that day
    """
    ELITE_MEN = ["Elite Men", [1, 25], BASE_RACE_DATETIME]
    ELITE_WOMEN = ["Elite Women", [26, 49], BASE_RACE_DATETIME + datetime.timedelta(minutes=2)]
    PURPLE = ["Specialty", [100, 199], BASE_RACE_DATETIME + datetime.timedelta(minutes=10)]
    GREEN = ["Sponsors", [200, 299], BASE_RACE_DATETIME + datetime.timedelta(minutes=20)]
    """
    The date people applied for the lottery determined the colors. Let's assume that
    General Lottery Open: 7/17 9AM- 7/28 11:59PM
    General Lottery Draw Date: 8/1
    """
    ORANGE = ["Tenants", [300, 399], BASE_RACE_DATETIME + datetime.timedelta(minutes=30)]
    GREY = ["General 1", [400, 499], BASE_RACE_DATETIME + datetime.timedelta(minutes=40)]
    GOLD = ["General 2", [500, 599], BASE_RACE_DATETIME + datetime.timedelta(minutes=50)]
    BLACK = ["General 3", [600, 699], BASE_RACE_DATETIME + datetime.timedelta(minutes=60)]

"""
Interested only in people who completed the 86 floors. So is it either a full course or dnf
"""
class Level(Enum):
    FULL = "Full Course"
    DNF = "DNF"

# Fields are sorted by interest
class RaceFields(Enum):
    BIB = "bib"
    NAME = "name"
    OVERALL_POSITION = "overall position"
    TIME = "time"
    GENDER = "gender"
    GENDER_POSITION = "gender position"
    AGE = "age"
    DIVISION_POSITION = "division position"
    COUNTRY = "country"
    STATE = "state"
    CITY = "city"
    PACE = "pace"
    TWENTY_FLOOR_POSITION = "20th floor position"
    TWENTY_FLOOR_GENDER_POSITION = "20th floor gender position"
    TWENTY_FLOOR_DIVISION_POSITION = "20th floor division position"
    TWENTY_FLOOR_PACE = '20th floor pace'
    TWENTY_FLOOR_TIME = '20th floor time'
    SIXTY_FLOOR_POSITION = "65th floor position"
    SIXTY_FIVE_FLOOR_GENDER_POSITION = "65th floor gender position"
    SIXTY_FIVE_FLOOR_DIVISION_POSITION = "65th floor division position"
    SIXTY_FIVE_FLOOR_PACE = '65th floor pace'
    SIXTY_FIVE_FLOOR_TIME = '65th floor time'
    WAVE = "wave"
    LEVEL = "level"
    URL = "url"

FIELD_NAMES = [x.value for x in RaceFields if x != RaceFields.URL]
FIELD_NAMES_FOR_SCRAPING = [x.value for x in RaceFields]
FIELD_NAMES_AND_POS: Dict[RaceFields, int] = {}
pos = 0
for field in RaceFields:
    FIELD_NAMES_AND_POS[field] = pos
    pos += 1

def get_wave_from_bib(bib: int) -> Waves:
    for wave in Waves:
        (lower, upper) = wave.value[1]
        if lower <= bib <= upper:
            return wave
    return Waves.BLACK

def get_description_for_wave(wave: Waves) -> str:
    return wave.value[0]

I used enums to make it clear what type of data I was working on, especially for the names of the fields. Consistency is key.
As for cleaning the data, well there were some obvious fixes I had to apply like:

Format of the times like pace, race time, and so on so it could be parsed later

Capitalize some values to make them easier to read

Early string to integer conversion for values like age, position, and so on. If that fails, assign 'not a number'.


By all means, we are not done massaging the data. A simple function takes care of this stage inside the data module:
# Omitted imports and Enum declarations as they were shown early on. 
# Check the source code for 'data.py' for more details
def raw_csv_read(raw_file: Path) -> Iterable[Dict[str, Any]]:
    record = {}
    with open(raw_file, 'r') as raw_csv_file:
        reader = csv.DictReader(raw_csv_file)
        row: Dict[str, Any]
        for row in reader:
            try:
                csv_field: str
                for csv_field in FIELD_NAMES_FOR_SCRAPING:
                    column_val = row[csv_field].strip()
                    if csv_field == RaceFields.BIB.value:
                        bib = int(column_val)
                        record[csv_field] = bib
                    elif csv_field in [ RaceFields.GENDER_POSITION.value, RaceFields.DIVISION_POSITION.value, RaceFields.OVERALL_POSITION.value,  RaceFields.TWENTY_FLOOR_POSITION.value,
                        RaceFields.TWENTY_FLOOR_DIVISION_POSITION.value, RaceFields.TWENTY_FLOOR_GENDER_POSITION.value, RaceFields.SIXTY_FLOOR_POSITION.value, RaceFields.SIXTY_FIVE_FLOOR_DIVISION_POSITION.value,
                        RaceFields.SIXTY_FIVE_FLOOR_GENDER_POSITION.value, RaceFields.AGE.value ]:
                        try:
                            record[csv_field] = int(column_val)
                        except ValueError:
                            record[csv_field] = math.nan
                    elif csv_field == RaceFields.WAVE.value:
                        record[csv_field] = get_description_for_wave(get_wave_from_bib(bib)).upper()
                    elif csv_field in [RaceFields.GENDER.value, RaceFields.COUNTRY.value]:
                        record[csv_field] = column_val.upper()
                    elif csv_field in [RaceFields.CITY.value, RaceFields.STATE.value,

                    ]:
                        record[csv_field] = column_val.capitalize()
                    elif csv_field in [RaceFields.SIXTY_FIVE_FLOOR_PACE.value, RaceFields.SIXTY_FIVE_FLOOR_TIME.value, RaceFields.TWENTY_FLOOR_PACE.value,
                        RaceFields.TWENTY_FLOOR_TIME.value, RaceFields.PACE.value, RaceFields.TIME.value ]:
                        parts = column_val.strip().split(':')
                        for idx in range(0, len(parts)):
                            if len(parts[idx]) == 1:
                                parts[idx] = f"0{parts[idx]}"
                        if len(parts) == 2:
                            parts.insert(0, "00")
                        record[csv_field] = ":".join(parts)
                    else:
                        record[csv_field] = column_val
                if record[csv_field] in ['-', '--']:
                    record[csv_field] = ""
                yield record
            except IndexError:
                raise

The esru_csv_cleaner script is the sum of the first stage cleanup effort, which takes the raw captured data and writes a CSV file with some important corrections:
esru_csv_cleaner --rawfile /home/josevnz/temp/raw_data.csv /home/josevnz/tutorials/docs/EmpireStateRunUp/empirestaterunup/results-full-level-2023.csv

Now with the data ready, we can proceed to load the data and ask some questions about the race.
How to Analyze the Data
Once the data is clean (or as clean as we can get it), it's time to move into running some numbers. Before writing more code, I took a piece of paper and asked myself a few questions about the race:

There are any interesting buckets/ clusters for age, race time, wave, and country participation?

A histogram for Age and Country would be nice to see

Describe the data! (median, percentiles, and so on)

Find outliers. There is a way to apply Z-scores here?


I decided to use Python Pandas for this task. This Open Source framework has an arsenal of tools to manipulate the data and to calculate statistics. It also has good tools to perform additional cleanup if needed.
So how does Pandas work?
Crash Course on Pandas
I strongly recommend that you check out 10 minutes to pandas if you are not familiar with the tool. For my DataFrame, I made the BIB an index as it is unique, and it has no special value for aggregation functions – but the 'id' attribute is unique.
It's important to note that also at this stage I needed to normalize the data, which I'll explain shortly:
# Omitted imports and Enum declarations as they were shown early on. 
# Check the source code for 'data.py' for more details
def load_data(data_file: Path = None, remove_dnf: bool = True) -> DataFrame:
    """
    * The code removes by default the DNF runners to avoid distortion on the results.
    * Replace unknown/ nan values with the median, to make analysis easier and avoid distortions
    """
    if data_file:
        def_file = data_file
    else:
        def_file = RACE_RESULTS_FULL_LEVEL
    df = pandas.read_csv(
        def_file
    )
    for time_field in [
        RaceFields.PACE.value,
        RaceFields.TIME.value,
        RaceFields.TWENTY_FLOOR_PACE.value,
        RaceFields.TWENTY_FLOOR_TIME.value,
        RaceFields.SIXTY_FIVE_FLOOR_PACE.value,
        RaceFields.SIXTY_FIVE_FLOOR_TIME.value
    ]:
        try:
            df[time_field] = pandas.to_timedelta(df[time_field])
        except ValueError as ve:
            raise ValueError(f'{time_field}={df[time_field]}', ve)
    df['finishtimestamp'] = BASE_RACE_DATETIME + df[RaceFields.TIME.value]
    if remove_dnf:
        df.drop(df[df.level == 'DNF'].index, inplace=True)

    # Normalize Age
    median_age = df[RaceFields.AGE.value].median()
    df[RaceFields.AGE.value].fillna(median_age, inplace=True)
    df[RaceFields.AGE.value] = df[RaceFields.AGE.value].astype(int)

    # Normalize state and city
    df.replace({RaceFields.STATE.value: {'-': ''}}, inplace=True)
    df[RaceFields.STATE.value].fillna('', inplace=True)
    df[RaceFields.CITY.value].fillna('', inplace=True)

    # Normalize overall position, 3 levels
    median_pos = df[RaceFields.OVERALL_POSITION.value].median()
    df[RaceFields.OVERALL_POSITION.value].fillna(median_pos, inplace=True)
    df[RaceFields.OVERALL_POSITION.value] = df[RaceFields.OVERALL_POSITION.value].astype(int)
    median_pos = df[RaceFields.TWENTY_FLOOR_POSITION.value].median()
    df[RaceFields.TWENTY_FLOOR_POSITION.value].fillna(median_pos, inplace=True)
    df[RaceFields.TWENTY_FLOOR_POSITION.value] = df[RaceFields.TWENTY_FLOOR_POSITION.value].astype(int)
    median_pos = df[RaceFields.SIXTY_FLOOR_POSITION.value].median()
    df[RaceFields.SIXTY_FLOOR_POSITION.value].fillna(median_pos, inplace=True)
    df[RaceFields.SIXTY_FLOOR_POSITION.value] = df[RaceFields.SIXTY_FLOOR_POSITION.value].astype(int)

    # Normalize gender position, 3 levels
    median_gender_pos = df[RaceFields.GENDER_POSITION.value].median()
    df[RaceFields.GENDER_POSITION.value].fillna(median_gender_pos, inplace=True)
    df[RaceFields.GENDER_POSITION.value] = df[RaceFields.GENDER_POSITION.value].astype(int)
    median_gender_pos = df[RaceFields.TWENTY_FLOOR_GENDER_POSITION.value].median()
    df[RaceFields.TWENTY_FLOOR_GENDER_POSITION.value].fillna(median_gender_pos, inplace=True)
    df[RaceFields.TWENTY_FLOOR_GENDER_POSITION.value] = df[RaceFields.TWENTY_FLOOR_GENDER_POSITION.value].astype(int)
    median_gender_pos = df[RaceFields.SIXTY_FIVE_FLOOR_GENDER_POSITION.value].median()
    df[RaceFields.SIXTY_FIVE_FLOOR_GENDER_POSITION.value].fillna(median_gender_pos, inplace=True)
    df[RaceFields.SIXTY_FIVE_FLOOR_GENDER_POSITION.value] = df[
        RaceFields.SIXTY_FIVE_FLOOR_GENDER_POSITION.value].astype(int)

    # Normalize age/ division position, 3 levels
    median_div_pos = df[RaceFields.DIVISION_POSITION.value].median()
    df[RaceFields.DIVISION_POSITION.value].fillna(median_div_pos, inplace=True)
    df[RaceFields.DIVISION_POSITION.value] = df[RaceFields.DIVISION_POSITION.value].astype(int)
    median_div_pos = df[RaceFields.TWENTY_FLOOR_DIVISION_POSITION.value].median()
    df[RaceFields.TWENTY_FLOOR_DIVISION_POSITION.value].fillna(median_div_pos, inplace=True)
    df[RaceFields.TWENTY_FLOOR_DIVISION_POSITION.value] = df[RaceFields.TWENTY_FLOOR_DIVISION_POSITION.value].astype(int)
    median_div_pos = df[RaceFields.SIXTY_FIVE_FLOOR_DIVISION_POSITION.value].median()
    df[RaceFields.SIXTY_FIVE_FLOOR_DIVISION_POSITION.value].fillna(median_div_pos, inplace=True)
    df[RaceFields.SIXTY_FIVE_FLOOR_DIVISION_POSITION.value] = df[
        RaceFields.SIXTY_FIVE_FLOOR_DIVISION_POSITION.value].astype(int)

    # Normalize 65th floor pace and time
    sixty_five_floor_pace_median = df[RaceFields.SIXTY_FIVE_FLOOR_PACE.value].median()
    sixty_five_floor_time_median = df[RaceFields.SIXTY_FIVE_FLOOR_TIME.value].median()
    df[RaceFields.SIXTY_FIVE_FLOOR_PACE.value].fillna(sixty_five_floor_pace_median, inplace=True)
    df[RaceFields.SIXTY_FIVE_FLOOR_TIME.value].fillna(sixty_five_floor_time_median, inplace=True)

    # Normalize BIB and make it the index
    df[RaceFields.BIB.value] = df[RaceFields.BIB.value].astype(int)
    df.set_index(RaceFields.BIB.value, inplace=True)

    # URL was useful during scraping, not needed for analysis
    df.drop([RaceFields.URL.value], axis=1, inplace=True)

    return df

I do a few things here after giving back the converted CSV back to the user, as a DataFrame:

Replaced "Not a Number" (nan) values with the median to avoid affecting the aggregation results. This makes analysis easier.

Dropped rows for runners that did not reach floor 86. Makes the analysis easier, and there are too few of them.

Convert some string columns into native data types like integers, timestamps

A few entries did not have the gender defined. That affected other fields like 'gender_position'. To avoid distortions, these were filled with the median.


In the end, this is how my DataFrame loading looked like:
(EmpireStateRunUp) [josevnz@dmaf5 EmpireStateRunUp]$ python3
Python 3.11.6 (main, Oct  3 2023, 00:00:00) [GCC 12.3.1 20230508 (Red Hat 12.3.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.

And the resulting DataFrame instance:
>>> # Using custom load_data function that returns a Panda DataFrame
>>> from empirestaterunup.data import load_data
>>> load_data('empirestaterunup/results-full-level-2023.csv')
                    name  overall position            time gender  gender position  age  ...  65th floor division position 65th floor pace 65th floor time       wave        level     finishtimestamp
bib                                                                                      ...                                                                                                          
19         Wai Ching Soh                 1 0 days 00:10:36      M                1   29  ...                             1 0 days 00:54:03 0 days 00:07:34  ELITE MEN  Full Course 2023-09-04 20:10:36
22        Ryoji Watanabe                 2 0 days 00:10:52      M                2   40  ...                             1 0 days 00:54:31 0 days 00:07:38  ELITE MEN  Full Course 2023-09-04 20:10:52
16            Fabio Ruga                 3 0 days 00:11:14      M                3   42  ...                             2 0 days 00:57:09 0 days 00:08:00  ELITE MEN  Full Course 2023-09-04 20:11:14
11        Emanuele Manzi                 4 0 days 00:11:28      M                4   45  ...                             3 0 days 00:59:17 0 days 00:08:18  ELITE MEN  Full Course 2023-09-04 20:11:28
249             Alex Cyr                 5 0 days 00:11:52      M                5   28  ...                             2 0 days 01:01:19 0 days 00:08:35   SPONSORS  Full Course 2023-09-04 20:11:52
..                   ...               ...             ...    ...              ...  ...  ...                           ...             ...             ...        ...          ...                 ...
555     Caroline Edwards               372 0 days 00:55:17      F              143   47  ...                            39 0 days 04:57:23 0 days 00:41:38  GENERAL 2  Full Course 2023-09-04 20:55:17
557        Sarah Preston               373 0 days 00:55:22      F              144   34  ...                            41 0 days 04:58:20 0 days 00:41:46  GENERAL 2  Full Course 2023-09-04 20:55:22
544  Christopher Winkler               374 0 days 01:00:10      M              228   40  ...                            18 0 days 01:49:53 0 days 00:15:23  GENERAL 2  Full Course 2023-09-04 21:00:10
545          Jay Winkler               375 0 days 01:05:19      U               93   33  ...                            18 0 days 05:28:56 0 days 00:46:03  GENERAL 2  Full Course 2023-09-04 21:05:19
646           Dana Zajko               376 0 days 01:06:48      F              145   38  ...                            42 0 days 05:15:14 0 days 00:44:08  GENERAL 3  Full Course 2023-09-04 21:06:48

[375 rows x 24 columns]

Once the data was loaded, I was able to start asking questions. For example, to detect the outliers I used a Z-score.
All the analysis logic was kept together on a single module called 'analyze', separate from presentation, data loading, or reports, to promote reuse.
from pandas import DataFrame
import numpy as np
def get_zscore(df: DataFrame, column: str):
    filtered = df[column]
    return filtered.sub(filtered.mean()).div(filtered.std(ddof=0))

def get_outliers(df: DataFrame, column: str, std_threshold: int = 3) -> DataFrame:
    """
    Use the z-score, anything further away than 3 standard deviations is considered an outlier.
    """
    filtered_df = df[column]
    z_scores = get_zscore(df=df, column=column)
    is_over = np.abs(z_scores) > std_threshold
    return filtered_df[is_over]

Also, it is very simple to get common statistics just by calling describe on our data:
from pandas import DataFrame
def get_5_number(criteria: str, data: DataFrame) -> DataFrame:
    return data[criteria].describe()

For example, let me show you summary metrics for different aspects of the race:
>>> from empirestaterunup.data import load_data
>>> df = load_data('empirestaterunup/results-full-level-2023.csv')
>>> from empirestaterunup.analyze import get_5_number
>>> from empirestaterunup.analyze import SUMMARY_METRICS
>>> print(SUMMARY_METRICS)
('age', 'time', 'pace')
>>> for key in SUMMARY_METRICS:
...     ndf = get_5_number(criteria=key, data=df)
...     print(ndf)
... 
count    375.000000
mean      41.309333
std       11.735968
min       11.000000
25%       33.000000
50%       40.000000
75%       49.000000
max       78.000000
Name: age, dtype: float64
count                          375
mean     0 days 00:23:03.461333333
std      0 days 00:08:06.313479117
min                0 days 00:10:36
25%                0 days 00:18:09
50%                0 days 00:21:20
75%         0 days 00:25:13.500000
max                0 days 01:06:48
Name: time, dtype: object
count                          375
mean     0 days 01:55:17.306666666
std      0 days 00:40:31.567395588
min                0 days 00:53:00
25%                0 days 01:30:45
50%                0 days 01:46:40
75%         0 days 02:06:07.500000
max                0 days 05:34:00
Name: pace, dtype: object

Making sure data web scraping, data loading, and analytics work well is a must. Testing is an integral part of writing code, so I kept adding more of it and went back to writing unit tests.
Let's check how to test our code (feel free to skip the next section if you are familiar with unit testing)
Testing, testing, and after that...more testing
I assume you are familiar with writing small, self-contained pieces of code to test your code. These are called unit tests.

The unittest unit testing framework was originally inspired by JUnit and has a similar flavor as major unit testing frameworks in other languages. It supports test automation, sharing of setup and shutdown code for tests, aggregation of tests into collections, and independence of the tests from the reporting framework. (From the Python docs)

I tried to have a simple unit test for every method I wrote on the code. This saved me lots of headaches down the road. As I refactored the code, I found better ways to get the same results, producing correct numbers.
A Unit test in this context is a class that extends unittest.TestCase. Each method that starts with test_ is a test that must pass several assertions.
For example, to make sure the analytics worked as expected, I wrote a test module called test_analyze:
# Not all test cases are shown, please check the full code of 'test/test_analyze.py'
import unittest
from pandas import DataFrame
from empirestaterunup.analyze import get_country_counts
from empirestaterunup.data import load_data

class AnalyzeTestCase(unittest.TestCase):
    df: DataFrame

    @classmethod
    def setUpClass(cls) -> None:
        cls.df = load_data()

    def test_get_country_counts(self):
        country_counts, min_countries, max_countries = get_country_counts(df=AnalyzeTestCase.df)
        self.assertIsNotNone(country_counts)
        self.assertEqual(2, country_counts['JPN'])
        self.assertIsNotNone(min_countries)
        self.assertEqual(3, min_countries.shape[0])
        self.assertIsNotNone(max_countries)
        self.assertEqual(14, max_countries.shape[0])


if __name__ == '__main__':
    unittest.main()

So far we got the data, and made sure it meets the expectations. I wrote separate tests for the analytics code and also for the scraper.
Testing the user interface requires a different approach, as it needs to simulate clicks and wait for screen changes. Sometimes failures are easy to spot (like crashes), but sometimes issues are much more subtle (did we get the right data displayed?).
Will revisit this particular testing modality after we introduce first how to visualize the results.
How to Visualize the Results
I wanted to use the terminal as much as possible to visualize my findings, and to keep requirements to a minimum. I decided to use the Textual framework to accomplish that.
This framework is very complete and allows you to build text applications that are responsive and beautiful to look at.
They are also easy to write, so before we go deeper into the resulting applications, let's pause to learn about Textual.
Text User Interfaces (TUI) with Textual
The Textual project has a nice tutorial that you can read to get up to speed.
Let's see some code. One of the applications is called esru_outlier. TUI code lives on the apps module that shows several tables together with the outliers we found before, using the z-score.
OutlierApp (extends App) collects all the basic information on a table for each outlier group and then calls the RunnerDetailScreen to display details about a runner.

Outliers first screen (by Age, Running Time, and Pace)
Next is code with explanations that shows how to build this screen:
# Only the code of the application shown here
# This application shows 3 tables: SUMMARY_METRICS = (RaceFields.AGE.value, RaceFields.TIME.value, RaceFields.PACE.value)
# Every application in Textual extends the App class
class OutlierApp(App):
    DF: DataFrame = None
    BINDINGS = [ ("q", "quit_app", "Quit"), ]  # Bind 'q' to 'quit_app' method `action_quit_app`, which in turn exists the app
    CSS_PATH = "outliers.tcss"  # Styling can be done externally, similar to using CSS
    ENABLE_COMMAND_PALETTE = False

    def action_quit_app(self):
        self.exit(0)

    def compose(self) -> ComposeResult:
        """
        Here we 'Yield' Widgets/ components that will be rendered in order on the TUI
        How do the components get their layout on the screen? They use a cascading style sheet (CSS): outliers.tcss and
        some explicit layout containers like the class `Vertical` that can contain other Widgets
        Here we have a header, tables, and a footer 
        """
        yield Header(show_clock=True)
        for column_name in SUMMARY_METRICS:
            table = DataTable(id=f'{column_name}_outlier')
            table.cursor_type = 'row'
            table.zebra_stripes = True
            table.tooltip = "Get runner details"
            if column_name == RaceFields.AGE.value:
                label = Label(f"{column_name} (older) outliers:".title())
            else:
                label = Label(f"{column_name} (slower) outliers:".title())
            yield Vertical(
                label,
                table
            )
        yield Footer()

    def on_mount(self) -> None:
        """
        Here we populate each table with data from the DataFrame. Each table has outliers of different types.
        All can be obtained with the `get_outliers` method.
        """
        for column in SUMMARY_METRICS:
            table = self.get_widget_by_id(f'{column}_outlier', expect_type=DataTable)
            columns = [x.title() for x in ['bib', column]]
            table.add_columns(*columns)
            table.add_rows(*[get_outliers(df=OutlierApp.DF, column=column).to_dict().items()])

    @on(DataTable.HeaderSelected)
    def on_header_clicked(self, event: DataTable.HeaderSelected):
        """
        When the user selects a column header it generates a 'HeaderSelected' event.
        The annotation on this method tells Textual that we will handle this event here
        We can extract the table, the selected column, and then sort the table contents.
        """
        table = event.data_table
        table.sort(event.column_key)

    @on(DataTable.RowSelected)
    def on_row_clicked(self, event: DataTable.RowSelected) -> None:
        """
        Similarly, when the user selects a row it generates a RowSelected method
        What we do on the 'on_row_clicked' method is capture the event, get the row contents, and construct
        a new modal screen (RunnerDetailScreen) which we push on top of the regular screen.
        There we show the runner details differently. 
        """
        table = event.data_table
        row = table.get_row(event.row_key)
        runner_detail = RunnerDetailScreen(df=OutlierApp.DF, row=row)
        self.push_screen(runner_detail)

The class RunnerDetailScreen (extends ModalScreen) handles showing the racer details using formatted Markdown, which shows up when you click on the table that was rendered before:

Rendered Markdown with details about the selected runner
And here's the code that allows that with explanations:
# Omitted imports and helper methods, only showing TUI-related code. See the 'apps.py' file for full code
class RunnerDetailScreen(ModalScreen):
    ENABLE_COMMAND_PALETTE = False  # Disable the search bar, it is active by default and is not needed here
    CSS_PATH = "runner_details.tcss"  # Handle the styles using external CSS

    def __init__(
            self,
            name: str | None = None,
            ident: str | None = None,
            classes: str | None = None,
            row: List[Any] | None = None,
            df: DataFrame = None,
            country_df: DataFrame = None
    ):
        """
        Override the constructor and load useful data like country ISO codes
        We get the Pandas DataFrame with the details that will be shown to the user
        """
        super().__init__(name, ident, classes)
        self.row = row
        self.df = df
        if not country_df:
            self.country_df = load_country_details()
        else:
            self.country_df = country_df

    def compose(self) -> ComposeResult:
        """
        In compose we prepare the markdown, and we let the MarkdownViewer handle details like 
        a nice automatic table of contents.
        Notice that we call `self.log.info('xxx'). We use that for debugging when this application
        is called using 'textual'.
        """
        bib_idx = FIELD_NAMES_AND_POS[RaceFields.BIB]
        bibs = [self.row[bib_idx]]
        columns, details = df_to_list_of_tuples(self.df, bibs)
        self.log.info(f"Columns: {columns}")
        self.log.info(f"Details: {details}")
        row_markdown = ""
        position_markdown = {}
        split_markdown = {}
        for legend in ['full', '20th', '65th']:
            position_markdown[legend] = ''
            split_markdown[legend] = ''
        for i in range(0, len(columns)):
            column = columns[i]
            detail = details[0][i]
            if re.search('pace|time', column):
                if re.search('20th', column):
                    split_markdown['20th'] += f"\n* **{column.title()}:** {detail}"
                elif re.search('65th', column):
                    split_markdown['65th'] += f"\n* **{column.title()}:** {detail}"
                else:
                    split_markdown['full'] += f"\n* **{column.title()}:** {detail}"
            elif re.search('position', column):
                if re.search('20th', column):
                    position_markdown['20th'] += f"\n* **{column.title()}:** {detail}"
                elif re.search('65th', column):
                    position_markdown['65th'] += f"\n* **{column.title()}:** {detail}"
                else:
                    position_markdown['full'] += f"\n* **{column.title()}:** {detail}"
            elif re.search('url|bib', column):
                pass  # Skip uninteresting columns
            else:
                row_markdown += f"\n* **{column.title()}:** {detail}"
        yield MarkdownViewer(f"""# Full Course Race details     
## Runner BIO (BIB: {bibs[0]})
{row_markdown}
## Positions
### 20th floor        
{position_markdown['20th']}
### 65th floor        
{position_markdown['65th']}
### Full course        
{position_markdown['full']}                
## Race time split   
### 20th floor        
{split_markdown['20th']}
### 65th floor        
{split_markdown['65th']}
### Full course        
{split_markdown['full']}         
        """)
        # This button is used to close this screen and send the user to the previous screen
        btn = Button("Close", variant="primary", id="close")
        btn.tooltip = "Back to main screen"
        yield btn

    @on(Button.Pressed, "#close")
    def on_button_pressed(self, _) -> None:
        """
        Simple logic, pop the previous screen and make this one disappear
        """
        self.app.pop_screen()

This class is reusable. There are other classes (like BrowserApp in this tutorial) that also send data when a user clicks on a table row, and those details get displayed using this modal screen.
We can customize the appearance using CSS (yes, like a web application). It looks a lot like a web application's CSS (but it's not exactly the same). For example to add style to a button, here's the code:
Button {
    dock: bottom;
    width: 100%;
    height: auto;
}

As you can see, Textual is a pretty powerful framework. It reminds me a lot of Java Swing, but without the extra complexity.
But is it just information in tabular format? I also wanted to have different graph types that could explain behavior like age cluster and gender distribution. For that, I wrote a few classes on the 'apps' module with the help of Matplotlib.
Plots with Matplotlib
I wanted to use some charts to display the data, and I made them with matplotlib. The code to generate an age plot box, that shows how old the participating runners were, is very straightforward.

Age box plot in Matplotlib that shows than most of the runners were in the 40-50 year old range.
And here's the code that produced that plot:
# Not all code is shown here (helper methods, imports)
# Please check the apps.py module to see all missing code
class Plotter:
    def plot_gender(self):
        """
        In this method, we get our data frame filtering by gender and get counts
        Then we create a pie plot
        """
        series = self.df[RaceFields.GENDER.value].value_counts()
        fig, ax = plt.subplots(layout='constrained')
        wedges, texts, auto_texts = ax.pie(
            series.values,
            labels=series.keys(),
            autopct="%%%.2f",
            shadow=True,
            startangle=90,
            explode=(0.1, 0, 0)
        )
        ax.set_title = "Gender participation"
        ax.set_xlabel('Gender distribution')

        # Legend with the fastest runners by gender
        fastest = find_fastest(self.df, FastestFilters.Gender)
        fastest_legend = [f"{fastest[gender]['name']} - {beautify_race_times(fastest[gender]['time'])}" for gender in
                          series.keys()]
        ax.legend(wedges, fastest_legend,
                  title="Fastest by gender",
                  loc="center left",
                  bbox_to_anchor=(1, 0, 0.5, 1))

Interesting – most of the runners were between 40-50 years old.
Now let's go back to testing TUI.
Testing the User Interfaces
When I started working on this small project, I knew that there was going to be a lot of testing. What I wasn't sure about was how I would be able to test the TUI.
I figured at least two ways would be useful with Textual: one being able to see the message flow between components and the other using unit tests with a twist:
Following the message flow with Textual
Textual supports an interesting development mode that allows you to change CSS and see the changes on your application without a restart. Also, you can see how the TUI events propagate, which is invaluable for debugging.
In one terminal, start the console:
(EmpireStateRunUp) [josevnz@dmaf5 EmpireStateRunUp]$ . ~/virtualenv/EmpireStateRunUp/bin/activate
(EmpireStateRunUp) [josevnz@dmaf5 EmpireStateRunUp]$ textual console
▌Textual Development Console v0.46.0                                                                                                                                             
▌Run a Textual app with textual run --dev my_app.py to connect.                                                                                                                  
▌Press Ctrl+C to quit.

Then in another terminal, start your application but using development mode:
(EmpireStateRunUp) [josevnz@dmaf5 EmpireStateRunUp]$ textual run --dev --command esru_browser

If you check back on your console terminal, you will see any messages you sent with App.log along with the events:
─────────────────────────────────────────────────────────────────────────── Client '127.0.0.1' connected ───────────────────────────────────────────────────────────────────────────
[18:28:17] SYSTEM                                                                                                                                                        app.py:2188
Connected to devtools ( ws://127.0.0.1:8081 )
[18:28:17] SYSTEM                                                                                                                                                        app.py:2192
---
[18:28:17] SYSTEM                                                                                                                                                        app.py:2194
driver=
[18:28:17] SYSTEM                                                                                                                                                        app.py:2195
loop=<_UnixSelectorEventLoop running=True closed=False debug=False>
[18:28:17] SYSTEM                                                                                                                                                        app.py:2196
features=frozenset({'debug', 'devtools'})
[18:28:17] SYSTEM                                                                                                                                                        app.py:2228
STARTED FileMonitor({PosixPath('/home/josevnz/EmpireStateCleanup/docs/EmpireStateRunUp/empirestaterunup/browser.tcss')})
[18:28:17] EVENT                                                                                                                                                 message_pump.py:706
Load() >>> BrowserApp(title='Race Runners', classes={'-dark-mode'}) method=None
[18:28:17] EVENT                                                                                                                                                 message_pump.py:697
Mount() >>> DataTable(id='runners') method=
[18:28:17] EVENT                                                                                                                                                 message_pump.py:697
Mount() >>> DataTable(id='runners') method=
[18:28:17] EVENT                                                                                                                                                 message_pump.py:697
Mount() >>> Footer() method=
[18:28:17] EVENT                                                                                                                                                 message_pump.py:697
Mount() >>> Footer() method=
[18:28:17] EVENT                                                                                                                                                 message_pump.py:697
Mount() >>> ToastRack(id='textual-toastrack') method=
...
RowHighlighted(cursor_row=0, row_key=) >>> BrowserApp(title='Race Runners', classes={'-dark-mode'}) method=None
[18:28:17] EVENT                                                                                                                                                 message_pump.py:697
Mount() >>> ScrollBarCorner() method=
[18:28:17] EVENT                                                                                                                                                 message_pump.py:706
Resize(size=Size(width=2, height=1), virtual_size=Size(width=178, height=47), container_size=Size(width=178, height=47)) >>> ScrollBarCorner() method=None
[18:28:17] EVENT                                                                                                                                                 message_pump.py:706
Show() >>> ScrollBarCorner() method=None

Using unittest and Pilot
The framework has the Pilot class that you can use to make automated calls to Textual Widgets and wait for events. This means you can simulate user interaction with the application to validate that it behaves as expected. This is more powerful than the regular unit tests as you can also cover UI interactions with expected results:
import unittest
from textual.widgets import DataTable, MarkdownViewer
from empirestaterunup.apps import BrowserApp


class AppTestCase(unittest.IsolatedAsyncioTestCase):
    async def test_browser_app(self):
        app = BrowserApp()
        self.assertIsNotNone(app)
        async with app.run_test() as pilot:

            """
            Test the command palette
            """
            await pilot.press("ctrl+\\")
            for char in "jose".split():
                await pilot.press(char)
            await pilot.press("enter")
            # This returns the runner screen. Check that it has some contents
            markdown_viewer = app.screen.query(MarkdownViewer).first()
            self.assertTrue(markdown_viewer.document)
            await pilot.click("#close")  # Close the new screen, pop the original one
            # Go back to the main screen, now select a runner but using the table
            table = app.screen.query(DataTable).first()
            coordinate = table.cursor_coordinate
            self.assertTrue(table.is_valid_coordinate(coordinate))
            await pilot.press("enter")
            await pilot.pause()
            markdown_viewer = app.screen.query(MarkdownViewer).first()
            self.assertTrue(markdown_viewer)
            # After validating the markdown one more time, close the app
            # Quit the app by pressing q
            await pilot.press("q")

if __name__ == '__main__':
    unittest.main()

This is invaluable, and something that many times requires an external toolset to validate (for example in Java you have the class Robot).
How to Run the Applications
Finally, it's time to get familiar with mini applications (you can see an animated demonstration of the TUI applications here).
Browsing Through the Data
The esru_browser is a simple browser that lets you navigate through the raw race data.
esru_browser

The application shows all the race details for every Runner in a table that allows sorting by column.

The esru_browser window shows all runners' results. Here you can sort, search for runners, and click to get more details
And the command palette allows searching for runners by name (it's basically a search bar with fuzzy logic):

Matches show up on the palette as you type
Summary Reports
To get insights about racer behavior, you need some summary reports (as opposed to drilling down into each racer's details).
This application provides details about the following:

Count, standard deviation, mean, min, max 45%, 50%, and 75% for age, time, and pace

Group and count distribution for Age, Wave, and Gender


esru_numbers

Some interesting facts about the race:

The average age was 41 years old, and 40 years old was the largest age group.

The majority number of people belonged to the 'BLACK WAVE'.

The majority of the people finished the race in between 20 and 30 minutes.

The youngest runner was 11 years old, and the oldest was 78.



esru_numbers gives a bird's eye view of all the racers, categorized by buckets
Finding Outliers
This application uses the Z-score to find the outliers for several metrics for this race:
esru_outlier


the esru_outlier main screen shows you racers that did not follow regular patterns
Because these results drill down to the BIB number, you can click on a row and get more details about a runner:

And you can get details for each outlier. Yes, code is reusable and is the same to show details for any runner
Textual has excellent support for rendering Markdown as well as programming languages. Take a look at the code to see for yourself.
A Few Plot Graphics For You
The esru_plot application offers a few plot graphics to help you visualize the data. Inside, the class Plotter does all the heavy lifting
Age plots
The program can generate two flavors for the same data, one is a Box diagram:

The age box diagram we saw before
The second is a regular histogram:

Age histogram shows the same as the box diagram but the buckets are more visible. Same data, many ways to explain the racer demographics.
You can see from both graphics that the group age with the most participants is the 40-45-year-old bracket and the outliers are in the 10-20 and 70-80 year old groups.
Participants per country plot

This plot shows all the countries with the number of participants, with the best runner from each.
No surprises here: the overwhelming majority of racers come from the United States, followed by Mexico. Interestingly, the winner of the 2023 race is from Malaysia, with only 2 runners participating.
Gender distribution

The gender distribution pie showing the best racer for each category
The majority of the runners identified themselves as Males, followed by Females.
What Else Can We Learn?

NYC was well represented on the event. Yeah, I'm talking about the NYC police department running in full gear, not me on the left ;-)
Participating in this race was a great experience. The best part was that it fueled my curiosity and led me to write this code to get more interesting facts about the race.
There is plenty more to learn about the tools you just saw in this tutorial:

There are a lot of public race datasets, and you can use them to apply what you learned here. Just take a look at this dataset of the New York City Marathon, period 1970-2018. What other questions you can ask about the data?

You saw just the tip of what you can do with Textual. I encourage you to explore the apps.py module. Take a look at the example applications as well.

Selenium Web driver is not just a tool for web scraping but for automated testing of web applications. It doesn't get better than having your browser perform automated testing for you. It is a big framework, so be prepared to spend time reading and running your tests. I strongly suggest you look at the examples. Trial an error will give you better results.

Apply for the Empire Estate Run Up lottery or run through a charity, if you like this kind of race. Who said King Kong is the only one who could make it to the top?

Sadly, I'm not in a position to offer you any training advice. Every person is different. I do recommend that you check with your doctor before you participate in a race like this, and get some professional advice from a running coach.

But most important of all, believe you can do this (the race and writing some tools to process the race data) and have fun while doing it. This is a pre-requisite for any project.



 What is Microsoft Fabric? How to Build a Customer Segmentation Project 
Benny Ifeanyi Iheagwara — Tue, 05 Mar 2024 01:00:06 +0000
 Microsoft Fabric is a data analytics tool that can help you streamline all your data needs and workflows, from data integration to analytics and engineering.
In this guide, I'll explain what Microsoft Fabric is in more detail, how it works, and walk you through building a project with it. If you already have an understanding of the platform, you can skip to the Microsoft Fabric project.
Here's what you'll learn about in this guide:

What is Microsoft Fabric?
Why you should learn about Microsoft Fabric
Microsoft Fabric architecture and components
How to get started by building a simple project
How to create a workspace in Microsoft Fabric
How to create a Lakehouse in Microsoft Fabric
How to use Kaggle API data in Microsoft Fabric
How to use the Data Wrangler in Microsoft Fabric
How to perform customer segmentation in Microsoft Fabric
How to visualize your lakehouse data in Power BI

Prerequisites
To follow along, you will need to have a Power BI license. You can get one for free to practice with using the Microsoft 365 Developer Program.
It would be also be helpful if you have knowledge of Microsoft Power BI and Python.
What is Microsoft Fabric?
Microsoft Fabric is an all-in-one analytics software-as-a-service (SaaS) platform for managing all your data analytics needs and workflows. Microsoft built this end-to-end platform to handle data-related data, from your data storage and migration to your real-time data analytics, data science projects, and data engineering workflow.
But how does it work?
This tool brings together various new and preexisting data tools and technologies—Power BI, OneLake, Azure Data Factory, Data Activator, Power Query, Apache Spark, Synapse Data Warehouse, Synapse Data Engineering, Synapse Data Science, Synapse Real-Time Analytics, Azure Machine Learning, and various connectors.
Why You Should Learn About Microsoft Fabric
The best part of Microsoft Fabric is its simplicity in terms of functionality. Using various technologies together, you can do everything all in one place and focus more on what you can do with it and less on licensing, supporting systems, dependencies, and how to integrate with all these different platforms.
Another benefit of the platform is how it handles your data. This provides and allows you to maintain a single reliable source of information. With Microsoft Fabric’s OneLake, you can have a single, unified data storage. 
Microsoft Fabric also has Azure’s OpenAI service integrated into its layer. This way, you can use AI (Co-pilot) to help you discover insights quickly.
Lastly, since it is an all-in-one platform, there is a cost-saving edge since there is no need to subscribe to multiple vendors.
Microsoft Fabric Architecture
Think of Microsoft Fabric as your data estate.
Just like every piece of real estate, Microsoft Fabric has various components in its architecture.
Let’s start by looking at the terminology you'll encounter and need to understand when using Microsoft Fabric's architecture:
Experiences and Workloads:
These refer to the various capabilities of the platform. Every experience on the platform is tailored with a specific user in mind. 
Below are some examples of the various experiences/workloads available. You'll notice that each of them are built for a specific purpose, task, and user. 

Data factory: This application gives users over 150 connectors to Lakehouses, warehouses, cloud, and on-premise data sources and orchestrates data pipelines for data transformation. A Lakehouse here refers to a data platform for storing structured and unstructured data. You can also copy your on-prem data to the cloud and load it into OneLake through the Data Factory.
Synapse data engineering is part of the data engineering experience on the platform. It has some cool features like Lakehouses, built data pipelines, and a Spark engine.
Synapse data warehouse provides you with a unified and serverless SQL engine. Like your “traditional” data warehouse, you have the full capabilities of your transactional T-SQL features.
Synapse real-time analytics allows you to stream data from Internet of Things (IoT) devices, telemetry, and logs. You can also use the workload here to analyze semi-structured data using its Kusto Query Language (KQL) capabilities, just like Azure Data Explorer.
Synapse data science allows you to build, collaborate, train, and deploy fully scalable end-to-end Machine learning (ML) and AI models. You can also carry out your ML experiments in your notebooks and log your models using the Fabric Auto Logging feature. A must-mention tool in this experience is the Data Wrangler, a Fabric graphical user interface for data transformation. With this tool, you can clean your data by simplifying by clicking buttons while the tool automatically generates the Python code for you. It is similar to Power Query.
Business Intelligence with Power BI helps you quickly turn your business data into insightful analytic reports and dashboards.
Data Activator allows you to take care of your data observability and monitor workloads in a non-code/low-code way. This tells you when specific data points hit a threshold or match a pattern. You can also automate particular actions and kickoff Power Automates flows when specific conditions occur.
Copilot in Fabric provides you with an Azure OpenAI Service. This means you can build reports, describe how you want to ingest your data, summarize, explore, and transform your data using the natural language capability of Azure OpenAI.

Workspaces
Workspaces are similar to Power BI’s workspace. Here, you can share and collaborate with others and create reports, Warehouses, Lakehouses, dashboards, and notebooks.
Capacity Unit (CU)
A CU is the ability of your resource to perform or produce an output.
Now we'll look at the various components of Microsoft Fabric's architecture.
OneLake
OneLake is the central data repository for Microsoft Fabric that stores the data in Delta Lake format. Think of it as OneDrive for your data. This repository allows you to explore and find data assets in your organization.
One exciting thing is Shortcuts, which allows you to share or point to data in other locations in OneLake without moving or duplicating the data. This removes any case of data redundancy.
Lakehouses vs Warehouses
While both "houses" hold data, some differences exist between Lakehouses and Warehouses in Microsoft Fabric.
For starters, a Lakehouse can store any data type, whether structured or unstructured. It is, however, stored in the Delta format by default. The Delta format is a storage layer that offers ACID (Atomicity, Consistency, Isolation, Durability) transactions. A Warehouse, on the other hand, is more suited for structured data.
Lakehouses also support Notebooks. So you can work with various languages from PySpark to SQL and R. Warehouses, on the other hand, only use SQL. 
Keep in mind, though, that Fabric provides you with two types of Warehouses: SQL Endpoint and Synapse Data Warehouse.

SQL Endpoint is auto-generated when a Lakehouse is created. This mean you can have a SQL-based experience and can query Lakehouse data using T-SQL language. 
Synapse Data Warehouse is more of your traditional SQL engine. So you can use it to create and query data out of OneLake.

How to Get Started With Microsoft Fabric – An End-to-End Project Example
To get a glimpse of how the Fabric platform works, we will build a little project.
We'll create a Lakehouse to store a mall dataset from Kaggle using the Kaggle API. We will also transform our data using Data Wrangler. Then, we will perform customer segmentation on our data based on the customer's annual income and spending score using the KMeans clustering algorithm. This will allow us to group the customers into various categories like low income earners that don't spend, average income earning customers, and high income customers who do not spend much.
Let's get started.
How to Enable Fabric
The first thing we need to do is to log into Microsoft Power BI. Here, we will activate Microsoft Fabric's capabilities for our workspace. 
To do this, follow these steps:
First, navigate to the capacity settings in the admin portal. The admin portal is where administrators control and manage the various Power BI features.

Admin Portal of Microsoft fabric
Then under the Tenant setting tab, look for Microsoft fabric tab.
Under that tab, enable the Users can create fabric items toggle to on. Once you've done that, select Apply.

Now your environment will be set up and the various services should appear at the bottom left of your screen.

Now you can see all the services like Power BI, Data Factory, and so on.
 
How to Create a Workspace in Microsoft Fabric
We'll use a mall customer segmentation dataset from Kaggle for this demo. This data, as mentioned in Kaggle, was created for the purpose of learning customer segmentation concepts.
Let's talk a little bit about the dataset. Imagine you have a supermarket mall and each customer has a membership card. You also have a data catalog of each customer with basic information like their customer ID, age, gender, annual income and spending score. 
Now we want to segment these customer into various groups so we can improve customer loyalty, understand the customers better, and more effectively target our marketing strategy. 
To achieve this, we will use the spending score assigned to each customer to define their purchasing power.
To get started, you'll need to create a new workspace. You can do that by following these steps:

Head to your Microsoft Fabric home page.
Select workspaces and click on New Workspace.
Give your workspace a name – I'm calling mine FabricMall.
Click on Advanced to view the dropdown options and select Trial if you are making use of your Fabric trial.
Click Apply.


How to create a workspace in Microsoft fabric
The next thing you want to do is to create a Lakehouse for your data.
How to Create a Lakehouse in Microsoft Fabric
To create a Lakehouse, first click on New within your workspace. This will display a list of various tasks you can do within your workspace.
Then select More options and select Lakehouse. 

Selecting Lakehouse under "More options"
Then give it a name, like FabricMallLake, and click on Open notebook.
Click on New notebook and Open. You can rename your notebook at the top left corner of your notebook. The notebook is similar to the Jupyter notebook experience.

Notebooks in Fabric
How to Use Kaggle API Data in Microsoft Fabric
Notebooks allow us to write, visualize, and execute code. Within the Notebook, we will use Python to perform a customer segmentation on our data in Microsoft Fabric.
First, import Kaggle using the command below:
!pip install Kaggle

Next, you'll need to import your operating system and connect to the Kaggle API.
import os
os.chdir('/lakehouse/default/Files')
os.environ['KAGGLE_USERNAME'] = 'bennyifeanyi'
os.environ['KAGGLE_KEY'] = '050019167fbe0027359cdb4b5eea50fe'
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()
api.dataset_download_file('vjchoudhary7/customer-segmentation-tutorial-in-python', 'Mall_Customers.csv')

In the code above, os.chdir('/lakehouse/default/Files') represents our File API path. Also remember to replace the username and API Key with your own.  
Now import Pandas. This will allow you to read your file.
import pandas as pd
df = pd.read_csv("/lakehouse/default/" + "Files/Mall_Customers.csv")
df.head()

But before we start segmenting our customers, let's transform our data by exploring the data wrangler.
How to Use the Data Wrangler in Microsoft Fabric
One of the most exciting things about this notebook is that you can perform data cleaning tasks without writing code using the Data Wrangler.
To do that, click on Data on the ribbon and select Transform DataFrame in Data Wrangler. 
We will perform the following transformations:

We will convert the gender column to lowercase.
We will also rename the columns with special characters like the dollar sign, brackets, and a dash. This is because I noticed Fabric finds it hard to handle these characters at the moment.

To do these transformations, follow these steps:
Under the Operation tab, select Convert text to lowercase.
Pick the column – Gender in this example – and select Apply. This will convert your Gender column to lowercase and automatically generate the codes.

Data wrangler: Formatting text
Similarly, under the schema tab, select rename columns.
Rename Annual Income (k$) to AnnualIncome, and Spending Score (1-100) to SpendingScore.
Once you’re done with the transformation, click Add code to notebook.

Data wrangler: Rename column
Back in the notebook, we can visualize our data using the code below:
sparkdf = spark.createDataFrame(df_clean)
display(sparkdf)

Within the chart element created, select Customize chart. Pick the columns you want and select Apply.

Charts in Data Wrangler
Once that's done, we can save the data in the Lakehouse using this code below:
sparkdf.write.format("delta").mode("overwrite").saveAsTable("malldatadf")


Saving data in Lakehouse
How to Perform Customer Segmentation in Microsoft Fabric
For our customer segmentation, we will use the KMeans clustering algorithm to segment the customers based on their annual income and spending score. 
K-means clustering is an unsupervised machine learning algorithm. It groups similar data points in your data based on underlying observations, similarities, and input vectors. 
We will do this by importing our libraries, applying our K-means by training the K-Means clustering model, and visualizing the clusters of customers based on their annual income and spending score. 
We will also include and show the centroids of each cluster, providing insights into the distribution of customers in the dataset. 
The centroids here refers to the center points of the clusters found by our algorithm. This is calculated as the average of all the data points in that cluster. When we visualize the clusters, the centroid will be represented with a distinct symbol or color.
Run this code to achieve this:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
X = df_clean[['AnnualIncome', 'SpendingScore']]
# Feature normalization
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
kmeans.fit(X_scaled)
plt.figure(figsize=(10, 8))
for cluster_label in range(5):  # Loop through each cluster label
cluster_points = X[kmeans.labels_ == cluster_label]
centroid = cluster_points.mean(axis=0)  # Calculate the centroid as the mean position of the data points
plt.scatter(cluster_points['AnnualIncome'], cluster_points['SpendingScore'],
s=50, label=f'Cluster {cluster_label + 1}')  # Plot points for the current cluster
plt.scatter(centroid[0], centroid[1], s=300, c='black', marker='*', label=f'Centroid {cluster_label + 1}')  # Plot the centroid
plt.title('Clusters of Customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

Here's the output:

Performing Customer Segmentation in Microsoft Fabric
The result of our analysis shows that our customers can be grouped into 5 clusters:

Cluster 1 (Purple) are low income earners with a low spending score.
Cluster 2 (Blue) are low income earners with a high spending score.
Cluster 3 (Red) are average income earning customers with significant spending scores.
Cluster 4 (Orange) are high income customers who do not spend much at the mall. They’re probably not satisfied with the services rendered.
Cluster 5 (Green) are high income customers with a high spending score.

We can also save our prediction as a new dataset using this code:
# Create a new DataFrame to store the clustering results
cluster_df = pd.DataFrame(data=X, columns=['AnnualIncome', 'SpendingScore'])
cluster_df['Cluster'] = cluster_label
sparkclusterdf = spark.createDataFrame(cluster_df)
sparkclusterdf.write.format("delta").mode("overwrite").saveAsTable("clusterdatadf")


Customer segementation prediction
Want to take a look at the notebook? You can download it from my GitHub.
How to Visualize Lakehouse Data in Power BI
Now we can decide to visualize our data on a dashboard within Fabric.
Head back to the FabricMall workspace and select the semantic model type of the FabricMallLake Lakehouse.

semantic model type of the FabricMallLake LakeHouse
Then select Manage default semantic model.

Manage default semantic model In Microsoft Fabric
Pick your dataset, click Confirm, and then select New Report. 
Let's visualize the average age in our data. To do this, click on the card visual and drag the age into this card. This will automatically create a visual showing the average age in your dataset. 

Power BI service in Microsoft Fabric
Just like in Power BI Desktop, you can create your measure, build your report, and publish your dashboard. You can learn more about how to create visuals in Power BI using this free freeCodeCamp YouTube data analysis video.
Alternatively, you can open Power BI Desktop, and connect to your Lakehouses from Onelake data hub.

Connect to your Lakehouse in Power BI
Where Can I Learn More about Microsoft Fabric?
Though Microsoft Fabric is a pretty new data platform, I hope you can tell that this tool will help you ease the way you and your team consume, analyze, and get insight from your data.
To learn more you can start with the fabric official documentation or any helpful YouTube tutorial like Francis’s Fabric course. I would also advise you to start with freeCodeCamp's Fabric publication tags if you want a compilation of resources.
Lastly, if you’re new to data analysis, start your journey today with freeCodeCamp’s Data Analyst Bootcamp for Beginners on YouTube. It covers everything from SQL, Tableau, Power BI, and Python to Excel, Pandas, and real-life projects building.  
If you enjoyed reading this article and/or have any questions and want to connect, you can find me on LinkedIn, Twitter and do check out my articles on freeCodeCamp.
 


 Essential SQL Concepts for Data Analysts – Explained with Code Examples 
freeCodeCamp — Tue, 27 Feb 2024 00:45:00 +0000
 By Joel Hereth
In the vast and ever-growing realm of data analytics, Structured Query Language (SQL) serves as a fundamental building block. 
While SQL's roots lie in database management, it has expanded its reach, becoming the go-to tool for data extraction, manipulation, and analysis. 
Whether you're just starting your journey as a data analyst or looking to bolster your proficiency in its tools, understanding essential SQL concepts is non-negotiable. 
This guide will take you through the critical aspects of SQL that are important for your success in the data analytics field. 
Table of Contents:

The Role of SQL in Data Analytics
Key SQL Concepts to Learn
– Basic Commands
– The CASE Statement
– Subqueries and Common Table Expressions (CTEs)
– Joins and Unions
– String and Date Formatting
– Window Functions
Conclusion

The Role of SQL in Data Analytics
Before getting into the nitty-gritty, it's important to understand the pivotal role of SQL in data analytics. 
SQL is the lingua franca of the database world, serving as a translator between human and machine. This makes it a must-learn for anyone diving into the data domain. 
To appreciate SQL's significance, you need only to look at the tasks it allows you to perform. From transforming raw data into insightful reports to creating data-driven applications and executing complex data operations, SQL is the powerhouse that enables analysts and professionals to extract hidden gems from the vast seas of databases. 
Key SQL Concepts to Learn
Basic Commands
SQL commands can be categorized into entities that manage the structure of the database schema (DDL - Data Definition Language), control the content of the database tables (DML - Data Manipulation Language), and access and work on the data within the database (DQL - Data Query Language). You'll want to start here to lay a solid foundation. 
DML Commands:

SELECT: retrieves data from one or more tables.

SELECT product_name, price
FROM products
WHERE category = 'Electronics';


INSERT: inserts new rows into a table.

INSERT INTO customers (name, email)
VALUES ('John Doe', 'john@example.com');


UPDATE: modifies existing data within a table.

UPDATE inventory
SET quantity = 50
WHERE product_id = 101;


DELETE: removes existing rows from a table.

DELETE FROM orders
WHERE order_id = 12345;

DDL Commands:

CREATE TABLE: creates a new table within the database.

CREATE TABLE employees (
    employee_id INT PRIMARY KEY,
    name VARCHAR(50),
    department VARCHAR(50),
    salary DECIMAL(10, 2)
);


ALTER TABLE: modifies an existing table within the database.

ALTER TABLE employees
ADD hire_date DATE;


DROP TABLE: removes an entire table from the database.

DROP TABLE customers;

DQL Commands:

SELECT: also part of DML but often associated with DQL as it is used to query data only.

The CASE Statement
The CASE statement takes scalars, predicates, function calls, and even SQL queries as input and returns an expression value. It’s an extremely versatile tool that can be used to transform data, perform if-then-else logic, categorize information, and more.
Basic Syntax of CASE:
SELECT column_name,
  CASE
    WHEN condition1 THEN result1
    WHEN condition2 THEN result2
    ELSE result3
  END
FROM table_name;

Understanding how and when to use CASE statements is a critical SQL skill to master as a data analyst dealing with complex datasets. To showcase the different CASE statements, we have the actions table with the user_id, action, and date fields. 
CREATE TABLE actions (
  "user_id" INTEGER,
  "action" VARCHAR(50),
  "date" DATE
);

INSERT INTO actions (
     "user_id",
      "action",
      "date"
)
VALUES
    (1, 'post', current_timestamp::DATE-3),
    (2, 'edit', current_timestamp::DATE-2),
    (3, 'post', current_timestamp::DATE-1),
    (4, 'post', current_timestamp::DATE-1),
    (5, 'edit', current_timestamp::DATE-5),
    (6, 'cancel', current_timestamp::DATE-2),
    (7, 'post', current_timestamp::DATE-2),
    (8, 'post', current_timestamp::DATE-1),
    (9, 'post', current_timestamp::DATE-1),
    (10, 'cancel', current_timestamp::DATE-3),
    (11, 'post', current_timestamp::DATE-2),
    (12, 'post', current_timestamp::DATE-2);
Your manager is about to go into a meeting with the event director and asks you to write a query to showcase the current post rate for all time rounded two decimals. In this case based on the actions table structure, we'll need to utilize a CASE statement. 
select round(1.0*
sum(case when action='post' then 1 else 0 end)
/
count(1)
,2) post_rate
from actions;

Initially, we employ a CASE statement to assign a value of 1 to posts, and 0 otherwise. Afterward, we aggregate these results using SUM(). Then, we divide this sum by the total count of records, represented by COUNT(1), which includes all records, not exclusively posts. 
This computation yields our post rate. To ensure decimal precision, we multiply the numerator by 1.0. Finally, we round the entire result to two decimal points as needed.
Subqueries and Common Table Expressions (CTEs)
Subqueries, or inner queries, allow you to use queries within another SQL statement. Common Table Expressions (CTEs) are named temporary result sets that you can reference within a SELECT, INSERT, UPDATE, or DELETE statement.
Subqueries:

Scalar Subquery: a subquery that returns a single value.
Column Subquery: a subquery that returns one or more columns.
Table Subquery: a subquery that looks like a table (used with any operator expecting a table).

CTEs:

Provide a more readable and maintainable alternative to a derived table or subquery.
Can reference themselves, which is useful for recursive queries.

To demonstrate the use cases, we're going to practice with both the traditional subquery and CTE using the following SQL schema: 
CREATE TABLE all_numbers (
  "phone_number" VARCHAR(25)
  );


CREATE TABLE confirmed_numbers (
  "phone_number" VARCHAR(25)
  );


INSERT INTO all_numbers
("phone_number")
VALUES
('706-766-8523'),
('555-239-6874'),
('407-234-5041'),
('(123)351-6123'),
('251-874-3478');

INSERT INTO confirmed_numbers
 ("phone_number")
 VALUES
('555-239-6874'),
('407-234-5041'),
('(123)351-6123');

For example, let's say you're a data analyst at DoorDash and you've been asked to retrieve all the phone numbers that are in the all_numbers table but are not present in the confirmed_numbers table. You can solve this by using a traditional subquery:
SELECT phone_number
FROM all_numbers
WHERE phone_number NOT IN (
  SELECT phone_number
  FROM confirmed_numbers
);

Alternatively, if the database is very large, you might want to think about using a CTE since they're more efficient for larger databases. 
WITH excluded_numbers AS (
  SELECT phone_number
  FROM confirmed_numbers
)

SELECT phone_number
FROM all_numbers
WHERE phone_number NOT IN (
  SELECT phone_number
  FROM excluded_numbers
);

Joins and Unions
Joins help you combine data from multiple tables based on a related column between them, while unions allow you to combine the result sets of two or more SELECT statements. Both are critical for harnessing the full power of your SQL queries.

Table illustrating the different types of SQL joins (Left, Full, Right, and Inner)
Types of Joins:

INNER JOIN: returns rows when there is a match in both tables.
LEFT JOIN: returns all rows from the left table and the matched rows from the right table.
RIGHT JOIN: returns all rows from the right table and the matched rows from the left table.
FULL JOIN: returns all rows when there is a match in one of the tables.

To illustrate the various JOIN types in SQL, consider a scenario where we want to compile the relationship between sales figures and their corresponding sales representatives across different regions. 
For this purpose, we have two tables: sales_data and representatives. They are linked by the rep_id field, which serves as a foreign key in the sales_data table and a primary key in the representatives table. Here's what that looks like:
CREATE TABLE sales_data (
    sale_id INT PRIMARY KEY,
    rep_id INT,
    region VARCHAR(50),
    sales DECIMAL(10, 2)
);

INSERT INTO sales_data (sale_id, rep_id, region, sales) VALUES
(1, 101, 'East', 1000.00),
(2, 102, 'East', 1500.50),
(3, 103, 'West', 2000.00),
(4, 104, 'West', 2500.75),
(5, NULL, 'West', 3000.00);  

CREATE TABLE representatives (
    rep_id INT PRIMARY KEY,
    sales_rep VARCHAR(100),
    region VARCHAR(50)
);
INSERT INTO representatives (rep_id, sales_rep, region) VALUES
(101, 'John Doe', 'East'),
(102, 'Jane Smith', 'East'),
(105, 'Jim Beam', 'North'),
(106, 'Jill Jackson', 'North'),
(107, 'Jack Johnson', 'South');

For our example, suppose we want to match sales to representatives in the East region. We would use an INNER JOIN to fetch only the rows with matching rep_id in both tables:
SELECT s.sales, r.sales_rep
FROM sales_data s
INNER JOIN representatives r
ON s.rep_id = r.rep_id
WHERE s.region = 'East';
In the case of wanting to see all sales data in the West region, including those without a corresponding sales representative, a LEFT JOIN comes in handy:
SELECT s.sales, r.sales_rep
FROM sales_data s
LEFT JOIN representatives r
ON s.rep_id = r.rep_id
WHERE s.region = 'West';

If our interest instead is in all representatives in the North region, even those without associated sales data, we would use a RIGHT JOIN:
SELECT s.sales, r.sales_rep
FROM sales_data s
RIGHT JOIN representatives r
ON s.rep_id = r.rep_id
WHERE r.region = 'North';

Lastly, to see all possible combinations of sales and representatives across all regions, regardless of matching rep_id, we use a FULL JOIN:
SELECT s.sales, r.sales_rep
FROM sales_data s
FULL JOIN representatives r
ON s.rep_id = r.rep_id;

Union and Union All:

UNION: returns the distinct rows that appear in either of the two result sets.
UNION ALL: returns all the rows including duplicates.

Continuing with the same SQL schema above containing the sales_data table and representatives table, let's review scenarios where we'd want use a UNION and UNION ALL.
Using a UNION, let's construct a SQL query to efficiently retrieve the names of all sales representatives from both the sales_data and representatives tables.
SELECT sales_rep AS representative_name FROM representatives
UNION
SELECT DISTINCT rep_id AS representative_name FROM sales_data;

Now, let's explore how to utilize a UNION ALL operation to retrieve the names of all sales representatives from both the sales_data and representatives tables, including duplicates.
SELECT sales_rep AS representative_name FROM representatives
UNION ALL
SELECT DISTINCT rep_id AS representative_name FROM sales_data;


Table illustrating the different types of SQL UNIONs (UNION vs UNION ALL)
String and Date Formatting
The manipulation of string and date values is common in data analysis. Understanding how to format these types properly is crucial for meaningful analysis.
String Functions:

CONCAT: merges two or more strings into one.
SUBSTRING: returns a part of a string.
LENGTH or LEN: returns the length of a string.

Date Functions:

DATEADD: adds an interval to a date.
DATEDIFF: returns the time between two dates.
DATENAME or TO_CHAR: Returns part of a date like day, month, or year.

To demonstrate the usage of string and date functions in SQL, let's delve into a scenario involving orders and deliveries. 
We have two tables: orders and deliveries. Here's a breakdown of each table and its columns:
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    order_date DATE,
    total_amount DECIMAL(10, 2)
);

INSERT INTO orders (order_id, customer_id, order_date, total_amount) VALUES
(1, 201, '2024-02-20', 500.00),
(2, 202, '2024-02-21', 750.25),
(3, 203, '2024-02-21', 1000.00),
(4, 204, '2024-02-22', 1200.75),
(5, 205, '2024-02-22', 1500.00);

CREATE TABLE deliveries (
    delivery_id INT PRIMARY KEY,
    order_id INT,
    delivery_date DATE,
    delivery_status VARCHAR(50)
);

INSERT INTO deliveries (delivery_id, order_id, delivery_date, delivery_status) VALUES
(1, 1, '2024-02-21', 'Delivered'),
(2, 2, '2024-02-22', 'In transit'),
(3, 3, '2024-02-22', 'Delivered'),
(4, 4, NULL, 'Pending'),
(5, 5, NULL, 'Pending');

Say you've been tasked with optimizing order tracking systems. To streamline this process, you need to create unique order identifiers by merging customer IDs and order IDs. Leveraging the CONCAT function in SQL, you merge these identifiers, ensuring efficient order management and analysis.
SELECT CONCAT(customer_id, '-', order_id) AS order_identifier
FROM orders;

Your next task is to categorize delivery statuses accurately, which is essential for operational efficiency. But delivery status messages often contain irrelevant details. 
To simplify this process, you use the SUBSTRING function in SQL to extract the initial characters of the delivery status. This enables swift categorization and analysis of delivery progress.
SELECT SUBSTRING(delivery_status, 1, 3) AS status_summary
FROM deliveries;

Now imagine you need to ensure the consistency of delivery status messages. It's crucial to validate that delivery status updates adhere to defined length constraints. 
By employing the LENGTH/LEN function in SQL, you calculate the length of each delivery status message. This facilitates robust validation mechanisms, ensuring uniformity and integrity in your data.
SELECT delivery_id, LENGTH(delivery_status) AS status_length
FROM deliveries;

Date Functions
When querying the orders and deliveries tables in the SQL schema provided, the DATEADD function is particularly useful in scenarios where you need to calculate future dates or deadlines based on existing ones. 
For example, you might use DATEADD to find the expected delivery date by adding a certain number of days to order_date to ensure delivery within a predefined time frame. 
SELECT order_id, customer_id, DATEADD(day, 3, order_date) AS expected_delivery_date
FROM orders;

The DATEDIFF function can also be useful in calculating differences between dates. For instance, if you need to find the average time it takes for an order to be delivered, you could subtract the order_date from the delivery_date and then calculate the average using AVG.
SELECT AVG(DATEDIFF(day,order_date,delivery_date)) AS average_delivery_time
FROM orders o INNER JOIN deliveries d ON o.order_id = d.order_id
WHERE delivery_status = 'Delivered';

TO_CHAR function can be useful in converting dates to a specific format. For instance, if you need to display the delivery date as Month DD, YYYY instead of the default format, you could use TO_CHAR in your query.
SELECT order_id, customer_id, TO_CHAR(delivery_date,'Month DD, YYYY') AS formatted_delivery_date
FROM orders o INNER JOIN deliveries d ON o.order_id = d.order_id;

Window Functions
Window functions are a powerful feature that allow you to perform calculations across a set of table rows related to the current row, known as the window, without the need for a self-join. This includes the capability to perform running totals, moving averages, and more.
Common Window Functions:

ROW_NUMBER(): assigns a unique number to each row to which a window function is applied.
RANK(): provides a rank to each row within a result set with the same rank given to the rows that have the same ranking.
DENSE_RANK(): similar to RANK(), but the ranks are consecutive.

CREATE TABLE product_data (
    product_id INT PRIMARY KEY,
    total_inventory INT NOT NULL,
    total_sales INT NOT NULL,
    region VARCHAR(50) NOT NULL
);
INSERT INTO product_data (product_id, total_inventory, total_sales, region) VALUES
(1, 100, 500, 'North America'),
(2, 150, 750, 'Europe'),
(3, 200, 1000, 'Asia'),
(4, 120, 1200, 'North America'),
(5, 180, 1500, 'Europe');

For example, your sales director Slacks you and asks you to calculate a running total of sales over product inventory. You can do this using a basic SUM Window Function ()
SELECT 
    product_id,
    total_inventory,
    SUM(total_sales) OVER(ORDER BY product_id) AS running_total_sales
FROM product_data;

Now, diving deeper into the problem. Say it's a large dataset and Excel won't cut it for this task and you want to partition it out by region. You can do this by applying ROW_NUMBER(). 
SELECT 
    region,
    product_id,
    ROW_NUMBER() OVER(PARTITION BY region ORDER BY product_id) AS region_product_rank
FROM product_data;

Alternatively, you could swap the ROW_NUMBER() for DENSE_RANK() or RANK() depending on the the use case. 
Conclusion
As a data analyst, your proficiency in SQL will evolve as you handle more complex data scenarios and questions. 
These essential SQL concepts serve as a good starting point – but continuous learning and applying these concepts in practical scenarios are what will truly solidify your understanding and expertise. 
Keep exploring new features, tools, and resources such freeCodeCamp or Big Tech Interviews, and you'll find SQL to be an ever-rewarding, ever-deepening skill to have in your data toolkit.
 


 How to Use Pandas for Data Cleaning and Preprocessing 
Oluwadamisi Samuel — Tue, 30 Jan 2024 14:55:00 +0000
 Steve Lohr of The New York Times said: "Data scientists, according to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."
This statement is 100% accurate, as this encompasses a series of steps that ensure data used for data science, machine learning and analysis projects are complete, accurate, unbiased and reliable.
The quality of your dataset plays a pivotal role in the success of your analysis or model. As the saying goes, “garbage in, garbage out”, the quality and reliability of your model and analysis heavily depends on the quality of your data.
Raw data, collected from various sources, are often messy, contain errors, inconsistencies, missing values and outliers. Data cleaning and preprocessing aims to identify and rectify these issues to ensure accurate, reliable and meaningful results during model building and data analysis as wrong conclusions could be costly.
This is where Pandas comes into play, it is a wonderful tool used in the data world to do both data cleaning and preprocessing. In this article, we'll delve into the essential concepts of data cleaning and preprocessing using the powerful Python library, Pandas.
Table of Contents

Prerequisites

Introduction

What is Data Cleaning?

What is Data Processing?

How to Import the Necessary Libraries

How to Load the Dataset

Exploratory Data Analysis (EDA)

How to Handle Missing Values

How to Remove Duplicate Records

Data Types and Conversion

How to Encode Categorical Variables

How to Handle Outliers

Conclusion


Prerequisites

A basic understanding of Python.

Basic understanding of data cleaning.


Introduction
Pandas is a popular open-source data manipulation and analysis library for Python. It provides easy-to-use functions needed to work with structured data seamlessly.
Pandas also integrates seamlessly with other popular Python libraries, such as NumPy for numerical computing and Matplotlib for data visualization. This makes it a powerful asset for data driven tasks.
Pandas excels in handling missing data, reshaping datasets, merging and joining multiple datasets, and performing complex operations on data, making it exceptionally useful for data cleaning and manipulation.
At its core, Pandas introduces two key data structures: Series and DataFrame. A Series is a one-dimensional array-like object that can hold any data type, while a DataFrame is a two-dimensional table with labeled axes (rows and columns). These structures allow users to manipulate, clean, and analyze datasets efficiently.
What is Data Cleaning?
Before we embark on our data adventure with Pandas, let's take a moment to explain the term "data cleaning." Think of it as the digital detox for your dataset, where we tidy up, and and prioritize accuracy above all else.
Data cleaning involves identifying and rectifying errors, inconsistencies, and missing values within a dataset. It's like preparing your ingredients before cooking; you want everything in order to get the perfect analysis or visualization.
Why bother with data cleaning? Well, imagine trying to analyze sales trends when some entries are missing, or working with a dataset that has duplicate records throwing off your calculations. Not ideal, right?
In this digital detox, we use tools like Pandas to get rid of inconsistencies, straighten out errors, and let the true clarity of your data shine through.
What is Data Processing?
You may be wondering, "Does data cleaning and data preprocessing mean the same thing?" The answer is no – they do not.
Picture this: you stumble upon an ancient treasure chest buried in the digital sands of your dataset. Data cleaning is like carefully unearthing that chest, dusting off the cobwebs, and ensuring that what's inside is authentic and reliable.
As for data preprocessing, you can think of it as taking that discovered treasure and preparing its contents for public display. It goes beyond cleaning; it's about transforming and optimizing the data for specific analyses or tasks.
Data cleaning is the initial phase of refining your dataset, making it readable and usable with techniques like removing duplicates, handling missing values and data type conversion while data preprocessing is similar to taking this refined data and scaling with more advanced techniques such as feature engineering, encoding categorical variables and and handling outliers to achieve better and more advanced results.
The goal is to turn your dataset into a refined masterpiece, ready for analysis or modeling.
How to Import the Necessary Libraries
Before we embark on data cleaning and preprocessing, let's import the Pandas library.
To save time and typing, we often import Pandas as pd. This lets us use the shorter pd.read_csv() instead of pandas.read_csv() for reading CSV files, making our code more efficient and readable.
import pandas as pd

How to Load the Dataset
Start by loading your dataset into a Pandas DataFrame.
In this example, we'll use a hypothetical dataset named your_dataset.csv. We will load the dataset into a variable called df.
#Replace 'your_dataset.csv' with the actual dataset name or file path
df = pd.read_csv('your_dataset.csv')

Exploratory Data Analysis (EDA)
EDA helps you understand the structure and characteristics of your dataset. Some Pandas functions help us gain insights into our dataset. We call these functions by calling the dataset variable plus the function.
For example:

df.head() will call the first 5 rows of the dataset. You can specify the number of rows to be displayed in the parentheses.

df.describe() gives some statistical data like percentile, mean and standard deviation of the numerical values of the Series or DataFrame.

df.info() gives the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).


Here's a code example below:
#Display the first few rows of the dataset
print(df.head())

#Summary statistics
print(df.describe())

#Information about the dataset
print(df.info())

How to Handle Missing Values
As a newbie in this field, missing values pose a significant stress as they come in different formats and can adversely impact your analysis or model.
Machine learning models cannot be trained with data that has missing or "NAN" values as they can alter your end result during analysis. But do not fret, Pandas provides methods to handle this problem.
One way to do this is by removing the missing values altogether. Code snippet below:
#Check for missing values
print(df.isnull().sum())

#Drop rows with missing valiues and place it in a new variable "df_cleaned"
df_cleaned = df.dropna()

#Fill missing values with mean for numerical data and place it ina new variable called df_filled
df_filled = df.fillna(df.mean())

But if the number of rows that have missing values is large, then this method will be inadequate.
For numerical data, you can simply compute the mean and input it into the rows that have missing values. Code snippet below:
#Replace missing values with the mean of each column
df.fillna(df.mean(), inplace=True)

#If you want to replace missing values in a specific column, you can do it this way:
#Replace 'column_name' with the actual column name
df['column_name'].fillna(df['column_name'].mean(), inplace=True)

#Now, df contains no missing values, and NaNs have been replaced with column mean

How to Remove Duplicate Records
Duplicate records can distort your analysis by influencing the results in ways that do not accurately show trends and underlying patterns (by producing outliers).
Pandas helps to identify and remove the duplicate values in an easy way by placing them in new variables.
Code snippet below:
#Identify duplicates
print(df.duplicated().sum())

#Remove duplicates
df_no_duplicates = df.drop_duplicates()

Data Types and Conversion
Data type conversion in Pandas is a crucial aspect of data preprocessing, allowing you to ensure that your data is in the appropriate format for analysis or modeling.
Data from various sources are usually messy and the data types of some values may be in the wrong format, for example some numerical values may come in 'float' or 'string' format instead of 'integer' format and a mix up of these formats leads to errors and wrong results.
You can convert a Column of type int to float with the following code:
#Convert 'Column1' to float
df['Column1'] = df['Column1'].astype(float)

#Display updated data types
print(df.dtypes)

You can use df.dtypes to print column data types.
How to Encode Categorical Variables
For machine learning algorithms, having categorical values in your dataset (non-numerical values) is crucial in ensuring the best model as they are equally as important.
These could be car brand names in a cars dataset for predicting car prices. But machine learning algorithms cannot processes this datatype, therefore it must be converted to numerical data before it can be used.
Pandas provides the get_dummies function which converts categorical values into numerical format(Binary format) such that it is recognized by the algorithm as a placeholder for values and not hierarchical data that can undergo numerical analysis. this just means that the numbers the brand name is converted to is not interpreted as 1 is greater than 0, but it tells the algorithm that both 1 and 0 are placeholders for categorical data. Code snippet is shown below:
#To convert categorical data from the column "Car_Brand" to numerical data
df_encode = pd.get_dummies(df, columns=[Car_Brand])

#The categorical data is converted to binary format of Numerical data

How to Handle Outliers
Outliers are data points significantly different from the majority of the data, they can distort statistical measures and adversely affect the performance of machine learning models.
They may be caused by human error, missing NaN values, or could be accurate data that does not correlate with the rest of the data.
There are several methods to identify and remove outliers, they are:

Remove NaN values.

Visualize the data before and after removal.

Z-score method (for normally distributed data).

IQR (Interquartile range) method for more robust data.


The IQR is useful for identifying outliers in a dataset. According to the IQR method, values that fall below Q1−1.5× IQR or above Q3+1.5×IQR are considered outliers.
This rule is based on the assumption that most of the data in a normal distribution should fall within this range.
Here's a code snippet for the IQR method:
#Using median calculations and IQR, outliers are identified and these data points should be removed
Q1 = df["column_name"].quantile(0.25)
Q3 = df["column_name"].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[df["column_name"].between(lower_bound, upper_bound)]

Conclusion
Data cleaning and preprocessing are integral components of any data analysis, science or machine learning project. Pandas, with its versatile functions, facilitates these processes efficiently.
By following the concepts outlined in this article, you can ensure that your data is well-prepared for analysis and modeling, ultimately leading to more accurate and reliable results.
 


 Learn Data Analysis and Visualization with Python Using Astronomical Data 
Beau Carnes — Fri, 19 Jan 2024 19:48:58 +0000
 Are you fascinated by both Python and the night sky?  
We just posted a course on the freeCodeCamp.org YouTube channel that will teach you how to use Python to analyze and visualize astronomical Data. This course was created by Spartificial, whose mission is to provide engineering education in the most engaging way.
This course is a journey through the universe of data analysis and visualization, tailored specifically for astronomical data. The course covers everything from the basics of Python programming to advanced image processing techniques.
Here is what you will learn in each module of the course.
Module 1: Starting with Python
Begin your adventure with Python, starting from the very basics. You'll get acquainted with Python programming using Google Colab. Learn about variables, data types, control flow, f-strings, user inputs, and functions. This module lays a strong foundation for handling astronomical data efficiently.
Module 2: Tabular Data Visualization
Dive into the world of tabular data with tools like Pandas, Matplotlib, and Seaborn. This module teaches you to import libraries, analyze star color data, detect outliers, and create compelling visualizations like line plots and Hertzsprung-Russell diagrams. It's all about making sense of complex astronomical datasets.
Module 3: Image Data Visualization
Explore the fascinating realm of astronomical image data. Learn about FITS files and use Python to bring galaxies like M31 to life on your screen. You'll delve into image processing techniques such as MinMax and ZScaleInterval scaling, enhancing your ability to interpret celestial images.
Module 4: Image Processing | Apply Filters and Extracting Features
This module takes you deeper into the world of image processing. Learn about convolution operations, Gaussian kernels, and feature enhancement. You'll discover techniques for identifying and extracting features from astronomical images, skills that are crucial for research and analysis.
This course offers hand-on learning. It emphasizes a practical approach, filled with examples and real-world datasets. You will get step-by-step guidance. This ensures a solid understanding of each concept, regardless of your previous experience level.
Whether you're an astronomy enthusiast, a seasoned researcher, or a curious programmer, this course offers an opportunity to enhance your skills and dive into the world of astronomical data analysis.
Watch the full course on the freeCodeCamp.org YouTube channel (7-hour watch).

        
 


 How to Prepare for Data Analyst Job Interviews 
freeCodeCamp — Fri, 12 Jan 2024 17:31:48 +0000
 By Jess Wilk
In today’s digital world, every business and organization collects and uses data to build better products, target the right customers, improve efficiency, and even forecast future demand. 
They say that data is the new oil – and now is the perfect time to enter the data analytics job market. 
According to PayScale, the average salary for entry-level roles in analytics is around $55,492 per year. The average salary for a skilled analyst is about $88,928 per year. Even if you are a beginner to programming in Python, you can learn the essential skills for data analysis quickly if you are consistent.
In this article, I’ll go over what Data Analytics skills you'll need to know, and how to prepare for and ace interviews to land a Data Analyst position with Python.
What Does a Data Analyst Do?
As a data analyst, your primary responsibility is transforming raw data into meaningful insights. 
Usually, the job description involves cleaning and organizing data to make sure that the quality of data is good. You'll also perform statistical analysis, interpret trends in complex datasets, build models, and create visualizations to communicate findings effectively. This information will help teams make business decisions and get valuable insights for the company's managers and key stakeholders.
Market research analysts collect and evaluate consumer and competitor data. A business analyst for Walmart could analyze purchase trends and identify seasonal patterns during events like Black Friday, Christmas, and New Year. This data could help the company expect higher demand and re-stock. 
A data analyst at IKEA might analyze customer preferences in different rural and urban regions to better strategize which products to sell. 
Data plays a role in every stage of a company, from market sizing and customer acquisition to advertising, customer journey, final conversion rate, and data-driven decisions. 
Since I started working in data science, I have always felt like a little detective uncovering patterns and hidden knowledge. Are you now excited to learn how to become a data analyst? Let’s start with actionable insights.
Essential Technical Skills to Develop
The first step while preparing for any role is identifying and learning the right skills. Here are the essential and in-demand skills you should learn to become a data analyst:
Python Programming
One of the most crucial skills for a data analyst is proficiency in the Python programming language. Python is widely used in organizations to perform various tasks such as handling datasets, cleaning and manipulating them, and carrying out statistical analysis. 
The popularity of Python stems from its ability to support a plethora of open-source packages and libraries and its flexibility and user-friendliness. I am confident that Python will continue to be an indispensable tool for data analysts in 2024. 
If you’re new to Python, you can check out the Introduction to Python course on Hyperskill with hands-on projects, where I contribute as an expert. You don't need any degree to start learning.
But Python is vast – where should you start?
Start by learning basic syntax and data structures like lists, dictionaries, classes, and so on. 
Once you are comfortable with the basics, get familiar with the essential libraries like Pandas (to read and manipulate data frames), Numpy (for statistical analysis), Matplotlib, and Seaborn for data visualization (creating plots).

Here's a helpful course that teaches you Pandas and Python for Data Analysis.
This course teaches you how to use Matplotlib for data visualization.
Here's an in-depth guide to using NumPy for scientific computing in Python.


Python logo
SQL
SQL (Structured Query Language) helps you interact with large relational databases. You should learn how to create and update SQL tables, perform filtering and aggregation, and extract insights. MySQL is a commonly used syntax. 
You can check out the SQL course for beginners on Hyperskill. And if you want a text-based overview, here's a full handbook that teaches you all the SQL basics you'll need to know.
Data Visualization Tools and Software
Analysing the data is the process, but presenting your insights is the final destination. You must master visualization analytics tools like Tableau software or Power BI to create dashboards and reports. 
As a data analyst, you may have to present your findings to non-technical teams interpretably. There are also many advanced methods, like interactive dashboards and geographic mapping, for visualizing spatial data to help make informed decisions.
Statistics
Probability and Statistics cover a wide range of essential concepts for anyone working with data. You should know the basic types of distributions, such as Normal, poisson, and skewed, and how to handle each. 
Many metrics, like mean, median, and standard deviation, can help analyze numerical variables and identify anomalies or outliers. P-value and Hypothesis testing are also critical.
Here's a tutorial on the top Stats concepts to know before getting into Data Science if you want to check your skills.
Excel
Even though most of us are familiar with Excel basics, you should learn functions like VLOOKUP, HLOOKUP, INDEX, MATCH, and IF statements for data manipulation. 
Understanding how to use PivotTables for summarizing and analyzing large datasets and enabling dynamic data exploration is crucial.
If you want to learn more about how you can use Excel for data analysis, here's a course on that.
Develop Your Portfolio
The data analytics industry is highly profitable but also fiercely competitive. More than simply working through courses and acquiring skills is required to stand out. 
To become a successful data analyst, you must build a portfolio of projects demonstrating your abilities. 
Once you're familiar with the relevant technology, identify a problem that requires analysis and locate a publicly available dataset. Analyze the dataset using various methods and extract any meaningful insights. If you don't have a degree, focus on making your portfolio the best you can.
Kaggle is a best friend to any data analyst beginner. Numerous datasets are available in all fields, from movie reviews and tweets to medical X-rays. Open notebooks allow you to see what expert data scientists have worked on with the same dataset. This is a great way to get guidance on approach and inspiration for ideas to try out.
For example, take the popular Kaggle dataset of IMDB Movie reviews. What can you do with it? I’ll share a few ideas to help you get started. 
You can begin at a basic level by calculating statistics to summarize critical metrics such as average rating, distribution of ratings, and the most reviewed genres. 
Then you could use natural language processing (NLP) techniques to perform sentiment analysis on the movie reviews. 
Next, create visualizations to present findings effectively. For instance, plot sentiment scores over time, visualize the distribution of reviews across genres, or create a word cloud highlighting frequently used words in positive and negative reviews.
Tailoring your projects to align with your interests and the specific requirements of potential employers will make your portfolio stand out in a sea of applicants. 
For example, if you want to work in healthcare, do a project that adds value to the field. Remember, it's not just about the code; it's about telling a compelling story with the data.
Finally, you'll want to scrape and analyze real-time data. Build a tool that tracks social media sentiment about a brand or analyzes website traffic patterns.
How to Build a Good CV
The first stage of any job application is shortlisting based on your CV (or résumé). Creating a concise and technically sound CV/résumé to increase your odds is crucial. 
Your CV must be based on your educational background, coursework information, achievements, prior internships or work experience, and extracurriculars. 
Let me share a few tips on creating a compelling CV or résumé:

Custom CVs: When creating your résumé, customize it to the job you are applying for. Emphasize the skills and projects that are most relevant to the specific role. If appropriate, you can also include any extracurricular activities demonstrating your ability to manage a team. But you must provide only accurate information – this should go without saying, but embellishing your résumé beyond your actual experience is unacceptable.
Quantify your achievements: Instead of mentioning that you conducted data analysis, mention specific projects, tools used, and your impact. For example, you could say that you increased website conversion rate by 15% through A/B testing. Remember to add any Python libraries, frameworks, and tools you used.
Keep it concise and visually appealing: Recruiters review hundreds of résumés and may need help reading each line in the first round. So make a résumé that simultaneously conveys your skills and experience highlights. Use bullet points, clear headings, and formatting when needed to highlight certain aspects.

Tips to Ace the Technical Interview
The final stage is the technical interview. Below, I have gathered some tips that will help you understand what your preparation might involve, along with examples of questions you might encounter. Remember that each case is unique and you should use these as general guidelines.
First, make sure you practice coding a lot. You can use platforms like HackerRank or LeetCode. Remember that transparent and efficient code is vital for passing an interview. For example, you might be asked to describe the correct syntax for the reshape() function in NumPy.
Next, make sure you are comfortable working with SQL. You'll need to know how to handle complex queries, joins, subqueries, and data manipulation in SQL. A question like "How do you subset or filter data in SQL?" or "What is a Subquery in SQL?" could come up.
You should also be prepared to discuss and demonstrate your skills in data visualization. You should be able to explain your choices in visualization for different types of data. For instance, "How is joining different from blending in Tableau?" or "What is the difference between Treemaps and Heatmaps in Tableau?"
You'll also want to have a good understanding of statistics. Be prepared to discuss statistical concepts like mean, median, mode, standard deviation, correlation, and regression analysis. 
You might be asked to interpret data or explain the significance of statistical findings in a business context, such as "Explain the term Normal Distribution” or “How do you treat outliers in a dataset?”
Next, make sure you have a solid foundation in data cleaning and preprocessing. Be ready to talk about experiences with cleaning and preparing data, involving dealing with missing values, outlier detection, and normalization. 
Knowing tools like Pandas in Python can be particularly beneficial. An example question could be, "How can you add a column to a Pandas Data Frame?"
Be comfortable with data-driven decision making. You might be asked to explain how you have used data to inform decision-making in past experiences in order to demonstrate your ability to draw conclusions from collected data and use it for the company's business decisions.
You should also be able to showcase your past work. If possible, bring examples of your past work or projects, such as a portfolio or detailed case studies. 
Be ready to discuss the challenges faced, how you approached them, and the outcomes. Questions like "Have you ever run an analysis on the wrong set of data? How did you figure out your error?" can be expected.
Also, don't neglect behavioral skills. Be prepared for behavioral questions that explore your problem-solving skills, teamwork, and ability to handle deadlines and pressure. Reflect on your past experiences and be ready to share stories that highlight these skills.
And finally, brush up on your industry knowledge. If the company operates in a specific industry (like finance, healthcare, retail, and so on), having some background knowledge or experience in that industry can be advantageous. Tailor your preparation to understand the unique data challenges and opportunities in that sector.
Remember, each company may have a different focus in their technical interviews, so try to get as much information as possible about the interview format beforehand. This way, you can tailor your preparation to meet their specific expectations.
Conclusion
Becoming a data analyst is a marathon, not a sprint.
If you are interested in a career as a data analyst, Python is an excellent language to learn. It is a versatile tool that allows you to manipulate, analyze, and visualize data effectively. By mastering in-demand skills such as Python, SQL, Data Visualization tools, Statistics, and Excel, you can set yourself up for success in the data analytics job market.
Also, building a portfolio of projects showcasing your abilities is crucial to stand out as an entry-level data analyst. The data analytics industry is rapidly growing, and there is a high demand for qualified professionals. 
So, start learning and experimenting with data today to land your dream job as a data analyst in Python.
Embrace the learning, celebrate the small wins, and don't be afraid to ask for help. Good luck with your goals and data analyst career path!
Thank you for reading! I'm Jess, and I'm an expert at Hyperskill. You can check out an Introduction to Python course on the platform.

data analysis - freeCodeCamp.org

How to Optimize PySpark Jobs: Real-World Scenarios for Understanding Logical Plans

Table of Contents

Background Information

What This Handbook is Really About

Who This Handbook Is For

How This Handbook Is Structured

What You'll Learn

Technical Prerequisites

Chapter 1: The Spark Mindset: Why Plans Matter

The Invisible Layer Behind Every Transformation

From Logical to Optimized to Physical Plans

1. Logical Plan

2. Analyzed Logical Plan

3. Optimized Logical Plan

4. Physical Plan

How to Read a Logical Plan

Version A: withColumn → filter

Parsed Logical Plan (Simplified)

Version B: Filter → Project

Parsed Logical Plan (Simplified)

Why You Should Look at the Plan Every Time by running df.explain(True)

What Spark Does Under the Hood

Chapter 2: Understanding the Spark Execution Flow

From Plans to Stages to Tasks

The Execution Trigger: Actions vs Transformations

Actions Trigger Execution

The Complete Execution Flow

What Triggers a Shuffle

Why Shuffles Create Stage Boundaries

Common Performance Bottlenecks

Optimized Approach

Chapter 3: Reading and Debugging Plans Like a Pro

Three Layers in Spark

Recognizing Common Nodes

Debugging Strategy: Read Plans from Top to Bottom

Catalyst Optimizer in Action

Chapter 4: Writing Efficient Transformations

Why Transformations Matter

The Goal of this Chapter

Before You Dive In:

Scenario 1: Rename in One Pass: withColumnRenamed() vs toDF()

Logical Plan Impact:

The Better Approach: Rename Once with toDF()

Logical Plan Impact:

Under the Hood: What Spark Actually Does

Real-World Timing: Glue Job Benchmark

Scenario 2: Reusing Expressions

The Problem: Repeated Expressions

The Better Approach: Compute Once, Reuse Everywhere

Under the Hood: Why Repetition Hurts

Real-World Benchmark: AWS Glue

Physical Plan Implication

Scenario 3: Batch Column Ops

The Problem: Chaining withColumn() Forever

The Better Approach: Batch with select()

Under the Hood: Why This Matters

Real-World Example: Using the Employees DF at Scale:

Scenario 4: Early Filter vs Late Filter

Problem: Late Filtering

Better Approach: Early Filtering

Real-World Benchmark: AWS Glue

Scenario 5: Column Pruning

The Problem: “The Lazy Star”

The Fix: Select Only What You Need

Real-World Benchmark: AWS Glue

Under the Hood: How Catalyst Handles Columns

Physical Plan Differences

Scenario 6: Filter Pushdown vs Full Scan

The Problem: Late Filters and Full Scans

The Fix: Filter Early and Project Light

Under the Hood: What Actually Happens

Real-World Benchmark: AWS Glue

Reflection: Why Pushdown Matters

Scenario 7: De-duplicate Right

The Problem: “All-Row Deduplication” and Why It Hurts

The Better Approach: Key-Based Deduplication

Real-World Benchmark: AWS Glue

Under the Hood: What Catalyst Does

Best Practices for Deduplication

Why You Should Look at the Plan Every Time by running `df.explain(True)`