statistics - freeCodeCamp.org

Data Science Insights: Why the Mean Lies When Handling Messy Retail Data

Rakshath Naik — Tue, 05 May 2026 16:59:17 +0000

In our daily life, we use the word "average" all the time: average salary, average marks, average age, and so on.

Let's take the case of a retail shop. If we're looking at the average order value to understand customer spending, we'd load the data, run the code, and get a result of $20 per order.

Done.

Except something looks odd.

When we take a closer look, we see that most customers are buying items worth $8 - $15. So where's $20 coming from?

In that case, the problem isn’t data – it’s the average. This is a clean textbook trap where everything works perfectly in the textbook, but real-world data doesn’t behave nicely.

Some customers buy in bulk (very large orders), some return orders (negative quantities), and a few anomalies distort the entire picture.

In this article, we'll use the Online Retail Dataset to answer a simple but tricky question: What does “average” really mean in the real world?

Prerequisites
The Dataset
Mean: The Sensitive Giant
Median: The Robust Middle
Beyond Averages: Understanding Spread with Quartiles
Applying IQR to Our Dataset
Final Comparison and Insights
Conclusion
Connect with me

Prerequisites

To follow along here, you'll need:

Basic Python knowledge: Understanding of variables and functions.

The Pandas library: Familiarity with loading data and basic DataFrame operations.

A development environment: Access to a tool like Jupyter Notebook, VS Code, or Google Colab.

A Dataset: For this analysis, I used the Online Retail Dataset, which is available for download here.

The Dataset

We'll work with the Online Retail Dataset, a real-world transactional dataset containing purchase records from a UK-based online retail store.

Source: UCI Machine Learning Repository
Collected by: UK-based online retail company (2010–2011)
Size: 541,909 transactions
Features: 8 attributes (InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country)
Ownership: Public dataset hosted by UCI
License: Open for research and educational use

Mean: The Sensitive Giant

In statistics and data analysis, the terms "average" and "arithmetic mean" are often used interchangeably. We aim to find the mean total price in our dataset. Mean in the context of the Online Retail Dataset is given as:

$$\text{Average Order Value} = \frac{\text{Sum of all TotalPrice values}}{\text{Number of transactions}}$$

In our dataset, the mean is calculated by summing all transaction values (including bulk purchases and returns) and dividing by the total number of transactions. This means every value, irrespective of unusually high or any negative values, directly influences the final average.

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
df = pd.read_excel(url, engine='openpyxl')

# Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Calculate the Mean (Average Order Value)
mean_value = df['TotalPrice'].mean()
print(f"Average Order Value (Mean): {mean_value:.2f}")

The results are as follows:

Average Order Value (Mean): 20.40

At first glance, the results may look promising: every transaction contributes equally. But that’s where the problem lies. Sometimes a few transactions, which are extremely high or low, affect the mean for all customers who lie in the closer range.

Take a look at the graph for the mean below.

The graph shows the mean Total Price for the Online Retail Dataset. We get a mean of 20.42. (Image by Author)

The graph shows a right-skewed distribution where the calculated mean of 20.40 is actually a textbook trap. The tallest bar clearly shows that the majority of transactions lie in the range of $8 - $15 range, but the red line is being dragged to the right by the long tail of high-value bulk orders by some customers.

In this scenario, the average price is well above what a typical customer actually spends because it's highly sensitive to outliers – and in reality, the bulk of the data lives in the lower price range.

In simple words, the mean is being pulled by some extreme values to the right, especially by some lying in the range of 200–300, which is noticeable in the graph.

Median: The Robust Middle

When the mean is distorted by extreme values, we need a metric that remains unaffected by such outliers. This is where the median comes into play.

Median is defined as the middle value after sorting the data.

In our dataset, we sort all the transactions and pick the middle one.

The formula for calculating the median is:

$$\text{Median} = \begin{cases} X_{\left[ \frac{n+1}{2} \right]} & \text{if } n \text{ is odd} \ \frac{X_{\left[ \frac{n}{2} \right]} + X_{\left[ \frac{n}{2} + 1 \right]}}{2} & \text{if } n \text{ is even} \end{cases}$$

Unlike the mean, the median doesn't depend on extreme values, and it cares only about the position of the data, not the magnitude.

# Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Calculate only the Median
median_value = df['TotalPrice'].median()
print(f"Typical Order Value (Median): {median_value:.2f}")

The results are as follows:

Typical Order Value (Median): 11.10

Now you'll notice that the result lies in the $8 — $15 range, where most of the transactions lie.

The figure demonstrates the graph for the median, where we get an accurate value of the transactions by the customers. (Image by Author)

In the previous graph, the mean was pulled to the right by large orders, but the median just asks what the middle customer spends. So even if someone spends $300 or some transactions are negative, the median stays stable.

In the above figure the median graph accurately highlights the range where most of the customers lie.

Beyond Averages: Understanding Spread with Quartiles

So far, we've studied the median, but knowing the center is not enough.

To truly understand how customer spending is, we need to understand how the data is spread, and this is where quartiles come into play.

Quartiles divide the dataset into the following parts:

Q1(25th percentile): 25% of transactions are below this.
Q2 (50th percentile): Median
Q3 (75th percentile): 75% of transactions are below this.

This is formally expressed as the Interquartile Range (IQR):

$$IQR = Q_3 - Q_1$$

The IQR: Detecting Outliers

The IQR measures the spread of the middle 50%.

If the IQR is small, then the data is concentrated. If it's large, the data is spread out. The IQR also helps us identify outliers mathematically.

Outlier Rule:

Lower Bound = Q1 — 1.5 * IQR
Upper Bound = Q3 + 1.5 * IQR

A Simple Example to Understand IQR

Consider the following transaction values:

$$\left[ 5, 8, 10, 12, 15, 18, 20 \right]$$

Step 1: Find the Median (Q2):

The middle value is:

$$Q_2 = 12$$

Step 2: Find Q1 (Lower Quartile):

The lower half is [5, 8, 10]. The median of the lower half is:

$$Q_1 = 8$$

Step 3: Find Q3 (Upper Quartile):

The upper half is [15, 18, 20]. The median of the upper half is:

$$Q_3 = 18$$

Step 4: Calculate IQR:

$$IQR = Q_3 - Q_1 = 18 - 8 = 10$$

Step 5: Find Outlier Bounds:

$$\begin{aligned} \text{Lower Bound} &= Q_1 - 1.5 \times IQR = 8 - 15 = -7 \ \text{Upper Bound} &= Q_3 + 1.5 \times IQR = 18 + 15 = 33 \end{aligned}$$

Any value below -7 or above 33 is an outlier (but in this demo problem, no outliers exist).

Applying IQR to Our Dataset

In our retail dataset, instead of neat values, we have bulk values and even negative returns.

# 1. Calculate IQR and Bounds
Q1 = df['TotalPrice'].quantile(0.25)
Q3 = df['TotalPrice'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

When we calculate IQR for our dataset, we get:

Lower Bound: -18.75
Upper Bound: 42.45
Number of Outliers: 33180

The graph demonstrates outliers, which are any values falling outside the range of -18.75 to 42.45. (Image by Author)

As the graph shows, the values outside the range -18.75 to 42.45 are considered outliers. These values will be removed.

Revisiting the Mean After Removing Outliers

Using the IQR method, we've removed extreme transactions that fell outside the typical spending range.

# Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Original Mean
mean_value = df['TotalPrice'].mean()
print(f"Original Mean: {mean_value:.2f}")

# IQR Calculation
Q1 = df['TotalPrice'].quantile(0.25)
Q3 = df['TotalPrice'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Lower Bound: {lower_bound:.2f}")
print(f"Upper Bound: {upper_bound:.2f}")

# Remove Outliers
df_no_outliers = df[(df['TotalPrice'] >= lower_bound) & (df['TotalPrice'] <= upper_bound)]

# New Mean after removing outliers
new_mean = df_no_outliers['TotalPrice'].mean()
print(f"Mean after removing outliers: {new_mean:.2f}")

After recomputing, we get:

Original Mean: 20.40
Lower Bound: -18.75
Upper Bound: 42.45
Mean after removing outliers: 11.63

Removing outliers significantly shifts the mean toward the region where most transactions occur. We now have a much better mean of 11.63 as opposed to the right-stretched mean of 20.40 we got with outliers.

Final Comparison and Insights

Looking at the results from all the graphs, we get a complete understanding of the dataset. The original mean was 20.40, which appeared to be significantly higher than the most transactions that actually occurred. In that case, the mean was pulled upward by some of the high-valued transactions and was distorted by the outliers.

The median, on the other hand, was 11.10, which lies within the range where most transactions are concentrated. This shows that the median is a much better representation of what a typical customer spends, as it's not affected by extreme values.

After removing the outliers using the IQR, the mean dropped to 11.63, bringing it very close to the median. This confirms that the earlier mean was not inherently wrong, but was simply influenced by extreme values in the data. Once those values were handled, the mean became a much more reliable measure of central tendency.

Conclusion

The results show that the mean can be misleading when data contains outliers. In our dataset, the original mean of 20.40 overstated customer spending, while the median (11.10) gave a more realistic picture. After removing outliers, the mean shifted to 11.63, aligning closely with the median.

This highlights a key lesson: The mean isn't wrong, but it must be used with an understanding of the data.

Choosing the right measure of average depends on the dataset, and in messy real-world scenarios, the median or a cleaned mean often tells the true story.

Connect with me

If you want to dive deeper, you can visit: Mean vs Median vs Mode: Understanding Central Tendency in Data Analysis.

What are Markov Chains? Explained With Python Code Examples

Tiago Capelo Monteiro — Mon, 08 Jul 2024 12:53:27 +0000

There are various mathematical tools that can be used to predict the near future based on a current state. One of the most widely used are Markov chains.

Markov chains allow you to predict the uncertainty of future events under certain conditions. For this reason, it is widely used in science, engineering, economics and many more areas.

However, there are many types of Markov Chains and each have their own applications.

This guide introduces what Markov chains are, different types of Markov chains, including Discrete-Time, Continuous-Time, Reversible, and a code example of Hidden Markov Models (HMMs).

We will see:

Analogy
Markov Chain Explained in plain English
Applications of Markov Chains
Types of Markov Chains
Hidden Markov Chains Code Example

Analogy

Imaging that you want to predict the weather tomorrow, and it only depends on the weather today. The weather can be either sunny or rainy.

Here are the probabilities:

If it's sunny today, there's an 80% chance that it will be sunny again tomorrow, and a 20% chance that it will be rainy.
If it's rainy today, there's a 50% chance that it will be sunny tomorrow, and a 50% chance that it will be rainy.

In this scenario, we can predict future states of the weather based on current states using probabilities.

This idea of predicting the future based solely probabilities of the present is called Markov chain.

Here, the states are either sunny or rainy and the probabilities describe the chances of the weather changing based on the current state.

Markov Chain Explained in Plain English

A Markov chain describes random processes where systems move between states, and a new state only depends on the current state, not on how it got there.

Mathematically, Markov chains are called stochastic models because they model (simulate) real life events that are random by nature (stochastic).

Markov chains are very easy to implement and efficient at modeling complex systems.

Another key advantage is their "memoryless" property. This makes it faster to run on computers, and powerful to study random processes and make prediction based on current conditions.

Applications of Markov Chains

At some level, almost all real-life events are stochastic. In other words, they involve randomness and uncertainty.

This is exactly why they are so widely used. They can predict the behavior of systems based on current conditions.

In finance, they are used to detect changes in credit ratings for forecasting market regimes.

In genetics, they help understand how proteins change over time. Which is important when studying genetic variations.

In robotics, they assist with decision-making by predicting the robot's next move based on current observation.

There, real life examples show how effective Markov chains can be used to solve real life problems in different fields.

Types of Markov Chains

There are many types of Markov chains. In this section, we'll only discuss the most important variants of Markov chains.

Discrete-Time Markov Chains (DTMCs)

In DTMCs, the system changes state at specific time steps. They are called discrete because the state transitions occur at distinct, separate time intervals.

They are used in queuing theory (study of the behavior of waiting lines), genetics, and economics because they are simple to analyze.

Continuous-Time Markov Chains (CTMCs)

CTMCs differ from DTMCs in that state transitions can occur at any continuous time point, not at fixed intervals.

This makes them stochastic models where state changes happen continuously. This is important in chemical reactions and reliability engineering.

Reversible Markov Chains

Reversible Markov chains are special. The process of state change is the same whether the direction is forwards or backwards, like rewinding a video and playing it again.

This property makes it easier to know when a system is stable and study how a system behaves over time. They are widely used in statistical physics and economics

Doubly Stochastic Markov Chains

Doubly stochastic Markov chains are defined by a transition probability matrix. In the matrix, the sum of the probabilities in each row and each column equals 1.

This means each row and each column represent a valid probability distribution. In other words, each row and column represent a list of chances for different outcomes.

This property is crucial in quantum computing and statistical mechanics.

Thanks to Doubly stochastic Markov chains, systems change in a way that preserves probabilities and symmetry, making the modeling and analysis of quantum computing systems far more accurate.

Hidden Markov Chains Code Example

Before we jump into code examples, lets first understand what Hidden Markov Chains are.

Hidden Markov Chains: Modeling Unseen States

The main idea behind hidden Markov chains is to model systems that have hidden states (states we do not know their values) which can only be discovered through observable events.

In other words, hidden Markov chains allow us to predict the behavior of a system by:

Considering the likelihood of moving from one state to another.
Knowing the probability of observing a certain event from each state

We can understand this by observing how the states change from an indirect point of view.

We many not know the states original values.

But by knowing the way they change, we can predict what their values will be in the future.

This way, hidden Markov chains are flexible in modeling sequences, capturing both the transitions between hidden states and the observable outcomes.

Because of this, hidden Markov models are used in fields such as engineering, financial modeling, speech recognition, bioinformatics, and many more.

Code Example

In this code example, we will see a simple example with synthetic data.

Here is the full code:

import numpy as np
from hmmlearn import hmm

# Set random seed for reproducibility
np.random.seed(42)

# Define the HMM parameters
n_components = 2  # Number of states
n_features = 1    # Number of observation features

# Create a Gaussian HMM
model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag")

# Define transition matrix (rows must sum to 1)
model.startprob_ = np.array([0.6, 0.4])
model.transmat_ = np.array([[0.7, 0.3],
                            [0.4, 0.6]])

# Define means and covariances for each state
model.means_ = np.array([[0.0], [3.0]])
model.covars_ = np.array([[0.5], [0.5]])

# Generate synthetic observation data
X, Z = model.sample(100)  # 100 samples

# Create a new HMM instance
new_model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag", n_iter=100)

# Fit the model to the data
new_model.fit(X)

# Print the learned parameters
print("Transition matrix:")
print(new_model.transmat_)
print("Means:")
print(new_model.means_)
print("Covariances:")
print(new_model.covars_)

# Predict the hidden states for the observed data
hidden_states = new_model.predict(X)

print("Hidden states:")
print(hidden_states)

Full code

Lets see the code block by block!

Import libraries and set random seed

import numpy as np
from hmmlearn import hmm

np.random.seed(42)

Import libraries and set random seed

In this block of code, we imported two python libraries:

NumPy: For numerical operations.
hmmlearn: For hidden Markov model implementation.

Next we defined with the numpy library a random seed.

What is a Random Seed?

A random seed is a value used to start a pseudorandom number generator.

With a fixed random seed, we ensure that the sequence of pseudorandom numbers generated is always the same.

This allows us to duplicate experiments and verify results.

The specific value of the seed does not matter as long as it remains consistent.

Define the HMM parameters and create a Gaussian HMM

n_components = 2  # Number of states
n_features = 1    # Number of observation features

model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag")

Define the HMM parameters and create a Gaussian HMM

In this code block, we created a HMM with two hidden states and a single observed variable.

covariance_type "diag" means the matrices that represent covariance–how two variables change together–are diagonal. In other words, each row and column is assumed to be independent of the others.

This implies that the probability distributions of each row and column are independent of each other.

However, there is still something strange when we defined the hidden Markov chain.

What Does "Gaussian" Mean?

This is a very big topic in statistics, but in a few words, Markov chains can only be created when we specify the transition probabilities—chances of moving from one state to another in a Markov chain—and an initial probability distribution.

A Gaussian HMM assumes events are initially modeled by a Gaussian distribution, also called a normal distribution.

Normal distribution

A normal distribution is like a bell-shaped curve that describes how things are often spread out in nature.

The normal distribution is crucial because it describes many natural occurrences like human heights, measurement errors, how likely a disease might spread and many more.

And while many natural events may not be described by a normal distribution with the central limit theorem, they can be approximated to be described by a normal distribution.

This way, many hidden Markov models (HMMs) are defined by a normal distribution, which represents many phenomena in nature and society

In the hmmlearn library, there is also the possibility of creating Markov chains based on Poisson distributions.

In simple words, Poisson distributions model probabilities that describe the occurrence of events over a fixed interval of time or space. This is widely used in telecommunications.

HMMs based on a Poisson distribution would predict events that often happen to be random and independent over a specified interval.

Define transition matrix , means and covariances for each state

model.startprob_ = np.array([0.6, 0.4])
model.transmat_ = np.array([[0.7, 0.3],
                            [0.4, 0.6]])

model.means_ = np.array([[0.0], [3.0]])
model.covars_ = np.array([[0.5], [0.5]])

Define transition matrix , means and covariances for each state

model.startprob_ = np.array([0.6, 0.4]):

This line sets the initial state probabilities for a Hidden Markov Model (HMM). It indicates that there is a 60% probability of starting in state 0 and a 40% probability of starting in state 1.

model.transmat_ = np.array([[0.7, 0.3], [0.4, 0.6]]):

This line sets the state transition probability matrix for the HMM. The matrix specifies the probabilities of moving from one state to another:
From state 0, there is a 70% chance of staying in state 0 and a 30% chance of transitioning to state 1.
From state 1, there is a 40% chance of transitioning to state 0 and a 60% chance of staying in state 1.

model.means_ = np.array([[0.0], [3.0]]):

This line sets the mean values for the observation distributions in each state. It indicates that the observations are normally distributed with a mean of 0.0 in state 0 and a mean of 3.0 in state 1.

model.covars_ = np.array([[0.5], [0.5]]):

This line sets the covariance values for the observation distributions in each state. It specifies that the variance (covariance in this 1-dimensional case) of the observations is 0.5 for both state 0 and state 1.

Create data, new HMM instance and fit the model with the data

X, Z = model.sample(100)  # 100 samples

new_model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag", n_iter=100)

new_model.fit(X)

print("Transition matrix:")
print(new_model.transmat_)
print("Means:")
print(new_model.means_)
print("Covariances:")
print(new_model.covars_)

Create data, new HMM instance and fit the model with the data

In this code, we created a model with 100 samples, iterated it 100 times, and printed the new state transition matrix, means, and covariances.

In other words, we generated 100 samples from the original model, fit a new Hidden Markov Model (HMM) to these samples, and then printed the learned parameters of this new model.

X means the observed data samples generated by the original model.
Z means the hidden state sequences corresponding to the observed data samples generated by the original model.

The transition matrix prints out:

[[0.8100804  0.1899196 ]
 [0.49398918 0.50601082]]

Which means that the model tends to stay in state 0 and has nearly equal chances of switching or staying when in state 1.

The means print out:

[[0.01577373]
 [3.06245496]]

Which means that the average observed value is approximately 0.016 in state 0 and 3.062 in state 1.

The covariances print out:

[[[0.41987084]]
 [[0.53146802]]]

Which means that the observed values varies by about 0.420 in state 0 and 0.531 in state 1.

This way, we may never know exactly the values of the states, but we know:

How they tend to change with each other
Their average observed value
How they vary

Predict the hidden states for the observed data

hidden_states = new_model.predict(X)

print("Hidden states:")
print(hidden_states)

Predict the hidden states for the observed data

In this code, based on the X observed data samples, we predicted the new states of the Markov model.

The hidden states print out:

[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1
 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0]

Which means that the hidden states switch between state 0 and state 1, showing how the system changes states over time.

Conclusion: The Future of Markov Chains

Markov chains are widely used in STEM fields due to their ability to predict the future based on the present.

Markov chains have been integrated more with artificial intelligence, improving automation and predicative analytics of systems.

Additionally, the development of more computationally efficient Markov chains is a big priority, making them more accessible for real-time processing and large-scale simulations.

In summary, Markov chains are a very important tool in science due to their ability to predict the future.

With AI and more computational efficiency, Markov chains can be applied in many other fields and solve many problems.

Learn Statistics for Data Science, Machine Learning, and AI – Full Handbook

Tatev Aslanyan — Fri, 12 Apr 2024 23:08:39 +0000

Karl Pearson was a British mathematician who once said "Statistics is the grammar of science". This holds true especially for Computer and Information Sciences, Physical Science, and Biological Science.

When you are getting started with your journey in Data Science, Data Analytics, Machine Learning, or AI (including Generative AI) having statistical knowledge will help you better leverage data insights and actually understand all the algorithms beyond their implementation approach.

I can't overstate the importance of statistics in data science and Artificial Intelligence. Statistics provides tools and methods to find structure and give deeper data insights. Both Statistics and Mathematics love facts and hate guesses. Knowing the fundamentals of these two important subjects will allow you to think critically, and be creative when using the data to solve business problems and make data-driven decisions.

Key statistical concepts for your data science or data analysis journey with Python Code

In this handbook, I will cover the following Statistics topics for data science, machine learning, and artificial intelligence (including GenAI):

Random variables
Mean, Variance, Standard Deviation
Covariance and Correlation
Probability distribution functions (PDFs)
Bayes' Theorem
Linear Regression and Ordinary Least Squares (OLS)
Gauss-Markov Theorem
Parameter properties (Bias, Consistency, Efficiency)
Confidence intervals
Hypothesis testing
Statistical significance
Type I & Type II Error
Statistical tests (Student's t-test, F-test, 2-Sample T-Test, 2-Sample Z-Test, Chi-Square Test)
p-value and its limitations
Inferential Statistics
Central Limit Theorem & Law of Large Numbers
Dimensionality reduction techniques (PCA, FA)
Interview Prep - Top 7 Statistics Questions with Answers
About The Author
How Can You Dive Deeper?

If you have no prior Statistical knowledge and you want to identify and learn the essential statistical concepts from the scratch and prepare for your job interviews, then this handbook is for you. It will also be a good read for anyone who wants to refresh their statistical knowledge.

Prerequisites

Before you start reading this handbook about key concepts in Statistics for Data Science, Machine Learning, and Artificial Intelligence, there are a few prerequisites that will help you make the most out of it.

This list is designed to ensure you are well-prepared and can fully grasp the statistical concepts discussed:

Basic Mathematical Skills: Comfort with high school level mathematics, including algebra and basic calculus, is essential. These skills are crucial for understanding statistical formulas and methods.
Logical Thinking: Ability to think logically and methodically to solve problems will aid in understanding statistical reasoning and applying these concepts to data-driven scenarios.
Computer Literacy: Basic knowledge of using computers and the internet is necessary since many examples and exercises might require the use of statistical software or coding.
Basic knowledge of Python, such as the creation of variables and working with some basic data structures and coding is also required (if you are not familiar with these concepts, check out my Python for Data Science 2024 -Full Course for Beginners here).
Curiosity and Willingness to Learn: A keen interest in learning and exploring data is perhaps the most important prerequisite. The field of data science is constantly evolving, and a proactive approach to learning will be incredibly beneficial.

This handbook assumes no prior knowledge of statistics, making it accessible to beginners. Still, familiarity with the above concepts will greatly enhance your understanding and ability to apply statistical methods effectively in various domains.

If you want to learn Mathematics, Statistics, Machine Learning or AI check out our YouTube Channel and LunarTech.ai for free resources.

Random Variables

Random variables form the cornerstone of many statistical concepts. It might be hard to digest the formal mathematical definition of a random variable, but simply put, it's a way to map the outcomes of random processes, such as flipping a coin or rolling a dice, to numbers.

For instance, we can define the random process of flipping a coin by random variable X which takes a value 1 if the outcome is heads and 0 if the outcome is tails.

$$X = \begin{cases} 1 & \text{if heads} \\ 0 & \text{if tails} \end{cases}$$

In this example, we have a random process of flipping a coin where this experiment can produce two possible outcomes: {0,1}. This set of all possible outcomes is called the sample space of the experiment. Each time the random process is repeated, it is referred to as an event.

In this example, flipping a coin and getting a tail as an outcome is an event. The chance or the likelihood of this event occurring with a particular outcome is called the probability of that event.

A probability of an event is the likelihood that a random variable takes a specific value of x which can be described by P(x). In the example of flipping a coin, the likelihood of getting heads or tails is the same, that is 0.5 or 50%. So we have the following setting:

$$\begin{align} \Pr(X = \text{heads}) = 0.5 \\ \Pr(X = \text{tails}) = 0.5 \end{align}$$

where the probability of an event, in this example, can only take values in the range [0,1].

Mean, Variance, Standard Deviation

To understand the concepts of mean, variance, and many other statistical topics, it is important to learn the concepts of population and sample.

The population is the set of all observations (individuals, objects, events, or procedures) and is usually very large and diverse. On the other hand, a sample is a subset of observations from the population that ideally is a true representation of the population.

Image Source: LunarTech

Given that experimenting with an entire population is either impossible or simply too expensive, researchers or analysts use samples rather than the entire population in their experiments or trials.

To make sure that the experimental results are reliable and hold for the entire population, the sample needs to be a true representation of the population. That is, the sample needs to be unbiased.

For this purpose, we can use statistical sampling techniques such as Random Sampling, Systematic Sampling, Clustered Sampling, Weighted Sampling, and Stratified Sampling.

Mean

The mean, also known as the average, is a central value of a finite set of numbers. Let’s assume a random variable X in the data has the following values:

$$X_1, X_2, X_3, \ldots, X_N$$

where N is the number of observations or data points in the sample set or simply the data frequency. Then the sample mean defined by μ, which is very often used to approximate the population mean, can be expressed as follows:

$$\mu = \frac{\sum_{i=1}^{N} x_i}{N}$$

The mean is also referred to as expectation which is often defined by E() or random variable with a bar on the top. For example, the expectation of random variables X and Y, that is E(X) and E(Y), respectively, can be expressed as follows:

$$\bar{X} = \frac{\sum_{i=1}^{N} X_i}{N}$$

$$\bar{Y} = \frac{\sum_{i=1}^{N} Y_i}{N}$$

Now that we have a solid understanding of the mean as a statistical measure, let's see how we can apply this knowledge practically using Python. Python is a versatile programming language that, with the help of libraries like NumPy, makes it easy to perform complex mathematical operations—including calculating the mean.

In the following code snippet, we demonstrate how to compute the mean of a set of numbers using NumPy. We will start by showing the calculation for a simple array of numbers. Then, we'll address a common scenario encountered in data science: calculating the mean of a dataset that includes undefined or missing values, represented as NaN (Not a Number). NumPy provides a function specifically designed to handle such cases, allowing us to compute the mean while ignoring these NaN values.

Here is how you can perform these operations in Python:

import numpy as np
import math
x = np.array([1,3,5,6])
mean_x = np.mean(x)

# in case the data contains Nan values
x_nan = np.array([1,3,5,6, math.nan])
mean_x_nan = np.nanmean(x_nan)

Variance

The variance measures how far the data points are spread out from the average value. It's equal to the sum of the squares of the differences between the data values and the average (the mean).

We can express the population variance as follows:

x = np.array([1,3,5,6])
variance_x = np.var(x)

# here you need to specify the degrees of freedom (df) max number of logically independent data points that have freedom to vary
x_nan = np.array([1,3,5,6, math.nan])
mean_x_nan = np.nanvar(x_nan, ddof = 1)

For deriving expectations and variances of different popular probability distribution functions, check out this Github repo.

Standard Deviation

The standard deviation is simply the square root of the variance and measures the extent to which data varies from its mean. The standard deviation defined by sigma can be expressed as follows:

$$\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$$

Standard deviation is often preferred over the variance because it has the same units as the data points, which means you can interpret it more easily.

To compute the population variance using Python, we utilize the var function from the NumPy library. By default, this function calculates the population variance by setting the ddof (Delta Degrees of Freedom) parameter to 0. However, when dealing with samples and not the entire population, you would typically set ddof to 1 to get the sample variance.

The code snippet provided shows how to calculate the variance for a set of data. Additionally, it shows how to calculate the variance when there are NaN values in the data. NaN values represent missing or undefined data. When calculating the variance, these NaN values must be handled correctly; otherwise, they can result in a variance that is not a number (NaN), which is uninformative.

Here is how you can calculate the population variance in Python, taking into account the potential presence of NaN values:

x = np.array([1,3,5,6])
variance_x = np.std(x)

x_nan = np.array([1,3,5,6, math.nan])
mean_x_nan = np.nanstd(x_nan, ddof = 1)

Covariance

The covariance is a measure of the joint variability of two random variables and describes the relationship between these two variables. It is defined as the expected value of the product of the two random variables’ deviations from their means.

The covariance between two random variables X and Z can be described by the following expression, where E(X) and E(Z) represent the means of X and Z, respectively.

$$\text{Cov}(X, Z) = E\left[(X - E(X))(Z - E(Z))\right]$$

Covariance can take negative or positive values as well as a value of 0. A positive value of covariance indicates that two random variables tend to vary in the same direction, whereas a negative value suggests that these variables vary in opposite directions. Finally, the value 0 means that they don’t vary together.

To explore the concept of covariance practically, we will use Python with the NumPy library, which provides powerful numerical operations. The np.cov function can be used to calculate the covariance matrix for two or more datasets. In the matrix, the diagonal elements represent the variance of each dataset, and the off-diagonal elements represent the covariance between each pair of datasets.

Let's look at an example of calculating the covariance between two sets of data:

x = np.array([1,3,5,6])
y = np.array([-2,-4,-5,-6])

#this will return the covariance matrix of x,y containing x_variance, y_variance on diagonal elements and covariance of x,y
cov_xy = np.cov(x,y)

Correlation

The correlation is also a measure of a relationship. It measures both the strength and the direction of the linear relationship between two variables.

If a correlation is detected, then it means that there is a relationship or a pattern between the values of two target variables. Correlation between two random variables X and Z is equal to the covariance between these two variables divided by the product of the standard deviations of these variables. This can be described by the following expression:

$$\rho_{X,Z} = \frac{\text{Cov}(X, Z)}{\sigma_X \sigma_Z}$$

Correlation coefficients’ values range between -1 and 1. Keep in mind that the correlation of a variable with itself is always 1, that is Cor(X, X) = 1.

Another thing to keep in mind when interpreting correlation is to not confuse it with causation, given that a correlation is not necessarily a causation. Even if there is a correlation between two variables, you cannot conclude that one variable causes a change in the other. This relationship could be coincidental, or a third factor might be causing both variables to change.

x = np.array([1,3,5,6])
y = np.array([-2,-4,-5,-6])

corr = np.corrcoef(x,y)

Probability Distribution Functions

A function that describes all the possible values, the sample space, and the corresponding probabilities that a random variable can take within a given range, bounded between the minimum and maximum possible values, is called a probability distribution function (pdf) or probability density.

Every pdf needs to satisfy the following two criteria:

$$0 \leq \Pr(X) \leq 1 \\ \sum p(X) = 1$$

where the first criterium states that all probabilities should be numbers in the range of [0,1] and the second criterium states that the sum of all possible probabilities should be equal to 1.

Probability functions are usually classified into two categories: discrete and continuous.

Discrete distribution function describes the random process with countable sample space, like in an example of tossing a coin that has only two possible outcomes. Continuous distribution functions describe the random process with a continuous sample space.

Examples of discrete distribution functions are Bernoulli, Binomial, Poisson, Discrete Uniform. Examples of continuous distribution functions are Normal, Continuous Uniform, Cauchy.

Binomial Distribution

The binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each with the boolean-valued outcome: success (with probability p) or failure (with probability q = 1 − p).

Let's assume a random variable X follows a Binomial distribution, then the probability of observing k successes in n independent trials can be expressed by the following probability density function:

$$\Pr(X = k) = \binom{n}{k} p^k q^{n-k}$$

The binomial distribution is useful when analyzing the results of repeated independent experiments, especially if you're interested in the probability of meeting a particular threshold given a specific error rate.

Binomial Distribution Mean and Variance

The mean of a binomial distribution, denoted as E(X)=np, tells you the average number of successes you can expect if you conduct n independent trials of a binary experiment.

A binary experiment is one where there are only two outcomes: success (with probability p) or failure (with probability q\=1−_p_).

$$E(X) = np \\ \text{Var}(X) = npq$$

For example, if you were to flip a coin 100 times and you define a success as the coin landing on heads (let's say the probability of heads is 0.5), the binomial distribution would tell you how likely it is to get any number of heads in those 100 flips. The mean E(X) would be 100×0.5=50, indicating that on average, you’d expect to get 50 heads.

The variance Var(X)=npq measures the spread of the distribution, indicating how much the number of successes is likely to deviate from the mean.

Continuing with the coin flip example, the variance would be 100×0.5×0.5=25, which informs you about the variability of the outcomes. A smaller variance would mean the results are more tightly clustered around the mean, whereas a larger variance indicates they’re more spread out.

These concepts are crucial in many fields. For instance:

Quality Control: Manufacturers might use the binomial distribution to predict the number of defective items in a batch, helping them understand the quality and consistency of their production process.
Healthcare: In medicine, it could be used to calculate the probability of a certain number of patients responding to a treatment, based on past success rates.
Finance: In finance, binomial models are used to evaluate the risk of portfolio or investment strategies by predicting the number of times an asset will reach a certain price point.
Polling and Survey Analysis: When predicting election results or customer preferences, pollsters might use the binomial distribution to estimate how many people will favor a candidate or a product, given the probability drawn from a sample.

Understanding the mean and variance of the binomial distribution is fundamental to interpreting the results and making informed decisions based on the likelihood of different outcomes.

The figure below visualizes an example of Binomial distribution where the number of independent trials is equal to 8 and the probability of success in each trial is equal to 16%.

Binomial distribution - showing number of success and probability. Image Source: LunarTech

The Python code below creates a histogram to visualize the distribution of outcomes from 1000 experiments, each consisting of 8 trials with a success probability of 0.16. It uses NumPy to generate the binomial distribution data and Matplotlib to plot the histogram, showing the probability of the number of successes in those trials.

# Random Generation of 1000 independent Binomial samples
import numpy as np
import matplotlib.pyplot as plt


n = 8
p = 0.16
N = 1000
X = np.random.binomial(n,p,N)
# Histogram of Binomial distribution

counts, bins, ignored = plt.hist(X, 20, density = True, rwidth = 0.7, color = 'purple')
plt.title("Binomial distribution with p = 0.16 n = 8")
plt.xlabel("Number of successes")
plt.ylabel("Probability")plt.show()

Poisson Distribution

The Poisson distribution is the discrete probability distribution of the number of events occurring in a specified time period, given the average number of times the event occurs over that time period.

Let's assume a random variable X follows a Poisson distribution. Then the probability of observing k events over a time period can be expressed by the following probability function:

$$\Pr(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}$$

where e is Euler’s number and λ lambda, the arrival rate parameter, is the expected value of X. The Poisson distribution function is very popular for its usage in modeling countable events occurring within a given time interval.

Poisson Distribution Mean and Variance

The Poisson distribution is particularly useful for modeling the number of times an event occurs within a specified time frame. The mean E(X) and variance Var(X)

Var(X) of a Poisson distribution are both equal to λ, which is the average rate at which events occur (also known as the rate parameter). This makes the Poisson distribution unique, as it is characterized by this single parameter.

The fact that the mean and variance are equal means that as we observe more events, the distribution of the number of occurrences becomes more predictable. It’s used in various fields such as business, engineering, and science for tasks like:

Predicting the number of customer arrivals at a store within an hour. Estimating the number of emails you'd receive in a day. Understanding the number of defects in a batch of materials.

So, the Poisson distribution helps in making probabilistic forecasts about the occurrence of rare or random events over intervals of time or space.

$$E(X) = \lambda \\ \text{Var}(X) = \lambda$$

For example, Poisson distribution can be used to model the number of customers arriving in the shop between 7 and 10 pm, or the number of patients arriving in an emergency room between 11 and 12 pm.

The figure below visualizes an example of Poisson distribution where we count the number of Web visitors arriving at the website where the arrival rate, lambda, is assumed to be equal to 7 minutes.

Randomly generating from Poisson Distribution with lambda = 7. Image Source: LunarTech

In practical data analysis, it is often helpful to simulate the distribution of events. Below is a Python code snippet that demonstrates how to generate a series of data points that follow a Poisson distribution using NumPy. We then create a histogram using Matplotlib to visualize the distribution of the number of visitors (as an example) we might expect to see, based on our average rate λ = 7

This histogram helps in understanding the distribution's shape and variability. The most likely number of visitors is around the mean λ, but the distribution shows the probability of seeing fewer or greater numbers as well.

# Random Generation of 1000 independent Poisson samples
import numpy as np
lambda_ = 7
N = 1000
X = np.random.poisson(lambda_,N)

# Histogram of Poisson distribution
import matplotlib.pyplot as plt
counts, bins, ignored = plt.hist(X, 50, density = True, color = 'purple')
plt.title("Randomly generating from Poisson Distribution with lambda = 7")
plt.xlabel("Number of visitors")
plt.ylabel("Probability")
plt.show()

Normal Distribution

The Normal probability distribution is the continuous probability distribution for a real-valued random variable. Normal distribution, also called Gaussian distribution is arguably one of the most popular distribution functions that is commonly used in social and natural sciences for modeling purposes. For example, it is used to model people’s height or test scores.

Let's assume a random variable X follows a Normal distribution. Then its probability density function can be expressed as follows:

$$\Pr(X = k) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2}$$

where the parameter μ (mu) is the mean of the distribution also referred to as the location parameter, parameter σ (sigma) is the standard deviation of the distribution also referred to as the scale parameter. The number π (pi) is a mathematical constant approximately equal to 3.14.

Normal Distribution Mean and Variance

$$E(X) = \mu \\ \text{Var}(X) = \sigma^2$$

The figure below visualizes an example of Normal distribution with a mean 0 (μ = 0) and standard deviation of 1 (σ = 1), which is referred to as Standard Normal distribution which is symmetric_._

Randomly generating 1000 obs from Normal Distribution (mu = 0, sigma = 1). Image Source: LunarTech

The visualization of the standard normal distribution is crucial because this distribution underpins many statistical methods and probability theory. When data is normally distributed with a mean ( μ ) of 0 and standard deviation (σ) of 1, it is referred to as the standard normal distribution. It's symmetric around the mean, with the shape of the curve often called the "bell curve" due to its bell-like shape.

The standard normal distribution is fundamental for the following reasons:

Central Limit Theorem: This theorem states that, under certain conditions, the sum of a large number of random variables will be approximately normally distributed. It allows for the use of normal probability theory for sample means and sums, even when the original data is not normally distributed.
Z-Scores: Values from any normal distribution can be transformed into the standard normal distribution using Z-scores, which indicate how many standard deviations an element is from the mean. This allows for the comparison of scores from different normal distributions.
Statistical Inference and AB Testing: Many statistical tests, such as t-tests and ANOVAs, assume that the data follows a normal distribution, or they rely on the central limit theorem. Understanding the standard normal distribution helps in the interpretation of these tests' results.
Confidence Intervals and Hypothesis Testing: The properties of the standard normal distribution are used to construct confidence intervals and to perform hypothesis testing.

All topics which we will cover below!

So, being able to visualize and understand the standard normal distribution is key to applying many statistical techniques accurately.

The Python code below uses NumPy to generate 1000 random samples from a normal distribution with a mean (μ) of 0 and a standard deviation (σ) of 1, which are standard parameters for the standard normal distribution. These generated samples are stored in the variable X.

To visualize the distribution of these samples, the code employs Matplotlib to create a histogram. The plt.hist function is used to plot the histogram of the samples with 30 bins, and the density parameter is set to True to normalize the histogram so that the area under it sums to 1. This effectively turns the histogram into a probability density plot.

Additionally, the SciPy library is used to overlay the probability density function (PDF) of the theoretical normal distribution on the histogram. The norm.pdf function generates the y-values for the PDF given an array of x-values. This theoretical curve is plotted in yellow over the histogram to show how closely the random samples fit the expected distribution.

The resulting graph displays the histogram of the generated samples in purple, with the theoretical normal distribution overlaid in yellow. The x-axis represents the range of values that the samples can take, while the y-axis represents the probability density. This visualization is a powerful tool for comparing the empirical distribution of the data with the theoretical model, allowing us to see whether our samples follow the expected pattern of a normal distribution.

# Random Generation of 1000 independent Normal samples
import numpy as np
mu = 0
sigma = 1
N = 1000
X = np.random.normal(mu,sigma,N)

# Population distribution
from scipy.stats import norm
x_values = np.arange(-5,5,0.01)
y_values = norm.pdf(x_values)
#Sample histogram with Population distribution
import matplotlib.pyplot as plt
counts, bins, ignored = plt.hist(X, 30, density = True,color = 'purple',label = 'Sampling Distribution')
plt.plot(x_values,y_values, color = 'y',linewidth = 2.5,label = 'Population Distribution')
plt.title("Randomly generating 1000 obs from Normal distribution mu = 0 sigma = 1")
plt.ylabel("Probability")
plt.legend()
plt.show()

Bayes' Theorem

The Bayes' Theorem (often called Bayes' Law) is arguably the most powerful rule of probability and statistics. It was named after famous English statistician and philosopher, Thomas Bayes.

English mathematician and philosopher Thomas Bayes

Bayes' theorem is a powerful probability law that brings the concept of subjectivity into the world of Statistics and Mathematics where everything is about facts. It describes the probability of an event, based on the prior information of conditions that might be related to that event.

For instance, if the risk of getting Coronavirus or Covid-19 is known to increase with age, then Bayes' Theorem allows the risk to an individual of a known age to be determined more accurately. It does this by conditioning it on the age rather than simply assuming that this individual is common to the population as a whole.

The concept of conditional probability, which plays a central role in Bayes' theorem, is a measure of the probability of an event happening, given that another event has already occurred.

Bayes' theorem can be described by the following expression where the X and Y stand for events X and Y, respectively:

$$\Pr(X | Y) = \frac{\Pr(Y | X) \Pr(X)}{\Pr(Y)}$$

Pr (X|Y): the probability of event X occurring given that event or condition Y has occurred or is true
Pr (Y|X): the probability of event Y occurring given that event or condition X has occurred or is true
Pr (X) & Pr (Y): the probabilities of observing events X and Y, respectively

In the case of the earlier example, the probability of getting Coronavirus (event X) conditional on being at a certain age is Pr (X|Y). This is equal to the probability of being at a certain age given that the person got a Coronavirus, Pr (Y|X), multiplied with the probability of getting a Coronavirus, Pr (X), divided by the probability of being at a certain age, Pr (Y).

Linear Regression

Earlier, we introduced the concept of causation between variables, which happens when a variable has a direct impact on another variable.

When the relationship between two variables is linear, then Linear Regression is a statistical method that can help model the impact of a unit change in a variable, the independent variable on the values of another variable, the dependent variable.

Dependent variables are often referred to as response variables or explained variables, whereas independent variables are often referred to as regressors or explanatory variables.

When the Linear Regression model is based on a single independent variable, then the model is called Simple Linear Regression. When the model is based on multiple independent variables, it’s referred to as Multiple Linear Regression.

Simple Linear Regression can be described by the following expression:

$$Y_i = \beta_0 + \beta_1X_i + u_i$$

where Y is the dependent variable, X is the independent variable which is part of the data, β0 is the intercept which is unknown and constant, β1 is the slope coefficient or a parameter corresponding to the variable X which is unknown and constant as well. Finally, u is the error term that the model makes when estimating the Y values.

The main idea behind linear regression is to find the best-fitting straight line, the regression line, through a set of paired ( X, Y ) data.

One example of the Linear Regression application is modeling the impact of flipper length on penguins’ body mass, which is visualized below:

Image Source: LunarTech

The R code snippet you've shared is for creating a scatter plot with a linear regression line using the ggplot2 package in R, which is a powerful and widely-used library for creating graphics and visualizations. The code uses a dataset named penguins from the palmerpenguins package, presumably containing data about penguin species, including measurements like flipper length and body mass.

# R code for the graph
install.packages("ggplot2")
install.packages("palmerpenguins")
library(palmerpenguins)
library(ggplot2)
View(data(penguins))
ggplot(data = penguins, aes(x = flipper_length_mm,y = body_mass_g))+
  geom_smooth(method = "lm", se = FALSE, color = 'purple')+
  geom_point()+
  labs(x="Flipper Length (mm)",y="Body Mass (g)")

Multiple Linear Regression with three independent variables can be described by the following expression:

$$Y_i = \beta_0 + \beta_1X_{1,i} + \beta_2X_{2,i} + \beta_3X_{3,i} + u_i$$

Ordinary Least Squares

The ordinary least squares (OLS) is a method for estimating the unknown parameters such as β0 and β1 in a linear regression model. The model is based on the principle of least squares. This minimizes the sum of the squares of the differences between the observed dependent variable and its values that are predicted by the linear function of the independent variable (often referred to as fitted values).

This difference between the real and predicted values of dependent variable Y is referred to as residual. So OLS minimizes the sum of squared residuals.

This optimization problem results in the following OLS estimates for the unknown parameters β0 and β1 which are also known as coefficient estimates:

$$\hat{\beta}1 = \frac{\sum{i=1}^{N} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{N} (X_i - \bar{X})^2}$$

$$\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X}$$

Once these parameters of the Simple Linear Regression model are estimated, the fitted values of the response variable can be computed as follows:

$$\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1X_i$$

Standard Error

The residuals or the estimated error terms can be determined as follows:

$$\hat{u}_i = Y_i - \hat{Y}_i$$

It is important to keep in mind the difference between the error terms and residuals. Error terms are never observed, while the residuals are calculated from the data. The OLS estimates the error terms for each observation but not the actual error term. So, the true error variance is still unknown.

Also, these estimates are subject to sampling uncertainty. This means that we will never be able to determine the exact estimate, the true value, of these parameters from sample data in an empirical application. But we can estimate it by calculating the sample residual variance by using the residuals as follows:

$$\hat{\sigma}^2 = \frac{\sum_{i=1}^{N} \hat{u}_i^2}{N - 2}$$

This estimate for the variance of sample residuals helps us estimate the variance of the estimated parameters, which is often expressed as follows:

$$\text{Var}(\hat{\beta})$$

The square root of this variance term is called the standard error of the estimate. This is a key component in assessing the accuracy of the parameter estimates. It is used to calculate test statistics and confidence intervals.

The standard error can be expressed as follows:

$$SE(\hat{\beta}) = \sqrt{\text{Var}(\hat{\beta})}$$

It is important to keep in mind the difference between the error terms and residuals. Error terms are never observed, while the residuals are calculated from the data.

OLS Assumptions

The OLS estimation method makes the following assumptions which need to be satisfied to get reliable prediction results:

The Linearity assumption states that the model is linear in parameters.
The Random Sample assumption states that all observations in the sample are randomly selected.
The Exogeneity assumption states that independent variables are uncorrelated with the error terms.
The Homoskedasticity assumption states that the variance of all error terms is constant.
The No Perfect Multi-Collinearity assumption states that none of the independent variables is constant and there are no exact linear relationships between the independent variables.

The Python code snippet you've shared performs Ordinary Least Squares (OLS) regression, which is a method used in statistics to estimate the relationship between independent variables and a dependent variable. This process involves calculating the best-fit line through the data points that minimizes the sum of the squared differences between the observed values and the values predicted by the model.

The code defines a function runOLS(Y, X) that takes in a dependent variable Y and an independent variable X and performs the following steps:

Estimates the OLS coefficients (beta_hat) using the linear algebra solution to the least squares problem.
Makes predictions (Y_hat) using the estimated coefficients and calculates the residuals.
Computes the residual sum of squares (RSS), total sum of squares (TSS), mean squared error (MSE), root mean squared error (RMSE), and R-squared value, which are common metrics used to assess the fit of the model.
Calculates the standard error of the coefficient estimates, t-statistics, p-values, and confidence intervals for the estimated coefficients.

These calculations are standard in regression analysis and are used to interpret and understand the strength and significance of the relationship between the variables. The result of this function includes the estimated coefficients and various statistics that help evaluate the model's performance.

def runOLS(Y,X):

   # OLS esyimation Y = Xb + e --> beta_hat = (X'X)^-1(X'Y)
   beta_hat = np.dot(np.linalg.inv(np.dot(np.transpose(X), X)), np.dot(np.transpose(X), Y))

   # OLS prediction
   Y_hat = np.dot(X,beta_hat)
   residuals = Y-Y_hat
   RSS = np.sum(np.square(residuals))
   sigma_squared_hat = RSS/(N-2)
   TSS = np.sum(np.square(Y-np.repeat(Y.mean(),len(Y))))
   MSE = sigma_squared_hat
   RMSE = np.sqrt(MSE)
   R_squared = (TSS-RSS)/TSS

   # Standard error of estimates:square root of estimate's variance
   var_beta_hat = np.linalg.inv(np.dot(np.transpose(X),X))*sigma_squared_hat

   SE = []
   t_stats = []
   p_values = []
   CI_s = []

   for i in range(len(beta)):
       #standard errors
       SE_i = np.sqrt(var_beta_hat[i,i])
       SE.append(np.round(SE_i,3))

        #t-statistics
        t_stat = np.round(beta_hat[i,0]/SE_i,3)
        t_stats.append(t_stat)

        #p-value of t-stat p[|t_stat| >= t-treshhold two sided] 
        p_value = t.sf(np.abs(t_stat),N-2) * 2
        p_values.append(np.round(p_value,3))

        #Confidence intervals = beta_hat -+ margin_of_error
        t_critical = t.ppf(q =1-0.05/2, df = N-2)
        margin_of_error = t_critical*SE_i
        CI = [np.round(beta_hat[i,0]-margin_of_error,3), np.round(beta_hat[i,0]+margin_of_error,3)]
        CI_s.append(CI)
        return(beta_hat, SE, t_stats, p_values,CI_s, 
               MSE, RMSE, R_squared)

Parameter Properties

Under the assumption that the OLS criteria/assumptions we just discussed are satisfied, the OLS estimators of coefficients β0 and β1 are BLUE and Consistent. So what does this mean?

Gauss-Markov Theorem

This theorem highlights the properties of OLS estimates where the term BLUE stands for Best Linear Unbiased Estimator. Let's explore what this means in more detail.

Bias

The bias of an estimator is the difference between its expected value and the true value of the parameter being estimated. It can be expressed as follows:

$$\text{Bias}(\beta, \hat{\beta}) = E(\hat{\beta}) - \beta$$

When we state that the estimator is unbiased, we mean that the bias is equal to zero. This implies that the expected value of the estimator is equal to the true parameter value, that is:

$$E(\hat{\beta}) = \beta$$

Unbiasedness does not guarantee that the obtained estimate with any particular sample is equal or close to β. What it means is that, if we repeatedly draw random samples from the population and then computes the estimate each time, then the average of these estimates would be equal or very close to β.

Efficiency

The term Best in the Gauss-Markov theorem relates to the variance of the estimator and is referred to as efficiency*.* A parameter can have multiple estimators but the one with the lowest variance is called efficient.

Consistency

The term consistency goes hand in hand with the terms sample size and convergence. If the estimator converges to the true parameter as the sample size becomes very large, then this estimator is said to be consistent, that is:

$$N \to \infty \text{ then } \hat{\beta} \to \beta$$

All these properties hold for OLS estimates as summarized in the Gauss-Markov theorem. In other words, OLS estimates have the smallest variance, they are unbiased, linear in parameters, and are consistent. These properties can be mathematically proven by using the OLS assumptions made earlier.

Confidence Intervals

The Confidence Interval is the range that contains the true population parameter with a certain pre-specified probability. This is referred to as the confidence level of the experiment, and it's obtained by using the sample results and the margin of error.

Margin of Error

The margin of error is the difference between the sample results and based on what the result would have been if you had used the entire population.

Confidence Level

The Confidence Level describes the level of certainty in the experimental results. For example, a 95% confidence level means that if you were to perform the same experiment repeatedly 100 times, then 95 of those 100 trials would lead to similar results.

Note that the confidence level is defined before the start of the experiment because it will affect how big the margin of error will be at the end of the experiment.

Confidence Interval for OLS Estimates

As I mentioned earlier, the OLS estimates of the Simple Linear Regression, the estimates for intercept β0 and slope coefficient β1, are subject to sampling uncertainty. But we can construct Confidence Intervals (CIs) for these parameters which will contain the true value of these parameters in 95% of all samples.

That is, 95% confidence interval for β can be interpreted as follows:

The confidence interval is the set of values for which a hypothesis test cannot be rejected to the level of 5%.
The confidence interval has a 95% chance to contain the true value of β.

95% confidence interval of OLS estimates can be constructed as follows:

$$CI_{0.95}^{\beta} = \left[\hat{\beta}_i - 1.96 , SE(\hat{\beta}_i), \hat{\beta}_i + 1.96 , SE(\hat{\beta}_i)\right]$$

This is based on the parameter estimate, the standard error of that estimate, and the value 1.96 representing the margin of error corresponding to the 5% rejection rule.

This value is determined using the Normal Distribution table, which we'll discuss later on in this handbook.

Meanwhile, the following figure illustrates the idea of 95% CI:

Image Source: LunarTech

Note that the confidence interval depends on the sample size as well, given that it is calculated using the standard error which is based on sample size.

Statistical Hypothesis Testing

Testing a hypothesis in Statistics is a way to test the results of an experiment or survey to determine how meaningful they the results are.

Basically, you're testing whether the obtained results are valid by figuring out the odds that the results have occurred by chance. If it is the letter, then the results are not reliable and neither is the experiment. Hypothesis Testing is part of the Statistical Inference.

Null and Alternative Hypothesis

Firstly, you need to determine the thesis you wish to test. Then you need to formulate the Null Hypothesis and the Alternative Hypothesis. The test can have two possible outcomes. Based on the statistical results, you can either reject the stated hypothesis or accept it.

As a rule of thumb, statisticians tend to put the version or formulation of the hypothesis under the Null Hypothesis that needs to be rejected_,_ whereas the acceptable and desired version is stated under the Alternative Hypothesis_._

Statistical Significance

Let’s look at the earlier mentioned example where we used the Linear Regression model to investigate whether a penguin's Flipper Length, the independent variable, has an impact on Body Mass_,_ the dependent variable.

We can formulate this model with the following statistical expression:

$$Y_{\text{BodyMass}} = \beta_0 + \beta_1X_{\text{FlipperLength}} + u_i$$

Then, once the OLS estimates of the coefficients are estimated, we can formulate the following Null and Alternative Hypothesis to test whether the Flipper Length has a statistically significant impact on the Body Mass:

where H0 and H1 represent Null Hypothesis and Alternative Hypothesis, respectively.

Rejecting the Null Hypothesis would mean that a one-unit increase in Flipper Length has a direct impact on the Body Mass (given that the parameter estimate of β1 is describing this impact of the independent variable, Flipper Length, on the dependent variable, Body Mass). We can reformulate this hypothesis as follows:

$$\begin{cases} H_0: \hat{\beta}_1 = 0 \\ H_1: \hat{\beta}_1 \neq 0 \end{cases}$$

where H0 states that the parameter estimate of β1 is equal to 0, that is Flipper Length effect on Body Mass is statistically insignificant whereas H1 states that the parameter estimate of β1 is not equal to 0, suggesting that Flipper Length effect on Body Mass is statistically significant.

Type I and Type II Errors

When performing Statistical Hypothesis Testing, you need to consider two conceptual types of errors: Type I error and Type II error.

Type I errors occur when the Null is incorrectly rejected, and Type II errors occur when the Null Hypothesis is incorrectly not rejected. A confusion matrix can help you clearly visualize the severity of these two types of errors.

As a rule of thumb, statisticians tend to put the version of the hypothesis under the Null Hypothesis that that needs to be rejected, whereas the acceptable and desired version is stated under the Alternative Hypothesis.

Statistical Tests

Once the you've stataed the Null and the Alternative Hypotheses and defined the test assumptions, the next step is to determine which statistical test is appropriate and to calculate the test statistic.

Whether or not to reject or not reject the Null can be determined by comparing the test statistic with the critical value. This comparison shows whether or not the observed test statistic is more extreme than the defined critical value.

It can have two possible results:

The test statistic is more extreme than the critical value → the null hypothesis can be rejected
The test statistic is not as extreme as the critical value → the null hypothesis cannot be rejected

The critical value is based on a pre-specified significance level α (usually chosen to be equal to 5%) and the type of probability distribution the test statistic follows.

The critical value divides the area under this probability distribution curve into the rejection region(s) and non-rejection region. There are numerous statistical tests used to test various hypotheses. Examples of Statistical tests are Student’s t-test, F-test, Chi-squared test, Durbin-Hausman-Wu Endogeneity test, White Heteroskedasticity test. In this handbook, we will look at two of these statistical tests: the Student's t-test and the F-test.

Student’s t-test

One of the simplest and most popular statistical tests is the Student’s t-test. You can use it to test various hypotheses, especially when dealing with a hypothesis where the main area of interest is to find evidence for the statistically significant effect of a single variable.

The test statistics of the t-test follows Student’s t distribution and can be determined as follows:

$$T_{\text{stat}} = \frac{\hat{\beta}_i - h_0}{SE(\hat{\beta})}$$

where h0 in the nominator is the value against which the parameter estimate is being tested. So, the t-test statistics are equal to the parameter estimate minus the hypothesized value divided by the standard error of the coefficient estimate.

Let's use this for our earlier hypothesis, where we wanted to test whether Flipper Length has a statistically significant impact on Body Mass or not. This test can be performed using a t-test. The h0 is in that case equal to the 0 since the slope coefficient estimate is tested against a value of 0.

Two-sided vs one-sided t-test

There are two versions of the t-test: a two-sided t-test and a one-sided t-test. Whether you need the former or the latter version of the test depends entirely on the hypothesis that you want to test.

You can use the two-sided or two-tailed t-test when the hypothesis is testing equal versus not equal relationship under the Null and Alternative Hypotheses. It would be similar to the following example:

$$H_{0} = \beta_hat_1 = h_0\ H_{1} = \beta_hat_1 \neq h_0$$

The two-sided t-test has two rejection regions as visualized in the figure below:

_Image Source: [Hartmann, K., Krois, J., Waske, B. (2018): E-Learning Project SOGA: Statistics and Geospatial Data Analysis. Department of Earth Sciences, Freie Universitaet Berlin](https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Hypothesis-Tests/Introduction-to-Hypothesis-Testing/Critical-Value-and-the-p-Value-Approach/index.html" data-href="https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Hypothesis-Tests/Introduction-to-Hypothesis-Testing/Critical-Value-and-the-p-Value-Approach/index.html" class="markup--anchor markup--figure-anchor" rel="noopener" target="blank">two-sided t-distribution table.

On the other hand, you can use the one-sided or one-tailed t-test when the hypothesis is testing positive/negative versus negative/positive relationships under the Null and Alternative Hypotheses. It looks like this:

Left-tailed vs right-tailed

One-sided t-test has a single rejection region. Depending on the hypothesis side, the rejection region is either on the left-hand side or the right-hand side as visualized in the figure below:

_Image Source: [Hartmann, K., Krois, J., Waske, B. (2018): E-Learning Project SOGA: Statistics and Geospatial Data Analysis. Department of Earth Sciences, Freie Universitaet Berlin](https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Hypothesis-Tests/Introduction-to-Hypothesis-Testing/Critical-Value-and-the-p-Value-Approach/index.html" data-href="https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Hypothesis-Tests/Introduction-to-Hypothesis-Testing/Critical-Value-and-the-p-Value-Approach/index.html" class="markup--anchor markup--figure-anchor" rel="noopener" target="blank">F-test
F-test is another very popular statistical test often used to test hypotheses testing a joint statistical significance of multiple variables. This is the case when you want to test whether multiple independent variables have a statistically significant impact on a dependent variable.

Following is an example of a statistical hypothesis that you can test using the F-test:

$$\begin{cases} H_0: \hat{\beta}_1 = \hat{\beta}_2 = \hat{\beta}_3 = 0 \\ H_1: \hat{\beta}_1 \neq \hat{\beta}_2 \neq \hat{\beta}_3 \neq 0 \end{cases}$$
where the Null states that the three variables corresponding to these coefficients are jointly statistically insignificant, and the Alternative states that these three variables are jointly statistically significant.

The test statistics of the F-test follows F distribution and can be determined as follows:

$$F_{\text{stat}} = \frac{(SSR_{\text{restricted}} - SSR_{\text{unrestricted}}) / q}{SSR_{\text{unrestricted}} / (N - k_{\text{unrestricted}} - 1)}$$
where :

the SSRrestricted is the sum of squared residuals of the restricted model, which is the same model excluding from the data the target variables stated as insignificant under the Null

the SSRunrestricted is the sum of squared residuals of the unrestricted model, which is the model that includes all variables

the q represents the number of variables that are being jointly tested for the insignificance under the Null

N is the sample size

and the k is the total number of variables in the unrestricted model.

SSR values are provided next to the parameter estimates after running the OLS regression, and the same holds for the F-statistics as well.

Following is an example of MLR model output where the SSR and F-statistics values are marked.

_Image Source:[ Stock and Whatson](https://www.uio.no/studier/emner/sv/oekonomi/ECON4150/v18/lecture7_ols_multiple_regressors_hypothesis_tests.pdf" data-href="https://www.uio.no/studier/emner/sv/oekonomi/ECON4150/v18/lecture7_ols_multiple_regressors_hypothesis_tests.pdf" class="markup--anchor markup--figure-anchor" rel="noopener" target="blank)

F-test has a single rejection region as visualized below:

_Image Source: [U of Michigan](https://www.statisticshowto.com/probability-and-statistics/f-statistic-value-test/" data-href="https://www.statisticshowto.com/probability-and-statistics/f-statistic-value-test/" class="markup--anchor markup--figure-anchor" rel="noopener" target="blank">2-sample T-test
If you want to test whether there is a statistically significant difference between the control and experimental groups’ metrics that are in the form of averages (for example, average purchase amount), metric follows student-t distribution. When the sample size is smaller than 30, you can use 2-sample T-test to test the following hypothesis:

$$\begin{cases} H_0: \mu_{\text{con}} = \mu_{\text{exp}} \\ H_1: \mu_{\text{con}} \neq \mu_{\text{exp}} \end{cases}$$
$$\begin{cases} H_0: \mu_{\text{con}} - \mu_{\text{exp}} = 0 \\ H_1: \mu_{\text{con}} - \mu_{\text{exp}} \neq 0 \end{cases}$$
where the sampling distribution of means of Control group follows Student-t distribution with degrees of freedom N_con-1. Also, the sampling distribution of means of the Experimental group also follows the Student-t distribution with degrees of freedom N_exp-1.

Note that the N_con and N_exp are the number of users in the Control and Experimental groups, respectively.

$$\hat{\mu}{\text{con}} \sim t(N{\text{con}} - 1)$$
$$\hat{\mu}{\text{exp}} \sim t(N{\text{exp}} - 1)$$
Then you can calculate an estimate for the pooled variance of the two samples as follows:

$$S^2_{\text{pooled}} = \frac{(N_{\text{con}} - 1) * \sigma^2_{\text{con}} + (N_{\text{exp}} - 1) * \sigma^2_{\text{exp}}}{N_{\text{con}} + N_{\text{exp}} - 2} * \left(\frac{1}{N_{\text{con}}} + \frac{1}{N_{\text{exp}}}\right)$$
where σ²_con and σ²_exp are the sample variances of the Control and Experimental groups, respectively. Then the Standard Error is equal to the square root of the estimate of the pooled variance, and can be defined as:

$$SE = \sqrt{\hat{S}^2_{\text{pooled}}}$$
Consequently, the test statistics of the 2-sample T-test with the hypothesis stated earlier can be calculated as follows:

$$T = \frac{\hat{\mu}{\text{con}} - \hat{\mu}{\text{exp}}}{\sqrt{\hat{S}^2_{\text{pooled}}}}$$
In order to test the statistical significance of the observed difference between sample means, we need to calculate the p-value of our test statistics.

The p-value is the probability of observing values at least as extreme as the common value when this is due to a random chance. Stated differently, the p-value is the probability of obtaining an effect at least as extreme as the one in your sample data, assuming the null hypothesis is true.

Then the p-value of the test statistics can be calculated as follows:

$$p_{\text{value}} = \Pr[t \leq -T \text{ or } t \geq T]$$
$$= 2 * \Pr[t \geq T]$$
The interpretation of a p-value is dependent on the chosen significance level, alpha, which you choose before running the test during the power analysis.

If the calculated p-value appears to be smaller than equal to alpha (for example, 0.05 for 5% significance level) we can reject the null hypothesis and state that there is a statistically significant difference between the primary metrics of the Control and Experimental groups.

Finally, to determine how accurate the obtained results are and also to comment about the practical significance of the obtained results, you can compute the Confidence Interval of your test by using the following formula:

$$CI = \left[ (\hat{\mu}{\text{con}} - \hat{\mu}{\text{exp}}) - t_{\frac{\alpha}{2}} * SE(\hat{\mu}{\text{con}} - \hat{\mu}{\text{exp}}), (\hat{\mu}{\text{con}} - \hat{\mu}{\text{exp}}) + t_{\frac{\alpha}{2}} * SE \right]$$
where the t_(1-alpha/2) is the critical value of the test corresponding to the two-sided t-test with alpha significance level. It can be found using the t-table.

The Python code provided performs a two-sample t-test, which is used in statistics to determine if two sets of data are significantly different from each other. This particular snippet simulates two groups (control and experimental) with data following a t-distribution, calculates the mean and variance for each group, and then performs the following steps:

It calculates the pooled variance, which is a weighted average of the variances of the two groups.

It computes the standard error of the difference between the two means.

It calculates the t-statistic, which is the difference between the two sample means divided by the standard error. This statistic measures how much the groups differ in units of standard error.

It determines the critical t-value from the t-distribution for the given significance level and degrees of freedom, which is used to decide whether the t-statistic is large enough to indicate a statistically significant difference between the groups.

It calculates the p-value, which indicates the probability of observing such a difference between means if the null hypothesis (that there is no difference) is true.

It computes the margin of error and constructs the confidence interval around the difference in means.

Finally, the code prints out the t-statistic, critical t-value, p-value, and confidence interval. These results can be used to infer whether the observed differences in means are statistically significant or likely due to random variation.

import numpy as np from scipy.stats import t N_con = 20 df_con = N_con - 1 # degrees of freedom of Control N_exp = 20 df_exp = N_exp - 1 # degrees of freedom of Experimental # Significance level alpha = 0.05 # data of control group with t-distribution X_con = np.random.standard_t(df_con,N_con) # data of experimental group with t-distribution X_exp = np.random.standard_t(df_exp,N_exp) # mean of control mu_con = np.mean(X_con) # mean of experimental mu_exp = np.mean(X_exp) # variance of control sigma_sqr_con = np.var(X_con) #variance of control sigma_sqr_exp = np.var(X_exp) # pooled variance pooled_variance_t_test = ((N_con-1)*sigma_sqr_con + (N_exp -1) * sigma_sqr_exp)/(N_con + N_exp-2)*(1/N_con + 1/N_exp) # Standard Error SE = np.sqrt(pooled_variance_t_test) # Test Statistics T = (mu_con-mu_exp)/SE # Critical value for two sided 2 sample t-test t_crit = t.ppf(1-alpha/2, N_con + N_exp - 2) # P-value of the two sided T-test using t-distribution and its symmetric property p_value = t.sf(T, N_con + N_exp - 2)*2 # Margin of Error margin_error = t_crit * SE # Confidence Interval CI = [(mu_con-mu_exp) - margin_error, (mu_con-mu_exp) + margin_error] print("T-score: ", T) print("T-critical: ", t_crit) print("P_value: ", p_value) print("Confidence Interval of 2 sample T-test: ", np.round(CI,2))

2-sample Z-test

There are various situations when you may want to use a 2-sample z-test:

if you want to test whether there is a statistically significant difference between the control and experimental groups’ metrics that are in the form of averages (for example, average purchase amount) or proportions (for example, Click Through Rate)

if the metric follows Normal distribution

when the sample size is larger than 30, such that you can use the Central Limit Theorem (CLT) to state that the sampling distributions of the Control and Experimental groups are asymptotically Normal

Here we will make a distinction between two cases: where the primary metric is in the form of proportions (like Click Through Rate) and where the primary metric is in the form of averages (like average purchase amount).

Case 1: Z-test for comparing proportions (2-sided)

If you want to test whether there is a statistically significant difference between the Control and Experimental groups’ metrics that are in the form of proportions (like CTR) and if the click event occurs independently, you can use a 2-sample Z-test to test the following hypothesis:

$$\begin{cases} H_0: p_{\text{con}} = p_{\text{exp}} \\ H_1: p_{\text{con}} \neq p_{\text{exp}} \end{cases}$$
$$\begin{cases} H_0: p_{\text{con}} - p_{\text{exp}} = 0 \\ H_1: p_{\text{con}} - p_{\text{exp}} \neq 0 \end{cases}$$
where each click event can be described by a random variable that can take two possible values: 1 (success) and 0 (failure). It also follows a Bernoulli distribution (click: success and no click: failure) where p_con and p_exp are the probabilities of clicking (probability of success) of Control and Experimental groups, respectively.

So, after collecting the interaction data of the Control and Experimental users, you can calculate the estimates of these two probabilities as follows:

$$SE = \sqrt{\hat{S}^2_{\text{pooled}}}$$
$$Z = \frac{(\hat{p}{\text{con}} - \hat{p}{\text{exp}})}{SE}$$
Since we are testing for the difference in these probabilities, we need to obtain an estimate for the pooled probability of success and an estimate for pooled variance, which can be done as follows:

$$\hat{p}{\text{pooled}} = \frac{X{\text{con}} + X_{\text{exp}}}{N_{\text{con}} + N_{\text{exp}}} = \frac{\#\text{clicks}{\text{con}} + \#\text{clicks}{\text{exp}}}{\#\text{impressions}{\text{con}} + \#\text{impressions}{\text{exp}}}$$
$$\hat{S}^2_{\text{pooled}} = \hat{p}{\text{pooled}}(1 - \hat{p}{\text{pooled}}) * \left(\frac{1}{N_{\text{con}}} + \frac{1}{N_{\text{exp}}}\right)$$
Then the Standard Error is equal to the square root of the estimate of the pooled variance. It can be defined as:

$$SE = \sqrt{\hat{S}^2_{\text{pooled}}}$$
And so, the test statistics of the 2-sample Z-test for the difference in proportions can be calculated as follows:

$$Z = \frac{(\hat{p}{\text{con}} - \hat{p}{\text{exp}})}{SE}$$
Then the p-value of this test statistics can be calculated as follows:

$$p_{\text{value}} = \Pr[Z \leq -T \text{ or } z \geq T]$$
$$= 2 * \Pr[Z \geq T]$$
Finally, you can compute the Confidence Interval of the test as follows:

$$CI = \left[ (\hat{p}{\text{con}} - \hat{p}{\text{exp}}) - z_{\frac{\alpha}{2}} * SE, (\hat{p}{\text{con}} - \hat{p}{\text{exp}}) + z_{\frac{\alpha}{2}} * SE \right]$$
where the z_(1-alpha/2) is the critical value of the test corresponding to the two-sided Z-test with alpha significance level. You can find it using the Z-table.

The rejection region of this two-sided 2-sample Z-test can be visualized by the following graph:

Image Source: The Author

The Python code snippet you’ve provided performs a two-sample Z-test for proportions. This type of test is used to determine whether there is a significant difference between the proportions of two groups. Here’s a brief explanation of the steps the code performs:

Calculates the sample proportions for both the control and experimental groups.

Computes the pooled sample proportion, which is an estimate of the proportion assuming the null hypothesis (that there is no difference between the group proportions) is true.

Calculates the pooled sample variance based on the pooled proportion and the sizes of the two samples.

Derives the standard error of the difference in sample proportions.

Calculates the Z-test statistic, which measures the number of standard errors between the sample proportion difference and the null hypothesis.

Finds the critical Z-value from the standard normal distribution for the given significance level.

Computes the p-value to assess the evidence against the null hypothesis.

Calculates the margin of error and the confidence interval for the difference in proportions.

Outputs the test statistic, critical value, p-value, and confidence interval, and based on the test statistic and critical value, it may print a statement to either reject or not reject the null hypothesis.

The latter part of the code uses Matplotlib to create a visualization of the standard normal distribution and the rejection regions for the two-sided Z-test. This visual aid helps to understand where the test statistic falls in relation to the distribution and the critical values.

import numpy as np from scipy.stats import norm X_con = 1242 #clicks control N_con = 9886 #impressions control X_exp = 974 #clicks experimental N_exp = 10072 #impressions experimetal # Significance Level alpha = 0.05 p_con_hat = X_con / N_con p_exp_hat = X_exp / N_exp p_pooled_hat = (X_con + X_exp)/(N_con + N_exp) pooled_variance = p_pooled_hat*(1-p_pooled_hat) * (1/N_con + 1/N_exp) # Standard Error SE = np.sqrt(pooled_variance) # test statsitics Test_stat = (p_con_hat - p_exp_hat)/SE # critical value usig the standard normal distribution Z_crit = norm.ppf(1-alpha/2) # Margin of error m = SE * Z_crit # two sided test and using symmetry property of Normal distibution so we multiple with 2 p_value = norm.sf(Test_stat)*2 # Confidence Interval CI = [(p_con_hat-p_exp_hat) - SE * Z_crit, (p_con_hat-p_exp_hat) + SE * Z_crit] if np.abs(Test_stat) >= Z_crit: print("reject the null") print(p_value) print("Test Statistics stat: ", Test_stat) print("Z-critical: ", Z_crit) print("P_value: ", p_value) print("Confidence Interval of 2 sample Z-test for proportions: ", np.round(CI,2)) import matplotlib.pyplot as plt z = np.arange(-3,3, 0.1) plt.plot(z, norm.pdf(z), label = 'Standard Normal Distribution',color = 'purple',linewidth = 2.5) plt.fill_between(z[z>Z_crit], norm.pdf(z[z>Z_crit]), label = 'Right Rejection Region',color ='y' ) plt.fill_between(z[z<(-1)*Z_crit], norm.pdf(z[z<(-1)*Z_crit]), label = 'Left Rejection Region',color ='y' ) plt.title("Two Sample Z-test rejection region") plt.legend() plt.show()

Case 2: Z-test for Comparing Means (2-sided)

If you want to test whether there is a statistically significant difference between the Control and Experimental groups’ metrics that are in the form of averages (like average purchase amount) you can use a 2-sample Z-test to test the following hypothesis:

$$\begin{cases} H_0: {CR}{\text{con}} = {CR}{\text{exp}} \\ H_1:{CR}{\text{con}} \neq {CR}{\text{exp}} \end{cases}$$
$$\begin{cases} H_0: {CR}{\text{con}} - {CR}{\text{exp}} = 0 \\ H_1: {CR}{\text{con}} - {CR}{\text{exp}} \neq 0 \end{cases}$$
where the sampling distribution of means of the Control group follows Normal distribution with mean mu_con and σ²_con/N_con. Moreover, the sampling distribution of means of the Experimental group also follows the Normal distribution with mean mu_exp and σ²_exp/N_exp.

$$\hat{\mu}{\text{con}} \sim N(\mu{con}, \frac{\sigma^2_{con}}{N_{con}})$$
$$\hat{\mu}{\text{exp}} \sim N(\mu{exp}, \frac{\sigma^{exp}2}{N{exp}})$$
Then the difference in the means of the control and experimental groups also follows Normal distributions with mean mu_con-mu_exp and variance σ²_con/N_con + σ²_exp/N_exp.

$$\hat{\mu}{\text{con}}-\hat{\mu}{\text{exp}} \sim N(\mu_{con}-\mu_{exp}, \frac{\sigma^2_{con}}{N_{con}}+\frac{\sigma^2_{exp}}{N_{exp}})$$
Consequently, the test statistics of the 2-sample Z-test for the difference in means can be calculated as follows:

$$T = \frac{\hat{\mu}{\text{con}}-\hat{\mu}{\text{exp}}}{\sqrt{\frac{\sigma^2_{con}}{N_{con}} + \frac{\sigma^2_{exp}}{N_{exp}}}} \sim N(0,1)$$
The Standard Error is equal to the square root of the estimate of the pooled variance and can be defined as:

$$SE = \sqrt{\frac{\sigma^2_{con}}{N_{con}} + \frac{\sigma^2_{exp}}{N_{exp}}}}$$
Then the p-value of this test statistics can be calculated as follows:

$$p_{\text{value}} = \Pr[Z \leq -T \text{ or } Z \geq T]$$
$$= 2 * \Pr[Z \geq T]$$
Finally, you can compute the Confidence Interval of the test as follows:

$$CI = [(\mu_hat_{con} - \mu_hat_{exp}) - z_{1-\alpha/2}*SE,((\mu_hat_{con} - \mu_hat_{exp}) + z_{1-\alpha/2)*SE]$$
The Python code provided appears to be set up for conducting a two-sample Z-test, typically used to determine if there is a significant difference between the means of two independent groups. In this context, the code might be comparing two different processes or treatments.

It generates two arrays of random integers to represent data for a control group (X_A) and an experimental group (X_B).

It calculates the sample means (mu_con, mu_exp) and variances (variance_con, variance_exp) for both groups.

The pooled variance is computed, which is used in the denominator of the test statistic formula for the Z-test, providing a measure of the data's common variance.

The Z-test statistic (T) is calculated by taking the difference between the two sample means and dividing it by the standard error of the difference.

The p-value is calculated to test the hypothesis of whether the means of the two groups are statistically different from each other.

The critical Z-value (Z_crit) is determined from the standard normal distribution, which defines the cutoff points for significance.

A margin of error is computed, and a confidence interval for the difference in means is constructed.

The test statistic, critical value, p-value, and confidence interval are printed to the console.

Lastly, the code uses Matplotlib to plot the standard normal distribution and highlight the rejection regions for the Z-test. This visualization can help in understanding the result of the Z-test in terms of where the test statistic lies relative to the distribution and the critical values for a two-sided test.

import numpy as np from scipy.stats import norm N_con = 60 N_exp = 60 # Significance Level alpha = 0.05 X_A = np.random.randint(100, size = N_con) X_B = np.random.randint(100, size = N_exp) # Calculating means of control and experimental groups mu_con = np.mean(X_A) mu_exp = np.mean(X_B) variance_con = np.var(X_A) variance_exp = np.var(X_B) # Pooled Variance pooled_variance = np.sqrt(variance_con/N_con + variance_exp/N_exp) # Test statistics T = (mu_con-mu_exp)/np.sqrt(variance_con/N_con + variance_exp/N_exp) # two sided test and using symmetry property of Normal distibution so we multiple with 2 p_value = norm.sf(T)*2 # Z-critical value Z_crit = norm.ppf(1-alpha/2) # Margin of error m = Z_crit*pooled_variance # Confidence Interval CI = [(mu_con - mu_exp) - m, (mu_con - mu_exp) + m] print("Test Statistics stat: ", T) print("Z-critical: ", Z_crit) print("P_value: ", p_value) print("Confidence Interval of 2 sample Z-test for proportions: ", np.round(CI,2)) import matplotlib.pyplot as plt z = np.arange(-3,3, 0.1) plt.plot(z, norm.pdf(z), label = 'Standard Normal Distribution',color = 'purple',linewidth = 2.5) plt.fill_between(z[z>Z_crit], norm.pdf(z[z>Z_crit]), label = 'Right Rejection Region',color ='y' ) plt.fill_between(z[z<(-1)*Z_crit], norm.pdf(z[z<(-1)*Z_crit]), label = 'Left Rejection Region',color ='y' ) plt.title("Two Sample Z-test rejection region") plt.legend() plt.show()

Chi-Squared test

If you want to test whether there is a statistically significant difference between the Control and Experimental groups’ performance metrics (for example their conversions) and you don’t really want to know the nature of this relationship (which one is better) you can use a Chi-Squared test to test the following hypothesis:

$$\begin{cases} H_0: \CR_{\text{con}} = \CR_{\text{exp}} \\ H_1: \CR_{\text{con}} \neq \CR_{\text{exp}} \end{cases}$$
$$\begin{cases} H_0: \CR_{\text{con}} - \CR_{\text{exp}} = 0 \\ H_1: \CR_{\text{con}} - \CR_{\text{exp}} \neq 0 \end{cases}$$
Note that the metric should be in the form of a binary variable (for example, conversion or no conversion/click or no click). The data can then be represented in the form of the following table, where O and T correspond to observed and theoretical values, respectively.

Table showing the data from Chi-Squared test

Then the test statistics of the Chi-2 test can be expressed as follows:

$$T = \sum_{i} \frac{(Observed_i - Expected_i)^2}{Expected_i}$$
where the Observed corresponds to the observed data and the Expected corresponds to the theoretical value, and i can take values 0 (no conversion) and 1(conversion). It’s important to see that each of these factors has a separate denominator. The formula for the test statistics when you have two groups only can be represented as follows:

$$T = \frac{(Observed_{con,1} - T_{con,1})^2}{T_{con,1}} + \frac{(Observed_{con,0} - T_{con,0})^2}{T_{con,0}} + \frac{(Observed_{exp,1} - T_{exp,1})^2}{T_{exp,1}} + \frac{(Observed_{exp,0} - T_{exp,0})^2}{T_{exp,0}}$$
The expected value is simply equal to the number of times each version of the product is viewed multiplied by the probability of it leading to conversion (or to a click in case of CTR).

Note that, since the Chi-2 test is not a parametric test, its Standard Error and Confidence Interval can’t be calculated in a standard way as we did in the parametric Z-test or T-test.

The rejection region of this two-sided 2-sample Z-test can be visualized by the following graph:

Image Source: The Author

The Python code you've shared is for conducting a Chi-squared test, a statistical hypothesis test that is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.

In the provided code snippet, it looks like the test is being used to compare two categorical datasets:

It calculates the Chi-squared test statistic by summing the squared difference between observed (O) and expected (T) frequencies, divided by the expected frequencies for each category. This is known as the squared relative distance and is used as the test statistic for the Chi-squared test.

It then calculates the p-value for this test statistic using the degrees of freedom, which in this case is assumed to be 1 (but this would typically depend on the number of categories minus one).

The Matplotlib library is used to plot the probability density function (pdf) of the Chi-squared distribution with one degree of freedom. It also highlights the rejection region for the test, which corresponds to the critical value of the Chi-squared distribution that the test statistic must exceed for the difference to be considered statistically significant.

The visualization helps to understand the Chi-squared test by showing where the test statistic lies in relation to the Chi-squared distribution and its critical value. If the test statistic is within the rejection region, the null hypothesis of no difference in frequencies can be rejected.

import numpy as np from scipy.stats import chi2 O = np.array([86, 83, 5810,3920]) T = np.array([105,65,5781, 3841]) # Squared_relative_distance def calculate_D(O,T): D_sum = 0 for i in range(len(O)): D_sum += (O[i] - T[i])**2/T[i] return(D_sum) D = calculate_D(O,T) p_value = chi2.sf(D, df = 1) import matplotlib.pyplot as plt # Step 1: pick a x-axis range like in case of z-test (-3,3,0.1) d = np.arange(0,5,0.1) # Step 2: drawing the initial pdf of chi-2 with df = 1 and x-axis d range we just created plt.plot(d, chi2.pdf(d, df = 1), color = "purple") # Step 3: filling in the rejection region plt.fill_between(d[d>D], chi2.pdf(d[d>D], df = 1), color = "y") # Step 4: adding title plt.title("Two Sample Chi-2 Test rejection region") # Step 5: showing the plt graph plt.show()

P-Values

Another quick way to determine whether to reject or to support the Null Hypothesis is by using p-values. The p-value is the probability of the condition under the Null occurring. Stated differently, the p-value is the probability, assuming the null hypothesis is true, of observing a result at least as extreme as the test statistic. The smaller the p-value, the stronger is the evidence against the Null Hypothesis, suggesting that it can be rejected.

The interpretation of a p-value is dependent on the chosen significance level. Most often, 1%, 5%, or 10% significance levels are used to interpret the p-value. So, instead of using the t-test and the F-test, p-values of these test statistics can be used to test the same hypotheses.

The following figure shows a sample output of an OLS regression with two independent variables. In this table, the p-value of the t-test, testing the statistical significance of class_size variable’s parameter estimate, and the p-value of the F-test, testing the joint statistical significance of the class_size, and el_pct variables parameter estimates, are underlined.

_Image Source:[ Stock and Whatson](https://www.uio.no/studier/emner/sv/oekonomi/ECON4150/v18/lecture7_ols_multiple_regressors_hypothesis_tests.pdf" data-href="https://www.uio.no/studier/emner/sv/oekonomi/ECON4150/v18/lecture7_ols_multiple_regressors_hypothesis_tests.pdf" class="markup--anchor markup--figure-anchor" rel="noopener" target="blank)

The p-value corresponding to the class_size variable is 0.011. When we compare this value to the significance levels 1% or 0.01 , 5% or 0.05, 10% or 0.1, then we can make the following conclusions:

0.011 > 0.01 → Null of the t-test can’t be rejected at 1% significance level

0.011 < 0.05 → Null of the t-test can be rejected at 5% significance level

0.011 < 0.10 → Null of the t-test can be rejected at 10% significance level

So, this p-value suggests that the coefficient of the class_size variable is statistically significant at 5% and 10% significance levels. The p-value corresponding to the F-test is 0.0000. And since 0 is smaller than all three cutoff values (0.01, 0.05, 0.10), we can conclude that the Null of the F-test can be rejected in all three cases.

This suggests that the coefficients of class_size and el_pct variables are jointly statistically significant at 1%, 5%, and 10% significance levels.

Limitation of p-values

Using p-values has many benefits, but it has also limitations. One of the main ones is that the p-value depends on both the magnitude of association and the sample size. If the magnitude of the effect is small and statistically insignificant, the p-value might still show a significant impact because the sample size is large. The opposite can occur as well – an effect can be large, but fail to meet the p<0.01, 0.05, or 0.10 criteria if the sample size is small.

Inferential Statistics

Inferential statistics uses sample data to make reasonable judgments about the population from which the sample data originated. We use it to investigate the relationships between variables within a sample and make predictions about how these variables will relate to a larger population.

Both the Law of Large Numbers (LLN) and the Central Limit Theorem (CLM) have a significant role in Inferential statistics because they show that the experimental results hold regardless of what shape the original population distribution was when the data is large enough.

The more data is gathered, the more accurate the statistical inferences become – hence, the more accurate parameter estimates are generated.

Law of Large Numbers (LLN)

Suppose X1, X2, . . . , Xn are all independent random variables with the same underlying distribution (also called independent identically-distributed or i.i.d), where all X’s have the same mean μ and standard deviation σ. As the sample size grows, the probability that the average of all X’s is equal to the mean μ is equal to 1.

The Law of Large Numbers can be summarized as follows:

Central Limit Theorem (CLM)

Suppose X1, X2, . . . , Xn are all independent random variables with the same underlying distribution (also called independent identically-distributed or i.i.d), where all X’s have the same mean μ and standard deviation σ. As the sample size grows, the probability distribution of X converges in the distribution in Normal distribution with mean μ and variance **σ-**squared.

The Central Limit Theorem can be summarized as follows:

Stated differently, when you have a population with mean μ and standard deviation σ and you take sufficiently large random samples from that population with replacement, then the distribution of the sample means will be approximately normally distributed.

Dimensionality Reduction Techniques

Dimensionality reduction is the transformation of data from a high-dimensional space into a low-dimensional space such that this low-dimensional representation of the data still contains the meaningful properties of the original data as much as possible.

With the increase in popularity in Big Data, the demand for these dimensionality reduction techniques, reducing the amount of unnecessary data and features, increased as well. Examples of popular dimensionality reduction techniques are Principle Component Analysis, Factor Analysis, Canonical Correlation, Random Forest.

Principle Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique that is very often used to reduce the dimensionality of large data sets. It does this by transforming a large set of variables into a smaller set that still contains most of the information or the variation in the original large dataset.

Let’s assume we have a data X with p variables X1, X2, …., Xp with eigenvectors e1, …, ep, and eigenvalues λ1,…, λp. Eigenvalues show the variance explained by a particular data field out of the total variance.

The idea behind PCA is to create new (independent) variables, called Principal Components, that are a linear combination of the existing variable. The i_th_ principal component can be expressed as follows:

$$Y_i = e_{i1}X_1 + e_{i2}X_2 + e_{i3}X_3 + ... + e_{ip}X_p$$
Then using the Elbow Rule or Kaiser Rule, you can determine the number of principal components that optimally summarize the data without losing too much information.

It is also important to look at the proportion of total variation (PRTV) that is explained by each principal component to decide whether it is beneficial to include or to exclude it. PRTV for the i_th_ principal component can be calculated using eigenvalues as follows:

$$PRTV_i = \frac{{\lambda_i}}{{\sum_{k=1}^{p} \lambda_k}}$$
Elbow Rule

The elbow rule or the elbow method is a heuristic approach that we can use to determine the number of optimal principal components from the PCA results.

The idea behind this method is to plot the explained variation as a function of the number of components and pick the elbow of the curve as the number of optimal principal components.

Following is an example of such a scatter plot where the PRTV (Y-axis) is plotted on the number of principal components (X-axis). The elbow corresponds to the X-axis value 2, which suggests that the number of optimal principal components is 2.

Image Source: [Multivariate Statistics Github](https://raw.githubusercontent.com/TatevKaren/Multivariate-Statistics/main/Elbow_rule%25varc_explained.png" data-href="https://raw.githubusercontent.com/TatevKaren/Multivariate-Statistics/main/Elbow_rule_%25varc_explained.png" class="markup--anchor markup--figure-anchor" rel="noopener" target="blank)

Factor Analysis (FA)

Factor analysis or FA is another statistical method for dimensionality reduction. It is one of the most commonly used inter-dependency techniques. We can use it when the relevant set of variables shows a systematic inter-dependence and our objective is to find out the latent factors that create a commonality.

Let’s assume we have a data X with p variables X1, X2, …., Xp. The FA model can be expressed as follows:

$$X-\mu = AF + u$$
where:

X is a [p x N] matrix of p variables and N observations

µ is [p x N] population mean matrix

A is [p x k] common factor loadings matrix

F [k x N] is the matrix of common factors

and u [pxN] is the matrix of specific factors.

So, to put it differently, a factor model is as a series of multiple regressions, predicting each of the variables Xi from the values of the unobservable common factors are:

$$X_1 = \mu_1 + a_{11}f_1 + a_{12}f_2 + ... + a_{1m}f_m + u1\\ X_2 = \mu_2 + a_{21}f_1 + a_{22}f_2 + ... + a_{2m}f_m + u2\\ .\\ .\\ .\\ X_p = \mu_p + a_{p1}f_1 + a_{p2}f_2 + ... + a_{pm}f_m + up$$
Each variable has k of its own common factors, and these are related to the observations via the factor loading matrix for a single observation as follows:

In factor analysis, the factors are calculated to maximize between-group variance while minimizing in-group variance. They are factors because they group the underlying variables. Unlike the PCA, in FA the data needs to be normalized, given that FA assumes that the dataset follows Normal Distribution.

Interview Prep – Top 7 Statistics Questions with Answers

Are you preparing for interviews in statistics, data analysis, or data science? It's crucial to know key statistical concepts and their applications.

Below I've included seven important statistics questions with answers, covering basic statistical tests, probability theory, and the use of statistics in decision-making, like A/B testing.

Question 1: What is the difference between a t-test and Z-test?

The question "What is the difference between a t-test and Z-test?" is a common question in data science interviews because it tests the candidate's understanding of basic statistical concepts used in comparing group means.

This knowledge is crucial because choosing the right test affects the validity of conclusions drawn from data, which is a daily task in a data scientist's role when it comes to interpreting experiments, analyzing survey results, or evaluating models.

Answer:

Both t-tests and Z-tests are statistical methods used to determine if there are significant differences between the means of two groups. But they have key differences:

Assumptions: You can use a t-test when the sample sizes are small and the population standard deviation is unknown. It doesn't require the sample mean to be normally distributed if the sample size is sufficiently large due to the Central Limit Theorem. The Z-test assumes that both the sample and the population distributions are normally distributed.

Sample Size: T-tests are typically used for sample sizes smaller than 30, whereas Z-tests are used for larger sample sizes (greater than or equal to 30) when the population standard deviation is known.

Test Statistic: The t-test uses the t-distribution to calculate the test statistic, taking into account the sample standard deviation. The Z-test uses the standard normal distribution, utilizing the known population standard deviation.

P-Value: The p-value in a t-test is determined based on the t-distribution, which accounts for the variability in smaller samples. The Z-test uses the standard normal distribution to calculate the p-value, suitable for larger samples or known population parameters.

Question 2: What is a p-value?

The question "What is a p-value?" requires the understanding of a fundamental concept in hypothesis testing that we descussed in this blog in detail with examples. It's not just a number – it's a bridge between the data you collect and the conclusions you draw for data driven decision making.

P-values quantify the evidence against a null hypothesis—how likely it is to observe the collected data if the null hypothesis were true.

For data scientists, p-values are part of everyday language in statistical analysis, model validation, and experimental design. They have to interpret p-values correctly to make informed decisions and often need to explain their implications to stakeholders who might not have deep statistical knowledge.

Thus, understanding p-values helps data scientists to convey the level of certainty or doubt in their findings and to justify subsequent actions or recommendations.

So here you need to show your understanding of what p-value measures and connect it to statistical significance and hypothesis testing.

Answer:

The p-value measures the probability of observing a test statistic at least as extreme as the one observed, under the assumption that the null hypothesis is true. It helps in deciding whether the observed data significantly deviate from what would be expected under the null hypothesis.

If the p-value is lower than a predetermined threshold (alpha level, usually set at 0.05), the null hypothesis is rejected, indicating that the observed result is statistically significant.

Question 3: What are limitations of p-values?

P-values are a staple of inferential statistics, providing a metric for evaluating evidence against a null hypothesis. In these question you need to name couple of them.

Answer

Dependence on Sample Size: The p-value is sensitive to the sample size. Large samples might yield significant p-values even for trivial effects, while small samples may not detect significant effects even if they exist.

Not a Measure of Effect Size or Importance: A small p-value does not necessarily mean the effect is practically significant – it simply indicates it's unlikely to have occurred by chance.

Misinterpretation: P-values can be misinterpreted as the probability that the null hypothesis is true, which is incorrect. They only measure the evidence against the null hypothesis.

Question 4: What is a Confidence Level?

A confidence level represents the frequency with which an estimated confidence interval would contain the true population parameter if the same process were repeated multiple times.

For example, a 95% confidence level means that if the study were repeated 100 times, approximately 95 of the confidence intervals calculated from those studies would be expected to contain the true population parameter.

Question 5: What is the Probability of Picking 5 Red and 5 Blue Balls Without Replacement?

What is the probability of picking exactly 5 red balls and 5 blue balls in 10 picks without replacement from a set of 100 balls, where there are 70 red balls and 30 blue balls? The text describes how to calculate this probability using combinatorial mathematics and the hypergeometric distribution.

In this question, you're dealing with a classic probability problem that involves combinatorial principles and the concept of probability without replacement. The context is a finite set of balls, each draw affecting the subsequent ones because the composition of the set changes with each draw.

To approach this problem, you need to consider:

The total number of balls: If the question doesn't specify this, you need to ask or make a reasonable assumption based on the context.

Initial proportion of balls: Know the initial count of red and blue balls in the set.

Sequential probability: Remember that each time you draw a ball, you don't put it back, so the probability of drawing a ball of a certain color changes with each draw.

Combinations: Calculate the number of ways to choose 5 red balls from the total red balls and 5 blue balls from the total blue balls, then divide by the number of ways to choose any 10 balls from the total.

Thinking through these points will guide you in formulating the solution based on the hypergeometric distribution, which describes the probability of a given number of successes in draws without replacement from a finite population.

This question tests your ability to apply probability theory to a dynamic scenario, a skill that's invaluable in data-driven decision-making and statistical modeling.

Answer:

To find the probability of picking exactly 5 red balls and 5 blue balls in 10 picks without replacement, we calculate the probability of picking 5 red balls out of 70 and 5 blue balls out of 30, and then divide by the total ways to pick 10 balls out of 100:

Let's calculate this probability:

Question 6: Explain Bayes' Theorem and its importance in calculating posterior probabilities.

Provide an example of how it might be used in genetic testing to determine the likelihood of an individual carrying a certain gene.

Bayes' Theorem is a cornerstone of probability theory that enables the updating of initial beliefs (prior probabilities) with new evidence to obtain updated beliefs (posterior probabilities). This question wants to test candidates ability to explain the concept, mathematical framework for incorporating new evidence into existing predictions or models.

Answer:

Bayes' Theorem is a fundamental theorem in probability theory and statistics that describes the probability of an event, based on prior knowledge of conditions that might be related to the event. It's crucial for calculating posterior probabilities, which are the probabilities of hypotheses given observed evidence.

P(A_∣_B) is the posterior probability: the probability of hypothesis A given the evidence B.

P(B∣A) is the likelihood: the probability of observing evidence B given that hypothesis A is true.

P(A) is the prior probability: the initial probability of hypothesis A, before observing evidence B.

P(B) is the marginal probability: the total probability of observing evidence B_B_ under all possible hypotheses.

Question 7: Describe how you would statistically determine if the results of an A/B test are significant - walk me through AB Testing process.

In this question, the interviewer is assessing your comprehensive knowledge of the A/B testing framework. They are looking for evidence that you can navigate the full spectrum of A/B testing procedures, which is essential for data scientists and AI professionals tasked with optimizing features, making data-informed decisions, and testing software products.

The interviewer wants to confirm that you understand each step in the process, beginning with formulating statistical hypotheses derived from business objectives. They are interested in your ability to conduct a power analysis and discuss its components, including determining effect size, significance level, and power, all critical in calculating the minimum sample size needed to detect a true effect and prevent p-hacking.

The discussion on randomization, data collection, and monitoring checks whether you grasp how to maintain the integrity of the test conditions. You should also be prepared to explain the selection of appropriate statistical tests, calculation of test statistics, p-values, and interpretation of results for both statistical and practical significance.

Ultimately, the interviewer is testing whether you can act as a data advocate: someone who can meticulously run A/B tests, interpret the results, and communicate findings and recommendations effectively to stakeholders, thereby driving data-driven decision-making within the organization.

To Learn AB Testing check my AB Testing Crash Course on YouTube.

Answer:

In an A/B test, my first step is to establish clear business and statistical hypotheses. For example, if we’re testing a new webpage layout, the business hypothesis might be that the new layout increases user engagement. Statistically, this translates to expecting a higher mean engagement score for the new layout compared to the old.

Next, I’d conduct a power analysis. This involves deciding on an effect size that's practically significant for our business context—say, a 10% increase in engagement. I'd choose a significance level, commonly 0.05, and aim for a power of 80%, reducing the likelihood of Type II errors.

The power analysis, which takes into account the effect size, significance level, and power, helps determine the minimum sample size needed. This is crucial for ensuring that our test is adequately powered to detect the effect we care about and for avoiding p-hacking by committing to a sample size upfront.

With our sample size determined, I’d ensure proper randomization in assigning users to the control and test groups, to eliminate selection bias. During the test, I’d closely monitor data collection for any anomalies or necessary adjustments.

Upon completion of the data collection, I’d choose an appropriate statistical test based on the data distribution and variance homogeneity—typically a t-test if the sample size is small or a normal distribution can’t be assumed, or a Z-test for larger samples with a known variance.

Calculating the test statistic and the corresponding p-value allows us to test the null hypothesis. If the p-value is less than our chosen alpha level, we reject the null hypothesis, suggesting that the new layout has a statistically significant impact on engagement.

In addition to statistical significance, I’d evaluate the practical significance by looking at the confidence interval for the effect size and considering the business impact.

Finally, I’d document the entire process and results, then communicate them to stakeholders in a clear, non-technical language. This includes not just the statistical significance, but also how the results translate to business outcomes. As a data advocate, my goal is to support data-driven decisions that align with our business objectives and user experience strategy

For getting more interview questions from Stats to Deep Learning - with over 400 Q&A as well as personalized interview preparation check out our Free Resource Hub and our Data Science Bootcamp with Free Trial.

Thank you for choosing this guide as your learning companion. As you continue to explore the vast field of machine learning, I hope you do so with confidence, precision, and an innovative spirit. Best wishes in all your future endeavors!

About the Author

I am Tatev Aslanyan, Senior Machine Learning and AI Researcher, and Co-Founder of LunarTech where we are making Data Science and AI accessible to everyone. I have had the privilege of working in Data Science across numerous countries, including the US, UK, Canada, and the Netherlands.

With an MSc and BSc in Econometrics under my belt, my journey in Machine and AI has been nothing short of incredible. Drawing from my technical studies during my Bachelors & Masters, along with over 5 years of hands-on experience in the Data Science Industry, in Machine Learning and AI, I've gathered this high-level summary of ML topics to share with you.

How Can You Dive Deeper?

After studying this guide, if you're keen to dive even deeper and structured learning is your style, consider joining us at LunarTech, we offer individual courses and Bootcamp in Data Science, Machine Learning and AI.

We provide a comprehensive program that offers an in-depth understanding of the theory, hands-on practical implementation, extensive practice material, and tailored interview preparation to set you up for success at your own phase.

You can check out our Ultimate Data Science Bootcamp and join a free trial to try the content first hand. This has earned the recognition of being one of the Best Data Science Bootcamps of 2023, and has been featured in esteemed publications like Forbes, Yahoo, Entrepreneur and more. This is your chance to be a part of a community that thrives on innovation and knowledge. Here is the Welcome message!

Connect with Me

LunarTech Newsletter

Follow me on LinkedIn and on YouTube

Check LunarTech.ai for FREE Resources

Subscribe to my The Data Science and AI Newsletter

https://substack.com/@lunartech

If you want to learn more about a career in Data Science, Machine Learning and AI, and learn how to secure a Data Science job, you can download this free Data Science and AI Career Handbook.

What Is Hypothesis Testing? Types and Python Code Example

Mene-Ejegi Ogbemi — Fri, 22 Sep 2023 00:41:23 +0000

Curiosity has always been a part of human nature. Since the beginning of time, this has been one of the most important tools for birthing civilizations. Still, our curiosity grows — it tests and expands our limits. Humanity has explored the plains of land, water, and air. We've built underwater habitats where we could live for weeks. Our civilization has explored various planets. We've explored land to an unlimited degree.

These things were possible because humans asked questions and searched until they found answers. However, for us to get these answers, a proven method must be used and followed through to validate our results. Historically, philosophers assumed the earth was flat and you would fall off when you reached the edge. While philosophers like Aristotle argued that the earth was spherical based on the formation of the stars, they could not prove it at the time.

This is because they didn't have adequate resources to explore space or mathematically prove Earth's shape. It was a Greek mathematician named Eratosthenes who calculated the earth's circumference with incredible precision. He used scientific methods to show that the Earth was not flat. Since then, other methods have been used to prove the Earth's spherical shape.

When there are questions or statements that are yet to be tested and confirmed based on some scientific method, they are called hypotheses. Basically, we have two types of hypotheses: null and alternate.

A null hypothesis is one's default belief or argument about a subject matter. In the case of the earth's shape, the null hypothesis was that the earth was flat.

An alternate hypothesis is a belief or argument a person might try to establish. Aristotle and Eratosthenes argued that the earth was spherical.

Other examples of a random alternate hypothesis include:

The weather may have an impact on a person's mood.

More people wear suits on Mondays compared to other days of the week.

Children are more likely to be brilliant if both parents are in academia, and so on.

What is Hypothesis Testing?

Hypothesis testing is the act of testing whether a hypothesis or inference is true. When an alternate hypothesis is introduced, we test it against the null hypothesis to know which is correct. Let's use a plant experiment by a 12-year-old student to see how this works.

The hypothesis is that a plant will grow taller when given a certain type of fertilizer. The student takes two samples of the same plant, fertilizes one, and leaves the other unfertilized. He measures the plants' height every few days and records the results in a table.

After a week or two, he compares the final height of both plants to see which grew taller. If the plant given fertilizer grew taller, the hypothesis is established as fact. If not, the hypothesis is not supported. This simple experiment shows how to form a hypothesis, test it experimentally, and analyze the results.

In hypothesis testing, there are two types of error: Type I and Type II.

When we reject the null hypothesis in a case where it is correct, we've committed a Type I error. Type II errors occur when we fail to reject the null hypothesis when it is incorrect.

In our plant experiment above, if the student finds out that both plants' heights are the same at the end of the test period yet opines that fertilizer helps with plant growth, he has committed a Type I error.

However, if the fertilized plant comes out taller and the student records that both plants are the same or that the one without fertilizer grew taller, he has committed a Type II error because he has failed to reject the null hypothesis.

What are the Steps in Hypothesis Testing?

The following steps explain how we can test a hypothesis:

Step #1 - Define the Null and Alternative Hypotheses

Before making any test, we must first define what we are testing and what the default assumption is about the subject. In this article, we'll be testing if the average weight of 10-year-old children is more than 32kg.

Our null hypothesis is that 10 year old children weigh 32 kg on average. Our alternate hypothesis is that the average weight is more than 32kg. Ho denotes a null hypothesis, while H1 denotes an alternate hypothesis.

Ho = 32

H1 = 32

Step #2 - Choose a Significance Level

The significance level is a threshold for determining if the test is valid. It gives credibility to our hypothesis test to ensure we are not just luck-dependent but have enough evidence to support our claims. We usually set our significance level before conducting our tests. The criterion for determining our significance value is known as p-value.

A lower p-value means that there is stronger evidence against the null hypothesis, and therefore, a greater degree of significance. A p-value of 0.05 is widely accepted to be significant in most fields of science. P-values do not denote the probability of the outcome of the result, they just serve as a benchmark for determining whether our test result is due to chance. For our test, our p-value will be 0.05.

Step #3 - Collect Data and Calculate a Test Statistic

You can obtain your data from online data stores or conduct your research directly. Data can be scraped or researched online. The methodology might depend on the research you are trying to conduct.

We can calculate our test using any of the appropriate hypothesis tests. This can be a T-test, Z-test, Chi-squared, and so on. There are several hypothesis tests, each suiting different purposes and research questions. In this article, we'll use the T-test to run our hypothesis, but I'll explain the Z-test, and chi-squared too.

T-test is used for comparison of two sets of data when we don't know the population standard deviation. It's a parametric test, meaning it makes assumptions about the distribution of the data. These assumptions include that the data is normally distributed and that the variances of the two groups are equal. In a more simple and practical sense, imagine that we have test scores in a class for males and females, but we don't know how different or similar these scores are. We can use a t-test to see if there's a real difference.

The Z-test is used for comparison between two sets of data when the population standard deviation is known. It is also a parametric test, but it makes fewer assumptions about the distribution of data. The z-test assumes that the data is normally distributed, but it does not assume that the variances of the two groups are equal. In our class test example, with the t-test, we can say that if we already know how spread out the scores are in both groups, we can now use the z-test to see if there's a difference in the average scores.

The Chi-squared test is used to compare two or more categorical variables. The chi-squared test is a non-parametric test, meaning it does not make any assumptions about the distribution of data. It can be used to test a variety of hypotheses, including whether two or more groups have equal proportions.

Step #4 - Decide on the Null Hypothesis Based on the Test Statistic and Significance Level

After conducting our test and calculating the test statistic, we can compare its value to the predetermined significance level. If the test statistic falls beyond the significance level, we can decide to reject the null hypothesis, indicating that there is sufficient evidence to support our alternative hypothesis.

On the other contrary, if the test statistic does not exceed the significance level, we fail to reject the null hypothesis, signifying that we do not have enough statistical evidence to conclude in favor of the alternative hypothesis.

Step #5 - Interpret the Results

Depending on the decision made in the previous step, we can interpret the result in the context of our study and the practical implications. For our case study, we can interpret whether we have significant evidence to support our claim that the average weight of 10 year old children is more than 32kg or not.

For our test, we are generating random dummy data for the weight of the children. We'll use a t-test to evaluate whether our hypothesis is correct or not.

import numpy as np import scipy.stats as stats # Create a dummy dataset of 10 year old children's weight data = np.random.randint(20, 40, 10) # Define the null hypothesis H0 = "The average weight of 10 year old children is 32kg." # Define the alternative hypothesis H1 = "The average weight of 10 year old children is more than 32kg." # Calculate the test statistic t_stat, p_value = stats.ttest_1samp(data, 32) # Print the results print("Test statistic:", t_stat) print("p-value:", p_value) # Conclusion if p_value < 0.05: print("Reject the null hypothesis.") else: print("Fail to reject the null hypothesis.")

For a better understanding, let's look at what each block of code does.

import numpy as np import scipy.stats as stats

The first block is the import statement, where we import numpy and scipy.stats. Numpy is a Python library used for scientific computing. It has a large library of functions for working with arrays. Scipy is a library for mathematical functions. It has a stat module for performing statistical functions, and that's what we'll be using for our t-test.

# Create a dummy dataset of 10 year old children's weight data = np.random.randint(20, 40, 100)

The weights of the children were generated at random since we aren't working with an actual dataset. The random module within the Numpy library provides a function for generating random numbers, which is randint.

The randint function takes three arguments. The first (20) is the lower bound of the random numbers to be generated. The second (40) is the upper bound, and the third (100) specifies the number of random integers to generate. That is, we are generating random weight values for 100 children. In real circumstances, these weight samples would have been obtained by taking the weight of the required number of children needed for the test.

# Define the null hypothesis H0 = "The average weight of 10 year old children is 32kg." # Define the alternative hypothesis H1 = "The average weight of 10 year old children is more than 32kg."

Using the code above, we declared our null and alternate hypotheses stating the average weight of a 10-year-old in both cases.

# Calculate the test statistic t_stat, p_value = stats.ttest_1samp(data, 32)

t_stat and p_value are the variables in which we'll store the results of our functions. stats.ttest_1samp is the function that calculates our test. It takes in two variables, the first is the data variable that stores the array of weights for children, and the second (32) is the value against which we'll test the mean of our array of weights or dataset in cases where we are using a real-world dataset.

# Print the results print("Test statistic:", t_stat) print("p-value:", p_value)

The code above prints both values for t_stats and p_value.

# Conclusion if p_value < 0.05: print("Reject the null hypothesis.") else: print("Fail to reject the null hypothesis.")

Lastly, we evaluated our p_value against our significance value, which is 0.05. If our p_value is less than 0.05, we reject the null hypothesis. Otherwise, we fail to reject the null hypothesis. Below is the output of this program. Our null hypothesis was rejected.

Test statistic: -5.114430435590074 p-value: 1.541000376540265e-06 Reject the null hypothesis.

Conclusion

In this article, we discussed the importance of hypothesis testing. We highlighted how science has advanced human knowledge and civilization through formulating and testing hypotheses.

We discussed Type I and Type II errors in hypothesis testing and how they underscore the importance of careful consideration and analysis in scientific inquiry. It reinforces the idea that conclusions should be drawn based on thorough statistical analysis rather than assumptions or biases.

We also generated a sample dataset using the relevant Python libraries and used the needed functions to calculate and test our alternate hypothesis.

Thank you for reading! Please follow me on LinkedIn where I also post more data related content.

What is R Squared? R2 Value Meaning and Definition

Ihechikara Abba — Tue, 28 Mar 2023 14:51:21 +0000

Regression analysis is a statistical method used to study the relationship between a dependent variable and one or more independent variables.

One of the most commonly used methods for linear regression analysis is R-Squared.

In this article, you'll get to know what R-Squared is and the meaning of its value(s). You'll also see some of the fields where it is used.

What is R Squared?

R-Squared (R²) is a statistical measure used to determine the proportion of variance in a dependent variable that can be predicted or explained by an independent variable.

In other words, R-Squared shows how well a regression model (independent variable) predicts the outcome of observed data (dependent variable).

R-Squared is also commonly known as the coefficient of determination. It is a goodness of fit model for linear regression analysis.

What Does an R Squared Value Mean?

An R-Squared value shows how well the model predicts the outcome of the dependent variable. R-Squared values range from 0 to 1.

An R-Squared value of 0 means that the model explains or predicts 0% of the relationship between the dependent and independent variables.

A value of 1 indicates that the model predicts 100% of the relationship, and a value of 0.5 indicates that the model predicts 50%, and so on.

The formula below is mostly used to find the value of R-Squared:

R² = 1 - RSS/TSS

where,

R² = coefficient of determination

RSS = sum of squares of residuals

TSS = total sum of squares

Where Is R Squared Used?

R-Squared is used by different fields. It can be used for the following:

Risk analysis in finance.

Marketing campaigns.

Scientific research.

Economics.

Sports analysis.

Summary

In this article, we talked about R-Squared. It is a statistical method mostly used in predicting the outcome of data.

We started by looking at what R-Squared means. We then talked about the meaning of its value and how to calculate it.

Lastly, we talked about the different fields where R-Sqaured can be used.

Thank you for reading!

What is the Difference Between an Independent Variable and a Dependent Variable?

Kolade Chris — Thu, 15 Dec 2022 17:32:36 +0000

The meaning of the word "variable" depends on the field where it's being used.

In programming, a variable is a particular piece of data that holds a value. Depending on the configuration, that value can change or remain fixed.

For instance, in JavaScript, you can implement a variable to change over time with the "var" and "let" keywords. But if you want, you can implement a variable so it doesn't change with the "const" keyword.

In this article, the kind of variable we’ll be looking at is not the one in programming but the one you'll deal with in research. Precisely, we’ll look at the differences between the two main types of variables in research – dependent and independent variables.

But before we look at the differences between dependent and independent variables, we need to understand what a variable is first.

What We'll Cover

What is a Variable in Research?

What are Dependent and Independent Variables?

What are the Differences between Dependent and Independent Variables?

How to Identify a Dependent Variable from an Independent Variable

Conclusion

More Readings

What is a Variable in Research?

If you’re conducting research, you’ll be measuring a lot of values. So, in research, a variable is anything you’re trying to measure. It could be age, temperature, length, height, mass, weight, or any other thing that can have a value.

In addition, you’ll be measuring those variables in different units – centimeters (cm), meters (m), grams (g), kilograms (kg), and many more.

These units can’t be neglected, but as far as variables are concerned, whether dependent or independent, the values are what really matter.

What are Dependent and Independent Variables?

Dependent and independent variables depend on whether one variable determines the outcome of the other or not.

A dependent variable is a variable whose changes and its outcome depend on another variable. On almost all occasions, the variable the dependent variable depends on is an independent variable.

Dependent variables are also called the response or outcome variables because they represent the outcome of the values you're measuring. That is, what you record after manipulating the independent variables.

An independent variable is a variable whose outcome or changes do not depend on another variable. It is the exact opposite of the dependent variable, at least according to what the name implies.

Independent variables are also called predictor variables because you can use them to predict the outcome of a dependent variable. That is, when you manipulate independent variables, they can give you the outcome of a dependent variable.

What are the Differences between Dependent and Independent Variables?

Basis Dependent Variable Independent Variable

Type It is the "response" variable It is the "effect" variable

Outcome Outcome depends on another variable (usually the independent variable) Outcome does not depend on another variable

Changes This variable changes over time. Consider the dependent variable as a variable you declare with the "let" or "var" keyword in JavaScript. You can later change it. This variable never changes. Consider the independent variable as the variable you declare with the "const" keyword in JavaScript. It is fixed unless you explicitly change the value.

Manipulation Dependent variables cannot be manipulated because their value depends on the independent variable. Independent variables can be manipulated to determine the outcome of a dependent variable.

Position on a Graph Dependent variables are placed on the y-axis (vertical axis) on a graph Independent variables are placed on the x-axis (horizontal axis) on a graph

How to Identify a Dependent Variable from an Independent Variable

We've taken a look at what variables are, what dependent and independent variables are, and the exact differences between dependent and independent variables.

But how exactly would you differentiate a dependent variable from an independent variable? We are going to look at two experiments or examples:

how Vitamin A helps mothers produce milk

the level of light nocturnal insects (insects active at night) are attracted to

For the first experiment, the sources of vitamin A the mother takes in from foods like fish oil and green vegetables are the independent variable. That's because the researcher can change [decrease or increase] the amount given to the mothers. How the body of the mother responds in producing more milk is the dependent variable.

For the second experiment, the level of light is the independent variable because the researcher can change it. How nocturnal insects react to that level of light is the dependent variable.

Conclusion

In this article, you learned about the differences between dependent and independent variables. We looked at:

what a variable is in research

the two main types of variables (dependent and independent variables)

what dependent and independent variables are

and how to differentiate a dependent variable from an independent variable.

I hope this article gives you a knowledge of what research variables are and how to differentiate a dependent variable from an independent variable.

Thank you for reading.

Further Reading

Types of Variables in Research and Statistics

Independent Variables v Dependent Variables

What is Stratified Random Sampling? Definition and Python Example

Ibrahim Ogunbiyi — Tue, 15 Nov 2022 16:33:52 +0000

When we wish to conduct an experiment on a population – for example, the entire population of a country – it is not always practical or realistic to include every subject (citizen) in the experiment.

Instead, we rely on a sample, which is a subset of the population, and then draw conclusions about the population based on the sample's results.

Now, drawing a sample from a population is known as sampling technique, and the manner in which the sample is drawn is essential to the result.

There are lot of sampling techniques out there, but in this tutorial we will look at one of them called stratified random sampling and how it works. Without further ado, let's get started.

What is Stratified Random Sampling?

Before we go into the details of stratified random sampling, let's break the term down into bits so we can grasp it better. Let's start with stratified.

In the context of sampling, stratified means splitting the population into smaller groups or strata based on a characteristic. To put it another way, you divide a population into groups based on their features.

Random sampling entails randomly selecting subjects (entities) from a population. Each subject has an equal probability of being chosen from the population to form a sample (subpopulation) of the overall population.

So therefore, stratified random sampling is a sampling approach in which the population is separated into groups or strata depending on a particular characteristic. Then subjects from each stratum (the singular of strata) are randomly sampled.

You divide the population into groups based on a characteristic and then choose a subject or entity at random from each group.

Types of Stratified Random Sampling

Stratified sampling is divided into two categories, which are:

Proportionate stratified random sampling.

Disproportionate stratified random sampling.

Proportionate stratified random sampling is a type of sampling in which the size of the random sample obtained from each stratum is proportionate to the size of the entire stratum's population.

In other words, the proportion of the entire stratum equals the proportion of the sample stratum. Consider the following example:

students = { "Name": ["Ibrahim", "Ganiyat", "Joel", "Elijah", "Yusuf", "Nurain", "Dayo", "David", "Olu", "Tobi"], "ID": ['001', '002', '003', '004', '005', '006','007', '008', '009', '010'], "Grade": ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'A', 'B', 'A'], "Category": [1, 2, 2, 1, 3, 3, 1, 2, 3, 3] } df = pd.DataFrame(students) >>

The above dataframe contains students' names, IDs, grades, and categories. Assume we wish to stratify students based on their grade characteristics and sample 60% of students from each group. That means we will have three strata in the above dataframe, because we have three different grades.

We can sample it by typing the following:

df_sample = df.groupby("Grade", group_keys=False).apply(lambda x:x.sample(frac=0.6))

Now what we did above is to group the dataframe into different strata using the groupby() method. Then we passed in the Grade feature. For each group (stratum) we randomly sampled out 0.6(60%) of observation from it.

Now if we look at the proportion for df_sample and df, we will see that the proportions for both dataframes are the same.

Disproportionate stratified random sampling, on the other hand, involves randomly selecting strata without regard for proportion. In other words, sampling is done based on a specified number. Let's look at an example.

df.groupby('Grade', group_keys=False).apply(lambda x: x.sample(n=2))

In this code, you can see that we only specified the actual number of samples we want to achieve.

Most of the time, you'll use proportionate stratified sampling. Disproportionate requires more expert knowledge. When performing stratified sampling you will most likely use proportionate sampling.

Applications of Stratified Random Sampling

1. Sampling Based on Shared Characteristic:

When one or more subjects in an experiment share characteristics, it suggests they are members of the same group (one subject can only be in a particular group).

For example, suppose 50 students take a test, and the grade range for the examination is merely A-E. So we can have students who are in the same grade group, for example, students who received an A (and it is impossible for a student to have two grades). As a result, they share the same characteristic or feature, which is grade.

So when you want to sample subjects based on shared characteristics, you should use stratified random sampling. This ensures that a member of a specific group will be included.

This is because stratified random sampling differs from simple random sampling, which is also a sampling technique. Stratified random sampling randomly samples out the population with no characteristics (that is, each subject of the population has equal chances of being picked).

As a result, simple random sampling cannot guarantee that a certain member of a particular group will be included in the sample.

Let's have a look at an example to see what we're talking about. Let's say we want to sample out 60% of students using both stratified and simple random sampling.

We can see the result for stratified random sampling below:

df.groupby('Grade', group_keys=False).apply(lambda x: x.sample(frac=0.6))

And this is the result of simple random sampling:

df.sample(frac= 0.6)

We can see that students with C grades are not included in the sample. This is because in simple random sampling, every observation has an equal chance of being chosen because we are not sampling based on characteristics. This means that there is a chance that an observation will not be chosen.

In stratified random sampling, on the other hand, we consider all the groups we want to sample and then randomly sample from each group.

2. Imbalanced Dataset:

An imbalanced dataset is a machine learning classification problem in which the two class labels in the target variable are not proportional to one another. In other words, one class has a higher count than the other, resulting in an imbalance.

In machine learning, stratified sampling is also used to obtain the same sample proportion for a train and test set if there is an imbalance in the dataset.

For example, a chronic disease dataset has an imbalance label as shown below. You can click here to download the dataset.

df = pd.read_csv("kidney_disease.csv") df.head()

If we check the proportion label feature which is classification, we can see that it is imbalanced.

Now let's say we want to split the train and test set using simple random sampling. We won't achieve the same proportion for the train and test set as the population proportion.

from sklearn.model_selection import train_test_split X = df.drop(columns = ["classification"]) y = df["classification"] X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

We can see that the label proportion for both y_train and y_test is not the same as the population proportion. To achieve the same proportion we can make use of the stratify parameter in train_test_split as shown below:

from sklearn.model_selection import train_test_split X = df.drop(columns = ["classification"]) y = df["classification"] X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, stratify=y)

The above code shows that the dataset was stratified on the label. So with that we will achieve the same proportion as the population proportion.

Conclusion

In this tutorial, we looked at stratified sampling and how you can use it in statistics and machine learning. We also looked at the types of stratified sampling.

Thank you for your time.

Population vs Sample – Statistics Example

Dionysia Lemonaki — Tue, 12 Jul 2022 17:07:34 +0000

When working with data sets and conducting a statistical analysis, you need to ensure that the data set you are using is relevant, valid, and correct.

The appropriate data will help you make sure that you have a correct outcome and come to an effective conclusion and solution that solves the problem at hand.

This is why it's necessary to know the difference between population data sets and sample data sets and whether the data you are dealing with is part of a population data set or a sample data set.

In this brief guide, you will learn the differences between these two popular statistical terms.

Let's get started!

What Is A Population in Statistics? Population Definition

A population is a collection that consists of all possible data values and items within the field of study.

A population refers to the whole number of items or the entire group of people that are of interest in the statistical study.

Essentially, it makes up the entire pool of the study.

An example of a population set is the number of all the people living in a country, such as all the number of people living in the U.S. – that is, the entire population of the U.S. .

Another example of working with a population set could be analyzing all the students in a university – this is the whole number of students studying at the University.

The quantity that describes the outcome of measuring the whole population is called a parameter. A parameter is a number that refers to the entire population.

Which Method Should You Use to Collect Data from a Population?

You may want to choose to collect data from a population when you need to work with a large amount of data.

A way of collecting data from an entire population is by conducting a census.

Let's take the U.S. census as an example. It's a procedure that takes place at least once every ten years.

It counts every person living in the U.S. and conducts a survey that collects data from all individuals and every member that makes up the population.

Is Population Data Accurate?

Collecting data from a population is not the most efficient way of collecting data.

Populations are often hard to define and observe, which will inevitably introduce a bias in the study and probably skew the results and lead to unreliable conclusions.

There are a few reasons why this is the case:

The pool of study is often too large.

There may be geographical constraints.

There may be time constraints.

There may be resource constraints

There may be accessibility constraints.

It is likely that there will be missing data values.

Instead, you may choose to collect data from a population when the population size is relatively small. You can also gather information on the items/people that make up the population when it is easily accessible, or when you can measure the items or contact every member of the population.

What Is a Sample in Statistics? Sample Definition

A sample is a subset and a small portion of the population – a small part of all the possible data values that are part of the specified field of study.

The size of the sample data set will always be smaller than that of the population.

Working with sample data is helpful when the population is too large and not reliable.

For example, the population could be unknown in size, or even not measurable or infinite in size.

This is the preferred method of collecting data when the data you need is too hard to gather. It's a way to get information about the population without actually needing to access every person or item in that population.

The number that refers to the result of measuring from within a sample data set is called a statistic. A statistic describes a sample of a population.

What Are the Defining Characteristics of a Good Sample?

A sample should accurately represent the whole population.

One of the other most important characteristics of sample data is that it should be random and chosen without bias.

Insights and data should be collected randomly, meaning every item or member of a population has equal chances and the same probability of being selected.

Those two criteria reduce bias and ensure the results are valid.

How Is Data Collected from a Sample?

The process of collecting data from a small subset of the population is known as sampling.

Sampling is helpful when it is difficult to collect all the necessary data from the population.

Sampling represents the entire population as it generalizes and reflects the individuals that are part of it.

Gathering all the necessary information and contacting the members of interest is easier, less time-consuming, and less costly.

A way to collect data from a sample is to conduct a poll, which is what happens during an election period.

Polls are a helpful tool for gauging voters' preferences and support of the parties taking part in the election.

It's impossible to gather all registered voters in the country and ask who they prefer to win the election since they might be in the millions.

Instead, it is better to gather several thousand responses from different sections of the population, such as from various cities and regions and from unrelated spots within those cities and regions.

This selection needs to be random, and people need to be chosen by chance. This ideally means that everyone should equally have the same chance of being picked for the poll.

What Is Sampling Bias and How to Avoid It

As mentioned earlier, a sample should accurately represent and reflect the entire population from which it has been taken.

For the sample to be representative, it should be gathered randomly. If not, the result of the analysis will most likely be prone to bias or what is otherwise known as sampling bias.

Sampling bias occurs when the methods used to collect the sample encourage systemic prejudice.

The methods are either in favor of or against an individual or group, which will inevitably skew the outcome of the analysis. Members of the specific population are not selected correclty, meaning they either have a higher or lower chance of being selected.

Essentially, the sample is collected in a way that unfairly favors only certain members of the population over others.

For example, a survey that questions students at the University’s cafe regarding their University experience excludes various groups of students.

It excludes:

Students who are distance learning and studying from home.

Students who may be studying part-time and working at the time the survey took place.

Students on an exchange program in a different country.

Students in a class following a lecture.

Firstly, this method is not random. Secondly, it is prone to sampling bias as it is limiting and favors only the section of students that were able to be present in the cafe during morning hours and therefore is not representative.

These students may have specific characteristics and probably do not reflect the overall population of students in the whole University.

Let's take another example.

Say that a poll is conducted during an election period to find out which candidate is the most favorable to the public.

If the members polled are only white collar workers, the results will be inaccurate since it doesn't accurately describe the entire population.

The population also includes blue-collar workers and people who might work more than one minimum wage job to make ends meet. The preferences for the candidate will likely differ from group to group.

In this case, the bias is heavy since the poll is not diverse – it reflects only one section of the population.

A way to lower the risk of sampling bias is through stratified random sampling.

Stratified random sampling involves accurately defining the population of interest, the characteristics it needs to have, and how you want it divided.

It also involves choosing your sample size and then dividing the sample into precise, homogenous smaller sub-groups that match the relevant criteria you set while ensuring the population and sample match.

Stratified random sampling leads to a more representative sample.

Wrapping up

And there you have it! You now have a high-level understanding of the differences between two widely used statistical terms - population and sample.

To learn more about Statistics, check out this free 8-hour course from freeCodeCamp.

Thank you for reading!

How to Detect Outliers in Machine Learning – 4 Methods for Outlier Detection

Bala Priya C — Tue, 05 Jul 2022 22:02:13 +0000

Have you ever trained a machine learning model on a real-world dataset? If yes, you’ll have likely come across outliers.

Outliers are those data points that are significantly different from the rest of the dataset. They are often abnormal observations that skew the data distribution, and arise due to inconsistent data entry, or erroneous observations.

To ensure that the trained model generalizes well to the valid range of test inputs, it’s important to detect and remove outliers.

In this guide, we’ll explore some statistical techniques that are widely used for outlier detection and removal.

Why Should You Detect Outliers?

In the machine learning pipeline, data cleaning and preprocessing is an important step as it helps you better understand the data. During this step, you deal with missing values, detect outliers, and more.

As outliers are very different values—abnormally low or abnormally high—their presence can often skew the results of statistical analyses on the dataset. This could lead to less effective and less useful models.

But dealing with outliers often requires domain expertise, and none of the outlier detection techniques should be applied without understanding the data distribution and the use case.

For example, in a dataset of house prices, if you find a few houses priced at around $1.5 million—much higher than the median house price, they’re likely outliers. However, if the dataset contains a significantly large number of houses priced at $1 million and above—they may be indicative of an increasing trend in house prices. So it would be incorrect to label them all as outliers. In this case, you need some knowledge of the real estate domain.

The goal of outlier detection is to remove the points—which are truly outliers—so you can build a model that performs well on unseen test data. We’ll go over a few techniques that’ll help us detect outliers in data.

How to Detect Outliers Using Standard Deviation

When the data, or certain features in the dataset, follow a normal distribution, you can use the standard deviation of the data, or the equivalent z-score to detect outliers.

In statistics, standard deviation measures the spread of data around the mean, and in essence, it captures how far away from the mean the data points are.

For data that is normally distributed, around 68.2% of the data will lie within one standard deviation from the mean. Close to 95.4% and 99.7% of the data lie within two and three standard deviations from the mean, respectively.

Let’s denote the standard deviation of the distribution by σ, and the mean by μ.

One approach to outlier detection is to set the lower limit to three standard deviations below the mean (μ - 3σ), and the upper limit to three standard deviations above the mean (μ + 3σ). Any data point that falls outside this range is detected as an outlier.

As 99.7% of the data typically lies within three standard deviations, the number of outliers will be close to 0.3% of the size of the dataset.

Code for Outlier Detection Using Standard Deviation

Now, let's create a normally-distributed dataset of student scores, and perform outlier detection on it.

As a first step, we’ll import the necessary modules.

import numpy as np import pandas as pd import seaborn as sns

Next, let’s define the function generate_scores() that returns a normally-distributed dataset of student scores containing 200 records. We’ll make a call to the function, and store the returned array in the variable scores_data.

def generate_scores(mean=60,std_dev=12,num_samples=200): np.random.seed(27) scores = np.random.normal(loc=mean,scale=std_dev,size=num_samples) scores = np.round(scores, decimals=0) return scores scores_data = generate_scores()

You can use Seaborn’s displot() function to visualize the data distribution. In this case, the dataset follows a normal distribution, as seen in the figure below.

sns.set_theme() sns.displot(data=scores_data).set(title="Distribution of Scores", xlabel="Scores")

Figure 1: Normal Distribution of Scores

Next, let's load the data into a Pandas dataframe for further analysis.

df_scores = pd.DataFrame(scores_data,columns=['score'])

To obtain the mean and standard deviation of the data in the dataframe df_scores, you can use the .mean() and the .std() methods, respectively.

df_scores.mean() # Output score 61.005 dtype: float64 df_scores.std() # Output score 11.854434 dtype: float64

As discussed earlier, set the lower limit (lower_limit) to be three standard deviations below the mean, and the upper limit (upper_limit) to be three standard deviations above the mean.

lower_limit = df_scores.mean() - 3*df_scores.std() upper_limit = df_scores.mean() + 3*df_scores.std() print(lower_limit) print(upper_limit) # Output 25.530716709142666 96.47928329085734

Now that you’ve defined the lower and upper limits, you may filter the dataframe df_scores to only retain the data points in the interval [lower_limit, upper_limit], as shown below.

df_scores_filtered=df_scores[(df_scores['score']>lower_limit)&(df_scores['score']# Output score 0 75.0 1 56.0 2 67.0 3 65.0 4 63.0 .. ... 194 42.0 195 76.0 196 67.0 197 74.0 199 53.0 [198 rows x 1 columns]

From the output above, you can see that two records have been removed, and df_scores_filtered contains 198 records.

How to Detect Outliers Using the Z-Score

Now let's explore the concept of the z-score. For a normal distribution with mean μ and standard deviation σ, the z-score for a value x in the dataset is given by:

z = (x - μ)/σ

From the above equation, we have the following:

When x = μ, the value of z-score is 0.

When x = μ ± 1, μ ± 2, or μ ± 3, the z-score is ± 1, ± 2, or ± 3, respectively.

Notice how this technique is equivalent to the scores based on standard deviation we had earlier. Under this transformation, all data points that lie below the lower limit, μ - 3*σ, now map to points that are less than -3 on the z-score scale.

Similarly, all points that lie above the upper limit, μ + 3*σ map to a value above 3 on the z-score scale. So [lower_limit, upper_limit] becomes [-3, 3].

Let’s use this technique on our dataset of scores.

Code for Outlier Detection Using Z-Score

Let's compute z-scores for all points in the dataset, and add z_score as a column to the dataframe df_scores.

df_scores['z_score']=(df_scores['score'] - df_scores['score'].mean())/df_scores['score'].std() df_scores.head() # Output score z_score 0 75.0 1.180571 1 56.0 -0.422205 2 67.0 0.505718 3 65.0 0.337005 4 63.0 0.168291

You can filter the dataframe df_scores to retain points whose z-scores are in the range [-3, 3], as shown below. The filtered dataframe contains 198 records, as expected.

df_scores_filtered= df_scores[(df_scores['z_score']>-3) & (df_scores['z_score']<3)] print(df_scores_filtered) # Output score z_score 0 75.0 1.180571 1 56.0 -0.422205 2 67.0 0.505718 3 65.0 0.337005 4 63.0 0.168291 .. ... ... 194 42.0 -1.603198 195 76.0 1.264928 196 67.0 0.505718 197 74.0 1.096214 199 53.0 -0.675275 [198 rows x 2 columns]

The methods involving standard deviation and z-scores can be used only when the data set, or the feature that you are examining, follows a normal distribution.

Next, we’ll discuss two outlier detection techniques that can be used independently of the data distribution.

How to Detect Outliers Using the Interquartile Range (IQR)

In statistics, interquartile range or IQR is a quantity that measures the difference between the first and the third quartiles in a given dataset.

The first quartile is also called the one-fourth quartile, or the 25% quartile.

If q25 is the first quartile, it means 25% of the points in the dataset have values less than q25.

The third quartile is also called the three-fourth, or the 75% quartile.

If q75 is the three-fourth quartile, 75% of the points have values less than q75.

Using the above notations, IQR = q75 - q25.

Code for Outlier Detection Using Interquartile Range (IQR)

You can use the box plot, or the box and whisker plot, to explore the dataset and visualize the presence of outliers. The points that lie beyond the whiskers are detected as outliers.

You can generate box plots in Seaborn using the boxplot function.

sns.boxplot(data=scores_data).set(title="Box Plot of Scores")

Figure 2: Box Plot of Scores

Now, call the describe method on the dataframe df_scores.

df_scores.describe() # Output score count 200.000000 mean 61.005000 std 11.854434 min 20.000000 25% 54.000000 50% 62.000000 75% 67.000000 max 98.000000

We use the 25% and 75% quartile values from the above result to compute IQR, and subsequently set the lower and upper limits to filter df_scores.

IQR = 67-54 lower_limit = 54 - 1.5*IQR upper_limit = 67 + 1.5*IQR print(upper_limit) print(lower_limit) # Output 86.5 34.5

As a next step, filter the dataframe df_scores to retain records that lie in the permissible range.

df_scores_filtered = df_scores[(df_scores['score']>lower_limit) & (df_scores['score']# Output score 0 75.0 1 56.0 2 67.0 3 65.0 4 63.0 .. ... 194 42.0 195 76.0 196 67.0 197 74.0 199 53.0 [192 rows x 1 columns]

As seen in the output, this method labels eight points as outliers, and the filtered dataframe is 192 records long.

You don't always have to call the describe method to identify the quartiles. You may instead use the percentile() function in NumPy. It takes in two arguments, a: an array or a dataframe and q: a list of quartiles.

The code cell below shows how you can calculate the first and the third quartiles using the percentile function.

q25,q75 = np.percentile(a = df_scores,q=[25,75]) IQR = q75 - q25 print(IQR) # Output 13.0

How to Detect Outliers Using Percentile

In the previous section, we explored the concept of interquartile range, and its application to outlier detection. You can think of percentile as an extension to the interquartile range.

As discussed earlier, the interquartile range works by dropping all points that are outside the range [q25 - 1.5*IQR, q75 + 1.5*IQR] as outliers. But removing outliers this way may not be the most optimal choice when your observations have a wide distribution. And you may be discarding more points—than you actually should—as outliers.

Depending on the domain, you may want to widen the range of permissible values to estimate the outliers better. Next, let’s revisit the scores dataset, and use percentile to detect outliers.

Code for Outlier Detection Using Percentile

Let’s define a custom range that accommodates all data points that lie anywhere between 0.5 and 99.5 percentile of the dataset. To do this, set q = [0.5, 99.5] in the percentile function, as shown below.

lower_limit, upper_limit = np.percentile(a=df_scores,q=[0.5,99.5]) print(upper_limit) print(lower_limit) # Output 91.035 28.955

Next, you may filter the dataframe using the lower and upper limits obtained from the previous step.

df_scores_filtered = df_scores[(df_scores['score']>lower_limit) & (df_scores['score']# Output score 0 75.0 1 56.0 2 67.0 3 65.0 4 63.0 .. ... 194 42.0 195 76.0 196 67.0 197 74.0 199 53.0 [198 rows x 1 columns]

From the code cell above, you can see that there are two outliers, and the filtered dataframe has 198 data records.

Conclusion

In this guide, we covered what outliers are, and why we need to detect them. We then went over the most common techniques for outlier detection.

Here’s a summary:

If the data, or feature of interest is normally distributed, you may use standard deviation and z-score to label points that are farther than three standard deviations away from the mean as outliers.

If the data is not normally distributed, you can use the interquartile range or percentage methods to detect outliers.

In addition, we discussed the best practices in outlier detection. When a large fraction of data is being labeled as outliers, they are not really outliers but can be attributed to a wider data distribution.

In applying all of the above techniques, it's also important to be aware of the current trend to identify how certain values are evolving, and check for permissible lower and upper limits using domain knowledge.

Statistics for Beginners – Top Stats Concepts to Know Before Getting into Data Science

Ibrahim Ogunbiyi — Fri, 10 Jun 2022 16:33:29 +0000

You've probably heard that statistics is the gateway to data science and that the data science map starts with stats.

Perhaps you've also heard from others that you have to learn statistics before learning data science. But then you ponder, "Since I'm not from a technical background like science, technology, engineering, or math (STEM), do I need to learn everything in statistics before getting into data science?" And those same people will tell you "Yes! You have to learn statistics."

Well, here's my answer: you don't need to learn all of statistics before beginning data science (though you do need to learn some fundamentals).

You can also learn as you go instead of wasting time learning statistics first before data science (that is, as you advance in your knowledge of data science, you can always learn more statistics concepts).

That being said, it is helpful to know statistics basics before jumping into data science. You can indeed say that stats is the gateway to data science because it will help you to have some intuition about your data and how to work with it.

In this article, we'll look at the top statistical concepts you need to know before diving into data science. I'll make it as simple as possible even if you don't come from a technical background. I can tell you're excited and ready to dive into the realm of data science. Let's get started.

What is Statistics?

According to economist and sampling technique pioneer Arthur Lyon Bowley, Statistics is:

"numerical statements of facts in any department of inquiry placed in relation to each other."

That basically means that statistics helps us comprehend our data and also helps us convey the results in that data to others.

Statistical methods (that is, the techniques employed in dealing with data in statistics) are classified into two types:

Descriptive Statistics

Inferential Statistics

Descriptive Statistics is a discipline of statistics that assists us in summarizing data through numerical values or graphical visualization.

Descriptive statistics helps us identify and understand some key properties in our data. It includes concepts such as central tendency, dispersion, boxplots, histograms, and so on, which we'll discuss later in the article.

Inferential Statistics, on the other hand, is a branch of statistics that helps us make decisions or predictions based on the data that we have gathered.

Inferential statistics is a significantly more advanced topic because it requires a deep understanding of descriptive statistics. It includes concepts such as hypothesis, probability, and so forth.

Top Statistical Concepts to Know Before Learning Data Science

Since you're now familiar with the definition of statistics, let's have a look at some of the concepts you'll need to know in statistics that'll help guide you when you dive into the realm of statistics.

Among the most fundamental concepts are:

What is a Subject?

This is the specific thing we wish to observe. It could be a person, an animal, or something else. It is also known as observation.

What is a Population?

Population refers to the entire set of topics in which we are interested (that is, that we want to observe). Assume you wish to count the number of females in a specific country.

What is a Sample?

In reality, observing a population is hardly an ideal situation (because it can be very expensive to perform, and also time-consuming).

Consider the following scenario: you wish to observe every female in the world. This type of observation can be costly to carry out. However, in statistics, we have something called a sample, which is a portion/subset of the population that you want to study. We can now make a decision (inferential statistic) about the full population using the sample.

What are Parameters?

This is a property/summary of a population. Consider the following scenario: you are observing the entire country and you discover that 90% of the inhabitants are males while 10% are females. The numerical values, 90%, and 10% are a numerical summary (that is, descriptive statistics) of the entire population. As a result, the summary is known as the population parameter.

What is a Statistic?

On the other hand, a statistic (not to be confused with statistic(s)) is about a sample's property. As stated in the preceding example, instead of working with the full population, we work with samples, so the numerical value is referred to as the statistic of the sample.

Hopefully you now have a decent understanding of what population, sample, statistic, and parameters are. Let's take a look at another concept with which we are all too familiar: "Data".

Data, as the term implies, represents factual information. That is, it conveys a message to us. It can, however, be divided into two categories:

Quantitative data.

Qualitative data.

What is Quantitative Data?

This is also known as numerical data. These data are a sort of data in which numerical values can be counted or measured. Quantitative data can be further classified into two types:

Quantitative discrete data: These are numerical data that can be counted but cannot be measured. Counting the number of shoes in a shoe store is a common example.

Quantitative continuous data: This is a type of numerical data that is based on measurement. For example, measuring the weight of a glass cylinder is continuous, not discrete.

What is Qualitative Data?

These are sorts of data that represent categories or groups of data. They are also known as categorical data. They are usually written in text. They can be characteristics, names, or anything else.

A common example is a person's name, dog breeds, and so on. However, there are some data that appear to be numerical data but are encoded as categorical data.

For example, suppose you wanted to group a certain group of people based on their age and discovered that the lowest and highest ages are 10 and 60, respectively. You then divided the ages into 5 categories (10-20, 21-30, 31-40, 41-50, 51-60) and assigned numerical values to each of those categories where 1 represents 10-20, 2 represents 21-30, and so on.

In this situation, the numerical values will be handled as categorical data rather than quantitative data. As your data science career progresses, you will learn how to work with categorical data.

Now you know the categories of data. Quantitative and qualitative data can be treated in statistics using these levels of measurement. Data in statistics can be classified into 4 levels of measurement which are:

Nominal scale data

Ordinal Scale data

Interval Scale data

Ratio Scale data

Qualitative data can be measured using:

Nominal scale data: These are the type of categorical data that do not have an ordered sense. That is, they cannot be ordered.

Each piece of data represents a single unit. An example of such categorical data includes color. It is not very ideal to rank blue over yellow. When working with nominal data, each data point must be handled as a separate unit.

Ordinal Scale data: Ordinal Scale data consists of ordered categorical data. When data is ranked, there is a sense of order in it. A survey response such as excellent, good, satisfactory, and unsatisfactory is an example of this. It makes sense to rank excellence above good.

Quantitative data can be measured using:

Interval Scale Data: These are numerical data with ordering and can be measured (for example find the difference between the data). The readings on a temperature scale are an example of interval data.

For example, you can measure the difference between 4 and 10 degrees Celsius, and 10 degrees is higher than 4 degrees. However, there are two exceptions for interval scale data:

It does not have a starting point (that is, it does not begin from zero and you can have a temperature value below zero)

You can't figure out their ratio: For example, it makes no logic to claim that 4 times 20 degrees Celsius is 80 degrees Celsius.

Ratio Scale data: These are numerical data that have the features of interval scale data (that is they may be ordered and measured), but also solve the exception of interval scale data (they have a starting point, and also you can find the ratio between them).

A grade score of 20, 68, 90, or 80 is an example. We can order it, measure it, and find the ratio between the values. It makes sense to say the score of 80 is 4 times better than the score of 20.

Now that we've covered the fundamentals of data, let's look at how the first category of statistics (descriptive statistics) can be applied to data.

As previously stated, descriptive statistics require summarizing data either numerically or graphically. Let's take a look at some of the most typical numerical and graphical summaries you'll encounter when dealing with data on a regular basis.

Mean vs Median vs Mode – What is the Difference?

Mean, Median, and Mode explained through illustration. Mode is the high point, Median is the half way point, and Mean is the average.

What is a Mean?

When we have a set of numerical data like this (4, 5, 6, 7, 10), each value in the set of data is referred to as a data point. We might want to find the data's average value.

So mean is essentially the average of a set of data and is calculated as the sum of all the data points divided by the total number of data points.

In our above data set, their sum is 32 and the total number of data points is 5. So the average number, that is the mean, is 6.4

Mean is only used on numerical data. Finding the average of our category data is impractical.

What is a Median?

Also, given a group of values, we may want to discover the value in the center. The median is used to compute the value in the middle. Median also is used on numerical data only.

What is a Mode?

This is the value with the highest frequency (that is a value that has the highest number of occurrences). The mode can be used for numerical or categorical data.

What is an Outlier?

Outliers are data points that differ from other data points and, when present, can lead us to incorrect conclusions. Here's a typical example of how outliers are harmful.

Consider the following scenario: you have a machine that counts how many customers enter your supermarket every day, and the readings are thus for a given week (20, 23, 26, 27, 302). We can see that the number 302 is an outlier because it deviates significantly from the other data points.

Outliers could have resulted from a sudden change, machine faults, or other circumstances. However, when they are present, they can lead us to make incorrect decisions, such as if you want to find the average number of consumers who visit your supermarket, the value 302 may lead you to an incorrect result. The mean of the preceding values is 75.

What is a Standard Deviation?

A Standard Deviation is a summary value that indicates how far our data point deviates from the mean. It is used to determine the spread of our data.

The closer the standard deviation is to zero, the closer our data points are to one another.

The standard deviation is an extremely valuable summary that informs us that we have some outliers in our dataset. Here's how it works:

A chart of a Normal Distribution, with the number of standard deviations listed on the x axis.

In the above chart, we see a Normal Distribution. 34.1% + 34.1% = 68.2% of all observations are within one standard deviation, or 1σ (pronounced one Sigma).

13.6% + 13.6% = 27.2% of the remaining observations are within two standard deviations, or 2σ. And so on.

And yes, if you've heard of Six Sigma, that is a concept in engineering where six standard deviation's worth of possibilities are accounted for in the quality assurance process. Meaning you are accounting for all but the most extreme outliers. 99.99966% of all possibilities, to be exact.

Now that we've grasped some numerical summaries, let's take a look at some common graphical summaries.

What is a Bar Chart?

A bar chart is a type of data visualization used for categorical data. You use it to graphically show the frequency of categorical data (that is the number of times a categorical data point occurs). Here's an example:

What is a Histogram?

A histogram is similar to a bar chart in that it shows the frequency of your numerical data called height, but it groups the numerical data points into bins or ranges.

It is a very efficient visualization tool because it helps you visualize the distribution of your numerical data. You can read more here to learn more about histograms.

What is a Boxplot?

Another excellent visualization that helps you visualize the distribution of your data is the boxplot.

A boxplot, for example, allows you to visually observe if there are any outliers in your data collection. It includes terms such as minimum, 25th percentile, 50th percentile, 75th percentile, and maximum. A Boxplot looks as follows:

Image by Ibrahim Ogunbiyi

So let’s go over what we have in the above diagram:

Minimum: The minimum value does not imply the smallest value in our dataset. It is calculated using this formula ( Q1 -1.5*IQR) where:

Q1 – implies The 25th percentile

IQR – implies the Interquartile range (which is the difference between the 75th percentile and the 25th percentile).

With the minimum, it can help us detect data points that are also far below the other observed values.

For instance, assuming our data points are spread like these [345, 402, 295, 386, 10]. We can see that the value 10 is also an outlier because it is a lower value that is far below other observations.

The 25th percentile is a value that tells us that 25% of our data points are below that value and 75% of our data points are above that value. The 25th percentile is also known as the first quartile.

The 50th percentile is a value that indicates that 50% of our data points are below that value and the remaining 50% are above that value. It is also known as the second quartile.

The 75th percentile is a value that tells us that 75 percent of our data point is below that value and the remaining 25 percent is above it. It is also known as the third quartile.

Maximum: Also like the minimum, the maximum does not imply the highest value in the dataset. It is calculated using the formula (Q3 + 1.5*IQR) where:

Q3 – implies the 75th percentile

IQR implies Interquartile Range (which is the difference between the 75th percentile and the 25th percentile).

With maximum also, it can help us detect data points that are also far above the other observed values.

For instance, assuming our data points are spread like these [645, 40, 25, 38, 42]. We can see that the value 645 is also an outlier because it is a higher value that is far above other observations.

We've seen some graphical summaries of what we'll be dealing with on a daily basis. Let's look at the final topic we will discuss in this article:

What is the Association Between Quantitative Variables?

Variables are any values (alphabetical or numerical, but typically alphabetical) that represent a collection of observations. It is sometimes referred to as a column in a table.

Two variables are said to be associated if a specific value of one variable is most likely to occur with a specific value of another variable.

To study the association between two quantitative variables (often referred to as correlation), we calculate it using the Karl Pearson formula, and the result is between -1 and +1.

If the correlation value approaches 1, it indicates that the two variables are positively correlated (that is, as one variable increases the other variable increases as well). If the value approaches -1, it indicates that the variables are negatively linked (that is as one variable increases, the other variable decreases). Finally, if the correlation current is 0, there is no correlation between the variables.

You can read more here to know more about correlation and Karl Pearson formula

What is a Scatter Plot?

We can represent the correlation between quantitative variables in a graphical summary by using a plot called a scatter plot.

A scatter plot looks like this:

Scatter (XY) Plots (mathsisfun.com)

To learn about scatter plots you can read more here.

Conclusion and Learning More

In this tutorial, we've explored some fundamental statistics concepts that will help you work more efficiently with your data.

But the learning does not stop here – there are a few fundamental topics that you must be familiar with. Because this is only the beginning, you can delve deeper by consulting online resources or textbooks.

Thank you very much for reading, and please share the article so that beginners who want to go into data science can learn as well.

What is an Outlier? Definition and How to Find Outliers in Statistics

Dionysia Lemonaki — Tue, 24 Aug 2021 20:32:36 +0000

Outliers are an important part of a dataset. They can hold useful information about your data.

Outliers can give helpful insights into the data you're studying, and they can have an effect on statistical results. This can potentially help you disover inconsistencies and detect any errors in your statistical processes.

So, knowing how to find outliers in a dataset will help you better understand your data.

There are a few different ways to find outliers in statistics.

This article will explain how to detect numeric outliers by calculating the interquartile range.

I give an example of a very simple dataset and how to calculate the interquartile range, so you can follow along if you want.

Let's get started!

What is an Outlier in Statistics? A Definition

In simple terms, an outlier is an extremely high or extremely low data point relative to the nearest data point and the rest of the neighboring co-existing values in a data graph or dataset you're working with.

Outliers are extreme values that stand out greatly from the overall pattern of values in a dataset or graph.

Below, on the far left of the graph, there is an outlier.

The value in the month of January is significantly less than in the other months.

How to Identify an Outlier in a Dataset

Alright, how do you go about finding outliers?

An outlier has to satisfy either of the following two conditions:

outlier < Q1 - 1.5(IQR)
outlier > Q3 + 1.5(IQR)
The rule for a low outlier is that a data point in a dataset has to be less than Q1 - 1.5xIQR.

This means that a data point needs to fall more than 1.5 times the Interquartile range below the first quartile to be considered a low outlier.

The rule for a high outlier is that if any data point in a dataset is more than Q3 - 1.5xIQR, it's a high outlier.

More specifically, the data point needs to fall more than 1.5 times the Interquartile range above the third quartile to be considered a high outlier.

As you can see, there are certain individual values you need to calculate first in a dataset, such as the IQR. But to find the IQR, you need to find the so called first and third quartiles which are Q1 and Q3 respectively.

So, let's see what each of those does and break down how to find their values in both an odd and an even dataset.

How to Find the Upper and Lower Quartiles in an Odd Dataset

To get started, let's say that you have this dataset:

25,14,6,5,5,30,11,11,13,4,2
The first step is to sort the values in ascending numerical order,from smallest to largest number.

2,4,5,5,6,11,11,13,14,25,30
The lowest value (MIN) is 2 and the highest (MAX) is 30.

How to calculate Q2 in an odd dataset

The next step is to find the median or quartile 2 (Q2).

This particular set of data has an odd number of values, with a total of 11 scores all together.

To find the median in a dataset means that you're finding the middle value – the single middle number in the set.

In odd datasets, there in only one middle number.

Since there are 11 values in total, an easy way to do this is to split the set in two equal parts with each side containing 5 values.

The median value will have 5 values on one side and 5 values on the other.

(2,4,5,5,6), 11 ,(11,13,14,25,30)

The median is 11 as it is the number that separates the first half from the second half.

An alternative way to double check if you're right is to do this:

(total_number_of_scores + 1) / 2.

This is (11 + 1) /2 = 6, which means you want the number in the 6th place of this set of data – which is 11.

So Q2 = 11.

How to calculate Q1 in an odd dataset

Next, to find the lower quartile, Q1, we need to find the median of the first half of the dataset, which is on the left hand side.

As a reminder, the initial dataset is:

(2,4,5,5,6), 11 ,(11,13,14,25,30)

The first half of the dataset, or the lower half, does not include the median:

2,4,5,5,6
This time, there is again an odd set of scores – specifically there are 5 values.

You want to again split this half set into another half, with an equal number of two values on each side. You'll get a unique number, which will be the number in the middle of the 5 values.

Pick the middle value that stands out:

(2,4),5,(5,6)

In this case it's Q1 = 5.

To double check, you can also do total_number_of_values + 1 / 2, similar to the previous example:

(5 + 1) /2 = 3.

This means you want the number in the 3rd place, which is 5.

How to calculate Q3 in an odd dataset

To find the upper quartile, Q3, the process is the same as for Q1 above. But in this case you take the second half on the right hand side of the dataset, above the median and without the median itself included:

(2,4,5,5,6), 11 ,(11,13,14,25,30)

11,13,14,25,30
You split this half of the odd set of numbers into another half to find the median and subsequently the value of Q3.

You again want the number in the 3rd place like you did for the first half.

(11,13),14,(25,30)

So Q3 = 14.

How to calculate IQR in an odd dataset

Now, the next step is to calculate the IQR which stands for Interquartile Range.

This is the difference/distance between the lower quartile (Q1) and the upper quartile (Q3) you calculated above.

As a reminder, the formula to do so is the following:

IQR = Q3 - Q1
To find the IQR of the dataset from above:

IQR= 14 - 5 IQR = 9
How to find an outlier in an odd dataset

To recap so far, the dataset is the one below:

2,4,5,5,6,11,11,13,14,25,30
and so far, you have calucalted the five number summary:

MIN = 2 Q1 = 5 MED = 11 Q3 = 14 MAX = 30
Finally, let's find out if there are any outliers in the dataset.

As a reminder, an outlier must fit the following criteria:

outlier < Q1 - 1.5(IQR)
Or

outlier > Q3 + 1.5(IQR)
To see if there is a lowest value outlier, you need to calculate the first part and see if there is a number in the set that satisfies the condition.

Outlier < Q1 - 1.5(IQR) Outlier < 5 - 1.5(9) Outlier < 5 - 13.5 outlier < - 8.5
There are no lower outliers, since there isn't a number less than -8.5 in the dataset.

Next, to see if there are any higher outliers:

Outlier > Q3 + 1.5(IQR)= Outlier > 14 + 1.5(9) Outlier > 14 + 13.5 Outlier > 27,5
And there is a number in the dataset that is more than 27,5:

2,4,5,5,6,11,11,13,14,25,30

In this case, 30 is the outlier in the existing dataset.

How to Find the Upper and Lower Quartiles in an Even Dataset

What happens when you have a dataset that consists of an even set of data?

There isn't just one stand-out median (Q2), nor is there a standout upper quartile (Q1) or standout lower quartile (Q3).

So the process of calculating quartiles and then finding an outlier is a bit different.

How to calculate Q2 in an even dataset

Say that you have this dataset with 8 numbers:

10,15,20,26,28,30,35,40

This time, the numbers are already sorted from lowest to highest value.

To find the median number in an even dataset, you need to find the value that would be in between the two numbers that are in the middle. You add them together and divide them by 2, like so:

10,15,20,26,28,30,35,40

26 + 28 = 54 54 / 2 = 27
How to calculate Q1 in an even dataset

To calculate to upper and lower quartiles in an even dataset, you keep all the numbers in the dataset (as opposed to in the odd set you removed the median).

This time, the dataset is cut in half.

10,15,20,26 | 28,30,35,40

To find Q1, you split the first half of the dataset into another half which leaves you with a remaining even set:

10,15 | 20,26

To find the median of this half, you take the two numbers in the middle and divide them by two:

Q1 = (15 + 20)/2 Q1 = 35 / 2 Q1 = 17,5
How to calculate Q3 in an even dataset

To find Q3, you need to focus on the second half of the dataset and split that half into another half:

28,30,35,40 -> 28,30 | 35,40

The two numbers in the middle are 30 and 35.

You add them and divide them by two, and the result is:

Q3 = (30 + 35)/2 Q3 = 65 / 2 Q3 = 32,5
How to calculate the IQR in an even dataset

The formula for calculating IQR is exactly the same as the one we used to calculate it for the odd dataset.

IQR = Q3 - Q1 IQR = 32,5 - 17,5 IQR = 15
How to find an outlier in an even dataset

As a recap, so far the five number summary is the following:

MIN = 10 Q1 = 17,5 MED = 27 Q3 = 32,5 MAX = 40
To calculate any outliers in the dataset:

outlier < Q1 - 1.5(IQR)
Or

outlier > Q3 + 1.5(IQR)
To find any lower outliers, you calcualte Q1 - 1.5(IQR) and see if there are any values less than the result.

outlier < 17,5 - 1.5(15)= outlier < 17,5 - 22,5 outlier < -5
There aren't any values in the dataset that are less than -5.

Finally, to find any higher outliers, you calculate Q3 - 1.5(IQR) and see if there are any values in the dataset that are higher than the result

outlier > 32.5 + 1.5(15)= outlier > 32.5 + 22.5 outlier > 55
There aren't any values higher than 55 so this dataset doesn't have any outliers.

Conclusion

In this article you learned how to find the interquartile range in a dataset and in that way calculate any outliers.

If you are interested in learning more about Statistics and the basics of Data Science, check out this free 8hour University course on freeCodeCamp's YouTube channel.

Thank you for for reading and happy learning.

Skewness and Kurtosis – Positively Skewed and Negatively Skewed Distributions in Statistics Explained

freeCodeCamp — Wed, 16 Jun 2021 20:56:49 +0000

By Rishit Dagli

In this article, I'll explain two important concepts in statistics: skewness and kurtosis. And don't worry – you won't need to know very much math to understand these concepts and learn how to apply them.

What are Density Curves?

Let's first talk a bit about density curves, as skewness and kurtosis are based on them. They're simply a way for us to represent a distribution. Let's see what I mean through an example.

Say that you need to record the heights of a lot of people. So your distribution has let's say 20 categories representing the range of the output (58-59 in, 59-60 in ... 78-79). You can plot a histogram representing these categories and the number of people whose height falls in each category.

Histogram of height vs population

Well, you might do this for thousands of people, so you are not interested in the exact number – rather the percentage or probability of these categories.

I also explicitly mentioned that you have a rather large distribution since percentages are often useless for smaller distributions.

If you use percentages with smaller numbers I often refer to it as lying with statistics – it's a statement that is technically correct but creates the wrong impression in our minds.

Let me give you an example: a student is extremely excited and tells everyone in his class that he made a 100% improvement in his marks! But what he doesn't say is that his marks went from a 2/30 to 4/30 😂.

I hope you now clearly see the problem of using percentages with smaller numbers.

Coming back to density curves, when you are working with a large distribution you want to have more granular categories. So you make each category which was 1 inch wide now 2 categories each (\frac{1}{2}) inch wide. Maybe you want to get even more granular and start using (\frac{1}{4}) inch wide categories. Can you guess where I am going with this?

At a point, we get an infinite number of such categories with an infinitely small length. This allows us to create a curve from this histogram which we had earlier divided into discrete categories. See our density curve below drawn from the histogram.

Probability density curve for our distribution

Why go through the effort?

Great question! As you may have guessed, I like to explain myself with examples, so let's look at another density curve to make it a bit easier for us to understand. Feel free to skip the curve equation at this stage if you have not worked with distributions before.

You can also follow along and create the graphs and visualizations in this article yourself through this Geogebra project (it runs in the browser).

$$ f(x) = \frac{1}{0.4 \sqrt{2 \pi} } \cdot e^{-\frac{1}{2} (\frac{x - 1.6}{0.4})^2} $$

So now what if I ask you "What percent of my distribution is in the category 1 - 1.6?" Well, you just calculate the area under the curve between 1 and 1.6, like this:

$$ \int_{1}^{1.6} f(x) \,dx $$

It would also be relatively easy for you to answer similar questions from the density curve like: "What percent of the distribution is under 1.2?" or "What percent of the distribution is above 1.2?"

You can now probably see why the effort of making this making a density curve is worth it and how it allows you to make inferences easily 🚀.

Skewed Distributions

Let's now talk a bit about skewed distributions – that is, those that are not as pleasant and symmetric as the curves we saw earlier. We'll talk about this more intuitively using the ideas of mean and median.

From this density curve graph's image, try figuring out where the median of this distribution would be. Perhaps it was easy for you to figure out – the curve is symmetrical and you might have concluded that the median is 1.6 since it was symmetric about (x=1.6).

Another way to go about this would be to say that the median is the value where the area under the curve to the left of it it and the area under the curve to the right of it are equal.

We're talking about this idea since it allows us to also calculate the median for non-symmetric density curves.

As an example here, I show two very common skewed distributions and how the idea of equal areas we just discussed helps us find their medians. If we tried eyeballing our median, this is what we'd get since we want the areas on either side to be equal.

Eyeballing the median for skewed curves

You can also calculate the mean through these density curves. Maybe you've tried calculating the mean yourself already, but notice that if you use the general formula to calculate the mean:

$$ mean = \frac{\sum a_n}{n} $$
you might notice a flaw in it: we take into account the ( x ) values but we also have probabilities associated with these values too. And it just makes sense to factor that in too.

So we modify the way we calculate the mean by using weighted averages. We will now also have a term (w_n) representing the associated weights:

$$ mean = \frac{\sum{a_n \cdot w_n}}{n} $$
So, we will be using the idea we just discussed to calculate the mean from our density curve.

You can also more intuitively understand this as the point on the x-axis where you could place a fulcrum and balance the curve if it was a solid object. This idea should help you better understand finding the mean from our density curve.

But another really interesting way to look at this would be as the x-coordinate of the point on this curve where the rotational inertia would be zero.

You might have already figured out how we can locate the mean for symmetric curves: our median and mean lie at the same point, the point of symmetry.

We will be using the idea we just discussed, placing a fulcrum on the x-axis and balancing the curve, to eyeball out the mean for skewed graphs like the ones we saw earlier while calculating the median.

We will soon discuss the idea of skewness in greater detail. But at this stage, generally speaking, you can identify the direction where your curve is skewed. If the median is to the right of the mean, then it is negatively skewed. And if the mean is to the right of median, then it is positively skewed.

Later in this article, for simplicity's sake we'll also refer to the narrow part of these curves as a "tail".

What are Moments?

Before we talk more about skewness and kurtosis let's explore the idea of moments a bit. Later we'll use this concept to develop an idea for measuring skewness and kurtosis in our distribution.

We'll use a small dataset, [1, 2, 3, 3, 3, 6]. These numbers mean that you have points that are 1 unit away from the origin, 2 units away from the origin, and so on.

So, we care a lot about the distances from the origin in our dataset. We can represent the average distance from the origin in our data by writing:

$$ \frac{\sum a_n -0}{n} = \frac{\sum a_n}{n} $$
This is what we call our first moment. Calculating this for our sample dataset we get 3 but if we change our dataset and make all elements equal to 3,

$$ [1, 2, 3, 3, 3, 6] \rightarrow [3, 3, 3, 3, 3, 3] $$
you'll see that our first moment remains the same. Can we devise something to differentiate our two datasets that have equal first moments? (PS: It's the second moment.)

We will calculate the average sum of squared distances rather than the average sum of distances:

$$ \frac{\sum (a_n)^2}{n} $$
Our second moment for our original dataset is 11.33 and for our new dataset is 9. Notice that the magnitude of the second moment is larger for our original dataset than the new one. Also, we have a higher value for the second moment in the original dataset because it is spread out and has a greater average squared distance.

Essentially we are saying that we have a couple of values in our original dataset larger than the mean value, which, when squared, increases our second moment by a lot.

Here's an interesting way of thinking about moments – assume our distribution is mass, and then the first moment would be the center of the mass, and the second moment would be the rotational inertia.

You can also see that our second moment is highly dependent on our first moment. But we are interested in knowing the information the second moment can give us independently.

To do so we calculate the squared distances from the mean or the first moment rather than from the origin.

$$ \frac{\sum (a_n- \mu_{1}^{'})^2 }{n} $$
Did you notice that we also intuitively derived a formula for variance? Going forward you will see how we use the ideas we just talked about to measure skewness and kurtosis.

Intro to Skewness and Kurtosis?

Let's see how we can use the idea of moments we talked about earlier to figure out how we can measure skewness (which you already have some idea about) and kurtosis.

What is Skewness?

Let's take the idea of moments we talked about just now and try to calculate the third moment. As you might have guessed, we can calculate the cubes of our distances. But as we discussed above, we are more interested in seeing the additional information the third moment provides.

So we want to subtract the second moment from our third moment. Later, we will also refer to this factor as the adjustment to the moment. So our adjusted moment will look like this:

$$ skewness = \frac{\sum (a_n - \mu)^3 }{n \cdot \sigma ^3} $$
This adjusted moment is what we call skewness. It helps us measure the asymmetry in the data.

Perfectly symmetrical data would have a skewness value of 0. A negative skewness value implies that a distribution has its tail on the left side of the distribution, while a positive skewness value has its tail on the on the right side of the distribution.

Positive skew and negative skew

At this stage, it might seem like calculating skewness would be pretty tough to do since in the formulas we use the population mean ( \mu ) and the population standard deviation ( \sigma ) which we wouldn't have access to while taking a sample.

Instead, you only have the sample mean and the sample standard deviation, so we will soon see how you can use these.

What is Kurtosis?

As you might have guessed, this time we will calculate our fourth moment or use the fourth power of our distances. And like we talked about earlier we are interested in seeing the additional information this provides so we will also subtract out the adjustment factor from it.

This is what we call kurtosis or a measure of whether our data has a lot of outliers or very few outliers. This will look like:

$$ kurtosis = \frac{\sum (a_n - \mu)^4 }{n \cdot \sigma ^4} $$
A better term for what's going on here is to figure out if the distribution is heavy-tailed or light-tailed. We can compare this to a normal distribution.

If you do a simple substitution you'll see that the kurtosis for normal distribution is 3. And since we are interested in comparing kurtosis to the normal distribution, often we use excess kurtosis which simply subtracts 3 from the above equation.

Positive and negative kurtosis (Adapted from Analytics Vidhya)

This is us essentially trying to force the kurtosis of our normal distribution to be 0 for easier comparison. So, if our distribution has positive kurtosis, it indicates a heavy-tailed distribution while negative kurtosis indicates a light-tailed distribution. Graphically, this would look something like the image above.

Sampling Adjustment

So, a problem with the equations we just built is that they have two terms in them, the distribution mean ( \mu ) and the distribution standard deviation ( \sigma ). But we are taking a sample of observations so we do not have the parameters for the whole distribution. We'd only have the sample mean and the sample standard deviation.

To keep this article focused, we will not be talking in detail about sampling adjustment terms since degrees of freedom is not in the scope of this article.

The idea is to use our sample mean ( \bar{x} ) and our sample standard deviation ( s ) to estimate these values for our distribution. We will also have to adjust our degree of freedom in these equations for it.

Don't worry if you don't understand this concept completely at this point. We can move on anyway. This leads to us modifying the equations we talked about earlier like so:

$$ skewness = \frac{\sum (a_n - \bar{x})^3 }{s^3} \cdot \frac{n}{(n-1)(n-2)} $$
$$ kurtosis = \frac{\sum (a_n - \bar{x})^4 }{s^4} \cdot \frac{n(n+1)}{(n-1)(n-2)(n-3)} - \frac{3(n-1)^2}{(n-2)(n-3)} $$
How to Implement this in Python

Finally, let's finish up by seeing how you can measure skewness and kurtosis in Python with an example. In case you want to follow along and try out the code, you can follow along with this Colab Notebook where we measure the skewness and kurtosis of a dataset.

It is pretty straightforward to implement this in Python with Scipy. It has methods to easily measure skewness and kurtosis for a distribution with pre-built methods.

The below code block shows how to measure skewness and kurtosis for the Boston housing dataset, but you could also use it for your own distributions.

from scipy.stats import skew from scipy.stats import kurtosis skew(data["MEDV"].dropna()) kurtosis(data["MEDV"].dropna())

Thank you for reading!

Thank you for sticking with me until the end. I hope you have learned a lot from this article.

I am excited to see if this article helped you better understand these two very important ideas. If you have any feedback or suggestions for me please feel free to reach out to me on Twitter.

Data Science Learning Roadmap

Harshit Tyagi — Tue, 12 Jan 2021 00:24:30 +0000

Although nothing really changes but the date, a new year fills everyone with the hope of starting things afresh. If you add in a bit of planning, some well-envisioned goals, and a learning roadmap, you'll have a great recipe for a year full of growth.

This post intends to strengthen your plan by providing you with a learning framework, resources, and project ideas to help you build a solid portfolio of work showcasing expertise in data science.

Just a note: I've prepared this roadmap based on my personal experience in data science. This is not the be-all and end-all learning plan. You can adapt this roadmap to better suit any specific domain or field of study that interests you. Also, this was created with Python in mind as I personally prefer it.

What is a learning roadmap?

A learning roadmap is an extension of a curriculum. It charts out a multi-level skills map with details about what skills you want to hone, how you will measure the outcome at each level, and techniques to further master each skill.

My roadmap assigns weights to each level based on the complexity and commonality of its application in the real-world. I have also added an estimated time for a beginner to complete each level with exercises and projects.

Here is a pyramid that depicts the high-level skills in order of their complexity and application in the industry.

Data science tasks in the order of complexity

This will mark the base of our framework. We’ll now have to deep dive into each of these strata to complete our framework with more specific, measurable details.

Specificity comes from examining the critical topics in each layer and the resources needed to master those topics.

We’d be able to measure the knowledge gained by applying the learned topics to a number of real-world projects. I’ve added a few project ideas, portals, and platforms that you can use to measure your proficiency.

Important NOTE: Take it one day at a time, one video/blog/chapter a day. It is a wide spectrum to cover. Don’t overwhelm yourself!

Let’s deep dive into each of these strata, starting from the bottom.

1. How to Learn About Programming or Software Engineering

(Estimated time: 2-3 months)

First, make sure you have sound programming skills. Every data science job description will ask for programming expertise in at least one languages.

Specific programming topics to know include:

Common data structures (data types, lists, dictionaries, sets, tuples), writing functions, logic, control flow, searching and sorting algorithms, object-oriented programming, and working with external libraries.

SQL scripting: Querying databases using joins, aggregations, and subqueries

Comfort using the Terminal, version control in Git, and using GitHub

Resources to learn Python:

learnpython.org [free]— a free resource for beginners. It covers all the basic programming topics from scratch. You get an interactive shell to practice those topics side-by-side.

Kaggle [free]— a free and interactive guide to learning python. It is a short tutorial covering all the important topics for data science.

Python certifications on freeCodeCamp [free] – freeCodeCamp offers several certifications based on Python, such as scientific computing, data analysis, and machine learning.

Python Course by freecodecamp on YouTube [free] — This is a 5-hour course that you can follow to practice the basic concepts.

Intermediate python [free]— Another free course by Patrick featured on freecodecamp.org.

Coursera Python for Everybody Specialization [fee] — this is a specialization encompassing beginner-level concepts, python data structures, data collection from the web, and using databases with python.

Resources for learning Git and GitHub

Guide for Git and GitHub [free]: complete these tutorials and labs to develop a firm grip over version control. It will help you further in contributing to open-source projects.

Here's a Git and GitHub crash course on the freeCodeCamp YouTube channel

Resources for learning SQL

Here's a course on SQL and Databases on the freeCodeCamp YouTube channel

Intro to SQL and Advanced SQL on Kaggle.

freeCodeCamp now has a free interactive SQL course.

Measure your expertise by solving a lot of problems and building at least 2 projects:

Solve a lot of problems here: HackerRank (beginner-friendly) and LeetCode (solve easy or medium-level questions)

Data Extraction from a website/API endpoints — try to write Python scripts from extracting data from webpages that allow scraping like soundcloud.com. Store the extracted data into a CSV file or a SQL database.

Games like rock-paper-scissor, spin a yarn, hangman, dice rolling simulator, tic-tac-toe, and so on.

Simple web apps like a YouTube video downloader, website blocker, music player, plagiarism checker, and so on.

Deploy these projects on GitHub pages or simply host the code on GitHub so that you learn to use Git.

2. How to Learn About Data Collection and Wrangling (Cleaning)

(Estimated time: 2 months)

A significant part of data science work is centered around finding apt data that can help you solve your problem. You can collect data from different legitimate sources — scraping (if the website allows), APIs, Databases, and publicly available repositories.

Once you have data in hand, an analyst will often find themself cleaning dataframes, working with multi-dimensional arrays, using descriptive/scientific computations, and manipulating dataframes to aggregate data.

Data are rarely clean and formatted for use in the “real world”. Pandas and NumPy are the two libraries that are at your disposal to go from dirty data to ready-to-analyze data.

As you start feeling comfortable writing Python programs, feel free to start taking lessons on using libraries like pandas and numpy.

Resources to learn about data collection and cleaning:

freeCodeCamp course on learning Numpy, Pandas, matplotlib, and seaborn [free].

Practical tutorial on data manipulation with NumPy and Pandas in Python from HackerEarth.

Kaggle pandas tutorial [free] — A short and concise hands-on tutorial that will walk you through commonly used data manipulation skills.

Data Cleaning course by Kaggle.

Coursera course on Introduction to Data Science in Python — This is the first course in the Applied Data Science with Python Specialization.

Data collection project Ideas:

Collect data from a website/API (open for public consumption) of your choice, and transform the data to store it from different sources into an aggregated file or table (DB). Example APIs include TMDB, quandl, Twitter API, and so on.

Pick any publicly available dataset and define a set of questions that you’d want to pursue after looking at the dataset and the domain. Wrangle the data to find out answers to those questions using Pandas and NumPy.

3. How to Learn About Exploratory Data Analysis, Business Acumen, and Storytelling

(Estimated time: 2–3 months)

The next stratum to master is data analysis and storytelling. Drawing insights from the data and then communicating the same to management in simple terms and visualizations is the core responsibility of a Data Analyst.

The storytelling part requires you to be proficient with data visualization along with excellent communication skills.

Specific exploratory data analysis and storytelling topics to learn include:

Exploratory data analysis — defining questions, handling missing values, outliers, formatting, filtering, univariate and multivariate analysis.

Data visualization — plotting data using libraries like matplotlib, seaborn, and plotly. Know how to choose the right chart to communicate the findings from the data.

Developing dashboards — a good percent of analysts only use Excel or a specialized tool like Power BI and Tableau to build dashboards that summarise/aggregate data to help management make decisions.

Business acumen: Work on asking the right questions to answer, ones that actually target the business metrics. Practice writing clear and concise reports, blogs, and presentations.

Resources to learn more about data analysis:

Learn data analysis with Python in this free course on the freeCodeCamp YouTube channel.

Data Analysis with Python — by IBM on Coursera. The course covers wrangling, exploratory analysis, and simple model development using python.

Data Visualization — by Kaggle. Another interactive course that lets you practice all the commonly used plots.

Build product sense and business acumen with these books: Measure what matters, Decode and conquer, Cracking the PM interview.

Data analysis project ideas

Exploratory analysis on movies dataset to find the formula to create profitable movies (use it as inspiration), use datasets from healthcare, finance, WHO, past census, Ecommerce, and so on.

Build dashboards (jupyter notebooks, excel, tableau) using the resources provided above.

4. How to Learn About Data Engineering

(Estimated time: 4–5 months)

Data engineering underpins the R&D teams by making clean data accessible to research engineers and scientists at big data-driven firms. It is a field in itself and you may decide to skip this part if you want to focus on just the statistical algorithm side of the problems.

Responsibilities of a data engineer comprise building an efficient data architecture, streamlining data processing, and maintaining large-scale data systems.

Engineers use Shell (CLI), SQL, and Python/Scala to create ETL pipelines, automate file system tasks, and optimize the database operations to make them high-performance.

Another crucial skill is implementing these data architectures which demand proficiency in cloud service providers like AWS, Google Cloud Platform, Microsoft Azure, and others.

Resources to learn Data Engineering:

Data Engineering Nanodegree by Udacity — as far as a compiled list of resources is concerned, I have not come across a better-structured course on data engineering that covers all the major concepts from scratch.

Data Engineering, Big Data, and Machine Learning on GCP Specialization — You can complete this specialization offered by Google on Coursera that walks you through all the major APIs and services offered by GCP to build a complete data solution.

Data Engineering project ideas/certifications to prepare for:

AWS Certified Machine Learning (300 USD) — A proctored exam offered by AWS, adds some weight to your profile (doesn’t guarantee anything, though), requires a decent understanding of AWS services and ML.

Professional Data Engineer — Certification offered by GCP. This is also a proctored exam and assesses your abilities to design data processing systems, deploying machine learning models in a production environment, and ensure solutions quality and automation.

5. How to Learn About Applied Statistics and Mathematics

(Estimated time: 4–5 months)

Statistical methods are a central part of data science. Almost all data science interviews predominantly focus on descriptive and inferential statistics.

People often start coding machine learning algorithms without a clear understanding of underlying statistical and mathematical methods that explain the working of those algorithms. This, of course, isn't the best way to go about it.

Topics you should focus on in Applied Statistics and math:

Descriptive Statistics — to be able to summarise the data is powerful, but not always. Learn about estimates of location (mean, median, mode, weighted statistics, trimmed statistics), and variability to describe the data.

Inferential statistics — designing hypothesis tests, A/B tests, defining business metrics, analyzing the collected data and experiment results using confidence interval, p-value, and alpha values.

Linear Algebra, Single and multi-variate calculus to understand loss functions, gradient, and optimizers in machine learning.

Resources to learn about Statistics and math:

Learn college-level statistics in this free 8-hour course on the freeCodeCamp YouTube channel

[Book] Practical statistics for data science (highly recommend) — A thorough guide on all the important statistical methods along with clean and concise applications/examples.

[Book] Naked Statistics — a non-technical but detailed guide to understanding the impact of statistics on our routine events, sports, recommendation systems, and many more instances.

An 8-hour University-level Statistics course — a foundation course to help you start thinking statistically.

Intro to Descriptive Statistics— offered by Udacity. Consists of video lectures explaining widely used measures of location and variability(standard deviation, variance, median absolute deviation).

Inferential Statistics, Udacity — the course consists of video lectures that educate you on drawing conclusions from data that might not be immediately obvious. It focuses on developing hypotheses and use common tests such as t-tests, ANOVA, and regression.

And here's a guide to statistics for data science to help you get started down the right path.

Statistics project ideas:

Solve the exercises provided in the courses above and then try to go through a number of public datasets where you can apply these statistical concepts. Ask questions like “Is there sufficient evidence to conclude that the mean age of mothers giving birth in Boston is over 25 years of age at the 0.05 level of significance”?

Try to design and run small experiments with your peers/groups/classes by asking them to interact with an app or answer a question. Run statistical methods on the collected data once you have a good amount of data after a period of time. This might be very hard to pull off but should be very interesting.

Analyze stock prices, cryptocurrencies, and design hypothesis around the average return or any other metric. Determine if you can reject the null hypothesis or fail to do so using critical values.

6. How to Learn About Machine Learning and AI

(Estimated time: 4–5 months)

After grilling yourself and going through all the major aforementioned concepts, you should now be ready to get started with the fancy ML algorithms.

There are three major types of learning:

Supervised Learning — includes regression and classification problems. Study simple linear regression, multiple regression, polynomial regression, naive Bayes, logistic regression, KNNs, tree models, ensemble models. Learn about evaluation metrics.

Unsupervised Learning — Clustering and dimensionality reduction are the two widely used applications of unsupervised learning. Dive deep into PCA, K-means clustering, hierarchical clustering, and gaussian mixtures.

Reinforcement learning (can skip*) — helps you build self-rewarding systems. Learn to optimize rewards, using the TF-Agents library, creating Deep Q-networks, and so on.

The majority of the ML projects need you to master a number of tasks that I’ve explained in this blog.

Resources to learn about Machine Learning:

Here's a free full course on Machine learning in Python with ScikitLearn on the freeCodeCamp YouTube channel.

[book] Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition — one of my all-time favorite books on machine learning. Doesn’t only cover the theoretical mathematical derivations, but also showcases the implementation of algorithms through examples. You should solve the exercises given at the end of each chapter.

Machine Learning Course by Andrew Ng — the go-to course for anyone trying to learn machine learning. Hands down!

Introduction to Machine Learning — Interactive course by Kaggle.

Intro to Game AI and Reinforcement Learning — another interactive course on Kaggle on reinforcement learning.

Deep Learning Specialization by deeplearning.ai

For those of you who are interested in further diving into deep learning, you can start off by completing this specialization offered by deeplearning.ai and the Hands-ON book. This is not as important from a data science perspective unless you are planning to solve a computer vision or NLP problem.

Deep learning deserves a dedicated roadmap of its own. I’ll create that with all the fundamental concepts soon.

Track your learning progress

I’ve also created a learning tracker for you on Notion. You can customize it to your needs and use it to track your progress, have easy access to all the resources and your projects.

Find the learning tracker here.

Also, here's the video version of this blog:

Data Science with Harshit

This is just a high-level overview of the wide spectrum of data science. You might want to deep dive into each of these topics and create a low-level concept-based plan for each of the categories.

If this tutorial was helpful, you should check out my data science and machine learning courses on Wiplane Academy. They are comprehensive yet compact and helps you build a solid foundation of work to showcase.

How to Explain Data Using Gaussian Distribution and Summary Statistics with Python

Harshit Tyagi — Fri, 27 Nov 2020 02:28:02 +0000

Once you understand the taxonomy of data, you should learn to apply a few essential foundational concepts that help describe the data using a set of statistical methods.

Before we dive into data and its distribution, we should understand the difference between two very important keywords - sample and population.

A sample is a snapshot of data from a larger dataset. This larger dataset which is all of the data that could be possibly collected is called population.

In statistics, the population is a broad, defined, and often theoretical set of all possible observations that are generated from an experiment or from a domain.

Observations in a sample dataset often fit a certain kind of distribution which is commonly called normal distribution, and formally called Gaussian distribution. This is the most studied distribution, and there is an entire sub-field of statistics dedicated to Gaussian data.

What we’ll cover

In this post, we’ll focus on understanding:

more about Guassian distribution and how it can be used to describe the data and observations from a machine learning model.

estimates of location — the central tendency of a distribution.

estimates of variability — the dispersion of data from the mean in the distribution.

the code snippets for generating normally distributed data and calculating estimates using various Python packages like numpy, scipy, matplotlib, and so on.

And with that, let's get started.

What is normal or Guassian distributon?

When we plot a dataset such as a histogram, the shape of that charted plot is what we call its distribution. The most commonly observed shape of continuous values is the bell curve, which is also called the Gaussian or normal distribution.

It is named after the German mathematician, Carl Friedrich Gauss. Some common example datasets that follow Gaussian distribution are:

Body temperature

People’s Heights

Car mileage

IQ scores

Let’s try to generate the ideal normal distribution and plot it using Python.

How to plot Gaussian distribution in Python

We have libraries like Numpy, scipy, and matplotlib to help us plot an ideal normal curve.

import numpy as np import scipy as sp from scipy import stats import matplotlib.pyplot as plt ## generate the data and plot it for an ideal normal curve ## x-axis for the plot x_data = np.arange(-5, 5, 0.001) ## y-axis as the gaussian y_data = stats.norm.pdf(x_axis, 0, 1) ## plot data plt.plot(x_data, y_data)plt.show()

Output:

The points on the x-axis are the observations and the y-axis is the likelihood of each observation.

We generated regularly spaced observations in the range (-5, 5) using np.arange(). Then we ran it through the norm.pdf() function with a mean of 0.0 and a standard deviation of 1 which returned the likelihood of that observation.

Observations around 0 are the most common and the ones around -5.0 and 5.0 are rare. The technical term for the pdf() function is the probability density function.

How to test for Gaussian Distribution

It is important to note that not all data fits the Gaussian distribution, and we have to discover the distribution either by reviewing histogram plots of the data or by implementing some statistical tests.

Some examples of observations that do not fit a Gaussian distribution and instead may fit an exponential (hockey-stick shape) include:

People’s incomes

Population of countries

Sales of cars.

Until now, we have just talked about the ideal bell-shaped curve of the distribution but if we had to work with random data and figure out its distribution.

This is how we'll proceed:

Create some random data for this example using numpy’s randn() function.

Plot the data using a histogram and analyze the returned graph for the expected shape.

In reality, the data is rarely perfectly Gaussian, but it will have a Gaussian-like distribution. If the sample size is large enough, we treat it as Gaussian.

Note that you may have to change the plotting configuration (scale, number of bins, and so on) to look for the desired pattern.

Let's take a look at some code:

## setting the seed for the random generation np.random.seed(1) ## generating univariate data data = 10 * np.random.randn(1000) + 100 ## plotting the data plt.hist(data)plt.show()

Output:

Here’s the output of the code above with the histogram plot of the data:

The plot looks more like a simple set of blocks. But we change the scale, which in this case is the arbitrary number of bins in the histogram.

Let’s specify the number of bins and plot it again:

plt.hist(data, bins=100) plt.show()

We can now see that the curve looks closer to a Gaussian bell-shaped curve.

Although, notice that we have a few observations that are going out of bounds and can be seen as noise.

This points to another important takeaway when working with sample dataset – you should always expect some noise or outliers.

Estimates of Location

A fundamental step in exploring a dataset is getting a summarized value for each feature or variable. This is commonly an estimate of where most of the data is located, or in other words, the central tendency.

At first, summarizing the data might sound like a piece of cake – just take the mean of the data. In reality, although the mean is very easy to compute and use, it may not always be the best measure for the central value.

To solve this problem, statisticians have developed alternative estimates to mean.

We are going to use the Boston dataset from the sklearn package.

Note that I’ve dropped a few columns, and this is what the dataframe looks like now:

Let’s look over the commonly used estimates of location with the help of an actual sample dataset, rather than Greek symbols:

Mean

The sum of all values divided by the number of values, also known as the average

Here's how to calculate the mean of the Age variable:

df['Age'].mean() ## output: 68.57490118577076

Weighted mean

The sum of all values times a weight divided by the sum of the weights. This is also known as the weighted average.

Here are two main motivations for using a weighted mean:

Some observations are intrinsically more variable (high standard deviation) than others, and highly variable observations are given a lower weight.

The collected data does not equally represent the different groups that we are interested in measuring.

Median

The value that separates one half of the data from the other, thus dividing it into a higher and lower half. This is also called the 50th percentile.

Here's how to calculate the median of the Age variable:

df['Age'].median() ## output: 77.5

Percentile

The value such that P percent of the data lies below, also known as quantile.

The describe method makes it easy to find the percentile:

df.describe()

This gives summary statistics of all the numerical variables. Note that the metrics are different for categorical variables.

Weighted median

The value such that one half of the sum of the weights lies above and below the sorted data.

Trimmed mean

The average of all values after dropping a fixed number of extreme values.

A trimmed mean eliminates the influence of extreme values. For example, while judging an event, we can calculate the final score using the trimmed mean of all the scores so that no judge can manipulate the result.

This is also known as the truncated mean.

For this, we are going to use the stats module from the scipy library:

## trim = 0.1 drops 10% from each end stats.trim_mean(df['Age'], 0.1) ## output: 71.19605911330049

Outlier

An outlier, or extreme value, is a data value that is very different from most of the data. The median is referred to as a robust estimate of location since it is not influenced by outliers, i.e. extreme cases whereas the mean is sensitive to outliers.

Estimates of Variability

Besides location, we have another method of summarizing a feature. Variability, also referred to as dispersion, tells us how spread-out or clustered the data is.

Calculating the variability measures for the same dataframe using libraries like pandas, numpy, and scipy.

Deviations

The difference between the observed values and the estimate of location. Deviations are sometimes called errors or residuals.

Variance

The sum of squared deviations from the mean divided by n — 1 where n is the number of data values. This is also called the mean-squared-error.

df['Age'].var()

Standard deviation

The square root of the variance.

df['Age'].std() ## output: 28.148861406903617

Mean absolute deviation

The mean of the absolute values of the deviations from the mean. This is also referred to as the l1-norm or Manhattan norm.

I’ve covered this in more detail along with a mathematical explanation here: Calculating Vector P-Norms — Linear Algebra for Data Science -IV

Median absolute deviation from the median

The median of the absolute values of the deviations from the median.

df['Age'].mad() ## output: 24.610885188020433

Range

The difference between the largest and the smallest value in a data set.

We can calculate the range of a variable using the min and max from the summary statistics of the dataframe:

df['Age'].iloc[df['Age'].idxmax] - df['Age'].iloc[df['Age'].idxmin()] ## output: 97.1

Order statistics

Order statistics, or ranks, are metrics based on the data values sorted from smallest to biggest.

Percentile

The value such that P percent of the values take on this value or less and (100–P) percent take on this value or more. This is sometimes called quantile.

Interquartile range

Interquartile range, or IQR, is the difference between the 75th percentile and the 25th percentile.

Q1 = df['Age'].quantile(0.25) Q3 = df['Age'].quantile(0.75) IQR = Q3 - Q1 ## Output: 49.04999999999999

Now that you have a clear understanding of Gaussian distribution and common estimates of location and variability, you can summarize and interpret the data easily using these statistical methods.

Data Science with Harshit

Embedded content

With this channel, I am planning to roll out a couple of series covering the entire data science space. Here is why you should be subscribing to the channel:

This series would cover all the required/demanded quality tutorials on each of the topics and subtopics like Python fundamentals for Data Science.

Explained Mathematics and derivations of why we do what we do in ML and Deep Learning.

Podcasts with Data Scientists and Engineers at Google, Microsoft, Amazon, and CEOs of big data-driven companies.

Projects and instructions to implement the topics learned so far. Learn about new certifications, Bootcamps, and resources to crack those certifications like this TensorFlow Developer Certificate Exam by Google.

If this tutorial was helpful, you should check out my data science and machine learning courses on Wiplane Academy. They are comprehensive yet compact and helps you build a solid foundation of work to showcase.

Statistics for Data Science — a Complete Guide for Aspiring ML Practitioners

Harshit Tyagi — Wed, 04 Nov 2020 19:07:11 +0000

In this hyper-connected world, data are being generated and consumed at an unprecedented pace.

As much as we enjoy this superconductivity of data, it invites abuse as well. Data professionals need to be trained to use statistical methods not only to interpret numbers but to uncover such abuse and protect us from being misled.

Not many data scientists are formally trained in statistics. There are also very few good books and courses that teach these statistical methods from a data science perspective.

Through this post, I intend to shed some light on the following:

What is Statistics?

Statistics in relation with machine learning.

Why you should master statistics

What curriculum**** you should follow to master these topics

How to study statistics**** to become a practitioner rather than a test-taker

Practical tips**** and learning resources

What is Statistics?

Statistics is a set of mathematical methods and tools that enable us to answer important questions about data. It is divided into two categories:

Descriptive Statistics - this offers methods to summarise data by transforming raw observations into meaningful information that is easy to interpret and share.

Inferential Statistics - this offers methods to study experiments done on small samples of data and chalk out the inferences to the entire population (entire domain).

Now, statistics and machine learning are two closely related areas of study. Statistics is an important prerequisite for applied machine learning, as it helps us select, evaluate and interpret predictive models.

Statistics and Machine Learning

The core of machine learning is centered around statistics. You can’t solve real-world problems with machine learning if you don’t have a good grip of statistical fundamentals.

There are certainly some factors that make learning statistics hard. I'm talking about mathematical equations, greek notation, and meticulously defined concepts that make it difficult to develop an interest in the subject.

We can address these issues with simple and clear explanations, appropriately paced tutorials, and hands-on labs to solve problems with applied statistical methods.

From exploratory data analysis to designing hypothesis testing experiments, statistics play an integral role in solving problems across all major industries and domains.

Anyone who wishes to develop a deep understanding of machine learning should learn how statistical methods form the foundation for regression algorithms and classification algorithms, how statistics allow us to learn from data, and how it helps us extract meaning from unlabeled data.

Why should you master statistics?

Every organisation is striving to become data-driven. This is why we are witnessing such an increase in demand for data scientists and analysts.

Now, to solve problems, answer questions, and map out a strategy, we need to make sense of the data. Luckily, statistics offers a collection of tools to produce those insights.

From Data to Knowledge

In isolation, raw observations are just data. We use descriptive statistics to transform these observations into insights that make sense.

Then we can use inferential statistics to study small samples of data and extrapolate our findings to the entire population.

Statistics helps answer questions like...

What features are the most important?

How should we design the experiment to develop our product strategy?

What performance metrics should we measure?

What is the most common and expected outcome?

How do we differentiate between noise and valid data?

All these are common and important questions that data teams have to answer on a daily basis.

The answers help us make decisions effectively. Statistical methods not only help us set up predictive modeling projects but also to interpret the results.

Statistics and Machine Learning Projects

Almost every machine learning project consists of the following tasks. And statistics play a central role in all of them in some shape or form. Here’s how:

Defining a Problem Statement

The most crucial part of predictive modeling is the actual definition of the problem that gives us the real objective to pursue.

This helps us decide the type of problem we're dealing with (that is, regression or classification). And it also helps us decide the structure and types of the inputs, outputs and metrics with regards to the objective.

But problem framing is not always straightforward. If you're new to Machine Learning, it may require significant exploration of the observations in the domain. Two main concepts to master here are exploratory data analysis (EDA) and data mining.

Initial Data Exploration

Data exploration involves gaining a deep understanding of both the distributions of variables and the relationships between variables in your data.

In part, domain expertise helps you gain this mastery over a specific type of variable. Nevertheless, both experts and newcomers to the field benefit from actually handling real observations from the domain.

Important related concepts in statistics boil down to learning descriptive statistics and data visualization.

Data Cleaning

Often, the data points you've collected from an experiment or a data repository are not pristine. The data may have been subjected to processes or manipulations that damaged its integrity. This further affects the downstream processes or models that use the data.

Common examples include missing values, data corruption, data errors (from a bad sensor), and unformatted data (observations with different scales).

If you want to master cleaning methods, you need to learn about outlier detection and missing value imputation.

Data Preparation and setting up transformation pipelines

If data contains errors and inconsistencies, you often can't use it directly for modeling.

First, the data might need to go through a set of transformations to change its shape or structure and make it more suitable for the problem you've defined or the learning algorithms you're using.

Then you can develop a pipeline of such transformations that you apply to the data to produce consistent and compatible input for the model.

You should master concepts like data sampling and feature selection methods, data transforms, scaling, and encoding.

Model Selection & Evaluation

A key step in solving a predictive problem is selecting and evaluating the learning method. Estimation statistics help you score model predictions on unseen data.

Experimental design is a subfield of statistics that drives the selection and evaluation process of a model. It demands a good understanding of statistical hypothesis tests and estimation statistics.

Fine-tuning the model

Almost every machine learning algorithm has a suite of hyperparameters that allow you to customise the learning method for your chosen problem framing.

This hyperparameter tuning is often empirical in nature, rather than analytical. It requires large suites of experiments in order to evaluate the effect of different hyperparameter settings on the performance of the model.

Statistics Curriculum for Practitioners

A good statistics curriculum for practitioners should not just cover the plethora of methods and tools I just discussed. It should also cover and explore the most commonly faced issues in the industry.

The following is a list of widely used skills you'll need to know to ace data science and ML interviews and get a job in the field.

General Statistics Skills

How to define statistically answerable questions for effective decision making.

Calculating and interpreting common statistics and how to use standard data visualization techniques to communicate findings.

Understanding of how mathematical statistics is applied to the field, concepts such as the central limit theorem and the law of large numbers.

Making inferences from estimates of location and variability (ANOVA).

How to identify the relationship between target variables and independent variables.

How to design statistical hypothesis testing experiments, A/B testing, and so on.

How to calculate and interpret performance metrics like p-value, alpha, type1 and type2 errors, and so on.

Important Statistics Concepts

Getting Started— Understanding types of data (rectangular and non-rectangular), estimate of location, estimate of variability, data distributions, binary and categorical data, correlation, relationship between different types of variables.

Distribution of Statistic — random numbers, the law of large numbers, Central Limit Theorem, standard error, and so on.

Data sampling and Distributions — random sampling, sampling bias, selection bias, sampling distribution, bootstrapping, confidence interval, normal distribution, t-distribution, binomial distribution, chi-square distribution, F-distribution, Poisson and exponential distribution.

Statistical Experiments and Significance Testing— A/B testing, conducting hypothesis tests (Null/Alternate), resampling, statistical significance, confidence interval, p-value, alpha, t-tests, degree of freedom, ANOVA, critical values, covariance and correlation, effect size, statistical power.

Nonparametric Statistical Methods — rank data, normality tests, normalization of data, rank correlation, rank significance tests, independence test

Practical Learning Tips

Most universities have designed their statistics course curricula to test the student’s cramming power. They just check if students can solve equations, define terminologies, and identify plots deriving equations, rather than focusing on applying these methods to solve real-world problems.

Aspiring practitioners, however, should follow a step-by-step process of learning and implementing statistical methods on different problems using executable Python code.

Let's look at the two main approaches to studying statistics a bit more in depth:

Top-down approach

Let's say you are asked to design an experiment to test the efficiency of two versions of a product feature. This feature is supposed to increase the user engagement on an online portal.

With a top-down approach, you'll first learn more about the problem. Then once the objective is clear, you can learn to apply the appropriate statistical methods.

This keeps you engaged and offers a better practical learning experience.

Bottom-up approach

This approach is how most universities and online courses teach statistics. It focuses on learning the theoretical concepts with mathematical notation, the history of that concept, and how to implement it.

For people like me who tend to lose interest in theoretical learning, this is not the right way to learn applied statistics. It makes it too meta, which renders the subject dry and depressing without any direct link to problem solving.

As you can probably tell, I recommend a top-down approach to studying statistics.

So now let's look at some specific resources I recommend to get you started down the right path.

Learning Resources

Book on Practical Statistics – This will teach you statistics from a Data Science standpoint. You should read at least the first 3 chapters of this book.

Statistics and Probability | Khan Academy – This course will prepare you well for all the statistics and probability related questions during the interview. A free course with a good compilation of video lectures and practice problems.

Naked Statistics – For people who dread mathematics and prefer to understand practical examples, this is an amazing book that explains how statistics is applied in real-life scenarios.

Statistical Methods for Machine Learning – This book serves as a crash course in statistical methods for machine learning practitioners. Ideally, those with a background as a developer.

Next up…

I will be creating a series of tutorials on each of the above-mentioned topics following a code-first approach so that we can understand and visualize the meaning and application of these concepts.

If I’ve missed any of the details or if you want me to cover any other aspect of statistics, respond to this story and I’ll add it to the curriculum.

Data Science with Harshit

With this channel, I am planning to roll out a couple of series covering the entire data science space. Here is why you should be subscribing to the channel:

This series would cover all the required/demanded quality tutorials on each of the topics and subtopics like Python fundamentals for Data Science.

Explained Mathematics and derivations of why we do what we do in ML and Deep Learning.

Podcasts with Data Scientists and Engineers at Google, Microsoft, Amazon, etc, and CEOs of big data-driven companies.

Projects and instructions to implement the topics learned so far. Learn about new certifications, Bootcamp, and resources to crack those certifications like this TensorFlow Developer Certificate Exam by Google.

If this tutorial was helpful, you should check out my data science and machine learning courses on Wiplane Academy. They are comprehensive yet compact and helps you build a solid foundation of work to showcase.

Basis	Dependent Variable	Independent Variable
Type	It is the "response" variable	It is the "effect" variable
Outcome	Outcome depends on another variable (usually the independent variable)	Outcome does not depend on another variable
Changes	This variable changes over time. Consider the dependent variable as a variable you declare with the "let" or "var" keyword in JavaScript. You can later change it.	This variable never changes. Consider the independent variable as the variable you declare with the "const" keyword in JavaScript. It is fixed unless you explicitly change the value.
Manipulation	Dependent variables cannot be manipulated because their value depends on the independent variable.	Independent variables can be manipulated to determine the outcome of a dependent variable.
Position on a Graph	Dependent variables are placed on the y-axis (vertical axis) on a graph	Independent variables are placed on the x-axis (horizontal axis) on a graph

statistics - freeCodeCamp.org

Data Science Insights: Why the Mean Lies When Handling Messy Retail Data

Table Of Contents

Prerequisites

The Dataset

Mean: The Sensitive Giant

Median: The Robust Middle

Beyond Averages: Understanding Spread with Quartiles

The IQR: Detecting Outliers

A Simple Example to Understand IQR

Step 1: Find the Median (Q2):

Step 2: Find Q1 (Lower Quartile):

Step 3: Find Q3 (Upper Quartile):

Step 4: Calculate IQR:

Step 5: Find Outlier Bounds:

Applying IQR to Our Dataset

Revisiting the Mean After Removing Outliers

Final Comparison and Insights

Conclusion

Connect with me

What are Markov Chains? Explained With Python Code Examples

Analogy

Markov Chain Explained in Plain English

Applications of Markov Chains

Types of Markov Chains

Discrete-Time Markov Chains (DTMCs)

Continuous-Time Markov Chains (CTMCs)

Reversible Markov Chains

Doubly Stochastic Markov Chains

Hidden Markov Chains Code Example

Hidden Markov Chains: Modeling Unseen States

Code Example

Import libraries and set random seed

What is a Random Seed?

Define the HMM parameters and create a Gaussian HMM

What Does "Gaussian" Mean?

Define transition matrix , means and covariances for each state

Create data, new HMM instance and fit the model with the data

Predict the hidden states for the observed data

Conclusion: The Future of Markov Chains

Learn Statistics for Data Science, Machine Learning, and AI – Full Handbook

Key statistical concepts for your data science or data analysis journey with Python Code

Prerequisites

Random Variables

Mean, Variance, Standard Deviation

Mean

Variance

Standard Deviation

Covariance

Correlation

Probability Distribution Functions

Binomial Distribution

Binomial Distribution Mean and Variance

Poisson Distribution

Poisson Distribution Mean and Variance

Normal Distribution

Normal Distribution Mean and Variance

Bayes' Theorem

Linear Regression

Ordinary Least Squares

Standard Error

OLS Assumptions

Parameter Properties

Gauss-Markov Theorem

Bias

Efficiency

Consistency

Confidence Intervals

Margin of Error

Confidence Level

Confidence Interval for OLS Estimates

Statistical Hypothesis Testing

Null and Alternative Hypothesis

Statistical Significance

Type I and Type II Errors

Statistical Tests

Student’s t-test

Two-sided vs one-sided t-test

2-sample Z-test

Case 1: Z-test for comparing proportions (2-sided)