You've probably heard that statistics is the gateway to data science and that the data science map starts with stats.
Perhaps you've also heard from others that you have to learn statistics before learning data science. But then you ponder, "Since I'm not from a technical background like science, technology, engineering, or math (STEM), do I need to learn everything in statistics before getting into data science?" And those same people will tell you "Yes! You have to learn statistics."
Well, here's my answer: you don't need to learn all of statistics before beginning data science (though you do need to learn some fundamentals).
You can also learn as you go instead of wasting time learning statistics first before data science (that is, as you advance in your knowledge of data science, you can always learn more statistics concepts).
That being said, it is helpful to know statistics basics before jumping into data science. You can indeed say that stats is the gateway to data science because it will help you to have some intuition about your data and how to work with it.
In this article, we'll look at the top statistical concepts you need to know before diving into data science. I'll make it as simple as possible even if you don't come from a technical background. I can tell you're excited and ready to dive into the realm of data science. Let's get started.
What is Statistics?
According to economist and sampling technique pioneer Arthur Lyon Bowley, Statistics is:
"numerical statements of facts in any department of inquiry placed in relation to each other."
That basically means that statistics helps us comprehend our data and also helps us convey the results in that data to others.
Statistical methods (that is, the techniques employed in dealing with data in statistics) are classified into two types:
- Descriptive Statistics
- Inferential Statistics
Descriptive Statistics is a discipline of statistics that assists us in summarizing data through numerical values or graphical visualization.
Descriptive statistics helps us identify and understand some key properties in our data. It includes concepts such as central tendency, dispersion, boxplots, histograms, and so on, which we'll discuss later in the article.
Inferential Statistics, on the other hand, is a branch of statistics that helps us make decisions or predictions based on the data that we have gathered.
Inferential statistics is a significantly more advanced topic because it requires a deep understanding of descriptive statistics. It includes concepts such as hypothesis, probability, and so forth.
Top Statistical Concepts to Know Before Learning Data Science
Since you're now familiar with the definition of statistics, let's have a look at some of the concepts you'll need to know in statistics that'll help guide you when you dive into the realm of statistics.
Among the most fundamental concepts are:
What is a Subject?
This is the specific thing we wish to observe. It could be a person, an animal, or something else. It is also known as observation.
What is a Population?
Population refers to the entire set of topics in which we are interested (that is, that we want to observe). Assume you wish to count the number of females in a specific country.
What is a Sample?
In reality, observing a population is hardly an ideal situation (because it can be very expensive to perform, and also time-consuming).
Consider the following scenario: you wish to observe every female in the world. This type of observation can be costly to carry out. However, in statistics, we have something called a sample, which is a portion/subset of the population that you want to study. We can now make a decision (inferential statistic) about the full population using the sample.
What are Parameters?
This is a property/summary of a population. Consider the following scenario: you are observing the entire country and you discover that 90% of the inhabitants are males while 10% are females. The numerical values, 90%, and 10% are a numerical summary (that is, descriptive statistics) of the entire population. As a result, the summary is known as the population parameter.
What is a Statistic?
On the other hand, a statistic (not to be confused with statistic(s)) is about a sample's property. As stated in the preceding example, instead of working with the full population, we work with samples, so the numerical value is referred to as the statistic of the sample.
Hopefully you now have a decent understanding of what population, sample, statistic, and parameters are. Let's take a look at another concept with which we are all too familiar: "Data".
Data, as the term implies, represents factual information. That is, it conveys a message to us. It can, however, be divided into two categories:
- Quantitative data.
- Qualitative data.
What is Quantitative Data?
This is also known as numerical data. These data are a sort of data in which numerical values can be counted or measured. Quantitative data can be further classified into two types:
Quantitative discrete data: These are numerical data that can be counted but cannot be measured. Counting the number of shoes in a shoe store is a common example.
Quantitative continuous data: This is a type of numerical data that is based on measurement. For example, measuring the weight of a glass cylinder is continuous, not discrete.
What is Qualitative Data?
These are sorts of data that represent categories or groups of data. They are also known as categorical data. They are usually written in text. They can be characteristics, names, or anything else.
A common example is a person's name, dog breeds, and so on. However, there are some data that appear to be numerical data but are encoded as categorical data.
For example, suppose you wanted to group a certain group of people based on their age and discovered that the lowest and highest ages are 10 and 60, respectively. You then divided the ages into 5 categories (10-20, 21-30, 31-40, 41-50, 51-60) and assigned numerical values to each of those categories where 1 represents 10-20, 2 represents 21-30, and so on.
In this situation, the numerical values will be handled as categorical data rather than quantitative data. As your data science career progresses, you will learn how to work with categorical data.
Now you know the categories of data. Quantitative and qualitative data can be treated in statistics using these levels of measurement. Data in statistics can be classified into 4 levels of measurement which are:
- Nominal scale data
- Ordinal Scale data
- Interval Scale data
- Ratio Scale data
Qualitative data can be measured using:
Nominal scale data: These are the type of categorical data that do not have an ordered sense. That is, they cannot be ordered.
Each piece of data represents a single unit. An example of such categorical data includes color. It is not very ideal to rank blue over yellow. When working with nominal data, each data point must be handled as a separate unit.
Ordinal Scale data: Ordinal Scale data consists of ordered categorical data. When data is ranked, there is a sense of order in it. A survey response such as excellent, good, satisfactory, and unsatisfactory is an example of this. It makes sense to rank excellence above good.
Quantitative data can be measured using:
Interval Scale Data: These are numerical data with ordering and can be measured (for example find the difference between the data). The readings on a temperature scale are an example of interval data.
For example, you can measure the difference between 4 and 10 degrees Celsius, and 10 degrees is higher than 4 degrees. However, there are two exceptions for interval scale data:
- It does not have a starting point (that is, it does not begin from zero and you can have a temperature value below zero)
- You can't figure out their ratio: For example, it makes no logic to claim that 4 times 20 degrees Celsius is 80 degrees Celsius.
Ratio Scale data: These are numerical data that have the features of interval scale data (that is they may be ordered and measured), but also solve the exception of interval scale data (they have a starting point, and also you can find the ratio between them).
A grade score of 20, 68, 90, or 80 is an example. We can order it, measure it, and find the ratio between the values. It makes sense to say the score of 80 is 4 times better than the score of 20.
Now that we've covered the fundamentals of data, let's look at how the first category of statistics (descriptive statistics) can be applied to data.
As previously stated, descriptive statistics require summarizing data either numerically or graphically. Let's take a look at some of the most typical numerical and graphical summaries you'll encounter when dealing with data on a regular basis.
Mean vs Median vs Mode – What is the Difference?
What is a Mean?
When we have a set of numerical data like this (4, 5, 6, 7, 10), each value in the set of data is referred to as a data point. We might want to find the data's average value.
So mean is essentially the average of a set of data and is calculated as the sum of all the data points divided by the total number of data points.
In our above data set, their sum is 32 and the total number of data points is 5. So the average number, that is the mean, is 6.4
Mean is only used on numerical data. Finding the average of our category data is impractical.
What is a Median?
Also, given a group of values, we may want to discover the value in the center. The median is used to compute the value in the middle. Median also is used on numerical data only.
What is a Mode?
This is the value with the highest frequency (that is a value that has the highest number of occurrences). The mode can be used for numerical or categorical data.
What is an Outlier?
Outliers are data points that differ from other data points and, when present, can lead us to incorrect conclusions. Here's a typical example of how outliers are harmful.
Consider the following scenario: you have a machine that counts how many customers enter your supermarket every day, and the readings are thus for a given week (20, 23, 26, 27, 302). We can see that the number 302 is an outlier because it deviates significantly from the other data points.
Outliers could have resulted from a sudden change, machine faults, or other circumstances. However, when they are present, they can lead us to make incorrect decisions, such as if you want to find the average number of consumers who visit your supermarket, the value 302 may lead you to an incorrect result. The mean of the preceding values is 75.
What is a Standard Deviation?
A Standard Deviation is a summary value that indicates how far our data point deviates from the mean. It is used to determine the spread of our data.
The closer the standard deviation is to zero, the closer our data points are to one another.
The standard deviation is an extremely valuable summary that informs us that we have some outliers in our dataset. Here's how it works:
In the above chart, we see a Normal Distribution. 34.1% + 34.1% = 68.2% of all observations are within one standard deviation, or 1σ (pronounced one Sigma).
13.6% + 13.6% = 27.2% of the remaining observations are within two standard deviations, or 2σ. And so on.
And yes, if you've heard of Six Sigma, that is a concept in engineering where six standard deviation's worth of possibilities are accounted for in the quality assurance process. Meaning you are accounting for all but the most extreme outliers. 99.99966% of all possibilities, to be exact.
Now that we've grasped some numerical summaries, let's take a look at some common graphical summaries.
What is a Bar Chart?
A bar chart is a type of data visualization used for categorical data. You use it to graphically show the frequency of categorical data (that is the number of times a categorical data point occurs). Here's an example:
What is a Histogram?
A histogram is similar to a bar chart in that it shows the frequency of your numerical data called height, but it groups the numerical data points into bins or ranges.
It is a very efficient visualization tool because it helps you visualize the distribution of your numerical data. You can read more here to learn more about histograms.
What is a Boxplot?
Another excellent visualization that helps you visualize the distribution of your data is the boxplot.
A boxplot, for example, allows you to visually observe if there are any outliers in your data collection. It includes terms such as minimum, 25th percentile, 50th percentile, 75th percentile, and maximum. A Boxplot looks as follows:
So let’s go over what we have in the above diagram:
Minimum: The minimum value does not imply the smallest value in our dataset. It is calculated using this formula ( Q1 -1.5*IQR) where:
- Q1 – implies The 25th percentile
- IQR – implies the Interquartile range (which is the difference between the 75th percentile and the 25th percentile).
With the minimum, it can help us detect data points that are also far below the other observed values.
For instance, assuming our data points are spread like these [345, 402, 295, 386, 10]. We can see that the value 10 is also an outlier because it is a lower value that is far below other observations.
The 25th percentile is a value that tells us that 25% of our data points are below that value and 75% of our data points are above that value. The 25th percentile is also known as the first quartile.
The 50th percentile is a value that indicates that 50% of our data points are below that value and the remaining 50% are above that value. It is also known as the second quartile.
The 75th percentile is a value that tells us that 75 percent of our data point is below that value and the remaining 25 percent is above it. It is also known as the third quartile.
Maximum: Also like the minimum, the maximum does not imply the highest value in the dataset. It is calculated using the formula (Q3 + 1.5*IQR) where:
- Q3 – implies the 75th percentile
- IQR implies Interquartile Range (which is the difference between the 75th percentile and the 25th percentile).
With maximum also, it can help us detect data points that are also far above the other observed values.
For instance, assuming our data points are spread like these [645, 40, 25, 38, 42]. We can see that the value 645 is also an outlier because it is a higher value that is far above other observations.
We've seen some graphical summaries of what we'll be dealing with on a daily basis. Let's look at the final topic we will discuss in this article:
What is the Association Between Quantitative Variables?
Variables are any values (alphabetical or numerical, but typically alphabetical) that represent a collection of observations. It is sometimes referred to as a column in a table.
Two variables are said to be associated if a specific value of one variable is most likely to occur with a specific value of another variable.
To study the association between two quantitative variables (often referred to as correlation), we calculate it using the Karl Pearson formula, and the result is between -1 and +1.
If the correlation value approaches 1, it indicates that the two variables are positively correlated (that is, as one variable increases the other variable increases as well). If the value approaches -1, it indicates that the variables are negatively linked (that is as one variable increases, the other variable decreases). Finally, if the correlation current is 0, there is no correlation between the variables.
You can read more here to know more about correlation and Karl Pearson formula
What is a Scatter Plot?
We can represent the correlation between quantitative variables in a graphical summary by using a plot called a scatter plot.
A scatter plot looks like this:
To learn about scatter plots you can read more here.
Conclusion and Learning More
In this tutorial, we've explored some fundamental statistics concepts that will help you work more efficiently with your data.
But the learning does not stop here – there are a few fundamental topics that you must be familiar with. Because this is only the beginning, you can delve deeper by consulting online resources or textbooks.
Thank you very much for reading, and please share the article so that beginners who want to go into data science can learn as well.