Ever since the launch of OpenAI's ChatGPT, everyone wants to learn about AI and ML.
Not only that – everyone wants to build and release an AI product on their own to mark their position in the global competition.
And if you already own a SaaS product, you may want to integrate AI and ML features into it to retain your customers and be competitive in the global market.
In this tutorial, you'll learn how to build a Linear Regression Model. This is one of the first things you'll learn how to do when studying Machine Learning, so it'll help you take your first step into this competitive market.
Table of Contents
- What is Linear Regression?
- Evaluation Metrics
- Linear Regression Example – Car Price Prediction Model
- How to Build the Model
- Medium level experience using an IDE (preferably VS Code)
- Basic understanding of Python Notebook (.pynb) files
- Good understanding of the Python programming language
- Basic knowledge of Pandas (to handle dataframes), Numpy, Scikit Learn, and Matplot libraries
- Some knowledge of statistics is helpful for analyzing the data
What is Linear Regression?
Linear Regression is a Supervised Learning method, where the predicted output will be continuous in nature. For example, things like price prediction, marks prediction, and so on.
Linear Regression is a fundamental statistical and machine learning technique used for modeling the relationship between a dependent variable (also known as the target or response variable) and one or more independent variables (predictors or features).
It aims to establish a linear equation that best represents the association between these variables, allowing us to make predictions and draw insights from the data.
The primary goal of linear regression is to find the "best-fit" line (or hyperplane in higher dimensions) that minimizes the difference between the predicted values and the actual observed values.
This best-fit line is defined by a linear equation of the form:
Y = b0 + b1X1 + b2X2 +...+ bnXn
In this equation:
- Y represents the dependent variable we want to predict.
- X1,X2,...,Xn are the independent variables or features.
- b0 is the intercept (the value of Y when all X values are zero).
- b1,b2,...,bn are the coefficients that determine the relationship between each independent variable and the dependent variable.
Linear regression assumes that there is a linear relationship between the predictors and the target variable.
The goal of the model is to estimate the coefficients (b0,b1,...,bn) that minimize the sum of the squared differences between the predicted values and the actual values in the training data. This process is often referred to as "fitting the model."
The evaluation metrics for a Linear Regression model are:
- Coefficient of Determination or R-Squared (R2)
- Root Mean Squared Error (RSME)
Let's see what each of these are.
R-Squared describes the amount of variation that is captured by the developed model. It always ranges between 0 and 1. The higher the value of R-squared, the better the model fits with the data.
Root Mean Squared Error
RMSE measures the average magnitude of the errors or residuals between the predicted values generated by a model and the actual observed values in a dataset. It always ranges between 0 and positive infinity. Lower RMSE values indicate better predictive performance.
Linear Regression Example – Car Price Prediction Model
In this example, we'll try to predict the car price by building a Linear Regression model. I found this problem and the dataset in Kaggle. I noticed that there's a submission for this problem, which was perfect. In fact I built my solution by taking a part of that solution.
Let's dive into the problem.
We're given a dataset of used cars, which contains the name of the car, year, selling price, present price, number of kilometers it has driven, type of fuel, type of the seller, transmission, and if the seller is the owner. Our goal is to predict the selling price of the car.
Let's explore the solution.
Import the required packages:
You'll need various packages to work through this problem. Here's how you can import them all:
Import the dataset
Pre-process the dataset
The below code shows the columns and their datatype and the number of rows. Our dataset has 9 columns and 301 rows.
The "Car_Name" column describes the car name. This field should be ignored from our dataset. This is because, only the features of the car matters and not its name.
The below code returns the number of unique car names in our dataset.
We have 98 unique car names in our dataset.
Clearly, it does not add any meaning to our dataset, since there are so many categories. Let's drop that column.
The dataset has the column named "Year". Ideally, we need the age of car over the year it was bought / sold. So, let's convert that to "Age" and remove the "Year" column.
"Age" is calculated by finding the difference between the maximum year available in our dataset and the year of that particular car. This is because, our calculations will be specific to that particular time period and to this dataset.
How to find outliers
An outlier is a data point that differs significantly from other observations. They can cause the performance of the model to drop.
Categorical columns will have the datatype as "object". Let's group the numerical columns and categorical columns in a NumPy array. The first 5 elements in the array will be numerical columns and the rest 3 will be categorical columns.
We can plot the data in the columns using the
seaborn library. Categorical columns will contain multiple bars, whereas numerical columns will contain single bars.
Let's try to find the outliers in our dataset using the following code:
Let's try to find the outliers using the InterQuartile Range rule.
This is based on the concept of quartiles, which divide a dataset into four equal parts. The IQR (InterQuartile Range rule) rule specifically focuses on the range of values within the middle 50% of the data and uses this range to identify potential outliers.
We have to find the minimum and maximum quantile values for each unique value in the categorical columns and filter the outlier samples which do not fit into the 25th and 75th percentile of our target column (Selling Price).
On the other hand, the outliers in numerical columns can be filtered by the 25th and 75th percentiles of the same column. We don't need to filter out with respect to the target column.
By running the above code, we found that there are 38 outliers in our dataset.
But keep in mind that it's not always the right decision to remove the outliers. They can be legitimate observations and it’s important to investigate the nature of the outlier before deciding whether to drop it or not.
We can delete outliers in two cases:
- Outlier is due to incorrectly entered or measured data
- Outlier creates a significant association
Let's dig even more and find the perfect outliers.
To do that, let's assume that if the selling price is more than 33 Lakhs or if the car has been driven more than 400,000 Kilometers, those are outliers. We'll mark them in green. Save all the indices in the
removing_indices variable. Plot them in the scatterplot format using
seaborn library, comparing each column against our target column.
Let's see the perfect outliers:
We got 2. We have to remove them. But before that, we have to check if there's any null data in our dataset.
There are no null values in our dataset.
Let's remove the identified outliers and reset the index of the dataframe.
Analyze the Dataset
Let's analyze the data to see how much each field/category is correlated with the selling price of the car. We need to do some analysis on our dataset to be able to come to some conclusions about it.
To do that, we have to identify the numerical and categorical fields in our dataset, as the way to plot this differs for each type.
If you're not familiar what bivariate analysis is, here's a basic definition:
Bivariate analysis is one of the simplest forms of quantitative analysis. It involves the analysis of two variables, for the purpose of determining the empirical relationship between them. Bivariate analysis can be helpful in testing simple hypotheses of association.
Let's compare the Selling Price with the other columns using Bivariate Analysis and try to derive some conclusions out of that data.
Selling Price vs Numerical Features Bivariate Analysis
Let's compare the numerical features with the Selling Price using Bivariate analysis. The Numerical columns will be plotted in a scatter graph.
Selling Price vs Categorical Features Bivariate Analysis
Let's compare the categorical features with the Selling Price using Bivariate analysis. The Categorical columns will be plotted in a stripplot graph. This gives the comparison among multiple values in a category.
Here are the conclusions we can make from our data analysis:
- As Present Price increases, Selling Price increases as well. They're directly proportional.
- Selling Price is inversely proportional to Kilometers Driven.
- Selling Price is inversely proportional to the car's age.
- As the number of previous car owners increases, its Selling Price decreases. So Selling Price is inversely proportional to Owner.
- Diesel Cars > CNG Cars > Petrol Cars in terms of Selling Price.
- The Selling Price of cars sold by individuals is lower than the price of cars sold by dealers.
- Automatic cars are more expensive than manual cars.
Categorical Variables Encoding
We can't use the Categorical fields as they are. They have to be converted to numbers because machines can only understand numbers.
For an example, let's take the Fuel column. As per our dataset, we have cars running on two types of fuel. They are Petrol and Diesel. The categorical variable encoding will split the fuel column into 2 columns (Fuel_Type_Petrol and Fuel_Type_Diesel).
Let's assume a car runs on Petrol. For this car, the data will be converted as Fuel_Type_Petrol column set to 1 (True), and Fuel_Type_Diesel column set to 0 (False). Computers can understand 1 and 0 rather than "Petrol" and "Diesel".
To do that, we'll perform one-hot encoding for the categorical columns. Pandas offers the
get_dummies method to encode the columns.
A correlation matrix is a matrix that summarizes the strength and direction of the linear relationships between pairs of variables in a dataset. It is a crucial tool in statistics and data analysis, used to examine the patterns of association between variables and understand how they may be related.
The correlation is directly proportional if the values are positive, and inversely proportional if the values are negative.
Here's the code to find the correlation matrix with relation to Selling Price.
From the above matrix, we can infer that the target variable "Selling Price" is highly correlated with Present Price, Seller Type, and Fuel Type.
How to Build the Model
We have come to the final stage. Let's train and test our model.
Let's remove the "Selling_Price" from input and set it to output. This means that it has to be predicted.
Let's split our dataset by taking 70% of data for training and 30% of data for testing.
Let's make a backup of our test data. We need this for the final comparison.
Normalize the dataset
StandardScaler is a preprocessing technique commonly used in machine learning and data analysis to standardize or normalize the features (variables) of a dataset. Its primary purpose is to transform the data such that each feature has a mean (average) of 0 and a standard deviation of 1.
Let's normalize our dataset using
It is very important that
StandardScaler transformation should only be gotten from the training set, otherwise it will lead to data leakage.
Train the model
Let's find the intercept and co-efficient for each column in our training dataset.
How to evaluate the model
Scikit Learn provides a metrics feature which helps us to measure the metrics of our model. We can use that to determine metrics include Mean Squared Error, Mean Absolute Error, Root Mean Squared Error, and R2-Score.
Now it's time to evaluate the model:
Evaluate the model using K-fold Cross-Validation
In k-fold cross-validation, the dataset is divided into k roughly equal-sized subsets or "folds." The model is trained and evaluated k times, each time using a different fold as the validation set and the remaining folds as the training set.
The results (for example accuracy, error) of these k runs are then averaged to obtain a more robust estimate of the model's performance.
The advantage is that each data point is used for both training and validation, reducing the risk of bias in the evaluation.
Let's create a dataframe with the actual and predicted values.
Let's compare the actual and predicted target values for the test data with the help of a bar plot.
In the above graph, the blue lines indicates the actual price and orange lines indicate the predicted price of the cars. You can see that some predicted values are negative. But in most of our cases, our model has predicted it pretty well.
This is not the perfect model. But if you want to fine tune it to make better predictions, let me know via my email. I'll write a tutorial about that if I receive more requests.
In this tutorial, you learned about Linear Regression with a practical example. I hope it helps you to make progress in your ML journey. You can access the above dataset and the code for it in from this Github repo.
Hope you enjoyed reading this article. If you wish to learn more about artificial intelligence / machine learning / deep learning, subscribe to my article by visiting my site which has a consolidated list of all my blogs.