#Regression - freeCodeCamp.org

Linear vs Logistic Regression: How to Choose the Right Regression Model for Your Data

Oluwadamisi Samuel — Tue, 28 May 2024 13:02:08 +0000

Regression models identify trends in a dataset and predict outcomes based on the trends they have analyzed and identified.

Linear and Logistic Regression are two types of regression models that are similar but carry out their functions in distinct ways. They're also two fundamental techniques in machine learning that predict outcomes by analyzing previously provided data.

Both Linear and Logistic Regression are supervised learning models that appear intertwined – so distinguishing between them can be confusing, as they share the same notion of predicting outcomes based on the input variables.

But here's the main difference: Linear Regression focuses on predicting continuous values, while Logistic Regression is designed specifically for binary classification (Yes or No). So although they have similar-sounding names, there are key differences in their applications, equations, and objectives.

In this article, you'll learn about the similarities and differences between Linear and Logistic Regression, explore key characteristics of each, and learn how to choose between them.

How Linear and Logistic Regression Make Predictions
– Linear Regression
– Logistic Regression
What are the Similarities between Linear and Logistic Regression?
What are the Differences between Linear and Logisstic Regression?
When to Use Linear vs Logistic Regression for Your Data Projects
What Are Other Types of Regression Models?
Conclusion

How Linear and Logistic Regression Make Predictions

Linear Regression

Linear regression is the simplest form of regression, assuming a linear (straight line) relationship between the input and the output variable. In simple terms, it harnesses the power of a straight line.

The equation for simple linear regression can be expressed as y = mx + b, where:

y is the dependent variable
x is the independent variable
m is the slope
and b is the intercept.

Linear regression graph (Source)

In a house price dataset, the independent variables are columns used to predict the price of the house, such as the “Area”, “Bedrooms”, “Age”, and “Location”. The Dependent variable will be the “Price” column – the feature to be predicted.

Logistic Regression

Logistic Regression is a powerful supervised machine learning technique. It helps categorize outcomes into two groups by assuming a Linear relationship between the features and the outcome and then calculating the possibility that the outcome will fall into one group or the other.

The mathematical equation calculates an output based on the relationship and the output is then transformed using a sigmoid function to constrain it between 0 and 1. Here it is:

$$y = e^(β0 + β1X1 + β2X2+… βnXn) / (1 + e^(β0 + β1 x 1 + β2 x 2 +… βn x n))$$

Where:

y gives the probability of success of the y categorical variable
e (x) is Euler’s number, the inverse of the natural logarithm function or sigmoid function, ln (x)
β0 is the y-intercept when all independent input variables equal 0
β1X1 is the regression coefficient (B1) of the first independent variable (X1), the impact value of the first independent variable on the dependent variable
βnXn is the regression coefficient (BN) of the last independent variable (XN), when there are multiple input values

Logistic Regression Graph (https://images.app.goo.gl/vfYBcVSrdvR2Mkki9)

This is commonly employed in areas like Spam detection and for medical diagnosis. It is used to interpret the likelihood of an observation falling into a specific class.

What are the Similarities between Linear and Logistic Regression?

Linear Relationship: Both linear and logistic regression assume a linear relationship between the input features and the output.
Supervised Learning: Both are supervised machine learning algorithms, meaning they require labeled training data.
Limitations: Both algorithms have similar limitations including:

Non-linear relationships between input and output variables will lead to inaccurate results.
Unclean data and missing values will lead to poor model performance. You can read more on data cleaning here.
Both models are prone to overfitting, which reduces the use of feature selection.

What are the Differences between Linear and Logistic Regression?

Output Type: Linear regression predicts continuous output (for example, the price of a house) on a straight line graph, while logistic regression predicts probabilities for binary classification (like if a patient has cancer or not) on an S-shaped curve.
Equation and Activation Function: Linear regression uses a simple linear equation while logistic regression uses the logistic function (sigmoid) to transform the output into probabilities.
Loss Function: Linear regression minimizes the sum of squared differences, while logistic regression minimizes the logistic loss (log loss).
Type of Supervised Learning : Linear regression is a regression model. Logistic regression is a classification model.

When to Use Linear vs Logistic Regression for Your Data Projects

You can use Linear Regression to solve problems where the relationship between variables can be reasonably approximated by a straight line. This means it's well-suited for understanding gradual changes or trends, rather than abrupt jumps or complex relationships. Some examples of these use-cases are:

House Price prediction
Identifying Relationships
Market Trends and Analysis
Business risk assessment
Scientific Research
Price Estimation
Understanding Impact

On the other hand, Logistic Regression is a powerful tool for understanding binary events and making predictions based on the features given. It excels in calculating the probability of an outcome being "Yes" or "No". This applies to a wide range of scenarios like:

Fraud Detection
Spam filter
Applications in Medicine
Customer Churn
Probability Estimation

What Are Other Types of Regression Models?

Linear and Logistic regression are not the only regression models available. There are other models you can use where linear and logistic regression fail:

Ridge regression is a regularization technique used to reduce the complexity of a model by introducing a small amount of bias. It makes the model less susceptible to overfitting.
Lasso regression is a regularization technique which also reduces the complexity of a model. It avoids overfitting by reducing the coefficient to become closer to zero. It is particularly useful when feature selection is crucial
Polynomial regression captures non-linear relationship using a curved line. It directly addresses the limitations of linear and logistic regression by modeling a non-linear relationship between variables.

Conclusion

Linear and logistic regression share the fundamental concept of a linear relationship between input variables and output variables. But their applications, mathematical equations, and use cases differ significantly.

Understanding these differences is crucial when choosing the appropriate model for a given problem.

This article has shed light on their inner workings and use cases, thereby equipping you to make the right and informed choice. Make sure you explore further to increase your knowledge and skills, and take the time to learn more complex machine learning models that will best fit your data problems.

If you found this helpful, you can connect with me on LinkedIn, my personal blog and on X (formerly Twitter).

Hours (X)	Topics Solved (Y)
1	1.5
1.2	2
1.5	3
2	1.8
2.3	2.7
2.5	4.7
2.7	7.1
3	10
3.1	6
3.2	5
3.6	8.9

Hours (X)	Topics Solved (Y)	(X - ͞x)	(y - ͞y)	(X - ͞x)*(y - ͞y)	(x - ͞x)²
1	1.5	-1.37	-3.29	4.51	1.88
1.2	2	-1.17	-2.79	3.26	1.37
1.5	3	-0.87	-1.79	1.56	0.76
2	1.8	-0.37	-2.99	1.11	0.14
2.3	2.7	-0.07	-2.09	0.15	0.00
2.5	4.7	0.13	-0.09	-0.01	0.02
2.7	7.1	0.33	2.31	0.76	0.11
3	10	0.63	5.21	3.28	0.40
3.1	6	0.73	1.21	0.88	0.53
3.2	5	0.83	0.21	0.17	0.69
3.6	8.9	1.23	4.11	5.06	1.51

Calculating "a"

All that is left is a, for which the formula is ͞͞͞y = a + b ͞x. We've already obtained all those other values, so we can substitute them and we get:

4.79 = a + 2.8*2.37
4.79 = a + 6.64
a = -6.64+4.79
a = -1.85

The result

Our final formula becomes:

Y = -1.85 + 2.8*X

Now we replace the X in our formula with each value that we have:

Hours (X)	-1.85 + 2.8 * X
1	0.95
1.2	1.51
1.5	2.35
2	3.75
2.3	4.59
2.5	5.15
2.7	5.71
3	6.55
3.1	6.83
3.2	7.11
3.6	8.23

Which is a graph that looks something like this:

We now have a line that represents how many topics we expect to be solved for each hour of study

If we want to predict how many topics we expect a student to solve with 8 hours of study, we replace it in our formula:

Y = -1.85 + 2.8*8
Y = 20.55

An in a graph we can see:

The further it is in the future the least accuracy we should expect

Limitations

Always bear in mind the limitations of a method. This will hopefully help you avoid incorrect results.

And this method, like any other, has its limitations. Here are a couple:

It doesn't take into account the complexity of the topics solved. A topic covered at the start of the "Responsive Web Design Certification" will most likely take less time to learn and solve than doing one of the final projects. So if the data we have is from different starting points of a course, the predictions won't be accurate
It's impossible for someone to study 240 hours continuously or to solve more topics than those available. Regardless, the method allows us to predict those values. At that point the method is no longer accurately giving results since it's an impossibility.

Example JavaScript Project

Doing this by hand is not necessary. We can create our project where we input the X and Y values, it draws a graph with those points, and applies the linear regression formula.

The project folder will have the following contents:

src/
  |-public // folder with the content that we will feed to the browser
    |-index.html
    |-style.css
    |-least-squares.js
  package.json
  server.js // our Node.js server

And package.json:

{
  "name": "least-squares-regression",
  "version": "1.0.0",
  "description": "Visualize linear least squares",
  "main": "server.js",
  "scripts": {
    "start": "node server.js",
    "server-debug": "nodemon --inspect server.js"
  },
  "author": "daspinola",
  "license": "MIT",
  "devDependencies": {
    "nodemon": "2.0.4"
  },
  "dependencies": {
    "express": "4.17.1"
  }
}

Once we have the package.json and we run npm install we will have Express and nodemon available. You can switch them out for others as you prefer, but I use these out of convenience.

In server.js:

const express = require('express')
const path = require('path')

const app = express()

app.use(express.static(path.join(__dirname, 'public')))

app.get('/', function(req, res) {
  res.sendFile(path.join(__dirname, 'public/index.html'))
})

app.listen(5000, function () {
  console.log(`Listening on port ${5000}!`)
})

This tiny server is made so we can access our page when we write in the browser localhost:5000. Before we run it let's create the remaining files:

public/index.html

<html>
  <head>
    <title>Least Squares Regressiontitle>
    <script src="https://cdn.jsdelivr.net/npm/chart.js@2.9.3/dist/Chart.min.js">script>
    <link rel="stylesheet" href="style.css">
  head>
  <body>
    <div class="container">
      <div class="left-half">
        <div>
          <input type="number" class="input-x" placeholder="X">
          <input type="number" class="input-y" placeholder="Y">

          <button class="btn-update-graph">Addbutton> 
        div>
        <div>
          <span class="span-formula">span>
        div>
        <div>
          <table class="table-pairs">
            <thead>
              <th>
                X
              th>
              <th>
                Y
              th>
            thead>
            <tbody>tbody>
          table>
        div>
      div>
      <div class="right-half">
        <canvas id="myChart">canvas>
      div>
    div>
    <script src="/js/least-squares.js">script>
  body>
html>

We create our elements:

Two inputs for our pairs, one for X and one for Y
A button to add those values to a table
A span to show the current formula as values are added
A table to show the pairs we've been adding
And a canvas for our chart

We also import the Chart.js library with a CDN and add our CSS and JavaScript files.

public/style.css

.container {
  display: grid; 
}

.left-half {
  grid-column: 1;
}

.right-half {
  grid-column: 2;
}

We add some rules so we have our inputs and table to the left and our graph to the right. This takes advantage of CSS grid.

public/least-squares.js

document.addEventListener('DOMContentLoaded', init, false);

function init() {
  const currentData = {
    pairs: [],
    slope: 0,
    coeficient: 0,
    line: [],
  };

  const chart = initChart();
}

function initChart() {
  const ctx = document.getElementById('myChart').getContext('2d');

  return new Chart(ctx, {
    type: 'scatter',
    data: {
      datasets: [{
        label: 'Scatter Dataset',
        backgroundColor: 'rgb(125,67,120)',
        data: [],
      }, {
        label: 'Line Dataset',
        fill: false,
        data: [],
        type: 'line',
      }],
    },
    options: {
      scales: {
        xAxes: [{
          type: 'linear',
          position: 'bottom',
          display: true,
          scaleLabel: {
            display: true,
            labelString: '(X)',
          },
        }],
        yAxes: [{
          type: 'linear',
          position: 'bottom',
          display: true,
          scaleLabel: {
            display: true,
            labelString: '(Y)',
          },
        }],
      },
    },
  });
}

And finally, we initialize our graph. At the start, it should be empty since we haven't added any data to it just yet.

Now if we run npm run server-debug and open our browser on localhost:5000 we should see something like this:

Our inputs to the left with an add button, or table with just the headers X and Y, to the right an empty graph

Adding functionality

The next step is to make the "Add" button do something. In our case we want to achieve:

Add the X and Y values to the table
Update the formula when we add more than one pair (we need at least 2 pairs to create a line)
Update the graph with the points and the line
Clean the inputs, just so it's easier to keep introducing data

Add the values to the table

public/least-squares.js

document.addEventListener('DOMContentLoaded', init, false);

function init() {
  const currentData = {
    pairs: [],
    slope: 0,
    coeficient: 0,
    line: [],
  };
  const btnUpdateGraph = document.querySelector('.btn-update-graph');
  const tablePairs = document.querySelector('.table-pairs');
  const spanFormula = document.querySelector('.span-formula');

  const inputX = document.querySelector('.input-x');
  const inputY = document.querySelector('.input-y');

  const chart = initChart();

  btnUpdateGraph.addEventListener('click', () => {
    const x = parseFloat(inputX.value);
    const y = parseFloat(inputY.value);

    updateTable(x, y);
  });

  function updateTable(x, y) {
    const tr = document.createElement('tr');
    const tdX = document.createElement('td');
    const tdY = document.createElement('td');

    tdX.innerHTML = x;
    tdY.innerHTML = y;

    tr.appendChild(tdX);
    tr.appendChild(tdY);

    tablePairs.querySelector('tbody').appendChild(tr);
  }
}

// ... rest of the code as it was

We get all of the elements we will use shortly and add an event on the "Add" button. That event will grab the current values and update our table visually.

We need to parse the amount since we get a string. It will be important for the next step when we have to apply the formula.

When we press add we should see the pairs on the table

Make the calculations

All the math we were talking about earlier (getting the average of X and Y, calculating b, and calculating a) should now be turned into code. We will also display the a and b values so we see them changing as we add values.

public/least-squares.js

// ... rest of the code as it was

btnUpdateGraph.addEventListener('click', () => {
  const x = parseFloat(inputX.value);
  const y = parseFloat(inputY.value);

  updateTable(x, y);
  updateFormula(x, y);
});

function updateFormula(x, y) {
  currentData.pairs.push({ x, y });
  const pairsAmount = currentData.pairs.length;

  const sum = currentData.pairs.reduce((acc, pair) => ({
    x: acc.x + pair.x,
    y: acc.y + pair.y,
  }), { x: 0, y: 0 });

  const average = {
    x: sum.x / pairsAmount,
    y: sum.y / pairsAmount,
  };

  const slopeDividend = currentData.pairs
    .reduce((acc, pair) => parseFloat(acc + ((pair.x - average.x) * (pair.y - average.y))), 0);
  const slopeDivisor = currentData.pairs
    .reduce((acc, pair) => parseFloat(acc + (pair.x - average.x) ** 2), 0);

  const slope = slopeDivisor !== 0
    ? parseFloat((slopeDividend / slopeDivisor).toFixed(2))
    : 0;

  const coeficient = parseFloat(
    (-(slope * average.x) + average.y).toFixed(2),
  );

  currentData.line = currentData.pairs
    .map((pair) => ({
      x: pair.x,
      y: parseFloat((coeficient + (slope * pair.x)).toFixed(2)),
    }));

  spanFormula.innerHTML = `Formula: Y = ${coeficient} + ${slope} * X`;
}

// ... rest of the code as it was

There isn't much to be said about the code here since it's all the theory that we've been through earlier. We loop through the values to get sums, averages, and all the other values we need to obtain the coefficient (a) and the slope (b).

The span so we can display the formula and see it change as we add values

We have the pairs and line in the current variable so we use them in the next step to update our chart.

Update the graph and clean inputs

public/least-squares.js

// ... rest of the code as it was

btnUpdateGraph.addEventListener('click', () => {
  const x = parseFloat(inputX.value);
  const y = parseFloat(inputY.value);

  updateTable(x, y);
  updateFormula(x, y);

  updateChart();

  clearInputs();
});

function updateChart() {
  chart.data.datasets[0].data = currentData.pairs;
  chart.data.datasets[1].data = currentData.line;

  chart.update();
}

function clearInputs() {
  inputX.value = '';
  inputY.value = '';
}

// ... rest of the code as it was

Updating the chart and cleaning the inputs of X and Y is very straightforward. We have two datasets, the first one (position zero) is for our pairs, so we show the dot on the graph. The second one (position one) is for our regression line.

We have to grab our instance of the chart and call update so we see the new values being taken into account.

At least three values are needed so we can take any kind of information our of the graph

Adding some style

We can change our layout a bit so it's more manageable. Nothing major, it just serves as a reminder that we can update the UI at any point

public/style.css

.container {
  display: grid; 
}

.left-half {
  grid-column: 1;
}

.right-half {
  grid-column: 2;
}

.pairs-style input[type="number"],
.pairs-style button {
  margin: 5px 0px;
}

.table-pairs {
  border-collapse: collapse;
  width: 100%;
}

.table-pairs td {
  text-align: center;
}

.table-pairs,
.table-pairs th,
.table-pairs td {
  margin: 10px 0px;
  border: 1px solid black;
}

public/index.html

<html>
  <head>
    <title>Least Squares Regressiontitle>
    <script src="https://cdn.jsdelivr.net/npm/chart.js@2.9.3/dist/Chart.min.js">script>
    <link rel="stylesheet" href="style.css">
  head>
  <body>
    <div class="container">
      <div class="left-half">
        <div class="pairs-style">
          <div>
            <input type="number" class="input-x" placeholder="X">
          div>
          <div>
            <input type="number" class="input-y" placeholder="Y">
          div>
          <button class="btn-update-graph">Addbutton> 
        div>
        <div>
          <span class="span-formula">Formula: Y = a + b * Xspan>
        div>
        <div>
          <table class="table-pairs">
            <thead>
              <th>
                X
              th>
              <th>
                Y
              th>
            thead>
            <tbody>tbody>
          table>
        div>
      div>
      <div class="right-half">
        <canvas id="myChart">canvas>
      div>
    div>
    <script src="/js/least-squares.js">script>
  body>
html>

Not a big change, but at least the elements are a bit better aligned

Proof of Concept

We add the same values as earlier in the theory and obtain the same graph and formula! :D

Final remarks

For brevity's sake, I cut out a lot that can be taken as an exercise to vastly improve the project. For example:

Add checks for empty values and the like
Make it so we can remove data that we wrongly inserted
Add an input for X or Y and apply the current data formula to "predict the future", similar to the last example of the theory

Regardless, predicting the future is a fun concept even if, in reality, the most we can hope to predict is an approximation based on past data points.

It's a powerful formula and if you build any project using it I would love to see it.

I hope this article was helpful to serve as an introduction to this concept. The code used in the article can be found in my GitHub here.

See you in the next one, in the meantime, go code something!

How I Used Regression Analysis to Analyze Life Expectancy with Scikit-Learn and Statsmodels

freeCodeCamp — Thu, 19 Mar 2020 17:25:29 +0000

By Black Raven

In this article, I will use some data related to life expectancy to evaluate the following models: Linear, Ridge, LASSO, and Polynomial Regression. So let's jump right in.

I was exploring the dengue trend in Singapore where there has been a recent spike in dengue cases – especially in the Dengue Red Zone where I am living. However, the raw data was not available on the NEA website.

I was wondering, has dengue affected the life expectancy of people in any country in particular? Do people in rich nations live longer? What are the factors affecting life expectancy of a country?

So I explored life expectancy and looked for data on the following aspects (features):

Birth Rate
Cancer Rate
Dengue Cases
Environmental Performance Index (EPI)
Gross Domestic Product (GDP)
Health Expenditure
Heart Disease Rate
Population
Area
Population Density
Stroke Rate

The target is Life Expectancy, measured in number of years.

The assumptions are:

These are country level averages
There is no distinction between male and female

The Python code is available on my GitHub.

Data Science Process

I have used the following data science process in my analysis:

data collection, data cleaning, Exploratory Data Analysis
feature selection, feature engineering
model selection, model tuning and hyperparameter tuning
model optimization based on selected performance metric

Tools used for this analysis include:

Python libraries, particularly Numpy and Pandas for manipulating data structures
Matplotlib and Seaborn for visualization
Scikit-Learn and Statsmodels for regression analysis

Exploratory Data Analysis

First I check for multi-collinearity between features.

sns.set(rc={'figure.figsize':(10,7)})sns.heatmap(df.corr(), cmap="seismic", annot=True, vmin=-1, vmax=1)

There seems to be some strong collinearity, denoted by boxes in dark red and dark blue as you can see in the image below.

For example, countries who spent more on healthcare have a higher EPI score. When health expenditures are higher, the stroke rate is also lower. And a larger area yields a higher population.

How about the correlation between features and target?
To live a long life, you should have a low stroke rate, high health expenditure, take good care of the environment, and have fewer babies (according to the correlation chart).

Let’s look at the initial pair plot.

sns.pairplot(df, height=1.5, aspect=1.5)

There seems to be a need to remove outliers in many features, for example, Dengue Cases, GDP, Population, Area, and Population Density.

Each outlier is replaced by the next highest value in the column. After removing the outliers, the plots are still skewed to the right (points are very concentrated on the left side). So this suggests that some transformation might be needed.

Another way to remove outliers is to use the LOG function, which helps to spread the concentrated data to the right.

Feature Selection

To look for significant features, I dropped one feature at a time to see its impact on the simple regression model. Looking at the R² Score, these 3 features (Birth Rate, EPI, Stroke Rate) are chosen, because the model will be adversely affected without them.

Next, I removed outliers and review the p-values on Statsmodels. I gained one more significant feature (Population Density). When the p-value of a feature is less than 0.05, it is considered a good feature, as I have chosen 5% as the significance level.

After that, I applied LOG functions to all features, and gained 4 more significant features (GDP, Heart Disease Rate, Population, and Area).

I have also done other transformations (Reciprocal, Power 2, Square Root) but there is no more improvement.

Features can also be selected using the LassoCV feature in SkLearn.

Finally I looked at the pair plot again with all significant features. The scatter plots are now nicely spread out with some clear trends.

Model Selection

I am now ready to fit the following models on the train data set:

Linear Regression (a straight line which approximates the relationship between the dependent variables and the independent target variable)
Ridge Regression (this reduces model complexity while keeping all coefficients in the model, known as L2 penalty)
LASSO Regression (Least Absolute Shrinkage and Selection Operator reduces model complexity by penalizing model coefficients to zero, for example, L1 penalty)
Degree 2 Polynomial Regression (a curve line to approximate the relationship between the dependent variables and the independent target variable)

I have also validated their performance on the validation data set. The simple linear regression model seems to have the potential to be the best performing model.

This is confirmed by Cross Validation using KFold (with 5 splits).

Finally, I checked the residue error against assumptions. The residue errors should be normally distributed with equal variance around the mean zero. The Normal Quartile-to-Quartile plot also looks acceptably normal.

Since I only have 250 rows (data limited by the number of countries in the world), I used the entire data set to simulate the test data set (note: this is done for academic purpose, not practical as it will lead to data leakage). I used KFold Cross Validation with 10 splits to evaluate the model performance.

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state = 1)
lm = LinearRegression()
lm.fit(X_train, y_train)
cvs_lm = cross_val_score(lm, X, y, cv=kf, scoring='r2')
print(cvs_lm)

There is quite a bit of variation in the R² values from 0.49 to 0.82, but the average result is around 0.69, which is quite satisfactory.

How do we interpret the model?

df = pd.read_csv('df3.csv')
X = df[ ['Birth Rate', 'EPI', 'GDP', 'Heart Disease Rate', 'Population', 'Area', 'Pop Density', 'Stroke Rate'] ].astype(float)
X = np.log(X)
y = df[ "Life Expectancy" ].astype(float)
X = sm.add_constant(X)

model = sm.OLS(y, X)
results = model.fit()
results.summary()

If you're unaffected by the features, your life expectancy is 62 years. If your country has low birth rate, add 5 more years to your life. If the EPI (Environment Performance Index) is high, add 8 more years to your life. If you live in a rich country, add half a year to your life. Finally for every unit (or rather LOG unit) decrease in stroke rate, 5 more years could be added to your life.

Next Steps

I could possibly collect more data by expanding the scope to cities instead of countries, and exploring other features (factors) affecting life expectancy. Also, I could split the data into male and female categories for such life expectancy regression analysis.

To conclude, here are some interesting insights:

Japan has the highest life expectancy (83.7 years). Central African Republic (49.5 years) and many countries in the African continent are at the bottom of scale. Singapore is ranked #5 (82.7 years).

2. Take good care of the environment. This has the largest coefficient (impact) on a country’s life expectancy.

The Python code for the above analysis is available on my GitHub – do feel free to refer to it.

https://github.com/JNYH/Project-Luther

Video presentation: https://youtu.be/gC2m_lvouu8

Thank you for reading.

How to read a Regression Table

freeCodeCamp — Sun, 31 Mar 2019 20:25:40 +0000

By Sharad Vijalapuram

What is regression?

Regression is one of the most important and commonly used data analysis processes. Simply put, it is a statistical method that explains the strength of the relationship between a dependent variable and one or more independent variable(s).

A dependent variable could be a variable or a field you are trying to predict or understand. An independent variable could be the fields or data points that you think might have an impact on the dependent variable.

In doing so, it answers a couple of important questions —

What variables matter?
To what extent do these variables matter?
How confident are we about these variables?

Let’s take an example…

To better explain the numbers in the regression table, I thought it would be useful to use a sample dataset and walk through the numbers and their importance.

I’m using a small dataset that contains GRE (a test that students take to be considered for admittance in Grad schools in the US) scores of 500 students and their chance of admittance into a university.

Because chance of admittance depends on GRE score, chance of admittance is the dependent variable and GRE score is the independent variable.

Scatterplot of GRE scores and chance of admittance

Regression line

Drawing a straight line that best describes the relationship between the GRE scores of students and their chances of admittance gives us the linear regression line. This is known as the trend line in various BI tools. The basic idea behind drawing this line is to minimize the distance between the data points at a given x-coordinate and the y-coordinate through which the regression line passes.

Scatterplot with a regression line.

The regression line makes it easier for us to represent the relationship. It is based on a mathematical equation that associates the x-coefficient and y-intercept.

Y-intercept is the point at which the line intersects the y-axis at x = 0. It is also the value the model would take or predict when x is 0.

Coefficients provide the impact or weight of a variable towards the entire model. In other words, it provides the amount of change in the dependent variable for a unit change in the independent variable.

Calculating the regression line equation

In order to find out the model’s y-intercept, we extend the regression line far enough until it intersects the y-axis at x = 0. This is our y-intercept and it is around -2.5. The number might not really make sense for the data set we are working on but the intention is to only show the calculation of y-intercept.

Calculating the y-intercept

The coefficient for this model will just be the slope of the regression line and can be calculated by getting the change in the admittance over the change in GRE scores.

Calculating the slope

In the example above, the coefficient would just be

m = (y2-y1) / (x2-x1)

And in this case, it would be close to 0.01.

The formula y = m*x + b helps us calculate the mathematical equation of our regression line. Substituting the values for y-intercept and slope we got from extending the regression line, we can formulate the equation -

y = 0.01x — 2.48

-2.48 is a more accurate y-intercept value I got from the regression table as shown later in this post.

This equation lets us forecast and predicts the chance of admittance of a student when his/her GRE score is known.

Now that we have the basics, let’s jump onto reading and interpreting a regression table.

Reading a regression table

The regression table can be roughly divided into three components —

Analysis of Variance (ANOVA): provides the analysis of the variance in the model, as the name suggests.
regression statistics: provide numerical information on the variation and how well the model explains the variation for the given data/observations.
residual output: provides the value predicted by the model and the difference between the actual observed value of the dependent variable and its predicted value by the regression model for each data point.

Analysis of Variance (ANOVA)

ANOVA table

Degrees of freedom (df)

Regression df is the number of independent variables in our regression model. Since we only consider GRE scores in this example, it is 1.

Residual df is the total number of observations (rows) of the dataset subtracted by the number of variables being estimated. In this example, both the GRE score coefficient and the constant are estimated.

Residual df = 500 — 2 = 498

Total df — is the sum of the regression and residual degrees of freedom, which equals the size of the dataset minus 1.

Sum of Squares (SS)

Regression line with the mean of the dataset in red.

Regression SS is the total variation in the dependent variable that is explained by the regression model. It is the sum of the square of the difference between the predicted value and mean of the value of all the data points.

∑ (ŷ — ӯ)²

From the ANOVA table, the regression SS is 6.5 and the total SS is 9.9, which means the regression model explains about 6.5/9.9 (around 65%) of all the variability in the dataset.

Residual SS — is the total variation in the dependent variable that is left unexplained by the regression model. It is also called the Error Sum of Squares and is the sum of the square of the difference between the actual and predicted values of all the data points.

∑ (y — ŷ)²

From the ANOVA table, the residual SS is about 3.4. In general, the smaller the error, the better the regression model explains the variation in the data set and so we would usually want to minimize this error.

Total SS — is the sum of both, regression and residual SS or by how much the chance of admittance would vary if the GRE scores are NOT taken into account.

Mean Squared Errors (MS) — are the mean of the sum of squares or the sum of squares divided by the degrees of freedom for both, regression and residuals.

Regression MS = ∑ (ŷ — ӯ)²/Reg. df

Residual MS = ∑ (y — ŷ)²/Res. df

F — is used to test the hypothesis that the slope of the independent variable is zero. Mathematically, it can also be calculated as

F = Regression MS / Residual MS

This is otherwise calculated by comparing the F-statistic to an F distribution with regression df in numerator degrees and residual df in denominator degrees.

Significance F — is nothing but the p-value for the null hypothesis that the coefficient of the independent variable is zero and as with any p-value, a low p-value indicates that a significant relationship exists between dependent and independent variables.

Standard Error — provides the estimated standard deviation of the distribution of coefficients. It is the amount by which the coefficient varies across different cases. A coefficient much greater than its standard error implies a probability that the coefficient is not 0.

t-Stat — is the t-statistic or t-value of the test and its value is equal to the coefficient divided by the standard error.

t-Stat = Coefficients/Standard Error

Again, the larger the coefficient with respect to the standard error, the larger the t-Stat is and higher the probability that the coefficient is away from 0.

p-value — The t-statistic is compared with the t distribution to determine the p-value. We usually only consider the p-value of the independent variable which provides the likelihood of obtaining a sample as close to the one used to derive the regression equation and verify if the slope of the regression line is actually zero or the coefficient is close to the coefficient obtained.

A p-value below 0.05 indicates 95% confidence that the slope of the regression line is not zero and hence there is a significant linear relationship between the dependent and independent variables.

A p-value greater than 0.05 indicates that the slope of the regression line may be zero and that there is not sufficient evidence at the 95% confidence level that a significant linear relationship exists between the dependent and independent variables.

Since the p-value of the independent variable GRE score is very close to 0, we can be extremely confident that there is a significant linear relationship between GRE scores and the chance of admittance.

Lower and Upper 95% — Since we mostly use a sample of data to estimate the regression line and its coefficients, they are mostly an approximation of the true coefficients and in turn the true regression line. The lower and upper 95% boundaries give the 95th confidence interval of lower and upper bounds for each coefficient.

Since the 95% confidence interval for GRE scores is 0.009 and 0.01, the boundaries do not contain zero and so, we can be 95% confident that there is a significant linear relationship between GRE scores and the chance of admittance.

Please note that a confidence level of 95% is widely used but, a level other than 95% is possible and can be set up during regression analysis.

Regression Statistics

Regression Statistics table

R² (R Square) — represents the power of a model. It shows the amount of variation in the dependent variable the independent variable explains and always lies between values 0 and 1. As the R² increases, more variation in the data is explained by the model and better the model gets at prediction. A low R² would indicate that the model doesn’t fit the data well and that an independent variable doesn’t explain the variation in the dependent variable well.

R² = Regression Sum of Squares/Total Sum of Squares

However, R square cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots, which are discussed later in this article.

R-square also does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or high R-squared value for a model that does not fit the data.

R², in this case, is 65 %, which implies that the GRE scores can explain 65% of the variation in the chance of admittance.

Adjusted R² — is R² multiplied by an adjustment factor. This is used while comparing different regression models with different independent variables. This number comes in handy while deciding on the right independent variables in multiple regression models.

Multiple R — is the positive square root of R²

Standard Error — is different from the standard error of the coefficients. This is the estimated standard deviation of the error of the regression equation and is a good measure of the accuracy of the regression line. It is the square root of the residual mean squared errors.

Std. Error = √(Res.MS)

Residual Output

Residuals are the difference between the actual value and the predicted value of the regression model and residual output is the predicted value of the dependent variable by the regression model and the residual for each data point.

And as the name suggests, a residual plot is a scatter plot between the residual and the independent variable, which in this case is the GRE score of each student.

A residual plot is important in detecting things like heteroscedasticity, non-linearity, and outliers. The process of detecting them is not being discussed as part of this article but, the fact that the residual plot for our example has data scattered randomly helps us in establishing the fact that the relationship between the variables in this model is linear.

Residual Plot

Intent

The intent of this article is not to build a working regression model but to provide a walkthrough of all the regression variables and their importance when necessary with a sample data set in a regression table.

Although this article provides an explanation with a single variable linear regression as an example, please be aware that some of these variables could have more importance in the cases of multi-variable or other situations.

#Regression - freeCodeCamp.org

Linear vs Logistic Regression: How to Choose the Right Regression Model for Your Data

Table of Contents

How Linear and Logistic Regression Make Predictions

Linear Regression

Logistic Regression

What are the Similarities between Linear and Logistic Regression?

What are the Differences between Linear and Logistic Regression?

When to Use Linear vs Logistic Regression for Your Data Projects

What Are Other Types of Regression Models?

Conclusion

Top Evaluation Metrics for Regression Problems in Machine Learning

What are Residuals?

Top Evaluation Metrics for Regression Problems

R2 Score

When to Use the R2 Score

Mean Absolute Error (MAE)

When to Use MAE

Root Mean Squared Error (RMSE)

When to Use RMSE

Conclusion and Learning More

The Least Squares Regression Method – How to Find the Line of Best Fit

What is the Least Squares Regression method and why use it?

Setting up an example

The formula

Calculating "b"

Calculating "a"

The result

Limitations

Example JavaScript Project

Adding functionality

Add the values to the table

Make the calculations

Update the graph and clean inputs

Adding some style

Proof of Concept

Final remarks

How I Used Regression Analysis to Analyze Life Expectancy with Scikit-Learn and Statsmodels

Data Science Process

Exploratory Data Analysis

Feature Selection

Model Selection

How do we interpret the model?

Next Steps

How to read a Regression Table

What is regression?

Let’s take an example…

Regression line

Calculating the regression line equation

Reading a regression table

Analysis of Variance (ANOVA)

Degrees of freedom (df)

Sum of Squares (SS)

Regression Statistics

Residual Output

Intent

References