<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Tiffany Mojo Omondi - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Tiffany Mojo Omondi - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sun, 24 May 2026 22:23:53 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/author/tiffanymojowrites/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Create Boxplots and Model Data in R Using ggplot2 ]]>
                </title>
                <description>
                    <![CDATA[ In this tutorial, you’ll walk through a complete data analysis project using the HR Analytics dataset by Saad Haroon on Kaggle. You’ll start by loading and cleaning the data, then explore it visually using boxplots with ggplot2. Finally, you’ll learn... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-create-boxplots-and-model-data-in-r/</link>
                <guid isPermaLink="false">69693680d6f0e208b327d21c</guid>
                
                    <category>
                        <![CDATA[ data visualization ]]>
                    </category>
                
                    <category>
                        <![CDATA[ R Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tiffany Mojo Omondi ]]>
                </dc:creator>
                <pubDate>Thu, 15 Jan 2026 18:48:32 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768418231372/f36e1cca-eed9-4620-bd7c-19788d8beafe.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In this tutorial, you’ll walk through a complete data analysis project using the HR Analytics dataset by Saad Haroon on Kaggle. You’ll start by loading and cleaning the data, then explore it visually using boxplots with ggplot2. Finally, you’ll learn about statistical modelling using linear regression and logistic regression in R.</p>
<p>By the end of this article, you should understand how to create boxplots in R, why they matter, and how they fit into a real-world analytics workflow.</p>
<h2 id="heading-table-of-contents"><strong>Table of Contents</strong></h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-your-r-environment">How to Set Up Your R Environment</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-load-and-inspect-the-data">How to Load and Inspect the Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-clean-and-prepare-the-data">How to Clean and Prepare the Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-use-boxplots">How to Use Boxplots</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-create-boxplots-with-ggplot2">How to Create Boxplots with ggplot2</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-perform-exploratory-data-analysis">How to Perform Exploratory Data Analysis</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-linear-regression-models">How to Build Linear Regression Models</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-logistic-regression-models">How to Build Logistic Regression Models</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-why-visualization-comes-before-modeling">Why Visualization Comes Before Modeling</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>Before you begin, you should be comfortable with the following:</p>
<ul>
<li><p>Basic R syntax (variables, functions, data frames).</p>
</li>
<li><p>Installing and loading R packages.</p>
</li>
<li><p>Understanding what rows and columns represent in a dataset.</p>
</li>
<li><p>Very basic statistics (mean, median, distributions).</p>
</li>
</ul>
<h2 id="heading-how-to-set-up-your-r-environment">How to Set Up Your R Environment</h2>
<p>Start by installing and loading the packages you will need.</p>
<pre><code class="lang-r">install.packages(c(<span class="hljs-string">"tidyverse"</span>, <span class="hljs-string">"ggplot2"</span>))
<span class="hljs-keyword">library</span>(tidyverse)
<span class="hljs-keyword">library</span>(ggplot2)
</code></pre>
<p><code>tidyverse</code> provides tools for data manipulation and visualization. <code>ggplot2</code> is the visualization engine you will use for boxplots. Loading the libraries makes their functions available for use</p>
<h2 id="heading-how-to-load-and-inspect-the-data">How to Load and Inspect the Data</h2>
<p>First, download the <a target="_blank" href="https://www.kaggle.com/datasets/saadharoon27/hr-analytics-dataset">HR Analytics dataset by Saad Haroon from Kaggle</a>.</p>
<p>Assuming the downloaded dataset is saved as "C:/Users/johndoe/Downloads/archive (2)/HR_Analytics.csv", load the path file into R.  </p>
<p>You can view a sample of the the dataset by running the <code>head</code> function. To view the structure of the dataset, you can run the <code>str</code> function.</p>
<pre><code class="lang-r">hr &lt;- read.csv(<span class="hljs-string">"C:/Users/johndoe/Downloads/archive (2)/HR_Analytics.csv"</span>)
head(hr)
str(hr)
</code></pre>
<p>The <code>read.csv</code> function imports the dataset into R. The <code>head</code> function shows the first six rows so you can preview the data. The <code>str</code> function reveals data types, helping you spot categorical versus numeric variables early.</p>
<p>Remember that understanding your data structure early prevents errors later when plotting or modeling. Once you run the <code>head</code> function, you should see the following in your console:</p>
<p>From the <code>head</code> function, you can see:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768489839861/f304305e-b889-4e25-8315-ff24c5201681.png" alt="first-six-rows-of-a-hr-dataset-shown-in-the-r-console" class="image--center mx-auto" width="1753" height="347" loading="lazy"></p>
<h3 id="heading-structure">Structure</h3>
<ul>
<li><p>Each row represents <strong>one employee</strong>.</p>
</li>
<li><p>Each column represents a <strong>feature/variable</strong> about the employee.</p>
</li>
</ul>
<h3 id="heading-key-columns-amp-meaning">Key Columns &amp; Meaning</h3>
<ul>
<li><p><code>EmpID</code> → Employee identifier</p>
</li>
<li><p><code>Age</code> → Age in years</p>
</li>
<li><p><code>AgeGroup</code> → Age category (for example, <code>18-25</code>)</p>
</li>
<li><p><code>Attrition</code> → Whether the employee left or not (<code>Yes/No</code>)</p>
</li>
<li><p><code>BusinessTravel</code> → Travel frequency (<code>Travel_Rarely</code>, <code>Travel_Frequently</code>, <code>Non-Travel</code>)</p>
</li>
<li><p><code>Department</code> → Employee department</p>
</li>
<li><p><code>DistanceFromHome</code> → Distance from home to office (km)</p>
</li>
<li><p><code>Education</code> / <code>EducationField</code> → Level and field of education</p>
</li>
<li><p><code>EmployeeCount</code> → Usually 1 per employee (redundant)</p>
</li>
<li><p><code>Gender</code> → Male / Female</p>
</li>
<li><p><code>JobRole</code> / <code>JobSatisfaction</code> → Job title and satisfaction level</p>
</li>
<li><p><code>MonthlyIncome</code> / <code>SalarySlab</code> → Salary amount and category</p>
</li>
<li><p><code>YearsAtCompany</code> / <code>YearsInCurrentRole</code> → Experience metrics</p>
</li>
<li><p><code>OverTime</code> → Works overtime (<code>Yes/No</code>)</p>
</li>
<li><p>Other features: <code>PerformanceRating</code>, <code>TrainingTimesLastYear</code>, <code>WorkLifeBalance</code>, <code>StockOptionLevel</code>, and so on.</p>
</li>
</ul>
<h3 id="heading-data-types"><strong>Data Types</strong></h3>
<ul>
<li><p><strong>Numeric</strong> → <code>Age</code>, <code>DistanceFromHome</code>, <code>MonthlyIncome</code>, <code>YearsAtCompany</code></p>
</li>
<li><p><strong>Categorical / Character</strong> → <code>Attrition</code>, <code>Gender</code>, <code>Department</code>, <code>JobRole</code></p>
</li>
</ul>
<h3 id="heading-observations"><strong>Observations</strong></h3>
<ul>
<li><p>The dataset is tabular, like a spreadsheet.</p>
</li>
<li><p>There are multiple categorical columns</p>
</li>
<li><p>There are multiple numeric columns</p>
</li>
<li><p>Some columns seem redundant or constant; doesn’t provide useful information because of the same values (for example, <code>EmployeeCount</code>)</p>
</li>
</ul>
<p>From the <code>str</code> function, you can gather that:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768488901453/80d8cae9-d569-4749-8028-0a6e9cc128c4.png" alt="r-output-showing-structure-of-hr-dataset" class="image--center mx-auto" width="1046" height="612" loading="lazy"></p>
<p>The dataset contains 1,480 observations and 38 variables. Each row represents one employee, and each column represents a feature about that employee.</p>
<p>Each column has a name, data type, and example values. For instance, <code>Age</code> and <code>DistanceFromHome</code> are numeric (<code>int</code>), with values like 28 or 12. <code>EmpID</code> and <code>Department</code> are character strings (<code>chr</code>), with examples like Research &amp; Development or Sales. Other features include <code>JobRole</code> (Analyst, Manager) and <code>Attrition</code> (Yes/No).</p>
<p>The dataset contains mixed data types. Some columns are numeric, such as <code>MonthlyIncome</code> or <code>YearsAtCompany</code>. Some are character or categorical, like <code>Gender</code> (Male/Female) and <code>BusinessTravel</code> (Travel_Rarely, Travel_Frequently). A few columns are redundant or constant. For example, <code>EmployeeCount</code> has the same value of 1 for all rows and does not provide useful information.</p>
<h2 id="heading-how-to-clean-and-prepare-the-data">How to Clean and Prepare the Data</h2>
<p>Before visualization, you must clean your data. In order to find out what you need to clean you can investigate the data.</p>
<p>Run the <code>summary</code> function to view the statistics of the dataset. You also need to run the <code>is.na</code> function to identify missing values to be removed.</p>
<pre><code class="lang-r">summary(hr)
colSums(is.na(hr))
</code></pre>
<p>The <code>summary</code> function gives quick statistics and flags suspicious values. The <code>is.na</code> function checks for missing data. Boxplots are sensitive to extreme values, so knowing what you are working with is critical.  </p>
<p>After running the <code>summary</code> function, the following will appear in your console:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768490404469/ef3bd30d-c3c9-4cf0-9c91-80a0e56f52f5.png" alt="r-summary-output-of-hr-dataset-showing-statistical-distributions" class="image--center mx-auto" width="1778" height="495" loading="lazy"></p>
<p>This shows the basic statistics of each column. After running the <code>is.na</code> function, the following will also appear in your console:  </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768490678134/00a12c24-224e-4c8f-80ee-bc7bbd4d8ca6.png" alt="r-output-showing-missing-value-counts-per-column-in-hr-dataset" class="image--center mx-auto" width="1832" height="198" loading="lazy"></p>
<p>From this output, you can see that only <code>YearsWithCurrManager</code> has <code>57</code>, meaning that <strong>57 employees</strong> don’t have a value for this column.</p>
<p>You can drop this whole column along with the other redundant columns we saw earlier on. You can do this with the code below.</p>
<pre><code class="lang-r">hr &lt;- hr %&gt;% select(-c(EmployeeCount, Over18, StandardHours, YearsWithCurrManager))
</code></pre>
<p>To verify if the columns are gone, use this code:</p>
<pre><code class="lang-r">colnames(hr)
</code></pre>
<p>Now we need to convert important categorical variables to factors. Doing this tells R that the column has <strong>two categories</strong> (‘Yes’ and ‘No’), not continuous text.</p>
<pre><code class="lang-r">hr$Attrition &lt;- as.factor(hr$Attrition)
hr$JobRole &lt;- as.factor(hr$JobRole)
hr$Department &lt;- as.factor(hr$Department)
</code></pre>
<p>This also ensures ggplot2 treats them correctly when grouping.</p>
<h2 id="heading-how-to-use-boxplots">How to Use Boxplots</h2>
<p>A boxplot displays key features of a dataset. The median is shown by the line in the middle of the box. The interquartile range is represented by the box itself while the whiskers show the spread of the data. Outliers appear as individual points.</p>
<p>Boxplots are mostly useful when you want to compare distributions across groups, such as income by job role or age by attrition status.</p>
<p>Let’s start with a simple boxplot of monthly income.</p>
<pre><code class="lang-r">ggplot(hr, aes(y = MonthlyIncome)) +
  geom_boxplot(fill = <span class="hljs-string">"blue"</span>) +
  labs(
    title = <span class="hljs-string">"Distribution of Monthly Income"</span>,
    y = <span class="hljs-string">"Monthly Income"</span>)
</code></pre>
<p>The <code>aes</code> function tells ggplot what variable to plot. <code>geom_boxplot</code> draws the boxplot. The <code>labs</code> function labels parts of the plot drawn, that is the <code>x</code> axis, <code>y</code> axis, and the title.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766410411798/200b1c22-3b73-49f0-ba30-9b83d28f3055.png" alt="A-vertical-boxplot-showing-the-distribution-of-employee-monthly-income." class="image--center mx-auto" width="473" height="523" loading="lazy"></p>
<h2 id="heading-how-to-create-boxplots-with-ggplot2">How to Create Boxplots with ggplot2</h2>
<p>Now lets compare <code>income</code> across <code>job roles</code>.</p>
<pre><code class="lang-r">ggplot(hr, aes(x = JobRole, y = MonthlyIncome)) +
  geom_boxplot(fill = <span class="hljs-string">"lightblue"</span>) +
  theme(axis.text.x = element_text(angle = <span class="hljs-number">45</span>, hjust = <span class="hljs-number">1</span>)) +
  labs(
    title = <span class="hljs-string">"Monthly Income by Job Role"</span>,
    x = <span class="hljs-string">"Job Role"</span>,
    y = <span class="hljs-string">"Monthly Income"</span>)
</code></pre>
<p>The x aesthetic lists all the job roles. The labels are rotated to improve readability. This visualization quickly reveals income differences across roles.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766508710023/c12ca136-38bf-492e-af90-24d7021b54a4.png" alt="Multiple-boxplots-comparing-monthly-income-distributions-across-different-job-roles." class="image--center mx-auto" width="852" height="522" loading="lazy"></p>
<h2 id="heading-how-to-perform-exploratory-data-analysis-eda">How to Perform Exploratory Data Analysis (EDA)</h2>
<p>Exploratory data analysis involves using visual methods to ask questions and gain a deeper understanding of the data.</p>
<p>We can use the example of <code>Years at company</code> by <code>department</code>.</p>
<pre><code class="lang-r">ggplot(hr, aes(x = Department, y = YearsAtCompany)) +
  geom_boxplot(fill = <span class="hljs-string">"darkblue"</span>) +
  labs(
    title = <span class="hljs-string">"Years at Company by Department"</span>,
    y = <span class="hljs-string">"Years at Company"</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766512679598/5e5da8cd-8fe7-4fae-bbe9-362af901b330.png" alt="Boxplots-showing-employee-tenure-across-departments." class="image--center mx-auto" width="842" height="518" loading="lazy"></p>
<h2 id="heading-how-to-build-linear-regression-models">How to Build Linear Regression Models</h2>
<p>To understand how to build linear regression models, you have to model <code>MonthlyIncome</code> using <code>YearsAtCompany</code> with the command below.</p>
<p>The first one creates the model while the second displays it.</p>
<pre><code class="lang-r">hr_lm&lt;- lm(MonthlyIncome ~ YearsAtCompany, data = hr)
summary(hr_lm)
</code></pre>
<p>Linear regression estimates how income changes with tenure. This works when the variables are numeric.</p>
<p>After running the code, your console should show you this output:</p>
<pre><code class="lang-r">Call:
lm(formula = MonthlyIncome ~ YearsAtCompany, data = hr)

Residuals:
   Min     1Q Median     3Q    Max 
 -<span class="hljs-number">9506</span>  -<span class="hljs-number">2488</span>  -<span class="hljs-number">1186</span>   <span class="hljs-number">1403</span>  <span class="hljs-number">15483</span> 

Coefficients:
               Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)     <span class="hljs-number">3734.47</span>     <span class="hljs-number">159.41</span>   <span class="hljs-number">23.43</span>   &lt;<span class="hljs-number">2e-16</span> ***
YearsAtCompany   <span class="hljs-number">395.25</span>      <span class="hljs-number">17.14</span>   <span class="hljs-number">23.07</span>   &lt;<span class="hljs-number">2e-16</span> ***
---
Signif. codes:  <span class="hljs-number">0</span> ‘***’ <span class="hljs-number">0.001</span> ‘**’ <span class="hljs-number">0.01</span> ‘*’ <span class="hljs-number">0.05</span> ‘.’ <span class="hljs-number">0.1</span> ‘ ’ <span class="hljs-number">1</span>

Residual standard error: <span class="hljs-number">4032</span> on <span class="hljs-number">1478</span> degrees of freedom
Multiple R-squared:  <span class="hljs-number">0.2647</span>,    Adjusted R-squared:  <span class="hljs-number">0.2642</span> 
<span class="hljs-literal">F</span>-statistic:   <span class="hljs-number">532</span> on <span class="hljs-number">1</span> and <span class="hljs-number">1478</span> DF,  p-value: &lt; <span class="hljs-number">2.2e-16</span>
</code></pre>
<p>Let’s interpret this model.</p>
<p>If an employee has 0 years at the company, their base monthly income is $3734.47. This comes from the intercept.</p>
<p>For each year an employee spends at the company, their monthly income is predicted to increase by $395.25.</p>
<p>Both coefficients have p-values &lt; <code>2e-16</code>. This means they are highly significant. It strongly shows that the years an employee spends at a company affects their income.</p>
<p>The model’s R-squared is <code>0.2647</code>. This means about 26% of the variation in monthly income is explained by the years an employee spends at the company. This is low, so other factors like role, department, or education likely affect income too.</p>
<p>The model’s F-statistic is <code>532</code>, with a p-value &lt; <code>2.2e-16</code>. This means the model is statistically significant overall.</p>
<p>In general, the longer an employee stays at a company, the more they earn, roughly $395 extra per year. But years at the company alone explain only about a quarter of their income. You need to consider other variables for better predictions.</p>
<h2 id="heading-how-to-build-logistic-regression-models">How to Build Logistic Regression Models</h2>
<p>You can now learn how to predict attrition. The first command generates the model while the second displays it.</p>
<pre><code class="lang-r">hr_glm&lt;- glm(
  Attrition ~ MonthlyIncome + YearsAtCompany,
  data = hr,
  family = binomial)


summary(hr_glm)
</code></pre>
<p>Your console should show this as an output when you run both commands.</p>
<pre><code class="lang-r">Call:
glm(formula = Attrition ~ MonthlyIncome + YearsAtCompany, family = binomial, 
    data = hr)

Coefficients:
                 Estimate Std. Error z value Pr(&gt;|z|)    
(Intercept)    -<span class="hljs-number">8.094e-01</span>  <span class="hljs-number">1.375e-01</span>  -<span class="hljs-number">5.886</span> <span class="hljs-number">3.96e-09</span> ***
MonthlyIncome  -<span class="hljs-number">9.449e-05</span>  <span class="hljs-number">2.302e-05</span>  -<span class="hljs-number">4.104</span> <span class="hljs-number">4.05e-05</span> ***
YearsAtCompany -<span class="hljs-number">5.047e-02</span>  <span class="hljs-number">1.792e-02</span>  -<span class="hljs-number">2.817</span>  <span class="hljs-number">0.00485</span> ** 
---
Signif. codes:  <span class="hljs-number">0</span> ‘***’ <span class="hljs-number">0.001</span> ‘**’ <span class="hljs-number">0.01</span> ‘*’ <span class="hljs-number">0.05</span> ‘.’ <span class="hljs-number">0.1</span> ‘ ’ <span class="hljs-number">1</span>

(Dispersion parameter <span class="hljs-keyword">for</span> binomial family taken to be <span class="hljs-number">1</span>)

    Null deviance: <span class="hljs-number">1305.4</span>  on <span class="hljs-number">1479</span>  degrees of freedom
Residual deviance: <span class="hljs-number">1252.5</span>  on <span class="hljs-number">1477</span>  degrees of freedom
AIC: <span class="hljs-number">1258.5</span>

Number of Fisher Scoring iterations: <span class="hljs-number">5</span>
</code></pre>
<p>Logistic regression is used for binary outcomes, that is, yes or no. It estimates probability.</p>
<p>Let’s interpret this logistic regression model. The model predicts whether an employee is likely to leave the company (Attrition) based on their <code>Monthly Income</code> and <code>Years at Company.</code></p>
<p>The intercept is <code>-0.809</code>. This is the baseline log-odds of leaving when their income and years at the company are zero.</p>
<p>The employees’ <code>Monthly Income</code> has a coefficient of <code>-0.0000945</code>. This means that as their income increases, their chance of leaving decreases slightly. An increase in income makes them less likely to quit.</p>
<p>The employees’ <code>Years at Company</code> have a coefficient of <code>-0.0505</code>. This shows that the longer they stay, the less likely they are to leave. Each additional year reduces their attrition probability.</p>
<p>All coefficients are statistically significant. <code>Monthly Income</code> and <code>Years at Company</code> both strongly affect their likelihood to stay.</p>
<p>The model’s residual deviance is <code>1252.5</code>, lower than the null deviance of <code>1305.4</code>. This means the model explains some of the variation in attrition.</p>
<p>The key takeaway is that if an employee earns more and stays longer at the company, they are less likely to leave. These factors matter, but other elements also influence attrition.</p>
<h2 id="heading-why-visualization-comes-before-modeling">Why Visualization Comes Before Modeling</h2>
<p>Boxplots help you to:</p>
<ul>
<li><p><strong>Detect outliers:</strong> Boxplots highlight extreme values that interfere with model results.</p>
</li>
<li><p><strong>Compare groups:</strong> Boxplots allow quick comparison of distributions across different categories.</p>
</li>
<li><p><strong>Form hypotheses:</strong> Visual patterns assist in identifying relationships worth testing in a model.</p>
</li>
<li><p><strong>Validate modeling assumptions:</strong> Boxplots help check distribution shape and variance before modeling.</p>
</li>
</ul>
<p>Modeling without visualization often leads to misinterpretation or false confidence.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you learned how to load and clean data, understand boxplots and their importance. You also learned how to use ggplot2 to compare distributions, perform exploratory data analysis (EDA), build linear and logistic regression models, and link visualization insights to modeling results.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Create Scatterplots and Model Data in R Using ggplot2 ]]>
                </title>
                <description>
                    <![CDATA[ You can use R as a powerful tool for data analysis, data visualization, and statistical modelling. In this guide, you’ll learn how to load real-world data into R, visualize patterns using ggplot2, build simple linear and logistic regression models, a... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-create-scatterplots-and-model-data-in-r/</link>
                <guid isPermaLink="false">695ba922d307c8d32fc522ea</guid>
                
                    <category>
                        <![CDATA[ data visualization ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ R Language ]]>
                    </category>
                
                    <category>
                        <![CDATA[ R Programming ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tiffany Mojo Omondi ]]>
                </dc:creator>
                <pubDate>Mon, 05 Jan 2026 12:05:54 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1767614352690/8b993426-f193-4ff3-b5ec-dd6dda11028e.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>You can use R as a powerful tool for data analysis, data visualization, and statistical modelling. In this guide, you’ll learn how to load real-world data into R, visualize patterns using ggplot2, build simple linear and logistic regression models, and interpret the models. By the end, you should know how to use R for your own projects.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-your-r-environment">How to Set Up Your R Environment</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-use-data-types-in-r">How to Use Data Types in R</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-use-data-structures-in-r">How to Use Data Structures in R</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-import-data-in-r">How to Import Data in R</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-visualize-data-with-ggplot2">How to Visualize Data with ggplot2</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-statistical-models-in-r">How to Build Statistical Models in R</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before we get started, you should have the following:</p>
<ul>
<li><p>R installed (version 4.0 or higher).</p>
</li>
<li><p>RStudio installed (recommended for beginners).</p>
</li>
<li><p>Basic familiarity with programming concepts such as variables and functions.</p>
</li>
<li><p>A basic understanding of statistics (mean, correlation, regression).</p>
</li>
</ul>
<h2 id="heading-how-to-set-up-your-r-environment">How to Set Up Your R Environment</h2>
<p>Before you start working with data, load the required libraries:</p>
<pre><code class="lang-plaintext">library(tidyverse)   # Data manipulation + ggplot2
library(readxl)      # Importing Excel files
</code></pre>
<p>These load the required libraries into the R. <code>tidyverse</code> is a collection of packages used for data manipulation and visualization, including <code>ggplot2</code>. <code>readxl</code> allows you to import Excel files directly into R without converting them to CSV format first.</p>
<h2 id="heading-how-to-use-data-types-in-r">How to Use Data Types in R</h2>
<p>Knowing data types helps you avoid errors and choose the right analysis methods.</p>
<h3 id="heading-common-data-types">Common Data Types</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Data type</td><td>Example</td><td>Use case</td></tr>
</thead>
<tbody>
<tr>
<td>Numeric</td><td><code>x &lt;- 5.7</code></td><td>Measurements, prices</td></tr>
<tr>
<td>Integer</td><td><code>y &lt;- 10L</code></td><td>Counts</td></tr>
<tr>
<td>Character</td><td><code>"House prices"</code></td><td>Text labels</td></tr>
<tr>
<td>Logical</td><td><code>TRUE</code></td><td>Conditions</td></tr>
<tr>
<td>Complex</td><td><code>2 + 3i</code></td><td>Advanced math</td></tr>
</tbody>
</table>
</div><h3 id="heading-numeric-data-types-in-r">Numeric Data Types in R</h3>
<pre><code class="lang-r">price &lt;- <span class="hljs-number">199.99</span>
tax &lt;- <span class="hljs-number">16.5</span>
total_cost &lt;- price + tax
total_cost
</code></pre>
<p>Numeric data is used for continuous values such as measurements, prices, or averages. As you can see, these are numeric values that can be used in a calculation. Numeric data types allow arithmetic operations such as addition, subtraction, multiplication, and division.</p>
<h3 id="heading-integer-data-types-in-r">Integer Data Types in R</h3>
<pre><code class="lang-r">students &lt;- <span class="hljs-number">30L</span>
classes &lt;- <span class="hljs-number">4L</span>
total_students &lt;- students * classes
total_students
</code></pre>
<p>Integers are whole numbers and are commonly used for counting. The <code>L</code> tells R that the values are integers. Integers are useful when working with counts, indexes, or discrete values.</p>
<h3 id="heading-character-data-types-in-r">Character Data Types in R</h3>
<pre><code class="lang-r">course_name &lt;- <span class="hljs-string">"Data Science"</span>
university &lt;- <span class="hljs-string">"Harvard University"</span>
paste(course_name, <span class="hljs-string">"at"</span>, university)
</code></pre>
<p>Character data is used to store text such as names, labels, or categories. The example above shows how character data can be combined using the <code>paste()</code> function. This data type cannot be used in mathematical operations.</p>
<h3 id="heading-logical-data-types-in-r">Logical Data Types in R</h3>
<pre><code class="lang-r">score &lt;- <span class="hljs-number">75</span>
passed &lt;- score &gt;= <span class="hljs-number">50</span>
passed
</code></pre>
<p>Logical data represents Boolean values: <code>TRUE</code> or <code>FALSE</code>. These are commonly used in conditions and filtering. Here, R evaluates a condition and returns <code>TRUE</code> because the score meets the requirement. Logical values are essential in decision-making and control flow.</p>
<h3 id="heading-complex-data-types-in-r">Complex Data Types in R</h3>
<p>Complex numbers contain both real and imaginary parts and are mostly used in advanced mathematical computations.</p>
<pre><code class="lang-r">z &lt;- <span class="hljs-number">2</span> + <span class="hljs-number">3i</span>
Mod(z)
</code></pre>
<p>This example calculates the magnitude of a complex number. Complex data types are rarely used in basic data analysis but are available in R.</p>
<h2 id="heading-how-to-use-data-structures-in-r">How to Use Data Structures in R</h2>
<p>R stores data in different structures depending on your goals. This is important because choosing the right structure makes operations easier. Its functions behave differently depending on the structure. Moreover, structures help R understand whether your data are numbers, categories, or text.</p>
<h3 id="heading-common-data-structures-in-r">Common Data Structures in R</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Structure</td><td>Best for</td></tr>
</thead>
<tbody>
<tr>
<td>Vector</td><td>Single column of data</td></tr>
<tr>
<td>Matrix</td><td>Numeric tables</td></tr>
<tr>
<td>Data Frame</td><td>Spreadsheet-like data</td></tr>
<tr>
<td>List</td><td>Mixed objects</td></tr>
</tbody>
</table>
</div><pre><code class="lang-r">vec &lt;- c(<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>)
mat &lt;- matrix(<span class="hljs-number">1</span>:<span class="hljs-number">9</span>, nrow = <span class="hljs-number">3</span>)
df &lt;- data.frame(Name = c(<span class="hljs-string">"Car"</span>, <span class="hljs-string">"Bike"</span>), Number = c(<span class="hljs-number">110</span>, <span class="hljs-number">95</span>))
lst &lt;- list(numbers = vec, matrix = mat, info = df)

str(lst) <span class="hljs-comment">##shows the structure of the list</span>
</code></pre>
<p>Lets understand the code above:</p>
<ul>
<li><p><code>vec</code> is a vector that stores a single type of data.</p>
</li>
<li><p><code>mat</code> is a matrix that organizes numeric values into rows and columns.</p>
</li>
<li><p><code>df</code> is a data frame that works like a spreadsheet, allowing different data types in each column.</p>
</li>
<li><p><code>lst</code> is a list that stores multiple objects of different types.</p>
</li>
<li><p>The <code>str()</code> function shows how these objects are nested within the list.</p>
</li>
</ul>
<h2 id="heading-how-to-import-data-in-r"><strong>How to Import Data in R</strong></h2>
<p>Now you can start working with your real data. You can import files into R by copying the path of the CSV or Excel file and pasting it into the command.</p>
<p><strong>For Windows:</strong> Replace single backward slashes / with either double backward slashes \ or single forward slashes \. For example:</p>
<pre><code class="lang-r">
Windows
```r
data &lt;- read.csv("C:\\Users\\file\\Documents\\data.csv") or 
data &lt;- read.csv("C:/Users/file/Documents/data.csv")
</code></pre>
<p><strong>For macOS/Linux:</strong> Single forward slashes work fine:</p>
<pre><code class="lang-r">macOS/Linux
data &lt;- read.csv(<span class="hljs-string">"/Users/file/Documents/data.csv"</span>)
</code></pre>
<h3 id="heading-how-to-read-a-csv-and-excel-file"><strong>How to Read a CSV and Excel File</strong></h3>
<pre><code class="lang-r"><span class="hljs-comment">#Import CSV file </span>
data &lt;- read.csv(<span class="hljs-string">"C:/Users/file/Documents/data.csv"</span>) or data &lt;- read.csv(<span class="hljs-string">"C:\\Users\\file\\Documents\\data.csv"</span>) <span class="hljs-comment">## for windows</span>

head(data.csv)
</code></pre>
<p>You can import a CSV file into R using a file path. On Windows systems, file paths can use either double forward slashes (<code>//</code>) or double backslashes (<code>\</code>). The imported data is stored as a data frame named data.</p>
<pre><code class="lang-r">data_excel &lt;- read_excel(<span class="hljs-string">"C:/Users/file/Documents/HR Data Set.xlsx"</span>)
head(data_excel)
</code></pre>
<p>You can import an Excel file into R using the code <code>read_excel()</code> function from the <code>readxl</code> package. The <code>head()</code> function is then used to preview the first few rows of the dataset.</p>
<p>Use the following commands to understand your data:</p>
<pre><code class="lang-r">str(data.csv)
summary(data.csv)

str(data_excel)
summary(data_excel)
</code></pre>
<p><code>str()</code> shows the structure of the dataset, including column names and data types. <code>summary()</code> provides descriptive statistics such as minimum, maximum, mean, and quartiles for each variable. Together, these functions help you understand the dataset before analysis.</p>
<h2 id="heading-how-to-visualize-data-with-ggplot2"><strong>How to Visualize Data with ggplot2</strong></h2>
<p>Visualization helps you spot patterns before you build models.</p>
<h3 id="heading-scatter-plot-example"><strong>Scatter Plot Example</strong></h3>
<p>We’ll use the built-in <code>mtcars</code> dataset in R. First, load the library to make it available for use:</p>
<pre><code class="lang-r">data(mtcars)
<span class="hljs-keyword">library</span>(ggplot2)

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point(size = <span class="hljs-number">3</span>,color=<span class="hljs-string">"blue"</span>) +geom_smooth(method=<span class="hljs-string">"lm"</span>,color=<span class="hljs-string">"red"</span>,se=<span class="hljs-literal">FALSE</span>)+
  labs(
    title = <span class="hljs-string">"Fuel Efficiency by Weight and Cylinders"</span>,
    x = <span class="hljs-string">"Weight (1000 lbs)"</span>,
    y = <span class="hljs-string">"Miles per Gallon"</span>
  ) +
  theme_minimal()
</code></pre>
<p>Let us break down the code to grasp it fully:</p>
<ul>
<li><p><code>data(mtcars)</code> loads the built-in <code>mtcars</code> dataset, which contains information about car specifications.</p>
</li>
<li><p><code>library(ggplot2)</code> enables data visualization.</p>
</li>
<li><p><code>aes()</code> was used to insert your dataset columns, which defines the <code>x</code> and <code>y</code> values.</p>
</li>
<li><p><code>aes()</code> was used to design the plot outside. For example, set point <code>size</code> and <code>color</code>.</p>
</li>
<li><p><code>geom_smooth()</code> wass used to add a trend line with. Here, we use <code>method="lm"</code> to fit a linear regression line. The <code>se=TRUE/FALSE</code> option controls the shading for confidence intervals. Use <code>TRUE</code> if you want the shading and <code>FALSE</code> if you don’t.</p>
</li>
<li><p><code>labs()</code> was used for label the plot and set the <code>title</code>, <code>x</code>-axis, and <code>y</code>-axis labels.</p>
</li>
<li><p>Finally, we set the plot theme using <code>theme_minimal()</code>.</p>
</li>
</ul>
<p>Running this code will produce a scatterplot showing fuel efficiency by weight and cylinders. The plot should look like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1765914755069/8921e803-7fa6-4705-802c-23ff8918bee5.png" alt="Scatterplot of mpg against vehicle weight with regression line" class="image--center mx-auto" width="912" height="527" loading="lazy"></p>
<h2 id="heading-how-to-build-statistical-models-in-r"><strong>How to Build Statistical Models in R</strong></h2>
<h3 id="heading-linear-regression"><strong>Linear Regression</strong></h3>
<p>You can use linear regression for continuous outcomes, basically to predict numerical values. For example, to predict a car’s miles per gallon (<code>mpg</code>) based on weight (<code>wt</code>) and horsepower (<code>hp</code>), you can use this formula:</p>
<pre><code class="lang-r">lm_model &lt;- lm(mpg ~ wt + hp, data = mtcars)
summary(lm_model)
</code></pre>
<p>But what does it mean?</p>
<ul>
<li><p><code>lm()</code> stands for linear model.</p>
</li>
<li><p>The response variable is <code>mpg</code>. This is the outcome you want to predict.</p>
</li>
<li><p>Predictor variables are <code>wt</code> and <code>hp</code>. These explain changes in the response.</p>
</li>
</ul>
<p>Once you run the model, it should look like this in your console:</p>
<pre><code class="lang-r">Call:
lm(formula = mpg ~ wt + hp, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-<span class="hljs-number">3.941</span> -<span class="hljs-number">1.600</span> -<span class="hljs-number">0.182</span>  <span class="hljs-number">1.050</span>  <span class="hljs-number">5.854</span> 

Coefficients:
            Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) <span class="hljs-number">37.22727</span>    <span class="hljs-number">1.59879</span>  <span class="hljs-number">23.285</span>  &lt; <span class="hljs-number">2e-16</span> ***
wt          -<span class="hljs-number">3.87783</span>    <span class="hljs-number">0.63273</span>  -<span class="hljs-number">6.129</span> <span class="hljs-number">1.12e-06</span> ***
hp          -<span class="hljs-number">0.03177</span>    <span class="hljs-number">0.00903</span>  -<span class="hljs-number">3.519</span>  <span class="hljs-number">0.00145</span> ** 
---
Signif. codes:  <span class="hljs-number">0</span> ‘***’ <span class="hljs-number">0.001</span> ‘**’ <span class="hljs-number">0.01</span> ‘*’ <span class="hljs-number">0.05</span> ‘.’ <span class="hljs-number">0.1</span> ‘ ’ <span class="hljs-number">1</span>

Residual standard error: <span class="hljs-number">2.593</span> on <span class="hljs-number">29</span> degrees of freedom
Multiple R-squared:  <span class="hljs-number">0.8268</span>,    Adjusted R-squared:  <span class="hljs-number">0.8148</span> 
<span class="hljs-literal">F</span>-statistic: <span class="hljs-number">69.21</span> on <span class="hljs-number">2</span> and <span class="hljs-number">29</span> DF,  p-value: <span class="hljs-number">9.109e-12</span>
</code></pre>
<p>Here’s an interpretation of the linear regression model:</p>
<ul>
<li><p>You created a model on miles per gallon (<code>mpg</code>) based on weight (<code>wt</code>) and horsepower (<code>hp</code>).</p>
</li>
<li><p>The intercept <code>37.227</code> is the <code>mpg</code> when <code>wt=0</code> and <code>hp=0</code>. In other words, when all other variables are <code>0</code>, the base <code>mpg</code> is <code>37.227</code>. The intercept is always the baseline value of the outcome when all other variables in the model are zero.</p>
</li>
<li><p>With every additional unit of weight (1000lbs), the <code>mpg</code> decreases by <code>3.877</code>. This variable affects the <code>mpg</code> greatly as seen with the <code>p-value</code>. The <code>p-value</code> is &lt;0.001, hence strong and statistically significant.</p>
</li>
<li><p>With every additional unit of horsepower, the <code>mpg</code> decreases by <code>0.031</code>. This variable affects the <code>mpg</code>, as seen with the <code>p-value</code> being <code>0.00145</code>, which is <strong>less than 0.01</strong>, indicating that horsepower is a statistically significant predictor of <code>mpg</code>, although its effect is smaller compared to vehicle weight.</p>
</li>
</ul>
<h3 id="heading-does-the-model-fit-the-data-and-why">Does the Model Fit the Data, and Why?</h3>
<p>The R-squared value shows that 83% of the variation in <code>mpg</code> is explained by weight and horsepower.</p>
<p><strong>Summary of the interpretation</strong>: Cars that are heavier and with more horsepower have lower fuel efficiency. These two variables explain most of the variation in <code>mpg</code> in the dataset.</p>
<h3 id="heading-logistic-regression"><strong>Logistic Regression</strong></h3>
<p>You can use logistic regression for binary outcomes, like yes/no questions. For example, predicting whether a vehicle is automatic or manual based on weight and horsepower.</p>
<pre><code class="lang-r">glm_model &lt;- glm(am ~ wt + hp, data = mtcars, family = binomial)
summary(glm_model)
</code></pre>
<p>Lets understand the code</p>
<ul>
<li><p><code>glm()</code> stands for generalized linear model.</p>
</li>
<li><p>The <code>family=binomial</code> option tells R to run logistic regression.</p>
</li>
<li><p>The response variable <code>am</code> indicates transmission type: 0 = automatic, 1 = manual.</p>
</li>
<li><p>Predictor variables remain <code>wt</code> and <code>hp</code>.</p>
</li>
</ul>
<p>Once you run the model, it should look like this in your console:</p>
<pre><code class="lang-r">Call:
glm(formula = am ~ wt + hp, family = binomial, data = mtcars)

Coefficients:
            Estimate Std. Error z value Pr(&gt;|z|)   
(Intercept) <span class="hljs-number">18.86630</span>    <span class="hljs-number">7.44356</span>   <span class="hljs-number">2.535</span>  <span class="hljs-number">0.01126</span> * 
wt          -<span class="hljs-number">8.08348</span>    <span class="hljs-number">3.06868</span>  -<span class="hljs-number">2.634</span>  <span class="hljs-number">0.00843</span> **
hp           <span class="hljs-number">0.03626</span>    <span class="hljs-number">0.01773</span>   <span class="hljs-number">2.044</span>  <span class="hljs-number">0.04091</span> * 
---
Signif. codes:  <span class="hljs-number">0</span> ‘***’ <span class="hljs-number">0.001</span> ‘**’ <span class="hljs-number">0.01</span> ‘*’ <span class="hljs-number">0.05</span> ‘.’ <span class="hljs-number">0.1</span> ‘ ’ <span class="hljs-number">1</span>

(Dispersion parameter <span class="hljs-keyword">for</span> binomial family taken to be <span class="hljs-number">1</span>)

    Null deviance: <span class="hljs-number">43.230</span>  on <span class="hljs-number">31</span>  degrees of freedom
Residual deviance: <span class="hljs-number">10.059</span>  on <span class="hljs-number">29</span>  degrees of freedom
AIC: <span class="hljs-number">16.059</span>

Number of Fisher Scoring iterations: <span class="hljs-number">8</span>
</code></pre>
<p>Here’s an interpreting of the logistic regression model:</p>
<ul>
<li><p>The intercept <code>18.866</code> represents the log-odds of a car being manual when <code>wt=0</code> and <code>hp=0</code>. In other words, when all other variables are <code>0</code>, the baseline log-odds of the outcome is <code>18.866</code>. The intercept is always the baseline value of the outcome when all other variables in the model are zero.</p>
</li>
<li><p>With every additional unit of weight (1000 lbs), the log odds of the car being manual decrease by <code>8.083</code>. This variable strongly affects the probability of the car being manual, as seen with the <code>p-value</code> being <code>0.008</code>, which is statistically significant.</p>
</li>
<li><p>With every additional unit of horsepower, the log odds of the car being manual increase by <code>0.036</code>. This variable also affects the probability of being manual, as seen with the <code>p-value</code> being <code>0.041</code>, which is statistically significant.</p>
</li>
</ul>
<p><strong>Summary of the interpretation</strong>: Heavier cars are more likely to be automatic, while higher horsepower slightly increases the chance of being manual. Together, <code>wt</code> and <code>hp</code> explain a large portion of transmission type variation.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you learned how to use R for data analysis, visualization, and statistical modeling, and how to set up your R environment and work with basic data types and data structures.</p>
<p>This article also showed you how to import real-world datasets and explore them using summary statistics. This should help you understand your data before analysis.</p>
<p>Using ggplot2, we visualized the relationships and identified patterns. We built and interpreted a linear regression model to predict fuel efficiency and a logistic regression model to classify transmission type.</p>
<p>You also learned how to interpret coefficients, p-values, and goodness-of-fit measures.</p>
<p>With these skills, you can load datasets, visualize trends, and build simple predictive models in R. Keep practicing with new datasets and explore more advanced techniques to improve your data analysis skills.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
