<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ R Programming - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ R Programming - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Thu, 28 May 2026 16:46:50 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/r-programming/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Create Boxplots and Model Data in R Using ggplot2 ]]>
                </title>
                <description>
                    <![CDATA[ In this tutorial, you’ll walk through a complete data analysis project using the HR Analytics dataset by Saad Haroon on Kaggle. You’ll start by loading and cleaning the data, then explore it visually using boxplots with ggplot2. Finally, you’ll learn... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-create-boxplots-and-model-data-in-r/</link>
                <guid isPermaLink="false">69693680d6f0e208b327d21c</guid>
                
                    <category>
                        <![CDATA[ data visualization ]]>
                    </category>
                
                    <category>
                        <![CDATA[ R Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tiffany Mojo Omondi ]]>
                </dc:creator>
                <pubDate>Thu, 15 Jan 2026 18:48:32 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768418231372/f36e1cca-eed9-4620-bd7c-19788d8beafe.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In this tutorial, you’ll walk through a complete data analysis project using the HR Analytics dataset by Saad Haroon on Kaggle. You’ll start by loading and cleaning the data, then explore it visually using boxplots with ggplot2. Finally, you’ll learn about statistical modelling using linear regression and logistic regression in R.</p>
<p>By the end of this article, you should understand how to create boxplots in R, why they matter, and how they fit into a real-world analytics workflow.</p>
<h2 id="heading-table-of-contents"><strong>Table of Contents</strong></h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-your-r-environment">How to Set Up Your R Environment</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-load-and-inspect-the-data">How to Load and Inspect the Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-clean-and-prepare-the-data">How to Clean and Prepare the Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-use-boxplots">How to Use Boxplots</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-create-boxplots-with-ggplot2">How to Create Boxplots with ggplot2</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-perform-exploratory-data-analysis">How to Perform Exploratory Data Analysis</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-linear-regression-models">How to Build Linear Regression Models</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-logistic-regression-models">How to Build Logistic Regression Models</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-why-visualization-comes-before-modeling">Why Visualization Comes Before Modeling</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>Before you begin, you should be comfortable with the following:</p>
<ul>
<li><p>Basic R syntax (variables, functions, data frames).</p>
</li>
<li><p>Installing and loading R packages.</p>
</li>
<li><p>Understanding what rows and columns represent in a dataset.</p>
</li>
<li><p>Very basic statistics (mean, median, distributions).</p>
</li>
</ul>
<h2 id="heading-how-to-set-up-your-r-environment">How to Set Up Your R Environment</h2>
<p>Start by installing and loading the packages you will need.</p>
<pre><code class="lang-r">install.packages(c(<span class="hljs-string">"tidyverse"</span>, <span class="hljs-string">"ggplot2"</span>))
<span class="hljs-keyword">library</span>(tidyverse)
<span class="hljs-keyword">library</span>(ggplot2)
</code></pre>
<p><code>tidyverse</code> provides tools for data manipulation and visualization. <code>ggplot2</code> is the visualization engine you will use for boxplots. Loading the libraries makes their functions available for use</p>
<h2 id="heading-how-to-load-and-inspect-the-data">How to Load and Inspect the Data</h2>
<p>First, download the <a target="_blank" href="https://www.kaggle.com/datasets/saadharoon27/hr-analytics-dataset">HR Analytics dataset by Saad Haroon from Kaggle</a>.</p>
<p>Assuming the downloaded dataset is saved as "C:/Users/johndoe/Downloads/archive (2)/HR_Analytics.csv", load the path file into R.  </p>
<p>You can view a sample of the the dataset by running the <code>head</code> function. To view the structure of the dataset, you can run the <code>str</code> function.</p>
<pre><code class="lang-r">hr &lt;- read.csv(<span class="hljs-string">"C:/Users/johndoe/Downloads/archive (2)/HR_Analytics.csv"</span>)
head(hr)
str(hr)
</code></pre>
<p>The <code>read.csv</code> function imports the dataset into R. The <code>head</code> function shows the first six rows so you can preview the data. The <code>str</code> function reveals data types, helping you spot categorical versus numeric variables early.</p>
<p>Remember that understanding your data structure early prevents errors later when plotting or modeling. Once you run the <code>head</code> function, you should see the following in your console:</p>
<p>From the <code>head</code> function, you can see:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768489839861/f304305e-b889-4e25-8315-ff24c5201681.png" alt="first-six-rows-of-a-hr-dataset-shown-in-the-r-console" class="image--center mx-auto" width="1753" height="347" loading="lazy"></p>
<h3 id="heading-structure">Structure</h3>
<ul>
<li><p>Each row represents <strong>one employee</strong>.</p>
</li>
<li><p>Each column represents a <strong>feature/variable</strong> about the employee.</p>
</li>
</ul>
<h3 id="heading-key-columns-amp-meaning">Key Columns &amp; Meaning</h3>
<ul>
<li><p><code>EmpID</code> → Employee identifier</p>
</li>
<li><p><code>Age</code> → Age in years</p>
</li>
<li><p><code>AgeGroup</code> → Age category (for example, <code>18-25</code>)</p>
</li>
<li><p><code>Attrition</code> → Whether the employee left or not (<code>Yes/No</code>)</p>
</li>
<li><p><code>BusinessTravel</code> → Travel frequency (<code>Travel_Rarely</code>, <code>Travel_Frequently</code>, <code>Non-Travel</code>)</p>
</li>
<li><p><code>Department</code> → Employee department</p>
</li>
<li><p><code>DistanceFromHome</code> → Distance from home to office (km)</p>
</li>
<li><p><code>Education</code> / <code>EducationField</code> → Level and field of education</p>
</li>
<li><p><code>EmployeeCount</code> → Usually 1 per employee (redundant)</p>
</li>
<li><p><code>Gender</code> → Male / Female</p>
</li>
<li><p><code>JobRole</code> / <code>JobSatisfaction</code> → Job title and satisfaction level</p>
</li>
<li><p><code>MonthlyIncome</code> / <code>SalarySlab</code> → Salary amount and category</p>
</li>
<li><p><code>YearsAtCompany</code> / <code>YearsInCurrentRole</code> → Experience metrics</p>
</li>
<li><p><code>OverTime</code> → Works overtime (<code>Yes/No</code>)</p>
</li>
<li><p>Other features: <code>PerformanceRating</code>, <code>TrainingTimesLastYear</code>, <code>WorkLifeBalance</code>, <code>StockOptionLevel</code>, and so on.</p>
</li>
</ul>
<h3 id="heading-data-types"><strong>Data Types</strong></h3>
<ul>
<li><p><strong>Numeric</strong> → <code>Age</code>, <code>DistanceFromHome</code>, <code>MonthlyIncome</code>, <code>YearsAtCompany</code></p>
</li>
<li><p><strong>Categorical / Character</strong> → <code>Attrition</code>, <code>Gender</code>, <code>Department</code>, <code>JobRole</code></p>
</li>
</ul>
<h3 id="heading-observations"><strong>Observations</strong></h3>
<ul>
<li><p>The dataset is tabular, like a spreadsheet.</p>
</li>
<li><p>There are multiple categorical columns</p>
</li>
<li><p>There are multiple numeric columns</p>
</li>
<li><p>Some columns seem redundant or constant; doesn’t provide useful information because of the same values (for example, <code>EmployeeCount</code>)</p>
</li>
</ul>
<p>From the <code>str</code> function, you can gather that:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768488901453/80d8cae9-d569-4749-8028-0a6e9cc128c4.png" alt="r-output-showing-structure-of-hr-dataset" class="image--center mx-auto" width="1046" height="612" loading="lazy"></p>
<p>The dataset contains 1,480 observations and 38 variables. Each row represents one employee, and each column represents a feature about that employee.</p>
<p>Each column has a name, data type, and example values. For instance, <code>Age</code> and <code>DistanceFromHome</code> are numeric (<code>int</code>), with values like 28 or 12. <code>EmpID</code> and <code>Department</code> are character strings (<code>chr</code>), with examples like Research &amp; Development or Sales. Other features include <code>JobRole</code> (Analyst, Manager) and <code>Attrition</code> (Yes/No).</p>
<p>The dataset contains mixed data types. Some columns are numeric, such as <code>MonthlyIncome</code> or <code>YearsAtCompany</code>. Some are character or categorical, like <code>Gender</code> (Male/Female) and <code>BusinessTravel</code> (Travel_Rarely, Travel_Frequently). A few columns are redundant or constant. For example, <code>EmployeeCount</code> has the same value of 1 for all rows and does not provide useful information.</p>
<h2 id="heading-how-to-clean-and-prepare-the-data">How to Clean and Prepare the Data</h2>
<p>Before visualization, you must clean your data. In order to find out what you need to clean you can investigate the data.</p>
<p>Run the <code>summary</code> function to view the statistics of the dataset. You also need to run the <code>is.na</code> function to identify missing values to be removed.</p>
<pre><code class="lang-r">summary(hr)
colSums(is.na(hr))
</code></pre>
<p>The <code>summary</code> function gives quick statistics and flags suspicious values. The <code>is.na</code> function checks for missing data. Boxplots are sensitive to extreme values, so knowing what you are working with is critical.  </p>
<p>After running the <code>summary</code> function, the following will appear in your console:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768490404469/ef3bd30d-c3c9-4cf0-9c91-80a0e56f52f5.png" alt="r-summary-output-of-hr-dataset-showing-statistical-distributions" class="image--center mx-auto" width="1778" height="495" loading="lazy"></p>
<p>This shows the basic statistics of each column. After running the <code>is.na</code> function, the following will also appear in your console:  </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768490678134/00a12c24-224e-4c8f-80ee-bc7bbd4d8ca6.png" alt="r-output-showing-missing-value-counts-per-column-in-hr-dataset" class="image--center mx-auto" width="1832" height="198" loading="lazy"></p>
<p>From this output, you can see that only <code>YearsWithCurrManager</code> has <code>57</code>, meaning that <strong>57 employees</strong> don’t have a value for this column.</p>
<p>You can drop this whole column along with the other redundant columns we saw earlier on. You can do this with the code below.</p>
<pre><code class="lang-r">hr &lt;- hr %&gt;% select(-c(EmployeeCount, Over18, StandardHours, YearsWithCurrManager))
</code></pre>
<p>To verify if the columns are gone, use this code:</p>
<pre><code class="lang-r">colnames(hr)
</code></pre>
<p>Now we need to convert important categorical variables to factors. Doing this tells R that the column has <strong>two categories</strong> (‘Yes’ and ‘No’), not continuous text.</p>
<pre><code class="lang-r">hr$Attrition &lt;- as.factor(hr$Attrition)
hr$JobRole &lt;- as.factor(hr$JobRole)
hr$Department &lt;- as.factor(hr$Department)
</code></pre>
<p>This also ensures ggplot2 treats them correctly when grouping.</p>
<h2 id="heading-how-to-use-boxplots">How to Use Boxplots</h2>
<p>A boxplot displays key features of a dataset. The median is shown by the line in the middle of the box. The interquartile range is represented by the box itself while the whiskers show the spread of the data. Outliers appear as individual points.</p>
<p>Boxplots are mostly useful when you want to compare distributions across groups, such as income by job role or age by attrition status.</p>
<p>Let’s start with a simple boxplot of monthly income.</p>
<pre><code class="lang-r">ggplot(hr, aes(y = MonthlyIncome)) +
  geom_boxplot(fill = <span class="hljs-string">"blue"</span>) +
  labs(
    title = <span class="hljs-string">"Distribution of Monthly Income"</span>,
    y = <span class="hljs-string">"Monthly Income"</span>)
</code></pre>
<p>The <code>aes</code> function tells ggplot what variable to plot. <code>geom_boxplot</code> draws the boxplot. The <code>labs</code> function labels parts of the plot drawn, that is the <code>x</code> axis, <code>y</code> axis, and the title.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766410411798/200b1c22-3b73-49f0-ba30-9b83d28f3055.png" alt="A-vertical-boxplot-showing-the-distribution-of-employee-monthly-income." class="image--center mx-auto" width="473" height="523" loading="lazy"></p>
<h2 id="heading-how-to-create-boxplots-with-ggplot2">How to Create Boxplots with ggplot2</h2>
<p>Now lets compare <code>income</code> across <code>job roles</code>.</p>
<pre><code class="lang-r">ggplot(hr, aes(x = JobRole, y = MonthlyIncome)) +
  geom_boxplot(fill = <span class="hljs-string">"lightblue"</span>) +
  theme(axis.text.x = element_text(angle = <span class="hljs-number">45</span>, hjust = <span class="hljs-number">1</span>)) +
  labs(
    title = <span class="hljs-string">"Monthly Income by Job Role"</span>,
    x = <span class="hljs-string">"Job Role"</span>,
    y = <span class="hljs-string">"Monthly Income"</span>)
</code></pre>
<p>The x aesthetic lists all the job roles. The labels are rotated to improve readability. This visualization quickly reveals income differences across roles.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766508710023/c12ca136-38bf-492e-af90-24d7021b54a4.png" alt="Multiple-boxplots-comparing-monthly-income-distributions-across-different-job-roles." class="image--center mx-auto" width="852" height="522" loading="lazy"></p>
<h2 id="heading-how-to-perform-exploratory-data-analysis-eda">How to Perform Exploratory Data Analysis (EDA)</h2>
<p>Exploratory data analysis involves using visual methods to ask questions and gain a deeper understanding of the data.</p>
<p>We can use the example of <code>Years at company</code> by <code>department</code>.</p>
<pre><code class="lang-r">ggplot(hr, aes(x = Department, y = YearsAtCompany)) +
  geom_boxplot(fill = <span class="hljs-string">"darkblue"</span>) +
  labs(
    title = <span class="hljs-string">"Years at Company by Department"</span>,
    y = <span class="hljs-string">"Years at Company"</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766512679598/5e5da8cd-8fe7-4fae-bbe9-362af901b330.png" alt="Boxplots-showing-employee-tenure-across-departments." class="image--center mx-auto" width="842" height="518" loading="lazy"></p>
<h2 id="heading-how-to-build-linear-regression-models">How to Build Linear Regression Models</h2>
<p>To understand how to build linear regression models, you have to model <code>MonthlyIncome</code> using <code>YearsAtCompany</code> with the command below.</p>
<p>The first one creates the model while the second displays it.</p>
<pre><code class="lang-r">hr_lm&lt;- lm(MonthlyIncome ~ YearsAtCompany, data = hr)
summary(hr_lm)
</code></pre>
<p>Linear regression estimates how income changes with tenure. This works when the variables are numeric.</p>
<p>After running the code, your console should show you this output:</p>
<pre><code class="lang-r">Call:
lm(formula = MonthlyIncome ~ YearsAtCompany, data = hr)

Residuals:
   Min     1Q Median     3Q    Max 
 -<span class="hljs-number">9506</span>  -<span class="hljs-number">2488</span>  -<span class="hljs-number">1186</span>   <span class="hljs-number">1403</span>  <span class="hljs-number">15483</span> 

Coefficients:
               Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)     <span class="hljs-number">3734.47</span>     <span class="hljs-number">159.41</span>   <span class="hljs-number">23.43</span>   &lt;<span class="hljs-number">2e-16</span> ***
YearsAtCompany   <span class="hljs-number">395.25</span>      <span class="hljs-number">17.14</span>   <span class="hljs-number">23.07</span>   &lt;<span class="hljs-number">2e-16</span> ***
---
Signif. codes:  <span class="hljs-number">0</span> ‘***’ <span class="hljs-number">0.001</span> ‘**’ <span class="hljs-number">0.01</span> ‘*’ <span class="hljs-number">0.05</span> ‘.’ <span class="hljs-number">0.1</span> ‘ ’ <span class="hljs-number">1</span>

Residual standard error: <span class="hljs-number">4032</span> on <span class="hljs-number">1478</span> degrees of freedom
Multiple R-squared:  <span class="hljs-number">0.2647</span>,    Adjusted R-squared:  <span class="hljs-number">0.2642</span> 
<span class="hljs-literal">F</span>-statistic:   <span class="hljs-number">532</span> on <span class="hljs-number">1</span> and <span class="hljs-number">1478</span> DF,  p-value: &lt; <span class="hljs-number">2.2e-16</span>
</code></pre>
<p>Let’s interpret this model.</p>
<p>If an employee has 0 years at the company, their base monthly income is $3734.47. This comes from the intercept.</p>
<p>For each year an employee spends at the company, their monthly income is predicted to increase by $395.25.</p>
<p>Both coefficients have p-values &lt; <code>2e-16</code>. This means they are highly significant. It strongly shows that the years an employee spends at a company affects their income.</p>
<p>The model’s R-squared is <code>0.2647</code>. This means about 26% of the variation in monthly income is explained by the years an employee spends at the company. This is low, so other factors like role, department, or education likely affect income too.</p>
<p>The model’s F-statistic is <code>532</code>, with a p-value &lt; <code>2.2e-16</code>. This means the model is statistically significant overall.</p>
<p>In general, the longer an employee stays at a company, the more they earn, roughly $395 extra per year. But years at the company alone explain only about a quarter of their income. You need to consider other variables for better predictions.</p>
<h2 id="heading-how-to-build-logistic-regression-models">How to Build Logistic Regression Models</h2>
<p>You can now learn how to predict attrition. The first command generates the model while the second displays it.</p>
<pre><code class="lang-r">hr_glm&lt;- glm(
  Attrition ~ MonthlyIncome + YearsAtCompany,
  data = hr,
  family = binomial)


summary(hr_glm)
</code></pre>
<p>Your console should show this as an output when you run both commands.</p>
<pre><code class="lang-r">Call:
glm(formula = Attrition ~ MonthlyIncome + YearsAtCompany, family = binomial, 
    data = hr)

Coefficients:
                 Estimate Std. Error z value Pr(&gt;|z|)    
(Intercept)    -<span class="hljs-number">8.094e-01</span>  <span class="hljs-number">1.375e-01</span>  -<span class="hljs-number">5.886</span> <span class="hljs-number">3.96e-09</span> ***
MonthlyIncome  -<span class="hljs-number">9.449e-05</span>  <span class="hljs-number">2.302e-05</span>  -<span class="hljs-number">4.104</span> <span class="hljs-number">4.05e-05</span> ***
YearsAtCompany -<span class="hljs-number">5.047e-02</span>  <span class="hljs-number">1.792e-02</span>  -<span class="hljs-number">2.817</span>  <span class="hljs-number">0.00485</span> ** 
---
Signif. codes:  <span class="hljs-number">0</span> ‘***’ <span class="hljs-number">0.001</span> ‘**’ <span class="hljs-number">0.01</span> ‘*’ <span class="hljs-number">0.05</span> ‘.’ <span class="hljs-number">0.1</span> ‘ ’ <span class="hljs-number">1</span>

(Dispersion parameter <span class="hljs-keyword">for</span> binomial family taken to be <span class="hljs-number">1</span>)

    Null deviance: <span class="hljs-number">1305.4</span>  on <span class="hljs-number">1479</span>  degrees of freedom
Residual deviance: <span class="hljs-number">1252.5</span>  on <span class="hljs-number">1477</span>  degrees of freedom
AIC: <span class="hljs-number">1258.5</span>

Number of Fisher Scoring iterations: <span class="hljs-number">5</span>
</code></pre>
<p>Logistic regression is used for binary outcomes, that is, yes or no. It estimates probability.</p>
<p>Let’s interpret this logistic regression model. The model predicts whether an employee is likely to leave the company (Attrition) based on their <code>Monthly Income</code> and <code>Years at Company.</code></p>
<p>The intercept is <code>-0.809</code>. This is the baseline log-odds of leaving when their income and years at the company are zero.</p>
<p>The employees’ <code>Monthly Income</code> has a coefficient of <code>-0.0000945</code>. This means that as their income increases, their chance of leaving decreases slightly. An increase in income makes them less likely to quit.</p>
<p>The employees’ <code>Years at Company</code> have a coefficient of <code>-0.0505</code>. This shows that the longer they stay, the less likely they are to leave. Each additional year reduces their attrition probability.</p>
<p>All coefficients are statistically significant. <code>Monthly Income</code> and <code>Years at Company</code> both strongly affect their likelihood to stay.</p>
<p>The model’s residual deviance is <code>1252.5</code>, lower than the null deviance of <code>1305.4</code>. This means the model explains some of the variation in attrition.</p>
<p>The key takeaway is that if an employee earns more and stays longer at the company, they are less likely to leave. These factors matter, but other elements also influence attrition.</p>
<h2 id="heading-why-visualization-comes-before-modeling">Why Visualization Comes Before Modeling</h2>
<p>Boxplots help you to:</p>
<ul>
<li><p><strong>Detect outliers:</strong> Boxplots highlight extreme values that interfere with model results.</p>
</li>
<li><p><strong>Compare groups:</strong> Boxplots allow quick comparison of distributions across different categories.</p>
</li>
<li><p><strong>Form hypotheses:</strong> Visual patterns assist in identifying relationships worth testing in a model.</p>
</li>
<li><p><strong>Validate modeling assumptions:</strong> Boxplots help check distribution shape and variance before modeling.</p>
</li>
</ul>
<p>Modeling without visualization often leads to misinterpretation or false confidence.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you learned how to load and clean data, understand boxplots and their importance. You also learned how to use ggplot2 to compare distributions, perform exploratory data analysis (EDA), build linear and logistic regression models, and link visualization insights to modeling results.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Create Scatterplots and Model Data in R Using ggplot2 ]]>
                </title>
                <description>
                    <![CDATA[ You can use R as a powerful tool for data analysis, data visualization, and statistical modelling. In this guide, you’ll learn how to load real-world data into R, visualize patterns using ggplot2, build simple linear and logistic regression models, a... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-create-scatterplots-and-model-data-in-r/</link>
                <guid isPermaLink="false">695ba922d307c8d32fc522ea</guid>
                
                    <category>
                        <![CDATA[ data visualization ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ R Language ]]>
                    </category>
                
                    <category>
                        <![CDATA[ R Programming ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tiffany Mojo Omondi ]]>
                </dc:creator>
                <pubDate>Mon, 05 Jan 2026 12:05:54 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1767614352690/8b993426-f193-4ff3-b5ec-dd6dda11028e.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>You can use R as a powerful tool for data analysis, data visualization, and statistical modelling. In this guide, you’ll learn how to load real-world data into R, visualize patterns using ggplot2, build simple linear and logistic regression models, and interpret the models. By the end, you should know how to use R for your own projects.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-your-r-environment">How to Set Up Your R Environment</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-use-data-types-in-r">How to Use Data Types in R</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-use-data-structures-in-r">How to Use Data Structures in R</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-import-data-in-r">How to Import Data in R</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-visualize-data-with-ggplot2">How to Visualize Data with ggplot2</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-statistical-models-in-r">How to Build Statistical Models in R</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before we get started, you should have the following:</p>
<ul>
<li><p>R installed (version 4.0 or higher).</p>
</li>
<li><p>RStudio installed (recommended for beginners).</p>
</li>
<li><p>Basic familiarity with programming concepts such as variables and functions.</p>
</li>
<li><p>A basic understanding of statistics (mean, correlation, regression).</p>
</li>
</ul>
<h2 id="heading-how-to-set-up-your-r-environment">How to Set Up Your R Environment</h2>
<p>Before you start working with data, load the required libraries:</p>
<pre><code class="lang-plaintext">library(tidyverse)   # Data manipulation + ggplot2
library(readxl)      # Importing Excel files
</code></pre>
<p>These load the required libraries into the R. <code>tidyverse</code> is a collection of packages used for data manipulation and visualization, including <code>ggplot2</code>. <code>readxl</code> allows you to import Excel files directly into R without converting them to CSV format first.</p>
<h2 id="heading-how-to-use-data-types-in-r">How to Use Data Types in R</h2>
<p>Knowing data types helps you avoid errors and choose the right analysis methods.</p>
<h3 id="heading-common-data-types">Common Data Types</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Data type</td><td>Example</td><td>Use case</td></tr>
</thead>
<tbody>
<tr>
<td>Numeric</td><td><code>x &lt;- 5.7</code></td><td>Measurements, prices</td></tr>
<tr>
<td>Integer</td><td><code>y &lt;- 10L</code></td><td>Counts</td></tr>
<tr>
<td>Character</td><td><code>"House prices"</code></td><td>Text labels</td></tr>
<tr>
<td>Logical</td><td><code>TRUE</code></td><td>Conditions</td></tr>
<tr>
<td>Complex</td><td><code>2 + 3i</code></td><td>Advanced math</td></tr>
</tbody>
</table>
</div><h3 id="heading-numeric-data-types-in-r">Numeric Data Types in R</h3>
<pre><code class="lang-r">price &lt;- <span class="hljs-number">199.99</span>
tax &lt;- <span class="hljs-number">16.5</span>
total_cost &lt;- price + tax
total_cost
</code></pre>
<p>Numeric data is used for continuous values such as measurements, prices, or averages. As you can see, these are numeric values that can be used in a calculation. Numeric data types allow arithmetic operations such as addition, subtraction, multiplication, and division.</p>
<h3 id="heading-integer-data-types-in-r">Integer Data Types in R</h3>
<pre><code class="lang-r">students &lt;- <span class="hljs-number">30L</span>
classes &lt;- <span class="hljs-number">4L</span>
total_students &lt;- students * classes
total_students
</code></pre>
<p>Integers are whole numbers and are commonly used for counting. The <code>L</code> tells R that the values are integers. Integers are useful when working with counts, indexes, or discrete values.</p>
<h3 id="heading-character-data-types-in-r">Character Data Types in R</h3>
<pre><code class="lang-r">course_name &lt;- <span class="hljs-string">"Data Science"</span>
university &lt;- <span class="hljs-string">"Harvard University"</span>
paste(course_name, <span class="hljs-string">"at"</span>, university)
</code></pre>
<p>Character data is used to store text such as names, labels, or categories. The example above shows how character data can be combined using the <code>paste()</code> function. This data type cannot be used in mathematical operations.</p>
<h3 id="heading-logical-data-types-in-r">Logical Data Types in R</h3>
<pre><code class="lang-r">score &lt;- <span class="hljs-number">75</span>
passed &lt;- score &gt;= <span class="hljs-number">50</span>
passed
</code></pre>
<p>Logical data represents Boolean values: <code>TRUE</code> or <code>FALSE</code>. These are commonly used in conditions and filtering. Here, R evaluates a condition and returns <code>TRUE</code> because the score meets the requirement. Logical values are essential in decision-making and control flow.</p>
<h3 id="heading-complex-data-types-in-r">Complex Data Types in R</h3>
<p>Complex numbers contain both real and imaginary parts and are mostly used in advanced mathematical computations.</p>
<pre><code class="lang-r">z &lt;- <span class="hljs-number">2</span> + <span class="hljs-number">3i</span>
Mod(z)
</code></pre>
<p>This example calculates the magnitude of a complex number. Complex data types are rarely used in basic data analysis but are available in R.</p>
<h2 id="heading-how-to-use-data-structures-in-r">How to Use Data Structures in R</h2>
<p>R stores data in different structures depending on your goals. This is important because choosing the right structure makes operations easier. Its functions behave differently depending on the structure. Moreover, structures help R understand whether your data are numbers, categories, or text.</p>
<h3 id="heading-common-data-structures-in-r">Common Data Structures in R</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Structure</td><td>Best for</td></tr>
</thead>
<tbody>
<tr>
<td>Vector</td><td>Single column of data</td></tr>
<tr>
<td>Matrix</td><td>Numeric tables</td></tr>
<tr>
<td>Data Frame</td><td>Spreadsheet-like data</td></tr>
<tr>
<td>List</td><td>Mixed objects</td></tr>
</tbody>
</table>
</div><pre><code class="lang-r">vec &lt;- c(<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>)
mat &lt;- matrix(<span class="hljs-number">1</span>:<span class="hljs-number">9</span>, nrow = <span class="hljs-number">3</span>)
df &lt;- data.frame(Name = c(<span class="hljs-string">"Car"</span>, <span class="hljs-string">"Bike"</span>), Number = c(<span class="hljs-number">110</span>, <span class="hljs-number">95</span>))
lst &lt;- list(numbers = vec, matrix = mat, info = df)

str(lst) <span class="hljs-comment">##shows the structure of the list</span>
</code></pre>
<p>Lets understand the code above:</p>
<ul>
<li><p><code>vec</code> is a vector that stores a single type of data.</p>
</li>
<li><p><code>mat</code> is a matrix that organizes numeric values into rows and columns.</p>
</li>
<li><p><code>df</code> is a data frame that works like a spreadsheet, allowing different data types in each column.</p>
</li>
<li><p><code>lst</code> is a list that stores multiple objects of different types.</p>
</li>
<li><p>The <code>str()</code> function shows how these objects are nested within the list.</p>
</li>
</ul>
<h2 id="heading-how-to-import-data-in-r"><strong>How to Import Data in R</strong></h2>
<p>Now you can start working with your real data. You can import files into R by copying the path of the CSV or Excel file and pasting it into the command.</p>
<p><strong>For Windows:</strong> Replace single backward slashes / with either double backward slashes \ or single forward slashes \. For example:</p>
<pre><code class="lang-r">
Windows
```r
data &lt;- read.csv("C:\\Users\\file\\Documents\\data.csv") or 
data &lt;- read.csv("C:/Users/file/Documents/data.csv")
</code></pre>
<p><strong>For macOS/Linux:</strong> Single forward slashes work fine:</p>
<pre><code class="lang-r">macOS/Linux
data &lt;- read.csv(<span class="hljs-string">"/Users/file/Documents/data.csv"</span>)
</code></pre>
<h3 id="heading-how-to-read-a-csv-and-excel-file"><strong>How to Read a CSV and Excel File</strong></h3>
<pre><code class="lang-r"><span class="hljs-comment">#Import CSV file </span>
data &lt;- read.csv(<span class="hljs-string">"C:/Users/file/Documents/data.csv"</span>) or data &lt;- read.csv(<span class="hljs-string">"C:\\Users\\file\\Documents\\data.csv"</span>) <span class="hljs-comment">## for windows</span>

head(data.csv)
</code></pre>
<p>You can import a CSV file into R using a file path. On Windows systems, file paths can use either double forward slashes (<code>//</code>) or double backslashes (<code>\</code>). The imported data is stored as a data frame named data.</p>
<pre><code class="lang-r">data_excel &lt;- read_excel(<span class="hljs-string">"C:/Users/file/Documents/HR Data Set.xlsx"</span>)
head(data_excel)
</code></pre>
<p>You can import an Excel file into R using the code <code>read_excel()</code> function from the <code>readxl</code> package. The <code>head()</code> function is then used to preview the first few rows of the dataset.</p>
<p>Use the following commands to understand your data:</p>
<pre><code class="lang-r">str(data.csv)
summary(data.csv)

str(data_excel)
summary(data_excel)
</code></pre>
<p><code>str()</code> shows the structure of the dataset, including column names and data types. <code>summary()</code> provides descriptive statistics such as minimum, maximum, mean, and quartiles for each variable. Together, these functions help you understand the dataset before analysis.</p>
<h2 id="heading-how-to-visualize-data-with-ggplot2"><strong>How to Visualize Data with ggplot2</strong></h2>
<p>Visualization helps you spot patterns before you build models.</p>
<h3 id="heading-scatter-plot-example"><strong>Scatter Plot Example</strong></h3>
<p>We’ll use the built-in <code>mtcars</code> dataset in R. First, load the library to make it available for use:</p>
<pre><code class="lang-r">data(mtcars)
<span class="hljs-keyword">library</span>(ggplot2)

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point(size = <span class="hljs-number">3</span>,color=<span class="hljs-string">"blue"</span>) +geom_smooth(method=<span class="hljs-string">"lm"</span>,color=<span class="hljs-string">"red"</span>,se=<span class="hljs-literal">FALSE</span>)+
  labs(
    title = <span class="hljs-string">"Fuel Efficiency by Weight and Cylinders"</span>,
    x = <span class="hljs-string">"Weight (1000 lbs)"</span>,
    y = <span class="hljs-string">"Miles per Gallon"</span>
  ) +
  theme_minimal()
</code></pre>
<p>Let us break down the code to grasp it fully:</p>
<ul>
<li><p><code>data(mtcars)</code> loads the built-in <code>mtcars</code> dataset, which contains information about car specifications.</p>
</li>
<li><p><code>library(ggplot2)</code> enables data visualization.</p>
</li>
<li><p><code>aes()</code> was used to insert your dataset columns, which defines the <code>x</code> and <code>y</code> values.</p>
</li>
<li><p><code>aes()</code> was used to design the plot outside. For example, set point <code>size</code> and <code>color</code>.</p>
</li>
<li><p><code>geom_smooth()</code> wass used to add a trend line with. Here, we use <code>method="lm"</code> to fit a linear regression line. The <code>se=TRUE/FALSE</code> option controls the shading for confidence intervals. Use <code>TRUE</code> if you want the shading and <code>FALSE</code> if you don’t.</p>
</li>
<li><p><code>labs()</code> was used for label the plot and set the <code>title</code>, <code>x</code>-axis, and <code>y</code>-axis labels.</p>
</li>
<li><p>Finally, we set the plot theme using <code>theme_minimal()</code>.</p>
</li>
</ul>
<p>Running this code will produce a scatterplot showing fuel efficiency by weight and cylinders. The plot should look like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1765914755069/8921e803-7fa6-4705-802c-23ff8918bee5.png" alt="Scatterplot of mpg against vehicle weight with regression line" class="image--center mx-auto" width="912" height="527" loading="lazy"></p>
<h2 id="heading-how-to-build-statistical-models-in-r"><strong>How to Build Statistical Models in R</strong></h2>
<h3 id="heading-linear-regression"><strong>Linear Regression</strong></h3>
<p>You can use linear regression for continuous outcomes, basically to predict numerical values. For example, to predict a car’s miles per gallon (<code>mpg</code>) based on weight (<code>wt</code>) and horsepower (<code>hp</code>), you can use this formula:</p>
<pre><code class="lang-r">lm_model &lt;- lm(mpg ~ wt + hp, data = mtcars)
summary(lm_model)
</code></pre>
<p>But what does it mean?</p>
<ul>
<li><p><code>lm()</code> stands for linear model.</p>
</li>
<li><p>The response variable is <code>mpg</code>. This is the outcome you want to predict.</p>
</li>
<li><p>Predictor variables are <code>wt</code> and <code>hp</code>. These explain changes in the response.</p>
</li>
</ul>
<p>Once you run the model, it should look like this in your console:</p>
<pre><code class="lang-r">Call:
lm(formula = mpg ~ wt + hp, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-<span class="hljs-number">3.941</span> -<span class="hljs-number">1.600</span> -<span class="hljs-number">0.182</span>  <span class="hljs-number">1.050</span>  <span class="hljs-number">5.854</span> 

Coefficients:
            Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) <span class="hljs-number">37.22727</span>    <span class="hljs-number">1.59879</span>  <span class="hljs-number">23.285</span>  &lt; <span class="hljs-number">2e-16</span> ***
wt          -<span class="hljs-number">3.87783</span>    <span class="hljs-number">0.63273</span>  -<span class="hljs-number">6.129</span> <span class="hljs-number">1.12e-06</span> ***
hp          -<span class="hljs-number">0.03177</span>    <span class="hljs-number">0.00903</span>  -<span class="hljs-number">3.519</span>  <span class="hljs-number">0.00145</span> ** 
---
Signif. codes:  <span class="hljs-number">0</span> ‘***’ <span class="hljs-number">0.001</span> ‘**’ <span class="hljs-number">0.01</span> ‘*’ <span class="hljs-number">0.05</span> ‘.’ <span class="hljs-number">0.1</span> ‘ ’ <span class="hljs-number">1</span>

Residual standard error: <span class="hljs-number">2.593</span> on <span class="hljs-number">29</span> degrees of freedom
Multiple R-squared:  <span class="hljs-number">0.8268</span>,    Adjusted R-squared:  <span class="hljs-number">0.8148</span> 
<span class="hljs-literal">F</span>-statistic: <span class="hljs-number">69.21</span> on <span class="hljs-number">2</span> and <span class="hljs-number">29</span> DF,  p-value: <span class="hljs-number">9.109e-12</span>
</code></pre>
<p>Here’s an interpretation of the linear regression model:</p>
<ul>
<li><p>You created a model on miles per gallon (<code>mpg</code>) based on weight (<code>wt</code>) and horsepower (<code>hp</code>).</p>
</li>
<li><p>The intercept <code>37.227</code> is the <code>mpg</code> when <code>wt=0</code> and <code>hp=0</code>. In other words, when all other variables are <code>0</code>, the base <code>mpg</code> is <code>37.227</code>. The intercept is always the baseline value of the outcome when all other variables in the model are zero.</p>
</li>
<li><p>With every additional unit of weight (1000lbs), the <code>mpg</code> decreases by <code>3.877</code>. This variable affects the <code>mpg</code> greatly as seen with the <code>p-value</code>. The <code>p-value</code> is &lt;0.001, hence strong and statistically significant.</p>
</li>
<li><p>With every additional unit of horsepower, the <code>mpg</code> decreases by <code>0.031</code>. This variable affects the <code>mpg</code>, as seen with the <code>p-value</code> being <code>0.00145</code>, which is <strong>less than 0.01</strong>, indicating that horsepower is a statistically significant predictor of <code>mpg</code>, although its effect is smaller compared to vehicle weight.</p>
</li>
</ul>
<h3 id="heading-does-the-model-fit-the-data-and-why">Does the Model Fit the Data, and Why?</h3>
<p>The R-squared value shows that 83% of the variation in <code>mpg</code> is explained by weight and horsepower.</p>
<p><strong>Summary of the interpretation</strong>: Cars that are heavier and with more horsepower have lower fuel efficiency. These two variables explain most of the variation in <code>mpg</code> in the dataset.</p>
<h3 id="heading-logistic-regression"><strong>Logistic Regression</strong></h3>
<p>You can use logistic regression for binary outcomes, like yes/no questions. For example, predicting whether a vehicle is automatic or manual based on weight and horsepower.</p>
<pre><code class="lang-r">glm_model &lt;- glm(am ~ wt + hp, data = mtcars, family = binomial)
summary(glm_model)
</code></pre>
<p>Lets understand the code</p>
<ul>
<li><p><code>glm()</code> stands for generalized linear model.</p>
</li>
<li><p>The <code>family=binomial</code> option tells R to run logistic regression.</p>
</li>
<li><p>The response variable <code>am</code> indicates transmission type: 0 = automatic, 1 = manual.</p>
</li>
<li><p>Predictor variables remain <code>wt</code> and <code>hp</code>.</p>
</li>
</ul>
<p>Once you run the model, it should look like this in your console:</p>
<pre><code class="lang-r">Call:
glm(formula = am ~ wt + hp, family = binomial, data = mtcars)

Coefficients:
            Estimate Std. Error z value Pr(&gt;|z|)   
(Intercept) <span class="hljs-number">18.86630</span>    <span class="hljs-number">7.44356</span>   <span class="hljs-number">2.535</span>  <span class="hljs-number">0.01126</span> * 
wt          -<span class="hljs-number">8.08348</span>    <span class="hljs-number">3.06868</span>  -<span class="hljs-number">2.634</span>  <span class="hljs-number">0.00843</span> **
hp           <span class="hljs-number">0.03626</span>    <span class="hljs-number">0.01773</span>   <span class="hljs-number">2.044</span>  <span class="hljs-number">0.04091</span> * 
---
Signif. codes:  <span class="hljs-number">0</span> ‘***’ <span class="hljs-number">0.001</span> ‘**’ <span class="hljs-number">0.01</span> ‘*’ <span class="hljs-number">0.05</span> ‘.’ <span class="hljs-number">0.1</span> ‘ ’ <span class="hljs-number">1</span>

(Dispersion parameter <span class="hljs-keyword">for</span> binomial family taken to be <span class="hljs-number">1</span>)

    Null deviance: <span class="hljs-number">43.230</span>  on <span class="hljs-number">31</span>  degrees of freedom
Residual deviance: <span class="hljs-number">10.059</span>  on <span class="hljs-number">29</span>  degrees of freedom
AIC: <span class="hljs-number">16.059</span>

Number of Fisher Scoring iterations: <span class="hljs-number">8</span>
</code></pre>
<p>Here’s an interpreting of the logistic regression model:</p>
<ul>
<li><p>The intercept <code>18.866</code> represents the log-odds of a car being manual when <code>wt=0</code> and <code>hp=0</code>. In other words, when all other variables are <code>0</code>, the baseline log-odds of the outcome is <code>18.866</code>. The intercept is always the baseline value of the outcome when all other variables in the model are zero.</p>
</li>
<li><p>With every additional unit of weight (1000 lbs), the log odds of the car being manual decrease by <code>8.083</code>. This variable strongly affects the probability of the car being manual, as seen with the <code>p-value</code> being <code>0.008</code>, which is statistically significant.</p>
</li>
<li><p>With every additional unit of horsepower, the log odds of the car being manual increase by <code>0.036</code>. This variable also affects the probability of being manual, as seen with the <code>p-value</code> being <code>0.041</code>, which is statistically significant.</p>
</li>
</ul>
<p><strong>Summary of the interpretation</strong>: Heavier cars are more likely to be automatic, while higher horsepower slightly increases the chance of being manual. Together, <code>wt</code> and <code>hp</code> explain a large portion of transmission type variation.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you learned how to use R for data analysis, visualization, and statistical modeling, and how to set up your R environment and work with basic data types and data structures.</p>
<p>This article also showed you how to import real-world datasets and explore them using summary statistics. This should help you understand your data before analysis.</p>
<p>Using ggplot2, we visualized the relationships and identified patterns. We built and interpreted a linear regression model to predict fuel efficiency and a logistic regression model to classify transmission type.</p>
<p>You also learned how to interpret coefficients, p-values, and goodness-of-fit measures.</p>
<p>With these skills, you can load datasets, visualize trends, and build simple predictive models in R. Keep practicing with new datasets and explore more advanced techniques to improve your data analysis skills.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Web Scraping With RSelenium (Chrome Driver) and Rvest ]]>
                </title>
                <description>
                    <![CDATA[ Web scraping lets you automatically extract data from websites, so you can store it in a structured format for later use. In this article, you'll explore how to use popular R libraries for web scraping to extract data from a website. The target websi... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/web-scraping-with-rselenium-chrome-driver-and-rvest/</link>
                <guid isPermaLink="false">67d8272af45871e3e821d5fa</guid>
                
                    <category>
                        <![CDATA[ Rselenium ]]>
                    </category>
                
                    <category>
                        <![CDATA[ RVest ]]>
                    </category>
                
                    <category>
                        <![CDATA[ selenium ]]>
                    </category>
                
                    <category>
                        <![CDATA[ R Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ webscraping  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ chromedriver ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Elabonga Atuo ]]>
                </dc:creator>
                <pubDate>Mon, 17 Mar 2025 13:44:10 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1742219025681/47c07711-cfa5-482f-a72b-d127bc5b63bc.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Web scraping lets you automatically extract data from websites, so you can store it in a structured format for later use.</p>
<p>In this article, you'll explore how to use popular R libraries for web scraping to extract data from a website. The target website displays different books across multiple pages, requiring navigation between them. You'll learn how to use RVest for data extraction and RSelenium to automate button clicks.</p>
<p>There are a couple of housekeeping rules when it comes to harvesting data on the internet:</p>
<ul>
<li><p><strong>Inspect the robots.txt file</strong>: Check the robots.txt file of a website to understand what data you are allowed to extract. You can find this file by appending “/robots.txt” to the website's home URL.</p>
</li>
<li><p><strong>Review terms and conditions</strong>: Before scraping, read the website's terms and conditions to understand the legal expectations regarding data extraction.</p>
</li>
<li><p><strong>Limit requests</strong>: Avoid overloading the server with requests by implementing rate limiting. The <a target="_blank" href="https://dmi3kno.github.io/polite/">polite</a> library in R can help manage request rates effectively.</p>
</li>
</ul>
<p>Let’s dive in!</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-project-overview">Project Overview</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-project-setup">Project Setup</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-understand-and-inspect-a-webpage">How to Understand and Inspect a Webpage</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-extract-data-using-rvest">How to Extract Data Using RVest</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-mimic-human-behaviour-using-rselenium">How to Mimic Human Behaviour Using RSelenium</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-combine-rselenium-amp-rvest-and-save-to-csv">How to Combine RSelenium &amp; RVest and Save to CSV</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-bringing-it-all-together">Bringing it All Together</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-project-overview">Project Overview</h2>
<p>Here’s what we’re going to be building:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1739891904874/e10f91f5-f5ba-4a9d-82d7-bd297b409b1b.gif" alt="e10f91f5-f5ba-4a9d-82d7-bd297b409b1b" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>This approach to web scraping allows you to see the browser in action as it navigates and extracts data from the website. Unlike headless browsing, where everything runs in the background without a visible interface, this method provides a graphical UI, making it easier to monitor and debug the process.</p>
<p>To practice your data mining skills, you will be scraping data from a website built specifically for that: <a target="_blank" href="https://books.toscrape.com/">Books To Scrape</a>. You are going to be using a driver to drive a browser which will then open your target website. It’ll navigate from the first page, mimicking human behaviour (clicking the next button) while collecting data about the books, right to the last page.</p>
<h2 id="heading-project-setup">Project Setup</h2>
<h3 id="heading-prerequisites"><strong>Prerequisites:</strong></h3>
<p>To follow along with this tutorial, you will need:</p>
<ul>
<li><p>R programming knowledge</p>
</li>
<li><p>HTML knowledge</p>
</li>
<li><p>R Studio installed</p>
</li>
</ul>
<p>Note that I’m building this tutorial on a Windows machine.</p>
<h3 id="heading-setup-and-install-chrome-driver">Setup and Install Chrome Driver</h3>
<p>First, you’ll want to check to make sure you have Java installed on your computer by running this terminal command:</p>
<pre><code class="lang-bash">java -version
</code></pre>
<p>If it’s not present, download and install Java <a target="_blank" href="https://www.java.com/en/download/">here</a>.</p>
<p>Next, install the Chrome browser if you don’t already have it. Once it’s installed, check for your browser version in the settings section.</p>
<p>Then you can download the Browser Driver that corresponds to your Browser Version <a target="_blank" href="https://developer.chrome.com/docs/chromedriver/downloads/version-selection">here</a>. Check where other browser drivers are stored on your device by running this in RStudio terminal:</p>
<pre><code class="lang-r"><span class="hljs-comment"># install and load wdman and binman packages</span>
install.packages(<span class="hljs-string">"wdman"</span>)
<span class="hljs-keyword">library</span>(wdman)

install.packages(<span class="hljs-string">"binman"</span>)
<span class="hljs-keyword">library</span>(binman)

<span class="hljs-comment"># check drivers already installed</span>
binman::list_versions(appname = <span class="hljs-string">"chromedriver"</span>)

<span class="hljs-comment"># check browser driver locations</span>
wdman::selenium(retcommand = <span class="hljs-literal">TRUE</span>, check = <span class="hljs-literal">FALSE</span>)
</code></pre>
<p>Extract the driver “.exe“ and store it at the specified folder location. This is usually the following location:</p>
<pre><code class="lang-bash"><span class="hljs-string">"C:\Users\YourName\AppData\Local\binman\binman_chromedriver\win32\version\chromedriver.exe"</span>
</code></pre>
<p>Now, add the drivers to your system path by specifying the folder path excluding the application. Confirm installation by running the following terminal command.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Chromedriver SYSTEMS PATH: "C:\Users\YourName\AppData\Local\binman\binman_chromedriver\win32\version\"</span>
<span class="hljs-comment"># check chromedriver installation</span>
chromedriver -version
</code></pre>
<h2 id="heading-how-to-understand-and-inspect-a-webpage">How to Understand and Inspect a Webpage</h2>
<p>A webpage is a visual representation of an HTML document that is available on the internet and accessed through a web browser. The components of a webpage, called elements, are structured hierarchically in a HTML DOM (Document Object Model) tree. Each element can be located using specific paths called selectors or locators, which you can read more about <a target="_blank" href="https://testrigor.com/blog/css-selector-vs-xpath-your-pocket-cheat-sheet/">here</a>.</p>
<p>Developer Tools are a set of tools available in your browser. They’re helpful for inspecting and analyzing a webpage’s structure. The feature “Inspect“ helps examine the structure and styling of a specific element. You can access this feature by selecting the element you would like to inspect, right clicking on it, and clicking “Inspect”.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1739974770342/59c960b1-2c88-4c1d-a23d-d9e9fee91dc5.gif" alt="Inspecting an element" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h2 id="heading-how-to-extract-data-using-rvest">How to Extract Data Using RVest</h2>
<p>RVest is an R package that contains a set of functions that enables you to extract data from HTML and XML web pages</p>
<p>We are interested in extracting the following information about books from every page on the website’s catalogue:</p>
<ul>
<li><p>Book Title</p>
</li>
<li><p>Book Rating</p>
</li>
<li><p>Book Price</p>
</li>
<li><p>Individual Book Link</p>
</li>
<li><p>Cover Image Link</p>
</li>
</ul>
<p>Let’s go through the steps for using RVest to extract this data.</p>
<h3 id="heading-step-1-load-the-webpage"><strong>Step 1: Load the webpage</strong></h3>
<p>To load the first page of your target website and parse the HTML document using the RVest package in R, follow these steps:</p>
<ol>
<li><p><strong>Install and load the RVest package</strong>: If you haven't already installed the RVest package, you can do so by running the following command in R:</p>
<pre><code class="lang-r"> install.packages(<span class="hljs-string">"rvest"</span>)
</code></pre>
<p> Then, load the package:</p>
<pre><code class="lang-r"> <span class="hljs-keyword">library</span>(rvest)
</code></pre>
</li>
<li><p><strong>Load the webpage and parse the HTML</strong>: Use the <code>read_html()</code> function from the RVest package to fetch and parse the HTML content of the webpage. Here's an example of how to do this:</p>
<pre><code class="lang-r"> <span class="hljs-comment"># Specify the URL of the target website</span>
 url &lt;- <span class="hljs-string">"https://books.toscrape.com/"</span>

 <span class="hljs-comment"># Fetch and parse the HTML content</span>
 webpage &lt;- read_html(url)
</code></pre>
</li>
</ol>
<p>This code will download the HTML content of the specified webpage and convert it into an XML document, making it easier to structure and organize the data for further processing or storage.</p>
<h3 id="heading-step-2-identify-the-target-elements"><strong>Step 2: Identify the target elements</strong></h3>
<p>The target elements are the HTML elements that contain the specific data you intend to extract.</p>
<p>A quick inspection of the webpage using developer tools shows that the each book’s information is contained in an <code>article</code> tag and forms part of an ordered list. It’s important to specify the <code>&lt;ol&gt;</code> tag in the path, as there are other lists in the tree.</p>
<p>The pipe <code>%&gt;%</code> operator facilitates chaining operations, making it easier to extract elements step by step. <code>html_element()</code> returns the first matching element while <code>html_elements()</code> returns all the elements that match the defined path.</p>
<pre><code class="lang-r"><span class="hljs-comment"># define the path from which other details will be extracted</span>
book &lt;- books %&gt;% html_element(<span class="hljs-string">"ol"</span>)  %&gt;% html_elements(<span class="hljs-string">"li"</span>) %&gt;% html_element(<span class="hljs-string">"article"</span>)

<span class="hljs-comment"># extracting details using css locators.</span>
<span class="hljs-comment"># title</span>
title &lt;- book %&gt;% 
  html_element(<span class="hljs-string">"h3 a"</span>) %&gt;% 
  html_attr(<span class="hljs-string">"title"</span>)

<span class="hljs-comment"># rating</span>
rating &lt;- book %&gt;% 
  html_element(<span class="hljs-string">"p"</span>) %&gt;% 
  html_attr(<span class="hljs-string">"class"</span>)

<span class="hljs-comment"># price</span>
price &lt;- book %&gt;% 
  html_element(<span class="hljs-string">".product_price p"</span>) %&gt;% 
  html_text2()

<span class="hljs-comment">#link to book page</span>
book_link &lt;- book %&gt;% 
  html_element(<span class="hljs-string">"h3 a"</span>) %&gt;% 
  html_attr(<span class="hljs-string">"href"</span>)

<span class="hljs-comment"># cover page image link</span>
cover_page_link &lt;- book %&gt;% 
  html_element(<span class="hljs-string">".image_container a img"</span>) %&gt;% 
  html_attr(<span class="hljs-string">"src"</span>)

<span class="hljs-comment"># inspect right format by selecting the first element of each detail</span>
title[[<span class="hljs-number">1</span>]]
rating[[<span class="hljs-number">1</span>]]
price[[<span class="hljs-number">1</span>]]
book_link[[<span class="hljs-number">1</span>]]
cover_page_link[[<span class="hljs-number">1</span>]]
</code></pre>
<h3 id="heading-step-3-clean-the-rating-data"><strong>Step 3: Clean the “rating” data</strong></h3>
<p>To clean the "star-rating" data, you can use the <code>stringr</code> package in R to remove the unnecessary text and trim any whitespace. Here's how you can do it:</p>
<pre><code class="lang-r"><span class="hljs-keyword">library</span>(stringr)

<span class="hljs-comment"># Example of extracted rating data</span>
rating_data &lt;- <span class="hljs-string">"star-rating Three"</span>

<span class="hljs-comment"># Remove "star-rating " and trim whitespace</span>
cleaned_rating &lt;- str_trim(str_replace(rating_data, <span class="hljs-string">"star-rating "</span>, <span class="hljs-string">""</span>))

<span class="hljs-comment"># Output the cleaned rating</span>
cleaned_rating
</code></pre>
<p>This code will output "Three", effectively removing the "star-rating" prefix and any leading or trailing whitespace.</p>
<h2 id="heading-how-to-mimic-human-behaviour-using-rselenium">How to Mimic Human Behaviour Using RSelenium</h2>
<h3 id="heading-how-selenium-works"><strong>How Selenium Works</strong></h3>
<p>Selenium is a tool that allows you to simulate user actions on a website, usually for testing purposes. RSelenium is an R library that allows you to access this functionality.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1739961235501/f358a1e1-6a2f-45dd-a0b0-12925811cab1.png" alt="Diagram illustrating Selenium's architecture. It shows a client with a Selenium script communicating with a server's browser driver using JSON Wire Protocol over HTTP. The server then sends a HTTP request to a browser" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>We need a script, a browser, and browser driver to mimic user behaviour. The code you write that contains the instructions detailing the actions you would like to automate is the script. The browser driver acts as a bridge between your script and the browser and performs your desired actions by translating the script into actions.</p>
<p>The script, when run, is the client which requests and receives info from the browser driver’s server.</p>
<p>When you run a script, the script is converted to JSON format data which is then transferred to the browser driver via the JSON Wire Protocol. A protocol is simply a set of rules that define how data should be managed and handle during transfer across devices.</p>
<p>The driver receives and validates the received data. If successful, it communicates the actions defined in the script to the browser. If it’s unsuccessful, an error is sent to the client.</p>
<p>On browser initialization, the driver performs the actions step by step. This carries on to completion or until an error is encountered (missing elements, server errors, and so on). The bidirectional communication between the driver and browser is via HTTP. Finally, the results are sent back to the client and the browser is shut down.</p>
<h3 id="heading-automating-page-navigation-and-data-collection-with-rselenium">Automating Page Navigation and Data Collection with RSelenium</h3>
<pre><code class="lang-r"><span class="hljs-comment"># install and load RSelenium</span>
install.packages(<span class="hljs-string">"RSelenium"</span>)
<span class="hljs-keyword">library</span>(RSelenium)

<span class="hljs-comment"># initialize and run the chrome driver</span>
rD &lt;- rsDriver(browser = <span class="hljs-string">"chrome"</span>, port = <span class="hljs-number">4567L</span>)

<span class="hljs-comment"># extract and assign the client</span>
remDr &lt;- rD[[<span class="hljs-string">"client"</span>]]
</code></pre>
<p>Running <code>rsDriver()</code> starts a Selenium server that launches ChromeDriver. Extract and assign the <code>rD[["client"]]</code> to a variable. This variable allows you to control and interact with the browser.</p>
<p>Sometimes, starting the driver may fail due to reasons such as permission restrictions, missing dependencies, or incorrect setup. If that happens, you can manually launch ChromeDriver by adding the following block of code right after loading the libraries at the top of the script. It is important to ensure the port numbers match.</p>
<pre><code class="lang-r">cDrv &lt;- chrome(verbose = <span class="hljs-literal">FALSE</span>, check = <span class="hljs-literal">FALSE</span>, port = <span class="hljs-number">4567L</span>)
cDrv$process
</code></pre>
<p>Now, navigate to the target webpage:</p>
<pre><code class="lang-r"><span class="hljs-comment"># naivigate to the target site</span>
remDr$navigate(<span class="hljs-string">"https://books.toscrape.com/"</span>)

<span class="hljs-comment">#maximize Chrome Window Size</span>
remDr$maxWindowSize()
</code></pre>
<p>And scroll to the bottom of the page:</p>
<pre><code class="lang-r"><span class="hljs-comment"># scroll to the bottom of the page</span>
webElem &lt;- remDr$findElement(<span class="hljs-string">"css"</span>, <span class="hljs-string">"body"</span>)
webElem$sendKeysToElement(list(key = <span class="hljs-string">"end"</span>))
</code></pre>
<p>The above code locates the body element and simulates pressing the down key to the end of the page.</p>
<p>Now, click Next to navigate to the next page:</p>
<pre><code class="lang-r"><span class="hljs-comment"># locate next button and click next</span>
nextPage &lt;-  remDr$findElement(using = <span class="hljs-string">"css selector"</span>,
                               value = <span class="hljs-string">".next &gt; a"</span>)
nextPage$clickElement()
</code></pre>
<p>Find the element that contains the link to the next page and click on it to redirect you.</p>
<p>Now we’re going to write a while loop that navigates through all the pages, up to page 50, and then closes the browser once it’s done.</p>
<p>A while loop executes a piece of code as long as a specific condition is met. Once the condition is not met, the loop exits.</p>
<pre><code class="lang-r"><span class="hljs-keyword">while</span>(condition is <span class="hljs-literal">TRUE</span>){
    <span class="hljs-comment">#DO SOMETHING</span>
}
</code></pre>
<p>Write a loop that ensures the next page button is clicked as long as the element containing the link to the next page is visible in the HTML DOM.</p>
<p>First, locate the next button element. Its presence in the open webpage makes sure that the loop runs.</p>
<p>The last page does not have a next button, so the loop will exit when it reaches that page (and Selenium will throw an error due to the missing element).</p>
<pre><code class="lang-r">nextPage &lt;- remDr$findElement(using = <span class="hljs-string">"css selector"</span>, value = <span class="hljs-string">".next &gt; a"</span>)
</code></pre>
<p>Wrap the nextPage element search in a <code>tryCatch()</code> block. This prevents the script from crashing if the 'Next' button is missing. If an error occurs, <code>tryCatch()</code> returns <code>NULL</code>, signaling that there are no more pages to navigate.</p>
<p>An <code>if</code> block then checks for a <code>NULL</code> value. If encountered, a message is displayed to inform the client that no 'Next' button was found, and the <code>break</code> statement exits the loop.</p>
<p>Finally, close the browser once the driver navigates to the last page (page 50 in the catalogue) to free up system resources using <code>remDr$close()</code>.</p>
<pre><code class="lang-r">
<span class="hljs-keyword">while</span> (<span class="hljs-literal">TRUE</span>) {  
  <span class="hljs-comment"># Try to find and click "Next" button</span>
  nextPage &lt;- <span class="hljs-keyword">tryCatch</span>({
    remDr$findElement(using = <span class="hljs-string">"css selector"</span>, value = <span class="hljs-string">".next &gt; a"</span>)
  }, error = <span class="hljs-keyword">function</span>(e) {
    <span class="hljs-keyword">return</span>(<span class="hljs-literal">NULL</span>)  <span class="hljs-comment"># No more pages</span>
  })

  <span class="hljs-keyword">if</span> (is.null(nextPage)) {
    message(<span class="hljs-string">"No 'Next' button found. Exiting loop."</span>)
    <span class="hljs-keyword">break</span>
  }

  nextPage$clickElement()
  Sys.sleep(<span class="hljs-number">3</span>)  <span class="hljs-comment"># Allow next page to load</span>

}
print(<span class="hljs-string">"finished scraping"</span>)
remDr$close()
</code></pre>
<h2 id="heading-how-to-combine-rselenium-amp-rvest-and-save-to-csv">How to Combine RSelenium &amp; RVest and Save to CSV</h2>
<p>Now that we’ve extracted data from specific HTML elements using RVest and automated user actions using RSelenium, let’s combine the two to scrape data from all the pages in the website.</p>
<h3 id="heading-create-a-scrape-books-function"><strong>Create a scrape books function</strong></h3>
<p>You will be saving the scraped books information in a CSV file. First, create an empty dataframe to hold the scraped data:</p>
<pre><code class="lang-r"><span class="hljs-comment"># install and load dplyr for dataframe manipulation</span>
install.packages(<span class="hljs-string">"dplyr"</span>)
<span class="hljs-keyword">library</span>(dplyr)

<span class="hljs-comment"># create a dataframe to hold book information</span>
Books &lt;-  data.frame()
</code></pre>
<h3 id="heading-retrieve-and-parse-the-webpage">Retrieve and parse the webpage</h3>
<p>For Rvest to work with RSelenium, you have to retrieve the HTML source of the currently loaded webpage within the Selenium-controlled browser using <code>remDr$getPageSource()[[1]]</code> to extract the HMTL content.</p>
<pre><code class="lang-r">page &lt;- remDr$getPageSource()[[<span class="hljs-number">1</span>]]
</code></pre>
<p>Convert the HTML content to XML using <code>read_html()</code> like this:</p>
<pre><code class="lang-r"> <span class="hljs-comment"># define the path from which other details will be extracted</span>
    books &lt;- read_html(page)  %&gt;% html_element(<span class="hljs-string">"ol"</span>)  %&gt;% html_elements(<span class="hljs-string">"li"</span>) %&gt;% html_element(<span class="hljs-string">"article"</span>)
</code></pre>
<p>Extract each book’s details using CSS selectors with <code>rvest</code> functions. The scraped objects returned are XML objects and lists. They need to be formatted to character strings, preventing unexpected data type issues when working with the data. Do this by piping <code>as.character()</code> at the very end of each extracted detail.</p>
<pre><code class="lang-r">    <span class="hljs-comment"># title</span>
    title &lt;- book %&gt;% 
      html_element(<span class="hljs-string">"h3 a"</span>) %&gt;% 
      html_attr(<span class="hljs-string">"title"</span>) %&gt;% 
      as.character()
</code></pre>
<p>Wrap the block of code used to extract details from HTML elements in a function and return a dataframe whose column values are the book details. This makes the code reusable and modular.</p>
<pre><code class="lang-r">
scrape_books &lt;- <span class="hljs-keyword">function</span>() {
    page &lt;- remDr$getPageSource()[[<span class="hljs-number">1</span>]]

    <span class="hljs-comment"># define the path from which other details will be extracted</span>
    books &lt;- read_html(page)  %&gt;% html_element(<span class="hljs-string">"ol"</span>)  %&gt;% html_elements(<span class="hljs-string">"li"</span>) %&gt;% html_element(<span class="hljs-string">"article"</span>)

    <span class="hljs-comment"># extracting details using css locators.</span>
    <span class="hljs-comment"># title</span>
    title &lt;- book %&gt;% 
      html_element(<span class="hljs-string">"h3 a"</span>) %&gt;% 
      html_attr(<span class="hljs-string">"title"</span>) %&gt;% 
      as.character() 

    <span class="hljs-comment"># rating</span>
    rating &lt;- book %&gt;% 
      html_element(<span class="hljs-string">"p"</span>) %&gt;% 
      html_attr(<span class="hljs-string">"class"</span>) %&gt;% 
      as.character() 

    cleaned_rating &lt;- str_trim(gsub(<span class="hljs-string">"star-rating"</span>, <span class="hljs-string">""</span>, rating))

    <span class="hljs-comment"># price</span>
    price &lt;- book %&gt;% 
      html_element(<span class="hljs-string">".product_price p"</span>) %&gt;% 
      html_text2() %&gt;% 
      as.character() 

    <span class="hljs-comment">#link to book page</span>
    book_link &lt;- book %&gt;% 
      html_element(<span class="hljs-string">"h3 a"</span>) %&gt;% 
      html_attr(<span class="hljs-string">"href"</span>) %&gt;% 
      as.character() 

    <span class="hljs-comment"># image link</span>
    cover_page_link &lt;- book %&gt;% 
      html_element(<span class="hljs-string">".image_container a img"</span>) %&gt;% 
      html_attr(<span class="hljs-string">"src"</span>) %&gt;% 
      as.character() 

    <span class="hljs-keyword">return</span>(data.frame(title,cleaned_rating,price,book_link,cover_page_link, stringsAsFactors = <span class="hljs-literal">FALSE</span>))
}
</code></pre>
<h3 id="heading-write-to-csv"><strong>Write to CSV</strong></h3>
<p>Save the dataframe to a CSV file saved as “books.csv“:</p>
<pre><code class="lang-r">write.csv(Books, file = <span class="hljs-string">"./books.csv"</span>, fileEncoding = <span class="hljs-string">"UTF-8"</span>)
</code></pre>
<h2 id="heading-bringing-it-all-together">Bringing it All Together</h2>
<p>Let’s review what we’ve done so far: First, the script to scrape book data begins by loading the browser, maximizing the window size, and navigating to the Books To Scrape Page.</p>
<p>Then we created an empty dataframe to hold the scraped data. We then scraped the data from the first page, saved it to the dataframe, and located the ‘Next‘ button in order to navigate to the next page – from which we scraped data and stored it.</p>
<p>The process of scraping, adding to the dataframe, and clicking the next page button continues until the ‘Next’ button is no longer available in the HTML DOM.</p>
<p>Once the last page has been reached, the code exits the loop and saves the data to CSV. Finally, it closes the driver to free up system resources.</p>
<pre><code class="lang-r"><span class="hljs-comment"># load libraries</span>
<span class="hljs-keyword">library</span>(wdman)
<span class="hljs-keyword">library</span>(binman)
<span class="hljs-keyword">library</span>(rvest)
<span class="hljs-keyword">library</span>(stringr)
<span class="hljs-keyword">library</span>(RSelenium)
<span class="hljs-keyword">library</span>(dplyr)


cDrv &lt;- chrome(verbose = <span class="hljs-literal">FALSE</span>, check = <span class="hljs-literal">FALSE</span>, port = <span class="hljs-number">4450L</span>)
cDrv$process

rD &lt;- rsDriver(browser = <span class="hljs-string">"chrome"</span>, port = <span class="hljs-number">4450L</span>)
remDr &lt;- rD[[<span class="hljs-string">"client"</span>]]


remDr$navigate(<span class="hljs-string">"https://books.toscrape.com/"</span>)
remDr$maxWindowSize()

page &lt;- remDr$getPageSource()[[<span class="hljs-number">1</span>]]
webElem &lt;- remDr$findElement(<span class="hljs-string">"css"</span>, <span class="hljs-string">"body"</span>)
webElem$sendKeysToElement(list(key = <span class="hljs-string">"end"</span>))

nextPage &lt;-  remDr$findElement(using = <span class="hljs-string">"css selector"</span>,
                               value = <span class="hljs-string">".next &gt; a"</span>)
nextPage$clickElement()


<span class="hljs-comment"># converting the lists containg the scraped data into a dataframe </span>
Books &lt;-  data.frame(title = character(), rating = character(), stringsAsFactors = <span class="hljs-literal">FALSE</span>)

scrape_books &lt;- <span class="hljs-keyword">function</span>() {
    page &lt;- remDr$getPageSource()[[<span class="hljs-number">1</span>]]

    <span class="hljs-comment"># define the path from which other details will be extracted</span>
    books &lt;- read_html(page)  %&gt;% html_element(<span class="hljs-string">"ol"</span>)  %&gt;% html_elements(<span class="hljs-string">"li"</span>) %&gt;% html_element(<span class="hljs-string">"article"</span>)

    <span class="hljs-comment"># extracting details using css locators.</span>
    <span class="hljs-comment"># title</span>
    title &lt;- book %&gt;% 
      html_element(<span class="hljs-string">"h3 a"</span>) %&gt;% 
      html_attr(<span class="hljs-string">"title"</span>) %&gt;% 
      as.character() 

    <span class="hljs-comment"># rating</span>
    rating &lt;- book %&gt;% 
      html_element(<span class="hljs-string">"p"</span>) %&gt;% 
      html_attr(<span class="hljs-string">"class"</span>) %&gt;% 
      as.character() 

    cleaned_rating &lt;- str_trim(gsub(<span class="hljs-string">"star-rating"</span>, <span class="hljs-string">""</span>, rating))

    <span class="hljs-comment"># price</span>
    price &lt;- book %&gt;% 
      html_element(<span class="hljs-string">".product_price p"</span>) %&gt;% 
      html_text2() %&gt;% 
      as.character() 

    <span class="hljs-comment">#link to book page</span>
    book_link &lt;- book %&gt;% 
      html_element(<span class="hljs-string">"h3 a"</span>) %&gt;% 
      html_attr(<span class="hljs-string">"href"</span>) %&gt;% 
      as.character() 

    <span class="hljs-comment"># image link</span>
    cover_page_link &lt;- book %&gt;% 
      html_element(<span class="hljs-string">".image_container a img"</span>) %&gt;% 
      html_attr(<span class="hljs-string">"src"</span>) %&gt;% 
      as.character() 

    <span class="hljs-keyword">return</span>(data.frame(title,cleaned_rating,price,book_link,cover_page_link, stringsAsFactors = <span class="hljs-literal">FALSE</span>))
}

<span class="hljs-comment"># scrape first page</span>
Books &lt;- rbind(Books, scrape_books())

<span class="hljs-keyword">while</span> (<span class="hljs-literal">TRUE</span>) {
  <span class="hljs-comment"># scrape current page</span>
  Books &lt;- rbind(Books, scrape_books())

  <span class="hljs-comment"># find and click "next" button</span>
  nextPage &lt;- <span class="hljs-keyword">tryCatch</span>({
    remDr$findElement(using = <span class="hljs-string">"css selector"</span>, value = <span class="hljs-string">".next &gt; a"</span>)
  }, error = <span class="hljs-keyword">function</span>(e) {
    <span class="hljs-keyword">return</span>(<span class="hljs-literal">NULL</span>)  <span class="hljs-comment"># No more pages</span>
  })

  <span class="hljs-comment"># exit loop if "next" button is missing</span>
  <span class="hljs-keyword">if</span> (is.null(nextPage)) {
    message(<span class="hljs-string">"No 'Next' button found. Exiting loop."</span>)
    <span class="hljs-keyword">break</span>
  }

  nextPage$clickElement()
  <span class="hljs-comment"># Allow next page to load</span>
  Sys.sleep(<span class="hljs-number">3</span>)  

}

write.csv(Books, file = <span class="hljs-string">"./books.csv"</span>, fileEncoding = <span class="hljs-string">"UTF-8"</span>)
print(<span class="hljs-string">"finished scraping"</span>)
remDr$close()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740129915080/2ee1344b-58a8-477b-a568-719ba4336c95.png" alt="2ee1344b-58a8-477b-a568-719ba4336c95" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you learned how to effectively combine RSelenium and RVest to scrape data from a website. By leveraging RSelenium, you can automate user interactions and navigate through web pages, while RVest allows you to extract specific data from HTML elements.</p>
<p>This approach provides a powerful and flexible method for web scraping, enabling you to handle dynamic content and mimic human behavior. By following the steps outlined here, you can successfully scrape data from multiple pages and save it to a CSV file for further analysis.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Model an Epidemic with R ]]>
                </title>
                <description>
                    <![CDATA[ By Peter Gleeson Epidemiology has never been more topical. It is the scientific study of how health and disease affects populations, including infectious diseases such as COVID-19. Key to understanding the spread of such diseases is the practice of e... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-model-an-epidemic-with-r/</link>
                <guid isPermaLink="false">66d460a4ffe6b1f641b5fa63</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ R Programming ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 30 Mar 2021 14:46:38 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2020/10/PIXNIO-39014-1200x877.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Peter Gleeson</p>
<p>Epidemiology has never been more topical. It is the scientific study of how health and disease affects populations, including infectious diseases such as COVID-19.</p>
<p>Key to understanding the spread of such diseases is the practice of epidemic modeling. This involves building quantitative models to describe and forecast the spread of disease.</p>
<p>The classical approach to epidemic modeling is to use a type of mathematical model known as a "compartmental model".</p>
<p>The approach is as follows:</p>
<ol>
<li>Assign each individual in the population to one of several compartments, based on their infection status.</li>
<li>Then, define the rates at which individuals move between compartments as their status updates.</li>
<li>Use this model to define differential equations that can predict the course of the epidemic.</li>
</ol>
<p>The SI model is the most basic form of compartmental model. It has two compartments: "susceptible" and "infectious".</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/Screenshot-2020-03-30-at-01.21.00.png" alt="Two compartments, one labelled S, the other I. An arrow flows from S into I." width="600" height="400" loading="lazy"></p>
<p>The SIR model adds an extra compartment called "recovered". This model is often used as a baseline in epidemiology. It is a simplistic model that nevertheless characterises the progression of an epidemic reasonably well.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/Screenshot-2020-03-30-at-01.23.52.png" alt="Three compartments, labelled S, I and R. Arrows flow from S to I and from I to R." width="600" height="400" loading="lazy"></p>
<p>An extension to the SIR model (and the one we will consider in more detail in this article) is the SEIR model. This adds one more compartment – "exposed".</p>
<h2 id="heading-what-is-the-seir-model">What is the SEIR model?</h2>
<p>The basic SEIR model has four compartments:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/Screenshot-2020-03-30-at-01.29.52.png" alt="Four compartments. S flows into E, E flows into I, I flows into R. The three arrows are labelled beta, sigma and gamma respectively." width="600" height="400" loading="lazy"></p>
<ul>
<li>"Susceptible" – individuals who have not been exposed to the virus</li>
<li>"Exposed" – individuals exposed to the virus, but not yet infectious</li>
<li>"Infectious" – exposed individuals who go on to become infectious</li>
<li>"Recovered" – infectious individuals who recover and become immune to the virus</li>
</ul>
<p>The population size N is taken as the sum of the individuals in the four compartments.</p>
<p>The flow of individuals between compartments is characterised by a number of parameters.</p>
<p><strong>β - "beta"</strong></p>
<p>β is the transmission coefficient. Think of this as the average number of infectious contacts an infectious individual in the population makes each time period. A high value of β means the virus has more opportunity to spread.</p>
<p><strong>σ - "sigma"</strong></p>
<p>σ is the rate at which exposed individuals become infectious. Think of it as the reciprocal of the average time it takes to become infectious. That is, if an individual becomes infectious after 4 days on average, σ will be 1/4 (or 0.25).</p>
<p><strong>γ - "gamma"</strong></p>
<p>γ is the rate at which infectious individuals recover. As before, think of it as the reciprocal of the average time it takes to recover. That is, if it takes 10 days on average to recover, γ will be 1/10 (or 0.1).</p>
<p><strong>μ - "mu"</strong></p>
<p>μ is an optional parameter to describe the mortality rate of infectious individuals. The higher μ is, the more deadly the virus.</p>
<p>From these parameters, you can construct a set of differential equations. These describe the rate at which each compartment changes size.</p>
<p>Let's start with the "susceptible" compartment, S.</p>
<h3 id="heading-equation-1-susceptible">Equation (1) - Susceptible</h3>
<p>The first thing to see from the model is that there is no way S can increase over time. There are no flows back into the compartment. Equation (1) must be negative, as S can only ever decrease.</p>
<p>In what ways can an individual leave compartment S?</p>
<p>Well, they can become infected by an infectious individual in the population.</p>
<p>At any stage, the proportion of infectious individuals in the population = I/N.</p>
<p>And the proportion of susceptible individuals will be S/N.</p>
<p>Under the assumption of perfect mixing (that is, individuals are equally likely to come into contact with any other in the population), the probability of any given contact being between an infectious and susceptible individual is (I / N) * (S / N).</p>
<p>This is multiplied by the number of contacts in the population. This is found by multiplying the transmission coefficient β, by the population size N.</p>
<p>Combining that all together and simplifying gives equation (1):</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/Screenshot-2020-03-29-at-21.42.45.png" alt="delta S equals minus beta times S times I all over N" width="600" height="400" loading="lazy"></p>
<h3 id="heading-equation-2-exposed">Equation (2) - Exposed</h3>
<p>Next, let's consider the "exposed" compartment, E. Individuals can flow into and out of this compartment.</p>
<p>The flow into E will be matched by the flow out of S. So the first part of the next equation will simply be the opposite of the previous term.</p>
<p>Individuals can leave E by moving into the infectious compartment. This happens at a rate determined by two variables – the rate σ and the current number of individuals in E.</p>
<p>So overall equation (2) is:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/10/Screenshot-2020-10-04-at-21.12.56-1.png" alt="deltaEI equals beta times S times I all over N, subtract sigma times I" width="600" height="400" loading="lazy"></p>
<h3 id="heading-equation-3-infectious">Equation (3) - Infectious</h3>
<p>The next compartment to consider is the "infectious" compartment, I.</p>
<p>There is one way into this compartment, which is from the "exposed" compartment.</p>
<p>There are two ways an individual can leave the "infectious" compartment.</p>
<p>Some will move to "recovered". This happens at a rate γ.</p>
<p>Others will not survive the infection. They can be modeled using the mortality rate μ.</p>
<p>So equation (3) looks like:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/Screenshot-2020-03-29-at-21.53.27.png" alt="delta I equals sigma times E subtract gamma times I subtract mu times I" width="600" height="400" loading="lazy"></p>
<h3 id="heading-equation-4-recovered">Equation (4) - Recovered</h3>
<p>Now let's look at the "recovered" compartment, R.</p>
<p>This time, individuals can flow into the compartment (determined by the rate γ).</p>
<p>And no individuals can flow out of the compartment (although in some models, it is assumed possible to move back into the "susceptible" compartment).</p>
<p>So the overall equation (4) looks like this:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/Screenshot-2020-03-29-at-22.00.50.png" alt="delta R equals gamma times I" width="600" height="400" loading="lazy"></p>
<h3 id="heading-equation-5-mortality-optional">Equation (5) - Mortality (optional)</h3>
<p>Using similar reasoning, you could also construct equation (5) for the change in mortality. You might consider this a fifth compartment in the model.</p>
<p>If you set μ to zero, you can exclude this aspect of the model.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/03/Screenshot-2020-03-29-at-22.00.13.png" alt="delta M equals mu times I" width="600" height="400" loading="lazy"></p>
<p>So now you have the full set of differential equations (1-5).</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/10/Screenshot-2020-10-04-at-21.15.12.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>An important number in any epidemic model is known as the basic reproduction number, or R₀. This is defined as:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/10/Screenshot-2020-10-03-at-21.02.11.png" alt="R zero equals beta over gamma" width="600" height="400" loading="lazy"></p>
<p>This number estimates the number of people who will be infected by the average infectious individual.</p>
<p>Therefore, it is a crucial number:</p>
<ul>
<li>If R₀ is above 1, then an outbreak of the virus is likely to become an epidemic </li>
<li>If R₀ is below 1, then an outbreak is likely to be contained</li>
</ul>
<h3 id="heading-how-to-solve-these-equations">How to solve these equations</h3>
<p>In order to use the model to predict the course of the epidemic, it is necessary to solve the system of equations.</p>
<p>This can be done using the <a target="_blank" href="https://www.r-project.org/">R programming language</a>.</p>
<p>In particular, you can use a package called <a target="_blank" href="https://www.rdocumentation.org/packages/deSolve/versions/1.27.1">deSolve</a> to solve the differential equations with respect to a time variable.</p>
<p>In R, paste the following code:</p>
<pre><code class="lang-r"><span class="hljs-keyword">require</span>(deSolve)

SEIR &lt;- <span class="hljs-keyword">function</span>(time, current_state, params){

  with(as.list(c(current_state, params)),{
    N &lt;- S+E+I+R
    dS &lt;- -(beta*S*I)/N
    dE &lt;- (beta*S*I)/N - sigma*E
    dI &lt;- sigma*E - gamma*I - mu*I
    dR &lt;- gamma*I
    dM &lt;- mu*I

    <span class="hljs-keyword">return</span>(list(c(dS, dE, dI, dR, dM)))
  })
}
</code></pre>
<p>This code imports the deSolve package.</p>
<p>It then defines a function called <code>SEIR</code>. It takes three arguments:</p>
<ul>
<li>The current time step.</li>
<li>A list of the current states of the system (that is, the estimates for each of S, E, I and R at the current time step).</li>
<li>A list of parameters used in the equations (recall these are β, σ, γ and μ).</li>
</ul>
<p>Inside the function body, you define the system of differential equations as described above. These are evaluated for the given time step and are returned as a list. The order in which they are returned must match the order in which you provide the current states.</p>
<p>Now take a look at the code below:</p>
<pre><code class="lang-r">params &lt;- c(beta=<span class="hljs-number">0.5</span>, sigma=<span class="hljs-number">0.25</span>, gamma=<span class="hljs-number">0.2</span>, mu=<span class="hljs-number">0.001</span>)

initial_state &lt;- c(S=<span class="hljs-number">999999</span>, E=<span class="hljs-number">1</span>, I=<span class="hljs-number">0</span>, R=<span class="hljs-number">0</span>, M=<span class="hljs-number">0</span>)

times &lt;- <span class="hljs-number">0</span>:<span class="hljs-number">365</span>
</code></pre>
<p>This initialises the parameters and initial state (starting conditions) for the model.</p>
<p>It also generates a vector of times from zero to 365 days.</p>
<p>Now, create the model:</p>
<pre><code class="lang-r">model &lt;- ode(initial_state, times, SEIR, params)
</code></pre>
<p>This uses deSolve's <code>ode()</code> function to solve the equations with respect to time. </p>
<p>See <a target="_blank" href="https://www.rdocumentation.org/packages/deSolve/versions/1.27.1/topics/ode">here</a> for the documentation.</p>
<p>The arguments required are:</p>
<ul>
<li>The initial state for each of the compartments</li>
<li>The vector of times (this example solves for up to 365 days)</li>
<li>The <code>SEIR()</code> function, which defines the system of equations</li>
<li>A vector of parameters to pass to the <code>SEIR()</code> function</li>
</ul>
<p>Running:</p>
<pre><code class="lang-r">summary(model)
</code></pre>
<p>...will give the summary statistics of the model.</p>
<pre><code>               S            E            I         R         M
Min.    <span class="hljs-number">108263.6</span> <span class="hljs-number">3.616607e-07</span> <span class="hljs-number">0.000000e+00</span>      <span class="hljs-number">0.00</span>    <span class="hljs-number">0.0000</span>
<span class="hljs-number">1</span>st Qu. <span class="hljs-number">108263.7</span> <span class="hljs-number">5.957435e-03</span> <span class="hljs-number">1.414971e-02</span>  <span class="hljs-number">63894.43</span>  <span class="hljs-number">319.4721</span>
Median  <span class="hljs-number">108395.7</span> <span class="hljs-number">8.470071e+00</span> <span class="hljs-number">1.273726e+01</span> <span class="hljs-number">886814.36</span> <span class="hljs-number">4434.0718</span>
Mean    <span class="hljs-number">362798.6</span> <span class="hljs-number">9.745754e+03</span> <span class="hljs-number">1.212158e+04</span> <span class="hljs-number">612272.74</span> <span class="hljs-number">3061.3637</span>
<span class="hljs-number">3</span>rd Qu. <span class="hljs-number">852375.5</span> <span class="hljs-number">1.734331e+03</span> <span class="hljs-number">2.533956e+03</span> <span class="hljs-number">887299.83</span> <span class="hljs-number">4436.4991</span>
Max.    <span class="hljs-number">999999.0</span> <span class="hljs-number">1.092967e+05</span> <span class="hljs-number">1.265161e+05</span> <span class="hljs-number">887299.86</span> <span class="hljs-number">4436.4993</span>
N          <span class="hljs-number">366.0</span> <span class="hljs-number">3.660000e+02</span> <span class="hljs-number">3.660000e+02</span>    <span class="hljs-number">366.00</span>  <span class="hljs-number">366.0000</span>
sd      <span class="hljs-number">381257.2</span> <span class="hljs-number">2.475783e+04</span> <span class="hljs-number">2.969234e+04</span> <span class="hljs-number">387333.47</span> <span class="hljs-number">1936.6673</span>
</code></pre><p>Already, you will find some interesting insights.</p>
<ul>
<li>Out of a million individuals, 108,264 did not become infected.</li>
<li>At the peak of the epidemic, 126,516 individuals were infectious simultaneously.</li>
<li>887,300 individuals recovered by the end of the model.</li>
<li>A total of 4436 individuals died during the epidemic.</li>
</ul>
<p>You can also visualise the evolution of the pandemic using the <code>matplot()</code> function.</p>
<p>Alternatively, you could use another plotting library such as <a target="_blank" href="https://ggplot2.tidyverse.org/index.html">ggplot2</a> to produce better quality graphics.</p>
<pre><code class="lang-r">matplot(model, type=<span class="hljs-string">"l"</span>, lty=<span class="hljs-number">1</span>, main=<span class="hljs-string">"SEIR model"</span>, xlab=<span class="hljs-string">"Time"</span>)

legend &lt;- colnames(model)[<span class="hljs-number">2</span>:<span class="hljs-number">6</span>]

legend(<span class="hljs-string">"right"</span>, legend=legend, col=<span class="hljs-number">2</span>:<span class="hljs-number">6</span>, lty = <span class="hljs-number">1</span>)
</code></pre>
<p>The plot is shown below:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/10/seir_model.png" alt="Chart showing curves which represent how the size of each compartment changes over time. S declines in a S-shaped curve, R and M increase in S-shape curves. I and E peak after day 100 before declining to zero" width="600" height="400" loading="lazy"></p>
<p>You can also coerce the model output to a dataframe type. Then, you can analyse the model further.</p>
<pre><code class="lang-r">infections &lt;- as.data.frame(model)$I

peak &lt;- max(infections)

match(peak, infections)
</code></pre>
<p>The code above reveals that the number of infections peaked on day 112.</p>
<p>Using other libraries, such as dplyr, would let you carry out analysis as advanced as you'd like.</p>
<h2 id="heading-how-to-model-intervention-methods">How to model intervention methods</h2>
<p>The SEIR model is an interesting example of how an epidemic develops without any changes in the population's behaviour.</p>
<p>You can build more sophisticated models by taking the SEIR model as a starting point and adding extra features.</p>
<p>This lets you model changes in behaviour (either voluntary or as a result of government intervention).</p>
<p>Many (but not all) countries around the world entered some form of "lockdown" during the coronavirus pandemic of 2020.</p>
<p>Ultimately, the intention of locking down is to alter the course of the epidemic by reducing the transmission coefficient, β.</p>
<p>The code below defines a model which changes the value of β between the start and end of a period of lockdown.</p>
<p><strong>All the numbers used are purely illustrative</strong>. You could make an entire research career (several times over) trying to figure out the most realistic values.</p>
<pre><code class="lang-r">SEIR_lockdown &lt;- <span class="hljs-keyword">function</span>(time, current_state, params){

    with(as.list(c(current_state, params)),{

      beta = ifelse(
        (time &lt;= start_lockdown || time &gt;= end_lockdown),
        <span class="hljs-number">0.5</span>, <span class="hljs-number">0.1</span>
        )

      N &lt;- S+E+I+R
      dS &lt;- -(beta*S*I)/N
      dE &lt;- (beta*S*I)/N - sigma*E
      dI &lt;- sigma*E - gamma*I - mu*I
      dR &lt;- gamma*I
      dM &lt;- mu*I

      <span class="hljs-keyword">return</span>(list(c(dS, dE, dI, dR, dM)))
    })
  }
</code></pre>
<p>The only change is the extra <code>ifelse()</code> statement to adjust the value of β to 0.1 during lockdown.</p>
<p>You need to pass two new parameters to the model. These are the start and end times of the lockdown period.</p>
<p>Here, the lockdown begins on day 90, and ends on day 150.</p>
<pre><code class="lang-r">params &lt;- c(
    sigma=<span class="hljs-number">0.25</span>,
    gamma=<span class="hljs-number">0.2</span>,
    mu=<span class="hljs-number">0.001</span>,
    start_lockdown=<span class="hljs-number">90</span>,
    end_lockdown=<span class="hljs-number">150</span>
    )

  initial_state &lt;- c(S=<span class="hljs-number">999999</span>, E=<span class="hljs-number">1</span>, I=<span class="hljs-number">0</span>, R=<span class="hljs-number">0</span>, M=<span class="hljs-number">0</span>)

  times &lt;- <span class="hljs-number">0</span>:<span class="hljs-number">365</span>

  model &lt;- ode(initial_state, times, SEIR_lockdown, params)
</code></pre>
<p>Now you can view the summary and graphs associated with this model.</p>
<pre><code class="lang-r">summary(model)
</code></pre>
<p>This will reveal:</p>
<pre><code>               S            E           I         R         M
Min.    <span class="hljs-number">156885.7</span> <span class="hljs-number">7.699207e-01</span>     <span class="hljs-number">0.00000</span>      <span class="hljs-number">0.00</span>    <span class="hljs-number">0.0000</span>
<span class="hljs-number">1</span>st Qu. <span class="hljs-number">160478.2</span> <span class="hljs-number">6.929205e+01</span>    <span class="hljs-number">97.71405</span>  <span class="hljs-number">63668.75</span>  <span class="hljs-number">318.3438</span>
Median  <span class="hljs-number">789214.4</span> <span class="hljs-number">1.246389e+03</span>  <span class="hljs-number">1735.66330</span> <span class="hljs-number">194379.16</span>  <span class="hljs-number">971.8958</span>
Mean    <span class="hljs-number">589558.9</span> <span class="hljs-number">9.216918e+03</span> <span class="hljs-number">11460.62036</span> <span class="hljs-number">387824.44</span> <span class="hljs-number">1939.1222</span>
<span class="hljs-number">3</span>rd Qu. <span class="hljs-number">867639.6</span> <span class="hljs-number">1.030043e+04</span> <span class="hljs-number">13780.17591</span> <span class="hljs-number">829898.56</span> <span class="hljs-number">4149.4928</span>
Max.    <span class="hljs-number">999999.0</span> <span class="hljs-number">6.083432e+04</span> <span class="hljs-number">72443.97892</span> <span class="hljs-number">838916.89</span> <span class="hljs-number">4194.5845</span>
N          <span class="hljs-number">366.0</span> <span class="hljs-number">3.660000e+02</span>   <span class="hljs-number">366.00000</span>    <span class="hljs-number">366.00</span>  <span class="hljs-number">366.0000</span>
sd      <span class="hljs-number">350719.3</span> <span class="hljs-number">1.570278e+04</span> <span class="hljs-number">18893.31145</span> <span class="hljs-number">346542.57</span> <span class="hljs-number">1732.7128</span>
</code></pre><p>You can see:</p>
<ul>
<li>Out of a million individuals, 156,886 did not become infected.</li>
<li>At the peak of the epidemic, 72,444 individuals were infectious simultaneously.</li>
<li>838,917 individuals recovered by the end of the model.</li>
<li>A total of 4195 individuals died during the epidemic.</li>
</ul>
<p>Plotting the model using <code>matplot()</code> reveals a strong "second wave" effect (as was seen across many countries in Europe towards the end of 2020).</p>
<pre><code class="lang-r">  matplot(
    model, 
    type=<span class="hljs-string">"l"</span>,
    lty=<span class="hljs-number">1</span>, 
    main=<span class="hljs-string">"SEIR model (with intervention)"</span>, 
    xlab=<span class="hljs-string">"Time"</span>
    )

legend &lt;- colnames(model)[<span class="hljs-number">2</span>:<span class="hljs-number">6</span>]

legend(<span class="hljs-string">"right"</span>, legend=legend, col=<span class="hljs-number">2</span>:<span class="hljs-number">6</span>, lty = <span class="hljs-number">1</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/10/seir_intervention.png" alt="Chart showing curves which represent how the size of each compartment changes over time. S declines rapidly, before levelling out during the lockdown, and then declining rapidly again, R and M increase rapidly before levelling out, then increasing rapidly again. I and E show a small peak before day 100, then declines, before peaking again after day 200          " width="600" height="400" loading="lazy"></p>
<p>Finally, you can coerce the model to a dataframe and carry out more detailed analysis from there.</p>
<pre><code class="lang-r">infections &lt;- as.data.frame(model)$I

peak &lt;- max(infections)

match(peak, infections)
</code></pre>
<p>In this scenario, the number of infections peaked on day 223.</p>
<p>In other scenarios, you could model the effect of vaccination. Or, you could build in seasonal differences in the transmission rate.</p>
<h2 id="heading-limitations-of-compartmental-models">Limitations of compartmental models</h2>
<p>As with all modeling, an epidemic model is only as good as the data and assumptions that go into it.</p>
<p>And some of the assumptions behind the SEIR model as described are unrealistic.</p>
<p>For example:</p>
<ul>
<li>In large populations, mixing is non-uniform. Individuals are much more likely to interact with individuals in their locality. More advanced compartmental models will account for this.</li>
<li>The model assumes the population is isolated. In reality, mixing between populations allows a virus to be introduced and reintroduced multiple times.</li>
<li>Individuals are usually not born with immunity. More sophisticated models will factor in the birth rate when considering longer periods of time.</li>
<li>The basic SEIR model does not account for age structures in the population. Often, a virus will spread faster among younger, densely populated cities. But it might prove more deadly to older populations outside those cities. More complex models will take these differences into consideration.</li>
<li>The SEIR model considers only averages for each of its parameters. In reality, there will be a lot of variation. Some individuals remain infectious for a long time. A small number of individuals might make a very large number of contacts. Therefore, the model is suitable for describing the epidemic at a high level, over a long period of time. But it is not suitable for predicting details on a smaller scale.</li>
</ul>
<p>Despite its limitations, the SEIR model is a solid starting point for understanding the dynamics of an epidemic.</p>
<p>More generally, the approach of using differential equations to represent flows between compartments to model complex processes is very powerful.</p>
<p>And the availability of software packages for languages such as R and Python makes it easier than ever to get started exploring these techniques.</p>
<p>You can dig into the code used for the examples <a target="_blank" href="https://github.com/pg0408/seir">here</a>.</p>
<p>Thanks for reading!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Choose the Best Programming Language for your Data Science Project ]]>
                </title>
                <description>
                    <![CDATA[ The battle between programming languages has always been a hot topic in the tech world. And given how fast technology is advancing, we have a new programming language or framework every few months. This makes it ever harder for developers, analysts, ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-choose-the-best-programming-language-for-your-data-science-project/</link>
                <guid isPermaLink="false">66d45f4747a8245f78752a6e</guid>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ R Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ statistics ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Harshit Tyagi ]]>
                </dc:creator>
                <pubDate>Wed, 01 Jul 2020 20:54:18 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2020/06/python_r-1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>The battle between programming languages has always been a hot topic in the tech world. And given how fast technology is advancing, we have a new programming language or framework every few months.</p>
<p>This makes it ever harder for developers, analysts, and researchers to choose the best language that will get their tasks done efficiently while incurring the lowest cost.</p>
<p>But I think that we tend to look at the wrong reasons for choosing a language. There are a bunch of factors that lead to the choice of a certain language. And with Data Science projects flooding the market, the question is NOT “which is the best language” but "which one suits your project requirements and environment (work setting)?"</p>
<p>So, with this post, I will present you with the right set of questions you should be asking in order to decide which is the best programming language for your data science project.</p>
<h2 id="heading-most-commonly-used-programming-languages-for-data-science">Most commonly used programming languages for Data Science</h2>
<p><strong>Python and R</strong> are the most widely used languages for statistical analysis or machine learning-centric projects. But there are others - like Java, Scala, or Matlab.</p>
<p>Both Python and R are state-of-the-art open-source programming languages with great community support. And we keep learning about new libraries and tools that allow us to achieve greater levels of performance and complexity.</p>
<h3 id="heading-python">Python</h3>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/07/python.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Python is well-known for its easy to learn and readable syntax. With a general-purpose (jack of all trades) language like Python, you can build complete scientific ecosystems without worrying much about the compatibility or interfacing issues.</p>
<p>Python code has low maintenance costs and is arguably more robust. From data wrangling to feature selection, web scraping, and deployment of our machine learning models, Python can get almost everything done with integration support from all the major ML and deep learning APIs like Theano, TensorFlow, and PyTorch.</p>
<h3 id="heading-r">R</h3>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/07/R.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>R was developed by academicians and statisticians over two decades ago. R today enables many statisticians, analysts, and developers to carry out their analysis effectively. We have over 12000 packages available in CRAN (an open-source repository).</p>
<p>Since it was developed keeping statisticians in mind, R is often the first choice for all the core-scientific and statistical analysis. There is a package in R for almost every kind of analysis there is.</p>
<p>Also, data analysis has been made very easy with tools like <a target="_blank" href="https://rstudio.com/">RStudio</a> that allow you to communicate your results with concise and elegant reports.</p>
<h2 id="heading-4-questions-to-help-you-choose-the-best-suited-language-for-your-project">4 Questions to help you choose the BEST suited language for your project</h2>
<p>So, how do you make the right choice for your work at hand?</p>
<p>Try answering these 4 questions:</p>
<h3 id="heading-1-which-languageframework-is-preferred-in-your-organisationindustry">1. Which language/framework is preferred in your organisation/industry?</h3>
<p>Look at the industry you are working in and the most commonly used language by your peers and competitors. It might be easier if you speak the same language.</p>
<p>Here is <a target="_blank" href="https://stackoverflow.blog/2017/10/10/impressive-growth-r/">an analysis</a> carried out by <a target="_blank" href="https://stackoverflow.blog/author/david-robinson/">David Robinson</a>, a data scientist. It’s a reflection of the popularity of R in each industry, and you can see that R is heavily used in Academia and Healthcare.</p>
<p>So, if you’re someone who wants to go into research, academia, or bioinformatics, you might consider R over Python.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/07/st2.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Source: [https://stackoverflow.blog/2017/10/10/impressive-growth-r/](https://stackoverflow.blog/2017/10/10/impressive-growth-r/" rel="noopener)</em></p>
<p>The other side of this coin involves software industries, application-driven organizations, and product-based companies. You might have to use the tech stack of your organization’s infrastructure or the language that your colleagues/teams are using.</p>
<p>And most of these organizations/industries have their infrastructure based on Python, including academia as well:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/07/st1.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Source: [https://stackoverflow.blog/2017/09/14/python-growing-quickly/](https://stackoverflow.blog/2017/09/14/python-growing-quickly/" rel="noopener)</em></p>
<p><strong>As an aspiring data scientist,</strong> therefore, you should focus on learning the language and tech that have the most applications and that can increase your chances of getting a job.</p>
<h3 id="heading-2-what-is-the-scope-of-your-project">2. What is the scope of your project?</h3>
<p>This is an important question, because before you pick up a language, you must have an agenda for your project.</p>
<p>For example, what if you want to simply solve a statistical problem through a dataset, perform some multi-variate analyses, and prepare a report or a dashboard explaining the insights? In this case R might be a better choice. It has some really powerful visualization and communication libraries.</p>
<p>On the other hand, what if your aim is to first carry out exploratory analysis, develop a deep learning model, and then deploy the model within a web application? Then Python’s web frameworks and support from all the major cloud providers make it a clear winner.</p>
<h3 id="heading-3-how-experienced-are-you-in-the-field-of-data-science">3. How experienced are you in the field of data science?</h3>
<p>For a beginner in data science who has limited familiarity with <a target="_blank" href="https://towardsdatascience.com/practical-reasons-to-learn-mathematics-for-data-science-1f6caec161ea?source=---------6------------------">statistics and mathematical concepts,</a> <strong>Python</strong> might be a better choice because it lets you code the fragments of an algorithm with ease.</p>
<p>With libraries like <a target="_blank" href="https://towardsdatascience.com/numpy-essentials-for-data-science-25dc39fae39?source=---------7------------------">NumPy</a>, you can manipulate matrices and code algorithms yourself. As a novice, it is always better to learn to build things from scratch rather than hopping onto using machine learning libraries.</p>
<p>But if you already know the fundamentals of machine learning algorithms, you can pick up either of the languages and get started with them.</p>
<h3 id="heading-4-how-much-time-do-you-have-on-hand-and-whats-the-cost-of-learning">4. How much time do you have on hand, and what's the cost of learning?</h3>
<p>The amount of time you can invest makes another case for your choice. Depending on your experience with programming and the delivery time of your project, you might choose one language over another to get started in the field.</p>
<p>If there is a high-priority project and you don’t know either of the languages, R might be an easier option for you to get started as you need limited/no experience with programming. You can write statistical models with a few lines of code using existing libraries.</p>
<p>Python (often the programmer’s choice) is a great option to start off with if you have some bandwidth to explore the libraries and learn about methods of exploring datasets. (In the case of R, this can be done quickly within Rstudio.)</p>
<p>Another important factor is that there are more Python Mentors as compared with R. If you're someone who needs help with their python/R project, you can look for a <a target="_blank" href="https://www.codementor.io/?partner=harshittyagi">Coding Mentor here</a> and using this link will also get you $10 credit on sign up to be used for the first mentor meeting.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In a nutshell, the gap between the capabilities of R and Python is getting narrower. Most jobs can be done by both languages. And both have rich ecosystems to support you.</p>
<p>Choosing a language for your project will then depend on:</p>
<ul>
<li><p>Your prior experience with Data Science <a target="_blank" href="https://towardsdatascience.com/practical-reasons-to-learn-mathematics-for-data-science-1f6caec161ea?source=---------6------------------">(stats and math)</a> and programming.</p>
</li>
<li><p>The domain of the project at hand and the extent of statistical or scientific processing required.</p>
</li>
<li><p>The future scope of your project.</p>
</li>
<li><p>The language/framework that is most widely supported in your teams, organisation, and industry.</p>
</li>
</ul>
<p>You can check out the video version of this blog here,</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/beX64BUmKpQ" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<p> </p>
<h3 id="heading-data-science-with-harshit">Data Science with Harshit</h3>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/-pVOoKrBtL8" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<p> </p>
<p>With this channel, I am planning to roll out a couple of <a target="_blank" href="https://towardsdatascience.com/hitchhikers-guide-to-learning-data-science-2cc3d963b1a2?source=---------8------------------">series covering the entire data science space</a>. Here is why you should be subscribing to the <a target="_blank" href="https://www.youtube.com/channel/UCH-xwLTKQaABNs2QmGxK2bQ">channel</a>:</p>
<ul>
<li><p>The series would cover all the required/demanded quality tutorials on each of the topics and subtopics like <a target="_blank" href="https://towardsdatascience.com/python-fundamentals-for-data-science-6c7f9901e1c8?source=---------5------------------">Python fundamentals for Data Science</a>.</p>
</li>
<li><p>Explained Mathematics and derivations of why we do what we do in ML and Deep Learning.</p>
</li>
<li><p>Podcasts with Data Scientists and Engineers at Google, Microsoft, Amazon, etc, and CEOs of big data-driven companies.</p>
</li>
<li><p><a target="_blank" href="https://towardsdatascience.com/building-covid-19-analysis-dashboard-using-python-and-voila-ee091f65dcbb?source=---------2------------------">Projects and instructions</a> to implement the topics learned so far.</p>
</li>
</ul>
<p>If this tutorial was helpful, you should check out my data science and machine learning courses on <a target="_blank" href="https://www.wiplane.com/">Wiplane Academy</a>. They are comprehensive yet compact and helps you build a solid foundation of work to showcase.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ R Programming Language Explained ]]>
                </title>
                <description>
                    <![CDATA[ R is an open source programming language and software environment for statistical computing and graphics. It is one of the primary languages used by data scientists and statisticians alike. It is supported by the R Foundation for Statistical Computin... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/r-programming-language-explained/</link>
                <guid isPermaLink="false">66c35d3d39357f9446976603</guid>
                
                    <category>
                        <![CDATA[ R Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ toothbrush ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Sat, 01 Feb 2020 00:00:00 +0000</pubDate>
                <media:content url="https://cdn-media-2.freecodecamp.org/w1280/5f9c9d09740569d1a4ca358a.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>R is an open source programming language and software environment for statistical computing and graphics. It is one of the primary languages used by data scientists and statisticians alike. It is supported by the R Foundation for Statistical Computing and a large community of open source developers. Since R utilized a command line interface, there can be a steep learning curve for some individuals who are used to using GUI focused programs such as SPSS and SAS so extensions to R such as RStudio can be highly beneficial. Since R is an open source program and freely available, there can a large attraction for academics whose access to statistical programs are regulated through their association to various colleges or universities.</p>
<h2 id="heading-installation"><strong>Installation</strong></h2>
<p>The first thing you need to get started with R is to download it from its <a target="_blank" href="https://www.r-project.org/">official site</a> according to your operating system.</p>
<h2 id="heading-popular-r-tools-and-packages"><strong>Popular R Tools and Packages</strong></h2>
<ul>
<li><a target="_blank" href="https://www.rstudio.com/products/rstudio/">RStudio</a> is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.</li>
<li><a target="_blank" href="https://cran.r-project.org/">The Comprehensive R Archive Network (CRAN)</a> is a leading source for R tools and resources.</li>
<li><a target="_blank" href="https://www.tidyverse.org/">Tidyverse</a> is an opinionated collection of R packages designed for data science like ggplot2, dplyr, readr, tidyr, purr, tibble.</li>
<li><a target="_blank" href="https://github.com/Rdatatable/data.table/wiki">data.table</a> is an implementation of base <code>data.frame</code> focused on improved performance and terse, flexible syntax.</li>
<li><a target="_blank" href="https://shiny.rstudio.com/">Shiny</a> framework for building dashboard style web apps in R.</li>
</ul>
<h2 id="heading-data-types-in-r">Data Types in R</h2>
<h3 id="heading-vector">Vector</h3>
<p>It is a sequence of data elements of the same basic type. For example:</p>
<pre><code class="lang-text">&gt; o &lt;- c(1,2,5.3,6,-2,4)                                  # Numeric vector
&gt; p &lt;- c("one","two","three","four","five","six")         # Character vector
&gt; q &lt;- c(TRUE,TRUE,FALSE,TRUE,FALSE,TRUE)                # Logical vector
&gt; o;p;q
[1]  1.0  2.0  5.3  6.0 -2.0  4.0
[1] "one"   "two"   "three" "four"  "five"  "six"
[1]  TRUE  TRUE FALSE  TRUE FALSE
</code></pre>
<h3 id="heading-matrix">Matrix</h3>
<p>It is a two-dimensional rectangular data set. The components in a matrix also must be of the same basic type like vector. For example:</p>
<pre><code class="lang-text">&gt; m = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
&gt; m
&gt;[,1] [,2] [,3]
[1,] "a"  "a"  "b" 
[2,] "c"  "b"  "a"
</code></pre>
<h3 id="heading-data-frame">Data Frame</h3>
<p>It is more general than a matrix, in that different columns can have different basic data types. For example:</p>
<pre><code class="lang-text">&gt; d &lt;- c(1,2,3,4)
&gt; e &lt;- c("red", "white", "red", NA)
&gt; f &lt;- c(TRUE,TRUE,TRUE,FALSE)
&gt; mydata &lt;- data.frame(d,e,f)
&gt; names(mydata) &lt;- c("ID","Color","Passed")
&gt; mydata
</code></pre>
<h3 id="heading-lists">Lists</h3>
<p>It is an R-object which can contain many different types of elements inside it like vectors, functions and even another list inside it. For example:</p>
<pre><code class="lang-text">&gt; list1 &lt;- list(c(2,5,3),21.3,sin)
&gt; list1
[[1]]
[1] 2 5 3
[[2]]
[1] 21.3
[[3]]
function (x)  .Primitive("sin")
</code></pre>
<h2 id="heading-functions-in-r">Functions in R</h2>
<p>A function allows you to define a reusable block of code that can be executed many times within your program.</p>
<p>Functions can be named and called repeatedly or can be run anonymously in place (similar to lambda functions in python).</p>
<p>Developing full understanding of R functions requires understanding of environments. Environments are simply a way to manage objects. An example of environments in action is that you can use a redundant variable name within a function, that won’t be affected if the larger runtime already has the same variable. Additionally, if a function calls a variable not defined within the function it will check the higher level environment for that variable.</p>
<h3 id="heading-syntax"><strong>Syntax</strong></h3>
<p>In R, a function definition has the following features:</p>
<ol>
<li>The keyword <code>function</code></li>
<li>a function name</li>
<li>input parameters (optional)</li>
<li>some block of code to execute</li>
<li>a return statement (optional)</li>
</ol>
<pre><code class="lang-text"># a function with no parameters or returned values
sayHello() = function(){
  "Hello!"
}

sayHello()  # calls the function, 'Hello!' is printed to the console

# a function with a parameter
helloWithName = function(name){
  paste0("Hello, ", name, "!")
}

helloWithName("Ada")  # calls the function, 'Hello, Ada!' is printed to the console

# a function with multiple parameters with a return statement
multiply = function(val1, val2){
  val1 * val2
}

multiply(3, 5)  # prints 15 to the console
</code></pre>
<p>Functions are blocks of code that can be reused simply by calling the function. This enables simple, elegant code reuse without explicitly re-writing sections of code. This makes code both more readable, makes for easier debugging, and limits typing errors.</p>
<p>Functions in R are created using the <code>function</code> keyword, along with a function name and function parameters inside parentheses.</p>
<p>The <code>return()</code> function can be used by the function to return a value, and is typically used to force early termination of a function with a returned value. Alternatively, the function will return the final printed value.</p>
<pre><code class="lang-text"># return a value explicitly or simply by printing
sum = function(a, b){
  c = a + b
  return(c)
}

sum = function(a, b){
  a + b
}


result = sum(1, 2)
# result = 3
</code></pre>
<p>You can also define default values for the parameters, which R will use when a variable is not specified during function call.</p>
<pre><code class="lang-text">sum = function(a, b = 3){
  a + b
}

result = sum(a = 1)
# result = 4
</code></pre>
<p>You can also pass the parameters in the order you want, using the name of the parameter.</p>
<pre><code class="lang-text">result = sum(b=2, a=2)
# result = 4
</code></pre>
<p>R can also accept additional, optional parameters with ’…’</p>
<pre><code class="lang-text">sum = function(a, b, ...){
  a + b + ...
}

sum(1, 2, 3) #returns 6
</code></pre>
<p>Functions can also be run anonymously. These are very useful in combination with the ‘apply’ family of functions.</p>
<pre><code class="lang-text"># loop through 1, 2, 3 - add 1 to each
sapply(1:3,
       function(i){
         i + 1
         })
</code></pre>
<h3 id="heading-notes"><strong>Notes</strong></h3>
<p>If a function definition includes arguments without default values specified, values for those values must be included.</p>
<pre><code class="lang-text">sum = function(a, b = 3){
a + b
}

sum(b = 2) # Error in sum(b = 2) : argument "a" is missing, with no default
</code></pre>
<p>Variables defined within a function only exist within the scope of that function, but will check larger environment if variable not specified</p>
<pre><code class="lang-text">double = function(a){
a * 2
}

double(x)  # Error in double(x) : object 'x' not found


double = function(){
a * 2
}

a = 3
double() # 6
</code></pre>
<h3 id="heading-in-built-functions-in-r">In-built functions in R</h3>
<ul>
<li>R comes with many functions that you can use to do sophisticated tasks like random sampling.</li>
<li>For example, you can round a number with the <code>round()</code>, or calculate its factorial with the <code>factorial()</code>.</li>
</ul>
<pre><code class="lang-r">&gt; round(<span class="hljs-number">4.147</span>)
[<span class="hljs-number">1</span>] <span class="hljs-number">4</span>
&gt; factorial(<span class="hljs-number">3</span>)
[<span class="hljs-number">1</span>] <span class="hljs-number">6</span>
&gt; round(mean(<span class="hljs-number">1</span>:<span class="hljs-number">6</span>))
[<span class="hljs-number">1</span>] <span class="hljs-number">4</span>
</code></pre>
<ul>
<li>The data that you pass into the function is called the function’s argument.</li>
<li>You can simulate a roll of the die with R’s <code>sample()</code>function. The <code>sample()</code> function takes two arguments:a vector named x and a number named size. For example:</li>
</ul>
<pre><code class="lang-r">&gt; sample(x = <span class="hljs-number">1</span>:<span class="hljs-number">4</span>, size = <span class="hljs-number">2</span>)
[] <span class="hljs-number">4</span> <span class="hljs-number">2</span>
&gt; sample(x = die, size = <span class="hljs-number">1</span>)
[] <span class="hljs-number">3</span>
&gt;dice &lt;- sample(die, size = <span class="hljs-number">2</span>, replace = <span class="hljs-literal">TRUE</span>)
&gt;dice
[<span class="hljs-number">1</span>] <span class="hljs-number">2</span> <span class="hljs-number">4</span>
&gt;sum(dice)
[<span class="hljs-number">1</span>] <span class="hljs-number">6</span>
</code></pre>
<ul>
<li>If you’re not sure which names to use with a function, you can look up the function’s arguments with args.</li>
</ul>
<pre><code class="lang-r">&gt; args(round)
[<span class="hljs-number">1</span>] <span class="hljs-keyword">function</span>(x, digits=<span class="hljs-number">0</span>)
</code></pre>
<h2 id="heading-objects-in-r"><strong>Objects in R</strong></h2>
<p>R allows to save the data by storing it inside an R object.</p>
<h3 id="heading-whats-an-object">What’s an object?</h3>
<p>It is just a name that you can use to call up stored data. For example, you can save data into an object like a or b.</p>
<pre><code class="lang-r">&gt; a &lt;- <span class="hljs-number">5</span>
&gt; a
[<span class="hljs-number">1</span>] <span class="hljs-number">5</span>
</code></pre>
<h3 id="heading-how-to-create-an-object-in-r">How to create an Object in R?</h3>
<ol>
<li>To create an R object, choose a name and then use the less-than symbol, <code>&lt;</code>, followed by a minus sign, <code>-</code>, to save data into it. This combination looks like an arrow, <code>&lt;-</code>. R will make an object, give it your name, and store in it whatever follows the arrow.</li>
<li>When you ask R what’s in a, it tells you on the next line. For example:</li>
</ol>
<pre><code class="lang-r">&gt; die &lt;- <span class="hljs-number">1</span>:<span class="hljs-number">6</span>
&gt; die
[<span class="hljs-number">1</span>] <span class="hljs-number">1</span> <span class="hljs-number">2</span> <span class="hljs-number">3</span> <span class="hljs-number">4</span> <span class="hljs-number">5</span> <span class="hljs-number">6</span>
</code></pre>
<ol>
<li>You can name an object in R almost anything you want, but there are a few rules. First, a name cannot start with a number. Second, a name cannot use some special symbols, like <code>^, !, $, @, +, -, /, or *</code>:</li>
<li>R also understands capitalization (or is case-sensitive), so name and Name will refer to different objects.</li>
<li>You can see which object names you have already used with the function <code>ls()</code>.</li>
</ol>
<h2 id="heading-more-information">More Information:</h2>
<ul>
<li><a target="_blank" href="https://www.freecodecamp.org/news/r-programming-course/">Learn R programming language basics in just 2 hours with this free course on statistical programming</a></li>
<li><a target="_blank" href="https://www.freecodecamp.org/news/an-introduction-to-web-scraping-using-r-40284110c848/">An introduction to web scraping using R</a></li>
<li><a target="_blank" href="https://www.freecodecamp.org/news/aggregates-in-r-one-of-the-most-powerful-tool-you-can-ask-for-4dd14eafff1f/">An introduction to aggregates in R: a powerful tool for playing with data</a></li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Learn R programming language basics in just 2 hours with this free course on statistical programming ]]>
                </title>
                <description>
                    <![CDATA[ Learn the R programming language in this course from Barton Poulson of datalab.cc. This is a hands-on overview of the statistical programming language R, one of the most important tools in data science. The course covers: Installing R RStudio Packag... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/r-programming-course/</link>
                <guid isPermaLink="false">66b2063a39b555ffda8bfeac</guid>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ R Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Thu, 06 Jun 2019 17:45:51 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2020/09/r.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Learn the R programming language in this course from Barton Poulson of <a target="_blank" href="https://datalab.cc/">datalab.cc</a>. This is a hands-on overview of the statistical programming language R, one of the most important tools in data science.</p>
<p>The course covers:</p>
<ul>
<li>Installing R</li>
<li>RStudio</li>
<li>Packages</li>
<li>plot()</li>
<li>Bar Charts</li>
<li>Histograms</li>
<li>Scatterplots</li>
<li>Overlaying Plots</li>
<li>summary()</li>
<li>describe()</li>
<li>Selecting Cases</li>
<li>Data Formats</li>
<li>Factors</li>
<li>Entering Data</li>
<li>Importing Data</li>
<li>Hierarchical Clustering</li>
<li>Principal Components</li>
<li>Regression</li>
<li>Next Steps</li>
</ul>
<p>You can watch the full video course on the <a target="_blank" href="https://www.youtube.com/watch?v=_V8eKsto3Ug">freeCodeCamp.org YouTube channel</a> (2 hour watch).</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to build a Hacker News Frontpage scraper with just 7 lines of R code ]]>
                </title>
                <description>
                    <![CDATA[ By AMR Web scraping used to be a difficult task requiring expertise in XML Tree parsing and HTTP Requests. But with new-age scraping libraries like beautifulsoup (for Python) and rvest (for R), web scraping has become a toy for any beginner to play w... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-hacker-news-frontpage-scraper-with-just-7-lines-of-r-code-221af6acb98/</link>
                <guid isPermaLink="false">66c34f789972b7c5c7624ea6</guid>
                
                    <category>
                        <![CDATA[ General Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ R Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ technology ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Web Development ]]>
                    </category>
                
                    <category>
                        <![CDATA[ web scraping ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 06 Feb 2018 20:54:52 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*P7CAV7kEQ4aBBzozOCYGsA.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By AMR</p>
<p>Web scraping used to be a difficult task requiring expertise in XML Tree parsing and HTTP Requests. But with new-age scraping libraries like beautifulsoup (for Python) and rvest (for R), web scraping has become a toy for any beginner to play with.</p>
<p>This post aims to explain how simple it is to use R, a very nice programming language, to perform Data Analysis and Data Visualization. The task ahead is very simple. Build a web scraper that scrapes the content of one of the most popular pages on the Internet (at least among Coders): <a target="_blank" href="https://news.ycombinator.com/">Hacker News Front Page</a>.</p>
<h3 id="heading-package-installation-and-loading">Package Installation and Loading</h3>
<p>The R package that we are going to use is <code>rvest.</code> <code>rvest</code> can be installed from <a target="_blank" href="https://cran.r-project.org/web/packages/rvest/index.html">CRAN</a> and loaded into R like below:</p>
<pre><code>library(rvest)
</code></pre><p><code>read_html()</code> function of <code>rvest</code> can be used to extract the HTML content of the url given as the argument for read_html function.</p>
<pre><code>content &lt;- read_html(<span class="hljs-string">'https://news.ycombinator.com/'</span>)
</code></pre><p>For <code>read_html()</code> to work without any concern, please make sure you are not behind any organization firewall. If so, configure your RStudio with a proxy to bypass the firewall, otherwise you might face a <code>connection timed out error</code>.</p>
<p>Below is the screenshot of HN front page layout (with key elements highlighted):</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/4rSb9LXlF3ZcRyLEGeKrT2yX8vsQJkACkaXi" alt="Image" width="800" height="432" loading="lazy"></p>
<p>Now, with the HTML content of the Hacker News front page loaded into the R object <em>content</em>, let us extract the data that we need — starting with the Title.</p>
<p>There is one particularly important aspect of making any web scraping assignment successful. That is to identify the right CSS selector, or XPath values, of the HTML elements whose values are supposed to be scraped. The easiest way to get the right element value is to use <code>the inspect tool</code> in Developer Tools of any browser.</p>
<p>Here’s the screenshot of the CSS selector value. It is highlighted using the Chrome Inspect Tool when hovered over Title of the links present in Hacker News Frontpage.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/IEJOiDb3aUyj90KuhWzWyAO4eoQ3Z1jUcpNM" alt="Image" width="800" height="449" loading="lazy"></p>
<pre><code>title &lt;- content %&gt;% html_nodes(<span class="hljs-string">'a.storylink'</span>) %&gt;% html_text()title [<span class="hljs-number">1</span>] <span class="hljs-string">"Magic Leap One"</span>                                                                   [<span class="hljs-number">2</span>] <span class="hljs-string">"Show HN: Terminal – native micro-GUIs for shell scripts and command line apps"</span>    [<span class="hljs-number">3</span>] <span class="hljs-string">"Tokio internals: Understanding Rust's async I/O framework"</span>                        [<span class="hljs-number">4</span>] <span class="hljs-string">"Funding Yourself as a Free Software Developer"</span>                                    [<span class="hljs-number">5</span>] <span class="hljs-string">"US Federal Ban on Making Lethal Viruses Is Lifted"</span>                                [<span class="hljs-number">6</span>] <span class="hljs-string">"Pass-Thru Income Deduction"</span>                                                       [<span class="hljs-number">7</span>] <span class="hljs-string">"Orson Welles' first attempt at movie-making"</span>                                      [<span class="hljs-number">8</span>] <span class="hljs-string">"D’s Newfangled Name Mangling"</span>                                                     [<span class="hljs-number">9</span>] <span class="hljs-string">"Apple Plans Combined iPhone, iPad, and Mac Apps to Create One User Experience"</span>    [<span class="hljs-number">10</span>] <span class="hljs-string">"LiteDB – A .NET NoSQL Document Store in a Single Data File"</span>                      [<span class="hljs-number">11</span>] <span class="hljs-string">"Taking a break from Adblock Plus development"</span>                                    [<span class="hljs-number">12</span>] <span class="hljs-string">"SpaceX’s Falcon Heavy rocket sets up at Cape Canaveral ahead of launch"</span>          [<span class="hljs-number">13</span>] <span class="hljs-string">"This is not a new year’s resolution"</span>                                             [<span class="hljs-number">14</span>] <span class="hljs-string">"Artists and writers whose works enter the public domain in 2018"</span>                 [<span class="hljs-number">15</span>] <span class="hljs-string">"Open Beta of Texpad 1.8, macOS LaTeX editor with integrated real-time typesetting"</span>[<span class="hljs-number">16</span>] <span class="hljs-string">"The triumph and near-tragedy of the first Moon landing"</span>                          [<span class="hljs-number">17</span>] <span class="hljs-string">"Retrotechnology – PC desktop screenshots from 1983-2005"</span>                         [<span class="hljs-number">18</span>] <span class="hljs-string">"Google Maps' Moat"</span>                                                               [<span class="hljs-number">19</span>] <span class="hljs-string">"Regex Parser in C Using Continuation Passing"</span>                                    [<span class="hljs-number">20</span>] <span class="hljs-string">"AT&amp;T giving $1000 bonus to all its employees because of tax reform"</span>              [<span class="hljs-number">21</span>] <span class="hljs-string">"How a PR Agency Stole Our Kickstarter Money"</span>                                     [<span class="hljs-number">22</span>] <span class="hljs-string">"Google Hangouts now on Firefox without plugins via WebRTC"</span>                       [<span class="hljs-number">23</span>] <span class="hljs-string">"Ubuntu 17.10 corrupting BIOS of many Lenovo laptop models"</span>                       [<span class="hljs-number">24</span>] <span class="hljs-string">"I Know What You Download on BitTorrent"</span>                                          [<span class="hljs-number">25</span>] <span class="hljs-string">"Carrie Fisher’s Private Philosophy Coach"</span>                                        [<span class="hljs-number">26</span>] <span class="hljs-string">"Show HN: Library of API collections for Postman"</span>                                 [<span class="hljs-number">27</span>] <span class="hljs-string">"Uber is officially a cab firm, says European court"</span>                              [<span class="hljs-number">28</span>] <span class="hljs-string">"The end of the Iceweasel Age (2016)"</span>                                             [<span class="hljs-number">29</span>] <span class="hljs-string">"Google will turn on native ad-blocking in Chrome on February 15"</span>                 [<span class="hljs-number">30</span>] <span class="hljs-string">"Bitcoin Cash deals frozen as insider trading is probed"</span>
</code></pre><p>The rvest package supports pipe %&gt;% operator. Thus, the R object containing the content of the HTML page (read with read_html) can be piped wi<code>th html_node</code>s() that takes a CSS selector or XPath as its argument. It can then extract the respective XML tree (or HTML node value) whose text value could be extracted wi<code>th html_tex</code>t() function.</p>
<p>The beauty of rvest is that it abstracts the entire XML parsing operation under the hood of functions like html_nodes() and html_text(). Thus making it easier for us to achieve our scraping goal with minimal code.</p>
<p>Like with Title, the CSS selector value of other required elements of the web page can be identified with the Chrome Inspect tool. They can also be passed as an argument to html_nodes() function and respective values can be extracted and stored in R objects.</p>
<pre><code>link_domain &lt;- content %&gt;% html_nodes(<span class="hljs-string">'span.sitestr'</span>) %&gt;% html_text()score &lt;- content %&gt;% html_nodes(<span class="hljs-string">'span.score'</span>) %&gt;% html_text()age &lt;- content %&gt;% html_nodes(<span class="hljs-string">'span.age'</span>) %&gt;% html_text()
</code></pre><p>All the essential pieces of information were extracted from the page. Now an R data frame can be made with the extracted elements to put the extracted data into a structured format.</p>
<pre><code>df &lt;- data.frame(title = title, link_domain = link_domain, score = score, age = age)
</code></pre><p>Below is the screenshot of the final dataframe in RStudio viewer:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/hEnsMstgI7-hwhOHGx9LrexwgAEoOb8q5bmK" alt="Image" width="800" height="515" loading="lazy"></p>
<p>Thus, in just 7 lines of code, we have successfully built a Hacker News Frontpage Scraper in R.</p>
<p>R is a wonderful language to perform Data Analysis and Data Visualization. The code used here is available <a target="_blank" href="https://github.com/amrrs/HN_scraper_in_R">on my github</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Which languages should you learn for data science? ]]>
                </title>
                <description>
                    <![CDATA[ By Peter Gleeson Data science is an exciting field to work in, combining advanced statistical and quantitative skills with real-world programming ability. There are many potential programming languages that the aspiring data scientist might consider ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/which-languages-should-you-learn-for-data-science-e806ba55a81f/</link>
                <guid isPermaLink="false">66d460bec7632f8bfbf1e487</guid>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Java ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Julialang ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Matlab ]]>
                    </category>
                
                    <category>
                        <![CDATA[ General Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ R Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Scala ]]>
                    </category>
                
                    <category>
                        <![CDATA[ SQL ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tech  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ technology ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Thu, 31 Aug 2017 16:07:30 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*gSxUa9oNaBk1QJf6eqQYeg.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Peter Gleeson</p>
<p>Data science is an exciting field to work in, combining advanced statistical and quantitative skills with real-world programming ability. There are many potential programming languages that the aspiring data scientist might consider specializing in.</p>
<p>While there is no correct answer, there are several things to take into consideration. Your success as a data scientist will depend on many points, including:</p>
<p><strong>Specificity</strong></p>
<p>When it comes to advanced data science, you will only get so far reinventing the wheel each time. Learn to master the various packages and modules offered in your chosen language. The extent to which this is possible depends on what domain-specific packages are available to you in the first place!</p>
<p><strong>Generality</strong></p>
<p>A top data scientist will have good all-round programming skills as well as the ability to crunch numbers. Much of the day-to-day work in data science revolves around sourcing and processing raw data or ‘data cleaning’. For this, no amount of fancy machine learning packages are going to help.</p>
<p><strong>Productivity</strong></p>
<p>In the often fast-paced world of commercial data science, there is much to be said for getting the job done quickly. However, this is what enables technical debt to creep in — and only with sensible practices can this be minimized.</p>
<p><strong>Performance</strong></p>
<p>In some cases it is vital to optimize the performance of your code, especially when dealing with large volumes of mission-critical data. Compiled languages are typically much faster than interpreted ones; likewise statically typed languages are considerably more fail-proof than dynamically typed. The obvious trade-off is against productivity.</p>
<p>To some extent, these can be seen as a pair of axes (Generality-Specificity, Performance-Productivity). Each of the languages below fall somewhere on these spectra.</p>
<p>With these core principles in mind, let’s take a look at some of the more popular languages used in data science. What follows is a combination of research and personal experience of myself, friends and colleagues — but it is by no means definitive! In approximately order of popularity, here goes:</p>
<h3 id="heading-r">R</h3>
<h4 id="heading-what-you-need-to-know">What you need to know</h4>
<p><img src="https://cdn-media-1.freecodecamp.org/images/bx3wt1sCBXSEUkiii81wH31gLcU0e3XiA6S7" alt="Image" width="256" height="256" loading="lazy"></p>
<p>Released in 1995 as a direct descendant of the older S programming language, R has since gone from strength to strength. Written in C, Fortran and itself, the project is currently supported by the <a target="_blank" href="https://www.r-project.org/foundation/">R Foundation for Statistical Computing</a>.</p>
<h4 id="heading-license">License</h4>
<p>Free!</p>
<h4 id="heading-pros">Pros</h4>
<ul>
<li>Excellent range of high-quality, domain specific and <a target="_blank" href="https://cran.r-project.org/">open source packages</a>. R has a package for almost every quantitative and statistical application imaginable. This includes neural networks, non-linear regression, phylogenetics, advanced plotting and many, many others.</li>
<li>The base installation comes with very comprehensive, in-built statistical functions and methods. R also handles matrix algebra particularly well.</li>
<li>Data visualization is a key strength with the use of libraries such as <a target="_blank" href="http://ggplot2.org/">ggplot2</a>.</li>
</ul>
<h4 id="heading-cons">Cons</h4>
<ul>
<li>Performance. There’s no two ways about it, <a target="_blank" href="http://adv-r.had.co.nz/Performance.html">R is not a quick language</a>.</li>
<li>Domain specificity. R is fantastic for statistics and data science purposes. But less so for general purpose programming.</li>
<li>Quirks. R has a few unusual features that might catch out programmers experienced with other languages. For instance: indexing from 1, using multiple assignment operators, unconventional data structures.</li>
</ul>
<h4 id="heading-verdict-brilliant-at-what-its-designed-for">Verdict — “brilliant at what it’s designed for”</h4>
<p>R is a powerful language that excels at a huge variety of statistical and data visualization applications, and being open source allows for a very active community of contributors. Its recent growth in popularity is a testament to how effective it is at what it does.</p>
<h3 id="heading-python">Python</h3>
<h4 id="heading-what-you-need-to-know-1">What you need to know</h4>
<p><img src="https://cdn-media-1.freecodecamp.org/images/U0XPlJp-xNFQypL6euOVZKDgms1Rfk4Hiojy" alt="Image" width="250" height="250" loading="lazy"></p>
<p>Guido van Rossum introduced Python back in 1991. It has since become an extremely popular general purpose language, and is widely used within the data science community. The major versions are currently <a target="_blank" href="https://www.python.org/downloads/release/python-362/">3.6</a> and <a target="_blank" href="https://www.python.org/download/releases/2.7/">2.7</a>.</p>
<h4 id="heading-license-1">License</h4>
<p>Free!</p>
<h4 id="heading-pros-1">Pros</h4>
<ul>
<li>Python is a very popular, mainstream general purpose programming language. It has an <a target="_blank" href="https://pypi.python.org/pypi">extensive range of purpose-built modules</a> and community support. Many online services provide a Python API.</li>
<li>Python is an easy language to learn. The low barrier to entry makes it an ideal first language for those new to programming.</li>
<li>Packages such as <a target="_blank" href="http://pandas.pydata.org/">pandas</a>, <a target="_blank" href="http://scikit-learn.org/stable/">scikit-learn</a> and <a target="_blank" href="https://www.tensorflow.org/">Tensorflow</a> make Python a solid option for advanced machine learning applications.</li>
</ul>
<h4 id="heading-cons-1">Cons</h4>
<ul>
<li>Type safety: Python is a dynamically typed language, which means you must show due care. Type errors (such as passing a String as an argument to a method which expects an Integer) are to be expected from time-to-time.</li>
<li>For specific statistical and data analysis purposes, R’s vast range of packages gives it a slight edge over Python. For general purpose languages, there are faster and safer alternatives to Python.</li>
</ul>
<h4 id="heading-verdict-excellent-all-rounder">Verdict — “excellent all-rounder”</h4>
<p>Python is a very good choice of language for data science, and not just at entry-level. Much of the data science process revolves around the <a target="_blank" href="https://en.wikipedia.org/wiki/Extract,_transform,_load">ETL process</a> (extraction-transformation-loading). This makes Python’s generality ideally suited. Libraries such as Google’s Tensorflow make Python a very exciting language to work in for machine learning.</p>
<h3 id="heading-sql">SQL</h3>
<h4 id="heading-what-you-need-to-know-2">What you need to know</h4>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1Dbg8u7RSmjx8l-Xv7DMDJpesKUYKEVASvP6" alt="Image" width="190" height="221" loading="lazy"></p>
<p><a target="_blank" href="https://www.w3schools.com/sql/default.asp">SQL</a> (‘Structured Query Language’) defines, manages and queries <a target="_blank" href="https://en.wikipedia.org/wiki/Relational_database">relational databases</a>. The language appeared by 1974 and has since undergone many implementations, but the core principles remain the same.</p>
<h4 id="heading-license-2">License</h4>
<p>Varies — some implementations are free, others proprietary</p>
<h4 id="heading-pros-2">Pros</h4>
<ul>
<li>Very efficient at querying, updating and manipulating relational databases.</li>
<li>Declarative syntax makes SQL an often very readable language . There’s no ambiguity about what <code>SELECT name FROM users WHERE age &gt;</code> 18 is supposed to do!</li>
<li>SQL is very used across a range of applications, making it a very useful language to be familiar with. Modules such as <a target="_blank" href="https://www.sqlalchemy.org/">SQLAlchemy</a> make integrating SQL with other languages straightforward.</li>
</ul>
<h4 id="heading-cons-2">Cons</h4>
<ul>
<li>SQL’s analytical capabilities are rather limited — beyond aggregating and summing, counting and averaging data, your options are limited.</li>
<li>For programmers coming from an imperative background, SQL’s declarative syntax can present a learning curve.</li>
<li>There are many different implementations of SQL such as <a target="_blank" href="https://www.postgresql.org/">PostgreSQL</a>, <a target="_blank" href="https://www.sqlite.org/">SQLite</a>, <a target="_blank" href="https://mariadb.org/">MariaDB</a> . They are all different enough to make inter-operability something of a headache.</li>
</ul>
<h4 id="heading-verdict-timeless-and-efficient">Verdict — “timeless and efficient”</h4>
<p>SQL is more useful as a data processing language than as an advanced analytical tool. Yet so much of the data science process hinges upon ETL, and SQL’s longevity and efficiency are proof that it is a very useful language for the modern data scientist to know.</p>
<h3 id="heading-java">Java</h3>
<h4 id="heading-what-you-need-to-know-3">What you need to know</h4>
<p><img src="https://cdn-media-1.freecodecamp.org/images/E2x8C0ZeF7QXqbkewzZdLlXojOkcMP16sayQ" alt="Image" width="256" height="256" loading="lazy"></p>
<p>Java is an extremely popular, general purpose language which runs on the (JVM) Java Virtual Machine. It’s an abstract computing system that enables seamless portability between platforms. Currently supported by <a target="_blank" href="https://www.oracle.com/java/index.html">Oracle Corporation</a>.</p>
<h4 id="heading-license-3">License</h4>
<p>Version 8 — Free! Legacy versions, proprietary.</p>
<h4 id="heading-pros-3">Pros</h4>
<ul>
<li>Ubiquity . Many modern systems and applications are built upon a Java back-end. The ability to integrate data science methods directly into the existing codebase is a powerful one to have.</li>
<li>Strongly typed. Java is no-nonsense when it comes to ensuring type safety. For mission-critical big data applications, this is invaluable.</li>
<li>Java is a high-performance, general purpose, compiled language . This makes it suitable for writing efficient ETL production code and computationally intensive machine learning algorithms.</li>
</ul>
<h4 id="heading-cons-3">Cons</h4>
<ul>
<li>For ad-hoc analyses and more dedicated statistical applications, Java’s verbosity makes it an unlikely first choice. Dynamically typed scripting languages such as R and Python lend themselves to much greater productivity.</li>
<li>Compared to domain-specific languages like R, there aren’t a great number of libraries available for advanced statistical methods in Java.</li>
</ul>
<h4 id="heading-verdict-a-serious-contender-for-data-science">Verdict — “a serious contender for data science”</h4>
<p>There is a lot to be said for learning Java as a first choice data science language. Many companies will appreciate the ability to seamlessly integrate data science production code directly into their existing codebase, and you will find Java’s performance and and type safety are real advantages. </p>
<p>However, you’ll be without the range of stats-specific packages available to other languages. That said, definitely one to consider — especially if you already know one of R and/or Python.</p>
<h3 id="heading-scala">Scala</h3>
<h4 id="heading-what-you-need-to-know-4">What you need to know</h4>
<p><img src="https://cdn-media-1.freecodecamp.org/images/ttyRkvz1Ye6LkeZdGzMZmesaG2BcvGZhFcmV" alt="Image" width="250" height="250" loading="lazy"></p>
<p>Developed by Martin Odersky and released in 2004, <a target="_blank" href="https://www.scala-lang.org/">Scala</a> is a language which runs on the JVM. It is a multi-paradigm language, enabling both object-oriented and functional approaches. Cluster computing framework <a target="_blank" href="https://spark.apache.org/">Apache Spark</a> is written in Scala.</p>
<h4 id="heading-license-4">License</h4>
<p>Free!</p>
<h4 id="heading-pros-4">Pros</h4>
<ul>
<li>Scala + Spark = High performance cluster computing. Scala is an ideal choice of language for those working with high-volume data sets.</li>
<li>Multi-paradigmatic: Scala programmers can have the best of both worlds. Both object-oriented and functional programming paradigms available to them.</li>
<li>Scala is compiled to Java bytecode and runs on a JVM. This allows inter-operability with the Java language itself, making Scala a very powerful general purpose language, while also being well-suited for data science.</li>
</ul>
<h4 id="heading-cons-4">Cons</h4>
<ul>
<li>Scala is not a straightforward language to get up and running with if you’re just starting out. Your best bet is to download <a target="_blank" href="http://www.scala-sbt.org/">sbt</a> and set up an IDE such as Eclipse or IntelliJ with a specific Scala plug-in.</li>
<li>The syntax and type system are often described as complex. This makes for a steep learning curve for those coming from dynamic languages such as Python.</li>
</ul>
<h4 id="heading-verdict-perfect-for-suitably-big-data">Verdict — “perfect, for suitably big data”</h4>
<p>When it comes to using cluster computing to work with Big Data, then Scala + Spark are fantastic solutions. If you have experience with Java and other statically typed languages, you’ll appreciate these features of Scala too. </p>
<p>Yet if your application doesn’t deal with the volumes of data that justify the added complexity of Scala, you will likely find your productivity being much higher using other languages such as R or Python.</p>
<h3 id="heading-julia">Julia</h3>
<h4 id="heading-what-you-need-to-know-5">What you need to know</h4>
<p><img src="https://cdn-media-1.freecodecamp.org/images/Ok4VqC5ra015oGgqPcuGvJ9cWBtBu5f0Zt-G" alt="Image" width="370" height="208" loading="lazy"></p>
<p>Released just over 5 years ago, <a target="_blank" href="https://julialang.org/">Julia</a> has made an impression in the world of numerical computing. Its profile was raised thanks to early adoption by <a target="_blank" href="https://juliacomputing.com/case-studies/">several major organizations</a> including many in the finance industry.</p>
<h4 id="heading-license-5">License</h4>
<p>Free!</p>
<h4 id="heading-pros-5">Pros</h4>
<ul>
<li>Julia is a JIT (‘just-in-time’) compiled language, which lets it offer good performance. It also offers the simplicity, dynamic-typing and scripting capabilities of an interpreted language like Python.</li>
<li>Julia was purpose-designed for numerical analysis. It is capable of general purpose programming as well.</li>
<li>Readability. Many users of the language cite this as a key advantage</li>
</ul>
<h4 id="heading-cons-5">Cons</h4>
<ul>
<li>Maturity. As a new language, some Julia users have experienced instability when using packages. But the core language itself is reportedly stable enough for production use.</li>
<li>Limited packages are another consequence of the language’s youthfulness and small development community. Unlike long-established R and Python, Julia doesn’t have the choice of packages (yet).</li>
</ul>
<h4 id="heading-verdict-one-for-the-future">Verdict — “one for the future”</h4>
<p>The main issue with Julia is one that cannot be blamed for. As a recently developed language, it isn’t as mature or production-ready as its main alternatives Python and R.</p>
<p>But, if you are willing to be patient, there’s every reason to pay close attention as the language evolves in the coming years.</p>
<h3 id="heading-matlab">MATLAB</h3>
<h4 id="heading-what-you-need-to-know-6">What you need to know</h4>
<p><img src="https://cdn-media-1.freecodecamp.org/images/DI1Fj8dKXe484TVK6JSHSKeuYPPfE49rIwYI" alt="Image" width="225" height="225" loading="lazy"></p>
<p><a target="_blank" href="https://in.mathworks.com/products/matlab.html">MATLAB</a> is an established numerical computing language used throughout academia and industry. It is developed and licensed by MathWorks, a company established in 1984 to commercialize the software.</p>
<h4 id="heading-license-6">License</h4>
<p>Proprietary — pricing varies depending on your use case</p>
<h4 id="heading-pros-6">Pros</h4>
<ul>
<li>Designed for numerical computing. MATLAB is well-suited for quantitative applications with sophisticated mathematical requirements such as signal processing, Fourier transforms, matrix algebra and image processing.</li>
<li>Data Visualization. MATLAB has some great inbuilt plotting capabilities.</li>
<li>MATLAB is often taught as part of many undergraduate courses in quantitative subjects such as Physics, Engineering and Applied Mathematics. As a consequence, it is widely used within these fields.</li>
</ul>
<h4 id="heading-cons-6">Cons</h4>
<ul>
<li>Proprietary licence. Depending on your use-case (academic, personal or enterprise) you may have to fork out for a pricey licence. There are free alternatives available such as <a target="_blank" href="https://www.gnu.org/software/octave/">Octave</a>. This is something you should give real consideration to.</li>
<li>MATLAB isn’t an obvious choice for general-purpose programming.</li>
</ul>
<h4 id="heading-verdict-best-for-mathematically-intensive-applications">Verdict — “best for mathematically intensive applications”</h4>
<p>MATLAB’s widespread use in a range of quantitative and numerical fields throughout industry and academia makes it a serious option for data science. </p>
<p>The clear use-case would be when your application or day-to-day role requires intensive, advanced mathematical functionality. Indeed, MATLAB was specifically designed for this.</p>
<h3 id="heading-other-languages">Other Languages</h3>
<p>There are other mainstream languages that may or may not be of interest to data scientists. This section provides a quick overview… with plenty of room for debate of course!</p>
<h4 id="heading-c">C++</h4>
<p><a target="_blank" href="https://isocpp.org/">C++</a> is not a common choice for data science, although it has lightning fast performance and widespread mainstream popularity. The simple reason may be a question of productivity versus performance.</p>
<p>As <a target="_blank" href="https://www.quora.com/Why-dont-data-scientists-use-C-C%2B%2B/answer/Kevin-Lin?srid=hhtiJ">one Quora user puts it</a>:</p>
<blockquote>
<p><em>“If you’re writing code to do some ad-hoc analysis that will probably only be run one time, would you rather spend 30 minutes writing a program that will run in 10 seconds, or 10 minutes writing a program that will run in 1 minute?”</em></p>
</blockquote>
<p>The dude’s got a point. Yet for serious production-level performance, C++ would be an excellent choice for implementing machine learning algorithms optimized at a low-level.</p>
<p><strong>Verdict — “not for day-to-day work, but if performance is critical…”</strong></p>
<h4 id="heading-javascript">JavaScript</h4>
<p>With the rise of <a target="_blank" href="https://nodejs.org/en/">Node.js</a> in recent years, <a target="_blank" href="https://en.wikipedia.org/wiki/JavaScript">JavaScript</a> has become more and more a serious server-side language. However, its use in data science and machine learning domains has been limited to date (although checkout <a target="_blank" href="https://github.com/harthur/brain">brain.js</a> and <a target="_blank" href="http://caza.la/synaptic/#/">synaptic.js</a>!). It suffers from the following disadvantages:</p>
<ul>
<li>Late to the game (Node.js is only 8 years old!), meaning…</li>
<li>Few relevant data science libraries and modules are available. This means no real mainstream interest or momentum</li>
<li>Performance-wise, Node.js is quick. But JavaScript as a language is <a target="_blank" href="https://hackernoon.com/the-javascript-phenomenon-is-a-mass-psychosis-57adebb09359">not without its critics</a>.</li>
</ul>
<p>Node’s strengths are in asynchronous I/O, its widespread use and the existence of <a target="_blank" href="https://www.slant.co/topics/101/~best-languages-that-compile-to-javascript">languages which compile to JavaScript</a>. So it’s conceivable that a useful framework for data science and realtime ETL processing could come together. </p>
<p>The key question is whether this would offer anything different to what already exists.</p>
<p><strong>Verdict — “there is much to do before JavaScript can be taken as a serious data science language”</strong></p>
<h4 id="heading-perl"><strong>Perl</strong></h4>
<p><a target="_blank" href="https://www.perl.org/">Perl</a> is known as a ‘Swiss-army knife of programming languages’, due to its versatility as a general-purpose scripting language. It shares a lot in common with Python, being a dynamically typed scripting language. But, it has not seen anything like the popularity Python has in the field of data science.</p>
<p>This is a little surprising, given its use in quantitative fields such as <a target="_blank" href="https://en.wikipedia.org/wiki/BioPerl">bioinformatics</a>. Perl has several key disadvantages when it comes to data science. It isn’t stand-out fast, and its syntax is <a target="_blank" href="https://en.wikipedia.org/wiki/Write-only_language">famously unfriendly</a>. There hasn’t been the same drive towards developing data science specific libraries. And in any field, momentum is key.</p>
<p><strong>Verdict — “a useful general purpose scripting language, yet it offers no real advantages for your data science CV”</strong></p>
<h4 id="heading-ruby">Ruby</h4>
<p><a target="_blank" href="https://www.ruby-lang.org/en/">Ruby</a> is another general purpose, dynamically typed interpreted language. Yet it also hasn’t seen the same adoption for data science as has Python.</p>
<p>This might seem surprising, but is likely a result of Python’s dominance in academia, and a positive feedback effect . The more people use Python, the more modules and frameworks are developed, and the more people will turn to Python. </p>
<p>The <a target="_blank" href="http://sciruby.com/">SciRuby project</a> exists to bring scientific computing functionality, such as matrix algebra, to Ruby. But for the time being, Python still leads the way.</p>
<p><strong>Verdict — “not an obvious choice yet for data science, but won’t harm the CV”</strong></p>
<h3 id="heading-conclusion">Conclusion</h3>
<p>Well, there you have it — a quickfire guide to which languages to consider for data science. The key here is to understand your usage requirements in terms of generality vs specificity, as well as your personal preferred development style of performance vs productivity.</p>
<p>I use R, Python and SQL on a regular basis, as my current role largely focuses on developing existing data pipeline and ETL processes. These languages give the right balance of generality and productivity to do the job, with the option of using R’s more advanced statistics packages when needed.</p>
<p>However — you may already have some experience with Java. Or you may want to use Scala for big data. Or, perhaps you’re keen to get involved with the Julia project.</p>
<p>Maybe you learned MATLAB at university, or want to give SciRuby a chance? Perhaps you have an altogether different suggestion. If so, please leave a reply below — I look forward to hearing from you!</p>
<p>Thanks for reading!</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
