R Programming - freeCodeCamp.org

How to Create Boxplots and Model Data in R Using ggplot2

Tiffany Mojo Omondi — Thu, 15 Jan 2026 18:48:32 +0000

In this tutorial, you’ll walk through a complete data analysis project using the HR Analytics dataset by Saad Haroon on Kaggle. You’ll start by loading and cleaning the data, then explore it visually using boxplots with ggplot2. Finally, you’ll learn about statistical modelling using linear regression and logistic regression in R.

By the end of this article, you should understand how to create boxplots in R, why they matter, and how they fit into a real-world analytics workflow.

Prerequisites
How to Set Up Your R Environment
How to Load and Inspect the Data
How to Clean and Prepare the Data
How to Use Boxplots
How to Create Boxplots with ggplot2
How to Perform Exploratory Data Analysis
How to Build Linear Regression Models
How to Build Logistic Regression Models
Why Visualization Comes Before Modeling
Conclusion

Prerequisites

Before you begin, you should be comfortable with the following:

Basic R syntax (variables, functions, data frames).
Installing and loading R packages.
Understanding what rows and columns represent in a dataset.
Very basic statistics (mean, median, distributions).

How to Set Up Your R Environment

Start by installing and loading the packages you will need.

install.packages(c("tidyverse", "ggplot2"))
library(tidyverse)
library(ggplot2)

tidyverse provides tools for data manipulation and visualization. ggplot2 is the visualization engine you will use for boxplots. Loading the libraries makes their functions available for use

How to Load and Inspect the Data

First, download the HR Analytics dataset by Saad Haroon from Kaggle.

Assuming the downloaded dataset is saved as "C:/Users/johndoe/Downloads/archive (2)/HR_Analytics.csv", load the path file into R.

You can view a sample of the the dataset by running the head function. To view the structure of the dataset, you can run the str function.

hr <- read.csv("C:/Users/johndoe/Downloads/archive (2)/HR_Analytics.csv")
head(hr)
str(hr)

The read.csv function imports the dataset into R. The head function shows the first six rows so you can preview the data. The str function reveals data types, helping you spot categorical versus numeric variables early.

Remember that understanding your data structure early prevents errors later when plotting or modeling. Once you run the head function, you should see the following in your console:

From the head function, you can see:

Structure

Each row represents one employee.
Each column represents a feature/variable about the employee.

Key Columns & Meaning

EmpID → Employee identifier
Age → Age in years
AgeGroup → Age category (for example, 18-25)
Attrition → Whether the employee left or not (Yes/No)
BusinessTravel → Travel frequency (Travel_Rarely, Travel_Frequently, Non-Travel)
Department → Employee department
DistanceFromHome → Distance from home to office (km)
Education / EducationField → Level and field of education
EmployeeCount → Usually 1 per employee (redundant)
Gender → Male / Female
JobRole / JobSatisfaction → Job title and satisfaction level
MonthlyIncome / SalarySlab → Salary amount and category
YearsAtCompany / YearsInCurrentRole → Experience metrics
OverTime → Works overtime (Yes/No)
Other features: PerformanceRating, TrainingTimesLastYear, WorkLifeBalance, StockOptionLevel, and so on.

Data Types

Numeric → Age, DistanceFromHome, MonthlyIncome, YearsAtCompany
Categorical / Character → Attrition, Gender, Department, JobRole

Observations

The dataset is tabular, like a spreadsheet.
There are multiple categorical columns
There are multiple numeric columns
Some columns seem redundant or constant; doesn’t provide useful information because of the same values (for example, EmployeeCount)

From the str function, you can gather that:

The dataset contains 1,480 observations and 38 variables. Each row represents one employee, and each column represents a feature about that employee.

Each column has a name, data type, and example values. For instance, Age and DistanceFromHome are numeric (int), with values like 28 or 12. EmpID and Department are character strings (chr), with examples like Research & Development or Sales. Other features include JobRole (Analyst, Manager) and Attrition (Yes/No).

The dataset contains mixed data types. Some columns are numeric, such as MonthlyIncome or YearsAtCompany. Some are character or categorical, like Gender (Male/Female) and BusinessTravel (Travel_Rarely, Travel_Frequently). A few columns are redundant or constant. For example, EmployeeCount has the same value of 1 for all rows and does not provide useful information.

How to Clean and Prepare the Data

Before visualization, you must clean your data. In order to find out what you need to clean you can investigate the data.

Run the summary function to view the statistics of the dataset. You also need to run the is.na function to identify missing values to be removed.

summary(hr)
colSums(is.na(hr))

The summary function gives quick statistics and flags suspicious values. The is.na function checks for missing data. Boxplots are sensitive to extreme values, so knowing what you are working with is critical.

After running the summary function, the following will appear in your console:

This shows the basic statistics of each column. After running the is.na function, the following will also appear in your console:

From this output, you can see that only YearsWithCurrManager has 57, meaning that 57 employees don’t have a value for this column.

You can drop this whole column along with the other redundant columns we saw earlier on. You can do this with the code below.

hr <- hr %>% select(-c(EmployeeCount, Over18, StandardHours, YearsWithCurrManager))

To verify if the columns are gone, use this code:

colnames(hr)

Now we need to convert important categorical variables to factors. Doing this tells R that the column has two categories (‘Yes’ and ‘No’), not continuous text.

hr$Attrition <- as.factor(hr$Attrition)
hr$JobRole <- as.factor(hr$JobRole)
hr$Department <- as.factor(hr$Department)

This also ensures ggplot2 treats them correctly when grouping.

How to Use Boxplots

A boxplot displays key features of a dataset. The median is shown by the line in the middle of the box. The interquartile range is represented by the box itself while the whiskers show the spread of the data. Outliers appear as individual points.

Boxplots are mostly useful when you want to compare distributions across groups, such as income by job role or age by attrition status.

Let’s start with a simple boxplot of monthly income.

ggplot(hr, aes(y = MonthlyIncome)) +
  geom_boxplot(fill = "blue") +
  labs(
    title = "Distribution of Monthly Income",
    y = "Monthly Income")

The aes function tells ggplot what variable to plot. geom_boxplot draws the boxplot. The labs function labels parts of the plot drawn, that is the x axis, y axis, and the title.

How to Create Boxplots with ggplot2

Now lets compare income across job roles.

ggplot(hr, aes(x = JobRole, y = MonthlyIncome)) +
  geom_boxplot(fill = "lightblue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    title = "Monthly Income by Job Role",
    x = "Job Role",
    y = "Monthly Income")

The x aesthetic lists all the job roles. The labels are rotated to improve readability. This visualization quickly reveals income differences across roles.

How to Perform Exploratory Data Analysis (EDA)

Exploratory data analysis involves using visual methods to ask questions and gain a deeper understanding of the data.

We can use the example of Years at company by department.

ggplot(hr, aes(x = Department, y = YearsAtCompany)) +
  geom_boxplot(fill = "darkblue") +
  labs(
    title = "Years at Company by Department",
    y = "Years at Company")

How to Build Linear Regression Models

To understand how to build linear regression models, you have to model MonthlyIncome using YearsAtCompany with the command below.

The first one creates the model while the second displays it.

hr_lm<- lm(MonthlyIncome ~ YearsAtCompany, data = hr)
summary(hr_lm)

Linear regression estimates how income changes with tenure. This works when the variables are numeric.

After running the code, your console should show you this output:

Call:
lm(formula = MonthlyIncome ~ YearsAtCompany, data = hr)

Residuals:
   Min     1Q Median     3Q    Max 
 -9506  -2488  -1186   1403  15483 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3734.47     159.41   23.43   <2e-16 ***
YearsAtCompany   395.25      17.14   23.07   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4032 on 1478 degrees of freedom
Multiple R-squared:  0.2647,    Adjusted R-squared:  0.2642 
F-statistic:   532 on 1 and 1478 DF,  p-value: < 2.2e-16

Let’s interpret this model.

If an employee has 0 years at the company, their base monthly income is $3734.47. This comes from the intercept.

For each year an employee spends at the company, their monthly income is predicted to increase by $395.25.

Both coefficients have p-values < 2e-16. This means they are highly significant. It strongly shows that the years an employee spends at a company affects their income.

The model’s R-squared is 0.2647. This means about 26% of the variation in monthly income is explained by the years an employee spends at the company. This is low, so other factors like role, department, or education likely affect income too.

The model’s F-statistic is 532, with a p-value < 2.2e-16. This means the model is statistically significant overall.

In general, the longer an employee stays at a company, the more they earn, roughly $395 extra per year. But years at the company alone explain only about a quarter of their income. You need to consider other variables for better predictions.

How to Build Logistic Regression Models

You can now learn how to predict attrition. The first command generates the model while the second displays it.

hr_glm<- glm(
  Attrition ~ MonthlyIncome + YearsAtCompany,
  data = hr,
  family = binomial)


summary(hr_glm)

Your console should show this as an output when you run both commands.

Call:
glm(formula = Attrition ~ MonthlyIncome + YearsAtCompany, family = binomial, 
    data = hr)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -8.094e-01  1.375e-01  -5.886 3.96e-09 ***
MonthlyIncome  -9.449e-05  2.302e-05  -4.104 4.05e-05 ***
YearsAtCompany -5.047e-02  1.792e-02  -2.817  0.00485 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1305.4  on 1479  degrees of freedom
Residual deviance: 1252.5  on 1477  degrees of freedom
AIC: 1258.5

Number of Fisher Scoring iterations: 5

Logistic regression is used for binary outcomes, that is, yes or no. It estimates probability.

Let’s interpret this logistic regression model. The model predicts whether an employee is likely to leave the company (Attrition) based on their Monthly Income and Years at Company.

The intercept is -0.809. This is the baseline log-odds of leaving when their income and years at the company are zero.

The employees’ Monthly Income has a coefficient of -0.0000945. This means that as their income increases, their chance of leaving decreases slightly. An increase in income makes them less likely to quit.

The employees’ Years at Company have a coefficient of -0.0505. This shows that the longer they stay, the less likely they are to leave. Each additional year reduces their attrition probability.

All coefficients are statistically significant. Monthly Income and Years at Company both strongly affect their likelihood to stay.

The model’s residual deviance is 1252.5, lower than the null deviance of 1305.4. This means the model explains some of the variation in attrition.

The key takeaway is that if an employee earns more and stays longer at the company, they are less likely to leave. These factors matter, but other elements also influence attrition.

Why Visualization Comes Before Modeling

Boxplots help you to:

Detect outliers: Boxplots highlight extreme values that interfere with model results.
Compare groups: Boxplots allow quick comparison of distributions across different categories.
Form hypotheses: Visual patterns assist in identifying relationships worth testing in a model.
Validate modeling assumptions: Boxplots help check distribution shape and variance before modeling.

Modeling without visualization often leads to misinterpretation or false confidence.

Conclusion

In this tutorial, you learned how to load and clean data, understand boxplots and their importance. You also learned how to use ggplot2 to compare distributions, perform exploratory data analysis (EDA), build linear and logistic regression models, and link visualization insights to modeling results.

How to Create Scatterplots and Model Data in R Using ggplot2

Tiffany Mojo Omondi — Mon, 05 Jan 2026 12:05:54 +0000

You can use R as a powerful tool for data analysis, data visualization, and statistical modelling. In this guide, you’ll learn how to load real-world data into R, visualize patterns using ggplot2, build simple linear and logistic regression models, and interpret the models. By the end, you should know how to use R for your own projects.

Prerequisites
How to Set Up Your R Environment
How to Use Data Types in R
How to Use Data Structures in R
How to Import Data in R
How to Visualize Data with ggplot2
How to Build Statistical Models in R
Conclusion

Prerequisites

Before we get started, you should have the following:

R installed (version 4.0 or higher).
RStudio installed (recommended for beginners).
Basic familiarity with programming concepts such as variables and functions.
A basic understanding of statistics (mean, correlation, regression).

How to Set Up Your R Environment

Before you start working with data, load the required libraries:

library(tidyverse)   # Data manipulation + ggplot2
library(readxl)      # Importing Excel files

These load the required libraries into the R. tidyverse is a collection of packages used for data manipulation and visualization, including ggplot2. readxl allows you to import Excel files directly into R without converting them to CSV format first.

How to Use Data Types in R

Knowing data types helps you avoid errors and choose the right analysis methods.

Common Data Types

Data type	Example	Use case
Numeric	`x <- 5.7`	Measurements, prices
Integer	`y <- 10L`	Counts
Character	`"House prices"`	Text labels
Logical	`TRUE`	Conditions
Complex	`2 + 3i`	Advanced math

Numeric Data Types in R

price <- 199.99
tax <- 16.5
total_cost <- price + tax
total_cost

Numeric data is used for continuous values such as measurements, prices, or averages. As you can see, these are numeric values that can be used in a calculation. Numeric data types allow arithmetic operations such as addition, subtraction, multiplication, and division.

Integer Data Types in R

students <- 30L
classes <- 4L
total_students <- students * classes
total_students

Integers are whole numbers and are commonly used for counting. The L tells R that the values are integers. Integers are useful when working with counts, indexes, or discrete values.

Character Data Types in R

course_name <- "Data Science"
university <- "Harvard University"
paste(course_name, "at", university)

Character data is used to store text such as names, labels, or categories. The example above shows how character data can be combined using the paste() function. This data type cannot be used in mathematical operations.

Logical Data Types in R

score <- 75
passed <- score >= 50
passed

Logical data represents Boolean values: TRUE or FALSE. These are commonly used in conditions and filtering. Here, R evaluates a condition and returns TRUE because the score meets the requirement. Logical values are essential in decision-making and control flow.

Complex Data Types in R

Complex numbers contain both real and imaginary parts and are mostly used in advanced mathematical computations.

z <- 2 + 3i
Mod(z)

This example calculates the magnitude of a complex number. Complex data types are rarely used in basic data analysis but are available in R.

How to Use Data Structures in R

R stores data in different structures depending on your goals. This is important because choosing the right structure makes operations easier. Its functions behave differently depending on the structure. Moreover, structures help R understand whether your data are numbers, categories, or text.

Common Data Structures in R

Structure	Best for
Vector	Single column of data
Matrix	Numeric tables
Data Frame	Spreadsheet-like data
List	Mixed objects

vec <- c(1, 2, 3, 4)
mat <- matrix(1:9, nrow = 3)
df <- data.frame(Name = c("Car", "Bike"), Number = c(110, 95))
lst <- list(numbers = vec, matrix = mat, info = df)

str(lst) ##shows the structure of the list

Lets understand the code above:

vec is a vector that stores a single type of data.
mat is a matrix that organizes numeric values into rows and columns.
df is a data frame that works like a spreadsheet, allowing different data types in each column.
lst is a list that stores multiple objects of different types.
The str() function shows how these objects are nested within the list.

How to Import Data in R

Now you can start working with your real data. You can import files into R by copying the path of the CSV or Excel file and pasting it into the command.

For Windows: Replace single backward slashes / with either double backward slashes \ or single forward slashes \. For example:


Windows
```r
data <- read.csv("C:\\Users\\file\\Documents\\data.csv") or 
data <- read.csv("C:/Users/file/Documents/data.csv")

For macOS/Linux: Single forward slashes work fine:

macOS/Linux
data <- read.csv("/Users/file/Documents/data.csv")

How to Read a CSV and Excel File

#Import CSV file 
data <- read.csv("C:/Users/file/Documents/data.csv") or data <- read.csv("C:\\Users\\file\\Documents\\data.csv") ## for windows

head(data.csv)

You can import a CSV file into R using a file path. On Windows systems, file paths can use either double forward slashes (//) or double backslashes (\). The imported data is stored as a data frame named data.

data_excel <- read_excel("C:/Users/file/Documents/HR Data Set.xlsx")
head(data_excel)

You can import an Excel file into R using the code read_excel() function from the readxl package. The head() function is then used to preview the first few rows of the dataset.

Use the following commands to understand your data:

str(data.csv)
summary(data.csv)

str(data_excel)
summary(data_excel)

str() shows the structure of the dataset, including column names and data types. summary() provides descriptive statistics such as minimum, maximum, mean, and quartiles for each variable. Together, these functions help you understand the dataset before analysis.

How to Visualize Data with ggplot2

Visualization helps you spot patterns before you build models.

Scatter Plot Example

We’ll use the built-in mtcars dataset in R. First, load the library to make it available for use:

data(mtcars)
library(ggplot2)

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point(size = 3,color="blue") +geom_smooth(method="lm",color="red",se=FALSE)+
  labs(
    title = "Fuel Efficiency by Weight and Cylinders",
    x = "Weight (1000 lbs)",
    y = "Miles per Gallon"
  ) +
  theme_minimal()

Let us break down the code to grasp it fully:

data(mtcars) loads the built-in mtcars dataset, which contains information about car specifications.
library(ggplot2) enables data visualization.
aes() was used to insert your dataset columns, which defines the x and y values.
aes() was used to design the plot outside. For example, set point size and color.
geom_smooth() wass used to add a trend line with. Here, we use method="lm" to fit a linear regression line. The se=TRUE/FALSE option controls the shading for confidence intervals. Use TRUE if you want the shading and FALSE if you don’t.
labs() was used for label the plot and set the title, x-axis, and y-axis labels.
Finally, we set the plot theme using theme_minimal().

Running this code will produce a scatterplot showing fuel efficiency by weight and cylinders. The plot should look like this:

How to Build Statistical Models in R

Linear Regression

You can use linear regression for continuous outcomes, basically to predict numerical values. For example, to predict a car’s miles per gallon (mpg) based on weight (wt) and horsepower (hp), you can use this formula:

lm_model <- lm(mpg ~ wt + hp, data = mtcars)
summary(lm_model)

But what does it mean?

lm() stands for linear model.
The response variable is mpg. This is the outcome you want to predict.
Predictor variables are wt and hp. These explain changes in the response.

Once you run the model, it should look like this in your console:

Call:
lm(formula = mpg ~ wt + hp, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.941 -1.600 -0.182  1.050  5.854 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
wt          -3.87783    0.63273  -6.129 1.12e-06 ***
hp          -0.03177    0.00903  -3.519  0.00145 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.593 on 29 degrees of freedom
Multiple R-squared:  0.8268,    Adjusted R-squared:  0.8148 
F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

Here’s an interpretation of the linear regression model:

You created a model on miles per gallon (mpg) based on weight (wt) and horsepower (hp).
The intercept 37.227 is the mpg when wt=0 and hp=0. In other words, when all other variables are 0, the base mpg is 37.227. The intercept is always the baseline value of the outcome when all other variables in the model are zero.
With every additional unit of weight (1000lbs), the mpg decreases by 3.877. This variable affects the mpg greatly as seen with the p-value. The p-value is <0.001, hence strong and statistically significant.
With every additional unit of horsepower, the mpg decreases by 0.031. This variable affects the mpg, as seen with the p-value being 0.00145, which is less than 0.01, indicating that horsepower is a statistically significant predictor of mpg, although its effect is smaller compared to vehicle weight.

Does the Model Fit the Data, and Why?

The R-squared value shows that 83% of the variation in mpg is explained by weight and horsepower.

Summary of the interpretation: Cars that are heavier and with more horsepower have lower fuel efficiency. These two variables explain most of the variation in mpg in the dataset.

Logistic Regression

You can use logistic regression for binary outcomes, like yes/no questions. For example, predicting whether a vehicle is automatic or manual based on weight and horsepower.

glm_model <- glm(am ~ wt + hp, data = mtcars, family = binomial)
summary(glm_model)

Lets understand the code

glm() stands for generalized linear model.
The family=binomial option tells R to run logistic regression.
The response variable am indicates transmission type: 0 = automatic, 1 = manual.
Predictor variables remain wt and hp.

Once you run the model, it should look like this in your console:

Call:
glm(formula = am ~ wt + hp, family = binomial, data = mtcars)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) 18.86630    7.44356   2.535  0.01126 * 
wt          -8.08348    3.06868  -2.634  0.00843 **
hp           0.03626    0.01773   2.044  0.04091 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.230  on 31  degrees of freedom
Residual deviance: 10.059  on 29  degrees of freedom
AIC: 16.059

Number of Fisher Scoring iterations: 8

Here’s an interpreting of the logistic regression model:

The intercept 18.866 represents the log-odds of a car being manual when wt=0 and hp=0. In other words, when all other variables are 0, the baseline log-odds of the outcome is 18.866. The intercept is always the baseline value of the outcome when all other variables in the model are zero.
With every additional unit of weight (1000 lbs), the log odds of the car being manual decrease by 8.083. This variable strongly affects the probability of the car being manual, as seen with the p-value being 0.008, which is statistically significant.
With every additional unit of horsepower, the log odds of the car being manual increase by 0.036. This variable also affects the probability of being manual, as seen with the p-value being 0.041, which is statistically significant.

Summary of the interpretation: Heavier cars are more likely to be automatic, while higher horsepower slightly increases the chance of being manual. Together, wt and hp explain a large portion of transmission type variation.

Conclusion

In this tutorial, you learned how to use R for data analysis, visualization, and statistical modeling, and how to set up your R environment and work with basic data types and data structures.

This article also showed you how to import real-world datasets and explore them using summary statistics. This should help you understand your data before analysis.

Using ggplot2, we visualized the relationships and identified patterns. We built and interpreted a linear regression model to predict fuel efficiency and a logistic regression model to classify transmission type.

You also learned how to interpret coefficients, p-values, and goodness-of-fit measures.

With these skills, you can load datasets, visualize trends, and build simple predictive models in R. Keep practicing with new datasets and explore more advanced techniques to improve your data analysis skills.

Web Scraping With RSelenium (Chrome Driver) and Rvest

Elabonga Atuo — Mon, 17 Mar 2025 13:44:10 +0000

Web scraping lets you automatically extract data from websites, so you can store it in a structured format for later use.

In this article, you'll explore how to use popular R libraries for web scraping to extract data from a website. The target website displays different books across multiple pages, requiring navigation between them. You'll learn how to use RVest for data extraction and RSelenium to automate button clicks.

There are a couple of housekeeping rules when it comes to harvesting data on the internet:

Inspect the robots.txt file: Check the robots.txt file of a website to understand what data you are allowed to extract. You can find this file by appending “/robots.txt” to the website's home URL.
Review terms and conditions: Before scraping, read the website's terms and conditions to understand the legal expectations regarding data extraction.
Limit requests: Avoid overloading the server with requests by implementing rate limiting. The polite library in R can help manage request rates effectively.

Let’s dive in!

Project Overview
Project Setup
How to Understand and Inspect a Webpage
How to Extract Data Using RVest
How to Mimic Human Behaviour Using RSelenium
How to Combine RSelenium & RVest and Save to CSV
Bringing it All Together
Conclusion

Project Overview

Here’s what we’re going to be building:

This approach to web scraping allows you to see the browser in action as it navigates and extracts data from the website. Unlike headless browsing, where everything runs in the background without a visible interface, this method provides a graphical UI, making it easier to monitor and debug the process.

To practice your data mining skills, you will be scraping data from a website built specifically for that: Books To Scrape. You are going to be using a driver to drive a browser which will then open your target website. It’ll navigate from the first page, mimicking human behaviour (clicking the next button) while collecting data about the books, right to the last page.

Project Setup

Prerequisites:

To follow along with this tutorial, you will need:

R programming knowledge
HTML knowledge
R Studio installed

Note that I’m building this tutorial on a Windows machine.

Setup and Install Chrome Driver

First, you’ll want to check to make sure you have Java installed on your computer by running this terminal command:

java -version

If it’s not present, download and install Java here.

Next, install the Chrome browser if you don’t already have it. Once it’s installed, check for your browser version in the settings section.

Then you can download the Browser Driver that corresponds to your Browser Version here. Check where other browser drivers are stored on your device by running this in RStudio terminal:

# install and load wdman and binman packages
install.packages("wdman")
library(wdman)

install.packages("binman")
library(binman)

# check drivers already installed
binman::list_versions(appname = "chromedriver")

# check browser driver locations
wdman::selenium(retcommand = TRUE, check = FALSE)

Extract the driver “.exe“ and store it at the specified folder location. This is usually the following location:

"C:\Users\YourName\AppData\Local\binman\binman_chromedriver\win32\version\chromedriver.exe"

Now, add the drivers to your system path by specifying the folder path excluding the application. Confirm installation by running the following terminal command.

# Chromedriver SYSTEMS PATH: "C:\Users\YourName\AppData\Local\binman\binman_chromedriver\win32\version\"
# check chromedriver installation
chromedriver -version

How to Understand and Inspect a Webpage

A webpage is a visual representation of an HTML document that is available on the internet and accessed through a web browser. The components of a webpage, called elements, are structured hierarchically in a HTML DOM (Document Object Model) tree. Each element can be located using specific paths called selectors or locators, which you can read more about here.

Developer Tools are a set of tools available in your browser. They’re helpful for inspecting and analyzing a webpage’s structure. The feature “Inspect“ helps examine the structure and styling of a specific element. You can access this feature by selecting the element you would like to inspect, right clicking on it, and clicking “Inspect”.

How to Extract Data Using RVest

RVest is an R package that contains a set of functions that enables you to extract data from HTML and XML web pages

We are interested in extracting the following information about books from every page on the website’s catalogue:

Book Title
Book Rating
Book Price
Individual Book Link
Cover Image Link

Let’s go through the steps for using RVest to extract this data.

Step 1: Load the webpage

To load the first page of your target website and parse the HTML document using the RVest package in R, follow these steps:

Install and load the RVest package: If you haven't already installed the RVest package, you can do so by running the following command in R:
```
 install.packages("rvest")
```
Then, load the package:
```
 library(rvest)
```
Load the webpage and parse the HTML: Use the read_html() function from the RVest package to fetch and parse the HTML content of the webpage. Here's an example of how to do this:
```
 # Specify the URL of the target website
 url <- "https://books.toscrape.com/"

 # Fetch and parse the HTML content
 webpage <- read_html(url)
```

This code will download the HTML content of the specified webpage and convert it into an XML document, making it easier to structure and organize the data for further processing or storage.

Step 2: Identify the target elements

The target elements are the HTML elements that contain the specific data you intend to extract.

A quick inspection of the webpage using developer tools shows that the each book’s information is contained in an article tag and forms part of an ordered list. It’s important to specify the

The pipe %>% operator facilitates chaining operations, making it easier to extract elements step by step. html_element() returns the first matching element while html_elements() returns all the elements that match the defined path.

# define the path from which other details will be extracted
book <- books %>% html_element("ol")  %>% html_elements("li") %>% html_element("article")

# extracting details using css locators.
# title
title <- book %>% 
  html_element("h3 a") %>% 
  html_attr("title")

# rating
rating <- book %>% 
  html_element("p") %>% 
  html_attr("class")

# price
price <- book %>% 
  html_element(".product_price p") %>% 
  html_text2()

#link to book page
book_link <- book %>% 
  html_element("h3 a") %>% 
  html_attr("href")

# cover page image link
cover_page_link <- book %>% 
  html_element(".image_container a img") %>% 
  html_attr("src")

# inspect right format by selecting the first element of each detail
title[[1]]
rating[[1]]
price[[1]]
book_link[[1]]
cover_page_link[[1]]

Step 3: Clean the “rating” data

To clean the "star-rating" data, you can use the stringr package in R to remove the unnecessary text and trim any whitespace. Here's how you can do it:

library(stringr)

# Example of extracted rating data
rating_data <- "star-rating Three"

# Remove "star-rating " and trim whitespace
cleaned_rating <- str_trim(str_replace(rating_data, "star-rating ", ""))

# Output the cleaned rating
cleaned_rating

This code will output "Three", effectively removing the "star-rating" prefix and any leading or trailing whitespace.

How to Mimic Human Behaviour Using RSelenium

How Selenium Works

Selenium is a tool that allows you to simulate user actions on a website, usually for testing purposes. RSelenium is an R library that allows you to access this functionality.

We need a script, a browser, and browser driver to mimic user behaviour. The code you write that contains the instructions detailing the actions you would like to automate is the script. The browser driver acts as a bridge between your script and the browser and performs your desired actions by translating the script into actions.

The script, when run, is the client which requests and receives info from the browser driver’s server.

When you run a script, the script is converted to JSON format data which is then transferred to the browser driver via the JSON Wire Protocol. A protocol is simply a set of rules that define how data should be managed and handle during transfer across devices.

The driver receives and validates the received data. If successful, it communicates the actions defined in the script to the browser. If it’s unsuccessful, an error is sent to the client.

On browser initialization, the driver performs the actions step by step. This carries on to completion or until an error is encountered (missing elements, server errors, and so on). The bidirectional communication between the driver and browser is via HTTP. Finally, the results are sent back to the client and the browser is shut down.

# install and load RSelenium
install.packages("RSelenium")
library(RSelenium)

# initialize and run the chrome driver
rD <- rsDriver(browser = "chrome", port = 4567L)

# extract and assign the client
remDr <- rD[["client"]]

Running rsDriver() starts a Selenium server that launches ChromeDriver. Extract and assign the rD[["client"]] to a variable. This variable allows you to control and interact with the browser.

Sometimes, starting the driver may fail due to reasons such as permission restrictions, missing dependencies, or incorrect setup. If that happens, you can manually launch ChromeDriver by adding the following block of code right after loading the libraries at the top of the script. It is important to ensure the port numbers match.

cDrv <- chrome(verbose = FALSE, check = FALSE, port = 4567L)
cDrv$process

Now, navigate to the target webpage:

# naivigate to the target site
remDr$navigate("https://books.toscrape.com/")

#maximize Chrome Window Size
remDr$maxWindowSize()

And scroll to the bottom of the page:

# scroll to the bottom of the page
webElem <- remDr$findElement("css", "body")
webElem$sendKeysToElement(list(key = "end"))

The above code locates the body element and simulates pressing the down key to the end of the page.

Now, click Next to navigate to the next page:

# locate next button and click next
nextPage <-  remDr$findElement(using = "css selector",
                               value = ".next > a")
nextPage$clickElement()

Find the element that contains the link to the next page and click on it to redirect you.

Now we’re going to write a while loop that navigates through all the pages, up to page 50, and then closes the browser once it’s done.

A while loop executes a piece of code as long as a specific condition is met. Once the condition is not met, the loop exits.

while(condition is TRUE){
    #DO SOMETHING
}

Write a loop that ensures the next page button is clicked as long as the element containing the link to the next page is visible in the HTML DOM.

First, locate the next button element. Its presence in the open webpage makes sure that the loop runs.

The last page does not have a next button, so the loop will exit when it reaches that page (and Selenium will throw an error due to the missing element).

nextPage <- remDr$findElement(using = "css selector", value = ".next > a")

Wrap the nextPage element search in a tryCatch() block. This prevents the script from crashing if the 'Next' button is missing. If an error occurs, tryCatch() returns NULL, signaling that there are no more pages to navigate.

An if block then checks for a NULL value. If encountered, a message is displayed to inform the client that no 'Next' button was found, and the break statement exits the loop.

Finally, close the browser once the driver navigates to the last page (page 50 in the catalogue) to free up system resources using remDr$close().


while (TRUE) {  
  # Try to find and click "Next" button
  nextPage <- tryCatch({
    remDr$findElement(using = "css selector", value = ".next > a")
  }, error = function(e) {
    return(NULL)  # No more pages
  })

  if (is.null(nextPage)) {
    message("No 'Next' button found. Exiting loop.")
    break
  }

  nextPage$clickElement()
  Sys.sleep(3)  # Allow next page to load

}
print("finished scraping")
remDr$close()

How to Combine RSelenium & RVest and Save to CSV

Now that we’ve extracted data from specific HTML elements using RVest and automated user actions using RSelenium, let’s combine the two to scrape data from all the pages in the website.

Create a scrape books function

You will be saving the scraped books information in a CSV file. First, create an empty dataframe to hold the scraped data:

# install and load dplyr for dataframe manipulation
install.packages("dplyr")
library(dplyr)

# create a dataframe to hold book information
Books <-  data.frame()

Retrieve and parse the webpage

For Rvest to work with RSelenium, you have to retrieve the HTML source of the currently loaded webpage within the Selenium-controlled browser using remDr$getPageSource()[[1]] to extract the HMTL content.

page <- remDr$getPageSource()[[1]]

Convert the HTML content to XML using read_html() like this:

 # define the path from which other details will be extracted
    books <- read_html(page)  %>% html_element("ol")  %>% html_elements("li") %>% html_element("article")

Extract each book’s details using CSS selectors with rvest functions. The scraped objects returned are XML objects and lists. They need to be formatted to character strings, preventing unexpected data type issues when working with the data. Do this by piping as.character() at the very end of each extracted detail.

    # title
    title <- book %>% 
      html_element("h3 a") %>% 
      html_attr("title") %>% 
      as.character()

Wrap the block of code used to extract details from HTML elements in a function and return a dataframe whose column values are the book details. This makes the code reusable and modular.


scrape_books <- function() {
    page <- remDr$getPageSource()[[1]]

    # define the path from which other details will be extracted
    books <- read_html(page)  %>% html_element("ol")  %>% html_elements("li") %>% html_element("article")

    # extracting details using css locators.
    # title
    title <- book %>% 
      html_element("h3 a") %>% 
      html_attr("title") %>% 
      as.character() 

    # rating
    rating <- book %>% 
      html_element("p") %>% 
      html_attr("class") %>% 
      as.character() 

    cleaned_rating <- str_trim(gsub("star-rating", "", rating))

    # price
    price <- book %>% 
      html_element(".product_price p") %>% 
      html_text2() %>% 
      as.character() 

    #link to book page
    book_link <- book %>% 
      html_element("h3 a") %>% 
      html_attr("href") %>% 
      as.character() 

    # image link
    cover_page_link <- book %>% 
      html_element(".image_container a img") %>% 
      html_attr("src") %>% 
      as.character() 

    return(data.frame(title,cleaned_rating,price,book_link,cover_page_link, stringsAsFactors = FALSE))
}

Write to CSV

Save the dataframe to a CSV file saved as “books.csv“:

write.csv(Books, file = "./books.csv", fileEncoding = "UTF-8")

Bringing it All Together

Let’s review what we’ve done so far: First, the script to scrape book data begins by loading the browser, maximizing the window size, and navigating to the Books To Scrape Page.

Then we created an empty dataframe to hold the scraped data. We then scraped the data from the first page, saved it to the dataframe, and located the ‘Next‘ button in order to navigate to the next page – from which we scraped data and stored it.

The process of scraping, adding to the dataframe, and clicking the next page button continues until the ‘Next’ button is no longer available in the HTML DOM.

Once the last page has been reached, the code exits the loop and saves the data to CSV. Finally, it closes the driver to free up system resources.

# load libraries
library(wdman)
library(binman)
library(rvest)
library(stringr)
library(RSelenium)
library(dplyr)


cDrv <- chrome(verbose = FALSE, check = FALSE, port = 4450L)
cDrv$process

rD <- rsDriver(browser = "chrome", port = 4450L)
remDr <- rD[["client"]]


remDr$navigate("https://books.toscrape.com/")
remDr$maxWindowSize()

page <- remDr$getPageSource()[[1]]
webElem <- remDr$findElement("css", "body")
webElem$sendKeysToElement(list(key = "end"))

nextPage <-  remDr$findElement(using = "css selector",
                               value = ".next > a")
nextPage$clickElement()


# converting the lists containg the scraped data into a dataframe 
Books <-  data.frame(title = character(), rating = character(), stringsAsFactors = FALSE)

scrape_books <- function() {
    page <- remDr$getPageSource()[[1]]

    # define the path from which other details will be extracted
    books <- read_html(page)  %>% html_element("ol")  %>% html_elements("li") %>% html_element("article")

    # extracting details using css locators.
    # title
    title <- book %>% 
      html_element("h3 a") %>% 
      html_attr("title") %>% 
      as.character() 

    # rating
    rating <- book %>% 
      html_element("p") %>% 
      html_attr("class") %>% 
      as.character() 

    cleaned_rating <- str_trim(gsub("star-rating", "", rating))

    # price
    price <- book %>% 
      html_element(".product_price p") %>% 
      html_text2() %>% 
      as.character() 

    #link to book page
    book_link <- book %>% 
      html_element("h3 a") %>% 
      html_attr("href") %>% 
      as.character() 

    # image link
    cover_page_link <- book %>% 
      html_element(".image_container a img") %>% 
      html_attr("src") %>% 
      as.character() 

    return(data.frame(title,cleaned_rating,price,book_link,cover_page_link, stringsAsFactors = FALSE))
}

# scrape first page
Books <- rbind(Books, scrape_books())

while (TRUE) {
  # scrape current page
  Books <- rbind(Books, scrape_books())

  # find and click "next" button
  nextPage <- tryCatch({
    remDr$findElement(using = "css selector", value = ".next > a")
  }, error = function(e) {
    return(NULL)  # No more pages
  })

  # exit loop if "next" button is missing
  if (is.null(nextPage)) {
    message("No 'Next' button found. Exiting loop.")
    break
  }

  nextPage$clickElement()
  # Allow next page to load
  Sys.sleep(3)  

}

write.csv(Books, file = "./books.csv", fileEncoding = "UTF-8")
print("finished scraping")
remDr$close()

Conclusion

In this tutorial, you learned how to effectively combine RSelenium and RVest to scrape data from a website. By leveraging RSelenium, you can automate user interactions and navigate through web pages, while RVest allows you to extract specific data from HTML elements.

This approach provides a powerful and flexible method for web scraping, enabling you to handle dynamic content and mimic human behavior. By following the steps outlined here, you can successfully scrape data from multiple pages and save it to a CSV file for further analysis.

How to Model an Epidemic with R

freeCodeCamp — Tue, 30 Mar 2021 14:46:38 +0000

By Peter Gleeson

Epidemiology has never been more topical. It is the scientific study of how health and disease affects populations, including infectious diseases such as COVID-19.

Key to understanding the spread of such diseases is the practice of epidemic modeling. This involves building quantitative models to describe and forecast the spread of disease.

The classical approach to epidemic modeling is to use a type of mathematical model known as a "compartmental model".

The approach is as follows:

Assign each individual in the population to one of several compartments, based on their infection status.
Then, define the rates at which individuals move between compartments as their status updates.
Use this model to define differential equations that can predict the course of the epidemic.

The SI model is the most basic form of compartmental model. It has two compartments: "susceptible" and "infectious".

The SIR model adds an extra compartment called "recovered". This model is often used as a baseline in epidemiology. It is a simplistic model that nevertheless characterises the progression of an epidemic reasonably well.

An extension to the SIR model (and the one we will consider in more detail in this article) is the SEIR model. This adds one more compartment – "exposed".

What is the SEIR model?

The basic SEIR model has four compartments:

"Susceptible" – individuals who have not been exposed to the virus
"Exposed" – individuals exposed to the virus, but not yet infectious
"Infectious" – exposed individuals who go on to become infectious
"Recovered" – infectious individuals who recover and become immune to the virus

The population size N is taken as the sum of the individuals in the four compartments.

The flow of individuals between compartments is characterised by a number of parameters.

β - "beta"

β is the transmission coefficient. Think of this as the average number of infectious contacts an infectious individual in the population makes each time period. A high value of β means the virus has more opportunity to spread.

σ - "sigma"

σ is the rate at which exposed individuals become infectious. Think of it as the reciprocal of the average time it takes to become infectious. That is, if an individual becomes infectious after 4 days on average, σ will be 1/4 (or 0.25).

γ - "gamma"

γ is the rate at which infectious individuals recover. As before, think of it as the reciprocal of the average time it takes to recover. That is, if it takes 10 days on average to recover, γ will be 1/10 (or 0.1).

μ - "mu"

μ is an optional parameter to describe the mortality rate of infectious individuals. The higher μ is, the more deadly the virus.

From these parameters, you can construct a set of differential equations. These describe the rate at which each compartment changes size.

Let's start with the "susceptible" compartment, S.

Equation (1) - Susceptible

The first thing to see from the model is that there is no way S can increase over time. There are no flows back into the compartment. Equation (1) must be negative, as S can only ever decrease.

In what ways can an individual leave compartment S?

Well, they can become infected by an infectious individual in the population.

At any stage, the proportion of infectious individuals in the population = I/N.

And the proportion of susceptible individuals will be S/N.

Under the assumption of perfect mixing (that is, individuals are equally likely to come into contact with any other in the population), the probability of any given contact being between an infectious and susceptible individual is (I / N) * (S / N).

This is multiplied by the number of contacts in the population. This is found by multiplying the transmission coefficient β, by the population size N.

Combining that all together and simplifying gives equation (1):

Equation (2) - Exposed

Next, let's consider the "exposed" compartment, E. Individuals can flow into and out of this compartment.

The flow into E will be matched by the flow out of S. So the first part of the next equation will simply be the opposite of the previous term.

Individuals can leave E by moving into the infectious compartment. This happens at a rate determined by two variables – the rate σ and the current number of individuals in E.

So overall equation (2) is:

Equation (3) - Infectious

The next compartment to consider is the "infectious" compartment, I.

There is one way into this compartment, which is from the "exposed" compartment.

There are two ways an individual can leave the "infectious" compartment.

Some will move to "recovered". This happens at a rate γ.

Others will not survive the infection. They can be modeled using the mortality rate μ.

So equation (3) looks like:

Equation (4) - Recovered

Now let's look at the "recovered" compartment, R.

This time, individuals can flow into the compartment (determined by the rate γ).

And no individuals can flow out of the compartment (although in some models, it is assumed possible to move back into the "susceptible" compartment).

So the overall equation (4) looks like this:

Equation (5) - Mortality (optional)

Using similar reasoning, you could also construct equation (5) for the change in mortality. You might consider this a fifth compartment in the model.

If you set μ to zero, you can exclude this aspect of the model.

So now you have the full set of differential equations (1-5).

An important number in any epidemic model is known as the basic reproduction number, or R₀. This is defined as:

This number estimates the number of people who will be infected by the average infectious individual.

Therefore, it is a crucial number:

If R₀ is above 1, then an outbreak of the virus is likely to become an epidemic
If R₀ is below 1, then an outbreak is likely to be contained

How to solve these equations

In order to use the model to predict the course of the epidemic, it is necessary to solve the system of equations.

This can be done using the R programming language.

In particular, you can use a package called deSolve to solve the differential equations with respect to a time variable.

In R, paste the following code:

require(deSolve)

SEIR <- function(time, current_state, params){

  with(as.list(c(current_state, params)),{
    N <- S+E+I+R
    dS <- -(beta*S*I)/N
    dE <- (beta*S*I)/N - sigma*E
    dI <- sigma*E - gamma*I - mu*I
    dR <- gamma*I
    dM <- mu*I

    return(list(c(dS, dE, dI, dR, dM)))
  })
}

This code imports the deSolve package.

It then defines a function called SEIR. It takes three arguments:

The current time step.
A list of the current states of the system (that is, the estimates for each of S, E, I and R at the current time step).
A list of parameters used in the equations (recall these are β, σ, γ and μ).

Inside the function body, you define the system of differential equations as described above. These are evaluated for the given time step and are returned as a list. The order in which they are returned must match the order in which you provide the current states.

Now take a look at the code below:

params <- c(beta=0.5, sigma=0.25, gamma=0.2, mu=0.001)

initial_state <- c(S=999999, E=1, I=0, R=0, M=0)

times <- 0:365

This initialises the parameters and initial state (starting conditions) for the model.

It also generates a vector of times from zero to 365 days.

Now, create the model:

model <- ode(initial_state, times, SEIR, params)

This uses deSolve's ode() function to solve the equations with respect to time.

See here for the documentation.

The arguments required are:

The initial state for each of the compartments
The vector of times (this example solves for up to 365 days)
The SEIR() function, which defines the system of equations
A vector of parameters to pass to the SEIR() function

Running:

summary(model)

...will give the summary statistics of the model.

               S            E            I         R         M
Min.    108263.6 3.616607e-07 0.000000e+00      0.00    0.0000
1st Qu. 108263.7 5.957435e-03 1.414971e-02  63894.43  319.4721
Median  108395.7 8.470071e+00 1.273726e+01 886814.36 4434.0718
Mean    362798.6 9.745754e+03 1.212158e+04 612272.74 3061.3637
3rd Qu. 852375.5 1.734331e+03 2.533956e+03 887299.83 4436.4991
Max.    999999.0 1.092967e+05 1.265161e+05 887299.86 4436.4993
N          366.0 3.660000e+02 3.660000e+02    366.00  366.0000
sd      381257.2 2.475783e+04 2.969234e+04 387333.47 1936.6673

Already, you will find some interesting insights.

Out of a million individuals, 108,264 did not become infected.
At the peak of the epidemic, 126,516 individuals were infectious simultaneously.
887,300 individuals recovered by the end of the model.
A total of 4436 individuals died during the epidemic.

You can also visualise the evolution of the pandemic using the matplot() function.

Alternatively, you could use another plotting library such as ggplot2 to produce better quality graphics.

matplot(model, type="l", lty=1, main="SEIR model", xlab="Time")

legend <- colnames(model)[2:6]

legend("right", legend=legend, col=2:6, lty = 1)

The plot is shown below:

You can also coerce the model output to a dataframe type. Then, you can analyse the model further.

infections <- as.data.frame(model)$I

peak <- max(infections)

match(peak, infections)

The code above reveals that the number of infections peaked on day 112.

Using other libraries, such as dplyr, would let you carry out analysis as advanced as you'd like.

How to model intervention methods

The SEIR model is an interesting example of how an epidemic develops without any changes in the population's behaviour.

You can build more sophisticated models by taking the SEIR model as a starting point and adding extra features.

This lets you model changes in behaviour (either voluntary or as a result of government intervention).

Many (but not all) countries around the world entered some form of "lockdown" during the coronavirus pandemic of 2020.

Ultimately, the intention of locking down is to alter the course of the epidemic by reducing the transmission coefficient, β.

The code below defines a model which changes the value of β between the start and end of a period of lockdown.

All the numbers used are purely illustrative. You could make an entire research career (several times over) trying to figure out the most realistic values.

SEIR_lockdown <- function(time, current_state, params){

    with(as.list(c(current_state, params)),{

      beta = ifelse(
        (time <= start_lockdown || time >= end_lockdown),
        0.5, 0.1
        )

      N <- S+E+I+R
      dS <- -(beta*S*I)/N
      dE <- (beta*S*I)/N - sigma*E
      dI <- sigma*E - gamma*I - mu*I
      dR <- gamma*I
      dM <- mu*I

      return(list(c(dS, dE, dI, dR, dM)))
    })
  }

The only change is the extra ifelse() statement to adjust the value of β to 0.1 during lockdown.

You need to pass two new parameters to the model. These are the start and end times of the lockdown period.

Here, the lockdown begins on day 90, and ends on day 150.

params <- c(
    sigma=0.25,
    gamma=0.2,
    mu=0.001,
    start_lockdown=90,
    end_lockdown=150
    )

  initial_state <- c(S=999999, E=1, I=0, R=0, M=0)

  times <- 0:365

  model <- ode(initial_state, times, SEIR_lockdown, params)

Now you can view the summary and graphs associated with this model.

summary(model)

This will reveal:

               S            E           I         R         M
Min.    156885.7 7.699207e-01     0.00000      0.00    0.0000
1st Qu. 160478.2 6.929205e+01    97.71405  63668.75  318.3438
Median  789214.4 1.246389e+03  1735.66330 194379.16  971.8958
Mean    589558.9 9.216918e+03 11460.62036 387824.44 1939.1222
3rd Qu. 867639.6 1.030043e+04 13780.17591 829898.56 4149.4928
Max.    999999.0 6.083432e+04 72443.97892 838916.89 4194.5845
N          366.0 3.660000e+02   366.00000    366.00  366.0000
sd      350719.3 1.570278e+04 18893.31145 346542.57 1732.7128

You can see:

Out of a million individuals, 156,886 did not become infected.
At the peak of the epidemic, 72,444 individuals were infectious simultaneously.
838,917 individuals recovered by the end of the model.
A total of 4195 individuals died during the epidemic.

Plotting the model using matplot() reveals a strong "second wave" effect (as was seen across many countries in Europe towards the end of 2020).

  matplot(
    model, 
    type="l",
    lty=1, 
    main="SEIR model (with intervention)", 
    xlab="Time"
    )

legend <- colnames(model)[2:6]

legend("right", legend=legend, col=2:6, lty = 1)

Finally, you can coerce the model to a dataframe and carry out more detailed analysis from there.

infections <- as.data.frame(model)$I

peak <- max(infections)

match(peak, infections)

In this scenario, the number of infections peaked on day 223.

In other scenarios, you could model the effect of vaccination. Or, you could build in seasonal differences in the transmission rate.

Limitations of compartmental models

As with all modeling, an epidemic model is only as good as the data and assumptions that go into it.

And some of the assumptions behind the SEIR model as described are unrealistic.

For example:

In large populations, mixing is non-uniform. Individuals are much more likely to interact with individuals in their locality. More advanced compartmental models will account for this.
The model assumes the population is isolated. In reality, mixing between populations allows a virus to be introduced and reintroduced multiple times.
Individuals are usually not born with immunity. More sophisticated models will factor in the birth rate when considering longer periods of time.
The basic SEIR model does not account for age structures in the population. Often, a virus will spread faster among younger, densely populated cities. But it might prove more deadly to older populations outside those cities. More complex models will take these differences into consideration.
The SEIR model considers only averages for each of its parameters. In reality, there will be a lot of variation. Some individuals remain infectious for a long time. A small number of individuals might make a very large number of contacts. Therefore, the model is suitable for describing the epidemic at a high level, over a long period of time. But it is not suitable for predicting details on a smaller scale.

Despite its limitations, the SEIR model is a solid starting point for understanding the dynamics of an epidemic.

More generally, the approach of using differential equations to represent flows between compartments to model complex processes is very powerful.

And the availability of software packages for languages such as R and Python makes it easier than ever to get started exploring these techniques.

You can dig into the code used for the examples here.

Thanks for reading!

How to Choose the Best Programming Language for your Data Science Project

Harshit Tyagi — Wed, 01 Jul 2020 20:54:18 +0000

The battle between programming languages has always been a hot topic in the tech world. And given how fast technology is advancing, we have a new programming language or framework every few months.

This makes it ever harder for developers, analysts, and researchers to choose the best language that will get their tasks done efficiently while incurring the lowest cost.

But I think that we tend to look at the wrong reasons for choosing a language. There are a bunch of factors that lead to the choice of a certain language. And with Data Science projects flooding the market, the question is NOT “which is the best language” but "which one suits your project requirements and environment (work setting)?"

So, with this post, I will present you with the right set of questions you should be asking in order to decide which is the best programming language for your data science project.

Most commonly used programming languages for Data Science

Python and R are the most widely used languages for statistical analysis or machine learning-centric projects. But there are others - like Java, Scala, or Matlab.

Both Python and R are state-of-the-art open-source programming languages with great community support. And we keep learning about new libraries and tools that allow us to achieve greater levels of performance and complexity.

Python

Python is well-known for its easy to learn and readable syntax. With a general-purpose (jack of all trades) language like Python, you can build complete scientific ecosystems without worrying much about the compatibility or interfacing issues.

Python code has low maintenance costs and is arguably more robust. From data wrangling to feature selection, web scraping, and deployment of our machine learning models, Python can get almost everything done with integration support from all the major ML and deep learning APIs like Theano, TensorFlow, and PyTorch.

R

R was developed by academicians and statisticians over two decades ago. R today enables many statisticians, analysts, and developers to carry out their analysis effectively. We have over 12000 packages available in CRAN (an open-source repository).

Since it was developed keeping statisticians in mind, R is often the first choice for all the core-scientific and statistical analysis. There is a package in R for almost every kind of analysis there is.

Also, data analysis has been made very easy with tools like RStudio that allow you to communicate your results with concise and elegant reports.

4 Questions to help you choose the BEST suited language for your project

So, how do you make the right choice for your work at hand?

Try answering these 4 questions:

1. Which language/framework is preferred in your organisation/industry?

Look at the industry you are working in and the most commonly used language by your peers and competitors. It might be easier if you speak the same language.

Here is an analysis carried out by David Robinson, a data scientist. It’s a reflection of the popularity of R in each industry, and you can see that R is heavily used in Academia and Healthcare.

So, if you’re someone who wants to go into research, academia, or bioinformatics, you might consider R over Python.

Source: [https://stackoverflow.blog/2017/10/10/impressive-growth-r/](https://stackoverflow.blog/2017/10/10/impressive-growth-r/" rel="noopener)

The other side of this coin involves software industries, application-driven organizations, and product-based companies. You might have to use the tech stack of your organization’s infrastructure or the language that your colleagues/teams are using.

And most of these organizations/industries have their infrastructure based on Python, including academia as well:

Source: [https://stackoverflow.blog/2017/09/14/python-growing-quickly/](https://stackoverflow.blog/2017/09/14/python-growing-quickly/" rel="noopener)

As an aspiring data scientist, therefore, you should focus on learning the language and tech that have the most applications and that can increase your chances of getting a job.

2. What is the scope of your project?

This is an important question, because before you pick up a language, you must have an agenda for your project.

For example, what if you want to simply solve a statistical problem through a dataset, perform some multi-variate analyses, and prepare a report or a dashboard explaining the insights? In this case R might be a better choice. It has some really powerful visualization and communication libraries.

On the other hand, what if your aim is to first carry out exploratory analysis, develop a deep learning model, and then deploy the model within a web application? Then Python’s web frameworks and support from all the major cloud providers make it a clear winner.

3. How experienced are you in the field of data science?

For a beginner in data science who has limited familiarity with statistics and mathematical concepts, Python might be a better choice because it lets you code the fragments of an algorithm with ease.

With libraries like NumPy, you can manipulate matrices and code algorithms yourself. As a novice, it is always better to learn to build things from scratch rather than hopping onto using machine learning libraries.

But if you already know the fundamentals of machine learning algorithms, you can pick up either of the languages and get started with them.

4. How much time do you have on hand, and what's the cost of learning?

The amount of time you can invest makes another case for your choice. Depending on your experience with programming and the delivery time of your project, you might choose one language over another to get started in the field.

If there is a high-priority project and you don’t know either of the languages, R might be an easier option for you to get started as you need limited/no experience with programming. You can write statistical models with a few lines of code using existing libraries.

Python (often the programmer’s choice) is a great option to start off with if you have some bandwidth to explore the libraries and learn about methods of exploring datasets. (In the case of R, this can be done quickly within Rstudio.)

Another important factor is that there are more Python Mentors as compared with R. If you're someone who needs help with their python/R project, you can look for a Coding Mentor here and using this link will also get you $10 credit on sign up to be used for the first mentor meeting.

Conclusion

In a nutshell, the gap between the capabilities of R and Python is getting narrower. Most jobs can be done by both languages. And both have rich ecosystems to support you.

Choosing a language for your project will then depend on:

Your prior experience with Data Science (stats and math) and programming.
The domain of the project at hand and the extent of statistical or scientific processing required.
The future scope of your project.
The language/framework that is most widely supported in your teams, organisation, and industry.

You can check out the video version of this blog here,

Data Science with Harshit

With this channel, I am planning to roll out a couple of series covering the entire data science space. Here is why you should be subscribing to the channel:

The series would cover all the required/demanded quality tutorials on each of the topics and subtopics like Python fundamentals for Data Science.
Explained Mathematics and derivations of why we do what we do in ML and Deep Learning.
Podcasts with Data Scientists and Engineers at Google, Microsoft, Amazon, etc, and CEOs of big data-driven companies.
Projects and instructions to implement the topics learned so far.

If this tutorial was helpful, you should check out my data science and machine learning courses on Wiplane Academy. They are comprehensive yet compact and helps you build a solid foundation of work to showcase.

R Programming Language Explained

freeCodeCamp — Sat, 01 Feb 2020 00:00:00 +0000

R is an open source programming language and software environment for statistical computing and graphics. It is one of the primary languages used by data scientists and statisticians alike. It is supported by the R Foundation for Statistical Computing and a large community of open source developers. Since R utilized a command line interface, there can be a steep learning curve for some individuals who are used to using GUI focused programs such as SPSS and SAS so extensions to R such as RStudio can be highly beneficial. Since R is an open source program and freely available, there can a large attraction for academics whose access to statistical programs are regulated through their association to various colleges or universities.

Installation

The first thing you need to get started with R is to download it from its official site according to your operating system.

Popular R Tools and Packages

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.
The Comprehensive R Archive Network (CRAN) is a leading source for R tools and resources.
Tidyverse is an opinionated collection of R packages designed for data science like ggplot2, dplyr, readr, tidyr, purr, tibble.
data.table is an implementation of base data.frame focused on improved performance and terse, flexible syntax.
Shiny framework for building dashboard style web apps in R.

Data Types in R

Vector

It is a sequence of data elements of the same basic type. For example:

> o <- c(1,2,5.3,6,-2,4)                                  # Numeric vector
> p <- c("one","two","three","four","five","six")         # Character vector
> q <- c(TRUE,TRUE,FALSE,TRUE,FALSE,TRUE)                # Logical vector
> o;p;q
[1]  1.0  2.0  5.3  6.0 -2.0  4.0
[1] "one"   "two"   "three" "four"  "five"  "six"
[1]  TRUE  TRUE FALSE  TRUE FALSE

Matrix

It is a two-dimensional rectangular data set. The components in a matrix also must be of the same basic type like vector. For example:

> m = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
> m
>[,1] [,2] [,3]
[1,] "a"  "a"  "b" 
[2,] "c"  "b"  "a"

Data Frame

It is more general than a matrix, in that different columns can have different basic data types. For example:

> d <- c(1,2,3,4)
> e <- c("red", "white", "red", NA)
> f <- c(TRUE,TRUE,TRUE,FALSE)
> mydata <- data.frame(d,e,f)
> names(mydata) <- c("ID","Color","Passed")
> mydata

Lists

It is an R-object which can contain many different types of elements inside it like vectors, functions and even another list inside it. For example:

> list1 <- list(c(2,5,3),21.3,sin)
> list1
[[1]]
[1] 2 5 3
[[2]]
[1] 21.3
[[3]]
function (x)  .Primitive("sin")

Functions in R

A function allows you to define a reusable block of code that can be executed many times within your program.

Functions can be named and called repeatedly or can be run anonymously in place (similar to lambda functions in python).

Developing full understanding of R functions requires understanding of environments. Environments are simply a way to manage objects. An example of environments in action is that you can use a redundant variable name within a function, that won’t be affected if the larger runtime already has the same variable. Additionally, if a function calls a variable not defined within the function it will check the higher level environment for that variable.

Syntax

In R, a function definition has the following features:

The keyword function
a function name
input parameters (optional)
some block of code to execute
a return statement (optional)

# a function with no parameters or returned values
sayHello() = function(){
  "Hello!"
}

sayHello()  # calls the function, 'Hello!' is printed to the console

# a function with a parameter
helloWithName = function(name){
  paste0("Hello, ", name, "!")
}

helloWithName("Ada")  # calls the function, 'Hello, Ada!' is printed to the console

# a function with multiple parameters with a return statement
multiply = function(val1, val2){
  val1 * val2
}

multiply(3, 5)  # prints 15 to the console

Functions are blocks of code that can be reused simply by calling the function. This enables simple, elegant code reuse without explicitly re-writing sections of code. This makes code both more readable, makes for easier debugging, and limits typing errors.

Functions in R are created using the function keyword, along with a function name and function parameters inside parentheses.

The return() function can be used by the function to return a value, and is typically used to force early termination of a function with a returned value. Alternatively, the function will return the final printed value.

# return a value explicitly or simply by printing
sum = function(a, b){
  c = a + b
  return(c)
}

sum = function(a, b){
  a + b
}


result = sum(1, 2)
# result = 3

You can also define default values for the parameters, which R will use when a variable is not specified during function call.

sum = function(a, b = 3){
  a + b
}

result = sum(a = 1)
# result = 4

You can also pass the parameters in the order you want, using the name of the parameter.

result = sum(b=2, a=2)
# result = 4

R can also accept additional, optional parameters with ’…’

sum = function(a, b, ...){
  a + b + ...
}

sum(1, 2, 3) #returns 6

Functions can also be run anonymously. These are very useful in combination with the ‘apply’ family of functions.

# loop through 1, 2, 3 - add 1 to each
sapply(1:3,
       function(i){
         i + 1
         })

Notes

If a function definition includes arguments without default values specified, values for those values must be included.

sum = function(a, b = 3){
a + b
}

sum(b = 2) # Error in sum(b = 2) : argument "a" is missing, with no default

Variables defined within a function only exist within the scope of that function, but will check larger environment if variable not specified

double = function(a){
a * 2
}

double(x)  # Error in double(x) : object 'x' not found


double = function(){
a * 2
}

a = 3
double() # 6

In-built functions in R

R comes with many functions that you can use to do sophisticated tasks like random sampling.
For example, you can round a number with the round(), or calculate its factorial with the factorial().

> round(4.147)
[1] 4
> factorial(3)
[1] 6
> round(mean(1:6))
[1] 4

The data that you pass into the function is called the function’s argument.
You can simulate a roll of the die with R’s sample()function. The sample() function takes two arguments:a vector named x and a number named size. For example:

> sample(x = 1:4, size = 2)
[] 4 2
> sample(x = die, size = 1)
[] 3
>dice <- sample(die, size = 2, replace = TRUE)
>dice
[1] 2 4
>sum(dice)
[1] 6

If you’re not sure which names to use with a function, you can look up the function’s arguments with args.

> args(round)
[1] function(x, digits=0)

Objects in R

R allows to save the data by storing it inside an R object.

What’s an object?

It is just a name that you can use to call up stored data. For example, you can save data into an object like a or b.

> a <- 5
> a
[1] 5

How to create an Object in R?

To create an R object, choose a name and then use the less-than symbol, <, followed by a minus sign, -, to save data into it. This combination looks like an arrow, <-. R will make an object, give it your name, and store in it whatever follows the arrow.
When you ask R what’s in a, it tells you on the next line. For example:

> die <- 1:6
> die
[1] 1 2 3 4 5 6

You can name an object in R almost anything you want, but there are a few rules. First, a name cannot start with a number. Second, a name cannot use some special symbols, like ^, !, $, @, +, -, /, or *:
R also understands capitalization (or is case-sensitive), so name and Name will refer to different objects.
You can see which object names you have already used with the function ls().

More Information:

Learn R programming language basics in just 2 hours with this free course on statistical programming

Beau Carnes — Thu, 06 Jun 2019 17:45:51 +0000

Learn the R programming language in this course from Barton Poulson of datalab.cc. This is a hands-on overview of the statistical programming language R, one of the most important tools in data science.

The course covers:

Installing R
RStudio
Packages
plot()
Bar Charts
Histograms
Scatterplots
Overlaying Plots
summary()
describe()
Selecting Cases
Data Formats
Factors
Entering Data
Importing Data
Hierarchical Clustering
Principal Components
Regression
Next Steps

You can watch the full video course on the freeCodeCamp.org YouTube channel (2 hour watch).

How to build a Hacker News Frontpage scraper with just 7 lines of R code

freeCodeCamp — Tue, 06 Feb 2018 20:54:52 +0000

By AMR

Web scraping used to be a difficult task requiring expertise in XML Tree parsing and HTTP Requests. But with new-age scraping libraries like beautifulsoup (for Python) and rvest (for R), web scraping has become a toy for any beginner to play with.

This post aims to explain how simple it is to use R, a very nice programming language, to perform Data Analysis and Data Visualization. The task ahead is very simple. Build a web scraper that scrapes the content of one of the most popular pages on the Internet (at least among Coders): Hacker News Front Page.

Package Installation and Loading

The R package that we are going to use is rvest. rvest can be installed from CRAN and loaded into R like below:

library(rvest)

read_html() function of rvest can be used to extract the HTML content of the url given as the argument for read_html function.

content <- read_html('https://news.ycombinator.com/')

For read_html() to work without any concern, please make sure you are not behind any organization firewall. If so, configure your RStudio with a proxy to bypass the firewall, otherwise you might face a connection timed out error.

Below is the screenshot of HN front page layout (with key elements highlighted):

Now, with the HTML content of the Hacker News front page loaded into the R object content, let us extract the data that we need — starting with the Title.

There is one particularly important aspect of making any web scraping assignment successful. That is to identify the right CSS selector, or XPath values, of the HTML elements whose values are supposed to be scraped. The easiest way to get the right element value is to use the inspect tool in Developer Tools of any browser.

Here’s the screenshot of the CSS selector value. It is highlighted using the Chrome Inspect Tool when hovered over Title of the links present in Hacker News Frontpage.

title <- content %>% html_nodes('a.storylink') %>% html_text()title [1] "Magic Leap One"                                                                   [2] "Show HN: Terminal – native micro-GUIs for shell scripts and command line apps"    [3] "Tokio internals: Understanding Rust's async I/O framework"                        [4] "Funding Yourself as a Free Software Developer"                                    [5] "US Federal Ban on Making Lethal Viruses Is Lifted"                                [6] "Pass-Thru Income Deduction"                                                       [7] "Orson Welles' first attempt at movie-making"                                      [8] "D’s Newfangled Name Mangling"                                                     [9] "Apple Plans Combined iPhone, iPad, and Mac Apps to Create One User Experience"    [10] "LiteDB – A .NET NoSQL Document Store in a Single Data File"                      [11] "Taking a break from Adblock Plus development"                                    [12] "SpaceX’s Falcon Heavy rocket sets up at Cape Canaveral ahead of launch"          [13] "This is not a new year’s resolution"                                             [14] "Artists and writers whose works enter the public domain in 2018"                 [15] "Open Beta of Texpad 1.8, macOS LaTeX editor with integrated real-time typesetting"[16] "The triumph and near-tragedy of the first Moon landing"                          [17] "Retrotechnology – PC desktop screenshots from 1983-2005"                         [18] "Google Maps' Moat"                                                               [19] "Regex Parser in C Using Continuation Passing"                                    [20] "AT&T giving $1000 bonus to all its employees because of tax reform"              [21] "How a PR Agency Stole Our Kickstarter Money"                                     [22] "Google Hangouts now on Firefox without plugins via WebRTC"                       [23] "Ubuntu 17.10 corrupting BIOS of many Lenovo laptop models"                       [24] "I Know What You Download on BitTorrent"                                          [25] "Carrie Fisher’s Private Philosophy Coach"                                        [26] "Show HN: Library of API collections for Postman"                                 [27] "Uber is officially a cab firm, says European court"                              [28] "The end of the Iceweasel Age (2016)"                                             [29] "Google will turn on native ad-blocking in Chrome on February 15"                 [30] "Bitcoin Cash deals frozen as insider trading is probed"

The rvest package supports pipe %>% operator. Thus, the R object containing the content of the HTML page (read with read_html) can be piped with html_nodes() that takes a CSS selector or XPath as its argument. It can then extract the respective XML tree (or HTML node value) whose text value could be extracted with html_text() function.

The beauty of rvest is that it abstracts the entire XML parsing operation under the hood of functions like html_nodes() and html_text(). Thus making it easier for us to achieve our scraping goal with minimal code.

Like with Title, the CSS selector value of other required elements of the web page can be identified with the Chrome Inspect tool. They can also be passed as an argument to html_nodes() function and respective values can be extracted and stored in R objects.

link_domain <- content %>% html_nodes('span.sitestr') %>% html_text()score <- content %>% html_nodes('span.score') %>% html_text()age <- content %>% html_nodes('span.age') %>% html_text()

All the essential pieces of information were extracted from the page. Now an R data frame can be made with the extracted elements to put the extracted data into a structured format.

df <- data.frame(title = title, link_domain = link_domain, score = score, age = age)

Below is the screenshot of the final dataframe in RStudio viewer:

Thus, in just 7 lines of code, we have successfully built a Hacker News Frontpage Scraper in R.

R is a wonderful language to perform Data Analysis and Data Visualization. The code used here is available on my github.

Which languages should you learn for data science?

freeCodeCamp — Thu, 31 Aug 2017 16:07:30 +0000

By Peter Gleeson

Data science is an exciting field to work in, combining advanced statistical and quantitative skills with real-world programming ability. There are many potential programming languages that the aspiring data scientist might consider specializing in.

While there is no correct answer, there are several things to take into consideration. Your success as a data scientist will depend on many points, including:

Specificity

When it comes to advanced data science, you will only get so far reinventing the wheel each time. Learn to master the various packages and modules offered in your chosen language. The extent to which this is possible depends on what domain-specific packages are available to you in the first place!

Generality

A top data scientist will have good all-round programming skills as well as the ability to crunch numbers. Much of the day-to-day work in data science revolves around sourcing and processing raw data or ‘data cleaning’. For this, no amount of fancy machine learning packages are going to help.

Productivity

In the often fast-paced world of commercial data science, there is much to be said for getting the job done quickly. However, this is what enables technical debt to creep in — and only with sensible practices can this be minimized.

Performance

In some cases it is vital to optimize the performance of your code, especially when dealing with large volumes of mission-critical data. Compiled languages are typically much faster than interpreted ones; likewise statically typed languages are considerably more fail-proof than dynamically typed. The obvious trade-off is against productivity.

To some extent, these can be seen as a pair of axes (Generality-Specificity, Performance-Productivity). Each of the languages below fall somewhere on these spectra.

With these core principles in mind, let’s take a look at some of the more popular languages used in data science. What follows is a combination of research and personal experience of myself, friends and colleagues — but it is by no means definitive! In approximately order of popularity, here goes:

R

What you need to know

Released in 1995 as a direct descendant of the older S programming language, R has since gone from strength to strength. Written in C, Fortran and itself, the project is currently supported by the R Foundation for Statistical Computing.

License

Free!

Pros

Excellent range of high-quality, domain specific and open source packages. R has a package for almost every quantitative and statistical application imaginable. This includes neural networks, non-linear regression, phylogenetics, advanced plotting and many, many others.
The base installation comes with very comprehensive, in-built statistical functions and methods. R also handles matrix algebra particularly well.
Data visualization is a key strength with the use of libraries such as ggplot2.

Cons

Performance. There’s no two ways about it, R is not a quick language.
Domain specificity. R is fantastic for statistics and data science purposes. But less so for general purpose programming.
Quirks. R has a few unusual features that might catch out programmers experienced with other languages. For instance: indexing from 1, using multiple assignment operators, unconventional data structures.

Verdict — “brilliant at what it’s designed for”

R is a powerful language that excels at a huge variety of statistical and data visualization applications, and being open source allows for a very active community of contributors. Its recent growth in popularity is a testament to how effective it is at what it does.

Python

What you need to know

Guido van Rossum introduced Python back in 1991. It has since become an extremely popular general purpose language, and is widely used within the data science community. The major versions are currently 3.6 and 2.7.

License

Free!

Pros

Python is a very popular, mainstream general purpose programming language. It has an extensive range of purpose-built modules and community support. Many online services provide a Python API.
Python is an easy language to learn. The low barrier to entry makes it an ideal first language for those new to programming.
Packages such as pandas, scikit-learn and Tensorflow make Python a solid option for advanced machine learning applications.

Cons

Type safety: Python is a dynamically typed language, which means you must show due care. Type errors (such as passing a String as an argument to a method which expects an Integer) are to be expected from time-to-time.
For specific statistical and data analysis purposes, R’s vast range of packages gives it a slight edge over Python. For general purpose languages, there are faster and safer alternatives to Python.

Verdict — “excellent all-rounder”

Python is a very good choice of language for data science, and not just at entry-level. Much of the data science process revolves around the ETL process (extraction-transformation-loading). This makes Python’s generality ideally suited. Libraries such as Google’s Tensorflow make Python a very exciting language to work in for machine learning.

SQL

What you need to know

SQL (‘Structured Query Language’) defines, manages and queries relational databases. The language appeared by 1974 and has since undergone many implementations, but the core principles remain the same.

License

Varies — some implementations are free, others proprietary

Pros

Very efficient at querying, updating and manipulating relational databases.
Declarative syntax makes SQL an often very readable language . There’s no ambiguity about what SELECT name FROM users WHERE age > 18 is supposed to do!
SQL is very used across a range of applications, making it a very useful language to be familiar with. Modules such as SQLAlchemy make integrating SQL with other languages straightforward.

Cons

SQL’s analytical capabilities are rather limited — beyond aggregating and summing, counting and averaging data, your options are limited.
For programmers coming from an imperative background, SQL’s declarative syntax can present a learning curve.
There are many different implementations of SQL such as PostgreSQL, SQLite, MariaDB . They are all different enough to make inter-operability something of a headache.

Verdict — “timeless and efficient”

SQL is more useful as a data processing language than as an advanced analytical tool. Yet so much of the data science process hinges upon ETL, and SQL’s longevity and efficiency are proof that it is a very useful language for the modern data scientist to know.

Java

What you need to know

Java is an extremely popular, general purpose language which runs on the (JVM) Java Virtual Machine. It’s an abstract computing system that enables seamless portability between platforms. Currently supported by Oracle Corporation.

License

Version 8 — Free! Legacy versions, proprietary.

Pros

Ubiquity . Many modern systems and applications are built upon a Java back-end. The ability to integrate data science methods directly into the existing codebase is a powerful one to have.
Strongly typed. Java is no-nonsense when it comes to ensuring type safety. For mission-critical big data applications, this is invaluable.
Java is a high-performance, general purpose, compiled language . This makes it suitable for writing efficient ETL production code and computationally intensive machine learning algorithms.

Cons

For ad-hoc analyses and more dedicated statistical applications, Java’s verbosity makes it an unlikely first choice. Dynamically typed scripting languages such as R and Python lend themselves to much greater productivity.
Compared to domain-specific languages like R, there aren’t a great number of libraries available for advanced statistical methods in Java.

Verdict — “a serious contender for data science”

There is a lot to be said for learning Java as a first choice data science language. Many companies will appreciate the ability to seamlessly integrate data science production code directly into their existing codebase, and you will find Java’s performance and and type safety are real advantages.

However, you’ll be without the range of stats-specific packages available to other languages. That said, definitely one to consider — especially if you already know one of R and/or Python.

Scala

What you need to know

Developed by Martin Odersky and released in 2004, Scala is a language which runs on the JVM. It is a multi-paradigm language, enabling both object-oriented and functional approaches. Cluster computing framework Apache Spark is written in Scala.

License

Free!

Pros

Scala + Spark = High performance cluster computing. Scala is an ideal choice of language for those working with high-volume data sets.
Multi-paradigmatic: Scala programmers can have the best of both worlds. Both object-oriented and functional programming paradigms available to them.
Scala is compiled to Java bytecode and runs on a JVM. This allows inter-operability with the Java language itself, making Scala a very powerful general purpose language, while also being well-suited for data science.

Cons

Scala is not a straightforward language to get up and running with if you’re just starting out. Your best bet is to download sbt and set up an IDE such as Eclipse or IntelliJ with a specific Scala plug-in.
The syntax and type system are often described as complex. This makes for a steep learning curve for those coming from dynamic languages such as Python.

Verdict — “perfect, for suitably big data”

When it comes to using cluster computing to work with Big Data, then Scala + Spark are fantastic solutions. If you have experience with Java and other statically typed languages, you’ll appreciate these features of Scala too.

Yet if your application doesn’t deal with the volumes of data that justify the added complexity of Scala, you will likely find your productivity being much higher using other languages such as R or Python.

Julia

What you need to know

Released just over 5 years ago, Julia has made an impression in the world of numerical computing. Its profile was raised thanks to early adoption by several major organizations including many in the finance industry.

License

Free!

Pros

Julia is a JIT (‘just-in-time’) compiled language, which lets it offer good performance. It also offers the simplicity, dynamic-typing and scripting capabilities of an interpreted language like Python.
Julia was purpose-designed for numerical analysis. It is capable of general purpose programming as well.
Readability. Many users of the language cite this as a key advantage

Cons

Maturity. As a new language, some Julia users have experienced instability when using packages. But the core language itself is reportedly stable enough for production use.
Limited packages are another consequence of the language’s youthfulness and small development community. Unlike long-established R and Python, Julia doesn’t have the choice of packages (yet).

Verdict — “one for the future”

The main issue with Julia is one that cannot be blamed for. As a recently developed language, it isn’t as mature or production-ready as its main alternatives Python and R.

But, if you are willing to be patient, there’s every reason to pay close attention as the language evolves in the coming years.

MATLAB

What you need to know

MATLAB is an established numerical computing language used throughout academia and industry. It is developed and licensed by MathWorks, a company established in 1984 to commercialize the software.

License

Proprietary — pricing varies depending on your use case

Pros

Designed for numerical computing. MATLAB is well-suited for quantitative applications with sophisticated mathematical requirements such as signal processing, Fourier transforms, matrix algebra and image processing.
Data Visualization. MATLAB has some great inbuilt plotting capabilities.
MATLAB is often taught as part of many undergraduate courses in quantitative subjects such as Physics, Engineering and Applied Mathematics. As a consequence, it is widely used within these fields.

Cons

Proprietary licence. Depending on your use-case (academic, personal or enterprise) you may have to fork out for a pricey licence. There are free alternatives available such as Octave. This is something you should give real consideration to.
MATLAB isn’t an obvious choice for general-purpose programming.

Verdict — “best for mathematically intensive applications”

MATLAB’s widespread use in a range of quantitative and numerical fields throughout industry and academia makes it a serious option for data science.

The clear use-case would be when your application or day-to-day role requires intensive, advanced mathematical functionality. Indeed, MATLAB was specifically designed for this.

Other Languages

There are other mainstream languages that may or may not be of interest to data scientists. This section provides a quick overview… with plenty of room for debate of course!

C++

C++ is not a common choice for data science, although it has lightning fast performance and widespread mainstream popularity. The simple reason may be a question of productivity versus performance.

As one Quora user puts it:

“If you’re writing code to do some ad-hoc analysis that will probably only be run one time, would you rather spend 30 minutes writing a program that will run in 10 seconds, or 10 minutes writing a program that will run in 1 minute?”

The dude’s got a point. Yet for serious production-level performance, C++ would be an excellent choice for implementing machine learning algorithms optimized at a low-level.

Verdict — “not for day-to-day work, but if performance is critical…”

JavaScript

With the rise of Node.js in recent years, JavaScript has become more and more a serious server-side language. However, its use in data science and machine learning domains has been limited to date (although checkout brain.js and synaptic.js!). It suffers from the following disadvantages:

Late to the game (Node.js is only 8 years old!), meaning…
Few relevant data science libraries and modules are available. This means no real mainstream interest or momentum
Performance-wise, Node.js is quick. But JavaScript as a language is not without its critics.

Node’s strengths are in asynchronous I/O, its widespread use and the existence of languages which compile to JavaScript. So it’s conceivable that a useful framework for data science and realtime ETL processing could come together.

The key question is whether this would offer anything different to what already exists.

Verdict — “there is much to do before JavaScript can be taken as a serious data science language”

Perl

Perl is known as a ‘Swiss-army knife of programming languages’, due to its versatility as a general-purpose scripting language. It shares a lot in common with Python, being a dynamically typed scripting language. But, it has not seen anything like the popularity Python has in the field of data science.

This is a little surprising, given its use in quantitative fields such as bioinformatics. Perl has several key disadvantages when it comes to data science. It isn’t stand-out fast, and its syntax is famously unfriendly. There hasn’t been the same drive towards developing data science specific libraries. And in any field, momentum is key.

Verdict — “a useful general purpose scripting language, yet it offers no real advantages for your data science CV”

Ruby

Ruby is another general purpose, dynamically typed interpreted language. Yet it also hasn’t seen the same adoption for data science as has Python.

This might seem surprising, but is likely a result of Python’s dominance in academia, and a positive feedback effect . The more people use Python, the more modules and frameworks are developed, and the more people will turn to Python.

The SciRuby project exists to bring scientific computing functionality, such as matrix algebra, to Ruby. But for the time being, Python still leads the way.

Verdict — “not an obvious choice yet for data science, but won’t harm the CV”

Conclusion

Well, there you have it — a quickfire guide to which languages to consider for data science. The key here is to understand your usage requirements in terms of generality vs specificity, as well as your personal preferred development style of performance vs productivity.

I use R, Python and SQL on a regular basis, as my current role largely focuses on developing existing data pipeline and ETL processes. These languages give the right balance of generality and productivity to do the job, with the option of using R’s more advanced statistics packages when needed.

However — you may already have some experience with Java. Or you may want to use Scala for big data. Or, perhaps you’re keen to get involved with the Julia project.

Maybe you learned MATLAB at university, or want to give SciRuby a chance? Perhaps you have an altogether different suggestion. If so, please leave a reply below — I look forward to hearing from you!

Thanks for reading!

R Programming - freeCodeCamp.org

How to Create Boxplots and Model Data in R Using ggplot2

Table of Contents

Prerequisites

How to Set Up Your R Environment

How to Load and Inspect the Data

Structure

Key Columns & Meaning

Data Types

Observations

How to Clean and Prepare the Data

How to Use Boxplots

How to Create Boxplots with ggplot2

How to Perform Exploratory Data Analysis (EDA)

How to Build Linear Regression Models

How to Build Logistic Regression Models

Why Visualization Comes Before Modeling

Conclusion

How to Create Scatterplots and Model Data in R Using ggplot2

Table of Contents

Prerequisites

How to Set Up Your R Environment

How to Use Data Types in R

Common Data Types

Numeric Data Types in R

Integer Data Types in R

Character Data Types in R

Logical Data Types in R

Complex Data Types in R

How to Use Data Structures in R

Common Data Structures in R

How to Import Data in R

How to Read a CSV and Excel File

How to Visualize Data with ggplot2

Scatter Plot Example

How to Build Statistical Models in R

Linear Regression

Does the Model Fit the Data, and Why?

Logistic Regression

Conclusion

Web Scraping With RSelenium (Chrome Driver) and Rvest

Table of Contents

Project Overview

Project Setup

Prerequisites:

Setup and Install Chrome Driver

How to Understand and Inspect a Webpage

How to Extract Data Using RVest

Step 1: Load the webpage

Step 2: Identify the target elements

Step 3: Clean the “rating” data

How to Mimic Human Behaviour Using RSelenium

How Selenium Works

Automating Page Navigation and Data Collection with RSelenium

How to Combine RSelenium & RVest and Save to CSV

Create a scrape books function

Retrieve and parse the webpage

Write to CSV

Bringing it All Together

Conclusion

How to Model an Epidemic with R

What is the SEIR model?

Equation (1) - Susceptible

Equation (2) - Exposed

Equation (3) - Infectious

Equation (4) - Recovered

Equation (5) - Mortality (optional)

How to solve these equations

How to model intervention methods

Limitations of compartmental models

How to Choose the Best Programming Language for your Data Science Project

Most commonly used programming languages for Data Science

Python

R

4 Questions to help you choose the BEST suited language for your project

1. Which language/framework is preferred in your organisation/industry?

2. What is the scope of your project?

3. How experienced are you in the field of data science?

4. How much time do you have on hand, and what's the cost of learning?

Conclusion