R Language - freeCodeCamp.org

How to Create Scatterplots and Model Data in R Using ggplot2

Tiffany Mojo Omondi — Mon, 05 Jan 2026 12:05:54 +0000

You can use R as a powerful tool for data analysis, data visualization, and statistical modelling. In this guide, you’ll learn how to load real-world data into R, visualize patterns using ggplot2, build simple linear and logistic regression models, and interpret the models. By the end, you should know how to use R for your own projects.

Prerequisites
How to Set Up Your R Environment
How to Use Data Types in R
How to Use Data Structures in R
How to Import Data in R
How to Visualize Data with ggplot2
How to Build Statistical Models in R
Conclusion

Prerequisites

Before we get started, you should have the following:

R installed (version 4.0 or higher).
RStudio installed (recommended for beginners).
Basic familiarity with programming concepts such as variables and functions.
A basic understanding of statistics (mean, correlation, regression).

How to Set Up Your R Environment

Before you start working with data, load the required libraries:

library(tidyverse)   # Data manipulation + ggplot2
library(readxl)      # Importing Excel files

These load the required libraries into the R. tidyverse is a collection of packages used for data manipulation and visualization, including ggplot2. readxl allows you to import Excel files directly into R without converting them to CSV format first.

How to Use Data Types in R

Knowing data types helps you avoid errors and choose the right analysis methods.

Common Data Types

Data type	Example	Use case
Numeric	`x <- 5.7`	Measurements, prices
Integer	`y <- 10L`	Counts
Character	`"House prices"`	Text labels
Logical	`TRUE`	Conditions
Complex	`2 + 3i`	Advanced math

Numeric Data Types in R

price <- 199.99
tax <- 16.5
total_cost <- price + tax
total_cost

Numeric data is used for continuous values such as measurements, prices, or averages. As you can see, these are numeric values that can be used in a calculation. Numeric data types allow arithmetic operations such as addition, subtraction, multiplication, and division.

Integer Data Types in R

students <- 30L
classes <- 4L
total_students <- students * classes
total_students

Integers are whole numbers and are commonly used for counting. The L tells R that the values are integers. Integers are useful when working with counts, indexes, or discrete values.

Character Data Types in R

course_name <- "Data Science"
university <- "Harvard University"
paste(course_name, "at", university)

Character data is used to store text such as names, labels, or categories. The example above shows how character data can be combined using the paste() function. This data type cannot be used in mathematical operations.

Logical Data Types in R

score <- 75
passed <- score >= 50
passed

Logical data represents Boolean values: TRUE or FALSE. These are commonly used in conditions and filtering. Here, R evaluates a condition and returns TRUE because the score meets the requirement. Logical values are essential in decision-making and control flow.

Complex Data Types in R

Complex numbers contain both real and imaginary parts and are mostly used in advanced mathematical computations.

z <- 2 + 3i
Mod(z)

This example calculates the magnitude of a complex number. Complex data types are rarely used in basic data analysis but are available in R.

How to Use Data Structures in R

R stores data in different structures depending on your goals. This is important because choosing the right structure makes operations easier. Its functions behave differently depending on the structure. Moreover, structures help R understand whether your data are numbers, categories, or text.

Common Data Structures in R

Structure	Best for
Vector	Single column of data
Matrix	Numeric tables
Data Frame	Spreadsheet-like data
List	Mixed objects

vec <- c(1, 2, 3, 4)
mat <- matrix(1:9, nrow = 3)
df <- data.frame(Name = c("Car", "Bike"), Number = c(110, 95))
lst <- list(numbers = vec, matrix = mat, info = df)

str(lst) ##shows the structure of the list

Lets understand the code above:

vec is a vector that stores a single type of data.
mat is a matrix that organizes numeric values into rows and columns.
df is a data frame that works like a spreadsheet, allowing different data types in each column.
lst is a list that stores multiple objects of different types.
The str() function shows how these objects are nested within the list.

How to Import Data in R

Now you can start working with your real data. You can import files into R by copying the path of the CSV or Excel file and pasting it into the command.

For Windows: Replace single backward slashes / with either double backward slashes \ or single forward slashes \. For example:


Windows
```r
data <- read.csv("C:\\Users\\file\\Documents\\data.csv") or 
data <- read.csv("C:/Users/file/Documents/data.csv")

For macOS/Linux: Single forward slashes work fine:

macOS/Linux
data <- read.csv("/Users/file/Documents/data.csv")

How to Read a CSV and Excel File

#Import CSV file 
data <- read.csv("C:/Users/file/Documents/data.csv") or data <- read.csv("C:\\Users\\file\\Documents\\data.csv") ## for windows

head(data.csv)

You can import a CSV file into R using a file path. On Windows systems, file paths can use either double forward slashes (//) or double backslashes (\). The imported data is stored as a data frame named data.

data_excel <- read_excel("C:/Users/file/Documents/HR Data Set.xlsx")
head(data_excel)

You can import an Excel file into R using the code read_excel() function from the readxl package. The head() function is then used to preview the first few rows of the dataset.

Use the following commands to understand your data:

str(data.csv)
summary(data.csv)

str(data_excel)
summary(data_excel)

str() shows the structure of the dataset, including column names and data types. summary() provides descriptive statistics such as minimum, maximum, mean, and quartiles for each variable. Together, these functions help you understand the dataset before analysis.

How to Visualize Data with ggplot2

Visualization helps you spot patterns before you build models.

Scatter Plot Example

We’ll use the built-in mtcars dataset in R. First, load the library to make it available for use:

data(mtcars)
library(ggplot2)

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point(size = 3,color="blue") +geom_smooth(method="lm",color="red",se=FALSE)+
  labs(
    title = "Fuel Efficiency by Weight and Cylinders",
    x = "Weight (1000 lbs)",
    y = "Miles per Gallon"
  ) +
  theme_minimal()

Let us break down the code to grasp it fully:

data(mtcars) loads the built-in mtcars dataset, which contains information about car specifications.
library(ggplot2) enables data visualization.
aes() was used to insert your dataset columns, which defines the x and y values.
aes() was used to design the plot outside. For example, set point size and color.
geom_smooth() wass used to add a trend line with. Here, we use method="lm" to fit a linear regression line. The se=TRUE/FALSE option controls the shading for confidence intervals. Use TRUE if you want the shading and FALSE if you don’t.
labs() was used for label the plot and set the title, x-axis, and y-axis labels.
Finally, we set the plot theme using theme_minimal().

Running this code will produce a scatterplot showing fuel efficiency by weight and cylinders. The plot should look like this:

How to Build Statistical Models in R

Linear Regression

You can use linear regression for continuous outcomes, basically to predict numerical values. For example, to predict a car’s miles per gallon (mpg) based on weight (wt) and horsepower (hp), you can use this formula:

lm_model <- lm(mpg ~ wt + hp, data = mtcars)
summary(lm_model)

But what does it mean?

lm() stands for linear model.
The response variable is mpg. This is the outcome you want to predict.
Predictor variables are wt and hp. These explain changes in the response.

Once you run the model, it should look like this in your console:

Call:
lm(formula = mpg ~ wt + hp, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.941 -1.600 -0.182  1.050  5.854 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
wt          -3.87783    0.63273  -6.129 1.12e-06 ***
hp          -0.03177    0.00903  -3.519  0.00145 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.593 on 29 degrees of freedom
Multiple R-squared:  0.8268,    Adjusted R-squared:  0.8148 
F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

Here’s an interpretation of the linear regression model:

You created a model on miles per gallon (mpg) based on weight (wt) and horsepower (hp).
The intercept 37.227 is the mpg when wt=0 and hp=0. In other words, when all other variables are 0, the base mpg is 37.227. The intercept is always the baseline value of the outcome when all other variables in the model are zero.
With every additional unit of weight (1000lbs), the mpg decreases by 3.877. This variable affects the mpg greatly as seen with the p-value. The p-value is <0.001, hence strong and statistically significant.
With every additional unit of horsepower, the mpg decreases by 0.031. This variable affects the mpg, as seen with the p-value being 0.00145, which is less than 0.01, indicating that horsepower is a statistically significant predictor of mpg, although its effect is smaller compared to vehicle weight.

Does the Model Fit the Data, and Why?

The R-squared value shows that 83% of the variation in mpg is explained by weight and horsepower.

Summary of the interpretation: Cars that are heavier and with more horsepower have lower fuel efficiency. These two variables explain most of the variation in mpg in the dataset.

Logistic Regression

You can use logistic regression for binary outcomes, like yes/no questions. For example, predicting whether a vehicle is automatic or manual based on weight and horsepower.

glm_model <- glm(am ~ wt + hp, data = mtcars, family = binomial)
summary(glm_model)

Lets understand the code

glm() stands for generalized linear model.
The family=binomial option tells R to run logistic regression.
The response variable am indicates transmission type: 0 = automatic, 1 = manual.
Predictor variables remain wt and hp.

Once you run the model, it should look like this in your console:

Call:
glm(formula = am ~ wt + hp, family = binomial, data = mtcars)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) 18.86630    7.44356   2.535  0.01126 * 
wt          -8.08348    3.06868  -2.634  0.00843 **
hp           0.03626    0.01773   2.044  0.04091 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.230  on 31  degrees of freedom
Residual deviance: 10.059  on 29  degrees of freedom
AIC: 16.059

Number of Fisher Scoring iterations: 8

Here’s an interpreting of the logistic regression model:

The intercept 18.866 represents the log-odds of a car being manual when wt=0 and hp=0. In other words, when all other variables are 0, the baseline log-odds of the outcome is 18.866. The intercept is always the baseline value of the outcome when all other variables in the model are zero.
With every additional unit of weight (1000 lbs), the log odds of the car being manual decrease by 8.083. This variable strongly affects the probability of the car being manual, as seen with the p-value being 0.008, which is statistically significant.
With every additional unit of horsepower, the log odds of the car being manual increase by 0.036. This variable also affects the probability of being manual, as seen with the p-value being 0.041, which is statistically significant.

Summary of the interpretation: Heavier cars are more likely to be automatic, while higher horsepower slightly increases the chance of being manual. Together, wt and hp explain a large portion of transmission type variation.

Conclusion

In this tutorial, you learned how to use R for data analysis, visualization, and statistical modeling, and how to set up your R environment and work with basic data types and data structures.

This article also showed you how to import real-world datasets and explore them using summary statistics. This should help you understand your data before analysis.

Using ggplot2, we visualized the relationships and identified patterns. We built and interpreted a linear regression model to predict fuel efficiency and a logistic regression model to classify transmission type.

You also learned how to interpret coefficients, p-values, and goodness-of-fit measures.

With these skills, you can load datasets, visualize trends, and build simple predictive models in R. Keep practicing with new datasets and explore more advanced techniques to improve your data analysis skills.

Learn R Programming from Harvard University

Beau Carnes — Tue, 02 Dec 2025 22:18:31 +0000

Harvard University creates amazing beginner computer science courses.

We just released Harvard CS50’s introduction to programming using a language called R, a popular language for statistical computing and graphics in data science and other domains. Carter Zenke developed this course.

Learn to use RStudio, a popular integrated development environment (IDE). Learn to represent real-world data with vectors, matrices, arrays, lists, and data frames. Filter data with conditions, via which you can analyze subsets of data. Apply functions and loops, via which you can manipulate and summarize data sets. Write functions to modularize code and raise exceptions when something goes wrong. Tidy data with R’s tidyverse and create colorful visualizations with R’s grammar of graphics. By course’s end, learn to package, test, and share R code for others to use. Assignments inspired by real-world data sets.

Here are the sections in this course:

Introduction
Representing Data
Transforming Data
Applying Functions
Tidying Data
Visualizing Data
Testing Programs
Packaging Programs

Watch the full course on the freeCodeCamp.org YouTube channel (9-hour watch).

How to Build a Local RAG App with Ollama and ChromaDB in the R Programming Language

Elabonga Atuo — Mon, 14 Apr 2025 18:58:16 +0000

A Large Language Model (LLM) is a type of machine learning model that is trained to understand and generate human-like text. These models are trained on vast datasets to capture the nuances of human language, enabling them to generate coherent and contextually relevant responses.

You can enhance the performance of an LLM by providing context — structured or unstructured data, such as documents, articles, or knowledge bases — tailored to the domain or information you want the model to specialize in. Using techniques like prompt engineering and context injection, you can build an intelligent chatbot capable of navigating extensive datasets, retrieving relevant information, and delivering responses.

Whether it's storing recipes, code documentation, research articles, or answering domain-specific queries, an LLM-based chatbot can adapt to your needs with customization and privacy. You can deploy it locally to create a highly specialized conversational assistant that respects your data.

In this article, you will learn how to build a local Retrieval-Augmented Generation (RAG) application using Ollama and ChromaDB in R. By the end, you'll have a custom conversational assistant with a Shiny interface that efficiently retrieves information while maintaining privacy and customization.

What is RAG?
Project Overview
Project Setup
Ollama Installation
Data Collection and Cleaning
How to Create Chunks
How to Generate Sentence Embeddings
How to Set Up the Vector Database for Embedding Storage
How to Write the User Input Query Embedding Function
Tool Calling
How to Initialize the Chat System, Design Prompts, and Integrate Tools
How to Interact with Your Chatbot Using a Shiny App
Complete Code
Conclusion

What is RAG?

Retrieval-Augmented Generation (RAG) is a method that integrates retrieval systems with generative AI, enabling chatbots to access recent and specific information from external sources.

By using a retrieval pipeline, the chatbot can fetch up-to-date, relevant data and combine it with the generative model’s language capabilities, producing responses that are both accurate and contextually enriched. This makes RAG particularly useful for applications requiring fact-based, real-time knowledge delivery.

Project Overview

Project Setup

Prerequisites

Before you begin, ensure you have installed the latest version of the items listed here:

RStudio: The IDE – RStudio is the primary workspace where you'll write and test your R code. Its user-friendly interface, debugging tools, and integrated environment make it ideal for data analysis and chatbot development.
R: The Programming Language – R is the backbone of your project. You'll use it to handle data manipulation, apply statistical models, and integrate your recipe chatbot components seamlessly.
Python – Some libraries, like the embedding library you'll use for text vectorization, are built on Python. It’s vital to have Python installed to enable these functionalities alongside your R code.
Java – Java serves as a foundational element for certain embedding libraries. It ensures efficient processing and compatibility for text embedding tasks required to train your chatbot.
Docker Desktop – Docker Desktop allows you to run ChromaDB, the vector database, locally on your machine. This enables fast and reliable storage of embeddings, ensuring your chatbot retrieves relevant information quickly.
Ollama – Ollama brings powerful Large Language Models (LLMs) directly to your local computer, removing the need for cloud resources. It lets you access multiple models, customize outputs, and integrate them into your chatbot effortlessly.

Ollama Installation

Ollama is an open-sourced tool you can use to run and manage LLMs on your computer. Once installed, you can access various LLMs as per your needs. You will be using llama3.2:3b-instruct-q4_K_M model to build this chatbot.

A quantized model is a version of a machine learning model that has been optimized to use less memory and computational power by reducing the precision of the numbers it uses. This enables you to use an LLM locally, especially when you don’t have access to a GPU (Graphics Processing Unit – a specialized processor that perform complex computations).

To start, you can download and install the Ollama software here.

Then you can confirm installation by running this command:

ollama --version

Run the following command to start Ollama:

ollama serve

Next, run the following command to pull the Q4_K_M quantization of llama3.2:3b-instruct:

ollama pull llama3.2:3b-instruct-q4_K_M

Then confirm that the model was extracted with this:

ollama list

If the model extraction was successful, a list containing the model’s name, ID, and size will be returned, like so:

Now you can chat with the model:

ollama run llama3.2:3b-instruct-q4_K_M

If successful, you should receive a prompt that you can test by asking a question and getting an answer. For example:

Then you can exit the console by typing /bye or ctrl + D

Data Collection and Cleaning

The chatbot you are building will be a cooking assistant that suggests recipes given your available ingredients, what you want to eat, and how much food a recipe yields.

You first have to get the data to train the model. You will be using a dataset that contains recipes from Kaggle.

To start, load the necessary libraries:

# loading required libraries
library(xml2) #read, parse, and manipulate XML,HTML documents
library(jsonlite) #manipulate JSON objects

library(RKaggle) # download datasets from Kaggle 
library(dplyr)   # data manipulation

Then download and save recipe dataset:

# Download and read the "recipe" dataset from Kaggle
recipes_list <- RKaggle::get_dataset("thedevastator/better-recipes-for-a-better-life")

Inspect the dataframe and extract the first element like this:

# inspect the dataset
class(recipes_list)
str(recipes_list)
head(recipes_list)
# extract the first tibble
recipes_df <- recipes_list[[1]]

A quick inspection of the recipes_list object shows that it contains two objects of type tibble. You will be using only the first element for this project. A tibble is a type of data structure used for storing and manipulating data. It’s similar to a traditional dataframe, but it’s designed to enforce stricter rules and perform fewer automatic actions compared to traditional dataframes.

We’ll use a regular dataframe in this project because more people are likely familiar with it. It can also efficiently handle row indexing, which is crucial for accessing and manipulating specific rows in our recipe dataset.

In the code block below, you’ll convert the tibble to a dataframe and then drop the first column, which is the index column. Then you’ll inspect the newly converted dataframe and drop unnecessary columns.

Unnecessary columns are best removed to streamline the dataset and focus on relevant features. In this project, we’ll drop certain columns that aren’t particularly useful for training the chatbot. This ensures that the model concentrates on meaningful data to improve its accuracy and functionality.

# convert to dataframe and drop the first column
recipes_df <- as.data.frame(recipes_df[, -1])
# inspect the converted dataframe
head(recipes_df)
class(recipes_df)
colnames(recipes_df)
# drop unnecessary columns
cleaned_recipes_df <- subset(recipes_df, select = -c(yield,rating,url,cuisine_path,nutrition,timing,img_src))

Now you need to identify rows with NA (missing) values, which you can do like this:

# Identify rows and columns with NA values
which(is.na(cleaned_recipes_df), arr.ind = TRUE)

# a quick inspection reveals columns [2:4] have missing values
subset_column_names <- colnames(cleaned_recipes_df)[2:4]
subset_column_names

It is important to handle NA values to ensure that your data is complete, to prevent errors, and to preserve context.

Now, replace the NA values and confirm that there are no missing values:

# Replace NA values dynamically based on conditions
cols_to_modify <- c("prep_time", "cook_time", "total_time")
cleaned_recipes_df[cols_to_modify] <- lapply(
  cleaned_recipes_df[cols_to_modify],
  function(x, df) {
    # Replace NA in prep_time and cook_time where both are NA
    replace(x, is.na(df$prep_time) & is.na(df$cook_time), "unknown")
  },
  df = cleaned_recipes_df  # Pass the whole dataframe for conditions
)
cleaned_recipes_df <- cleaned_recipes_df %>%
  mutate(
    prep_time = case_when(
      # If cooktime is present but preptime is NA, replace with "no preparation required"
      !is.na(cook_time) & is.na(prep_time) ~ "no preparation required",
      # Otherwise, retain original value
      TRUE ~ as.character(prep_time)
    ),
    cook_time = case_when(
      # If prep_time is present but cook_time is NA, replace with "no cooking required"
      !is.na(prep_time) & is.na(cook_time) ~ "no cooking required",
      # Otherwise, retain original value
      TRUE ~ as.character(cook_time)
    )
  )
# confirm there are no missing values
any(is.na(cleaned_recipes_df))
)

# confirm the replacing NA logic works by inspecting specific rows
cleaned_recipes_df[1081,]
cleaned_recipes_df[1,]
cleaned_recipes_df[405,]

For this tutorial, we’ll subset the dataframe to the first 250 rows for demo purposes. This saves on time when it comes to generating embeddings.

# recommended for demo/learning purposes
cleaned_recipes_df <- head(cleaned_recipes_df,250)

How to Create Chunks

To understand why chunking is important before embedding, you need to understand what an embedding is.

An embedding is a vectoral representation of a word or a sentence. Machines don’t understand human text – they understand numbers. LLMs work by transforming human text to numerical representations in order to give answers. The process of generating embeddings requires a lot of computation, and breaking down the data to be embedded optimizes the embedding process.

So now we’re going to split the dataframe into smaller chunks of a specified size to enable efficient batch processing and iteration.

# Define the size of each chunk (number of rows per chunk)
chunk_size <- 1

# Get the total number of rows in the dataframe
n <- nrow(cleaned_recipes_df)

# Create a vector of group numbers for chunking
# Each group number repeats for 'chunk_size' rows
# Ensure the vector matches the total number of rows
r <- rep(1:ceiling(n/chunk_size), each = chunk_size)[1:n]

# Split the dataframe into smaller chunks (subsets) based on the group numbers
chunks <- split(cleaned_recipes_df, r)

How to Generate Sentence Embeddings

As previously mentioned, embeddings are vector representations of words or sentences. Embeddings can be generated from both words and sentences. How you choose to generate embeddings depends on your intended application of the LLM.

Word embeddings are numerical representations of individual words in a continuous vector space. They capture semantic relationships between words, allowing similar words to have vectors close to each other.

Word embeddings can be used in search engines as they support word-level queries by matching embeddings to retrieve relevant documents. They can also be used in text classification to classify documents, emails, or tweets based on word-level features (for example, detecting spam emails or sentiment analysis).

Sentence embeddings are numerical representations of entire sentences in a vector space, designed to capture the overall meaning and context of the sentence. They are used in settings where sentences provide better context like question answering systems where user queries are matched to relevant sentences or documents for more precise retrieval.

For our recipe chatbot, sentence embedding is the best choice.

First, create an empty dataframe that has three columns.

#empty dataframe
recipe_sentence_embeddings <-  data.frame(
  recipe = character(),
  recipe_vec_embeddings = I(list()),
  recipe_id = character()
)

The first column will hold the actual recipe in text form, the recipe_vec_embeddings column will hold the generated sentence embeddings, and the recipe_id holds a unique id for each recipe. This will help in indexing and retrieval from the vector database.

Next, it’s helpful to define a progress bar, which you can do like this:

# create a progress bar
pb <- txtProgressBar(min = 1, max = length(chunks), style = 3)

Embedding can take a while, so it’s important to keep track of the progress of the process.

Now it’s time to generate embeddings and populate the dataframe.

Write a for loop that executes the code block as long as the length of the chunks.

for (i in 1:length(chunks)) {}

The recipe field is the text at the chunk that is currently being executed and the unique chunk id is generated by pasting the index of the chunk and the text “chunk”.

for (i in 1:length(chunks)) {
    recipe <- as.character(chunks[i])
    recipe_id <- paste0("recipe",i)
}

The text embed function from the text library generates either sentence or word embeddings. It takes in a character variable or a dataframe and produces a tibble of embeddings. You can read loading instructions here for smooth running of the text library.

The batch_size defines how many rows are embedded at a time from the input. Setting the keep_token_embeddings discards the embeddings for individual tokens after processing, and aggregation_from_layers_to_tokens “concatenates” or combines embeddings from specified layers to create detailed embeddings for each token. A token is the smallest unit of text that a model can process.

for (i in 1:length(chunks)) {
    recipe <- as.character(chunks[i])
    recipe_id <- paste0("recipe",i)
    recipe_embeddings <- textEmbed(as.character(recipe),
                                layers = 10:11,
                                aggregation_from_layers_to_tokens = "concatenate",
                                aggregation_from_tokens_to_texts = "mean",
                                keep_token_embeddings = FALSE,
                                batch_size = 1
  )
}

In order to specify sentence embeddings, you need to set the argument to the aggregation_from_tokens_to_texts parameter as "mean".

aggregation_from_tokens_to_texts = "mean"

The "mean" operation averages the embeddings of all tokens in a sentence to generate a single vector that represents the entire sentence. This sentence-level embedding captures the overall meaning and semantics of the text, regardless of its token length.

# convert tibble to vector
  recipe_vec_embeddings <- unlist(recipe_embeddings, use.names = FALSE)
  recipe_vec_embeddings <- list(recipe_vec_embeddings)

The embedding function returns a tibble object. In order to obtain a vector embedding, you need to first unlist the tibble and drop the row names and then list the result to form a simple vector.

  # Append the current chunk's data to the dataframe
  recipe_sentence_embeddings <- recipe_sentence_embeddings %>%
    add_row(
      recipe = recipe,
      recipe_vec_embeddings = recipe_vec_embeddings,
      recipe_id = recipe_id
    )

Finally, update the empty dataframe after each iteration with the newly generated data.

  # track embedding progress
  setTxtProgressBar(pb, i)

In order to keep track of the embedding progress, you can use the earlier defined progress bar inside the loop. It will update at the end of every iteration.

Complete Code Block:

# load required library
library(text)
# # ensure to read loading instructions here for smooth running of the 'text' library
# # https://www.r-text.org/
# embedding data
for (i in 1:length(chunks)) {
  recipe <- as.character(chunks[i])
  recipe_id <- paste0("recipe",i)
  recipe_embeddings <- textEmbed(as.character(recipe),
                                layers = 10:11,
                                aggregation_from_layers_to_tokens = "concatenate",
                                aggregation_from_tokens_to_texts = "mean",
                                keep_token_embeddings = FALSE,
                                batch_size = 1
  )

  # convert tibble to vector
  recipe_vec_embeddings <- unlist(recipe_embeddings, use.names = FALSE)
  recipe_vec_embeddings <- list(recipe_vec_embeddings)

  # Append the current chunk's data to the dataframe
  recipe_sentence_embeddings <- recipe_sentence_embeddings %>%
    add_row(
      recipe = recipe,
      recipe_vec_embeddings = recipe_vec_embeddings,
      recipe_id = recipe_id
    )

  # track embedding progress
  setTxtProgressBar(pb, i)

}

How to Set Up the Vector Database for Embedding Storage

A vector database is a special type of database that stores embeddings and allows you to query and retrieve relevant information. There are numerous vector databases available, but for this project, you will use ChromaDB, an open-source option that integrates with the R environment through the rchroma library.

ChromaDB runs locally in a Docker container. Just make sure you have Docker installed and running on your device.

Then load the rchroma library and run your ChromaDB instance:

# load rchroma library
library(rchroma)
# run ChromaDB instance.
chroma_docker_run()

If it was successful, you should see this in the console:

Next, connect to a local ChromaDB instance and check the connection:

# Connect to a local ChromaDB instance
client <- chroma_connect()

# Check the connection
heartbeat(client)
version(client)

Now you’ll need to create a collection and confirm that it was created. Collections in ChromaDB function similarly to tables in conventional databases.

# Create a new collection
create_collection(client, "recipes_collection")

# List all collections
list_collections(client)

Now, add embeddings to the collection. To add embeddings to the recipes_collection, use the add_documents function.

# Add documents to the collection
add_documents(
  client,
  "recipes_collection",
  documents = recipe_sentence_embeddings$recipe,
  ids = recipe_sentence_embeddings$recipe_id,
  embeddings = recipe_sentence_embeddings$recipe_vec_embeddings
)

The add_documents() function is used to add recipe data to the recipes_collection. Here's a breakdown of its arguments and how the corresponding data is accessed:

documents: This argument represents the recipe text. It is sourced from the recipe column of the recipe_sentence_embeddings dataframe.
ids: This is the unique identifier for each recipe. It is extracted from the recipe_id column of the same dataframe.
embeddings: This contains the sentence embeddings, which were previously generated for each recipe. These embeddings are accessed from the recipe_vec_embeddings column of the dataframe.

All three arguments—documents, ids, and embeddings—are obtained by subsetting their respective columns from the recipe_sentence_embeddings dataframe.

How to Write the User Input Query Embedding Function

In order to retrieve information from a vector database, you must first embed your query text. The database compares your query's embedding with its stored embeddings to find and retrieve the most relevant document.

It's important to ensure that the dimensions (rows × columns) of your query embedding match those of the database embeddings. This alignment is achieved by using the same embedding model to generate your query.

Matching embeddings involves calculating the similarity (for example, cosine similarity) between the query and stored embeddings, identifying the closest match for effective retrieval.

Let’s write a function that allows us to embed a query which then queries similar documents using the generated embeddings. Wrapping it in a function makes it reusable.

  #sentence embeddings function and query
  question <- function(sentence){
    sentence_embeddings <- textEmbed(sentence,
                                     layers = 10:11,
                                     aggregation_from_layers_to_tokens = "concatenate",
                                     aggregation_from_tokens_to_texts = "mean",
                                     keep_token_embeddings = FALSE
    )

    # convert tibble to vector
    sentence_vec_embeddings <- unlist(sentence_embeddings, use.names = FALSE)
    sentence_vec_embeddings <- list(sentence_vec_embeddings)

    # Query similar documents using embeddings
    results <- query(
      client,
      "recipes_collection",
      query_embeddings = sentence_vec_embeddings ,
      n_results = 2
    )
    results

  }

This chunk of code is similar to how we have previously used the text_embed() function. The query() function is added to enable querying the vector database, particularly the recipes' collection, and returns the top two documents that closely match a user’s query.

Our function thus takes in a sentence as an argument and embeds the sentence to generate sentence embeddings. It then queries the database and returns two documents that match the query most.

Tool Calling

To interact with Ollama in R, you will utilize the ellmer library. This library streamlines the use of large language models (LLMs) by offering an interface that enables seamless access to and interaction with a variety of LLM providers.

To enhance the LLM’s usage, we need to provide context to it. You can do this by tool calling. Tool calling allows an LLM to access external resources in order to enhance its functionality.

For this project, we are implementing Retrieval-Augmented Generation (RAG), which combines retrieving relevant information from a vector database and generating responses using an LLM. This approach improves the chatbot's ability to provide accurate and contextually relevant answers.

Now, define a function that links to the LLM to provide context using the tool() function from the ellmer library.

# load ellmer library
library(ellmer)

# function that links to llm to provide context
  tool_context  <- tool(
    question,
    "obtains the right context for a given question",
    sentence = type_string()

  )

The tool() function takes the question function that returns the relevant documents that we’ll use as context as the first argument. We’ll use the documents to help the LLM answer questions accordingly.

The text, "obtains the right context for a given question", is a description of what the tool will be doing.

Finally, the sentence = type_string() defines what type of object the question() function expects.

How to Initialize the Chat System, Design Prompts, and Integrate Tools

Next, you’ll set up a conversational AI system by defining its role and functionality. Using system prompt design, you will shape the assistant’s behavior, tone, and focus as a culinary assistant. You’ll also integrate external tools to extend the chatbot’s capabilities by registering tools. Let’s dive in.

First, you need to initialize a Chat Object:

#  Initialize the chat system with propmpt instructions.
  chat <- chat_ollama(system_prompt = "You are a knowledgeable culinary assistant specializing in recipe recommendations. 
                      You provide tailored meal suggestions based on the user's available ingredients and the desired amount of food or servings.
                      Ensure the recipes align closely with the user's inputs and yield the expected quantity.",
                      model = "llama3.2:3b-instruct-q4_K_M")

You can do that using the chat_ollama() function. This sets up a conversational agent with the specified system prompt and model.

The system prompt defines the conversational behavior, tone, and focus of the LLM while the model argument specifies the language model (llama3.2:3b-instruct-q4_K_M) that the chat system will use to generate responses.

Next, you need to register a tool.

 #register tool
  chat$register_tool(tool_context)

We need to tell our chat object about our tool_context() function. Do this by registering a tool using the register_tool() function.

How to Interact with Your Chatbot Using a Shiny App

To interact with the chatbot you’ve just created, we’ll use Shiny, a framework for building interactive web applications in R. Shiny provides a user-friendly graphical interface that allows seamless interaction with the chatbot.

For this purpose, we’ll use the shinychat library, which simplifies the process of building a chat interface within a Shiny app. This involves defining two key components:

User Interface (UI):
- Responsible for the visual layout and what the user sees.
- In this case, chat_ui("chat") is used to create the interactive chat interface.
Server Function:
- Handles the functionality and logic of the application.
- It connects the chatbot to external tools and manages processes like embedding queries, retrieving relevant responses, and handling user inputs.

# load the required library
library(shinychat)

# wrap the chat code in a Shiny App
ui <- bslib::page_fluid(
  chat_ui("chat")
)

server <- function(input, output, session) {
  # Connect to a local ChromaDB instance running on docker with embeddings loaded
  client <- chroma_connect()

  #sentence embeddings function and query
  question <- function(sentence){
    sentence_embeddings <- textEmbed(sentence,
                                     layers = 10:11,
                                     aggregation_from_layers_to_tokens = "concatenate",
                                     aggregation_from_tokens_to_texts = "mean",
                                     keep_token_embeddings = FALSE
    )

    # convert tibble to vector
    sentence_vec_embeddings <- unlist(sentence_embeddings, use.names = FALSE)
    sentence_vec_embeddings <- list(sentence_vec_embeddings)

    # Query similar documents using embeddings
    results <- query(
      client,
      "recipes_collection",
      query_embeddings = sentence_vec_embeddings ,
      n_results = 2
    )
    results

  }


  # function that provides context
  tool_context  <- tool(
    question,
    "obtains the right context for a given question",
    sentence = type_string()

  )

  #  Initialize the chat system with the first chunk
  chat <- chat_ollama(system_prompt = "You are a knowledgeable culinary assistant specializing in recipe recommendations. 
                      You provide tailored meal suggestions based on the user's available ingredients and the desired amount of food or servings.
                      Ensure the recipes align closely with the user's inputs and yield the expected quantity.",
                      model = "llama3.2:3b-instruct-q4_K_M")
  #register tool
  chat$register_tool(tool_context)

  observeEvent(input$chat_user_input, {
    stream <- chat$stream_async(input$chat_user_input)
    chat_append("chat", stream)
  })
}

shinyApp(ui, server)

Alright, let’s understand how this is working:

User input monitoring with observeEvent(): The observeEvent() block monitors user inputs from the chat interface (input$chat_user_input). When a user sends a message, the chatbot processes it, retrieves relevant context using the embeddings, and streams the response dynamically to the chat interface.
Tool calling for context: The chatbot employs tool calling to interact with external resources (like the vector database) and enhance its functionality. In this project, Retrieval-Augmented Generation (RAG) ensures the chatbot provides accurate and context-rich responses by integrating retrieval and generation seamlessly.

This approach brings the chatbot to life, enabling users to interact with it dynamically through a responsive Shiny app.

Complete Code

The R scripts have been split in two, with data.R containing code that handles data gathering and cleaning, text chunking, sentence embeddings generation, creating a vector database, and loading documents to it.

The chat.R script contains code that handles user input querying, context retrieval, chat initialization, system prompt design, tool integration, and a chat Shiny app.

data.R

# install and load required packages
# install devtools from CRAN
install.packages('devtools')
devtools::install_github("benyamindsmith/RKaggle")

library(text)
library(rchroma)
library(RKaggle)
library(dplyr)

# run ChromaDB instance.
chroma_docker_run()

# Connect to a local ChromaDB instance
client <- chroma_connect()

# Check the connection
heartbeat(client)
version(client)


# Create a new collection
create_collection(client, "recipes_collection")

# List all collections
list_collections(client)

# Download and read the "recipe" dataset from Kaggle
recipes_list <- RKaggle::get_dataset("thedevastator/better-recipes-for-a-better-life")

# extract the first tibble
recipes_df <- recipes_list[[1]]

# convert to dataframe and drop the first column
recipes_df <- as.data.frame(recipes_df[, -1])

# drop unnecessary columns
cleaned_recipes_df <- subset(recipes_df, select = -c(yield,rating,url,cuisine_path,nutrition,timing,img_src))

## Replace NA values dynamically based on conditions
# Replace NA when all columns have NA values
cols_to_modify <- c("prep_time", "cook_time", "total_time")
cleaned_recipes_df[cols_to_modify] <- lapply(
  cleaned_recipes_df[cols_to_modify],
  function(x, df) {
    # Replace NA in prep_time and cook_time where both are NA
    replace(x, is.na(df$prep_time) & is.na(df$cook_time), "unknown")
  },
  df = cleaned_recipes_df  
)

# Replace NA when either or columns have NA values
cleaned_recipes_df <- cleaned_recipes_df %>%
  mutate(
    prep_time = case_when(
      # If cook_time is present but prep_time is NA, replace with "no preparation required"
      !is.na(cook_time) & is.na(prep_time) ~ "no preparation required",
      # Otherwise, retain original value
      TRUE ~ as.character(prep_time)
    ),
    cook_time = case_when(
      # If prep_time is present but cook_time is NA, replace with "no cooking required"
      !is.na(prep_time) & is.na(cook_time) ~ "no cooking required",
      # Otherwise, retain original value
      TRUE ~ as.character(cook_time)
    )
  )

# chunk the dataset
chunk_size <- 1
n <- nrow(cleaned_recipes_df)
r <- rep(1:ceiling(n/chunk_size),each = chunk_size)[1:n]
chunks <- split(cleaned_recipes_df,r)

#empty dataframe
recipe_sentence_embeddings <-  data.frame(
  recipe = character(),
  recipe_vec_embeddings = I(list()),
  recipe_id = character()
)

# create a progress bar
pb <- txtProgressBar(min = 1, max = length(chunks), style = 3)

# embedding data
for (i in 1:length(chunks)) {
  recipe <- as.character(chunks[i])
  recipe_id <- paste0("recipe",i)
  recipe_embeddings <- textEmbed(as.character(recipe),
                                layers = 10:11,
                                aggregation_from_layers_to_tokens = "concatenate",
                                aggregation_from_tokens_to_texts = "mean",
                                keep_token_embeddings = FALSE,
                                batch_size = 1
  )

  # convert tibble to vector
  recipe_vec_embeddings <- unlist(recipe_embeddings, use.names = FALSE)
  recipe_vec_embeddings <- list(recipe_vec_embeddings)

  # Append the current chunk's data to the dataframe
  recipe_sentence_embeddings <- recipe_sentence_embeddings %>%
    add_row(
      recipe = recipe,
      recipe_vec_embeddings = recipe_vec_embeddings,
      recipe_id = recipe_id
    )

  # track embedding progress
  setTxtProgressBar(pb, i)

}

# Add documents to the collection
add_documents(
  client,
  "recipes_collection",
  documents = recipe_sentence_embeddings$recipe,
  ids = recipe_sentence_embeddings$recipe_id,
  embeddings = recipe_sentence_embeddings$recipe_vec_embeddings
)

chat.R

# Load required packages
library(ellmer)
library(text)
library(rchroma)
library(shinychat)

ui <- bslib::page_fluid(
  chat_ui("chat")
)

server <- function(input, output, session) {
  # Connect to a local ChromaDB instance running on docker with embeddings loaded 
  client <- chroma_connect()

  # sentence embeddings function and query
  question <- function(sentence){
    sentence_embeddings <- textEmbed(sentence,
                                     layers = 10:11,
                                     aggregation_from_layers_to_tokens = "concatenate",
                                     aggregation_from_tokens_to_texts = "mean",
                                     keep_token_embeddings = FALSE
    )

    # convert tibble to vector
    sentence_vec_embeddings <- unlist(sentence_embeddings, use.names = FALSE)
    sentence_vec_embeddings <- list(sentence_vec_embeddings)

    # Query similar documents
    results <- query(
      client,
      "recipes_collection",
      query_embeddings = sentence_vec_embeddings ,
      n_results = 2
    )
    results

  }


  # function that provides context
  tool_context  <- tool(
    question,
    "obtains the right context for a given question",
    sentence = type_string()

  )

  #  Initialize the chat system 
  chat <- chat_ollama(system_prompt = "You are a knowledgeable culinary assistant specializing in recipe recommendations. 
                      You provide tailored meal suggestions based on the user's available ingredients and the desired amount of food or servings.
                      Ensure the recipes align closely with the user's inputs and yield the expected quantity.",
                      model = "llama3.2:3b-instruct-q4_K_M")
  #register tool
  chat$register_tool(tool_context)

  observeEvent(input$chat_user_input, {
    stream <- chat$stream_async(input$chat_user_input)
    chat_append("chat", stream)
  })
}

shinyApp(ui, server)

You can find the complete code here.

Conclusion

Building a local Retrieval-Augmented Generation (RAG) application using Ollama and ChromaDB in R programming offers a powerful way to create a specialized conversational assistant.

By leveraging the capabilities of large language models and vector databases, you can efficiently manage and retrieve relevant information from extensive datasets.

This approach not only enhances the performance of language models but also ensures customization and privacy by running the application locally.

Whether you're developing a cooking assistant or any other domain-specific chatbot, this method provides a robust framework for delivering intelligent and contextually aware responses.

How to Build a Weather App with R Shiny

Elabonga Atuo — Mon, 09 Dec 2024 15:30:42 +0000

In this tutorial, you’ll learn how to build a weather app in R. Really – a weather app, in R? Wait, hear me out.

When you think of R, you probably imagine someone wearing chunky thick prescription glasses and devouring a book. You know, a statistician dealing with complex models, an insane amount of mathematical equations, and copious amounts of data.

But R is far more than just a tool for statistics. It shines when you need to turn raw data into actionable insights and present those insights in a clear, engaging way.

With frameworks like Shiny, R takes this one step further, enabling you to create fully interactive web apps without having to worry about frontends, backends, or learning an entirely new programming language.

In this tutorial, you will create a simple weather app that fetches data from an API and displays the results in a good-looking app.

Project Overview
Project Setup
API Keys: Storage and Retrieval
How to Make Your First API Call
How to Build the Shiny App
Conclusion

Project Overview

Here’s what we’re going to be building:

For the weather app to work, you will need to make two separate API calls. We’ll use the One Call API 3.0 to update weather data and the OpenWeather API for geocoding. You can get your API Key here. Just keep in mind that if this is your first time signing up for an API key, activation may take up to 24 hours.

The weather app will take the location/city from user input. The input will then be geocoded by making the call to OpenWeather API. Then, from its response, the coordinates (latitude and longitude) will be extracted. The coordinates will be used as query arguments for the One Call API call to obtain the weather data in JSON format.

Prerequisites:

To follow along with this tutorial, you will need:

R programming knowledge
HTML and a bit of JavaScript knowledge
R Studio installed

Project Setup

Create a folder in your desired directory. Set and confirm the project folder as the working directory using the following command in the R console:

setwd("path/to/your/project/file")
getwd()

Create a project in the set path using the following command:

#create R project
usethis::create_project(path = ".", open = FALSE)

You should have a folder structure that looks like this.

Create an R file in the root directory and save it as app.R. All your R code will be contained here.

Install and load the following libraries that you are going to work with:

library(shiny)
library(bslib)
library(shinyjs)
library(httr2)
library(lubridate)
library(shiny.semantic)

API Keys: Storage and Retrieval

Storing your credentials in a location separate from your scripts and global environment is a good practice. This ensures security, scalability, and flexibility, especially when working in shared or production environments. The .Renviron file best serves that purpose.

Open and edit your .Renviron file in the following way:

#open and edit .Renviron
usethis::edit_r_environ(scope=c("project")

The scope argument set to project sets up the .Renviron specifically to your project. In the newly opened file, add your API key as follows:

OPENWEATHERAPIKEY="yourapikey"

How to Make Your First API Call

You will be using the httr2 library (built based on httr) to obtain data from the API. It grants you more control over how you make requests to the web.

Make the API Key accessible in the script

First, you’ll need to securely access and store the API key in the script without hardcoding it. You can do that like this:

#access API keys in script
readenviron(".Renviron")
api_key = Sys.getenv("OPENWEATHERAPIKEY")

Define the Geocoding Function

You will create a function that takes a location and an API key as inputs, sends a request to the OpenWeather geocoding API, and returns the coordinates of the specified location.

Start by creating a request. The pipe (|>) operator facilitates the chaining of HTTP requests step by step in a clear and readable manner. The geocoding URL takes two parameters: location, denoted by q, and the API key, denoted by app_id. The req_url_query() function appends these parameters to the query.

Chain the query to perform the request and fetch action, and finally obtain the response in JSON format using the second to last line.

# Geocoding URL
geocoding_url <- "https://api.openweathermap.org/data/2.5/weather"
geocode <- function(location, api_key) {
  request(geocoding_url) |> 
    req_url_query(`q` = location, `appid` = api_key) |> 
    req_perform() |> 
    resp_body_json() |>
    coordinates()
}

Define the coordinate-extracting function

The coordinates() function is a helper function that extracts the latitude and longitude values from the JSON response. A quick inspection of the JSON response reveals the coordinate's position. The JSON object is simply a long list of lists and you can access elements by subsetting it.

A blank data body would imply that the city/location is unavailable, and you’d get the message "No such city exists!". If the JSON contains an element, the length would be more than 0 – it is a list after all.

coordinates <- function(body) {
  if(length(body) != 0) { 
    lat <- body$coord$lat
    lng <- body$coord$lon
    town <- body$name
    c(lat, lng, town)
  } else {
    "No such city exists!"
  }
}

Define the weather-update function

You will create a function that sends a request to the OpenWeather API with specified query parameters, handles errors using a predefined function, and returns the parsed JSON response containing the weather data.

As implemented in the geocoding function, start by creating a request and adding the necessary query parameters using the req_url_query() function. The openweather_json() function accepts two main arguments:

api_key: This is a required argument used for authentication with the OpenWeather API matched by position.
...: This represents optional keyword arguments that you can use to customize the query. You can pass as many additional parameters as needed, provided they are specified as named arguments.

openweather_json <- function(api_key, ...) { 
  request(current_weather_url) |> 
    req_url_query(..., `appid` = api_key, `units` = "metric") |> 
    req_error(body = openweather_error_body) |>
    req_perform() |> 
    resp_body_json()
}

Error Handling: Extracting and Managing Status Codes

You will create an error-handling function that extracts non-200 status codes from a response and defines how to manage them. The structure of this function depends on how the API reports errors and where the relevant information is stored.

Define the weather-update error body

The req_error() in openweather_json() introduces a new concept: error handling. API requests may throw exceptions, and getting the status codes helps you know what message to show the user and how to resolve it.

Create an error body which is a function that captures the error code if the status code is not 200 (which means everything is OK).

The function takes a response and extracts the status response stored in the JSON response at the $message sublist. The underscore (_)is a placeholder for the JSON object.

openweather_error_body <- function(resp) {
  resp |> resp_body_json() |> _$message 
}

Define the geocode error body

This error body function will prove useful in the Shiny App. This is a simple walkthrough.

The req_error() function allows you to customize how response errors are handled. Its is_error argument determines whether a given response should be considered an error. By setting is_error to \(resp) FALSE (an anonymous function that always returns FALSE), all responses, regardless of the status code, are treated as successful. This prevents the app from exiting due to non-200 status codes.

With this setup, you can extract the status code from the response body and pipe it into the resp_status() function to retrieve the exact code.

openstreetmap_error_body <- function(location, api_key) {
  resp <- request(geocoding_url) |> 
    req_url_query(`q` = location, `appid` = api_key) |> 
    req_error(is_error = \(resp) FALSE) |>
    req_perform() |>  resp_status()
  resp
}

How to Build the Shiny App

Now that you have nailed down how to obtain data from the API, it’s time to render the results in an interpretable and interactive format. For this, you will use Shiny. Shiny is a framework that allows you to create interactive web apps.

A Shiny App is made up of two components:

The UI: what the user interacts with. It defines the layout and appearance of the app.
The server: contains the app’s logic and behaviour.

Building the Shiny UI

Shiny UI provides a collection of elements that allow users to input data, make selections, and trigger events seamlessly.

You will include a textInput element that takes in the location and the weather data will be fetched and rendered upon submission. The input_task_button button prevents the user from clicking when an API call is in progress. The other elements are output elements where the weather data will be displayed and a mode-switching button.

Styling the Shiny app

You can use shiny.semantic, a library built on top of Fomantic-UI, to style your Shiny dashboard. Fomantic-UI is a front-end framework that provides a rich collection of pre-styled HTML components like buttons, modals, form inputs, and more. It simplifies UI design by allowing developers to create visually appealing and responsive interfaces without needing extensive custom CSS or HTML knowledge.

Fomantic-UI styling is applied by wrapping elements in their corresponding classes, which define their behavior and appearance.

A grid in Fomantic-UI is a flexible layout system used to organize content. It acts as a canvas that divides the layout into rows (horizontally aligned) and columns (vertically aligned). A root grid can contain up to 16 columns, making it ideal for creating structured and responsive designs.

To specify a column's width, you append classes like wide and the size (a number from 1 to 16) to represent its span. The total width of all columns in a row should sum up to 16.

A segment groups related content, while a card displays detailed, content-rich items, such as a user's social media profile. Dividers are visual elements used to separate sections or content within a layout.

For the weather app, first create a div of class grid within which you’ll nest the various elements.

Search bar section

Divide the grid into sixteen columns and create a segment that groups elements in the search bar section. Add a theme toggle button, location input that takes in user input, a search button for submitting the location to the API, and a notification button, defining their width by the column size.

div(class = "sixteen wide column",
          div(class = "ui segment",
              div(class = "ui grid",
                  div(class = "two wide column",
                      button(
                        class = "ui button icon basic",
                        input_id = "darkmode",
                        label = NULL,
                        icon = icon("moon icon")
                      )
                  ),
                  div(class = "ten wide column",
                      textInput(
                        "location",
                        label = NULL,
                        placeholder = "Search for your preferred city"
                      )
                  ),
                  div(class = "two wide column",
                      tags$div(
                        class = "ui button",
                        id = "my-custom-button",
                        input_task_button("search", label = "Search", icon = icon("search"))
                      )
                  ),
                  div(class = "two wide column",
                      actionButton("show_alert", label = icon("bell"), class = "bell-no-alert"),
                      textOutput("alert_message")
                  )
              )
          )
      )

Location and current weather section

Divide the grid into sixteen columns and nest another grid within the partitions that will host two columns.

Within the grid, define two columns. The first column is for time, location, and date data, and the second column will hold current weather data.

Then create card elements to hold each weather parameter, its unit of measurement, and the corresponding icon.

div(class = "sixteen wide column",
          div(class = "ui equal-height-grid grid",
              div(class = "left floated center aligned four wide column",
                  div(class = "ui raised equal-height-two-segment segment",
                      style = "flex: 1;",
                      div(class = "column center aligned",
                          div(class = "ui hidden section divider"),
                          span(class = "ui large text", textOutput("city")),
                          div(class = "ui hidden section divider"),
                          span(class = "ui big text", textOutput("currentTime")),
                          div(class = "ui hidden section divider"),
                          span(class = "ui large text", textOutput("currentDate")),
                          div(class = "ui hidden section divider")
                      )
                  )
              ),
              div(class = "right floated center aligned twelve wide column",
                  div(class = "ui raised segment",
                      div(class = "ui horizontal equal width segments",
                          div(class = "ui equal-height-two-segment segment",
                              style = "flex: 3;",
                              div(class = "column",
                                  span(class = "ui big text centered", textOutput("currentTemp")),
                                  textOutput("feelsLike"),
                                  card(
                                    class = "ui mini",
                                    div(class = "content", icon(class = "large sun"),
                                        div(class = "sub header", "Sunrise"),
                                        div(class = "description", textOutput("sunriseTime"))
                                    )
                                  ),
                                  card(
                                    class = "ui mini",
                                    div(class = "content", icon(class = "large moon"),
                                        div(class = "sub header", "Sunset"),
                                        div(class = "description", textOutput("sunsetTime"))
                                    )
                                  )
                              )
                          ),
                          div(class = "ui segment",
                              style = "flex: 3;",
                              div(
                                class = "column center aligned",
                                div(class = "ui hidden divider"),
                                htmlOutput("currentWeatherIcon"),
                                span(class = "ui large text", textOutput("currentWeatherDescription"))
                              )
                          ),
                          div(class = "ui segment",
                              style = "flex: 3;",
                              div(class = "column",
                                  card(
                                    class = "ui tiny",
                                    div(class = "content", icon(class = "big tint"),
                                        div(class = "sub header", "Humidity"),
                                        div(class = "description", textOutput("currentHumidity"))
                                    )
                                  ),
                                  card(
                                    class = "ui tiny",
                                    div(class = "content", icon(class = "big tachometer alternate"),
                                        div(class = "sub header", "Pressure"),
                                        div(class = "description", textOutput("currentPressure"))
                                    )
                                  )
                              )
                          ),
                          div(class = "ui segment",
                              style = "flex: 3;",
                              div(class = "column center aligned",
                                  card(
                                    class = "ui tiny",
                                    div(class = "content", icon(class = "big wind"),
                                        div(class = "sub header", "Wind Speed"),
                                        div(class = "description", textOutput("currentWindSpeed"))
                                    )
                                  ),
                                  card(
                                    class = "ui tiny",
                                    div(class = "content", icon(class = "big umbrella"),
                                        div(class = "sub header", "UV Index"),
                                        div(class = "description", textOutput("currentUV"))
                                    )
                                  )
                              )
                          )
                      )
                  )
              )
          )
      )

Forecast section

This section holds the forecasted data. Divide the grid into sixteen columns and nest another grid within the partitions hosting two columns.

Within the grid, define two columns. The first column holds the 5-Day Forecast data. Separate the elements containing different values using rows. The second column contains Hourly Forecast data. Separate the elements containing different values using columns.

      # Forecast section
      div(class = "sixteen wide column",
          div(class = "ui grid equal-height-grid",
              div(class = "left floated center aligned six wide column",
                  div(class = "ui raised segment special-segment equal-height-segment",
                      h4("5 Days Forecast:"),
                      div(class = "ui three column special-column grid",
                          # Day forecasts
                          div(class = "row",
                              div(class = "five wide column", textOutput("dailyDtOne")),
                              div(class = "three wide column", textOutput("dailyTempOne")),
                              div(class = "three wide column", htmlOutput("dailyIconOne"))
                          ),
                          div(class = "row",
                              div(class = "five wide column", textOutput("dailyDtTwo")),
                              div(class = "three wide column", textOutput("dailyTempTwo")),
                              div(class = "three wide column", htmlOutput("dailyIconTwo"))
                          ),
                          div(class = "row",
                              div(class = "five wide column", textOutput("dailyDtThree")),
                              div(class = "three wide column", textOutput("dailyTempThree")),
                              div(class = "three wide column", htmlOutput("dailyIconThree"))
                          ),
                          div(class = "row",
                              div(class = "five wide column", textOutput("dailyDtFour")),
                              div(class = "three wide column", textOutput("dailyTempFour")),
                              div(class = "three wide column", htmlOutput("dailyIconFour"))
                          ),
                          div(class = "row",
                              div(class = "five wide column", textOutput("dailyDtFive")),
                              div(class = "three wide column", textOutput("dailyTempFive")),
                              div(class = "three wide column", htmlOutput("dailyIconFive"))
                          )
                      )
                  )
              ),
              div(class = "right floated center aligned ten wide column",
                  div(class = "ui raised segment special-segment equal-height-segment",
                      h4("Hourly Forecast:"),
                      div(
                        class = "ui grid",
                        style = "display: flex; flex-direction: row; align-items: center; justify-content: space-around; flex-wrap: wrap; height: 100%;",
                        # Hourly forecasts
                        div(class = "column",
                            textOutput("hourlyDtOne"),
                            htmlOutput("hourlyIconOne"),
                            textOutput("hourlyTempOne")
                        ),
                        div(class = "column",
                            textOutput("hourlyDtTwo"),
                            htmlOutput("hourlyIconTwo"),
                            textOutput("hourlyTempTwo")
                        ),
                        div(class = "column",
                            textOutput("hourlyDtThree"),
                            htmlOutput("hourlyIconThree"),
                            textOutput("hourlyTempThree")
                        ),
                        div(class = "column",
                            textOutput("hourlyDtFour"),
                            htmlOutput("hourlyIconFour"),
                            textOutput("hourlyTempFour")
                        ),
                        div(class = "column",
                            textOutput("hourlyDtFive"),
                            htmlOutput("hourlyIconFive"),
                            textOutput("hourlyTempFive")
                        )
                      )
                  )
              )
          )
      )
  )

Building the Shiny Server

Each element in the UI section has an ID (unique identifier) that is used to manipulate what data/information will be displayed to it.

The render*() set of functions defines the visualization type while the output$* functions subset elements. These two are used to link the visual to the logic. Most elements will have data extracted from the JSON list, except for the weather icons (for which an external link as a source will be referenced).

Reactivity

Reactivity is what makes Shiny apps dynamic—outputs automatically update when their dependencies change.

Two key components of reactivity are reactives and observers. A reactive computes and returns a value based on its dependencies, while an observer monitors reactive values and runs code that causes side effects, like logging or updating a database.

To control reactivity, you can use bindEvent() to delay execution until a specific event occurs or observeEvent() to listen for a user action and trigger a code block. Together, these tools provide flexibility for managing app behavior.

The Server Code

location reactive

The location reactive includes an if-else conditional block that defines what message to display depending on the status code. The query variable contains the city/location that will be geocoded to obtain coordinates. The flow is piped to bindEvent(). This ensures the geocoding API call is completed before another call can be made, which reduces unnecessary requests.

location <- reactive({
    query <- input$location
    if(openstreetmap_error_body(query, api_key) == "404"){
      validate("No such city/town exists. Check your spelling!")
    }
    else if(openstreetmap_error_body(query, api_key) == "400"){
      validate("Bad request")
    }
    coords <- geocode(query, api_key)
  }) %>% bindEvent(input$search)

weather_data reactive

The weather reactive combines a geocoding API call and a weather update API call using coordinates obtained and extracted from location():

  weather_data <- reactive({
    loc <- location()
    openweather_json(api_key, lat = loc[1], lon = loc[2])
  })

To access the JSON objects returned by the API call, you call the reactive as if it were a function. The specific values to be extracted can then be accessed by subsetting the JSON value.

# subsetting weather data.
  output$city <- renderText({
    location()[3]
  })

  output$currentWeatherDescription <- renderText({
    weather_data()$current$weather[[1]]$description
  })

Create a Parse Date function

All the time data in the JSON response, forecasted or current, is provided in UNIX format. To make this information user-friendly, it needs to be converted into a human-readable format. You can do this by creating a function that takes the time data as input and uses functions from the lubridate package to handle the conversion.

First, convert the timestamp element to a datetime object. Format the time item to a 12-hour clock system and a date item to include the day of the week, the date, and the month.

%I: Displays the hour in a 12-hour clock format (01-12).
%M: Displays the minutes (00-59).
%p: Adds the AM/PM indicator.

The paste function concatenates the values. The function returns a vector containing date and time values to be extracted by subsetting.

parse_date <- function(timestamp) {
  datetime <- as_datetime(timestamp) 
  date <- paste(weekdays(datetime), ",", day(datetime), months(datetime))
  time <- format(as.POSIXct(datetime), format = "%I:%M %p")
  c(date, time)
}

Add a modal to display error messages

The location reactive provides a way to handle errors. You can incorporate a modal to enhance the user experience by overlaying the page and disabling its content until the user completes a specified action whenever an error occurs.

You’ll add JavaScript to control when and how the modal shows.

Add two modals in the UI section, each featuring an explanation of the error (header) and an outline of the required action (content). The action class includes a button that enables the user to close the modal.

# modals - UI
  div(id = "notFound", class = "ui modal",
      div(class = "header", "Location Not Found"),
      div(class = "content", "No such city/town exists. Check your spelling!"),
      div(class = "actions",
          div(class = "ui button", id = "closeNotFound", "OK"))
  ),
  div(id = "badRequest", class = "ui modal",
      div(class = "header", "Invalid Request"),
      div(class = "content", "Bad request. Please try again with valid details."),
      div(class = "actions",
          div(class = "ui button", id = "closeBadRequest", "OK"))
  )

Slightly adjust the location reactive to incorporate the modal. The commented-out code will be replaced with the JavaScript lines. The runjs function shows the modal depending on the error encountered. req(FALSE) terminates the reactive flow.

# show and hide modals  - Server
location <- reactive({
    query <- input$location
    if(openstreetmap_error_body(query, api_key) == "404"){
      #validate("No such city/town exists. Check your spelling!")
      runjs("$('#notFound').modal('show');")
      req(FALSE)
    }
    else if(openstreetmap_error_body(query, api_key) == "400"){
      #validate("Bad request")
      runjs("$('#badRequest').modal('show');")
      req(FALSE)
    }
    coords <- geocode(query, api_key)
  }) %>% bindEvent(input$search)

# listens for button click on modals to hide modal
observeEvent(input$closeNotFound, {
    runjs("$('#notFound').modal('hide');")
  })

observeEvent(input$closeBadRequest, {
    runjs("$('#badRequest').modal('hide');")
  })

Conclusion

In this tutorial, you have built a weather app using Shiny that retrieves weather data from an API and displays it in an interactive and visually appealing way.

To do this, you used the following libraries:

httr2 for making API requests and handling responses
shiny.semantic for styling the app
lubridate for working with and formatting time data
shinyjs for integrating JavaScript features into the app

This combination of tools allowed you to create a functional, user-friendly weather app.

You can find the complete code for the project here.

La Fin!

How to Run R Programs Directly in Jupyter Notebook Locally

Md. Fahim Bin Amin — Thu, 03 Oct 2024 19:12:33 +0000

R is a popular programming language that’s now widely used in research-related fields like Bioinformatics.

And to use R, you’ll need to install the R Compiler and R Studio. But did you know that you can also directly run your R code right in a Jupyter Notebook? This helps in so many ways if you are already used to using Jupyter Notebook for Machine Learning-related tasks using Python.

In this tutorial, I’ll show you exactly how you can set up your local machine to run the R programming language directly in Jupyter Notebook. The processes I am going to show you today are equally applicable to all major operating systems (Windows, MacOS, and Linux OSes).

Install Conda
Create a New Environment
Activate Your Conda Environment
Install ipykernel and jupyter
Install R in the Conda Environment
Open the Jupyter Notebook
Run R in Jupyter Notebook
Conclusion

Install Conda

You’d normally use Conda to handle multiple environments in Python. And here, we’re going to use the same Conda program to install R in our environment. You can either use Anaconda or Miniconda.

I prefer Miniconda as it’s so lightweight. You’ll also get the opportunity to install the latest packages directly using Miniconda. But you can simply go with the Anaconda if you are already comfortable with that.

Create a New Environment

Many people tend to use the Base environment. But I never like to use the Base environment directly as you typically need multiple environments for handling different package and versions of packages as well.

So I’ll create a new environment where I’ll work on my R programming language-related tasks using Jupyter Notebook.

To create a new Conda environment, simply use the following command:

conda create --name r-conda

Here, r-conda is my Conda environment’s name. You can choose any other name, but keep in mind that the conda env name can not have any whitespaces in it.

It will create a new Conda environment named r-conda for me.

Activate Your Conda Environment

If you want to work on a separate conda environment, you’ll need to make sure that you’re activating that specific conda environment before starting to do anything.

I want to work on the r-conda conda environment. So I can simply activate the conda environment using the following command:

conda activate r-conda

You need to use the exact conda env name that you want if it’s different than r-conda in the command.

💡

Keep in mind that you need to activate the conda environment successfully before proceeding further.

You will see the conda environment’s name as (conda-env-name) at the left side of your terminal.

Install `ipykernel` and `jupyter`

I always like to install the ipykernel and jupyter in all of my conda environments as they help manage different conda environments’ Jupyter notebooks/labs separately.

So I’m going to install them together in my conda env by using the command below:

conda install ipykernel jupyter

This will install both ipykernel and jupyter in the activated conda environment.

Install R in the Conda Environment

To install R directly in the conda environment, simply use the following command:

conda install -c r r-irkernel

This will install the necessary components that enable your local computer to run the R program in your Jupyter Notebook.

Open the Jupyter Notebook

Now you can open the Jupyter Notebook either by using jupyter notebook or jupyter notebook --ip=0.0.0.0 --port=8889 --no-browser --allow-root --NotebookApp.token=''. Just make sure to modify the IP, port, root configuration, and token as you see fit for your work.

Open the given link in the terminal to open Jupyter Notebook in your web browser.

Run R in Jupyter Notebook

After opening Jupyter Notebook in your web browser, when you want to create a new notebook for R, you will get R directly in the “New” menu like the image given below.

Now, you can use the R language directly in your Jupyter Notebook!

You can also see the R programming language logo at the top right side of your Notebook.

Conclusion

Thank you for reading the entire article. I hope you have learned something new here.

If you have enjoyed the procedures step-by-step, then don't forget to let me know on Twitter/X or LinkedIn. I would appreciate it if you could endorse me for some relevant skillsets on LinkedIn. I would also recommend that you subscribe to my YouTube channel for regular programming-related content.

You can follow me on GitHub as well if you are interested in open source. Make sure to check my website as well.

Thank you so much! 😀

Build Interactive Data-Driven Web Apps With R Shiny

Beau Carnes — Wed, 22 Sep 2021 13:17:46 +0000

Shiny is an R package that makes it easy to build interactive and data-driven web apps straight from R.

We just published a course on the freeCodeCamp.org YouTube channel that will teach you how to use R Shiny.

Dr. Chanin Nantasenamat, also known as the Data Professor, teaches this course. He is an Associate Professor of Bioinformatics at a Research University and has more than 15 years of experience in data science.

Apps created with Shiny can be hosted on a standalone webpage or embedded in R Markdown documents. Shiny makes it possible to build these web apps from R and also to create them using only R code.

This R is a little shiny. :)

In this course you will first will learn the basics of Shiny. Then you will learn how to use Shiny to build the following apps:

Print User Input
Display Histogram
Machine Learning (Weather Dataset)
Machine Learning (Iris Dataset)
BMI Calculator

After building those apps, you will learn how to deploy them using Heroku.

Watch the full course below or on the freeCodeCamp.org YouTube channel (90-minute watch).

Transcript

(autogenerated)

Learn to use R to build an interactive data driven application with the R shiny package.

Dr. Chanin Nantasenamat, also known as the Data Professor, teaches this course.

Besides teaching on his YouTube channel, he is also a university professor.

You probably know that the our programming language can help you to perform statistical analysis.

But did you know that you could use our to build an interactive data driven web application.

In this course on Free Code Camp, you will be learning about how you could use the our shiny package to build an interactive and data driven web application that will range from a simple application that allows you to print user inputs, web applications that will allow you to display data visualization, as well as web application that will allow you to make predictions from machine learning models.

Finally, you'll be learning how to deploy the web applications that you have created to the cloud by means of the Heroku platform.

All codes that are used in this tutorial will be provided in the video description.

And without further ado, let's dive in.

Before we begin, let's cover the basics of what is a shiny package.

so shiny is an art package that allows you to build an interactive web application, there are several extension packages that will allow you to extend the function of shiny, including shiny themes, shiny dashboard, shiny j, s and several others.

And once you develop your web app in shiny, then you want to deploy it.

So you have two options, you want to deploy it on your own server, for example, using a service like Digital Ocean or to a shiny apps.io.

There are lots of example codes that can get you started.

And this is available in the shiny gallery.

So the links are in the slides.

Okay, so what we will learn today, first of all, we will learn about the structure of a shiny web application.

And then we're going to have a look at some of the examples of the shiny web application.

And finally, we will show you step by step how you can build your interactive web application.

So let's have a look at the structure of a shiny web application.

So essentially, a shiny web app comprises of three components.

So the first component is the user interface, which is housed within a file called UI dot r.

And the second is the server function, which will perform the processing of the data, which is housed in the file called server dot r.

And then the shiny app function will fuse the UI and server components together.

So the UI is the front end that accepts the user input values, the server is the back end that processes these input values to finally produce output results that are displayed on the website.

Okay, so you see that input data will flow into the user interface, which is the website that you see.

And then you will enter data into the text box, and then the data will be submitted to the server, the server processes the information, and then it will produce the result.

And the result is displayed on the websites.

Okay, and then the user will see the results.

Okay, so let's have a look at some of the shiny web applications.

So let's go to this link.

Okay, so this is the gallery available from SHINee.

And you can see that there are a lot of examples.

So there are the integration of maps, right insights, shiny application, and also interactive scatter plots.

You can also embed Google charts as well, you could perform k means clustering, you could create some bar charts using available data set from the data set package.

And then you could also create a word cloud, okay, and there are several others.

And then there are witches like buttons, tables, slider, input, slider, K, downloading files, uploading files, subsetting, data set and all that.

Okay, so there are several examples on how you can develop custom shiny web apps.

So why don't we click on one, the first one, okay, so this map is interactive.

If we click on it, we can zoom in.

Right, so you click on the color, and then it will update the map based on your input are also Data Explorer.

So it's an interactive table.

You can also sort the data as well, okay.

Or how about a word cloud generator.

Can you could play around with the input parameters, minimum frequency of each word.

For example, if it's 25, it means that the current word like love has to Be present at least 25 times.

So 26 times in order for it to be counted here, how many words are we limiting to be displayed here? Okay, and these accept input from the books of our choice A Midsummer's night dream, The Merchant of Venice or Romeo and Juliet, we have to click on the Change button.

k means clustering, using the iris dataset, right.

Okay.

Okay, and then the next one is to have a look at some of the web applications coming out of my own research lab.

So let's have a look.

Let's go to code stop bio slash osfp.

So as I'm a bioinformatics researcher, and data scientist, so what we do in our lab is we try to apply machine learning in order to make sense of biological data and chemical data as well.

And so the objective of this web server is to take as input the protein sequence, and then we will predict whether the protein sequence is an oligomer or a monomer.

Okay, so let's click on the Insert example data, and then the input will be a fast a format of the protein sequence.

So the first line which contains the greater than symbol, followed by the name of the protein is given here.

So we see that the first protein is a monomer.

And the second protein is a tetra mer.

And let's click on the submit button to make the prediction.

Okay, and so we see that the prediction is correct on both occasion, because the first one is a monomer.

And it predicts it to be a monomer.

And the second one is a tetramer.

And it predicts it to be an ollie Gomer.

Okay, so this is the interface of the prediction web server.

And if we click on the other buttons, it will look like any other ordinary website.

Okay, so these are description on how to use the web server.

And they are written in markdown case.

So this shiny app can also embed markdown inside as well.

Okay, so we also provide the data set for download as well.

And we host it on the GitHub.

And if you're interested in reading this paper, you can click on the link.

Okay, so this is the paper that we published back in 2016, in the Journal of Chem informatics.

Okay, so let's go back to the slide.

And let's get started.

We're creating our own web app using SHINee.

So what you want to do now is fire up your our studio or our studio cloud K.

And so the code that will be used today is available on the data Professor GitHub.

So if you go to github.com, slash data professor, okay, and then you click on code, and then find SHINee slash 001 first app.

And then you want to click on App dot r.

And then you want to right click on the raw button right here.

And then you want to click on the safelink s, and then select a suitable position where you want to save the file.

And because I already have it, I will just click on Cancel, but if you don't have it yet, click on the Save button.

Okay, so let's open up the app dot our file right inside the our studio.

Okay, so before we begin a credit to Winston Chang for developing this template by which we greatly modified and simplified to make this app dot our file.

So if you want to check out the full version, go ahead here, links are provided here.

Okay, so in this simplification, we're going to start with the baby steps.

So this web app is an interactive web application whereby it will accept input values in the form of text, primarily the given name and surname.

So let's have a look.

Okay, so the app will accept input which has the given name and the surname Okay, so let's go ahead and type the given name john and insert name is still okay, and so the name john doe will appear in the output here and the name of the app this my first app, you can also modify this to your own liking, okay, and in this example, we have three navigation bar so we intentionally left it blank here, according to the original template by Winston Chang.

Okay, so the code that we have is located on the nav bar dot one so a point in notice that you can also create several web apps inside different navigation bar.

Let's say that you want to modify the name like let's say given name is Jennifer.

So then you'll see that the name is automatically updated.

So notice that there is no Submit button and whenever you type in an updated name, it will automatically update the results.

So in our they implement reactive, let's have a look shiny, reactive, reactive expressions, reactivity and overview, okay, so it's based on the principles of reactive programming, which is used by the shiny package.

So we're not going to go into detail.

But if you're interested, I can also provide the links in the file as well.

concepts about reactive programming used by SHINee.

Okay, so I'm going to provide the link for you, here.

Okay, so a moment ago, we have taken a look at how the web application will look like, which is the end outcome of this code.

So let's look under the hood, what does the code looks like? Okay, so in the slides, I've shown to you that it comprises of three components.

So let's have a look.

So the first component is the UI is right here.

So it's on line 19, until lines 43k, lines 19 until lines 43.

This is the UI or the user interface, and then a lines 47 until 52 is the server component.

So you're going to notice that we're not doing anything much here.

We're just displaying the results.

And so the code is very concise.

And the third component is the shiny app function.

So this thing will piece together the UI and the server.

So it's essentially just saying that, okay, this part here is the UI.

This part here is the server and fused in both to create a shiny app object.

Okay, so that's all there is to it at the conceptual level.

So let's have a look at the components inside the UI object.

Okay, so here is using inside the fluid page tag, it's using the theme argument, and it's telling that we want to use the C raelian.

theme.

Okay.

And the C rolling theme is the blue theme that you've seen a moment ago.

Let's say I want to change it to United.

And I can click on the reload, or I need to save it first.

And then I'll click on the reload, and then it changes to the United Can I want to change to say Yeti, save it, and then it becomes the Yeti theme.

So maybe you're wondering, what's the available options for you.

So if you search for shiny themes in Google key, the first results and just click on it.

So here, this is how a civilian looks like, if you like that you could type in zulian there's Cosmo Cyborg darkly, paper, lumen journal, flatly readable? sandstone, simplex, slate, Space Lab, superhero, united and Yeti.

Let's try superhero.

Okay, so it's john doe.

Okay.

There you go.

By just default back to civilian.

So let's envisage the code as modular components.

So you're going to see that inside the UI, you're going to have a fluid page, you can within this fluid page, you're going to define the theme.

And inside the fluid page, aside from the theme, you're going to have a navigation bar page, right? So the navigation bar page is right here.

It's this bar.

And so the name of the app is my first app.

So this is the name of the navigation bar page.

Inside the nav bar page, there is the tab panel.

Okay, so tab panel comprises of nav bar, one nav, bar, two nav, bar, three can inside and out more.

One, you have the sidebar panel right here to the left, right, you have here sidebar panel, and your sidebar panel contains tag h3, h3 is the heading third level heading input.

And then text input is the given name.

And the text input is the type of input.

So if you change this to something else, it will look differently here.

And there are a lot of widgets, okay, so you can find what you want.

You can shop for what widget you like, and then just replace it right here in the code.

Okay, so the given name is right here, displayed here.

And then this thing here is the default value.

So let's say that I could type in john doe, and let's save it and reload the app.

So you see that john doe will automatically by default appear in the text box, okay, but I can also leave it blank as well.

Right.

So this is the contents of the sidebar panel.

So the sidebar panel will accept The input right and then the main panel is right here where we see header one output one john doe, which is the result.

So in main panel one, right Heather one is inside the h1.

So h1 is the tag, which is the biggest tag available.

And h4 is a smaller tag, right? So we have in order of from big to small, we have h1 and h2 h3, h4, right.

So for the input here, we use h3, if we change it to h1, it will be bigger.

It'll be the same size as the header one here, but it's too big.

So I'm going to change it to just h3 we couldn't even make this a stream as well.

Right, so you've got a little bit bigger for the output one here.

Right, so you can play around with changing the options here.

Okay, and so verbatim text output is simply a text box that will return the output value.

So it's just a simple text box, and then the nav bar to nav bar three, as we have previously mentioned, it is intentionally left blank.

Okay, so there's, that's all there is to the UI.

But the confusing part is how does UI and server interact? How do they send information back and forth? Right? How does UI send the input value to the server? And how does the server accept the input value? Okay, let's have a look right here.

So notice that the text box has this thing called text one.

txt one right in the given name, okay.

And a surname is txt two, okay.

Now, this stat make a note of that, how about I put it in the comments t x T one and T x T two, okay, and make a note of this to t x t out.

txt or UT okay.

So, these two will be sent to, to the server TFT to will also be sent to the server t x t out is generated from the server.

Okay, so let's go back to the slides.

Okay, why don't I create a new slide.

So let's duplicate this slide.

So let's call this the first web app.

And we're going to modify this to reflect the contents of this web app.

So the input data is txt one.

And txt to write and the output is txt out, right.

So it will send txt one and 62 to the server.

And so actually, the server sends write txt out and the TTL will be displayed.

displays.

txt out.

So here txt one and txt two will be sent to the server txt one and txt two here is input dollar sign txt one and input dollar sign txt to Okay, and so the question is, how does it send it as a txt out, it's right here output dollar sign txt out, and it's going to use this function called render text.

Okay, so there's several render function like render table render text, right that you can modify.

So you can also find out from the SHINee documentation, okay, so this output txt out, what is essentially does is it will use the paste function to combine TF T one and T two and separated by a MP space.

And then it will produce the result as the concatenated text of TF T one and T two inside the txt out variable.

And then this variable will be called from within the verbatim text output, and then it will display the text inside the text box.

That's all there is to this shiny web application, it will seem a bit confusing, but if you get the concepts straight, it will be very simple.

And you could create any web application to your own imagination, you can make this web application data driven, you could ask input, you could upload a file of the input data and then the input data will be sent to the server right and then in the server, you could create a machine learning model and then once the machine learning model is built, it would then relate a results back into the UI and then the UI will display the predicted results.

Okay, so this will be very powerful as a model deployment approach for your machine learning model can there's several tips and tricks which we use in our research lab, and we can share this in a future video.

And so if you're finding value out of this video, please smash the like button.

Okay, so let me recap this process.

In summary, this app Our file will contain three components the UI component, which is the user interface, it will accept input, which is the txt one and txt two, which corresponds to the given name and the surname.

And when you input the given name and surname, it will be sent to the server.

And then the paste function will combine tasty one and txt two and put it inside a txt out variable.

And then this TFT out variable is embedded inside the verbatim text output, which is a text box on the UI.

And as a result, you will see the input values that you typed in displayed in the text box.

Okay, so this first web app, which is essentially starting from the basics, so nothing fancy here, just a simple web app where you can type in the name, the first name, last name, and then it will display the result.

Okay, so in future videos of this series, called the web app in our we're going to have several other videos.

And if you have ideas on what application you would like us to develop, let us know.

So please comment down below, and I'll see you in the next one.

Okay, so this video represents the second episode of the web apps in our series.

In the first video, we covered how you can develop your very first shiny web app, and the web app allows you to enter the first name and last name, and the web app will display the output for a sample.

If you enter the first name as john and the last name as Doe, then the output panel will then output john doe.

So in this video, we're going to show you how you can develop your second web app in R.

And so the web app today is really quite simple.

The web app will display a histogram of the air quality data set, particularly the ozone levels, and the user will be able to adjust the bin size, and then the histogram will adjust accordingly.

Okay, so let's get started.

So what you want to do now is go to the data Professor GitHub, so click on the code folder, and then click on the shiny folder, and then click on the 002 histogram.

And then what you want to do now is click on the app dot r, and then right click on the raw link, and then save link as and then save it to a desired destination.

So since I have already downloaded the file, so let's open up the app dot our file.

So why don't we go ahead and run the application.

So as you can see, the web app has the title as the ozone level and the side panel shown on the left as the number of bins as a slider input value.

So you can adjust this by sliding to the left or right and then the resulting histogram will be updated automatically in real time.

So let's say that we adjusted to seven bins, and you will see that in the histogram, there will be a total of seven bars.

If you adjusted to 12, then the number of bins or the number of bars will then be adjusted to 12.

So what is the bin and then in a histogram is essentially the number of bars and each bar represents a range in the value for a sample from the range of zero to five or zero to 10.

And so if you adjust it to one bar, then you will see only one bar here.

And if you adjust to two bins, you will see two bars, right and etc.

And for a maximum of 50 bars.

Okay, so let's go back to the code.

So in this app dot our file, you will see that on line number nine, it essentially loads the shiny library and line number 10 will load the air quality data set into the memory.

And I have already pin the UI in red here.

And the server in red, and the shiny app function in red.

So as in the previous video, I have already mentioned that the shiny app has essentially three major components consisting of the UI component, which is the user interface and the server component that will accept the input value from the UI, and it will do some processing as shown here.

And finally it will generate the output and the output will then be sent back to the UI for display in the main panel.

Okay, so let's recap that again.

So this UI is the user interface and it will allow you to specify the name of the title panel here which is specified as also level.

So let's have a look.

So ozone level is specified by the title panel.

So you can modify this name if you like.

Okay, so let's say you want to call it just ozone, and then you have to save it and then re Load the app can and the app will then be called ozone.

Okay.

Okay, so this is the title panel here.

And the next block of code here will be the sidebar layout.

And the sidebar layout will allow you to specify the sidebar panel and inside the sidebar panel will then be a slider input.

And the slider input is essentially the number of bins, right number of bins in the UI, and it will have an input ID of bins, which the server component will recognize.

Okay, I'm going to show you in just a moment.

And so the minimum value here is one and the maximum value of the band is 50.

And the default value is 30.

So you see here that the minimum is one, the maximum is 50.

And the default value is 30.

So you can adjust the maximum to say 40.

And the fall, you can test it with 20.

And then we will save it and reload the app can and you're going to see that the web app automatically updates to 140.

And with a default of 20.

Okay, and you when you slide it, it'll update as before.

So here you can see that the step size is one because when you slide the button, it will incrementally increase by one.

Let's say that you want to modify the step size to another number.

Can you do that? Yes, you can.

So you want to specify step to be equal to let's say, two.

And the minimum you want, adjust it to zero, save it and reload the app.

And here now the step size becomes 2k.

So 18 and then you move it it becomes 20 you move it it becomes 22k.

So you notice that I've also modified the minimum to be zero because if it's one then a step size will be 13579.

But if you make it into a even number, then the step size will be also even number Okay, so let's have a look if the step size is one, it will be 135 right it will be 135.

So I cannot select 20.

Okay, so it has to be only odd numbers.

So if you want it to be even number then you want to put the minimum to become zero.

Okay.

So here now, you can even make the step size to be five.

Right? 05 1015 2025 30, right.

Okay, so notice that the slider input has an input ID of bins, and let's go into the server component.

And let's find bins where spins, spins, it's right here and bins is right here.

Okay, so the server component will have two bins, the input bins, and it will be the value of the argument and breaks, okay, so it will allow you to specify the number of bins in the histogram and x will be the air quality data set.

And then we use the dollar sign to specify the column called ozone.

Because in air quality data set, let me stop the shiny app.

First, let's have a look air quality dollar sign.

And then notice that you have ozone you have solar you have when you have temperature month.

They Okay, so we specify ozone.

And we'll also notice that also on has some missing data.

Right, it has some missing data here showing us and a the what we're going to do with the missing data is to omit it from the data set using the NA dot omit function and then save it back into the x variable.

Okay, and then the bins variable will then determine what is the minimum value of the bin and what is the maximum value of the bin.

Okay, so moving on to the next function is the histogram function where x is the input data, which is the air quality or sown column and the breaks will be equal to the number of bins that we specify here can so the color will be discolor, which is bluish color, and the border will be black.

So this is the blue color mentioned in here and the border is black.

So we see a black line and the x label is also on level.

So the x label is right here.

And then the main is histogram of ozone level.

So main is right here, right? So you can just label or also the main text as well.

So in this output dollar sign dist plot, we will use the render plot and then this will generate an output called output dist plot.

And then we're going to know this.

If we screw up to the main panel of the UI component, the plot output function We'll have an output ID equals to this plot.

Okay, so the name, this plot here and this point here are the same object.

So the server will generate this output object called this plot and sends it to the UI component for display on the main panel.

So the main panel is located right here.

Okay, so finally, the shiny app function will fuse together the UI component and the server component.

So you're going to see that the code communicates between the UI and the server, right, so the UI will accept the input, which is the number of pins and it will send a number of pins to the server component, and the server component will generate the histogram plot.

And the histogram plot will be contained within this output this plot and it will be sent to the plot output function in the main panel of the UI.

And so you will see the resulting histogram being generated.

And if you adjust the value of the input bin number, then the plot will also update automatically using the reactive function of SHINee.

Okay, so you can customize the color if you like, let's make it 003366.

Okay, so it's dark blue? Or what if we just call it blue? Or how about let's use red.

Okay, so the plot will become red, if we use green.

So the histogram will also be green.

So here, you can customize the color to your own liking and experiment.

And don't forget to upload this to your GitHub so that you can start building your portfolio for data science projects, and then you're going to have several repository in your GitHub in no time.

Congratulations, you have built your second shiny web application in our Okay, so this is the third episode of the web apps in our series.

So today, we're going to build a play golf web application.

So probably you're wondering what is a play golf application.

So the play golf web application that we're going to build today is going to be based on the weather data set provided by the weeka data mining software.

So let's have a look here.

So the data set is a relatively small data set where it has a total of five variables.

So four variables are the outlook temperature, humidity, and when and the class label would be to play or not play golf, which is a function of the weather and the condition, right, like whether it is sunny weather, the temperature is high, whether the humidity is high, or low or medium, and whether there is win or not true or false.

And then the final decision is to either play or not play golf.

Okay.

So before we dive into the code, let's have a look at how this the web application looks like.

Okay, so the web app looks like this.

It's a very simple application.

So the name of the web app is play golf.

And so here there is the input parameters comprising of the four variables that I have mentioned.

The first one is the outlook.

And so the user can select one of three outlook, whether it is sunny, whether there is overcast of cloud, and also whether there is rain or not.

And the second variable is to temperature.

And so this is a slider input, so the user can slide the input value, and then the humidity is also a slider input.

And then windy is either a yes or a no, which is a drop down menu.

And then when you're ready to make a prediction, you just click on the submit button.

Okay, and here you see a prediction is being made.

And the prediction says yes, and then we also see the underlying probability in that the know has a 27% probability.

And the yes has a 73% probability.

And so you could play around with this, right? If, for example, if it is sunny, and the humidity is very high, and the temperature is very high, and there is when should you play golf? No, right? I mean, really, if the temperature is high, it's very humid, and it's very windy and it's sunny, right? Probably not.

Right.

Okay.

What if What if the humidity is low temperature is quite cool, and it's sunny, and it's not so windy.

Would you play golf, right? Yes.

Right.

So there's an 87 probability for Yes, and a 13% probability for No.

Okay, so this is the web app we're going to build today.

So let's go ahead and stop the web app for a moment.

Okay, so what you want to do now is Go to the data Professor GitHub.

And then you want to click on the code directory, and followed by clicking on the shiny directory.

And finally clicking on the 003, play golf directory.

And then click on the app dot our file.

Okay, so what you want to do is right click on the raw link, and then save to file.

So I'm going to save it into the weather folder, save it, okay, it's right here, click on it, make sure that the app works.

Okay, it works.

prediction has been made.

Okay, cool.

So let's just clear up this by pressing on the ctrl and L button.

Okay, so let's have a look under the hood, what does the app dot our file looks like? So the first couple of lines, which is import libraries used by this app dot our file.

So this comprises of the shiny package, shiny themes package, the data dot table library, the our curl library and the random forest library.

Okay, so next line of code would be to create a data object called weather by reading the CSV, which is downloaded from the data Professor GitHub in the data folder.

And the file is called weather dot weeka dot CSV panelists have a look at the data set.

What does it look like? I'll click on the line weather and then I hit on the control and enter button.

And then let's type in weather.

Okay, so this is wanting me to see, let's go with head and then weather.

Okay, and so we see that there are five columns outlook, temperature, humidity, windy and play.

Right.

And let's have a look at the data type of the data set, we see that the outlook has three factor levels play has two factor levels.

So these are categorical label.

No, yes.

And the outlook is overcast, rainy and sunny.

Windy is a true and false humidity and temperature are integers.

And so a random forest model will be created by using the four variables comprising of Outlook temperature, humidity, windy as the input variable, and the play variable here will be used as the output variable or the variable that we want to predict.

Okay, and in the data equals to weather, which is the weather data object here.

And we're going to use number of three to be five lines read.

And because there are four input variables, we're just going to use em try a four.

Okay, so let's try building a model and the model has been built.

And let's apply the model for prediction, shall we, I mean, just to test that the code is working properly.

So let's try applying the model on the input file that I have previously mentioned about.

And so we're going to run this line and putting the data into the variable called test can and we're going to assign the factor because if we don't do it, then it'll provide an error.

So before we run this line of code, let's just try to make a prediction model.

And we should be able to see a error coming up.

Okay, so we got this error error in predict random forest type of predictor in new data do not match.

So what we notice is that if we type in the str, and then we type in weather, and then we notice that the outlook has factors with three level, but if we type in str, and then the test variable, notice that the outlook has a factor of only one level.

And this is because the input data has only one line of data, which is essentially one row of data.

And so the outlook is only sunny for the prediction being made has only one role.

And but in reality, it should have three levels of the factor.

So we're gonna define that by telling the code that there are three possible factors there are overcast, rainy, and sunny.

So let's run that line of code and then run the prediction again, okay, and it works.

And then print outputs.

And here we go, we got the prediction, which is exactly what is going to be displayed on the web application.

Let me show you.

Right, we make the prediction, and shown here.

So this table you see here is shown right here.

So the model works, and let's go to the other lines of code.

So the next one would be the user interface.

Right.

As I mentioned in previous video, the user interface represents the first component of the shiny web app.

And this is followed by the second component, which is the server and then this is followed by the third component, which is the fusion of the user interface and the server component using the shiny app function.

So let's talk about the UI.

So this UI object makes use of the fluid page function And we're going to use the theme equals to the shiny theme united.

And so the United theme will give the buttons a red color.

So if we change it to thoroughly and then we're going to have the thruline color theme, which is a bit blue.

Okay, so please refer to the first video of the web apps in our in order to see the selection of web templates that you can choose from.

Okay, so let's run the app again.

And I'm going to put the app just about right, right here.

So the header panel is play golf.

And so this is right here, play golf.

So if you want to change the name, feel free to do so.

And then the next one is the sidebar panel.

So the sidebar panel will accept the input parameters is located to the left, and there will be a total of four input.

And so the first one is select input, which is a drop down menu.

And if you click on it, you get three selections, Sunny, overcast, rainy, and when you hover on the drop down menu on sunny it will secretly under the hood, select the sunny object, and if you select on the overcast also, it will under the hood be equivalent to the overcast object.

If you select on the rainy it will be equivalent to the rainy object and the default is to select rainy, right here we'll select rainy as the default What if you change it to sunny now let's change the value to be a high value, let's say like 85 and humidity to be 95.

And it's windy, that's true.

Reload the app again, right.

So high temperature high humidity windy, Sunny, don't play golf.

Okay, so here you can change the default value to your liking.

Okay, so we have mentioned about the three data objects four here, Sunny, overcast and rainy.

So keep that in mind, we're going to make use of data in the server function.

And note that when we will refer to it later on in the server function, it will be referred to as input dollar sign outlook input dollar sign temperature input, dollar sign humidity and input dollar sign windy.

Here, why don't we just scroll down and have a look here input dollar sign outlook input dollar sign temperature input dollar sign humidity input dollar sign windy, okay.

And then let's move back up.

Notice the spelling here using the small letter not the Capital One.

So the Capital One Here are the label.

So it's exactly what we're going to see in the web application outlook with the colon is right here on the label.

Okay, actually, you don't have to type in label if you don't want to, we could just you know, delete it, and it will give the same results.

It's just implied, okay.

But if you want to add the label argument, you could feel free to do so.

Right? But if you do it for one, well might as well do it for all.

So that would actually make the code looks a bit more easier to read.

Right.

So let me see here that okay, this is the object name, Outlook.

And the label is outlook with a capital O reload the app.

Here you go.

Right, it works as usual.

So here, this outlook here is that outlook object.

And this temperature here is the temperature object.

And this action button here is the submit button.

So this Submit button is added in order to overcome the reactive function.

So we're just at the familiar Submit button so that users can initiate the prediction process when they feel ready to do so instead of having the web app being reactive and making the prediction spontaneously upon sliding up and down of the input values.

Because when it's reactive, if you move it by one notch, and then you let go of the mouse, it'll make a prediction.

But for this one with the submit button, no prediction will be made until you actually click on the submit button to actually this might be a good thing on the server side because the server will work a bit less if the prediction being made is made only once versus if it is in the reactive mode.

If you slide the input value and you just change your mind later on.

Then prediction will be made at each point off the changing of the input value of the slider here or even the drop down menu, right.

But we do it once with the use of the submit button.

Okay, and that's it for this left sidebar panel.

And then the main panel here will display the result from the output generated by the server function.

So we're going to talk about that in just a moment.

So why don't we note that the output being generated by the server component will be called contents and table data.

Okay.

Okay, so let's hop on to the server function which is the second component.

So in this data set in Put variable it will comprise of the first component is it will create a data frame which will accept four input values from the web application compressing off the outlook temperature, humidity windy, which is right here outlook temperature humidity windy, and then it will combine it with the play variable, which is the fifth column of the original data set and then it will transpose the data set it will rotate it in will transpose it and then write a input dot CSV file, it will read the input that CSV file back in into the test variable.

And then it will apply the factor function in order to tell that the outlook has three levels.

And finally, a prediction will be made using the model generated earlier by means of the random forest and apply the prediction model to predict the input values from the user.

And once the prediction is made, it will be sent from here into the output dollar sign table data as the function data set input right here.

And then it's going to render the table as we will see in the web application.

So the table that is being rendered will be right here, which comprises of three columns to prediction the know in the Yes, probability k and this status output textbox is just essentially this box right here.

So if we load the app for the first time, it will just say server is ready for calculation.

And if we click on the submit button, the text will change to calculation complete and it will be followed by the prediction results table.

Okay, so note that there are two output being generated right here output dollar sign contents, output dollar sign table data.

So these two outputs will be sent to the UI component right here, table data and contents, table data will be displayed as a table using the table output function.

And the status of the prediction whether it is ready for prediction or prediction has been made will be displayed by the verbatim text output function K.

And this is just a label of the status output text box shown right here that we're going to have the H three font size, I mean, if you want to change it to H to make it a bit bigger than you will notice that the font will become bigger, right? So I'm just going to make it back to h3.

Okay.

So that's all right.

And then the last component shiny app function, which is fused the UI and the server together again, you have all of this in 121 lines of code.

And so nothing fancy here, just a simple web application that you can create using the shiny language.

Okay, so this video represents the fourth episode of the web apps in our series.

And today we're going to cover about how we can develop a Iris predictor, which is a machine learning model in the background.

And the web app allows the user to select the input values for the four input parameters and press on the submit button and make a prediction.

So without further ado, let's get started.

Okay, so the first thing that you want to do is go to the data Professor GitHub.

Okay, once you arrive here, you click on the code link, and then find SHINee, and then click on the serial zero for Iris predictor.

So what you want to do now is to download the first three files comprising of the app numeric dot, our app, slider dot r and the model dot r, because the other three files found below will be generated automatically when we run the code.

Okay, so why don't we just click on each of them manually.

And then for each, right click on the raw button and click on the Save Link As a key and then you select the location in your computer where you want to save the files.

So you do this for all three files, the app numeric that our apps are and the model dot r.

Okay, so I have already done that.

And I will go back to the our studio application.

Okay, so before we begin, let's have a look at what the the iris predictor web application that we are going to develop today looks like so you want to hit on the run app, you need to make sure that your working directory is at the folder where it contains all of the necessary files to be run the one that you have just downloaded.

Okay, and once you have made sure already, you want to click on the run app button.

Alright, so this is what the app looks like.

And it allows you to put in the four input parameters.

And so these are the default values which you can adjust accordingly.

And then when you click on the submit button, the prediction will be made.

And here the prediction is made to be that the input parameter is predicted to be a Iris setosa flower and the probability of it being a Irish atossa is 100% sense, okay, and so if you change the input parameters and so the prediction will also be changed because the input parameter will be feed into the predictive model, which is a random forest.

And then the random forest will perform the classification.

And it has classified this input parameter as a Iris virginica with 100% probability.

So let's have a look under the hood, what does the code actually looks like? Okay, so the first code that you want to open up right now is the model dot r.

So in this tutorial, we're going to pre build the random forest model, and then we're going to load it in, right.

So as you recall, in the previous videos of this channel, we have shown you how you can deploy your predictive model into a RDS file.

And so what you want to do is you develop the model in this model, dot our file, and you save it as an RDS, right? So you're deploying that.

And then you're going to read that in here on line number 15, you're going to read the model dot RDS n, and you're going to give it a name, the name is model, and then we're going to use this model for making the prediction.

So the advantage of this is that the model is already built.

And so there is no additional workload on the shiny application.

So it can just readily read in the model and perform the classification.

So this will be beneficial in the case in which the predictive model will take a long time to build the model.

Okay, so let's have a look at the model dot our file where we will be building the model.

So the first steps that we want to do now is to load in the libraries, which will include the our curl and the random forest.

So the our curl library will allow us to read the data Professor GitHub to download the iris data set, and then the random forest will be used to create the prediction model.

And we also need the carrot package in order to do the data splitting.

Okay, so Iris here will mean that we will create a data object called Iris because we're going to read in the CSV, which will retrieve the CSV file from the data Professor GitHub, and the file is called Iris dot CSV, okay, and it will use the carrot package to perform data splitting, using a ratio of 8020 to 0.8.

Here is to 80% split, which will go into the training index, and then we will subsequently use the training index to create a training set in which it will perform slicing of the original Iris data frame, and then the remainder 20% will go to the testing set.

So what we're going to do next is we're going to write the training set and the testing set out into the CSV files, right, because that would help to remedy possible shuffling of the data that will go into the training set and the testing set, so it will allow reproducibility in the future.

So in the future, we can just read in the training dot CSV file, instead of performing the data splitting again, right.

So here, we're going to read in the training dot CSV file and give it the same name, which is the train set Canada, we're going to delete the first column, which is the index number.

And then we're going to build a model and assign the built model into the model data object.

And once a model has been built, we're going to save it as the RDS file.

So the we're going to deploy the model into the RDS format.

So in this random forest function code, we're specifying that we want to predict the species of the iris flower.

And we're going to use all four input parameters and the data set, we'll be using the training set for making the model and then we're going to assign a entry value, which is the parameter of the random forest to be five Heinz red, and we're going to assign the M try parameter to be four, right, and then we're going to assign a true value for the importance argument, okay, and so so while you want to do is you want to run all of this blocks of code, so you could just Ctrl A, select everything and then Ctrl, enter.

Okay, and then the data will be read, and then a model will be built, and it will be saved as the model dot RDS.

Okay, so this concludes the model dot our file, and then we're going to close that.

And then we're going to open up the second file, which is the app gnumeric dot r k.

So let's have a look.

The first few lines will be importing the necessary libraries, which will be the shiny library, the data dot table, the random forest package, and then we're going to read in the model that we have built in the previous step.

And we're going to assign it into a model object, right.

And then like in previous video, the shiny web application will contain three components.

So the first component being the UI, and the second component being the server.

And the third component being the shiny app function, which will essentially piece together the UI and the server.

Okay, so let's have a look at the UI.

And we're going to open up the web application and have a look right at the same time.

And for readability of the code, I would just add additional enters to it and new line to it.

So that when I open up the web browser, concurrently The values here won't be hidden.

Okay? Save it and go back to the web application.

Alright, so here, the name of this web application is called Iris predictor.

And so it is in the header panel here.

So we put in the iris predictor, if you want to change the name, feel free to do so right here.

And then we're going to have the sidebar panel, which is on the left, and then we're going to have the main bar panel, which is on the right, so as always, the left or the sidebar panel will take in the input parameters, and then clicking on the submit button, which is right here, it will send the input parameters to the server function, and the server will use that input parameters to feed it in to the predictive model, which is the random forest model and make a prediction.

And once your prediction has been made, the resulting output value generated will then be feed back into the main panel right here.

And then the results will be displayed in the table data, which is going to be occurring right below this text message.

So the table data will be shown right here, which is the prediction being made.

Okay, so in the input parameters, we're going to use the HTML tag and inside we're going to assign a size of the header to be h3 right, and a name will be input parameters right here.

So further showing the versatility of the shiny application framework.

So notice that the s and l are capital letter, and this is the ID of this input parameter sepal length, and it is case sensitive.

So we have to type it in exactly S is when we're going to use it in the next step.

So it's going to be like input dollar sign, and then CBOE dot length.

And then this will be the input parameter, which the server function will be using as the data to be fed into the random forest model.

Okay, and so the label here will be sepal length.

And the label means right here, the label and the value is the default value, which is five and here is five.

So if you change the default value to 5.1, and you save it, run the app again.

So you see that the 5.1 will be updated right here in place of the 5.0.

Okay, so the same thing will be for the sepal width, petal length and petal width, right with the label and with the value, which is the default value right here.

And then the next block of code here is the action button function.

And this will be the submit button.

So it will overwrite the reactive function in which when there is no Submit button, every time we modify the numbers in here, a prediction will be made.

So that would put a heavy load onto the server, because every time that you update the value here, a prediction will be made.

So imagine that you update the values 10 times 20 times and 20 predictive models will be created.

Whereas in a situation where you have the submit button, you can spend all the time or as many times as you need to update the values, right? Let's see if I went to and then I changed my mind, I want to have it 4.9.

So do this 10 more times.

And so the prediction model will not be built, right.

So it's going to wait for you until you click on the submit button and then the prediction will be made.

Okay, so this will be more economical on the server side.

And also for familiarity, where we normally would click on some button in order to initiate the process of the prediction.

Okay, and then the following block of code main panel will be right here.

So in the tax label h3 status output, it will be this part.

So notice that this block of code is exactly the same as the HTML block of code.

So I'm just showing you the versatility of the shiny web application.

And you could use either one, okay, but this is the shiny way of doing things, right.

So let me show you by putting it right here.

And then I'm going to comment that out and put in the APR parameters here and then replace the value inside.

Okay, reload the application, right, and then it looks exactly the same, right.

So you can do it both ways, right, and then the following text box shown here will tell you that the server is ready for calculation.

So this will be displayed upon loading of the web application.

And upon clicking on the submit button, the value will be changed to be calculation complete.

Okay, so this will be on the server side.

So I will show you in just a moment.

Okay, so we finished with the UI component.

And now let's go on to the server component.

And so here we're going to load in the function, okay, so this block of code here will be the input parameters, which will be obtained from the UI component where the user will input the input parameters and click on Submit button.

And upon doing that, all of the input parameters will come in as shown in this block of code here.

And this block of code will essentially generate the input CSV file, which will be read into the test object and then apply the model to make a prediction on this test object.

And once the prediction has been made, this block of code data set input will contain the prediction and the prediction value will be inserted right here, right and it's going to be encapsulated by a output table data variable name and then This thing and then the input dollar sign, and then the output dollar sign table data will be sent to the main panel in the UI to be displayed.

So it's right here, right.

So this one will come from the table data right here, table data, right this highlighted in blue, and it will be coming from the prediction results table, right, which we use the render table function here, and the data set input here will contain the prediction which is coming out from the output data object.

Okay, so let me go specifically line by line here.

So a data frame will be created, and then name will be the name of the Heather variable name on the first row, and then the values will take and the input parameter value from the UI.

So input dollar sign, sepal length, sepal width, petal length, petal width will come from the input text box right here 5.1 3.6 1.4 2.2.

So these text box will be the input dollar sign, sepal length, sepal width, petal length, petal width, okay, and then we're going to create a data frame.

And once we have done that, we will write it out as a input dot CSV file.

Now we're going to read it back in, and then we're going to put it into the test object, and then we're going to create a output object and a data frame will be created.

And we apply the prediction function in order to make a prediction using the random forest model on the input test data.

And once the prediction has been made, we will also tell the probability in three digits Okay, and once a prediction has been made, it will then be sent to this output data object and it will print it out and it will be representing the data set inputs, and this data set input will be inserted into the render table function, and a table will be generated to show you the output prediction results shown right here.

Right.

Okay.

So that's essentially it for this Iris predictor in the numeric form.

So let's close this and hop on to the next one.

Okay, so now we're going to proceed with the app slider version.

And before doing so you want to clean the workspace environment.

So click on the broom button.

And then after you have done that, then you want to click on the app, slider dot r and then Ctrl, a, and then Ctrl, enter.

Right, and then the web app will be loaded.

So you see that now instead of a tap spots where we put in the numerical value, you're going to have a slider, right and then you click on the input parameter by sliding here, and then after you're satisfied with the input values, then you will click on the submit button, and then the prediction will be made.

And as always, it looks exactly the same, but the only difference is the input parameters will have the slider bar instead of the textbox.

Okay, so let's have a look under the hood.

So what new code did we add to this file, so we've added line 1718, and 19.

And then we've also added two new arguments, which is the minimum and the maximum argument into each of the inputs.

And we also change the name from numeric input into slider input.

And that's essentially it, we just change a couple of lines of code and the web app will look like this.

Instead of a numeric text box, we're going to have a slider bar.

So the value of the minimum here will be taken as the minimum function and then the train set dollar sign simple link, right.

So I don't have to manually put in the minimum value or maximum value, but I will do this programmatically.

So I'm going to use the minimum function and inside the minimum function as the argument, I'm going to say, Okay, I want to have the train set object.

And I want to have the simple link column.

And I want to know what is the minimum value, right, it's going to be like this.

So let me close this, and I'm going to read in the file.

So let me show you how it looks like.

And if I run the train set, it will look like this.

And then I'm going to run this line and then notice that the first column will be gone, right, I don't want the index to be shown.

So I just deleted out.

And then when I say train set dollar sign sepal length, where they get will be this, so it's going to be the values of only the first column.

And upon adding the minimum function in front, I'm going to get the minimum value.

And if I use the maximum function, I'm going to get the maximum value of this column.

So the minimum is 4.3.

And the maximum is 7.9.

So instead of putting in the values manually, 4.3 7.9, I'm going to do it programmatically, and it's going to be so much easier, right? And then I just put it in right here.

And that's all for modifying the code and everything else works exactly the same.

And you get a new feel to the web application and it's not that difficult.

Okay, So play around and let me know what kind of web application you want to be made and Or the input data that you want me to use for making the web app.

Okay, so today represents the fifth episode of the web apps in our series.

And today we're going to cover about how you can develop a BMI calculator.

So if you're wondering what is a BMI, so essentially, BMI stands for body mass index, and it is computed by dividing the weight in kilograms by the heights in square meters.

So for example, if you weighed 70 kilograms and you are 170 centimeters tall, then you would first have to convert the height to a meter.

So 170 centimeters would then become 1.7 meters.

And according to the equation, you would take your weight, which is 70 kilograms divided by 1.7 meter squared, okay, so let me calculate that.

So 1.7 times 1.7, would be 2.89.

And if I weighed 70 kilograms, divide that by the squared heights, then my BMI would be 24.2.

Okay, so let's have a look at the scale of the BMI in adults.

So if you have a BMI below 18.5, then it would mean that you are underweight.

If you have a BMI in the range of 18.5 and 24.9, then it means that you have a healthy weight.

And if your BMI is between 25 to 29.9, it means that you are overweight.

And if you have a BMI of greater than 30, then you are obese.

Okay, so in the previous example, a BMI of 24.2 would mean that the weight is a healthy weight.

So without further ado, let's get started in developing our BMI web application.

So you want to go first to the GitHub of the data professor.

And so click on the code folder, find shiny and click on the shiny folder.

And then find the 005 that BMI click on that.

And then you want to download both the about.md and the app dot r into your computer.

So why don't we do that, right, right click on the raw to save link as because we're going to download it into a BMI folder.

We save it there.

And then download the second file, right click on the raw link, safely add as save it to the folder BMI.

Okay, and now we have to upload the folder.

Okay, there you have it, you have two files at that R and about.md.

So let's have a quick look at what this file look like.

So as you show the app dot r is the art code comprising of the three major components, the UI, the user interface, number two is the server.

And number three is the signing out function which fuses both the UI and the server function.

And in the second file, you have the about.md.

So this is written in the markdown language.

And it's going to be used by the app dot our file.

So we're going to see that in just a moment.

So before we dive deep into the our code, let's have a look what the web application looks like.

Click on the run app.

Okay, so this is a simple web application where you can put in your input height and your weight, so the height will be in centimeters and the weight will be in kilograms.

So the minimum value here is 40 for the heights, and 250 for the maximum value, and for the weight, the minimum value is 20.

And the maximum value is 100.

So please note that this BMI calculator is developed for adults and it's not suitable for children.

If you want to develop a BMI for children, then we will have to refer to this second link here.

Okay, so as you notice that when they click on the about link on the navigation bar, it shows the information in the About page.

So originally, the code was written here in markdown language and here in the website.

It displays it as a Normal webpage.

So here you can add boldness to the text, you can add superscript make it stand out as an equation, you could add italic font, right.

So all of this is within the markdown language, right? For example, if you for example, if you use two asterisks, it will mean that the tests will be in bold text.

So meaning that you have to use two asterisks before and after the test, you want to meet both.

And if you use the for hash tag, it means that you are going to use the Heather level for tag, which is the h4 tag in HTML.

And if you're using the greater than symbol here, it means that it's going to display this light gray bar to the left.

So you know that it's a equation.

And if you use one asterisk, then it means that the tax will be in Tillich form, right.

And here we use it for hash tagging in and so it becomes a header.

And then we make the BMI calculator in Italy form by using the asterisk before and after, and even add the links to the website, right.

So the taps that you want to make into a link, you have to put that in bracket and immediately following that, you have to put in parenthesis the URL of the webpage.

And so this is format to like a normal web page.

And so the web application, and so the web application, it's mobile friendly, and you can use it on your mobile phone, okay, it would look something like this on the phone.

And if you click on it, and you get the BMI, right, in my previous example, a height of 170 and a weight of 70, you will get a BMI of 24.2 to 145.

And because we're rounding it, and therefore we get 24.22.

Okay, so this web application seems simple enough, okay.

And so let's dive deep into the code.

Okay, so let's have a look at the code of the app dot our file.

So the first two lines here will be the loading in of the library package of shiny and shiny themes.

And then following that we have the user interface.

So inside the UI optic, it's going to be the fluid page function.

And here, we will define that we're going to use the shiny theme of united and first run the app and have a look.

So here at the nav bar page function shows that we're going to use the name of this navigation bar to be BMI calculator.

And then the tab panel will have the first navigation tab here to be home.

Okay, and inside top panel here, we're going to use the sidebar panel and the main panel.

So as usual, the sidebar panel is right here to the left, and to the right, in the status output, we're going to have the main panel, right so the sidebar panel will contain the input parameters, which comprises of two input parameters, the height and the weight, and we're using a slider input, so you can slide the bar here, and then you get the desired value, click on the submit button, and then you get the calculated BMI value.

Okay, so the slider input here is responsible for this slider button.

And so the name of this slider input is called height here in label height.

And the first one will be the ID of this specific slider inputs.

And so this slider input has an ID of heights, notice the small h and then the second slider input has a value of weights and notice the small W and so these two slider input will then be used in the next step, it will be used by the server function as the input dollar sign weights and input dollar sign heights in order to calculate the BMI, okay, and then the action button function will be the red button that you click to initiate the calculation process.

And so the main panel will have h3 tag here showing the status outputs Okay, and then the verbatim text output will contain the contents ID, which is from the output in the server function.

And in the table output is also from the output of the server function, it is called the table data.

And this is the table data containing the computed BMI value.

So let me recap that again.

So here the slider input, we have to work them height and weight and so it will be referred to as input dollar sign heights input dollar sign weights.

And as the user slides this value, it will adjust the value to the height parameter or the weight parameter and the input dollar sign height value and the input dollar sign weight value will then go to the server function.

I will show you right now, right here.

So we go to the server function in the equation that we're going to create puti BMI.

So here we're taking the input dollar sign weight, dividing it by the in parenthesis, the input dollar sign height divided by 100, right, because we want to convert the centimeters to become meter, so we have to divide the centimeter value by our friends read, and that will then make it a major form.

And then we're going to multiply the height by itself so that we get the squared height value.

And then we're going to divide the weight by the height in order to get the BMI.

And we're going to encapsulate the BMI value inside a data frame so that we can display it in the final output here below in the output contents, it will show that the server is ready for calculation or the server has already completed the calculation.

So this will be modified by the submit button, the red button that we clicked right here.

So when we don't click the button, it will say server is ready for calculation.

But upon clicking on the red button, the BMI will be calculated.

And then in this text box, the text will change to calculation complete kn in the following output results here is called the output dollar sign table data.

And inside here, we're going to use the render table function.

So the results from the data set input will be the computed BMI value right here in the print BMI.

So let me recap again, let's have a look at the web application again.

So this web application will take two input parameters, the height and weight, and they are in the centimeter unit and the height and the weight will be referred to as input dollar sign heights and input dollar sign weights.

And upon clicking on the red button, it will be sent to the server function into this BMI calculator function.

So it will then take the input weights and the input height and perform the calculation and return the BMI value.

And then we put the BMI value into a data frame.

And then we print it out and the results of the BMI is printing out is part of the data set input variable.

And that is called within the render table function of the output dollar sign table data and the output dollar sign table data will go to the main panel right here in the main panel to be displayed in the table output.

And it looks like this right here.

So you see that it's called BMI and then we have to BMI value right beneath it, right.

And that's all there is to building this BMI web application.

So you can play around with this code.

And you can change the default value, for example, the height, you can make it 180 and the value of the weight, you could make it say 75, right and then run the code again.

So the default value then becomes updated to be 180 and 75.

So let's say that you want to update the maximum value to be 300, minimum value to be 50.

And the weight, you want to update it to say 30.

And the maximum would be 120.

And then we load the application and here you see the minimum values and maximum values are updated accordingly.

So you see here, if the weight is maintained the same and the height increases, then the BMI becomes less.

But if the height decreases, then the BMI is high, right because of the equation of the BMI, whereby the weight is divided by the height squared, okay, and so you can play around existing values, playing around with the template.

So let's say you want to change the United theme to become a Boolean.

Save it, reload the app, and here you go, you get a different colored web application.

Right.

So the types of theme could be obtained by looking at the websites.

You can Google that Google for shiny themes, click on the click on the art studio.github.io slash shiny themes.

So I'll provide the link in the description down below.

So check that out.

In this video, I'm going to show you how you could deploy a shiny web application.

And without further ado, we're starting right now.

Okay, so the first thing that you want to do is head over to the GitHub of the data professor.

And you want to click on repositories, then find and click on the iris, our Heroku.

And so all of the files that are needed to deploy your app is found here.

So feel free to clone this to your own GitHub or also you could download the entire folder content here by clicking on the code and then download zip file.

So let's have a look here.

So you're gonna see that we have the UI dot r, which is the user interface.

And we also have server dot r, which is the server side component of the web app.

So essentially, the our shiny web app will be comprised of two components, the user interface and the server component.

And then we're going to have the training data set and the testing data set as the CSV file.

And the actual model will be contained within the model dot RDS.

And so the machine learning model is saved as the model dot RDS.

And it will be loaded into the web application when we run it.

And then there are two additional r files.

Let's have a look.

So the first one is in that dot r.

And so let's have a look here.

So it will allow us to install the necessary libraries.

So we're going to install the random forest and the data dot table.

And the run dot r will allow us to run the R shiny and assign the proper ports.

Okay.

And so that's all there is to having the necessary components for deploying your our shiny web application.

So let's head over to Hiroko.

And so you want to click on new create new app.

And then you want to give the app a name.

So let me call it VP, Iris, our create app.

And then I want to connect to GitHub.

And I'll find Iris or he Roku.

So for your case, you want to find your own GitHub and you want to find your own RSR he Roku and then connect.

And then this is very important, because you want to click on settings.

And in order to have the support for our, you're going to need to add the custom build pack.

And so you want to click on the Add build pack.

And you're going to notice that there are some officially supported build packs.

So by default, you're going to have Python, right, and you're going to have others like PHP, Ruby, Java, node j s.

And so for our we're going to use a third party.

And the third party link to the build pack for R is given here.

So I'm going to provide you this link in the description of this video.

So you want to copy that and then put it here as well.

And then click on save changes.

And then you're going to see that it has been added successfully here.

Now you want to head back to deploy, scroll down and you want to click on deploy branch in the manual deploy.

So at this point, you want to take a break, grab a cup of coffee and wait for the web app to deploy.

So we're gonna see the log of what is happening here.

So initially, it is installing version 3.6, point three here.

And it is downloading the bill pack directly from Amazon Web Service, and also having shiny as well.

Right, so it's installing the data dot table library.

And right now it's building the environment.

So this web application has already been built in a previous video.

So the link to that video will be provided in the description down below.

And in the meantime, maybe I could show you that shiny and it is number four Iris predictor.

And so you're going to notice that we've been using the testing and training and for the app, we divided the components of UI and server into these separate files.

Okay, and so it's compressing the environment from 499 megabytes to 121 megabytes.

Although the our packages occupies 121 megabytes, okay, so it has compressed it down to 152.

And it is deploying the web application to dp that's Iris dash r dot e Roku app.com.

And so in just a moment, you're going to see a link to view the deployed websites.

Okay, finished.

And so it says that your app was successfully deployed.

Let's click on it.

Alright, so it says here that server is ready for a calculation.

Let's submit All right, so predictions seems to work and it is predicted to be setosa with the probability of 100%.

k this predicted to be pseudoscience.

Well, same thing she told us and now it was predicted to be virginica.

100% virginica.

Thank you for watching until the end of this video, and I hope that this video was helpful to you.

And for more tutorials in data science, Bioinformatics, as well as Python and our coding tutorials, please check out my YouTube channel at the data professor and also my new and second YouTube channel decoding professor.

And you can also find me on the medium platform where I blog about data science as well as doing Python tutorials.

And last but not least, I would like to thank Free Code Camp for this awesome collaboration.

And please don't forget to smash the like button, subscribe if you haven't already.

And until next time, the best way to learn data science is to do data science, and please enjoy the journey.

An introduction to aggregates in R: a powerful tool for playing with data

freeCodeCamp — Tue, 12 Mar 2019 05:56:18 +0000

By Satyam Singh Chauhan

Data Visualization is not just about colors and graphs. It’s about exploring the data and visualizing the right thing.

_[Source](https://newatlas.com/art-ones-and-zeros-data-visualization/49926/" rel="noopener" target="blank" title=")

While playing with the data, the most powerful tool that comes handy is Aggregates. Aggregates is just the type of transformation that we apply to any given data.

We have 11 aggregate function available to us:

avg
Average of all numeric values is calculated and returned.
count
Function count returns total number of items in each group.
first
The first value of each group is returned by the function first.
last
The last value of each group is returned by the function last.
max
The max value of each group is returned by the function max.
It is very helpful to identify outliers as well.
median
The median of all numeric values for the mentioned group is returned by the function median.
min
The min value of each group is returned by the function min.
It is very helpful to identify outliers as well.
mode
The mode of all numeric values for the mentioned group is returned by the function mode.
rms
Root Mean Square, rms value for all numeric values in the group is returned by the fucntion rms.
sttdev
Standard Deviation of all Numeric values given in the group is returned by the function stddev.
sum
Sum of all the numeric values is returned by the function sum.

Basic Examples

Basic Visual Scatter plot using aggregate function — sum

#Include the Librarylibrary(plotly)

#Store the graph in one variable to make it easier to manipulate.p <- plot_ly(     type = 'scatter',     y = iris$Petal.Length/iris$Petal.Width,     x = iris$Species,     mode = 'markers',     marker = list(          size = 15,          color = 'green',          opacity = 0.8     ),     transforms = list(          list(               type = 'aggregate',               groups = iris$Species,               aggregations = list(                    list(                         target = 'y', func = 'sum', enabled = T                    )               )          )     ))

#Display the graphp

What does this mean?

Function sum, as mentioned above, calculates the sum of each group.
Thus, here the groups are categorized as species. This code uses the Iris Data Set which consist three different species, setosa, veriscolor, and virginica. For each species there are 50 observations in the data set. This data set is available in R (built-in) and can be loaded directly.

There are “iris” and “iris3” - two data sets are available. You can choose any one of them to run this code. The Data-Set used in this article is “iris”.

Fig. 1 Sum of Petal Length

What does this code do exactly?

This code uses the function sum and calculates the sum of all the Petal.Length of each group respectively. Then, the calculated sum is plotted on the x-y axis. Where the x-axis is Species, the y-axis shows the Summation.

From this graph, we can get an idea that the petal size of setosa is smallest as the sum is the smallest, but it’s not conclusive evidence. To get conclusive evidence we can use the function avg.

The function sum is very suitable for almost the whole data set. For example, one of the best places where this can be used is in Population Data Set. In the world population data set, we can aggregate countries according to continents and find the sum of all the population of the countries in it.

Most used function — avg

#Include the Librarylibrary(plotly)

#Store the graph in one variable to make it easier to manipulate.q <- plot_ly(     type = 'bar',     y = iris$Petal.Length/iris$Petal.Width,     x = iris$Species,     color = iris$Species,     transforms = list(          list(               type = 'aggregate',               groups = iris$Species,               aggregations = list(                    list(                         target = 'y', func = 'avg', enabled = T                    )               )          )     ))

#Display the graphq

What does this mean?

The iris data-set contains two columns for Petals, Petal.Width and Petal.Length. Further, it can be used to calculate the average of the ratio of Petal.Length & Petal.Width.

Fig. 2 Average ratio of Petal Length to Petal Width

What does this code do exactly?

For each observation, the ratio of Petal.Length to Petal.Width is calculated before the average of all the gained values is plotted. As we can observe from this Bar Plot, Setosa has the max ratio with a near-ratio of 7, which shows that the petal length in Setosa is 7 times longer than its width. While on the other hand, virginica has the smallest ratio with nearly 3 times the width.

This function is very flexible and especially when it’s used very wisely to get the best result. For example, if we consider some other data-set like Population, then we can calculate the average birth to death ratio for each country.

Let’s use all the functions in one graph. Now we’re going to plot a scatter plot for each category and we’re going to use all the functions. To this graph we will add a button from which we can select the desired function to make our work easier and get the results quicker.

Aggregation of all functions — all functions in one-graph

#Include the Librarylibrary(plotly)

#Store the graph in one variable to make it easier to manipulate.s <- schema()agg <- s$transforms$aggregate$attributes$aggregations$items$aggregation$func$valuesl = list()

for (i in 1:length(agg)) {     ll = list(method = "restyle",     args = list('transforms[0].aggregations[0].func', agg[i]),     label = agg[i])     l[[i]] = ll     }

p <- plot_ly(     type = 'scatter',     x = iris$Species,     y = iris$Sepal.Length / iris$Sepal.Width,     mode = 'markers',     marker = list(          size = 20,          color = 'orange',          opacity = 0.8          ),     transforms = list(          list(               type = 'aggregate',               groups = iris$Species,               aggregations = list(                    list(                         target = 'y', func = 'avg', enabled = T                    )               )            )     )) %>%layout(     title = 'Plotly Aggregations by Satyam Chauhan
use     dropdown to change aggregation
Sepal ratio of Length to     Width',     xaxis = list(title = 'Species'),     yaxis = list(title = 'Sepal ratio: Length/Width'),     updatemenus = list(          list(               x = 0.2,               y = 1.2,               xref = 'paper',               yref = 'paper',               yanchor = 'top',               buttons = l          )     ))

#Display the graphs

What does this mean?

We make a list where all the function attributes of aggregation are stored. We use this function to experiment with all the functions of Aggregations in R.

A few of the graphs with different examples are shown below.

Fig. 3 Illustrates the function mode.

What does this code do exactly?

First, a list is created as mentioned earlier, in which all the functions are stored. After the list is made, the y-axis is set to the ratio of Sepal.Length to Sepal.Width and x-axis is set to Species.

After calculating the ratio, the function transform is called in which the func = ‘avg’ is mentioned for just the starting phase. When we run this code and select the function ‘mode’, we get Fig. 3 (above), which shows that the mode of setosa is the least among the three at around 1.4. Mode tells that the ratio 1.4 is repeated the most times or that value is most likely to be sampled. The different pattern we saw here is that the highest value most likely to be sampled is from the category veriscolor having a mode near to 2.2.

Fig. 4 Left Figure: Illustrates the change in Sepal ratio of Length to Width Right Figure: Illustrates the root mean square (rms) value of the Graph

In Fig. 4 above, the change of ratio of Sepal Length to Sepal Width is plotted and we get very different results compared to the rest of the graphs. We observe the change of Setosa and Virginica to be the same and positive, while in the change of ratio by species, veriscolor is almost negative and is three times the change of the setosa and virginica.

On the other hand, the right figure shows the rms values of each species. We can easily see that the species veriscolor and virginica have almost same value which is significantly greater than the rms value of setosa.

Conclusion

Aggregation functions are one of the most powerful tools developers can ask for. They can provide you the patterns and results that you wouldn’t expect. To analyse the data visually, you have to play with the data, and to do that we need to manipulate and transform it. Aggregation functions do that for you, and they’re one of the most widely used functions in transform. This article is just a start. You can certainly explore more and apply more. That’s what explorers do.

How to avoid scope creep, and other software design lessons learned the hard way

freeCodeCamp — Fri, 15 Feb 2019 20:39:55 +0000

By Dror Berel

From a data-science perspective.

You’ve got a fresh new project on your desk, some exciting data, a challenging Kaggle competition, a new client you wish to impress, and you are fully motivated. At first, the problem seems to be well defined, and you even feel comfortable with the task in hand. You have just completed a similar task. This new one should not be much different. Maybe even just a few copy/pastes with some modifications at the edges.

But then it comes… The client / collaborator / boss has just one simple additional request… It usually goes like this:

‘Hmmmm, I wonder how would the results look like if instead of x, we do only a minor change, just do y, or… you know what, let’s try both and see how it affects the results’.

Can the initial tool/solution you chose handle such an adjustment? It may be easy to copy/paste it with a couple of alterations, but what if you have to do it again and again? For how long are you going to stick to your initial plan?

Within the context of machine learning, some examples are:

Tuning ‘let’s see how a different model parameter affects it’

Benchmarking ‘let’s see how various models affect it’

Ensemble ‘let’s try combining the best models together’

Resampling / cross-validation ‘we must inspect for over-fitting’

Imagine adding on top of that some complex, messy, multi-layer, high-throughput genomics data that can easily go into a very fine resolution level (gene expression / mutation / sequence, …),… AND THEN adding multiple layers of various multi-genomic data on top of each other, …AND THEN doing it for multiple cohorts / studies in a meta-analysis level … you may end up with a VERY … BIG … UGLY … MESS!

Sound familiar? Unfortunately, I have been in this situation more than once. As much as I was motivated to please my collaborators, at those times, my tools were…limited, and not sufficient to deliver the broader scope resolution. At that time, I might have not even been aware that a higher level of scope was relevant.

_[https://leankit.com/wp-content/uploads/2013/11/Screen-Shot-2013-11-25-at-4.25.52-PM.png](https://leankit.com/wp-content/uploads/2013/11/Screen-Shot-2013-11-25-at-4.25.52-PM.png" rel="noopener" target="blank" title=")

A lot has been written about scope creep in the context of project management. But what would a scientist, who was mostly trained to care about the rightness of the analysis / tools, rather than the ‘management’ of the whole project, have to say about it?

The good news, my friend, is that it is never too late to learn from someone else’s mistakes. Here are couple of lessons, learned the hard way. (No worries, this is not another blog post about reproducible research).

Lesson #1: Begin at the end! Define what your scope is. Do you need to extend it?

Make sure you understand what is the highest expected resolution! Brainstorm what would be the craziest outcomes of your project, and then agree on reasonable expectations within your timeframe and budget.

Have a very detailed, clear, definition of the project scope. For example, is your solution going to handle just one data set, or more? How are you going to validate your results? There are always going to be more methods/data sets for that, but what would be just sufficient enough?

The tricky challenge with scope creep is that the client doesn’t really care or think in terms of “scope”. Their goal is to get a solution that solves a hypothesis, or a business need. Whether their request is within or outside scope is entirely your problem! DEAL WITH IT!

In the context of machine learning, back in the day, I used ad-hoc R packages that do just one multivariate model. They did the work well, but were too specific for the developers domain, and lacked the higher resolution on comparing it with other models, or aggregating other models, or lacking resample implementation. Only later did I learn to utilized machine learning meta/aggregator packages such as mlr, tidymodels (formerly caret), or SuperLearner to extend my scope. Read more about it here.

Lesson #2: Do not reinvent the wheel! There are other experts that know how to do it better than you!

In a role where you are expected to be multidisciplinary, and new tools/methods pop daily that are accessible for everyone to use, it may be a slippery fall into a very deep rabbit hole to explore any new approach. And guess what, nobody want you to waste their time/money on that.

How to bet on the right tool? Ask yourself, what do the experts in that domain use? How mature is the tool they developed? Is it going to be maintained, or deprecated? They of course had their own learning curve, and over time, have perfected their tools to overcome the common pitfalls you are about to discover.

For me, with genomics data, it was Bioconductor Object-oriented S4 classes. Read more here about why that was the best tool for my need. Sure, it wasn’t trivial to learn, but I felt comfortable betting on it when I saw how it is implemented at top academic and industry organizations. I also knew that it was not another open source resource that might die. Instead, it as a government and academia-funded project, powered by the best experts of the domain, open, and free, for all of us to use.

Lesson #3: Found a gap? Be creative, but keep it simple!

But what if something in the analytical pipeline is still not in place? A missing link, nowhere to be found, that would have better fit to the specific need you have, bridging the gap?

Here you might need to get some dirty work done, and stop depending on others to provide you the solution. Another potentially slippery scope creep rabbit hole? Maybe… if you are not careful enough!

How to avoid it? Very easy: Keep it Simple!

Here is a very simple example. Suppose you have to solve an unsupervised problem. There is definitely more than one way to it. Which one to choose? Is the simplest one, suppose ‘hierarchical clustering’, just be good enough to begin with? Implement it, see how it works with the rest of your analytical components (data, scalability, reproducibility), and later on, after things have worked out well as you planned, relax that simplification into a more complex method. Do it very carefully and gradually.

More examples to follow next.

Lesson #4: Do not be afraid to refactor!

Tired of patching and debugging poorly cohesive and poorly-designed code that someone else, maybe even your boss, has written long time ago, before better tools became available? You ask yourself, GRRRRR, this is such an ugly workaround, why not just simply use that new approach that was designed specifically for this task? (see lesson #2).

Yes, it is risky to begin everything from scratch, and sometimes you may not have the resources to do it, but perhaps it is time for a reality check.

But what if the refactor solution will give us different results from what our collaborators are already counting on? Well, if there was indeed a past error/bug/mistake, it is better to face it and acknowledge it now, before even more damage is done. But also remember lesson #3: If you stick to simple solutions at the core, refactoring them under broader wrapping solution should assist in producing similar results.

Lesson #5: go to lesson #1.

Case studies:

Here are two case studies from my own experience working with multi genomic data. (Could easily expand to other types of data, but perhaps that is a topic for a future post).

Case Study #1: Bioc2mlr: A utility function to transform Bioconductor’s S4 omic classes into mlr’s task and CPOs.

https://drorberel.github.io/Bioc2mlr/

I love using Bioconductor data containers for genomic data, but I also love machine learning meta-aggregator toolkits for analysis at higher level scope. The only problem was that they were not necessarily compatible with each other.

The S4 object oriented had multiple dimensions (slots), tied in together in complex constraints, that were intentionally designed to meet some purpose. But the machine learning approach was designed for a simplified, flat, two-dimensional, matrix like input structure: columns for the features/variables, and rows for the subjects/observations.

I needed some way of breaking the S4 constrained ties, and flattening it. But unfortunately, to the best of my knowledge, I couldn’t find a way to do so. What should I have done?

Remember lesson #3: Should I spend my time on this task? Well… yes, why not? I felt comfortable enough with both approaches, have already experienced the ins and outs, the soft bellies, and I definitely appreciated the tremendous value of both approaches separately, but also jointly. In fact, creating this adapter package, Bioc2mlr, was not too much effort to do, and if you look at the code itself, you will see relatively simple steps.

Conclusion of case 1: When you have a couple of good tools, but they are not compatible, create a simple new adapter to link them.

_[https://drorberel.github.io/Bioc2mlr/](https://drorberel.github.io/Bioc2mlr/" rel="noopener" target="blank" title=")

Case study #2: meta analysis

But that wasn’t enough for me…(see lesson #5).

My scope extension required me to provide a solution to even higher level of analysis. Meta-analysis of multiple studies/cohorts, each with a multi-omic data cube, each with a downstream machine learning analytics pipeline, implementing resampling, and all that jazz, across all studies, and at scale. Phewww!

Quite a challenge! How should I address that implementing the above lessons?

Lesson #1: I began at the end. My ‘observation-unit’, row, in a tidy-fashion is not the subject, neither is the gene, nor is just one of the omics. It is the entire study/cohort (that is, a whole data cube) well-compressed into a single object in R. More than one cohort? Not a problem at all. Add as many as rows as you need for more cohorts.

Lesson #2: Didn’t have to invent a new tool. The experts in our field have already figured it out for us. They might have not had this implementation in mind when they did so, but if I can do it, so can you. Just give it a try.

Lesson #3: I found a simple solution. Should I invent/extend a new S4 object oriented class for this type of multi-cohort, multi-omic data? Of course not. There must be a simple solution. My simple solution: a tidy / nested data structure, with non-atomic objects at each cell. Read more about it here.

Lesson #4: Refactor? Well. Maybe I am not there yet now, since so far my (current) scope can handle all my wildest dreams. But if you show me a better approach, perhaps a data.table one (I know), or even in python (god forbid), I would not hesitate to give it a try, even if it is beyond my comfort zone.

Lesson #5: Meta-meta analysis? (Not a typo). Who knows. Maybe one day.

Conclusion of case 2: tidy everything! Even non-atomic objects.

_[https://drorberel.github.io/aboutme](https://drorberel.github.io/aboutme" rel="noopener" target="blank" title=")

One last piece of advice: Get an expert’s opinion, at least until you become one yourself.

‘If only I had known that before. That could have saved me so much time and effort…’

To the expert, your current challenges are yesterday’s resolution. They had already figured that out when we were still in kindergarten. They have spent their entire career just on that. Shoot them an email, ask a very clear question, with no dependencies proof-of-concepts examples, or case studies to demonstrate your challenge. My experience is that they would be happy to assist if you respect their time and authority.

Final words

When you figure out what type of tool/solution you are passionate about, make it happen! Don’t fool yourself with excuses why it is not a good time for your new tool to be created. Just do it!

Don’t give up. Focus. Decide what you want to achieve. Do not be afraid to extend your scope, but do it with simple solutions! Refactor. It will be worth your time. Maybe not immediately, but in days to come. Be creative!

And last but not least, don’t be shy. Tell everyone about it. Share it with your community. Make the universe a better place with your solution. You may even earn an extra buck on the side. Who knows?

p.s.

This post is dedicated with love to all of my former anxious collaborators / clients / bosses. I appreciate your patience, and wish I would have known the above before. You were there to assist and support me learning these lessons the hard way, for both good and for bad. Let me make it up to you. Shoot me an email and I will redo my old work in just a few lines of code, reflecting my current level of scope.

Check more related topics here: https://drorberel.github.io/

Consultant: currently accepting new projects!

Useful reference:

Clean Coder Blog
_On the Diminished Capacity to Discuss Things Rationally_blog.cleancoder.com Scope Creep in Project Management: Definition, Causes & Solutions
_When a project stretches far beyond its original vision, it is called "scope creep". Scope creep in project management…_www.workamajig.com

How to make Beautiful Ruby Plots with Galaaz

freeCodeCamp — Mon, 26 Nov 2018 18:13:24 +0000

By Rodrigo Botafogo

By Rodrigo Botafogo & Daniel Mossé

According to Wikipedia “Ruby is a dynamic, interpreted, reflective, object-oriented, general-purpose programming language. It was designed and developed in the mid-1990s by Yukihiro “Matz” Matsumoto in Japan.” It reached high popularity with the development of Ruby on Rails (RoR) by David Heinemeier Hansson.

RoR is a web application framework first released around 2005. It makes extensive use of Ruby’s meta-programming features. With RoR, Ruby became very popular. According to Ruby’s Tiobe index it peaked in popularity around 2008, then declined until 2015 when it started picking up again.

At the time of this writing (November 2018), the Tiobe index puts Ruby in 16th position as most popular language.

Python, a language similar to Ruby, ranks 4th in the index. Java, C and C++ take the first three positions. Ruby is often criticized for its focus on web applications. But Ruby can do much more than just web applications. Yet, for scientific computing, Ruby lags way behind Python and R. Python has the Django framework for web, NumPy for numerical arrays, and Pandas for data analysis. R is a free software environment for statistical computing and graphics with thousands of libraries for data analysis.

Until recently, there was no real way for Ruby to bridge this gap. Implementing a complete scientific computing infrastructure would take too long. Enters Oracle’s GraalVM:

GraalVM is a universal virtual machine for running applications written in JavaScript, Python 3, Ruby, R, JVM-based languages like Java, Scala, Kotlin, and LLVM-based languages such as C and C++.

GraalVM removes the isolation between programming languages and enables interoperability in a shared run-time. It can run either standalone or in the context of OpenJDK, Node.js, Oracle Database, or MySQL.

GraalVM allows you to write polyglot applications with a seamless way to pass values from one language to another. With GraalVM there is no copying or marshaling necessary as it is with other polyglot systems. This lets you achieve high performance when language boundaries are crossed. Most of the time there is no additional cost for crossing a language boundary at all.

Often developers have to make uncomfortable compromises that require them to rewrite their software in other languages. For example:

“That library is not available in my language. I need to rewrite it.”

“That language would be the perfect fit for my problem, but we cannot run it in our environment.”

“That problem is already solved in my language, but the language is too slow.”

With GraalVM we aim to allow developers to freely choose the right language for the task at hand without making compromises.

As stated above, GraalVM is a universal virtual machine that allows Ruby and R (and other languages) to run on the same environment. GraalVM allows polyglot applications to seamlessly interact with one another and pass values from one language to the other.

GraalVM is a very powerful environment. Yet, it still requires application writers to know several languages. To eliminate that requirement, we built Galaaz, a gem for Ruby, to tightly couple Ruby and R and allow those languages to interact in a way that the user will be unaware of such interaction. In other words, a Ruby programmer will be able to use all the capabilities of R without knowing the R syntax.

Library wrapping is a usual way of bringing features from one language into another. To improve performance, Python often wraps more efficient C libraries. For the Python developer, the existence of such C libraries is hidden. The problem with library wrapping is that for any new library, there is the need to handcraft a new wrapper requiring a high level of expertise and time.

Galaaz, instead of wrapping a single C or R library, wraps the whole R language in Ruby. Doing so, all thousands of R libraries are available immediately to Ruby developers without any new wrapping effort.

To show the power of Galaaz, we show in this article how Ruby can use R’s ggplot2 library transparently bringing to Ruby the power of high quality scientific plotting. We also show that migrating from R to Ruby with Galaaz is a matter of small syntactic changes. By using Ruby, the R developer can use all of Ruby’s powerful object-oriented features. Also, with Ruby, it becomes much easier to move code from the analysis phase to the production phase.

In this article we will explore the R ToothGrowth dataset. To illustrate, we will create some boxplots. A primer on boxplot is available in this article.

We will also create a Corporate Template ensuring that plots will have a consistent visualization. This template is built using a Ruby module. There is a way of building ggplot themes that will work the same as the Ruby module. Yet, writing a new theme requires specific knowledge on theme writing. Ruby modules are standard to the language and don’t need special knowledge.

Here we show a scatter plot in Ruby also with Galaaz.

gKnit

Knitr is an application that converts text written in rmarkdown to many different output formats. For instance, a writer can convert an rmarkdown document to HTML, LaTex, docx and many other formats.

Rmarkdown documents can contain text and code chunks. Knitr formats code chunks in a grayed box in the output document. It also executes the code chunks and formats the output in a white box. Every line of output from the execution code is preceded by ‘##’.

Knitr allows code chunks to be in R, Python, Ruby and dozens of other languages. Yet, while R and Python chunks can share data, in other languages, chunks are independent. This means that a variable defined in one chunk cannot be used in another chunk.

With gKnit Ruby code chunks can share data. In gKnit each Ruby chunk executes in its own scope and thus, local variable defined in a chunk are not accessible by other chunks. Yet, All chunks execute in the scope of a ‘chunk’ class and instance variables (‘@’), are available in all chunks.

Exploring the Dataset

Let’s start by exploring our selected dataset. A dataset is like a simple excel spreadsheet, in which each column has only one type of data. For instance one column can have float, the other integer, and a third strings.

ToothGrowth R dataset analyzes the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs, where each animal received one of three dose levels of Vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice (OJ) or ascorbic acid (a form of vitamin C and coded as VC).

The ToothGrowth dataset contains three columns: ‘len’, ‘supp’ and ‘dose’. Let’s take a look at a few rows of this dataset.

In Galaaz, R variables are accessed by using the corresponding Ruby symbol preceded by the tilde (‘~’) function. Note in the following chunk that ‘ToothGrowth’ is the R variable and Ruby’s ‘@tooth_growth’ is assigned the value of ‘~:ToothGrowth’.

# Read the R ToothGrowth variable and assign it to the# Ruby instance variable @tooth_growth that will be # available to all Ruby chunks in this document.@tooth_growth = ~:ToothGrowth

# print the first few elements of the datasetputs @tooth_growth.head

##    len supp dose## 1  4.2   VC  0.5## 2 11.5   VC  0.5## 3  7.3   VC  0.5## 4  5.8   VC  0.5## 5  6.4   VC  0.5## 6 10.0   VC  0.5

Great! We’ve managed to read the ToothGrowth dataset and take a look at its elements. We see here the first 6 rows of the dataset. To access a column, follow the dataset name with a dot (‘.’) and the name of the column. Also use dot notation to chain methods in usual Ruby style.

# Access the tooth_growth 'len' column and print the first few# elements of this column with the 'head' method.puts @tooth_growth.len.head

## [1]  4.2 11.5  7.3  5.8  6.4 10.0

The ‘dose’ column contains a numeric value with either, 0.5, 1 or 2, although the first 6 rows as seen above only contain the 0.5 values. Even though those are number, they are better interpreted as a factor or category. So, let’s convert our ‘dose’ column from numeric to ‘factor’.

In R, the function ‘as.factor’ is used to convert data in a vector to factors. To use this function from Galaaz the dot (‘.’) in the function name is substituted by ’‘(double underline). The function ’as.factor’ becomes ’R.asfactor’ or just ’as__factor’ when chaining.

# convert the dose to a factor@tooth_growth.dose = @tooth_growth.dose.as__factor

Let’s explore some more details of this dataset. In particular, let’s look at its dimensions, structure and summary statistics.

puts @tooth_growth.dim

## [1] 60  3

This dataset has 60 rows, one for each subject and 3 columns, as we have already seen.

Note that we do not need to call ‘puts’ when using the ‘str’ function. This functions does not return anything and prints the structure of the dataset as a side effect.

@tooth_growth.str

## 'data.frame':    60 obs. of  3 variables:##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...##  $ dose: Factor w/ 3 levels "0.5","1","2": 1 1 1 1 1 1 1 1 1 1 ...

Observe that both variables ‘supp’ and ‘dose’ are factors. The system made variable ‘supp’ a factor automatically, since it contains two strings OJ and VC.

Finally, using the summary method, we get the statistical summary for the dataset

puts @tooth_growth.summary

##       len        supp     dose   ##  Min.   : 4.20   OJ:30   0.5:20  ##  1st Qu.:13.07   VC:30   1  :20  ##  Median :19.25           2  :20  ##  Mean   :18.81                   ##  3rd Qu.:25.27                   ##  Max.   :33.90

Doing the Data Analysis

Quick plot for seeing the data

Let’s now create our first plot with the given data by accessing ggplot2 from Ruby. For Rubyists that have never seen or used ggplot2, here is the description of ggplot found in its home page:

“ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.”

This description might be a bit cryptic and it is best to see it at work to understand it. Basically, in the grammar of graphics developers add layers of components such as grid, axes, data, title, subtitle and also graphical primitives such as bar plot, box plot, to form the final graphics.

Interested readers can look up the following articles on the grammar of graphics on medium: A Comprehensive Guide to the Grammar of Graphics for Effective Visualization of Multi-dimensional Data and What are the Ingredients of a Terrible Data Story?

In order to make a plot, we use the ‘ggplot’ function to the dataset. In R, this would be written as ggplot(, ...). Galaaz gives you the flexibility to use either R.ggplot(et>, ...) or et>.ggplot(...). In the graph specification bellow, we use the second notation that looks more like Ruby. Ggplot uses the ‘aes’ method to specify x and y axes; in this case, _th_e ‘dose’ on the x axis and the ‘length’ on the y axis: ‘E.aes(x: :dose, y: :len)’. To specify the type of plot add a geom to the plot. For a boxplot, the geom is R.geom_boxplot.


Note also that we have a call to ‘R.png’ before plotting and ’R.devoff’ after the print statement. ‘R.png’ opens a ‘png device’ for outputting the plot. If we do no pass a name to the ‘png’ function, the image gets a default name of ‘Rplot’ where  is the number of the plot. ’R.devoff’ closes the device and creates the ‘png’ file. We can then include the generated ‘png’ file in the document by adding an rmarkdown directive.

Figure 1: Creating a Boxplot in png format for the ToothGrowth dataset. Dose x Length of Odontoblasts
Great! We’ve just managed to create and save our first plot in Ruby with only four lines of code. We can now easily see with this plot a clear trend: as the dose of the supplement increases, so are the length of teeth.
Faceting the plot
This first plot shows a trend, but our data has information about two different forms of delivery method, either by Orange Juice (OJ) or by Vitamin C (VC). Let’s then try to create a plot that helps us discern the effect of each delivery method.
This next plot is a facetted plot where each delivery method gets is own plot. On the left side, the plot shows the OJ delivery method. On the right side, we see the VC delivery method. To obtain this plot, we use the ‘R.facet_grid’ function that automatically creates the facets based on the delivery method factors. The parameter to the ‘facet_grid’ method is a formula.
In Galaaz we give programmers the flexibility to use two different ways to write formulas. In the first way, the following changes from writing formulas (for example ‘x ~ y’) in R are necessary:

R symbols are represented by the same Ruby symbol prefixed with the ‘+’ method. The symbol x in R becomes +:x in Ruby;
The ‘~’ operator in R becomes ‘=~’ in Ruby. The formula x ~ y in R is written as +:x =~ +:y in Ruby;
The ‘.’ symbol in R becomes ‘+:all’

Another way of writing a formula is to use the ‘formula’ function with the actual formula as a string. The formula x ~ y in R can be written as R.formula("x ~ y"). For more complex formulas, the use of the ‘formula’ function is preferred.
The formula +:all =~ +:supp indicates to the ‘facet_grid’ function that it needs to facet the plot based on the supp variable and split the plot vertically. Changing the formula to +:supp =~ +:all would split the plot horizontally.
R.png("figures/facet_by_delivery.png")@base_tooth = @tooth_growth.ggplot(E.aes(x: :dose, y: :len, group: :dose))@bp = @base_tooth + R.geom_boxplot +      # Split in vertical direction      R.facet_grid(+:all =~ +:supp)      puts @bpR.dev__off

Figure 2: ToothGrowth dataset faceted by delivery method
It now becomes clear that although both methods of delivery have a direct impact on tooth growth, method OJ is non-linear having a higher impact with smaller doses of ascorbic acid and reducing it’s impact as the dose increases. With the VC approach, the impact seems to be more linear.
Adding Color
If we were writing about data analysis, we would make a better analysis of the trends and improve the statistical analysis. But here we are interested in working with ggplot in Ruby. So, let’s add some colors to this plot to make the trend and comparison more visible.
In the following plot, the boxes are color coded by dose. To add color, it is enough to add fill: :dose to the aesthetic of boxplot. With this command each ‘dose’ factor gets its own color.
R.png("figures/facets_by_delivery_color.png")
@bp = @bp + R.geom_boxplot(E.aes(fill: :dose))puts @bp
R.dev__off

Figure 3: Adding color to the faceted boxplot figure
Faceting helps us compare the general trends for each delivery method. Adding color allow us to compare specifically how each dosage impacts the tooth growth. It is possible to observe that with smaller doses, up to 1mg, OJ performs better than VC (red color). For 2mg, both OJ and VC have the same median, but OJ is less disperse (blue color). For 1mg (green color), OJ is significantly better than VC. By this very quick visual analysis, it seems that OJ is a better delivery method than VC.
Clarifying the data
Boxplots give us a nice idea of the distribution of data, but looking at those plots with large colored boxes leaves us wondering what else is going on. According to Edward Tufte in Envisioning Information:

Thin data rightly prompts suspicions: “What are they leaving out? Is that really everything they know? What are they hiding? Is that all they did?” Now and then it is claimed that vacant space is “friendly” (anthropomorphizing an inherently murky idea) but it is not how much empty space there is, but rather how it is used. It is not how much information there is, but rather how effectively it is arranged.

And he states:

A most unconventional design strategy is revealed: to clarify, add detail.

Let’s use this wisdom and add yet another layer of data to our plot, so that we clarify it with detail and do not leave large empty boxes. In this next plot, we add data points for each of the 60 pigs in the experiment. For that, add the function ‘R.geom_point’ to the plot.
R.png("figures/facets_with_points.png")
# Add point for each subject@bp = @bp + R.geom_point
puts @bp
R.dev__off

Figure 4: Adding points for all data — Not everything can be seen because of data hiding (some points on
top of others)
Now we can see the actual distribution of all the 60 subjects. Actually, this is not totally true. We have a hard time seeing all 60 subjects. It seems that some points might be placed one over the other hiding useful information.
But no sweat! Another layer might solve the problem. In the following plot a new layer called ‘geom_jitter’ is added to the plot. Jitter adds a small amount of random variation to the location of each point, and is a useful way of handling overplotting caused by discreteness in smaller datasets. This makes it easier to see all of the points and prevents data hiding. We also add color and change the shape of the points, making them even easier to see.
R.png("figures/facets_with_jitter.png")
# Use small diamonds in a light blue color (cyan3) # to plot the subjects of the experimentputs @bp + R.geom_jitter(shape: 23, color: "cyan3", size: 1)
R.dev__off

Figure 5: Jittering the points to show those on top of each other — All data is now visible
Preparing the Plot for Presentation
We have come a long way since our first plot. As we’ve already said, this is not an article about data analysis and the focus is on the integration of Ruby and ggplot. So, let’s assume that the analysis is now done. Yet, ending the analysis does not mean that the work is done. On the contrary, the hardest part is yet to come!
After the analysis it is necessary to communicate it by making a final plot for presentation. The last plot has all the information we want to share, but it is not very pleasing to the eye.
Improving Colors
Let’s start by trying to improve colors. For now, we will not use the jitter layer. The previous plot uses three bright colors. Is there any obvious, or non-obvious for that matter, interpretation for the colors? Clearly, they are just random colors selected automatically by our software. Although those colors helped us understand the data, for a final presentation random colors can distract the viewer.
In the following plot we use ‘scale_fill_manual’ function to change the colors of the boxes and order of labels. For colors, we use shades of blue for each dosage, with light blue (‘cyan’) representing the lower dose and deep blue (‘deepskyblue4’) the higher dose.
Also, the legend could be improved: we use the ‘breaks’ parameter to put the smaller value (0.5) at the bottom of the labels and the largest (2) at the top. This ordering seems more natural and matches with the actual order of the colors in the plot.
R.png("figures/facets_by_delivery_color2.png")
@bp = @bp +      R.scale_fill_manual(values: R.c("cyan", "deepskyblue",                                      "deepskyblue4"),                          breaks: R.c("2","1","0.5"))
puts @bp
R.dev__off

Figure 6: Shades of blue representing lower to higher doses
Violin Plot and Jitter
The boxplot with jitter did look a bit overwhelming. The next plot uses a variation of a boxplot known as a violin plot with jittered data.
From Wikipedia

A violin plot is a method of plotting numeric data. It is similar to a box plot with a rotated kernel density plot on each side.
A violin plot has four layers. The outer shape represents all possible results, with thickness indicating how common. (Thus the thickest section represents the mode average.) The next layer inside represents the values that occur 95% of the time. The next layer (if it exists) inside represents the values that occur 50% of the time. The central dot represents the median average value.

R.png("figures/violin_with_jitter.png")@violin = @base_tooth + R.geom_violin(E.aes(fill: :dose)) +    R.facet_grid(+:all =~ +:supp) +   R.geom_jitter(shape: 23, color: "cyan3", size: 1) +   R.scale_fill_manual(values: R.c("cyan", "deepskyblue",                                   "deepskyblue4"),                       breaks: R.c("2","1","0.5"))puts @violinR.dev__off

Figure 7: Violin plot with shades of blue and jitter
This plot is an alternative to the original boxplot. For the final presentation, it is important to think which graphics will be best understood by our audience. A violin plot is a less known plot and could add mental overhead, yet, in my opinion, it does look a bit better than the boxplot and provides even more information than the boxplot with jitter.
Adding Decoration
Our final plot is starting to take shape, but a presentation plot should have at least a title, labels on the axes and maybe some other decorations. Let’s start adding those. Since decoration requires more graph area, this new plot has a ‘width’ and ‘height’ specification. When there is no specification, the default values from R for width and height are 480 pixels.
The ‘labs’ function adds the required decoration. In this example we use ‘title’, ‘subtitle’, ‘x’ for the x axis label and ‘y’, for the y axis label, and ‘caption’ for information about the plot (for clarity, we defined a caption variable using Ruby’s Here Doc style).
R.png("figures/facets_with_decorations.png", width: 540,       height: 560)
caption = <<-EOTLength of odontoblasts in 60 guinea pigs. Each animal received one of three dose levels of vitamin C.EOT
@decorations =  R.labs(title: "Tooth Growth:  Length vs Vitamin C Dose",         subtitle: "Faceted by delivery method, OJ or VC",         x: "Dose (mg)", y: "Teeth length",         caption: caption)
puts @bp + @decorations
R.dev__off

Figure 8: Adding title, subtitle, axes names and caption
The Corp Theme
We are almost done. But the default plot configuration does not yet look nice to the eye. We are still distracted by many aspects of the graph. First, the black font color does not look good. Then plot background, borders, grids all add clutter to the plot.
We will now define our corporate theme. in a module that can be used/loaded for all plots, similar to CSS or any other style definition.
In this theme, we remove borders and grids. The background is left for faceted plots but removed for non-faceted plots. Font colors are a shade o blue (color: ‘#00080’). Axes labels are moved near the end of the axis and written in ‘bold’.
module CorpTheme
R.install_and_loads 'RColorBrewer'   #----------------------------------------------------------------# face can be  (1=plain, 2=bold, 3=italic, 4=bold-italic)#----------------------------------------------------------------    def self.text_element(size, face: "plain", hjust: nil)    E.element_text(color: "#000080",                    face: face,                   size: size,           hjust: hjust)  end  #----------------------------------------------------------------# Defines the plot theme (visualization).  In this theme we # remove major and minor grids, borders and background.  We # also turn-off scientific notation.#----------------------------------------------------------------    def self.global_theme(faceted = false)    # turn-off scientific notation like 1e+48    R.options(scipen: 999)    # remove major grids    gb = R.theme(panel__grid__major: E.element_blank())    # remove minor grids    gb = gb + R.theme(panel__grid__minor: E.element_blank)    # remove border    gb = gb + R.theme(panel__border: E.element_blank)    # remove background. When working with faceted graphs,     # the background makes it easier to see each facet, so     # leave it    gb = gb +       R.theme(panel__background: E.element_blank) if !faceted    # Change axis font    gb = gb + R.theme(axis__text: text_element(8))    # change axis title font    gb = gb +      R.theme(axis__title:        text_element(10, face: "bold", hjust: 1))    # change font of title    gb = gb + R.theme(title: text_element(12, face: "bold"))    # change font of subtitle    gb = gb + R.theme(plot__subtitle: text_element(9))    # change font of captions    gb = gb + R.theme(plot__caption: text_element(8))
  end   end
Final Box Plot
We can now easily make our final boxplot and violin plot. All the layers for the plot were added in order to expose our understanding of the data and the need to present the result to our audience.
The final specification is just the addition of all layers build up to this point (@bp), plus the decorations (@decorations), plus the corporate theme.
Here is our final boxplot, without jitter.
R.png("figures/final_box_plot.png", width: 540, height: 560)
puts @bp + @decorations + CorpTheme.global_theme(faceted: true)
R.dev__off

Figure 9: Final boxplot with all decoration, but no jitter
And here is the final violin plot, with jitter and the same look and feel of the corporate boxplot.
R.png("figures/final_violin_plot.png", width: 540, height: 560)
puts @violin + @decorations + CorpTheme.global_theme(faceted: true)
R.dev__off

Figure 10: Final violin plot, with decorations and jitter
Another View
We now make another plot, with the same look and feel as before but facetted by dose and not by supplement. This shows how easy it is to create new plots by just changing small statement on the grammar of graphics.
R.png("figures/facet_by_dose.png", width: 540, height: 560)
caption = <<-EOTLength of odontoblasts in 60 guinea pigs. Each animal received one of three dose levels of vitamin C.EOT
@bp = @tooth_growth.ggplot(E.aes(x: :supp, y: :len,                                  group: :supp)) +       R.geom_boxplot(E.aes(fill: :supp)) +       R.facet_grid(+:all =~ +:dose) +      R.scale_fill_manual(values: R.c("cyan", "deepskyblue4")) +      R.labs(title: "Tooth Growth:  Length by Dose",             subtitle: "Faceted by dose",             x: "Delivery method", y: "Teeth length",             caption: caption) +      CorpTheme.global_theme(faceted: true)
puts @bp
R.dev__off

Figure 11: New plot with the same ‘corporate’ look and feel
Conclusion
In this article, we introduce Galaaz and show how to tightly couple Ruby and R in a way that Ruby developers do not need to be aware of the executing R engine. For the Ruby developer the existence of R is of no consequence, she is just coding in Ruby. On the other hand, for the R developer, migration to Ruby is a matter of small syntactic changes with a very gentle learning curve. As the R developer becomes more proficient in Ruby, he can start using ‘classes’, ‘modules’, ‘procs’, ‘lambdas’.
Trying to bring to Ruby the power of R starting from scratch is an enormous endeavor and would probably never be accomplished. Today’s data scientists would certainly stick with either Python or R. Now, both the Ruby and R communities can benefit from this marriage, provided by Galaaz on top of GraalVM and Truffle’s polyglot environment.
We developed the coupling of Ruby and R, but the process we used can also be done to couple Ruby and JavaScript or Ruby and Python. In a polyglot world we believe that a uniglot library might be extremely relevant.
From the perspective of performance, GraalVM and Truffle promises improvements that could reach over 10 times, both for FastR and for TruffleRuby.
This article has shown how to improve a plot step-by-step. Starting from a very simple boxplot with all default configurations, we moved slowly to our final plot. The important point here is not if the final plot is actually beautiful (as beauty is in the eye of the beholder), but that there is a process of small steps improvements that can be followed to getting a final plot ready for presentation.
Finally, this whole article was written in rmarkdown and compiled to HTML by gknit, an application that wraps knitr and allows documenting Ruby code. This application can be of great help for any Rubyist trying to write articles, blogs or documentation for Ruby.
Installing Galaaz
Prerequisites

GraalVM (>= rc8): https://github.com/oracle/graal/releases
TruffleRuby
FastR

The following R packages will be automatically installed when necessary, but could be installed prior to using gKnit if desired:

ggplot2
gridExtra
knitr

Installation of R packages requires a development environment and can be time consuming. In Linux, the gnu compiler and tools should be enough. I am not sure what is needed on the Mac.
In order to run the ‘specs’ the following Ruby package is necessary:

gem install rspec

Preparation

gem install galaaz

Usage

gknit 
In a scrip add: require ‘galaaz’

Running the demos
After installation, many galaaz demos are available doing:
> galaaz -T
will show a list with all available demos. To run any of the demos in the list, substitute the call to ‘rake’ to ‘galaaz’. For instance, one of the examples in the list is ‘rake sthda:bar’. In order to run this example just do ‘galaaz sthda:bar’. Doing ‘galaaz sthda:all’ will run all demos in the sthda category, in this case a slide show with over 80 ggplot graphics written in Ruby.
Some of the examples require ‘rspec’ to be available. To install ‘rspec’ just do ‘gem install rspec’.



 An introduction to web scraping using R 
freeCodeCamp — Wed, 24 Oct 2018 21:25:46 +0000
 By Hiren Patel
With the e-commerce boom, businesses have gone online. Customers, too, look for products online. Unlike the offline marketplace, a customer can compare the price of a product available at different places in real time.
Therefore, competitive pricing is something that has become the most crucial part of a business strategy.
In order to keep prices of your products competitive and attractive, you need to monitor and keep track of prices set by your competitors. If you know what your competitors’ pricing strategy is, you can accordingly align your pricing strategy to get an edge over them.
Hence, price monitoring has become a vital part of the process of running an e-commerce business.
You might wonder how to get hold of the data to compare prices.
The top 3 ways of getting the data you need for price comparison
1. Feeds from Merchants
As you might be aware, there are several price comparison sites available on the internet. These sites get into a sort of understanding with the businesses wherein they get the data directly from them and which they use for price comparison.
These businesses put into place an API, or utilize FTP to provide the data. Generally, a referral commission is what makes a price comparison site financially viable.
2. Product feeds from third-party APIs
On the other hand, there are services which offer e-commerce data through an API. When such a service is used, the third party pays for the volume of data.
3. Web Scraping
Web scraping is one of the most robust and reliable ways of getting web data from the internet. It is increasingly used in price intelligence because it is an efficient way of getting the product data from e-commerce sites.
You may not have access to the first and second option. Hence, web scraping can come to your rescue. You can use web scraping to leverage the power of data to arrive at competitive pricing for your business.
Web scraping can be used to get current prices for the current market scenario, and e-commerce more generally. We will use web scraping to get the data from an e-commerce site. In this blog, you will learn how to scrape the names and prices of products from Amazon in all categories, under a particular brand.
Extracting data from Amazon periodically can help you keep track of the market trends of pricing and enable you to set your prices accordingly.
Table of contents

Web scraping for price comparison
Web scraping in R
Implementation
End note

1. Web scraping for price comparison
As the market wisdom says, price is everything. The customers make their purchase decisions based on price. They base their understanding of the quality of a product on price. In short, price is what drives the customers and, hence, the market.
Therefore, price comparison sites are in great demand. Customers can easily navigate the whole market by looking at the prices of the same product across the brands. These price comparison websites extract the price of the same product from different sites.
Along with price, price comparison websites also scrape data such as the product description, technical specifications, and features. They project the whole gamut of information on a single page in a comparative way.
This answers the question the prospective buyer has asked in their search. Now the prospective buyer can compare the products and their prices, along with information such as features, payment, and shipping options, so that they can identify the best possible deal available.
Pricing optimization has its impact on the business in the sense that such techniques can enhance profit margins by 10%.
E-commerce is all about competitive pricing, and it has spread to other business domains as well. Take the case of travel. Now even travel-related websites scrape the price from airline websites in real time to provide the price comparison of different airlines.
The only challenge in this is to update the data in real time and stay up to date every second as prices keep changing on the source sites. Price comparison sites use Cron jobs or at the view time to update the price. However, it will rest upon the configuration of the site owner.
This is where this blog can help you — you will be able to work out a scraping script that you can customize to suit your needs. You will be able to extract product feeds, images, price, and all other relevant details regarding a product from a number of different websites. With this, you can create your powerful database for price comparison site.
2. Web scraping in R
Price comparison becomes cumbersome because getting web data is not that easy — there are technologies like HTML, XML, and JSON to distribute the content.
So, in order to get the data you need, you must effectively navigate through these different technologies. R can help you access data stored in these technologies. However, it requires a bit of in-depth understanding of R before you get started.
What is R?
Web scraping is an advanced task that not many people perform. Web scraping with R is, certainly, technical and advanced programming. An adequate understanding of R is essential for web scraping in this way.
To start with, R is a language for statistical computing and graphics. Statisticians and data miners use R a lot due to its evolving statistical software, and its focus on data analysis.
One reason R is such a favorite among this set of people is the quality of plots which can be worked out, including mathematical symbols and formulae wherever required.
R is wonderful because it offers a vast variety of functions and packages that can handle data mining tasks.
rvest, RCrawler etc are R packages used for data collection processes.
In this segment, we will see what kinds of tools are required to work with R to carry out web scraping. We will see it through the use case of Amazon website from where we will try to get the product data and store it in JSON form.
Requirements
In this use case, knowledge of R is essential and I am assuming that you have a basic understanding of R. You should be aware of at least any one R interface, such as RStudio. The base R installation interface is fine.
If you are not aware of R and the other associated interfaces, you should go through this tutorial.
Now let’s understand how the packages we’re going to use will be installed.
Packages:
1. rvest
Hadley Wickham authored the rvest package for web scraping in R. rvest is useful in extracting the information you need from web pages.
Along with this, you also need to install the selectr and ‘xml2’ packages.
Installation steps:
install.packages(‘selectr’)
install.packages(‘xml2’)
install.packages(‘rvest’)
rvest contains the basic web scraping functions, which are quite effective. Using the following functions, we will try to extract the data from web sites.

read_html(url) : scrape HTML content from a given URL
html_nodes(): identifies HTML wrappers.
html_nodes(“.class”): calls node based on CSS class
html_nodes(“#id”): calls node based on  id
html_nodes(xpath=”xpath”): calls node based on xpath (we’ll cover this later)
html_attrs(): identifies attributes (useful for debugging)
html_table(): turns HTML tables into data frames
html_text(): strips the HTML tags and extracts only the text

2. stringr
stringr comes into play when you think of tasks related to data cleaning and preparation.
There are four essential sets of functions in stringr:

stringr functions are useful because they enable you to work around the individual characters within the strings in character vectors
there are whitespace tools which can be used to add, remove, and manipulate whitespace
there are locale sensitive operations whose operations will differ from locale to locale
there are pattern matching functions. These functions recognize four parts of pattern description. Regular expressions are the standard one but there are other tools as well

Installation
install.packages(‘stringr’)
3. jsonlite
What makes the jsonline package useful is that it is a JSON parser/generator which is optimized for the web.
It is vital because it enables an effective mapping between JSON data and the crucial R data types. Using this, we are able to convert between R objects and JSON without loss of type or information, and without the need for any manual data wrangling.
This works really well for interacting with web APIs, or if you want to create ways through which data can travel in and out of R using JSON.
Installation
install.packages(‘jsonlite’)
Before we jump-start into it, let’s see how it works:
It should be clear at the outset that each website is different, because the coding that goes into a website is different.
Web scraping is the technique of identifying and using these patterns of coding to extract the data you need. Your browser makes the website available to you from HTML. Web scraping is simply about parsing the HTML made available to you from your browser.
Web scraping has a set process that works like this, generally:

Access a page from R
Instruct R where to “look” on the page
Convert data in a usable format within R using the rvest package

Now let’s go to implementation to understand it better.
3. Implementation
Let’s implement it and see how it works. We will scrape the Amazon website for the price comparison of a product called “One Plus 6”, a mobile phone.
You can see it here.
Step 1: Loading the packages we need
We need to be in the console, at R command prompt to start the process. Once we are there, we need to load the packages required as shown below:
#loading the package:> library(xml2)> library(rvest)> library(stringr)
Step 2: Reading the HTML content from Amazon
#Specifying the url for desired website to be scrappedurl <- ‘https://www.amazon.in/OnePlus-Mirror-Black-64GB-Memory/dp/B0756Z43QS?tag=googinhydr18418-21&tag=googinkenshoo-21&ascsubtag=aee9a916-6acd-4409-92ca-3bdbeb549f80’
#Reading the html content from Amazonwebpage <- read_html(url)
In this code, we read the HTML content from the given URL, and assign that HTML into the webpage variable.
Step 3: Scrape product details from Amazon
Now, as the next step, we will extract the following information from the website:
Title: The title of the product.
Price: The price of the product.
Description: The description of the product.
Rating: The user rating of the product.
Size: The size of the product.
Color: The color of the product.
This screenshot shows how these fields are arranged.

Next, we will make use of HTML tags, like the title of the product and price, for extracting data using Inspect Element.
In order to find out the class of the HTML tag, use the following steps:
=> go to chrome browser => go to this URL => right click => inspect element
NOTE: If you are not using the Chrome browser, check out this article.
Based on CSS selectors such as class and id, we will scrape the data from the HTML. To find the CSS class for the product title, we need to right-click on title and select “Inspect” or “Inspect Element”.

As you can see below, I extracted the title of the product with the help of html_nodes in which I passed the id of the title — h1#title — and webpage which had stored HTML content.
I could also get the title text using html_text and print the text of the title with the help of the head () function.
#scrape title of the product> title_html <- html_nodes(webpage, ‘h1#title’)> title <- html_text(title_html)> head(title)
The output is shown below:

We could get the title of the product using spaces and \n.
The next step would be to remove spaces and new line with the help of the str_replace_all() function in the stringr library.
# remove all space and new linesstr_replace_all(title, “[\r\n]” , “”)
Output:

Now we will need to extract the other related information of the product following the same process.
Price of the product:
# scrape the price of the product> price_html <- html_nodes(webpage, ‘span#priceblock_ourprice’)> price <- html_text(price_html)
# remove spaces and new line> str_replace_all(title, “[\r\n]” , “”)
# print price value> head(price)
Output:

Product description:
# scrape product description> desc_html <- html_nodes(webpage, ‘div#productDescription’)> desc <- html_text(desc_html)
# replace new lines and spaces> desc <- str_replace_all(desc, “[\r\n\t]” , “”)> desc <- str_trim(desc)> head(desc)
Output:

Rating of the product:
# scrape product rating > rate_html <- html_nodes(webpage, ‘span#acrPopover’)> rate <- html_text(rate_html)
# remove spaces and newlines and tabs > rate <- str_replace_all(rate, “[\r\n]” , “”)> rate <- str_trim(rate)
# print rating of the product> head(rate)
Output:

Size of the product:
# Scrape size of the product> size_html <- html_nodes(webpage, ‘div#variation_size_name’)> size_html <- html_nodes(size_html, ‘span.selection’)> size <- html_text(size_html)
# remove tab from text> size <- str_trim(size)
# Print product size> head(size)
Output:

Color of the product:
# Scrape product color> color_html <- html_nodes(webpage, ‘div#variation_color_name’)> color_html <- html_nodes(color_html, ‘span.selection’)> color <- html_text(color_html)
# remove tabs from text> color <- str_trim(color)
# print product color> head(color)
Output:

Step 4: We have successfully extracted data from all the fields which can be used to compare the product information from another site.
Let’s compile and combine them to work out a dataframe and inspect its structure.
#Combining all the lists to form a data frameproduct_data <- data.frame(Title = title, Price = price,Description = desc, Rating = rate, Size = size, Color = color)
#Structure of the data framestr(product_data)
Output:

In this output we can see all the scraped data in the data frames.
Step 5: Store data in JSON format:
As the data is collected, we can carry out different tasks on it such as compare, analyze, and arrive at business insights about it. Based on this data, we can think of training machine learning models over this.
Data would be stored in JSON format for further process.
Follow the given code and get the JSON result.
# Include ‘jsonlite’ library to convert in JSON form.> library(jsonlite)
# convert dataframe into JSON format> json_data <- toJSON(product_data)
# print output> cat(json_data)
In the code above, I have included jsonlite library for using the toJSON() function to convert the dataframe object into JSON form.
At the end of the process, we have stored data in JSON format and printed it.
It is possible to store data in a csv file also or in the database for further processing, if we wish.
Output:

Following this practical example, you can also extract the relevant data for the same from product from https://www.oneplus.in/6 and compare with Amazon to work out the fair value of the product. In the same way, you can use the data to compare it with other websites.
4. End note
As you can see, R can give you great leverage in scraping data from different websites. With this practical illustration of how R can be used, you can now explore it on your own and extract product data from Amazon or any other e-commerce website.
A word of caution for you: certain websites have anti-scraping policies. If you overdo it, you will be blocked and you will begin to see captchas instead of product details. Of course, you can also learn to work your way around the captchas using different services available. However, you do need to understand the legality of scraping data and whatever you are doing with the scraped data.
Feel free to send to me your feedback and suggestions regarding this post!
 


 Finding Correlations in Non-Linear Data 
freeCodeCamp — Mon, 29 Jan 2018 21:26:00 +0000
 By Peter Gleeson
From a signalling perspective, the world is a noisy place. In order to make sense of anything, we have to be selective with our attention.
We humans have, over the course of millions of years of natural selection, become fairly good at filtering out background signals. We learn to associate particular signals with certain events.
For instance, imagine you’re playing table tennis in a busy office.
To return your opponent’s shot, you need to make a huge array of complex calculations and judgements, taking into account multiple competing sensory signals.
To predict the motion of the ball, your brain has to repeatedly sample the ball’s current position and estimate its future trajectory. More advanced players will also take into account any spin their opponent applied to the shot.
Finally, in order to play your own shot, you need to account for the position of your opponent, your own position, the speed of the ball, and any spin you intend to apply.
All of this involves an amazing amount of subconscious differential calculus. We take it for granted that, generally speaking, our nervous system can do this automatically (at least after a bit of practice).
Just as impressive is how the human brain differentially assigns importance to each of the myriad competing signals it receives. The position of the ball, for example, is judged to be more relevant than, say, the conversation taking place behind you, or the door opening in front of you.
This may sound so obvious as to seem unworthy of stating, but that is testament to the just how good we are at learning to make accurate predictions out of noisy data.
Certainly, a blank-state machine given a continuous stream of audiovisual data would face a difficult task knowing which signals best predict the optimal course of action.
Luckily, there are statistical and computational methods that can be used to identify patterns in noisy, complex data.
Correlation 101
Generally speaking, when we talk of ‘correlation’ between two variables, we are referring to their ‘relatedness’ in some sense.
Correlated variables are those which contain information about each other. The stronger the correlation, the more one variable tells us about the other.

You may have seen it all before: Positive correlation, zero correlation, negative correlation
You may well already have some understanding of correlation, how it works and what its limitations are. Indeed, it’s something of a data science cliche:

“Correlation does not imply causation”

This is of course true — there are good reasons why even a strong correlation between two variables is not a guarantor of causality. The observed correlation could be due to the effects of a hidden third variable, or just entirely down to chance.
That said, correlation does allow for predictions about one variable to made based upon another. There are several methods that can be used to estimate correlated-ness for both linear and non-linear data. Let’s take a look at how they work.
We’ll go through the math and the code implementation, using Python and R. The code for the examples this article can be found here.
Pearson’s Correlation Coefficient
What is it?
Pearson’s Correlation Coefficient (PCC, or Pearson’s r) is a widely used linear correlation measure. It’s often the first one taught in many elementary stats courses. Mathematically speaking, it is defined as “the covariance between two vectors, normalized by the product of their standard deviations”.
Tell me more…
The covariance between two paired vectors is a measure of their tendency to vary above or below their means together. That is, a measure of whether each pair tend to be on similar or opposite sides of their respective means.

Let’s see this implemented in Python:
def mean(x):
    return sum(x)/len(x)

def covariance(x,y):
    calc = []
    for i in range(len(x)):
        xi = x[i] - mean(x)
        yi = y[i] - mean(y)
        calc.append(xi * yi)
    return sum(calc)/(len(x) - 1)

a = [1,2,3,4,5] ; b = [5,4,3,2,1]
print(covariance(a,b))

The covariance is calculated by taking each pair of variables, and subtracting their respective means from them. Then, multiply these two values together.

If they are both above their mean (or both below), then this will produce a positive number, because a positive×positive=positive, and likewise a negative×negative=positive.
If they are on different sides of their means, then this produces a negative number (because positive×negative=negative).

Once we have all these values calculated for each pair, sum them up, and divide by n-1, where n is the sample size. This is the sample covariance.
If the pairs have a tendency to both be on the same side of their respective means, the covariance will be a positive number. If they have a tendency to be on opposite sides of their means, the covariance will be a negative number. The stronger this tendency, the larger the absolute value of the covariance.
If there is no overall pattern, then the covariance will be close to zero. This is because the positive and negative values will cancel each other out.
At first, it might appear that the covariance is a sufficient measure of ‘relatedness’ between two variables. However, take a look at the graph below:

_Covariance = 0.00003. From a [question posted recently on stackexchange](https://stats.stackexchange.com/questions/320001/why-does-this-set-of-data-have-no-covariance" rel="noopener" target="blank" title=")
Looks like there’s a strong relationship between the variables, right? So why is the covariance so low, at approximately 0.00003?
The key here is to realise that the covariance is scale-dependent. Look at the x and y axes — pretty much all the data points fall between the range of 0.015 and 0.04. The covariance is likewise going to be close to zero, since it is calculated by subtracting the means from each individual observation.
To obtain a more meaningful figure, it is important to normalize the covariance. This is done by dividing it by the product of the standard deviations of each of the vectors.

The Greek letter rho is often used to denote Pearson’s r
In Python:
import math

def stDev(x):
    variance = 0
    for i in x:
        variance += (i - mean(x) ** 2) / len(x)
    return math.sqrt(variance)

def Pearsons(x,y):
    cov = covariance(x,y)
    return cov / (stDev(x) * stDev(y))

The reason this is done is because the standard deviation of a vector is the square root of its variance. This means if two vectors are identical, then multiplying their standard deviations will equal their variance.
Funnily enough, the covariance of two identical vectors is also equal to their variance.

Therefore, the maximum value the covariance between two vectors can take is equal to the product of their standard deviations, which occurs when the vectors are perfectly correlated. It is this which bounds the correlation coefficient between -1 and +1.
Which way do the arrows point?
As an aside, a much cooler way of defining the PCC of two vectors comes from linear algebra.
First, we center the vectors, by subtracting their means from their individual values.
a = [1,2,3,4,5] ; b = [5,4,3,2,1]

a_centered = [i - mean(a) for i in a]
b_centered = [j - mean(b) for j in b]

Now, we can make use of the fact that vectors can be considered as ‘arrows’ pointing in a given direction.
For instance, in 2-D, the vector [1,3] could be represented as an arrow pointing 1 unit along the x-axis, and 3 units along the y-axis. Likewise, the vector [2,1] could be represented as an arrow pointing 2 units along the x-axis, and 1 unit along the y-axis.

Two vectors (1,3) and (2,1) shown as arrows.
Similarly, we can represent our data vectors as arrows in an n-dimensional space (although don’t try visualising when n > 3…)
The angle ϴ between these arrows can be worked out using the dot product of the two vectors. This is defined as:

Or, in Python:
def dotProduct(x,y):
    calc = 0
    for i in range(len(x)):
        calc += x[i] * y[i]
    return calc

The dot product can also be defined as:

Where ||x|| is the magnitude (or ‘length’) of the vector x (think Pythagoras’ theorem), and ϴ is the angle between the arrow vectors.

As a Python function:
def magnitude(x):
    x_sq = [i ** 2 for i in x]
    return math.sqrt(sum(x_sq))

This lets us find cos(ϴ), by dividing the dot product by the product of the magnitudes of the two vectors.

def cosTheta(x,y):
    mag_x = magnitude(x)
    mag_y = magnitude(y)
    return dotProduct(x,y) / (mag_x * mag_y)

Now, if you know a little trigonometry, you may recall that the cosine function produces a graph that oscillates between +1 and -1.

_[Source](http://cda.mrs.umn.edu/~mcquarrb/teachingarchive/Precalculus/Animations/SineCosineAnim.html" rel="noopener" target="blank" title=")
The value of cos(ϴ) will vary depending on the angle between the two arrow vectors.

When the angle is zero (i.e., the vectors point in the exact same direction), cos(ϴ) will equal 1.
When the angle is -180°, (the vectors point in exact opposite directions), then cos(ϴ) will equal -1.
When the angle is 90° (the vectors point in completely unrelated directions), then cos(ϴ) will equal zero.

This might look familiar — a measure between +1 and -1 that seems to describe the relatedness of two vectors? Isn’t that Pearson’s r?
Well — that is exactly what it is! By considering the data as arrow vectors in a high-dimensional space, we can use the angle ϴ between them as a measure of similarity.

A) Positively correlated vectors; B) Negatively correlated vectors; C) Uncorrelated vectors
The cosine of this angle ϴ is mathematically identical to Pearson’s Correlation Coefficient.
When viewed as high-dimensional arrows, positively correlated vectors will point in a similar direction.
Negatively correlated vectors will point towards opposite directions.
And uncorrelated vectors will point at right-angles to one another.
Personally, I think this is a really intuitive way to make sense of correlation.
Statistical significance?
As is always the case with frequentist statistics, it is important to ask how significant a test statistic calculated from a given sample actually is. Pearson’s r is no exception.
Unfortunately, whacking confidence intervals on an estimate of PCC is not entirely straightforward.
This is because Pearson’s r is bound between -1 and +1, and therefore isn’t normally distributed. An estimated PCC of, say, +0.95 has only so much room for error above it, but plenty of room below.
Luckily, there is a solution — using a trick called Fisher’s Z-transform:

Calculate an estimate of Pearson’s r as usual.
Transform r→z using Fisher’s Z-transform. This can be done by using the formula z = arctanh(r), where arctanh is the inverse hyperbolic tangent function.
Now calculate the standard deviation of z. Luckily, this is straightforward to calculate, and is given by SDz = 1/sqrt(n-3), where n is the sample size.
Choose your significance threshold, alpha, and check how many standard deviations from the mean this corresponds to. If we take alpha = 0.95, use 1.96.
Find the upper estimate by calculating z +(1.96 × SDz), and the lower bound by calculating z - (1.96 × SDz).
Convert these back to r, using r = tanh(z), where tanh is the hyperbolic tangent function.
If the upper and lower bounds are both the same side of zero, you have statistical significance!

Here’s a Python implementation:
r = Pearsons(x,y)
z = math.atanh(r)
SD_z = 1 / math.sqrt(len(x) - 3)
z_upper = z + 1.96 * SD_z
z_lower = z - 1.96 * SD_z
r_upper = math.tanh(z_upper)
r_lower = math.tanh(z_lower)

Of course, when given a large data set of many potentially correlated variables, it may be tempting to check every pairwise correlation. This is often referred to as ‘data dredging’ — scouring the data set for any apparent relationships between the variables.
If you do take this multiple comparison approach, you should use stricter significance thresholds to reduce your risk of discovering false positives (that is, finding unrelated variables which appear correlated purely by chance).
One method for doing this is to use the Bonferroni correction.
The small print
So far, so good. We’ve seen how Pearson’s r can be used to calculate the correlation coefficient between two variables, and how to assess the statistical significance of the result. Given an unseen set of data, it is possible to start mining for significant relationships between the variables.
However, there is a major catch — Pearson’s r only works for linear data.
Look at the graphs below. They clearly show what looks like a non-random relationship, but Pearson’s r is very close to zero.

The reason why is because the variables in these graphs have a non-linear relationship.
We can generally picture a relationship between two variables as a ‘cloud’ of points scattered either side of a line. The wider the scatter, the ‘noisier’ the data, and the weaker the relationship.
However,  Pearson’s r compares each individual data point with only one other (the overall means). This means it can only consider straight lines. It’s not great at detecting any non-linear relationships.
In the graphs above, Pearson’s r doesn’t reveal there being much correlation to talk of.
Yet the relationship between these variables is still clearly non-random, and that makes them potentially useful predictors of each other. How can machines identify this? Luckily, there are different correlation measures available to us.
Let’s take a look at a couple of them.
Distance Correlation
What is it?
Distance correlation bears some resemblance to Pearson’s r, but is actually calculated using a rather different notion of covariance. The method works by replacing our everyday concepts of covariance and standard deviation (as defined above) with “distance” analogues.
Much like Pearson’s r, “distance correlation” is defined as the “distance covariance” normalized by the “distance standard deviation”.
Instead of assessing how two variables tend to co-vary in their distance from their respective means, distance correlation assesses how they tend to co-vary in terms of their distances from all other points.
This opens up the potential to better capture non-linear dependencies between variables.
The finer details…
Robert Brown was a Scottish botanist born in 1773. While studying plant pollen under his microscope, Brown noticed tiny organic particles jittering about at random on the surface of the water he was using.
Little could he have suspected a chance observation of his would lead to his name being immortalized as the (re-)discoverer of Brownian motion.
Even less could he have known that it would take nearly a century before Albert Einstein would provide an explanation for the phenomenon — and hence proving the existence of atoms — in the same year he published papers on E=MC², special relativity and helped kick-start the field of quantum theory.
Brownian motion is a physical process whereby particles move about at random due to collisions with surrounding molecules.
The math behind this process can be generalized into a concept known as the Weiner process. Among other things, the Weiner process plays an important part in mathematical finance’s most famous model, Black-Scholes.
Interestingly, Brownian motion and the Weiner process turn out to be relevant to a non-linear correlation measure developed in the mid-2000’s through the work of Gabor Szekely.

Let’s run through how this can be calculated for two vectors x and y, each of length N.

First, we form N×N distance matrices for each of the vectors. A distance matrix is exactly like a road distance chart in an atlas — the intersection of each row and column shows the distance between the corresponding cities. Here, the intersection between row i and column j gives the distance between the i-th and j-th elements of the vector.



Next, the matrices are “double-centered”. This means for each element, we subtract the mean of its row and the mean of its column. Then, we add the grand mean of the entire matrix.


The ‘hat’ symbols mean ‘double-centred’; the ‘bar’ symbols mean ‘mean’

With the two double-centered matrices, we can calculate the square of the distance covariance by taking the average of each element in X multiplied by its corresponding element in Y.



Now, we can use a similar approach to find the “distance variance”. Remember — the covariance of two identical vectors is equivalent to their variance. Therefore, the squared distance variance can be described as below:



Finally, we have all the pieces to calculate the distance correlation. Remember that the (distance) standard deviation is equal to the square-root of the (distance) variance.


If you prefer to work through code instead of math notation (after all, there is a reason people tend to write software in one and not the other…), then check out the R implementation below:
set.seed(1234)

doubleCenter <- function(x){
  centered <- x
  for(i in 1:dim(x)[1]){
    for(j in 1:dim(x)[2]){
      centered[i,j] <- x[i,j] - mean(x[i,]) - mean(x[,j]) + mean(x)
      }
    }
  return(centered)
}

distanceCovariance <- function(x,y){
  N <- length(x)
  distX <- as.matrix(dist(x))
  distY <- as.matrix(dist(y))
  centeredX <- doubleCenter(distX)
  centeredY <- doubleCenter(distY)
  calc <- sum(centeredX * centeredY)
  return(sqrt(calc/(N^2)))
 }

distanceVariance <- function(x){
  return(distanceCovariance(x,x))
}
distanceCorrelation <- function(x,y){
  cov <- distanceCovariance(x,y)
  sd <- sqrt(distanceVariance(x)*distanceVariance(y))
  return(cov/sd)
}

# Compare with Pearson's r
x <- -10:10
y <- x^2 + rnorm(21,0,10)
cor(x,y) # --> 0.057
distanceCorrelation(x,y) # --> 0.509

The distance correlation between any two variables is bound between zero and one. Zero implies the variables are independent, whereas a score closer to one indicates a dependent relationship.
If you’d rather not write your own distance correlation methods from scratch, you can install R’s energy package, written by very researchers who devised the method. The methods available in this package call functions written in C, giving a great speed advantage.
Physical interpretation
One of the more surprising results relating to the formulation of distance correlation is that it bears an exact equivalence to Brownian correlation.
Brownian correlation refers to the independence (or dependence) of two Brownian processes. Brownian processes that are dependent will show a tendency to ‘follow’ each other.
A simple metaphor to help grasp the concept of distance correlation is to picture a fleet of paper boats floating on the surface of a lake.
If there is no prevailing wind direction, then each boat will drift about at random — in a way that’s (kind of) analogous to Brownian motion.

Boats drifting under no prevailing wind
If there is a prevailing wind, then the direction the boats drift in will be dependent upon the strength of the wind. The stronger the wind, the stronger the dependence.

Under a prevailing wind, the boats will tend to drift in the same direction
In a comparable way, uncorrelated variables can be thought of as boats drifting without a prevailing wind. Correlated variables can be thought of as boats drifting under the influence of a prevailing wind. In this metaphor, the wind represents the strength of the relationship between the two variables.
If we allow the prevailing wind direction to vary at different points on the lake, then we can bring a notion of non-linearity into the analogy. Distance correlation uses the distances between the ‘boats’ to infer the strength of the prevailing wind.

Confidence Intervals?
Confidence intervals can be established for a distance correlation estimate using a ‘resampling’ technique. A simple example is bootstrap resampling.
This is a neat statistical trick that requires us to ‘reconstruct’ the data by randomly sampling (with replacement) from the original data set. This is repeated many times (e.g., 1000), and each time the statistic of interest is calculated.
This will produce a range of different estimates for the statistic we’re interested in. We can use these to estimate the upper and lower bounds for a given level of confidence.
Check out the R code below for a simple bootstrap function:
set.seed(1234)

bootstrap <- function(x,y,reps,alpha){
  estimates <- c()
  original <- data.frame(x,y)
  N <- dim(original)[1]
  for(i in 1:reps){
    S <- original[sample(1:N, N, replace = TRUE),]
    estimates <- append(estimates, distanceCorrelation(S$x, S$y))
  }
  u <- alpha/2 ; l <- 1-u
  interval <- quantile(estimates, c(l, u))
  return(2*(dcor(x,y)) - as.numeric(interval[1:2]))
}

# Use with 1000 reps and threshold alpha = 0.05
x <- -10:10
y <- x^2 + rnorm(21,0,10)
bootstrap(x,y,1000,0.05) # --> 0.237 to 0.546

If you want to establish statistical significance, there is another resampling trick available, called a ‘permutation test’.
This is slightly different to the bootstrap method defined above. Here, we keep one vector constant and ‘shuffle’ the other by resampling. This approximates the null hypothesis — that there is no dependency between the variables.
The ‘shuffled’ variable is then used to calculate the distance correlation between it and the constant variable. This is done many times, and the distribution of outcomes is compared against the actual distance correlation (obtained from the unshuffled data).
The proportion of ‘shuffled’ outcomes greater than or equal to the ‘real’ outcome is then taken as a p-value, which can be compared to a given significance threshold (e.g., 0.05).
Check out the code to see how this works:
permutationTest <- function(x,y,reps){
  estimates <- c()
  observed <- distanceCorrelation(x,y)
  N <- length(x)
  for(i in 1:reps){
    y_i <- sample(y, length(y), replace = T)
    estimates <- append(estimates, distanceCorrelation(x, y_i))
  }
  p_value <- mean(estimates >= observed)
  return(p_value)
}

# Use with 1000 reps
x <- -10:10
y <- x^2 + rnorm(21,0,10)
permutationTest(x,y,1000) # --> 0.036

Maximal Information Coefficient
What is it?
The Maximal Information Coefficient (MIC) is a recent method for detecting non-linear dependencies between variables, devised in 2011. The algorithm used to calculate MIC applies concepts from information theory and probability to continuous data.
Diving in…
Information theory is a fascinating field within mathematics that was pioneered by Claude Shannon in the mid-twentieth century.
A key concept is entropy — a measure of the uncertainty in a given probability distribution. A probability distribution describes the probabilities of a given set of outcomes associated with a particular event.

Entropy of a probability distribution is minus “the sum of the probability of each outcome, multiplied by the logarithm of itself”
To understand how this works, compare the two probability distributions below:

Possible outcomes are on the X-axis; their respective probabilities are on the Y-axis
On the left is that of a fair six-sided dice, and on the right is the distribution of a not-so-fair six-sided dice.
Intuitively, which would you expect to have the higher entropy? For which dice is the outcome the least certain? Let’s calculate the entropy and see what the answer turns out to be.
entropy <- function(x){
  pr <- prop.table(table(x))
  H <- sum(pr * log(pr,2))
  return(-H)
}

dice1 <- 1:6
dice2 <- c(1,1,1,1,2:6)
entropy(dice1) # --> 2.585
entropy(dice2) # --> 2.281

As you may have expected, the fairer dice has the higher entropy.
That is because each outcome is as likely as any other, so we cannot know in advance which to favour.
The unfair dice gives us more information — some outcomes are much more likely than others — so there is less uncertainty about the outcome.
By that reasoning, we can see that entropy will be highest when each outcome is equally likely. This type of probability distribution is called a ‘uniform’ distribution.
Cross-entropy is an extension to the concept of entropy, that takes into account a second probability distribution.

crossEntropy <- function(x,y){
  prX <- prop.table(table(x))
  prY <- prop.table(table(y))
  H <- sum(prX * log(prY,2))
  return(-H)
}

This has the property that the cross-entropy between two identical probability distributions is equal to their individual entropy. When considering two non-identical probability distributions, there will be a difference between their cross-entropy and their individual entropies.
This difference, or ‘divergence’, can be quantified by calculating their Kullback-Leibler divergence, or KL-divergence.
The KL-divergence of two probability distributions X and Y is:

KL-divergence of probability distributions X and Y equals their cross-entropy, minus the entropy of X
The minimum value of the KL-divergence between two distributions is zero. This only happens when the distributions are identical.
KL_divergence <- function(x,y){
  kl <- crossEntropy(x,y) - entropy(x)
  return(kl)
}

One use for KL-divergence in the context of discovering correlations is to calculate the Mutual Information (MI) of two variables.
Mutual Information can be defined as “the KL-divergence between the joint and marginal distributions of two random variables”. If these are identical, MI will equal zero. If they are at all different, then MI will be a positive number. The more different the joint and marginal distributions are, the higher the MI.
To understand this better, let’s take a moment to revisit some probability concepts.
The joint distribution of variables X and Y is simply the probability of them co-occurring. For instance, if you flipped two coins X and Y, their joint distribution would reflect the probability of each observed outcome. Say you flip the coins 100 times, and get the result “heads, heads” 40 times. The joint distribution would reflect this.
P(X=H, Y=H) = 40/100 = 0.4
jointDist <- function(x,y){
  N <- length(x)
  u <- unique(append(x,y))
  joint <- c()
  for(i in u){
    for(j in u){
      f <- x[paste0(x,y) == paste0(i,j)]
      joint <- append(joint, length(f)/N)
    }
  }
  return(joint)
}

The marginal distribution is the probability distribution of one variable in the absence of any information about the other. The product of two marginal distributions gives the probability of two events’ co-occurrence under the assumption of independence. 
For the coin flipping example, say both coins produce 50 heads and 50 tails. Their marginal distributions would reflect this.
P(X=H) = 50/100 = 0.5 ; P(Y=H) = 50/100 = 0.5
P(X=H) × P(Y=H) = 0.5 × 0.5 = 0.25
marginalProduct <- function(x,y){
  N <- length(x)
  u <- unique(append(x,y))
  marginal <- c()
  for(i in u){
    for(j in u){
      fX <- length(x[x == i]) / N
      fY <- length(y[y == j]) / N 
      marginal <- append(marginal, fX * fY)
    }
  }
  return(marginal)
}

Returning to the coin flipping example, the product of the marginal distributions will give the probability of observing each outcome if the two coins are independent, while the joint distribution will give the probability of each outcome, as actually observed.
If the coins genuinely are independent, then the joint distribution should be (approximately) identical to the product of the marginal distributions. If they are in some way dependent, then there will be a divergence.
In the example, P(X=H,Y=H) > P(X=H) × P(Y=H). This suggests the coins both land on heads more often than would be expected by chance.
The bigger the divergence between the joint and marginal product distributions, the more likely it is the events are dependent in some way. The measure of this divergence is defined by the Mutual Information of the two variables.

The Mutual Information of X and Y equals “the KL divergence of their joint distribution, and the product of their marginal distributions”
mutualInfo <- function(x,y){
  joint <- jointDist(x,y)
  marginal <- marginalProduct(x,y)
  Hjm <- - sum(joint[marginal > 0] * log(marginal[marginal > 0],2))
  Hj <- - sum(joint[joint > 0] * log(joint[joint > 0],2))
  return(Hjm - Hj)
}

A major assumption here is that we are working with discrete probability distributions. How can we apply these concepts to continuous data?
Binning
One approach is to quantize the data (make the variables discrete). This is achieved by binning (assigning data points to discrete categories).

The key issue now is deciding how many bins to use. Luckily, the original paper on the Maximal Information Coefficient provides a suggestion: try most of them!
That is to say, try differing numbers of bins and see which produces the greatest result of Mutual Information between the variables. This raises two challenges, though:

How many bins to try? Technically, you could quantize a variable into any number of bins, simply by making the bin size forever smaller.
Mutual Information is sensitive to the number of bins used. How do you fairly compare MI between different numbers of bins?

The first challenge means it is technically impossible to try every possible number of bins. However, the authors of the paper offer a heuristic solution (that is, a solution which is not ‘guaranteed perfect’, but is a pretty good approximation). They also suggest an upper limit on the number of bins to try.

The maximum number of bins to try is determined by the sample size, N
As for fairly comparing MI values between different binning schemes, there’s a simple fix… normalize it! This can be done by dividing each MI score by the maximum it could theoretically take for that particular combination of bins.
The binning combination that produces the highest normalized MI overall is the one to use.

Mutual Information can be normalized by dividing by the logarithm of the smallest number of bins
The highest normalized MI is then reported as the Maximal Information Coefficient (or ‘MIC’) for those two variables. Let’s check out some code that will estimate the MIC of two continuous variables.
MIC <- function(x,y){
  N <- length(x)
  maxBins <- ceiling(N ** 0.6)
  MI <- c()
  for(i in 2:maxBins) {
    for (j in 2:maxBins){
      if(i * j > maxBins){
        next
      }
      Xbins <- i; Ybins <- j
      binnedX <-cut(x, breaks=Xbins, labels = 1:Xbins)
      binnedY <-cut(y, breaks=Ybins, labels = 1:Ybins)
      MI_estimate <- mutualInfo(binnedX,binnedY)
      MI_normalized <- MI_estimate / log(min(Xbins,Ybins),2)
      MI <- append(MI, MI_normalized)
    }
  }
  return(max(MI))
}

x <- runif(100,-10,10)
y <- x**2 + rnorm(100,0,10)
MIC(x,y) # --> 0.751

The above code is a simplification of the method outlined in the original paper. A more faithful implementation of the algorithm is available in the R package minerva. In Python, you can use the minepy module.
MIC is capable of picking out all kinds of linear and non-linear relationships, and has found use in a range of different applications. It is bound between 0 and 1, with higher values indicating greater dependence.
Confidence Intervals?
To establish confidence bounds on an estimate of MIC, you can simply use a bootstrapping technique like the one we looked at earlier.
To generalize the bootstrap function, we can take advantage of R’s functional programming capabilities, by passing the technique we want to use as an argument.
bootstrap <- function(x,y,func,reps,alpha){
  estimates <- c()
  original <- data.frame(x,y)
  N <- dim(original)[1]
  for(i in 1:reps){
    S <- original[sample(1:N, N, replace = TRUE),]
    estimates <- append(estimates, func(S$x, S$y))
  }
  l <- alpha/2 ; u <- 1 - l
  interval <- quantile(estimates, c(u, l))
  return(2*(func(x,y)) - as.numeric(interval[1:2]))
}

bootstrap(x,y,MIC,100,0.05) # --> 0.594 to 0.88

Summary
To conclude this tour of correlation, let’s test each different method against a range of artificially generated data. The code for these examples can be found here.
Noise

set.seed(123)

# Noise
x0 <- rnorm(100,0,1)
y0 <- rnorm(100,0,1)
plot(y0~x0, pch = 18)

cor(x0,y0)
distanceCorrelation(x0,y0)
MIC(x0,y0)


Pearson’s r = - 0.05
Distance Correlation = 0.157
MIC = 0.097

Simple linear

# Simple linear relationship
x1 <- -20:20
y1 <- x1 + rnorm(41,0,4)
plot(y1~x1, pch =18)

cor(x1,y1)
distanceCorrelation(x1,y1)
MIC(x1,y1)


Pearson’s r =+0.95
Distance Correlation = 0.95
MIC = 0.89

Simple quadratic

# y ~ x**2
x2 <- -20:20
y2 <- x2**2 + rnorm(41,0,40)
plot(y2~x2, pch = 18)

cor(x2,y2)
distanceCorrelation(x2,y2)
MIC(x2,y2)


Pearson’s r =+0.003
Distance Correlation = 0.474
MIC = 0.594

Trigonometric

# Cosine
x3 <- -20:20
y3 <- cos(x3/4) + rnorm(41,0,0.2)
plot(y3~x3, type='p', pch=18)

cor(x3,y3)
distanceCorrelation(x3,y3)
MIC(x3,y3)


Pearson’s r = - 0.035
Distance Correlation = 0.382
MIC = 0.484

Circle

# Circle

n <- 50
theta <- runif (n, 0, 2*pi)
x4 <- append(cos(theta), cos(theta))
y4 <- append(sin(theta), -sin(theta))
plot(x4,y4, pch=18)

cor(x4,y4)
distanceCorrelation(x4,y4)
MIC(x4,y4)


Pearson’s r < 0.001
Distance Correlation = 0.234
MIC = 0.218

Thanks for reading!
 


 How you can use linear regression models to predict quadratic, root, and polynomial functions 
freeCodeCamp — Wed, 11 Oct 2017 16:00:41 +0000
 By Björn Hartmann
When reading articles about machine learning, I often suspect that authors misunderstand the term “linear model.” Many authors suggest that linear models can only be applied if data can be described with a line. But this is way too restrictive.
Linear models assume the functional form is linear — not the relationship between your variables.
I’ll show you how you can improve your linear regressions with quadratic, root, and exponential functions.

So what’s the functional form?
The functional form is the equation you want to estimate.
Let us start with an example and think about how we could describe salaries of data scientists. Suppose an average data scientist (i) receives an entry-level salary (entry_level_salary) plus a bonus for each year of his experience (experience_i).
Thus, his salary (salary_i) is given by the following functional form:
salary_i = entry_level_salary + beta_1 * experience_i
Now, we can interpret the coefficient beta_1 as the bonus for each year of experience. And with this coefficient we can start making predictions by just knowing the level of experience.
As your machine learning model takes care of the coefficient beta_1 , all you need to enter in R or any other software is:
model_1 <- lm(salary ~ entry_level_salary + experience)
Linearity in the functional form requires that we sum up each determinant on the right-hand side of the equation.
Imagine we are right with our assumptions. Each point indicates one data scientist with his level of experience and salary. Finally, the red line is our predictions.

Many aspiring data scientists already run similar predictions. But often that is all they do with linear models…
How to estimate quadratic models?
When we want to estimate a quadratic model, we cannot type in something like this:
model_2 <- lm(salary ~ entry_level_salary + experience^2)
>> This will reject an error message
Most of these functions do not expect that they have to transform your input variables. As a result, they reject an error message if you try. Furthermore, you do not have a sum at the right-hand side of the equation anymore.
Note: You need to compute experience^² before adding it into your model. Thus, you will run:
# First, compute the square values of experienceexperience_2 <- experience^2
# Then add them into your regressionmodel_2 <- lm(salary ~ entry_level_salary + experience_2)
In return, you get a nice quadratic function:

Estimate root functions with linear models
Often we observe values that rise fast in the beginning and align to certain level afterwards. Let us modify our example and estimate a typical learning curve.
In the beginning a learning curve tends to be very steep and slows down after some years.
There is one function that features such a trend, the root function. So we use the square root of experience to capture this relationship:
# First, compute the square root values of experiencesqrt_experience <- sqrt(experience)
# Then add them into your regressionmodel_3 <- lm(knowledge ~ sqrt_experience)
Again, make sure you compute the square root before you add it to your model:

Or you might want to use the logarithmic function as it describes a similar trend. But its’ values are negative between zero and one. So make sure this is not a problem for you and your data.
Mastering linear models
Finally, you can even estimate polynomial functions with higher orders or exponential functions. All you need to do is to compute all variables before you add them into your linear model:
# First, compute polynomialsexperience_2 <- experience^2experience_3 <- experience^3
# Then add them into your regressionmodel_4 <- lm(salary ~ experience + experience_2 + experience_3)

Two cases where you should use other models
Although linear models can be applied to many cases, there are limitations. The most popular can be divided into two categories:
1. Probabilities:
If you want to estimate the probability of an event, you better use Probit, Logit or Tobit models. When estimating probabilities you use distributions that linear functions cannot capture. Depending on the distribution you assume, you should choose between the Probit, Logit or Tobit model.
2. Count variables
Finally, when estimating a count variable you want to use a Poisson model. Count variables are variable that can only be integers such as 1, 2, 3, 4.
For example count the number of children, the number of purchases a customer makes or the number of accidents in a region.
What to take away from this article
There are two things I want you to remember:

Improve your linear models and try quadratic, root or polynomial functions.
Always transform your data before you add them to your regression.

I uploaded the R code for all examples on GitHub. Feel free to download them, play with them, or share them with your friends and colleagues.

If you have any questions, write a comment below or contact me. I appreciate your feedback.

R Language - freeCodeCamp.org

How to Create Scatterplots and Model Data in R Using ggplot2

Table of Contents

Prerequisites

How to Set Up Your R Environment

How to Use Data Types in R

Common Data Types

Numeric Data Types in R

Integer Data Types in R

Character Data Types in R

Logical Data Types in R

Complex Data Types in R

How to Use Data Structures in R

Common Data Structures in R

How to Import Data in R

How to Read a CSV and Excel File

How to Visualize Data with ggplot2

Scatter Plot Example

How to Build Statistical Models in R

Linear Regression

Does the Model Fit the Data, and Why?

Logistic Regression

Conclusion

Learn R Programming from Harvard University

How to Build a Local RAG App with Ollama and ChromaDB in the R Programming Language

Table of Contents

What is RAG?

Project Overview

Project Setup

Prerequisites

Ollama Installation

Data Collection and Cleaning

How to Create Chunks

How to Generate Sentence Embeddings

How to Set Up the Vector Database for Embedding Storage

How to Write the User Input Query Embedding Function

Tool Calling

How to Initialize the Chat System, Design Prompts, and Integrate Tools

How to Interact with Your Chatbot Using a Shiny App

Complete Code

Conclusion

How to Build a Weather App with R Shiny

Table of Contents

Project Overview

Prerequisites:

Project Setup

API Keys: Storage and Retrieval

How to Make Your First API Call

Make the API Key accessible in the script

Define the Geocoding Function

Define the coordinate-extracting function

Define the weather-update function

Error Handling: Extracting and Managing Status Codes

Define the weather-update error body

Define the geocode error body

How to Build the Shiny App

Building the Shiny UI

Styling the Shiny app

Search bar section

Location and current weather section

Building the Shiny Server

Reactivity

The Server Code

Conclusion

How to Run R Programs Directly in Jupyter Notebook Locally

Table of Contents

Install Conda

Create a New Environment

Activate Your Conda Environment

Install ipykernel and jupyter

Install R in the Conda Environment

Open the Jupyter Notebook

Run R in Jupyter Notebook

Conclusion

Build Interactive Data-Driven Web Apps With R Shiny

Transcript

An introduction to aggregates in R: a powerful tool for playing with data

We have 11 aggregate function available to us:

Basic Examples

Basic Visual Scatter plot using aggregate function — sum

Install `ipykernel` and `jupyter`