webscraping - freeCodeCamp.org

Web Scraping With RSelenium (Chrome Driver) and Rvest

Elabonga Atuo — Mon, 17 Mar 2025 13:44:10 +0000

Web scraping lets you automatically extract data from websites, so you can store it in a structured format for later use.

In this article, you'll explore how to use popular R libraries for web scraping to extract data from a website. The target website displays different books across multiple pages, requiring navigation between them. You'll learn how to use RVest for data extraction and RSelenium to automate button clicks.

There are a couple of housekeeping rules when it comes to harvesting data on the internet:

Inspect the robots.txt file: Check the robots.txt file of a website to understand what data you are allowed to extract. You can find this file by appending “/robots.txt” to the website's home URL.
Review terms and conditions: Before scraping, read the website's terms and conditions to understand the legal expectations regarding data extraction.
Limit requests: Avoid overloading the server with requests by implementing rate limiting. The polite library in R can help manage request rates effectively.

Let’s dive in!

Project Overview
Project Setup
How to Understand and Inspect a Webpage
How to Extract Data Using RVest
How to Mimic Human Behaviour Using RSelenium
How to Combine RSelenium & RVest and Save to CSV
Bringing it All Together
Conclusion

Project Overview

Here’s what we’re going to be building:

This approach to web scraping allows you to see the browser in action as it navigates and extracts data from the website. Unlike headless browsing, where everything runs in the background without a visible interface, this method provides a graphical UI, making it easier to monitor and debug the process.

To practice your data mining skills, you will be scraping data from a website built specifically for that: Books To Scrape. You are going to be using a driver to drive a browser which will then open your target website. It’ll navigate from the first page, mimicking human behaviour (clicking the next button) while collecting data about the books, right to the last page.

Project Setup

Prerequisites:

To follow along with this tutorial, you will need:

R programming knowledge
HTML knowledge
R Studio installed

Note that I’m building this tutorial on a Windows machine.

Setup and Install Chrome Driver

First, you’ll want to check to make sure you have Java installed on your computer by running this terminal command:

java -version

If it’s not present, download and install Java here.

Next, install the Chrome browser if you don’t already have it. Once it’s installed, check for your browser version in the settings section.

Then you can download the Browser Driver that corresponds to your Browser Version here. Check where other browser drivers are stored on your device by running this in RStudio terminal:

# install and load wdman and binman packages
install.packages("wdman")
library(wdman)

install.packages("binman")
library(binman)

# check drivers already installed
binman::list_versions(appname = "chromedriver")

# check browser driver locations
wdman::selenium(retcommand = TRUE, check = FALSE)

Extract the driver “.exe“ and store it at the specified folder location. This is usually the following location:

"C:\Users\YourName\AppData\Local\binman\binman_chromedriver\win32\version\chromedriver.exe"

Now, add the drivers to your system path by specifying the folder path excluding the application. Confirm installation by running the following terminal command.

# Chromedriver SYSTEMS PATH: "C:\Users\YourName\AppData\Local\binman\binman_chromedriver\win32\version\"
# check chromedriver installation
chromedriver -version

How to Understand and Inspect a Webpage

A webpage is a visual representation of an HTML document that is available on the internet and accessed through a web browser. The components of a webpage, called elements, are structured hierarchically in a HTML DOM (Document Object Model) tree. Each element can be located using specific paths called selectors or locators, which you can read more about here.

Developer Tools are a set of tools available in your browser. They’re helpful for inspecting and analyzing a webpage’s structure. The feature “Inspect“ helps examine the structure and styling of a specific element. You can access this feature by selecting the element you would like to inspect, right clicking on it, and clicking “Inspect”.

How to Extract Data Using RVest

RVest is an R package that contains a set of functions that enables you to extract data from HTML and XML web pages

We are interested in extracting the following information about books from every page on the website’s catalogue:

Book Title
Book Rating
Book Price
Individual Book Link
Cover Image Link

Let’s go through the steps for using RVest to extract this data.

Step 1: Load the webpage

To load the first page of your target website and parse the HTML document using the RVest package in R, follow these steps:

Install and load the RVest package: If you haven't already installed the RVest package, you can do so by running the following command in R:
```
 install.packages("rvest")
```
Then, load the package:
```
 library(rvest)
```
Load the webpage and parse the HTML: Use the read_html() function from the RVest package to fetch and parse the HTML content of the webpage. Here's an example of how to do this:
```
 # Specify the URL of the target website
 url <- "https://books.toscrape.com/"

 # Fetch and parse the HTML content
 webpage <- read_html(url)
```

This code will download the HTML content of the specified webpage and convert it into an XML document, making it easier to structure and organize the data for further processing or storage.

Step 2: Identify the target elements

The target elements are the HTML elements that contain the specific data you intend to extract.

A quick inspection of the webpage using developer tools shows that the each book’s information is contained in an article tag and forms part of an ordered list. It’s important to specify the

The pipe %>% operator facilitates chaining operations, making it easier to extract elements step by step. html_element() returns the first matching element while html_elements() returns all the elements that match the defined path.

# define the path from which other details will be extracted
book <- books %>% html_element("ol")  %>% html_elements("li") %>% html_element("article")

# extracting details using css locators.
# title
title <- book %>% 
  html_element("h3 a") %>% 
  html_attr("title")

# rating
rating <- book %>% 
  html_element("p") %>% 
  html_attr("class")

# price
price <- book %>% 
  html_element(".product_price p") %>% 
  html_text2()

#link to book page
book_link <- book %>% 
  html_element("h3 a") %>% 
  html_attr("href")

# cover page image link
cover_page_link <- book %>% 
  html_element(".image_container a img") %>% 
  html_attr("src")

# inspect right format by selecting the first element of each detail
title[[1]]
rating[[1]]
price[[1]]
book_link[[1]]
cover_page_link[[1]]

Step 3: Clean the “rating” data

To clean the "star-rating" data, you can use the stringr package in R to remove the unnecessary text and trim any whitespace. Here's how you can do it:

library(stringr)

# Example of extracted rating data
rating_data <- "star-rating Three"

# Remove "star-rating " and trim whitespace
cleaned_rating <- str_trim(str_replace(rating_data, "star-rating ", ""))

# Output the cleaned rating
cleaned_rating

This code will output "Three", effectively removing the "star-rating" prefix and any leading or trailing whitespace.

How to Mimic Human Behaviour Using RSelenium

How Selenium Works

Selenium is a tool that allows you to simulate user actions on a website, usually for testing purposes. RSelenium is an R library that allows you to access this functionality.

We need a script, a browser, and browser driver to mimic user behaviour. The code you write that contains the instructions detailing the actions you would like to automate is the script. The browser driver acts as a bridge between your script and the browser and performs your desired actions by translating the script into actions.

The script, when run, is the client which requests and receives info from the browser driver’s server.

When you run a script, the script is converted to JSON format data which is then transferred to the browser driver via the JSON Wire Protocol. A protocol is simply a set of rules that define how data should be managed and handle during transfer across devices.

The driver receives and validates the received data. If successful, it communicates the actions defined in the script to the browser. If it’s unsuccessful, an error is sent to the client.

On browser initialization, the driver performs the actions step by step. This carries on to completion or until an error is encountered (missing elements, server errors, and so on). The bidirectional communication between the driver and browser is via HTTP. Finally, the results are sent back to the client and the browser is shut down.

# install and load RSelenium
install.packages("RSelenium")
library(RSelenium)

# initialize and run the chrome driver
rD <- rsDriver(browser = "chrome", port = 4567L)

# extract and assign the client
remDr <- rD[["client"]]

Running rsDriver() starts a Selenium server that launches ChromeDriver. Extract and assign the rD[["client"]] to a variable. This variable allows you to control and interact with the browser.

Sometimes, starting the driver may fail due to reasons such as permission restrictions, missing dependencies, or incorrect setup. If that happens, you can manually launch ChromeDriver by adding the following block of code right after loading the libraries at the top of the script. It is important to ensure the port numbers match.

cDrv <- chrome(verbose = FALSE, check = FALSE, port = 4567L)
cDrv$process

Now, navigate to the target webpage:

# naivigate to the target site
remDr$navigate("https://books.toscrape.com/")

#maximize Chrome Window Size
remDr$maxWindowSize()

And scroll to the bottom of the page:

# scroll to the bottom of the page
webElem <- remDr$findElement("css", "body")
webElem$sendKeysToElement(list(key = "end"))

The above code locates the body element and simulates pressing the down key to the end of the page.

Now, click Next to navigate to the next page:

# locate next button and click next
nextPage <-  remDr$findElement(using = "css selector",
                               value = ".next > a")
nextPage$clickElement()

Find the element that contains the link to the next page and click on it to redirect you.

Now we’re going to write a while loop that navigates through all the pages, up to page 50, and then closes the browser once it’s done.

A while loop executes a piece of code as long as a specific condition is met. Once the condition is not met, the loop exits.

while(condition is TRUE){
    #DO SOMETHING
}

Write a loop that ensures the next page button is clicked as long as the element containing the link to the next page is visible in the HTML DOM.

First, locate the next button element. Its presence in the open webpage makes sure that the loop runs.

The last page does not have a next button, so the loop will exit when it reaches that page (and Selenium will throw an error due to the missing element).

nextPage <- remDr$findElement(using = "css selector", value = ".next > a")

Wrap the nextPage element search in a tryCatch() block. This prevents the script from crashing if the 'Next' button is missing. If an error occurs, tryCatch() returns NULL, signaling that there are no more pages to navigate.

An if block then checks for a NULL value. If encountered, a message is displayed to inform the client that no 'Next' button was found, and the break statement exits the loop.

Finally, close the browser once the driver navigates to the last page (page 50 in the catalogue) to free up system resources using remDr$close().


while (TRUE) {  
  # Try to find and click "Next" button
  nextPage <- tryCatch({
    remDr$findElement(using = "css selector", value = ".next > a")
  }, error = function(e) {
    return(NULL)  # No more pages
  })

  if (is.null(nextPage)) {
    message("No 'Next' button found. Exiting loop.")
    break
  }

  nextPage$clickElement()
  Sys.sleep(3)  # Allow next page to load

}
print("finished scraping")
remDr$close()

How to Combine RSelenium & RVest and Save to CSV

Now that we’ve extracted data from specific HTML elements using RVest and automated user actions using RSelenium, let’s combine the two to scrape data from all the pages in the website.

Create a scrape books function

You will be saving the scraped books information in a CSV file. First, create an empty dataframe to hold the scraped data:

# install and load dplyr for dataframe manipulation
install.packages("dplyr")
library(dplyr)

# create a dataframe to hold book information
Books <-  data.frame()

Retrieve and parse the webpage

For Rvest to work with RSelenium, you have to retrieve the HTML source of the currently loaded webpage within the Selenium-controlled browser using remDr$getPageSource()[[1]] to extract the HMTL content.

page <- remDr$getPageSource()[[1]]

Convert the HTML content to XML using read_html() like this:

 # define the path from which other details will be extracted
    books <- read_html(page)  %>% html_element("ol")  %>% html_elements("li") %>% html_element("article")

Extract each book’s details using CSS selectors with rvest functions. The scraped objects returned are XML objects and lists. They need to be formatted to character strings, preventing unexpected data type issues when working with the data. Do this by piping as.character() at the very end of each extracted detail.

    # title
    title <- book %>% 
      html_element("h3 a") %>% 
      html_attr("title") %>% 
      as.character()

Wrap the block of code used to extract details from HTML elements in a function and return a dataframe whose column values are the book details. This makes the code reusable and modular.


scrape_books <- function() {
    page <- remDr$getPageSource()[[1]]

    # define the path from which other details will be extracted
    books <- read_html(page)  %>% html_element("ol")  %>% html_elements("li") %>% html_element("article")

    # extracting details using css locators.
    # title
    title <- book %>% 
      html_element("h3 a") %>% 
      html_attr("title") %>% 
      as.character() 

    # rating
    rating <- book %>% 
      html_element("p") %>% 
      html_attr("class") %>% 
      as.character() 

    cleaned_rating <- str_trim(gsub("star-rating", "", rating))

    # price
    price <- book %>% 
      html_element(".product_price p") %>% 
      html_text2() %>% 
      as.character() 

    #link to book page
    book_link <- book %>% 
      html_element("h3 a") %>% 
      html_attr("href") %>% 
      as.character() 

    # image link
    cover_page_link <- book %>% 
      html_element(".image_container a img") %>% 
      html_attr("src") %>% 
      as.character() 

    return(data.frame(title,cleaned_rating,price,book_link,cover_page_link, stringsAsFactors = FALSE))
}

Write to CSV

Save the dataframe to a CSV file saved as “books.csv“:

write.csv(Books, file = "./books.csv", fileEncoding = "UTF-8")

Bringing it All Together

Let’s review what we’ve done so far: First, the script to scrape book data begins by loading the browser, maximizing the window size, and navigating to the Books To Scrape Page.

Then we created an empty dataframe to hold the scraped data. We then scraped the data from the first page, saved it to the dataframe, and located the ‘Next‘ button in order to navigate to the next page – from which we scraped data and stored it.

The process of scraping, adding to the dataframe, and clicking the next page button continues until the ‘Next’ button is no longer available in the HTML DOM.

Once the last page has been reached, the code exits the loop and saves the data to CSV. Finally, it closes the driver to free up system resources.

# load libraries
library(wdman)
library(binman)
library(rvest)
library(stringr)
library(RSelenium)
library(dplyr)


cDrv <- chrome(verbose = FALSE, check = FALSE, port = 4450L)
cDrv$process

rD <- rsDriver(browser = "chrome", port = 4450L)
remDr <- rD[["client"]]


remDr$navigate("https://books.toscrape.com/")
remDr$maxWindowSize()

page <- remDr$getPageSource()[[1]]
webElem <- remDr$findElement("css", "body")
webElem$sendKeysToElement(list(key = "end"))

nextPage <-  remDr$findElement(using = "css selector",
                               value = ".next > a")
nextPage$clickElement()


# converting the lists containg the scraped data into a dataframe 
Books <-  data.frame(title = character(), rating = character(), stringsAsFactors = FALSE)

scrape_books <- function() {
    page <- remDr$getPageSource()[[1]]

    # define the path from which other details will be extracted
    books <- read_html(page)  %>% html_element("ol")  %>% html_elements("li") %>% html_element("article")

    # extracting details using css locators.
    # title
    title <- book %>% 
      html_element("h3 a") %>% 
      html_attr("title") %>% 
      as.character() 

    # rating
    rating <- book %>% 
      html_element("p") %>% 
      html_attr("class") %>% 
      as.character() 

    cleaned_rating <- str_trim(gsub("star-rating", "", rating))

    # price
    price <- book %>% 
      html_element(".product_price p") %>% 
      html_text2() %>% 
      as.character() 

    #link to book page
    book_link <- book %>% 
      html_element("h3 a") %>% 
      html_attr("href") %>% 
      as.character() 

    # image link
    cover_page_link <- book %>% 
      html_element(".image_container a img") %>% 
      html_attr("src") %>% 
      as.character() 

    return(data.frame(title,cleaned_rating,price,book_link,cover_page_link, stringsAsFactors = FALSE))
}

# scrape first page
Books <- rbind(Books, scrape_books())

while (TRUE) {
  # scrape current page
  Books <- rbind(Books, scrape_books())

  # find and click "next" button
  nextPage <- tryCatch({
    remDr$findElement(using = "css selector", value = ".next > a")
  }, error = function(e) {
    return(NULL)  # No more pages
  })

  # exit loop if "next" button is missing
  if (is.null(nextPage)) {
    message("No 'Next' button found. Exiting loop.")
    break
  }

  nextPage$clickElement()
  # Allow next page to load
  Sys.sleep(3)  

}

write.csv(Books, file = "./books.csv", fileEncoding = "UTF-8")
print("finished scraping")
remDr$close()

Conclusion

In this tutorial, you learned how to effectively combine RSelenium and RVest to scrape data from a website. By leveraging RSelenium, you can automate user interactions and navigate through web pages, while RVest allows you to extract specific data from HTML elements.

This approach provides a powerful and flexible method for web scraping, enabling you to handle dynamic content and mimic human behavior. By following the steps outlined here, you can successfully scrape data from multiple pages and save it to a CSV file for further analysis.

How to Build a Dynamic Web Scraper App with Playwright and React: A Step-by-Step Guide

Mihail Gaberov — Wed, 15 Jan 2025 14:51:30 +0000

Today we are going to build a small web scraper app. This application will scrape data from the Airbnb website and display it in a nice grid view. We will also add a Refresh button that will be able to trigger a new scraping round and update the results.

In order to make our app a bit more performant, we will utilize the browser’s local storage to store already scraped data so that we don’t trigger new scraping requests every time the browser is refreshed.

Here’s what it will look like:

How to Spin Up the App with Vite
How to Build the Server
How to Build the Front End
How to Deploy to render.com
Conclusion

If you want to jump straight into the code, here 💁 is the GitHub repository with a detailed README 🙌, and here you can see the live demo.

Now if you’re ready, let’s go step by step and see how to build and the deploy the app.

First, let’s get everything set up and ready to go.

How to Spin Up the App with Vite

We will use the Vite build tool to quickly spin up a bare bones React application, equipped with TailwindCSS for styling. In order to do that, run this in your terminal app:

npm create vite@latest web-scraper -- --template react

And then install and configure TailwindCSS as follows:

npm install -D tailwindcss postcss autoprefixer
npx tailwindcss init -p

Add the paths to all of your template files in your tailwind.config.js file like this:

/** @type {import('tailwindcss').Config} */
export default {
  content: [
    "./index.html",
    "./src/**/*.{js,ts,jsx,tsx}",
  ],
  theme: {
    extend: {},
  },
  plugins: [],
}

Now you should have a brand new React application with Tailwind installed and configured.

Let’s start our work with the server.

How to Build the Server

Since we are building a full stack application, the bare minimum we need to have in place is a server, a client, and an API. The API will live in the server world and the client app will call the endpoints it exposes in order to fetch the data we need to display on the front end.

Set Up the HTTP Server with Express.js

We are going to use the Express.js library to spin up an HTTP server that will handle our API requests. To do so, follow these steps:

First, install the necessary packages with this command:

npm install express cors playwright

Then create an empty server.js file in the project's root folder and add the following code:

import express from "express";
import { chromium } from "playwright";
import cors from "cors";
import { scrapeListings } from "./utils/scraper.js";

const app = express();
const PORT = 5001;

app.use(cors());

app.get("/scrape", async (req, res) => {
  let browser;
  try {
    browser = await chromium.launch();
    const listings = await scrapeListings({ browser, retryCount: 3 });
    res.json(listings);
  } catch (error) {
    res.status(500).json({ error: error.message });
  }
});

app.listen(PORT, () => {
  console.log(`Scraper server running on http://localhost:${PORT}`);
});

Before we continue with the scraper, let me first explain what we are doing here.

This is a pretty simple setup of an Express server that is exposing an endpoint called “scrape“. Our client-side application (the front end) can send GET requests to this endpoint and receive the data returned as a result.

What’s important here is the async callback function that we pass to the app.get method. This is where we call our scraping function within a try/catch block. It will return the scraped data or an error if something goes wrong.

The last few lines indicate that our server will listen on the specified PORT, which is set to 5001 here, and display a message in the terminal to show that the server is running.

What is Web Scraping?

Before diving into the code, I want to briefly explain web scraping if you’re unfamiliar with it. Web scraping involves automatically reading content from websites using a piece of software. This software is called a “web scraper“. In our case, the scraper is what’s in the scrapeListing function.

An essential part of the scraping process is finding something in the DOM tree of the target website that you can use to select the data we want to scrape. This something is known as a selector. Selectors can be different HTML elements, such as tags (h3, p, table) or attributes, like class names or IDs.

You can use various programming techniques or features of the programming language you’re using to create the scraper, aiming for better results when implementing the selecting part of the scraper.

In our case, we’re using [itemprop="itemListElement"] as the selector. But you might wonder, how did we figure this out and decide to use that? How do you know which selector to use?

This is where it gets tricky. You have to manually inspect the DOM tree of the target website and determine what would work best. That’s the case unless the site provides an API specifically designed for scrapers.

Here is how this looks like in practice. This is a screenshot from the Airbnb website:

Usually, you’ll need the information you’re scraping for some particular purpose, which means you’ll need to store it somewhere and then process it. This processing often involve some kind of visualization of the data. This is where our client application comes into play.

We will store the results of our scraping in the browser's local storage. Then, we will easily display those results in a grid layout using React and TailwindCSS. But before we get to all this, let’s go back to the code to understand how the scraping process is done.

Set Up Playwright

For the scraping functionality, we will use another library that has gotten pretty famous over the last few years: Playwright. It mainly serves as an e2e testing solution, but, as you’ll see now, we can use it for scraping the web as well.

We will put the scraping function in a separate file so that we have it all in order and keep the separation of concerns in place.

Create a new folder in the root directory and name it utils. Inside this folder, add a new file named scraper.js and include the following code:

const MAX_RETRIES = 3;

const validateListing = (listing) => {
  return (
    typeof listing.title === "string" &&
    typeof listing.price === "string" &&
    typeof listing.link === "string"
  );
};

export const scrapeListings = async ({ browser, retryCount }) => {
  try {
    const page = await browser.newPage();

    try {
      await page.goto("https://www.airbnb.com/", { waitUntil: "load" });

      await page.waitForSelector('[itemprop="itemListElement"]', {
        timeout: 10000,
      });

      const listings = await page.$$eval(
        '[itemprop="itemListElement"]',
        (elements) => {
          return elements.slice(0, 10).map((element) => {
            const title =
              element.querySelector(".t1jojoys")?.innerText || "N/A";
            const price =
              element.querySelector("._11jcbg2")?.innerText || "N/A";
            const link = element.querySelector("a")?.href || "N/A";
            return { title, price, link };
          });
        }
      );

      const validListings = listings.filter(validateListing);

      if (validListings.length === 0) {
        throw new Error("No listings found");
      }

      return validListings;
    } catch (pageError) {
      if (retryCount < MAX_RETRIES) {
        console.log(`Retrying... (${retryCount + 1}/${MAX_RETRIES})`);
        return await scrapeListings(retryCount + 1);
      } else {
        throw new Error(
          `Failed to scrape data after ${MAX_RETRIES} attempts: ${pageError.message}`
        );
      }
    } finally {
      await page.close();
    }
  } catch (browserError) {
    throw new Error(`Failed to launch browser: ${browserError.message}`);
  } finally {
    if (browser) {
      await browser.close();
    }
  }
};

Retry Mechanism

At the top of the file, there's a constant called MAX_RETRIES used to implement a retry mechanism. This tactic is often used by web scrapers to bypass or overcome protections that some websites have against scraping. We will see how to use it below.

It's important to mention the legal aspect here as well. Always respect the terms and conditions along with the privacy policy of the website you plan to scrape. Use these techniques only to handle technical challenges, not to break the law.

A small helper function follows that you can use to validate the received data. Nothing interesting here.

Next is the main scraping function. We’re passing the browser object, provided by Playwright, and the number of retry attempts as arguments to the function.

There are two try/catch blocks to handle possible failures: one for launching the browser (in headless mode, meaning you won't see anything) and one for the scraping process. In the latter, we’ll use Playwright's features to request the website, wait until the page is fully loaded, and then locate the selector we defined.

In the callback function we pass to $$eval, we access the elements returned by the scraping, allowing us to process them and get the data we want. In this case, I’m using three selectors to fetch the title, price, and link of the property. The first two are class names, and the last is the HTML tag .

Then we return an object, { title, price, link }, with the fetched data, that is the values of the three properties. And in the end of the try part, we validate the results before returning them to the front end.

What follows in the catch part is the implementation of the retry mechanism we talked about a minute ago:

 } catch (pageError) {
      if (retryCount < MAX_RETRIES) {
        console.log(`Retrying... (${retryCount + 1}/${MAX_RETRIES})`);
        return await scrapeListings(retryCount + 1);
      } else {
        throw new Error(
          `Failed to scrape data after ${MAX_RETRIES} attempts: ${pageError.message}`
        );
      }
    }

If an error occurs during the reading process, we enter the catch phase and check if the retry count is below the maximum limit we set. If it is, we try again by running the function recursively. Otherwise, we throw an error indicating that the scraping failed and the maximum retry attempts have been reached.

That's all you need to set up basic web scraping of the Airbnb homepage.

You can see all this in the Github repo of the project so don’t worry if you missed something here.

How to Build the Front End

Now it's time to put the scraped data to use.

Let's display the last 10 properties in a grid layout, allowing you (or anyone) to open them by clicking on their links. We will also add a Refresh feature that lets you perform a new scrape to get the most up-to-date data.

This is how the structure of the front end part of the project looks:

We have a simple app structure: one main container (App.jsx) that holds all the components and includes some logic for making requests to the API and storing the data in local storage.

import { useEffect, useState } from "react";
import { useLocalStorage } from "@uidotdev/usehooks";
import axios from "axios";
import Footer from "./components/Footer";
import Header from "./components/Header";
import RefreshButton from "./components/RefreshButton";
import Grid from "./components/Grid";
import Loader from "./components/Loader";

function App() {
  const [listings, setListings] = useLocalStorage("properties", []);
  const [loading, setLoading] = useState(false);
  const [error, setError] = useState("");

  const fetchListings = async () => {
    setLoading(true);
    setError("");
    setListings([]);

    try {
      const response = await axios.get("http://localhost:5001/scrape");
      if (response.data.length === 0) {
        throw new Error("No listings found");
      }
      setListings(response.data);
    } catch (err) {
      setError(
        err.response?.data?.error ||
          "Failed to fetch listings. Please try again."
      );
    } finally {
      setLoading(false);
    }
  };

  useEffect(() => {
    if (listings.length === 0) {
      fetchListings();
    }
  }, []);

  return (
    <div className="flex flex-col items-center justify-center min-h-screen bg-gray-100">
      <Header />
      <RefreshButton callback={fetchListings} loading={loading} />
      <main className="flex flex-col items-center justify-center flex-1 w-full px-4 relative">
        {error && <p className="text-red-500">{error}p>}
        {loading ? <Loader /> : <Grid listings={listings} />}
      main>
      <Footer />
    div>
  );
}

export default App;

All components are placed in the components directory (this is what I call a surprise, ah:)). Most of the components are quite simple, and I included them to give the app a more complete appearance.

The Header displays the top bar. The RefreshButton is used to send a new request and get the latest data. In the

section, we either show an error message if fetching fails, or we display a Loader component and a Grid component.

The loading part is straightforward. The Grid component is the interesting one. We pass the scraping results to it using a prop called 'listings'. Inside, we use a simple map() function to go through them and display the properties. We use Tailwind to style the grid, ensuring the properties are neatly listed and look good on both desktop and mobile screens.

And in the end with we have the Footer component that renders a simple bar with text. Again, I’ve added it just for completeness.

How to Deploy to render.com

Maybe a little over a year ago, I needed a place to deploy full-stack applications, ideally for free, since they were just for educational purposes.

After some research, I found a platform called Render and managed to deploy an app with both client and server parts, getting it to work online. I left it there until now. Since our scraper requires both parts to function properly, we will deploy it there and have it working online, as you can see here.

To do this, you need to create an account with Render and use their dashboard application. The process is simple, but I'll include a few screenshots below for clarity.

This is the Overview page where you can see all your projects:

Here is the Project page where you can view and manage your projects. In our case, we can see both the server and the client app as separate services.

You can click on each service to open its page, where you can view the deployments and the commits that triggered them. You can find even more details if you explore further.

You should be able to manage the deployment process on your own, as everything is clearly explained. But if you need help, feel free to reach out.

I should mention that I am not affiliated with Render in any way and I am not receiving any benefits for mentioning them here. I just found it to be a useful tool and wanted to share it with you – so I’ve used it here.

Conclusion

A web scraper app can be a powerful tool for gathering data, but there are several areas for improvement and important considerations to keep in mind here.

Firstly, you can enhance the app's performance and efficiency by optimizing the scraping process and ensuring that the data is stored and processed effectively. You can also implement more robust error handling and retry mechanisms to improve the reliability of the scraper.

Also, keep in mind that ethical scraping is crucial, and it's important to always respect the terms of service and privacy policies of the websites you are scraping. This includes not overloading the website with requests and ensuring that the data is used responsibly. Always seek permission from the site if required and consider using APIs provided by the website as a more ethical and reliable alternative.

Lastly, respecting the law is paramount. Ensure that your scraping activities comply with legal regulations and guidelines to avoid any potential legal issues. By focusing on these aspects, you can build a more effective, ethical, and legally compliant web scraper app.

webscraping - freeCodeCamp.org

Web Scraping With RSelenium (Chrome Driver) and Rvest

Table of Contents

Project Overview

Project Setup

Prerequisites:

Setup and Install Chrome Driver

How to Understand and Inspect a Webpage

How to Extract Data Using RVest

Step 1: Load the webpage

Step 2: Identify the target elements

Step 3: Clean the “rating” data

How to Mimic Human Behaviour Using RSelenium

How Selenium Works

Automating Page Navigation and Data Collection with RSelenium

How to Combine RSelenium & RVest and Save to CSV

Create a scrape books function

Retrieve and parse the webpage

Write to CSV

Bringing it All Together

Conclusion

How to Build a Dynamic Web Scraper App with Playwright and React: A Step-by-Step Guide

Table of Contents

How to Spin Up the App with Vite

How to Build the Server

Set Up the HTTP Server with Express.js

What is Web Scraping?

Set Up Playwright

Retry Mechanism

How to Build the Front End

How to Deploy to render.com

Conclusion